Message ID | 1646803679-11433-1-git-send-email-quic_charante@quicinc.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: madvise: return correct bytes advised with process_madvise | expand |
On Wed, Mar 09, 2022 at 10:57:59AM +0530, Charan Teja Kalla wrote: > The process_madvise() system call returns error even after processing > some VMA's passed in the 'struct iovec' vector list which leaves the > user confused to know where to restart the advise next. It is also > against this syscall man page[1] documentation where it mentions that > "return value may be less than the total number of requested bytes, if > an error occurred after some iovec elements were already processed.". > > Consider a user passed 10 VMA's in the 'struct iovec' vector list of > which 9 are processed but one. Then it just returns the error caused on > that failed VMA despite the first 9 VMA's processed, leaving the user > confused about on which VMA it is failed. Returning the number of bytes > processed here can help the user to know which VMA it is failed on and > thus can retry/skip the advise on that VMA. > > [1]https://man7.org/linux/man-pages/man2/process_madvise.2.html. > > Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API" > Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> > --- > mm/madvise.c | 12 +++++++++--- > 1 file changed, 9 insertions(+), 3 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 38d0f51..d3b49b3 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -1426,15 +1426,21 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > > while (iov_iter_count(&iter)) { > iovec = iov_iter_iovec(&iter); > + /* > + * Even when [start, end) passed to do_madvise covers > + * some unmapped addresses, it continues processing with > + * returning ENOMEM at the end. Thus consider the range > + * as processed when do_madvise() returns ENOMEM. > + * This makes process_madvise() never returns ENOMEM. > + */ Looks like that this patch has two things. first, returns processed bytes instead of error in case of error. Second, keep working on rest vmas on -ENOMEM due to unmapped hole. First thing totally makes sense to me(that's exactly I wanted to do but somehow missed) so it should go stable tree. However, second stuff might be arguble so it would be great if you split the patch. > ret = do_madvise(mm, (unsigned long)iovec.iov_base, > iovec.iov_len, behavior); > - if (ret < 0) > + if (ret < 0 && ret != -ENOMEM) > break; > iov_iter_advance(&iter, iovec.iov_len); > } > > - if (ret == 0) > - ret = total_len - iov_iter_count(&iter); > + ret = (total_len - iov_iter_count(&iter)) ? : ret; > > release_mm: > mmput(mm); > -- > 2.7.4 >
> On Mar 9, 2022, at 8:47 AM, Minchan Kim <minchan@kernel.org> wrote: > > On Wed, Mar 09, 2022 at 10:57:59AM +0530, Charan Teja Kalla wrote: >> The process_madvise() system call returns error even after processing >> some VMA's passed in the 'struct iovec' vector list which leaves the >> user confused to know where to restart the advise next. It is also >> against this syscall man page[1] documentation where it mentions that >> "return value may be less than the total number of requested bytes, if >> an error occurred after some iovec elements were already processed.". >> >> Consider a user passed 10 VMA's in the 'struct iovec' vector list of >> which 9 are processed but one. Then it just returns the error caused on >> that failed VMA despite the first 9 VMA's processed, leaving the user >> confused about on which VMA it is failed. Returning the number of bytes >> processed here can help the user to know which VMA it is failed on and >> thus can retry/skip the advise on that VMA. >> >> [1]https://man7.org/linux/man-pages/man2/process_madvise.2.html. >> >> Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API" >> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> >> --- >> mm/madvise.c | 12 +++++++++--- >> 1 file changed, 9 insertions(+), 3 deletions(-) >> >> diff --git a/mm/madvise.c b/mm/madvise.c >> index 38d0f51..d3b49b3 100644 >> --- a/mm/madvise.c >> +++ b/mm/madvise.c >> @@ -1426,15 +1426,21 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, >> >> while (iov_iter_count(&iter)) { >> iovec = iov_iter_iovec(&iter); >> + /* >> + * Even when [start, end) passed to do_madvise covers >> + * some unmapped addresses, it continues processing with >> + * returning ENOMEM at the end. Thus consider the range >> + * as processed when do_madvise() returns ENOMEM. >> + * This makes process_madvise() never returns ENOMEM. >> + */ > > Looks like that this patch has two things. first, returns processed > bytes instead of error in case of error. Second, keep working on > rest vmas on -ENOMEM due to unmapped hole. > > First thing totally makes sense to me(that's exactly I wanted to > do but somehow missed) so it should go stable tree. However, > second stuff might be arguble so it would be great if you split > the patch. I fully understand and relate to the basic motivation of this patch. The ENOMEM that this patch checks for, IIUC, is the ENOMEM that is returned on unmapped holes. Such ENOMEM does not appear, according to the man page, to be a valid reason to return ENOMEM to userspace. Presumably process_madvise() is expected to skip unmapped holes and not to fail because of them. Having said that, I do not think that the check that the patch does is clean or clearly documented. In addition, this patch (and some work on process_madvise()) raise in my mind a couple of questions: 1. There are other errors that process_madvise might encounter and can be propagated back to userspace, but are not documented. For instance if can_madv_lru_vma() fails on MADV_COLD, userspace will get EINVAL. EINVAL is not documented as a valid error-code for such case in either madvise() and process_madvise() man pages. 2. If an advice fails due to other reason, specifically split_huge_page(), there might be partial success within a VMA. I guess for now it is fine to ignore such failures since there is no guarantee on the OS behavior following most advices (excluding MADV_DONTNEED). Regards, Nadav
Thanks Minchan for your comment!! On 3/9/2022 10:17 PM, Minchan Kim wrote: >> @@ -1426,15 +1426,21 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, >> >> while (iov_iter_count(&iter)) { >> iovec = iov_iter_iovec(&iter); >> + /* >> + * Even when [start, end) passed to do_madvise covers >> + * some unmapped addresses, it continues processing with >> + * returning ENOMEM at the end. Thus consider the range >> + * as processed when do_madvise() returns ENOMEM. >> + * This makes process_madvise() never returns ENOMEM. >> + */ > Looks like that this patch has two things. first, returns processed > bytes instead of error in case of error. Second, keep working on > rest vmas on -ENOMEM due to unmapped hole. > > First thing totally makes sense to me(that's exactly I wanted to > do but somehow missed) so it should go stable tree. However, > second stuff might be arguble so it would be great if you split > the patch. Sure, then will split the patch in V2. >
Thanks Amit for the inputs!! On 3/10/2022 12:20 AM, Nadav Amit wrote: > --- > mm/madvise.c | 12 +++++++++--- > 1 file changed, 9 insertions(+), 3 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 38d0f51..d3b49b3 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -1426,15 +1426,21 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > > while (iov_iter_count(&iter)) { > iovec = iov_iter_iovec(&iter); > + /* > + * Even when [start, end) passed to do_madvise covers > + * some unmapped addresses, it continues processing with > + * returning ENOMEM at the end. Thus consider the range > + * as processed when do_madvise() returns ENOMEM. > + * This makes process_madvise() never returns ENOMEM. > + */ > > I fully understand and relate to the basic motivation of this > patch. > > The ENOMEM that this patch checks for, IIUC, is the ENOMEM that is > returned on unmapped holes. Such ENOMEM does not appear, according to > the man page, to be a valid reason to return ENOMEM to userspace. > Presumably process_madvise() is expected to skip unmapped holes > and not to fail because of them> True, that ENOMEM represents the VMA passed contains the unmapped holes. Pasting the Documentation of do_madvise(): * -ENOMEM - addresses in the specified range are not currently * mapped, or are outside the AS of the process. Internally process_madvise() calls do_madvise() in a loop by passing the vma it received in 'struct iovec'. And I too agree here that process_madvise() is expected to process the unmapped holes. > Having said that, I do not think that the check that the patch does > is clean or clearly documented. If it is about the Documentation, how about adding: "Since process_madvise() is expected to process unmapped holes, never return ENOMEM received from do_madvise() to user". If the code changes can be made further cleaner, please suggest. > > In addition, this patch (and some work on process_madvise()) raise > in my mind a couple of questions: > > 1. There are other errors that process_madvise might encounter > and can be propagated back to userspace, but are not > documented. For instance if can_madv_lru_vma() fails on > MADV_COLD, userspace will get EINVAL. EINVAL is not documented > as a valid error-code for such case in either madvise() and > process_madvise() man pages. I agree here with the man page documentations too and felt the same while going through them. For the mentioned case too, in the madvise[1] man page, EINVAL return type is only talked for MADV_DONTNEED and MADV_REMOVE. It should also contains for MADV_PAGEOUT, MADV_COLD and as well for MADV_FREE. The other missing return types, which I came across, in process_madvise are: EINVAL - return from process_madvise_behavior_valid(). EINTR - from mm_access() EACCES - from mm_access() Thanks, Charan
diff --git a/mm/madvise.c b/mm/madvise.c index 38d0f51..d3b49b3 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1426,15 +1426,21 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, while (iov_iter_count(&iter)) { iovec = iov_iter_iovec(&iter); + /* + * Even when [start, end) passed to do_madvise covers + * some unmapped addresses, it continues processing with + * returning ENOMEM at the end. Thus consider the range + * as processed when do_madvise() returns ENOMEM. + * This makes process_madvise() never returns ENOMEM. + */ ret = do_madvise(mm, (unsigned long)iovec.iov_base, iovec.iov_len, behavior); - if (ret < 0) + if (ret < 0 && ret != -ENOMEM) break; iov_iter_advance(&iter, iovec.iov_len); } - if (ret == 0) - ret = total_len - iov_iter_count(&iter); + ret = (total_len - iov_iter_count(&iter)) ? : ret; release_mm: mmput(mm);
The process_madvise() system call returns error even after processing some VMA's passed in the 'struct iovec' vector list which leaves the user confused to know where to restart the advise next. It is also against this syscall man page[1] documentation where it mentions that "return value may be less than the total number of requested bytes, if an error occurred after some iovec elements were already processed.". Consider a user passed 10 VMA's in the 'struct iovec' vector list of which 9 are processed but one. Then it just returns the error caused on that failed VMA despite the first 9 VMA's processed, leaving the user confused about on which VMA it is failed. Returning the number of bytes processed here can help the user to know which VMA it is failed on and thus can retry/skip the advise on that VMA. [1]https://man7.org/linux/man-pages/man2/process_madvise.2.html. Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API" Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> --- mm/madvise.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-)