Message ID | 20230104225207.1066932-4-peterx@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/uffd: Fix missing markers on hugetlb | expand |
On Wed, Jan 4, 2023 at 10:52 PM Peter Xu <peterx@redhat.com> wrote: > > Before this patch, when there's any pgtable allocation issues happened > during change_protection(), the error will be ignored from the syscall. > For shmem, there will be an error dumped into the host dmesg. Two issues > with that: > > (1) Doing a trace dump when allocation fails is not anything close to > grace.. > > (2) The user should be notified with any kind of such error, so the user > can trap it and decide what to do next, either by retrying, or stop > the process properly, or anything else. > > For userfault users, this will change the API of UFFDIO_WRITEPROTECT when > pgtable allocation failure happened. It should not normally break anyone, > though. If it breaks, then in good ways. > > One man-page update will be on the way to introduce the new -ENOMEM for > UFFDIO_WRITEPROTECT. Not marking stable so we keep the old behavior on the > 5.19-till-now kernels. > > Reported-by: James Houghton <jthoughton@google.com> > Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: James Houghton <jthoughton@google.com> Thanks Peter! :)
> On Jan 4, 2023, at 2:52 PM, Peter Xu <peterx@redhat.com> wrote: > > Before this patch, when there's any pgtable allocation issues happened > during change_protection(), the error will be ignored from the syscall. > For shmem, there will be an error dumped into the host dmesg. Two issues > with that: > > (1) Doing a trace dump when allocation fails is not anything close to > grace.. > > (2) The user should be notified with any kind of such error, so the user > can trap it and decide what to do next, either by retrying, or stop > the process properly, or anything else. > > For userfault users, this will change the API of UFFDIO_WRITEPROTECT when > pgtable allocation failure happened. It should not normally break anyone, > though. If it breaks, then in good ways. > > One man-page update will be on the way to introduce the new -ENOMEM for > UFFDIO_WRITEPROTECT. Not marking stable so we keep the old behavior on the > 5.19-till-now kernels. I understand that the current assumption is that change_protection() should fully succeed or fail, and I guess this is the current behavior. However, to be more “future-proof” perhaps this needs to be revisited. For instance, UFFDIO_WRITEPROTECT can benefit from the ability to (based on userspace request) prevent write-protection of pages that are pinned. This is necessary to allow userspace uffd monitor to avoid write-protection of O_DIRECT’d memory, for instance, that might change even if a uffd monitor considers it write-protected. In such a case, a “partial failure” is possible, since only part of the memory was write-protected. The uffd monitor should be allowed to continue execution, but it has to know the part of the memory that was successfully write-protected. To support “partial failure”, the kernel should return to UFFDIO_WRITEPROTECT-users the number of pages/bytes that were not successfully write-protected, unless no memory was successfully write-protected. (Unlike NUMA, pages that were skipped should be accounted as “successfully write-protected"). I am only raising this subject to avoid multiple API changes.
On 04.01.23 23:52, Peter Xu wrote: > Before this patch, when there's any pgtable allocation issues happened > during change_protection(), the error will be ignored from the syscall. > For shmem, there will be an error dumped into the host dmesg. Two issues > with that: > > (1) Doing a trace dump when allocation fails is not anything close to > grace.. s/..// > > (2) The user should be notified with any kind of such error, so the user > can trap it and decide what to do next, either by retrying, or stop > the process properly, or anything else. > > For userfault users, this will change the API of UFFDIO_WRITEPROTECT when > pgtable allocation failure happened. It should not normally break anyone, > though. If it breaks, then in good ways. > > One man-page update will be on the way to introduce the new -ENOMEM for > UFFDIO_WRITEPROTECT. Not marking stable so we keep the old behavior on the > 5.19-till-now kernels. We'd now fail after already having modified some state (protected some PTEs). I assume that can already happen when protecting across multiple VMAs and is expected, right?
On 05.01.23 04:10, Nadav Amit wrote: > >> On Jan 4, 2023, at 2:52 PM, Peter Xu <peterx@redhat.com> wrote: >> >> Before this patch, when there's any pgtable allocation issues happened >> during change_protection(), the error will be ignored from the syscall. >> For shmem, there will be an error dumped into the host dmesg. Two issues >> with that: >> >> (1) Doing a trace dump when allocation fails is not anything close to >> grace.. >> >> (2) The user should be notified with any kind of such error, so the user >> can trap it and decide what to do next, either by retrying, or stop >> the process properly, or anything else. >> >> For userfault users, this will change the API of UFFDIO_WRITEPROTECT when >> pgtable allocation failure happened. It should not normally break anyone, >> though. If it breaks, then in good ways. >> >> One man-page update will be on the way to introduce the new -ENOMEM for >> UFFDIO_WRITEPROTECT. Not marking stable so we keep the old behavior on the >> 5.19-till-now kernels. > > I understand that the current assumption is that change_protection() should > fully succeed or fail, and I guess this is the current behavior. > > However, to be more “future-proof” perhaps this needs to be revisited. > > For instance, UFFDIO_WRITEPROTECT can benefit from the ability to (based on > userspace request) prevent write-protection of pages that are pinned. This is > necessary to allow userspace uffd monitor to avoid write-protection of > O_DIRECT’d memory, for instance, that might change even if a uffd monitor > considers it write-protected. Just a note that this is pretty tricky IMHO, because: a) We cannot distinguished "pinned readable" from "pinned writable" b) We can have false positives ("pinned") even for compound pages due to concurrent GUP-fast. c) Synchronizing against GUP-fast is pretty tricky ... as we learned. Concurrent pinning is usually problematic. d) O_DIRECT still uses FOLL_GET and we cannot identify that. (at least that should be figured out at one point) I have a patch lying around for a very long time that removes that special-pinned handling from softdirty code, because of the above reasons (and because it forgets THP). For now I didn't send it because for softdirty, it's acceptable to over-indicate and it hasn't been reported to be an actual problem so far. For existing UFFDIO_WRITEPROTECT users, however, it might be very harmful (especially for existing users) to get false protection errors. Failing due to ENOMEM is different to failing due to some temporary concurrency issues. Having that said, I started thinking about alternative ways of detecting that in that past, without much outcome so far: that latest idea was indicating "this MM has had pinned pages at one point, be careful because any techniques that use write-protection (softdirty, mprotect, uffd-wp) won't be able to catch writes via pinned pages reliably". Hm.
> On Jan 5, 2023, at 12:59 AM, David Hildenbrand <david@redhat.com> wrote: > > On 05.01.23 04:10, Nadav Amit wrote: >>> On Jan 4, 2023, at 2:52 PM, Peter Xu <peterx@redhat.com> wrote: >>> >>> Before this patch, when there's any pgtable allocation issues happened >>> during change_protection(), the error will be ignored from the syscall. >>> For shmem, there will be an error dumped into the host dmesg. Two issues >>> with that: >>> >>> (1) Doing a trace dump when allocation fails is not anything close to >>> grace.. >>> >>> (2) The user should be notified with any kind of such error, so the user >>> can trap it and decide what to do next, either by retrying, or stop >>> the process properly, or anything else. >>> >>> For userfault users, this will change the API of UFFDIO_WRITEPROTECT when >>> pgtable allocation failure happened. It should not normally break anyone, >>> though. If it breaks, then in good ways. >>> >>> One man-page update will be on the way to introduce the new -ENOMEM for >>> UFFDIO_WRITEPROTECT. Not marking stable so we keep the old behavior on the >>> 5.19-till-now kernels. >> I understand that the current assumption is that change_protection() should >> fully succeed or fail, and I guess this is the current behavior. >> However, to be more “future-proof” perhaps this needs to be revisited. >> For instance, UFFDIO_WRITEPROTECT can benefit from the ability to (based on >> userspace request) prevent write-protection of pages that are pinned. This is >> necessary to allow userspace uffd monitor to avoid write-protection of >> O_DIRECT’d memory, for instance, that might change even if a uffd monitor >> considers it write-protected. > > Just a note that this is pretty tricky IMHO, because: > > a) We cannot distinguished "pinned readable" from "pinned writable" > b) We can have false positives ("pinned") even for compound pages due to > concurrent GUP-fast. > c) Synchronizing against GUP-fast is pretty tricky ... as we learned. > Concurrent pinning is usually problematic. > d) O_DIRECT still uses FOLL_GET and we cannot identify that. (at least > that should be figured out at one point) My prototype used the page-count IIRC, so it had false-positives (but addressed O_DIRECT). And yes, precise refinement is complicated. However, if you need to uffd-wp memory, then without such a mechanism you need to ensure no kerenl/DMA write to these pages is possible. The only other option I can think of is interposing/seccomp on a variety of syscalls, to prevent uffd-wp of such memory. > > I have a patch lying around for a very long time that removes that special-pinned handling from softdirty code, because of the above reasons (and because it forgets THP). For now I didn't send it because for softdirty, it's acceptable to over-indicate and it hasn't been reported to be an actual problem so far. > > For existing UFFDIO_WRITEPROTECT users, however, it might be very harmful (especially for existing users) to get false protection errors. Failing due to ENOMEM is different to failing due to some temporary concurrency issues. Yes, I propose it as an optional flag for UFFD-WP. Anyhow, I believe the UFFD-WP as implemented now is not efficient and should’ve been vectored to allow one TLB shootdown for many non-consecutive pages. > > Having that said, I started thinking about alternative ways of detecting that in that past, without much outcome so far: that latest idea was indicating "this MM has had pinned pages at one point, be careful because any techniques that use write-protection (softdirty, mprotect, uffd-wp) won't be able to catch writes via pinned pages reliably". I am not sure what the best way to detect that a page is write-pinned reliably. My point was that if a change is already carried to write-protect mechanisms, then this issue should be considered. Because otherwise, many use-cases of uffd-wp would encounter implementation issues. I will not “kill” myself over it now, but I think it worth consideration.
On Thu, Jan 05, 2023 at 10:01:46AM -0800, Nadav Amit wrote: > > > > On Jan 5, 2023, at 12:59 AM, David Hildenbrand <david@redhat.com> wrote: > > > > On 05.01.23 04:10, Nadav Amit wrote: > >>> On Jan 4, 2023, at 2:52 PM, Peter Xu <peterx@redhat.com> wrote: > >>> > >>> Before this patch, when there's any pgtable allocation issues happened > >>> during change_protection(), the error will be ignored from the syscall. > >>> For shmem, there will be an error dumped into the host dmesg. Two issues > >>> with that: > >>> > >>> (1) Doing a trace dump when allocation fails is not anything close to > >>> grace.. > >>> > >>> (2) The user should be notified with any kind of such error, so the user > >>> can trap it and decide what to do next, either by retrying, or stop > >>> the process properly, or anything else. > >>> > >>> For userfault users, this will change the API of UFFDIO_WRITEPROTECT when > >>> pgtable allocation failure happened. It should not normally break anyone, > >>> though. If it breaks, then in good ways. > >>> > >>> One man-page update will be on the way to introduce the new -ENOMEM for > >>> UFFDIO_WRITEPROTECT. Not marking stable so we keep the old behavior on the > >>> 5.19-till-now kernels. > >> I understand that the current assumption is that change_protection() should > >> fully succeed or fail, and I guess this is the current behavior. > >> However, to be more “future-proof” perhaps this needs to be revisited. > >> For instance, UFFDIO_WRITEPROTECT can benefit from the ability to (based on > >> userspace request) prevent write-protection of pages that are pinned. This is > >> necessary to allow userspace uffd monitor to avoid write-protection of > >> O_DIRECT’d memory, for instance, that might change even if a uffd monitor > >> considers it write-protected. > > > > Just a note that this is pretty tricky IMHO, because: > > > > a) We cannot distinguished "pinned readable" from "pinned writable" > > b) We can have false positives ("pinned") even for compound pages due to > > concurrent GUP-fast. > > c) Synchronizing against GUP-fast is pretty tricky ... as we learned. > > Concurrent pinning is usually problematic. > > d) O_DIRECT still uses FOLL_GET and we cannot identify that. (at least > > that should be figured out at one point) > > My prototype used the page-count IIRC, so it had false-positives (but > addressed O_DIRECT). I think this means the app is fine to not be able to write protect some page being requested? For a swap framework I think it's fine, but maybe not for taking a snapshot, so I agree it should be an optional flag as you mentioned below. > And yes, precise refinement is complicated. However, > if you need to uffd-wp memory, then without such a mechanism you need to > ensure no kerenl/DMA write to these pages is possible. The only other > option I can think of is interposing/seccomp on a variety of syscalls, > to prevent uffd-wp of such memory. > > > > > I have a patch lying around for a very long time that removes that special-pinned handling from softdirty code, because of the above reasons (and because it forgets THP). For now I didn't send it because for softdirty, it's acceptable to over-indicate and it hasn't been reported to be an actual problem so far. > > > > For existing UFFDIO_WRITEPROTECT users, however, it might be very harmful (especially for existing users) to get false protection errors. Failing due to ENOMEM is different to failing due to some temporary concurrency issues. > > Yes, I propose it as an optional flag for UFFD-WP. > Anyhow, I believe the UFFD-WP as implemented now is not efficient and > should’ve been vectored to allow one TLB shootdown for many > non-consecutive pages. Agreed. Would providing a vector of ranges help too for a few uffd ioctls? I'm also curious whether you're still actively developing (or running) your iouring series. > > > > > Having that said, I started thinking about alternative ways of detecting that in that past, without much outcome so far: that latest idea was indicating "this MM has had pinned pages at one point, be careful because any techniques that use write-protection (softdirty, mprotect, uffd-wp) won't be able to catch writes via pinned pages reliably". > > I am not sure what the best way to detect that a page is write-pinned > reliably. My point was that if a change is already carried to > write-protect mechanisms, then this issue should be considered. Because > otherwise, many use-cases of uffd-wp would encounter implementation > issues. > > I will not “kill” myself over it now, but I think it worth consideration. The current interface change is small and limited only to the extra -ENOMEM retval with memory pressures (personally I don't really know how to trigger this, as I never succeeded myself even with memory pressure..). What you said does sound like a new area to explore, and I think it's fine to change the interface again. Thanks,
On 05.01.23 19:01, Nadav Amit wrote: > > >> On Jan 5, 2023, at 12:59 AM, David Hildenbrand <david@redhat.com> wrote: >> >> On 05.01.23 04:10, Nadav Amit wrote: >>>> On Jan 4, 2023, at 2:52 PM, Peter Xu <peterx@redhat.com> wrote: >>>> >>>> Before this patch, when there's any pgtable allocation issues happened >>>> during change_protection(), the error will be ignored from the syscall. >>>> For shmem, there will be an error dumped into the host dmesg. Two issues >>>> with that: >>>> >>>> (1) Doing a trace dump when allocation fails is not anything close to >>>> grace.. >>>> >>>> (2) The user should be notified with any kind of such error, so the user >>>> can trap it and decide what to do next, either by retrying, or stop >>>> the process properly, or anything else. >>>> >>>> For userfault users, this will change the API of UFFDIO_WRITEPROTECT when >>>> pgtable allocation failure happened. It should not normally break anyone, >>>> though. If it breaks, then in good ways. >>>> >>>> One man-page update will be on the way to introduce the new -ENOMEM for >>>> UFFDIO_WRITEPROTECT. Not marking stable so we keep the old behavior on the >>>> 5.19-till-now kernels. >>> I understand that the current assumption is that change_protection() should >>> fully succeed or fail, and I guess this is the current behavior. >>> However, to be more “future-proof” perhaps this needs to be revisited. >>> For instance, UFFDIO_WRITEPROTECT can benefit from the ability to (based on >>> userspace request) prevent write-protection of pages that are pinned. This is >>> necessary to allow userspace uffd monitor to avoid write-protection of >>> O_DIRECT’d memory, for instance, that might change even if a uffd monitor >>> considers it write-protected. >> >> Just a note that this is pretty tricky IMHO, because: >> >> a) We cannot distinguished "pinned readable" from "pinned writable" >> b) We can have false positives ("pinned") even for compound pages due to >> concurrent GUP-fast. >> c) Synchronizing against GUP-fast is pretty tricky ... as we learned. >> Concurrent pinning is usually problematic. >> d) O_DIRECT still uses FOLL_GET and we cannot identify that. (at least >> that should be figured out at one point) > > My prototype used the page-count IIRC, so it had false-positives (but I suspect GUP-fast is still problematic, I might be wrong. > addressed O_DIRECT). And yes, precise refinement is complicated. However, > if you need to uffd-wp memory, then without such a mechanism you need to > ensure no kerenl/DMA write to these pages is possible. The only other > option I can think of is interposing/seccomp on a variety of syscalls, > to prevent uffd-wp of such memory. The whole thing reminds me of MADV_DONTNEED+pinning: an application shouldn't do it, because you can only get it wrong :) I know, that's a bad answer.
Sorry for the late response. >> Yes, I propose it as an optional flag for UFFD-WP. >> Anyhow, I believe the UFFD-WP as implemented now is not efficient and >> should’ve been vectored to allow one TLB shootdown for many >> non-consecutive pages. > > Agreed. Would providing a vector of ranges help too for a few uffd ioctls? > > I'm also curious whether you're still actively developing (or running) your > iouring series. So I finished building a prototype some time ago, and one of the benefits was in reducing memory reclamation time. Unfortunately, MGLRU then came and took away a lot of the benefit. A colleague of mine had a slightly different use-case, so I gave him the code and he showed interest in upstreaming it. After some probing, it turns out he decided he is not into the effort of upstreaming it. I can upstream the vectored WP once I write some tests. >> >> I am not sure what the best way to detect that a page is write-pinned >> reliably. My point was that if a change is already carried to >> write-protect mechanisms, then this issue should be considered. Because >> otherwise, many use-cases of uffd-wp would encounter implementation >> issues. >> >> I will not “kill” myself over it now, but I think it worth consideration. > > The current interface change is small and limited only to the extra -ENOMEM > retval with memory pressures (personally I don't really know how to trigger > this, as I never succeeded myself even with memory pressure..). What you > said does sound like a new area to explore, and I think it's fine to change > the interface again. Understood. Thanks and sorry again for the late response, Nadav
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 9df0b9a762cc..3767f18114ef 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -73,7 +73,7 @@ extern ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long dst_start, extern int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, unsigned long len, bool enable_wp, atomic_t *mmap_changing); -extern void uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *vma, +extern long uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *vma, unsigned long start, unsigned long len, bool enable_wp); /* mm helpers */ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 84bc665c7c86..d82d97e03eae 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6658,8 +6658,10 @@ long hugetlb_change_protection(struct vm_area_struct *vma, * pre-allocations to install pte markers. */ ptep = huge_pte_alloc(mm, vma, address, psize); - if (!ptep) + if (!ptep) { + pages = -ENOMEM; break; + } } ptl = huge_pte_lock(h, mm, ptep); if (huge_pmd_unshare(mm, vma, address, ptep)) { @@ -6749,7 +6751,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma, hugetlb_vma_unlock_write(vma); mmu_notifier_invalidate_range_end(&range); - return pages << h->order; + return pages > 0 ? (pages << h->order) : pages; } /* Return true if reservation was successful, false otherwise. */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index a86b8f15e2f0..85a34f1f3ab8 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -636,7 +636,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma, tlb_gather_mmu(&tlb, vma->vm_mm); nr_updated = change_protection(&tlb, vma, addr, end, MM_CP_PROT_NUMA); - if (nr_updated) + if (nr_updated > 0) count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated); tlb_finish_mmu(&tlb); diff --git a/mm/mprotect.c b/mm/mprotect.c index 0af22ab59ea8..ade0d5f85a36 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -330,28 +330,34 @@ uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags) /* * If wr-protecting the range for file-backed, populate pgtable for the case * when pgtable is empty but page cache exists. When {pte|pmd|...}_alloc() - * failed it means no memory, we don't have a better option but stop. + * failed we treat it the same way as pgtable allocation failures during + * page faults by kicking OOM and returning error. */ #define change_pmd_prepare(vma, pmd, cp_flags) \ - do { \ + ({ \ + long err = 0; \ if (unlikely(uffd_wp_protect_file(vma, cp_flags))) { \ - if (WARN_ON_ONCE(pte_alloc(vma->vm_mm, pmd))) \ - break; \ + if (pte_alloc(vma->vm_mm, pmd)) \ + err = -ENOMEM; \ } \ - } while (0) + err; \ + }) + /* * This is the general pud/p4d/pgd version of change_pmd_prepare(). We need to * have separate change_pmd_prepare() because pte_alloc() returns 0 on success, * while {pmd|pud|p4d}_alloc() returns the valid pointer on success. */ #define change_prepare(vma, high, low, addr, cp_flags) \ - do { \ - if (unlikely(uffd_wp_protect_file(vma, cp_flags))) { \ - low##_t *p = low##_alloc(vma->vm_mm, high, addr); \ - if (WARN_ON_ONCE(p == NULL)) \ - break; \ - } \ - } while (0) + ({ \ + long err = 0; \ + if (unlikely(uffd_wp_protect_file(vma, cp_flags))) { \ + low##_t *p = low##_alloc(vma->vm_mm, high, addr); \ + if (p == NULL) \ + err = -ENOMEM; \ + } \ + err; \ + }) static inline long change_pmd_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud, unsigned long addr, @@ -367,11 +373,15 @@ static inline long change_pmd_range(struct mmu_gather *tlb, pmd = pmd_offset(pud, addr); do { - long this_pages; + long ret; next = pmd_addr_end(addr, end); - change_pmd_prepare(vma, pmd, cp_flags); + ret = change_pmd_prepare(vma, pmd, cp_flags); + if (ret) { + pages = ret; + break; + } /* * Automatic NUMA balancing walks the tables with mmap_lock * held for read. It's possible a parallel update to occur @@ -401,7 +411,11 @@ static inline long change_pmd_range(struct mmu_gather *tlb, * cleared; make sure pmd populated if * necessary, then fall-through to pte level. */ - change_pmd_prepare(vma, pmd, cp_flags); + ret = change_pmd_prepare(vma, pmd, cp_flags); + if (ret) { + pages = ret; + break; + } } else { /* * change_huge_pmd() does not defer TLB flushes, @@ -422,9 +436,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb, } /* fall through, the trans huge pmd just split */ } - this_pages = change_pte_range(tlb, vma, pmd, addr, next, - newprot, cp_flags); - pages += this_pages; + pages += change_pte_range(tlb, vma, pmd, addr, next, + newprot, cp_flags); next: cond_resched(); } while (pmd++, addr = next, addr != end); @@ -443,12 +456,14 @@ static inline long change_pud_range(struct mmu_gather *tlb, { pud_t *pud; unsigned long next; - long pages = 0; + long pages = 0, ret; pud = pud_offset(p4d, addr); do { next = pud_addr_end(addr, end); - change_prepare(vma, pud, pmd, addr, cp_flags); + ret = change_prepare(vma, pud, pmd, addr, cp_flags); + if (ret) + return ret; if (pud_none_or_clear_bad(pud)) continue; pages += change_pmd_range(tlb, vma, pud, addr, next, newprot, @@ -464,12 +479,14 @@ static inline long change_p4d_range(struct mmu_gather *tlb, { p4d_t *p4d; unsigned long next; - long pages = 0; + long pages = 0, ret; p4d = p4d_offset(pgd, addr); do { next = p4d_addr_end(addr, end); - change_prepare(vma, p4d, pud, addr, cp_flags); + ret = change_prepare(vma, p4d, pud, addr, cp_flags); + if (ret) + return ret; if (p4d_none_or_clear_bad(p4d)) continue; pages += change_pud_range(tlb, vma, p4d, addr, next, newprot, @@ -486,14 +503,18 @@ static long change_protection_range(struct mmu_gather *tlb, struct mm_struct *mm = vma->vm_mm; pgd_t *pgd; unsigned long next; - long pages = 0; + long pages = 0, ret; BUG_ON(addr >= end); pgd = pgd_offset(mm, addr); tlb_start_vma(tlb, vma); do { next = pgd_addr_end(addr, end); - change_prepare(vma, pgd, p4d, addr, cp_flags); + ret = change_prepare(vma, pgd, p4d, addr, cp_flags); + if (ret) { + pages = ret; + break; + } if (pgd_none_or_clear_bad(pgd)) continue; pages += change_p4d_range(tlb, vma, pgd, addr, next, newprot, diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 65ad172add27..53c3d916ff66 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -710,11 +710,12 @@ ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start, mmap_changing, 0); } -void uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma, +long uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma, unsigned long start, unsigned long len, bool enable_wp) { unsigned int mm_cp_flags; struct mmu_gather tlb; + long ret; if (enable_wp) mm_cp_flags = MM_CP_UFFD_WP; @@ -730,8 +731,10 @@ void uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma, if (!enable_wp && vma_wants_manual_pte_write_upgrade(dst_vma)) mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE; tlb_gather_mmu(&tlb, dst_mm); - change_protection(&tlb, dst_vma, start, start + len, mm_cp_flags); + ret = change_protection(&tlb, dst_vma, start, start + len, mm_cp_flags); tlb_finish_mmu(&tlb); + + return ret; } int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, @@ -740,7 +743,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, { struct vm_area_struct *dst_vma; unsigned long page_mask; - int err; + long err; /* * Sanitize the command parameters: @@ -779,9 +782,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, goto out_unlock; } - uffd_wp_range(dst_mm, dst_vma, start, len, enable_wp); + err = uffd_wp_range(dst_mm, dst_vma, start, len, enable_wp); + + /* Return 0 on success, <0 on failures */ + if (err > 0) + err = 0; - err = 0; out_unlock: mmap_read_unlock(dst_mm); return err;
Before this patch, when there's any pgtable allocation issues happened during change_protection(), the error will be ignored from the syscall. For shmem, there will be an error dumped into the host dmesg. Two issues with that: (1) Doing a trace dump when allocation fails is not anything close to grace.. (2) The user should be notified with any kind of such error, so the user can trap it and decide what to do next, either by retrying, or stop the process properly, or anything else. For userfault users, this will change the API of UFFDIO_WRITEPROTECT when pgtable allocation failure happened. It should not normally break anyone, though. If it breaks, then in good ways. One man-page update will be on the way to introduce the new -ENOMEM for UFFDIO_WRITEPROTECT. Not marking stable so we keep the old behavior on the 5.19-till-now kernels. Reported-by: James Houghton <jthoughton@google.com> Signed-off-by: Peter Xu <peterx@redhat.com> --- include/linux/userfaultfd_k.h | 2 +- mm/hugetlb.c | 6 ++- mm/mempolicy.c | 2 +- mm/mprotect.c | 69 +++++++++++++++++++++++------------ mm/userfaultfd.c | 16 +++++--- 5 files changed, 62 insertions(+), 33 deletions(-)