Message ID | 1536699493-69195-3-git-send-email-yang.shi@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: zap pages with read mmap_sem in munmap for large mapping | expand |
On Wed, Sep 12, 2018 at 04:58:11AM +0800, Yang Shi wrote:
> mm/mmap.c | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
I really think you're going about this the wrong way by duplicating
vm_munmap().
On 9/11/18 2:16 PM, Matthew Wilcox wrote: > On Wed, Sep 12, 2018 at 04:58:11AM +0800, Yang Shi wrote: >> mm/mmap.c | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- > I really think you're going about this the wrong way by duplicating > vm_munmap(). If we don't duplicate vm_munmap() or do_munmap(), we need pass an extra parameter to them to tell when it is fine to downgrade write lock or if the lock has been acquired outside it (i.e. in mmap()/mremap()), right? But, vm_munmap() or do_munmap() is called not only by mmap-related, but also some other places, like arch-specific places, which don't need downgrade write lock or are not safe to do so. Actually, I did this way in the v1 patches, but it got pushed back by tglx who suggested duplicate the code so that the change could be done in mm only without touching other files, i.e. arch-specific stuff. I didn't have strong argument to convince him. And, Michal prefers have VM_HUGETLB and VM_PFNMAP handled separately for safe and bisectable sake, which needs call the regular do_munmap(). In addition to this, I just found mpx code may call do_munmap() recursively when I was looking into the mpx code. We might be able to handle these by the extra parameter, but it sounds it make the code hard to understand and error prone. Thanks, Yang
On Tue, Sep 11, 2018 at 04:35:03PM -0700, Yang Shi wrote: > On 9/11/18 2:16 PM, Matthew Wilcox wrote: > > On Wed, Sep 12, 2018 at 04:58:11AM +0800, Yang Shi wrote: > > > mm/mmap.c | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- > > I really think you're going about this the wrong way by duplicating > > vm_munmap(). > > If we don't duplicate vm_munmap() or do_munmap(), we need pass an extra > parameter to them to tell when it is fine to downgrade write lock or if the > lock has been acquired outside it (i.e. in mmap()/mremap()), right? But, > vm_munmap() or do_munmap() is called not only by mmap-related, but also some > other places, like arch-specific places, which don't need downgrade write > lock or are not safe to do so. > > Actually, I did this way in the v1 patches, but it got pushed back by tglx > who suggested duplicate the code so that the change could be done in mm only > without touching other files, i.e. arch-specific stuff. I didn't have strong > argument to convince him. With my patch, there is nothing to change in arch-specific code. Here it is again ... diff --git a/mm/mmap.c b/mm/mmap.c index de699523c0b7..06dc31d1da8c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2798,11 +2798,11 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma, * work. This now handles partial unmappings. * Jeremy Fitzhardinge <jeremy@goop.org> */ -int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, - struct list_head *uf) +static int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len, + struct list_head *uf, bool downgrade) { unsigned long end; - struct vm_area_struct *vma, *prev, *last; + struct vm_area_struct *vma, *prev, *last, *tmp; if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start) return -EINVAL; @@ -2816,7 +2816,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, if (!vma) return 0; prev = vma->vm_prev; - /* we have start < vma->vm_end */ + /* we have start < vma->vm_end */ /* if it doesn't overlap, we have nothing.. */ end = start + len; @@ -2873,18 +2873,22 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, /* * unlock any mlock()ed ranges before detaching vmas + * and check to see if there's any reason we might have to hold + * the mmap_sem write-locked while unmapping regions. */ - if (mm->locked_vm) { - struct vm_area_struct *tmp = vma; - while (tmp && tmp->vm_start < end) { - if (tmp->vm_flags & VM_LOCKED) { - mm->locked_vm -= vma_pages(tmp); - munlock_vma_pages_all(tmp); - } - tmp = tmp->vm_next; + for (tmp = vma; tmp && tmp->vm_start < end; tmp = tmp->vm_next) { + if (tmp->vm_flags & VM_LOCKED) { + mm->locked_vm -= vma_pages(tmp); + munlock_vma_pages_all(tmp); } + if (tmp->vm_file && + has_uprobes(tmp, tmp->vm_start, tmp->vm_end)) + downgrade = false; } + if (downgrade) + downgrade_write(&mm->mmap_sem); + /* * Remove the vma's, and unmap the actual pages */ @@ -2896,7 +2900,13 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, /* Fix up all other VM information */ remove_vma_list(mm, vma); - return 0; + return downgrade ? 1 : 0; +} + +int do_unmap(struct mm_struct *mm, unsigned long start, size_t len, + struct list_head *uf) +{ + return __do_munmap(mm, start, len, uf, false); } int vm_munmap(unsigned long start, size_t len) @@ -2905,11 +2915,12 @@ int vm_munmap(unsigned long start, size_t len) struct mm_struct *mm = current->mm; LIST_HEAD(uf); - if (down_write_killable(&mm->mmap_sem)) - return -EINTR; - - ret = do_munmap(mm, start, len, &uf); - up_write(&mm->mmap_sem); + down_write(&mm->mmap_sem); + ret = __do_munmap(mm, start, len, &uf, true); + if (ret == 1) + up_read(&mm->mmap_sem); + else + up_write(&mm->mmap_sem); userfaultfd_unmap_complete(mm, &uf); return ret; } Anybody calling do_munmap() will not get the lock dropped. > And, Michal prefers have VM_HUGETLB and VM_PFNMAP handled separately for > safe and bisectable sake, which needs call the regular do_munmap(). That can be introduced and then taken out ... indeed, you can split this into many patches, starting with this: + if (tmp->vm_file) + downgrade = false; to only allow this optimisation for anonymous mappings at first. > In addition to this, I just found mpx code may call do_munmap() recursively > when I was looking into the mpx code. > > We might be able to handle these by the extra parameter, but it sounds it > make the code hard to understand and error prone. Only if you make the extra parameter mandatory.
On Tue 11-09-18 19:29:21, Matthew Wilcox wrote: > On Tue, Sep 11, 2018 at 04:35:03PM -0700, Yang Shi wrote: [...] I didn't get to read the patch yet. > > And, Michal prefers have VM_HUGETLB and VM_PFNMAP handled separately for > > safe and bisectable sake, which needs call the regular do_munmap(). > > That can be introduced and then taken out ... indeed, you can split this into > many patches, starting with this: > > + if (tmp->vm_file) > + downgrade = false; > > to only allow this optimisation for anonymous mappings at first. or add a helper function to check for special cases and make the downgrade behavior conditional on it.
On 9/12/18 2:11 AM, Michal Hocko wrote: > On Tue 11-09-18 19:29:21, Matthew Wilcox wrote: >> On Tue, Sep 11, 2018 at 04:35:03PM -0700, Yang Shi wrote: > [...] > > I didn't get to read the patch yet. If you guys think this is the better way I could convert my patches to go this way. It is simple to do the conversion. Thanks, Yang > >>> And, Michal prefers have VM_HUGETLB and VM_PFNMAP handled separately for >>> safe and bisectable sake, which needs call the regular do_munmap(). >> That can be introduced and then taken out ... indeed, you can split this into >> many patches, starting with this: >> >> + if (tmp->vm_file) >> + downgrade = false; >> >> to only allow this optimisation for anonymous mappings at first. > or add a helper function to check for special cases and make the > downgrade behavior conditional on it.
diff --git a/mm/mmap.c b/mm/mmap.c index b7092b4..937d2f2 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2769,6 +2769,89 @@ static inline void munlock_vmas(struct vm_area_struct *vma, } } +/* + * Zap pages with read mmap_sem held + * + * uf is the list for userfaultfd + */ +static int do_munmap_zap_rlock(struct mm_struct *mm, unsigned long start, + size_t len, struct list_head *uf) +{ + unsigned long end; + struct vm_area_struct *start_vma, *prev, *vma; + int ret = 0; + + if (!addr_ok(start, len)) + return -EINVAL; + + len = PAGE_ALIGN(len); + + end = start + len; + + /* + * Need write mmap_sem to split vmas and detach vmas + * splitting vma up-front to save PITA to clean if it is failed + */ + if (down_write_killable(&mm->mmap_sem)) + return -EINTR; + + start_vma = munmap_lookup_vma(mm, start, end); + if (!start_vma) + goto out; + if (IS_ERR(start_vma)) { + ret = PTR_ERR(start_vma); + goto out; + } + + prev = start_vma->vm_prev; + + if (unlikely(uf)) { + ret = userfaultfd_unmap_prep(start_vma, start, end, uf); + if (ret) + goto out; + } + + /* + * Unmapping vmas, which have VM_HUGETLB or VM_PFNMAP + * need get done with write mmap_sem held since they may update + * vm_flags. Deal with such mappings with regular do_munmap() call. + */ + for (vma = start_vma; vma && vma->vm_start < end; vma = vma->vm_next) { + if (vma->vm_flags & (VM_HUGETLB | VM_PFNMAP)) + goto regular_path; + } + + /* Handle mlocked vmas */ + if (mm->locked_vm) + munlock_vmas(start_vma, end); + + /* Detach vmas from rbtree */ + detach_vmas_to_be_unmapped(mm, start_vma, prev, end); + + /* + * mpx unmap need to be handled with write mmap_sem. It is safe to + * deal with it before unmap_region(). + */ + arch_unmap(mm, start_vma, start, end); + + downgrade_write(&mm->mmap_sem); + + /* Zap mappings with read mmap_sem */ + unmap_region(mm, start_vma, prev, start, end); + + remove_vma_list(mm, start_vma); + up_read(&mm->mmap_sem); + + return 0; + +regular_path: + ret = do_munmap(mm, start, len, uf); + +out: + up_write(&mm->mmap_sem); + return ret; +} + /* Munmap is split into 2 main parts -- this part which finds * what needs doing, and the areas themselves, which do the * work. This now handles partial unmappings. @@ -2830,6 +2913,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, return 0; } +static int vm_munmap_zap_rlock(unsigned long start, size_t len) +{ + int ret; + struct mm_struct *mm = current->mm; + LIST_HEAD(uf); + + ret = do_munmap_zap_rlock(mm, start, len, &uf); + userfaultfd_unmap_complete(mm, &uf); + return ret; +} + int vm_munmap(unsigned long start, size_t len) { int ret; @@ -2849,10 +2943,9 @@ int vm_munmap(unsigned long start, size_t len) SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) { profile_munmap(addr); - return vm_munmap(addr, len); + return vm_munmap_zap_rlock(addr, len); } - /* * Emulation of deprecated remap_file_pages() syscall. */