Message ID | 1530311985-31251-5-git-send-email-yang.shi@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote: > When running some mmap/munmap scalability tests with large memory (i.e. > > 300GB), the below hung task issue may happen occasionally. > > INFO: task ps:14018 blocked for more than 120 seconds. > Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > ps D 0 14018 1 0x00000004 > ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 > ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 > 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 > Call Trace: > [<ffffffff817154d0>] ? __schedule+0x250/0x730 > [<ffffffff817159e6>] schedule+0x36/0x80 > [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150 > [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30 > [<ffffffff81717db0>] down_read+0x20/0x40 > [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0 > [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100 > [<ffffffff81241d87>] __vfs_read+0x37/0x150 > [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0 > [<ffffffff81242266>] vfs_read+0x96/0x130 > [<ffffffff812437b5>] SyS_read+0x55/0xc0 > [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5 > > It is because munmap holds mmap_sem from very beginning to all the way > down to the end, and doesn't release it in the middle. When unmapping > large mapping, it may take long time (take ~18 seconds to unmap 320GB > mapping with every single page mapped on an idle machine). > > It is because munmap holds mmap_sem from very beginning to all the way > down to the end, and doesn't release it in the middle. When unmapping > large mapping, it may take long time (take ~18 seconds to unmap 320GB > mapping with every single page mapped on an idle machine). > > Zapping pages is the most time consuming part, according to the > suggestion from Michal Hock [1], zapping pages can be done with holding > read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write > mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set, > the page fault to VM_DEAD vma will trigger SIGSEGV. > > Define large mapping size thresh as PUD size or 1GB, just zap pages with > read mmap_sem for mappings which are >= thresh value. Perhaps it would be better to treat all mappings in the fashion, regardless of size. Simpler code, lesser mmap_sem hold times, much better testing coverage. Is there any particular downside to doing this? > If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just > fallback to regular path since unmapping those mappings need acquire > write mmap_sem. So we'll still get huge latencies an softlockup warnings for some usecases. This is a problem! > For the time being, just do this in munmap syscall path. Other > vm_munmap() or do_munmap() call sites remain intact for stability > reason. > > The below is some regression and performance data collected on a machine > with 32 cores of E5-2680 @ 2.70GHz and 384GB memory. Where is this "regression and performance data"? Something mising from the changelog? > With the patched kernel, write mmap_sem hold time is dropped to us level > from second. > > [1] https://lwn.net/Articles/753269/ > > ... > > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma, > return 1; > } > > +/* Consider PUD size or 1GB mapping as large mapping */ > +#ifdef HPAGE_PUD_SIZE > +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE > +#else > +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024) > +#endif > + > +/* Unmap large mapping early with acquiring read mmap_sem */ > +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start, > + size_t len, struct list_head *uf) Can we have a comment describing what `uf' is and what it does? (at least) > +{ > + unsigned long end = 0; > + struct vm_area_struct *vma = NULL, *prev, *tmp; `tmp' is a poor choice of identifier - it doesn't communicate either the variable's type nor its purpose. Perhaps rename vma to start_vma(?) and rename tmp to vma? And declaring start_vma to be const would be a nice readability addition. > + bool success = false; > + int ret = 0; > + > + if (!munmap_addr_sanity(start, len)) > + return -EINVAL; > + > + len = PAGE_ALIGN(len); > + > + end = start + len; > + > + /* Just deal with uf in regular path */ > + if (unlikely(uf)) > + goto regular_path; > + > + if (len >= LARGE_MAP_THRESH) { > + /* > + * need write mmap_sem to split vma and set VM_DEAD flag > + * splitting vma up-front to save PITA to clean if it is failed > + */ > + down_write(&mm->mmap_sem); > + ret = munmap_lookup_vma(mm, &vma, &prev, start, end); > + if (ret != 1) { > + up_write(&mm->mmap_sem); > + return ret; Can just use `goto out' here, and that would avoid the unpleasing use of a deeply eembded `return'. > + } > + /* This ret value might be returned, so reset it */ > + ret = 0; > + > + /* > + * Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP > + * flag set or has uprobes set, need acquire write map_sem, > + * so skip them in early zap. Just deal with such mapping in > + * regular path. For each case, please describe *why* mmap_sem must be held for writing. > + * Borrow can_madv_dontneed_vma() to check the conditions. > + */ > + tmp = vma; > + while (tmp && tmp->vm_start < end) { > + if (!can_madv_dontneed_vma(tmp) || > + vma_has_uprobes(tmp, start, end)) { > + up_write(&mm->mmap_sem); > + goto regular_path; > + } > + tmp = tmp->vm_next; > + } > + /* > + * set VM_DEAD flag before tear down them. > + * page fault on VM_DEAD vma will trigger SIGSEGV. > + */ > + tmp = vma; > + for ( ; tmp && tmp->vm_start < end; tmp = tmp->vm_next) > + tmp->vm_flags |= VM_DEAD; > + up_write(&mm->mmap_sem); > + > + /* zap mappings with read mmap_sem */ > + down_read(&mm->mmap_sem); Use downgrade_write()? > + zap_page_range(vma, start, len); > + /* indicates early zap is success */ > + success = true; > + up_read(&mm->mmap_sem); > + } > + > +regular_path: > + /* hold write mmap_sem for vma manipulation or regular path */ > + if (down_write_killable(&mm->mmap_sem)) > + return -EINTR; Why is this _killable() while the preceding down_write() was not? > + if (success) { > + /* vmas have been zapped, here clean up pgtable and vmas */ > + struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap; > + struct mmu_gather tlb; > + tlb_gather_mmu(&tlb, mm, start, end); > + free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, > + next ? next->vm_start : USER_PGTABLES_CEILING); > + tlb_finish_mmu(&tlb, start, end); > + > + detach_vmas_to_be_unmapped(mm, vma, prev, end); > + arch_unmap(mm, vma, start, end); > + remove_vma_list(mm, vma); > + } else { > + /* vma is VM_LOCKED|VM_HUGETLB|VM_PFNMAP or has uprobe */ > + if (vma) { > + if (unlikely(uf)) { > + int ret = userfaultfd_unmap_prep(vma, start, > + end, uf); > + if (ret) > + goto out; Bug? This `ret' shadows the other `ret' in this function. > + } > + if (mm->locked_vm) { > + tmp = vma; > + while (tmp && tmp->vm_start < end) { > + if (tmp->vm_flags & VM_LOCKED) { > + mm->locked_vm -= vma_pages(tmp); > + munlock_vma_pages_all(tmp); > + } > + tmp = tmp->vm_next; > + } > + } > + detach_vmas_to_be_unmapped(mm, vma, prev, end); > + unmap_region(mm, vma, prev, start, end); > + remove_vma_list(mm, vma); > + } else > + /* When mapping size < LARGE_MAP_THRESH */ > + ret = do_munmap(mm, start, len, uf); > + } > + > +out: > + up_write(&mm->mmap_sem); > + return ret; > +} > + > /* Munmap is split into 2 main parts -- this part which finds > * what needs doing, and the areas themselves, which do the > * work. This now handles partial unmappings. > @@ -2829,6 +2951,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, > return 0; > } > > +static int vm_munmap_zap_early(unsigned long start, size_t len) > +{ > + int ret; > + struct mm_struct *mm = current->mm; > + LIST_HEAD(uf); > + > + ret = do_munmap_zap_early(mm, start, len, &uf); > + userfaultfd_unmap_complete(mm, &uf); > + return ret; > +} > + > int vm_munmap(unsigned long start, size_t len) > { > int ret; > @@ -2848,10 +2981,9 @@ int vm_munmap(unsigned long start, size_t len) > SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) > { > profile_munmap(addr); > - return vm_munmap(addr, len); > + return vm_munmap_zap_early(addr, len); > } > > - > /* > * Emulation of deprecated remap_file_pages() syscall. > */ > -- > 1.8.3.1
On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote: And... > diff --git a/mm/mmap.c b/mm/mmap.c > index 87dcf83..d61e08b 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma, > return 1; > } > > +/* Consider PUD size or 1GB mapping as large mapping */ > +#ifdef HPAGE_PUD_SIZE > +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE > +#else > +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024) > +#endif So this assumes that 32-bit machines cannot have 1GB mappings (fair enough) and this is the sole means by which we avoid falling into the "len >= LARGE_MAP_THRESH" codepath, which will behave very badly, at least because for such machines, VM_DEAD is zero. This is rather ugly and fragile. And, I guess, explains why we can't give all mappings this treatment: 32-bit machines can't do it. And we're adding a bunch of code to 32-bit kernels which will never be executed. I'm thinking it would be better to be much more explicit with "#ifdef CONFIG_64BIT" in this code, rather than relying upon the above magic. But I tend to think that the fact that we haven't solved anything on locked vmas or on uprobed mappings is a shostopper for the whole approach :(
On 6/29/18 6:28 PM, Andrew Morton wrote: > On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote: > >> When running some mmap/munmap scalability tests with large memory (i.e. >>> 300GB), the below hung task issue may happen occasionally. >> INFO: task ps:14018 blocked for more than 120 seconds. >> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this >> message. >> ps D 0 14018 1 0x00000004 >> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 >> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 >> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 >> Call Trace: >> [<ffffffff817154d0>] ? __schedule+0x250/0x730 >> [<ffffffff817159e6>] schedule+0x36/0x80 >> [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150 >> [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30 >> [<ffffffff81717db0>] down_read+0x20/0x40 >> [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0 >> [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100 >> [<ffffffff81241d87>] __vfs_read+0x37/0x150 >> [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0 >> [<ffffffff81242266>] vfs_read+0x96/0x130 >> [<ffffffff812437b5>] SyS_read+0x55/0xc0 >> [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5 >> >> It is because munmap holds mmap_sem from very beginning to all the way >> down to the end, and doesn't release it in the middle. When unmapping >> large mapping, it may take long time (take ~18 seconds to unmap 320GB >> mapping with every single page mapped on an idle machine). >> >> It is because munmap holds mmap_sem from very beginning to all the way >> down to the end, and doesn't release it in the middle. When unmapping >> large mapping, it may take long time (take ~18 seconds to unmap 320GB >> mapping with every single page mapped on an idle machine). >> >> Zapping pages is the most time consuming part, according to the >> suggestion from Michal Hock [1], zapping pages can be done with holding >> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write >> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set, >> the page fault to VM_DEAD vma will trigger SIGSEGV. >> >> Define large mapping size thresh as PUD size or 1GB, just zap pages with >> read mmap_sem for mappings which are >= thresh value. > Perhaps it would be better to treat all mappings in the fashion, > regardless of size. Simpler code, lesser mmap_sem hold times, much > better testing coverage. Is there any particular downside to doing > this? Actually, my testing uses 4K size to improve the coverage. The only downside AFAICS is the cost of multiple acquiring/releasing lock may outpace the benefits. > >> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just >> fallback to regular path since unmapping those mappings need acquire >> write mmap_sem. > So we'll still get huge latencies an softlockup warnings for some > usecases. This is a problem! Because unmapping such area needs modify vm_flags, which need write mmap_sem, in current code base. > >> For the time being, just do this in munmap syscall path. Other >> vm_munmap() or do_munmap() call sites remain intact for stability >> reason. >> >> The below is some regression and performance data collected on a machine >> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory. > Where is this "regression and performance data"? Something mising from > the changelog? oops, it might be removed inadvertently. >> With the patched kernel, write mmap_sem hold time is dropped to us level >> from second. >> >> [1] https://lwn.net/Articles/753269/ >> >> ... >> >> --- a/mm/mmap.c >> +++ b/mm/mmap.c >> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma, >> return 1; >> } >> >> +/* Consider PUD size or 1GB mapping as large mapping */ >> +#ifdef HPAGE_PUD_SIZE >> +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE >> +#else >> +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024) >> +#endif >> + >> +/* Unmap large mapping early with acquiring read mmap_sem */ >> +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start, >> + size_t len, struct list_head *uf) > Can we have a comment describing what `uf' is and what it does? (at least) Sure. > >> +{ >> + unsigned long end = 0; >> + struct vm_area_struct *vma = NULL, *prev, *tmp; > `tmp' is a poor choice of identifier - it doesn't communicate either > the variable's type nor its purpose. > > Perhaps rename vma to start_vma(?) and rename tmp to vma? > > And declaring start_vma to be const would be a nice readability addition. Sounds ok. > >> + bool success = false; >> + int ret = 0; >> + >> + if (!munmap_addr_sanity(start, len)) >> + return -EINVAL; >> + >> + len = PAGE_ALIGN(len); >> + >> + end = start + len; >> + >> + /* Just deal with uf in regular path */ >> + if (unlikely(uf)) >> + goto regular_path; >> + >> + if (len >= LARGE_MAP_THRESH) { >> + /* >> + * need write mmap_sem to split vma and set VM_DEAD flag >> + * splitting vma up-front to save PITA to clean if it is failed >> + */ >> + down_write(&mm->mmap_sem); >> + ret = munmap_lookup_vma(mm, &vma, &prev, start, end); >> + if (ret != 1) { >> + up_write(&mm->mmap_sem); >> + return ret; > Can just use `goto out' here, and that would avoid the unpleasing use > of a deeply eembded `return'. Yes. > >> + } >> + /* This ret value might be returned, so reset it */ >> + ret = 0; >> + >> + /* >> + * Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP >> + * flag set or has uprobes set, need acquire write map_sem, >> + * so skip them in early zap. Just deal with such mapping in >> + * regular path. > For each case, please describe *why* mmap_sem must be held for writing. Sure. > >> + * Borrow can_madv_dontneed_vma() to check the conditions. >> + */ >> + tmp = vma; >> + while (tmp && tmp->vm_start < end) { >> + if (!can_madv_dontneed_vma(tmp) || >> + vma_has_uprobes(tmp, start, end)) { >> + up_write(&mm->mmap_sem); >> + goto regular_path; >> + } >> + tmp = tmp->vm_next; >> + } >> + /* >> + * set VM_DEAD flag before tear down them. >> + * page fault on VM_DEAD vma will trigger SIGSEGV. >> + */ >> + tmp = vma; >> + for ( ; tmp && tmp->vm_start < end; tmp = tmp->vm_next) >> + tmp->vm_flags |= VM_DEAD; >> + up_write(&mm->mmap_sem); >> + >> + /* zap mappings with read mmap_sem */ >> + down_read(&mm->mmap_sem); > Use downgrade_write()? Aha, thanks for reminding. I forgot this. Just focused on "upgrade_read" part suggested by Michal. > >> + zap_page_range(vma, start, len); >> + /* indicates early zap is success */ >> + success = true; >> + up_read(&mm->mmap_sem); >> + } >> + >> +regular_path: >> + /* hold write mmap_sem for vma manipulation or regular path */ >> + if (down_write_killable(&mm->mmap_sem)) >> + return -EINTR; > Why is this _killable() while the preceding down_write() was not? This is copied from the original do_munmap(). The preceding one could be _killable() too. > >> + if (success) { >> + /* vmas have been zapped, here clean up pgtable and vmas */ >> + struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap; >> + struct mmu_gather tlb; >> + tlb_gather_mmu(&tlb, mm, start, end); >> + free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, >> + next ? next->vm_start : USER_PGTABLES_CEILING); >> + tlb_finish_mmu(&tlb, start, end); >> + >> + detach_vmas_to_be_unmapped(mm, vma, prev, end); >> + arch_unmap(mm, vma, start, end); >> + remove_vma_list(mm, vma); >> + } else { >> + /* vma is VM_LOCKED|VM_HUGETLB|VM_PFNMAP or has uprobe */ >> + if (vma) { >> + if (unlikely(uf)) { >> + int ret = userfaultfd_unmap_prep(vma, start, >> + end, uf); >> + if (ret) >> + goto out; > Bug? This `ret' shadows the other `ret' in this function. oops, it should just use the same "ret". Thanks, Yang > >> + } >> + if (mm->locked_vm) { >> + tmp = vma; >> + while (tmp && tmp->vm_start < end) { >> + if (tmp->vm_flags & VM_LOCKED) { >> + mm->locked_vm -= vma_pages(tmp); >> + munlock_vma_pages_all(tmp); >> + } >> + tmp = tmp->vm_next; >> + } >> + } >> + detach_vmas_to_be_unmapped(mm, vma, prev, end); >> + unmap_region(mm, vma, prev, start, end); >> + remove_vma_list(mm, vma); >> + } else >> + /* When mapping size < LARGE_MAP_THRESH */ >> + ret = do_munmap(mm, start, len, uf); >> + } >> + >> +out: >> + up_write(&mm->mmap_sem); >> + return ret; >> +} >> + >> /* Munmap is split into 2 main parts -- this part which finds >> * what needs doing, and the areas themselves, which do the >> * work. This now handles partial unmappings. >> @@ -2829,6 +2951,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, >> return 0; >> } >> >> +static int vm_munmap_zap_early(unsigned long start, size_t len) >> +{ >> + int ret; >> + struct mm_struct *mm = current->mm; >> + LIST_HEAD(uf); >> + >> + ret = do_munmap_zap_early(mm, start, len, &uf); >> + userfaultfd_unmap_complete(mm, &uf); >> + return ret; >> +} >> + >> int vm_munmap(unsigned long start, size_t len) >> { >> int ret; >> @@ -2848,10 +2981,9 @@ int vm_munmap(unsigned long start, size_t len) >> SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) >> { >> profile_munmap(addr); >> - return vm_munmap(addr, len); >> + return vm_munmap_zap_early(addr, len); >> } >> >> - >> /* >> * Emulation of deprecated remap_file_pages() syscall. >> */ >> -- >> 1.8.3.1
On 6/29/18 6:35 PM, Andrew Morton wrote: > On Sat, 30 Jun 2018 06:39:44 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote: > > > And... > >> diff --git a/mm/mmap.c b/mm/mmap.c >> index 87dcf83..d61e08b 100644 >> --- a/mm/mmap.c >> +++ b/mm/mmap.c >> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma, >> return 1; >> } >> >> +/* Consider PUD size or 1GB mapping as large mapping */ >> +#ifdef HPAGE_PUD_SIZE >> +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE >> +#else >> +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024) >> +#endif > So this assumes that 32-bit machines cannot have 1GB mappings (fair > enough) and this is the sole means by which we avoid falling into the > "len >= LARGE_MAP_THRESH" codepath, which will behave very badly, at > least because for such machines, VM_DEAD is zero. > > This is rather ugly and fragile. And, I guess, explains why we can't > give all mappings this treatment: 32-bit machines can't do it. And > we're adding a bunch of code to 32-bit kernels which will never be > executed. > > I'm thinking it would be better to be much more explicit with "#ifdef > CONFIG_64BIT" in this code, rather than relying upon the above magic. > > But I tend to think that the fact that we haven't solved anything on > locked vmas or on uprobed mappings is a shostopper for the whole > approach :( I agree it is not that perfect. But, it still could improve the most use cases. For the locked vmas and hugetlb vmas, unmapping operations need modify vm_flags. But, I'm wondering we might be able to separate unmap and vm_flags update. Because we know they will be unmapped right away, the vm_flags might be able to be updated in write mmap_sem critical section before the actual unmap is called or after it. This is just off the top of my head. For uprobed mappings, I'm not sure how vital it is to this case. Thanks, Yang >
On Fri, 29 Jun 2018 19:28:15 -0700 Yang Shi <yang.shi@linux.alibaba.com> wrote: > > > > we're adding a bunch of code to 32-bit kernels which will never be > > executed. > > > > I'm thinking it would be better to be much more explicit with "#ifdef > > CONFIG_64BIT" in this code, rather than relying upon the above magic. > > > > But I tend to think that the fact that we haven't solved anything on > > locked vmas or on uprobed mappings is a shostopper for the whole > > approach :( > > I agree it is not that perfect. But, it still could improve the most use > cases. Well, those unaddressed usecases will need to be fixed at some point. What's our plan for that? Would one of your earlier designs have addressed all usecases? I expect the dumb unmap-a-little-bit-at-a-time approach would have? > For the locked vmas and hugetlb vmas, unmapping operations need modify > vm_flags. But, I'm wondering we might be able to separate unmap and > vm_flags update. Because we know they will be unmapped right away, the > vm_flags might be able to be updated in write mmap_sem critical section > before the actual unmap is called or after it. This is just off the top > of my head. > > For uprobed mappings, I'm not sure how vital it is to this case. > > Thanks, > Yang > > >
On 6/29/18 8:15 PM, Andrew Morton wrote: > On Fri, 29 Jun 2018 19:28:15 -0700 Yang Shi <yang.shi@linux.alibaba.com> wrote: > >> >>> we're adding a bunch of code to 32-bit kernels which will never be >>> executed. >>> >>> I'm thinking it would be better to be much more explicit with "#ifdef >>> CONFIG_64BIT" in this code, rather than relying upon the above magic. >>> >>> But I tend to think that the fact that we haven't solved anything on >>> locked vmas or on uprobed mappings is a shostopper for the whole >>> approach :( >> I agree it is not that perfect. But, it still could improve the most use >> cases. > Well, those unaddressed usecases will need to be fixed at some point. Yes, definitely. > What's our plan for that? As I mentioned in the earlier email, locked and hugetlb cases might be able to be solved by separating vm_flags update and actual unmap. I will look into it further later. From my point of view, uprobe mapping sounds not that vital. > > Would one of your earlier designs have addressed all usecases? I > expect the dumb unmap-a-little-bit-at-a-time approach would have? Yes. The v1 design does unmap with holding write map_sem. So, the vm_flags update is not a problem. Thanks, Yang > >> For the locked vmas and hugetlb vmas, unmapping operations need modify >> vm_flags. But, I'm wondering we might be able to separate unmap and >> vm_flags update. Because we know they will be unmapped right away, the >> vm_flags might be able to be updated in write mmap_sem critical section >> before the actual unmap is called or after it. This is just off the top >> of my head. >> >> For uprobed mappings, I'm not sure how vital it is to this case. >> >> Thanks, >> Yang >>
On Sat, Jun 30, 2018 at 06:39:44AM +0800, Yang Shi wrote: > When running some mmap/munmap scalability tests with large memory (i.e. > > 300GB), the below hung task issue may happen occasionally. > > INFO: task ps:14018 blocked for more than 120 seconds. > Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > ps D 0 14018 1 0x00000004 > ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 > ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 > 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 > Call Trace: > [<ffffffff817154d0>] ? __schedule+0x250/0x730 > [<ffffffff817159e6>] schedule+0x36/0x80 > [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150 > [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30 > [<ffffffff81717db0>] down_read+0x20/0x40 > [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0 > [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100 > [<ffffffff81241d87>] __vfs_read+0x37/0x150 > [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0 > [<ffffffff81242266>] vfs_read+0x96/0x130 > [<ffffffff812437b5>] SyS_read+0x55/0xc0 > [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5 > > It is because munmap holds mmap_sem from very beginning to all the way > down to the end, and doesn't release it in the middle. When unmapping > large mapping, it may take long time (take ~18 seconds to unmap 320GB > mapping with every single page mapped on an idle machine). > > It is because munmap holds mmap_sem from very beginning to all the way > down to the end, and doesn't release it in the middle. When unmapping > large mapping, it may take long time (take ~18 seconds to unmap 320GB > mapping with every single page mapped on an idle machine). > > Zapping pages is the most time consuming part, according to the > suggestion from Michal Hock [1], zapping pages can be done with holding > read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write > mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set, > the page fault to VM_DEAD vma will trigger SIGSEGV. > > Define large mapping size thresh as PUD size or 1GB, just zap pages with > read mmap_sem for mappings which are >= thresh value. > > If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just > fallback to regular path since unmapping those mappings need acquire > write mmap_sem. > > For the time being, just do this in munmap syscall path. Other > vm_munmap() or do_munmap() call sites remain intact for stability > reason. > > The below is some regression and performance data collected on a machine > with 32 cores of E5-2680 @ 2.70GHz and 384GB memory. > > With the patched kernel, write mmap_sem hold time is dropped to us level > from second. > > [1] https://lwn.net/Articles/753269/ > > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> > --- > mm/mmap.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 134 insertions(+), 2 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 87dcf83..d61e08b 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma, > return 1; > } > > +/* Consider PUD size or 1GB mapping as large mapping */ > +#ifdef HPAGE_PUD_SIZE > +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE > +#else > +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024) > +#endif PUD_SIZE is defined everywhere. > + > +/* Unmap large mapping early with acquiring read mmap_sem */ > +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start, > + size_t len, struct list_head *uf) > +{ > + unsigned long end = 0; > + struct vm_area_struct *vma = NULL, *prev, *tmp; > + bool success = false; > + int ret = 0; > + > + if (!munmap_addr_sanity(start, len)) > + return -EINVAL; > + > + len = PAGE_ALIGN(len); > + > + end = start + len; > + > + /* Just deal with uf in regular path */ > + if (unlikely(uf)) > + goto regular_path; > + > + if (len >= LARGE_MAP_THRESH) { > + /* > + * need write mmap_sem to split vma and set VM_DEAD flag > + * splitting vma up-front to save PITA to clean if it is failed What errors do you talk about? ENOMEM on VMA split? Anything else? > + */ > + down_write(&mm->mmap_sem); > + ret = munmap_lookup_vma(mm, &vma, &prev, start, end); > + if (ret != 1) { > + up_write(&mm->mmap_sem); > + return ret; > + } > + /* This ret value might be returned, so reset it */ > + ret = 0; > + > + /* > + * Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP > + * flag set or has uprobes set, need acquire write map_sem, > + * so skip them in early zap. Just deal with such mapping in > + * regular path. > + * Borrow can_madv_dontneed_vma() to check the conditions. > + */ > + tmp = vma; > + while (tmp && tmp->vm_start < end) { > + if (!can_madv_dontneed_vma(tmp) || > + vma_has_uprobes(tmp, start, end)) { > + up_write(&mm->mmap_sem); > + goto regular_path; > + } > + tmp = tmp->vm_next; > + } > + /* > + * set VM_DEAD flag before tear down them. > + * page fault on VM_DEAD vma will trigger SIGSEGV. > + */ > + tmp = vma; > + for ( ; tmp && tmp->vm_start < end; tmp = tmp->vm_next) > + tmp->vm_flags |= VM_DEAD; I probably miss the explanation somewhere, but what's wrong with allowing other thread to re-populate the VMA? I would rather allow the VMA to be re-populated by other thread while we are zapping the range. And later zap the range again under down_write. It should also lead to consolidated regular path: take mmap_sem for write and call do_munmap(). On the first path we just skip VMA we cannot deal with under down_read(mmap_sem), regular path will take care of them. > + up_write(&mm->mmap_sem); > + > + /* zap mappings with read mmap_sem */ > + down_read(&mm->mmap_sem); Yeah. There's race between up_write() and down_read(). Use downgrade, as Andrew suggested. > + zap_page_range(vma, start, len); > + /* indicates early zap is success */ > + success = true; > + up_read(&mm->mmap_sem); And here again. This race can be avoided if we wouldn't carry vma to regular_path, but just go directly to do_munmap(). > + } > + > +regular_path: > + /* hold write mmap_sem for vma manipulation or regular path */ > + if (down_write_killable(&mm->mmap_sem)) > + return -EINTR; > + if (success) { > + /* vmas have been zapped, here clean up pgtable and vmas */ > + struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap; > + struct mmu_gather tlb; > + tlb_gather_mmu(&tlb, mm, start, end); > + free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, > + next ? next->vm_start : USER_PGTABLES_CEILING); > + tlb_finish_mmu(&tlb, start, end); > + > + detach_vmas_to_be_unmapped(mm, vma, prev, end); > + arch_unmap(mm, vma, start, end); > + remove_vma_list(mm, vma); > + } else { > + /* vma is VM_LOCKED|VM_HUGETLB|VM_PFNMAP or has uprobe */ > + if (vma) { > + if (unlikely(uf)) { > + int ret = userfaultfd_unmap_prep(vma, start, > + end, uf); > + if (ret) > + goto out; > + } > + if (mm->locked_vm) { > + tmp = vma; > + while (tmp && tmp->vm_start < end) { > + if (tmp->vm_flags & VM_LOCKED) { > + mm->locked_vm -= vma_pages(tmp); > + munlock_vma_pages_all(tmp); > + } > + tmp = tmp->vm_next; > + } > + } > + detach_vmas_to_be_unmapped(mm, vma, prev, end); > + unmap_region(mm, vma, prev, start, end); > + remove_vma_list(mm, vma); > + } else > + /* When mapping size < LARGE_MAP_THRESH */ > + ret = do_munmap(mm, start, len, uf); > + } > + > +out: > + up_write(&mm->mmap_sem); > + return ret; > +} > + > /* Munmap is split into 2 main parts -- this part which finds > * what needs doing, and the areas themselves, which do the > * work. This now handles partial unmappings. > @@ -2829,6 +2951,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, > return 0; > } > > +static int vm_munmap_zap_early(unsigned long start, size_t len) > +{ > + int ret; > + struct mm_struct *mm = current->mm; > + LIST_HEAD(uf); > + > + ret = do_munmap_zap_early(mm, start, len, &uf); > + userfaultfd_unmap_complete(mm, &uf); > + return ret; > +} > + > int vm_munmap(unsigned long start, size_t len) > { > int ret; > @@ -2848,10 +2981,9 @@ int vm_munmap(unsigned long start, size_t len) > SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) > { > profile_munmap(addr); > - return vm_munmap(addr, len); > + return vm_munmap_zap_early(addr, len); > } > > - > /* > * Emulation of deprecated remap_file_pages() syscall. > */ > -- > 1.8.3.1 >
On Mon 02-07-18 15:33:50, Kirill A. Shutemov wrote: [...] > I probably miss the explanation somewhere, but what's wrong with allowing > other thread to re-populate the VMA? We have discussed that earlier and it boils down to how is racy access to munmap supposed to behave. Right now we have either the original content or SEGV. If we allow to simply madvise_dontneed before real unmap we could get a new page as well. There might be (quite broken I would say) user space code that would simply corrupt data silently that way.
On Sat 30-06-18 06:39:44, Yang Shi wrote: > When running some mmap/munmap scalability tests with large memory (i.e. > > 300GB), the below hung task issue may happen occasionally. > > INFO: task ps:14018 blocked for more than 120 seconds. > Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > ps D 0 14018 1 0x00000004 > ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 > ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 > 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 > Call Trace: > [<ffffffff817154d0>] ? __schedule+0x250/0x730 > [<ffffffff817159e6>] schedule+0x36/0x80 > [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150 > [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30 > [<ffffffff81717db0>] down_read+0x20/0x40 > [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0 > [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100 > [<ffffffff81241d87>] __vfs_read+0x37/0x150 > [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0 > [<ffffffff81242266>] vfs_read+0x96/0x130 > [<ffffffff812437b5>] SyS_read+0x55/0xc0 > [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5 > > It is because munmap holds mmap_sem from very beginning to all the way > down to the end, and doesn't release it in the middle. When unmapping > large mapping, it may take long time (take ~18 seconds to unmap 320GB > mapping with every single page mapped on an idle machine). > > It is because munmap holds mmap_sem from very beginning to all the way > down to the end, and doesn't release it in the middle. When unmapping > large mapping, it may take long time (take ~18 seconds to unmap 320GB > mapping with every single page mapped on an idle machine). > > Zapping pages is the most time consuming part, according to the > suggestion from Michal Hock [1], zapping pages can be done with holding s@Hock@Hocko@ > read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write > mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set, > the page fault to VM_DEAD vma will trigger SIGSEGV. This really deserves an explanation why the all dance is really needed. It would be also good to mention how do you achieve the overal consistency. E.g. you are dropping mmap_sem and then re-taking it for write. What if any pending write lock succeeds and modify the address space? Does it matter, why if not? > Define large mapping size thresh as PUD size or 1GB, just zap pages with > read mmap_sem for mappings which are >= thresh value. > > If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just > fallback to regular path since unmapping those mappings need acquire > write mmap_sem. > > For the time being, just do this in munmap syscall path. Other > vm_munmap() or do_munmap() call sites remain intact for stability > reason. What are those stability reasons? > The below is some regression and performance data collected on a machine > with 32 cores of E5-2680 @ 2.70GHz and 384GB memory. > > With the patched kernel, write mmap_sem hold time is dropped to us level > from second. I haven't read through the implemenation carefuly TBH but the changelog needs quite some work to explain the solution and resulting semantic of munmap after the change.
On Fri 29-06-18 20:15:47, Andrew Morton wrote: [...] > Would one of your earlier designs have addressed all usecases? I > expect the dumb unmap-a-little-bit-at-a-time approach would have? It has been already pointed out that this will not work. You simply cannot drop the mmap_sem during unmap because another thread could change the address space under your feet. So you need some form of VM_DEAD and handle concurrent and conflicting address space operations.
On 7/2/18 6:53 AM, Michal Hocko wrote: > On Sat 30-06-18 06:39:44, Yang Shi wrote: >> When running some mmap/munmap scalability tests with large memory (i.e. >>> 300GB), the below hung task issue may happen occasionally. >> INFO: task ps:14018 blocked for more than 120 seconds. >> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this >> message. >> ps D 0 14018 1 0x00000004 >> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 >> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 >> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 >> Call Trace: >> [<ffffffff817154d0>] ? __schedule+0x250/0x730 >> [<ffffffff817159e6>] schedule+0x36/0x80 >> [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150 >> [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30 >> [<ffffffff81717db0>] down_read+0x20/0x40 >> [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0 >> [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100 >> [<ffffffff81241d87>] __vfs_read+0x37/0x150 >> [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0 >> [<ffffffff81242266>] vfs_read+0x96/0x130 >> [<ffffffff812437b5>] SyS_read+0x55/0xc0 >> [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5 >> >> It is because munmap holds mmap_sem from very beginning to all the way >> down to the end, and doesn't release it in the middle. When unmapping >> large mapping, it may take long time (take ~18 seconds to unmap 320GB >> mapping with every single page mapped on an idle machine). >> >> It is because munmap holds mmap_sem from very beginning to all the way >> down to the end, and doesn't release it in the middle. When unmapping >> large mapping, it may take long time (take ~18 seconds to unmap 320GB >> mapping with every single page mapped on an idle machine). >> >> Zapping pages is the most time consuming part, according to the >> suggestion from Michal Hock [1], zapping pages can be done with holding > s@Hock@Hocko@ Sorry for the wrong spelling. > >> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write >> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set, >> the page fault to VM_DEAD vma will trigger SIGSEGV. > This really deserves an explanation why the all dance is really needed. > > It would be also good to mention how do you achieve the overal > consistency. E.g. you are dropping mmap_sem and then re-taking it for > write. What if any pending write lock succeeds and modify the address > space? Does it matter, why if not? Sure. > >> Define large mapping size thresh as PUD size or 1GB, just zap pages with >> read mmap_sem for mappings which are >= thresh value. >> >> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just >> fallback to regular path since unmapping those mappings need acquire >> write mmap_sem. >> >> For the time being, just do this in munmap syscall path. Other >> vm_munmap() or do_munmap() call sites remain intact for stability >> reason. > What are those stability reasons? mmap() and mremap() may call do_munmap() as well, so it may introduce more race condition if they use the zap early version of do_munmap too. They would have much more chances to take mmap_sem to change address space and cause conflict. And, it looks they are not the vital source of long period of write mmap_sem hold. So, it sounds not worth making things more complicated for the time being. > >> The below is some regression and performance data collected on a machine >> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory. >> >> With the patched kernel, write mmap_sem hold time is dropped to us level >> from second. > I haven't read through the implemenation carefuly TBH but the changelog > needs quite some work to explain the solution and resulting semantic of > munmap after the change. Thanks for the suggestion. Will polish the changelog. Yang
On 7/2/18 5:33 AM, Kirill A. Shutemov wrote: > On Sat, Jun 30, 2018 at 06:39:44AM +0800, Yang Shi wrote: >> When running some mmap/munmap scalability tests with large memory (i.e. >>> 300GB), the below hung task issue may happen occasionally. >> INFO: task ps:14018 blocked for more than 120 seconds. >> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this >> message. >> ps D 0 14018 1 0x00000004 >> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 >> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 >> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 >> Call Trace: >> [<ffffffff817154d0>] ? __schedule+0x250/0x730 >> [<ffffffff817159e6>] schedule+0x36/0x80 >> [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150 >> [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30 >> [<ffffffff81717db0>] down_read+0x20/0x40 >> [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0 >> [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100 >> [<ffffffff81241d87>] __vfs_read+0x37/0x150 >> [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0 >> [<ffffffff81242266>] vfs_read+0x96/0x130 >> [<ffffffff812437b5>] SyS_read+0x55/0xc0 >> [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5 >> >> It is because munmap holds mmap_sem from very beginning to all the way >> down to the end, and doesn't release it in the middle. When unmapping >> large mapping, it may take long time (take ~18 seconds to unmap 320GB >> mapping with every single page mapped on an idle machine). >> >> It is because munmap holds mmap_sem from very beginning to all the way >> down to the end, and doesn't release it in the middle. When unmapping >> large mapping, it may take long time (take ~18 seconds to unmap 320GB >> mapping with every single page mapped on an idle machine). >> >> Zapping pages is the most time consuming part, according to the >> suggestion from Michal Hock [1], zapping pages can be done with holding >> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write >> mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set, >> the page fault to VM_DEAD vma will trigger SIGSEGV. >> >> Define large mapping size thresh as PUD size or 1GB, just zap pages with >> read mmap_sem for mappings which are >= thresh value. >> >> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just >> fallback to regular path since unmapping those mappings need acquire >> write mmap_sem. >> >> For the time being, just do this in munmap syscall path. Other >> vm_munmap() or do_munmap() call sites remain intact for stability >> reason. >> >> The below is some regression and performance data collected on a machine >> with 32 cores of E5-2680 @ 2.70GHz and 384GB memory. >> >> With the patched kernel, write mmap_sem hold time is dropped to us level >> from second. >> >> [1] https://lwn.net/Articles/753269/ >> >> Cc: Michal Hocko <mhocko@kernel.org> >> Cc: Matthew Wilcox <willy@infradead.org> >> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> >> --- >> mm/mmap.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- >> 1 file changed, 134 insertions(+), 2 deletions(-) >> >> diff --git a/mm/mmap.c b/mm/mmap.c >> index 87dcf83..d61e08b 100644 >> --- a/mm/mmap.c >> +++ b/mm/mmap.c >> @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma, >> return 1; >> } >> >> +/* Consider PUD size or 1GB mapping as large mapping */ >> +#ifdef HPAGE_PUD_SIZE >> +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE >> +#else >> +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024) >> +#endif > PUD_SIZE is defined everywhere. If THP is defined, otherwise it is: #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; }) > >> + >> +/* Unmap large mapping early with acquiring read mmap_sem */ >> +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start, >> + size_t len, struct list_head *uf) >> +{ >> + unsigned long end = 0; >> + struct vm_area_struct *vma = NULL, *prev, *tmp; >> + bool success = false; >> + int ret = 0; >> + >> + if (!munmap_addr_sanity(start, len)) >> + return -EINVAL; >> + >> + len = PAGE_ALIGN(len); >> + >> + end = start + len; >> + >> + /* Just deal with uf in regular path */ >> + if (unlikely(uf)) >> + goto regular_path; >> + >> + if (len >= LARGE_MAP_THRESH) { >> + /* >> + * need write mmap_sem to split vma and set VM_DEAD flag >> + * splitting vma up-front to save PITA to clean if it is failed > What errors do you talk about? ENOMEM on VMA split? Anything else? Yes, ENOMEM on vma split. > >> + */ >> + down_write(&mm->mmap_sem); >> + ret = munmap_lookup_vma(mm, &vma, &prev, start, end); >> + if (ret != 1) { >> + up_write(&mm->mmap_sem); >> + return ret; >> + } >> + /* This ret value might be returned, so reset it */ >> + ret = 0; >> + >> + /* >> + * Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP >> + * flag set or has uprobes set, need acquire write map_sem, >> + * so skip them in early zap. Just deal with such mapping in >> + * regular path. >> + * Borrow can_madv_dontneed_vma() to check the conditions. >> + */ >> + tmp = vma; >> + while (tmp && tmp->vm_start < end) { >> + if (!can_madv_dontneed_vma(tmp) || >> + vma_has_uprobes(tmp, start, end)) { >> + up_write(&mm->mmap_sem); >> + goto regular_path; >> + } >> + tmp = tmp->vm_next; >> + } >> + /* >> + * set VM_DEAD flag before tear down them. >> + * page fault on VM_DEAD vma will trigger SIGSEGV. >> + */ >> + tmp = vma; >> + for ( ; tmp && tmp->vm_start < end; tmp = tmp->vm_next) >> + tmp->vm_flags |= VM_DEAD; > I probably miss the explanation somewhere, but what's wrong with allowing > other thread to re-populate the VMA? > > I would rather allow the VMA to be re-populated by other thread while we > are zapping the range. And later zap the range again under down_write. > > It should also lead to consolidated regular path: take mmap_sem for write > and call do_munmap(). > > On the first path we just skip VMA we cannot deal with under > down_read(mmap_sem), regular path will take care of them. > > >> + up_write(&mm->mmap_sem); >> + >> + /* zap mappings with read mmap_sem */ >> + down_read(&mm->mmap_sem); > Yeah. There's race between up_write() and down_read(). > Use downgrade, as Andrew suggested. > >> + zap_page_range(vma, start, len); >> + /* indicates early zap is success */ >> + success = true; >> + up_read(&mm->mmap_sem); > And here again. > > This race can be avoided if we wouldn't carry vma to regular_path, but > just go directly to do_munmap(). Thanks, Kirill. Yes, I did think about re-validating vmas before. This sounds reasonable to avoid the race. Although we spend more time in re-looking up vmas, but it should be very short, and the duplicate zap should be very short too. Yang > >> + } >> + >> +regular_path: >> + /* hold write mmap_sem for vma manipulation or regular path */ >> + if (down_write_killable(&mm->mmap_sem)) >> + return -EINTR; >> + if (success) { >> + /* vmas have been zapped, here clean up pgtable and vmas */ >> + struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap; >> + struct mmu_gather tlb; >> + tlb_gather_mmu(&tlb, mm, start, end); >> + free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, >> + next ? next->vm_start : USER_PGTABLES_CEILING); >> + tlb_finish_mmu(&tlb, start, end); >> + >> + detach_vmas_to_be_unmapped(mm, vma, prev, end); >> + arch_unmap(mm, vma, start, end); >> + remove_vma_list(mm, vma); >> + } else { >> + /* vma is VM_LOCKED|VM_HUGETLB|VM_PFNMAP or has uprobe */ >> + if (vma) { >> + if (unlikely(uf)) { >> + int ret = userfaultfd_unmap_prep(vma, start, >> + end, uf); >> + if (ret) >> + goto out; >> + } >> + if (mm->locked_vm) { >> + tmp = vma; >> + while (tmp && tmp->vm_start < end) { >> + if (tmp->vm_flags & VM_LOCKED) { >> + mm->locked_vm -= vma_pages(tmp); >> + munlock_vma_pages_all(tmp); >> + } >> + tmp = tmp->vm_next; >> + } >> + } >> + detach_vmas_to_be_unmapped(mm, vma, prev, end); >> + unmap_region(mm, vma, prev, start, end); >> + remove_vma_list(mm, vma); >> + } else >> + /* When mapping size < LARGE_MAP_THRESH */ >> + ret = do_munmap(mm, start, len, uf); >> + } >> + >> +out: >> + up_write(&mm->mmap_sem); >> + return ret; >> +} >> + >> /* Munmap is split into 2 main parts -- this part which finds >> * what needs doing, and the areas themselves, which do the >> * work. This now handles partial unmappings. >> @@ -2829,6 +2951,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, >> return 0; >> } >> >> +static int vm_munmap_zap_early(unsigned long start, size_t len) >> +{ >> + int ret; >> + struct mm_struct *mm = current->mm; >> + LIST_HEAD(uf); >> + >> + ret = do_munmap_zap_early(mm, start, len, &uf); >> + userfaultfd_unmap_complete(mm, &uf); >> + return ret; >> +} >> + >> int vm_munmap(unsigned long start, size_t len) >> { >> int ret; >> @@ -2848,10 +2981,9 @@ int vm_munmap(unsigned long start, size_t len) >> SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) >> { >> profile_munmap(addr); >> - return vm_munmap(addr, len); >> + return vm_munmap_zap_early(addr, len); >> } >> >> - >> /* >> * Emulation of deprecated remap_file_pages() syscall. >> */ >> -- >> 1.8.3.1 >>
On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko <mhocko@kernel.org> wrote: > On Fri 29-06-18 20:15:47, Andrew Morton wrote: > [...] > > Would one of your earlier designs have addressed all usecases? I > > expect the dumb unmap-a-little-bit-at-a-time approach would have? > > It has been already pointed out that this will not work. I said "one of". There were others. > You simply > cannot drop the mmap_sem during unmap because another thread could > change the address space under your feet. So you need some form of > VM_DEAD and handle concurrent and conflicting address space operations. Unclear that this is a problem. If a thread does an unmap of a range of virtual address space, there's no guarantee that upon return some other thread has not already mapped new stuff into that address range. So what's changed?
On 6/29/18 9:26 PM, Yang Shi wrote: > > > On 6/29/18 8:15 PM, Andrew Morton wrote: >> On Fri, 29 Jun 2018 19:28:15 -0700 Yang Shi >> <yang.shi@linux.alibaba.com> wrote: >> >>> >>>> we're adding a bunch of code to 32-bit kernels which will never be >>>> executed. >>>> >>>> I'm thinking it would be better to be much more explicit with "#ifdef >>>> CONFIG_64BIT" in this code, rather than relying upon the above magic. >>>> >>>> But I tend to think that the fact that we haven't solved anything on >>>> locked vmas or on uprobed mappings is a shostopper for the whole >>>> approach :( >>> I agree it is not that perfect. But, it still could improve the most >>> use >>> cases. >> Well, those unaddressed usecases will need to be fixed at some point. > > Yes, definitely. > >> What's our plan for that? > > As I mentioned in the earlier email, locked and hugetlb cases might be > able to be solved by separating vm_flags update and actual unmap. I > will look into it further later. By looking into this further, I think both mlocked and hugetlb vmas can be handled. For mlocked vmas, it is easy since we acquires write mmap_sem before unmapping, so VM_LOCK flags can be cleared here then unmap, just like what the regular path does. For hugetlb vmas, the VM_MAYSHARE flag is just checked by huge_pmd_share() in hugetlb_fault()->huge_pte_alloc(), another call site is dup_mm()->copy_page_range()->copy_hugetlb_page_range(), we don't care this call chain in this case. So we may expand VM_DEAD to hugetlb_fault(). Michal suggested to check VM_DEAD in check_stable_address_space(), so it would be called in hugetlb_fault() too (not in current code), then the page fault handler would bail out before huge_pte_alloc() is called. With this trick, we don't have to care about when the vm_flags is updated, we can unmap hugetlb vmas in read mmap_sem critical section, then update the vm_flags with write mmap_sem held or before the unmap. Yang > > From my point of view, uprobe mapping sounds not that vital. > >> >> Would one of your earlier designs have addressed all usecases? I >> expect the dumb unmap-a-little-bit-at-a-time approach would have? > > Yes. The v1 design does unmap with holding write map_sem. So, the > vm_flags update is not a problem. > > Thanks, > Yang > >> >>> For the locked vmas and hugetlb vmas, unmapping operations need modify >>> vm_flags. But, I'm wondering we might be able to separate unmap and >>> vm_flags update. Because we know they will be unmapped right away, the >>> vm_flags might be able to be updated in write mmap_sem critical section >>> before the actual unmap is called or after it. This is just off the top >>> of my head. >>> >>> For uprobed mappings, I'm not sure how vital it is to this case. >>> >>> Thanks, >>> Yang >>> >
On Mon 02-07-18 13:48:45, Andrew Morton wrote: > On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko <mhocko@kernel.org> wrote: > > > On Fri 29-06-18 20:15:47, Andrew Morton wrote: > > [...] > > > Would one of your earlier designs have addressed all usecases? I > > > expect the dumb unmap-a-little-bit-at-a-time approach would have? > > > > It has been already pointed out that this will not work. > > I said "one of". There were others. Well, I was aware only about two potential solutions. Either do the heavy lifting under the shared lock and do the rest with the exlusive one and this, drop the lock per parts. Maybe I have missed others? > > You simply > > cannot drop the mmap_sem during unmap because another thread could > > change the address space under your feet. So you need some form of > > VM_DEAD and handle concurrent and conflicting address space operations. > > Unclear that this is a problem. If a thread does an unmap of a range > of virtual address space, there's no guarantee that upon return some > other thread has not already mapped new stuff into that address range. > So what's changed? Well, consider the following scenario: Thread A = calling mmap(NULL, sizeA) Thread B = calling munmap(addr, sizeB) They do not use any external synchronization and rely on the atomic munmap. Thread B only munmaps range that it knows belongs to it (e.g. called mmap in the past). It should be clear that ThreadA should not get an address from the addr, sizeB range, right? In the most simple case it will not happen. But let's say that the addr, sizeB range has unmapped holes for what ever reasons. Now anytime munmap drops the exclusive lock after handling one VMA, Thread A might find its sizeA range and use it. ThreadB then might remove this new range as soon as it gets its exclusive lock again. Is such a code safe? No it is not and I would call it fragile at best but people tend to do weird things and atomic munmap behavior is something they can easily depend on. Another example would be an atomic address range probing by MAP_FIXED_NOREPLACE. It would simply break for similar reasons. I remember my attempt to make MAP_LOCKED consistent with mlock (if the population fails then return -ENOMEM) and that required to drop the shared mmap_sem and take it in exclusive mode (because we do not have upgrade_read) and Linus was strongly against [1][2] for very similar reasons. If you drop the lock you simply do not know what happened under your feet. [1] http://lkml.kernel.org/r/CA+55aFydkG-BgZzry5DrTzueVh9VvEcVJdLV8iOyUphQk=0vpw@mail.gmail.com [2] http://lkml.kernel.org/r/CA+55aFyajquhGhw59qNWKGK4dBV0TPmDD7-1XqPo7DZWvO_hPg@mail.gmail.com
On Mon, Jul 02, 2018 at 10:19:32AM -0700, Yang Shi wrote: > > > On 7/2/18 5:33 AM, Kirill A. Shutemov wrote: > > On Sat, Jun 30, 2018 at 06:39:44AM +0800, Yang Shi wrote: > > > When running some mmap/munmap scalability tests with large memory (i.e. > > > > 300GB), the below hung task issue may happen occasionally. > > > INFO: task ps:14018 blocked for more than 120 seconds. > > > Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > > > message. > > > ps D 0 14018 1 0x00000004 > > > ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 > > > ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 > > > 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 > > > Call Trace: > > > [<ffffffff817154d0>] ? __schedule+0x250/0x730 > > > [<ffffffff817159e6>] schedule+0x36/0x80 > > > [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150 > > > [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30 > > > [<ffffffff81717db0>] down_read+0x20/0x40 > > > [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0 > > > [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100 > > > [<ffffffff81241d87>] __vfs_read+0x37/0x150 > > > [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0 > > > [<ffffffff81242266>] vfs_read+0x96/0x130 > > > [<ffffffff812437b5>] SyS_read+0x55/0xc0 > > > [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5 > > > > > > It is because munmap holds mmap_sem from very beginning to all the way > > > down to the end, and doesn't release it in the middle. When unmapping > > > large mapping, it may take long time (take ~18 seconds to unmap 320GB > > > mapping with every single page mapped on an idle machine). > > > > > > It is because munmap holds mmap_sem from very beginning to all the way > > > down to the end, and doesn't release it in the middle. When unmapping > > > large mapping, it may take long time (take ~18 seconds to unmap 320GB > > > mapping with every single page mapped on an idle machine). > > > > > > Zapping pages is the most time consuming part, according to the > > > suggestion from Michal Hock [1], zapping pages can be done with holding > > > read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write > > > mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set, > > > the page fault to VM_DEAD vma will trigger SIGSEGV. > > > > > > Define large mapping size thresh as PUD size or 1GB, just zap pages with > > > read mmap_sem for mappings which are >= thresh value. > > > > > > If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just > > > fallback to regular path since unmapping those mappings need acquire > > > write mmap_sem. > > > > > > For the time being, just do this in munmap syscall path. Other > > > vm_munmap() or do_munmap() call sites remain intact for stability > > > reason. > > > > > > The below is some regression and performance data collected on a machine > > > with 32 cores of E5-2680 @ 2.70GHz and 384GB memory. > > > > > > With the patched kernel, write mmap_sem hold time is dropped to us level > > > from second. > > > > > > [1] https://lwn.net/Articles/753269/ > > > > > > Cc: Michal Hocko <mhocko@kernel.org> > > > Cc: Matthew Wilcox <willy@infradead.org> > > > Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> > > > --- > > > mm/mmap.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- > > > 1 file changed, 134 insertions(+), 2 deletions(-) > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > index 87dcf83..d61e08b 100644 > > > --- a/mm/mmap.c > > > +++ b/mm/mmap.c > > > @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma, > > > return 1; > > > } > > > +/* Consider PUD size or 1GB mapping as large mapping */ > > > +#ifdef HPAGE_PUD_SIZE > > > +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE > > > +#else > > > +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024) > > > +#endif > > PUD_SIZE is defined everywhere. > > If THP is defined, otherwise it is: > > #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; }) I'm talking about PUD_SIZE, not HPAGE_PUD_SIZE.
On Mon, Jul 02, 2018 at 02:49:28PM +0200, Michal Hocko wrote: > On Mon 02-07-18 15:33:50, Kirill A. Shutemov wrote: > [...] > > I probably miss the explanation somewhere, but what's wrong with allowing > > other thread to re-populate the VMA? > > We have discussed that earlier and it boils down to how is racy access > to munmap supposed to behave. Right now we have either the original > content or SEGV. If we allow to simply madvise_dontneed before real > unmap we could get a new page as well. There might be (quite broken I > would say) user space code that would simply corrupt data silently that > way. Okay, so we add a lot of complexity to accommodate broken userspace that may or may not exist. Is it right? :)
On 7/2/18 11:09 PM, Michal Hocko wrote: > On Mon 02-07-18 13:48:45, Andrew Morton wrote: >> On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko <mhocko@kernel.org> wrote: >> >>> On Fri 29-06-18 20:15:47, Andrew Morton wrote: >>> [...] >>>> Would one of your earlier designs have addressed all usecases? I >>>> expect the dumb unmap-a-little-bit-at-a-time approach would have? >>> It has been already pointed out that this will not work. >> I said "one of". There were others. > Well, I was aware only about two potential solutions. Either do the > heavy lifting under the shared lock and do the rest with the exlusive > one and this, drop the lock per parts. Maybe I have missed others? There is the other one which I presented on LSFMM summit. But, actually it turns out that one looks very similar to the current under review one. Yang > >>> You simply >>> cannot drop the mmap_sem during unmap because another thread could >>> change the address space under your feet. So you need some form of >>> VM_DEAD and handle concurrent and conflicting address space operations. >> Unclear that this is a problem. If a thread does an unmap of a range >> of virtual address space, there's no guarantee that upon return some >> other thread has not already mapped new stuff into that address range. >> So what's changed? > Well, consider the following scenario: > Thread A = calling mmap(NULL, sizeA) > Thread B = calling munmap(addr, sizeB) > > They do not use any external synchronization and rely on the atomic > munmap. Thread B only munmaps range that it knows belongs to it (e.g. > called mmap in the past). It should be clear that ThreadA should not > get an address from the addr, sizeB range, right? In the most simple case > it will not happen. But let's say that the addr, sizeB range has > unmapped holes for what ever reasons. Now anytime munmap drops the > exclusive lock after handling one VMA, Thread A might find its sizeA > range and use it. ThreadB then might remove this new range as soon as it > gets its exclusive lock again. > > Is such a code safe? No it is not and I would call it fragile at best > but people tend to do weird things and atomic munmap behavior is > something they can easily depend on. > > Another example would be an atomic address range probing by > MAP_FIXED_NOREPLACE. It would simply break for similar reasons. > > I remember my attempt to make MAP_LOCKED consistent with mlock (if the > population fails then return -ENOMEM) and that required to drop the > shared mmap_sem and take it in exclusive mode (because we do not > have upgrade_read) and Linus was strongly against [1][2] for very > similar reasons. If you drop the lock you simply do not know what > happened under your feet. > > [1] http://lkml.kernel.org/r/CA+55aFydkG-BgZzry5DrTzueVh9VvEcVJdLV8iOyUphQk=0vpw@mail.gmail.com > [2] http://lkml.kernel.org/r/CA+55aFyajquhGhw59qNWKGK4dBV0TPmDD7-1XqPo7DZWvO_hPg@mail.gmail.com
On 7/2/18 11:09 PM, Michal Hocko wrote: > On Mon 02-07-18 13:48:45, Andrew Morton wrote: >> On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko <mhocko@kernel.org> wrote: >> >>> On Fri 29-06-18 20:15:47, Andrew Morton wrote: >>> [...] >>>> Would one of your earlier designs have addressed all usecases? I >>>> expect the dumb unmap-a-little-bit-at-a-time approach would have? >>> It has been already pointed out that this will not work. >> I said "one of". There were others. > Well, I was aware only about two potential solutions. Either do the > heavy lifting under the shared lock and do the rest with the exlusive > one and this, drop the lock per parts. Maybe I have missed others? > >>> You simply >>> cannot drop the mmap_sem during unmap because another thread could >>> change the address space under your feet. So you need some form of >>> VM_DEAD and handle concurrent and conflicting address space operations. >> Unclear that this is a problem. If a thread does an unmap of a range >> of virtual address space, there's no guarantee that upon return some >> other thread has not already mapped new stuff into that address range. >> So what's changed? > Well, consider the following scenario: > Thread A = calling mmap(NULL, sizeA) > Thread B = calling munmap(addr, sizeB) > > They do not use any external synchronization and rely on the atomic > munmap. Thread B only munmaps range that it knows belongs to it (e.g. > called mmap in the past). It should be clear that ThreadA should not > get an address from the addr, sizeB range, right? In the most simple case > it will not happen. But let's say that the addr, sizeB range has > unmapped holes for what ever reasons. Now anytime munmap drops the > exclusive lock after handling one VMA, Thread A might find its sizeA > range and use it. ThreadB then might remove this new range as soon as it > gets its exclusive lock again. I'm a little bit confused here. If ThreadB already has unmapped that range, then ThreadA uses it. It sounds not like a problem since ThreadB should just go ahead to handle the next range when it gets its exclusive lock again, right? I don't think of why ThreadB would re-visit that range to remove it. But, if ThreadA uses MAP_FIXED, it might remap some ranges, then ThreadB remove them. It might trigger SIGSEGV or SIGBUS, but this is not even guaranteed on vanilla kernel too if the application doesn't do any synchronization. It all depends on timing. > > Is such a code safe? No it is not and I would call it fragile at best > but people tend to do weird things and atomic munmap behavior is > something they can easily depend on. > > Another example would be an atomic address range probing by > MAP_FIXED_NOREPLACE. It would simply break for similar reasons. Yes, I agree, it could simply break MAP_FIXED_NOREPLACE. Yang > > I remember my attempt to make MAP_LOCKED consistent with mlock (if the > population fails then return -ENOMEM) and that required to drop the > shared mmap_sem and take it in exclusive mode (because we do not > have upgrade_read) and Linus was strongly against [1][2] for very > similar reasons. If you drop the lock you simply do not know what > happened under your feet. > > [1] http://lkml.kernel.org/r/CA+55aFydkG-BgZzry5DrTzueVh9VvEcVJdLV8iOyUphQk=0vpw@mail.gmail.com > [2] http://lkml.kernel.org/r/CA+55aFyajquhGhw59qNWKGK4dBV0TPmDD7-1XqPo7DZWvO_hPg@mail.gmail.com
On Tue 03-07-18 11:22:17, Yang Shi wrote: > > > On 7/2/18 11:09 PM, Michal Hocko wrote: > > On Mon 02-07-18 13:48:45, Andrew Morton wrote: > > > On Mon, 2 Jul 2018 16:05:02 +0200 Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > On Fri 29-06-18 20:15:47, Andrew Morton wrote: > > > > [...] > > > > > Would one of your earlier designs have addressed all usecases? I > > > > > expect the dumb unmap-a-little-bit-at-a-time approach would have? > > > > It has been already pointed out that this will not work. > > > I said "one of". There were others. > > Well, I was aware only about two potential solutions. Either do the > > heavy lifting under the shared lock and do the rest with the exlusive > > one and this, drop the lock per parts. Maybe I have missed others? > > > > > > You simply > > > > cannot drop the mmap_sem during unmap because another thread could > > > > change the address space under your feet. So you need some form of > > > > VM_DEAD and handle concurrent and conflicting address space operations. > > > Unclear that this is a problem. If a thread does an unmap of a range > > > of virtual address space, there's no guarantee that upon return some > > > other thread has not already mapped new stuff into that address range. > > > So what's changed? > > Well, consider the following scenario: > > Thread A = calling mmap(NULL, sizeA) > > Thread B = calling munmap(addr, sizeB) > > > > They do not use any external synchronization and rely on the atomic > > munmap. Thread B only munmaps range that it knows belongs to it (e.g. > > called mmap in the past). It should be clear that ThreadA should not > > get an address from the addr, sizeB range, right? In the most simple case > > it will not happen. But let's say that the addr, sizeB range has > > unmapped holes for what ever reasons. Now anytime munmap drops the > > exclusive lock after handling one VMA, Thread A might find its sizeA > > range and use it. ThreadB then might remove this new range as soon as it > > gets its exclusive lock again. > > I'm a little bit confused here. If ThreadB already has unmapped that range, > then ThreadA uses it. It sounds not like a problem since ThreadB should just > go ahead to handle the next range when it gets its exclusive lock again, > right? I don't think of why ThreadB would re-visit that range to remove it. Not if the new range overlap with the follow up range that ThreadB does. Example B: munmap [XXXXX] [XXXXXX] [XXXXXXXXXX] B: breaks the lock after processing the first vma. A: mmap [XXXXXXXXXXXX] B: munmap retakes the lock and revalidate from the last vm_end because the old vma->vm_next might be gone B: [XXX][XXXXX] [XXXXXXXXXX] so you munmap part of the range. Sure you can plan some tricks and skip over vmas that do not start above your last vma->vm_end or something like that but I expect there are other can of worms hidden there.
diff --git a/mm/mmap.c b/mm/mmap.c index 87dcf83..d61e08b 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2763,6 +2763,128 @@ static int munmap_lookup_vma(struct mm_struct *mm, struct vm_area_struct **vma, return 1; } +/* Consider PUD size or 1GB mapping as large mapping */ +#ifdef HPAGE_PUD_SIZE +#define LARGE_MAP_THRESH HPAGE_PUD_SIZE +#else +#define LARGE_MAP_THRESH (1 * 1024 * 1024 * 1024) +#endif + +/* Unmap large mapping early with acquiring read mmap_sem */ +static int do_munmap_zap_early(struct mm_struct *mm, unsigned long start, + size_t len, struct list_head *uf) +{ + unsigned long end = 0; + struct vm_area_struct *vma = NULL, *prev, *tmp; + bool success = false; + int ret = 0; + + if (!munmap_addr_sanity(start, len)) + return -EINVAL; + + len = PAGE_ALIGN(len); + + end = start + len; + + /* Just deal with uf in regular path */ + if (unlikely(uf)) + goto regular_path; + + if (len >= LARGE_MAP_THRESH) { + /* + * need write mmap_sem to split vma and set VM_DEAD flag + * splitting vma up-front to save PITA to clean if it is failed + */ + down_write(&mm->mmap_sem); + ret = munmap_lookup_vma(mm, &vma, &prev, start, end); + if (ret != 1) { + up_write(&mm->mmap_sem); + return ret; + } + /* This ret value might be returned, so reset it */ + ret = 0; + + /* + * Unmapping vmas, which has VM_LOCKED|VM_HUGETLB|VM_PFNMAP + * flag set or has uprobes set, need acquire write map_sem, + * so skip them in early zap. Just deal with such mapping in + * regular path. + * Borrow can_madv_dontneed_vma() to check the conditions. + */ + tmp = vma; + while (tmp && tmp->vm_start < end) { + if (!can_madv_dontneed_vma(tmp) || + vma_has_uprobes(tmp, start, end)) { + up_write(&mm->mmap_sem); + goto regular_path; + } + tmp = tmp->vm_next; + } + /* + * set VM_DEAD flag before tear down them. + * page fault on VM_DEAD vma will trigger SIGSEGV. + */ + tmp = vma; + for ( ; tmp && tmp->vm_start < end; tmp = tmp->vm_next) + tmp->vm_flags |= VM_DEAD; + up_write(&mm->mmap_sem); + + /* zap mappings with read mmap_sem */ + down_read(&mm->mmap_sem); + zap_page_range(vma, start, len); + /* indicates early zap is success */ + success = true; + up_read(&mm->mmap_sem); + } + +regular_path: + /* hold write mmap_sem for vma manipulation or regular path */ + if (down_write_killable(&mm->mmap_sem)) + return -EINTR; + if (success) { + /* vmas have been zapped, here clean up pgtable and vmas */ + struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap; + struct mmu_gather tlb; + tlb_gather_mmu(&tlb, mm, start, end); + free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, + next ? next->vm_start : USER_PGTABLES_CEILING); + tlb_finish_mmu(&tlb, start, end); + + detach_vmas_to_be_unmapped(mm, vma, prev, end); + arch_unmap(mm, vma, start, end); + remove_vma_list(mm, vma); + } else { + /* vma is VM_LOCKED|VM_HUGETLB|VM_PFNMAP or has uprobe */ + if (vma) { + if (unlikely(uf)) { + int ret = userfaultfd_unmap_prep(vma, start, + end, uf); + if (ret) + goto out; + } + if (mm->locked_vm) { + tmp = vma; + while (tmp && tmp->vm_start < end) { + if (tmp->vm_flags & VM_LOCKED) { + mm->locked_vm -= vma_pages(tmp); + munlock_vma_pages_all(tmp); + } + tmp = tmp->vm_next; + } + } + detach_vmas_to_be_unmapped(mm, vma, prev, end); + unmap_region(mm, vma, prev, start, end); + remove_vma_list(mm, vma); + } else + /* When mapping size < LARGE_MAP_THRESH */ + ret = do_munmap(mm, start, len, uf); + } + +out: + up_write(&mm->mmap_sem); + return ret; +} + /* Munmap is split into 2 main parts -- this part which finds * what needs doing, and the areas themselves, which do the * work. This now handles partial unmappings. @@ -2829,6 +2951,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, return 0; } +static int vm_munmap_zap_early(unsigned long start, size_t len) +{ + int ret; + struct mm_struct *mm = current->mm; + LIST_HEAD(uf); + + ret = do_munmap_zap_early(mm, start, len, &uf); + userfaultfd_unmap_complete(mm, &uf); + return ret; +} + int vm_munmap(unsigned long start, size_t len) { int ret; @@ -2848,10 +2981,9 @@ int vm_munmap(unsigned long start, size_t len) SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) { profile_munmap(addr); - return vm_munmap(addr, len); + return vm_munmap_zap_early(addr, len); } - /* * Emulation of deprecated remap_file_pages() syscall. */