Message ID | 20230920021811.3095089-2-riel@surriel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | hugetlbfs: close race between MADV_DONTNEED and page fault | expand |
On Tue, Sep 19, 2023 at 10:16:09PM -0400, riel@surriel.com wrote: > From: Rik van Riel <riel@surriel.com> > > Extend the locking scheme used to protect shared hugetlb mappings > from truncate vs page fault races, in order to protect private > hugetlb mappings (with resv_map) against MADV_DONTNEED. > > Add a read-write semaphore to the resv_map data structure, and > use that from the hugetlb_vma_(un)lock_* functions, in preparation > for closing the race between MADV_DONTNEED and page faults. This feels an awful lot like the invalidate_lock in struct address_space which was recently added by Jan Kara.
On Wed, 2023-09-20 at 04:57 +0100, Matthew Wilcox wrote: > On Tue, Sep 19, 2023 at 10:16:09PM -0400, riel@surriel.com wrote: > > From: Rik van Riel <riel@surriel.com> > > > > Extend the locking scheme used to protect shared hugetlb mappings > > from truncate vs page fault races, in order to protect private > > hugetlb mappings (with resv_map) against MADV_DONTNEED. > > > > Add a read-write semaphore to the resv_map data structure, and > > use that from the hugetlb_vma_(un)lock_* functions, in preparation > > for closing the race between MADV_DONTNEED and page faults. > > This feels an awful lot like the invalidate_lock in struct > address_space > which was recently added by Jan Kara. > Indeed it does. It might be even nicer if we could replace the hugetlb_vma_lock special logic with the invalidate_lock for hugetlbfs. Mike, can you think of any reason why the hugetlb_vma_lock logic should not be replaced with the invalidate_lock? If not, I'd be happy to implement that.
On 09/20/23 00:09, Rik van Riel wrote: > On Wed, 2023-09-20 at 04:57 +0100, Matthew Wilcox wrote: > > On Tue, Sep 19, 2023 at 10:16:09PM -0400, riel@surriel.com wrote: > > > From: Rik van Riel <riel@surriel.com> > > > > > > Extend the locking scheme used to protect shared hugetlb mappings > > > from truncate vs page fault races, in order to protect private > > > hugetlb mappings (with resv_map) against MADV_DONTNEED. > > > > > > Add a read-write semaphore to the resv_map data structure, and > > > use that from the hugetlb_vma_(un)lock_* functions, in preparation > > > for closing the race between MADV_DONTNEED and page faults. > > > > This feels an awful lot like the invalidate_lock in struct > > address_space > > which was recently added by Jan Kara. > > > Indeed it does. > > It might be even nicer if we could replace the hugetlb_vma_lock > special logic with the invalidate_lock for hugetlbfs. > > Mike, can you think of any reason why the hugetlb_vma_lock logic > should not be replaced with the invalidate_lock? > > If not, I'd be happy to implement that. > Sorry Rik, I have some other things that need immediate attention and have not had a chance to take a close look here. I'll take a closer look later (my) today or tomorrow.
On 09/19/23 22:16, riel@surriel.com wrote: > From: Rik van Riel <riel@surriel.com> > > Extend the locking scheme used to protect shared hugetlb mappings > from truncate vs page fault races, in order to protect private > hugetlb mappings (with resv_map) against MADV_DONTNEED. > > Add a read-write semaphore to the resv_map data structure, and > use that from the hugetlb_vma_(un)lock_* functions, in preparation > for closing the race between MADV_DONTNEED and page faults. > > Signed-off-by: Rik van Riel <riel@surriel.com> > --- > include/linux/hugetlb.h | 6 ++++++ > mm/hugetlb.c | 36 ++++++++++++++++++++++++++++++++---- > 2 files changed, 38 insertions(+), 4 deletions(-) This looks straight forward. However, I ran just this patch through libhugetlbfs test suite and it hung on misaligned_offset (2M: 32). https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/misaligned_offset.c Added lock/semaphore debugging to the kernel and got: [ 38.094690] ========================= [ 38.095517] WARNING: held lock freed! [ 38.096350] 6.6.0-rc2-next-20230921-dirty #4 Not tainted [ 38.097556] ------------------------- [ 38.098439] mlock/1002 is freeing memory ffff8881eff8dc00-ffff8881eff8ddff, with a lock still held there! [ 38.100550] ffff8881eff8dce8 (&resv_map->rw_sema){++++}-{3:3}, at: __unmap_hugepage_range_final+0x29/0x120 [ 38.103564] 2 locks held by mlock/1002: [ 38.104552] #0: ffff8881effa42a0 (&mm->mmap_lock){++++}-{3:3}, at: do_vmi_align_munmap+0x5c6/0x650 [ 38.106611] #1: ffff8881eff8dce8 (&resv_map->rw_sema){++++}-{3:3}, at: __unmap_hugepage_range_final+0x29/0x120 [ 38.108827] [ 38.108827] stack backtrace: [ 38.109929] CPU: 0 PID: 1002 Comm: mlock Not tainted 6.6.0-rc2-next-20230921-dirty #4 [ 38.111812] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014 [ 38.113784] Call Trace: [ 38.114456] <TASK> [ 38.115066] dump_stack_lvl+0x57/0x90 [ 38.116001] debug_check_no_locks_freed+0x137/0x170 [ 38.117193] ? remove_vma+0x28/0x70 [ 38.118088] __kmem_cache_free+0x8f/0x2b0 [ 38.119080] remove_vma+0x28/0x70 [ 38.119960] do_vmi_align_munmap+0x3b1/0x650 [ 38.121051] do_vmi_munmap+0xc9/0x1a0 [ 38.122006] __vm_munmap+0xa4/0x190 [ 38.122931] __ia32_sys_munmap+0x15/0x20 [ 38.123926] __do_fast_syscall_32+0x68/0x100 [ 38.125031] do_fast_syscall_32+0x2f/0x70 [ 38.126060] entry_SYSENTER_compat_after_hwframe+0x7b/0x8d [ 38.127366] RIP: 0023:0xf7f05579 [ 38.128198] Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00 [ 38.132534] RSP: 002b:00000000fffa877c EFLAGS: 00000286 ORIG_RAX: 000000000000005b [ 38.135703] RAX: ffffffffffffffda RBX: 00000000f7a00000 RCX: 0000000000200000 [ 38.137323] RDX: 00000000f7a00000 RSI: 0000000000200000 RDI: 0000000000000003 [ 38.138965] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 [ 38.140574] R10: 0000000000000000 R11: 0000000000000286 R12: 0000000000000000 [ 38.142191] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 38.143865] </TASK> Something is not quite right. If you do not get to it first, I will take a look as time permits.
On 09/21/23 15:42, Mike Kravetz wrote: > On 09/19/23 22:16, riel@surriel.com wrote: > > From: Rik van Riel <riel@surriel.com> > > > > Extend the locking scheme used to protect shared hugetlb mappings > > from truncate vs page fault races, in order to protect private > > hugetlb mappings (with resv_map) against MADV_DONTNEED. > > > > Add a read-write semaphore to the resv_map data structure, and > > use that from the hugetlb_vma_(un)lock_* functions, in preparation > > for closing the race between MADV_DONTNEED and page faults. > > > > Signed-off-by: Rik van Riel <riel@surriel.com> > > --- > > include/linux/hugetlb.h | 6 ++++++ > > mm/hugetlb.c | 36 ++++++++++++++++++++++++++++++++---- > > 2 files changed, 38 insertions(+), 4 deletions(-) > > This looks straight forward. > > However, I ran just this patch through libhugetlbfs test suite and it hung on > misaligned_offset (2M: 32). > https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/misaligned_offset.c > > Added lock/semaphore debugging to the kernel and got: > [ 38.094690] ========================= > [ 38.095517] WARNING: held lock freed! > [ 38.096350] 6.6.0-rc2-next-20230921-dirty #4 Not tainted > [ 38.097556] ------------------------- > [ 38.098439] mlock/1002 is freeing memory ffff8881eff8dc00-ffff8881eff8ddff, with a lock still held there! > [ 38.100550] ffff8881eff8dce8 (&resv_map->rw_sema){++++}-{3:3}, at: __unmap_hugepage_range_final+0x29/0x120 > [ 38.103564] 2 locks held by mlock/1002: > [ 38.104552] #0: ffff8881effa42a0 (&mm->mmap_lock){++++}-{3:3}, at: do_vmi_align_munmap+0x5c6/0x650 > [ 38.106611] #1: ffff8881eff8dce8 (&resv_map->rw_sema){++++}-{3:3}, at: __unmap_hugepage_range_final+0x29/0x120 > [ 38.108827] > [ 38.108827] stack backtrace: > [ 38.109929] CPU: 0 PID: 1002 Comm: mlock Not tainted 6.6.0-rc2-next-20230921-dirty #4 > [ 38.111812] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014 > [ 38.113784] Call Trace: > [ 38.114456] <TASK> > [ 38.115066] dump_stack_lvl+0x57/0x90 > [ 38.116001] debug_check_no_locks_freed+0x137/0x170 > [ 38.117193] ? remove_vma+0x28/0x70 > [ 38.118088] __kmem_cache_free+0x8f/0x2b0 > [ 38.119080] remove_vma+0x28/0x70 > [ 38.119960] do_vmi_align_munmap+0x3b1/0x650 > [ 38.121051] do_vmi_munmap+0xc9/0x1a0 > [ 38.122006] __vm_munmap+0xa4/0x190 > [ 38.122931] __ia32_sys_munmap+0x15/0x20 > [ 38.123926] __do_fast_syscall_32+0x68/0x100 > [ 38.125031] do_fast_syscall_32+0x2f/0x70 > [ 38.126060] entry_SYSENTER_compat_after_hwframe+0x7b/0x8d > [ 38.127366] RIP: 0023:0xf7f05579 > [ 38.128198] Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00 > [ 38.132534] RSP: 002b:00000000fffa877c EFLAGS: 00000286 ORIG_RAX: 000000000000005b > [ 38.135703] RAX: ffffffffffffffda RBX: 00000000f7a00000 RCX: 0000000000200000 > [ 38.137323] RDX: 00000000f7a00000 RSI: 0000000000200000 RDI: 0000000000000003 > [ 38.138965] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 > [ 38.140574] R10: 0000000000000000 R11: 0000000000000286 R12: 0000000000000000 > [ 38.142191] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > [ 38.143865] </TASK> > > Something is not quite right. If you do not get to it first, I will take a > look as time permits. Just for grins I threw on patch 2 (with lock debugging) and ran the test suite. It gets past misaligned_offset, but is spewing locking warnings too fast to read. Something is certainly missing.
On Thu, 2023-09-21 at 15:42 -0700, Mike Kravetz wrote: > On 09/19/23 22:16, riel@surriel.com wrote: > > From: Rik van Riel <riel@surriel.com> > > > > Extend the locking scheme used to protect shared hugetlb mappings > > from truncate vs page fault races, in order to protect private > > hugetlb mappings (with resv_map) against MADV_DONTNEED. > > > > Add a read-write semaphore to the resv_map data structure, and > > use that from the hugetlb_vma_(un)lock_* functions, in preparation > > for closing the race between MADV_DONTNEED and page faults. > > > > Signed-off-by: Rik van Riel <riel@surriel.com> > > --- > > include/linux/hugetlb.h | 6 ++++++ > > mm/hugetlb.c | 36 ++++++++++++++++++++++++++++++++---- > > 2 files changed, 38 insertions(+), 4 deletions(-) > > This looks straight forward. > > However, I ran just this patch through libhugetlbfs test suite and it > hung on > misaligned_offset (2M: 32). > https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/misaligned_offset.c Ah, so that's why I couldn't find hugetlbfs tests in the kernel selftests directory. They're in libhugetlbfs. I'll play around with those tests tomorrow. Let me see what's going on.
On Thu, 2023-09-21 at 15:42 -0700, Mike Kravetz wrote: > On 09/19/23 22:16, riel@surriel.com wrote: > > From: Rik van Riel <riel@surriel.com> > > > > Extend the locking scheme used to protect shared hugetlb mappings > > from truncate vs page fault races, in order to protect private > > hugetlb mappings (with resv_map) against MADV_DONTNEED. > > > > Add a read-write semaphore to the resv_map data structure, and > > use that from the hugetlb_vma_(un)lock_* functions, in preparation > > for closing the race between MADV_DONTNEED and page faults. > > > > Signed-off-by: Rik van Riel <riel@surriel.com> > > --- > > include/linux/hugetlb.h | 6 ++++++ > > mm/hugetlb.c | 36 ++++++++++++++++++++++++++++++++---- > > 2 files changed, 38 insertions(+), 4 deletions(-) > > This looks straight forward. > > However, I ran just this patch through libhugetlbfs test suite and it > hung on > misaligned_offset (2M: 32). > https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/misaligned_offset.c Speaking of "looks straightforward", how do I compile the libhugetlbfs code? The __morecore variable, which is pointed at either the THP or hugetlbfs morecore function, does not seem to be defined anywhere in the sources. Do I need to run some magic script (didn't find it) to get a special header file set up before I can build libhugetlbfs? $ make CC32 obj32/morecore.o morecore.c: In function ‘__lh_hugetlbfs_setup_morecore’: morecore.c:368:17: error: ‘__morecore’ undeclared (first use in this function); did you mean ‘thp_morecore’? 368 | __morecore = &thp_morecore; | ^~~~~~~~~~ | thp_morecore morecore.c:368:17: note: each undeclared identifier is reported only once for each function it appears in make: *** [Makefile:292: obj32/morecore.o] Error 1 $ grep __morecore *.[ch] morecore.c: __morecore = &thp_morecore; morecore.c: __morecore = &hugetlbfs_morecore;
On 09/22/23 10:37, Rik van Riel wrote: > On Thu, 2023-09-21 at 15:42 -0700, Mike Kravetz wrote: > > On 09/19/23 22:16, riel@surriel.com wrote: > > > From: Rik van Riel <riel@surriel.com> > > > > > > Extend the locking scheme used to protect shared hugetlb mappings > > > from truncate vs page fault races, in order to protect private > > > hugetlb mappings (with resv_map) against MADV_DONTNEED. > > > > > > Add a read-write semaphore to the resv_map data structure, and > > > use that from the hugetlb_vma_(un)lock_* functions, in preparation > > > for closing the race between MADV_DONTNEED and page faults. > > > > > > Signed-off-by: Rik van Riel <riel@surriel.com> > > > --- > > > include/linux/hugetlb.h | 6 ++++++ > > > mm/hugetlb.c | 36 ++++++++++++++++++++++++++++++++---- > > > 2 files changed, 38 insertions(+), 4 deletions(-) > > > > This looks straight forward. > > > > However, I ran just this patch through libhugetlbfs test suite and it > > hung on > > misaligned_offset (2M: 32). > > https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/misaligned_offset.c > > > Speaking of "looks straightforward", how do I compile the > libhugetlbfs code? > > The __morecore variable, which is pointed at either the > THP or hugetlbfs morecore function, does not seem to be > defined anywhere in the sources. > > Do I need to run some magic script (didn't find it) to > get a special header file set up before I can build > libhugetlbfs? libhugetlbfs is a mess! Distros have dropped it. However, I still find the test cases useful. I have a special VM with an old glibc just for running the tests. Sorry, can't give instructions for using tests on a recent glibc. But, back to this patch ... With the hints from the locking debug code, it came to me on my walk this morning. We need to also have __hugetlb_vma_unlock_write_free() work for private vmas as called from __unmap_hugepage_range_final. This additional change (or something like it) is required in this patch. diff --git a/mm/hugetlb.c b/mm/hugetlb.c index f906c5fa4d09..8f3d5895fffc 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -372,6 +372,11 @@ static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma) struct hugetlb_vma_lock *vma_lock = vma->vm_private_data; __hugetlb_vma_unlock_write_put(vma_lock); + } else if (__vma_private_lock(vma)) { + struct resv_map *resv_map = vma_resv_map(vma); + + /* no free for anon vmas, but still need to unlock */ + up_write(&resv_map->rw_sema); } }
On Fri, 2023-09-22 at 09:44 -0700, Mike Kravetz wrote: > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index f906c5fa4d09..8f3d5895fffc 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -372,6 +372,11 @@ static void > __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma) > struct hugetlb_vma_lock *vma_lock = vma- > >vm_private_data; > > __hugetlb_vma_unlock_write_put(vma_lock); > + } else if (__vma_private_lock(vma)) { > + struct resv_map *resv_map = vma_resv_map(vma); > + > + /* no free for anon vmas, but still need to unlock */ > + up_write(&resv_map->rw_sema); > } > } > Nice catch. I'll add that. I was still trying to reproduce the bug here. The libhugetlbfs code compiles with the offending bits commented out, but the misaligned_offset test wasn't causing trouble on my test VM here. Given the potential negative impact of moving from a per-VMA lock to a per-backing-address_space lock, I'll keep the 3 patches separate, and in the order they are in now. Let me go spin and test v2.
On Fri, 2023-09-22 at 09:44 -0700, Mike Kravetz wrote: > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index f906c5fa4d09..8f3d5895fffc 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -372,6 +372,11 @@ static void > __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma) > struct hugetlb_vma_lock *vma_lock = vma- > >vm_private_data; > > __hugetlb_vma_unlock_write_put(vma_lock); > + } else if (__vma_private_lock(vma)) { > + struct resv_map *resv_map = vma_resv_map(vma); > + > + /* no free for anon vmas, but still need to unlock */ > + up_write(&resv_map->rw_sema); > } > } > That did the trick. The libhugetlbfs tests pass now, with lockdep and KASAN enabled. Breno's MADV_DONTNEED test case for hugetlbfs still passes, too.
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 5b2626063f4f..694928fa06a3 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -60,6 +60,7 @@ struct resv_map { long adds_in_progress; struct list_head region_cache; long region_cache_count; + struct rw_semaphore rw_sema; #ifdef CONFIG_CGROUP_HUGETLB /* * On private mappings, the counter to uncharge reservations is stored @@ -1231,6 +1232,11 @@ static inline bool __vma_shareable_lock(struct vm_area_struct *vma) return (vma->vm_flags & VM_MAYSHARE) && vma->vm_private_data; } +static inline bool __vma_private_lock(struct vm_area_struct *vma) +{ + return (!(vma->vm_flags & VM_MAYSHARE)) && vma->vm_private_data; +} + /* * Safe version of huge_pte_offset() to check the locks. See comments * above huge_pte_offset(). diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ba6d39b71cb1..b99d215d2939 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -97,6 +97,7 @@ static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma); static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma); static void hugetlb_unshare_pmds(struct vm_area_struct *vma, unsigned long start, unsigned long end); +static struct resv_map *vma_resv_map(struct vm_area_struct *vma); static inline bool subpool_is_free(struct hugepage_subpool *spool) { @@ -267,6 +268,10 @@ void hugetlb_vma_lock_read(struct vm_area_struct *vma) struct hugetlb_vma_lock *vma_lock = vma->vm_private_data; down_read(&vma_lock->rw_sema); + } else if (__vma_private_lock(vma)) { + struct resv_map *resv_map = vma_resv_map(vma); + + down_read(&resv_map->rw_sema); } } @@ -276,6 +281,10 @@ void hugetlb_vma_unlock_read(struct vm_area_struct *vma) struct hugetlb_vma_lock *vma_lock = vma->vm_private_data; up_read(&vma_lock->rw_sema); + } else if (__vma_private_lock(vma)) { + struct resv_map *resv_map = vma_resv_map(vma); + + up_read(&resv_map->rw_sema); } } @@ -285,6 +294,10 @@ void hugetlb_vma_lock_write(struct vm_area_struct *vma) struct hugetlb_vma_lock *vma_lock = vma->vm_private_data; down_write(&vma_lock->rw_sema); + } else if (__vma_private_lock(vma)) { + struct resv_map *resv_map = vma_resv_map(vma); + + down_write(&resv_map->rw_sema); } } @@ -294,17 +307,27 @@ void hugetlb_vma_unlock_write(struct vm_area_struct *vma) struct hugetlb_vma_lock *vma_lock = vma->vm_private_data; up_write(&vma_lock->rw_sema); + } else if (__vma_private_lock(vma)) { + struct resv_map *resv_map = vma_resv_map(vma); + + up_write(&resv_map->rw_sema); } } int hugetlb_vma_trylock_write(struct vm_area_struct *vma) { - struct hugetlb_vma_lock *vma_lock = vma->vm_private_data; - if (!__vma_shareable_lock(vma)) - return 1; + if (__vma_shareable_lock(vma)) { + struct hugetlb_vma_lock *vma_lock = vma->vm_private_data; + + return down_write_trylock(&vma_lock->rw_sema); + } else if (__vma_private_lock(vma)) { + struct resv_map *resv_map = vma_resv_map(vma); + + return down_write_trylock(&resv_map->rw_sema); + } - return down_write_trylock(&vma_lock->rw_sema); + return 1; } void hugetlb_vma_assert_locked(struct vm_area_struct *vma) @@ -313,6 +336,10 @@ void hugetlb_vma_assert_locked(struct vm_area_struct *vma) struct hugetlb_vma_lock *vma_lock = vma->vm_private_data; lockdep_assert_held(&vma_lock->rw_sema); + } else if (__vma_private_lock(vma)) { + struct resv_map *resv_map = vma_resv_map(vma); + + lockdep_assert_held(&resv_map->rw_sema); } } @@ -1068,6 +1095,7 @@ struct resv_map *resv_map_alloc(void) kref_init(&resv_map->refs); spin_lock_init(&resv_map->lock); INIT_LIST_HEAD(&resv_map->regions); + init_rwsem(&resv_map->rw_sema); resv_map->adds_in_progress = 0; /*