Message ID | 20241016043600.35139-3-lizhe.67@bytedance.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | rwsem: introduce upgrade_read interface | expand |
On Wed, Oct 16, 2024 at 12:36:00PM +0800, lizhe.67@bytedance.com wrote: > From: Li Zhe <lizhe.67@bytedance.com> > > In function collapse_huge_page(), we drop mmap read lock and get > mmap write lock to prevent most accesses to pagetables. There is > a small time window to allow other tasks to acquire the mmap lock. > With the use of upgrade_read(), we don't need to check vma and pmd > again in most cases. This is clearly a performance optimisation. So you must have some numebrs that justify this, please include them.
On Wed, 16 Oct 2024 12:53:15 +0100, willy@infradead.org wrote: >On Wed, Oct 16, 2024 at 12:36:00PM +0800, lizhe.67@bytedance.com wrote: >> From: Li Zhe <lizhe.67@bytedance.com> >> >> In function collapse_huge_page(), we drop mmap read lock and get >> mmap write lock to prevent most accesses to pagetables. There is >> a small time window to allow other tasks to acquire the mmap lock. >> With the use of upgrade_read(), we don't need to check vma and pmd >> again in most cases. > >This is clearly a performance optimisation. So you must have some >numebrs that justify this, please include them. Yes, I will add the relevant data to v2 patch. Thanks!
On Thu, Oct 17, 2024 at 02:18:41PM +0800, lizhe.67@bytedance.com wrote: > On Wed, 16 Oct 2024 12:53:15 +0100, willy@infradead.org wrote: > > >On Wed, Oct 16, 2024 at 12:36:00PM +0800, lizhe.67@bytedance.com wrote: > >> From: Li Zhe <lizhe.67@bytedance.com> > >> > >> In function collapse_huge_page(), we drop mmap read lock and get > >> mmap write lock to prevent most accesses to pagetables. There is > >> a small time window to allow other tasks to acquire the mmap lock. > >> With the use of upgrade_read(), we don't need to check vma and pmd > >> again in most cases. > > > >This is clearly a performance optimisation. So you must have some > >numebrs that justify this, please include them. > > Yes, I will add the relevant data to v2 patch. How about telling us all now so we know whether to continue discussing this?
On Thu, 17 Oct 2024 14:20:12 +0100, willy@infradead.org wrote: > On Thu, Oct 17, 2024 at 02:18:41PM +0800, lizhe.67@bytedance.com wrote: > > On Wed, 16 Oct 2024 12:53:15 +0100, willy@infradead.org wrote: > > > > >On Wed, Oct 16, 2024 at 12:36:00PM +0800, lizhe.67@bytedance.com wrote: > > >> From: Li Zhe <lizhe.67@bytedance.com> > > >> > > >> In function collapse_huge_page(), we drop mmap read lock and get > > >> mmap write lock to prevent most accesses to pagetables. There is > > >> a small time window to allow other tasks to acquire the mmap lock. > > >> With the use of upgrade_read(), we don't need to check vma and pmd > > >> again in most cases. > > > > > >This is clearly a performance optimisation. So you must have some > > >numebrs that justify this, please include them. > > > > Yes, I will add the relevant data to v2 patch. > > How about telling us all now so we know whether to continue discussing > this? In my test environment, function collapse_huge_page() only achieved a 0.25% performance improvement. I use ftrace to get the execution time of collapse_huge_page(). The test code and test command are as follows. (1) Test result: average execution time of collapse_huge_page() before this patch: 1611.06283 us after this patch: 1597.01474 us (2) Test code: #define MMAP_SIZE (2ul*1024*1024) #define ALIGN(x, mask) (((x) + ((mask)-1)) & ~((mask)-1)) int main(void) { int num = 100; size_t page_sz = getpagesize(); while (num--) { size_t index; unsigned char *p_map; unsigned char *p_map_real; p_map = (unsigned char *)mmap(0, 2 * MMAP_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0); if (p_map == MAP_FAILED) { printf("mmap fail\n"); return -1; } else { p_map_real = (char *)ALIGN((unsigned long)p_map, MMAP_SIZE); printf("mmap get %p, align to %p\n", p_map, p_map_real); } for(index = 0; index < MMAP_SIZE; index += page_sz) p_map_real[index] = 6; int ret = madvise(p_map_real, MMAP_SIZE, 25); printf("ret is %d\n", ret); munmap(p_map, 2 * MMAP_SIZE); } return 0; } (3) Test command: echo never > /sys/kernel/mm/transparent_hugepage/enabled gcc test.c -o test trace-cmd record -p function_graph -g collapse_huge_page --max-graph-depth 1 ./test The optimization of the function collapse_huge_page() seems insignificant. I am not sure whether it will have a more obvious optimization effect in other scenarios.
Hello, kernel test robot noticed "WARNING:at_include/linux/rwsem.h:#collapse_huge_page" on: commit: 6604438065fc95d188ffcde12f687dfc319ef2cc ("[RFC 2/2] khugepaged: use upgrade_read() to optimize collapse_huge_page") url: https://github.com/intel-lab-lkp/linux/commits/lizhe-67-bytedance-com/rwsem-introduce-upgrade_read-interface/20241016-123810 base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 823a566221a5639f6c69424897218e5d6431a970 patch link: https://lore.kernel.org/all/20241016043600.35139-3-lizhe.67@bytedance.com/ patch subject: [RFC 2/2] khugepaged: use upgrade_read() to optimize collapse_huge_page in testcase: boot config: x86_64-rhel-8.3-kselftests compiler: gcc-12 test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G (please refer to attached dmesg/kmsg for entire log/backtrace) +------------------------------------------------------+------------+------------+ | | 6a6ad5e040 | 6604438065 | +------------------------------------------------------+------------+------------+ | WARNING:at_include/linux/rwsem.h:#collapse_huge_page | 0 | 18 | | RIP:collapse_huge_page | 0 | 18 | +------------------------------------------------------+------------+------------+ If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202410231434.e98a7287-oliver.sang@intel.com [ 19.879799][ T39] ------------[ cut here ]------------ [ 19.880779][ T39] WARNING: CPU: 0 PID: 39 at include/linux/rwsem.h:203 collapse_huge_page (include/linux/rwsem.h:203 include/linux/mmap_lock.h:70 include/linux/mm.h:735 include/linux/mm.h:754 mm/khugepaged.c:1165) [ 19.881951][ T39] Modules linked in: sch_fq_codel [ 19.882624][ T39] CPU: 0 UID: 0 PID: 39 Comm: khugepaged Not tainted 6.12.0-rc2-00004-g6604438065fc #1 [ 19.883821][ T39] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 19.885086][ T39] RIP: 0010:collapse_huge_page (include/linux/rwsem.h:203 include/linux/mmap_lock.h:70 include/linux/mm.h:735 include/linux/mm.h:754 mm/khugepaged.c:1165) [ 19.885835][ T39] Code: 0f 85 7f f9 ff ff 48 c7 c2 00 e0 59 84 be 6e 03 00 00 48 c7 c7 60 e0 59 84 c6 05 fd 58 49 04 01 e8 88 4e 7e ff e9 5b f9 ff ff <0f> 0b 48 b8 00 00 00 00 00 fc ff df 4c 89 c2 48 c1 ea 03 80 3c 02 All code ======== 0: 0f 85 7f f9 ff ff jne 0xfffffffffffff985 6: 48 c7 c2 00 e0 59 84 mov $0xffffffff8459e000,%rdx d: be 6e 03 00 00 mov $0x36e,%esi 12: 48 c7 c7 60 e0 59 84 mov $0xffffffff8459e060,%rdi 19: c6 05 fd 58 49 04 01 movb $0x1,0x44958fd(%rip) # 0x449591d 20: e8 88 4e 7e ff callq 0xffffffffff7e4ead 25: e9 5b f9 ff ff jmpq 0xfffffffffffff985 2a:* 0f 0b ud2 <-- trapping instruction 2c: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax 33: fc ff df 36: 4c 89 c2 mov %r8,%rdx 39: 48 c1 ea 03 shr $0x3,%rdx 3d: 80 .byte 0x80 3e: 3c 02 cmp $0x2,%al Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax 9: fc ff df c: 4c 89 c2 mov %r8,%rdx f: 48 c1 ea 03 shr $0x3,%rdx 13: 80 .byte 0x80 14: 3c 02 cmp $0x2,%al [ 19.888059][ T39] RSP: 0000:ffffc90000297998 EFLAGS: 00010246 [ 19.888832][ T39] RAX: 0000000000000000 RBX: 1ffff92000052f39 RCX: 0000000000000000 [ 19.889839][ T39] RDX: 0000000000000000 RSI: ffff88813394afd8 RDI: ffff88810625c320 [ 19.890860][ T39] RBP: ffff8882e908a6c8 R08: ffff8882e908a6d8 R09: ffffed10267295ee [ 19.891844][ T39] R10: ffff88813394af77 R11: ffff8882ea258440 R12: ffff88813394ae40 [ 19.892865][ T39] R13: ffffffff859ef180 R14: ffffc90000297ab8 R15: ffff88813394af68 [ 19.893868][ T39] FS: 0000000000000000(0000) GS:ffff8883aee00000(0000) knlGS:0000000000000000 [ 19.894985][ T39] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 19.895775][ T39] CR2: 0000000028a5b608 CR3: 0000000133892000 CR4: 00000000000406f0 [ 19.896782][ T39] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 19.897783][ T39] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 19.898793][ T39] Call Trace: [ 19.902237][ T39] <TASK> [ 19.902820][ T39] ? collapse_huge_page (include/linux/rwsem.h:203 include/linux/mmap_lock.h:70 include/linux/mm.h:735 include/linux/mm.h:754 mm/khugepaged.c:1165) [ 19.903524][ T39] ? __warn (kernel/panic.c:748) [ 19.904099][ T39] ? collapse_huge_page (include/linux/rwsem.h:203 include/linux/mmap_lock.h:70 include/linux/mm.h:735 include/linux/mm.h:754 mm/khugepaged.c:1165) [ 19.904863][ T39] ? report_bug (lib/bug.c:180 lib/bug.c:219) [ 19.905477][ T39] ? handle_bug (arch/x86/kernel/traps.c:285) [ 19.906057][ T39] ? exc_invalid_op (arch/x86/kernel/traps.c:309 (discriminator 1)) [ 19.906674][ T39] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621) [ 19.907345][ T39] ? collapse_huge_page (include/linux/rwsem.h:203 include/linux/mmap_lock.h:70 include/linux/mm.h:735 include/linux/mm.h:754 mm/khugepaged.c:1165) [ 19.908036][ T39] ? collapse_huge_page (include/linux/rwsem.h:203 include/linux/mmap_lock.h:70 include/linux/mm.h:735 include/linux/mm.h:754 mm/khugepaged.c:1165) [ 19.908739][ T39] ? __pte_offset_map_lock (include/linux/pgtable.h:324 include/linux/pgtable.h:594 mm/pgtable-generic.c:376) [ 19.909447][ T39] ? __pfx_collapse_huge_page (mm/khugepaged.c:1095) [ 19.910150][ T39] ? __pfx_lock_acquire (kernel/locking/lockdep.c:5793) [ 19.910817][ T39] ? __lock_acquire (kernel/locking/lockdep.c:5202) [ 19.911460][ T39] ? do_raw_spin_lock (arch/x86/include/asm/atomic.h:107 include/linux/atomic/atomic-arch-fallback.h:2170 include/linux/atomic/atomic-instrumented.h:1302 include/asm-generic/qspinlock.h:111 kernel/locking/spinlock_debug.c:116) [ 19.912111][ T39] ? find_held_lock (kernel/locking/lockdep.c:5315) [ 19.912753][ T39] ? find_held_lock (kernel/locking/lockdep.c:5315) [ 19.913389][ T39] ? hpage_collapse_scan_pmd (include/linux/rcupdate.h:347 include/linux/rcupdate.h:880 include/linux/pgtable.h:115 mm/khugepaged.c:1422) [ 19.914118][ T39] ? __lock_release+0x10b/0x440 [ 19.914823][ T39] ? hpage_collapse_scan_pmd (include/linux/rcupdate.h:347 include/linux/rcupdate.h:880 include/linux/pgtable.h:115 mm/khugepaged.c:1422) [ 19.915528][ T39] ? hpage_collapse_scan_pmd (include/linux/rcupdate.h:347 include/linux/rcupdate.h:880 include/linux/pgtable.h:115 mm/khugepaged.c:1422) [ 19.916202][ T39] hpage_collapse_scan_pmd (mm/khugepaged.c:1427) [ 19.916915][ T39] ? __pfx_hpage_collapse_scan_pmd (mm/khugepaged.c:1261) [ 19.917678][ T39] ? __thp_vma_allowable_orders (arch/x86/include/asm/bitops.h:206 (discriminator 1) arch/x86/include/asm/bitops.h:238 (discriminator 1) include/asm-generic/bitops/instrumented-non-atomic.h:142 (discriminator 1) mm/huge_memory.c:118 (discriminator 1)) [ 19.918441][ T39] ? __pfx___might_resched (kernel/sched/core.c:8586) [ 19.919187][ T39] khugepaged_scan_mm_slot+0x8bd/0xd90 [ 19.920082][ T39] ? __pfx_khugepaged_scan_mm_slot+0x10/0x10 [ 19.921041][ T39] ? __pfx_lock_acquire (kernel/locking/lockdep.c:5793) [ 19.921756][ T39] ? __pfx_do_raw_spin_lock (kernel/locking/spinlock_debug.c:114) [ 19.922464][ T39] ? __pfx___might_resched (kernel/sched/core.c:8586) [ 19.923200][ T39] khugepaged (include/linux/spinlock.h:391 mm/khugepaged.c:2521 mm/khugepaged.c:2573) [ 19.923801][ T39] ? __pfx_khugepaged (mm/khugepaged.c:2566) [ 19.924471][ T39] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4347 kernel/locking/lockdep.c:4406) [ 19.925235][ T39] ? __kthread_parkme (arch/x86/include/asm/bitops.h:206 arch/x86/include/asm/bitops.h:238 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/kthread.c:280) [ 19.925865][ T39] ? __pfx_khugepaged (mm/khugepaged.c:2566) [ 19.926499][ T39] kthread (kernel/kthread.c:389) [ 19.927083][ T39] ? __pfx_kthread (kernel/kthread.c:342) [ 19.927775][ T39] ret_from_fork (arch/x86/kernel/process.c:153) [ 19.928395][ T39] ? __pfx_kthread (kernel/kthread.c:342) [ 19.929026][ T39] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) [ 19.929685][ T39] </TASK> [ 19.930123][ T39] irq event stamp: 951 [ 19.930689][ T39] hardirqs last enabled at (965): __up_console_sem (arch/x86/include/asm/irqflags.h:42 (discriminator 1) arch/x86/include/asm/irqflags.h:97 (discriminator 1) arch/x86/include/asm/irqflags.h:155 (discriminator 1) kernel/printk/printk.c:344 (discriminator 1)) [ 19.931960][ T39] hardirqs last disabled at (978): __up_console_sem (kernel/printk/printk.c:342 (discriminator 1)) [ 19.933120][ T39] softirqs last enabled at (892): handle_softirqs (arch/x86/include/asm/preempt.h:26 kernel/softirq.c:401 kernel/softirq.c:582) [ 19.934254][ T39] softirqs last disabled at (885): __irq_exit_rcu (kernel/softirq.c:589 kernel/softirq.c:428 kernel/softirq.c:637) [ 19.935397][ T39] ---[ end trace 0000000000000000 ]--- The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20241023/202410231434.e98a7287-oliver.sang@intel.com
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index f9c39898eaff..934051274f7a 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1142,23 +1142,25 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, goto out_nolock; } - mmap_read_unlock(mm); - /* - * Prevent all access to pagetables with the exception of - * gup_fast later handled by the ptep_clear_flush and the VM - * handled by the anon_vma lock + PG_lock. - * - * UFFDIO_MOVE is prevented to race as well thanks to the - * mmap_lock. - */ - mmap_write_lock(mm); - result = hugepage_vma_revalidate(mm, address, true, &vma, cc); - if (result != SCAN_SUCCEED) - goto out_up_write; - /* check if the pmd is still valid */ - result = check_pmd_still_valid(mm, address, pmd); - if (result != SCAN_SUCCEED) - goto out_up_write; + if (upgrade_read(&mm->mmap_lock)) { + mmap_read_unlock(mm); + /* + * Prevent all access to pagetables with the exception of + * gup_fast later handled by the ptep_clear_flush and the VM + * handled by the anon_vma lock + PG_lock. + * + * UFFDIO_MOVE is prevented to race as well thanks to the + * mmap_lock. + */ + mmap_write_lock(mm); + result = hugepage_vma_revalidate(mm, address, true, &vma, cc); + if (result != SCAN_SUCCEED) + goto out_up_write; + /* check if the pmd is still valid */ + result = check_pmd_still_valid(mm, address, pmd); + if (result != SCAN_SUCCEED) + goto out_up_write; + } vma_start_write(vma); anon_vma_lock_write(vma->anon_vma);