diff mbox series

[RFC,2/2] khugepaged: use upgrade_read() to optimize collapse_huge_page

Message ID 20241016043600.35139-3-lizhe.67@bytedance.com (mailing list archive)
State New
Headers show
Series rwsem: introduce upgrade_read interface | expand

Commit Message

lizhe.67@bytedance.com Oct. 16, 2024, 4:36 a.m. UTC
From: Li Zhe <lizhe.67@bytedance.com>

In function collapse_huge_page(), we drop mmap read lock and get
mmap write lock to prevent most accesses to pagetables. There is
a small time window to allow other tasks to acquire the mmap lock.
With the use of upgrade_read(), we don't need to check vma and pmd
again in most cases.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 mm/khugepaged.c | 36 +++++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 17 deletions(-)

Comments

Matthew Wilcox Oct. 16, 2024, 11:53 a.m. UTC | #1
On Wed, Oct 16, 2024 at 12:36:00PM +0800, lizhe.67@bytedance.com wrote:
> From: Li Zhe <lizhe.67@bytedance.com>
> 
> In function collapse_huge_page(), we drop mmap read lock and get
> mmap write lock to prevent most accesses to pagetables. There is
> a small time window to allow other tasks to acquire the mmap lock.
> With the use of upgrade_read(), we don't need to check vma and pmd
> again in most cases.

This is clearly a performance optimisation.  So you must have some
numebrs that justify this, please include them.
lizhe.67@bytedance.com Oct. 17, 2024, 6:18 a.m. UTC | #2
On Wed, 16 Oct 2024 12:53:15 +0100, willy@infradead.org wrote:

>On Wed, Oct 16, 2024 at 12:36:00PM +0800, lizhe.67@bytedance.com wrote:
>> From: Li Zhe <lizhe.67@bytedance.com>
>> 
>> In function collapse_huge_page(), we drop mmap read lock and get
>> mmap write lock to prevent most accesses to pagetables. There is
>> a small time window to allow other tasks to acquire the mmap lock.
>> With the use of upgrade_read(), we don't need to check vma and pmd
>> again in most cases.
>
>This is clearly a performance optimisation.  So you must have some
>numebrs that justify this, please include them.

Yes, I will add the relevant data to v2 patch.

Thanks!
Matthew Wilcox Oct. 17, 2024, 1:20 p.m. UTC | #3
On Thu, Oct 17, 2024 at 02:18:41PM +0800, lizhe.67@bytedance.com wrote:
> On Wed, 16 Oct 2024 12:53:15 +0100, willy@infradead.org wrote:
> 
> >On Wed, Oct 16, 2024 at 12:36:00PM +0800, lizhe.67@bytedance.com wrote:
> >> From: Li Zhe <lizhe.67@bytedance.com>
> >> 
> >> In function collapse_huge_page(), we drop mmap read lock and get
> >> mmap write lock to prevent most accesses to pagetables. There is
> >> a small time window to allow other tasks to acquire the mmap lock.
> >> With the use of upgrade_read(), we don't need to check vma and pmd
> >> again in most cases.
> >
> >This is clearly a performance optimisation.  So you must have some
> >numebrs that justify this, please include them.
> 
> Yes, I will add the relevant data to v2 patch.

How about telling us all now so we know whether to continue discussing
this?
lizhe.67@bytedance.com Oct. 18, 2024, 6:37 a.m. UTC | #4
On Thu, 17 Oct 2024 14:20:12 +0100, willy@infradead.org wrote:

> On Thu, Oct 17, 2024 at 02:18:41PM +0800, lizhe.67@bytedance.com wrote:
> > On Wed, 16 Oct 2024 12:53:15 +0100, willy@infradead.org wrote:
> > 
> > >On Wed, Oct 16, 2024 at 12:36:00PM +0800, lizhe.67@bytedance.com wrote:
> > >> From: Li Zhe <lizhe.67@bytedance.com>
> > >> 
> > >> In function collapse_huge_page(), we drop mmap read lock and get
> > >> mmap write lock to prevent most accesses to pagetables. There is
> > >> a small time window to allow other tasks to acquire the mmap lock.
> > >> With the use of upgrade_read(), we don't need to check vma and pmd
> > >> again in most cases.
> > >
> > >This is clearly a performance optimisation.  So you must have some
> > >numebrs that justify this, please include them.
> > 
> > Yes, I will add the relevant data to v2 patch.
> 
> How about telling us all now so we know whether to continue discussing
> this?

In my test environment, function collapse_huge_page() only achieved a 0.25%
performance improvement. I use ftrace to get the execution time of
collapse_huge_page(). The test code and test command are as follows.

(1) Test result:

			average execution time of collapse_huge_page()
before this patch: 		1611.06283 us
after this patch:               1597.01474 us

(2) Test code:

#define MMAP_SIZE (2ul*1024*1024)
#define ALIGN(x, mask)  (((x) + ((mask)-1)) & ~((mask)-1))
int main(void)
{
	int num = 100;
	size_t page_sz = getpagesize();
	while (num--) {
		size_t index;
		unsigned char *p_map;
		unsigned char *p_map_real;
		p_map = (unsigned char *)mmap(0, 2 * MMAP_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0); 
		if (p_map == MAP_FAILED) { 
			printf("mmap fail\n"); 
			return -1;
		} else {
			p_map_real = (char *)ALIGN((unsigned long)p_map, MMAP_SIZE);
			printf("mmap get %p, align to %p\n", p_map, p_map_real);
		}
		for(index = 0; index < MMAP_SIZE; index += page_sz)
			p_map_real[index] = 6;
		int ret = madvise(p_map_real, MMAP_SIZE, 25);
		printf("ret is %d\n", ret);
		munmap(p_map, 2 * MMAP_SIZE); 
	}
	return 0;
}

(3) Test command:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
gcc test.c -o test
trace-cmd record -p function_graph -g collapse_huge_page --max-graph-depth 1 ./test

The optimization of the function collapse_huge_page() seems insignificant.
I am not sure whether it will have a more obvious optimization effect in
other scenarios.
kernel test robot Oct. 23, 2024, 7:27 a.m. UTC | #5
Hello,

kernel test robot noticed "WARNING:at_include/linux/rwsem.h:#collapse_huge_page" on:

commit: 6604438065fc95d188ffcde12f687dfc319ef2cc ("[RFC 2/2] khugepaged: use upgrade_read() to optimize collapse_huge_page")
url: https://github.com/intel-lab-lkp/linux/commits/lizhe-67-bytedance-com/rwsem-introduce-upgrade_read-interface/20241016-123810
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 823a566221a5639f6c69424897218e5d6431a970
patch link: https://lore.kernel.org/all/20241016043600.35139-3-lizhe.67@bytedance.com/
patch subject: [RFC 2/2] khugepaged: use upgrade_read() to optimize collapse_huge_page

in testcase: boot

config: x86_64-rhel-8.3-kselftests
compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+------------------------------------------------------+------------+------------+
|                                                      | 6a6ad5e040 | 6604438065 |
+------------------------------------------------------+------------+------------+
| WARNING:at_include/linux/rwsem.h:#collapse_huge_page | 0          | 18         |
| RIP:collapse_huge_page                               | 0          | 18         |
+------------------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202410231434.e98a7287-oliver.sang@intel.com


[   19.879799][   T39] ------------[ cut here ]------------
[ 19.880779][ T39] WARNING: CPU: 0 PID: 39 at include/linux/rwsem.h:203 collapse_huge_page (include/linux/rwsem.h:203 include/linux/mmap_lock.h:70 include/linux/mm.h:735 include/linux/mm.h:754 mm/khugepaged.c:1165) 
[   19.881951][   T39] Modules linked in: sch_fq_codel
[   19.882624][   T39] CPU: 0 UID: 0 PID: 39 Comm: khugepaged Not tainted 6.12.0-rc2-00004-g6604438065fc #1
[   19.883821][   T39] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 19.885086][ T39] RIP: 0010:collapse_huge_page (include/linux/rwsem.h:203 include/linux/mmap_lock.h:70 include/linux/mm.h:735 include/linux/mm.h:754 mm/khugepaged.c:1165) 
[ 19.885835][ T39] Code: 0f 85 7f f9 ff ff 48 c7 c2 00 e0 59 84 be 6e 03 00 00 48 c7 c7 60 e0 59 84 c6 05 fd 58 49 04 01 e8 88 4e 7e ff e9 5b f9 ff ff <0f> 0b 48 b8 00 00 00 00 00 fc ff df 4c 89 c2 48 c1 ea 03 80 3c 02
All code
========
   0:	0f 85 7f f9 ff ff    	jne    0xfffffffffffff985
   6:	48 c7 c2 00 e0 59 84 	mov    $0xffffffff8459e000,%rdx
   d:	be 6e 03 00 00       	mov    $0x36e,%esi
  12:	48 c7 c7 60 e0 59 84 	mov    $0xffffffff8459e060,%rdi
  19:	c6 05 fd 58 49 04 01 	movb   $0x1,0x44958fd(%rip)        # 0x449591d
  20:	e8 88 4e 7e ff       	callq  0xffffffffff7e4ead
  25:	e9 5b f9 ff ff       	jmpq   0xfffffffffffff985
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
  33:	fc ff df 
  36:	4c 89 c2             	mov    %r8,%rdx
  39:	48 c1 ea 03          	shr    $0x3,%rdx
  3d:	80                   	.byte 0x80
  3e:	3c 02                	cmp    $0x2,%al

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
   9:	fc ff df 
   c:	4c 89 c2             	mov    %r8,%rdx
   f:	48 c1 ea 03          	shr    $0x3,%rdx
  13:	80                   	.byte 0x80
  14:	3c 02                	cmp    $0x2,%al
[   19.888059][   T39] RSP: 0000:ffffc90000297998 EFLAGS: 00010246
[   19.888832][   T39] RAX: 0000000000000000 RBX: 1ffff92000052f39 RCX: 0000000000000000
[   19.889839][   T39] RDX: 0000000000000000 RSI: ffff88813394afd8 RDI: ffff88810625c320
[   19.890860][   T39] RBP: ffff8882e908a6c8 R08: ffff8882e908a6d8 R09: ffffed10267295ee
[   19.891844][   T39] R10: ffff88813394af77 R11: ffff8882ea258440 R12: ffff88813394ae40
[   19.892865][   T39] R13: ffffffff859ef180 R14: ffffc90000297ab8 R15: ffff88813394af68
[   19.893868][   T39] FS:  0000000000000000(0000) GS:ffff8883aee00000(0000) knlGS:0000000000000000
[   19.894985][   T39] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   19.895775][   T39] CR2: 0000000028a5b608 CR3: 0000000133892000 CR4: 00000000000406f0
[   19.896782][   T39] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   19.897783][   T39] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   19.898793][   T39] Call Trace:
[   19.902237][   T39]  <TASK>
[ 19.902820][ T39] ? collapse_huge_page (include/linux/rwsem.h:203 include/linux/mmap_lock.h:70 include/linux/mm.h:735 include/linux/mm.h:754 mm/khugepaged.c:1165) 
[ 19.903524][ T39] ? __warn (kernel/panic.c:748) 
[ 19.904099][ T39] ? collapse_huge_page (include/linux/rwsem.h:203 include/linux/mmap_lock.h:70 include/linux/mm.h:735 include/linux/mm.h:754 mm/khugepaged.c:1165) 
[ 19.904863][ T39] ? report_bug (lib/bug.c:180 lib/bug.c:219) 
[ 19.905477][ T39] ? handle_bug (arch/x86/kernel/traps.c:285) 
[ 19.906057][ T39] ? exc_invalid_op (arch/x86/kernel/traps.c:309 (discriminator 1)) 
[ 19.906674][ T39] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621) 
[ 19.907345][ T39] ? collapse_huge_page (include/linux/rwsem.h:203 include/linux/mmap_lock.h:70 include/linux/mm.h:735 include/linux/mm.h:754 mm/khugepaged.c:1165) 
[ 19.908036][ T39] ? collapse_huge_page (include/linux/rwsem.h:203 include/linux/mmap_lock.h:70 include/linux/mm.h:735 include/linux/mm.h:754 mm/khugepaged.c:1165) 
[ 19.908739][ T39] ? __pte_offset_map_lock (include/linux/pgtable.h:324 include/linux/pgtable.h:594 mm/pgtable-generic.c:376) 
[ 19.909447][ T39] ? __pfx_collapse_huge_page (mm/khugepaged.c:1095) 
[ 19.910150][ T39] ? __pfx_lock_acquire (kernel/locking/lockdep.c:5793) 
[ 19.910817][ T39] ? __lock_acquire (kernel/locking/lockdep.c:5202) 
[ 19.911460][ T39] ? do_raw_spin_lock (arch/x86/include/asm/atomic.h:107 include/linux/atomic/atomic-arch-fallback.h:2170 include/linux/atomic/atomic-instrumented.h:1302 include/asm-generic/qspinlock.h:111 kernel/locking/spinlock_debug.c:116) 
[ 19.912111][ T39] ? find_held_lock (kernel/locking/lockdep.c:5315) 
[ 19.912753][ T39] ? find_held_lock (kernel/locking/lockdep.c:5315) 
[ 19.913389][ T39] ? hpage_collapse_scan_pmd (include/linux/rcupdate.h:347 include/linux/rcupdate.h:880 include/linux/pgtable.h:115 mm/khugepaged.c:1422) 
[ 19.914118][ T39] ? __lock_release+0x10b/0x440 
[ 19.914823][ T39] ? hpage_collapse_scan_pmd (include/linux/rcupdate.h:347 include/linux/rcupdate.h:880 include/linux/pgtable.h:115 mm/khugepaged.c:1422) 
[ 19.915528][ T39] ? hpage_collapse_scan_pmd (include/linux/rcupdate.h:347 include/linux/rcupdate.h:880 include/linux/pgtable.h:115 mm/khugepaged.c:1422) 
[ 19.916202][ T39] hpage_collapse_scan_pmd (mm/khugepaged.c:1427) 
[ 19.916915][ T39] ? __pfx_hpage_collapse_scan_pmd (mm/khugepaged.c:1261) 
[ 19.917678][ T39] ? __thp_vma_allowable_orders (arch/x86/include/asm/bitops.h:206 (discriminator 1) arch/x86/include/asm/bitops.h:238 (discriminator 1) include/asm-generic/bitops/instrumented-non-atomic.h:142 (discriminator 1) mm/huge_memory.c:118 (discriminator 1)) 
[ 19.918441][ T39] ? __pfx___might_resched (kernel/sched/core.c:8586) 
[ 19.919187][ T39] khugepaged_scan_mm_slot+0x8bd/0xd90 
[ 19.920082][ T39] ? __pfx_khugepaged_scan_mm_slot+0x10/0x10 
[ 19.921041][ T39] ? __pfx_lock_acquire (kernel/locking/lockdep.c:5793) 
[ 19.921756][ T39] ? __pfx_do_raw_spin_lock (kernel/locking/spinlock_debug.c:114) 
[ 19.922464][ T39] ? __pfx___might_resched (kernel/sched/core.c:8586) 
[ 19.923200][ T39] khugepaged (include/linux/spinlock.h:391 mm/khugepaged.c:2521 mm/khugepaged.c:2573) 
[ 19.923801][ T39] ? __pfx_khugepaged (mm/khugepaged.c:2566) 
[ 19.924471][ T39] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4347 kernel/locking/lockdep.c:4406) 
[ 19.925235][ T39] ? __kthread_parkme (arch/x86/include/asm/bitops.h:206 arch/x86/include/asm/bitops.h:238 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/kthread.c:280) 
[ 19.925865][ T39] ? __pfx_khugepaged (mm/khugepaged.c:2566) 
[ 19.926499][ T39] kthread (kernel/kthread.c:389) 
[ 19.927083][ T39] ? __pfx_kthread (kernel/kthread.c:342) 
[ 19.927775][ T39] ret_from_fork (arch/x86/kernel/process.c:153) 
[ 19.928395][ T39] ? __pfx_kthread (kernel/kthread.c:342) 
[ 19.929026][ T39] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) 
[   19.929685][   T39]  </TASK>
[   19.930123][   T39] irq event stamp: 951
[ 19.930689][ T39] hardirqs last enabled at (965): __up_console_sem (arch/x86/include/asm/irqflags.h:42 (discriminator 1) arch/x86/include/asm/irqflags.h:97 (discriminator 1) arch/x86/include/asm/irqflags.h:155 (discriminator 1) kernel/printk/printk.c:344 (discriminator 1)) 
[ 19.931960][ T39] hardirqs last disabled at (978): __up_console_sem (kernel/printk/printk.c:342 (discriminator 1)) 
[ 19.933120][ T39] softirqs last enabled at (892): handle_softirqs (arch/x86/include/asm/preempt.h:26 kernel/softirq.c:401 kernel/softirq.c:582) 
[ 19.934254][ T39] softirqs last disabled at (885): __irq_exit_rcu (kernel/softirq.c:589 kernel/softirq.c:428 kernel/softirq.c:637) 
[   19.935397][   T39] ---[ end trace 0000000000000000 ]---


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20241023/202410231434.e98a7287-oliver.sang@intel.com
diff mbox series

Patch

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f9c39898eaff..934051274f7a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1142,23 +1142,25 @@  static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 			goto out_nolock;
 	}
 
-	mmap_read_unlock(mm);
-	/*
-	 * Prevent all access to pagetables with the exception of
-	 * gup_fast later handled by the ptep_clear_flush and the VM
-	 * handled by the anon_vma lock + PG_lock.
-	 *
-	 * UFFDIO_MOVE is prevented to race as well thanks to the
-	 * mmap_lock.
-	 */
-	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
-	if (result != SCAN_SUCCEED)
-		goto out_up_write;
-	/* check if the pmd is still valid */
-	result = check_pmd_still_valid(mm, address, pmd);
-	if (result != SCAN_SUCCEED)
-		goto out_up_write;
+	if (upgrade_read(&mm->mmap_lock)) {
+		mmap_read_unlock(mm);
+		/*
+		 * Prevent all access to pagetables with the exception of
+		 * gup_fast later handled by the ptep_clear_flush and the VM
+		 * handled by the anon_vma lock + PG_lock.
+		 *
+		 * UFFDIO_MOVE is prevented to race as well thanks to the
+		 * mmap_lock.
+		 */
+		mmap_write_lock(mm);
+		result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+		if (result != SCAN_SUCCEED)
+			goto out_up_write;
+		/* check if the pmd is still valid */
+		result = check_pmd_still_valid(mm, address, pmd);
+		if (result != SCAN_SUCCEED)
+			goto out_up_write;
+	}
 
 	vma_start_write(vma);
 	anon_vma_lock_write(vma->anon_vma);