diff mbox series

[RFC,4/5] mm: truncate: split huge page cache page to a non-zero order if possible.

Message ID 20220321142128.2471199-5-zi.yan@sent.com (mailing list archive)
State New
Headers show
Series Split a huge page to any lower order pages | expand

Commit Message

Zi Yan March 21, 2022, 2:21 p.m. UTC
From: Zi Yan <ziy@nvidia.com>

To minimize the number of pages after a huge page truncation, we do not
need to split it all the way down to order-0. The huge page has at most
three parts, the part before offset, the part to be truncated, the part
remaining at the end. Find the greatest common power of two multiplier of
the non-zero values of them as the new order, so we can split the huge
page to this order and keep the remaining pages as large and as few as
possible.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c |  1 +
 mm/truncate.c    | 33 +++++++++++++++++++++++++++++++--
 2 files changed, 32 insertions(+), 2 deletions(-)

Comments

Roman Gushchin March 21, 2022, 10:32 p.m. UTC | #1
On Mon, Mar 21, 2022 at 10:21:27AM -0400, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> To minimize the number of pages after a huge page truncation, we do not
> need to split it all the way down to order-0. The huge page has at most
> three parts, the part before offset, the part to be truncated, the part
> remaining at the end. Find the greatest common power of two multiplier of
> the non-zero values of them as the new order, so we can split the huge
> page to this order and keep the remaining pages as large and as few as
> possible.

Would you mind please to describe the algorithm in more details?

Thanks!
Zi Yan March 22, 2022, 2:19 p.m. UTC | #2
On 21 Mar 2022, at 18:32, Roman Gushchin wrote:

> On Mon, Mar 21, 2022 at 10:21:27AM -0400, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> To minimize the number of pages after a huge page truncation, we do not
>> need to split it all the way down to order-0. The huge page has at most
>> three parts, the part before offset, the part to be truncated, the part
>> remaining at the end. Find the greatest common power of two multiplier of
>> the non-zero values of them as the new order, so we can split the huge
>> page to this order and keep the remaining pages as large and as few as
>> possible.
>
> Would you mind please to describe the algorithm in more details?

Sure.

During truncation, there can be three parts in a huge page:
1. the _offset_ from the beginning of the huge page,
2. the _length_ of the to-be-truncated part,
3. the _remaining_ part after the to-be-truncated part.

the size of the split huge page need to be the greatest common divisor
of the non-zero ones of three after being rounded down to power of two.

OK, I actually find there is a gcd function. I think the algorithm can
simplified to

new_order = ilog2(gcd(gcd(offset, length), remaining)) - PAGE_SHIFT;

I will update the code, the commit message, and the comment in the next
version.

Thank you for the comment.

--
Best Regards,
Yan, Zi
kernel test robot March 23, 2022, 6:40 a.m. UTC | #3
Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: 2757cee2d6c6c76f672ec6566ade2dcb8c1605dd ("[RFC PATCH 4/5] mm: truncate: split huge page cache page to a non-zero order if possible.")
url: https://github.com/0day-ci/linux/commits/Zi-Yan/Split-a-huge-page-to-any-lower-order-pages/20220321-222304
base: https://github.com/hnaz/linux-mm master
patch link: https://lore.kernel.org/linux-mm/20220321142128.2471199-5-zi.yan@sent.com

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu Icelake-Server -smp 4 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <oliver.sang@intel.com>


[   25.633813][  T275] UBSAN: shift-out-of-bounds in include/linux/log2.h:67:13
[   25.635404][  T275] shift exponent 4294967295 is too large for 32-bit type 'long unsigned int'
[   25.636781][  T275] CPU: 3 PID: 275 Comm: cron Not tainted 5.17.0-rc8-mm1-00485-g2757cee2d6c6 #1
[   25.638246][  T275] Call Trace:
[ 25.638768][ T275] dump_stack_lvl (lib/dump_stack.c:107) 
[ 25.651568][ T275] dump_stack (lib/dump_stack.c:114) 
[ 25.652209][ T275] ubsan_epilogue (lib/ubsan.c:152) 
[ 25.652926][ T275] __ubsan_handle_shift_out_of_bounds.cold (arch/x86/include/asm/smap.h:85) 
[ 25.654034][ T275] ? lock_release (kernel/locking/lockdep.c:5348 kernel/locking/lockdep.c:5692) 
[ 25.654781][ T275] ? __kmap_local_pfn_prot (mm/highmem.c:532) 
[ 25.655650][ T275] ? kunmap_local_indexed (mm/highmem.c:600 (discriminator 3)) 
[ 25.656382][ T275] ? zero_user_segments (mm/highmem.c:408) 
[ 25.657110][ T275] greatest_pow_of_two_multiplier.cold (include/linux/log2.h:67 mm/truncate.c:204) 
[ 25.658134][ T275] truncate_inode_partial_folio (mm/truncate.c:255) 
g System Logging[ 25.659902][ T275] shmem_undo_range (mm/shmem.c:966) 
Service...
[ 25.660789][ T275] ? zero_user_segments (mm/highmem.c:408) 
[ 25.661744][ T275] ? folio_mark_dirty (mm/page-writeback.c:2717) 
[ 25.662480][ T275] ? unlock_page (mm/folio-compat.c:21) 
[ 25.663179][ T275] ? __lock_acquire (kernel/locking/lockdep.c:5060) 
[ 25.668235][ T275] shmem_truncate_range (mm/shmem.c:1045) 
[ 25.668251][ T275] ? setattr_prepare (fs/attr.c:108) 
[ 25.668256][ T275] ? mark_held_locks (kernel/locking/lockdep.c:4239) 
Startin[ 25.668262][ T275] shmem_setattr (mm/shmem.c:1109) 
g /etc/rc.local [ 25.675802][ T275] ? shmem_setattr (mm/shmem.c:1109) 
Compatibility...[ 25.676621][ T275] ? current_time (fs/inode.c:2406) 

[ 25.677351][ T275] notify_change (fs/attr.c:414) 
[ 25.677989][ T275] ? shmem_evict_inode (mm/shmem.c:1077) 
[ 25.678735][ T275] ? notify_change (fs/attr.c:414) 
[ 25.679481][ T275] ? lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5674) 
[ 25.680186][ T275] do_truncate (fs/open.c:67) 
[ 25.680843][ T275] ? do_truncate (fs/open.c:67) 
[ 25.681520][ T275] ? do_truncate (fs/open.c:67) 
[ 25.682230][ T275] do_sys_ftruncate (fs/open.c:197) 
[ 25.683001][ T275] __ia32_sys_ftruncate (fs/open.c:204) 
[ 25.684027][ T275] __do_fast_syscall_32 (arch/x86/entry/common.c:112 arch/x86/entry/common.c:178) 
[ 25.684801][ T275] ? irqentry_exit_to_user_mode (kernel/entry/common.c:324) 
[ 25.685720][ T275] do_fast_syscall_32 (arch/x86/entry/common.c:203) 
[ 25.686464][ T275] do_SYSENTER_32 (arch/x86/entry/common.c:247) 
[ 25.687172][ T275] entry_SYSENTER_32 (arch/x86/entry/entry_32.S:869) 
[   25.687180][  T275] EIP: 0xa7f00549
[ 25.687184][ T275] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
All code
========
   0:	03 74 c0 01          	add    0x1(%rax,%rax,8),%esi
   4:	10 05 03 74 b8 01    	adc    %al,0x1b87403(%rip)        # 0x1b8740d
   a:	10 06                	adc    %al,(%rsi)
   c:	03 74 b4 01          	add    0x1(%rsp,%rsi,4),%esi
  10:	10 07                	adc    %al,(%rdi)
  12:	03 74 b0 01          	add    0x1(%rax,%rsi,4),%esi
  16:	10 08                	adc    %cl,(%rax)
  18:	03 74 d8 01          	add    0x1(%rax,%rbx,8),%esi
  1c:	00 00                	add    %al,(%rax)
  1e:	00 00                	add    %al,(%rax)
  20:	00 51 52             	add    %dl,0x52(%rcx)
  23:	55                   	push   %rbp
  24:	89 e5                	mov    %esp,%ebp
  26:	0f 34                	sysenter 
  28:	cd 80                	int    $0x80
  2a:*	5d                   	pop    %rbp		<-- trapping instruction
  2b:	5a                   	pop    %rdx
  2c:	59                   	pop    %rcx
  2d:	c3                   	retq   
  2e:	90                   	nop
  2f:	90                   	nop
  30:	90                   	nop
  31:	90                   	nop
  32:	8d 76 00             	lea    0x0(%rsi),%esi
  35:	58                   	pop    %rax
  36:	b8 77 00 00 00       	mov    $0x77,%eax
  3b:	cd 80                	int    $0x80
  3d:	90                   	nop
  3e:	8d                   	.byte 0x8d
  3f:	76                   	.byte 0x76

Code starting with the faulting instruction
===========================================
   0:	5d                   	pop    %rbp
   1:	5a                   	pop    %rdx
   2:	59                   	pop    %rcx
   3:	c3                   	retq   
   4:	90                   	nop
   5:	90                   	nop
   6:	90                   	nop
   7:	90                   	nop
   8:	8d 76 00             	lea    0x0(%rsi),%esi
   b:	58                   	pop    %rax
   c:	b8 77 00 00 00       	mov    $0x77,%eax
  11:	cd 80                	int    $0x80
  13:	90                   	nop
  14:	8d                   	.byte 0x8d
  15:	76                   	.byte 0x76
[   25.687187][  T275] EAX: ffffffda EBX: 00000003 ECX: 00000004 EDX: 00453000
[   25.687190][  T275] ESI: 00000004 EDI: 00000002 EBP: 0207e168 ESP: aff7174c
[   25.687193][  T275] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
[   25.687280][  T275] ================================================================================
Starting LKP bootstrap...
[  OK  ] Started D-Bus System Message Bus.
[   25.719619] rc.local[280]: mkdir: cannot create directory '/var/lock/lkp-bootstrap.lock': File exists
[  OK  ] Started Daily Cleanup of Temporary Directories.
Starting Login Service...
[  OK  ] Started Daily apt upgrade and clean activities.
[  OK  ] Reached target Timers.
Starting LSB: Execute the kexec -e command to reboot system...
Starting OpenBSD Secure Shell server...
Starting Permit User Sessions...
[  OK  ] Started LKP bootstrap.
[  OK  ] Started Permit User Sessions.
[  OK  ] Started LSB: Start and stop bmc-watchdog.
[  OK  ] Started OpenBSD Secure Shell server.
[  OK  ] Started LSB: Execute the kexec -e command to reboot system.
[  OK  ] Started Login Service.
Starting LSB: Load kernel image with kexec...
LKP: ttyS0: 290: Kernel tests: Boot OK!
LKP: ttyS0: 290: HOSTNAME vm-icl-74, MAC 52:54:00:12:34:56, kernel 5.17.0-rc8-mm1-00485-g2757cee2d6c6 1
[  OK  ] Reached target Sound Card.
[  OK  ] Reached target Printer.
[  OK  ] Started LSB: Load kernel image with kexec.
LKP: ttyS0: 290:  /lkp/lkp/src/bin/run-lkp /lkp/jobs/scheduled/vm-icl-74/boot-1-debian-i386-20191205.cgz-2757cee2d6c6c76f672ec6566ade2dcb8c1605dd-20220322-39641-na3mp9-2.yaml
[   32.682503][  T516] wget (516) used greatest stack depth: 5812 bytes left
[  OK  ] Started System Logging Service.
[   36.389049][  T556] rsync (556) used greatest stack depth: 5772 bytes left
[   37.653297][  T315] LKP: stdout: 290: Kernel tests: Boot OK!
[   37.653316][  T315]
[   40.012523][  T605] wget (605) used greatest stack depth: 5732 bytes left
LKP: ttyS0: 290: LKP: rebooting forcely
[   42.275963][  T315] LKP: stdout: 290: HOSTNAME vm-icl-74, MAC 52:54:00:12:34:56, kernel 5.17.0-rc8-mm1-00485-g2757cee2d6c6 1
[   42.275991][  T315]
[   42.291539][  T315] install debs round one: dpkg -i --force-confdef --force-depends /opt/deb/gawk_1%3a4.1.4+dfsg-1_i386.deb
[   42.291562][  T315]
[   42.297409][  T315] Selecting previously unselected package gawk.
[   42.297426][  T315]
[   42.314957][  T315] (Reading database ... 16210 files and directories currently installed.)
[   42.314974][  T315]
[   42.319689][  T315] Preparing to unpack .../gawk_1%3a4.1.4+dfsg-1_i386.deb ...
[   42.319710][  T315]
[   42.323192][  T315] Unpacking gawk (1:4.1.4+dfsg-1) ...
[   42.323208][  T315]
[   42.326270][  T315] Setting up gawk (1:4.1.4+dfsg-1) ...
[   42.326285][  T315]
[   42.941438][  T290] sysrq: Emergency Sync
[   42.942418][   T36] Emergency Sync complete
[   42.943081][  T290] sysrq: Resetting

Kboot worker: lkp-worker04
Elapsed time: 60

kvm=(
qemu-system-x86_64
-enable-kvm


To reproduce:

        # build kernel
	cd linux
	cp config-5.17.0-rc8-mm1-00485-g2757cee2d6c6 .config
	make HOSTCC=gcc-9 CC=gcc-9 ARCH=i386 olddefconfig prepare modules_prepare bzImage modules
	make HOSTCC=gcc-9 CC=gcc-9 ARCH=i386 INSTALL_MOD_PATH=<mod-install-dir> modules_install
	cd <mod-install-dir>
	find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.
diff mbox series

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3617aa3ad0b1..76db0092a1e2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2349,6 +2349,7 @@  static void __split_huge_page_tail(struct page *head, int tail,
 		prep_compound_page(page_tail, new_order);
 		prep_transhuge_page(page_tail);
 	}
+	VM_BUG_ON_PAGE(PageTail(page_tail), page_tail);
 
 	/* Finally unfreeze refcount. Additional reference from page cache. */
 	page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) ||
diff --git a/mm/truncate.c b/mm/truncate.c
index ab50d0d59a2a..4f71e67dec09 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -197,6 +197,14 @@  int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
 	return 0;
 }
 
+static unsigned int greatest_pow_of_two_multiplier(unsigned int num)
+{
+	if (num & 1)
+		return 0;
+	return min_t(unsigned int, ilog2(num),
+		ilog2(num - rounddown_pow_of_two(num)));
+}
+
 /*
  * Handle partial folios.  The folio may be entirely within the
  * range if a split has raced with us.  If not, we zero the part of the
@@ -211,7 +219,8 @@  int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
 bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 {
 	loff_t pos = folio_pos(folio);
-	unsigned int offset, length;
+	unsigned int offset, length, remaining;
+	unsigned int new_order = folio_order(folio);
 
 	if (pos < start)
 		offset = start - pos;
@@ -222,6 +231,7 @@  bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 		length = length - offset;
 	else
 		length = end + 1 - pos - offset;
+	remaining = folio_size(folio) - offset - length;
 
 	folio_wait_writeback(folio);
 	if (length == folio_size(folio)) {
@@ -236,11 +246,30 @@  bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 	 */
 	folio_zero_range(folio, offset, length);
 
+	/*
+	 * Find the greatest common power of two multiplier of the non-zero
+	 * offset, length, and remaining as the new order. So we can truncate
+	 * a subpage as large as possible.
+	 */
+	if (offset)
+		new_order = greatest_pow_of_two_multiplier(offset / PAGE_SIZE);
+	if (length)
+		new_order = min_t(unsigned int, new_order,
+			greatest_pow_of_two_multiplier(length / PAGE_SIZE));
+	if (remaining)
+		new_order = min_t(unsigned int, new_order,
+			greatest_pow_of_two_multiplier(remaining / PAGE_SIZE));
+
+	/* order-1 THP not supported, downgrade to order-0 */
+	if (new_order == 1)
+		new_order = 0;
+
+
 	if (folio_has_private(folio))
 		folio_invalidate(folio, offset, length);
 	if (!folio_test_large(folio))
 		return true;
-	if (split_huge_page(&folio->page) == 0)
+	if (split_huge_page_to_list_to_order(&folio->page, NULL, new_order) == 0)
 		return true;
 	if (folio_test_dirty(folio))
 		return false;