diff mbox series

[RFC,3/4] block: set mapping order for the block cache in set_init_blocksize

Message ID 20230621083823.1724337-4-p.raghav@samsung.com (mailing list archive)
State New, archived
Headers show
Series minimum folio order support in filemap | expand

Commit Message

Pankaj Raghav June 21, 2023, 8:38 a.m. UTC
From: Luis Chamberlain <mcgrof@kernel.org>

Automatically set the minimum mapping order for block devices in
set_init_blocksize(). The mapping order will be set only when the block
device uses iomap based aops.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 block/bdev.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Hannes Reinecke June 21, 2023, 9:05 a.m. UTC | #1
On 6/21/23 10:38, Pankaj Raghav wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Automatically set the minimum mapping order for block devices in
> set_init_blocksize(). The mapping order will be set only when the block
> device uses iomap based aops.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   block/bdev.c | 9 +++++++++
>   1 file changed, 9 insertions(+)
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 9bb54d9d02a6..db8cede8a320 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -126,6 +126,7 @@ static void set_init_blocksize(struct block_device *bdev)
>   {
>   	unsigned int bsize = bdev_logical_block_size(bdev);
>   	loff_t size = i_size_read(bdev->bd_inode);
> +	int order, folio_order;
>   
>   	while (bsize < PAGE_SIZE) {
>   		if (size & bsize)
> @@ -133,6 +134,14 @@ static void set_init_blocksize(struct block_device *bdev)
>   		bsize <<= 1;
>   	}
>   	bdev->bd_inode->i_blkbits = blksize_bits(bsize);
> +	order = bdev->bd_inode->i_blkbits - PAGE_SHIFT;
> +	folio_order = mapping_min_folio_order(bdev->bd_inode->i_mapping);
> +
> +	if (!IS_ENABLED(CONFIG_BUFFER_HEAD)) {
> +		/* Do not allow changing the folio order after it is set */
> +		WARN_ON_ONCE(folio_order && (folio_order != order));
> +		mapping_set_folio_orders(bdev->bd_inode->i_mapping, order, 31);
> +	}
>   }
>   
>   int set_blocksize(struct block_device *bdev, int size)
This really has nothing to do with buffer heads.

In fact, I've got a patchset to make it work _with_ buffer heads.

So please, don't make it conditional on CONFIG_BUFFER_HEAD.

And we should be calling into 'mapping_set_folio_order()' only if the 
'order' argument is larger than PAGE_ORDER, otherwise we end up enabling
large folio support for _every_ block device.
Which I doubt we want.

Cheers,

Hannes
Pankaj Raghav June 21, 2023, 10:42 a.m. UTC | #2
>>       bdev->bd_inode->i_blkbits = blksize_bits(bsize);
>> +    order = bdev->bd_inode->i_blkbits - PAGE_SHIFT;
>> +    folio_order = mapping_min_folio_order(bdev->bd_inode->i_mapping);
>> +
>> +    if (!IS_ENABLED(CONFIG_BUFFER_HEAD)) {
>> +        /* Do not allow changing the folio order after it is set */
>> +        WARN_ON_ONCE(folio_order && (folio_order != order));
>> +        mapping_set_folio_orders(bdev->bd_inode->i_mapping, order, 31);
>> +    }
>>   }
>>     int set_blocksize(struct block_device *bdev, int size)
> This really has nothing to do with buffer heads.
> 
> In fact, I've got a patchset to make it work _with_ buffer heads.
> 
> So please, don't make it conditional on CONFIG_BUFFER_HEAD.
> 
> And we should be calling into 'mapping_set_folio_order()' only if the 'order' argument is larger
> than PAGE_ORDER, otherwise we end up enabling
> large folio support for _every_ block device.
> Which I doubt we want.
> 

Hmm, which aops are you using for the block device? If you are using the old aops, then we will be
using helpers from buffer.c and mpage.c which do not support large folios. I am getting a BUG_ON
when I don't use iomap based aops for the block device:

[   11.596239] kernel BUG at fs/buffer.c:2384!


[   11.596609] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
[   11.597064] CPU: 3 PID: 10 Comm: kworker/u8:0 Not tainted
6.4.0-rc7-next-20230621-00010-g87171074c649-dirty #183
[   11.597934] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014


[   11.598882] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[   11.599370] RIP: 0010:block_read_full_folio+0x70d/0x8f0

Let me know what you think!

> Cheers,
> 
> Hannes
> 
>
Hannes Reinecke June 21, 2023, 11:02 a.m. UTC | #3
On 6/21/23 12:42, Pankaj Raghav wrote:
>>>        bdev->bd_inode->i_blkbits = blksize_bits(bsize);
>>> +    order = bdev->bd_inode->i_blkbits - PAGE_SHIFT;
>>> +    folio_order = mapping_min_folio_order(bdev->bd_inode->i_mapping);
>>> +
>>> +    if (!IS_ENABLED(CONFIG_BUFFER_HEAD)) {
>>> +        /* Do not allow changing the folio order after it is set */
>>> +        WARN_ON_ONCE(folio_order && (folio_order != order));
>>> +        mapping_set_folio_orders(bdev->bd_inode->i_mapping, order, 31);
>>> +    }
>>>    }
>>>      int set_blocksize(struct block_device *bdev, int size)
>> This really has nothing to do with buffer heads.
>>
>> In fact, I've got a patchset to make it work _with_ buffer heads.
>>
>> So please, don't make it conditional on CONFIG_BUFFER_HEAD.
>>
>> And we should be calling into 'mapping_set_folio_order()' only if the 'order' argument is larger
>> than PAGE_ORDER, otherwise we end up enabling
>> large folio support for _every_ block device.
>> Which I doubt we want.
>>
> 
> Hmm, which aops are you using for the block device? If you are using the old aops, then we will be
> using helpers from buffer.c and mpage.c which do not support large folios. I am getting a BUG_ON
> when I don't use iomap based aops for the block device:
> 
I know. I haven't said that mpage.c / buffer.c support large folios 
_now_. All I'm saying is that I have a patchset enabling it to support 
large folios :-)

Cheers,

Hannes
Pankaj Raghav June 21, 2023, 12:02 p.m. UTC | #4
>>
>> Hmm, which aops are you using for the block device? If you are using the old aops, then we will be
>> using helpers from buffer.c and mpage.c which do not support large folios. I am getting a BUG_ON
>> when I don't use iomap based aops for the block device:
>>
> I know. I haven't said that mpage.c / buffer.c support large folios _now_. All I'm saying is that I
> have a patchset enabling it to support large folios :-)
> 

Ah ok! I thought we are not going that route based on the discussion we had in LSF.

> Cheers,
> 
> Hannes
>
kernel test robot June 24, 2023, 8:35 a.m. UTC | #5
Hello,

kernel test robot noticed "WARNING:at_block/bdev.c:#set_init_blocksize" on:

commit: 01671335e600baacb98c70cdb010d539e5f615c2 ("[RFC 3/4] block: set mapping order for the block cache in set_init_blocksize")
url: https://github.com/intel-lab-lkp/linux/commits/Pankaj-Raghav/fs-Allow-fine-grained-control-of-folio-sizes/20230621-172322
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20230621083823.1724337-4-p.raghav@samsung.com/
patch subject: [RFC 3/4] block: set mapping order for the block cache in set_init_blocksize

in testcase: boot

compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202306241656.fa99b0e0-oliver.sang@intel.com


[   20.303844][  T130] ------------[ cut here ]------------
[ 20.304791][ T130] WARNING: CPU: 1 PID: 130 at block/bdev.c:142 set_init_blocksize (block/bdev.c:142 (discriminator 1)) 
[   20.306150][  T130] Modules linked in: sr_mod cdrom sg ata_generic bochs intel_rapl_msr ata_piix intel_rapl_common drm_vram_helper crct10dif_pclmul crc32_pclmul ppdev crc32c_intel ghash_clmulni_intel sha512_ssse3 drm_kms_helper rapl libata syscopyarea sysfillrect sysimgblt drm_ttm_helper ttm joydev ipmi_devintf parport_pc ipmi_msghandler serio_raw i2c_piix4 parport fuse drm ip_tables
[   20.310835][  T130] CPU: 1 PID: 130 Comm: systemd-udevd Not tainted 6.4.0-rc4-00504-g01671335e600 #1
[   20.312212][  T130] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 20.313760][ T130] RIP: 0010:set_init_blocksize (block/bdev.c:142 (discriminator 1)) 
[ 20.314621][ T130] Code: c1 e2 08 48 09 d0 48 0d 00 e0 03 00 48 89 86 98 00 00 00 c3 cc cc cc cc 3d ff 0f 00 00 0f 86 7a ff ff ff c1 e8 09 89 c6 eb 91 <0f> 0b eb cc ba 09 00 00 00 eb 96 66 66 2e 0f 1f 84 00 00 00 00 00
All code
========
   0:	c1 e2 08             	shl    $0x8,%edx
   3:	48 09 d0             	or     %rdx,%rax
   6:	48 0d 00 e0 03 00    	or     $0x3e000,%rax
   c:	48 89 86 98 00 00 00 	mov    %rax,0x98(%rsi)
  13:	c3                   	retq   
  14:	cc                   	int3   
  15:	cc                   	int3   
  16:	cc                   	int3   
  17:	cc                   	int3   
  18:	3d ff 0f 00 00       	cmp    $0xfff,%eax
  1d:	0f 86 7a ff ff ff    	jbe    0xffffffffffffff9d
  23:	c1 e8 09             	shr    $0x9,%eax
  26:	89 c6                	mov    %eax,%esi
  28:	eb 91                	jmp    0xffffffffffffffbb
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	eb cc                	jmp    0xfffffffffffffffa
  2e:	ba 09 00 00 00       	mov    $0x9,%edx
  33:	eb 96                	jmp    0xffffffffffffffcb
  35:	66 66 2e 0f 1f 84 00 	data16 nopw %cs:0x0(%rax,%rax,1)
  3c:	00 00 00 00 

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	eb cc                	jmp    0xffffffffffffffd0
   4:	ba 09 00 00 00       	mov    $0x9,%edx
   9:	eb 96                	jmp    0xffffffffffffffa1
   b:	66 66 2e 0f 1f 84 00 	data16 nopw %cs:0x0(%rax,%rax,1)
  12:	00 00 00 00 
[   20.317420][  T130] RSP: 0000:ffffadd780567c48 EFLAGS: 00010282
[   20.318345][  T130] RAX: 00000000fffffd00 RBX: ffff8d79b1670c00 RCX: 000000000000001d
[   20.319432][  T130] RDX: 00000000fffffffd RSI: ffff8d79b16710f8 RDI: ffff8d79b1670c00
[   20.320641][  T130] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000584843
[   20.321827][  T130] R10: 0000000000000000 R11: 0000000000586c43 R12: ffff8d7998831400
[   20.322912][  T130] R13: ffff8d7998831400 R14: ffff8d7a7aca2800 R15: ffff8d79988315f8
[   20.324947][  T130] FS:  0000000000000000(0000) GS:ffff8d7cafd00000(0063) knlGS:00000000f7929b00
[   20.326981][  T130] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[   20.328939][  T130] CR2: 0000000059b91a8c CR3: 0000000117ac4000 CR4: 00000000000406e0
[   20.330843][  T130] Call Trace:
[   20.332322][  T130]  <TASK>
[ 20.333553][ T130] ? set_init_blocksize (block/bdev.c:142 (discriminator 1)) 
[ 20.335087][ T130] ? __warn (kernel/panic.c:673) 
[ 20.336555][ T130] ? set_init_blocksize (block/bdev.c:142 (discriminator 1)) 
[ 20.337984][ T130] ? report_bug (lib/bug.c:180 lib/bug.c:219) 
[ 20.339449][ T130] ? handle_bug (arch/x86/kernel/traps.c:324) 
[ 20.340945][ T130] ? exc_invalid_op (arch/x86/kernel/traps.c:345 (discriminator 1)) 
[ 20.342386][ T130] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) 
[ 20.344022][ T130] ? set_init_blocksize (block/bdev.c:142 (discriminator 1)) 
[ 20.345530][ T130] blkdev_get_whole (block/bdev.c:626) 
[ 20.346950][ T130] blkdev_get_by_dev (block/bdev.c:837) 
[ 20.348464][ T130] ? __pfx_blkdev_open (block/fops.c:474) 
[ 20.349941][ T130] blkdev_open (block/fops.c:494) 
[ 20.351309][ T130] do_dentry_open (fs/open.c:920) 
[ 20.352846][ T130] do_open (fs/namei.c:3639) 
[ 20.354190][ T130] ? open_last_lookups (fs/namei.c:3583) 
[ 20.355564][ T130] path_openat (fs/namei.c:3792) 
[ 20.356950][ T130] do_filp_open (fs/namei.c:3818) 
[ 20.358276][ T130] do_sys_openat2 (fs/open.c:1356) 
[ 20.359601][ T130] __ia32_compat_sys_openat (fs/open.c:1430) 
[ 20.361103][ T130] __do_fast_syscall_32 (arch/x86/entry/common.c:112 arch/x86/entry/common.c:178) 
[ 20.362534][ T130] do_fast_syscall_32 (arch/x86/entry/common.c:203) 
[ 20.363986][ T130] entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:122) 
[   20.365697][  T130] RIP: 0023:0xf7f90589
[ 20.366940][ T130] Code: 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00
All code
========
   0:	03 74 d8 01          	add    0x1(%rax,%rbx,8),%esi
	...
  20:	00 51 52             	add    %dl,0x52(%rcx)
  23:	55                   	push   %rbp
  24:*	89 e5                	mov    %esp,%ebp		<-- trapping instruction
  26:	0f 34                	sysenter 
  28:	cd 80                	int    $0x80
  2a:	5d                   	pop    %rbp
  2b:	5a                   	pop    %rdx
  2c:	59                   	pop    %rcx
  2d:	c3                   	retq   
  2e:	90                   	nop
  2f:	90                   	nop
  30:	90                   	nop
  31:	90                   	nop
  32:	8d b4 26 00 00 00 00 	lea    0x0(%rsi,%riz,1),%esi
  39:	8d b4 26 00 00 00 00 	lea    0x0(%rsi,%riz,1),%esi

Code starting with the faulting instruction
===========================================
   0:	5d                   	pop    %rbp
   1:	5a                   	pop    %rdx
   2:	59                   	pop    %rcx
   3:	c3                   	retq   
   4:	90                   	nop
   5:	90                   	nop
   6:	90                   	nop
   7:	90                   	nop
   8:	8d b4 26 00 00 00 00 	lea    0x0(%rsi,%riz,1),%esi
   f:	8d b4 26 00 00 00 00 	lea    0x0(%rsi,%riz,1),%esi
[   20.370969][  T130] RSP: 002b:00000000ffd9b010 EFLAGS: 00200206 ORIG_RAX: 0000000000000127
[   20.372809][  T130] RAX: ffffffffffffffda RBX: 00000000ffffff9c RCX: 00000000578c6540
[   20.374652][  T130] RDX: 00000000000a8800 RSI: 0000000000000000 RDI: 000000005663b8ad
[   20.376552][  T130] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[   20.378381][  T130] R10: 0000000000000000 R11: 0000000000200206 R12: 0000000000000000
[   20.380235][  T130] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   20.381888][  T130]  </TASK>
[   20.382968][  T130] ---[ end trace 0000000000000000 ]---
LKP: ttyS0: 194:  /lkp/lkp/src/bin/run-lkp /lkp/jobs/scheduled/vm-meta-307/boot-1-debian-11.1-i386-20220923.cgz-01671335e600-20230622-46155-yec2o-1.yaml
[   29.071854][  T211] is_virt=true
[   29.071868][  T211]
[   30.394619][  T211] lkp: kernel tainted state: 0
[   30.394639][  T211]
[   31.373739][  T211] LKP: stdout: 194: Kernel tests: Boot OK!
[   31.373784][  T211]
LKP: ttyS0: 194: LKP: rebooting forcely
[   35.497053][  T211] LKP: stdout: 194: HOSTNAME vm-snb, MAC 52:54:00:12:34:56, kernel 6.4.0-rc4-00504-g01671335e600 1
[   35.497067][  T211]
[   35.674956][  T211] install debs round one: dpkg -i --force-confdef --force-depends /opt/deb/gawk_1%3a5.1.0-1_i386.deb
[   35.674975][  T211]
[   35.679306][  T211] Selecting previously unselected package gawk.
[   35.679315][  T211]
[   35.684028][  T211] (Reading database ... 16439 files and directories currently installed.)
[   35.684040][  T211]
[   35.688683][  T211] Preparing to unpack .../deb/gawk_1%3a5.1.0-1_i386.deb ...
[   35.688693][  T211]
[   35.691823][  T211] Unpacking gawk (1:5.1.0-1) ...
[   35.691830][  T211]
[   35.694690][  T211] Setting up gawk (1:5.1.0-1) ...
[   35.694697][  T211]
[   35.696731][  T211] NO_NETWORK=
[   35.696739][  T211]
[   35.730000][  T194] sysrq: Emergency Sync
[   35.730860][   T57] Emergency Sync complete
[   35.731988][  T194] sysrq: Resetting

Kboot worker: lkp-worker23
Elapsed time: 60

kvm=(
qemu-system-x86_64
-enable-kvm
-cpu SandyBridge
-kernel $kernel
-initrd initrd-vm-meta-307.cgz
-m 16384
-smp 2


To reproduce:

        # build kernel
	cd linux
	cp config-6.4.0-rc4-00504-g01671335e600 .config
	make HOSTCC=gcc-12 CC=gcc-12 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
	make HOSTCC=gcc-12 CC=gcc-12 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
	cd <mod-install-dir>
	find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.
diff mbox series

Patch

diff --git a/block/bdev.c b/block/bdev.c
index 9bb54d9d02a6..db8cede8a320 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -126,6 +126,7 @@  static void set_init_blocksize(struct block_device *bdev)
 {
 	unsigned int bsize = bdev_logical_block_size(bdev);
 	loff_t size = i_size_read(bdev->bd_inode);
+	int order, folio_order;
 
 	while (bsize < PAGE_SIZE) {
 		if (size & bsize)
@@ -133,6 +134,14 @@  static void set_init_blocksize(struct block_device *bdev)
 		bsize <<= 1;
 	}
 	bdev->bd_inode->i_blkbits = blksize_bits(bsize);
+	order = bdev->bd_inode->i_blkbits - PAGE_SHIFT;
+	folio_order = mapping_min_folio_order(bdev->bd_inode->i_mapping);
+
+	if (!IS_ENABLED(CONFIG_BUFFER_HEAD)) {
+		/* Do not allow changing the folio order after it is set */
+		WARN_ON_ONCE(folio_order && (folio_order != order));
+		mapping_set_folio_orders(bdev->bd_inode->i_mapping, order, 31);
+	}
 }
 
 int set_blocksize(struct block_device *bdev, int size)