mbox series

[0/7] RFC: high-order folio support for I/O

Message ID 20230614114637.89759-1-hare@suse.de (mailing list archive)
Headers show
Series RFC: high-order folio support for I/O | expand

Message

Hannes Reinecke June 14, 2023, 11:46 a.m. UTC
Hi all,

now, that was easy.
Thanks to willy and his recent patchset to support large folios in
gfs2 turns out that most of the work to support high-order folios
for I/O is actually done.
It only need twe rather obvious patches to allocate folios with
the order derived from the mapping blocksize, and to adjust readahead
to avoid reading off the end of the device.
But with these two patches (and the patchset from hch to switch
the block device over to iomap) (and the patchset from ritesh to
support sub-blocksize iomap buffers) I can now do:

# modprobe brd rd_size=524288 rd_blksize=16384
# mkfs.xfs -b size=16384 /dev/ram0

it still fails when trying to mount the device:

XFS (ram0): Cannot set_blocksize to 16384 on device ram0

but to my understanding this is being worked on.

Christoph, any chance to have an updated submission of your
patchset to convert block devices over to iomap?
I don't actually need the last one to switch off buffer heads,
but the others really do help for this case.

The entire tree can be found at:

git.kernel.org:/pub/scm/linux/git/kernel/hare/scsi-devel.git
branch brd.v2

Happy hacking!

Hannes Reinecke (6):
  brd: convert to folios
  brd: abstract page_size conventions
  brd: make sector size configurable
  brd: make logical sector size configurable
  mm/filemap: allocate folios with mapping blocksize
  mm/readahead: align readahead down to mapping blocksize

Pankaj Raghav (1):
  brd: use XArray instead of radix-tree to index backing pages

 drivers/block/brd.c     | 320 +++++++++++++++++++++-------------------
 include/linux/pagemap.h |   7 +
 mm/filemap.c            |   7 +-
 mm/readahead.c          |  10 +-
 4 files changed, 186 insertions(+), 158 deletions(-)

Comments

Hannes Reinecke June 14, 2023, 1:17 p.m. UTC | #1
On 6/14/23 13:46, Hannes Reinecke wrote:
> Hi all,
> 
> now, that was easy.
> Thanks to willy and his recent patchset to support large folios in
> gfs2 turns out that most of the work to support high-order folios
> for I/O is actually done.
> It only need twe rather obvious patches to allocate folios with
> the order derived from the mapping blocksize, and to adjust readahead
> to avoid reading off the end of the device.
> But with these two patches (and the patchset from hch to switch
> the block device over to iomap) (and the patchset from ritesh to
> support sub-blocksize iomap buffers) I can now do:
> 
> # modprobe brd rd_size=524288 rd_blksize=16384
> # mkfs.xfs -b size=16384 /dev/ram0
> 
> it still fails when trying to mount the device:
> 
> XFS (ram0): Cannot set_blocksize to 16384 on device ram0
> 
> but to my understanding this is being worked on.
> 
Turns out that was quite easy to fix (just remove the check in 
set_blocksize()), but now I get this:

SGI XFS with ACLs, security attributes, quota, no debug enabled
XFS (ram0): File system with blocksize 16384 bytes. Only pagesize (4096) 
or less will currently work.

Hmm. And btrfs doesn't fare better here:

BTRFS info (device ram0): using crc32c (crc32c-intel) checksum algorithm
BTRFS error (device ram0): sectorsize 16384 not yet supported for page 
size 4096
BTRFS error (device ram0): superblock contains fatal errors
BTRFS error (device ram0): open_ctree failed

But at least we _can_ test these filesystems now :-)

Cheers,

Hannes
Matthew Wilcox (Oracle) June 14, 2023, 1:53 p.m. UTC | #2
On Wed, Jun 14, 2023 at 03:17:25PM +0200, Hannes Reinecke wrote:
> Turns out that was quite easy to fix (just remove the check in
> set_blocksize()), but now I get this:
> 
> SGI XFS with ACLs, security attributes, quota, no debug enabled
> XFS (ram0): File system with blocksize 16384 bytes. Only pagesize (4096) or
> less will currently work.

What happens if you just remove this hunk:

+++ b/fs/xfs/xfs_super.c
@@ -1583,18 +1583,6 @@ xfs_fs_fill_super(
                goto out_free_sb;
        }

-       /*
-        * Until this is fixed only page-sized or smaller data blocks work.
-        */
-       if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
-               xfs_warn(mp,
-               "File system with blocksize %d bytes. "
-               "Only pagesize (%ld) or less will currently work.",
-                               mp->m_sb.sb_blocksize, PAGE_SIZE);
-               error = -ENOSYS;
-               goto out_free_sb;
-       }
-
        /* Ensure this filesystem fits in the page cache limits */
        if (xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_dblocks) ||
            xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_rblocks)) {
Hannes Reinecke June 14, 2023, 3:06 p.m. UTC | #3
On 6/14/23 15:53, Matthew Wilcox wrote:
> On Wed, Jun 14, 2023 at 03:17:25PM +0200, Hannes Reinecke wrote:
>> Turns out that was quite easy to fix (just remove the check in
>> set_blocksize()), but now I get this:
>>
>> SGI XFS with ACLs, security attributes, quota, no debug enabled
>> XFS (ram0): File system with blocksize 16384 bytes. Only pagesize (4096) or
>> less will currently work.
> 
> What happens if you just remove this hunk:
> 
> +++ b/fs/xfs/xfs_super.c
> @@ -1583,18 +1583,6 @@ xfs_fs_fill_super(
>                  goto out_free_sb;
>          }
> 
> -       /*
> -        * Until this is fixed only page-sized or smaller data blocks work.
> -        */
> -       if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
> -               xfs_warn(mp,
> -               "File system with blocksize %d bytes. "
> -               "Only pagesize (%ld) or less will currently work.",
> -                               mp->m_sb.sb_blocksize, PAGE_SIZE);
> -               error = -ENOSYS;
> -               goto out_free_sb;
> -       }
> -
>          /* Ensure this filesystem fits in the page cache limits */
>          if (xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_dblocks) ||
>              xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_rblocks)) {

Whee! That works!

Rebased things with your memcpy_{to,from}_folio() patches, disabled that 
chunk, and:

# mount /dev/ram0 /mnt
XFS (ram0): Mounting V5 Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551
XFS (ram0): Ending clean mount
xfs filesystem being mounted at /mnt supports timestamps until 
2038-01-19 (0x7fffffff)
# umount /mnt
XFS (ram0): Unmounting Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551

Great work, Matthew!

(Now I just need to check why copying data from NFS crashes ...)

Cheers,

Hannes
Hannes Reinecke June 14, 2023, 3:35 p.m. UTC | #4
On 6/14/23 17:06, Hannes Reinecke wrote:
[ .. ]
> 
> Whee! That works!
> 
> Rebased things with your memcpy_{to,from}_folio() patches, disabled that 
> chunk, and:
> 
> # mount /dev/ram0 /mnt
> XFS (ram0): Mounting V5 Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551
> XFS (ram0): Ending clean mount
> xfs filesystem being mounted at /mnt supports timestamps until 
> 2038-01-19 (0x7fffffff)
> # umount /mnt
> XFS (ram0): Unmounting Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551
> 
> Great work, Matthew!
> 
> (Now I just need to check why copying data from NFS crashes ...)
> 
Hmm. And for that I'm hitting include/linux/pagemap.h:1250 pretty 
consistently; something's going haywire with readahead.

Matthew, are you sure that this one:

/** 

  * readahead_length - The number of bytes in this readahead request. 

  * @rac: The readahead request. 

  */
static inline size_t readahead_length(struct readahead_control *rac)
{
         return rac->_nr_pages * PAGE_SIZE;
}

is tenable for large folios?
Especially as we have in mm/readahead.c:499

         ractl->_nr_pages += 1UL << order;

Hmm?

Cheers,

Hannes
Matthew Wilcox (Oracle) June 14, 2023, 5:46 p.m. UTC | #5
On Wed, Jun 14, 2023 at 05:35:02PM +0200, Hannes Reinecke wrote:
> Hmm. And for that I'm hitting include/linux/pagemap.h:1250 pretty
> consistently; something's going haywire with readahead.

Is that this one?

        VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);

in __readahead_folio()

> Matthew, are you sure that this one:
> 
> /**
> 
>  * readahead_length - The number of bytes in this readahead request.
> 
>  * @rac: The readahead request.
> 
>  */
> static inline size_t readahead_length(struct readahead_control *rac)
> {
>         return rac->_nr_pages * PAGE_SIZE;
> }
> 
> is tenable for large folios?
> Especially as we have in mm/readahead.c:499
> 
>         ractl->_nr_pages += 1UL << order;
> 
> Hmm?

Yes.  nr_pages really is the number of pages.

I'm somewhat surprised you're not hitting:

                VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);

in __filemap_add_folio().  Maybe the filesystems in question always
access indices which are naturally aligned to the block size.
Dave Chinner June 14, 2023, 11:53 p.m. UTC | #6
On Wed, Jun 14, 2023 at 05:06:14PM +0200, Hannes Reinecke wrote:
> On 6/14/23 15:53, Matthew Wilcox wrote:
> > On Wed, Jun 14, 2023 at 03:17:25PM +0200, Hannes Reinecke wrote:
> > > Turns out that was quite easy to fix (just remove the check in
> > > set_blocksize()), but now I get this:
> > > 
> > > SGI XFS with ACLs, security attributes, quota, no debug enabled
> > > XFS (ram0): File system with blocksize 16384 bytes. Only pagesize (4096) or
> > > less will currently work.
> > 
> > What happens if you just remove this hunk:
> > 
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1583,18 +1583,6 @@ xfs_fs_fill_super(
> >                  goto out_free_sb;
> >          }
> > 
> > -       /*
> > -        * Until this is fixed only page-sized or smaller data blocks work.
> > -        */
> > -       if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
> > -               xfs_warn(mp,
> > -               "File system with blocksize %d bytes. "
> > -               "Only pagesize (%ld) or less will currently work.",
> > -                               mp->m_sb.sb_blocksize, PAGE_SIZE);
> > -               error = -ENOSYS;
> > -               goto out_free_sb;
> > -       }
> > -
> >          /* Ensure this filesystem fits in the page cache limits */
> >          if (xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_dblocks) ||
> >              xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_rblocks)) {
> 
> Whee! That works!
> 
> Rebased things with your memcpy_{to,from}_folio() patches, disabled that
> chunk, and:
> 
> # mount /dev/ram0 /mnt
> XFS (ram0): Mounting V5 Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551
> XFS (ram0): Ending clean mount
> xfs filesystem being mounted at /mnt supports timestamps until 2038-01-19
> (0x7fffffff)
> # umount /mnt
> XFS (ram0): Unmounting Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551

Mounting the filesystem doesn't mean it works. XFS metadata has
laways worked with bs > ps, and mounting the filesystem only does
metadata IO.

It's not until you start reading/writing user data that the
filesystem will start exercising the page cache....

> Great work, Matthew!
> 
> (Now I just need to check why copying data from NFS crashes ...)

.... and then we see it doesn't actually work. :)

Likely you also need the large folio support in the iomap write path
patches from Willy, plus whatever corner cases in iomap that still
have implicit dependencies on PAGE_SIZE need to be fixed (truncate,
invalidate, sub-block zeroing, etc may not be doing exactly the
right thing).

All you need to do now is run the BS > PS filesytems through a full
fstests pass (reflink + rmap enabled, auto group), and then we can
start on the real data integrity validation work. It'll need tens of
billions of fsx ops run on it, days of recoveryloop testing, days of
fstress based exercise, etc before we can actually enable it in
XFS....

Cheers,

Dave.
Dave Chinner June 15, 2023, 3:44 a.m. UTC | #7
On Wed, Jun 14, 2023 at 05:06:14PM +0200, Hannes Reinecke wrote:
> On 6/14/23 15:53, Matthew Wilcox wrote:
> > On Wed, Jun 14, 2023 at 03:17:25PM +0200, Hannes Reinecke wrote:
> > > Turns out that was quite easy to fix (just remove the check in
> > > set_blocksize()), but now I get this:
> > > 
> > > SGI XFS with ACLs, security attributes, quota, no debug enabled
> > > XFS (ram0): File system with blocksize 16384 bytes. Only pagesize (4096) or
> > > less will currently work.
> > 
> > What happens if you just remove this hunk:
> > 
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1583,18 +1583,6 @@ xfs_fs_fill_super(
> >                  goto out_free_sb;
> >          }
> > 
> > -       /*
> > -        * Until this is fixed only page-sized or smaller data blocks work.
> > -        */
> > -       if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
> > -               xfs_warn(mp,
> > -               "File system with blocksize %d bytes. "
> > -               "Only pagesize (%ld) or less will currently work.",
> > -                               mp->m_sb.sb_blocksize, PAGE_SIZE);
> > -               error = -ENOSYS;
> > -               goto out_free_sb;
> > -       }
> > -
> >          /* Ensure this filesystem fits in the page cache limits */
> >          if (xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_dblocks) ||
> >              xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_rblocks)) {
> 
> Whee! That works!
> 
> Rebased things with your memcpy_{to,from}_folio() patches, disabled that
> chunk, and:
> 
> # mount /dev/ram0 /mnt

What is the output of mkfs.xfs?

> XFS (ram0): Mounting V5 Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551
> XFS (ram0): Ending clean mount
> xfs filesystem being mounted at /mnt supports timestamps until 2038-01-19
> (0x7fffffff)
> # umount /mnt
> XFS (ram0): Unmounting Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551

Nope. Not here.

Debug kernel builds assert fail at mount time with:

XFS: Assertion failed: PAGE_SHIFT >= sbp->sb_blocklog, file: fs/xfs/xfs_mount.c, line: 133

Because we do a check to ensure that the entire filesystem address
range can be indexed by the page cache. I suspect this is actually a
stale, left over check from the days we used the page cache for
indexing cached metadata, but with that sorted....

It fails here (8GB ram disk):

#mkfs.xfs -f -b size=64k /dev/ram0
meta-data=/dev/ram0              isize=512    agcount=4, agsize=32000 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=0
data     =                       bsize=65536  blocks=128000, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=65536  ascii-ci=0, ftype=1
log      =internal log           bsize=65536  blocks=1024, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=65536  blocks=0, rtextents=0
# mount /dev/ram0 /mnt/test
[   34.047433] XFS (ram0): Mounting V5 Filesystem 074579ae-9c33-447a-a336-8ea99cda87c3
[   34.053962] BUG: Bad rss-counter state mm:00000000b41e2cf6 type:MM_FILEPAGES val:11
[   34.054451] general protection fault, probably for non-canonical address 0x4002888237d00000: 0000 [#1] PREEMPT SMP
[   34.057426] psi: task underflow! cpu=8 t=2 tasks=[0 0 0 0] clear=4 set=0
[   34.065011] CPU: 2 PID: 3689 Comm: mount Not tainted 6.4.0-rc6-dgc+ #1832
[   34.068647] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   34.073236] RIP: 0010:__submit_bio+0x9e/0x110
[   34.075124] Code: c3 e8 d6 c5 88 ff 48 8b 43 08 48 89 df 4c 8b 60 10 49 8b 44 24 60 ff 10 49 8b 5c 24 68 e8 3a 92 88 ff 48 8b 43 10 a8 03 75 56 <65> 48 8
[   34.084879] RSP: 0018:ffffc900045c3a10 EFLAGS: 00010246
[   34.087501] RAX: 4003000000000000 RBX: ffff8885c1568000 RCX: 0000000000000080
[   34.090455] RDX: ffff888805d8a900 RSI: 0000000000000286 RDI: ffffc900045c3ad8
[   34.093419] RBP: ffffc900045c3a20 R08: 0000000000000000 R09: 0000000000000000
[   34.096381] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88880124b000
[   34.099340] R13: ffff8885c1568000 R14: ffff888100620000 R15: 0000000000fa0000
[   34.102285] FS:  00007f1b86428840(0000) GS:ffff888237d00000(0000) knlGS:0000000000000000
[   34.105410] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   34.107577] CR2: 00007f1b86364000 CR3: 000000010482d003 CR4: 0000000000060ee0
[   34.110114] Call Trace:
[   34.111031]  <TASK>
[   34.111782]  ? show_regs+0x61/0x70
[   34.112945]  ? die_addr+0x37/0x90
[   34.114080]  ? exc_general_protection+0x19e/0x3b0
[   34.115619]  ? asm_exc_general_protection+0x27/0x30
[   34.117162]  ? __submit_bio+0x9e/0x110
[   34.118355]  submit_bio_noacct_nocheck+0xf3/0x330
[   34.119819]  submit_bio_noacct+0x196/0x490
[   34.121042]  submit_bio+0x58/0x60
[   34.122045]  submit_bio_wait+0x70/0xd0
[   34.123178]  xfs_rw_bdev+0x188/0x1b0
[   34.124255]  xlog_do_io+0x95/0x170
[   34.125283]  xlog_bwrite+0x14/0x20
[   34.126310]  xlog_write_log_records+0x179/0x260
[   34.127637]  xlog_clear_stale_blocks+0xa5/0x1c0
[   34.128917]  xlog_find_tail+0x372/0x3b0
[   34.130011]  xlog_recover+0x2f/0x190
[   34.131041]  xfs_log_mount+0x1b8/0x350
[   34.132055]  xfs_mountfs+0x451/0x9a0
[   34.133019]  xfs_fs_fill_super+0x4d9/0x920
[   34.134113]  get_tree_bdev+0x16e/0x270
[   34.135130]  ? xfs_open_devices+0x230/0x230
[   34.136184]  xfs_fs_get_tree+0x15/0x20
[   34.137144]  vfs_get_tree+0x24/0xd0
[   34.138041]  path_mount+0x2fd/0xae0
[   34.138955]  ? putname+0x53/0x60
[   34.139749]  __x64_sys_mount+0x108/0x140
[   34.140698]  do_syscall_64+0x34/0x80
[   34.141574]  entry_SYSCALL_64_after_hwframe+0x63/0xcd


Hmmmm - that's logical sector aligned/sized IO that is failing like
this. The fs is using 64kB block size, 512 byte sector size.

So I went looking.

# blockdev --report /dev/ram0
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256   512  4096               0      8388608000   /dev/ram0
#

Huh. sector size is fine, block size for the device isn't.

# cat /sys/block/ram0/queue/physical_block_size
65536
#

Yup, brd definitely picked up that it is supposed to be using 64kB
blocks.

# blockdev --setbsz 65536 /dev/ram0
blockdev: ioctl error on BLKBSZSET: Invalid argument
#

Huh.

<dig dig dig>

int set_blocksize(struct block_device *bdev, int size)
{
        /* Size must be a power of two, and between 512 and PAGE_SIZE */
        if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
                return -EINVAL;
	.....

Yeah, ok. The block device doesn't support 64kB block sizes. Lucky
that XFs uses this as it's sector size:

# mkfs.xfs -f -b size=64k -s size=16k /dev/ram0
....
# mount /dev/ram0 /mnt/test
[  692.564375] XFS (ram0): Cannot set_blocksize to 16384 on device ram0
<mount fails>
#

Now expected. I wonder if the problem is 512 byte sector sizes....

# mkfs.xfs -f -s size=4k -b size=64k /dev/ram0
meta-data=/dev/ram0              isize=512    agcount=4, agsize=32000 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=0
data     =                       bsize=65536  blocks=128000, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=65536  ascii-ci=0, ftype=1
log      =internal log           bsize=65536  blocks=1024, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=65536  blocks=0, rtextents=0
# mount /dev/ram0 /mnt/test
[  835.711473] XFS (ram0): Mounting V5 Filesystem 72743c95-1264-43cd-8867-1f2b2e30ba24
[  835.722700] XFS (ram0): Ending clean mount
#

Okay, there we go. The patchset appears to have some kind of problem
with the filesystem using the minimum logical sector size of 512
bytes on this modified driver. Setting sector size == PAGE_SIZE
allows the filesystem to mount, but brd should not break if logical
sector aligned/sized IO is done.

$ sudo xfs_io -f -d -c "pwrite -b 1M 0 1M" -c "pread -v 0 1M" /mnt/test/foo
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0004 sec (2.266 GiB/sec and 2320.1856 ops/sec)
.....
000ffff0:  cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd  ................
read 1048576/1048576 bytes at offset 0
1 MiB, 16 ops; 0.9233 sec (1.083 MiB/sec and 17.3284 ops/sec)
$

Ok, direct IO works just fine.

$ xfs_io -c "pwrite -S 0xaa -b 1M 0 1M" -c "pread 0 1M -v" /mnt/test/foo
.....
000ffff0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
read 1048576/1048576 bytes at offset 0
1 MiB, 16 ops; 0.7035 sec (1.421 MiB/sec and 22.7428 ops/sec)
$

Ok, well aligned buffered IO looks like it works.

Right, let's step it up and do some more complex stuff. Let's run a
basic fsx pass on the filesystem:


$ sudo ~/src/xfstests-dev/ltp/fsx -d /mnt/test/baz
Seed set to 1
main: filesystem does not support dedupe range, disabling!
main: filesystem does not support exchange range, disabling!
truncating to largest ever: 0x3aea7
2 trunc from 0x0 to 0x3aea7
3 copy  from 0x1a3d6 to 0x26608, (0xc232 bytes) at 0x2ea8c
Segmentation fault
$

And there's the boom, only 3 operations into the test. This is kinda
what I expected - getting fsx to run for billions of ops without
failure might take a while.

Huh, why did it say FIDEDUPERANGE was not supported - that's weird,
something is broken there, maybe the fsx test.

[ 1787.365339] ------------[ cut here ]------------
[ 1787.368623] kernel BUG at include/linux/pagemap.h:1248!
[ 1787.371488] invalid opcode: 0000 [#1] PREEMPT SMP
[ 1787.374814] CPU: 10 PID: 5153 Comm: fsx Not tainted 6.4.0-rc6-dgc+ #1832
[ 1787.377240] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 1787.383061] RIP: 0010:read_pages+0x11d/0x230
[ 1787.385268] Code: 4c 89 e7 e8 c5 14 ff ff f0 41 ff 4c 24 34 0f 85 55 ff ff ff 4c 89 e7 e8 61 1d 00 00 8b 73 24 8b 43 20 39 f0 0f 83 4d ff ff ff <0f> 0b 0
[ 1787.395078] RSP: 0018:ffffc90004113918 EFLAGS: 00010283
[ 1787.396636] RAX: 0000000000000001 RBX: ffffc90004113ab0 RCX: 0000000000000000
[ 1787.400357] RDX: 0000000000001000 RSI: 0000000000000010 RDI: ffffea0017421000
[ 1787.403989] RBP: ffffc90004113960 R08: 0000000000001000 R09: 0000000000000000
[ 1787.407915] R10: ffff8885d084a000 R11: ffffc900041137d8 R12: ffffffff822b2e60
[ 1787.411472] R13: 000000000000001b R14: ffffea0017421000 R15: ffff8885c0cc8318
[ 1787.415342] FS:  00007f96c86fcc40(0000) GS:ffff8885fed00000(0000) knlGS:0000000000000000
[ 1787.418404] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1787.420717] CR2: 000055ecf4289908 CR3: 00000005feceb001 CR4: 0000000000060ee0
[ 1787.423110] Call Trace:
[ 1787.424214]  <TASK>
[ 1787.425606]  ? show_regs+0x61/0x70
[ 1787.426568]  ? die+0x37/0x90
[ 1787.427584]  ? do_trap+0xec/0x100
[ 1787.428959]  ? do_error_trap+0x6c/0x90
[ 1787.430672]  ? read_pages+0x11d/0x230
[ 1787.432259]  ? exc_invalid_op+0x52/0x70
[ 1787.433312]  ? read_pages+0x11d/0x230
[ 1787.434544]  ? asm_exc_invalid_op+0x1b/0x20
[ 1787.436038]  ? read_pages+0x11d/0x230
[ 1787.437442]  ? read_pages+0x5c/0x230
[ 1787.438943]  page_cache_ra_unbounded+0x128/0x1b0
[ 1787.440419]  do_page_cache_ra+0x6c/0x70
[ 1787.441765]  ondemand_readahead+0x31f/0x350
[ 1787.443426]  page_cache_sync_ra+0x49/0x50
[ 1787.445070]  filemap_get_pages+0x10e/0x680
[ 1787.446259]  ? xfs_ilock+0xc1/0x220
[ 1787.447426]  filemap_read+0xed/0x380
[ 1787.448632]  ? kmem_cache_free+0x1f5/0x480
[ 1787.449926]  ? xfs_log_ticket_put+0x2f/0x60
[ 1787.451152]  ? xfs_inode_item_release+0x2e/0xa0
[ 1787.453128]  generic_file_read_iter+0xdb/0x160
[ 1787.454527]  xfs_file_buffered_read+0x54/0xd0
[ 1787.455894]  xfs_file_read_iter+0x74/0xe0
[ 1787.457544]  generic_file_splice_read+0x8c/0x150
[ 1787.460094]  do_splice_to+0x85/0xb0
[ 1787.461285]  splice_direct_to_actor+0xb3/0x210
[ 1787.462336]  ? pipe_to_sendpage+0xa0/0xa0
[ 1787.463287]  do_splice_direct+0x92/0xd0
[ 1787.464203]  vfs_copy_file_range+0x2af/0x560
[ 1787.465229]  __do_sys_copy_file_range+0xe3/0x1f0
[ 1787.466429]  __x64_sys_copy_file_range+0x24/0x30
[ 1787.468053]  do_syscall_64+0x34/0x80
[ 1787.469392]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

static inline struct folio *__readahead_folio(struct readahead_control *ractl)
{
        struct folio *folio;

>>>>>   BUG_ON(ractl->_batch_count > ractl->_nr_pages);
        ractl->_nr_pages -= ractl->_batch_count;
        ractl->_index += ractl->_batch_count;

....

So something is going wrong in the readahead path from a splice
operation from copy_file_range().

..... Wait, what?

Why is it splicing rather than doing a remap operation?  'cp
--reflink=always bar bar2' appears to work fine, so it's unexpected
that it's copying data rather than cloning extents. Something is
going wrong there...

.....

Ok, that's enough time spent on this right now. The BS > PS stuff in
this patchset doesn't allow filesystems to work correctly, 
and the reasons for things going wrong are not obvious.

I suspect that this is going to take quite some work just to muscle
through all these whacky corner cases - fsx will find a lot of
them; you'll need to work through them until it runs without fail
for a couple of billion ops.

The patch I was using is below.

Cheers,

Dave.
Hannes Reinecke June 15, 2023, 6:21 a.m. UTC | #8
On 6/15/23 01:53, Dave Chinner wrote:
> On Wed, Jun 14, 2023 at 05:06:14PM +0200, Hannes Reinecke wrote:
>> On 6/14/23 15:53, Matthew Wilcox wrote:
>>> On Wed, Jun 14, 2023 at 03:17:25PM +0200, Hannes Reinecke wrote:
>>>> Turns out that was quite easy to fix (just remove the check in
>>>> set_blocksize()), but now I get this:
>>>>
>>>> SGI XFS with ACLs, security attributes, quota, no debug enabled
>>>> XFS (ram0): File system with blocksize 16384 bytes. Only pagesize (4096) or
>>>> less will currently work.
>>>
>>> What happens if you just remove this hunk:
>>>
>>> +++ b/fs/xfs/xfs_super.c
>>> @@ -1583,18 +1583,6 @@ xfs_fs_fill_super(
>>>                   goto out_free_sb;
>>>           }
>>>
>>> -       /*
>>> -        * Until this is fixed only page-sized or smaller data blocks work.
>>> -        */
>>> -       if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
>>> -               xfs_warn(mp,
>>> -               "File system with blocksize %d bytes. "
>>> -               "Only pagesize (%ld) or less will currently work.",
>>> -                               mp->m_sb.sb_blocksize, PAGE_SIZE);
>>> -               error = -ENOSYS;
>>> -               goto out_free_sb;
>>> -       }
>>> -
>>>           /* Ensure this filesystem fits in the page cache limits */
>>>           if (xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_dblocks) ||
>>>               xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_rblocks)) {
>>
>> Whee! That works!
>>
>> Rebased things with your memcpy_{to,from}_folio() patches, disabled that
>> chunk, and:
>>
>> # mount /dev/ram0 /mnt
>> XFS (ram0): Mounting V5 Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551
>> XFS (ram0): Ending clean mount
>> xfs filesystem being mounted at /mnt supports timestamps until 2038-01-19
>> (0x7fffffff)
>> # umount /mnt
>> XFS (ram0): Unmounting Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551
> 
> Mounting the filesystem doesn't mean it works. XFS metadata has
> laways worked with bs > ps, and mounting the filesystem only does
> metadata IO.
> 
> It's not until you start reading/writing user data that the
> filesystem will start exercising the page cache....
> 
>> Great work, Matthew!
>>
>> (Now I just need to check why copying data from NFS crashes ...)
> 
> .... and then we see it doesn't actually work. :)
> 
> Likely you also need the large folio support in the iomap write path
> patches from Willy, plus whatever corner cases in iomap that still
> have implicit dependencies on PAGE_SIZE need to be fixed (truncate,
> invalidate, sub-block zeroing, etc may not be doing exactly the
> right thing).
> 
These are built on top of the mm-unstable branch from akpm, which does 
include the iomap write path patches from Willy, so yes, I know.

> All you need to do now is run the BS > PS filesytems through a full
> fstests pass (reflink + rmap enabled, auto group), and then we can
> start on the real data integrity validation work. It'll need tens of
> billions of fsx ops run on it, days of recoveryloop testing, days of
> fstress based exercise, etc before we can actually enable it in
> XFS....
> 
Hey, c'mon. I do know _that_. All I'm saying is that now we can _start_
running tests and figure out corner cases (like NFS crashing on me :-).
With this patchset we now have some infrastructure in place making it
even _possible_ to run those tests.

Don't be so pessimistic ...

Cheers,

Hannes
Dave Chinner June 15, 2023, 8:51 a.m. UTC | #9
On Thu, Jun 15, 2023 at 08:21:10AM +0200, Hannes Reinecke wrote:
> On 6/15/23 01:53, Dave Chinner wrote:
> > On Wed, Jun 14, 2023 at 05:06:14PM +0200, Hannes Reinecke wrote:
> > All you need to do now is run the BS > PS filesytems through a full
> > fstests pass (reflink + rmap enabled, auto group), and then we can
> > start on the real data integrity validation work. It'll need tens of
> > billions of fsx ops run on it, days of recoveryloop testing, days of
> > fstress based exercise, etc before we can actually enable it in
> > XFS....
> > 
> Hey, c'mon. I do know _that_. All I'm saying is that now we can _start_
> running tests and figure out corner cases (like NFS crashing on me :-).
> With this patchset we now have some infrastructure in place making it
> even _possible_ to run those tests.

I got to this same point several years ago. You know, that patchset
that Luis went back to when he brought up this whole topic again?
That's right when I started running fsx, and I realised it
didn't cover FICLONERANGE, FIDEDUPERANGE and copy_file_range().

Yep, that's when we first realised we had -zero- test coverage of
those operations. Darrick and I spent the next *3 months* pretty
much rewriting the VFS level of those operations and fixing all the
other bugs in the implementations, just so we could validate they
worked correct on BS <= PS.

But by then Willy had started working over iomap and filemap for
folios, and the bs > PS patches were completely bitrotted and needed
rewriting from scratch. Which I now didn't have time to do....

So today is deja vu all over again: the first time I run fsx on
a new 64kB BS on 4KB PS implementation it hangs doing something
-really weird- and unexpected in copy_file_range(). It shouldn't
even be in the splice code doing a physical data copy.  So something
went wrong in ->remap_file_range(), or maybe in the truncate before
it, before it bugged out over out of range readahead in the page
cache...

I got only 3 fsx ops in today, and at least three bugs have already
manifest themselves....

> Don't be so pessimistic ...

I'm not pessimistic. I'm being realistic. I'm speaking from
experience. Not just as a Linux filesystem engineer who has had to
run this fsx-based data integrity validation process from the ground
up multiple times in the past decade, but also as an Irix OS
engineer that spent many, many hours digging out nasty, subtle bs > ps
data corruption bugs of the Irix buffer/page cache.

Cheers,

Dave.
Kent Overstreet June 16, 2023, 4:06 p.m. UTC | #10
On Thu, Jun 15, 2023 at 06:51:35PM +1000, Dave Chinner wrote:
> On Thu, Jun 15, 2023 at 08:21:10AM +0200, Hannes Reinecke wrote:
> > On 6/15/23 01:53, Dave Chinner wrote:
> > > On Wed, Jun 14, 2023 at 05:06:14PM +0200, Hannes Reinecke wrote:
> > > All you need to do now is run the BS > PS filesytems through a full
> > > fstests pass (reflink + rmap enabled, auto group), and then we can
> > > start on the real data integrity validation work. It'll need tens of
> > > billions of fsx ops run on it, days of recoveryloop testing, days of
> > > fstress based exercise, etc before we can actually enable it in
> > > XFS....
> > > 
> > Hey, c'mon. I do know _that_. All I'm saying is that now we can _start_
> > running tests and figure out corner cases (like NFS crashing on me :-).
> > With this patchset we now have some infrastructure in place making it
> > even _possible_ to run those tests.
> 
> I got to this same point several years ago. You know, that patchset
> that Luis went back to when he brought up this whole topic again?
> That's right when I started running fsx, and I realised it
> didn't cover FICLONERANGE, FIDEDUPERANGE and copy_file_range().
> 
> Yep, that's when we first realised we had -zero- test coverage of
> those operations. Darrick and I spent the next *3 months* pretty
> much rewriting the VFS level of those operations and fixing all the
> other bugs in the implementations, just so we could validate they
> worked correct on BS <= PS.

code coverage analysis...