Message ID | 20240614100329.1203579-2-hch@lst.de (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | nfs: fix nfs_swap_rw for large-folio swap | expand |
On Fri, 2024-06-14 at 12:03 +0200, Christoph Hellwig wrote: > As of Linux 6.10-rc the MM can swap out larger than page size chunks. > NFS has all code ready to handle this, but has a VM_BUG_ON that > triggers when this happens. Simply remove the VM_BUG_ON to fix this > use case. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > fs/nfs/direct.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c > index bb2f583eb28bf1..90079ca134dd3c 100644 > --- a/fs/nfs/direct.c > +++ b/fs/nfs/direct.c > @@ -141,8 +141,6 @@ int nfs_swap_rw(struct kiocb *iocb, struct > iov_iter *iter) > { > ssize_t ret; > > - VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE); > - > if (iov_iter_rw(iter) == READ) > ret = nfs_file_direct_read(iocb, iter, true); > else This definitely seems wrong in a large folio world. Reviewed-by: Jeff Layton <jlayton@kernel.org>
On Fri, 14 Jun 2024 12:03:25 +0200 Christoph Hellwig <hch@lst.de> wrote: > As of Linux 6.10-rc the MM can swap out larger than page size chunks. > NFS has all code ready to handle this, but has a VM_BUG_ON that > triggers when this happens. Simply remove the VM_BUG_ON to fix this > use case. > > ... > > --- a/fs/nfs/direct.c > +++ b/fs/nfs/direct.c > @@ -141,8 +141,6 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter) > { > ssize_t ret; > > - VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE); > - > if (iov_iter_rw(iter) == READ) > ret = nfs_file_direct_read(iocb, iter, true); > else I'm thinking this should precede "mm: swap: entirely map large folios found in swapcache", or be a part of it. Barry/Chuanhua, any opinions?
On Sun, Jun 16, 2024 at 5:55 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Fri, 14 Jun 2024 12:03:25 +0200 Christoph Hellwig <hch@lst.de> wrote: > > > As of Linux 6.10-rc the MM can swap out larger than page size chunks. > > NFS has all code ready to handle this, but has a VM_BUG_ON that > > triggers when this happens. Simply remove the VM_BUG_ON to fix this > > use case. As I understand it, this isn't happening because we don't support mTHP swapping out to a swapfile, whether it's on NFS or any other filesystem. static int scan_swap_map_slots(struct swap_info_struct *si, unsigned char usage, int nr, swp_entry_t slots[], int order) { ... if (order > 0) { ... /* * Swapfile is not block device or not using clusters so unable * to allocate large entries. */ if (!(si->flags & SWP_BLKDEV) || !si->cluster_info) return 0; } } However, I'm pleased to see this patch, as we might need it someday. > > > > ... > > > > --- a/fs/nfs/direct.c > > +++ b/fs/nfs/direct.c > > @@ -141,8 +141,6 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter) > > { > > ssize_t ret; > > > > - VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE); > > - > > if (iov_iter_rw(iter) == READ) > > ret = nfs_file_direct_read(iocb, iter, true); > > else > > I'm thinking this should precede "mm: swap: entirely map large folios > found in swapcache", or be a part of it. > > Barry/Chuanhua, any opinions? The patch “mm: swap: entirely map large folios found in swapcache” does not perform any I/O operations; it only maps the accessed large folio within the swapcache. Therefore, this patch can be considered independent. BTW, we haven't supported swapping out mTHP to nfs swapfile. So, this patch is future-proof and can serve as a precedent if we want to extend mthp swapout/swpin to non-blkdev systems. Please correct me if I'm wrong, Ryan. Thanks Barry
On Sun, Jun 16, 2024 at 12:16:10PM +1200, Barry Song wrote: > As I understand it, this isn't happening because we don't support > mTHP swapping out to a swapfile, whether it's on NFS or any > other filesystem. It does happen. The reason why I sent this patch is becaue I observed the BUG_ON trigger on a trivial swap generation workload (usemem.c from xfstests).
On Sun, Jun 16, 2024 at 4:54 PM Christoph Hellwig <hch@lst.de> wrote: > > On Sun, Jun 16, 2024 at 12:16:10PM +1200, Barry Song wrote: > > As I understand it, this isn't happening because we don't support > > mTHP swapping out to a swapfile, whether it's on NFS or any > > other filesystem. > > It does happen. The reason why I sent this patch is becaue I observed > the BUG_ON trigger on a trivial swap generation workload (usemem.c from > xfstests). This is quite unusual. Could you share your setup and backtrace? I'd like to reproduce the issue, as the mm code only supports mTHP swapout on block devices. What is your swap device or swap file? Additionally, on what kind of filesystem is the executable file built from usemem.c located? >
On Sun, Jun 16, 2024 at 06:23:16PM +0800, Barry Song wrote: > This is quite unusual. Could you share your setup and backtrace? Do this in an x86 kvm vm with 1G of memory: mkfs.xfs -f /dev/vdb mount /dev/vdb /mnt/nfs1 # this requires the following line in /etc/exports to actually export the # file system: "/mnt/nfs2 localhost(rw,no_root_squash)" systemctl restart nfs-kernel-server mount -t nfs 127.0.0.1:/mnt/nfs1/ /mnt/test/ cd /mnt/test dd if=/dev/zero of=swapfile bs=1024 count=524288 mkswap swapfile chmod 0600 swapfile swapon swapfile # xfstests-dev is a checked out and built xfstests repo ~/xfstests-dev/src/usemem 1024 The backtrace is at the end of the mail. > I'd > like to reproduce the issue, as the mm code only supports mTHP > swapout on block devices. What is your swap device or swap file? The above swap fіle on NFS is the only swap in the system > Additionally, on what kind of filesystem is the executable file built > from usemem.c located? Doesn't matter. I've tried xfs, ext4 and nfs itself. root@testvm:~# ./xfstests-dev/src/usemem 1024 [ 65.065292] ------------[ cut here ]------------ [ 65.065528] kernel BUG at fs/nfs/direct.c:144! [ 65.065724] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 65.065988] CPU: 0 PID: 71 Comm: kswapd0 Not tainted 6.10.0-rc3+ #2587 [ 65.066270] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 65.066683] RIP: 0010:nfs_swap_rw+0x3f/0x50 [ 65.066893] Code: 00 00 74 13 e8 e2 fb ff ff 31 d2 48 85 c0 48 0f 4f c2 c3 cc cc cc cc e8 9f f8 ff ff 31 0 [ 65.067698] RSP: 0018:ffffc900001b79c8 EFLAGS: 00010206 [ 65.067949] RAX: ffffffff8162afc0 RBX: ffff88800ca24c00 RCX: 0000000000000004 [ 65.068274] RDX: ffff88800ca24c30 RSI: ffffc900001b79d0 RDI: ffff88800ca24c00 [ 65.068588] RBP: ffff88800dd710e8 R08: 0000000000004000 R09: 0000000000000010 [ 65.068911] R10: 0000000000035a40 R11: 000000000000001d R12: ffffc900001b7a78 [ 65.069225] R13: ffffc900001b7db0 R14: ffffc900001b7a68 R15: ffffea00003f1c40 [ 65.069542] FS: 0000000000000000(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000 [ 65.069904] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 65.070157] CR2: 00005606ddec4dd0 CR3: 000000000365a003 CR4: 0000000000770ef0 [ 65.070472] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 65.070784] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 [ 65.071282] PKRU: 55555554 [ 65.071407] Call Trace: [ 65.071519] <TASK> [ 65.071619] ? die+0x31/0x80 [ 65.071754] ? do_trap+0xd7/0x100 [ 65.071915] ? do_error_trap+0x65/0x90 [ 65.072084] ? nfs_swap_rw+0x3f/0x50 [ 65.072247] ? exc_invalid_op+0x50/0x70 [ 65.072420] ? nfs_swap_rw+0x3f/0x50 [ 65.072581] ? asm_exc_invalid_op+0x1a/0x20 [ 65.072770] ? __pfx_nfs_swap_rw+0x10/0x10 [ 65.072966] ? nfs_swap_rw+0x3f/0x50 [ 65.073125] swap_write_unplug+0x57/0x90 [ 65.073307] shrink_folio_list+0x3a1/0xed0 [ 65.073500] ? __mod_zone_page_state+0x61/0xa0 [ 65.073700] ? isolate_lru_folios+0x2f3/0x400 [ 65.073902] shrink_lruvec+0x587/0xb60 [ 65.074071] ? shrink_slab+0x3ad/0x3c0 [ 65.074239] ? shrink_slab+0x3ad/0x3c0 [ 65.074408] shrink_node+0x2b7/0x7e0 [ 65.074569] balance_pgdat+0x291/0x6f0
On Mon, Jun 17, 2024 at 5:32 PM Christoph Hellwig <hch@lst.de> wrote: > > On Sun, Jun 16, 2024 at 06:23:16PM +0800, Barry Song wrote: > > This is quite unusual. Could you share your setup and backtrace? > > Do this in an x86 kvm vm with 1G of memory: > > mkfs.xfs -f /dev/vdb > > mount /dev/vdb /mnt/nfs1 > > # this requires the following line in /etc/exports to actually export the > # file system: "/mnt/nfs2 localhost(rw,no_root_squash)" > systemctl restart nfs-kernel-server > > mount -t nfs 127.0.0.1:/mnt/nfs1/ /mnt/test/ > > cd /mnt/test > dd if=/dev/zero of=swapfile bs=1024 count=524288 > mkswap swapfile > chmod 0600 swapfile > swapon swapfile I did everything the same as above, but always got failure at the last step to swapon: /mnt/test # swapon swapfile swapon: /mnt/test/swapfile: swapon failed: Invalid argument the mount status: /dev/vdb on /mnt/nfs1 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) 127.0.0.1:/mnt/nfs1 on /mnt/test type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1) As long as the swapfile is not on nfs, I am ok to swapon: /mnt # swapon swapfile [ 1150.302641] Adding 65532k swap on /mnt/swapfile. Priority:-2 extents:1 across:65532k Anything I am missing? > > # xfstests-dev is a checked out and built xfstests repo > ~/xfstests-dev/src/usemem 1024 > > The backtrace is at the end of the mail. > > > I'd > > like to reproduce the issue, as the mm code only supports mTHP > > swapout on block devices. What is your swap device or swap file? > > The above swap fіle on NFS is the only swap in the system > > > Additionally, on what kind of filesystem is the executable file built > > from usemem.c located? > > Doesn't matter. I've tried xfs, ext4 and nfs itself. > > root@testvm:~# ./xfstests-dev/src/usemem 1024 > [ 65.065292] ------------[ cut here ]------------ > [ 65.065528] kernel BUG at fs/nfs/direct.c:144! > [ 65.065724] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI > [ 65.065988] CPU: 0 PID: 71 Comm: kswapd0 Not tainted 6.10.0-rc3+ #2587 > [ 65.066270] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 > [ 65.066683] RIP: 0010:nfs_swap_rw+0x3f/0x50 > [ 65.066893] Code: 00 00 74 13 e8 e2 fb ff ff 31 d2 48 85 c0 48 0f 4f c2 c3 cc cc cc cc e8 9f f8 ff ff 31 0 > [ 65.067698] RSP: 0018:ffffc900001b79c8 EFLAGS: 00010206 > [ 65.067949] RAX: ffffffff8162afc0 RBX: ffff88800ca24c00 RCX: 0000000000000004 > [ 65.068274] RDX: ffff88800ca24c30 RSI: ffffc900001b79d0 RDI: ffff88800ca24c00 > [ 65.068588] RBP: ffff88800dd710e8 R08: 0000000000004000 R09: 0000000000000010 > [ 65.068911] R10: 0000000000035a40 R11: 000000000000001d R12: ffffc900001b7a78 > [ 65.069225] R13: ffffc900001b7db0 R14: ffffc900001b7a68 R15: ffffea00003f1c40 > [ 65.069542] FS: 0000000000000000(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000 > [ 65.069904] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 65.070157] CR2: 00005606ddec4dd0 CR3: 000000000365a003 CR4: 0000000000770ef0 > [ 65.070472] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 65.070784] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 > [ 65.071282] PKRU: 55555554 > [ 65.071407] Call Trace: > [ 65.071519] <TASK> > [ 65.071619] ? die+0x31/0x80 > [ 65.071754] ? do_trap+0xd7/0x100 > [ 65.071915] ? do_error_trap+0x65/0x90 > [ 65.072084] ? nfs_swap_rw+0x3f/0x50 > [ 65.072247] ? exc_invalid_op+0x50/0x70 > [ 65.072420] ? nfs_swap_rw+0x3f/0x50 > [ 65.072581] ? asm_exc_invalid_op+0x1a/0x20 > [ 65.072770] ? __pfx_nfs_swap_rw+0x10/0x10 > [ 65.072966] ? nfs_swap_rw+0x3f/0x50 > [ 65.073125] swap_write_unplug+0x57/0x90 > [ 65.073307] shrink_folio_list+0x3a1/0xed0 > [ 65.073500] ? __mod_zone_page_state+0x61/0xa0 > [ 65.073700] ? isolate_lru_folios+0x2f3/0x400 > [ 65.073902] shrink_lruvec+0x587/0xb60 > [ 65.074071] ? shrink_slab+0x3ad/0x3c0 > [ 65.074239] ? shrink_slab+0x3ad/0x3c0 > [ 65.074408] shrink_node+0x2b7/0x7e0 > [ 65.074569] balance_pgdat+0x291/0x6f0
On 16/06/2024 11:23, Barry Song wrote: > On Sun, Jun 16, 2024 at 4:54 PM Christoph Hellwig <hch@lst.de> wrote: >> >> On Sun, Jun 16, 2024 at 12:16:10PM +1200, Barry Song wrote: >>> As I understand it, this isn't happening because we don't support >>> mTHP swapping out to a swapfile, whether it's on NFS or any >>> other filesystem. >> >> It does happen. The reason why I sent this patch is becaue I observed >> the BUG_ON trigger on a trivial swap generation workload (usemem.c from >> xfstests). > > This is quite unusual. Could you share your setup and backtrace? I'd > like to reproduce the issue, as the mm code only supports mTHP > swapout on block devices. What is your swap device or swap file? > Additionally, on what kind of filesystem is the executable file built > from usemem.c located? Yes, I'm also confused by this, since as Barry says, the swap-out changes to support mTHP are only intended to be activated when the swap device is a non-rotating block device - swap files on file systems are explicitly not supported and all swapping should be done page-by-page in that case. This constraint is exactly the same as for the pre-existing PMD-size THP swap-out support. So if you are seeing large folios being written after the mTHP swap-out change, you should also be seeing large folios before this change. Hopefully the stack trace will tell us what's going on here. (Sorry for my slow responses/lack of engagement over the last month; its been a combination of paternity leave/lack of sleep/working on other things. I'm hoping to get properly back into this stuff within the next couple of weeks). Thanks, Ryan
On Mon, Jun 17, 2024 at 8:03 PM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 16/06/2024 11:23, Barry Song wrote: > > On Sun, Jun 16, 2024 at 4:54 PM Christoph Hellwig <hch@lst.de> wrote: > >> > >> On Sun, Jun 16, 2024 at 12:16:10PM +1200, Barry Song wrote: > >>> As I understand it, this isn't happening because we don't support > >>> mTHP swapping out to a swapfile, whether it's on NFS or any > >>> other filesystem. > >> > >> It does happen. The reason why I sent this patch is becaue I observed > >> the BUG_ON trigger on a trivial swap generation workload (usemem.c from > >> xfstests). > > > > This is quite unusual. Could you share your setup and backtrace? I'd > > like to reproduce the issue, as the mm code only supports mTHP > > swapout on block devices. What is your swap device or swap file? > > Additionally, on what kind of filesystem is the executable file built > > from usemem.c located? > > Yes, I'm also confused by this, since as Barry says, the swap-out changes to > support mTHP are only intended to be activated when the swap device is a > non-rotating block device - swap files on file systems are explicitly not > supported and all swapping should be done page-by-page in that case. This > constraint is exactly the same as for the pre-existing PMD-size THP swap-out > support. So if you are seeing large folios being written after the mTHP swap-out > change, you should also be seeing large folios before this change. > > Hopefully the stack trace will tell us what's going on here. Hi Ryan, Christoph, I am able to reproduce the issue now. I am debugging and will update the root cause with you this week. Initial investigation shows the issue might *not* be related to THP_SWPOUT. I am even able to reproduce it after disabling thp and mthp, entirely by small folios: [ 215.925069] folio_alloc_swap folio nr:1 anon:1 swapbacked:1 [ 215.926383] vmscan: shrink_folio_list folio nr:1 anon:1 swapbacked:1 [ 215.927008] folio_alloc_swap folio nr:1 anon:1 swapbacked:1 [ 215.929368] ------------[ cut here ]------------ [ 215.929824] kernel BUG at fs/nfs/direct.c:144! [ 215.930403] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP [ 215.931264] Modules linked in: [ 215.932328] CPU: 3 PID: 214 Comm: mthp_swpout_tes Not tainted 6.10.0-rc3-ga12328d9fb85-dirty #292 [ 215.932953] Hardware name: linux,dummy-virt (DT) [ 215.933461] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 215.934030] pc : nfs_swap_rw+0x60/0x70 [ 215.935079] lr : swap_write_unplug+0x64/0xb0 [ 215.935559] sp : ffff800087363280 [ 215.935958] x29: ffff800087363280 x28: ffff0000c3241800 x27: fffffdffc323a4c0 [ 215.937012] x26: fffffdffc323a4c8 x25: ffff0001b4a51500 x24: ffff80008250f670 [ 215.937893] x23: 0000000000000001 x22: ffff0000c0b2da00 x21: 0000000000020000 [ 215.938734] x20: ffff0000c46a8bd8 x19: ffff0000c154f800 x18: ffffffffffffffff [ 215.939594] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800107363097 [ 215.940591] x14: 0000000000000000 x13: 313a64656b636162 x12: 7061777320313a6e [ 215.941621] x11: 6f6e6120313a726e x10: ffff800083e86318 x9 : ffff8000803e9ad4 [ 215.942673] x8 : ffff800087363168 x7 : 0000000000000000 x6 : ffff0001adbfa4c6 [ 215.943674] x5 : 0000000000000002 x4 : 0000000000020000 x3 : 0000000000020000 [ 215.944673] x2 : ffff8000806015e8 x1 : ffff8000873632a0 x0 : ffff0000c154f800 [ 215.945568] Call trace: [ 215.945906] nfs_swap_rw+0x60/0x70 [ 215.946351] __swap_writepage+0x2e8/0x328 [ 215.946775] swap_writepage+0x68/0xd0 [ 215.947184] pageout+0xe4/0x430 [ 215.947587] shrink_folio_list+0x9bc/0xf60 [ 215.947992] reclaim_folio_list+0x8c/0x168 [ 215.948454] reclaim_pages+0xfc/0x178 [ 215.948843] madvise_cold_or_pageout_pte_range+0x8d8/0xf28 [ 215.949285] walk_pgd_range+0x390/0x808 [ 215.949660] __walk_page_range+0x1e0/0x1f0 [ 215.950040] walk_page_range+0x1f0/0x2c8 [ 215.950458] madvise_pageout+0xf8/0x280 [ 215.950905] madvise_vma_behavior+0x314/0xa20 [ 215.951361] madvise_walk_vmas+0xc0/0x128 [ 215.951807] do_madvise.part.0+0x110/0x558 [ 215.952298] __arm64_sys_madvise+0x68/0x88 [ 215.952723] invoke_syscall+0x50/0x128 [ 215.953148] el0_svc_common.constprop.0+0x48/0xf8 [ 215.953592] do_el0_svc+0x28/0x40 [ 215.954036] el0_svc+0x50/0x150 [ 215.954610] el0t_64_sync_handler+0x13c/0x158 [ 215.955070] el0t_64_sync+0x1a4/0x1a8 [ 215.955685] Code: a8c17bfd d50323bf 9a9fd000 d65f03c0 (d4210000) [ 215.956510] ---[ end trace 0000000000000000 ]--- > > (Sorry for my slow responses/lack of engagement over the last month; its been a > combination of paternity leave/lack of sleep/working on other things. I'm hoping > to get properly back into this stuff within the next couple of weeks). > > Thanks, > Ryan >
On 17/06/2024 10:40, Barry Song wrote: > On Mon, Jun 17, 2024 at 8:03 PM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 16/06/2024 11:23, Barry Song wrote: >>> On Sun, Jun 16, 2024 at 4:54 PM Christoph Hellwig <hch@lst.de> wrote: >>>> >>>> On Sun, Jun 16, 2024 at 12:16:10PM +1200, Barry Song wrote: >>>>> As I understand it, this isn't happening because we don't support >>>>> mTHP swapping out to a swapfile, whether it's on NFS or any >>>>> other filesystem. >>>> >>>> It does happen. The reason why I sent this patch is becaue I observed >>>> the BUG_ON trigger on a trivial swap generation workload (usemem.c from >>>> xfstests). >>> >>> This is quite unusual. Could you share your setup and backtrace? I'd >>> like to reproduce the issue, as the mm code only supports mTHP >>> swapout on block devices. What is your swap device or swap file? >>> Additionally, on what kind of filesystem is the executable file built >>> from usemem.c located? >> >> Yes, I'm also confused by this, since as Barry says, the swap-out changes to >> support mTHP are only intended to be activated when the swap device is a >> non-rotating block device - swap files on file systems are explicitly not >> supported and all swapping should be done page-by-page in that case. This >> constraint is exactly the same as for the pre-existing PMD-size THP swap-out >> support. So if you are seeing large folios being written after the mTHP swap-out >> change, you should also be seeing large folios before this change. >> >> Hopefully the stack trace will tell us what's going on here. > > Hi Ryan, Christoph, > > I am able to reproduce the issue now. I am debugging and will update > the root cause > with you this week. Ahh great; for some reason I'm not receiving Chrostoph's mails so didn't see the stack trace and instructions until you replied to it. I had a go are repro'ing too but am failing to even get the systemd nfs service to start. I'll leave it to you. > > Initial investigation shows the issue might *not* be related to THP_SWPOUT. > > I am even able to reproduce it after disabling thp and mthp, entirely by > small folios: > > [ 215.925069] folio_alloc_swap folio nr:1 anon:1 swapbacked:1 > [ 215.926383] vmscan: shrink_folio_list folio nr:1 anon:1 swapbacked:1 > [ 215.927008] folio_alloc_swap folio nr:1 anon:1 swapbacked:1 > [ 215.929368] ------------[ cut here ]------------ > [ 215.929824] kernel BUG at fs/nfs/direct.c:144! > [ 215.930403] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > [ 215.931264] Modules linked in: > [ 215.932328] CPU: 3 PID: 214 Comm: mthp_swpout_tes Not tainted > 6.10.0-rc3-ga12328d9fb85-dirty #292 > [ 215.932953] Hardware name: linux,dummy-virt (DT) > [ 215.933461] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) > [ 215.934030] pc : nfs_swap_rw+0x60/0x70 > [ 215.935079] lr : swap_write_unplug+0x64/0xb0 > [ 215.935559] sp : ffff800087363280 > [ 215.935958] x29: ffff800087363280 x28: ffff0000c3241800 x27: fffffdffc323a4c0 > [ 215.937012] x26: fffffdffc323a4c8 x25: ffff0001b4a51500 x24: ffff80008250f670 > [ 215.937893] x23: 0000000000000001 x22: ffff0000c0b2da00 x21: 0000000000020000 > [ 215.938734] x20: ffff0000c46a8bd8 x19: ffff0000c154f800 x18: ffffffffffffffff > [ 215.939594] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800107363097 > [ 215.940591] x14: 0000000000000000 x13: 313a64656b636162 x12: 7061777320313a6e > [ 215.941621] x11: 6f6e6120313a726e x10: ffff800083e86318 x9 : ffff8000803e9ad4 > [ 215.942673] x8 : ffff800087363168 x7 : 0000000000000000 x6 : ffff0001adbfa4c6 > [ 215.943674] x5 : 0000000000000002 x4 : 0000000000020000 x3 : 0000000000020000 > [ 215.944673] x2 : ffff8000806015e8 x1 : ffff8000873632a0 x0 : ffff0000c154f800 > [ 215.945568] Call trace: > [ 215.945906] nfs_swap_rw+0x60/0x70 > [ 215.946351] __swap_writepage+0x2e8/0x328 > [ 215.946775] swap_writepage+0x68/0xd0 > [ 215.947184] pageout+0xe4/0x430 > [ 215.947587] shrink_folio_list+0x9bc/0xf60 > [ 215.947992] reclaim_folio_list+0x8c/0x168 > [ 215.948454] reclaim_pages+0xfc/0x178 > [ 215.948843] madvise_cold_or_pageout_pte_range+0x8d8/0xf28 > [ 215.949285] walk_pgd_range+0x390/0x808 > [ 215.949660] __walk_page_range+0x1e0/0x1f0 > [ 215.950040] walk_page_range+0x1f0/0x2c8 > [ 215.950458] madvise_pageout+0xf8/0x280 > [ 215.950905] madvise_vma_behavior+0x314/0xa20 > [ 215.951361] madvise_walk_vmas+0xc0/0x128 > [ 215.951807] do_madvise.part.0+0x110/0x558 > [ 215.952298] __arm64_sys_madvise+0x68/0x88 > [ 215.952723] invoke_syscall+0x50/0x128 > [ 215.953148] el0_svc_common.constprop.0+0x48/0xf8 > [ 215.953592] do_el0_svc+0x28/0x40 > [ 215.954036] el0_svc+0x50/0x150 > [ 215.954610] el0t_64_sync_handler+0x13c/0x158 > [ 215.955070] el0t_64_sync+0x1a4/0x1a8 > [ 215.955685] Code: a8c17bfd d50323bf 9a9fd000 d65f03c0 (d4210000) > [ 215.956510] ---[ end trace 0000000000000000 ]--- > > >> >> (Sorry for my slow responses/lack of engagement over the last month; its been a >> combination of paternity leave/lack of sleep/working on other things. I'm hoping >> to get properly back into this stuff within the next couple of weeks). >> >> Thanks, >> Ryan >>
On Mon, Jun 17, 2024 at 08:02:47PM +1200, Barry Song wrote: > I did everything the same as above, > but always got failure at the last step to swapon: > /mnt/test # swapon swapfile > swapon: /mnt/test/swapfile: swapon failed: Invalid argument You are probably missing CONFIG_NFS_SWAP=y in your .config.
On Tue, Jun 18, 2024 at 5:52 PM Christoph Hellwig <hch@lst.de> wrote: > > On Mon, Jun 17, 2024 at 08:02:47PM +1200, Barry Song wrote: > > I did everything the same as above, > > but always got failure at the last step to swapon: > > /mnt/test # swapon swapfile > > swapon: /mnt/test/swapfile: swapon failed: Invalid argument > > You are probably missing > > CONFIG_NFS_SWAP=y > > in your .config. Yes, that was exactly what I missed. I then figured it out, reproduced the issue, and discovered that the root cause was unrelated to large folios. It was actually due to a batched bio plugging optimization from 2022. You can find the new patch here: https://lore.kernel.org/linux-mm/20240617220135.43563-1-21cnbao@gmail.com/ >
On Tue, Jun 18, 2024 at 06:05:33PM +1200, Barry Song wrote: > Yes, that was exactly what I missed. I then figured it out, reproduced > the issue, > and discovered that the root cause was unrelated to large folios. It > was actually > due to a batched bio plugging optimization from 2022. You can find the new patch > here: > > https://lore.kernel.org/linux-mm/20240617220135.43563-1-21cnbao@gmail.com/ I don't really see any point in keeping the VM_BUG_ON. The underlying direct I/O code doesn't really care about the size at all.
On Tue, Jun 18, 2024 at 6:13 PM Christoph Hellwig <hch@lst.de> wrote: > > On Tue, Jun 18, 2024 at 06:05:33PM +1200, Barry Song wrote: > > Yes, that was exactly what I missed. I then figured it out, reproduced > > the issue, > > and discovered that the root cause was unrelated to large folios. It > > was actually > > due to a batched bio plugging optimization from 2022. You can find the new patch > > here: > > > > https://lore.kernel.org/linux-mm/20240617220135.43563-1-21cnbao@gmail.com/ > > I don't really see any point in keeping the VM_BUG_ON. The underlying > direct I/O code doesn't really care about the size at all. I am perfectly fine with dropping the VM_BUG/WARN. I was keeping it for debugging purposes while we add mTHP swapout support for the swapfile. Large folios might be in bio vec, this can help verify things are still normal after having large folios there. But I agree that this is not important for the NFS code itself. I can add this warning locally for debugging when needed.
On Fri, Jun 14, 2024 at 12:38 PM Christoph Hellwig <hch@lst.de> wrote: > > As of Linux 6.10-rc the MM can swap out larger than page size chunks. > NFS has all code ready to handle this, but has a VM_BUG_ON that > triggers when this happens. Simply remove the VM_BUG_ON to fix this > use case. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > fs/nfs/direct.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c > index bb2f583eb28bf1..90079ca134dd3c 100644 +1 for the patch, and thank you for fixing NFS swap Thanks, Martin
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c index bb2f583eb28bf1..90079ca134dd3c 100644 --- a/fs/nfs/direct.c +++ b/fs/nfs/direct.c @@ -141,8 +141,6 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter) { ssize_t ret; - VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE); - if (iov_iter_rw(iter) == READ) ret = nfs_file_direct_read(iocb, iter, true); else
As of Linux 6.10-rc the MM can swap out larger than page size chunks. NFS has all code ready to handle this, but has a VM_BUG_ON that triggers when this happens. Simply remove the VM_BUG_ON to fix this use case. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/nfs/direct.c | 2 -- 1 file changed, 2 deletions(-)