diff mbox series

nfs: fix nfs_swap_rw for large-folio swap

Message ID 20240614100329.1203579-2-hch@lst.de (mailing list archive)
State New
Headers show
Series nfs: fix nfs_swap_rw for large-folio swap | expand

Commit Message

Christoph Hellwig June 14, 2024, 10:03 a.m. UTC
As of Linux 6.10-rc the MM can swap out larger than page size chunks.
NFS has all code ready to handle this, but has a VM_BUG_ON that
triggers when this happens.  Simply remove the VM_BUG_ON to fix this
use case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/nfs/direct.c | 2 --
 1 file changed, 2 deletions(-)

Comments

Jeff Layton June 14, 2024, 5:52 p.m. UTC | #1
On Fri, 2024-06-14 at 12:03 +0200, Christoph Hellwig wrote:
> As of Linux 6.10-rc the MM can swap out larger than page size chunks.
> NFS has all code ready to handle this, but has a VM_BUG_ON that
> triggers when this happens.  Simply remove the VM_BUG_ON to fix this
> use case.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/nfs/direct.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index bb2f583eb28bf1..90079ca134dd3c 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -141,8 +141,6 @@ int nfs_swap_rw(struct kiocb *iocb, struct
> iov_iter *iter)
>  {
>  	ssize_t ret;
>  
> -	VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE);
> -
>  	if (iov_iter_rw(iter) == READ)
>  		ret = nfs_file_direct_read(iocb, iter, true);
>  	else


This definitely seems wrong in a large folio world.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Andrew Morton June 14, 2024, 6:21 p.m. UTC | #2
On Fri, 14 Jun 2024 12:03:25 +0200 Christoph Hellwig <hch@lst.de> wrote:

> As of Linux 6.10-rc the MM can swap out larger than page size chunks.
> NFS has all code ready to handle this, but has a VM_BUG_ON that
> triggers when this happens.  Simply remove the VM_BUG_ON to fix this
> use case.
> 
> ...
>
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -141,8 +141,6 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
>  {
>  	ssize_t ret;
>  
> -	VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE);
> -
>  	if (iov_iter_rw(iter) == READ)
>  		ret = nfs_file_direct_read(iocb, iter, true);
>  	else

I'm thinking this should precede "mm: swap: entirely map large folios
found in swapcache", or be a part of it.

Barry/Chuanhua, any opinions?
Barry Song June 16, 2024, 12:16 a.m. UTC | #3
On Sun, Jun 16, 2024 at 5:55 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 14 Jun 2024 12:03:25 +0200 Christoph Hellwig <hch@lst.de> wrote:
>
> > As of Linux 6.10-rc the MM can swap out larger than page size chunks.
> > NFS has all code ready to handle this, but has a VM_BUG_ON that
> > triggers when this happens.  Simply remove the VM_BUG_ON to fix this
> > use case.

As I understand it, this isn't happening because we don't support
mTHP swapping out to a swapfile, whether it's on NFS or any
other filesystem.
static int scan_swap_map_slots(struct swap_info_struct *si,
                               unsigned char usage, int nr,
                               swp_entry_t slots[], int order)
{
        ...
        if (order > 0) {
                ...

                /*
                 * Swapfile is not block device or not using clusters so unable
                 * to allocate large entries.
                 */
                if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
                        return 0;
        }
}

However, I'm pleased to see this patch, as we might need it someday.

> >
> > ...
> >
> > --- a/fs/nfs/direct.c
> > +++ b/fs/nfs/direct.c
> > @@ -141,8 +141,6 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
> >  {
> >       ssize_t ret;
> >
> > -     VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE);
> > -
> >       if (iov_iter_rw(iter) == READ)
> >               ret = nfs_file_direct_read(iocb, iter, true);
> >       else
>
> I'm thinking this should precede "mm: swap: entirely map large folios
> found in swapcache", or be a part of it.
>
> Barry/Chuanhua, any opinions?

The patch “mm: swap: entirely map large folios found in swapcache” does not
perform any I/O operations; it only maps the accessed large folio within the
swapcache. Therefore, this patch can be considered independent.

BTW, we haven't supported swapping out mTHP to nfs swapfile.

So, this patch is future-proof and can serve as a precedent if we
want to extend mthp swapout/swpin to non-blkdev systems.
Please correct me if I'm wrong, Ryan.

Thanks
Barry
Christoph Hellwig June 16, 2024, 8:54 a.m. UTC | #4
On Sun, Jun 16, 2024 at 12:16:10PM +1200, Barry Song wrote:
> As I understand it, this isn't happening because we don't support
> mTHP swapping out to a swapfile, whether it's on NFS or any
> other filesystem.

It does happen.  The reason why I sent this patch is becaue I observed
the BUG_ON trigger on a trivial swap generation workload (usemem.c from
xfstests).
Barry Song June 16, 2024, 10:23 a.m. UTC | #5
On Sun, Jun 16, 2024 at 4:54 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Sun, Jun 16, 2024 at 12:16:10PM +1200, Barry Song wrote:
> > As I understand it, this isn't happening because we don't support
> > mTHP swapping out to a swapfile, whether it's on NFS or any
> > other filesystem.
>
> It does happen.  The reason why I sent this patch is becaue I observed
> the BUG_ON trigger on a trivial swap generation workload (usemem.c from
> xfstests).

This is quite unusual. Could you share your setup and backtrace? I'd
like to reproduce the issue, as the mm code only supports mTHP
swapout on block devices. What is your swap device or swap file?
Additionally, on what kind of filesystem is the executable file built
from usemem.c located?

>
Christoph Hellwig June 17, 2024, 5:32 a.m. UTC | #6
On Sun, Jun 16, 2024 at 06:23:16PM +0800, Barry Song wrote:
> This is quite unusual. Could you share your setup and backtrace?

Do this in an x86 kvm vm with 1G of memory:

mkfs.xfs -f /dev/vdb

mount /dev/vdb /mnt/nfs1

# this requires the following line in /etc/exports to actually export the
# file system: "/mnt/nfs2	localhost(rw,no_root_squash)"
systemctl restart nfs-kernel-server

mount -t nfs 127.0.0.1:/mnt/nfs1/ /mnt/test/

cd /mnt/test
dd if=/dev/zero of=swapfile bs=1024 count=524288
mkswap swapfile
chmod 0600 swapfile
swapon swapfile

# xfstests-dev is a checked out and built xfstests repo
~/xfstests-dev/src/usemem 1024

The backtrace is at the end of the mail.

> I'd
> like to reproduce the issue, as the mm code only supports mTHP
> swapout on block devices. What is your swap device or swap file?

The above swap fіle on NFS is the only swap in the system

> Additionally, on what kind of filesystem is the executable file built
> from usemem.c located?

Doesn't matter.  I've tried xfs, ext4 and nfs itself.

root@testvm:~# ./xfstests-dev/src/usemem 1024
[   65.065292] ------------[ cut here ]------------
[   65.065528] kernel BUG at fs/nfs/direct.c:144!
[   65.065724] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[   65.065988] CPU: 0 PID: 71 Comm: kswapd0 Not tainted 6.10.0-rc3+ #2587
[   65.066270] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   65.066683] RIP: 0010:nfs_swap_rw+0x3f/0x50
[   65.066893] Code: 00 00 74 13 e8 e2 fb ff ff 31 d2 48 85 c0 48 0f 4f c2 c3 cc cc cc cc e8 9f f8 ff ff 31 0
[   65.067698] RSP: 0018:ffffc900001b79c8 EFLAGS: 00010206
[   65.067949] RAX: ffffffff8162afc0 RBX: ffff88800ca24c00 RCX: 0000000000000004
[   65.068274] RDX: ffff88800ca24c30 RSI: ffffc900001b79d0 RDI: ffff88800ca24c00
[   65.068588] RBP: ffff88800dd710e8 R08: 0000000000004000 R09: 0000000000000010
[   65.068911] R10: 0000000000035a40 R11: 000000000000001d R12: ffffc900001b7a78
[   65.069225] R13: ffffc900001b7db0 R14: ffffc900001b7a68 R15: ffffea00003f1c40
[   65.069542] FS:  0000000000000000(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
[   65.069904] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   65.070157] CR2: 00005606ddec4dd0 CR3: 000000000365a003 CR4: 0000000000770ef0
[   65.070472] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   65.070784] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[   65.071282] PKRU: 55555554
[   65.071407] Call Trace:
[   65.071519]  <TASK>
[   65.071619]  ? die+0x31/0x80
[   65.071754]  ? do_trap+0xd7/0x100
[   65.071915]  ? do_error_trap+0x65/0x90
[   65.072084]  ? nfs_swap_rw+0x3f/0x50
[   65.072247]  ? exc_invalid_op+0x50/0x70
[   65.072420]  ? nfs_swap_rw+0x3f/0x50
[   65.072581]  ? asm_exc_invalid_op+0x1a/0x20
[   65.072770]  ? __pfx_nfs_swap_rw+0x10/0x10
[   65.072966]  ? nfs_swap_rw+0x3f/0x50
[   65.073125]  swap_write_unplug+0x57/0x90
[   65.073307]  shrink_folio_list+0x3a1/0xed0
[   65.073500]  ? __mod_zone_page_state+0x61/0xa0
[   65.073700]  ? isolate_lru_folios+0x2f3/0x400
[   65.073902]  shrink_lruvec+0x587/0xb60
[   65.074071]  ? shrink_slab+0x3ad/0x3c0
[   65.074239]  ? shrink_slab+0x3ad/0x3c0
[   65.074408]  shrink_node+0x2b7/0x7e0
[   65.074569]  balance_pgdat+0x291/0x6f0
Barry Song June 17, 2024, 8:02 a.m. UTC | #7
On Mon, Jun 17, 2024 at 5:32 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Sun, Jun 16, 2024 at 06:23:16PM +0800, Barry Song wrote:
> > This is quite unusual. Could you share your setup and backtrace?
>
> Do this in an x86 kvm vm with 1G of memory:
>
> mkfs.xfs -f /dev/vdb
>
> mount /dev/vdb /mnt/nfs1
>
> # this requires the following line in /etc/exports to actually export the
> # file system: "/mnt/nfs2       localhost(rw,no_root_squash)"
> systemctl restart nfs-kernel-server
>
> mount -t nfs 127.0.0.1:/mnt/nfs1/ /mnt/test/
>
> cd /mnt/test
> dd if=/dev/zero of=swapfile bs=1024 count=524288
> mkswap swapfile
> chmod 0600 swapfile
> swapon swapfile

I did everything the same as above,
but always got failure at the last step to swapon:
/mnt/test # swapon swapfile
swapon: /mnt/test/swapfile: swapon failed: Invalid argument

the mount status:
/dev/vdb on /mnt/nfs1 type xfs
(rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
127.0.0.1:/mnt/nfs1 on /mnt/test type nfs4
(rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1)

As long as the swapfile is not on nfs, I am ok to swapon:
/mnt # swapon swapfile
[ 1150.302641] Adding 65532k swap on /mnt/swapfile.  Priority:-2
extents:1 across:65532k

Anything I am missing?

>
> # xfstests-dev is a checked out and built xfstests repo
> ~/xfstests-dev/src/usemem 1024
>
> The backtrace is at the end of the mail.
>
> > I'd
> > like to reproduce the issue, as the mm code only supports mTHP
> > swapout on block devices. What is your swap device or swap file?
>
> The above swap fіle on NFS is the only swap in the system
>
> > Additionally, on what kind of filesystem is the executable file built
> > from usemem.c located?
>
> Doesn't matter.  I've tried xfs, ext4 and nfs itself.
>
> root@testvm:~# ./xfstests-dev/src/usemem 1024
> [   65.065292] ------------[ cut here ]------------
> [   65.065528] kernel BUG at fs/nfs/direct.c:144!
> [   65.065724] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [   65.065988] CPU: 0 PID: 71 Comm: kswapd0 Not tainted 6.10.0-rc3+ #2587
> [   65.066270] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [   65.066683] RIP: 0010:nfs_swap_rw+0x3f/0x50
> [   65.066893] Code: 00 00 74 13 e8 e2 fb ff ff 31 d2 48 85 c0 48 0f 4f c2 c3 cc cc cc cc e8 9f f8 ff ff 31 0
> [   65.067698] RSP: 0018:ffffc900001b79c8 EFLAGS: 00010206
> [   65.067949] RAX: ffffffff8162afc0 RBX: ffff88800ca24c00 RCX: 0000000000000004
> [   65.068274] RDX: ffff88800ca24c30 RSI: ffffc900001b79d0 RDI: ffff88800ca24c00
> [   65.068588] RBP: ffff88800dd710e8 R08: 0000000000004000 R09: 0000000000000010
> [   65.068911] R10: 0000000000035a40 R11: 000000000000001d R12: ffffc900001b7a78
> [   65.069225] R13: ffffc900001b7db0 R14: ffffc900001b7a68 R15: ffffea00003f1c40
> [   65.069542] FS:  0000000000000000(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
> [   65.069904] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   65.070157] CR2: 00005606ddec4dd0 CR3: 000000000365a003 CR4: 0000000000770ef0
> [   65.070472] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   65.070784] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [   65.071282] PKRU: 55555554
> [   65.071407] Call Trace:
> [   65.071519]  <TASK>
> [   65.071619]  ? die+0x31/0x80
> [   65.071754]  ? do_trap+0xd7/0x100
> [   65.071915]  ? do_error_trap+0x65/0x90
> [   65.072084]  ? nfs_swap_rw+0x3f/0x50
> [   65.072247]  ? exc_invalid_op+0x50/0x70
> [   65.072420]  ? nfs_swap_rw+0x3f/0x50
> [   65.072581]  ? asm_exc_invalid_op+0x1a/0x20
> [   65.072770]  ? __pfx_nfs_swap_rw+0x10/0x10
> [   65.072966]  ? nfs_swap_rw+0x3f/0x50
> [   65.073125]  swap_write_unplug+0x57/0x90
> [   65.073307]  shrink_folio_list+0x3a1/0xed0
> [   65.073500]  ? __mod_zone_page_state+0x61/0xa0
> [   65.073700]  ? isolate_lru_folios+0x2f3/0x400
> [   65.073902]  shrink_lruvec+0x587/0xb60
> [   65.074071]  ? shrink_slab+0x3ad/0x3c0
> [   65.074239]  ? shrink_slab+0x3ad/0x3c0
> [   65.074408]  shrink_node+0x2b7/0x7e0
> [   65.074569]  balance_pgdat+0x291/0x6f0
Ryan Roberts June 17, 2024, 8:03 a.m. UTC | #8
On 16/06/2024 11:23, Barry Song wrote:
> On Sun, Jun 16, 2024 at 4:54 PM Christoph Hellwig <hch@lst.de> wrote:
>>
>> On Sun, Jun 16, 2024 at 12:16:10PM +1200, Barry Song wrote:
>>> As I understand it, this isn't happening because we don't support
>>> mTHP swapping out to a swapfile, whether it's on NFS or any
>>> other filesystem.
>>
>> It does happen.  The reason why I sent this patch is becaue I observed
>> the BUG_ON trigger on a trivial swap generation workload (usemem.c from
>> xfstests).
> 
> This is quite unusual. Could you share your setup and backtrace? I'd
> like to reproduce the issue, as the mm code only supports mTHP
> swapout on block devices. What is your swap device or swap file?
> Additionally, on what kind of filesystem is the executable file built
> from usemem.c located?

Yes, I'm also confused by this, since as Barry says, the swap-out changes to
support mTHP are only intended to be activated when the swap device is a
non-rotating block device - swap files on file systems are explicitly not
supported and all swapping should be done page-by-page in that case. This
constraint is exactly the same as for the pre-existing PMD-size THP swap-out
support. So if you are seeing large folios being written after the mTHP swap-out
change, you should also be seeing large folios before this change.

Hopefully the stack trace will tell us what's going on here.

(Sorry for my slow responses/lack of engagement over the last month; its been a
combination of paternity leave/lack of sleep/working on other things. I'm hoping
to get properly back into this stuff within the next couple of weeks).

Thanks,
Ryan
Barry Song June 17, 2024, 9:40 a.m. UTC | #9
On Mon, Jun 17, 2024 at 8:03 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 16/06/2024 11:23, Barry Song wrote:
> > On Sun, Jun 16, 2024 at 4:54 PM Christoph Hellwig <hch@lst.de> wrote:
> >>
> >> On Sun, Jun 16, 2024 at 12:16:10PM +1200, Barry Song wrote:
> >>> As I understand it, this isn't happening because we don't support
> >>> mTHP swapping out to a swapfile, whether it's on NFS or any
> >>> other filesystem.
> >>
> >> It does happen.  The reason why I sent this patch is becaue I observed
> >> the BUG_ON trigger on a trivial swap generation workload (usemem.c from
> >> xfstests).
> >
> > This is quite unusual. Could you share your setup and backtrace? I'd
> > like to reproduce the issue, as the mm code only supports mTHP
> > swapout on block devices. What is your swap device or swap file?
> > Additionally, on what kind of filesystem is the executable file built
> > from usemem.c located?
>
> Yes, I'm also confused by this, since as Barry says, the swap-out changes to
> support mTHP are only intended to be activated when the swap device is a
> non-rotating block device - swap files on file systems are explicitly not
> supported and all swapping should be done page-by-page in that case. This
> constraint is exactly the same as for the pre-existing PMD-size THP swap-out
> support. So if you are seeing large folios being written after the mTHP swap-out
> change, you should also be seeing large folios before this change.
>
> Hopefully the stack trace will tell us what's going on here.

Hi Ryan, Christoph,

I am able to reproduce the issue now. I am debugging and will update
the root cause
with you this week.

Initial investigation shows the issue might *not* be related to THP_SWPOUT.

I am even able to reproduce it after disabling thp and mthp, entirely by
small folios:

[  215.925069] folio_alloc_swap folio nr:1 anon:1 swapbacked:1
[  215.926383] vmscan: shrink_folio_list folio nr:1 anon:1 swapbacked:1
[  215.927008] folio_alloc_swap folio nr:1 anon:1 swapbacked:1
[  215.929368] ------------[ cut here ]------------
[  215.929824] kernel BUG at fs/nfs/direct.c:144!
[  215.930403] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[  215.931264] Modules linked in:
[  215.932328] CPU: 3 PID: 214 Comm: mthp_swpout_tes Not tainted
6.10.0-rc3-ga12328d9fb85-dirty #292
[  215.932953] Hardware name: linux,dummy-virt (DT)
[  215.933461] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[  215.934030] pc : nfs_swap_rw+0x60/0x70
[  215.935079] lr : swap_write_unplug+0x64/0xb0
[  215.935559] sp : ffff800087363280
[  215.935958] x29: ffff800087363280 x28: ffff0000c3241800 x27: fffffdffc323a4c0
[  215.937012] x26: fffffdffc323a4c8 x25: ffff0001b4a51500 x24: ffff80008250f670
[  215.937893] x23: 0000000000000001 x22: ffff0000c0b2da00 x21: 0000000000020000
[  215.938734] x20: ffff0000c46a8bd8 x19: ffff0000c154f800 x18: ffffffffffffffff
[  215.939594] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800107363097
[  215.940591] x14: 0000000000000000 x13: 313a64656b636162 x12: 7061777320313a6e
[  215.941621] x11: 6f6e6120313a726e x10: ffff800083e86318 x9 : ffff8000803e9ad4
[  215.942673] x8 : ffff800087363168 x7 : 0000000000000000 x6 : ffff0001adbfa4c6
[  215.943674] x5 : 0000000000000002 x4 : 0000000000020000 x3 : 0000000000020000
[  215.944673] x2 : ffff8000806015e8 x1 : ffff8000873632a0 x0 : ffff0000c154f800
[  215.945568] Call trace:
[  215.945906]  nfs_swap_rw+0x60/0x70
[  215.946351]  __swap_writepage+0x2e8/0x328
[  215.946775]  swap_writepage+0x68/0xd0
[  215.947184]  pageout+0xe4/0x430
[  215.947587]  shrink_folio_list+0x9bc/0xf60
[  215.947992]  reclaim_folio_list+0x8c/0x168
[  215.948454]  reclaim_pages+0xfc/0x178
[  215.948843]  madvise_cold_or_pageout_pte_range+0x8d8/0xf28
[  215.949285]  walk_pgd_range+0x390/0x808
[  215.949660]  __walk_page_range+0x1e0/0x1f0
[  215.950040]  walk_page_range+0x1f0/0x2c8
[  215.950458]  madvise_pageout+0xf8/0x280
[  215.950905]  madvise_vma_behavior+0x314/0xa20
[  215.951361]  madvise_walk_vmas+0xc0/0x128
[  215.951807]  do_madvise.part.0+0x110/0x558
[  215.952298]  __arm64_sys_madvise+0x68/0x88
[  215.952723]  invoke_syscall+0x50/0x128
[  215.953148]  el0_svc_common.constprop.0+0x48/0xf8
[  215.953592]  do_el0_svc+0x28/0x40
[  215.954036]  el0_svc+0x50/0x150
[  215.954610]  el0t_64_sync_handler+0x13c/0x158
[  215.955070]  el0t_64_sync+0x1a4/0x1a8
[  215.955685] Code: a8c17bfd d50323bf 9a9fd000 d65f03c0 (d4210000)
[  215.956510] ---[ end trace 0000000000000000 ]---


>
> (Sorry for my slow responses/lack of engagement over the last month; its been a
> combination of paternity leave/lack of sleep/working on other things. I'm hoping
> to get properly back into this stuff within the next couple of weeks).
>
> Thanks,
> Ryan
>
Ryan Roberts June 17, 2024, 10:33 a.m. UTC | #10
On 17/06/2024 10:40, Barry Song wrote:
> On Mon, Jun 17, 2024 at 8:03 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 16/06/2024 11:23, Barry Song wrote:
>>> On Sun, Jun 16, 2024 at 4:54 PM Christoph Hellwig <hch@lst.de> wrote:
>>>>
>>>> On Sun, Jun 16, 2024 at 12:16:10PM +1200, Barry Song wrote:
>>>>> As I understand it, this isn't happening because we don't support
>>>>> mTHP swapping out to a swapfile, whether it's on NFS or any
>>>>> other filesystem.
>>>>
>>>> It does happen.  The reason why I sent this patch is becaue I observed
>>>> the BUG_ON trigger on a trivial swap generation workload (usemem.c from
>>>> xfstests).
>>>
>>> This is quite unusual. Could you share your setup and backtrace? I'd
>>> like to reproduce the issue, as the mm code only supports mTHP
>>> swapout on block devices. What is your swap device or swap file?
>>> Additionally, on what kind of filesystem is the executable file built
>>> from usemem.c located?
>>
>> Yes, I'm also confused by this, since as Barry says, the swap-out changes to
>> support mTHP are only intended to be activated when the swap device is a
>> non-rotating block device - swap files on file systems are explicitly not
>> supported and all swapping should be done page-by-page in that case. This
>> constraint is exactly the same as for the pre-existing PMD-size THP swap-out
>> support. So if you are seeing large folios being written after the mTHP swap-out
>> change, you should also be seeing large folios before this change.
>>
>> Hopefully the stack trace will tell us what's going on here.
> 
> Hi Ryan, Christoph,
> 
> I am able to reproduce the issue now. I am debugging and will update
> the root cause
> with you this week.

Ahh great; for some reason I'm not receiving Chrostoph's mails so didn't see the
stack trace and instructions until you replied to it. I had a go are repro'ing
too but am failing to even get the systemd nfs service to start. I'll leave it
to you.

> 
> Initial investigation shows the issue might *not* be related to THP_SWPOUT.
> 
> I am even able to reproduce it after disabling thp and mthp, entirely by
> small folios:
> 
> [  215.925069] folio_alloc_swap folio nr:1 anon:1 swapbacked:1
> [  215.926383] vmscan: shrink_folio_list folio nr:1 anon:1 swapbacked:1
> [  215.927008] folio_alloc_swap folio nr:1 anon:1 swapbacked:1
> [  215.929368] ------------[ cut here ]------------
> [  215.929824] kernel BUG at fs/nfs/direct.c:144!
> [  215.930403] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> [  215.931264] Modules linked in:
> [  215.932328] CPU: 3 PID: 214 Comm: mthp_swpout_tes Not tainted
> 6.10.0-rc3-ga12328d9fb85-dirty #292
> [  215.932953] Hardware name: linux,dummy-virt (DT)
> [  215.933461] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [  215.934030] pc : nfs_swap_rw+0x60/0x70
> [  215.935079] lr : swap_write_unplug+0x64/0xb0
> [  215.935559] sp : ffff800087363280
> [  215.935958] x29: ffff800087363280 x28: ffff0000c3241800 x27: fffffdffc323a4c0
> [  215.937012] x26: fffffdffc323a4c8 x25: ffff0001b4a51500 x24: ffff80008250f670
> [  215.937893] x23: 0000000000000001 x22: ffff0000c0b2da00 x21: 0000000000020000
> [  215.938734] x20: ffff0000c46a8bd8 x19: ffff0000c154f800 x18: ffffffffffffffff
> [  215.939594] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800107363097
> [  215.940591] x14: 0000000000000000 x13: 313a64656b636162 x12: 7061777320313a6e
> [  215.941621] x11: 6f6e6120313a726e x10: ffff800083e86318 x9 : ffff8000803e9ad4
> [  215.942673] x8 : ffff800087363168 x7 : 0000000000000000 x6 : ffff0001adbfa4c6
> [  215.943674] x5 : 0000000000000002 x4 : 0000000000020000 x3 : 0000000000020000
> [  215.944673] x2 : ffff8000806015e8 x1 : ffff8000873632a0 x0 : ffff0000c154f800
> [  215.945568] Call trace:
> [  215.945906]  nfs_swap_rw+0x60/0x70
> [  215.946351]  __swap_writepage+0x2e8/0x328
> [  215.946775]  swap_writepage+0x68/0xd0
> [  215.947184]  pageout+0xe4/0x430
> [  215.947587]  shrink_folio_list+0x9bc/0xf60
> [  215.947992]  reclaim_folio_list+0x8c/0x168
> [  215.948454]  reclaim_pages+0xfc/0x178
> [  215.948843]  madvise_cold_or_pageout_pte_range+0x8d8/0xf28
> [  215.949285]  walk_pgd_range+0x390/0x808
> [  215.949660]  __walk_page_range+0x1e0/0x1f0
> [  215.950040]  walk_page_range+0x1f0/0x2c8
> [  215.950458]  madvise_pageout+0xf8/0x280
> [  215.950905]  madvise_vma_behavior+0x314/0xa20
> [  215.951361]  madvise_walk_vmas+0xc0/0x128
> [  215.951807]  do_madvise.part.0+0x110/0x558
> [  215.952298]  __arm64_sys_madvise+0x68/0x88
> [  215.952723]  invoke_syscall+0x50/0x128
> [  215.953148]  el0_svc_common.constprop.0+0x48/0xf8
> [  215.953592]  do_el0_svc+0x28/0x40
> [  215.954036]  el0_svc+0x50/0x150
> [  215.954610]  el0t_64_sync_handler+0x13c/0x158
> [  215.955070]  el0t_64_sync+0x1a4/0x1a8
> [  215.955685] Code: a8c17bfd d50323bf 9a9fd000 d65f03c0 (d4210000)
> [  215.956510] ---[ end trace 0000000000000000 ]---
> 
> 
>>
>> (Sorry for my slow responses/lack of engagement over the last month; its been a
>> combination of paternity leave/lack of sleep/working on other things. I'm hoping
>> to get properly back into this stuff within the next couple of weeks).
>>
>> Thanks,
>> Ryan
>>
Christoph Hellwig June 18, 2024, 5:52 a.m. UTC | #11
On Mon, Jun 17, 2024 at 08:02:47PM +1200, Barry Song wrote:
> I did everything the same as above,
> but always got failure at the last step to swapon:
> /mnt/test # swapon swapfile
> swapon: /mnt/test/swapfile: swapon failed: Invalid argument

You are probably missing

CONFIG_NFS_SWAP=y

in your .config.
Barry Song June 18, 2024, 6:05 a.m. UTC | #12
On Tue, Jun 18, 2024 at 5:52 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Mon, Jun 17, 2024 at 08:02:47PM +1200, Barry Song wrote:
> > I did everything the same as above,
> > but always got failure at the last step to swapon:
> > /mnt/test # swapon swapfile
> > swapon: /mnt/test/swapfile: swapon failed: Invalid argument
>
> You are probably missing
>
> CONFIG_NFS_SWAP=y
>
> in your .config.

Yes, that was exactly what I missed. I then figured it out, reproduced
the issue,
and discovered that the root cause was unrelated to large folios. It
was actually
due to a batched bio plugging optimization from 2022. You can find the new patch
here:

https://lore.kernel.org/linux-mm/20240617220135.43563-1-21cnbao@gmail.com/


>
Christoph Hellwig June 18, 2024, 6:13 a.m. UTC | #13
On Tue, Jun 18, 2024 at 06:05:33PM +1200, Barry Song wrote:
> Yes, that was exactly what I missed. I then figured it out, reproduced
> the issue,
> and discovered that the root cause was unrelated to large folios. It
> was actually
> due to a batched bio plugging optimization from 2022. You can find the new patch
> here:
> 
> https://lore.kernel.org/linux-mm/20240617220135.43563-1-21cnbao@gmail.com/

I don't really see any point in keeping the VM_BUG_ON.  The underlying
direct I/O code doesn't really care about the size at all.
Barry Song June 18, 2024, 6:31 a.m. UTC | #14
On Tue, Jun 18, 2024 at 6:13 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, Jun 18, 2024 at 06:05:33PM +1200, Barry Song wrote:
> > Yes, that was exactly what I missed. I then figured it out, reproduced
> > the issue,
> > and discovered that the root cause was unrelated to large folios. It
> > was actually
> > due to a batched bio plugging optimization from 2022. You can find the new patch
> > here:
> >
> > https://lore.kernel.org/linux-mm/20240617220135.43563-1-21cnbao@gmail.com/
>
> I don't really see any point in keeping the VM_BUG_ON.  The underlying
> direct I/O code doesn't really care about the size at all.

I am perfectly fine with dropping the VM_BUG/WARN. I was keeping it
for debugging
purposes while we add mTHP swapout support for the swapfile. Large folios might
be in bio vec, this can help verify things are still normal after
having large folios there.

But I agree that this is not important for the NFS code itself. I can
add this warning
locally for debugging when needed.
Martin Wege June 18, 2024, 6:48 a.m. UTC | #15
On Fri, Jun 14, 2024 at 12:38 PM Christoph Hellwig <hch@lst.de> wrote:
>
> As of Linux 6.10-rc the MM can swap out larger than page size chunks.
> NFS has all code ready to handle this, but has a VM_BUG_ON that
> triggers when this happens.  Simply remove the VM_BUG_ON to fix this
> use case.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/nfs/direct.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index bb2f583eb28bf1..90079ca134dd3c 100644

+1 for the patch, and thank you for fixing NFS swap

Thanks,
Martin
diff mbox series

Patch

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index bb2f583eb28bf1..90079ca134dd3c 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -141,8 +141,6 @@  int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
 {
 	ssize_t ret;
 
-	VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE);
-
 	if (iov_iter_rw(iter) == READ)
 		ret = nfs_file_direct_read(iocb, iter, true);
 	else