Message ID | 20171031173945.rgsvdtulqmsa5hkz@tonberry.usersys.redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Scott, looks like I can't reproduce the BUG, but now it hangs. I will investigate and let you know. Tigran. ----- Original Message ----- > From: "Scott Mayhew" <smayhew@redhat.com> > To: "Tigran Mkrtchyan" <tigran.mkrtchyan@desy.de> > Cc: "linux-nfs" <linux-nfs@vger.kernel.org> > Sent: Tuesday, October 31, 2017 6:39:45 PM > Subject: Re: Kernel bug triggered with xfstest generic/113 > On Thu, 26 Oct 2017, Mkrtchyan, Tigran wrote: > >> >> >> Against dCache nfs server: >> >> [ 3987.717284] ------------[ cut here ]------------ >> [ 3987.717286] kernel BUG at fs/inode.c:567! >> [ 3987.717292] invalid opcode: 0000 [#1] SMP >> [ 3987.717293] Modules linked in: loop nfs_layout_nfsv41_files rpcsec_gss_krb5 >> nfsv4 nfs lockd grace fscache binfmt_misc af_packet nf_conntrack_netbios_ns >> nf_conntrack_broadcast xt_tcpudp xt >> _CT ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT >> nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack >> libcrc32c nfnetlink ip6table_mangle ip6ta >> ble_raw ip6table_security iptable_mangle iptable_raw iptable_security >> ip6table_filter ip6_tables iptable_filter btrfs xor zstd_decompress >> zstd_compress xxhash lzo_compress zlib_deflate raid6 >> _pq snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_codec >> snd_hda_core snd_hwdep snd_seq iTCO_wdt iTCO_vendor_support snd_seq_device >> snd_timer lpc_ich mfd_core tpm_tis snd_pcm >> tpm_tis_core i2c_i801 snd tpm auth_rpcgss oid_registry >> [ 3987.717340] sunrpc ip_tables x_tables serio_raw e1000e ptp pps_core autofs4 >> [ 3987.717349] CPU: 1 PID: 31883 Comm: rm Not tainted >> 4.14.0-rc6-01076-gbb176f67090c #120 >> [ >> 3987.717350] Hardware name: Comptronic pczW1007/DX38BT, BIOS >> BTX3810J.86A.1893.2008.1009.1712 10/09/2008 >> [ 3987.717353] task: ffff880107f7c480 task.stack: ffff88012a830000 >> [ 3987.717358] RIP: 0010:evict+0x161/0x180 >> [ 3987.717360] RSP: 0018:ffff88012a833e80 EFLAGS: 00010202 >> [ 3987.717362] RAX: ffffffff81806110 RBX: ffff880100037800 RCX: ffff8801000378a0 >> [ 3987.717364] RDX: ffffffff81806110 RSI: ffff8801000378a0 RDI: ffffffff81806108 >> [ 3987.717366] RBP: ffff88012a833e98 R08: 0000000000000000 R09: ffff88012a03d6e8 >> [ 3987.717368] R10: ffff88012a03d6f8 R11: ffff88012a03d6f8 R12: ffff880100037910 >> [ 3987.717370] R13: ffffffffa0404540 R14: 00000000ffffff9c R15: 0000000000000000 >> [ 3987.717372] FS: 00007faddfd6e700(0000) GS:ffff88012fc80000(0000) >> knlGS:0000000000000000 >> [ 3987.717375] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 3987.717377] CR2: 00005653ab778470 CR3: 000000012840c000 CR4: 00000000000406e0 >> [ 3987.717378] Call Trace: >> [ 3987.717382] iput+0xf9/0x1c0 >> [ 3987.717385] do_unlinkat+0x18d/0x2f0 >> [ 3987.717388] SyS_unlinkat+0x16/0x30 >> [ 3987.717392] entry_SYSCALL_64_fastpath+0x13/0x94 >> [ 3987.717394] RIP: 0033:0x7faddf8a0c17 >> [ 3987.717396] RSP: 002b:00007ffc44a11468 EFLAGS: 00000246 ORIG_RAX: >> 0000000000000107 >> [ 3987.717399] RAX: ffffffffffffffda RBX: 000055ccd0723428 RCX: 00007faddf8a0c17 >> [ 3987.717401] RDX: 0000000000000000 RSI: 000055ccd07220f0 RDI: 00000000ffffff9c >> [ 3987.717403] RBP: 000055ccd0723320 R08: 0000000000000003 R09: 0000000000000000 >> [ 3987.717404] R10: 0000000000000100 R11: 0000000000000246 R12: 000055ccd0722060 >> [ 3987.717406] R13: 00007ffc44a115a0 R14: 00007faddfb6a1e4 R15: 0000000000000000 >> [ 3987.717408] Code: 89 df e8 23 7d fe ff eb 8f 48 83 bb 20 02 00 00 00 74 85 48 >> 89 df e8 7f ad 01 00 0f b7 03 66 25 00 f0 e9 6b ff ff ff 0f 0b 0f 0b <0f> 0b 48 >> 8d bb 68 01 00 00 e8 a1 9e fa >> ff 48 89 df e8 a9 f1 ff >> [ 3987.717443] RIP: evict+0x161/0x180 RSP: ffff88012a833e80 >> [ 3987.717445] ---[ end trace ca9a0f6be0e72301 ]--- > > I've been seeing a similar panic. Does the attached patch help? > > -Scott > >> >> Looks like it happens in post-test clean-up step (rm), though some processes >> are still there: >> >> root 18489 18329 0 Oct25 pts/1 00:00:00 >> /data/git/xfstests-dev/ltp/aio-stress -t 20 -s 10 -O -I 1000 >> /mnt/test/aiostress.18329.3 /mnt/test/aiostress.18329.3.20 >> /mnt/test/aiostress.18329.3.19 /mnt/test/aiostress.18329.3.18 >> /mnt/test/aiostress.18329.3.17 /mnt/test/aiostress.18329.3.16 >> /mnt/test/aiostress.18329.3.15 /mnt/test/aiostress.18329.3.14 >> /mnt/test/aiostress.18329.3.13 /mnt/test/aiostress.18329.3.12 >> /mnt/test/aiostress.18329.3.11 /mnt/test/aiostress.18329.3.10 >> /mnt/test/aiostress.18329.3.9 /mnt/test/aiostress.18329.3.8 >> /mnt/test/aiostress.18329.3.7 /mnt/test/aiostress.18329.3.6 >> /mnt/test/aiostress.18329.3.5 /mnt/test/aiostress.18329.3.4 >> /mnt/test/aiostress.18329.3.3 /mnt/test/aiostress.18329.3.2 >> >> >> I can always re-produce it. >> >> Tigran. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >> the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Scott, you change in did have fixed the issue. The hanging client was related to our server bug. Tested-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de> Should this go into stable as well? Thanks, Tigran ----- Original Message ----- > From: "Tigran Mkrtchyan" <tigran.mkrtchyan@desy.de> > To: "Scott Mayhew" <smayhew@redhat.com> > Cc: "linux-nfs" <linux-nfs@vger.kernel.org> > Sent: Wednesday, November 1, 2017 3:29:32 PM > Subject: Re: Kernel bug triggered with xfstest generic/113 > Hi Scott, > > looks like I can't reproduce the BUG, but now it hangs. I will investigate and > let you know. > > Tigran. > > ----- Original Message ----- >> From: "Scott Mayhew" <smayhew@redhat.com> >> To: "Tigran Mkrtchyan" <tigran.mkrtchyan@desy.de> >> Cc: "linux-nfs" <linux-nfs@vger.kernel.org> >> Sent: Tuesday, October 31, 2017 6:39:45 PM >> Subject: Re: Kernel bug triggered with xfstest generic/113 > >> On Thu, 26 Oct 2017, Mkrtchyan, Tigran wrote: >> >>> >>> >>> Against dCache nfs server: >>> >>> [ 3987.717284] ------------[ cut here ]------------ >>> [ 3987.717286] kernel BUG at fs/inode.c:567! >>> [ 3987.717292] invalid opcode: 0000 [#1] SMP >>> [ 3987.717293] Modules linked in: loop nfs_layout_nfsv41_files rpcsec_gss_krb5 >>> nfsv4 nfs lockd grace fscache binfmt_misc af_packet nf_conntrack_netbios_ns >>> nf_conntrack_broadcast xt_tcpudp xt >>> _CT ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT >>> nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack >>> libcrc32c nfnetlink ip6table_mangle ip6ta >>> ble_raw ip6table_security iptable_mangle iptable_raw iptable_security >>> ip6table_filter ip6_tables iptable_filter btrfs xor zstd_decompress >>> zstd_compress xxhash lzo_compress zlib_deflate raid6 >>> _pq snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_codec >>> snd_hda_core snd_hwdep snd_seq iTCO_wdt iTCO_vendor_support snd_seq_device >>> snd_timer lpc_ich mfd_core tpm_tis snd_pcm >>> tpm_tis_core i2c_i801 snd tpm auth_rpcgss oid_registry >>> [ 3987.717340] sunrpc ip_tables x_tables serio_raw e1000e ptp pps_core autofs4 >>> [ 3987.717349] CPU: 1 PID: 31883 Comm: rm Not tainted >>> 4.14.0-rc6-01076-gbb176f67090c #120 >>> [ >>> 3987.717350] Hardware name: Comptronic pczW1007/DX38BT, BIOS >>> BTX3810J.86A.1893.2008.1009.1712 10/09/2008 >>> [ 3987.717353] task: ffff880107f7c480 task.stack: ffff88012a830000 >>> [ 3987.717358] RIP: 0010:evict+0x161/0x180 >>> [ 3987.717360] RSP: 0018:ffff88012a833e80 EFLAGS: 00010202 >>> [ 3987.717362] RAX: ffffffff81806110 RBX: ffff880100037800 RCX: ffff8801000378a0 >>> [ 3987.717364] RDX: ffffffff81806110 RSI: ffff8801000378a0 RDI: ffffffff81806108 >>> [ 3987.717366] RBP: ffff88012a833e98 R08: 0000000000000000 R09: ffff88012a03d6e8 >>> [ 3987.717368] R10: ffff88012a03d6f8 R11: ffff88012a03d6f8 R12: ffff880100037910 >>> [ 3987.717370] R13: ffffffffa0404540 R14: 00000000ffffff9c R15: 0000000000000000 >>> [ 3987.717372] FS: 00007faddfd6e700(0000) GS:ffff88012fc80000(0000) >>> knlGS:0000000000000000 >>> [ 3987.717375] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 3987.717377] CR2: 00005653ab778470 CR3: 000000012840c000 CR4: 00000000000406e0 >>> [ 3987.717378] Call Trace: >>> [ 3987.717382] iput+0xf9/0x1c0 >>> [ 3987.717385] do_unlinkat+0x18d/0x2f0 >>> [ 3987.717388] SyS_unlinkat+0x16/0x30 >>> [ 3987.717392] entry_SYSCALL_64_fastpath+0x13/0x94 >>> [ 3987.717394] RIP: 0033:0x7faddf8a0c17 >>> [ 3987.717396] RSP: 002b:00007ffc44a11468 EFLAGS: 00000246 ORIG_RAX: >>> 0000000000000107 >>> [ 3987.717399] RAX: ffffffffffffffda RBX: 000055ccd0723428 RCX: 00007faddf8a0c17 >>> [ 3987.717401] RDX: 0000000000000000 RSI: 000055ccd07220f0 RDI: 00000000ffffff9c >>> [ 3987.717403] RBP: 000055ccd0723320 R08: 0000000000000003 R09: 0000000000000000 >>> [ 3987.717404] R10: 0000000000000100 R11: 0000000000000246 R12: 000055ccd0722060 >>> [ 3987.717406] R13: 00007ffc44a115a0 R14: 00007faddfb6a1e4 R15: 0000000000000000 >>> [ 3987.717408] Code: 89 df e8 23 7d fe ff eb 8f 48 83 bb 20 02 00 00 00 74 85 48 >>> 89 df e8 7f ad 01 00 0f b7 03 66 25 00 f0 e9 6b ff ff ff 0f 0b 0f 0b <0f> 0b 48 >>> 8d bb 68 01 00 00 e8 a1 9e fa >>> ff 48 89 df e8 a9 f1 ff >>> [ 3987.717443] RIP: evict+0x161/0x180 RSP: ffff88012a833e80 >>> [ 3987.717445] ---[ end trace ca9a0f6be0e72301 ]--- >> >> I've been seeing a similar panic. Does the attached patch help? >> >> -Scott >> >>> >>> Looks like it happens in post-test clean-up step (rm), though some processes >>> are still there: >>> >>> root 18489 18329 0 Oct25 pts/1 00:00:00 >>> /data/git/xfstests-dev/ltp/aio-stress -t 20 -s 10 -O -I 1000 >>> /mnt/test/aiostress.18329.3 /mnt/test/aiostress.18329.3.20 >>> /mnt/test/aiostress.18329.3.19 /mnt/test/aiostress.18329.3.18 >>> /mnt/test/aiostress.18329.3.17 /mnt/test/aiostress.18329.3.16 >>> /mnt/test/aiostress.18329.3.15 /mnt/test/aiostress.18329.3.14 >>> /mnt/test/aiostress.18329.3.13 /mnt/test/aiostress.18329.3.12 >>> /mnt/test/aiostress.18329.3.11 /mnt/test/aiostress.18329.3.10 >>> /mnt/test/aiostress.18329.3.9 /mnt/test/aiostress.18329.3.8 >>> /mnt/test/aiostress.18329.3.7 /mnt/test/aiostress.18329.3.6 >>> /mnt/test/aiostress.18329.3.5 /mnt/test/aiostress.18329.3.4 >>> /mnt/test/aiostress.18329.3.3 /mnt/test/aiostress.18329.3.2 >>> >>> >>> I can always re-produce it. >>> >>> Tigran. >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >>> the body of a message to majordomo@vger.kernel.org >> > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/nfs/filelayout/filelayout.c b/fs/nfs/filelayout/filelayout.c index 508126e..8845a73 100644 --- a/fs/nfs/filelayout/filelayout.c +++ b/fs/nfs/filelayout/filelayout.c @@ -169,7 +169,7 @@ static int filelayout_async_handle_error(struct rpc_task *task, * i/o and all i/o waiting on the slot table to the MDS until * layout is destroyed and a new valid layout is obtained. */ - pnfs_destroy_layout(NFS_I(inode)); + pnfs_destroy_layout(NFS_I(inode), 0); rpc_wake_up(&tbl->slot_tbl_waitq); goto reset; /* RPC connection errors */ diff --git a/fs/nfs/flexfilelayout/flexfilelayout.c b/fs/nfs/flexfilelayout/flexfilelayout.c index b0fa83a..2f0f682 100644 --- a/fs/nfs/flexfilelayout/flexfilelayout.c +++ b/fs/nfs/flexfilelayout/flexfilelayout.c @@ -1088,7 +1088,7 @@ static int ff_layout_async_handle_error_v4(struct rpc_task *task, * i/o and all i/o waiting on the slot table to the MDS until * layout is destroyed and a new valid layout is obtained. */ - pnfs_destroy_layout(NFS_I(inode)); + pnfs_destroy_layout(NFS_I(inode), 0); rpc_wake_up(&tbl->slot_tbl_waitq); goto reset; /* RPC connection errors */ diff --git a/fs/nfs/nfs4super.c b/fs/nfs/nfs4super.c index 6fb7cb6..3b4a063 100644 --- a/fs/nfs/nfs4super.c +++ b/fs/nfs/nfs4super.c @@ -95,7 +95,7 @@ static void nfs4_evict_inode(struct inode *inode) nfs_inode_return_delegation_noreclaim(inode); /* Note that above delegreturn would trigger pnfs return-on-close */ pnfs_return_layout(inode); - pnfs_destroy_layout(NFS_I(inode)); + pnfs_destroy_layout(NFS_I(inode), FLUSH_SYNC); /* First call standard NFS clear_inode() code */ nfs_clear_inode(inode); } diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c index 3bcd669..0884f37 100644 --- a/fs/nfs/pnfs.c +++ b/fs/nfs/pnfs.c @@ -678,7 +678,7 @@ pnfs_free_lseg_list(struct list_head *free_me) } void -pnfs_destroy_layout(struct nfs_inode *nfsi) +pnfs_destroy_layout(struct nfs_inode *nfsi, int how) { struct pnfs_layout_hdr *lo; LIST_HEAD(tmp_list); @@ -692,7 +692,7 @@ pnfs_destroy_layout(struct nfs_inode *nfsi) pnfs_layout_clear_fail_bit(lo, NFS_LAYOUT_RW_FAILED); spin_unlock(&nfsi->vfs_inode.i_lock); pnfs_free_lseg_list(&tmp_list); - nfs_commit_inode(&nfsi->vfs_inode, 0); + nfs_commit_inode(&nfsi->vfs_inode, how); pnfs_put_layout_hdr(lo); } else spin_unlock(&nfsi->vfs_inode.i_lock); @@ -1831,7 +1831,7 @@ pnfs_update_layout(struct inode *ino, } /* Destroy the existing layout and start over */ if (time_after(jiffies, giveup)) - pnfs_destroy_layout(NFS_I(ino)); + pnfs_destroy_layout(NFS_I(ino), 0); /* Fallthrough */ case -EAGAIN: break; diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h index 87f144f..51baf59 100644 --- a/fs/nfs/pnfs.h +++ b/fs/nfs/pnfs.h @@ -244,7 +244,7 @@ size_t pnfs_generic_pg_test(struct nfs_pageio_descriptor *pgio, void pnfs_set_lo_fail(struct pnfs_layout_segment *lseg); struct pnfs_layout_segment *pnfs_layout_process(struct nfs4_layoutget *lgp); void pnfs_free_lseg_list(struct list_head *tmp_list); -void pnfs_destroy_layout(struct nfs_inode *); +void pnfs_destroy_layout(struct nfs_inode *, int how); void pnfs_destroy_all_layouts(struct nfs_client *); int pnfs_destroy_layouts_byfsid(struct nfs_client *clp, struct nfs_fsid *fsid,