Message ID | 165516173293.21248.14587048046993234326.stgit@noble.brown (mailing list archive) |
---|---|
Headers | show |
Series | Allow concurrent directory updates. | expand |
Neil, Firstly, thank you for your work on this. I'm probably the main beneficiary of this (NFSD) effort atm so I feel extra special and lucky! I have done some quick artificial tests similar to before where I am using a NFS server and client separated by an (extreme) 200ms of latency (great for testing parallelism). I am only using NFSv3 due to the NFSD_CACHE_SIZE_SLOTS_PER_SESSION parallelism limitations for NFSv4. Firstly, a client direct to server (VFS) with 10 simultaneous create processes hitting the same directory: client1 # for x in {1..1000}; do echo /srv/server1/data/touch.$x done | xargs -n1 -P 10 -iX -t touch X 2>&1 | pv -l -a >|/dev/null Without the patch ( on the client), this reaches a steady state of 2.4 creates/s and increasing the number of parallel create processes does not change this aggregate performance. With the patch, the creation rate increases to 15 creates/s and with 100 processes, it further scales up to 121 creates/s. Now for the re-export case (NFSD) where an intermediary server re-exports the originating server (200ms away) to clients on it's local LAN, there is no noticeable improvement for a single (not patched) client. But we do see an aggregate improvement when we use multiple clients at once. # pdsh -Rssh -w 'client[1-10]' 'for x in {1..1000}; do echo /srv/reexport1/data/$(hostname -s).$x; done | xargs -n1 -P 10 -iX -t touch X 2>&1' | pv -l -a >|/dev/null Without the patch applied to the reexport server, the aggregate is around 2.2 create/s which is similar to doing it directly to the originating server from a single client (above). With the patch, the aggregate increases to 15 creates/s for 10 clients which again matches the results of a single patched client. Not quite a x10 increase but a healthy improvement nonetheless. However, it is at this point that I started to experience some stability issues with the re-export server that are not present with the vanilla unpatched v5.19-rc2 kernel. In particular the knfsd threads start to lock up with stack traces like this: [ 1234.460696] INFO: task nfsd:5514 blocked for more than 123 seconds. [ 1234.461481] Tainted: G W E 5.19.0-1.dneg.x86_64 #1 [ 1234.462289] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1234.463227] task:nfsd state:D stack: 0 pid: 5514 ppid: 2 flags:0x00004000 [ 1234.464212] Call Trace: [ 1234.464677] <TASK> [ 1234.465104] __schedule+0x2a9/0x8a0 [ 1234.465663] schedule+0x55/0xc0 [ 1234.466183] ? nfs_lookup_revalidate_dentry+0x3a0/0x3a0 [nfs] [ 1234.466995] __nfs_lookup_revalidate+0xdf/0x120 [nfs] [ 1234.467732] ? put_prev_task_stop+0x170/0x170 [ 1234.468374] nfs_lookup_revalidate+0x15/0x20 [nfs] [ 1234.469073] lookup_dcache+0x5a/0x80 [ 1234.469639] lookup_one_unlocked+0x59/0xa0 [ 1234.470244] lookup_one_len_unlocked+0x1d/0x20 [ 1234.470951] nfsd_lookup_dentry+0x190/0x470 [nfsd] [ 1234.471663] nfsd_lookup+0x88/0x1b0 [nfsd] [ 1234.472294] nfsd3_proc_lookup+0xb4/0x100 [nfsd] [ 1234.473012] nfsd_dispatch+0x161/0x290 [nfsd] [ 1234.473689] svc_process_common+0x48a/0x620 [sunrpc] [ 1234.474402] ? nfsd_svc+0x330/0x330 [nfsd] [ 1234.475038] ? nfsd_shutdown_threads+0xa0/0xa0 [nfsd] [ 1234.475772] svc_process+0xbc/0xf0 [sunrpc] [ 1234.476408] nfsd+0xda/0x190 [nfsd] [ 1234.477011] kthread+0xf0/0x120 [ 1234.477522] ? kthread_complete_and_exit+0x20/0x20 [ 1234.478199] ret_from_fork+0x22/0x30 [ 1234.478755] </TASK> For whatever reason, they seem to affect our Netapp mounts and re-exports rather than our originating Linux NFS servers (against which all tests were done). This may be related to the fact that those Netapps serve our home directories so there could be some unique locking patterns going on there. This issue made things a bit too unstable to test at larger scales or with our production workloads. So all in all, the performance improvements in the knfsd re-export case is looking great and we have real world use cases that this helps with (batch processing workloads with latencies >10ms). If we can figure out the hanging knfsd threads, then I can test it more heavily. Many thanks, Daire On Tue, 14 Jun 2022 at 00:19, NeilBrown <neilb@suse.de> wrote: > > VFS currently holds an exclusive lock on a directory during create, > unlink, rename. This imposes serialisation on all filesystems though > some may not benefit from it, and some may be able to provide finer > grained locking internally, thus reducing contention. > > This series allows the filesystem to request that the inode lock be > shared rather than exclusive. In that case an exclusive lock will be > held on the dentry instead, much as is done for parallel lookup. > > The NFS filesystem can easily support concurrent updates (server does > any needed serialiation) so it is converted. > > This series also converts nfsd to use the new interfaces so concurrent > incoming NFS requests in the one directory can be handled concurrently. > > As a net result, if an NFS mounted filesystem is reexported over NFS, > then multiple clients can create files in a single directory and all > synchronisation will be handled on the final server. This helps hid > latency on link from client to server. > > I include a few nfsd patches that aren't strictly needed for this work, > but seem to be a logical consequence of the changes that I did have to > make. > > I have only tested this lightly. In particular the rename support is > quite new and I haven't tried to break it yet. > > I post this for general review, and hopefully extra testing... Daire > Byrne has expressed interest in the NFS re-export parallelism. > > NeilBrown > > > --- > > NeilBrown (12): > VFS: support parallel updates in the one directory. > VFS: move EEXIST and ENOENT tests into lookup_hash_update() > VFS: move want_write checks into lookup_hash_update() > VFS: move dput() and mnt_drop_write() into done_path_update() > VFS: export done_path_update() > VFS: support concurrent renames. > NFS: support parallel updates in the one directory. > nfsd: allow parallel creates from nfsd > nfsd: support concurrent renames. > nfsd: reduce locking in nfsd_lookup() > nfsd: use (un)lock_inode instead of fh_(un)lock > nfsd: discard fh_locked flag and fh_lock/fh_unlock > > > fs/dcache.c | 59 ++++- > fs/namei.c | 578 ++++++++++++++++++++++++++++++++--------- > fs/nfs/dir.c | 29 ++- > fs/nfs/inode.c | 2 + > fs/nfs/unlink.c | 5 +- > fs/nfsd/nfs2acl.c | 6 +- > fs/nfsd/nfs3acl.c | 4 +- > fs/nfsd/nfs3proc.c | 37 +-- > fs/nfsd/nfs4acl.c | 7 +- > fs/nfsd/nfs4proc.c | 61 ++--- > fs/nfsd/nfs4state.c | 8 +- > fs/nfsd/nfsfh.c | 10 +- > fs/nfsd/nfsfh.h | 58 +---- > fs/nfsd/nfsproc.c | 31 +-- > fs/nfsd/vfs.c | 243 ++++++++--------- > fs/nfsd/vfs.h | 8 +- > include/linux/dcache.h | 27 ++ > include/linux/fs.h | 1 + > include/linux/namei.h | 30 ++- > 19 files changed, 791 insertions(+), 413 deletions(-) > > -- > Signature >
On Wed, 15 Jun 2022, Daire Byrne wrote: ... > With the patch, the aggregate increases to 15 creates/s for 10 clients > which again matches the results of a single patched client. Not quite > a x10 increase but a healthy improvement nonetheless. Great! > > However, it is at this point that I started to experience some > stability issues with the re-export server that are not present with > the vanilla unpatched v5.19-rc2 kernel. In particular the knfsd > threads start to lock up with stack traces like this: > > [ 1234.460696] INFO: task nfsd:5514 blocked for more than 123 seconds. > [ 1234.461481] Tainted: G W E 5.19.0-1.dneg.x86_64 #1 > [ 1234.462289] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 1234.463227] task:nfsd state:D stack: 0 pid: 5514 > ppid: 2 flags:0x00004000 > [ 1234.464212] Call Trace: > [ 1234.464677] <TASK> > [ 1234.465104] __schedule+0x2a9/0x8a0 > [ 1234.465663] schedule+0x55/0xc0 > [ 1234.466183] ? nfs_lookup_revalidate_dentry+0x3a0/0x3a0 [nfs] > [ 1234.466995] __nfs_lookup_revalidate+0xdf/0x120 [nfs] I can see the cause of this - I forget a wakeup. This patch should fix it, though I hope to find a better solution. diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 54c2c7adcd56..072130d000c4 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -2483,17 +2483,16 @@ int nfs_unlink(struct inode *dir, struct dentry *dentry) if (!(dentry->d_flags & DCACHE_PAR_UPDATE)) { /* Must have exclusive lock on parent */ did_set_par_update = true; + lock_acquire_exclusive(&dentry->d_update_map, 0, + 0, NULL, _THIS_IP_); dentry->d_flags |= DCACHE_PAR_UPDATE; } spin_unlock(&dentry->d_lock); error = nfs_safe_remove(dentry); nfs_dentry_remove_handle_error(dir, dentry, error); - if (did_set_par_update) { - spin_lock(&dentry->d_lock); - dentry->d_flags &= ~DCACHE_PAR_UPDATE; - spin_unlock(&dentry->d_lock); - } + if (did_set_par_update) + d_unlock_update(dentry); out: trace_nfs_unlink_exit(dir, dentry, error); return error; > > So all in all, the performance improvements in the knfsd re-export > case is looking great and we have real world use cases that this helps > with (batch processing workloads with latencies >10ms). If we can > figure out the hanging knfsd threads, then I can test it more heavily. Hopefully the above patch will allow the more heavy testing to continue. In any case, thanks a lot for the testing so far, NeilBrown
On Thu, 16 Jun 2022 at 01:56, NeilBrown <neilb@suse.de> wrote: > > On Wed, 15 Jun 2022, Daire Byrne wrote: > .. > > However, it is at this point that I started to experience some > > stability issues with the re-export server that are not present with > > the vanilla unpatched v5.19-rc2 kernel. In particular the knfsd > > threads start to lock up with stack traces like this: > > > > [ 1234.460696] INFO: task nfsd:5514 blocked for more than 123 seconds. > > [ 1234.461481] Tainted: G W E 5.19.0-1.dneg.x86_64 #1 > > [ 1234.462289] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables this message. > > [ 1234.463227] task:nfsd state:D stack: 0 pid: 5514 > > ppid: 2 flags:0x00004000 > > [ 1234.464212] Call Trace: > > [ 1234.464677] <TASK> > > [ 1234.465104] __schedule+0x2a9/0x8a0 > > [ 1234.465663] schedule+0x55/0xc0 > > [ 1234.466183] ? nfs_lookup_revalidate_dentry+0x3a0/0x3a0 [nfs] > > [ 1234.466995] __nfs_lookup_revalidate+0xdf/0x120 [nfs] > > I can see the cause of this - I forget a wakeup. This patch should fix > it, though I hope to find a better solution. > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > index 54c2c7adcd56..072130d000c4 100644 > --- a/fs/nfs/dir.c > +++ b/fs/nfs/dir.c > @@ -2483,17 +2483,16 @@ int nfs_unlink(struct inode *dir, struct dentry *dentry) > if (!(dentry->d_flags & DCACHE_PAR_UPDATE)) { > /* Must have exclusive lock on parent */ > did_set_par_update = true; > + lock_acquire_exclusive(&dentry->d_update_map, 0, > + 0, NULL, _THIS_IP_); > dentry->d_flags |= DCACHE_PAR_UPDATE; > } > > spin_unlock(&dentry->d_lock); > error = nfs_safe_remove(dentry); > nfs_dentry_remove_handle_error(dir, dentry, error); > - if (did_set_par_update) { > - spin_lock(&dentry->d_lock); > - dentry->d_flags &= ~DCACHE_PAR_UPDATE; > - spin_unlock(&dentry->d_lock); > - } > + if (did_set_par_update) > + d_unlock_update(dentry); > out: > trace_nfs_unlink_exit(dir, dentry, error); > return error; > > > > > So all in all, the performance improvements in the knfsd re-export > > case is looking great and we have real world use cases that this helps > > with (batch processing workloads with latencies >10ms). If we can > > figure out the hanging knfsd threads, then I can test it more heavily. > > Hopefully the above patch will allow the more heavy testing to continue. > In any case, thanks a lot for the testing so far, Patch applied but unfortunately I'm still getting the same trace, but this time I also captured a preceding stack for a hung process local to the reexport server - I wonder if it's happening somewhere in the VFS changes rather than nfsd which then exports the path? [ 373.930506] INFO: task XXXX:5072 blocked for more than 122 seconds. [ 373.931410] Tainted: G W E 5.19.0-3.dneg.x86_64 #1 [ 373.932313] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 373.933442] task:XXXX state:D stack: 0 pid: 5072 ppid: 1 flags:0x00000000 [ 373.934639] Call Trace: [ 373.935007] <TASK> [ 373.935306] __schedule+0x2a9/0x8a0 [ 373.935844] schedule+0x55/0xc0 [ 373.936294] ? nfs_lookup_revalidate_dentry+0x3a0/0x3a0 [nfs] [ 373.937137] __nfs_lookup_revalidate+0xdf/0x120 [nfs] [ 373.937875] ? put_prev_task_stop+0x170/0x170 [ 373.938525] nfs_lookup_revalidate+0x15/0x20 [nfs] [ 373.939226] lookup_fast+0xda/0x150 [ 373.939756] path_openat+0x12a/0x1090 [ 373.940293] ? __filemap_fdatawrite_range+0x54/0x70 [ 373.941100] do_filp_open+0xb2/0x120 [ 373.941635] ? hashlen_string+0xd0/0xd0 [ 373.942190] ? _raw_spin_unlock+0xe/0x30 [ 373.942766] do_sys_openat2+0x245/0x320 [ 373.943305] do_sys_open+0x46/0x80 [ 373.943839] __x64_sys_open+0x21/0x30 [ 373.944428] do_syscall_64+0x3b/0x90 [ 373.944979] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 373.945688] RIP: 0033:0x7fcd80ceeeb0 [ 373.946226] RSP: 002b:00007fff90fd8298 EFLAGS: 00000246 ORIG_RAX: 0000000000000002 [ 373.947330] RAX: ffffffffffffffda RBX: 00007fcd81d6e981 RCX: 00007fcd80ceeeb0 [ 373.947333] RDX: 00000000000001b6 RSI: 0000000000000000 RDI: 00007fff90fd8360 [ 373.947334] RBP: 00007fff90fd82f0 R08: 00007fcd81d6e986 R09: 0000000000000000 [ 373.947335] R10: 0000000000000024 R11: 0000000000000246 R12: 0000000000cd6110 [ 373.947337] R13: 0000000000000008 R14: 00007fff90fd8360 R15: 00007fff90fdb580 [ 373.947339] </TASK> [ 373.947421] INFO: task nfsd:5696 blocked for more than 122 seconds. [ 373.947423] Tainted: G W E 5.19.0-3.dneg.x86_64 #1 [ 373.947424] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 373.947425] task:nfsd state:D stack: 0 pid: 5696 ppid: 2 flags:0x00004000 [ 373.947428] Call Trace: [ 373.947429] <TASK> [ 373.947430] __schedule+0x2a9/0x8a0 [ 373.947434] schedule+0x55/0xc0 [ 373.947436] ? nfs_lookup_revalidate_dentry+0x3a0/0x3a0 [nfs] [ 373.947451] __nfs_lookup_revalidate+0xdf/0x120 [nfs] [ 373.947464] ? put_prev_task_stop+0x170/0x170 [ 373.947466] nfs_lookup_revalidate+0x15/0x20 [nfs] [ 373.947478] lookup_dcache+0x5a/0x80 [ 373.947481] lookup_one_unlocked+0x59/0xa0 [ 373.947484] lookup_one_len_unlocked+0x1d/0x20 [ 373.947487] nfsd_lookup_dentry+0x190/0x470 [nfsd] [ 373.947509] nfsd_lookup+0x88/0x1b0 [nfsd] [ 373.947522] nfsd3_proc_lookup+0xb4/0x100 [nfsd] [ 373.947537] nfsd_dispatch+0x161/0x290 [nfsd] [ 373.947551] svc_process_common+0x48a/0x620 [sunrpc] [ 373.947589] ? nfsd_svc+0x330/0x330 [nfsd] [ 373.947602] ? nfsd_shutdown_threads+0xa0/0xa0 [nfsd] [ 373.947621] svc_process+0xbc/0xf0 [sunrpc] [ 373.951088] nfsd+0xda/0x190 [nfsd] [ 373.951136] kthread+0xf0/0x120 [ 373.951138] ? kthread_complete_and_exit+0x20/0x20 [ 373.951140] ret_from_fork+0x22/0x30 [ 373.951149] </TASK> I double checked that the patch had been applied and I hadn't made a mistake with installation. I could perhaps try running with just the VFS patches to see if I can still reproduce the "local" VFS hang without the nfsd patches? Your previous VFS only patchset was stable for me. Daire
Hi Neil, On Wed, Jun 15, 2022 at 9:23 PM NeilBrown <neilb@suse.de> wrote: > > On Wed, 15 Jun 2022, Daire Byrne wrote: > ... > > With the patch, the aggregate increases to 15 creates/s for 10 clients > > which again matches the results of a single patched client. Not quite > > a x10 increase but a healthy improvement nonetheless. > > Great! > > > > > However, it is at this point that I started to experience some > > stability issues with the re-export server that are not present with > > the vanilla unpatched v5.19-rc2 kernel. In particular the knfsd > > threads start to lock up with stack traces like this: > > > > [ 1234.460696] INFO: task nfsd:5514 blocked for more than 123 seconds. > > [ 1234.461481] Tainted: G W E 5.19.0-1.dneg.x86_64 #1 > > [ 1234.462289] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables this message. > > [ 1234.463227] task:nfsd state:D stack: 0 pid: 5514 > > ppid: 2 flags:0x00004000 > > [ 1234.464212] Call Trace: > > [ 1234.464677] <TASK> > > [ 1234.465104] __schedule+0x2a9/0x8a0 > > [ 1234.465663] schedule+0x55/0xc0 > > [ 1234.466183] ? nfs_lookup_revalidate_dentry+0x3a0/0x3a0 [nfs] > > [ 1234.466995] __nfs_lookup_revalidate+0xdf/0x120 [nfs] > > I can see the cause of this - I forget a wakeup. This patch should fix > it, though I hope to find a better solution. I've applied as far as the NFS client patch plus the extra fix and I'm seeing this stack trace on my client when I try to run cthon tests (I was seeing it without the extra change as well): anna@gouda ~ % virsh console client Connected to domain 'client' Escape character is ^] (Ctrl + ]) [ 148.977712] BUG: unable to handle page fault for address: ffffa56a03e2ff70 [ 148.978892] #PF: supervisor read access in kernel mode [ 148.979504] #PF: error_code(0x0000) - not-present page [ 148.980071] PGD 100000067 P4D 100000067 PUD 1001bb067 PMD 103164067 PTE 0 [ 148.980859] Oops: 0000 [#1] PREEMPT SMP PTI [ 148.981323] CPU: 2 PID: 683 Comm: test7 Tainted: G W 5.19.0-rc1-g3de25d30cb97-dirty #33479 5f05aa11df373b898aeb9103b7a17e0ee3946022 [ 148.982824] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 148.984016] RIP: 0010:d_alloc_parallel+0x1ad/0x5a0 [ 148.984609] Code: 48 c7 c2 ff ff ff ff 89 c1 48 d3 e2 48 f7 d2 4d 31 c1 49 85 d1 0f 84 e2 00 00 00 66 90 4d 8b 6d 00 4d 85 ed 0f 84 3e 03 00 00 <45> 39 95 70 ff ff ff 75 ea 4d 39 a5 68 ff ff ff 75 e1 4d 8d b5 50 [ 148.986724] RSP: 0018:ffffa56a03e1bc50 EFLAGS: 00010286 [ 148.987302] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9101b9400000 [ 148.988084] RDX: 00000006ebc26d0c RSI: ffff9100c30b9020 RDI: ffff9100cb523480 [ 148.988978] RBP: 00000000000009fc R08: 00000000000000c0 R09: ffff9100cb76d180 [ 148.989752] R10: 00000000ebc26d0c R11: ffffffff846e8750 R12: ffff9100cb523480 [ 148.990891] R13: ffffa56a03e30000 R14: ffff9100cb5234d8 R15: ffffa56a03e1bdc8 [ 148.991741] FS: 00007f0c00660540(0000) GS:ffff9101b9d00000(0000) knlGS:0000000000000000 [ 148.992995] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 148.993686] CR2: ffffa56a03e2ff70 CR3: 000000010eaa2004 CR4: 0000000000370ee0 [ 148.994494] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 148.995638] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 148.996462] Call Trace: [ 148.996752] <TASK> [ 148.997036] path_openat+0x290/0xd80 [ 148.997472] do_filp_open+0xbe/0x160 [ 148.998136] do_sys_openat2+0x8e/0x160 [ 148.998609] __x64_sys_creat+0x47/0x70 [ 148.999057] do_syscall_64+0x31/0x50 [ 148.999479] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 149.000040] RIP: 0033:0x7f0c00502487 [ 149.000444] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 55 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 74 24 0c 48 89 3c 24 e8 [ 149.002576] RSP: 002b:00007ffc01181a98 EFLAGS: 00000246 ORIG_RAX: 0000000000000055 [ 149.003610] RAX: ffffffffffffffda RBX: 00007ffc01184d28 RCX: 00007f0c00502487 [ 149.004511] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 00007ffc01181ae0 [ 149.005351] RBP: 00007ffc01182af0 R08: 0000000000000000 R09: 0000000000000064 [ 149.006160] R10: 00007f0c00406c68 R11: 0000000000000246 R12: 0000000000000000 [ 149.006939] R13: 00007ffc01184d40 R14: 0000000000000000 R15: 00007f0c0078b000 [ 149.007882] </TASK> [ 149.008135] Modules linked in: cbc cts rpcsec_gss_krb5 nfsv4 nfs fscache netfs rpcrdma rdma_cm iw_cm ib_cm ib_core cfg80211 rfkill 8021q garp stp mrp llc intel_rapl_msr intel_rapl_common kvm_intel kvm snd_hda_codec_generic irqbypass crct10dif_pclmul crc32_pclmul snd_hda_intel ghash_clmulni_intel iTCO_wdt joydev snd_intel_dspcfg intel_pmc_bxt mousedev iTCO_vendor_support vfat snd_hda_codec fat snd_hwdep snd_hda_core aesni_intel crypto_simd cryptd rapl snd_pcm intel_agp qxl psmouse snd_timer i2c_i801 pcspkr i2c_smbus lpc_ich snd soundcore usbhid drm_ttm_helper intel_gtt mac_hid ttm nfsd nfs_acl lockd auth_rpcgss usbip_host grace usbip_core fuse sunrpc qemu_fw_cfg ip_tables x_tables xfs libcrc32c crc32c_generic serio_raw atkbd libps2 9pnet_virtio virtio_rng 9pnet vivaldi_fmap rng_core virtio_balloon virtio_net virtio_blk net_failover virtio_console virtio_scsi failover crc32c_intel xhci_pci i8042 virtio_pci virtio_pci_legacy_dev xhci_pci_renesas virtio_pci_modern_dev serio [ 149.020778] CR2: ffffa56a03e2ff70 [ 149.021803] ---[ end trace 0000000000000000 ]--- [ 149.021810] RIP: 0010:d_alloc_parallel+0x1ad/0x5a0 [ 149.021815] Code: 48 c7 c2 ff ff ff ff 89 c1 48 d3 e2 48 f7 d2 4d 31 c1 49 85 d1 0f 84 e2 00 00 00 66 90 4d 8b 6d 00 4d 85 ed 0f 84 3e 03 00 00 <45> 39 95 70 ff ff ff 75 ea 4d 39 a5 68 ff ff ff 75 e1 4d 8d b5 50 [ 149.021817] RSP: 0018:ffffa56a03e1bc50 EFLAGS: 00010286 [ 149.021819] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9101b9400000 [ 149.021820] RDX: 00000006ebc26d0c RSI: ffff9100c30b9020 RDI: ffff9100cb523480 [ 149.021821] RBP: 00000000000009fc R08: 00000000000000c0 R09: ffff9100cb76d180 [ 149.021822] R10: 00000000ebc26d0c R11: ffffffff846e8750 R12: ffff9100cb523480 [ 149.021823] R13: ffffa56a03e30000 R14: ffff9100cb5234d8 R15: ffffa56a03e1bdc8 [ 149.021824] FS: 00007f0c00660540(0000) GS:ffff9101b9d00000(0000) knlGS:0000000000000000 [ 149.021826] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 149.021827] CR2: ffffa56a03e2ff70 CR3: 000000010eaa2004 CR4: 0000000000370ee0 [ 149.021831] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 149.021832] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 149.021833] note: test7[683] exited with preempt_count 1 [ 149.026665] BUG: unable to handle page fault for address: ffffa56a03e2ff70 [ 149.026668] #PF: supervisor read access in kernel mode [ 149.026669] #PF: error_code(0x0000) - not-present page [ 149.026670] PGD 100000067 P4D 100000067 PUD 1001bb067 PMD 103164067 PTE 0 [ 149.026674] Oops: 0000 [#2] PREEMPT SMP PTI [ 149.026674] Oops: 0000 [#2] PREEMPT SMP PTI [ 149.026676] CPU: 0 PID: 687 Comm: cthon.zsh Tainted: G D W 5.19.0-rc1-g3de25d30cb97-dirty #33479 5f05aa11df373b898aeb9103b7a17e0ee3946022 [ 149.026679] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 149.026680] RIP: 0010:d_alloc_parallel+0x1ad/0x5a0 [ 149.026684] Code: 48 c7 c2 ff ff ff ff 89 c1 48 d3 e2 48 f7 d2 4d 31 c1 49 85 d1 0f 84 e2 00 00 00 66 90 4d 8b 6d 00 4d 85 ed 0f 84 3e 03 00 00 <45> 39 95 70 ff ff ff 75 ea 4d 39 a5 68 ff ff ff 75 e1 4d 8d b5 50 [ 149.026685] RSP: 0018:ffffa56a03e3bbf0 EFLAGS: 00010286 [ 149.026687] RAX: 000000000000002a RBX: 000000000000002a RCX: ffff9101b9400000 [ 149.026688] RDX: 000000047dd98e98 RSI: ffff9100c1da4030 RDI: ffff9100c5ea7540 [ 149.026689] RBP: 00000000000009fc R08: 00000000000000c0 R09: ffff9100cb481600 [ 149.026690] R10: 000000007dd98e98 R11: ffffffff846e7640 R12: ffff9100c5ea7540 [ 149.026691] R13: ffffa56a03e30000 R14: ffff9100c5ea7598 R15: ffffa56a03e3bda0 [ 149.026692] FS: 00007fa4232f9000(0000) GS:ffff9101b9c00000(0000) knlGS:0000000000000000 [ 149.026693] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 149.026694] CR2: ffffa56a03e2ff70 CR3: 0000000104b90002 CR4: 0000000000370ef0 [ 149.026698] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 149.026699] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 149.026700] Call Trace: [ 149.026702] <TASK> [ 149.026704] __lookup_slow+0x61/0x160 [ 149.026707] ? try_to_unlazy+0x14d/0x1f0 [ 149.026709] lookup_slow+0x33/0x50 [ 149.026711] walk_component+0x132/0x150 [ 149.026714] path_lookupat+0x4d/0x100 [ 149.026716] filename_lookup+0xeb/0x200 [ 149.026718] user_path_at_empty+0x36/0x90 [ 149.026721] do_faccessat+0x124/0x290 [ 149.026723] do_syscall_64+0x31/0x50 [ 149.026726] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 149.026729] RIP: 0033:0x7fa423101ceb [ 149.026731] Code: 77 05 c3 0f 1f 40 00 48 8b 15 a9 b0 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 0f 1f 40 00 f3 0f 1e fa b8 15 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 79 b0 0f 00 f7 d8 [ 149.026732] RSP: 002b:00007ffd7374d6a8 EFLAGS: 00000202 ORIG_RAX: 0000000000000015 [ 149.026734] RAX: ffffffffffffffda RBX: 00007ffd7374d760 RCX: 00007fa423101ceb [ 149.026735] RDX: 000000000000002f RSI: 0000000000000001 RDI: 00007ffd7374d760 [ 149.026736] RBP: 00007ffd7374d760 R08: 0000000000000001 R09: 0000000000000000 [ 149.026737] R10: 0000000000000001 R11: 0000000000000202 R12: 000055b0c36db9d8 [ 149.026738] R13: 00007ffd7374e760 R14: 00007ffd7374d770 R15: 00007fa42326ebb0 [ 149.026740] </TASK> [ 149.026740] Modules linked in: cbc cts rpcsec_gss_krb5 nfsv4 nfs fscache netfs rpcrdma rdma_cm iw_cm ib_cm ib_core cfg80211 rfkill 8021q garp stp mrp llc intel_rapl_msr intel_rapl_common kvm_intel kvm snd_hda_codec_generic irqbypass crct10dif_pclmul crc32_pclmul snd_hda_intel ghash_clmulni_intel iTCO_wdt joydev snd_intel_dspcfg intel_pmc_bxt mousedev iTCO_vendor_support vfat snd_hda_codec fat snd_hwdep snd_hda_core aesni_intel crypto_simd cryptd rapl snd_pcm intel_agp qxl psmouse snd_timer i2c_i801 pcspkr i2c_smbus lpc_ich snd soundcore usbhid drm_ttm_helper intel_gtt mac_hid ttm nfsd nfs_acl lockd auth_rpcgss usbip_host grace usbip_core fuse sunrpc qemu_fw_cfg ip_tables x_tables xfs libcrc32c crc32c_generic serio_raw atkbd libps2 9pnet_virtio virtio_rng 9pnet vivaldi_fmap rng_core virtio_balloon virtio_net virtio_blk net_failover virtio_console virtio_scsi failover crc32c_intel xhci_pci i8042 virtio_pci virtio_pci_legacy_dev xhci_pci_renesas virtio_pci_modern_dev serio [ 149.026793] CR2: ffffa56a03e2ff70 [ 149.026795] ---[ end trace 0000000000000000 ]--- [ 149.026795] RIP: 0010:d_alloc_parallel+0x1ad/0x5a0 [ 149.026797] Code: 48 c7 c2 ff ff ff ff 89 c1 48 d3 e2 48 f7 d2 4d 31 c1 49 85 d1 0f 84 e2 00 00 00 66 90 4d 8b 6d 00 4d 85 ed 0f 84 3e 03 00 00 <45> 39 95 70 ff ff ff 75 ea 4d 39 a5 68 ff ff ff 75 e1 4d 8d b5 50 [ 149.026798] RSP: 0018:ffffa56a03e1bc50 EFLAGS: 00010286 [ 149.026799] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9101b9400000 [ 149.026800] RDX: 00000006ebc26d0c RSI: ffff9100c30b9020 RDI: ffff9100cb523480 [ 149.026801] RBP: 00000000000009fc R08: 00000000000000c0 R09: ffff9100cb76d180 [ 149.026802] R10: 00000000ebc26d0c R11: ffffffff846e8750 R12: ffff9100cb523480 [ 149.026803] R13: ffffa56a03e30000 R14: ffff9100cb5234d8 R15: ffffa56a03e1bdc8 [ 149.026803] FS: 00007fa4232f9000(0000) GS:ffff9101b9c00000(0000) knlGS:0000000000000000 [ 149.026805] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 149.026806] CR2: ffffa56a03e2ff70 CR3: 0000000104b90002 CR4: 0000000000370ef0 [ 149.026807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 149.026807] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 149.026808] note: cthon.zsh[687] exited with preempt_count 1 [ 149.039191] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 149.039194] BUG: unable to handle page fault for address: ffff9100cb775fb0 [ 149.039196] #PF: supervisor instruction fetch in kernel mode [ 149.039197] #PF: error_code(0x0011) - permissions violation [ 149.039199] PGD 1cfe05067 P4D 1cfe05067 PUD 100b7a063 PMD 800000010b6001e3 [ 149.039203] Oops: 0011 [#3] PREEMPT SMP PTI [ 149.039206] CPU: 1 PID: 15 Comm: pr/tty0 Tainted: G D W 5.19.0-rc1-g3de25d30cb97-dirty #33479 5f05aa11df373b898aeb9103b7a17e0ee3946022 [ 149.039210] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 149.039212] RIP: 0010:0xffff9100cb775fb0 [ 149.039215] Code: ff ff 80 5f 77 cb 00 91 ff ff d0 5b 77 cb 00 91 ff ff 20 35 52 cb 00 91 ff ff a0 5f 77 cb 00 91 ff ff a0 5f 77 cb 00 91 ff ff <30> b5 d7 cb 00 91 ff ff b0 de 9b 82 ff ff ff ff 00 00 00 00 00 00 [ 149.039217] RSP: 0018:ffffa56a00110f28 EFLAGS: 00010286 [ 149.039219] RAX: ffff9100c04ab530 RBX: 0000000000000004 RCX: 0000000000000004 [ 149.039220] RDX: ffff9100c01c4c10 RSI: 0000000000000000 RDI: ffff9100c04ab530 [ 149.039221] RBP: 0000000000000015 R08: 0000000080100006 R09: 0000000000100006 [ 149.039223] R10: 7fffffffffffffff R11: ffff9100cb775fb0 R12: ffff9101b9cb25f8 [ 149.039224] R13: ffffa56a00110f38 R14: ffff9101b9cb2580 R15: ffff9100c0318000 [ 149.039226] FS: 0000000000000000(0000) GS:ffff9101b9c80000(0000) knlGS:0000000000000000 [ 149.039227] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 149.039229] CR2: ffff9100cb775fb0 CR3: 0000000103698005 CR4: 0000000000370ee0 [ 149.039234] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 149.039235] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 149.039236] Call Trace: [ 149.039238] <IRQ> [ 149.039240] ? rcu_core+0x2f7/0x770 [ 149.039245] ? __do_softirq+0x152/0x2d1 [ 149.039249] ? irq_exit_rcu+0x6c/0xb0 [ 149.039252] ? sysvec_apic_timer_interrupt+0x6d/0x80 [ 149.039255] </IRQ> [ 149.039256] <TASK> [ 149.039257] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 [ 149.039263] ? bit_putcs+0x503/0x7c0 [ 149.039271] ? fbcon_putcs+0x251/0x280 [ 149.039273] ? bit_clear+0x100/0x100 [ 149.039276] ? fbcon_redraw+0x152/0x220 [ 149.039279] ? fbcon_scroll+0xd4/0x1b0 [ 149.039282] ? con_scroll+0x1b1/0x240 [ 149.039286] ? vt_console_print+0x1f7/0x450 [ 149.039289] ? __console_emit_next_record+0x30b/0x3a0 [ 149.039293] ? printk_kthread_func+0x4a4/0x600 [ 149.039295] ? wake_bit_function+0x70/0x70 [ 149.039298] ? console_cpu_notify+0x80/0x80 [ 149.039300] ? kthread+0xd8/0xf0 [ 149.039303] ? kthread_blkcg+0x30/0x30 [ 149.039305] ? ret_from_fork+0x22/0x30 [ 149.039310] </TASK> [ 149.039311] Modules linked in: cbc cts rpcsec_gss_krb5 nfsv4 nfs fscache netfs rpcrdma rdma_cm iw_cm ib_cm ib_core cfg80211 rfkill 8021q garp stp mrp llc intel_rapl_msr intel_rapl_common kvm_intel kvm snd_hda_codec_generic irqbypass crct10dif_pclmul crc32_pclmul snd_hda_intel ghash_clmulni_intel iTCO_wdt joydev snd_intel_dspcfg intel_pmc_bxt mousedev iTCO_vendor_support vfat snd_hda_codec fat snd_hwdep snd_hda_core aesni_intel crypto_simd cryptd rapl snd_pcm intel_agp qxl psmouse snd_timer i2c_i801 pcspkr i2c_smbus lpc_ich snd soundcore usbhid drm_ttm_helper intel_gtt mac_hid ttm nfsd nfs_acl lockd auth_rpcgss usbip_host grace usbip_core fuse sunrpc qemu_fw_cfg ip_tables x_tables xfs libcrc32c crc32c_generic serio_raw atkbd libps2 9pnet_virtio virtio_rng 9pnet vivaldi_fmap rng_core virtio_balloon virtio_net virtio_blk net_failover virtio_console virtio_scsi failover crc32c_intel xhci_pci i8042 virtio_pci virtio_pci_legacy_dev xhci_pci_renesas virtio_pci_modern_dev serio [ 149.039392] CR2: ffff9100cb775fb0 [ 149.039402] ---[ end trace 0000000000000000 ]--- [ 149.039403] RIP: 0010:d_alloc_parallel+0x1ad/0x5a0 [ 149.039407] Code: 48 c7 c2 ff ff ff ff 89 c1 48 d3 e2 48 f7 d2 4d 31 c1 49 85 d1 0f 84 e2 00 00 00 66 90 4d 8b 6d 00 4d 85 ed 0f 84 3e 03 00 00 <45> 39 95 70 ff ff ff 75 ea 4d 39 a5 68 ff ff ff 75 e1 4d 8d b5 50 [ 149.039408] RSP: 0018:ffffa56a03e1bc50 EFLAGS: 00010286 [ 149.039410] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9101b9400000 [ 149.039411] RDX: 00000006ebc26d0c RSI: ffff9100c30b9020 RDI: ffff9100cb523480 [ 149.039412] RBP: 00000000000009fc R08: 00000000000000c0 R09: ffff9100cb76d180 [ 149.039413] R10: 00000000ebc26d0c R11: ffffffff846e8750 R12: ffff9100cb523480 [ 149.039414] R13: ffffa56a03e30000 R14: ffff9100cb5234d8 R15: ffffa56a03e1bdc8 [ 149.039415] FS: 0000000000000000(0000) GS:ffff9101b9c80000(0000) knlGS:0000000000000000 [ 149.039417] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 149.039418] CR2: ffff9100cb775fb0 CR3: 0000000103698005 CR4: 0000000000370ee0 [ 149.039421] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 149.039422] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 149.039423] Kernel panic - not syncing: Fatal exception in interrupt [ 149.039611] Kernel Offset: 0x1600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Hopefully something in there helps you figure out what's going on! Anna > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > index 54c2c7adcd56..072130d000c4 100644 > --- a/fs/nfs/dir.c > +++ b/fs/nfs/dir.c > @@ -2483,17 +2483,16 @@ int nfs_unlink(struct inode *dir, struct dentry *dentry) > if (!(dentry->d_flags & DCACHE_PAR_UPDATE)) { > /* Must have exclusive lock on parent */ > did_set_par_update = true; > + lock_acquire_exclusive(&dentry->d_update_map, 0, > + 0, NULL, _THIS_IP_); > dentry->d_flags |= DCACHE_PAR_UPDATE; > } > > spin_unlock(&dentry->d_lock); > error = nfs_safe_remove(dentry); > nfs_dentry_remove_handle_error(dir, dentry, error); > - if (did_set_par_update) { > - spin_lock(&dentry->d_lock); > - dentry->d_flags &= ~DCACHE_PAR_UPDATE; > - spin_unlock(&dentry->d_lock); > - } > + if (did_set_par_update) > + d_unlock_update(dentry); > out: > trace_nfs_unlink_exit(dir, dentry, error); > return error; > > > > > So all in all, the performance improvements in the knfsd re-export > > case is looking great and we have real world use cases that this helps > > with (batch processing workloads with latencies >10ms). If we can > > figure out the hanging knfsd threads, then I can test it more heavily. > > Hopefully the above patch will allow the more heavy testing to continue. > In any case, thanks a lot for the testing so far, > > NeilBrown
On Thu, 16 Jun 2022, Daire Byrne wrote: > > I double checked that the patch had been applied and I hadn't made a > mistake with installation. :-) always worth double checking... > > I could perhaps try running with just the VFS patches to see if I can > still reproduce the "local" VFS hang without the nfsd patches? Your > previous VFS only patchset was stable for me. I've made quite a few changes since that VFS-only patches. Almost certainly the problem is not in the nfsd code. I think that following has a reasonable chance of making things better, both for the problem you hit and the problem Anna hit. I haven't tested it at all yet so no promises - up to you if you try it. Thanks to both of you for the help with testing. NeilBrown diff --git a/fs/namei.c b/fs/namei.c index 31ba4dbedfcf..6d0c955d407a 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1609,7 +1609,7 @@ static struct dentry *__lookup_hash(const struct qstr *name, if (IS_ERR(dentry)) return dentry; - if (wq && d_in_lookup(dentry)) + if (wq && !d_in_lookup(dentry)) /* Must have raced with another thread doing the lookup */ return dentry; @@ -1664,6 +1664,7 @@ static struct dentry *lookup_hash_update(const struct qstr *name, } if (flags & LOOKUP_EXCL) { if (d_is_positive(dentry)) { + d_lookup_done(dentry); dput(dentry); err = -EEXIST; goto out_err; @@ -1671,6 +1672,7 @@ static struct dentry *lookup_hash_update(const struct qstr *name, } if (!(flags & LOOKUP_CREATE)) { if (!dentry->d_inode) { + d_lookup_done(dentry); dput(dentry); err = -ENOENT; goto out_err; @@ -1687,6 +1689,8 @@ static struct dentry *lookup_hash_update(const struct qstr *name, } if (err2) { err = err2; + d_lookup_done(dentry); + d_unlock_update(dentry); dput(dentry); goto out_err; } @@ -3273,6 +3277,7 @@ static struct dentry *lock_rename_lookup(struct dentry *p1, struct dentry *p2, } return NULL; out_unlock_2: + d_lookup_done(d1); dput(d1); d1 = d2; out_unlock_1: @@ -3315,6 +3320,7 @@ static struct dentry *lock_rename_lookup(struct dentry *p1, struct dentry *p2, *d2p = d2; return p; unlock_out_4: + d_lookup_done(d1); dput(d1); d1 = d2; unlock_out_3:
On Fri, 17 Jun 2022 at 06:49, NeilBrown <neilb@suse.de> wrote: > > On Thu, 16 Jun 2022, Daire Byrne wrote: > > > > I double checked that the patch had been applied and I hadn't made a > > mistake with installation. > > :-) always worth double checking... > > > > > I could perhaps try running with just the VFS patches to see if I can > > still reproduce the "local" VFS hang without the nfsd patches? Your > > previous VFS only patchset was stable for me. > > I've made quite a few changes since that VFS-only patches. Almost > certainly the problem is not in the nfsd code. > > I think that following has a reasonable chance of making things better, > both for the problem you hit and the problem Anna hit. I haven't tested > it at all yet so no promises - up to you if you try it. > > Thanks to both of you for the help with testing. > > NeilBrown This patch does the job for me - no more stack traces and things have been stable all day. I'm going to run some production loads over the weekend and then I'll do some more artificial scale testing next week. Thanks again for this work! Improving the parallelism anywhere we can for single clients and then nfsd is great for reexport servers (especially once you add some "cloud" latency). Cheers, Daire
On Fri, 17 Jun 2022 at 16:27, Daire Byrne <daire@dneg.com> wrote: > This patch does the job for me - no more stack traces and things have > been stable all day. I'm going to run some production loads over the > weekend and then I'll do some more artificial scale testing next week. > > Thanks again for this work! Improving the parallelism anywhere we can > for single clients and then nfsd is great for reexport servers > (especially once you add some "cloud" latency). > > Cheers, > > Daire The patch ran without incident with our production re-export workloads over the weekend (which also helps audit v5.19-rc2). I ran a couple more synthetic tests and got up to 100 clients of a re-export server and ~113 creates/s aggregate to a single directory with 200ms latency to the originating server. This compares well with the ~121 create/s when using 100 threads on a single patched client direct to the remote NFS server. In other words, the NFSD portion of this patch is delivering almost the same performance as the underlying VFS NFS client performance when re-exporting that path to hundreds of clients. Again, without this patch we can only sustain just under 3 create/s in both cases (VFS/NFSD) with 200ms latency. This is a great improvement for our batch workloads with varying amounts of latency >10ms (cloud networking). Tested-by: Daire Byrne <daire@dneg.com> Daire