Message ID | 20240515091727.22034-1-laoar.shao@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfs: Delete the associated dentry when deleting a file | expand |
Oliver, is there any chance you could run this through the test robot performance suite? The original full patch at https://lore.kernel.org/all/20240515091727.22034-1-laoar.shao@gmail.com/ and it would be interesting if the test robot could see if the patch makes any difference on any other loads? Thanks, Linus On Wed, 15 May 2024 at 02:17, Yafang Shao <laoar.shao@gmail.com> wrote: > > Our applications, built on Elasticsearch[0], frequently create and delete > files. These applications operate within containers, some with a memory > limit exceeding 100GB. Over prolonged periods, the accumulation of negative > dentries within these containers can amount to tens of gigabytes. > > Upon container exit, directories are deleted. However, due to the numerous > associated dentries, this process can be time-consuming. Our users have > expressed frustration with this prolonged exit duration, which constitutes > our first issue.
hi, Linus, On Wed, May 15, 2024 at 09:05:24AM -0700, Linus Torvalds wrote: > Oliver, > is there any chance you could run this through the test robot > performance suite? The original full patch at > > https://lore.kernel.org/all/20240515091727.22034-1-laoar.shao@gmail.com/ > > and it would be interesting if the test robot could see if the patch > makes any difference on any other loads? sure. it's our great pleasure. thanks a lot to involve us. the tests are started, but due to resource contraint, it may cost 2-3 days to run enough tests. will keep you guys updated, Thanks! > > Thanks, > Linus > > On Wed, 15 May 2024 at 02:17, Yafang Shao <laoar.shao@gmail.com> wrote: > > > > Our applications, built on Elasticsearch[0], frequently create and delete > > files. These applications operate within containers, some with a memory > > limit exceeding 100GB. Over prolonged periods, the accumulation of negative > > dentries within these containers can amount to tens of gigabytes. > > > > Upon container exit, directories are deleted. However, due to the numerous > > associated dentries, this process can be time-consuming. Our users have > > expressed frustration with this prolonged exit duration, which constitutes > > our first issue.
Hello, kernel test robot noticed a 6.7% improvement of stress-ng.touch.ops_per_sec on: commit: 3681ce364442ce2ec7c7fbe90ad0aca651db475b ("[PATCH] vfs: Delete the associated dentry when deleting a file") url: https://github.com/intel-lab-lkp/linux/commits/Yafang-Shao/vfs-Delete-the-associated-dentry-when-deleting-a-file/20240516-144340 base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 3c999d1ae3c75991902a1a7dad0cb62c2a3008b4 patch link: https://lore.kernel.org/all/20240515091727.22034-1-laoar.shao@gmail.com/ patch subject: [PATCH] vfs: Delete the associated dentry when deleting a file testcase: stress-ng test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory parameters: nr_threads: 100% disk: 1HDD testtime: 60s fs: ext4 test: touch cpufreq_governor: performance Details are as below: --------------------------------------------------------------------------------------------------> The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20240522/202405221518.ecea2810-oliver.sang@intel.com ========================================================================================= compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime: gcc-13/performance/1HDD/ext4/x86_64-rhel-8.3/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp8/touch/stress-ng/60s commit: 3c999d1ae3 ("Merge tag 'wq-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq") 3681ce3644 ("vfs: Delete the associated dentry when deleting a file") 3c999d1ae3c75991 3681ce364442ce2ec7c7fbe90ad ---------------- --------------------------- %stddev %change %stddev \ | \ 9.884e+08 -49.1% 5.027e+08 ± 4% cpuidle..time 23.46 ± 2% -38.3% 14.47 ± 5% iostat.cpu.idle 75.71 +11.7% 84.60 iostat.cpu.system 620100 ± 5% -24.0% 471157 ± 6% numa-numastat.node0.local_node 650333 ± 3% -22.3% 505122 ± 5% numa-numastat.node0.numa_hit 736690 ± 3% -21.7% 577182 ± 4% numa-numastat.node1.local_node 772759 ± 3% -21.1% 609537 ± 3% numa-numastat.node1.numa_hit 23.42 -38.1% 14.50 ± 5% vmstat.cpu.id 54.39 ± 3% +13.4% 61.67 ± 2% vmstat.procs.r 75507 ± 6% +24.6% 94113 ± 5% vmstat.system.cs 177863 +14.2% 203176 vmstat.system.in 21.30 ± 2% -9.5 11.76 ± 7% mpstat.cpu.all.idle% 0.37 ± 2% +0.1 0.42 ± 2% mpstat.cpu.all.irq% 0.14 -0.0 0.11 ± 2% mpstat.cpu.all.soft% 77.37 +9.4 86.79 mpstat.cpu.all.sys% 0.81 +0.1 0.92 ± 2% mpstat.cpu.all.usr% 68.24 -11.5% 60.40 stress-ng.time.elapsed_time 68.24 -11.5% 60.40 stress-ng.time.elapsed_time.max 175478 +5.3% 184765 stress-ng.time.involuntary_context_switches 5038 +12.4% 5663 stress-ng.time.percent_of_cpu_this_job_got 259551 ± 2% +6.7% 277010 stress-ng.touch.ops_per_sec 454143 ± 2% -9.2% 412504 ± 2% meminfo.Active 445748 ± 2% -9.3% 404154 ± 2% meminfo.Active(anon) 1714332 ± 2% -92.8% 123761 ± 2% meminfo.KReclaimable 7111721 -23.7% 5423125 meminfo.Memused 1714332 ± 2% -92.8% 123761 ± 2% meminfo.SReclaimable 291963 -35.6% 188122 meminfo.SUnreclaim 2006296 -84.5% 311883 meminfo.Slab 7289014 -24.0% 5539852 meminfo.max_used_kB 875347 ± 3% -91.3% 75880 ± 30% numa-meminfo.node0.KReclaimable 875347 ± 3% -91.3% 75880 ± 30% numa-meminfo.node0.SReclaimable 161663 ± 7% -31.3% 111087 ± 12% numa-meminfo.node0.SUnreclaim 1037010 ± 3% -82.0% 186968 ± 16% numa-meminfo.node0.Slab 838678 ± 3% -94.3% 48016 ± 50% numa-meminfo.node1.KReclaimable 838678 ± 3% -94.3% 48016 ± 50% numa-meminfo.node1.SReclaimable 130259 ± 8% -40.8% 77055 ± 18% numa-meminfo.node1.SUnreclaim 968937 ± 3% -87.1% 125072 ± 25% numa-meminfo.node1.Slab 218872 ± 3% -91.3% 18961 ± 31% numa-vmstat.node0.nr_slab_reclaimable 40425 ± 7% -31.3% 27772 ± 12% numa-vmstat.node0.nr_slab_unreclaimable 649973 ± 3% -22.3% 504715 ± 5% numa-vmstat.node0.numa_hit 619740 ± 5% -24.0% 470750 ± 6% numa-vmstat.node0.numa_local 209697 ± 3% -94.3% 12019 ± 50% numa-vmstat.node1.nr_slab_reclaimable 32567 ± 8% -40.8% 19264 ± 18% numa-vmstat.node1.nr_slab_unreclaimable 771696 ± 3% -21.1% 608506 ± 3% numa-vmstat.node1.numa_hit 735626 ± 3% -21.7% 576154 ± 4% numa-vmstat.node1.numa_local 111469 ± 2% -9.3% 101103 ± 2% proc-vmstat.nr_active_anon 446.10 -36.6% 283.05 proc-vmstat.nr_dirtied 18812 +2.5% 19287 proc-vmstat.nr_kernel_stack 428165 ± 2% -92.8% 30970 ± 2% proc-vmstat.nr_slab_reclaimable 72954 -35.5% 47032 proc-vmstat.nr_slab_unreclaimable 111469 ± 2% -9.3% 101103 ± 2% proc-vmstat.nr_zone_active_anon 1424982 -21.7% 1116423 proc-vmstat.numa_hit 1358679 -22.7% 1050103 ± 2% proc-vmstat.numa_local 4261989 ± 3% -14.9% 3625822 ± 3% proc-vmstat.pgalloc_normal 399826 -4.6% 381396 proc-vmstat.pgfault 3996977 ± 3% -15.9% 3359964 ± 3% proc-vmstat.pgfree 13500 -5.9% 12708 proc-vmstat.pgpgout 0.13 ± 3% -0.0 0.08 ± 4% perf-profile.children.cycles-pp.d_lookup 0.12 ± 4% -0.0 0.08 perf-profile.children.cycles-pp.__d_lookup 0.11 ± 2% +0.0 0.13 ± 3% perf-profile.children.cycles-pp.irq_exit_rcu 0.06 ± 7% +0.0 0.09 ± 2% perf-profile.children.cycles-pp.__slab_free 0.02 ±100% +0.0 0.06 ± 8% perf-profile.children.cycles-pp.__memcg_slab_free_hook 0.07 ± 7% +0.0 0.12 ± 4% perf-profile.children.cycles-pp.run_ksoftirqd 0.09 ± 5% +0.0 0.14 ± 4% perf-profile.children.cycles-pp.smpboot_thread_fn 0.18 ± 3% +0.1 0.24 ± 3% perf-profile.children.cycles-pp.kmem_cache_free 0.10 ± 7% +0.1 0.16 ± 7% perf-profile.children.cycles-pp.kthread 0.10 ± 7% +0.1 0.16 ± 7% perf-profile.children.cycles-pp.ret_from_fork 0.10 ± 7% +0.1 0.16 ± 7% perf-profile.children.cycles-pp.ret_from_fork_asm 0.14 ± 4% +0.1 0.20 ± 3% perf-profile.children.cycles-pp.rcu_do_batch 0.15 ± 3% +0.1 0.21 ± 2% perf-profile.children.cycles-pp.rcu_core 0.18 ± 3% +0.1 0.24 ± 2% perf-profile.children.cycles-pp.handle_softirqs 0.00 +0.1 0.06 ± 7% perf-profile.children.cycles-pp.list_lru_del 0.82 ± 2% +0.1 0.89 ± 3% perf-profile.children.cycles-pp._raw_spin_lock 0.01 ±238% +0.1 0.08 ± 4% perf-profile.children.cycles-pp.__call_rcu_common 0.00 +0.1 0.08 ± 3% perf-profile.children.cycles-pp.d_lru_del 0.00 +0.2 0.17 ± 3% perf-profile.children.cycles-pp.__dentry_kill 0.33 ± 11% +0.2 0.50 ± 6% perf-profile.children.cycles-pp.dput 0.09 ± 4% -0.0 0.05 ± 22% perf-profile.self.cycles-pp.__d_lookup 0.06 ± 5% +0.0 0.09 ± 5% perf-profile.self.cycles-pp.__slab_free 0.74 ± 2% +0.1 0.79 perf-profile.self.cycles-pp._raw_spin_lock 2.55 ± 4% -40.6% 1.51 ± 5% perf-stat.i.MPKI 1.219e+10 +12.8% 1.376e+10 perf-stat.i.branch-instructions 0.27 ± 2% -0.1 0.21 ± 2% perf-stat.i.branch-miss-rate% 22693772 +14.3% 25941855 perf-stat.i.branch-misses 39.43 -4.6 34.83 perf-stat.i.cache-miss-rate% 95834044 ± 3% +10.9% 1.063e+08 ± 4% perf-stat.i.cache-misses 2.658e+08 ± 2% +14.8% 3.051e+08 ± 4% perf-stat.i.cache-references 77622 ± 6% +25.5% 97395 ± 5% perf-stat.i.context-switches 2.82 +3.3% 2.91 perf-stat.i.cpi 1.821e+11 +12.5% 2.049e+11 perf-stat.i.cpu-cycles 9159 ± 3% +22.9% 11257 ± 3% perf-stat.i.cpu-migrations 6.223e+10 +12.8% 7.019e+10 perf-stat.i.instructions 0.36 -4.8% 0.34 perf-stat.i.ipc 1.19 ± 9% +27.6% 1.51 ± 5% perf-stat.i.metric.K/sec 4716 +6.3% 5011 perf-stat.i.minor-faults 4716 +6.3% 5011 perf-stat.i.page-faults 35.93 -1.2 34.77 perf-stat.overall.cache-miss-rate% 1.214e+10 +11.5% 1.353e+10 perf-stat.ps.branch-instructions 22276184 +13.3% 25240906 perf-stat.ps.branch-misses 95133798 ± 3% +9.8% 1.045e+08 ± 4% perf-stat.ps.cache-misses 2.648e+08 ± 2% +13.5% 3.004e+08 ± 4% perf-stat.ps.cache-references 77612 ± 6% +23.8% 96101 ± 5% perf-stat.ps.context-switches 1.813e+11 +11.1% 2.015e+11 perf-stat.ps.cpu-cycles 9093 ± 3% +21.5% 11047 ± 2% perf-stat.ps.cpu-migrations 6.192e+10 +11.5% 6.901e+10 perf-stat.ps.instructions 4639 +5.6% 4899 perf-stat.ps.minor-faults 4640 +5.6% 4899 perf-stat.ps.page-faults Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
hi, Linus, hi, Yafang Shao, On Wed, May 15, 2024 at 09:05:24AM -0700, Linus Torvalds wrote: > Oliver, > is there any chance you could run this through the test robot > performance suite? The original full patch at > > https://lore.kernel.org/all/20240515091727.22034-1-laoar.shao@gmail.com/ > > and it would be interesting if the test robot could see if the patch > makes any difference on any other loads? > we just reported a stress-ng performance improvement by this patch [1] test robot applied this patch upon 3c999d1ae3 ("Merge tag 'wq-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq") filesystem is not our team's major domain, so we just made some limited review of the results, and decided to send out the report FYI. at first stage, we decided to check below catagories of tests as priority: stress-ng filesystem filebench mailserver reaim fileserver we also pick sysbench-fileio, blogbench into coverage. here is a summary. for stress-ng, besided [1] which was reported, we got below data that are about this patch comparing to 3c999d1ae3. either there is no significant performance change, or the change is smaller than the noise which will make test robot's bisect fail, so these information is just FYI. and if you have any doubt about any subtests, could you let us know then we could check further? (also included some net test results) 12.87 ± 6% -0.6% 12.79 stress-ng.xattr.ops_per_sec 6721 ± 5% +7.5% 7224 ± 27% stress-ng.rawdev.ops_per_sec 9002 ± 7% -8.7% 8217 stress-ng.dirmany.ops_per_sec 8594743 ± 4% -3.0% 8337417 stress-ng.rawsock.ops_per_sec 2056 ± 3% +2.9% 2116 stress-ng.dirdeep.ops_per_sec 4307 ± 21% -6.9% 4009 stress-ng.dir.ops_per_sec 137946 ± 18% +5.8% 145942 stress-ng.fiemap.ops_per_sec 22413006 ± 2% +2.5% 22982512 ± 2% stress-ng.sockdiag.ops_per_sec 286714 ± 2% -3.8% 275876 ± 5% stress-ng.udp-flood.ops_per_sec 82904 ± 46% -31.6% 56716 stress-ng.sctp.ops_per_sec 9853408 -0.3% 9826387 stress-ng.ping-sock.ops_per_sec 84667 ± 12% -26.7% 62050 ± 17% stress-ng.dccp.ops_per_sec 61750 ± 25% -24.2% 46821 ± 38% stress-ng.open.ops_per_sec 583443 ± 3% -3.4% 563822 stress-ng.file-ioctl.ops_per_sec 11919 ± 28% -34.3% 7833 stress-ng.dentry.ops_per_sec 18.59 ± 12% -23.9% 14.15 ± 27% stress-ng.swap.ops_per_sec 246.37 ± 2% +15.9% 285.58 ± 12% stress-ng.aiol.ops_per_sec 7.45 -4.8% 7.10 ± 7% stress-ng.fallocate.ops_per_sec 207.97 ± 7% +5.2% 218.70 stress-ng.copy-file.ops_per_sec 69.87 ± 7% +5.8% 73.93 ± 5% stress-ng.fpunch.ops_per_sec 0.25 ± 21% +24.0% 0.31 stress-ng.inode-flags.ops_per_sec 849.35 ± 6% +1.4% 861.51 stress-ng.mknod.ops_per_sec 926144 ± 4% -5.2% 877558 stress-ng.lease.ops_per_sec 82924 -2.1% 81220 stress-ng.fcntl.ops_per_sec 6.19 ±124% -50.7% 3.05 stress-ng.chattr.ops_per_sec 676.90 ± 4% -1.9% 663.94 ± 5% stress-ng.iomix.ops_per_sec 0.93 ± 6% +5.6% 0.98 ± 7% stress-ng.symlink.ops_per_sec 1703608 -3.8% 1639057 ± 3% stress-ng.eventfd.ops_per_sec 1735861 -0.6% 1726072 stress-ng.sockpair.ops_per_sec 85440 -2.0% 83705 stress-ng.rawudp.ops_per_sec 6198 +0.6% 6236 stress-ng.sockabuse.ops_per_sec 39226 +0.0% 39234 stress-ng.sock.ops_per_sec 1358 +0.3% 1363 stress-ng.tun.ops_per_sec 9794021 -1.7% 9623340 stress-ng.icmp-flood.ops_per_sec 1324728 +0.3% 1328244 stress-ng.epoll.ops_per_sec 146150 -2.0% 143231 stress-ng.rawpkt.ops_per_sec 6381112 -0.4% 6352696 stress-ng.udp.ops_per_sec 1234258 +0.2% 1236738 stress-ng.sockfd.ops_per_sec 23954 -0.1% 23932 stress-ng.sockmany.ops_per_sec 257030 -0.1% 256860 stress-ng.netdev.ops_per_sec 6337097 +0.1% 6341130 stress-ng.flock.ops_per_sec 173212 -0.3% 172728 stress-ng.rename.ops_per_sec 199.69 +0.6% 200.82 stress-ng.sync-file.ops_per_sec 606.57 +0.8% 611.53 stress-ng.chown.ops_per_sec 183549 -0.9% 181975 stress-ng.handle.ops_per_sec 1299 +0.0% 1299 stress-ng.hdd.ops_per_sec 98371066 +0.2% 98571113 stress-ng.lockofd.ops_per_sec 25.49 -4.3% 24.39 stress-ng.ioprio.ops_per_sec 96745191 -1.5% 95333632 stress-ng.locka.ops_per_sec 582.35 +0.1% 582.86 stress-ng.chmod.ops_per_sec 2075897 -2.2% 2029552 stress-ng.getdent.ops_per_sec 60.47 -1.9% 59.34 stress-ng.metamix.ops_per_sec 14161 -0.3% 14123 stress-ng.io.ops_per_sec 23.98 -1.5% 23.61 stress-ng.link.ops_per_sec 27514 +0.0% 27528 stress-ng.filename.ops_per_sec 44955 +1.6% 45678 stress-ng.dnotify.ops_per_sec 160.94 +0.4% 161.51 stress-ng.inotify.ops_per_sec 2452224 +4.0% 2549607 stress-ng.lockf.ops_per_sec 6761 +0.3% 6779 stress-ng.fsize.ops_per_sec 775083 -1.5% 763487 stress-ng.fanotify.ops_per_sec 309124 -4.2% 296285 stress-ng.utime.ops_per_sec 25567 -0.1% 25530 stress-ng.dup.ops_per_sec 1858 +0.9% 1876 stress-ng.procfs.ops_per_sec 105804 -3.9% 101658 stress-ng.access.ops_per_sec 1.04 -1.9% 1.02 stress-ng.chdir.ops_per_sec 82753 -0.3% 82480 stress-ng.fstat.ops_per_sec 681128 +3.7% 706375 stress-ng.acl.ops_per_sec 11892 -0.1% 11875 stress-ng.bind-mount.ops_per_sec for filebench, similar results, but data is less unstable than stress-ng, which means for most of them, we regarded them as that they should not be impacted by this patch. for reaim/sysbench-fileio/blogbench, the data are quite stable, and we didn't notice any significant performance changes. we even doubt whether they go through the code path changed by this patch. so for these, we don't list full results here. BTW, besides filesystem tests, this patch is also piped into other performance test categories such like net, scheduler, mm and others, and since it also goes into our so-called hourly kernels, it could run by full other performance test suites which test robot supports. so in following 2-3 weeks, it's still possible for us to report other results including regression. thanks [1] https://lore.kernel.org/all/202405221518.ecea2810-oliver.sang@intel.com/ > Thanks, > Linus > > On Wed, 15 May 2024 at 02:17, Yafang Shao <laoar.shao@gmail.com> wrote: > > > > Our applications, built on Elasticsearch[0], frequently create and delete > > files. These applications operate within containers, some with a memory > > limit exceeding 100GB. Over prolonged periods, the accumulation of negative > > dentries within these containers can amount to tens of gigabytes. > > > > Upon container exit, directories are deleted. However, due to the numerous > > associated dentries, this process can be time-consuming. Our users have > > expressed frustration with this prolonged exit duration, which constitutes > > our first issue.
On Wed, 22 May 2024 at 01:12, kernel test robot <oliver.sang@intel.com> wrote: > > kernel test robot noticed a 6.7% improvement of stress-ng.touch.ops_per_sec on: > > commit: 3681ce364442ce2ec7c7fbe90ad0aca651db475b ("[PATCH] vfs: Delete the associated dentry when deleting a file") Ok, since everything else is at least tentatively in the noise, and the only hard numbers we have are the ones from Yafang's Elasticsearch load and this - both of which say that this is a good patch - I decided to just apply this ASAP just to get more testing. I just wanted to note very explicitly that this is very much tentative: this will be reverted very aggressively if somebody reports some other real-world load performance issues, and we'll have to look at other solutions. But I just don't think we'll get much more actual testing of this without just saying "let's try it". Also, I ended up removing the part of the patch that stopped clearing the DCACHE_CANT_MOUNT bit. I think it's right, but it's really unrelated to the actual problem at hand, and there are other cleanups - like the unnecessary dget/dput pair - in this path that could also be looked at. Anyway, let's see if somebody notices any issues with this. And I think we should look at the "shrink dentries" case anyway for other reasons, since it's easy to create a ton of negative dentries with just lots of lookups (rather than lots of unlinking of existing files). Of course, if you do billions of lookups of different files that do not exist in the same directory, I suspect you just have yourself to blame, so the "lots of negative lookups" load doesn't sound particularly realistic. TLDR; I applied it for testing because we're in the merge window and there's no reason not to try. Linus
On Wed, May 22, 2024 at 09:00:03AM -0700, Linus Torvalds wrote: > Of course, if you do billions of lookups of different files that do > not exist in the same directory, I suspect you just have yourself to > blame, so the "lots of negative lookups" load doesn't sound > particularly realistic. Oh no. We have real customers that this hits and it's not even stupid. Every time an "event" happens, they look up something like a hash in three different directories (like $PATH or multiple -I flags to the compiler). Order is important, so they can't just look it up in the directory that it's most likely to exist in. It usually fails to exist in directory A and B, so we create dentries that say it doesn't. And those dentries are literally never referenced again. Then directory C has the file they're looking for (or it doesn't and it gets created because the process has write access to C and not A or B). plan9 handles this so much better because it has that union-mount stuff instead of search paths. So it creates one dentry that tells it which of those directories it actually exists in. But we're stuck with unix-style search paths, so life is pain.
On Wed, 22 May 2024 at 10:14, Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, May 22, 2024 at 09:00:03AM -0700, Linus Torvalds wrote: > > Of course, if you do billions of lookups of different files that do > > not exist in the same directory, I suspect you just have yourself to > > blame, so the "lots of negative lookups" load doesn't sound > > particularly realistic. > > Oh no. We have real customers that this hits and it's not even stupid. Oh, negative dentries exist, and yes, they are a major feature on some loads. Exactly because of the kinds of situations you describe. In fact, that's the _point_. Things like PATH lookup require negative dentries for good performance, because otherwise all the missed cases would force a lookup all the way out to the filesystem. So having thousands of negative dentries is normal and expected. And it will grow for bigger machines and loads, of course. That said, I don't think we've had people on real loads complain about them being in the hundreds of millions like Yafang's case. > plan9 handles this so much better because it has that union-mount stuff > instead of search paths. So it creates one dentry that tells it which of > those directories it actually exists in. But we're stuck with unix-style > search paths, so life is pain. I suspect you are just not as aware of the downsides of the plan9 models. People tend to think plan9 was great. It had huge and fundamental design mistakes. Union mounts à la plan9 aren't hugely wonderful, and there's a reason overlayfs does things differently (not that that is hugely wonderful either). Linus
On Wed, May 22, 2024 at 4:51 PM Oliver Sang <oliver.sang@intel.com> wrote: > > > hi, Linus, hi, Yafang Shao, > > > On Wed, May 15, 2024 at 09:05:24AM -0700, Linus Torvalds wrote: > > Oliver, > > is there any chance you could run this through the test robot > > performance suite? The original full patch at > > > > https://lore.kernel.org/all/20240515091727.22034-1-laoar.shao@gmail.com/ > > > > and it would be interesting if the test robot could see if the patch > > makes any difference on any other loads? > > > > we just reported a stress-ng performance improvement by this patch [1] Awesome! > > test robot applied this patch upon > 3c999d1ae3 ("Merge tag 'wq-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq") > > filesystem is not our team's major domain, so we just made some limited review > of the results, and decided to send out the report FYI. > > at first stage, we decided to check below catagories of tests as priority: > > stress-ng filesystem > filebench mailserver > reaim fileserver > > we also pick sysbench-fileio, blogbench into coverage. > > here is a summary. > > for stress-ng, besided [1] which was reported, we got below data that are > about this patch comparing to 3c999d1ae3. > > either there is no significant performance change, or the change is smaller > than the noise which will make test robot's bisect fail, so these information > is just FYI. and if you have any doubt about any subtests, could you let us know > then we could check further? > > (also included some net test results) > > 12.87 ą 6% -0.6% 12.79 stress-ng.xattr.ops_per_sec > 6721 ą 5% +7.5% 7224 ą 27% stress-ng.rawdev.ops_per_sec > 9002 ą 7% -8.7% 8217 stress-ng.dirmany.ops_per_sec > 8594743 ą 4% -3.0% 8337417 stress-ng.rawsock.ops_per_sec > 2056 ą 3% +2.9% 2116 stress-ng.dirdeep.ops_per_sec > 4307 ą 21% -6.9% 4009 stress-ng.dir.ops_per_sec > 137946 ą 18% +5.8% 145942 stress-ng.fiemap.ops_per_sec > 22413006 ą 2% +2.5% 22982512 ą 2% stress-ng.sockdiag.ops_per_sec > 286714 ą 2% -3.8% 275876 ą 5% stress-ng.udp-flood.ops_per_sec > 82904 ą 46% -31.6% 56716 stress-ng.sctp.ops_per_sec > 9853408 -0.3% 9826387 stress-ng.ping-sock.ops_per_sec > 84667 ą 12% -26.7% 62050 ą 17% stress-ng.dccp.ops_per_sec > 61750 ą 25% -24.2% 46821 ą 38% stress-ng.open.ops_per_sec > 583443 ą 3% -3.4% 563822 stress-ng.file-ioctl.ops_per_sec > 11919 ą 28% -34.3% 7833 stress-ng.dentry.ops_per_sec > 18.59 ą 12% -23.9% 14.15 ą 27% stress-ng.swap.ops_per_sec > 246.37 ą 2% +15.9% 285.58 ą 12% stress-ng.aiol.ops_per_sec > 7.45 -4.8% 7.10 ą 7% stress-ng.fallocate.ops_per_sec > 207.97 ą 7% +5.2% 218.70 stress-ng.copy-file.ops_per_sec > 69.87 ą 7% +5.8% 73.93 ą 5% stress-ng.fpunch.ops_per_sec > 0.25 ą 21% +24.0% 0.31 stress-ng.inode-flags.ops_per_sec > 849.35 ą 6% +1.4% 861.51 stress-ng.mknod.ops_per_sec > 926144 ą 4% -5.2% 877558 stress-ng.lease.ops_per_sec > 82924 -2.1% 81220 stress-ng.fcntl.ops_per_sec > 6.19 ą124% -50.7% 3.05 stress-ng.chattr.ops_per_sec > 676.90 ą 4% -1.9% 663.94 ą 5% stress-ng.iomix.ops_per_sec > 0.93 ą 6% +5.6% 0.98 ą 7% stress-ng.symlink.ops_per_sec > 1703608 -3.8% 1639057 ą 3% stress-ng.eventfd.ops_per_sec > 1735861 -0.6% 1726072 stress-ng.sockpair.ops_per_sec > 85440 -2.0% 83705 stress-ng.rawudp.ops_per_sec > 6198 +0.6% 6236 stress-ng.sockabuse.ops_per_sec > 39226 +0.0% 39234 stress-ng.sock.ops_per_sec > 1358 +0.3% 1363 stress-ng.tun.ops_per_sec > 9794021 -1.7% 9623340 stress-ng.icmp-flood.ops_per_sec > 1324728 +0.3% 1328244 stress-ng.epoll.ops_per_sec > 146150 -2.0% 143231 stress-ng.rawpkt.ops_per_sec > 6381112 -0.4% 6352696 stress-ng.udp.ops_per_sec > 1234258 +0.2% 1236738 stress-ng.sockfd.ops_per_sec > 23954 -0.1% 23932 stress-ng.sockmany.ops_per_sec > 257030 -0.1% 256860 stress-ng.netdev.ops_per_sec > 6337097 +0.1% 6341130 stress-ng.flock.ops_per_sec > 173212 -0.3% 172728 stress-ng.rename.ops_per_sec > 199.69 +0.6% 200.82 stress-ng.sync-file.ops_per_sec > 606.57 +0.8% 611.53 stress-ng.chown.ops_per_sec > 183549 -0.9% 181975 stress-ng.handle.ops_per_sec > 1299 +0.0% 1299 stress-ng.hdd.ops_per_sec > 98371066 +0.2% 98571113 stress-ng.lockofd.ops_per_sec > 25.49 -4.3% 24.39 stress-ng.ioprio.ops_per_sec > 96745191 -1.5% 95333632 stress-ng.locka.ops_per_sec > 582.35 +0.1% 582.86 stress-ng.chmod.ops_per_sec > 2075897 -2.2% 2029552 stress-ng.getdent.ops_per_sec > 60.47 -1.9% 59.34 stress-ng.metamix.ops_per_sec > 14161 -0.3% 14123 stress-ng.io.ops_per_sec > 23.98 -1.5% 23.61 stress-ng.link.ops_per_sec > 27514 +0.0% 27528 stress-ng.filename.ops_per_sec > 44955 +1.6% 45678 stress-ng.dnotify.ops_per_sec > 160.94 +0.4% 161.51 stress-ng.inotify.ops_per_sec > 2452224 +4.0% 2549607 stress-ng.lockf.ops_per_sec > 6761 +0.3% 6779 stress-ng.fsize.ops_per_sec > 775083 -1.5% 763487 stress-ng.fanotify.ops_per_sec > 309124 -4.2% 296285 stress-ng.utime.ops_per_sec > 25567 -0.1% 25530 stress-ng.dup.ops_per_sec > 1858 +0.9% 1876 stress-ng.procfs.ops_per_sec > 105804 -3.9% 101658 stress-ng.access.ops_per_sec > 1.04 -1.9% 1.02 stress-ng.chdir.ops_per_sec > 82753 -0.3% 82480 stress-ng.fstat.ops_per_sec > 681128 +3.7% 706375 stress-ng.acl.ops_per_sec > 11892 -0.1% 11875 stress-ng.bind-mount.ops_per_sec > > > for filebench, similar results, but data is less unstable than stress-ng, which > means for most of them, we regarded them as that they should not be impacted by > this patch. > > for reaim/sysbench-fileio/blogbench, the data are quite stable, and we didn't > notice any significant performance changes. we even doubt whether they go > through the code path changed by this patch. > > so for these, we don't list full results here. > > BTW, besides filesystem tests, this patch is also piped into other performance > test categories such like net, scheduler, mm and others, and since it also goes > into our so-called hourly kernels, it could run by full other performance test > suites which test robot supports. so in following 2-3 weeks, it's still possible > for us to report other results including regression. > That's great. Many thanks for your help.
diff --git a/fs/dcache.c b/fs/dcache.c index 71a8e943a0fa..2ffdb98e9166 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -2360,19 +2360,17 @@ EXPORT_SYMBOL(d_hash_and_lookup); * - unhash this dentry and free it. * * Usually, we want to just turn this into - * a negative dentry, but if anybody else is - * currently using the dentry or the inode - * we can't do that and we fall back on removing - * it from the hash queues and waiting for - * it to be deleted later when it has no users + * a negative dentry, but certain workloads can + * generate a large number of negative dentries. + * Therefore, it would be better to simply + * unhash it. */ /** * d_delete - delete a dentry * @dentry: The dentry to delete * - * Turn the dentry into a negative dentry if possible, otherwise - * remove it from the hash queues so it can be deleted later + * Remove the dentry from the hash queues so it can be deleted later. */ void d_delete(struct dentry * dentry) @@ -2381,14 +2379,14 @@ void d_delete(struct dentry * dentry) spin_lock(&inode->i_lock); spin_lock(&dentry->d_lock); + __d_drop(dentry); + /* * Are we the only user? */ if (dentry->d_lockref.count == 1) { - dentry->d_flags &= ~DCACHE_CANT_MOUNT; dentry_unlink_inode(dentry); } else { - __d_drop(dentry); spin_unlock(&dentry->d_lock); spin_unlock(&inode->i_lock); }
Our applications, built on Elasticsearch[0], frequently create and delete files. These applications operate within containers, some with a memory limit exceeding 100GB. Over prolonged periods, the accumulation of negative dentries within these containers can amount to tens of gigabytes. Upon container exit, directories are deleted. However, due to the numerous associated dentries, this process can be time-consuming. Our users have expressed frustration with this prolonged exit duration, which constitutes our first issue. Simultaneously, other processes may attempt to access the parent directory of the Elasticsearch directories. Since the task responsible for deleting the dentries holds the inode lock, processes attempting directory lookup experience significant delays. This issue, our second problem, is easily demonstrated: - Task 1 generates negative dentries: $ pwd ~/test $ mkdir es && cd es/ && ./create_and_delete_files.sh [ After generating tens of GB dentries ] $ cd ~/test && rm -rf es [ It will take a long duration to finish ] - Task 2 attempts to lookup the 'test/' directory $ pwd ~/test $ ls The 'ls' command in Task 2 experiences prolonged execution as Task 1 is deleting the dentries. We've devised a solution to address both issues by deleting associated dentry when removing a file. Interestingly, we've noted that a similar patch was proposed years ago[1], although it was rejected citing the absence of tangible issues caused by negative dentries. Given our current challenges, we're resubmitting the proposal. All relevant stakeholders from previous discussions have been included for reference. Some alternative solutions are also under discussion[2][3], such as shrinking child dentries outside of the parent inode lock or even asynchronously shrinking child dentries. However, given the straightforward nature of the current solution, I believe this approach is still necessary. [0]. https://github.com/elastic/elasticsearch [1]. https://patchwork.kernel.org/project/linux-fsdevel/patch/1502099673-31620-1-git-send-email-wangkai86@huawei.com [2]. https://lore.kernel.org/linux-fsdevel/20240511200240.6354-2-torvalds@linux-foundation.org/ [3]. https://lore.kernel.org/linux-fsdevel/CAHk-=wjEMf8Du4UFzxuToGDnF3yLaMcrYeyNAaH1NJWa6fwcNQ@mail.gmail.com/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Wangkai <wangkai86@huawei.com> Cc: Colin Walters <walters@verbum.org> --- fs/dcache.c | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-)