Message ID | 20200226161404.14136-1-longman@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | fs/dcache: Limit # of negative dentries | expand |
On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote: > A new sysctl parameter "dentry-dir-max" is introduced which accepts a > value of 0 (default) for no limit or a positive integer 256 and up. Small > dentry-dir-max numbers are forbidden to avoid excessive dentry count > checking which can impact system performance. This is always the wrong approach. A sysctl is just a way of blaming the sysadmin for us not being very good at programming. I agree that we need a way to limit the number of negative dentries. But that limit needs to be dynamic and depend on how the system is being used, not on how some overworked sysadmin has configured it. So we need an initial estimate for the number of negative dentries that we need for good performance. Maybe it's 1000. It doesn't really matter; it's going to change dynamically. Then we need a metric to let us know whether it needs to be increased. Perhaps that's "number of new negative dentries created in the last second". And we need to decide how much to increase it; maybe it's by 50% or maybe by 10%. Perhaps somewhere between 10-100% depending on how high the recent rate of negative dentry creation has been. We also need a metric to let us know whether it needs to be decreased. I'm reluctant to say that memory pressure should be that metric because very large systems can let the number of dentries grow in an unbounded way. Perhaps that metric is "number of hits in the negative dentry cache in the last ten seconds". Again, we'll need to decide how much to shrink the target number by. If the number of negative dentries is at or above the target, then creating a new negative dentry means evicting an existing negative dentry. If the number of negative dentries is lower than the target, then we can just create a new one. Of course, memory pressure (and shrinking the target number) should cause negative dentries to be evicted from the old end of the LRU list. But memory pressure shouldn't cause us to change the target number; the target number is what we think we need to keep the system running smoothly.
On 2/26/20 11:29 AM, Matthew Wilcox wrote: > On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote: >> A new sysctl parameter "dentry-dir-max" is introduced which accepts a >> value of 0 (default) for no limit or a positive integer 256 and up. Small >> dentry-dir-max numbers are forbidden to avoid excessive dentry count >> checking which can impact system performance. > This is always the wrong approach. A sysctl is just a way of blaming > the sysadmin for us not being very good at programming. > > I agree that we need a way to limit the number of negative dentries. > But that limit needs to be dynamic and depend on how the system is being > used, not on how some overworked sysadmin has configured it. > > So we need an initial estimate for the number of negative dentries that > we need for good performance. Maybe it's 1000. It doesn't really matter; > it's going to change dynamically. > > Then we need a metric to let us know whether it needs to be increased. > Perhaps that's "number of new negative dentries created in the last > second". And we need to decide how much to increase it; maybe it's by > 50% or maybe by 10%. Perhaps somewhere between 10-100% depending on > how high the recent rate of negative dentry creation has been. > > We also need a metric to let us know whether it needs to be decreased. > I'm reluctant to say that memory pressure should be that metric because > very large systems can let the number of dentries grow in an unbounded > way. Perhaps that metric is "number of hits in the negative dentry > cache in the last ten seconds". Again, we'll need to decide how much > to shrink the target number by. > > If the number of negative dentries is at or above the target, then > creating a new negative dentry means evicting an existing negative dentry. > If the number of negative dentries is lower than the target, then we > can just create a new one. > > Of course, memory pressure (and shrinking the target number) should > cause negative dentries to be evicted from the old end of the LRU list. > But memory pressure shouldn't cause us to change the target number; > the target number is what we think we need to keep the system running > smoothly. > Thanks for the quick response. I agree that auto-tuning so that the system administrator don't have to worry about it will be the best approach if it is implemented in the right way. However, it is hard to do it right. How about letting users specify a cap on the amount of total system memory allowed for negative dentries like one of my previous patchs. Internally, there is a predefined minimum and maximum for dentry-dir-max. We sample the total negative dentry counts periodically and adjust the dentry-dir-max accordingly. Specifying a percentage of total system memory is more intuitive than just specifying a hard number for dentry-dir-max. Still some user input is required. What do you think? Cheers, Longman
On Wed, Feb 26, 2020 at 02:19:59PM -0500, Waiman Long wrote: > On 2/26/20 11:29 AM, Matthew Wilcox wrote: > > On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote: > >> A new sysctl parameter "dentry-dir-max" is introduced which accepts a > >> value of 0 (default) for no limit or a positive integer 256 and up. Small > >> dentry-dir-max numbers are forbidden to avoid excessive dentry count > >> checking which can impact system performance. > > This is always the wrong approach. A sysctl is just a way of blaming > > the sysadmin for us not being very good at programming. > > > > I agree that we need a way to limit the number of negative dentries. > > But that limit needs to be dynamic and depend on how the system is being > > used, not on how some overworked sysadmin has configured it. > > > > So we need an initial estimate for the number of negative dentries that > > we need for good performance. Maybe it's 1000. It doesn't really matter; > > it's going to change dynamically. > > > > Then we need a metric to let us know whether it needs to be increased. > > Perhaps that's "number of new negative dentries created in the last > > second". And we need to decide how much to increase it; maybe it's by > > 50% or maybe by 10%. Perhaps somewhere between 10-100% depending on > > how high the recent rate of negative dentry creation has been. > > > > We also need a metric to let us know whether it needs to be decreased. > > I'm reluctant to say that memory pressure should be that metric because > > very large systems can let the number of dentries grow in an unbounded > > way. Perhaps that metric is "number of hits in the negative dentry > > cache in the last ten seconds". Again, we'll need to decide how much > > to shrink the target number by. > > > > If the number of negative dentries is at or above the target, then > > creating a new negative dentry means evicting an existing negative dentry. > > If the number of negative dentries is lower than the target, then we > > can just create a new one. > > > > Of course, memory pressure (and shrinking the target number) should > > cause negative dentries to be evicted from the old end of the LRU list. > > But memory pressure shouldn't cause us to change the target number; > > the target number is what we think we need to keep the system running > > smoothly. > > > Thanks for the quick response. > > I agree that auto-tuning so that the system administrator don't have to > worry about it will be the best approach if it is implemented in the > right way. However, it is hard to do it right. > > How about letting users specify a cap on the amount of total system > memory allowed for negative dentries like one of my previous patchs. > Internally, there is a predefined minimum and maximum for > dentry-dir-max. We sample the total negative dentry counts periodically > and adjust the dentry-dir-max accordingly. > > Specifying a percentage of total system memory is more intuitive than > just specifying a hard number for dentry-dir-max. Still some user input > is required. If you want to base the whole thing on a per-directory target number, or a percentage of the system memory target (rather than my suggestion of a total # of negative dentries), that seems reasonable. What's not reasonable is expecting the sysadmin to be able to either predict the workload, or react to a changing workload in sufficient time. The system has to be self-tuning. Just look how long stale information stays around about how to tune your Linux system. Here's an article from 2018 suggesting using the 'intr' option for NFS mounts: https://kb.netapp.com/app/answers/answer_view/a_id/1004893/~/hard-mount-vs-soft-mount- I made that a no-op in 2007. Any tunable you add to Linux immediately becomes a cargo-cult solution to any problem people are having.
On Feb 26, 2020, at 9:29 AM, Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote: >> A new sysctl parameter "dentry-dir-max" is introduced which accepts a >> value of 0 (default) for no limit or a positive integer 256 and up. Small >> dentry-dir-max numbers are forbidden to avoid excessive dentry count >> checking which can impact system performance. > > This is always the wrong approach. A sysctl is just a way of blaming > the sysadmin for us not being very good at programming. > > I agree that we need a way to limit the number of negative dentries. > But that limit needs to be dynamic and depend on how the system is being > used, not on how some overworked sysadmin has configured it. > > So we need an initial estimate for the number of negative dentries that > we need for good performance. Maybe it's 1000. It doesn't really matter; > it's going to change dynamically. > > Then we need a metric to let us know whether it needs to be increased. > Perhaps that's "number of new negative dentries created in the last > second". And we need to decide how much to increase it; maybe it's by > 50% or maybe by 10%. Perhaps somewhere between 10-100% depending on > how high the recent rate of negative dentry creation has been. > > We also need a metric to let us know whether it needs to be decreased. > I'm reluctant to say that memory pressure should be that metric because > very large systems can let the number of dentries grow in an unbounded > way. Perhaps that metric is "number of hits in the negative dentry > cache in the last ten seconds". Again, we'll need to decide how much > to shrink the target number by. OK, so now instead of a single tunable parameter we need three, because these numbers are totally made up and nobody knows the right values. :-) Defaulting the limit to "disabled/no limit" also has the problem that 99.99% of users won't even know this tunable exists, let alone how to set it correctly, so they will continue to see these problems, and the code may as well not exist (i.e. pure overhead), while Waiman has a better idea today of what would be reasonable defaults. I definitely agree that a single fixed value will be wrong for every system except the original developer's. Making the maximum default to some reasonable fraction of the system size, rather than a fixed value, is probably best to start. Something like this as a starting point: /* Allow a reasonable minimum number of negative entries, * but proportionately more if the directory/dcache is large. */ dir_negative_max = max(num_dir_entries / 16, 1024); total_negative_max = max(totalram_pages / 32, total_dentries / 8); (Waiman should decide actual values based on where the problem was hit previously), and include tunables to change the limits for testing. Ideally there would also be a dir ioctl that allows fetching the current positive/negative entry count on a directory (e.g. /usr/bin, /usr/lib64, /usr/share/man/man*) to see what these values are. Otherwise there is no way to determine whether the limits used are any good or not. Dynamic limits are hard to get right, and incorrect state machines can lead to wild swings in behaviour due to unexpected feedback. It isn't clear to me that adjusting the limit based on the current rate of negative dentry creation even makes sense. If there are a lot of negative entries being created, that is when you'd want to _stop_ allowing more to be added. We don't have any limit today, so imposing some large-but-still-reasonable upper limit on negative entries will catch the runaway negative dcache case that was the original need of this functionality without adding a lot of complexity that we may not need at all. > If the number of negative dentries is at or above the target, then > creating a new negative dentry means evicting an existing negative dentry. > If the number of negative dentries is lower than the target, then we > can just create a new one. > > Of course, memory pressure (and shrinking the target number) should > cause negative dentries to be evicted from the old end of the LRU list. > But memory pressure shouldn't cause us to change the target number; > the target number is what we think we need to keep the system running > smoothly. Cheers, Andreas
On Wed, Feb 26, 2020 at 02:28:50PM -0700, Andreas Dilger wrote: > On Feb 26, 2020, at 9:29 AM, Matthew Wilcox <willy@infradead.org> wrote: > > This is always the wrong approach. A sysctl is just a way of blaming > > the sysadmin for us not being very good at programming. > > > > I agree that we need a way to limit the number of negative dentries. > > But that limit needs to be dynamic and depend on how the system is being > > used, not on how some overworked sysadmin has configured it. > > > > So we need an initial estimate for the number of negative dentries that > > we need for good performance. Maybe it's 1000. It doesn't really matter; > > it's going to change dynamically. > > > > Then we need a metric to let us know whether it needs to be increased. > > Perhaps that's "number of new negative dentries created in the last > > second". And we need to decide how much to increase it; maybe it's by > > 50% or maybe by 10%. Perhaps somewhere between 10-100% depending on > > how high the recent rate of negative dentry creation has been. > > > > We also need a metric to let us know whether it needs to be decreased. > > I'm reluctant to say that memory pressure should be that metric because > > very large systems can let the number of dentries grow in an unbounded > > way. Perhaps that metric is "number of hits in the negative dentry > > cache in the last ten seconds". Again, we'll need to decide how much > > to shrink the target number by. > > OK, so now instead of a single tunable parameter we need three, because > these numbers are totally made up and nobody knows the right values. :-) > Defaulting the limit to "disabled/no limit" also has the problem that > 99.99% of users won't even know this tunable exists, let alone how to > set it correctly, so they will continue to see these problems, and the > code may as well not exist (i.e. pure overhead), while Waiman has a > better idea today of what would be reasonable defaults. I never said "no limit". I just said to start at some fairly random value and not worry about where you start because it'll correct to where this system needs it to be. As long as it converges like loadavg does, it'll be fine. It needs a fairly large "don't change the target" area, and it needs to react quickly to real changes in a system's workload. > I definitely agree that a single fixed value will be wrong for every > system except the original developer's. Making the maximum default to > some reasonable fraction of the system size, rather than a fixed value, > is probably best to start. Something like this as a starting point: > > /* Allow a reasonable minimum number of negative entries, > * but proportionately more if the directory/dcache is large. > */ > dir_negative_max = max(num_dir_entries / 16, 1024); > total_negative_max = max(totalram_pages / 32, total_dentries / 8); Those kinds of things are garbage on large machines. With a terabyte of RAM, you can end up with tens of millions of dentries clogging up the system. There _is_ an upper limit on the useful number of dentries to keep around. > (Waiman should decide actual values based on where the problem was hit > previously), and include tunables to change the limits for testing. > > Ideally there would also be a dir ioctl that allows fetching the current > positive/negative entry count on a directory (e.g. /usr/bin, /usr/lib64, > /usr/share/man/man*) to see what these values are. Otherwise there is > no way to determine whether the limits used are any good or not. It definitely needs to be instrumented for testing, but no, not ioctls. tracepoints, perhaps. > Dynamic limits are hard to get right, and incorrect state machines can lead > to wild swings in behaviour due to unexpected feedback. It isn't clear to > me that adjusting the limit based on the current rate of negative dentry > creation even makes sense. If there are a lot of negative entries being > created, that is when you'd want to _stop_ allowing more to be added. That doesn't make sense. What you really want to know is "If my dcache had twice as many entries in it, would that significantly reduce the thrash of new entries being created". In the page cache, we end up with a double LRU where once-used entries fall off the list quickly but twice-or-more used entries get to stay around for a bit longer. Maybe we could do something like that; keep a victim cache for recently evicted dentries, and if we get a large hit rate in the victim cache, expand the size of the primary cache.
On Wed, Feb 26, 2020 at 01:45:07PM -0800, Matthew Wilcox wrote: > had twice as many entries in it, would that significantly reduce the > thrash of new entries being created". In the page cache, we end up > with a double LRU where once-used entries fall off the list quickly > but twice-or-more used entries get to stay around for a bit longer. > Maybe we could do something like that; keep a victim cache for recently > evicted dentries, and if we get a large hit rate in the victim cache, > expand the size of the primary cache. You know, I've been saying exactly the same thing about the inode LRU in response to people trying to hack behaviour out of the shrinker that is triggered by the working set getting trashed by excessive creation of single use inodes (i.e. large scale directory traversal). IOWs, both have the same problem with working set retention in the face of excessive growth pressure. So, you know, perhaps two caches with the same problem, that use the same LRU implementation, could solve the same problem by enhancing the generic LRU code they use to an active/inactive style clocking LRU like the page LRUs? Cheers, Dave.
On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote: > As there is no limit for negative dentries, it is possible that a sizeable > portion of system memory can be tied up in dentry cache slabs. Dentry slabs > are generally recalimable if the dentries are in the LRUs. Still having > too much memory used up by dentries can be problematic: I don't get it. Why isn't the solution simply "constrain the application generating unbound numbers of dentries to a memcg"? Then when the memcg runs out of memory, it will start reclaiming the dentries that were allocated inside the memcg that are using all it's resources, thereby preventing unbound growth of the dentry cache. I mean, this sort of resource control is exactly what memcgs are supposed to be used for and are already used for. I don't see why we need all this complexity for global dentry resource management when memcgs should already provide an effective means of managing and placing bounds on the amount of memory any specific application can use... Cheers, Dave.
On Wed, 2020-02-26 at 14:28 -0700, Andreas Dilger wrote: > On Feb 26, 2020, at 9:29 AM, Matthew Wilcox <willy@infradead.org> > wrote: > > On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote: > > > A new sysctl parameter "dentry-dir-max" is introduced which > > > accepts a > > > value of 0 (default) for no limit or a positive integer 256 and > > > up. Small > > > dentry-dir-max numbers are forbidden to avoid excessive dentry > > > count > > > checking which can impact system performance. > > > > This is always the wrong approach. A sysctl is just a way of > > blaming > > the sysadmin for us not being very good at programming. > > > > I agree that we need a way to limit the number of negative > > dentries. > > But that limit needs to be dynamic and depend on how the system is > > being > > used, not on how some overworked sysadmin has configured it. > > > > So we need an initial estimate for the number of negative dentries > > that > > we need for good performance. Maybe it's 1000. It doesn't really > > matter; > > it's going to change dynamically. > > > > Then we need a metric to let us know whether it needs to be > > increased. > > Perhaps that's "number of new negative dentries created in the last > > second". And we need to decide how much to increase it; maybe it's > > by > > 50% or maybe by 10%. Perhaps somewhere between 10-100% depending > > on > > how high the recent rate of negative dentry creation has been. > > > > We also need a metric to let us know whether it needs to be > > decreased. > > I'm reluctant to say that memory pressure should be that metric > > because > > very large systems can let the number of dentries grow in an > > unbounded > > way. Perhaps that metric is "number of hits in the negative dentry > > cache in the last ten seconds". Again, we'll need to decide how > > much > > to shrink the target number by. > > OK, so now instead of a single tunable parameter we need three, > because > these numbers are totally made up and nobody knows the right values. > :-) > Defaulting the limit to "disabled/no limit" also has the problem that > 99.99% of users won't even know this tunable exists, let alone how to > set it correctly, so they will continue to see these problems, and > the > code may as well not exist (i.e. pure overhead), while Waiman has a > better idea today of what would be reasonable defaults. Why have these at all. Not all file systems even produce negative hashed dentries. The most beneficial use of them is to improve performance of rapid fire lookups for non-existent names. Longer lived negative hashed dentries don't give much benefit at all unless they suddenly have lots of hits and that would cost a single allocation on the first lookup if the dentry ttl expired and the dentry discarded. A ttl (say jiffies) set at appropriate times could be a better choice all round, no sysctl values at all. Ian
On 2/26/20 8:29 AM, Matthew Wilcox wrote: > On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote: >> A new sysctl parameter "dentry-dir-max" is introduced which accepts a >> value of 0 (default) for no limit or a positive integer 256 and up. Small >> dentry-dir-max numbers are forbidden to avoid excessive dentry count >> checking which can impact system performance. > > This is always the wrong approach. A sysctl is just a way of blaming > the sysadmin for us not being very good at programming. > > I agree that we need a way to limit the number of negative dentries. > But that limit needs to be dynamic and depend on how the system is being > used, not on how some overworked sysadmin has configured it. > > So we need an initial estimate for the number of negative dentries that > we need for good performance. Maybe it's 1000. It doesn't really matter; > it's going to change dynamically. > > Then we need a metric to let us know whether it needs to be increased. > Perhaps that's "number of new negative dentries created in the last > second". And we need to decide how much to increase it; maybe it's by > 50% or maybe by 10%. Perhaps somewhere between 10-100% depending on > how high the recent rate of negative dentry creation has been. There are pitfalls to this approach as well. Consider what libnss does every time it starts up (via curl in this case) # cat /proc/sys/fs/dentry-state 3154271 3131421 45 0 2863333 0 # for I in `seq 1 10`; do curl https://sandeen.net/ &>/dev/null; done # cat /proc/sys/fs/dentry-state 3170738 3147844 45 0 2879882 0 voila, 16k more negative dcache entries, thanks to: https://github.com/nss-dev/nss/blob/317cb06697d5b953d825e050c1d8c1ee0d647010/lib/softoken/sdb.c#L390 i.e. each time it inits, it will intentionally create up to 10,000 negative dentries which will never be looked up again. I /think/ the original intent of this work was to limit such rogue applications, so scaling with use probably isn't the way to go. -Eric
On Thu, Feb 27, 2020 at 11:04:40AM -0800, Eric Sandeen wrote: > On 2/26/20 8:29 AM, Matthew Wilcox wrote: > > On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote: > >> A new sysctl parameter "dentry-dir-max" is introduced which accepts a > >> value of 0 (default) for no limit or a positive integer 256 and up. Small > >> dentry-dir-max numbers are forbidden to avoid excessive dentry count > >> checking which can impact system performance. > > > > This is always the wrong approach. A sysctl is just a way of blaming > > the sysadmin for us not being very good at programming. > > > > I agree that we need a way to limit the number of negative dentries. > > But that limit needs to be dynamic and depend on how the system is being > > used, not on how some overworked sysadmin has configured it. > > > > So we need an initial estimate for the number of negative dentries that > > we need for good performance. Maybe it's 1000. It doesn't really matter; > > it's going to change dynamically. > > > > Then we need a metric to let us know whether it needs to be increased. > > Perhaps that's "number of new negative dentries created in the last > > second". And we need to decide how much to increase it; maybe it's by > > 50% or maybe by 10%. Perhaps somewhere between 10-100% depending on > > how high the recent rate of negative dentry creation has been. > > There are pitfalls to this approach as well. Consider what libnss > does every time it starts up (via curl in this case) > > # cat /proc/sys/fs/dentry-state > 3154271 3131421 45 0 2863333 0 > # for I in `seq 1 10`; do curl https://sandeen.net/ &>/dev/null; done > # cat /proc/sys/fs/dentry-state > 3170738 3147844 45 0 2879882 0 > > voila, 16k more negative dcache entries, thanks to: > > https://github.com/nss-dev/nss/blob/317cb06697d5b953d825e050c1d8c1ee0d647010/lib/softoken/sdb.c#L390 > > i.e. each time it inits, it will intentionally create up to 10,000 negative > dentries which will never be looked up again. Sandboxing via memcg restricted cgroups means users and/or applications cannot create unbound numbers of negative dentries, and that largely solves this problem. For a system daemons whose environment is controlled by a systemd unit file, this should be pretty trivial to do, and memcg directed memory reclaim will control negative dentry buildup. For short-lived applications, teardown of the cgroup will free all the negative dentries it created - they don't hang around forever. For long lived applications, negative dentries are bound by the application memcg limits, and buildup will only affect the applications own performance, not that of the whole system. IOWs, I'd expect this sort of resource control problem to be solved at the user, application and/or distro level, not with a huge kernel hammer. > I /think/ the original intent of this work was to limit such rogue > applications, so scaling with use probably isn't the way to go. The original intent was to prevent problems on old kernels that supported terabytes of memory but could not use cgroup/memcg infrastructure to isolate and contain negative dentry growth. That was a much simpler, targeted negative dentry limiting solution, not the ... craziness that can be found in this patchset. Cheers, Dave.
On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote: > Not all file systems even produce negative hashed dentries. > > The most beneficial use of them is to improve performance of rapid > fire lookups for non-existent names. Longer lived negative hashed > dentries don't give much benefit at all unless they suddenly have > lots of hits and that would cost a single allocation on the first > lookup if the dentry ttl expired and the dentry discarded. > > A ttl (say jiffies) set at appropriate times could be a better > choice all round, no sysctl values at all. The canonical argument in favour of negative dentries is to improve application startup time as every application searches the library path for the same libraries. Only they don't do that any more: $ strace -e file cat /dev/null execve("/bin/cat", ["cat", "/dev/null"], 0x7ffd5f7ddda8 /* 44 vars */) = 0 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3 So, are they still useful? Or should we, say, keep at most 100 around?
On Thu, 2020-02-27 at 19:34 -0800, Matthew Wilcox wrote: > On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote: > > Not all file systems even produce negative hashed dentries. > > > > The most beneficial use of them is to improve performance of rapid > > fire lookups for non-existent names. Longer lived negative hashed > > dentries don't give much benefit at all unless they suddenly have > > lots of hits and that would cost a single allocation on the first > > lookup if the dentry ttl expired and the dentry discarded. > > > > A ttl (say jiffies) set at appropriate times could be a better > > choice all round, no sysctl values at all. > > The canonical argument in favour of negative dentries is to improve > application startup time as every application searches the library > path > for the same libraries. Only they don't do that any more: > > $ strace -e file cat /dev/null > execve("/bin/cat", ["cat", "/dev/null"], 0x7ffd5f7ddda8 /* 44 vars > */) = 0 > access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or > directory) > openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 > openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", > O_RDONLY|O_CLOEXEC) = 3 > openat(AT_FDCWD, "/usr/lib/locale/locale-archive", > O_RDONLY|O_CLOEXEC) = 3 > openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3 > > So, are they still useful? Or should we, say, keep at most 100 > around? Who knows how old apps will be on distros., ;) But I don't think it matters. The VFS will (should) work fine without a minimum negative hashed dentry count (and hashed since unhashed negative dentries are summarily executed on final dput()) and a ttl should keep frequently used ones around long enough to satisfy this sort of thing should it be needed. Even the ttl value should be resilient to a large range of values, just not so much very small ones. Ian
On Thu, Feb 27, 2020 at 07:34:12PM -0800, Matthew Wilcox wrote: > On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote: > > Not all file systems even produce negative hashed dentries. > > > > The most beneficial use of them is to improve performance of rapid > > fire lookups for non-existent names. Longer lived negative hashed > > dentries don't give much benefit at all unless they suddenly have > > lots of hits and that would cost a single allocation on the first > > lookup if the dentry ttl expired and the dentry discarded. > > > > A ttl (say jiffies) set at appropriate times could be a better > > choice all round, no sysctl values at all. > > The canonical argument in favour of negative dentries is to improve > application startup time as every application searches the library path > for the same libraries. Only they don't do that any more: Tell that to scripts that keep looking through $PATH for binaries each time they are run. Tell that to cc(1) looking through include path, etc. Ian, autofs is deeply pathological in that respect; that's OK, since it has very unusual needs, but please don't use it as a model for anything else - its needs *are* unusual.
On Fri, 2020-02-28 at 12:16 +0800, Ian Kent wrote: > On Thu, 2020-02-27 at 19:34 -0800, Matthew Wilcox wrote: > > On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote: > > > Not all file systems even produce negative hashed dentries. > > > > > > The most beneficial use of them is to improve performance of > > > rapid > > > fire lookups for non-existent names. Longer lived negative hashed > > > dentries don't give much benefit at all unless they suddenly have > > > lots of hits and that would cost a single allocation on the first > > > lookup if the dentry ttl expired and the dentry discarded. > > > > > > A ttl (say jiffies) set at appropriate times could be a better > > > choice all round, no sysctl values at all. > > > > The canonical argument in favour of negative dentries is to improve > > application startup time as every application searches the library > > path > > for the same libraries. Only they don't do that any more: > > > > $ strace -e file cat /dev/null > > execve("/bin/cat", ["cat", "/dev/null"], 0x7ffd5f7ddda8 /* 44 vars > > */) = 0 > > access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file > > or > > directory) > > openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 > > openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", > > O_RDONLY|O_CLOEXEC) = 3 > > openat(AT_FDCWD, "/usr/lib/locale/locale-archive", > > O_RDONLY|O_CLOEXEC) = 3 > > openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3 > > > > So, are they still useful? Or should we, say, keep at most 100 > > around? > > Who knows how old apps will be on distros., ;) Or what admins put in the PATH, I've seen oddness in that a lot. > > But I don't think it matters. And I don't think I made my answer to the question clear. I don't think setting a minimum matters but there are other sources of a possibly significant number of lookups on paths that don't exist. I've seen evidence recently (although I suspect unfounded) that systemd can generate lots of these lookups at times. And let's not forget that file systems are the primary source of these and not all create them on lookups. I may be mistaken, but I think ext4 does not while xfs definitely does. The more important metric I think is calculating a sensible maximum to be pruned to prevent getting bogged down as there could be times when a lot of these are present. After all this is meant to be an iterative pruning measure. Ian
On Fri, 2020-02-28 at 04:22 +0000, Al Viro wrote: > On Thu, Feb 27, 2020 at 07:34:12PM -0800, Matthew Wilcox wrote: > > On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote: > > > Not all file systems even produce negative hashed dentries. > > > > > > The most beneficial use of them is to improve performance of > > > rapid > > > fire lookups for non-existent names. Longer lived negative hashed > > > dentries don't give much benefit at all unless they suddenly have > > > lots of hits and that would cost a single allocation on the first > > > lookup if the dentry ttl expired and the dentry discarded. > > > > > > A ttl (say jiffies) set at appropriate times could be a better > > > choice all round, no sysctl values at all. > > > > The canonical argument in favour of negative dentries is to improve > > application startup time as every application searches the library > > path > > for the same libraries. Only they don't do that any more: > > Tell that to scripts that keep looking through $PATH for > binaries each time they are run. Tell that to cc(1) looking through > include path, etc. > > Ian, autofs is deeply pathological in that respect; that's OK, > since it has very unusual needs, but please don't use it as a model > for anything else - its needs *are* unusual. Ok, but my thoughts aren't based on autofs behaviours. But it sounds like you don't believe this is a sensible suggestion and you would know best so ... Ian
On Fri, Feb 28, 2020 at 12:36:09PM +0800, Ian Kent wrote: > And let's not forget that file systems are the primary > source of these and not all create them on lookups. > I may be mistaken, but I think ext4 does not while xfs > definitely does. Both ext4 and xfs bloody well *DO* create hashed negative dentries on lookups. There is a pathological case when they are trying to be case-insensitive (and in that situation we are SOL - if somebody insists upon mounting with -o make-it-suck, that's what they bloody well get). Casefondling idiocy aside, negative lookups are hashed. On all normal filesystems. Look for d_splice_alias() getting passed NULL inode - that's where ->lookup() instances normally create those.
On 2/27/20 10:34 PM, Matthew Wilcox wrote: > On Thu, Feb 27, 2020 at 05:55:43PM +0800, Ian Kent wrote: >> Not all file systems even produce negative hashed dentries. >> >> The most beneficial use of them is to improve performance of rapid >> fire lookups for non-existent names. Longer lived negative hashed >> dentries don't give much benefit at all unless they suddenly have >> lots of hits and that would cost a single allocation on the first >> lookup if the dentry ttl expired and the dentry discarded. >> >> A ttl (say jiffies) set at appropriate times could be a better >> choice all round, no sysctl values at all. > The canonical argument in favour of negative dentries is to improve > application startup time as every application searches the library path > for the same libraries. Only they don't do that any more: > > $ strace -e file cat /dev/null > execve("/bin/cat", ["cat", "/dev/null"], 0x7ffd5f7ddda8 /* 44 vars */) = 0 > access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) > openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 > openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 > openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3 > openat(AT_FDCWD, "/dev/null", O_RDONLY) = 3 > > So, are they still useful? Or should we, say, keep at most 100 around? > It is the shell that does the path search, not the command itself. Cheers, Longman
On Fri, Feb 28, 2020 at 10:32:02AM -0500, Waiman Long wrote: > On 2/27/20 10:34 PM, Matthew Wilcox wrote: > > The canonical argument in favour of negative dentries is to improve > > application startup time as every application searches the library path ^^^^^^^ > > for the same libraries. Only they don't do that any more: ^^^^^^^^^ > > It is the shell that does the path search, not the command itself.
On 2/27/20 3:30 AM, Dave Chinner wrote: > On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote: >> As there is no limit for negative dentries, it is possible that a sizeable >> portion of system memory can be tied up in dentry cache slabs. Dentry slabs >> are generally recalimable if the dentries are in the LRUs. Still having >> too much memory used up by dentries can be problematic: > I don't get it. > > Why isn't the solution simply "constrain the application generating > unbound numbers of dentries to a memcg"? > > Then when the memcg runs out of memory, it will start reclaiming the > dentries that were allocated inside the memcg that are using all > it's resources, thereby preventing unbound growth of the dentry > cache. > > I mean, this sort of resource control is exactly what memcgs are > supposed to be used for and are already used for. I don't see why we > need all this complexity for global dentry resource management when > memcgs should already provide an effective means of managing and > placing bounds on the amount of memory any specific application can > use... Using memcg is one way to limit the damage. The argument that excessive negative dentries can push out existing memory objects that can be more useful if left alone still applies. Daemons that run in the root memcg has no limitation on how much memory that they can use. There can also be memcgs with high memory limits and long running applications. memcg is certainly a useful tool in this regards, but it doesn't solve all the problem. Cheers, Longman
On Thu, Feb 27, 2020 at 07:34:12PM -0800, Matthew Wilcox wrote: > > The canonical argument in favour of negative dentries is to improve > application startup time as every application searches the library path > for the same libraries. The other canonical example is C compilers that need to search for header files along the include search path: % strace -o /tmp/st -f gcc -o /tmp/hello /tmp/hello.c -I.. -I../.. % grep open /tmp/st | grep stdio.h | grep ENOENT | wc -l 6 - Ted
On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote: > As there is no limit for negative dentries, it is possible that a sizeable > portion of system memory can be tied up in dentry cache slabs. Dentry slabs > are generally recalimable if the dentries are in the LRUs. Still having > too much memory used up by dentries can be problematic: > > 1) When a filesystem with too many negative dentries is being unmounted, > the process of draining the dentries associated with the filesystem > can take some time. To users, the system may seem to hang for > a while. The long wait may also cause unexpected timeout error or > other warnings. This can happen when a long running container with > many negative dentries is being destroyed, for instance. > > 2) Tying up too much memory in unused negative dentries means there > are less memory available for other use. Even though the kernel is > able to reclaim unused dentries when running out of free memory, > it will still introduce additional latency to the application > reducing its performance. There's a third problem, which is that having a lot of negative dentries can clog the hash chains. I tried to quantify this, and found a weird result: root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done real 0m5.402s user 0m4.361s sys 0m1.230s root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done real 0m5.572s user 0m4.337s sys 0m1.407s root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done real 0m5.607s user 0m4.522s sys 0m1.342s root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done real 0m5.599s user 0m4.472s sys 0m1.369s root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done real 0m5.574s user 0m4.498s sys 0m1.300s Pretty consistent system time, between about 1.3 and 1.4 seconds. root@bobo-kvm:~# grep dentry /proc/slabinfo dentry 20394 21735 192 21 1 : tunables 0 0 0 : slabdata 1035 1035 0 root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done real 0m5.515s user 0m4.353s sys 0m1.359s At this point, we have 20k dentries allocated. Now, pollute the dcache with names that don't exist: root@bobo-kvm:~# for i in `seq 1 100000`; do cat /dev/null$i >/dev/zero; done 2>/dev/null root@bobo-kvm:~# grep dentry /proc/slabinfo dentry 20605 21735 192 21 1 : tunables 0 0 0 : slabdata 1035 1035 0 Huh. We've kept the number of dentries pretty constant. Still, maybe the bad dentries have pushed out the good ones. root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done real 0m6.644s user 0m4.921s sys 0m1.946s root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done real 0m6.676s user 0m5.004s sys 0m1.909s root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done real 0m6.662s user 0m4.980s sys 0m1.916s root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done real 0m6.714s user 0m4.973s sys 0m1.986s Well, we certainly made it suck. Up to a pretty consistent 1.9-2.0 seconds of kernel time, or 50% worse. We've also made user time worse, somehow. Anyhow, I should write a proper C program to measure this. But I thought I'd share this raw data with you now to demonstrate that dcache pollution is a real problem today, even on a machine with 2GB.
On 15/03/2020 06.46, Matthew Wilcox wrote: > On Wed, Feb 26, 2020 at 11:13:53AM -0500, Waiman Long wrote: >> As there is no limit for negative dentries, it is possible that a sizeable >> portion of system memory can be tied up in dentry cache slabs. Dentry slabs >> are generally recalimable if the dentries are in the LRUs. Still having >> too much memory used up by dentries can be problematic: >> >> 1) When a filesystem with too many negative dentries is being unmounted, >> the process of draining the dentries associated with the filesystem >> can take some time. To users, the system may seem to hang for >> a while. The long wait may also cause unexpected timeout error or >> other warnings. This can happen when a long running container with >> many negative dentries is being destroyed, for instance. >> >> 2) Tying up too much memory in unused negative dentries means there >> are less memory available for other use. Even though the kernel is >> able to reclaim unused dentries when running out of free memory, >> it will still introduce additional latency to the application >> reducing its performance. > > There's a third problem, which is that having a lot of negative dentries > can clog the hash chains. I tried to quantify this, and found a weird result: Yep. I've seen this in the wild. Server hard too much unused memory. https://lore.kernel.org/lkml/ff0993a2-9825-304c-6a5b-2e9d4b940032@yandex-team.ru/T/#u ---quote--- I've seen problem on large server where horde of negative dentries slowed down all lookups significantly: watchdog: BUG: soft lockup - CPU#25 stuck for 22s! [atop:968884] at __d_lookup_rcu+0x6f/0x190 slabtop: OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 85118166 85116916 0% 0.19K 2026623 42 16212984K dentry 16577106 16371723 0% 0.10K 425054 39 1700216K buffer_head 935850 934379 0% 1.05K 31195 30 998240K ext4_inode_cache 663740 654967 0% 0.57K 23705 28 379280K radix_tree_node 399987 380055 0% 0.65K 8163 49 261216K proc_inode_cache 226380 168813 0% 0.19K 5390 42 43120K cred_jar 70345 65721 0% 0.58K 1279 55 40928K inode_cache 105927 43314 0% 0.31K 2077 51 33232K filp 630972 601503 0% 0.04K 6186 102 24744K ext4_extent_status 5848 4269 0% 3.56K 731 8 23392K task_struct 16224 11531 0% 1.00K 507 32 16224K kmalloc-1024 6752 5833 0% 2.00K 422 16 13504K kmalloc-2048 199680 158086 0% 0.06K 3120 64 12480K anon_vma_chain 156128 154751 0% 0.07K 2788 56 11152K Acpi-Operand Total RAM is 256 GB These dentries came from temporary files created and deleted by postgres. But this could be easily reproduced by lookup of non-existent files. Of course, memory pressure easily washes them away. Similar problem happened before around proc sysctl entries: https://lkml.org/lkml/2017/2/10/47 This one does not concentrate in one bucket and needs much more memory. Looks like dcache needs some kind of background shrinker started when dcache size or fraction of negative dentries exceeds some threshold. ---end--- > > root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done > real 0m5.402s > user 0m4.361s > sys 0m1.230s > root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done > real 0m5.572s > user 0m4.337s > sys 0m1.407s > root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done > real 0m5.607s > user 0m4.522s > sys 0m1.342s > root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done > real 0m5.599s > user 0m4.472s > sys 0m1.369s > root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done > real 0m5.574s > user 0m4.498s > sys 0m1.300s > > Pretty consistent system time, between about 1.3 and 1.4 seconds. > > root@bobo-kvm:~# grep dentry /proc/slabinfo > dentry 20394 21735 192 21 1 : tunables 0 0 0 : slabdata 1035 1035 0 > root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done > real 0m5.515s > user 0m4.353s > sys 0m1.359s > > At this point, we have 20k dentries allocated. > > Now, pollute the dcache with names that don't exist: > > root@bobo-kvm:~# for i in `seq 1 100000`; do cat /dev/null$i >/dev/zero; done 2>/dev/null > root@bobo-kvm:~# grep dentry /proc/slabinfo > dentry 20605 21735 192 21 1 : tunables 0 0 0 : slabdata 1035 1035 0 > > Huh. We've kept the number of dentries pretty constant. Still, maybe the > bad dentries have pushed out the good ones. > > root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done > real 0m6.644s > user 0m4.921s > sys 0m1.946s > root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done > real 0m6.676s > user 0m5.004s > sys 0m1.909s > root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done > real 0m6.662s > user 0m4.980s > sys 0m1.916s > root@bobo-kvm:~# time for i in `seq 1 10000`; do cat /dev/null >/dev/zero; done > real 0m6.714s > user 0m4.973s > sys 0m1.986s > > Well, we certainly made it suck. Up to a pretty consistent 1.9-2.0 seconds > of kernel time, or 50% worse. We've also made user time worse, somehow. > > Anyhow, I should write a proper C program to measure this. But I thought > I'd share this raw data with you now to demonstrate that dcache pollution > is a real problem today, even on a machine with 2GB. >