Message ID | 20210928125500.167943-1-atenart@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Userspace spinning on net-sysfs access | expand |
On Tue 28-09-21 14:54:51, Antoine Tenart wrote: > Hello, Hi, thanks for posting this. Coincidentally we have come across a similar problem as well just recently. > What made those syscalls to spin is the following construction (which is > found a lot in net sysfs and sysctl code): > > if (!rtnl_trylock()) > return restart_syscall(); One of our customer is using Prometeus (https://github.com/prometheus/prometheus) for monitoring and they have noticed that running several instances of node-exporter can lead to a high CPU utilization. After some investigation it has turned out that most instances are busy looping on on of the sysfs files while one instance is processing sysfs speed file for mlx driver which performs quite a convoluted set of operations (send commands to the lower layers via workqueues) to talk to the device to get the information. The problem becomes more visible with more instance of node-exporter running at parallel. This results in some monitoring alarms at the said machine because the high CPU utilization is not expected. I would appreciate if you CC me on next versions of this patchset. Thankis for working on this!
Quoting Michal Hocko (2021-10-06 08:37:47) > On Tue 28-09-21 14:54:51, Antoine Tenart wrote: > thanks for posting this. Coincidentally we have come across a similar > problem as well just recently. > > > What made those syscalls to spin is the following construction (which is > > found a lot in net sysfs and sysctl code): > > > > if (!rtnl_trylock()) > > return restart_syscall(); > > One of our customer is using Prometeus (https://github.com/prometheus/prometheus) > for monitoring and they have noticed that running several instances of > node-exporter can lead to a high CPU utilization. After some > investigation it has turned out that most instances are busy looping on > on of the sysfs files while one instance is processing sysfs speed file > for mlx driver which performs quite a convoluted set of operations (send > commands to the lower layers via workqueues) to talk to the device to > get the information. > > The problem becomes more visible with more instance of node-exporter > running at parallel. This results in some monitoring alarms at the said > machine because the high CPU utilization is not expected. > > I would appreciate if you CC me on next versions of this patchset. Sure, will do! Nice to see this can help others. Any help on (extensively) testing is welcomed :-) Thanks, Antoine
On Wed 06-10-21 09:59:54, Antoine Tenart wrote: [...] > Nice to see this can help others. I find the timing amusing because this behavior was there for years just hitting us really hard just recently. > Any help on (extensively) testing is welcomed :-) We can help with that for sure.
With the approach taken in this thread not going too well[1], what next? I think we should discuss other possibilities and gather some ideas. Below are some early thoughts, that might not be acceptable. 1) Making an rtnl_lock helper that returns when needed The idea would be to replace rtnl_trylock/restart_syscall by this helper, which would try to grab the rtnl lock and return when needed. Something like the following: static rtnl_lock_ifalive(const struct net_device *dev) { while (!rtnl_trylock()) { if (!dev_isalive(dev)) return -EINVAL; /* Special case for queue files */ if (dev->drain_sysfs_queues) return restart_syscall(); /* something not to have the CPU spinning */ } } One way not to have the CPU spinning is to sleep, let's say with `usleep_range(500, 1000);` (range to be defined properly). The disadvantage is on net device unregistration as we might need to wait for all those loops to return first. (It's a trade-off though, we're not restarting syscalls over and over in other situations). Or would there be something better? Possible improvements: - Add an overall timeout and restart the syscall if we hit it, to have an upper bound. - Make it interruptible, check for need_resched, etc. Note that this approach could work for sysctl files as well; looping over all devices in a netns to make the checks. 2) Interrupt rtnl_lock when in the unregistration paths I'm wondering if using mutex_lock_interruptible in problematic areas (sysfs, sysctl), keeping track of their tasks and interrupting them in the unregistration paths would work and be acceptable. On paper this looks like a solution with not much overhead and not too invasive to implement. But is it acceptable? (Are there some side effects we really don't want?). Note that this would need some thinking to make it safe against sysfs accesses between the tasks interruption and the sysfs files draining (refcount? another lock?). 3) Other ideas? Also, I'm not sure about the -rt implications of all the above. Thanks, Antoine [1] https://lore.kernel.org/netdev/163549826664.3523.4140191764737040064@kwain/
On Fri, 29 Oct 2021 16:33:05 +0200 Antoine Tenart <atenart@kernel.org> wrote: > With the approach taken in this thread not going too well[1], what next? > I think we should discuss other possibilities and gather some ideas. > Below are some early thoughts, that might not be acceptable. > > 1) Making an rtnl_lock helper that returns when needed > > The idea would be to replace rtnl_trylock/restart_syscall by this > helper, which would try to grab the rtnl lock and return when needed. > Something like the following: > > static rtnl_lock_ifalive(const struct net_device *dev) > { > while (!rtnl_trylock()) { > if (!dev_isalive(dev)) > return -EINVAL; > > /* Special case for queue files */ > if (dev->drain_sysfs_queues) > return restart_syscall(); > > /* something not to have the CPU spinning */ > } > } > > One way not to have the CPU spinning is to sleep, let's say with > `usleep_range(500, 1000);` (range to be defined properly). The > disadvantage is on net device unregistration as we might need to wait > for all those loops to return first. (It's a trade-off though, we're not > restarting syscalls over and over in other situations). Or would there > be something better? > > Possible improvements: > - Add an overall timeout and restart the syscall if we hit it, to have > an upper bound. > - Make it interruptible, check for need_resched, etc. > > Note that this approach could work for sysctl files as well; looping > over all devices in a netns to make the checks. > > 2) Interrupt rtnl_lock when in the unregistration paths > > I'm wondering if using mutex_lock_interruptible in problematic areas > (sysfs, sysctl), keeping track of their tasks and interrupting them in > the unregistration paths would work and be acceptable. On paper this > looks like a solution with not much overhead and not too invasive to > implement. But is it acceptable? (Are there some side effects we really > don't want?). > > Note that this would need some thinking to make it safe against sysfs > accesses between the tasks interruption and the sysfs files draining > (refcount? another lock?). > > 3) Other ideas? > > Also, I'm not sure about the -rt implications of all the above. > > Thanks, > Antoine > > [1] https://lore.kernel.org/netdev/163549826664.3523.4140191764737040064@kwain/ As the originator of net-sysfs and the trylock, I appreciate your attempts to do something better. The best solution may actually not be just in the networking code but go all the way back up to restart_syscall() itself. Spinning a CPU itself is not a bad thing, see spin locks. The problem is if the spin causes the active CPU to not make progress.