Message ID | 20190911150537.19527-1-longman@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | hugetlbfs: Disable PMD sharing for large systems | expand |
On Wed, Sep 11, 2019 at 04:05:32PM +0100, Waiman Long wrote: > A customer with large SMP systems (up to 16 sockets) with application > that uses large amount of static hugepages (~500-1500GB) are experiencing > random multisecond delays. These delays was caused by the long time it > took to scan the VMA interval tree with mmap_sem held. > > To fix this problem while perserving existing behavior as much as > possible, we need to allow timeout in down_write() and disabling PMD > sharing when it is taking too long to do so. Since a transaction can > involving touching multiple huge pages, timing out for each of the huge > page interactions does not completely solve the problem. So a threshold > is set to completely disable PMD sharing if too many timeouts happen. > > The first 4 patches of this 5-patch series adds a new > down_write_timedlock() API which accepts a timeout argument and return > true is locking is successful or false otherwise. It works more or less > than a down_write_trylock() but the calling thread may sleep. Just on general principle, this is a non-starter. If a lock is being held too long, then whatever the lock is protecting needs fixing. Adding timeouts to locks and sysctls to tune them is not a viable solution to address latencies caused by algorithm scalability issues. Cheers, Dave.
On Fri, Sep 13, 2019 at 11:50:43AM +1000, Dave Chinner wrote: > On Wed, Sep 11, 2019 at 04:05:32PM +0100, Waiman Long wrote: > > A customer with large SMP systems (up to 16 sockets) with application > > that uses large amount of static hugepages (~500-1500GB) are experiencing > > random multisecond delays. These delays was caused by the long time it > > took to scan the VMA interval tree with mmap_sem held. > > > > To fix this problem while perserving existing behavior as much as > > possible, we need to allow timeout in down_write() and disabling PMD > > sharing when it is taking too long to do so. Since a transaction can > > involving touching multiple huge pages, timing out for each of the huge > > page interactions does not completely solve the problem. So a threshold > > is set to completely disable PMD sharing if too many timeouts happen. > > > > The first 4 patches of this 5-patch series adds a new > > down_write_timedlock() API which accepts a timeout argument and return > > true is locking is successful or false otherwise. It works more or less > > than a down_write_trylock() but the calling thread may sleep. > > Just on general principle, this is a non-starter. If a lock is being > held too long, then whatever the lock is protecting needs fixing. > Adding timeouts to locks and sysctls to tune them is not a viable > solution to address latencies caused by algorithm scalability > issues. I'm very much agreeing here. Lock functions with timeouts are a sign of horrific design.