Message ID | 20220612183301.981616-3-longman@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/kmemleak: Avoid soft lockup in kmemleak_scan() | expand |
On Sun, Jun 12, 2022 at 02:33:00PM -0400, Waiman Long wrote: > With a debug kernel running on a 2-socket 96-thread x86-64 system > (HZ=1000), the 2nd and 3rd iteration loops speedup with this patch on > the first kmemleak_scan() call after bootup is shown in the table below. > > Before patch After patch > Loop # # of objects Elapsed time # of objects Elapsed time > ------ ------------ ------------ ------------ ------------ > 2 2,599,850 2.392s 2,596,364 0.266s > 3 2,600,176 2.171s 2,597,061 0.260s > > This patch reduces loop iteration times by about 88%. This will greatly > reduce the chance of a soft lockup happening in the 2nd or 3rd iteration > loops. Nice numbers, thanks for digging into this. But I'm slightly surprised that the first loop doesn't cause any problems. > diff --git a/mm/kmemleak.c b/mm/kmemleak.c > index dad9219c972c..7dd64139a7c7 100644 > --- a/mm/kmemleak.c > +++ b/mm/kmemleak.c > @@ -1508,6 +1508,13 @@ static void kmemleak_scan(void) > */ > rcu_read_lock(); > list_for_each_entry_rcu(object, &object_list, object_list) { > + /* > + * This is racy but we can save the overhead of lock/unlock > + * calls. The missed objects, if any, should be caught in > + * the next scan. > + */ > + if (!color_white(object)) > + continue; > raw_spin_lock_irq(&object->lock); > if (color_white(object) && (object->flags & OBJECT_ALLOCATED) > && update_checksum(object) && get_object(object)) { It's not actually scanning (like tree look-ups) but only updating the checksum of the potentially orphan objects. If the problem is caused by object->lock, we should have seen it with the first loop as well. It is possible that some large list is occasionally missed if there are concurrent updates and a significant number of objects turn up "white", forcing the checksum update. Otherwise this shouldn't be much different from the first loop if there are no massive (false) leaks. I think the race on color_white() can only be with a kmemleak_ignore() or kmemleak_not_leak() call, otherwise the object colour shouldn't be changed. So such objects can only turn from white to gray or black, so the race I think is safe. > @@ -1535,6 +1542,13 @@ static void kmemleak_scan(void) > */ > rcu_read_lock(); > list_for_each_entry_rcu(object, &object_list, object_list) { > + /* > + * This is racy but we can save the overhead of lock/unlock > + * calls. The missed objects, if any, should be caught in > + * the next scan. > + */ > + if (!color_white(object)) > + continue; > raw_spin_lock_irq(&object->lock); > if (unreferenced_object(object) && > !(object->flags & OBJECT_REPORTED)) { Same here. I did wonder whether it's worth keeping object->lock around, I even have a stashed patch lying around from 2019. Instead we'd have the big kmemleak_lock held for longer, though released periodically during scanning. We can then move the lock outside the loop and traversal would be faster but with an increased latency on slab allocation/freeing on other CPUs. Right now we take the kmemleak_lock when scanning a single block (e.g. object) to protect the rb-tree and rely on object->lock to ensure the object isn't freed. Other concurrent allocs/frees would only be blocked during single object scanning. Anyway, I'm not entirely sure it's the lock causing the issue as we don't see it with the first loop. I'm more inclined to think it's the checksum and the skipping if !color_white() would do the trick. Unless there's a better idea: Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
On 6/14/22 12:54, Catalin Marinas wrote: > On Sun, Jun 12, 2022 at 02:33:00PM -0400, Waiman Long wrote: >> With a debug kernel running on a 2-socket 96-thread x86-64 system >> (HZ=1000), the 2nd and 3rd iteration loops speedup with this patch on >> the first kmemleak_scan() call after bootup is shown in the table below. >> >> Before patch After patch >> Loop # # of objects Elapsed time # of objects Elapsed time >> ------ ------------ ------------ ------------ ------------ >> 2 2,599,850 2.392s 2,596,364 0.266s >> 3 2,600,176 2.171s 2,597,061 0.260s >> >> This patch reduces loop iteration times by about 88%. This will greatly >> reduce the chance of a soft lockup happening in the 2nd or 3rd iteration >> loops. > Nice numbers, thanks for digging into this. > > But I'm slightly surprised that the first loop doesn't cause any > problems. The first loop is still problematic. It is just a bit faster with the same number of objects. The corresponding elapsed time is about 1.7s. The heuristics used in this patch cannot be applied to the first loop. See patch 3 on how to avoid soft lockup in the first loop. > >> diff --git a/mm/kmemleak.c b/mm/kmemleak.c >> index dad9219c972c..7dd64139a7c7 100644 >> --- a/mm/kmemleak.c >> +++ b/mm/kmemleak.c >> @@ -1508,6 +1508,13 @@ static void kmemleak_scan(void) >> */ >> rcu_read_lock(); >> list_for_each_entry_rcu(object, &object_list, object_list) { >> + /* >> + * This is racy but we can save the overhead of lock/unlock >> + * calls. The missed objects, if any, should be caught in >> + * the next scan. >> + */ >> + if (!color_white(object)) >> + continue; >> raw_spin_lock_irq(&object->lock); >> if (color_white(object) && (object->flags & OBJECT_ALLOCATED) >> && update_checksum(object) && get_object(object)) { > It's not actually scanning (like tree look-ups) but only updating the > checksum of the potentially orphan objects. If the problem is caused by > object->lock, we should have seen it with the first loop as well. See above. Maybe I should clarify in the patch description that similar change cannot be applied to the first loop. > > It is possible that some large list is occasionally missed if there are > concurrent updates and a significant number of objects turn up "white", > forcing the checksum update. Otherwise this shouldn't be much different > from the first loop if there are no massive (false) leaks. > > I think the race on color_white() can only be with a kmemleak_ignore() > or kmemleak_not_leak() call, otherwise the object colour shouldn't be > changed. So such objects can only turn from white to gray or black, so > the race I think is safe. > >> @@ -1535,6 +1542,13 @@ static void kmemleak_scan(void) >> */ >> rcu_read_lock(); >> list_for_each_entry_rcu(object, &object_list, object_list) { >> + /* >> + * This is racy but we can save the overhead of lock/unlock >> + * calls. The missed objects, if any, should be caught in >> + * the next scan. >> + */ >> + if (!color_white(object)) >> + continue; >> raw_spin_lock_irq(&object->lock); >> if (unreferenced_object(object) && >> !(object->flags & OBJECT_REPORTED)) { > Same here. > > I did wonder whether it's worth keeping object->lock around, I even have > a stashed patch lying around from 2019. Instead we'd have the big > kmemleak_lock held for longer, though released periodically during > scanning. We can then move the lock outside the loop and traversal would > be faster but with an increased latency on slab allocation/freeing on > other CPUs. Right now we take the kmemleak_lock when scanning a single > block (e.g. object) to protect the rb-tree and rely on object->lock to > ensure the object isn't freed. Other concurrent allocs/frees would only > be blocked during single object scanning. > > Anyway, I'm not entirely sure it's the lock causing the issue as we > don't see it with the first loop. I'm more inclined to think it's the > checksum and the skipping if !color_white() would do the trick. > > Unless there's a better idea: > > Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> The lock is a problem because of lockdep. Once I disable lockdep, the elapsed time can drop to about 0.7s. However, lockdep is normally enabled in a debug kernel. I will try to investigate if there is a way to optimize lockdep or such repeated lock/unlock loop. Thanks, Longman >
diff --git a/mm/kmemleak.c b/mm/kmemleak.c index dad9219c972c..7dd64139a7c7 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -1508,6 +1508,13 @@ static void kmemleak_scan(void) */ rcu_read_lock(); list_for_each_entry_rcu(object, &object_list, object_list) { + /* + * This is racy but we can save the overhead of lock/unlock + * calls. The missed objects, if any, should be caught in + * the next scan. + */ + if (!color_white(object)) + continue; raw_spin_lock_irq(&object->lock); if (color_white(object) && (object->flags & OBJECT_ALLOCATED) && update_checksum(object) && get_object(object)) { @@ -1535,6 +1542,13 @@ static void kmemleak_scan(void) */ rcu_read_lock(); list_for_each_entry_rcu(object, &object_list, object_list) { + /* + * This is racy but we can save the overhead of lock/unlock + * calls. The missed objects, if any, should be caught in + * the next scan. + */ + if (!color_white(object)) + continue; raw_spin_lock_irq(&object->lock); if (unreferenced_object(object) && !(object->flags & OBJECT_REPORTED)) {
There are 3 RCU-based object iteration loops in kmemleak_scan(). Because of the need to take RCU read lock, we can't insert cond_resched() into the loop like other parts of the function. As there can be millions of objects to be scanned, it takes a while to iterate all of them. The kmemleak functionality is usually enabled in a debug kernel which is much slower than a non-debug kernel. With sufficient number of kmemleak objects, the time to iterate them all may exceed 22s causing soft lockup. watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [kmemleak:625] In this particular bug report, the soft lockup happen in the 2nd iteration loop. In the 2nd and 3rd loops, most of the objects are checked and then skipped under the object lock. Only a selected fews are modified. Those objects certainly need lock protection. However, the lock/unlock operation is slow especially with interrupt disabling and enabling included. We can actually do some basic check like color_white() without taking the lock and skip the object accordingly. Of course, this kind of check is racy and may miss objects that are being modified concurrently. The cost of missed objects, however, is just that they will be discovered in the next scan instead. The advantage of doing so is that iteration can be done much faster especially with LOCKDEP enabled in a debug kernel. With a debug kernel running on a 2-socket 96-thread x86-64 system (HZ=1000), the 2nd and 3rd iteration loops speedup with this patch on the first kmemleak_scan() call after bootup is shown in the table below. Before patch After patch Loop # # of objects Elapsed time # of objects Elapsed time ------ ------------ ------------ ------------ ------------ 2 2,599,850 2.392s 2,596,364 0.266s 3 2,600,176 2.171s 2,597,061 0.260s This patch reduces loop iteration times by about 88%. This will greatly reduce the chance of a soft lockup happening in the 2nd or 3rd iteration loops. Signed-off-by: Waiman Long <longman@redhat.com> --- mm/kmemleak.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)