Message ID | 20240307031952.2123-1-laoar.shao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: mglru: Fix soft lockup attributed to scanning folios | expand |
On Thu, 7 Mar 2024 11:19:52 +0800 Yafang Shao <laoar.shao@gmail.com> wrote: > After we enabled mglru on our 384C1536GB production servers, we > encountered frequent soft lockups attributed to scanning folios. > > The soft lockup as follows, > > ... > > There were a total of 22 tasks waiting for this spinlock > (RDI: ffff99d2b6ff9050): > > crash> foreach RU bt | grep -B 8 queued_spin_lock_slowpath | grep "RDI: ffff99d2b6ff9050" | wc -l > 22 If we're holding the lock for this long then there's a possibility of getting hit by the NMI watchdog also. > Additionally, two other threads were also engaged in scanning folios, one > with 19 waiters and the other with 15 waiters. > > To address this issue under heavy reclaim conditions, we introduced a > hotfix version of the fix, incorporating cond_resched() in scan_folios(). > Following the application of this hotfix to our servers, the soft lockup > issue ceased. > > ... > > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4367,6 +4367,10 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, > > if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH) > break; > + > + spin_unlock_irq(&lruvec->lru_lock); > + cond_resched(); > + spin_lock_irq(&lruvec->lru_lock); > } Presumably wrapping this with `if (need_resched())' will save some work. This lock is held for a reason. I'd like to see an analysis of why this change is safe.
On Fri, Mar 8, 2024 at 1:06 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Thu, 7 Mar 2024 11:19:52 +0800 Yafang Shao <laoar.shao@gmail.com> wrote: > > > After we enabled mglru on our 384C1536GB production servers, we > > encountered frequent soft lockups attributed to scanning folios. > > > > The soft lockup as follows, > > > > ... > > > > There were a total of 22 tasks waiting for this spinlock > > (RDI: ffff99d2b6ff9050): > > > > crash> foreach RU bt | grep -B 8 queued_spin_lock_slowpath | grep "RDI: ffff99d2b6ff9050" | wc -l > > 22 > > If we're holding the lock for this long then there's a possibility of > getting hit by the NMI watchdog also. The NMI watchdog is disabled as these servers are KVM guest. kernel.nmi_watchdog = 0 kernel.soft_watchdog = 1 > > > Additionally, two other threads were also engaged in scanning folios, one > > with 19 waiters and the other with 15 waiters. > > > > To address this issue under heavy reclaim conditions, we introduced a > > hotfix version of the fix, incorporating cond_resched() in scan_folios(). > > Following the application of this hotfix to our servers, the soft lockup > > issue ceased. > > > > ... > > > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -4367,6 +4367,10 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, > > > > if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH) > > break; > > + > > + spin_unlock_irq(&lruvec->lru_lock); > > + cond_resched(); > > + spin_lock_irq(&lruvec->lru_lock); > > } > > Presumably wrapping this with `if (need_resched())' will save some work. good suggestion. > > This lock is held for a reason. I'd like to see an analysis of why > this change is safe. I believe the key point here is whether we can reduce the scope of this lock from: evict_folios spin_lock_irq(&lruvec->lru_lock); scanned = isolate_folios(lruvec, sc, swappiness, &type, &list); scanned += try_to_inc_min_seq(lruvec, swappiness); if (get_nr_gens(lruvec, !swappiness) == MIN_NR_GENS) scanned = 0; spin_unlock_irq(&lruvec->lru_lock); to: evict_folios spin_lock_irq(&lruvec->lru_lock); scanned = isolate_folios(lruvec, sc, swappiness, &type, &list); spin_unlock_irq(&lruvec->lru_lock); spin_lock_irq(&lruvec->lru_lock); scanned += try_to_inc_min_seq(lruvec, swappiness); if (get_nr_gens(lruvec, !swappiness) == MIN_NR_GENS) scanned = 0; spin_unlock_irq(&lruvec->lru_lock); In isolate_folios(), it merely utilizes the min_seq to retrieve the generation without modifying it. If multiple tasks are running evict_folios() concurrently, it seems inconsequential whether min_seq is incremented by one task or another. I'd appreciate Yu's confirmation on this matter.
On Fri, Mar 08, 2024 at 04:57:08PM +0800, Yafang Shao wrote: > On Fri, Mar 8, 2024 at 1:06 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > On Thu, 7 Mar 2024 11:19:52 +0800 Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > After we enabled mglru on our 384C1536GB production servers, we > > > encountered frequent soft lockups attributed to scanning folios. > > > > > > The soft lockup as follows, > > > > > > ... > > > > > > There were a total of 22 tasks waiting for this spinlock > > > (RDI: ffff99d2b6ff9050): > > > > > > crash> foreach RU bt | grep -B 8 queued_spin_lock_slowpath | grep "RDI: ffff99d2b6ff9050" | wc -l > > > 22 > > > > If we're holding the lock for this long then there's a possibility of > > getting hit by the NMI watchdog also. > > The NMI watchdog is disabled as these servers are KVM guest. > > kernel.nmi_watchdog = 0 > kernel.soft_watchdog = 1 > > > > > > Additionally, two other threads were also engaged in scanning folios, one > > > with 19 waiters and the other with 15 waiters. > > > > > > To address this issue under heavy reclaim conditions, we introduced a > > > hotfix version of the fix, incorporating cond_resched() in scan_folios(). > > > Following the application of this hotfix to our servers, the soft lockup > > > issue ceased. > > > > > > ... > > > > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -4367,6 +4367,10 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, > > > > > > if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH) > > > break; > > > + > > > + spin_unlock_irq(&lruvec->lru_lock); > > > + cond_resched(); > > > + spin_lock_irq(&lruvec->lru_lock); > > > } > > > > Presumably wrapping this with `if (need_resched())' will save some work. > > good suggestion. > > > > > This lock is held for a reason. I'd like to see an analysis of why > > this change is safe. > > I believe the key point here is whether we can reduce the scope of > this lock from: > > evict_folios > spin_lock_irq(&lruvec->lru_lock); > scanned = isolate_folios(lruvec, sc, swappiness, &type, &list); > scanned += try_to_inc_min_seq(lruvec, swappiness); > if (get_nr_gens(lruvec, !swappiness) == MIN_NR_GENS) > scanned = 0; > spin_unlock_irq(&lruvec->lru_lock); > > to: > > evict_folios > spin_lock_irq(&lruvec->lru_lock); > scanned = isolate_folios(lruvec, sc, swappiness, &type, &list); > spin_unlock_irq(&lruvec->lru_lock); > > spin_lock_irq(&lruvec->lru_lock); > scanned += try_to_inc_min_seq(lruvec, swappiness); > if (get_nr_gens(lruvec, !swappiness) == MIN_NR_GENS) > scanned = 0; > spin_unlock_irq(&lruvec->lru_lock); > > In isolate_folios(), it merely utilizes the min_seq to retrieve the > generation without modifying it. If multiple tasks are running > evict_folios() concurrently, it seems inconsequential whether min_seq > is incremented by one task or another. I'd appreciate Yu's > confirmation on this matter. Hi Yafang, Thanks for the patch! Yes, your second analysis is correct -- we can't just drop the lock as the original patch does because min_seq can be updated in the mean time. If this happens, the gen value becomes invalid, since it's based on the expired min_seq: sort_folio() { .. gen = lru_gen_from_seq(lrugen->min_seq[type]); .. } The following might be a better approach (untested): diff --git a/mm/vmscan.c b/mm/vmscan.c index 4255619a1a31..6fe53cfa8ef8 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4365,7 +4365,8 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, skipped_zone += delta; } - if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH) + if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH || + spin_is_contended(&lruvec->lru_lock)) break; } @@ -4375,7 +4376,8 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, skipped += skipped_zone; } - if (!remaining || isolated >= MIN_LRU_BATCH) + if (!remaining || isolated >= MIN_LRU_BATCH || + (scanned && spin_is_contended(&lruvec->lru_lock))) break; }
On Tue, Mar 12, 2024 at 02:29:48PM -0600, Yu Zhao wrote: > On Fri, Mar 08, 2024 at 04:57:08PM +0800, Yafang Shao wrote: > > On Fri, Mar 8, 2024 at 1:06 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > On Thu, 7 Mar 2024 11:19:52 +0800 Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > After we enabled mglru on our 384C1536GB production servers, we > > > > encountered frequent soft lockups attributed to scanning folios. > > > > > > > > The soft lockup as follows, > > > > > > > > ... > > > > > > > > There were a total of 22 tasks waiting for this spinlock > > > > (RDI: ffff99d2b6ff9050): > > > > > > > > crash> foreach RU bt | grep -B 8 queued_spin_lock_slowpath | grep "RDI: ffff99d2b6ff9050" | wc -l > > > > 22 > > > > > > If we're holding the lock for this long then there's a possibility of > > > getting hit by the NMI watchdog also. > > > > The NMI watchdog is disabled as these servers are KVM guest. > > > > kernel.nmi_watchdog = 0 > > kernel.soft_watchdog = 1 > > > > > > > > > Additionally, two other threads were also engaged in scanning folios, one > > > > with 19 waiters and the other with 15 waiters. > > > > > > > > To address this issue under heavy reclaim conditions, we introduced a > > > > hotfix version of the fix, incorporating cond_resched() in scan_folios(). > > > > Following the application of this hotfix to our servers, the soft lockup > > > > issue ceased. > > > > > > > > ... > > > > > > > > --- a/mm/vmscan.c > > > > +++ b/mm/vmscan.c > > > > @@ -4367,6 +4367,10 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, > > > > > > > > if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH) > > > > break; > > > > + > > > > + spin_unlock_irq(&lruvec->lru_lock); > > > > + cond_resched(); > > > > + spin_lock_irq(&lruvec->lru_lock); > > > > } > > > > > > Presumably wrapping this with `if (need_resched())' will save some work. > > > > good suggestion. > > > > > > > > This lock is held for a reason. I'd like to see an analysis of why > > > this change is safe. > > > > I believe the key point here is whether we can reduce the scope of > > this lock from: > > > > evict_folios > > spin_lock_irq(&lruvec->lru_lock); > > scanned = isolate_folios(lruvec, sc, swappiness, &type, &list); > > scanned += try_to_inc_min_seq(lruvec, swappiness); > > if (get_nr_gens(lruvec, !swappiness) == MIN_NR_GENS) > > scanned = 0; > > spin_unlock_irq(&lruvec->lru_lock); > > > > to: > > > > evict_folios > > spin_lock_irq(&lruvec->lru_lock); > > scanned = isolate_folios(lruvec, sc, swappiness, &type, &list); > > spin_unlock_irq(&lruvec->lru_lock); > > > > spin_lock_irq(&lruvec->lru_lock); > > scanned += try_to_inc_min_seq(lruvec, swappiness); > > if (get_nr_gens(lruvec, !swappiness) == MIN_NR_GENS) > > scanned = 0; > > spin_unlock_irq(&lruvec->lru_lock); > > > > In isolate_folios(), it merely utilizes the min_seq to retrieve the > > generation without modifying it. If multiple tasks are running > > evict_folios() concurrently, it seems inconsequential whether min_seq > > is incremented by one task or another. I'd appreciate Yu's > > confirmation on this matter. > > Hi Yafang, > > Thanks for the patch! > > Yes, your second analysis is correct -- we can't just drop the lock > as the original patch does because min_seq can be updated in the mean > time. If this happens, the gen value becomes invalid, since it's based > on the expired min_seq: > > sort_folio() > { > .. > gen = lru_gen_from_seq(lrugen->min_seq[type]); > .. > } > > The following might be a better approach (untested): > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 4255619a1a31..6fe53cfa8ef8 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4365,7 +4365,8 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, > skipped_zone += delta; > } > > - if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH) > + if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH || > + spin_is_contended(&lruvec->lru_lock)) > break; > } > > @@ -4375,7 +4376,8 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, > skipped += skipped_zone; > } > > - if (!remaining || isolated >= MIN_LRU_BATCH) > + if (!remaining || isolated >= MIN_LRU_BATCH || > + (scanned && spin_is_contended(&lruvec->lru_lock))) > break; > } A better way might be: diff --git a/mm/vmscan.c b/mm/vmscan.c index 4255619a1a31..ac59f064c4e1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4367,6 +4367,11 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH) break; + + if (need_resched() || spin_is_contended(&lruvec->lru_lock)) { + remaining = 0; + break; + } } if (skipped_zone) {
On Wed, Mar 13, 2024 at 6:12 AM Yu Zhao <yuzhao@google.com> wrote: > > On Tue, Mar 12, 2024 at 02:29:48PM -0600, Yu Zhao wrote: > > On Fri, Mar 08, 2024 at 04:57:08PM +0800, Yafang Shao wrote: > > > On Fri, Mar 8, 2024 at 1:06 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > > > On Thu, 7 Mar 2024 11:19:52 +0800 Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > After we enabled mglru on our 384C1536GB production servers, we > > > > > encountered frequent soft lockups attributed to scanning folios. > > > > > > > > > > The soft lockup as follows, > > > > > > > > > > ... > > > > > > > > > > There were a total of 22 tasks waiting for this spinlock > > > > > (RDI: ffff99d2b6ff9050): > > > > > > > > > > crash> foreach RU bt | grep -B 8 queued_spin_lock_slowpath | grep "RDI: ffff99d2b6ff9050" | wc -l > > > > > 22 > > > > > > > > If we're holding the lock for this long then there's a possibility of > > > > getting hit by the NMI watchdog also. > > > > > > The NMI watchdog is disabled as these servers are KVM guest. > > > > > > kernel.nmi_watchdog = 0 > > > kernel.soft_watchdog = 1 > > > > > > > > > > > > Additionally, two other threads were also engaged in scanning folios, one > > > > > with 19 waiters and the other with 15 waiters. > > > > > > > > > > To address this issue under heavy reclaim conditions, we introduced a > > > > > hotfix version of the fix, incorporating cond_resched() in scan_folios(). > > > > > Following the application of this hotfix to our servers, the soft lockup > > > > > issue ceased. > > > > > > > > > > ... > > > > > > > > > > --- a/mm/vmscan.c > > > > > +++ b/mm/vmscan.c > > > > > @@ -4367,6 +4367,10 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, > > > > > > > > > > if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH) > > > > > break; > > > > > + > > > > > + spin_unlock_irq(&lruvec->lru_lock); > > > > > + cond_resched(); > > > > > + spin_lock_irq(&lruvec->lru_lock); > > > > > } > > > > > > > > Presumably wrapping this with `if (need_resched())' will save some work. > > > > > > good suggestion. > > > > > > > > > > > This lock is held for a reason. I'd like to see an analysis of why > > > > this change is safe. > > > > > > I believe the key point here is whether we can reduce the scope of > > > this lock from: > > > > > > evict_folios > > > spin_lock_irq(&lruvec->lru_lock); > > > scanned = isolate_folios(lruvec, sc, swappiness, &type, &list); > > > scanned += try_to_inc_min_seq(lruvec, swappiness); > > > if (get_nr_gens(lruvec, !swappiness) == MIN_NR_GENS) > > > scanned = 0; > > > spin_unlock_irq(&lruvec->lru_lock); > > > > > > to: > > > > > > evict_folios > > > spin_lock_irq(&lruvec->lru_lock); > > > scanned = isolate_folios(lruvec, sc, swappiness, &type, &list); > > > spin_unlock_irq(&lruvec->lru_lock); > > > > > > spin_lock_irq(&lruvec->lru_lock); > > > scanned += try_to_inc_min_seq(lruvec, swappiness); > > > if (get_nr_gens(lruvec, !swappiness) == MIN_NR_GENS) > > > scanned = 0; > > > spin_unlock_irq(&lruvec->lru_lock); > > > > > > In isolate_folios(), it merely utilizes the min_seq to retrieve the > > > generation without modifying it. If multiple tasks are running > > > evict_folios() concurrently, it seems inconsequential whether min_seq > > > is incremented by one task or another. I'd appreciate Yu's > > > confirmation on this matter. > > > > Hi Yafang, > > > > Thanks for the patch! > > > > Yes, your second analysis is correct -- we can't just drop the lock > > as the original patch does because min_seq can be updated in the mean > > time. If this happens, the gen value becomes invalid, since it's based > > on the expired min_seq: > > > > sort_folio() > > { > > .. > > gen = lru_gen_from_seq(lrugen->min_seq[type]); > > .. > > } > > > > The following might be a better approach (untested): > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 4255619a1a31..6fe53cfa8ef8 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -4365,7 +4365,8 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, > > skipped_zone += delta; > > } > > > > - if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH) > > + if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH || > > + spin_is_contended(&lruvec->lru_lock)) > > break; > > } > > > > @@ -4375,7 +4376,8 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, > > skipped += skipped_zone; > > } > > > > - if (!remaining || isolated >= MIN_LRU_BATCH) > > + if (!remaining || isolated >= MIN_LRU_BATCH || > > + (scanned && spin_is_contended(&lruvec->lru_lock))) > > break; > > } > > A better way might be: > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 4255619a1a31..ac59f064c4e1 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4367,6 +4367,11 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, > > if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH) > break; > + > + if (need_resched() || spin_is_contended(&lruvec->lru_lock)) { > + remaining = 0; > + break; > + } > } > > if (skipped_zone) { It is better. Thanks for your suggestion. I will verify it on our production servers, which may take several days.
diff --git a/mm/vmscan.c b/mm/vmscan.c index 4f9c854ce6cc..8f2877285b9a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4367,6 +4367,10 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, if (!--remaining || max(isolated, skipped_zone) >= MIN_LRU_BATCH) break; + + spin_unlock_irq(&lruvec->lru_lock); + cond_resched(); + spin_lock_irq(&lruvec->lru_lock); } if (skipped_zone) {
After we enabled mglru on our 384C1536GB production servers, we encountered frequent soft lockups attributed to scanning folios. The soft lockup as follows, [Sat Feb 24 02:29:42 2024] watchdog: BUG: soft lockup - CPU#215 stuck for 111s! [kworker/215:0:2200100] [Sat Feb 24 02:29:42 2024] Call Trace: [Sat Feb 24 02:29:42 2024] <IRQ> [Sat Feb 24 02:29:42 2024] ? show_regs.cold+0x1a/0x1f [Sat Feb 24 02:29:42 2024] ? watchdog_timer_fn+0x1c4/0x220 [Sat Feb 24 02:29:42 2024] ? softlockup_fn+0x30/0x30 [Sat Feb 24 02:29:42 2024] ? __hrtimer_run_queues+0xa2/0x2b0 [Sat Feb 24 02:29:42 2024] ? hrtimer_interrupt+0x109/0x220 [Sat Feb 24 02:29:42 2024] ? __sysvec_apic_timer_interrupt+0x5e/0x110 [Sat Feb 24 02:29:42 2024] ? sysvec_apic_timer_interrupt+0x7b/0x90 [Sat Feb 24 02:29:42 2024] </IRQ> [Sat Feb 24 02:29:42 2024] <TASK> [Sat Feb 24 02:29:42 2024] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 [Sat Feb 24 02:29:42 2024] ? folio_end_writeback+0x73/0xa0 [Sat Feb 24 02:29:42 2024] ? folio_rotate_reclaimable+0x8c/0x90 [Sat Feb 24 02:29:42 2024] ? folio_rotate_reclaimable+0x57/0x90 [Sat Feb 24 02:29:42 2024] ? folio_rotate_reclaimable+0x8c/0x90 [Sat Feb 24 02:29:42 2024] folio_end_writeback+0x73/0xa0 [Sat Feb 24 02:29:42 2024] iomap_finish_ioend+0x1d4/0x420 [Sat Feb 24 02:29:42 2024] iomap_finish_ioends+0x5e/0xe0 [Sat Feb 24 02:29:42 2024] xfs_end_ioend+0x65/0x150 [xfs] [Sat Feb 24 02:29:42 2024] xfs_end_io+0xbc/0xf0 [xfs] [Sat Feb 24 02:29:42 2024] process_one_work+0x1ec/0x3c0 [Sat Feb 24 02:29:42 2024] worker_thread+0x4d/0x390 [Sat Feb 24 02:29:42 2024] ? process_one_work+0x3c0/0x3c0 [Sat Feb 24 02:29:42 2024] kthread+0xee/0x120 [Sat Feb 24 02:29:42 2024] ? kthread_complete_and_exit+0x20/0x20 [Sat Feb 24 02:29:42 2024] ret_from_fork+0x1f/0x30 [Sat Feb 24 02:29:42 2024] </TASK> From our analysis of the vmcore generated by the soft lockup, the thread was waiting for the spinlock lruvec->lru_lock: PID: 2200100 TASK: ffff9a221d8b4000 CPU: 215 COMMAND: "kworker/215:0" #0 [fffffe000319ae20] crash_nmi_callback at ffffffff8e055419 #1 [fffffe000319ae58] nmi_handle at ffffffff8e0253c0 #2 [fffffe000319aea0] default_do_nmi at ffffffff8eae5985 #3 [fffffe000319aec8] exc_nmi at ffffffff8eae5b78 #4 [fffffe000319aef0] end_repeat_nmi at ffffffff8ec015f0 [exception RIP: queued_spin_lock_slowpath+59] RIP: ffffffff8eaf9b8b RSP: ffffb58b01d4fc20 RFLAGS: 00000002 RAX: 0000000000000001 RBX: ffffb58b01d4fc90 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff99d2b6ff9050 RBP: ffffb58b01d4fc40 R8: 0000000000035b21 R9: 0000000000000040 R10: 0000000000035b00 R11: 0000000000000001 R12: ffff99d2b6ff9050 R13: 0000000000000046 R14: ffffffff8e28bd30 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #5 [ffffb58b01d4fc20] queued_spin_lock_slowpath at ffffffff8eaf9b8b #6 [ffffb58b01d4fc48] _raw_spin_lock_irqsave at ffffffff8eaf9b11 #7 [ffffb58b01d4fc68] folio_lruvec_lock_irqsave at ffffffff8e337a82 #8 [ffffb58b01d4fc88] folio_batch_move_lru at ffffffff8e28dbcf #9 [ffffb58b01d4fcd0] folio_batch_add_and_move at ffffffff8e28dce7 #10 [ffffb58b01d4fce0] folio_rotate_reclaimable at ffffffff8e28eee7 #11 [ffffb58b01d4fcf8] folio_end_writeback at ffffffff8e27bfb3 #12 [ffffb58b01d4fd10] iomap_finish_ioend at ffffffff8e3d9d04 #13 [ffffb58b01d4fd98] iomap_finish_ioends at ffffffff8e3d9fae #14 [ffffb58b01d4fde0] xfs_end_ioend at ffffffffc0fae835 [xfs] #15 [ffffb58b01d4fe20] xfs_end_io at ffffffffc0fae9dc [xfs] #16 [ffffb58b01d4fe60] process_one_work at ffffffff8e0ae08c #17 [ffffb58b01d4feb0] worker_thread at ffffffff8e0ae2ad #18 [ffffb58b01d4ff10] kthread at ffffffff8e0b671e #19 [ffffb58b01d4ff50] ret_from_fork at ffffffff8e002dcf While the spinlock (RDI: ffff99d2b6ff9050) was held by a task which was scanning folios: PID: 2400713 TASK: ffff996be1d14000 CPU: 50 COMMAND: "chitu_main" --- <NMI exception stack> --- #5 [ffffb58b14ef76e8] __mod_zone_page_state at ffffffff8e2a9c36 #6 [ffffb58b14ef76f0] folio_inc_gen at ffffffff8e2990bd #7 [ffffb58b14ef7740] sort_folio at ffffffff8e29afbb #8 [ffffb58b14ef7748] sysvec_apic_timer_interrupt at ffffffff8eae79f0 #9 [ffffb58b14ef77b0] scan_folios at ffffffff8e29b49b #10 [ffffb58b14ef7878] evict_folios at ffffffff8e29bb53 #11 [ffffb58b14ef7968] lru_gen_shrink_lruvec at ffffffff8e29cb57 #12 [ffffb58b14ef7a28] shrink_lruvec at ffffffff8e29e135 #13 [ffffb58b14ef7af0] shrink_node at ffffffff8e29e78c #14 [ffffb58b14ef7b88] do_try_to_free_pages at ffffffff8e29ec08 #15 [ffffb58b14ef7bf8] try_to_free_mem_cgroup_pages at ffffffff8e2a17a6 #16 [ffffb58b14ef7ca8] try_charge_memcg at ffffffff8e338879 #17 [ffffb58b14ef7d48] charge_memcg at ffffffff8e3394f8 #18 [ffffb58b14ef7d70] __mem_cgroup_charge at ffffffff8e33aded #19 [ffffb58b14ef7d98] do_anonymous_page at ffffffff8e2c6523 #20 [ffffb58b14ef7dd8] __handle_mm_fault at ffffffff8e2cc27d #21 [ffffb58b14ef7e78] handle_mm_fault at ffffffff8e2cc3ba #22 [ffffb58b14ef7eb8] do_user_addr_fault at ffffffff8e073a99 #23 [ffffb58b14ef7f20] exc_page_fault at ffffffff8eae82f7 #24 [ffffb58b14ef7f50] asm_exc_page_fault at ffffffff8ec00bb7 There were a total of 22 tasks waiting for this spinlock (RDI: ffff99d2b6ff9050): crash> foreach RU bt | grep -B 8 queued_spin_lock_slowpath | grep "RDI: ffff99d2b6ff9050" | wc -l 22 Additionally, two other threads were also engaged in scanning folios, one with 19 waiters and the other with 15 waiters. To address this issue under heavy reclaim conditions, we introduced a hotfix version of the fix, incorporating cond_resched() in scan_folios(). Following the application of this hotfix to our servers, the soft lockup issue ceased. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: stable@vger.kernel.org # 6.1+ --- mm/vmscan.c | 4 ++++ 1 file changed, 4 insertions(+)