Message ID | 20221223135207.2275317-1-fengwei.yin@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] mm/thp: check and bail out if page in deferred queue already | expand |
On Fri, Dec 23, 2022 at 09:52:07PM +0800, Yin Fengwei wrote: > Kernel build regression with LLVM was reported here: > https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/ > with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP > boundaries"). And the commit f35b5d7d676e was reverted. > > It turned out the regression is related with madvise(MADV_DONTNEED) > was used by ld.lld. But with none PMD_SIZE aligned parameter len. > trace-bpfcc captured: > 531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4 > 531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4 > > If the underneath physical page is THP, the madvise(MADV_DONTNNED) can > trigger split_queue_lock contention raised significantly. perf showed > following data: > 14.85% 0.00% ld.lld [kernel.kallsyms] [k] > entry_SYSCALL_64_after_hwframe > 11.52% > entry_SYSCALL_64_after_hwframe > do_syscall_64 > __x64_sys_madvise > do_madvise.part.0 > zap_page_range > unmap_single_vma > unmap_page_range > page_remove_rmap > deferred_split_huge_page > __lock_text_start > native_queued_spin_lock_slowpath > > If THP can't be removed from rmap as whole THP, partial THP will be > removed from rmap by removing sub-pages from rmap. Even the THP > head page is added to deferred queue already, the split_queue_lock > will be acquired and check whether the THP head page is in the queue > already. Thus, the contention of split_queue_lock is raised. > > Before acquire split_queue_lock, check and bail out early if the THP > head page is in the queue already. The checking without holding > split_queue_lock could race with deferred_split_scan, but it doesn't > impact the correctness here. > > Test result of building kernel with ld.lld: > commit 7b5a0b664ebe (parent commit of f35b5d7d676e): > time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all > 6:07.99 real, 26367.77 user, 5063.35 sys > > commit f35b5d7d676e: > time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all > 7:22.15 real, 26235.03 user, 12504.55 sys > > commit f35b5d7d676e with the fixing patch: > time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all > 6:08.49 real, 26520.15 user, 5047.91 sys > > Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> I cannot say whether or not this is a good idea or not but it does resolve the regression I reported: Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00016-g7b5a0b664ebe Time (mean ± σ): 383.003 s ± 0.680 s [User: 34737.850 s, System: 7287.079 s] Range (min … max): 382.218 s … 383.413 s 3 runs Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e Time (mean ± σ): 437.886 s ± 1.030 s [User: 35888.658 s, System: 14048.871 s] Range (min … max): 436.865 s … 438.924 s 3 runs Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e-dirty Time (mean ± σ): 384.371 s ± 1.004 s [User: 35402.880 s, System: 6401.691 s] Range (min … max): 383.547 s … 385.489 s 3 runs Tested-by: Nathan Chancellor <nathan@kernel.org> > --- > My first thought was to change the per node deferred queue to per cpu. > It's complicated and may be overkill. > > For the race without lock acquired, I didn't see obvious issue here. But I > could miss something here. Let me know if I did. Thanks. > > > mm/huge_memory.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index abe6cfd92ffa..7cde9f702e63 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2837,6 +2837,9 @@ void deferred_split_huge_page(struct page *page) > if (PageSwapCache(page)) > return; > > + if (!list_empty(page_deferred_list(page))) > + return; > + > spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > if (list_empty(page_deferred_list(page))) { > count_vm_event(THP_DEFERRED_SPLIT_PAGE); > > base-commit: 8395ae05cb5a2e31d36106e8c85efa11cda849be > -- > 2.34.1 > > >
On Fri, 23 Dec 2022, Yin Fengwei wrote: > Kernel build regression with LLVM was reported here: > https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/ > with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP > boundaries"). And the commit f35b5d7d676e was reverted. > > It turned out the regression is related with madvise(MADV_DONTNEED) > was used by ld.lld. But with none PMD_SIZE aligned parameter len. > trace-bpfcc captured: > 531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4 > 531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4 > > If the underneath physical page is THP, the madvise(MADV_DONTNNED) can > trigger split_queue_lock contention raised significantly. perf showed > following data: > 14.85% 0.00% ld.lld [kernel.kallsyms] [k] > entry_SYSCALL_64_after_hwframe > 11.52% > entry_SYSCALL_64_after_hwframe > do_syscall_64 > __x64_sys_madvise > do_madvise.part.0 > zap_page_range > unmap_single_vma > unmap_page_range > page_remove_rmap > deferred_split_huge_page > __lock_text_start > native_queued_spin_lock_slowpath > > If THP can't be removed from rmap as whole THP, partial THP will be > removed from rmap by removing sub-pages from rmap. Even the THP > head page is added to deferred queue already, the split_queue_lock > will be acquired and check whether the THP head page is in the queue > already. Thus, the contention of split_queue_lock is raised. > > Before acquire split_queue_lock, check and bail out early if the THP > head page is in the queue already. The checking without holding > split_queue_lock could race with deferred_split_scan, but it doesn't > impact the correctness here. > > Test result of building kernel with ld.lld: > commit 7b5a0b664ebe (parent commit of f35b5d7d676e): > time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all > 6:07.99 real, 26367.77 user, 5063.35 sys > > commit f35b5d7d676e: > time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all > 7:22.15 real, 26235.03 user, 12504.55 sys > > commit f35b5d7d676e with the fixing patch: > time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all > 6:08.49 real, 26520.15 user, 5047.91 sys > > Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Acked-by: David Rientjes <rientjes@google.com>
Yin Fengwei <fengwei.yin@intel.com> writes: > Kernel build regression with LLVM was reported here: > https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/ > with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP > boundaries"). And the commit f35b5d7d676e was reverted. > > It turned out the regression is related with madvise(MADV_DONTNEED) > was used by ld.lld. But with none PMD_SIZE aligned parameter len. > trace-bpfcc captured: > 531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4 > 531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4 > > If the underneath physical page is THP, the madvise(MADV_DONTNNED) can > trigger split_queue_lock contention raised significantly. perf showed > following data: > 14.85% 0.00% ld.lld [kernel.kallsyms] [k] > entry_SYSCALL_64_after_hwframe > 11.52% > entry_SYSCALL_64_after_hwframe > do_syscall_64 > __x64_sys_madvise > do_madvise.part.0 > zap_page_range > unmap_single_vma > unmap_page_range > page_remove_rmap > deferred_split_huge_page > __lock_text_start > native_queued_spin_lock_slowpath > > If THP can't be removed from rmap as whole THP, partial THP will be > removed from rmap by removing sub-pages from rmap. Even the THP > head page is added to deferred queue already, the split_queue_lock > will be acquired and check whether the THP head page is in the queue > already. Thus, the contention of split_queue_lock is raised. > > Before acquire split_queue_lock, check and bail out early if the THP > head page is in the queue already. The checking without holding > split_queue_lock could race with deferred_split_scan, but it doesn't > impact the correctness here. > > Test result of building kernel with ld.lld: > commit 7b5a0b664ebe (parent commit of f35b5d7d676e): > time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all > 6:07.99 real, 26367.77 user, 5063.35 sys > > commit f35b5d7d676e: > time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all > 7:22.15 real, 26235.03 user, 12504.55 sys > > commit f35b5d7d676e with the fixing patch: > time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all > 6:08.49 real, 26520.15 user, 5047.91 sys > > Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Thank you for fixing. Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > --- > My first thought was to change the per node deferred queue to per cpu. > It's complicated and may be overkill. > > For the race without lock acquired, I didn't see obvious issue here. But I > could miss something here. Let me know if I did. Thanks. > > > mm/huge_memory.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index abe6cfd92ffa..7cde9f702e63 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2837,6 +2837,9 @@ void deferred_split_huge_page(struct page *page) > if (PageSwapCache(page)) > return; > > + if (!list_empty(page_deferred_list(page))) > + return; > + > spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > if (list_empty(page_deferred_list(page))) { > count_vm_event(THP_DEFERRED_SPLIT_PAGE); > > base-commit: 8395ae05cb5a2e31d36106e8c85efa11cda849be
On Fri, 23 Dec 2022 21:52:07 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote: > Kernel build regression with LLVM was reported here: > https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/ > with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP > boundaries"). And the commit f35b5d7d676e was reverted. > > It turned out the regression is related with madvise(MADV_DONTNEED) > was used by ld.lld. Are we able to identify a Fixes: target for this? Thanks.
On 12/28/2022 4:15 AM, Nathan Chancellor wrote: > On Fri, Dec 23, 2022 at 09:52:07PM +0800, Yin Fengwei wrote: >> Kernel build regression with LLVM was reported here: >> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/ >> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP >> boundaries"). And the commit f35b5d7d676e was reverted. >> >> It turned out the regression is related with madvise(MADV_DONTNEED) >> was used by ld.lld. But with none PMD_SIZE aligned parameter len. >> trace-bpfcc captured: >> 531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4 >> 531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4 >> >> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can >> trigger split_queue_lock contention raised significantly. perf showed >> following data: >> 14.85% 0.00% ld.lld [kernel.kallsyms] [k] >> entry_SYSCALL_64_after_hwframe >> 11.52% >> entry_SYSCALL_64_after_hwframe >> do_syscall_64 >> __x64_sys_madvise >> do_madvise.part.0 >> zap_page_range >> unmap_single_vma >> unmap_page_range >> page_remove_rmap >> deferred_split_huge_page >> __lock_text_start >> native_queued_spin_lock_slowpath >> >> If THP can't be removed from rmap as whole THP, partial THP will be >> removed from rmap by removing sub-pages from rmap. Even the THP >> head page is added to deferred queue already, the split_queue_lock >> will be acquired and check whether the THP head page is in the queue >> already. Thus, the contention of split_queue_lock is raised. >> >> Before acquire split_queue_lock, check and bail out early if the THP >> head page is in the queue already. The checking without holding >> split_queue_lock could race with deferred_split_scan, but it doesn't >> impact the correctness here. >> >> Test result of building kernel with ld.lld: >> commit 7b5a0b664ebe (parent commit of f35b5d7d676e): >> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all >> 6:07.99 real, 26367.77 user, 5063.35 sys >> >> commit f35b5d7d676e: >> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all >> 7:22.15 real, 26235.03 user, 12504.55 sys >> >> commit f35b5d7d676e with the fixing patch: >> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all >> 6:08.49 real, 26520.15 user, 5047.91 sys >> >> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> > > I cannot say whether or not this is a good idea or not but it does > resolve the regression I reported: > > Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00016-g7b5a0b664ebe > Time (mean ± σ): 383.003 s ± 0.680 s [User: 34737.850 s, System: 7287.079 s] > Range (min … max): 382.218 s … 383.413 s 3 runs > > Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e > Time (mean ± σ): 437.886 s ± 1.030 s [User: 35888.658 s, System: 14048.871 s] > Range (min … max): 436.865 s … 438.924 s 3 runs > > Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e-dirty > Time (mean ± σ): 384.371 s ± 1.004 s [User: 35402.880 s, System: 6401.691 s] > Range (min … max): 383.547 s … 385.489 s 3 runs > > Tested-by: Nathan Chancellor <nathan@kernel.org> Thanks for testing the patch. Regards Yin, Fengwei
On 12/28/2022 9:24 AM, David Rientjes wrote: > On Fri, 23 Dec 2022, Yin Fengwei wrote: > >> Kernel build regression with LLVM was reported here: >> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/ >> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP >> boundaries"). And the commit f35b5d7d676e was reverted. >> >> It turned out the regression is related with madvise(MADV_DONTNEED) >> was used by ld.lld. But with none PMD_SIZE aligned parameter len. >> trace-bpfcc captured: >> 531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4 >> 531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4 >> >> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can >> trigger split_queue_lock contention raised significantly. perf showed >> following data: >> 14.85% 0.00% ld.lld [kernel.kallsyms] [k] >> entry_SYSCALL_64_after_hwframe >> 11.52% >> entry_SYSCALL_64_after_hwframe >> do_syscall_64 >> __x64_sys_madvise >> do_madvise.part.0 >> zap_page_range >> unmap_single_vma >> unmap_page_range >> page_remove_rmap >> deferred_split_huge_page >> __lock_text_start >> native_queued_spin_lock_slowpath >> >> If THP can't be removed from rmap as whole THP, partial THP will be >> removed from rmap by removing sub-pages from rmap. Even the THP >> head page is added to deferred queue already, the split_queue_lock >> will be acquired and check whether the THP head page is in the queue >> already. Thus, the contention of split_queue_lock is raised. >> >> Before acquire split_queue_lock, check and bail out early if the THP >> head page is in the queue already. The checking without holding >> split_queue_lock could race with deferred_split_scan, but it doesn't >> impact the correctness here. >> >> Test result of building kernel with ld.lld: >> commit 7b5a0b664ebe (parent commit of f35b5d7d676e): >> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all >> 6:07.99 real, 26367.77 user, 5063.35 sys >> >> commit f35b5d7d676e: >> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all >> 7:22.15 real, 26235.03 user, 12504.55 sys >> >> commit f35b5d7d676e with the fixing patch: >> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all >> 6:08.49 real, 26520.15 user, 5047.91 sys >> >> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> > > Acked-by: David Rientjes <rientjes@google.com> Thanks for reviewing the patch. Regards Yin, Fengwei
On 12/28/2022 10:06 AM, Huang, Ying wrote: > Yin Fengwei <fengwei.yin@intel.com> writes: > >> Kernel build regression with LLVM was reported here: >> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/ >> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP >> boundaries"). And the commit f35b5d7d676e was reverted. >> >> It turned out the regression is related with madvise(MADV_DONTNEED) >> was used by ld.lld. But with none PMD_SIZE aligned parameter len. >> trace-bpfcc captured: >> 531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4 >> 531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4 >> >> If the underneath physical page is THP, the madvise(MADV_DONTNNED) can >> trigger split_queue_lock contention raised significantly. perf showed >> following data: >> 14.85% 0.00% ld.lld [kernel.kallsyms] [k] >> entry_SYSCALL_64_after_hwframe >> 11.52% >> entry_SYSCALL_64_after_hwframe >> do_syscall_64 >> __x64_sys_madvise >> do_madvise.part.0 >> zap_page_range >> unmap_single_vma >> unmap_page_range >> page_remove_rmap >> deferred_split_huge_page >> __lock_text_start >> native_queued_spin_lock_slowpath >> >> If THP can't be removed from rmap as whole THP, partial THP will be >> removed from rmap by removing sub-pages from rmap. Even the THP >> head page is added to deferred queue already, the split_queue_lock >> will be acquired and check whether the THP head page is in the queue >> already. Thus, the contention of split_queue_lock is raised. >> >> Before acquire split_queue_lock, check and bail out early if the THP >> head page is in the queue already. The checking without holding >> split_queue_lock could race with deferred_split_scan, but it doesn't >> impact the correctness here. >> >> Test result of building kernel with ld.lld: >> commit 7b5a0b664ebe (parent commit of f35b5d7d676e): >> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all >> 6:07.99 real, 26367.77 user, 5063.35 sys >> >> commit f35b5d7d676e: >> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all >> 7:22.15 real, 26235.03 user, 12504.55 sys >> >> commit f35b5d7d676e with the fixing patch: >> time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all >> 6:08.49 real, 26520.15 user, 5047.91 sys >> >> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> > > Thank you for fixing. > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Thanks for reviewing the patch. Regards Yin, Fengwei
On 12/29/2022 7:33 AM, Andrew Morton wrote: > On Fri, 23 Dec 2022 21:52:07 +0800 Yin Fengwei <fengwei.yin@intel.com> wrote: > >> Kernel build regression with LLVM was reported here: >> https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/ >> with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP >> boundaries"). And the commit f35b5d7d676e was reverted. >> >> It turned out the regression is related with madvise(MADV_DONTNEED) >> was used by ld.lld. > > Are we able to identify a Fixes: target for this? Thanks. I am not sure. The commit f35b5d7d676e didn't introduce this issue. It just exposed this issue and it's reverted already. The partial THP behavior was like this from the first day and not a regression. Regards Yin, Fengwei
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index abe6cfd92ffa..7cde9f702e63 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2837,6 +2837,9 @@ void deferred_split_huge_page(struct page *page) if (PageSwapCache(page)) return; + if (!list_empty(page_deferred_list(page))) + return; + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); if (list_empty(page_deferred_list(page))) { count_vm_event(THP_DEFERRED_SPLIT_PAGE);
Kernel build regression with LLVM was reported here: https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/ with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP boundaries"). And the commit f35b5d7d676e was reverted. It turned out the regression is related with madvise(MADV_DONTNEED) was used by ld.lld. But with none PMD_SIZE aligned parameter len. trace-bpfcc captured: 531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4 531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4 If the underneath physical page is THP, the madvise(MADV_DONTNNED) can trigger split_queue_lock contention raised significantly. perf showed following data: 14.85% 0.00% ld.lld [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe 11.52% entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_madvise do_madvise.part.0 zap_page_range unmap_single_vma unmap_page_range page_remove_rmap deferred_split_huge_page __lock_text_start native_queued_spin_lock_slowpath If THP can't be removed from rmap as whole THP, partial THP will be removed from rmap by removing sub-pages from rmap. Even the THP head page is added to deferred queue already, the split_queue_lock will be acquired and check whether the THP head page is in the queue already. Thus, the contention of split_queue_lock is raised. Before acquire split_queue_lock, check and bail out early if the THP head page is in the queue already. The checking without holding split_queue_lock could race with deferred_split_scan, but it doesn't impact the correctness here. Test result of building kernel with ld.lld: commit 7b5a0b664ebe (parent commit of f35b5d7d676e): time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all 6:07.99 real, 26367.77 user, 5063.35 sys commit f35b5d7d676e: time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all 7:22.15 real, 26235.03 user, 12504.55 sys commit f35b5d7d676e with the fixing patch: time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all 6:08.49 real, 26520.15 user, 5047.91 sys Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> --- My first thought was to change the per node deferred queue to per cpu. It's complicated and may be overkill. For the race without lock acquired, I didn't see obvious issue here. But I could miss something here. Let me know if I did. Thanks. mm/huge_memory.c | 3 +++ 1 file changed, 3 insertions(+) base-commit: 8395ae05cb5a2e31d36106e8c85efa11cda849be