From patchwork Fri Dec 23 13:52:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yin Fengwei X-Patchwork-Id: 13080998 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51320C4332F for ; Fri, 23 Dec 2022 13:49:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7A84F900003; Fri, 23 Dec 2022 08:49:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 759CC900002; Fri, 23 Dec 2022 08:49:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 61FF9900003; Fri, 23 Dec 2022 08:49:06 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 4F29A900002 for ; Fri, 23 Dec 2022 08:49:06 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id D465940567 for ; Fri, 23 Dec 2022 13:49:05 +0000 (UTC) X-FDA: 80273702250.22.2D8AB2A Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by imf23.hostedemail.com (Postfix) with ESMTP id 77144140007 for ; Fri, 23 Dec 2022 13:49:03 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=BV1PiI8B; spf=pass (imf23.hostedemail.com: domain of fengwei.yin@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1671803344; a=rsa-sha256; cv=none; b=d4HBbCtSa104Jc5W20FZal61uaylAJ26iJ7nNKFiqRdYTxeRn/7YQQvx6HZdyUhuWsnwJK DJZPRpyz0LjSifJmyvVkxxa/F/YXqc07rr952KpcMkoxRXN5HkePthmbY/BEzuzJNxjNmp 9kDqB5BsFW8OqgZ+pxPfqd+K9yepF/Q= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=BV1PiI8B; spf=pass (imf23.hostedemail.com: domain of fengwei.yin@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1671803344; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=LBfoIYHCe8NmL6Lz7T27DkEZ6UMLnb9Y9/z+fJu0AKk=; b=2ISJfunnwcn77hTjRP8tG+FWfqmcfyO0yd8gqp4IWkaYwt2l6FlxpMEG3AzkG06BrS4c2P sDnEKIRJc6eeIGHmcE4AbeiqSFYF/sYTdDWWUPoo+X/U/0mlGgbXAVTpyYRYyrZHWMWz8M EwUHBRP/x30l/OesM3qAP3p9Xj/P9hk= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1671803343; x=1703339343; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=X256kUZBweWsuBFIUFOZmD5mcO0Gc77VkzrpWHtIKUI=; b=BV1PiI8Bg4cy/XKdWPPrbWtqx/88PTZdFoycHru8J9+qbwq5GRk5edOH RzuAjDKntOl9Iu1b3xx8p65gkaOTNW66fIPcish5GnoWfvA1aYxNkwMZQ /JmXstG5SiFjeUP8V0Z3nc1YsOS1G35rvdR5aQpwMtRf0BMo1d/ECN3y9 Blb1fTGD6jmQdr0I+sZIwLnTcEW+ien+n4rS6hWY7nks3NWIWyBbfVaRK rhFvVY7orbp9NrIZjGWe5zEg2vjmBpjeLeWECv0a2uJ0QQhKFf5TTRakG WNPbGwGekW1+vN7xaK/+atrwDNpZw93I1CDhtpETOz+v8vgHxmDgxEAb9 A==; X-IronPort-AV: E=McAfee;i="6500,9779,10570"; a="382589592" X-IronPort-AV: E=Sophos;i="5.96,268,1665471600"; d="scan'208";a="382589592" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Dec 2022 05:49:01 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10570"; a="645584044" X-IronPort-AV: E=Sophos;i="5.96,268,1665471600"; d="scan'208";a="645584044" Received: from fyin-dev.sh.intel.com ([10.239.159.32]) by orsmga007.jf.intel.com with ESMTP; 23 Dec 2022 05:48:58 -0800 From: Yin Fengwei To: riel@surriel.com, nathan@kernel.org, akpm@linux-foundation.org, shy828301@gmail.com, willy@infradead.org, kirill.shutemov@intel.com, ying.huang@intel.com, feng.tang@intel.com, zhengjun.xing@linux.intel.com, linux-mm@kvack.org Cc: fengwei.yin@intel.com Subject: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already Date: Fri, 23 Dec 2022 21:52:07 +0800 Message-Id: <20221223135207.2275317-1-fengwei.yin@intel.com> X-Mailer: git-send-email 2.30.2 MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 77144140007 X-Stat-Signature: 6zw73iot3pmbzmskbyase9eur7p9paot X-HE-Tag: 1671803343-670449 X-HE-Meta: U2FsdGVkX19W51y1TXJsGPYiqevM2GJKc+z98oPBAjwfMfNzO+mcHAkXetQUDiH0T/z2h6qg4GiiepkHWe9XZMdqEK0+qlWulvK1/GHHOKFc7Qjd9zumxKbM3/flxKPZaLTevAD7K4NZyHBoVhLP8zfZVgiXh+cusSCt+Ob/qm31CnHI0LhZlbx7HnOFbC2ixTBXrh/tF44Y64tyojL8hP92jzDZNfBBGceDhdyKUfJDHLjBVbnCL8cISZn/IoNhuOiL7bT90tLYZo0E2Rliw0PY3reOLoRPTlpWO7oF169OgtFH+fxUM1GeJR1qe0sQVx/ZIiK+KsLtuyXTcGZ2PoXECNUhgPnNx6sybsMdN7AEVWCl30mVcGfFSim1IbsEXhXT437EHz4FT8Sv4zWfEzp1Lqu34xyFEYF3Y1cbMJOSs6d1VFC8CwUl5O8ikzEjo7Zvb4fzUs+erxuDAsgw8+Zs3VazK4W7yUxMUU8vlbyRHSukIQn5Mc0QfhG0tzWLnT2gLIhhr35YFfk+o/R6BexAyc3+JsZRLaCb6sEoIwP1J3Nvt1HHksbpzuc3WG5BhaGl2GEu9bmVsXSZuiLLVldy5MOOkkw81mi6j1UNcoc8kXWTa533VBhVINLNl75GHjkbl8LAQqow3uAkIewTF+HTtLMXaTOImeHue3OPtJjV8eoGM/oU5VIad8VhDja//JZwRKqep1Ay4oaiOVr3qcxwNJMT96QaimsbjGE5e44JOy+Qt2egumukLUsvIShM9gbKEpSAuxPQTYwSkouoRhkybLgvq+ZIWWWSIJpsfFpe8Nl0SA7w4h4UrVZbXigMYz764mFzdFyEJymazNisoWd8wenphFvl726RjalL7jg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Kernel build regression with LLVM was reported here: https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/ with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP boundaries"). And the commit f35b5d7d676e was reverted. It turned out the regression is related with madvise(MADV_DONTNEED) was used by ld.lld. But with none PMD_SIZE aligned parameter len. trace-bpfcc captured: 531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4 531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4 If the underneath physical page is THP, the madvise(MADV_DONTNNED) can trigger split_queue_lock contention raised significantly. perf showed following data: 14.85% 0.00% ld.lld [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe 11.52% entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_madvise do_madvise.part.0 zap_page_range unmap_single_vma unmap_page_range page_remove_rmap deferred_split_huge_page __lock_text_start native_queued_spin_lock_slowpath If THP can't be removed from rmap as whole THP, partial THP will be removed from rmap by removing sub-pages from rmap. Even the THP head page is added to deferred queue already, the split_queue_lock will be acquired and check whether the THP head page is in the queue already. Thus, the contention of split_queue_lock is raised. Before acquire split_queue_lock, check and bail out early if the THP head page is in the queue already. The checking without holding split_queue_lock could race with deferred_split_scan, but it doesn't impact the correctness here. Test result of building kernel with ld.lld: commit 7b5a0b664ebe (parent commit of f35b5d7d676e): time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all 6:07.99 real, 26367.77 user, 5063.35 sys commit f35b5d7d676e: time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all 7:22.15 real, 26235.03 user, 12504.55 sys commit f35b5d7d676e with the fixing patch: time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all 6:08.49 real, 26520.15 user, 5047.91 sys Signed-off-by: Yin Fengwei Tested-by: Nathan Chancellor Acked-by: David Rientjes Reviewed-by: "Huang, Ying" --- My first thought was to change the per node deferred queue to per cpu. It's complicated and may be overkill. For the race without lock acquired, I didn't see obvious issue here. But I could miss something here. Let me know if I did. Thanks. mm/huge_memory.c | 3 +++ 1 file changed, 3 insertions(+) base-commit: 8395ae05cb5a2e31d36106e8c85efa11cda849be diff --git a/mm/huge_memory.c b/mm/huge_memory.c index abe6cfd92ffa..7cde9f702e63 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2837,6 +2837,9 @@ void deferred_split_huge_page(struct page *page) if (PageSwapCache(page)) return; + if (!list_empty(page_deferred_list(page))) + return; + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); if (list_empty(page_deferred_list(page))) { count_vm_event(THP_DEFERRED_SPLIT_PAGE);