From patchwork Tue Sep 26 06:09:02 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13398710 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 93B45E7D0C5 for ; Tue, 26 Sep 2023 06:09:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2BEB58D0066; Tue, 26 Sep 2023 02:09:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 26E978D0005; Tue, 26 Sep 2023 02:09:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 15FF58D0066; Tue, 26 Sep 2023 02:09:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 079AA8D0005 for ; Tue, 26 Sep 2023 02:09:34 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id D5588C0FD7 for ; Tue, 26 Sep 2023 06:09:33 +0000 (UTC) X-FDA: 81277721826.29.D1692A0 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.100]) by imf19.hostedemail.com (Postfix) with ESMTP id AB7891A0006 for ; Tue, 26 Sep 2023 06:09:31 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=UQFp8K2c; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695708572; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DVaeTi2fdQYhnqDjiiKvq3LxrpG6qCEAFo0kK9VOVAk=; b=vTrquptAidZrOHFIW/Gp7XNqfFPbvIt8BprlsFEcfJ4VA9hI+6UD2fMzASCgYMG0s7SM4e kU7EwHh/ijcOEUy2E4jDK6RpFJLk41NCOEmQbGgZ/SV8pjGUA2IAd+APR0S/OjG1joNO8G hbtHm/+56lmN9j2/W9wPR2DaMxcfsC8= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=UQFp8K2c; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695708572; a=rsa-sha256; cv=none; b=qBvHiJrMc9cn/VAFKcwS1y/C9CFSmKCCkqx+ZrbjwA0CbAJEhMFaaM1Hg2vRBet1nGsTdu dlKpNKUS7UBnQZz5FqsXGF2ydlA7hJ4y6OYCt0Xpqg7/7IA2mMm7Jtxc2DBgWUZym9AgNI HfoSYswwMhil4wrdOs71rJJp969GmPs= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695708571; x=1727244571; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=kkrugMJHHzL4iBIjWV0YJcdllhvq6CWQwVNpR1GjvwQ=; b=UQFp8K2cLTI+CrWy3bpWxIrGTNWTQhIqU72TogVb/U49x5IYBKGO4s2H kBCjXxJG108gMM9gcbh+/KUhvMR75l/LBjdOdYQsne55RLaXdrBgnb0M2 xXaFgoHtf5FmuOR6S0v/lED9HqNTBio5y49ZURkr0Ugi7ROaqWkqNDKfu l7dPAQUsM0D2GYRvvCyHeG1JtuFFNEHBYMMPux5Tbz9NdBg0O8yr3jcxD /vMjed6jLOwy2sKPQyBCYA+XZxnDFTkUj9Z7tJAp2Yo4CbEJzs6bcFkrF yorZTQvaY6aHaCj3KeNXRayGcJjBh759k5yTcSkZJwTMhmBxvv2srXCi4 A==; X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="447991266" X-IronPort-AV: E=Sophos;i="6.03,177,1694761200"; d="scan'208";a="447991266" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Sep 2023 23:09:30 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="892075849" X-IronPort-AV: E=Sophos;i="6.03,177,1694761200"; d="scan'208";a="892075849" Received: from aozhu-mobl.ccr.corp.intel.com (HELO yhuang6-mobl2.ccr.corp.intel.com) ([10.255.31.94]) by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Sep 2023 23:08:22 -0700 From: Huang Ying To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH -V2 01/10] mm, pcp: avoid to drain PCP when process exit Date: Tue, 26 Sep 2023 14:09:02 +0800 Message-Id: <20230926060911.266511-2-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230926060911.266511-1-ying.huang@intel.com> References: <20230926060911.266511-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: AB7891A0006 X-Stat-Signature: czyen5swt8qgd6oosprdnjbwymen833h X-HE-Tag: 1695708571-656230 X-HE-Meta: U2FsdGVkX1+9H88XaYWxWhZwLejn7uikwTQ1ZAneLYvliCvyXv6x2X1Fg54Kgh2wC4P6Zu73L2EwSlfUMSlwlK1MmY5Djzlj+3lZnzNBmeYDEqo3570Clq0JdCzH2ox2qtF/umAVpx3t6xheZgD+6V8gXs1+7OROPXmZcE7xRN1KY8Jm0hsVVlIt87/53IYRjAq/+30rUrxoYlmII7PmWtl/j+1alrxDjLdxeuxik0FXZXB+NSwMPy8nZSZosQyuqGURubFmysZUyS2sdzTVUcDq/GiPz8msO/nn54fRLiBYXv2VlN2tmaUgHIvzpzxc23p73DptUB1LwBJJZsrV4e/1PIVlROw0SfRkCjpTrxLN6dzXp6y2dTQy7D7BGAo7/nOimFhzN3tnU0F32GfHj/bOGUkZAmhmVZUd4Q3y3zs8keGAqgs0zyQpVPXo1fhruziiXTReCgzD33wFauFCXsgTWYsBkezkS/4YgfpA7Lm8dWi2GQrHWm/1kgmMCJgVnHN2PM2PW1s0hX5GIdKkHrrcsggvIyNgaQgS0Ushkttjjexl0F2ObqlthYjaBzankeeKV8C68DsCy3maZVF1yzZ4lk8sqL7dNdLXXzT4Pim5N2rwm9fmxjqqe5aa0WUonnrfF86DvSybWXa7g8nd6UG4AWLeUTTFcPzPsKPMfmDdts4MDnZSkJW84BFfwfl6l/KEoy8c5qthQZcAAyuFX5DLzr1NLLFC8zv3FxFuwKvjgHRe2lMfpYFCNMZnyK1KdRFRu2wJEbiWN78UopsxHqwASDR+VYxx3nrAdSbTnR7nL96De75u2NBrGPpZgkYMFlfD03jWbWoDUeU+V1EQL4SXajemUWgnG2fB4qPXeTXFonWkseDISVt5ZAcBY8qEG8UtJWkt4bUkUg5mDQyL23AGkDFk5rg8WuFt+REMc227j5pgnxeQoVnk71gWaEbphpnF3ZvX1fBK/+yOAkf /l26gkMB pbkSn6wEHc4fD1/t26FVpq06f3WDesTHcvP9WK71vcEnpUS7iUDR8i5SbkmPpyqaqdATTnQp+z3skUMZ+tOFuFj1Q6GXbLOWSR2WjwmI+bof13iTLBRnKH2mAF4UEeykTZNRV4BF8EqSvT3eoibum0iywy7irzf956rgjC25Yz7ISdTYDAkR+BpNW+4/6V8VSux6TO8LwxgERm9tCg5/haCBEOy1E8XZ6v4qtaP+LvnVQbLcG4fbx0IicShAi/7X6BG3i5pxR6pXx+qss8Jh0wo3HDuQIWl0q54BHS3m0+E2yTXT5QshiLXvrve1kWNgi3f3WFbnyyyCJNinqaRkVv+UHgq8yyA8l0HqriHr2KWUhfTYvwBblRoa7OtW8HFBMutExUG8UGbeDV8oaq51RrGIjjzUWtFVsweCAklzSlaLx7dn6IMeNono/XUd2o+55LOU0F1HubSOL+rMp0dQneFRIny60SHfavZgmI/5VOckNuLkV30Dbjd0+m2VdZz3EV3qJwbIpy59y63KnLH9MUUyip6qZ5ni9oMg0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when PCP is mostly used for high-order pages freeing to improve the cache-hot pages reusing between page allocation and freeing CPUs. But, the PCP draining mechanism may be triggered unexpectedly when process exits. With some customized trace point, it was found that PCP draining (free_high == true) was triggered with the order-1 page freeing with the following call stack, => free_unref_page_commit => free_unref_page => __mmdrop => exit_mm => do_exit => do_group_exit => __x64_sys_exit_group => do_syscall_64 Checking the source code, this is the page table PGD freeing (mm_free_pgd()). It's a order-1 page freeing if CONFIG_PAGE_TABLE_ISOLATION=y. Which is a common configuration for security. Just before that, page freeing with the following call stack was found, => free_unref_page_commit => free_unref_page_list => release_pages => tlb_batch_pages_flush => tlb_finish_mmu => exit_mmap => __mmput => exit_mm => do_exit => do_group_exit => __x64_sys_exit_group => do_syscall_64 So, when a process exits, - a large number of user pages of the process will be freed without page allocation, it's highly possible that pcp->free_factor becomes > 0. - after freeing all user pages, the PGD will be freed, which is a order-1 page freeing, PCP will be drained. All in all, when a process exits, it's high possible that the PCP will be drained. This is an unexpected behavior. To avoid this, in the patch, the PCP draining will only be triggered for 2 consecutive high-order page freeing. On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances in parallel (each with `make -j 28`) in 8 cgroup. This simulates the kbuild server that is used by 0-Day kbuild service. With the patch, the cycles% of the spinlock contention (mostly for zone lock) decreases from 13.5% to 10.6% (with PCP size == 361). The number of PCP draining for high order pages freeing (free_high) decreases 80.8%. This helps network workload too for reduced zone lock contention. On a 2-socket Intel server with 128 logical CPU, with the patch, the network bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite with 16-pair processes increase 17.1%. The cycles% of the spinlock contention (mostly for zone lock) decreases from 50.0% to 45.8%. The number of PCP draining for high order pages freeing (free_high) decreases 27.4%. The cache miss rate keeps 0.3%. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Mel Gorman Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter --- include/linux/mmzone.h | 5 ++++- mm/page_alloc.c | 11 ++++++++--- 2 files changed, 12 insertions(+), 4 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4106fbc5b4b3..64d5ed2bb724 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -676,12 +676,15 @@ enum zone_watermarks { #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) +#define PCPF_PREV_FREE_HIGH_ORDER 0x01 + struct per_cpu_pages { spinlock_t lock; /* Protects lists field */ int count; /* number of pages in the list */ int high; /* high watermark, emptying needed */ int batch; /* chunk size for buddy add/remove */ - short free_factor; /* batch scaling factor during free */ + u8 flags; /* protected by pcp->lock */ + u8 free_factor; /* batch scaling factor during free */ #ifdef CONFIG_NUMA short expire; /* When 0, remote pagesets are drained */ #endif diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 95546f376302..295e61f0c49d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2370,7 +2370,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, { int high; int pindex; - bool free_high; + bool free_high = false; __count_vm_events(PGFREE, 1 << order); pindex = order_to_pindex(migratetype, order); @@ -2383,8 +2383,13 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, * freeing without allocation. The remainder after bulk freeing * stops will be drained from vmstat refresh context. */ - free_high = (pcp->free_factor && order && order <= PAGE_ALLOC_COSTLY_ORDER); - + if (order && order <= PAGE_ALLOC_COSTLY_ORDER) { + free_high = (pcp->free_factor && + (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER)); + pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER; + } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { + pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER; + } high = nr_pcp_high(pcp, zone, free_high); if (pcp->count >= high) { free_pcppages_bulk(zone, nr_pcp_free(pcp, high, free_high), pcp, pindex);