From patchwork Tue Sep 26 06:09:02 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 13398710
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 93B45E7D0C5
	for <linux-mm@archiver.kernel.org>; Tue, 26 Sep 2023 06:09:34 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2BEB58D0066; Tue, 26 Sep 2023 02:09:34 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 26E978D0005; Tue, 26 Sep 2023 02:09:34 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 15FF58D0066; Tue, 26 Sep 2023 02:09:34 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com
 [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 079AA8D0005
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 02:09:34 -0400 (EDT)
Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id D5588C0FD7
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:09:33 +0000 (UTC)
X-FDA: 81277721826.29.D1692A0
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.100])
	by imf19.hostedemail.com (Postfix) with ESMTP id AB7891A0006
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:09:31 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=UQFp8K2c;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1695708572;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=DVaeTi2fdQYhnqDjiiKvq3LxrpG6qCEAFo0kK9VOVAk=;
	b=vTrquptAidZrOHFIW/Gp7XNqfFPbvIt8BprlsFEcfJ4VA9hI+6UD2fMzASCgYMG0s7SM4e
	kU7EwHh/ijcOEUy2E4jDK6RpFJLk41NCOEmQbGgZ/SV8pjGUA2IAd+APR0S/OjG1joNO8G
	hbtHm/+56lmN9j2/W9wPR2DaMxcfsC8=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=UQFp8K2c;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695708572; a=rsa-sha256;
	cv=none;
	b=qBvHiJrMc9cn/VAFKcwS1y/C9CFSmKCCkqx+ZrbjwA0CbAJEhMFaaM1Hg2vRBet1nGsTdu
	dlKpNKUS7UBnQZz5FqsXGF2ydlA7hJ4y6OYCt0Xpqg7/7IA2mMm7Jtxc2DBgWUZym9AgNI
	HfoSYswwMhil4wrdOs71rJJp969GmPs=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695708571; x=1727244571;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=kkrugMJHHzL4iBIjWV0YJcdllhvq6CWQwVNpR1GjvwQ=;
  b=UQFp8K2cLTI+CrWy3bpWxIrGTNWTQhIqU72TogVb/U49x5IYBKGO4s2H
   kBCjXxJG108gMM9gcbh+/KUhvMR75l/LBjdOdYQsne55RLaXdrBgnb0M2
   xXaFgoHtf5FmuOR6S0v/lED9HqNTBio5y49ZURkr0Ugi7ROaqWkqNDKfu
   l7dPAQUsM0D2GYRvvCyHeG1JtuFFNEHBYMMPux5Tbz9NdBg0O8yr3jcxD
   /vMjed6jLOwy2sKPQyBCYA+XZxnDFTkUj9Z7tJAp2Yo4CbEJzs6bcFkrF
   yorZTQvaY6aHaCj3KeNXRayGcJjBh759k5yTcSkZJwTMhmBxvv2srXCi4
   A==;
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="447991266"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="447991266"
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:09:30 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="892075849"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="892075849"
Received: from aozhu-mobl.ccr.corp.intel.com (HELO
 yhuang6-mobl2.ccr.corp.intel.com) ([10.255.31.94])
  by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:08:22 -0700
From: Huang Ying <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Arjan Van De Ven <arjan@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Hildenbrand <david@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Lameter <cl@linux.com>
Subject: [PATCH -V2 01/10] mm, pcp: avoid to drain PCP when process exit
Date: Tue, 26 Sep 2023 14:09:02 +0800
Message-Id: <20230926060911.266511-2-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230926060911.266511-1-ying.huang@intel.com>
References: <20230926060911.266511-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: AB7891A0006
X-Stat-Signature: czyen5swt8qgd6oosprdnjbwymen833h
X-HE-Tag: 1695708571-656230
X-HE-Meta: 
 U2FsdGVkX1+9H88XaYWxWhZwLejn7uikwTQ1ZAneLYvliCvyXv6x2X1Fg54Kgh2wC4P6Zu73L2EwSlfUMSlwlK1MmY5Djzlj+3lZnzNBmeYDEqo3570Clq0JdCzH2ox2qtF/umAVpx3t6xheZgD+6V8gXs1+7OROPXmZcE7xRN1KY8Jm0hsVVlIt87/53IYRjAq/+30rUrxoYlmII7PmWtl/j+1alrxDjLdxeuxik0FXZXB+NSwMPy8nZSZosQyuqGURubFmysZUyS2sdzTVUcDq/GiPz8msO/nn54fRLiBYXv2VlN2tmaUgHIvzpzxc23p73DptUB1LwBJJZsrV4e/1PIVlROw0SfRkCjpTrxLN6dzXp6y2dTQy7D7BGAo7/nOimFhzN3tnU0F32GfHj/bOGUkZAmhmVZUd4Q3y3zs8keGAqgs0zyQpVPXo1fhruziiXTReCgzD33wFauFCXsgTWYsBkezkS/4YgfpA7Lm8dWi2GQrHWm/1kgmMCJgVnHN2PM2PW1s0hX5GIdKkHrrcsggvIyNgaQgS0Ushkttjjexl0F2ObqlthYjaBzankeeKV8C68DsCy3maZVF1yzZ4lk8sqL7dNdLXXzT4Pim5N2rwm9fmxjqqe5aa0WUonnrfF86DvSybWXa7g8nd6UG4AWLeUTTFcPzPsKPMfmDdts4MDnZSkJW84BFfwfl6l/KEoy8c5qthQZcAAyuFX5DLzr1NLLFC8zv3FxFuwKvjgHRe2lMfpYFCNMZnyK1KdRFRu2wJEbiWN78UopsxHqwASDR+VYxx3nrAdSbTnR7nL96De75u2NBrGPpZgkYMFlfD03jWbWoDUeU+V1EQL4SXajemUWgnG2fB4qPXeTXFonWkseDISVt5ZAcBY8qEG8UtJWkt4bUkUg5mDQyL23AGkDFk5rg8WuFt+REMc227j5pgnxeQoVnk71gWaEbphpnF3ZvX1fBK/+yOAkf
 /l26gkMB
 pbkSn6wEHc4fD1/t26FVpq06f3WDesTHcvP9WK71vcEnpUS7iUDR8i5SbkmPpyqaqdATTnQp+z3skUMZ+tOFuFj1Q6GXbLOWSR2WjwmI+bof13iTLBRnKH2mAF4UEeykTZNRV4BF8EqSvT3eoibum0iywy7irzf956rgjC25Yz7ISdTYDAkR+BpNW+4/6V8VSux6TO8LwxgERm9tCg5/haCBEOy1E8XZ6v4qtaP+LvnVQbLcG4fbx0IicShAi/7X6BG3i5pxR6pXx+qss8Jh0wo3HDuQIWl0q54BHS3m0+E2yTXT5QshiLXvrve1kWNgi3f3WFbnyyyCJNinqaRkVv+UHgq8yyA8l0HqriHr2KWUhfTYvwBblRoa7OtW8HFBMutExUG8UGbeDV8oaq51RrGIjjzUWtFVsweCAklzSlaLx7dn6IMeNono/XUd2o+55LOU0F1HubSOL+rMp0dQneFRIny60SHfavZgmI/5VOckNuLkV30Dbjd0+m2VdZz3EV3qJwbIpy59y63KnLH9MUUyip6qZ5ni9oMg0
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order
pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be
drained when PCP is mostly used for high-order pages freeing to
improve the cache-hot pages reusing between page allocation and
freeing CPUs.

But, the PCP draining mechanism may be triggered unexpectedly when
process exits.  With some customized trace point, it was found that
PCP draining (free_high == true) was triggered with the order-1 page
freeing with the following call stack,

 => free_unref_page_commit
 => free_unref_page
 => __mmdrop
 => exit_mm
 => do_exit
 => do_group_exit
 => __x64_sys_exit_group
 => do_syscall_64

Checking the source code, this is the page table PGD
freeing (mm_free_pgd()).  It's a order-1 page freeing if
CONFIG_PAGE_TABLE_ISOLATION=y.  Which is a common configuration for
security.

Just before that, page freeing with the following call stack was
found,

 => free_unref_page_commit
 => free_unref_page_list
 => release_pages
 => tlb_batch_pages_flush
 => tlb_finish_mmu
 => exit_mmap
 => __mmput
 => exit_mm
 => do_exit
 => do_group_exit
 => __x64_sys_exit_group
 => do_syscall_64

So, when a process exits,

- a large number of user pages of the process will be freed without
  page allocation, it's highly possible that pcp->free_factor becomes
  > 0.

- after freeing all user pages, the PGD will be freed, which is a
  order-1 page freeing, PCP will be drained.

All in all, when a process exits, it's high possible that the PCP will
be drained.  This is an unexpected behavior.

To avoid this, in the patch, the PCP draining will only be triggered
for 2 consecutive high-order page freeing.

On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild
instances in parallel (each with `make -j 28`) in 8 cgroup.  This
simulates the kbuild server that is used by 0-Day kbuild service.
With the patch, the cycles% of the spinlock contention (mostly for
zone lock) decreases from 13.5% to 10.6% (with PCP size == 361).  The
number of PCP draining for high order pages freeing (free_high)
decreases 80.8%.

This helps network workload too for reduced zone lock contention.  On
a 2-socket Intel server with 128 logical CPU, with the patch, the
network bandwidth of the UNIX (AF_UNIX) test case of lmbench test
suite with 16-pair processes increase 17.1%.  The cycles% of the
spinlock contention (mostly for zone lock) decreases from 50.0% to
45.8%.  The number of PCP draining for high order pages
freeing (free_high) decreases 27.4%.  The cache miss rate keeps 0.3%.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/mmzone.h |  5 ++++-
 mm/page_alloc.c        | 11 ++++++++---
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4106fbc5b4b3..64d5ed2bb724 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -676,12 +676,15 @@ enum zone_watermarks {
 #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
 #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
 
+#define	PCPF_PREV_FREE_HIGH_ORDER	0x01
+
 struct per_cpu_pages {
 	spinlock_t lock;	/* Protects lists field */
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
-	short free_factor;	/* batch scaling factor during free */
+	u8 flags;		/* protected by pcp->lock */
+	u8 free_factor;		/* batch scaling factor during free */
 #ifdef CONFIG_NUMA
 	short expire;		/* When 0, remote pagesets are drained */
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 95546f376302..295e61f0c49d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2370,7 +2370,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 {
 	int high;
 	int pindex;
-	bool free_high;
+	bool free_high = false;
 
 	__count_vm_events(PGFREE, 1 << order);
 	pindex = order_to_pindex(migratetype, order);
@@ -2383,8 +2383,13 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	 * freeing without allocation. The remainder after bulk freeing
 	 * stops will be drained from vmstat refresh context.
 	 */
-	free_high = (pcp->free_factor && order && order <= PAGE_ALLOC_COSTLY_ORDER);
-
+	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
+		free_high = (pcp->free_factor &&
+			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER));
+		pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER;
+	} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
+		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
+	}
 	high = nr_pcp_high(pcp, zone, free_high);
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, high, free_high), pcp, pindex);

From patchwork Tue Sep 26 06:09:03 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 13398711
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1799FE8181D
	for <linux-mm@archiver.kernel.org>; Tue, 26 Sep 2023 06:09:38 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A392D8D0067; Tue, 26 Sep 2023 02:09:37 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9C1A28D0005; Tue, 26 Sep 2023 02:09:37 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 862098D0067; Tue, 26 Sep 2023 02:09:37 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com
 [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 611298D0005
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 02:09:37 -0400 (EDT)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 2710740FB5
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:09:37 +0000 (UTC)
X-FDA: 81277721994.11.C819CAF
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.100])
	by imf19.hostedemail.com (Postfix) with ESMTP id 187BE1A0010
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:09:34 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=JAWfjDbX;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1695708575;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=36opLWR6CJOCDKHWsDafiwu7OGcLAly4+3YEuY6IXMM=;
	b=3bECgy7IgXpfN8I4NMftWmpqZ0ie9huC+nxTS5ywcJ15OSLS2xAcDwy9dSWVRztoJPSPNo
	Ev87r9jwcjZdNfGQTpUsYA5NUIxV++9D33HFxxopww3xx4pNcKwM6GvOvYxd1oHfqGCqvA
	TLmQWJ2wMJxhc9pQojwUUUH/qvfSv3o=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=JAWfjDbX;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695708575; a=rsa-sha256;
	cv=none;
	b=1uyVYektEjHccwPhPEhRsQLxIe3oN4xC1idgH8U+3RfGJXvk4zFqITPJP7LeVmAvFvCNjO
	2CmGkvorVKkNcNmi7ZmgbRfBMy9KLicTW+naDl5fEDDKAobhcTuKOgCT6P0VeWXe5W9L9R
	K2WHCilZt2qypHuM1IaraXto5ioUXqA=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695708575; x=1727244575;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=lVKcfu7RB9vKN2Hm50wym0zo99YnMLypq+qR+yt7FI8=;
  b=JAWfjDbXgIEtj3598j0W3Hw/t5mJEv1LcAHSeys7yaVuJnWR2D86FaZb
   9bnStUP1ul8H1Udr6STgzrmy/3WG5Uh0bhxCKYKjz6C+pB8QsRifTvMNw
   ig4VqmXzKt9AO4FEGAsXwmvxalKmLe9W8JJaHFJABHZ4XF2mNi3na3Qyl
   8cFQig2ZligKjChJ9EcQJEwD/elNncYZKqdUaT7UFxlb39OryguHnLGNH
   k6Sc8My6uniu0/yyaHQHd23wMvEYy6CLxk/+vq1eZcmX6PWcmOhq6b40Z
   WJ6jMPQqsnsOXvGpfRFHZYvdX+6Qkws10gD/EzSmzSFiCst2InL9f2SuB
   g==;
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="447991291"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="447991291"
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:09:34 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="892075860"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="892075860"
Received: from aozhu-mobl.ccr.corp.intel.com (HELO
 yhuang6-mobl2.ccr.corp.intel.com) ([10.255.31.94])
  by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:08:27 -0700
From: Huang Ying <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Arjan Van De Ven <arjan@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Sudeep Holla <sudeep.holla@arm.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Hildenbrand <david@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Lameter <cl@linux.com>
Subject: [PATCH -V2 02/10] cacheinfo: calculate per-CPU data cache size
Date: Tue, 26 Sep 2023 14:09:03 +0800
Message-Id: <20230926060911.266511-3-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230926060911.266511-1-ying.huang@intel.com>
References: <20230926060911.266511-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 187BE1A0010
X-Stat-Signature: hjfzb6k3uztb8ihwrzi3oht6pr3465a8
X-HE-Tag: 1695708574-117470
X-HE-Meta: 
 U2FsdGVkX1+cZarmc4KhYn3gBEBgFSIIZNFqlI78ySGRGlC0cEDlQO1W5U2mVsD/XKH9lOBajmPPjz1PqCQA3QWyqrZLUTLnworUZH5A1Nok5E/wfKOOncQcG5r/GkuLDAYahsxUI+LUcrKwCjF9UL4kqgQ5uXaIbMFlBdD56Yj5+twlY2Gi6hvgMiLAfPzk/S3V1US9AscnyZ/w3a2NwkXXcjCVghyizouPgOIhUuRiOxAtUTwwdMQlI2AqBu+xpCcv7BzsApzI75lyklA9dawe0FCJrcz9kQRCGUbRb1ORByEtxMPoSyOyjxUB8SbEoZjjsN2Wk24i+Q4HS/7Itg54vVquQyR2YTcKlKJrNr/CF0dkwFdBxM7ggc4SNjS80dSitivAAjO5zVWcOFBGIqZBlfiIIbfOMJCNi7Zs2Jwy1isIrfdavGuAvt4zIuwuyPq06Oz4BYg+Cgdt4Lq/vWcEzjw0GGlJsfrk4k4ZHrwJlZFE77wU3vR+HjH5XkvKX15s4+DAHLPNyumUMCdaj2wVRhvUIlS0xQIHI1eK6MomhWp7dY7pffr1UPkrJYa4qKeXwsky42ykTVFi8vUpqWwI24Fq/BXQ+ylrBOciK7BHJnlQpP9Jnk/fGO7mkpmWguORwNBiuyXlOvLZCxBlCcN5B7en/6JOMG0oPp4DaP4SGEf/CTAakyVF5tq/kDMwmZMWJOof2qu0v+W6OSHkv4iRGSDrXLC8zjvrKv3AfbRXtIpQenUT1SF24JFQy8yHq11erEARSCq5qS7c2g2x2zO9rFZF4KcPQ5XLgbeRWN1IjTXD/1pwKlqJOzuG09k5UwAljbqV20JqNjWqsfnMeAEVHmDSAqVOBEeyxJJOWlUMuYN/0kTiwMZePEcPEyjsk3tSVuWCJ2l7tX8uLwfya+wKbebge3K/LQ8Xa1PGeuxiN7ZmdkzRP4pZBadU8xBx8cwXCl7v7Qq60JRFXdV
 hxejAYHB
 Mp95cYNDwUa5w51ryBN0gY4W6lsujMlmfDB6NHERkdaoIIfvyBVtB+QFK2N667rmW8IZyvk8zoQ19gsf4FkE4kshRR5oaNbOhFVc29gcx2z+4lpN/leavn73UieTRrgP9LeXz/6ER89jLAEu05nnQIlP4IS+xXYmPr4cUqlKpkTXvwuADbIkyhNhSgZ57FiC3BSwD3QN/nA7bHJxzV/tv3kyqq6ccG27elFRqn+RC8rwF8g/LlBuxsjPv+TigZKzEYAp7zEjeE5Z6/E/HqMUCj07A2x0YQviWsFN3/9o028CcBtPGQb5pehFQndrGAOk8duuivRRDFWrnb7TiLVQOjGnWk8RguFo8Lh/huNcn4daka3WdiecO4K19HMLMvliqOVYmBs+T5yCSR79ySlrTUrNYcwgJG9L255W7za8kcVUqaaFwpIVF1u2dj2r2KGyWsar3u2a3rc65kWmc4BFSAk4Ceh7hThmENV4bY9/Uz79/6BMWNQuJM21vNoorG35Sxkfq6aXwE+hyE266yPfT46k88x4zF2fppC+BqW/LPpO35hwQuMTnNQYBU0SqftRItt17jasd3jpfsCZWATBw9aFtI7tMjBQt7A5N
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Per-CPU data cache size is useful information.  For example, it can be
used to determine per-CPU cache size.  So, in this patch, the data
cache size for each CPU is calculated via data_cache_size /
shared_cpu_weight.

A brute-force algorithm to iterate all online CPUs is used to avoid
to allocate an extra cpumask, especially in offline callback.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 drivers/base/cacheinfo.c  | 42 ++++++++++++++++++++++++++++++++++++++-
 include/linux/cacheinfo.h |  1 +
 2 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index cbae8be1fe52..3e8951a3fbab 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
 	return rc;
 }
 
+static void update_data_cache_size_cpu(unsigned int cpu)
+{
+	struct cpu_cacheinfo *ci;
+	struct cacheinfo *leaf;
+	unsigned int i, nr_shared;
+	unsigned int size_data = 0;
+
+	if (!per_cpu_cacheinfo(cpu))
+		return;
+
+	ci = ci_cacheinfo(cpu);
+	for (i = 0; i < cache_leaves(cpu); i++) {
+		leaf = per_cpu_cacheinfo_idx(cpu, i);
+		if (leaf->type != CACHE_TYPE_DATA &&
+		    leaf->type != CACHE_TYPE_UNIFIED)
+			continue;
+		nr_shared = cpumask_weight(&leaf->shared_cpu_map);
+		if (!nr_shared)
+			continue;
+		size_data += leaf->size / nr_shared;
+	}
+	ci->size_data = size_data;
+}
+
+static void update_data_cache_size(bool cpu_online, unsigned int cpu)
+{
+	unsigned int icpu;
+
+	for_each_online_cpu(icpu) {
+		if (!cpu_online && icpu == cpu)
+			continue;
+		update_data_cache_size_cpu(icpu);
+	}
+}
+
 static int cacheinfo_cpu_online(unsigned int cpu)
 {
 	int rc = detect_cache_attributes(cpu);
@@ -906,7 +941,11 @@ static int cacheinfo_cpu_online(unsigned int cpu)
 		return rc;
 	rc = cache_add_dev(cpu);
 	if (rc)
-		free_cache_attributes(cpu);
+		goto err;
+	update_data_cache_size(true, cpu);
+	return 0;
+err:
+	free_cache_attributes(cpu);
 	return rc;
 }
 
@@ -916,6 +955,7 @@ static int cacheinfo_cpu_pre_down(unsigned int cpu)
 		cpu_cache_sysfs_exit(cpu);
 
 	free_cache_attributes(cpu);
+	update_data_cache_size(false, cpu);
 	return 0;
 }
 
diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index a5cfd44fab45..4e7ccfa0c36d 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -73,6 +73,7 @@ struct cacheinfo {
 
 struct cpu_cacheinfo {
 	struct cacheinfo *info_list;
+	unsigned int size_data;
 	unsigned int num_levels;
 	unsigned int num_leaves;
 	bool cpu_map_populated;

From patchwork Tue Sep 26 06:09:04 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 13398712
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 38821E8181F
	for <linux-mm@archiver.kernel.org>; Tue, 26 Sep 2023 06:09:42 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C0FAE8D0068; Tue, 26 Sep 2023 02:09:41 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id BBDFE8D0005; Tue, 26 Sep 2023 02:09:41 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A384E8D0068; Tue, 26 Sep 2023 02:09:41 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com
 [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 7C3758D0005
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 02:09:41 -0400 (EDT)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 3D64F160F73
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:09:41 +0000 (UTC)
X-FDA: 81277722162.18.B08962F
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.100])
	by imf19.hostedemail.com (Postfix) with ESMTP id 2626E1A0017
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:09:39 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=Onl6DQeW;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1695708579;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=hUErPygrtV1xgN130d18EVh1V1cyOuIF7+UyarrA6AY=;
	b=MYE39gUeAk99SQJGxgk7neWxv039P3lV8q/HWa3527OdBCKWPjqQhyCZB2oFzJliW4aLts
	pv9KMhUZ/StvhNJNqcQQtCaSnIr+o3UgGsNru4wW4F3AOM73tf7cTe/ISNMUzE8T5XVG9+
	sJm+fA90kYFhG2g/bPzDuVj6OVauin8=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=Onl6DQeW;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695708579; a=rsa-sha256;
	cv=none;
	b=mXORW96WN0XCrUyKqN+nIPveu7RHhnQTR/p/FC8L8aYvNmtAeXzCevWoFJqpDC4qroSDtc
	FDpSUqx4nd54TFN23bmF9psGI4pnUmW+f/WNFS4uBUDl5Atgfm7M8pvJUZH2BgBj87T6nm
	ooK5cMCEaLXsvR5lJIgYjC22hY61C+I=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695708579; x=1727244579;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=DeMClJs3Mthv/xHdoFt3z4ewOtGnsu4v3P02NwqV+vA=;
  b=Onl6DQeWyWcBz/jr0e+i1wx12H3eEKJrGDSG0XoRv2ZuizHzPWRVpxOm
   BCN3rFTjM5ITYj8773QaKbP1jKJm0KBc7aK0Hjm9xrb1CUwAcplpCTP0l
   VTyZKiBLXlK6RQ5hG/H6ezS4ES6VRHfwbelp5YNDvcGSj0HWXSxWGoQkf
   ozkSiOjakNrb5IUQ4iWNVUSQgxXYvZyaunRYScmvXKfoyLciHbgt6ovR+
   a8fmVwltDY9Pb7l90xoKmNfTRr2LBPoF+WpIyJRDF5mPhekk/WCblrkTD
   lcT8sCAJkymbG82yG6sX+7QYvgQ5o4en6ok1Zp8hyD9pMGyjN+NO9H4RR
   g==;
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="447991312"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="447991312"
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:09:38 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="892075877"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="892075877"
Received: from aozhu-mobl.ccr.corp.intel.com (HELO
 yhuang6-mobl2.ccr.corp.intel.com) ([10.255.31.94])
  by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:08:30 -0700
From: Huang Ying <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Arjan Van De Ven <arjan@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Hildenbrand <david@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Lameter <cl@linux.com>
Subject: [PATCH -V2 03/10] mm,
 pcp: reduce lock contention for draining high-order pages
Date: Tue, 26 Sep 2023 14:09:04 +0800
Message-Id: <20230926060911.266511-4-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230926060911.266511-1-ying.huang@intel.com>
References: <20230926060911.266511-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 2626E1A0017
X-Stat-Signature: cm99a5f4iyk9joiztutngtx5sewii7cs
X-HE-Tag: 1695708579-460751
X-HE-Meta: 
 U2FsdGVkX18CzxYGlZRLc4WESL+CZ6+zcZNdXw7YN32OjsZfQjJy6qVgpXOiy+Wa6fKTvS67ZWhQz17qTzvLL4V5/7QX9+cuaLtl3IXoj8n7uFdzvKakjVwYOR3dB6E81fjUWpDFd4BXMXO13TOX7ocWvhkUJuOWiWzKO81DUVxb4gUGIR3oSZ8FMWNjTc/OSFy54/yLhzjxLyJo6k4gwu8eMHIdei5dHGadJR73RuwOiT0DtBO2R5J60Owbc/JqKAIWI4mblHsi0RphAwdooO1f+QOl9UYcv7Mf6tTofeUkTrcLD2Fix8h1rYjMo9F4HzJOzjEnTvlRxgeeIS7GnHLnslIBQude+BpaWJHsTXn9jDNCDJcE9m3awryeSuUXdIJnIaXdeCz1KjjpAjINdsmuP+Ibw+OV/o8vjzcdxzm6+Ku9esNLd/38vQlQMuWj4BafFj7MenysSp2fk0Xyv2Zee7eAi4glb+HHwivX5mAKOHALkly3jQX/YwlKOv6OUnZDsuvPmR1N6QEPqiLgeWE4OwGflZjHz2H80H3N3a3+5ZYhQG8tb6iIAfhf+fSCx7L5jNECV1qg6YksA9FjyZKDjnrl063IxDRZmiUHq4TBqlcOjWVvgbUKdF1okNCAitV+azHvZ3R7sHJ/1lBDbTnAqIDIUMeX3DivHXLVQDGenMqt/cMXDS1zWXeyJEVF8tYhSsl2Fy9ElbPhU82MscIl7t6Q4QoUSN0DvKvyDOVpi0/n9Uth1Qy+UwReRWI63YuHfM3Jfu+i0tM1QI49iYhRNWFhtZQkzBNpiGNbht7Xw1TV4oS7qCu/A6LX73LtKA71vY5hp/AaNmWTfUZ6DqQt6RZ+Rq2QWksLpPvJ+MRPB/0ykHix96c1Fm2kmf0EiCf6Hqaz6jlbbP0GyC/h7mPFYoRGMnXiZZcsL92/cdj/1ODsmbjM7AB7IVlCfyWgB/5RWAQ4Z40aVebsLxU
 h8ghqv0Z
 tDMfV1FA2QJ/12UvXI3kyv830oGW4MZ/Bb//2aX/Zs82AbHU8VyKmSTvHsE8tXy8ncNWtzTN/Mfox6mCf7jyzNevOIQbshkzLAbxoKn6Y9XMW/g2LqS3gQvtKNz6G8oGdkCRLV5Mx1vR+A/hy2B78of3JpxF0RNK7fxxxAL1EKsxXgV8wW70pYY+T7Eem34ZolCO21ukufGnCZBrCjYyAjlhA/mLUUnDQ4vD1imBqJeDrjEXpichpiANn2RFctWqYDlTw7Q10DpKRB2e1Iq78+b/eGQKSF3/O58LGFLSP3PW6XKEVFAhgxRKy2GyrR0L9p2+WJMCXXItdTGLPGJZVMBo1E+jTBeDWzzrKMyH1gy3oQwU490c5NVijtmuV2cgSVjv1b1RvEtnNGdbtp0OOxUg9elxBQgdp0DBkj/NiCVBQ+jJK9d+0zrs7UFiZeCME95IVzvzqa0yBM/NVVlL/Hqrht4qiqQV/6YaKhr2nrBrJN1bIDrCSQeyEQx4O0ftUVaLB/uxrUposfaRSRFdZbXnuplGNXDmMwOgT
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order
pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be
drained when PCP is mostly used for high-order pages freeing to
improve the cache-hot pages reusing between page allocating and
freeing CPUs.

On system with small per-CPU data cache, pages shouldn't be cached
before draining to guarantee cache-hot.  But on a system with large
per-CPU data cache, more pages can be cached before draining to reduce
zone lock contention.

So, in this patch, instead of draining without any caching, "batch"
pages will be cached in PCP before draining if the per-CPU data cache
size is more than "4 * batch".

On a 2-socket Intel server with 128 logical CPU, with the patch, the
network bandwidth of the UNIX (AF_UNIX) test case of lmbench test
suite with 16-pair processes increase 72.2%.  The cycles% of the
spinlock contention (mostly for zone lock) decreases from 45.8% to
21.2%.  The number of PCP draining for high order pages
freeing (free_high) decreases 89.8%.  The cache miss rate keeps 0.3%.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 drivers/base/cacheinfo.c |  2 ++
 include/linux/gfp.h      |  1 +
 include/linux/mmzone.h   |  1 +
 mm/page_alloc.c          | 37 ++++++++++++++++++++++++++++++++++++-
 4 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index 3e8951a3fbab..a55b2f83958b 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -943,6 +943,7 @@ static int cacheinfo_cpu_online(unsigned int cpu)
 	if (rc)
 		goto err;
 	update_data_cache_size(true, cpu);
+	setup_pcp_cacheinfo();
 	return 0;
 err:
 	free_cache_attributes(cpu);
@@ -956,6 +957,7 @@ static int cacheinfo_cpu_pre_down(unsigned int cpu)
 
 	free_cache_attributes(cpu);
 	update_data_cache_size(false, cpu);
+	setup_pcp_cacheinfo();
 	return 0;
 }
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 665f06675c83..665edc11fb9f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -325,6 +325,7 @@ void drain_all_pages(struct zone *zone);
 void drain_local_pages(struct zone *zone);
 
 void page_alloc_init_late(void);
+void setup_pcp_cacheinfo(void);
 
 /*
  * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 64d5ed2bb724..4132e7490b49 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -677,6 +677,7 @@ enum zone_watermarks {
 #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
 
 #define	PCPF_PREV_FREE_HIGH_ORDER	0x01
+#define	PCPF_FREE_HIGH_BATCH		0x02
 
 struct per_cpu_pages {
 	spinlock_t lock;	/* Protects lists field */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 295e61f0c49d..e97814985710 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -52,6 +52,7 @@
 #include <linux/psi.h>
 #include <linux/khugepaged.h>
 #include <linux/delayacct.h>
+#include <linux/cacheinfo.h>
 #include <asm/div64.h>
 #include "internal.h"
 #include "shuffle.h"
@@ -2385,7 +2386,9 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	 */
 	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
 		free_high = (pcp->free_factor &&
-			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER));
+			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
+			     (!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
+			      pcp->count >= READ_ONCE(pcp->batch)));
 		pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER;
 	} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
 		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
@@ -5418,6 +5421,38 @@ static void zone_pcp_update(struct zone *zone, int cpu_online)
 	mutex_unlock(&pcp_batch_high_lock);
 }
 
+static void zone_pcp_update_cacheinfo(struct zone *zone)
+{
+	int cpu;
+	struct per_cpu_pages *pcp;
+	struct cpu_cacheinfo *cci;
+
+	for_each_online_cpu(cpu) {
+		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+		cci = get_cpu_cacheinfo(cpu);
+		/*
+		 * If per-CPU data cache is large enough, up to
+		 * "batch" high-order pages can be cached in PCP for
+		 * consecutive freeing.  This can reduce zone lock
+		 * contention without hurting cache-hot pages sharing.
+		 */
+		spin_lock(&pcp->lock);
+		if ((cci->size_data >> PAGE_SHIFT) > 4 * pcp->batch)
+			pcp->flags |= PCPF_FREE_HIGH_BATCH;
+		else
+			pcp->flags &= ~PCPF_FREE_HIGH_BATCH;
+		spin_unlock(&pcp->lock);
+	}
+}
+
+void setup_pcp_cacheinfo(void)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone)
+		zone_pcp_update_cacheinfo(zone);
+}
+
 /*
  * Allocate per cpu pagesets and initialize them.
  * Before this call only boot pagesets were available.

From patchwork Tue Sep 26 06:09:05 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 13398713
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DB6FFE7D0C5
	for <linux-mm@archiver.kernel.org>; Tue, 26 Sep 2023 06:09:46 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 72B5E8D0069; Tue, 26 Sep 2023 02:09:46 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6DBD18D0005; Tue, 26 Sep 2023 02:09:46 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 57C3E8D0069; Tue, 26 Sep 2023 02:09:46 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com
 [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 2F01F8D0005
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 02:09:46 -0400 (EDT)
Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id D1A5C1A0FBC
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:09:45 +0000 (UTC)
X-FDA: 81277722330.20.50CA0AE
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.100])
	by imf19.hostedemail.com (Postfix) with ESMTP id BE0641A0008
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:09:43 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="i/KlUGvV";
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1695708584;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=QU8xixQkNAvZQHMRr76T4oxDBaYkpDqrgV2G7Isvz3A=;
	b=RNgxuntND0xSU3rxoPQlPe3cFGslnrMCSolLuhmZwKu2RFA6ovFiMlKyEwamsuaG4/B4Dl
	iSbgOZCKwmEIK1q7NLfL+ERw1qtZ6/imEcWNc4ucjXg86xA2cTdVcTsSyYetA2JlKzdBOR
	5KJ3wL7nMh7VJVgljRS8sBmvID3PyWI=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="i/KlUGvV";
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695708584; a=rsa-sha256;
	cv=none;
	b=xSOYSgaK+CIg1x9rTzOQUzOXPmO0LVG9BSx7RdXUYxreHPamxzuWRcEdhmy+HqvaFzsH9A
	zy9j2LlP75vYKbrXATSHnhdn4ZsAbtu9Wx7gA8/05sfs0vfyfNdqeq5aHxvSl5aYNtGfyn
	sQ3VN4elylmPT4diG/LXXeju3W6hHoM=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695708583; x=1727244583;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=XuWRBp7awNhgI2gxGvdBoVlgeQn1EFxLB/yFfQ9Zo1k=;
  b=i/KlUGvVVDuW94LW+gJBQP0CFY1ui215I1Pj20kVlXujUFk8oXWl7tSo
   znB2+6YdO3qeD2SaOVByQLMpPRNlU0e0GFKKYeADvC/byTaheUIpRTCUj
   s9buj8QTFkNwiszguVvPf1YsJR5gy1HK6TV5ewpRVmOr81YOBPTPdBr4P
   mvztanv8Jstbma7L3K05DBwXhq1ufBJs1pSXpI3Gw82nMTrG4G01aFjeF
   d8MR1w+n5CdT/1WYLRUU1xIJgaWqWqyri/6b8m37ktU9Ezfd7ac9N36ms
   nTP29MnyxZ/bK/0BExSD+JIE2s6IuyjIzP/JF9LB2B5SkHJqLE9ltLekN
   g==;
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="447991338"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="447991338"
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:09:43 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="892075894"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="892075894"
Received: from aozhu-mobl.ccr.corp.intel.com (HELO
 yhuang6-mobl2.ccr.corp.intel.com) ([10.255.31.94])
  by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:08:34 -0700
From: Huang Ying <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Arjan Van De Ven <arjan@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Hildenbrand <david@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Lameter <cl@linux.com>
Subject: [PATCH -V2 04/10] mm: restrict the pcp batch scale factor to avoid
 too long latency
Date: Tue, 26 Sep 2023 14:09:05 +0800
Message-Id: <20230926060911.266511-5-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230926060911.266511-1-ying.huang@intel.com>
References: <20230926060911.266511-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: BE0641A0008
X-Stat-Signature: bs78ps9f9ynpe3qsryk6kcc36qrfwek5
X-HE-Tag: 1695708583-77913
X-HE-Meta: 
 U2FsdGVkX1/VXzyoLj81wZ/LjKgw6fluo416ao0JSVvWLmhhQmNbQXwjnUHJ47MUP+4fIow55fUCkyY0RLUbi+k1/Y3dJEdmAJjS6YOWy3IMaqBshFEmpQlS4IMm8hBIswmyO9V5hzINqgsu6qU4N2YC5FEyVlOvhPeHcZPMUf8MXAkaYR/AZu5Y/SRbTkPT0yEqBVu7mxXUczrtJhbCi7ewRTjR552fM+/3FMtfh8a5KumW9kj2H9XgYlkvew5bWynyxuxj6DsnaxyzhKCt87GdiC89goxt1utodZjf8MofBjV7yKo6esvtqjjTY91s4loMTFm5W0ZX/Vd4Pn7zJtkarMmDvAdYtiXDokpFkF9IksqBjxp3QyUvm0Iw2N/n2Ra4AA/1Na63veeIK0N9N7JYc/uqeLq0MAkUBcAHtP6N1xLDQVz6kHjmzexV+OV7t6k+QsGOTz7UDfbX53q8oqXjihX2NbPr05ypMaopMpyeA6nFrmxs8cu4w6tTvh94rU00NOojXADuNntnvGzphMCcP300eduD0jX8lxSTe2S0lOZUKZR1ybxI2QNDQ5bG3xdpc17KGmoaGxqb3l6Jr1U06NdXstUtN0mJW7VOyB358l2pPwTlwYGv1C7r9UHbi9q22sTY5uTFgM/VV5Orm4N/dQttYpucFe5lHogAYBtg5+46SFbDgNrwwFIWvAmymW8ALWfxGhkthJA9QaqA4NqOuVG5P8DDRfUgfMbZfFNo4ABYC/zibg7DiXqblzMiQbatfQi+Qb4+rS7AI96Pa7+uk7WaFyM7LAW+rmLvPOMR8CPXdI0geH1L7YCZJfD80maXNRfByYSM9SnrIizQ2QEwgCzVZU+a0cCtF1xl7PAEf1m5zFw35RETH6JT0c90p37x8Rxy8QLbWMedc1AOQJJWoGlTDLHK4VfFrIYZShgIYi9RMh8WFeo6MKbi83to8U3U/5/SOvQVkdTdRlS
 eQy9kiX4
 +RYtNMs1MXnX1SMnaFPXhDLNEqLCoQAQ7ezso4RbX4V4ZwE8puI8/dNWMOjvF2rOvl5++5+uCsi/DUtGtZPNK2kPdJTqAsbZqXhvzjajcLlMoQhK6iEjoSZoKWnL+ZgYNp5aRm+kWir5l4uVNBKNkuisRjyDdve2UeD6J5XhkcQtoeX3HxR5xvRiICXBMzVj28ZOZFMqvkh/PLKZnwOO0MtZdQomBykxsiZRd5TUbH8yOrThtdGDJboK67Ptp+uqa6LmX2WR/aeW1UuLBkhIXTWccs2TK1hZVDtP0UoFRsfGMQEQddVO8IMhJ7nJX+upTZWNVwf8aLLdGHC7z8v2Q8ykHFaHXnkmVavvDGb0GTv6LZVZXh/+0crwtD+HuLrtIrI9wmQrJwWSURrBpZaoUuaYq5El309tfJIoFKjaOVvspABsNra081qSynFjLxAgUa+Ik1FDRQ5Td+H2qgfid1XR7YiPrMxE31l/os/W3HciynohF7g0G5N0tkBhsv5u6+px4LC/S3mniIsiCkEwILxG6rwD+7+4+VT4E
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

In page allocator, PCP (Per-CPU Pageset) is refilled and drained in
batches to increase page allocation throughput, reduce page
allocation/freeing latency per page, and reduce zone lock contention.
But too large batch size will cause too long maximal
allocation/freeing latency, which may punish arbitrary users.  So the
default batch size is chosen carefully (in zone_batchsize(), the value
is 63 for zone > 1GB) to avoid that.

In commit 3b12e7e97938 ("mm/page_alloc: scale the number of pages that
are batch freed"), the batch size will be scaled for large number of
page freeing to improve page freeing performance and reduce zone lock
contention.  Similar optimization can be used for large number of
pages allocation too.

To find out a suitable max batch scale factor (that is, max effective
batch size), some tests and measurement on some machines were done as
follows.

A set of debug patches are implemented as follows,

- Set PCP high to be 2 * batch to reduce the effect of PCP high

- Disable free batch size scaling to get the raw performance.

- The code with zone lock held is extracted from rmqueue_bulk() and
  free_pcppages_bulk() to 2 separate functions to make it easy to
  measure the function run time with ftrace function_graph tracer.

- The batch size is hard coded to be 63 (default), 127, 255, 511,
  1023, 2047, 4095.

Then will-it-scale/page_fault1 is used to generate the page
allocation/freeing workload.  The page allocation/freeing throughput
(page/s) is measured via will-it-scale.  The page allocation/freeing
average latency (alloc/free latency avg, in us) and allocation/freeing
latency at 99 percentile (alloc/free latency 99%, in us) are measured
with ftrace function_graph tracer.

The test results are as follows,

Sapphire Rapids Server
======================
Batch	throughput	free latency	free latency	alloc latency	alloc latency
	page/s		avg / us	99% / us	avg / us	99% / us
-----	----------	------------	------------	-------------	-------------
  63	513633.4	 2.33		 3.57		 2.67		  6.83
 127	517616.7	 4.35		 6.65		 4.22		 13.03
 255	520822.8	 8.29		13.32		 7.52		 25.24
 511	524122.0	15.79		23.42		14.02		 49.35
1023	525980.5	30.25		44.19		25.36		 94.88
2047	526793.6	59.39		84.50		45.22		140.81

Ice Lake Server
===============
Batch	throughput	free latency	free latency	alloc latency	alloc latency
	page/s		avg / us	99% / us	avg / us	99% / us
-----	----------	------------	------------	-------------	-------------
  63	620210.3	 2.21		 3.68		 2.02		 4.35
 127	627003.0	 4.09		 6.86		 3.51		 8.28
 255	630777.5	 7.70		13.50		 6.17		15.97
 511	633651.5	14.85		22.62		11.66		31.08
1023	637071.1	28.55		42.02		20.81		54.36
2047	638089.7	56.54		84.06		39.28		91.68

Cascade Lake Server
===================
Batch	throughput	free latency	free latency	alloc latency	alloc latency
	page/s		avg / us	99% / us	avg / us	99% / us
-----	----------	------------	------------	-------------	-------------
  63	404706.7	 3.29		  5.03		 3.53		  4.75
 127	422475.2	 6.12		  9.09		 6.36		  8.76
 255	411522.2	11.68		 16.97		10.90		 16.39
 511	428124.1	22.54		 31.28		19.86		 32.25
1023	414718.4	43.39		 62.52		40.00		 66.33
2047	429848.7	86.64		120.34		71.14		106.08

Commet Lake Desktop
===================
Batch	throughput	free latency	free latency	alloc latency	alloc latency
	page/s		avg / us	99% / us	avg / us	99% / us
-----	----------	------------	------------	-------------	-------------

  63	795183.13	 2.18		 3.55		 2.03		 3.05
 127	803067.85	 3.91		 6.56		 3.85		 5.52
 255	812771.10	 7.35		10.80		 7.14		10.20
 511	817723.48	14.17		27.54		13.43		30.31
1023	818870.19	27.72		40.10		27.89		46.28

Coffee Lake Desktop
===================
Batch	throughput	free latency	free latency	alloc latency	alloc latency
	page/s		avg / us	99% / us	avg / us	99% / us
-----	----------	------------	------------	-------------	-------------
  63	510542.8	 3.13		  4.40		 2.48		 3.43
 127	514288.6	 5.97		  7.89		 4.65		 6.04
 255	516889.7	11.86		 15.58		 8.96		12.55
 511	519802.4	23.10		 28.81		16.95		26.19
1023	520802.7	45.30		 52.51		33.19		45.95
2047	519997.1	90.63		104.00		65.26		81.74

From the above data, to restrict the allocation/freeing latency to be
less than 100 us in most times, the max batch scale factor needs to be
less than or equal to 5.

So, in this patch, the batch scale factor is restricted to be less
than or equal to 5.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 mm/page_alloc.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e97814985710..4b601f505401 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -86,6 +86,9 @@ typedef int __bitwise fpi_t;
  */
 #define FPI_TO_TAIL		((__force fpi_t)BIT(1))
 
+/* Maximum PCP batch scale factor to restrict max allocation/freeing latency */
+#define PCP_BATCH_SCALE_MAX	5
+
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -2340,7 +2343,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
 	 * freeing of pages without any allocation.
 	 */
 	batch <<= pcp->free_factor;
-	if (batch < max_nr_free)
+	if (batch < max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX)
 		pcp->free_factor++;
 	batch = clamp(batch, min_nr_free, max_nr_free);
 

From patchwork Tue Sep 26 06:09:06 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 13398714
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 71761E7D0C5
	for <linux-mm@archiver.kernel.org>; Tue, 26 Sep 2023 06:09:50 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 111668D006A; Tue, 26 Sep 2023 02:09:50 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0C0C98D0005; Tue, 26 Sep 2023 02:09:50 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id EA5858D006A; Tue, 26 Sep 2023 02:09:49 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com
 [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id C48468D0005
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 02:09:49 -0400 (EDT)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 8A04540F6E
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:09:49 +0000 (UTC)
X-FDA: 81277722498.30.16B48B7
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.100])
	by imf19.hostedemail.com (Postfix) with ESMTP id 7F7481A0006
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:09:47 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="P3UNhv/z";
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1695708587;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=DryvSbimGLkTtB6Yi2MUdnBMNq996jm6NuOBUqckrJg=;
	b=ipkHSeuFnZD8NZBO6gW3s+MZuCrbkBUvQb1gVDvOwMRl/zylJch0hVED9eniE4p69sJJbS
	r40c8fAp4DAGK0AAMM9YY2nUl5ILCUMjpaxf3OYAAJvj3MhRkFTJUPkxwLZkRebnmhRT6l
	zNV0NwANlpgxUZl+0gnn8DxwVGCxgws=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="P3UNhv/z";
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695708587; a=rsa-sha256;
	cv=none;
	b=8jzRpML1MZeyG1pElB+BhO9yAn0r8TEuirzv7Pu3HEektP4n8/kV5D0TV6lgj/F4l0mxFy
	L73vZlSvzkwOn3677QwUTYzIgvv0Ing80pSg3q1/AnePfse7SptSPQhJc8ofnxYslxOrYf
	6OXmRguMhIzB9pmFVSpCwAAEj+otgDU=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695708587; x=1727244587;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=ZtH26z6BJyiTQzNvHwodkC3VC0x9VX0B8MkwItrlS3k=;
  b=P3UNhv/zAZz09D9nv4Yo/kpXMR/CYI5XtN2Y/Z+8nnNU5KODgePXwOXn
   KNLTqixsr1DhB92CPOPkLSqSTdNFQ5XJCJflIaSXVFqViBPDejZIlUYdD
   5uoV6ZROLT2IoXOLuzflf2YLpk63NfQPIAio+zKXq6CpiG61Vm1YyAoUl
   bYVjivihThW4FMW78d/1xpK8Tp3augNY66nTtm9iq+JEGmJJ6lnUMy5lq
   GhPMM7ClolfgbyK7R1z33KJ514DKQsWs+eBAebI5HT/OugOcA92RulRF8
   QfJ4nqZqfaX7wKYEb7r8VZSCuBtjkBDi9Wo0sWiQWuM/UHHb5ihDJOjq5
   g==;
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="447991363"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="447991363"
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:09:46 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="892075929"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="892075929"
Received: from aozhu-mobl.ccr.corp.intel.com (HELO
 yhuang6-mobl2.ccr.corp.intel.com) ([10.255.31.94])
  by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:08:39 -0700
From: Huang Ying <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Arjan Van De Ven <arjan@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Hildenbrand <david@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Lameter <cl@linux.com>
Subject: [PATCH -V2 05/10] mm,
 page_alloc: scale the number of pages that are batch allocated
Date: Tue, 26 Sep 2023 14:09:06 +0800
Message-Id: <20230926060911.266511-6-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230926060911.266511-1-ying.huang@intel.com>
References: <20230926060911.266511-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 7F7481A0006
X-Stat-Signature: dqfujfsxk58e1dop8mi8pt8nemxqxbno
X-HE-Tag: 1695708587-60991
X-HE-Meta: 
 U2FsdGVkX1+ddbCBxkxcUsW1Zd8+M1Fn+l3XaLZMv2by1KycMq0gW8mvz8IWvBFFMhASC8sLzaO6RSX5P6v179IRGkJvCGNFEbT0FIKJM7vdmhQJqlh+vI67QhMLqXmZKfxG3nNUyvNFtBveOA2griD2D/gpU4IckkONPtsnjF0D36thvsccUM3nFbFA/3V0Vn0Toh/BUUJZklUy7i2gF/5cJTumwz03ghEv24zkhVITTvaXTQTSXzZEZ3mK5U2oEIBVz9s2w+wEufnsB/JX+Yt3DA3olR799LIzDZVxOTNXd3HFnDBflcvfRGQcWNt5C1dNG/88IpM3Wt3e0F9Qmn1VaLP3ZNUGPRADE3o86NmHkgywcaLip9MjRd7pMFNLSlX4UQToF3/J3/tCCVA+UPYwet2mXIiDoViwjRWBnSjjBdN2KQT4zE/60v6rvSWmpME0FB6YKncIzLDC4uAI//3T7RjA95cCgzFhxeifd23kNyNc5rQLeaVnqTSsLoBGjUyNxMwOOq6H4cu44H8oN9cPAO66XrFBJOH60r1GrAq10t3buPW2K/hxOzyr7Yi1wpnwtsh8Lk/af8tQm/6diSvSPumHsrmBtQ14V83nuunb1qu2eMPiniZq1JrhncihdavRuY91KU4tRIP6RtJL48KmBF4aPDQYE2IoCRWoPKP55aNRXelZ5czfvLW6CpXTVgrmM+rUJMzSPe8YxStmT3WOyXMvgH6ZcK70AblzAZAGs4GRRVrx9qrpaoBMaT0SzC83QUWstjqjn5dOKwVTBcVCMuV1F6cc2kFTJJMzwCfcpGMz1LMzbNt3HNBBdH/cEJ+s1DVtODdZbzCBAfaR6t8PX/uXVkd8KaMnSiDbxtZMrnbH+BPTRtTCh1BRCjnxkO2URMvCqPzjdch6eqlueu8EE62sQFKrSUX9n6lnpU1qmo/0rB/qZFkBE3kEUeXwe+LcsZq+WxCG68SDQeC
 eQaU7urH
 mR9I7sUOJncB26oGsTGMogV4REIwLoxGgXTfKoBCRd+tCCnT9UZJNFYVnTIacRhotrg3ktbSM3P/GOsrNL1Cra/pPfhSNpbtlx001D9q9ECTFk6sJCWnKn+LXuF8j/ciC3dKuC9eC2+0nAFsZWREue0pjPpR80f75MuqYSaJ6vp5AfXdF7l1/ccA7uP8QV9o6PuSqHpmTZ4j7HmRTps6BxcUUrGsyglsy5xVBpYEySYfJnzui7YGoET6ty1n8BqTJnz+TJVWPpVZDTGr3LoPESjmt98bOnSw5jF+oPVzWUxJIa4ydmgEqLqbpxpNYWGdXc+Q4/NMyNi7amN5K9gtgYOoGdggrbZTKvky+0PcepEKyHHld9p60XYR3LvDkLqwlW7JBb8kcFA7OH46CKHxebD0Hwp9CJJb7om6GNVOtH/9BdZPhSPYk+PTk7bXai0/Q86PNOQ7dvsmkTv8W6y+WO6KPkvgVA/q0kIxC/j+murgGz8NtQEFifUIleacXzcmPzGUzHHHK9O8kfnVwmTIMgYYA9JyDhRYg/EPeuTebl6MhTay+zZd03vOCVUvszm5b71k8EfTIeB2Be5qbE2tzbxiVww==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

When a task is allocating a large number of order-0 pages, it may
acquire the zone->lock multiple times allocating pages in batches.
This may unnecessarily contend on the zone lock when allocating very
large number of pages.  This patch adapts the size of the batch based
on the recent pattern to scale the batch size for subsequent
allocations.

On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild
instances in parallel (each with `make -j 28`) in 8 cgroup.  This
simulates the kbuild server that is used by 0-Day kbuild service.
With the patch, the cycles% of the spinlock contention (mostly for
zone lock) decreases from 11.7% to 10.0% (with PCP size == 361).

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/mmzone.h |  3 ++-
 mm/page_alloc.c        | 52 ++++++++++++++++++++++++++++++++++--------
 2 files changed, 44 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4132e7490b49..4f7420e35fbb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -685,9 +685,10 @@ struct per_cpu_pages {
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
 	u8 flags;		/* protected by pcp->lock */
+	u8 alloc_factor;	/* batch scaling factor during allocate */
 	u8 free_factor;		/* batch scaling factor during free */
 #ifdef CONFIG_NUMA
-	short expire;		/* When 0, remote pagesets are drained */
+	u8 expire;		/* When 0, remote pagesets are drained */
 #endif
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4b601f505401..b9226845abf7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2376,6 +2376,12 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	int pindex;
 	bool free_high = false;
 
+	/*
+	 * On freeing, reduce the number of pages that are batch allocated.
+	 * See nr_pcp_alloc() where alloc_factor is increased for subsequent
+	 * allocations.
+	 */
+	pcp->alloc_factor >>= 1;
 	__count_vm_events(PGFREE, 1 << order);
 	pindex = order_to_pindex(migratetype, order);
 	list_add(&page->pcp_list, &pcp->lists[pindex]);
@@ -2682,6 +2688,41 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 	return page;
 }
 
+static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order)
+{
+	int high, batch, max_nr_alloc;
+
+	high = READ_ONCE(pcp->high);
+	batch = READ_ONCE(pcp->batch);
+
+	/* Check for PCP disabled or boot pageset */
+	if (unlikely(high < batch))
+		return 1;
+
+	/*
+	 * Double the number of pages allocated each time there is subsequent
+	 * refiling of order-0 pages without drain.
+	 */
+	if (!order) {
+		max_nr_alloc = max(high - pcp->count - batch, batch);
+		batch <<= pcp->alloc_factor;
+		if (batch <= max_nr_alloc && pcp->alloc_factor < PCP_BATCH_SCALE_MAX)
+			pcp->alloc_factor++;
+		batch = min(batch, max_nr_alloc);
+	}
+
+	/*
+	 * Scale batch relative to order if batch implies free pages
+	 * can be stored on the PCP. Batch can be 1 for small zones or
+	 * for boot pagesets which should never store free pages as
+	 * the pages may belong to arbitrary zones.
+	 */
+	if (batch > 1)
+		batch = max(batch >> order, 2);
+
+	return batch;
+}
+
 /* Remove page from the per-cpu list, caller must protect the list */
 static inline
 struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
@@ -2694,18 +2735,9 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 
 	do {
 		if (list_empty(list)) {
-			int batch = READ_ONCE(pcp->batch);
+			int batch = nr_pcp_alloc(pcp, order);
 			int alloced;
 
-			/*
-			 * Scale batch relative to order if batch implies
-			 * free pages can be stored on the PCP. Batch can
-			 * be 1 for small zones or for boot pagesets which
-			 * should never store free pages as the pages may
-			 * belong to arbitrary zones.
-			 */
-			if (batch > 1)
-				batch = max(batch >> order, 2);
 			alloced = rmqueue_bulk(zone, order,
 					batch, list,
 					migratetype, alloc_flags);

From patchwork Tue Sep 26 06:09:07 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 13398715
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 78CF3E7D0C5
	for <linux-mm@archiver.kernel.org>; Tue, 26 Sep 2023 06:09:56 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 15F398D006B; Tue, 26 Sep 2023 02:09:56 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0E8248D0005; Tue, 26 Sep 2023 02:09:56 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id EF2158D006B; Tue, 26 Sep 2023 02:09:55 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com
 [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id CDED18D0005
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 02:09:55 -0400 (EDT)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 9092CA0FFC
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:09:55 +0000 (UTC)
X-FDA: 81277722750.28.66D9F0F
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.100])
	by imf19.hostedemail.com (Postfix) with ESMTP id 7609C1A000E
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:09:53 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=LjCOiiGX;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1695708593;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=A9GbfQNp00BFsmuDGD/dfys76FtFZk3lcsP4ad5IHRw=;
	b=2fMjC3kN9hje5sTVU8wS9LNUO+SqCa5lwF2kkQ9ERI4j6XlUn1OBPfftVT8t0vZuAp61es
	vt4TCsSS3hBZ9xgfZAwPZetVcjWPb55XV5/Iy16TKjKi5UaRWihSOShBwSWYqZ248CNwmz
	wchvMR1c308WrW9xrsTGCtDndtYeYlg=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=LjCOiiGX;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695708593; a=rsa-sha256;
	cv=none;
	b=REHlSbRyx8ObWwrD52oQGLsnC3P6hE9ANNfE1vFP4tlP6VRMHootvRjKUxZ4VYow5MQQVg
	oQkzHHdn+dKyF/tTmKscPjGdSHw54PQGdUcKLJgcS1jNzO85YFKzeynqbEAaISdQOU3EL0
	A4SvNSimr2e7vLTCLYLCBE4x99eEpa8=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695708593; x=1727244593;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=LGvzAogqWzGmZcRGU4lmeTYFMWywK93t3PV7OpA2ock=;
  b=LjCOiiGXcrK7w4BZIaY/eJaTkjokYoaY3Ah61ktpxIrHspt3rdIVE0e6
   QZvLlmoh7NzHgiNloYYax+FV9/+CQMhHJZp08XnN2ZUhDE6E06nMgbq2D
   f4QlWaib36sp3Y7Ttd1lLEongvv4WcOUxvb/za6IdpZeQSRz8TVIgtq8R
   GW1NQgI4YpAf9BRdWaKmbfla7aj4z8e4qSkc+0W1cFVaUhwOoQL8yeM92
   sup4ynEMFU2Ddzo/dLmj5RW3pd68KfehOaZtB5aosSVclfrz0ahb0+CqP
   krxTkXlbLqc12iIkP5fb/qOdbIAoVoOBDKuAId0pe+vchvGrwidu5aRlC
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="447991390"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="447991390"
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:09:52 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="892075936"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="892075936"
Received: from aozhu-mobl.ccr.corp.intel.com (HELO
 yhuang6-mobl2.ccr.corp.intel.com) ([10.255.31.94])
  by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:08:43 -0700
From: Huang Ying <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Arjan Van De Ven <arjan@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Hildenbrand <david@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Lameter <cl@linux.com>
Subject: [PATCH -V2 06/10] mm: add framework for PCP high auto-tuning
Date: Tue, 26 Sep 2023 14:09:07 +0800
Message-Id: <20230926060911.266511-7-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230926060911.266511-1-ying.huang@intel.com>
References: <20230926060911.266511-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 7609C1A000E
X-Stat-Signature: w5edthm87errdcm7u768fj65rzubb779
X-HE-Tag: 1695708593-22290
X-HE-Meta: 
 U2FsdGVkX1+Aa7uXD2OHjrUvMh3LK12CbJdFpxPPG3ai/EBlcpelpQM5661tnqJHdHRH41hvsZxCyrO2KKZGTag4urJeoEpqPms7se5qBZJy6vOz4dhhltA3E38YJf5wXs9X+CI3pgSAPwJMX06JTvIyQ5LE/2wcZvi/Hs2NROERmfPMUqfDjwFTTWYHhdy7MSI3YQjtNZuSobuJU4/vg5IUd3VHGLRxpWVWeN+T3tBFV93dg4K/+E/VE5Fnvm8E3UeZs+0n3BUwwIypnrW0DJFnkRLjM83Z7UU8+GFQhS0n+TRQ3VriHhNX62Q+MvZFASh4NNtMAVbUAyLI9gWU1k1Y+RLHy6+Ko3utqFJ/3A1R324c9V2sVjKwEAYpaEnloCDTAU3N2A45U+ciY/JFziHnDUcBB4QJNc2kbXbWns3NOiLNeJ4xXmPPzAJvnPUbNZdOGimyvV+eb0Z+n7rIzBtq4OFbm/gLEi+JEZCaGhzbly/O/Inon7YkN1hNCoIfSvpPCkJYpwlr7D6R0Tm5/W31Cy9tu/rpwvSVoXB1WSJar5FG/tl84mjdbqsD+ruWbrACnTYnngbiaLXSL+CwazfKJMjqGqXhP5gzI4ainx3X0wuuRJ1wWwsS90rsybBJfV8Mwo0pArDGT6DPpiTwIK4IOuNWJj2Kj3avKy5EoMBChZ0OPetA51J+BFkTtYzSVmuon70+30ctgFYGtk5lrREF7NSNeGzuG1n9pVD8/GQafDhkL41iwRakOP3egkV2+ETS2ec6/rFn2/R/8hvL0+KbRHNzm4akUL5ZpFksucWPlGNaFTl4BIY3v4pet+ruiDLk9jhAotk8wzFamDPAHIiEoYkpS5RGtBQJl4jihsVny+bTxObWwLWcPMwhWE2KXEhR9zDSNaJFu8jgGqgx2X2CBP4weDqa/Oatm/+f7YWEqvXm+3JLNAjSwVQrf4n/jIRa0lpgF+MWtPylBzb
 8bf8M28D
 MSh/42Orh6ToGTG1UUKhUpxgwxULg1Wk+6/mHxLKhXmClzmV8rrBKyP7a+zkkob17ASR32v37lbBwdvrqDkLIIXMOcMW8HWKZqA2RRsUieRKtzdc6yuv3/raWrxXSGxAvjP9yib/ZKAimWGx9VqGkuehwJcDX1UdH0mt/Y4Ua3M0Hj7FiBtj84eW3i3sB3VHtH3pEezof2p90lSbmdG3iv82cWlvGkxxB47zt+G/Xp++U1e3fWg+1dlbYkCNUWbLVZJflm9KednGJJpptt/WhM/28cMMMqZGpbmtHDbROyYtCjH4tHCtrDj8fv8OQrj/05/okMVAEl8/I4kmpTg5cA1OvBmE+NL8aWLogDshA2AArczqjf+gjPUZQJZHN1DbBJByREVhApyUlLP7964ihysV6O7Avp5nxTrK3jxjA7JR2kI/f89AFoit2lJy2vn6nPbUZ54ICq34qADqXTSn+eXVmGzFNDpd3aWh+jcVejHqgQ8RoYrfHri8+JVn/4d8vC17zl1St0rJ3Ip+ZuVn4p4/57eZhmY6f0LbK
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

The page allocation performance requirements of different workloads
are usually different.  So, we need to tune PCP (per-CPU pageset) high
to optimize the workload page allocation performance.  Now, we have a
system wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP
high by hand.  But, it's hard to find out the best value by hand.  And
one global configuration may not work best for the different workloads
that run on the same system.  One solution to these issues is to tune
PCP high of each CPU automatically.

This patch adds the framework for PCP high auto-tuning.  With it,
pcp->high of each CPU will be changed automatically by tuning
algorithm at runtime.  The minimal high (pcp->high_min) is the
original PCP high value calculated based on the low watermark pages.
While the maximal high (pcp->high_max) is the PCP high value when
percpu_pagelist_high_fraction sysctl knob is set to
MIN_PERCPU_PAGELIST_HIGH_FRACTION.  That is, the maximal pcp->high
that can be set via sysctl knob by hand.

It's possible that PCP high auto-tuning doesn't work well for some
workloads.  So, when PCP high is tuned by hand via the sysctl knob,
the auto-tuning will be disabled.  The PCP high set by hand will be
used instead.

This patch only adds the framework, so pcp->high will be set to
pcp->high_min (original default) always.  We will add actual
auto-tuning algorithm in the following patches in the series.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 12 +++--
 include/linux/mmzone.h                  |  5 +-
 mm/page_alloc.c                         | 71 ++++++++++++++++---------
 3 files changed, 58 insertions(+), 30 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 45ba1f4dc004..7386366fe114 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -843,10 +843,14 @@ each zone between per-cpu lists.
 The batch value of each per-cpu page list remains the same regardless of
 the value of the high fraction so allocation latencies are unaffected.
 
-The initial value is zero. Kernel uses this value to set the high pcp->high
-mark based on the low watermark for the zone and the number of local
-online CPUs.  If the user writes '0' to this sysctl, it will revert to
-this default behavior.
+The initial value is zero. With this value, kernel will tune pcp->high
+automatically according to the requirements of workloads.  The lower
+limit of tuning is based on the low watermark for the zone and the
+number of local online CPUs.  The upper limit is the page number when
+the sysctl is set to the minimal value (8).  If the user writes '0' to
+this sysctl, it will revert to this default behavior.  In another
+words, if the user write other value, the auto-tuning will be disabled
+and the user specified pcp->high will be used.
 
 
 stat_interval
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4f7420e35fbb..d6cfb5023f3e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -683,6 +683,8 @@ struct per_cpu_pages {
 	spinlock_t lock;	/* Protects lists field */
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
+	int high_min;		/* min high watermark */
+	int high_max;		/* max high watermark */
 	int batch;		/* chunk size for buddy add/remove */
 	u8 flags;		/* protected by pcp->lock */
 	u8 alloc_factor;	/* batch scaling factor during allocate */
@@ -842,7 +844,8 @@ struct zone {
 	 * the high and batch values are copied to individual pagesets for
 	 * faster access
 	 */
-	int pageset_high;
+	int pageset_high_min;
+	int pageset_high_max;
 	int pageset_batch;
 
 #ifndef CONFIG_SPARSEMEM
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b9226845abf7..df07580dbd53 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2353,7 +2353,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
 static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 		       bool free_high)
 {
-	int high = READ_ONCE(pcp->high);
+	int high = READ_ONCE(pcp->high_min);
 
 	if (unlikely(!high || free_high))
 		return 0;
@@ -2692,7 +2692,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order)
 {
 	int high, batch, max_nr_alloc;
 
-	high = READ_ONCE(pcp->high);
+	high = READ_ONCE(pcp->high_min);
 	batch = READ_ONCE(pcp->batch);
 
 	/* Check for PCP disabled or boot pageset */
@@ -5298,14 +5298,15 @@ static int zone_batchsize(struct zone *zone)
 }
 
 static int percpu_pagelist_high_fraction;
-static int zone_highsize(struct zone *zone, int batch, int cpu_online)
+static int zone_highsize(struct zone *zone, int batch, int cpu_online,
+			 int high_fraction)
 {
 #ifdef CONFIG_MMU
 	int high;
 	int nr_split_cpus;
 	unsigned long total_pages;
 
-	if (!percpu_pagelist_high_fraction) {
+	if (!high_fraction) {
 		/*
 		 * By default, the high value of the pcp is based on the zone
 		 * low watermark so that if they are full then background
@@ -5318,15 +5319,15 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
 		 * value is based on a fraction of the managed pages in the
 		 * zone.
 		 */
-		total_pages = zone_managed_pages(zone) / percpu_pagelist_high_fraction;
+		total_pages = zone_managed_pages(zone) / high_fraction;
 	}
 
 	/*
 	 * Split the high value across all online CPUs local to the zone. Note
 	 * that early in boot that CPUs may not be online yet and that during
 	 * CPU hotplug that the cpumask is not yet updated when a CPU is being
-	 * onlined. For memory nodes that have no CPUs, split pcp->high across
-	 * all online CPUs to mitigate the risk that reclaim is triggered
+	 * onlined. For memory nodes that have no CPUs, split the high value
+	 * across all online CPUs to mitigate the risk that reclaim is triggered
 	 * prematurely due to pages stored on pcp lists.
 	 */
 	nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online;
@@ -5354,19 +5355,21 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
  * However, guaranteeing these relations at all times would require e.g. write
  * barriers here but also careful usage of read barriers at the read side, and
  * thus be prone to error and bad for performance. Thus the update only prevents
- * store tearing. Any new users of pcp->batch and pcp->high should ensure they
- * can cope with those fields changing asynchronously, and fully trust only the
- * pcp->count field on the local CPU with interrupts disabled.
+ * store tearing. Any new users of pcp->batch, pcp->high_min and pcp->high_max
+ * should ensure they can cope with those fields changing asynchronously, and
+ * fully trust only the pcp->count field on the local CPU with interrupts
+ * disabled.
  *
  * mutex_is_locked(&pcp_batch_high_lock) required when calling this function
  * outside of boot time (or some other assurance that no concurrent updaters
  * exist).
  */
-static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
-		unsigned long batch)
+static void pageset_update(struct per_cpu_pages *pcp, unsigned long high_min,
+			   unsigned long high_max, unsigned long batch)
 {
 	WRITE_ONCE(pcp->batch, batch);
-	WRITE_ONCE(pcp->high, high);
+	WRITE_ONCE(pcp->high_min, high_min);
+	WRITE_ONCE(pcp->high_max, high_max);
 }
 
 static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonestat *pzstats)
@@ -5386,20 +5389,21 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
 	 * need to be as careful as pageset_update() as nobody can access the
 	 * pageset yet.
 	 */
-	pcp->high = BOOT_PAGESET_HIGH;
+	pcp->high_min = BOOT_PAGESET_HIGH;
+	pcp->high_max = BOOT_PAGESET_HIGH;
 	pcp->batch = BOOT_PAGESET_BATCH;
 	pcp->free_factor = 0;
 }
 
-static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
-		unsigned long batch)
+static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high_min,
+					      unsigned long high_max, unsigned long batch)
 {
 	struct per_cpu_pages *pcp;
 	int cpu;
 
 	for_each_possible_cpu(cpu) {
 		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
-		pageset_update(pcp, high, batch);
+		pageset_update(pcp, high_min, high_max, batch);
 	}
 }
 
@@ -5409,19 +5413,34 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
  */
 static void zone_set_pageset_high_and_batch(struct zone *zone, int cpu_online)
 {
-	int new_high, new_batch;
+	int new_high_min, new_high_max, new_batch;
 
 	new_batch = max(1, zone_batchsize(zone));
-	new_high = zone_highsize(zone, new_batch, cpu_online);
+	if (percpu_pagelist_high_fraction) {
+		new_high_min = zone_highsize(zone, new_batch, cpu_online,
+					     percpu_pagelist_high_fraction);
+		/*
+		 * PCP high is tuned manually, disable auto-tuning via
+		 * setting high_min and high_max to the manual value.
+		 */
+		new_high_max = new_high_min;
+	} else {
+		new_high_min = zone_highsize(zone, new_batch, cpu_online, 0);
+		new_high_max = zone_highsize(zone, new_batch, cpu_online,
+					     MIN_PERCPU_PAGELIST_HIGH_FRACTION);
+	}
 
-	if (zone->pageset_high == new_high &&
+	if (zone->pageset_high_min == new_high_min &&
+	    zone->pageset_high_max == new_high_max &&
 	    zone->pageset_batch == new_batch)
 		return;
 
-	zone->pageset_high = new_high;
+	zone->pageset_high_min = new_high_min;
+	zone->pageset_high_max = new_high_max;
 	zone->pageset_batch = new_batch;
 
-	__zone_set_pageset_high_and_batch(zone, new_high, new_batch);
+	__zone_set_pageset_high_and_batch(zone, new_high_min, new_high_max,
+					  new_batch);
 }
 
 void __meminit setup_zone_pageset(struct zone *zone)
@@ -5529,7 +5548,8 @@ __meminit void zone_pcp_init(struct zone *zone)
 	 */
 	zone->per_cpu_pageset = &boot_pageset;
 	zone->per_cpu_zonestats = &boot_zonestats;
-	zone->pageset_high = BOOT_PAGESET_HIGH;
+	zone->pageset_high_min = BOOT_PAGESET_HIGH;
+	zone->pageset_high_max = BOOT_PAGESET_HIGH;
 	zone->pageset_batch = BOOT_PAGESET_BATCH;
 
 	if (populated_zone(zone))
@@ -6431,13 +6451,14 @@ EXPORT_SYMBOL(free_contig_range);
 void zone_pcp_disable(struct zone *zone)
 {
 	mutex_lock(&pcp_batch_high_lock);
-	__zone_set_pageset_high_and_batch(zone, 0, 1);
+	__zone_set_pageset_high_and_batch(zone, 0, 0, 1);
 	__drain_all_pages(zone, true);
 }
 
 void zone_pcp_enable(struct zone *zone)
 {
-	__zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch);
+	__zone_set_pageset_high_and_batch(zone, zone->pageset_high_min,
+		zone->pageset_high_max, zone->pageset_batch);
 	mutex_unlock(&pcp_batch_high_lock);
 }
 

From patchwork Tue Sep 26 06:09:08 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 13398716
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8F7A3EE14D8
	for <linux-mm@archiver.kernel.org>; Tue, 26 Sep 2023 06:10:01 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2E6DA8D006C; Tue, 26 Sep 2023 02:10:01 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 297EC8D0005; Tue, 26 Sep 2023 02:10:01 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 15D488D006C; Tue, 26 Sep 2023 02:10:01 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com
 [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id E90BD8D0005
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 02:10:00 -0400 (EDT)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 91CB640FA7
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:10:00 +0000 (UTC)
X-FDA: 81277722960.28.2D1C808
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.100])
	by imf19.hostedemail.com (Postfix) with ESMTP id 790AA1A0008
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:09:58 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=oGgjDvXn;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1695708598;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=As3TzKMzpEOattpMaKZpn8UQdhodCyHtQ1I4fnt15TA=;
	b=PlXI97UrXT8v6bKW5uswy3xZzZTEJ2ayvJ2uQgP00+hnYkacOE260ey1AvyfYmF91x/WmQ
	CEhErg9Aju73mCYLdtougHFs6KAN9OAOgrcAuypMjRWw0VlnsxJZA/zBXxomgjLiIMMDWe
	3XCdg3BNJX3X95vDtVGg1QFQtnYlRAo=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=oGgjDvXn;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695708598; a=rsa-sha256;
	cv=none;
	b=I18iMH3k1IHo7jAfKSlEG1rnpK0cJUtavG9sPW8PFqM58u8UNiY+Q1yRKmzk+BJluRRnyT
	NYj7snBCiGjGN1gNX0tf+5wPTvjlIoiEvsyJO0C4apQcDYAyAXcOKAzA0jv8bGwVvoiBwZ
	gCh3yppPsL+EpRnDVhA62+wWlmMZJqk=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695708598; x=1727244598;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=M9RgF+xRqLmrLbo/+CwFvgtwRRXK0bl1PRniucs7Q6Y=;
  b=oGgjDvXnMi+/cRbCU6NeZy5czqALM41CnH+5E2M4MkXfLhuYvVNjpdCt
   1HoG4G2iwSYTPPHjMBzCodPXSJ1d4r3zEWyhgRufHYWYx/FlAQPH3p8jr
   Et2ZwE02CEv6F6U6E0AnLvNgy2zXO7dajgEEeEK2qJpZWezDxlm0geOXq
   mw71LtALHvzMKVz4NjEA0osnwYhD7nhD2el6ZziaIn9IrYflm5ZMnWd7U
   Fo0gmoC66DTjpaM8rVSW0z8Ra1rTQ1VSV7b9VCayK70lnv9viE1WSORd0
   IaIJxd7dFubqwujXz9OVVRw+e8YIg/CbojAHx599cTnhqVgTX1qxfD3Ll
   w==;
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="447991420"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="447991420"
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:09:57 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="892075960"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="892075960"
Received: from aozhu-mobl.ccr.corp.intel.com (HELO
 yhuang6-mobl2.ccr.corp.intel.com) ([10.255.31.94])
  by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:08:48 -0700
From: Huang Ying <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Arjan Van De Ven <arjan@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Michal Hocko <mhocko@suse.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Hildenbrand <david@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Lameter <cl@linux.com>
Subject: [PATCH -V2 07/10] mm: tune PCP high automatically
Date: Tue, 26 Sep 2023 14:09:08 +0800
Message-Id: <20230926060911.266511-8-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230926060911.266511-1-ying.huang@intel.com>
References: <20230926060911.266511-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 790AA1A0008
X-Stat-Signature: xeizecaouo19aujtq8opcji6ejwb6tt3
X-HE-Tag: 1695708598-805820
X-HE-Meta: 
 U2FsdGVkX18fCdE2j3ugMxhxDYDHPoYWT8yR47aJ05fWgI0HJfotZ8Y7PdES13XGY/sOjMmtONieZAb4pQwpkZ9AiPNSf+c0wv4N2/Clbxgcrr3wD5mccsg2B1sL9l1h1YfjnYh2QokNnGapr49UiNx0qnswj6vFYfCju8duH0gGl7TNoudTcjgwMkkqWg6BjZ94KhyCIROPYUdVi0yAXcV9vIq7Os7NkxTQ0GPvZr2QWRahWAIo4Xdu8IraX2GEgREnXlOeosA2Up9HoW4SyxwBkdBIPjg7ZUWbND+FouybXJZlu0hAxmR+F5ob0bACRP5KhSrhB20hsk6zVnzyfijryZ9kfgtIlfo5Ox2A3NzDzayWFNtdPgFrCHMTdeP7HuuHMX2/VfpLB7YbEsuqF/Yyw1+B7NXcHIOGGbdED/vApYQeLuoB3Y2vH+VcnorLy47pD8J/AsN5UMU1/PI6VM9AbXOA2S4brUrqH/aqVt6yMGB0ldCSVE+dFrk4RCGCn9VyZjAHCvT2jtAqTN63s1W7DplzJRSg0ZJ+etmg5b97ehWZT2fq/FQepVAqnsPN/ddKSUtGUse6sBoOgw0Cy4XwI7sK149lDpIvpx6q3chzJkDn93vCbr38QAtjRGHsDj1a5uLxwF2p9NvMpqcn6jE6O2nOcxzoMPsr/D3Qnaf/vaYhLphfZx8djKiuCY2/TUK9LoEvSdrxizZUa0hIbFH+K8/bvcWCvjDLkEXCGdYVlDYDnskdC2XyZ1wqrgty8hsBlFP2fMLNzilpH2XUiq8uMV7I0NSkbO7CHqCwzQVOAIXxCregej2IF8jdKwEbb+ZWxccL1iQN6xdUMP8SDcEd7ObKsoiCNU38bVvrho0Dv0wzRRxo4lYkxjfyzXkdmiT8MEVsGW2EMu+3pTWfR3dGwcbd47BvMEulbCxHijZ3fttpjpjKArtpmkk0r5rQsyq8Fv8ojfdrmuOQIc0
 8+zeyokn
 HBSLGlMei0owO4XflXUAMXwFabekr/iq20Rz8rm9lhRX0iM6Su3ncv77L/Cr84map+Pt193gJmUFi5C2R/9gH74RYw4ed2mEQipSU2Kkf3ruyd96f0KAHg4RITBxWiRLN550mRdkI0jMI7p6h8Jf8xyBaxyIT3GsfYb9ezpOQkyvr9F4kw/KgPEV9Xh5tORZpgBX1jMv/LZ8X4STDs10U7C3TqmOR3cNPvn1/l72nNXP99LpWDlNQaLoWDfOv9dEPfHifOoDLEdOpstKrE1AG6eeFj92Cb5ArhpPnYKkZrT9EN7rVcpiK9RU1Kz0dNxs+maif2QiSSz4KlAncJE+Y4jKZbo5btWRzMRBN+NWN03Gd89jih2ZcCo6NmYPzU5fMjtlpqy9SKaAPJEBpd2rF4y1NU4JViPY3pTvzhtwg7cQ7FBrdoBLWUXRpSJfrY3g/BD9mKOG59Ne54DsqgnCHtzpyJQku7bqeHL7x3ucHd7rqjSDnDKjE02HgOnVXxT9sGKenOwIsBkCvKJttH7hcVCIPSPuztw6e2ICa
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

The target to tune PCP high automatically is as follows,

- Minimize allocation/freeing from/to shared zone

- Minimize idle pages in PCP

- Minimize pages in PCP if the system free pages is too few

To reach these target, a tuning algorithm as follows is designed,

- When we refill PCP via allocating from the zone, increase PCP high.
  Because if we had larger PCP, we could avoid to allocate from the
  zone.

- In periodic vmstat updating kworker (via refresh_cpu_vm_stats()),
  decrease PCP high to try to free possible idle PCP pages.

- When page reclaiming is active for the zone, stop increasing PCP
  high in allocating path, decrease PCP high and free some pages in
  freeing path.

So, the PCP high can be tuned to the page allocating/freeing depth of
workloads eventually.

One issue of the algorithm is that if the number of pages allocated is
much more than that of pages freed on a CPU, the PCP high may become
the maximal value even if the allocating/freeing depth is small.  But
this isn't a severe issue, because there are no idle pages in this
case.

One alternative choice is to increase PCP high when we drain PCP via
trying to free pages to the zone, but don't increase PCP high during
PCP refilling.  This can avoid the issue above.  But if the number of
pages allocated is much less than that of pages freed on a CPU, there
will be many idle pages in PCP and it may be hard to free these idle
pages.

On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild
instances in parallel (each with `make -j 28`) in 8 cgroup.  This
simulates the kbuild server that is used by 0-Day kbuild service.
With the patch, the build time decreases 3.6%.  The cycles% of the
spinlock contention (mostly for zone lock) decreases from 10.0% to
0.7% (with PCP size == 361).  The number of PCP draining for high
order pages freeing (free_high) decreases 63.4%.  The number of pages
allocated from zone (instead of from PCP) decreases 80.4%.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Suggested-by: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/gfp.h |   1 +
 mm/page_alloc.c     | 118 ++++++++++++++++++++++++++++++++++----------
 mm/vmstat.c         |   8 +--
 3 files changed, 98 insertions(+), 29 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 665edc11fb9f..5b917e5b9350 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -320,6 +320,7 @@ extern void page_frag_free(void *addr);
 #define free_page(addr) free_pages((addr), 0)
 
 void page_alloc_init_cpuhp(void);
+int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_all_pages(struct zone *zone);
 void drain_local_pages(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index df07580dbd53..0d482a55235b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2160,6 +2160,40 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	return i;
 }
 
+/*
+ * Called from the vmstat counter updater to decay the PCP high.
+ * Return whether there are addition works to do.
+ */
+int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
+{
+	int high_min, to_drain, batch;
+	int todo = 0;
+
+	high_min = READ_ONCE(pcp->high_min);
+	batch = READ_ONCE(pcp->batch);
+	/*
+	 * Decrease pcp->high periodically to try to free possible
+	 * idle PCP pages.  And, avoid to free too many pages to
+	 * control latency.
+	 */
+	if (pcp->high > high_min) {
+		pcp->high = max3(pcp->count - (batch << PCP_BATCH_SCALE_MAX),
+				 pcp->high * 4 / 5, high_min);
+		if (pcp->high > high_min)
+			todo++;
+	}
+
+	to_drain = pcp->count - pcp->high;
+	if (to_drain > 0) {
+		spin_lock(&pcp->lock);
+		free_pcppages_bulk(zone, to_drain, pcp, 0);
+		spin_unlock(&pcp->lock);
+		todo++;
+	}
+
+	return todo;
+}
+
 #ifdef CONFIG_NUMA
 /*
  * Called from the vmstat counter updater to drain pagesets of this
@@ -2321,14 +2355,13 @@ static bool free_unref_page_prepare(struct page *page, unsigned long pfn,
 	return true;
 }
 
-static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
+static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free_high)
 {
 	int min_nr_free, max_nr_free;
-	int batch = READ_ONCE(pcp->batch);
 
-	/* Free everything if batch freeing high-order pages. */
+	/* Free as much as possible if batch freeing high-order pages. */
 	if (unlikely(free_high))
-		return pcp->count;
+		return min(pcp->count, batch << PCP_BATCH_SCALE_MAX);
 
 	/* Check for PCP disabled or boot pageset */
 	if (unlikely(high < batch))
@@ -2343,7 +2376,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
 	 * freeing of pages without any allocation.
 	 */
 	batch <<= pcp->free_factor;
-	if (batch < max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX)
+	if (batch <= max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX)
 		pcp->free_factor++;
 	batch = clamp(batch, min_nr_free, max_nr_free);
 
@@ -2351,28 +2384,47 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
 }
 
 static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
-		       bool free_high)
+		       int batch, bool free_high)
 {
-	int high = READ_ONCE(pcp->high_min);
+	int high, high_min, high_max;
 
-	if (unlikely(!high || free_high))
+	high_min = READ_ONCE(pcp->high_min);
+	high_max = READ_ONCE(pcp->high_max);
+	high = pcp->high = clamp(pcp->high, high_min, high_max);
+
+	if (unlikely(!high))
 		return 0;
 
-	if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
-		return high;
+	if (unlikely(free_high)) {
+		pcp->high = max(high - (batch << PCP_BATCH_SCALE_MAX), high_min);
+		return 0;
+	}
 
 	/*
 	 * If reclaim is active, limit the number of pages that can be
 	 * stored on pcp lists
 	 */
-	return min(READ_ONCE(pcp->batch) << 2, high);
+	if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) {
+		pcp->high = max(high - (batch << pcp->free_factor), high_min);
+		return min(batch << 2, pcp->high);
+	}
+
+	if (pcp->count >= high && high_min != high_max) {
+		int need_high = (batch << pcp->free_factor) + batch;
+
+		/* pcp->high should be large enough to hold batch freed pages */
+		if (pcp->high < need_high)
+			pcp->high = clamp(need_high, high_min, high_max);
+	}
+
+	return high;
 }
 
 static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 				   struct page *page, int migratetype,
 				   unsigned int order)
 {
-	int high;
+	int high, batch;
 	int pindex;
 	bool free_high = false;
 
@@ -2387,6 +2439,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	list_add(&page->pcp_list, &pcp->lists[pindex]);
 	pcp->count += 1 << order;
 
+	batch = READ_ONCE(pcp->batch);
 	/*
 	 * As high-order pages other than THP's stored on PCP can contribute
 	 * to fragmentation, limit the number stored when PCP is heavily
@@ -2397,14 +2450,15 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 		free_high = (pcp->free_factor &&
 			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
 			     (!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
-			      pcp->count >= READ_ONCE(pcp->batch)));
+			      pcp->count >= READ_ONCE(batch)));
 		pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER;
 	} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
 		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
 	}
-	high = nr_pcp_high(pcp, zone, free_high);
+	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count >= high) {
-		free_pcppages_bulk(zone, nr_pcp_free(pcp, high, free_high), pcp, pindex);
+		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
+				   pcp, pindex);
 	}
 }
 
@@ -2688,24 +2742,38 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 	return page;
 }
 
-static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order)
+static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
 {
-	int high, batch, max_nr_alloc;
+	int high, base_batch, batch, max_nr_alloc;
+	int high_max, high_min;
 
-	high = READ_ONCE(pcp->high_min);
-	batch = READ_ONCE(pcp->batch);
+	base_batch = READ_ONCE(pcp->batch);
+	high_min = READ_ONCE(pcp->high_min);
+	high_max = READ_ONCE(pcp->high_max);
+	high = pcp->high = clamp(pcp->high, high_min, high_max);
 
 	/* Check for PCP disabled or boot pageset */
-	if (unlikely(high < batch))
+	if (unlikely(high < base_batch))
 		return 1;
 
+	if (order)
+		batch = base_batch;
+	else
+		batch = (base_batch << pcp->alloc_factor);
+
 	/*
-	 * Double the number of pages allocated each time there is subsequent
-	 * refiling of order-0 pages without drain.
+	 * If we had larger pcp->high, we could avoid to allocate from
+	 * zone.
 	 */
+	if (high_min != high_max && !test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
+		high = pcp->high = min(high + batch, high_max);
+
 	if (!order) {
-		max_nr_alloc = max(high - pcp->count - batch, batch);
-		batch <<= pcp->alloc_factor;
+		max_nr_alloc = max(high - pcp->count - base_batch, base_batch);
+		/*
+		 * Double the number of pages allocated each time there is
+		 * subsequent refiling of order-0 pages without drain.
+		 */
 		if (batch <= max_nr_alloc && pcp->alloc_factor < PCP_BATCH_SCALE_MAX)
 			pcp->alloc_factor++;
 		batch = min(batch, max_nr_alloc);
@@ -2735,7 +2803,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 
 	do {
 		if (list_empty(list)) {
-			int batch = nr_pcp_alloc(pcp, order);
+			int batch = nr_pcp_alloc(pcp, zone, order);
 			int alloced;
 
 			alloced = rmqueue_bulk(zone, order,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 00e81e99c6ee..2f716ad14168 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -814,9 +814,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 
 	for_each_populated_zone(zone) {
 		struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
-#ifdef CONFIG_NUMA
 		struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset;
-#endif
 
 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 			int v;
@@ -832,10 +830,12 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 #endif
 			}
 		}
-#ifdef CONFIG_NUMA
 
 		if (do_pagesets) {
 			cond_resched();
+
+			changes += decay_pcp_high(zone, this_cpu_ptr(pcp));
+#ifdef CONFIG_NUMA
 			/*
 			 * Deal with draining the remote pageset of this
 			 * processor
@@ -862,8 +862,8 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 				drain_zone_pages(zone, this_cpu_ptr(pcp));
 				changes++;
 			}
-		}
 #endif
+		}
 	}
 
 	for_each_online_pgdat(pgdat) {

From patchwork Tue Sep 26 06:09:09 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 13398717
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0D20AE8181F
	for <linux-mm@archiver.kernel.org>; Tue, 26 Sep 2023 06:10:09 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9C49D8D006D; Tue, 26 Sep 2023 02:10:08 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 973688D0005; Tue, 26 Sep 2023 02:10:08 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 83BC38D006D; Tue, 26 Sep 2023 02:10:08 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com
 [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 5DF488D0005
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 02:10:08 -0400 (EDT)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 2477380F76
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:10:08 +0000 (UTC)
X-FDA: 81277723296.13.D1D4505
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.100])
	by imf19.hostedemail.com (Postfix) with ESMTP id 244E71A000B
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:10:04 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=dCgIwoe1;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1695708605;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=OkOx4+4B6DN2IG2d9UYB5azt2eRQHHjxJuFmoFGEHqs=;
	b=Ov9Mvi7R2tbn2zIJULQMJ4KYX1jMaFWAPTlk435UzQff/yNZX6kOjuEw0rciYVowo1Y5QH
	QExkxjJXIdqjVINiaaNoYXty1jlFXoxwM8BvxlED7CYOpBYxm9NzaihCxyXt4fY5osIhV4
	OZ4ahGZpN+IrMpgPV8lCipnnnlVeZ5U=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=dCgIwoe1;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695708605; a=rsa-sha256;
	cv=none;
	b=EpVpfqeLcOh2yO98gpsEpOxcXlarbmO+5i68n6gdsuWnSGd/PeXc0KpU5K2+0W9mDfSbNo
	VbW275iLZYSA+8SVBz9NnVjbSXH15+FPUsoyn/uZw/f9h/jXfi5zizsli14i7fGYh0a6qq
	MvrqXInpfJvFVeMESEMPxczEdoFAdm0=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695708605; x=1727244605;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Y9KtfnCLPgA/SXtfssQyiyzVZDDHiki6dZr93OTwpTQ=;
  b=dCgIwoe1eOoeY2ki9jFzwrIPTujScQUsam1dtsurIOvlBaX0g4svI54H
   m3auzTxnkPSM23Ndn2CnbmNIIGPJPfsZbdpzQk2ljurDyjTNsJEzlewi9
   BS7BYDQcZCq+dhaCfOyCOVoPn/4wt9A/rFSll3r96C0yZKolQ/d3zPo8+
   pW80QMIuMBaOj7TXzV2rSbm8k2JoMG4mPuoZ/rnNq8gNyf/hf955FxAX2
   VPJ8kQDrRiHh4oy+XtUJsCzf8RwaWnX7GvRz0jUDPIcpP2XvCpK7+SZja
   +koj1lpoLIAC9l2tu0ZpNR8CUp8TAztTcuwI3x48uXFyoSmWYRMdTclu9
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="447991460"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="447991460"
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:10:04 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="892076071"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="892076071"
Received: from aozhu-mobl.ccr.corp.intel.com (HELO
 yhuang6-mobl2.ccr.corp.intel.com) ([10.255.31.94])
  by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:08:54 -0700
From: Huang Ying <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Arjan Van De Ven <arjan@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Hildenbrand <david@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Lameter <cl@linux.com>
Subject: [PATCH -V2 08/10] mm,
 pcp: decrease PCP high if free pages < high watermark
Date: Tue, 26 Sep 2023 14:09:09 +0800
Message-Id: <20230926060911.266511-9-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230926060911.266511-1-ying.huang@intel.com>
References: <20230926060911.266511-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 244E71A000B
X-Stat-Signature: zzu485tp3kdnuebaagubuxoz96jkhw5j
X-HE-Tag: 1695708604-872787
X-HE-Meta: 
 U2FsdGVkX1+HsC5DcsgFooda7EF+bLAJwL4pQAIsUJKaKcDmzhi8iYfXOqHGJjYJjNPpgXGShUbsRk2oZOCwWFBHeElP29/KHUUpbkAlfgaDzvoRkddSSvpbtrjQxi8lxbflOa9lWUTiLu3i2NBbkDT6jbp6fL1UhDU3h6EUBWXXBkHFCqsftn4NFHHl1xsQ1EjNEkP7qlMzqoETcwPXja4atoJyELhkeErwHnQ4OadkrXxNTkXEBXcLOx+xAsrYZZAYPvEUoHDhiILMorevLiayIUJVWSR1OeVKM8Ss3TIzFurfe+ElHXKE6/smKcwMVXQM7Uahj7Jeq9wbjbJLBSCGMJepRbwGmWCcqKh41XP9LjIh8+bYk9bhfjpQsJBNrB5Ehmrsn6PZdS7PnGU+0glZdUUSbTyh2j55wLhFk9Y5sBSltc/iW2i2WZ8vo/p8I+LMqh+1x21r3mkBIUx0qTKnXBoRk7SKjAwR+/UbnchIDTYW0mR9B1uYXF7Bxga9TkDIsoFYDZES2/5UZKQSstZJfS1Ie4oqnyozymvRuGg7SqurlaPthVw3Rtfch2QMGnCn6VTcNCqYxITo+LEm9Q1/JyOvsSDhvkmnvm8K1gyQdSyoA8Eh54l6rsF5pG0GoGpvupoU2Go4Df6Lcw7b129BB21Ju7LBEc0vjsExepX7wi011vFHugRrDlYHJjnkZmSVK01m50wEHtUojv57pkMJbGoZ3iSP0Juk/2YSv0VUJ+anm9o91RU/Wufi6KLtBD9R/Ew+VGpbAocwcw7TXzLXed3LWNHTIulDrQil4VkHYYc4awrYcJxiwSRK/62LNMDgBQX6zq2Z6Nub6mSxNKvgfBsx+h7nTeKj+cxZCC97vUGognvBEyOa9fK2uAmXkcyq+AEBL0Nhocnem2nMJSzxvTBJWRtokCRh7W1Kfz92KPOEL168YW2svBcxNEWPeYZmzq2ftdrEXBE4VMz
 Dxj3gsc/
 h4iJemxiTNKHdjmYSTOOY0S6fZfndIA6qwbRc5DI/x7PRtT80DJ7fjlDLRP3PkyhDyY2WMv7iiaZihiyCUKF8YMkMwhhPDlBVVH+UBLw4Qb23cILTCZJyKUoc7mQbyGy1iGwig9+d7QioV0paccHpEf0ZsSVqbQjKn0T75dgOsLjwi2Mg44XVMp6STh7pn5mFC7zyaLVlKvkwcK90OCJIM7MH8f4+zGNX5uqA0mBVO6D2yNUKwc7I49n/R5tfO24unx0/rf+AvWaL0g5vfpT04R/MlLS5RLrAA2U+jRL3LpFMAmdyeEOAEx2Sycm5du/uQdR7VO1eO+6Y7lYVIwP0nO/YRtxnFi6DMn1SdU7AugPHByMGsC7Fxw0Ok+ODgIMhIFLVsuQXdL+nsQSXUMAxpZEOiAZxRnnJDqjnlr+m5CurMlECCEnaDHLdubqkxMsm6NwoaseVnyf/Dwdj2h26ZrdREFXGpfsqHH9NglzHl+yAVIw3llvSRPm4ZyR1Plk4NENupIuhxMce57jeqtp3ROyoQwmoIM8B22oj
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

One target of PCP is to minimize pages in PCP if the system free pages
is too few.  To reach that target, when page reclaiming is active for
the zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in
allocating path, decrease PCP high and free some pages in freeing
path.  But this may be too late because the background page reclaiming
may introduce latency for some workloads.  So, in this patch, during
page allocation we will detect whether the number of free pages of the
zone is below high watermark.  If so, we will stop increasing PCP high
in allocating path, decrease PCP high and free some pages in freeing
path.  With this, we can reduce the possibility of the premature
background page reclaiming caused by too large PCP.

The high watermark checking is done in allocating path to reduce the
overhead in hotter freeing path.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/mmzone.h |  1 +
 mm/page_alloc.c        | 22 ++++++++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d6cfb5023f3e..8a19e2af89df 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1006,6 +1006,7 @@ enum zone_flags {
 					 * Cleared when kswapd is woken.
 					 */
 	ZONE_RECLAIM_ACTIVE,		/* kswapd may be scanning the zone. */
+	ZONE_BELOW_HIGH,		/* zone is below high watermark. */
 };
 
 static inline unsigned long zone_managed_pages(struct zone *zone)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d482a55235b..08b74c65b88a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2409,7 +2409,13 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 		return min(batch << 2, pcp->high);
 	}
 
-	if (pcp->count >= high && high_min != high_max) {
+	if (high_min == high_max)
+		return high;
+
+	if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) {
+		pcp->high = max(high - (batch << pcp->free_factor), high_min);
+		high = max(pcp->count, high_min);
+	} else if (pcp->count >= high) {
 		int need_high = (batch << pcp->free_factor) + batch;
 
 		/* pcp->high should be large enough to hold batch freed pages */
@@ -2459,6 +2465,10 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
 				   pcp, pindex);
+		if (test_bit(ZONE_BELOW_HIGH, &zone->flags) &&
+		    zone_watermark_ok(zone, 0, high_wmark_pages(zone),
+				      ZONE_MOVABLE, 0))
+			clear_bit(ZONE_BELOW_HIGH, &zone->flags);
 	}
 }
 
@@ -2765,7 +2775,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
 	 * If we had larger pcp->high, we could avoid to allocate from
 	 * zone.
 	 */
-	if (high_min != high_max && !test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
+	if (high_min != high_max && !test_bit(ZONE_BELOW_HIGH, &zone->flags))
 		high = pcp->high = min(high + batch, high_max);
 
 	if (!order) {
@@ -3226,6 +3236,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			}
 		}
 
+		mark = high_wmark_pages(zone);
+		if (zone_watermark_fast(zone, order, mark,
+					ac->highest_zoneidx, alloc_flags,
+					gfp_mask))
+			goto try_this_zone;
+		else if (!test_bit(ZONE_BELOW_HIGH, &zone->flags))
+			set_bit(ZONE_BELOW_HIGH, &zone->flags);
+
 		mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
 		if (!zone_watermark_fast(zone, order, mark,
 				       ac->highest_zoneidx, alloc_flags,

From patchwork Tue Sep 26 06:09:10 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 13398718
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AA30DE7D0C5
	for <linux-mm@archiver.kernel.org>; Tue, 26 Sep 2023 06:10:11 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 41FD48D006E; Tue, 26 Sep 2023 02:10:11 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3CEDF8D0005; Tue, 26 Sep 2023 02:10:11 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 26F5D8D006E; Tue, 26 Sep 2023 02:10:11 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com
 [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 00CE08D0005
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 02:10:10 -0400 (EDT)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id B9D8EB42EA
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:10:10 +0000 (UTC)
X-FDA: 81277723380.22.E1240A6
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.100])
	by imf30.hostedemail.com (Postfix) with ESMTP id C73D18000D
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:10:08 +0000 (UTC)
Authentication-Results: imf30.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=Nqy4gKBp;
	spf=pass (imf30.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1695708609;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=w2iMvxo7If1GTMOOh15H1XxbFei4TMeFoSfe4bBcBqM=;
	b=U8ONmuGaY5JG+PnJDhEepYeyWZse49cp2Dh0VTgv0CVJSpof92i4OEVSjlrSbcoyW5Mo/a
	qo3onqafS4vksao9HgwMUJhs/ve5GcQv+BMdvK/AT8vMtXIYojPah+y5TEake2V4OTP244
	FkeQEhzwsuxALDxDo6fBcMGuBXT3XSQ=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695708609; a=rsa-sha256;
	cv=none;
	b=LbypIY9FBah8B3RVptEaXeGDHijE8tRYpSRaIf5uf1+tVhJDPkODPt21F6EI+5nSNt6SzC
	uLwNkpF3/oXCJdW31skO9joykYL/mjxYLk9c0JYl9J8uXN3uiyKhNq1RU6eaw5lA1/5Sm7
	AWyOH129xIMnKLxNrnEyXzdKXtPLEdU=
ARC-Authentication-Results: i=1;
	imf30.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=Nqy4gKBp;
	spf=pass (imf30.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695708608; x=1727244608;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=qaIFP424oa1dLN+3UGtNGktIElxaTAH6Kr61bKGSDDg=;
  b=Nqy4gKBpknGaWtzSbCYw6xiX/MuYMMNvaDXzqfOqzdBSIe+BsCvp0JKx
   33IY90SpWPtn4p+S/Jt5ka8FVdSvMuNYirefWNOVfk92v5COBaN7jjCC+
   RaWCGW9XkSxE1Hn7Rzz3Ccws8PHb1TL8pHCQEdsj52qW4JNnGwnllgXS7
   wrvOu2L8MiT+t9IKSSiQ0nFN1lGKQ4KNbZcTAbEMUHdBPw6WRHZcyuESb
   z87b4FosyL7d63pF9CZTxhhvi5BGg2bsN1qJkvg+sBxYoCtgQANoS2o92
   1aLNeonSSOcovd0tWJUnsa7ewtxt1B9h1QDOwVfmLuCTj474/j+oAVrtg
   g==;
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="447991481"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="447991481"
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:10:07 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="892076115"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="892076115"
Received: from aozhu-mobl.ccr.corp.intel.com (HELO
 yhuang6-mobl2.ccr.corp.intel.com) ([10.255.31.94])
  by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:08:59 -0700
From: Huang Ying <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Arjan Van De Ven <arjan@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Hildenbrand <david@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Lameter <cl@linux.com>
Subject: [PATCH -V2 09/10] mm, pcp: avoid to reduce PCP high unnecessarily
Date: Tue, 26 Sep 2023 14:09:10 +0800
Message-Id: <20230926060911.266511-10-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230926060911.266511-1-ying.huang@intel.com>
References: <20230926060911.266511-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Rspamd-Queue-Id: C73D18000D
X-Rspam-User: 
X-Rspamd-Server: rspam11
X-Stat-Signature: b7p1opzrrsjegbdrdw5qxee3qcuat8ff
X-HE-Tag: 1695708608-182276
X-HE-Meta: 
 U2FsdGVkX1/Fl/cNE09fYPckYYo6AcrxRKOcvIPYLPeVFMSUqxgC5D5FKoJ11NPQKdhPfeJrEepMHvk6unCvWkD0ZfwxzlqpT2DnsZX0lmNwRoRwWazM2gSOiLhaHpiNDRaLUGTtcGAPrkx2SfovN5ECRzFbGBTEIIjk25sgGgTdEy4knPyOZjcPpXciPagVWYJXHJbh08YoYhXeNZE1ymY+4ZMEAgjPcUHx9ldYc0MGn7EA/x/Moul1OYy9wzck8jtDPFmtAD6y1LIuG65VJiX/1q1xpeWnKP1TZtD30215J8t1nDm42La3SeFhwr52tb7EziYhPBp9I/uaIKdP4A0r4nQjY1q/uLFx3CvF+YqmjfFmqWeIXn5VLDuBiSUVBZoESqm910zIFkAvKMVejhg7KmQzNIOS+7lzx2kbTSXikSk9hmyAlyUQ8i+dB8Ln8vV6KBBuxByWjTKHFZPo3VkoIzfMhgNMVn8bvXmuT4fW/ZigqXlN6/+/STbaqosNGkA8hnbC9NcGTzfXxGAMMGgGQpQdWpuJOZA+ltUPccpSM1bSZX2HM84GmlXtOP22sKZADRbhHktPt2LAYn0DPvvHzdV8EZSUoAgTMoLbYKAFC93hmN6+b+dml6Uz9lA4GKJ0nQRNNodm+Zuze7A8nHdfuZNoFQ8qdwh1o+Co2FDnAhkxC6IDLWs9HAH94QV+r7n1oZskMCfupqMvnyqMCXTtlIaJV+kP3/Bg5IYwMnNrj9AhS27Y/rYFp1PckqPbGSW2MMvbZU5uGkQEr6o6vECNXSxbgjUaMhpAlPWxKqli8j2LDefO8fxYE/o4u4kEU4ndWMivhX+cE+MlZVwIDLrD2xrb1XCRmZw7Do4GBnJlRP8Xrpb6BMYQX+UYUuO6BZKxIf9MrHs4XuU7ZBs94qNuTvVwB15Eqwq07NBZio80syYmrU3lh6kKM5VyAm8/+njFuxHrExkKXpxKFgN
 Mo9xYHKs
 t59l0jiudP5dRw5R6gOCQMy7CWbk5LwL1d58St3hVv+vzUEeIuqiTpGn2yt0u6m7uZWVwHL3wYJ/c94z0tFwL8Z4h8eejeteyOAUzw+ygcVcSxM5wddeEP0BMDc5cbb36tBAVb6R740mFTxw4+XpW2b1lXDRjUDa0rPNkY/tOaGJ2zTimLyoJZObXR3UlQf3+GMnHI1c4XyJOVKXDpDu8djDEi4TYsTRxxy7XQKyrFk/X7baGVQkgZSL2LTIreAdi/6uqG8RV6ygpnH5m4GNXmsWK8mQxuEi5yZRUWoyxWaVZgf6J2HLiztcb7FfThBUf/+1XZ5qbS1SkhY+7Y8dtqbkfgiXCnJeGchV+TQ2XG/A5cwfssqCPulY6rxlKNRRMGhWm+J2nJbBxW4Hk1k6RE+T6gmpuJWXxpAvHVO1XuWYwUauFIf35C5q9DDzsHi+r+ByUJlHYRM86y4DDzNjomQhMbJUGZE2rxq6Qawr9V6VX+JJoLxLTmBbNhqHG16x9TxMKcH2hO7tds6iwF6rxxBpbvXHHOsKTicRv
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

In PCP high auto-tuning algorithm, to minimize idle pages in PCP, in
periodic vmstat updating kworker (via refresh_cpu_vm_stats()), we will
decrease PCP high to try to free possible idle PCP pages.  One issue
is that even if the page allocating/freeing depth is larger than
maximal PCP high, we may reduce PCP high unnecessarily.

To avoid the above issue, in this patch, we will track the minimal PCP
page count.  And, the periodic PCP high decrement will not more than
the recent minimal PCP page count.  So, only detected idle pages will
be freed.

On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild
instances in parallel (each with `make -j 28`) in 8 cgroup.  This
simulates the kbuild server that is used by 0-Day kbuild service.
With the patch, The number of pages allocated from zone (instead of
from PCP) decreases 21.4%.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/mmzone.h |  1 +
 mm/page_alloc.c        | 15 ++++++++++-----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8a19e2af89df..35b78c7522a7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -682,6 +682,7 @@ enum zone_watermarks {
 struct per_cpu_pages {
 	spinlock_t lock;	/* Protects lists field */
 	int count;		/* number of pages in the list */
+	int count_min;		/* minimal number of pages in the list recently */
 	int high;		/* high watermark, emptying needed */
 	int high_min;		/* min high watermark */
 	int high_max;		/* max high watermark */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 08b74c65b88a..d7b602822ab3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2166,19 +2166,20 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
  */
 int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 {
-	int high_min, to_drain, batch;
+	int high_min, decrease, to_drain, batch;
 	int todo = 0;
 
 	high_min = READ_ONCE(pcp->high_min);
 	batch = READ_ONCE(pcp->batch);
 	/*
-	 * Decrease pcp->high periodically to try to free possible
-	 * idle PCP pages.  And, avoid to free too many pages to
-	 * control latency.
+	 * Decrease pcp->high periodically to free idle PCP pages counted
+	 * via pcp->count_min.  And, avoid to free too many pages to
+	 * control latency.  This caps pcp->high decrement too.
 	 */
 	if (pcp->high > high_min) {
+		decrease = min(pcp->count_min, pcp->high / 5);
 		pcp->high = max3(pcp->count - (batch << PCP_BATCH_SCALE_MAX),
-				 pcp->high * 4 / 5, high_min);
+				 pcp->high - decrease, high_min);
 		if (pcp->high > high_min)
 			todo++;
 	}
@@ -2191,6 +2192,8 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 		todo++;
 	}
 
+	pcp->count_min = pcp->count;
+
 	return todo;
 }
 
@@ -2828,6 +2831,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 		page = list_first_entry(list, struct page, pcp_list);
 		list_del(&page->pcp_list);
 		pcp->count -= 1 << order;
+		if (pcp->count < pcp->count_min)
+			pcp->count_min = pcp->count;
 	} while (check_new_pages(page, order));
 
 	return page;

From patchwork Tue Sep 26 06:09:11 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 13398719
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8FDBDE8181F
	for <linux-mm@archiver.kernel.org>; Tue, 26 Sep 2023 06:10:15 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2BEF78D006F; Tue, 26 Sep 2023 02:10:15 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 26FA98D0005; Tue, 26 Sep 2023 02:10:15 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 136BD8D006F; Tue, 26 Sep 2023 02:10:15 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com
 [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id E02A08D0005
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 02:10:14 -0400 (EDT)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 93905140F84
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:10:14 +0000 (UTC)
X-FDA: 81277723548.22.D4A7927
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.100])
	by imf30.hostedemail.com (Postfix) with ESMTP id 9FA1F8000D
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:10:12 +0000 (UTC)
Authentication-Results: imf30.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=aSES3Wjb;
	spf=pass (imf30.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1695708612;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=5wUpIHiH+nJzKfqxLHQ/6d31aYDaI0+xwS2iId27j5U=;
	b=DqkBISYEDRi4etj6t0QSAW8dFsBAi3No2nEPIWlFvOJxL203mUfw048Lrjd1T6a44dKtgI
	+9oAwAVkh7t4M5HHqiqGivpzhbOAPYgPRHGaNZ85yL/w9V3x1R5TjITpvVvxKROd6gJLYA
	4QfQRpGU8ymGWhLIJJtUCPUTylB2wA4=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695708612; a=rsa-sha256;
	cv=none;
	b=o693y7gPGuO91sZhpPOvOj0kMgH1kdpBoec2d3RSU8/KQ8MGK6fOkcutY2XWV4ydqAvN1+
	bpVNO7ZdZwlPeh6+fjjewFIoMfDDfJOK5ef265Bw5cbTw8ulrjc0JyqWZ1B2M4I/dd2Ejr
	EEbPa6ZU5yYE0EYQTRSocLVEr0tnAt8=
ARC-Authentication-Results: i=1;
	imf30.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=aSES3Wjb;
	spf=pass (imf30.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695708612; x=1727244612;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=H9bLA8f1e2FoPWWn6dF2nLkxOgqJitjO66ZJfB8995Y=;
  b=aSES3WjboghDLhqoIerj+3QmREj9bfRb6Kqa1Fnes2eTp4wWNsYytSjG
   DuxyDWu5V4v/6hobw58U2IWYm1+ESq9KMR3guqXU24gR5+8vXpQmEJvVy
   A3T4ELvnHO+a4Oww6SY4U1rXIAppLvxPDUA8D+kv4byItHA5HwQ5JFcLk
   vVM/R9X/hTJc3MKo9n85pLubgUdU1W26i6uHJoON5T9ZxdwJarF8Gv7br
   8hwZZHIY++NMAinXm6lnQsZNnFi1RJH2xDoiwzv0qOPiIpPSzaYFpOpdN
   M1YZPIZr1PVr9AgFgHaTxdVp2DSoLVfRvKRolEALS6/5g2sFw6aySu6a7
   w==;
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="447991514"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="447991514"
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:10:11 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="892076144"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="892076144"
Received: from aozhu-mobl.ccr.corp.intel.com (HELO
 yhuang6-mobl2.ccr.corp.intel.com) ([10.255.31.94])
  by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:09:03 -0700
From: Huang Ying <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Arjan Van De Ven <arjan@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Hildenbrand <david@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Lameter <cl@linux.com>
Subject: [PATCH -V2 10/10] mm,
 pcp: reduce detecting time of consecutive high order page freeing
Date: Tue, 26 Sep 2023 14:09:11 +0800
Message-Id: <20230926060911.266511-11-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230926060911.266511-1-ying.huang@intel.com>
References: <20230926060911.266511-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Rspamd-Queue-Id: 9FA1F8000D
X-Rspam-User: 
X-Rspamd-Server: rspam11
X-Stat-Signature: h4fi8mfene6b9gnnytixmfwokfs3j19y
X-HE-Tag: 1695708612-270262
X-HE-Meta: 
 U2FsdGVkX1+3YS4wTUyei2MquQrmU+mBZVZVKS8ysx5WV+M6+VxiAw/cjGcNlGfQ0DjutPgONRJ/BQHTLUidesJTcd8/wZwWvOxn7oLy30BfEDg4oSJQR2Q9jKiSiFpSCmYmagKQjenRv+k56ZG0LlcrXiWkNrT6eDvtJGJkVV6yV+Moc7GxDWyF3sVU7SPOT+Ub6Ch0hW8vkXlsuKOz57EzF7Fm2MdNAQoAkcgGa+lU871hNRmZF8nQxvzHjF6goLQkiJ5JFiKc3uPBb4Uy6o9ExJXRcmTRDDMRPlPrxMKFCeQm8Ws38Hu1Q1aKFNTQHKyLEEeirtbpjV1aXhSUPI22PiHEiIAE0TYjlN//2R2xM8tSbFD0m6jiUf574egUUf7j9IzCB91T9p90Ur+fQJfeBT544syhX87jtVbC9tSSpzKXvHlb+2wLtinCRwDdtcyJZHUoFc4NTiTi7spJV9TVYTVRQcrqWBA5cQdi8bGAr9GqPtzxjyjqq/S8OKtQD2ah1SnlkfaI5WOIyZQM1h/HFnijnu3wiH5UIyh3W3eGgHA1FotQD6N40VpOOUDwaxJ6W967es4kS969v4LIb7c1IQccYRIJp9+Ta3oLW811flifFSiEuK41qm4rk4UdoldZiWwowZzNBvYDDJ6r3XLGRKj27TT78xyMXk0hhxHZI7EESEv3tXzu7nZaFHNGSXm/vp9VklnVhE1XpoWYF25GwGO3NMZon7fQbNjDGBM6QpvDe7htzNFXklmfYxxBqoOtH53hYtZiPn1w9qPjr5ezMxQNOm9nAvqFLp8XDsnUi+dKxaAtfmw7w5oX5SKnKXS/cdSg5NT7HGLeNYAidSnFkXG5K0znBGGBTXuaebwcepziDQJBQsQ3ayyvUpHWXW5/vgSkZcaolcWnn6ZM3HPhxPfhuo8LIN4eCobrGPX+Crv+ouEUqmR1v5iz3+iWMeqYoXwrboAHpkSqTS2
 47aUbdge
 +Nr5Ye6FEbkWclosU/4nt1tnUrvyoPHXi1emujg+IZxgpwGxDJRFZbqa6MfRz2Mn5OIayPRQEGmrRkJwlKK8gCvzU1ADL5hfZu2FzF6N5yuek75aO6OFsVVHlkvOfJHMpMYH9fbBrABiO9QEjD0yCSGawmkP0TJn3Kq4dqP1oR1IcF86XQw+x8HnEBeKdV17HfnkiYEtJ5UKB2z4JQGOCWKG4rdEdx3BFVpiOrPX9VJgv9k6Ke93FtWw9A/eipAXEKSo5eVyPqbFJ9nLF1fBkpwByO/6wmbgcZvrPsO7uiJgZsHADTtsveEoiIbTaRT2o/Vp7/qqgnQwsc1yTCzWFZrhaF3gFdAUtH8RtuHKItFJs6Jj8xV3fU03gofscOr7KBcGwcHqfRsVO7/hRG2m2+oOVwkg6uWeOpPVGdJCl2lc8+MzjAmPoFXmbhukqKTGomdcqzpz8tg1vjiNj9nLiN28LAgRFk1Z4sZZnvUg+PjU1FOyri6COP2L9kuRrN5Yun4SXP8fwwlE45Q9Zy4Hos3r3AeZR+Iet5/UA
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

In current PCP auto-tuning design, if the number of pages allocated is
much more than that of pages freed on a CPU, the PCP high may become
the maximal value even if the allocating/freeing depth is small, for
example, in the sender of network workloads.  If a CPU was used as
sender originally, then it is used as receiver after context
switching, we need to fill the whole PCP with maximal high before
triggering PCP draining for consecutive high order freeing.  This will
hurt the performance of some network workloads.

To solve the issue, in this patch, we will track the consecutive page
freeing with a counter in stead of relying on PCP draining.  So, we
can detect consecutive page freeing much earlier.

On a 2-socket Intel server with 128 logical CPU, we tested
SCTP_STREAM_MANY test case of netperf test suite with 64-pair
processes.  With the patch, the network bandwidth improves 3.1%.  This
restores the performance drop caused by PCP auto-tuning.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/mmzone.h |  2 +-
 mm/page_alloc.c        | 23 +++++++++++------------
 2 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 35b78c7522a7..44f6dc3cdeeb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -689,10 +689,10 @@ struct per_cpu_pages {
 	int batch;		/* chunk size for buddy add/remove */
 	u8 flags;		/* protected by pcp->lock */
 	u8 alloc_factor;	/* batch scaling factor during allocate */
-	u8 free_factor;		/* batch scaling factor during free */
 #ifdef CONFIG_NUMA
 	u8 expire;		/* When 0, remote pagesets are drained */
 #endif
+	short free_count;	/* consecutive free count */
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
 	struct list_head lists[NR_PCP_LISTS];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d7b602822ab3..206ab768ec23 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2375,13 +2375,10 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free
 	max_nr_free = high - batch;
 
 	/*
-	 * Double the number of pages freed each time there is subsequent
-	 * freeing of pages without any allocation.
+	 * Increase the batch number to the number of the consecutive
+	 * freed pages to reduce zone lock contention.
 	 */
-	batch <<= pcp->free_factor;
-	if (batch <= max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX)
-		pcp->free_factor++;
-	batch = clamp(batch, min_nr_free, max_nr_free);
+	batch = clamp_t(int, pcp->free_count, min_nr_free, max_nr_free);
 
 	return batch;
 }
@@ -2408,7 +2405,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 	 * stored on pcp lists
 	 */
 	if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) {
-		pcp->high = max(high - (batch << pcp->free_factor), high_min);
+		pcp->high = max(high - pcp->free_count, high_min);
 		return min(batch << 2, pcp->high);
 	}
 
@@ -2416,10 +2413,10 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 		return high;
 
 	if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) {
-		pcp->high = max(high - (batch << pcp->free_factor), high_min);
+		pcp->high = max(high - pcp->free_count, high_min);
 		high = max(pcp->count, high_min);
 	} else if (pcp->count >= high) {
-		int need_high = (batch << pcp->free_factor) + batch;
+		int need_high = pcp->free_count + batch;
 
 		/* pcp->high should be large enough to hold batch freed pages */
 		if (pcp->high < need_high)
@@ -2456,7 +2453,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	 * stops will be drained from vmstat refresh context.
 	 */
 	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
-		free_high = (pcp->free_factor &&
+		free_high = (pcp->free_count >= batch &&
 			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
 			     (!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
 			      pcp->count >= READ_ONCE(batch)));
@@ -2464,6 +2461,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
 		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
 	}
+	if (pcp->free_count < (batch << PCP_BATCH_SCALE_MAX))
+		pcp->free_count += (1 << order);
 	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
@@ -2861,7 +2860,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	 * See nr_pcp_free() where free_factor is increased for subsequent
 	 * frees.
 	 */
-	pcp->free_factor >>= 1;
+	pcp->free_count >>= 1;
 	list = &pcp->lists[order_to_pindex(migratetype, order)];
 	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
 	pcp_spin_unlock(pcp);
@@ -5483,7 +5482,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
 	pcp->high_min = BOOT_PAGESET_HIGH;
 	pcp->high_max = BOOT_PAGESET_HIGH;
 	pcp->batch = BOOT_PAGESET_BATCH;
-	pcp->free_factor = 0;
+	pcp->free_count = 0;
 }
 
 static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high_min,