From patchwork Mon Oct 16 05:29:56 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 13422490
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BD9FFCDB465
	for <linux-mm@archiver.kernel.org>; Mon, 16 Oct 2023 05:30:35 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 5851D8D0032; Mon, 16 Oct 2023 01:30:35 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 536A98D0001; Mon, 16 Oct 2023 01:30:35 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3FFA58D0032; Mon, 16 Oct 2023 01:30:35 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com
 [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 2C0028D0001
	for <linux-mm@kvack.org>; Mon, 16 Oct 2023 01:30:35 -0400 (EDT)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id F149EB5C49
	for <linux-mm@kvack.org>; Mon, 16 Oct 2023 05:30:34 +0000 (UTC)
X-FDA: 81350199588.08.8AD9BF2
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65])
	by imf12.hostedemail.com (Postfix) with ESMTP id CADC240017
	for <linux-mm@kvack.org>; Mon, 16 Oct 2023 05:30:32 +0000 (UTC)
Authentication-Results: imf12.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="PMD/kIm+";
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697434233; a=rsa-sha256;
	cv=none;
	b=pljMVF8vVmm8ybDVZhtQmr39iFaD9YG0j5bZhzXTCJH4ywzElMu62mmLU0gGmET1KkPUiq
	8hz2Bhp6BsUY3ss69Cs5xit+jdYIYXuNliW6n+izBZe1N+Swi+Mru+RW87jO78pcoV5qsP
	qbifijU6J8jLGi91UDOr7SFRrFPUWYQ=
ARC-Authentication-Results: i=1;
	imf12.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="PMD/kIm+";
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1697434233;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=EM2p971k14pLhhfzm38QBffPfm3gPH5AMIDqRjUWOuU=;
	b=vFISVWiqU+vD4knUON892052OMXLQ3VFPMI40DJprLXfpjV6bPq+M4BPMbREkIUTFkloOc
	bn7tJ+rDbTTEQm5xBARTlj45eIGlKbUxf5vmgBDCZBwXthr1nXUKfBqJ0GPPXW7EXq/2Jc
	IuiAa9AeIyoOELYnJXJOQrWDK2Uma0g=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1697434232; x=1728970232;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=UjGG03uSU9scSCPZWKrCozXwLS8Efdo9vVeSGMTDY0g=;
  b=PMD/kIm+KFcWyDcpxdXO23TFBE8nKZVlpGEuo2fKBQB2RZs7OJL3tOzd
   Ko8IgoS7LFLK+8/er4pakGC7vl5nUg+q8hgrjCTHOnKSB6OyT++BBLXcU
   TH/QOssGE5YnqG9rjki0cL2XWfaeuteYJXOwdk0AKCmllLf5TJiVHha4z
   R27wRaHQLIMKkeU1A8gYZdmsOEiPTbcFuVhK/lhVmkN8XFV+OMM9WYrzI
   nYIDHOra0a1NHOzF9dpKS39SJdwHcrjSF2SlFvXBm1xBBx/05/XiV9TYb
   dYr5grgPxD5qaUKt2lkk9ULnspw4VTB5vfaJPmoTqUE807VXRCBK4nNzi
   A==;
X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="389307995"
X-IronPort-AV: E=Sophos;i="6.03,228,1694761200";
   d="scan'208";a="389307995"
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
  by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 15 Oct 2023 22:30:30 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="899356680"
X-IronPort-AV: E=Sophos;i="6.03,228,1694761200";
   d="scan'208";a="899356680"
Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133])
  by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 15 Oct 2023 22:28:29 -0700
From: Huang Ying <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Arjan Van De Ven <arjan@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Sudeep Holla <sudeep.holla@arm.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Hildenbrand <david@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Lameter <cl@linux.com>
Subject: [PATCH -V3 3/9] mm,
 pcp: reduce lock contention for draining high-order pages
Date: Mon, 16 Oct 2023 13:29:56 +0800
Message-Id: <20231016053002.756205-4-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20231016053002.756205-1-ying.huang@intel.com>
References: <20231016053002.756205-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: CADC240017
X-Stat-Signature: q4tw9c1i46bgcxgziugnwsbtcr5eis8k
X-HE-Tag: 1697434232-762922
X-HE-Meta: 
 U2FsdGVkX1+YcqAEDWrCSjqi6n2IJyNUNiu5567mBVyX9pWtWxiAftWOHUY13Mes6mC8XGUQoMsc6P69R4/wuMEMc+VxDBQHxP6yeOQ3O9NmTk8/GekOkGm3mOaeUUiX+mzdNijlGlX9MSt8Lq9sKz7h+38A/5mFRc5EcGdA5wpSxhqw9JTdkUJf+BW6pUwvPIMgvgQ0NQsZcDcb/sP9uLSpnfWr8SQbGT80BJTAAtzCFfbU6g8O53QV99cs2hzgZPAoGFkiLljglzkKw5f5axPs13RO1QGSQKQDsaavMjLkSWcwXQZwv2y6Zk/NrRLEhUExQHdUGDN+LAreP0NDB6DEtGafGN99C/Oqn3ooyiSoAltceBYrbeuf6e7gQKIBl4U92vGCGU/7BhBz1l1+gLh6jJejjud4gyvRsOAY9EBvMNqj5F8s5RU0wBUX2Yk1GbUwYP3O4JNC5mdmfh5IdAK0KEDrcaOmNpEOIX0ZFEHY0ZQwimLiWmsOrGOG1Esw7JlAcuB+YWifURPCaYkvyrer7dreCGyya1/GutiRzvbDvwVReFjxzxwfwaRFo7AHBfu0lTFxy8ThiWCUNZ3IMEdzyfW2Pz/NkNX30XeU3OWUYQoT4zVBaKx4+tZRt/LR8IwNor4KC8+a31EOHWXP6JGajULjcuqAX+PafFbWDdOgRGFCC6FNzkkolyRN+54Sj/Jj1E/4g4vNC0+uK2nR4qUX3WVj1FAaP8NOkygCr9e1bNexxorSGLWRB+XLgl5CN9nIzVDnUXhiV73nB1skUlcHwwiEvYbpgdSc8tGqTv1m4Xlm0z9UCflNJiUJ+d6bEI3PkxrONkrV1jw78OjmhecrB8eI6AN2gg4X2b775FJD7Eqe4JwNCUdENdiFe3+6oD83k4Xt0IsvU4g8iSQxPnxHrKEKn9QUt7ZiwcomgQCiUlwRN2c7LhyQhYcvUAFjnXhE5Gq4KuO568FSz5i
 L1LkHhH4
 EYo2E4ztliXDjTn3Aqrf1Q4khrLyexmVJLzMc+2xBR/wbYcFw1VC3W7zIjcRRz+eIqA5hzrjGxa0vdS+n7zt7kluTY76Md9dG4JlW+7Gy2izBWoITvXOmJmgY5OMIMJDXIqfAhZuH2gb8H830aQ8nt/uiRU2P5jhcCB4q2KlNcHk97QxhuwdN/dM07oLZIXmJJ89Tv6oNbdgNBwfEgv8vL1PL1ce9hr8eazXSCoytXuOhlZbyG0e3yZ0rPOLgye8gUt2OVMV2ctNwNQ0pFAXQXdrvzEIKbzDaoc5D/Qs2U01DWxL2ondiTXfTLhHDEorZzc3Isfet3Jo6l/BJV6ftTjGFLH7jJp0N+YCu1GGymzpY7xVUNaAm8ytOhrekqjPvdrgtJSzN+4DYGwzGl8qgQ09l/a5WNLwIRV8nKQI7XWkN++z1Mu4BSaKk6sAogia6wpQS+DEBvz4eS61TOyrFt4gEM1N4HM96YDF0oK58yrQ68QZqkRbKjHh5i5+cCKgVLST3YuTEkpHYuiUy+jvXI7e1A5CGHRwnm+TDLga39X1q+s8HSezWpg7ORZSy7gReus6p7lvfnjuM0knSRlD4bqcEfDhO5VhEGkvQ
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order
pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be
drained when PCP is mostly used for high-order pages freeing to
improve the cache-hot pages reusing between page allocating and
freeing CPUs.

On system with small per-CPU data cache slice, pages shouldn't be
cached before draining to guarantee cache-hot.  But on a system with
large per-CPU data cache slice, some pages can be cached before
draining to reduce zone lock contention.

So, in this patch, instead of draining without any caching,
"pcp->batch" pages will be cached in PCP before draining if the
size of the per-CPU data cache slice is more than "3 * batch".

In theory, if the size of per-CPU data cache slice is more than "2 *
batch", we can reuse cache-hot pages between CPUs.  But considering
the other usage of cache (code, other data accessing, etc.), "3 *
batch" is used.

Note: "3 * batch" is chosen to make sure the optimization works on
recent x86_64 server CPUs.  If you want to increase it, please check
whether it breaks the optimization.

On a 2-socket Intel server with 128 logical CPU, with the patch, the
network bandwidth of the UNIX (AF_UNIX) test case of lmbench test
suite with 16-pair processes increase 70.5%.  The cycles% of the
spinlock contention (mostly for zone lock) decreases from 46.1% to
21.3%.  The number of PCP draining for high order pages
freeing (free_high) decreases 89.9%.  The cache miss rate keeps 0.2%.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 drivers/base/cacheinfo.c |  2 ++
 include/linux/gfp.h      |  1 +
 include/linux/mmzone.h   |  6 ++++++
 mm/page_alloc.c          | 38 +++++++++++++++++++++++++++++++++++++-
 4 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index 585c66fce9d9..f1e79263fe61 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -950,6 +950,7 @@ static int cacheinfo_cpu_online(unsigned int cpu)
 	if (rc)
 		goto err;
 	update_per_cpu_data_slice_size(true, cpu);
+	setup_pcp_cacheinfo();
 	return 0;
 err:
 	free_cache_attributes(cpu);
@@ -963,6 +964,7 @@ static int cacheinfo_cpu_pre_down(unsigned int cpu)
 
 	free_cache_attributes(cpu);
 	update_per_cpu_data_slice_size(false, cpu);
+	setup_pcp_cacheinfo();
 	return 0;
 }
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 665f06675c83..665edc11fb9f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -325,6 +325,7 @@ void drain_all_pages(struct zone *zone);
 void drain_local_pages(struct zone *zone);
 
 void page_alloc_init_late(void);
+void setup_pcp_cacheinfo(void);
 
 /*
  * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 19c40a6f7e45..cdff247e8c6f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -682,8 +682,14 @@ enum zone_watermarks {
  * PCPF_PREV_FREE_HIGH_ORDER: a high-order page is freed in the
  * previous page freeing.  To avoid to drain PCP for an accident
  * high-order page freeing.
+ *
+ * PCPF_FREE_HIGH_BATCH: preserve "pcp->batch" pages in PCP before
+ * draining PCP for consecutive high-order pages freeing without
+ * allocation if data cache slice of CPU is large enough.  To reduce
+ * zone lock contention and keep cache-hot pages reusing.
  */
 #define	PCPF_PREV_FREE_HIGH_ORDER	BIT(0)
+#define	PCPF_FREE_HIGH_BATCH		BIT(1)
 
 struct per_cpu_pages {
 	spinlock_t lock;	/* Protects lists field */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 295e61f0c49d..ba2d8f06523e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -52,6 +52,7 @@
 #include <linux/psi.h>
 #include <linux/khugepaged.h>
 #include <linux/delayacct.h>
+#include <linux/cacheinfo.h>
 #include <asm/div64.h>
 #include "internal.h"
 #include "shuffle.h"
@@ -2385,7 +2386,9 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	 */
 	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
 		free_high = (pcp->free_factor &&
-			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER));
+			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
+			     (!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
+			      pcp->count >= READ_ONCE(pcp->batch)));
 		pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER;
 	} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
 		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
@@ -5418,6 +5421,39 @@ static void zone_pcp_update(struct zone *zone, int cpu_online)
 	mutex_unlock(&pcp_batch_high_lock);
 }
 
+static void zone_pcp_update_cacheinfo(struct zone *zone)
+{
+	int cpu;
+	struct per_cpu_pages *pcp;
+	struct cpu_cacheinfo *cci;
+
+	for_each_online_cpu(cpu) {
+		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+		cci = get_cpu_cacheinfo(cpu);
+		/*
+		 * If data cache slice of CPU is large enough, "pcp->batch"
+		 * pages can be preserved in PCP before draining PCP for
+		 * consecutive high-order pages freeing without allocation.
+		 * This can reduce zone lock contention without hurting
+		 * cache-hot pages sharing.
+		 */
+		spin_lock(&pcp->lock);
+		if ((cci->per_cpu_data_slice_size >> PAGE_SHIFT) > 3 * pcp->batch)
+			pcp->flags |= PCPF_FREE_HIGH_BATCH;
+		else
+			pcp->flags &= ~PCPF_FREE_HIGH_BATCH;
+		spin_unlock(&pcp->lock);
+	}
+}
+
+void setup_pcp_cacheinfo(void)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone)
+		zone_pcp_update_cacheinfo(zone);
+}
+
 /*
  * Allocate per cpu pagesets and initialize them.
  * Before this call only boot pagesets were available.