From patchwork Tue Sep 26 06:09:01 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 13398709
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1C65CE8181D
	for <linux-mm@archiver.kernel.org>; Tue, 26 Sep 2023 06:09:31 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A5FC68D0065; Tue, 26 Sep 2023 02:09:30 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A0EB58D0005; Tue, 26 Sep 2023 02:09:30 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8D6E38D0065; Tue, 26 Sep 2023 02:09:30 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com
 [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 7DDC38D0005
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 02:09:30 -0400 (EDT)
Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 52EB2160F72
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:09:30 +0000 (UTC)
X-FDA: 81277721700.14.1ED3FA8
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.100])
	by imf19.hostedemail.com (Postfix) with ESMTP id 5E9C91A0010
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:09:28 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=bgp1nTyv;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1695708568;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:references:dkim-signature;
	bh=Pcg/I6gFQG4l2IGQuGctvtBbGsMo7yq5/J0Qk6E1uHI=;
	b=BQ7Y6R9pmYhGgW18qlP4sQFtO0TSk1YiBWbx7kwlNH2i901W4St0JQBCXxetPcUfAvgYrh
	ZaTDu8SUwrLIt53qOZOcHJl21aqoxaotDmyrYJQ/rMlWxABmnOtAXF8g4RpDzj9hO1Ip1b
	+d2IkY0gqhKKWRkxNeIqoPFK2xFVhKU=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=bgp1nTyv;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695708568; a=rsa-sha256;
	cv=none;
	b=LlloeiX3srxiohQkdJFTqtN+J4fyQy3aEc7Sts5WhJLWOxYXNZsXXrDvIw4/anMOkfJX8P
	TVXMH1nA6Kq7ZGc0kQdNzcsI+ErvU3bpmJ8MSRj0zu9Qn99+t0aY1fq2tmmXYWA1RxQY3F
	k6nFCAoPwynrZz+7MvP1mIit+koJCDo=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695708568; x=1727244568;
  h=from:to:cc:subject:date:message-id:mime-version:
   content-transfer-encoding;
  bh=CSA7Jls8iydCBRBC90Y0WdsarPb+k5VXOEYoase/WoY=;
  b=bgp1nTyvztEZ/xg1+NBvS7f1BAEgfwG+znzMj0QQegaB3q8ervKG/yBK
   5x0JjrW5MFRFasnqmNoDpqzUSOo1IHtof05GYlezgQuaVwAr7tlRk+JTT
   rHfK/sEv/ofFXCA96xdnbKtk8FW1e8VRyY9E1pGUJ5366gUOY4kPDlUjN
   2qhrL//Ip227MgXeH6SB7X3osdHJm2BGswJCNuc7IrJuesy+AIhELpm/j
   eR7Yi0FU4FW9nZ3RKj5x2H3LYsmYapKDHEy+2IC44x+JhdEZjWurSuFUg
   146WCiHOz/6zlKbmUFRWam+53AZw+GMVezly/pMJM/sJm6jY52y5MYV/X
   g==;
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="447991242"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="447991242"
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:09:26 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="892075842"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="892075842"
Received: from aozhu-mobl.ccr.corp.intel.com (HELO
 yhuang6-mobl2.ccr.corp.intel.com) ([10.255.31.94])
  by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:08:19 -0700
From: Huang Ying <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Arjan Van De Ven <arjan@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Hildenbrand <david@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Lameter <cl@linux.com>
Subject: [PATCH -V2 00/10] mm: PCP high auto-tuning
Date: Tue, 26 Sep 2023 14:09:01 +0800
Message-Id: <20230926060911.266511-1-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 5E9C91A0010
X-Stat-Signature: nnujo7yqumjwjzyisenenqjdf4mriefq
X-HE-Tag: 1695708568-50672
X-HE-Meta: 
 U2FsdGVkX1/JzpWXp48FcgRG+EywuYLqOswJ3cs0LJ9Rn14sfK+CBguqJbwfvXmIbVroPr414Tr+LHYayZfeqBLIwhNk+FjSWTDN6SYZd7phBE8gNwj5OZreJ6Om7p7eYXeQVt1/qRShBtcFTAXIN1mcwhJkVkzxw7QKzgWzyIwPFOxSiDyOikkNI8C0v6w8YaeJtjttSYWuUl/vzz3h2vmuvrX7QzLIha9w5Lovvnm8snrDgnFEi0CiF4udDxUSfOlDP3gW0ftzA/ZR210DcqYgNwzlEfZYKJJwIDH4xDdbuBD9euhg1sp0HB/gZOahuAvqFZMRiHQPtZifTqAmp1Zl9cu948Ky3Fs1dNKlKWzNYXCnaMk9p0m+rDKvVRDyHbow+sXD57R5DOd56HO6iVAUcq2OC763m/HEhy2U34Z86lmIGR3f9EzQlRvFeG754dLJoNlLn0LOOEzL0GIq5uKMTp0Tdh7+FZYnZfixyjUDtisgrHkCytdJwyynIJah5/ocxUMJWN9VuFVuwLpDqy1TcK5oROqGa3BBbqVCD/uzV0BN1N24YpfmVUPyMWBbdq6jkczb4JyNfH32TI6pbSbb0pusDWuUrTdnSjCEaIIb5BjT3duVSJvpzX0M05XRZCiYjLz0OgPsWe6M+mr5GBPQU1AgMxx01H/r3AOIxpt2XqgwNiPBYpEECDWd+bQqP0xK6Sv/VuH7+kHCximcCflCKYkzkVyWPHtTWTwhSv5HXH5re+8iMpC0+WB7ZqwrKIXGNtla8kfuu/95wuL3AUcL94iXOOv14IItQiy2dNWU3RrUAZLRsb7f+n6O06vbmWIMn+jAc+mh/Tmafc8k78CdrFo0P5TL0glj5tYckvk/pjp0szPM8iQqQB6Kifznm2C/zDEnYiH3MIN6WUg+mWxA30Crw5YwzDeY1uohu07a4Wu0JV85b9WDqMuxQ/chJfBVVjdZ8/evs9CunXQ
 DCQkm7N5
 cUylFqzb8GPtqGXDcgdIxrZJbrtkCrp/o7z+PipZeuF4piTiaKITvimFdcQejCTLy5zLz16S9buIoEici5zuxKuj02MkepKTO5phxhwzhIlxmyyVUsvGihMaXqHqz5KouRZBhchuH3Ir1kko=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

The page allocation performance requirements of different workloads
are often different.  So, we need to tune the PCP (Per-CPU Pageset)
high on each CPU automatically to optimize the page allocation
performance.

The list of patches in series is as follows,

 1 mm, pcp: avoid to drain PCP when process exit
 2 cacheinfo: calculate per-CPU data cache size
 3 mm, pcp: reduce lock contention for draining high-order pages
 4 mm: restrict the pcp batch scale factor to avoid too long latency
 5 mm, page_alloc: scale the number of pages that are batch allocated
 6 mm: add framework for PCP high auto-tuning
 7 mm: tune PCP high automatically
 8 mm, pcp: decrease PCP high if free pages < high watermark
 9 mm, pcp: avoid to reduce PCP high unnecessarily
10 mm, pcp: reduce detecting time of consecutive high order page freeing

Patch 1/2/3 optimize the PCP draining for consecutive high-order pages
freeing.

Patch 4/5 optimize batch freeing and allocating.

Patch 6/7/8/9 implement and optimize a PCP high auto-tuning method.

Patch 10 optimize the PCP draining for consecutive high order page
freeing based on PCP high auto-tuning.

The test results for patches with performance impact are as follows,

kbuild
======

On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild
instances in parallel (each with `make -j 28`) in 8 cgroup.  This
simulates the kbuild server that is used by 0-Day kbuild service.

	build time   lock contend%	free_high	alloc_zone
	----------	----------	---------	----------
base	     100.0	      13.5          100.0            100.0
patch1	      99.2	      10.6	     19.2	      95.6
patch3	      99.2	      11.7	      7.1	      95.6
patch5	      98.4	      10.0	      8.2	      97.1
patch7	      94.9	       0.7	      3.0	      19.0
patch9	      94.9	       0.6	      2.7	      15.0
patch10	      94.9	       0.9	      8.8	      18.6

The PCP draining optimization (patch 1/3) and PCP batch allocation
optimization (patch 5) reduces zone lock contention a little.  The PCP
high auto-tuning (patch 7/9/10) reduces build time visibly.  Where the
tuning target: the number of pages allocated from zone reduces
greatly.  So, the zone contention cycles% reduces greatly.

With PCP tuning patches (patch 7/9/10), the average used memory during
test increases up to 21.0% because more pages are cached in PCP.  But
at the end of the test, the number of the used memory decreases to the
same level as that of the base patch.  That is, the pages cached in
PCP will be released to zone after not being used actively.

netperf SCTP_STREAM_MANY
========================

On a 2-socket Intel server with 128 logical CPU, we tested
SCTP_STREAM_MANY test case of netperf test suite with 64-pair
processes.

	     score   lock contend%	free_high	alloc_zone  cache miss rate%
	     -----	----------	---------	----------  ----------------
base	     100.0	       2.0          100.0            100.0	         1.3
patch1	      99.7	       2.0	     99.7	      99.7		 1.3
patch3	     105.5	       1.2	     13.2	     105.4		 1.2
patch5	     106.9	       1.2	     13.4	     106.9		 1.3
patch7	     103.5	       1.8	      6.8	      90.8		 7.6
patch9	     103.7	       1.8	      6.6	      89.8		 7.7
patch10	     106.9	       1.2	     13.5	     106.9		 1.2

The PCP draining optimization (patch 1+3) improves performance.  The
PCP high auto-tuning (patch 7/9) reduces performance a little because
PCP draining cannot be triggered in time sometimes.  So, the cache
miss rate% increases.  The further PCP draining optimization (patch
10) based on PCP tuning restore the performance.

lmbench3 UNIX (AF_UNIX)
=======================

On a 2-socket Intel server with 128 logical CPU, we tested UNIX
(AF_UNIX socket) test case of lmbench3 test suite with 16-pair
processes.

	     score   lock contend%	free_high	alloc_zone  cache miss rate%
	     -----	----------	---------	----------  ----------------
base	     100.0	      50.0          100.0            100.0	         0.3
patch1	     117.1	      45.8           72.6	     108.9	         0.2
patch3	     201.6	      21.2            7.4	     111.5	         0.2
patch5	     201.9	      20.9            7.5	     112.7	         0.3
patch7	     194.2	      19.3            7.3	     111.5	         2.9
patch9	     193.1	      19.2            7.2	     110.4	         2.9
patch10	     196.8	      21.0            7.4	     111.2	         2.1

The PCP draining optimization (patch 1/3) improves performance much.
The PCP tuning (patch 7/9) reduces performance a little because PCP
draining cannot be triggered in time sometimes.  The further PCP
draining optimization (patch 10) based on PCP tuning restores the
performance partly.

The patchset adds several fields in struct per_cpu_pages.  The struct
layout before/after the patchset is as follows,

base
====

struct per_cpu_pages {
	spinlock_t                 lock;                 /*     0     4 */
	int                        count;                /*     4     4 */
	int                        high;                 /*     8     4 */
	int                        batch;                /*    12     4 */
	short int                  free_factor;          /*    16     2 */
	short int                  expire;               /*    18     2 */

	/* XXX 4 bytes hole, try to pack */

	struct list_head           lists[13];            /*    24   208 */

	/* size: 256, cachelines: 4, members: 7 */
	/* sum members: 228, holes: 1, sum holes: 4 */
	/* padding: 24 */
} __attribute__((__aligned__(64)));

patched
=======

struct per_cpu_pages {
	spinlock_t                 lock;                 /*     0     4 */
	int                        count;                /*     4     4 */
	int                        count_min;            /*     8     4 */
	int                        high;                 /*    12     4 */
	int                        high_min;             /*    16     4 */
	int                        high_max;             /*    20     4 */
	int                        batch;                /*    24     4 */
	u8                         flags;                /*    28     1 */
	u8                         alloc_factor;         /*    29     1 */
	u8                         expire;               /*    30     1 */

	/* XXX 1 byte hole, try to pack */

	short int                  free_count;           /*    32     2 */

	/* XXX 6 bytes hole, try to pack */

	struct list_head           lists[13];            /*    40   208 */

	/* size: 256, cachelines: 4, members: 12 */
	/* sum members: 241, holes: 2, sum holes: 7 */
	/* padding: 8 */
} __attribute__((__aligned__(64)));

The size of the struct doesn't changed with the patchset.

Changelog:

v2:

- Fix the kbuild test configuration and results.  Thanks Andrew for
  reminding on test results!

- Add document for sysctl behavior extension in [06/10] per Andrew's comments.

Best Regards,
Huang, Ying