From patchwork Wed Sep 20 06:18:46 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13392107 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DEBE0CE79AC for ; Wed, 20 Sep 2023 06:19:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6EA956B0109; Wed, 20 Sep 2023 02:19:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 64A556B010A; Wed, 20 Sep 2023 02:19:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5120A6B010B; Wed, 20 Sep 2023 02:19:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 3F5EA6B0109 for ; Wed, 20 Sep 2023 02:19:38 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 09F0A40358 for ; Wed, 20 Sep 2023 06:19:38 +0000 (UTC) X-FDA: 81255974436.24.E6AAB6B Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126]) by imf29.hostedemail.com (Postfix) with ESMTP id 0B6FA120015 for ; Wed, 20 Sep 2023 06:19:34 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="cVgb0/TU"; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695190775; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=Fy6CeoqQfmgFo+ydOdkhetg1fzEVEv9zMPkU/BuH5Ko=; b=jKn1kIt16+fJ528zc/1qU1NouSPMEBTTsOfqAmdx/tss6pJ56CGzn1b9sFR3yXO/fDvoc/ 75EhVkSMGp56Vlfgy2YI8Y+kS8zg1LJevkZr3tSA+3Ft8Y5Sf/7gWp3W4eb118ZJR5kNT7 yxaBg64eM+cF1Uj5+wn1mGoCJFbqA4E= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="cVgb0/TU"; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695190775; a=rsa-sha256; cv=none; b=hvCFpmB+7ry1h4GiyoaGFPT0VMVmmB86BNNFFB4PZ3z/mJBjQsNncReEBDfEclicUXeHnX TUWKjnYXybKSrHjeaJP2WbU0udds1wRufcFYaQyoo84X53uVsYO9hp5GycrM2t7tigTUbK KiThuo38ofHGXJZcrHltd7uoYrpvBcE= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695190775; x=1726726775; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=chPIaEJzVNdbE6odLGvBFsc/Z9s6eqiuK9MRZ+edZSQ=; b=cVgb0/TU9Xee6Pm4s+9TDyWXRzFBPKDV7I2ZtXpM+wHrI9DDivVn8fLS 0O7QDPRo6vDVeD+HibXQ7HdzRJv+93WQh+CqDigEvR7fv5odfG8+XCT6K sLK2DcQ37RHicCVxkXfTOW6zP1sm9pHUrUk9bQXUamd5b1sfBBoKgl2Se Mr8jfnw92fg05vSHhecldGfPtWl/w4E2GtxcnLm3rD44i90F3JQQEC7dt GvfOAxaZAF6zS4lv7EvNZrkqLvks7c4taF2ekJCoYJTlVVfMjzHPbBjHX XT0gDvgdj5Gf9hawJUqg2tVmzDSPGP2yASG1Vh04ElkmqF7fzmCvSilsC w==; X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187563" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="365187563" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:33 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060492" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="740060492" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:29 -0700 From: Huang Ying To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Andrew Morton , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH 00/10] mm: PCP high auto-tuning Date: Wed, 20 Sep 2023 14:18:46 +0800 Message-Id: <20230920061856.257597-1-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 X-Rspamd-Queue-Id: 0B6FA120015 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 4aw4wdjxow3xy6aco9sharrq1rfz353t X-HE-Tag: 1695190774-939764 X-HE-Meta: U2FsdGVkX19wpZRuBMizl/WAxK0qcHICAelkP7RbppA+3Q9cE9zIb2xRT2DJbfyf2AKGZHj7ZmlXWl7zudQlzy+X4EZtEwtpKmTgbdJxHCw8MzBtcYXtclhcJ2KeDO5LUkDBtVNHhXKY9irCjbjh9q6yfG+paYYr1AkL7tzsxMphvbm9OHlzGuckRBJ/FM7OdIFNIT9HU1HlMb9hbxYPGBMhvH8UK3VCHAwVFDa9FjKapXiQ6Q9D9bgmcVdr7MSyOhhF+37/dxCSK70gV8eG7+AEtkj4b4KrlYP4MNRcp7jVfnHRuzvQiERUAy6OWuPPQ3Wim8/F6WT96sj8F4KtLUe5MMYUmzsXlk+M/0CCnsEwRnSrLh/ZLq0qDWqDx0wcVY+gKp8Lyp4393OPHSD2JuYB0vQYjt+o80IBuz8JhpvlVn2HrxAQ+RekxiymyYs2d3VLWYtpEQoNMeTjbHERmR7NwDfK7i2oafK9gzJT7wicYDPsEYZE7D18eW7Iy8rFnoHnM6azYBPkFrWKARm9AO7ZPhhl5PnxEJhVyijFcI6FKn16RtWwcJaPC4jjgIDdZiUQ3CnEThYYw8S/ew4Mop4nwgEPYVjf6i4YRa/3Pe6oSvtwVwSSS1HVvPvgOpRnsajoPUdCAtNJEobO4k8B7RHdcP8KWplE31BF27kdYoE/IokSh/IvQO2gq+6/1C8pb1z5qBlJ4hOj3HH94thL0b0WcROaPS0Aukz84NSF36Ff5JHm8hDgB3a5+HoMSy6Ns7JKc8WVxdzpxJhMyM8U+RFugR1ZcPn47zoOw2pUO8ZwQ/nU8BdgiItSlptUS+6A1tPVYERI1CGyr/nA/z/G6pM6o207Uswl7To+Z5V0mFpHHma2MuaBFOjj55Y/7zuL6p1mNk2jgphtfubj8VPAgxaNpeIIuZymkwiICtpowjJHDzCQlTAdjeq+bDai1KS1ClYj05oDOPJmZZyehiu bomaTvvm rgX4esrJkWhRDcmIfsa8J3rmPGK/ERHYBmF8DQqw5PXNqz/XzZxwC/Hy4RaLNXGpQtcm38RPRwfxBOsywl50WheLBI/UgoEDjzSOZ1G3u9zn0K8Kl58uCJrKNBKkxs7DnYpZUoyh9/lO9m1w= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The page allocation performance requirements of different workloads are often different. So, we need to tune the PCP (Per-CPU Pageset) high on each CPU automatically to optimize the page allocation performance. The list of patches in series is as follows, 1 mm, pcp: avoid to drain PCP when process exit 2 cacheinfo: calculate per-CPU data cache size 3 mm, pcp: reduce lock contention for draining high-order pages 4 mm: restrict the pcp batch scale factor to avoid too long latency 5 mm, page_alloc: scale the number of pages that are batch allocated 6 mm: add framework for PCP high auto-tuning 7 mm: tune PCP high automatically 8 mm, pcp: decrease PCP high if free pages < high watermark 9 mm, pcp: avoid to reduce PCP high unnecessarily 10 mm, pcp: reduce detecting time of consecutive high order page freeing Patch 1/2/3 optimize the PCP draining for consecutive high-order pages freeing. Patch 4/5 optimize batch freeing and allocating. Patch 6/7/8/9 implement and optimize a PCP high auto-tuning method. Patch 10 optimize the PCP draining for consecutive high order page freeing based on PCP high auto-tuning. The test results for patches with performance impact are as follows, kbuild ====== On a 2-socket Intel server with 224 logical CPU, we tested kbuild on one socket with `make -j 112`. build time zone lock% free_high alloc_zone ---------- ---------- --------- ---------- base 100.0 43.6 100.0 100.0 patch1 96.6 40.3 49.2 95.2 patch3 96.4 40.5 11.3 95.1 patch5 96.1 37.9 13.3 96.8 patch7 86.4 9.8 6.2 22.0 patch9 85.9 9.4 4.8 16.3 patch10 87.7 12.6 29.0 32.3 The PCP draining optimization (patch 1/3) improves performance a little. The PCP batch allocation optimization (patch 5) reduces zone lock contention a little. The PCP high auto-tuning (patch 7/9) improves performance much. Where the tuning target: the number of pages allocated from zone reduces greatly. So, the zone lock contention cycles% reduces greatly. The further PCP draining optimization (patch 10) based on PCP tuning reduce the performance a little. But it will benefit network workloads as below. With PCP tuning patches (patch 7/9/10), the maximum used memory during test increases up to 50.6% because more pages are cached in PCP. But finally, the number of the used memory decreases to the same level as that of the base patch. That is, the pages cached in PCP will be released to zone after not being used actively. netperf SCTP_STREAM_MANY ======================== On a 2-socket Intel server with 128 logical CPU, we tested SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes. score zone lock% free_high alloc_zone cache miss rate% ----- ---------- --------- ---------- ---------------- base 100.0 2.0 100.0 100.0 1.3 patch1 99.7 2.0 99.7 99.7 1.3 patch3 105.5 1.2 13.2 105.4 1.2 patch5 106.9 1.2 13.4 106.9 1.3 patch7 103.5 1.8 6.8 90.8 7.6 patch9 103.7 1.8 6.6 89.8 7.7 patch10 106.9 1.2 13.5 106.9 1.2 The PCP draining optimization (patch 1+3) improves performance. The PCP high auto-tuning (patch 7/9) reduces performance a little because PCP draining cannot be triggered in time sometimes. So, the cache miss rate% increases. The further PCP draining optimization (patch 10) based on PCP tuning restore the performance. lmbench3 UNIX (AF_UNIX) ======================= On a 2-socket Intel server with 128 logical CPU, we tested UNIX (AF_UNIX socket) test case of lmbench3 test suite with 16-pair processes. score zone lock% free_high alloc_zone cache miss rate% ----- ---------- --------- ---------- ---------------- base 100.0 50.0 100.0 100.0 0.3 patch1 117.1 45.8 72.6 108.9 0.2 patch3 201.6 21.2 7.4 111.5 0.2 patch5 201.9 20.9 7.5 112.7 0.3 patch7 194.2 19.3 7.3 111.5 2.9 patch9 193.1 19.2 7.2 110.4 2.9 patch10 196.8 21.0 7.4 111.2 2.1 The PCP draining optimization (patch 1/3) improves performance much. The PCP tuning (patch 7/9) reduces performance a little because PCP draining cannot be triggered in time sometimes. The further PCP draining optimization (patch 10) based on PCP tuning restores the performance partly. The patchset adds several fields in struct per_cpu_pages. The struct layout before/after the patchset is as follows, base ==== struct per_cpu_pages { spinlock_t lock; /* 0 4 */ int count; /* 4 4 */ int high; /* 8 4 */ int batch; /* 12 4 */ short int free_factor; /* 16 2 */ short int expire; /* 18 2 */ /* XXX 4 bytes hole, try to pack */ struct list_head lists[13]; /* 24 208 */ /* size: 256, cachelines: 4, members: 7 */ /* sum members: 228, holes: 1, sum holes: 4 */ /* padding: 24 */ } __attribute__((__aligned__(64))); patched ======= struct per_cpu_pages { spinlock_t lock; /* 0 4 */ int count; /* 4 4 */ int count_min; /* 8 4 */ int high; /* 12 4 */ int high_min; /* 16 4 */ int high_max; /* 20 4 */ int batch; /* 24 4 */ u8 flags; /* 28 1 */ u8 alloc_factor; /* 29 1 */ u8 expire; /* 30 1 */ /* XXX 1 byte hole, try to pack */ short int free_count; /* 32 2 */ /* XXX 6 bytes hole, try to pack */ struct list_head lists[13]; /* 40 208 */ /* size: 256, cachelines: 4, members: 12 */ /* sum members: 241, holes: 2, sum holes: 7 */ /* padding: 8 */ } __attribute__((__aligned__(64))); The size of the struct doesn't changed with the patchset. Best Regards, Huang, Ying