From patchwork Wed Sep 20 06:18:47 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13392108 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 23A02CE79AD for ; Wed, 20 Sep 2023 06:19:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9B3976B010B; Wed, 20 Sep 2023 02:19:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 957E76B010D; Wed, 20 Sep 2023 02:19:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7D1956B010C; Wed, 20 Sep 2023 02:19:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 69B206B010A for ; Wed, 20 Sep 2023 02:19:41 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 3BD8C140EA9 for ; Wed, 20 Sep 2023 06:19:41 +0000 (UTC) X-FDA: 81255974562.28.C9DDA30 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126]) by imf29.hostedemail.com (Postfix) with ESMTP id 16ACE120015 for ; Wed, 20 Sep 2023 06:19:37 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="H+InEKo/"; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695190778; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GPl4E/VNt1Dv9MdbE8wBSTxkzdvu+TPDhPll3vaYXgs=; b=l/XE4B0zDVxNGVOIxzsNa3fsmSArP5UcVecy90lRgAUkAtuUbOfY4gvEt4yqsCGjlXVJWX wkWVRjRCBjcF2a09M0oBmiQfjkT8sskjYsikZ2RevIWHHENpukI7eUMGItdC8eRhYIC+sp Qu03FDfu+eHaB9rwazR2SEbQh8Sdkm4= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="H+InEKo/"; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695190778; a=rsa-sha256; cv=none; b=POH2MFhA/ubULsISdjwBfYKPQjpHU4ieAG6rHWBsC09TroZrGqR7G7y6xGWmCxVqe/3mgR JV0tZYIRabcYIeBuZ8azxd1pyAPa5jfiFAr7mXGnPFsPiscdNmk9eXYxrpDjouTfAmJ3G0 Ak4jG+w5moBQ5/4benIER68rJr+5QFU= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695190778; x=1726726778; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=y5KqaVwntn1O5nTc80IjgYebQMrM7BtNCmGfR5qYkwg=; b=H+InEKo/foeVkfovEPk9LPTEI1LusxUaeqweCswNPaWYE6m5JlGWAyou eKsnKsJ8sRSOTDGiVvWEh282gmTRX60n6+hnQrVDQHG5m6eqMsTa4bUxA L1f6yoRNRQBsrYd3TGn71dG7Cp+PzZ+ce4xZfbgU3tOUB3y8CQKDUEhRc YWTT95qCeb38SCxvMSY17Vjbo5noRmHjeiQJ0jvddSTqS4/rdKVOCymX7 wKPS9drvmTlIoG0CoQCFV/1tg4S3B5vAdmri+KLJYh9DlqVS6nU/KdLTi KUkBzkSUa89BhqKXSDrI0TzWCt7JwDtVsm+JnlNRGWV02cyt56F4fH8M1 Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187582" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="365187582" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:37 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060503" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="740060503" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:33 -0700 From: Huang Ying To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Andrew Morton , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH 01/10] mm, pcp: avoid to drain PCP when process exit Date: Wed, 20 Sep 2023 14:18:47 +0800 Message-Id: <20230920061856.257597-2-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com> References: <20230920061856.257597-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 16ACE120015 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: aji8hr66mixy4ieq6764k3m917asceqe X-HE-Tag: 1695190777-651895 X-HE-Meta: U2FsdGVkX19RL7c44hbK7hXff7S+oLkVMogBnVq+3WE7syxcqh1asnNxGLax2jz8bhVc+EH7fOuj5YYRTeah2Fr3cUEzNieSP6bdrLvx/zQywyXbBEyTHYrR73ryKSc00V5XazI6Q+ogYEVBtfmL0SnRYiLQQ1Ntct+vQnD8c76znCJnLzm4tU1h8ezyp5OOGHmgvLSh71dh8kFx30IHNqvN0Qtm0hHTdnbbsTsGeksSmUCe3T+na7y43mrbsuCpC5iPZ9juxPCOnH4HnFIW8gxX33DFOP0dcb3NK5P6NXgo9ZRB9agO95/yj19Q97XpasA5+fWiqPNIzoT9Iq3rdJd49KqZqNZyEtwF2TK3Zqo3sucQ2oNktiQLl+Gdl1YOhAZQ+ZhkUKi5/HvvKcIjrNWBndGXb4AE1KL8+yR7aHos6r1EFPMX4zpJBTMwMdDc6dUaO7nx9u+CLOhkDRrXJns08sEp1WYSnGa8oHTLOXojgSf9zH9FsrvD4hI/+kp9ZWyHSfigvA6HQohqjaHyZNJlb9PyE5Rx+pry4dMqSmphPLRSbYyfNJPb0AmFvwera8qOT2xbKD1xzMrDWXZDObTT0Fn6l+A1FKPVpnxg2Y78hgAexNRJzPmDE5Cmo5mj/KMFGo3ZMQcpTvbwmMtfO+tWAmonMci4oZS9QZ4uVxFNqDpLi1htcbd4JMpXyph3ueWJ8kYiaJaoFkFzkPwC4q1Pwt/Uj3OMJVYQQorsqUbtvLdve7ALgxQ7OGqsp11LRclCYZQKusPHLOe/hGKWPxJKWqofmKODz/xdHX5kc2PhznnUMcg9UQkW4LtWMsNVce/WZn/Fh0lXPo6ZzaTa9e8A8lzBNbblk5eCUJ7QhDg0uOJ1tvT1NA+GuCjQjL3xl+Sa1uvYg/z/BhYVzog9T7Dak+kFEC9xARELPGWvkvwqk0qufLeJYaRrpLAmLk0u1jG25LSzecpsirZoELw azA/Dnai 7+acvapvckzJU6xWU81TRs4L+Ql8GOZs64z8X76dB/8otENE7J/ONawulAr7ltZCQ7S6zw6H3vG1Lgez6XZkrMlXHsUZzhCp6DBJZZzv9Bb4C8Kws24NFpogOeDVS2PoFlMyZaLB3desJIBu95GdOmky7Jccuz9/FR1PwxsP7F0gytOgje89O/beyosYea3UcysICzs3vIFf0qBrgTACsE54uzOvuw4P1W4afAo2knQpz4CwbK8rIEDpxiNto5ktjIVclDn2MWWvXjqaGOfUH0j250NQF2QpP5iaa4szUoRsfg0Fp2/UqgGT1RP+ygwTHApKYQmtdvZ6kFSfOAp39mKchjMkfK8yYGzVayBndinaivqlrCCPYY3bWy3eKeY1hR4pKIQfVqHWL1AQbVay9zdRkAtDAxB2/Jazsa/9uLTtijlIwxAwMMx+Y679DNsuTd1utsVX7COEiza8g/l6fJfCZpvF6JV3uolBK6CP3XuN7M5tMxHa8IRkOpinhWTD6cheaCS/n503k4xLlSMVXPiBoZcICMW2eacq0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when PCP is mostly used for high-order pages freeing to improve the cache-hot pages reusing between page allocation and freeing CPUs. But, the PCP draining mechanism may be triggered unexpectedly when process exits. With some customized trace point, it was found that PCP draining (free_high == true) was triggered with the order-1 page freeing with the following call stack, => free_unref_page_commit => free_unref_page => __mmdrop => exit_mm => do_exit => do_group_exit => __x64_sys_exit_group => do_syscall_64 Checking the source code, this is the page table PGD freeing (mm_free_pgd()). It's a order-1 page freeing if CONFIG_PAGE_TABLE_ISOLATION=y. Which is a common configuration for security. Just before that, page freeing with the following call stack was found, => free_unref_page_commit => free_unref_page_list => release_pages => tlb_batch_pages_flush => tlb_finish_mmu => exit_mmap => __mmput => exit_mm => do_exit => do_group_exit => __x64_sys_exit_group => do_syscall_64 So, when a process exits, - a large number of user pages of the process will be freed without page allocation, it's highly possible that pcp->free_factor becomes > 0. - after freeing all user pages, the PGD will be freed, which is a order-1 page freeing, PCP will be drained. All in all, when a process exits, it's high possible that the PCP will be drained. This is an unexpected behavior. To avoid this, in the patch, the PCP draining will only be triggered for 2 consecutive high-order page freeing. On a 2-socket Intel server with 224 logical CPU, we tested kbuild on one socket with `make -j 112`. With the patch, the build time decreases 3.4% (from 206s to 199s). The cycles% of the spinlock contention (mostly for zone lock) decreases from 43.6% to 40.3% (with PCP size == 361). The number of PCP draining for high order pages freeing (free_high) decreases 50.8%. This helps network workload too for reduced zone lock contention. On a 2-socket Intel server with 128 logical CPU, with the patch, the network bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite with 16-pair processes increase 17.1%. The cycles% of the spinlock contention (mostly for zone lock) decreases from 50.0% to 45.8%. The number of PCP draining for high order pages freeing (free_high) decreases 27.4%. The cache miss rate keeps 0.3%. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Mel Gorman Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter Acked-by: Mel Gorman --- include/linux/mmzone.h | 5 ++++- mm/page_alloc.c | 11 ++++++++--- 2 files changed, 12 insertions(+), 4 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4106fbc5b4b3..64d5ed2bb724 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -676,12 +676,15 @@ enum zone_watermarks { #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) +#define PCPF_PREV_FREE_HIGH_ORDER 0x01 + struct per_cpu_pages { spinlock_t lock; /* Protects lists field */ int count; /* number of pages in the list */ int high; /* high watermark, emptying needed */ int batch; /* chunk size for buddy add/remove */ - short free_factor; /* batch scaling factor during free */ + u8 flags; /* protected by pcp->lock */ + u8 free_factor; /* batch scaling factor during free */ #ifdef CONFIG_NUMA short expire; /* When 0, remote pagesets are drained */ #endif diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0c5be12f9336..828dcc24b030 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2370,7 +2370,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, { int high; int pindex; - bool free_high; + bool free_high = false; __count_vm_events(PGFREE, 1 << order); pindex = order_to_pindex(migratetype, order); @@ -2383,8 +2383,13 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, * freeing without allocation. The remainder after bulk freeing * stops will be drained from vmstat refresh context. */ - free_high = (pcp->free_factor && order && order <= PAGE_ALLOC_COSTLY_ORDER); - + if (order && order <= PAGE_ALLOC_COSTLY_ORDER) { + free_high = (pcp->free_factor && + (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER)); + pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER; + } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { + pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER; + } high = nr_pcp_high(pcp, zone, free_high); if (pcp->count >= high) { free_pcppages_bulk(zone, nr_pcp_free(pcp, high, free_high), pcp, pindex); From patchwork Wed Sep 20 06:18:48 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13392109 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 31B15CE79AE for ; Wed, 20 Sep 2023 06:19:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BA4D66B010D; Wed, 20 Sep 2023 02:19:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B52AE6B010E; Wed, 20 Sep 2023 02:19:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9F4176B010F; Wed, 20 Sep 2023 02:19:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 89E7A6B010D for ; Wed, 20 Sep 2023 02:19:45 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 618661A097E for ; Wed, 20 Sep 2023 06:19:45 +0000 (UTC) X-FDA: 81255974730.19.30BB63B Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126]) by imf29.hostedemail.com (Postfix) with ESMTP id BFD8B12001A for ; Wed, 20 Sep 2023 06:19:41 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=cF9ll9AY; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695190782; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=36opLWR6CJOCDKHWsDafiwu7OGcLAly4+3YEuY6IXMM=; b=6JUktXmoqd/sLGBlMJkGZOvh4EJANW718PHvnKfDOEo9MqYAdX5zP/4N7xnv0vS+xRJEQK asmRFNOPamkgSBjfulv3grbvmoKd6mogQHSkyEVYFpFeKjXK0CWk5+yw+OgVoymb774k7X s/BcKfSph7UWPlYsqKdigdADgTlk17M= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=cF9ll9AY; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695190782; a=rsa-sha256; cv=none; b=xbrku065KoV6tJVoO+Y/+9lgwgHM5ZpCKV+RU4MWWa4xbMxuxwDtRrbsITG568LOnCHgjB Dcp9MrJptQ78vAaZzcQaS4zdpMQapDEwW67KopaYSERtjLjb53xinzqhJ1agJHoRiOD9Pp BHOLV45VC61uxTIZMcaCgPhCa0d46Cc= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695190782; x=1726726782; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=lVKcfu7RB9vKN2Hm50wym0zo99YnMLypq+qR+yt7FI8=; b=cF9ll9AYPa0FoiN6yPwlUYTQHQG1UA3RII1tPtTmo/qQ5Qxbg/F5fsUs HFle/Sanicwb2vATBXtJezlGKnqOIqbDSsfJY/rdIjXFSL0YARA1jET/Y n8rK5NwoCBannG4N9HivZ+mTBnqKuDc9C20K5d4ZY/TY69MxjtUMAKJ1E cSlMCtEttxdYQFiuNoZ+BSw98taVhuuHNpPlUej5wvY6Rml7vuWbhSWhh DQGv31rPlI6BuLXoP/i3/qdVp9ldmaMH/ARJt/q9F8yG0qWAPvEzkl3zj EwW9bgblQbLscKcFlGAvD30tkVHzm9A5HUdIeThv1VGYvReHHjvqopurz g==; X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187603" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="365187603" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:41 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060521" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="740060521" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:37 -0700 From: Huang Ying To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Sudeep Holla , Andrew Morton , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH 02/10] cacheinfo: calculate per-CPU data cache size Date: Wed, 20 Sep 2023 14:18:48 +0800 Message-Id: <20230920061856.257597-3-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com> References: <20230920061856.257597-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: BFD8B12001A X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 331dupq7iu9cymrw8w9yu19s98im6gmr X-HE-Tag: 1695190781-484236 X-HE-Meta: U2FsdGVkX19P4anaS+lfUN1aqviqBnOuieRbnxHzptCATbsJL1NH9RI8ZJPdfVixgplU6LqZ2gIwDAKbD+0xpjl6Z+CG35oUraszLwLcXaiDqa7uNjsRCltgtOkncYIoc0W0ZukSUebOvW5FBmlihoIMGthhfrd12lNt41ec/sUQrhZsyFeharWVXItDt3yW89hPfIedrLUY5K5y2V50OnzOzS0Cz8SRGBVwZaDpkF9kR6XDCbWfASHsh47zUUpuDQ8fKwnp/IWyek0MiTrYPt3vXZpOThWXUoHuiyViLm1PkJkOSEszzp4/d9TP2IJH8dxNFEMykeRUympL0xTfvXoE+ielTPc91V+v3Y3hy9kzA6NJSkQdsR4P1GtRZVIPq28RDtNMbHXgaqnt27UMyjeOJeQFJUTPaU1AKsHpmI+COtLUNobg4XoHhML2W2eJzaPkWjuUjY5dWur2WcuqiVe3rshcKJrc9shpgYVQhNHiVU1gjjlXUEtSzWOjCDnaOPGRQQXVQSWq7xQZpNBFSYlIe36WiD98WAyauuJqS1+gUkbVM9MRisA4F/nAwLbGpqYE1NygzIfSLtVRc9w4k5LJrARt3ZH88bgn6FRia6hT8fx/GViODCbyFYMtopq5W91kxvI5Wq13s1eogQgazyt9gyifJNtjGX/Z1x8wtf8UzWa6A1Chv6YMXeXq9XyKYm4/4fPsTdrGnd7alF1b9SnEgspwzc8tnjyQbGeaaSIN/3b+J7KrBPZjK9zy1XoKOS1CdkotRaEqhUtGY6urLxqYrFqb3zppx3mv6Hzpk42BDCOTjLqr9wfz/4L1iGiW3C+9eAz+0JUYeOAPFBMcO6LZZ7Fy5BhuqNJSYzIMieeBiLLmKppVrxclAs8a7SZu1v55e2m77om7qTc2Um2QRfTi9Unk0U1LAx5vhHt7SNm2e9z8jqoq/szoVwNnaRHkF7dFLvBBHWTR4E9Vf/H XeY9AGJH 6p2ry42XrtesC7FJU+Ow1WxByTB4vTgMfqHTZ9gEg2vjqBnQYo/f4T797VBbLMOVED2bOA0oL+nEncngmxP28A/aICYV6vIC5IlFmuG+h1Z9gRFvd8EjMVj66o0fiBthNIrtGUVoRi7pRjm7Yj2o+EHZnzE0RmqnwuwJ/1QJDK/k9I+ixGJ78QQjOExRn/NBxp09PF6IIEaBfHZIbgt1MkFrBx2M1U3O4ZG3gwxVCeAMQsXWmiTSZFSSUhnNmPwOnqA2vl8U90y7cq2AsKP9vDQvEgxT9W/3Jnpgxes1WVqjWGenxnxcC5PiBgICoskIf2EYcbN6NUCasBQH/mf2T7OR4aA/Ihwnh6f4jHCIZb7hJzwbcUd4KEVtwRDsI3g+LNr8aT4BZ2Vaw1DaYIsBS2WSc7jOTV3cTN8sLmHakWDWZdOccTBSy7KWS+T+kK6GurcI6EGfRBje7G/H4L12aPh49gVDjlRdkwv1hruLzK4uszLuAOV1/kWbXNGpYg+hfFYldak/7VM0aAeEPox6VIrFuq3GRJKQBhHTdF4ktwNZq0ni04AmNIqh29GNlQFc24bO1QL7FO9OiKDC78gBjW8sxPFmWwCNpm3j1 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Per-CPU data cache size is useful information. For example, it can be used to determine per-CPU cache size. So, in this patch, the data cache size for each CPU is calculated via data_cache_size / shared_cpu_weight. A brute-force algorithm to iterate all online CPUs is used to avoid to allocate an extra cpumask, especially in offline callback. Signed-off-by: "Huang, Ying" Cc: Sudeep Holla Cc: Andrew Morton Cc: Mel Gorman Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter --- drivers/base/cacheinfo.c | 42 ++++++++++++++++++++++++++++++++++++++- include/linux/cacheinfo.h | 1 + 2 files changed, 42 insertions(+), 1 deletion(-) diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c index cbae8be1fe52..3e8951a3fbab 100644 --- a/drivers/base/cacheinfo.c +++ b/drivers/base/cacheinfo.c @@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu) return rc; } +static void update_data_cache_size_cpu(unsigned int cpu) +{ + struct cpu_cacheinfo *ci; + struct cacheinfo *leaf; + unsigned int i, nr_shared; + unsigned int size_data = 0; + + if (!per_cpu_cacheinfo(cpu)) + return; + + ci = ci_cacheinfo(cpu); + for (i = 0; i < cache_leaves(cpu); i++) { + leaf = per_cpu_cacheinfo_idx(cpu, i); + if (leaf->type != CACHE_TYPE_DATA && + leaf->type != CACHE_TYPE_UNIFIED) + continue; + nr_shared = cpumask_weight(&leaf->shared_cpu_map); + if (!nr_shared) + continue; + size_data += leaf->size / nr_shared; + } + ci->size_data = size_data; +} + +static void update_data_cache_size(bool cpu_online, unsigned int cpu) +{ + unsigned int icpu; + + for_each_online_cpu(icpu) { + if (!cpu_online && icpu == cpu) + continue; + update_data_cache_size_cpu(icpu); + } +} + static int cacheinfo_cpu_online(unsigned int cpu) { int rc = detect_cache_attributes(cpu); @@ -906,7 +941,11 @@ static int cacheinfo_cpu_online(unsigned int cpu) return rc; rc = cache_add_dev(cpu); if (rc) - free_cache_attributes(cpu); + goto err; + update_data_cache_size(true, cpu); + return 0; +err: + free_cache_attributes(cpu); return rc; } @@ -916,6 +955,7 @@ static int cacheinfo_cpu_pre_down(unsigned int cpu) cpu_cache_sysfs_exit(cpu); free_cache_attributes(cpu); + update_data_cache_size(false, cpu); return 0; } diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h index a5cfd44fab45..4e7ccfa0c36d 100644 --- a/include/linux/cacheinfo.h +++ b/include/linux/cacheinfo.h @@ -73,6 +73,7 @@ struct cacheinfo { struct cpu_cacheinfo { struct cacheinfo *info_list; + unsigned int size_data; unsigned int num_levels; unsigned int num_leaves; bool cpu_map_populated; From patchwork Wed Sep 20 06:18:49 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13392110 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8B822CE79AD for ; Wed, 20 Sep 2023 06:19:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 22DB46B010E; Wed, 20 Sep 2023 02:19:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1DDD66B0111; Wed, 20 Sep 2023 02:19:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 058316B0110; Wed, 20 Sep 2023 02:19:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id E29856B010E for ; Wed, 20 Sep 2023 02:19:47 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id C0014A0F54 for ; Wed, 20 Sep 2023 06:19:47 +0000 (UTC) X-FDA: 81255974814.05.E000399 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126]) by imf29.hostedemail.com (Postfix) with ESMTP id C9BFD120013 for ; Wed, 20 Sep 2023 06:19:45 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="lz/q40Zs"; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695190786; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=l/RqypTBqoKRXw8P4mlR7VW0p5EEkP7aW82SstOT+vs=; b=Ni1RjkCTLEz1PnHYeXEKFe4t9ko2IJ4KptbPM7nTJkhf+z0HIEBvkXgu5AlQrGX8ydBEgf N+kBRrIogpAq2CecBh8XfJY7NOHD3EFoJCALBpS9x4y9ugU8zuz7Q5kRChXGquKqP68xMx bB3Jf9j+fGiAIplRqlGnMM6L8ITS8x4= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="lz/q40Zs"; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695190786; a=rsa-sha256; cv=none; b=dZ5Qg0cZOobJyNwZKBO9fY9RbTtRbkVb2iDraIrViHOZ2CevahNbe2E7kp/UwZrhjOZj+n X0eu9obHoxEp942roLFn0zB705ygIrDgf7yw5tpVl8GtDaaDKw0maXW2qtvrmingPFwMlm OaLGj5b+e3K4hyz5KRU9nDlalxYgmuo= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695190786; x=1726726786; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Gl8Dsupvu0i/ycW4/TMF2HWHASVPpn6wN7AOIH8Z+Fc=; b=lz/q40Zs/rG/bcUGqypFN+9GBkSj0uuSnsMCVFi+yJi0jTPdsFb+MReB 8C4whZRmOcH/DnURRLkt9MEHxicvKIxIlmeGIgteukhWQlWrgCMCAFQFO LmL4dTFCUiRY/gV2Nmwy9vOQlteYihcNRi98lzervSvst3WwL2MoZyWXK Nmslqq6djYSVIkyVbqSeIvrT15OVtGESP7o2FindposlQyCkg1Ny8FPaW R1bxkjZsy3aPNLcc0YXXrxZgSE7WqYcwmi2YXBUABCujVNjRWBWLVtRJn KCe87J7HgXaPwgwpis1V/Q4YUe2v96HccXvDSnS0vtq9K9ZeJNyMIp4t2 w==; X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187624" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="365187624" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:45 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060540" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="740060540" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:41 -0700 From: Huang Ying To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Andrew Morton , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH 03/10] mm, pcp: reduce lock contention for draining high-order pages Date: Wed, 20 Sep 2023 14:18:49 +0800 Message-Id: <20230920061856.257597-4-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com> References: <20230920061856.257597-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: C9BFD120013 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: kmjb4xgggp6a4mhbmeq6t1881i1peedb X-HE-Tag: 1695190785-975136 X-HE-Meta: U2FsdGVkX18LMtrAA51zoMms1tno9GPuOP5OxYHKMXikvjoUigaOZUpmXGvBfYERVTt9E4n5IxEmlfr+bnbcs/DSkz4qY7VpW2qNG2UqNQ1+GJ+FiVty9CGkSKI/ult8n2+2tCtCBu6Fjhp1dlnECA4RHLJXKrgr79u7zmsN1K1oo1k0zTur7BVBa+pbQqxJ3mFVLvxTlaQmHujK+hZY2LY0bSfPOFvKkAxz9c4iTDjmBtdFYWsdBoqOsE8uaOPjrJeaCqq1v5y034JFLNewPHgvuB9qaczYf3Gwo+H2PvD2SzIIggv9m+CiZISSef18N5wVL3OvDMuuJ4CAaGV2xLfoAgpESvQYhgBoAY+Y9zlNcQSjS909Yd9ntwzTy7AO56IHoN71dwejhKC/hqN129BqywZEmYJ5fj1hFR48dwYQOY086Y6AQ4GPyF+5ZAGQObQ/hBZOjUdDogU8BWC+N4BlDTOFFw8hZMIGG9Lqg3J2WTaQB9uXFeDnmXj+QUDUNIf24njevZuFItu1gpoQjqoU6PPt14OeXGTsf85S0WOH2YcZfq6Ai8tGERIPc/YnLXKANTVUE6aZzjoZ73VQ2U+ir5Mssb27Vlq2DO4sseUwHI+5ZTUOWCFGvmnQSo0L4xggRTqFX0c1rSZWzIkHR0fKfSJlUsEE6LF0L4j5lP4aKDy+Na+WqMEfNe595/WFOyVVQOt4eUAPyXUqoIlkYHs8ADEgKUmwVMMKcQOQjZ4ipaQfYnV+b366QtH+Og85i+YIYcBugtehsi8W3sohyrJfWoRh5PAgqi6qpIlz0TcCgnlxcbfuShCsOzj4sfdyW3dzhH/2ySFmQ7x3M8NIVgAtRDu5QDQB1kZQ27bVTVODJj7om7iAa8YvKvvuIqJfsoCSo/EFrEI2Chi7Ii/HkUvGx3sB+s5McsF1KtGGit/fvllhaDZ6EJngGxRsRVEGJvKXzytWXsgY90jPO3J +Dy4HzbV 9wIcxbJqDjiDeG1ST2oSl3eK0LLZqjTk+/11FLTUvXsXQWctCK9GXTPp1m03n7+SYs9x5+gPtpQvs9XnPQuKS1WzYAUNYeub4vA5fvY+sSOQb2aff1mmM9xQ1d3T6FRvVXJMVawymyV6CKH1cV7I5D2jLA0LXclIPzbUqmHv/yZDx5YONnqera3wLMO8fgN/lnoAGu5M/fXs7l8bvdpUMONUrxxEIgjznObQ/rf7IfinbkGTSVzivLSE0zig6Maldv4ZieuVtbFWILPyz179y/fOi6eSdniwWnbg1CYIE8f8FSUblI5N68IrtVE25kdcyVCnADyC+FYyzjfB341hJHlIhNSzz+EPS3BxnxWq3q5RBaWOZLs2xZ2+acPGFRRu+fw30Q4zXp88Rhca1iL4S4OyF5eQsLIfi6mDRAmQ7JkKagw4pwcOgAfuakZDv0aRP8v99jXZdwQh4vbtVzXZIPWuy2gxm0Ma/y5dfyDD6y1ryVmp7kiepyb0Cr//B5DBTznhZ1Y6bn4JIpOX3LlZnDk6RQiNR5oubaSdo X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when PCP is mostly used for high-order pages freeing to improve the cache-hot pages reusing between page allocating and freeing CPUs. On system with small per-CPU data cache, pages shouldn't be cached before draining to guarantee cache-hot. But on a system with large per-CPU data cache, more pages can be cached before draining to reduce zone lock contention. So, in this patch, instead of draining without any caching, "batch" pages will be cached in PCP before draining if the per-CPU data cache size is more than "4 * batch". On a 2-socket Intel server with 128 logical CPU, with the patch, the network bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite with 16-pair processes increase 72.2%. The cycles% of the spinlock contention (mostly for zone lock) decreases from 45.8% to 21.2%. The number of PCP draining for high order pages freeing (free_high) decreases 89.8%. The cache miss rate keeps 0.3%. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Mel Gorman Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter Acked-by: Mel Gorman --- drivers/base/cacheinfo.c | 2 ++ include/linux/gfp.h | 1 + include/linux/mmzone.h | 1 + mm/page_alloc.c | 37 ++++++++++++++++++++++++++++++++++++- 4 files changed, 40 insertions(+), 1 deletion(-) diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c index 3e8951a3fbab..a55b2f83958b 100644 --- a/drivers/base/cacheinfo.c +++ b/drivers/base/cacheinfo.c @@ -943,6 +943,7 @@ static int cacheinfo_cpu_online(unsigned int cpu) if (rc) goto err; update_data_cache_size(true, cpu); + setup_pcp_cacheinfo(); return 0; err: free_cache_attributes(cpu); @@ -956,6 +957,7 @@ static int cacheinfo_cpu_pre_down(unsigned int cpu) free_cache_attributes(cpu); update_data_cache_size(false, cpu); + setup_pcp_cacheinfo(); return 0; } diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 665f06675c83..665edc11fb9f 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -325,6 +325,7 @@ void drain_all_pages(struct zone *zone); void drain_local_pages(struct zone *zone); void page_alloc_init_late(void); +void setup_pcp_cacheinfo(void); /* * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 64d5ed2bb724..4132e7490b49 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -677,6 +677,7 @@ enum zone_watermarks { #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) #define PCPF_PREV_FREE_HIGH_ORDER 0x01 +#define PCPF_FREE_HIGH_BATCH 0x02 struct per_cpu_pages { spinlock_t lock; /* Protects lists field */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 828dcc24b030..06aa9c5687e0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -52,6 +52,7 @@ #include #include #include +#include #include #include "internal.h" #include "shuffle.h" @@ -2385,7 +2386,9 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, */ if (order && order <= PAGE_ALLOC_COSTLY_ORDER) { free_high = (pcp->free_factor && - (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER)); + (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) && + (!(pcp->flags & PCPF_FREE_HIGH_BATCH) || + pcp->count >= READ_ONCE(pcp->batch))); pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER; } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER; @@ -5418,6 +5421,38 @@ static void zone_pcp_update(struct zone *zone, int cpu_online) mutex_unlock(&pcp_batch_high_lock); } +static void zone_pcp_update_cacheinfo(struct zone *zone) +{ + int cpu; + struct per_cpu_pages *pcp; + struct cpu_cacheinfo *cci; + + for_each_online_cpu(cpu) { + pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); + cci = get_cpu_cacheinfo(cpu); + /* + * If per-CPU data cache is large enough, up to + * "batch" high-order pages can be cached in PCP for + * consecutive freeing. This can reduce zone lock + * contention without hurting cache-hot pages sharing. + */ + spin_lock(&pcp->lock); + if ((cci->size_data >> PAGE_SHIFT) > 4 * pcp->batch) + pcp->flags |= PCPF_FREE_HIGH_BATCH; + else + pcp->flags &= ~PCPF_FREE_HIGH_BATCH; + spin_unlock(&pcp->lock); + } +} + +void setup_pcp_cacheinfo(void) +{ + struct zone *zone; + + for_each_populated_zone(zone) + zone_pcp_update_cacheinfo(zone); +} + /* * Allocate per cpu pagesets and initialize them. * Before this call only boot pagesets were available. From patchwork Wed Sep 20 06:18:50 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13392111 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7FDB8CE79AF for ; Wed, 20 Sep 2023 06:19:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 16B776B0111; Wed, 20 Sep 2023 02:19:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0F4E46B0112; Wed, 20 Sep 2023 02:19:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EAF0C6B0113; Wed, 20 Sep 2023 02:19:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id D4E266B0111 for ; Wed, 20 Sep 2023 02:19:51 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id A86DF140EA9 for ; Wed, 20 Sep 2023 06:19:51 +0000 (UTC) X-FDA: 81255974982.27.C218502 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126]) by imf29.hostedemail.com (Postfix) with ESMTP id 9426212001B for ; Wed, 20 Sep 2023 06:19:49 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=JzvwP1Yg; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695190789; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=l3q5SV++ZWPRlY3DCX2JHyc1VEp3GoxWhBxm7mEJMtw=; b=ZDv3I3HFJ2z2It3/gmLS87tmc2oIqlk3ckOqYKiuFPfrrPDNMhYEQ1Ed/+XaFiyhRFuqVn Q/U7GgooKpam+LJh4WuWRZMCfmW0oZOxlF5HH9TqJX7n4WYAc4GS8spsKf2SBdu4fKwJwn 4vQm7jXaHqAt61pDs/eC50Ob127eChU= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=JzvwP1Yg; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695190789; a=rsa-sha256; cv=none; b=l6Zs83i/mwBIshQaSbI/0xrkC8e2SS6sSSm34zwgjVZCIM1kVwsFUldo9IbG7O+iLM11fD JGyNxtTpSRhYyyMSx87dgVgJD35YZdU7ew14Rnr8NIRVYQD5YzTOS4O9Nh+y1EMQdsBi/b KOukvpt4fBK7rkYsvpiQliwZ1FW0PeE= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695190789; x=1726726789; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=edWaWFJ5ehaM/hA/BBV9GI7aeyW0raEHXFDMtgquXCc=; b=JzvwP1YgcPilvouTtGEPj9VCq0hdGPXUqQS2Kdkre/2wb/tQ2TfwUMeu MfVW/OV1lelkQ9ISX16zfeVjBiHxAwtH/Scj5Jx7VVcVGfN38J17opI/p E7+W4X7/EMwisAZlnus309GXVEutPw4c5HM74+SX7VtbhgeWNoBstTlZc WbDoScppN1Vd1XUDLtwvuCEwJV0QsOw+HgBeT1pbtYUcos9zPTw27RNIV +iB6ZqMT6x/3+lGGj5zyKaahQmF+zrtLlS59kbbHbvfrfBkj5cVdI67l5 TcTDfvKnbzmcJhlLmoSRqcg6nDZ5xdV0N/3vhonOkCeK6tcpPnj2d5Xau A==; X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187666" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="365187666" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:49 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060591" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="740060591" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:45 -0700 From: Huang Ying To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Andrew Morton , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH 04/10] mm: restrict the pcp batch scale factor to avoid too long latency Date: Wed, 20 Sep 2023 14:18:50 +0800 Message-Id: <20230920061856.257597-5-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com> References: <20230920061856.257597-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 9426212001B X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: s7emyimgb9egpwz43d6ecs64c9ouz6zp X-HE-Tag: 1695190789-718219 X-HE-Meta: U2FsdGVkX191lUMoa8lW8LbbhDPuvZEPerBUzAl068vBvChhOoq8KAN60E4bquRwgZp7OV2s6oob9YDBC0HJr1+f6bKcpN53+a41WpLBX+txKI+GVdRwxj3m0YygfPW0mBGh5156QPBrxBDUheFcbnczkiD1svSofc0yWRKpjtlBt9RGh/3NUtckIjjDSWlyPRTW103tX77Qe4RvckOB208k27qDwJwtZl7vIR+WifTBOol+tYgVM+28+AHtdW8MjUKQNztvCZBO8S60v78Ll9iLOLwrMUNHK25Dm8vGXZBiAHrRBaP4QGWM52mklCAbGdP19x5f73mfSvtjRMIL/AvUPQ0mRjwNCKhD4aJjEUBhdfAG3Bjy/stzogHAqyh0Yt5W0MC6yTBwfWP+ZRkgQPZUy3E7+DHIvAw39lEk/76oNLCdkVaP3lSsPmA96ayL+6XJcIhy315YMex8qtXlh2K1Frn5lYj2pkZt78CPi0esQX2f7+hmxkt0dJU1bSUIo9sBUgmdiGO3RCGcpO12oojPI11aAz+cdlCiuKXbvTzyn6J+d8ZhhZP1B6kUes0qiV1fn2juzqHYpZwbMiJjw0dBSgVlnZjzk8DAi7sO1U8Isc9WtG5XLGBNWuqrcJSI+kYXDaI+PbYZZounowxXKyfS0d3rC9dmQEOmzRs1f6ih1F4V5xIT/s2WONmqnJ3PKw7Mpoc5izd/KUyDZTs+ONh/99ByeZh0sPBYrZNZaai6LPnPnmFCqX08n2f3FTZeN6i9SGwBEiKsrMf/V21VvOmUcc77oBz3ng4ZmuEbdtvFt6Rha1JSH8wXiNp49vB3H00+gFJPVTNEnzUfQhgtAk+ob8iZZDbOQOOijrI69glxoGBu7f0dRW/GSlORoneXxNxXodhOsOlZUPTyD3JxWINPm2cOpIDZk3EMMVqu9CrLaA1xLUtRLVNJiSpP5u3aKQHcNg8p3MxNXAAOqRL JgDlZeH0 Pnlud+ZBOyLQYe5xqOue3MHJmaxSIolxSbvzAtOuUhECYR84LySn59w8bTjO+cUn5qeBq6WzVg4AFoIteb5yzu6FvzxflLWGtz3BGHlAaaQBre0dirTS3SDy6aLN4j85MkBJEa5c6N1ui4oWwsgrC1kZiA9CuopIstavwoic2pEBfLPLMcPQvT33GoLQ4sMKivDrvZdSzNm5zMDySBIBZxql8JC/RhnCQIAsQC+6OLHnynK5U91jb9PzLqAsPjIt9+u/HNxL9exTmbbIZVZ13Vbrd65I2VIX8AHJwpLnU/G7tk3BhS7Ik2JL2e8nFbe3YeE6YtwfE+N3SN4bEOVKIw4bSsVwJkD63hR5ewgeUK0chSwGJgQwM3BQxzRgzvU251jnH+nuMY1PckhvUDEc61EBpl/1NvLpnZZRM5/1oMxZ1yO6A1B6nt00MhuSCHCRvBDdtsbmf+JXAeoK8kMH9ghPuk2bxilutHbWPRGs8Tcc2y2UqasNddesbmLdDNN1qJPU4M8Yp9S6UX5djxHQopvhnDC15u3NpNhTI X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In page allocator, PCP (Per-CPU Pageset) is refilled and drained in batches to increase page allocation throughput, reduce page allocation/freeing latency per page, and reduce zone lock contention. But too large batch size will cause too long maximal allocation/freeing latency, which may punish arbitrary users. So the default batch size is chosen carefully (in zone_batchsize(), the value is 63 for zone > 1GB) to avoid that. In commit 3b12e7e97938 ("mm/page_alloc: scale the number of pages that are batch freed"), the batch size will be scaled for large number of page freeing to improve page freeing performance and reduce zone lock contention. Similar optimization can be used for large number of pages allocation too. To find out a suitable max batch scale factor (that is, max effective batch size), some tests and measurement on some machines were done as follows. A set of debug patches are implemented as follows, - Set PCP high to be 2 * batch to reduce the effect of PCP high - Disable free batch size scaling to get the raw performance. - The code with zone lock held is extracted from rmqueue_bulk() and free_pcppages_bulk() to 2 separate functions to make it easy to measure the function run time with ftrace function_graph tracer. - The batch size is hard coded to be 63 (default), 127, 255, 511, 1023, 2047, 4095. Then will-it-scale/page_fault1 is used to generate the page allocation/freeing workload. The page allocation/freeing throughput (page/s) is measured via will-it-scale. The page allocation/freeing average latency (alloc/free latency avg, in us) and allocation/freeing latency at 99 percentile (alloc/free latency 99%, in us) are measured with ftrace function_graph tracer. The test results are as follows, Sapphire Rapids Server ====================== Batch throughput free latency free latency alloc latency alloc latency page/s avg / us 99% / us avg / us 99% / us ----- ---------- ------------ ------------ ------------- ------------- 63 513633.4 2.33 3.57 2.67 6.83 127 517616.7 4.35 6.65 4.22 13.03 255 520822.8 8.29 13.32 7.52 25.24 511 524122.0 15.79 23.42 14.02 49.35 1023 525980.5 30.25 44.19 25.36 94.88 2047 526793.6 59.39 84.50 45.22 140.81 Ice Lake Server =============== Batch throughput free latency free latency alloc latency alloc latency page/s avg / us 99% / us avg / us 99% / us ----- ---------- ------------ ------------ ------------- ------------- 63 620210.3 2.21 3.68 2.02 4.35 127 627003.0 4.09 6.86 3.51 8.28 255 630777.5 7.70 13.50 6.17 15.97 511 633651.5 14.85 22.62 11.66 31.08 1023 637071.1 28.55 42.02 20.81 54.36 2047 638089.7 56.54 84.06 39.28 91.68 Cascade Lake Server =================== Batch throughput free latency free latency alloc latency alloc latency page/s avg / us 99% / us avg / us 99% / us ----- ---------- ------------ ------------ ------------- ------------- 63 404706.7 3.29 5.03 3.53 4.75 127 422475.2 6.12 9.09 6.36 8.76 255 411522.2 11.68 16.97 10.90 16.39 511 428124.1 22.54 31.28 19.86 32.25 1023 414718.4 43.39 62.52 40.00 66.33 2047 429848.7 86.64 120.34 71.14 106.08 Commet Lake Desktop =================== Batch throughput free latency free latency alloc latency alloc latency page/s avg / us 99% / us avg / us 99% / us ----- ---------- ------------ ------------ ------------- ------------- 63 795183.13 2.18 3.55 2.03 3.05 127 803067.85 3.91 6.56 3.85 5.52 255 812771.10 7.35 10.80 7.14 10.20 511 817723.48 14.17 27.54 13.43 30.31 1023 818870.19 27.72 40.10 27.89 46.28 Coffee Lake Desktop =================== Batch throughput free latency free latency alloc latency alloc latency page/s avg / us 99% / us avg / us 99% / us ----- ---------- ------------ ------------ ------------- ------------- 63 510542.8 3.13 4.40 2.48 3.43 127 514288.6 5.97 7.89 4.65 6.04 255 516889.7 11.86 15.58 8.96 12.55 511 519802.4 23.10 28.81 16.95 26.19 1023 520802.7 45.30 52.51 33.19 45.95 2047 519997.1 90.63 104.00 65.26 81.74 From the above data, to restrict the allocation/freeing latency to be less than 100 us in most times, the max batch scale factor needs to be less than or equal to 5. So, in this patch, the batch scale factor is restricted to be less than or equal to 5. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Mel Gorman Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter Acked-by: Mel Gorman --- mm/page_alloc.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 06aa9c5687e0..30554c674349 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -86,6 +86,9 @@ typedef int __bitwise fpi_t; */ #define FPI_TO_TAIL ((__force fpi_t)BIT(1)) +/* Maximum PCP batch scale factor to restrict max allocation/freeing latency */ +#define PCP_BATCH_SCALE_MAX 5 + /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ static DEFINE_MUTEX(pcp_batch_high_lock); #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8) @@ -2340,7 +2343,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high) * freeing of pages without any allocation. */ batch <<= pcp->free_factor; - if (batch < max_nr_free) + if (batch < max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX) pcp->free_factor++; batch = clamp(batch, min_nr_free, max_nr_free); From patchwork Wed Sep 20 06:18:51 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13392112 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 19074CE79AD for ; Wed, 20 Sep 2023 06:19:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A47FF6B0113; Wed, 20 Sep 2023 02:19:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9F43B6B0114; Wed, 20 Sep 2023 02:19:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8BCD86B0115; Wed, 20 Sep 2023 02:19:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 76C406B0113 for ; Wed, 20 Sep 2023 02:19:55 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 504DCB41E5 for ; Wed, 20 Sep 2023 06:19:55 +0000 (UTC) X-FDA: 81255975150.01.62E490B Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126]) by imf29.hostedemail.com (Postfix) with ESMTP id 45D91120013 for ; Wed, 20 Sep 2023 06:19:53 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Gp7YBX0n; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695190793; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=B04gbBYfG2Z+ZG2AaIcVwZBgZPaKp1IkffSjK1YB3eQ=; b=fnXUc05Lrhf+Rvt1vzsMp2C5jpU2Uqlsyc6nYcaaCqHjW2pgmAKcxdaDzwqs8wqRbm0VMH jS2W3lIAxrQLXlCA7bCZNzqX/fdMhC8CQOYJxXPrDYf4SbCceLH7R1jR80tnBeyo6v0d0N Wda1zdJgN802CV5sdyiCtXOrdp8FVls= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Gp7YBX0n; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695190793; a=rsa-sha256; cv=none; b=LI2xjPGh+AwCcC3ifE0Hrqm4iYFfhdjWsF5B+lE9cHXesS+WouUqTV5yqohWDuEYpeDtMD 3FZmczpp8VBpCun4TBGkpCj+597KnlrGt4WASWBy/3IAI86MUT83jDT8qEXYk80SoL/PE+ cK6abLRVsfaQ228szQwG3f690TMplsU= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695190793; x=1726726793; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=34wdsgvd2w4CDiR/vU6T5wdv61OCb5iOnSh2X+G58Mc=; b=Gp7YBX0nMiS5QX/TBV6p3G9xBLqXObK3Vvn25jn0OFU44htoSWmGhBO8 xv+t8h02FgEdi0dLRwcRuNl+mmOHsEUyEh19kd0qiPcLb1h9X9r+m6g+P 46KrzmOYM6RpgA/E5m+Ss6UN+SmSG2y2EZXtAFsP+YNEKXaVLb8pIKvQw f2vNEubh7v0tTgZd06YUAYshMqmz01kfTsWsgDY/ZE8yUaBoZ8b39gR26 yvYxuPJhy/kgH6wXcGB43jfeTboGe301YHenTKyzBvXwNivj3kxFNzKF9 Ot6gSC5Ku1KrPsELcyzePifubGKcLVYcgHfzN3Q2K/gww5copN/jP0P2G Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187684" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="365187684" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:52 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060606" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="740060606" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:49 -0700 From: Huang Ying To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Mel Gorman , Andrew Morton , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH 05/10] mm, page_alloc: scale the number of pages that are batch allocated Date: Wed, 20 Sep 2023 14:18:51 +0800 Message-Id: <20230920061856.257597-6-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com> References: <20230920061856.257597-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 45D91120013 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: c46oeyoi1qaajq3hxtqdzz6z4mpgbnz9 X-HE-Tag: 1695190793-606164 X-HE-Meta: U2FsdGVkX1/VfS66kpHpbgyki/Fwi9Zb9E/kI8xKxZXOugTzOdgytHD6AA8dKNnzEun2QmMdkyZvgJeYdfW2Avs0OFj++nnBvdQ5DmrDfrT3+ePvcka5dye9j7sYP3WcJRzc98QrgaiUjzfcrgQC+plaHS63UzKZDmFzWPTWZnkGQ749VPBUUKXw7uDUJ7bhJCBVKXUlwwGjPx0NIjevbKVV+2q9GBjnoXsiUWhCS7r0vmNXknIacKEUfBPywRMvXgW7OyyjRUB8IPDEaKz2JAosehFFRLc2tL/umaG7RrU1utgMDz8K8Ajvh6NjWMtf15164RUZ7CO/LdK+sUDXRzVGbw7tsL1OgQonarme/vzF03y+HGSdfz1wZ4lt6Wzy1fAulZHXAlm5dCJRgVYW90zp9ml3hBNPCLi/LVGYxPMvFLXcBTCiRxMk8S8xuZlj+rCEWXRhnTfKJDgm8bVXiidogk3VOOnaLfhl20u1RF061lOuAJgfW+gsc8ubrVOvga3AspI6sA6/u4hpj4IE2tIlxK5FOQPV40ZC8DyMyFKUlMR9RRAfcr+nt4FlVcz9ZAy7C3+yzbLEKVXWIrM0UzatVcCagzJmP8CDHNhhaeG51kD7MQDBuzf5r9fEoi1O1xGtdmhJ26NCLCox60Z6aAE5zMVP7R4V3nzHZJX3mQD6LJm7gxlT2jaG9h+SKkBHESvn6hXHaKt1m0HQzwh1hWZDSPe0E5fE5eHnVLbrfE5oU7JMbX5OaYzQYpVe6GLUJNebzhqwQ70yjl81urUudHb+TT1tyPenY2t+8TTExr4b5JTnKFKjJrabiEOhWcIDkvRGSDo7LKMQNaNjUBMD1kJu7hTE24et9YKd6q1NF7SoNj6FK+Xs658I9GkXnSK0Ggo9MY0BzVy8YlrbmNU+oiKs6tmHvS++SiXNP/W4LZSaKuEHS5GsupFafWjoxKnQUDEbOXMGYAl2BGAbZai hfa1bcAR Qb3RWSVEug7qkiqJ5zBLzsdd6gPGchmZQoPRrTzG+oG/tTJTPGof/Nb6bOidovXDlVpYGuYT3A1LqPQq0Bc9UwqybWEZlfNo08iQCuL/SPZSJIA0Dr0MUyIQkBoKQkz0bAV0ZUUoIWA1dXFmVsFH0T/ovfXVubGT/CkEOA2B0I9iAZThKmUyape2N36HstORWCKc1bvesZZKHoWQddeVwlOuns5rg9GrWQCjK7lyp/iS2hkE00VmDrFU+ArIPOBnXZnPaf7Yp0UnUCTf3kKZzH8T9SS0aaC9faQVm3yHnnhWpwBvZGroPUCmO0WVrBZVDuON0qzZQauVY9NWM2WMNxTEGBDatpSFBBRwvydAZqGesXJuQWO5Akxk/Z5W/ZSSeNHhWqTuG6zsy6Pp1KY2n52xWFo4AKWuPfQmF4Hcub0mpwoen11VlfV+FghQLq2pvKjzmYx9HeuwcQqO3qN3MRwROMzpDG/Fp9WDoLGXeHzuIf958utqljD2zWRkUkMdgHrw1kfmA75PvFdMoSnn9cdd1vhIS5ernI7jIv9b3sbbcetjATje3QdaTqyp+Ea0CHD839kBPIPZ/biMENHX7A/HVZw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When a task is allocating a large number of order-0 pages, it may acquire the zone->lock multiple times allocating pages in batches. This may unnecessarily contend on the zone lock when allocating very large number of pages. This patch adapts the size of the batch based on the recent pattern to scale the batch size for subsequent allocations. On a 2-socket Intel server with 224 logical CPU, we tested kbuild on one socket with `make -j 112`. With the patch, the cycles% of the spinlock contention (mostly for zone lock) decreases from 40.5% to 37.9% (with PCP size == 361). Signed-off-by: "Huang, Ying" Suggested-by: Mel Gorman Cc: Andrew Morton Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter Acked-by: Mel Gorman --- include/linux/mmzone.h | 3 ++- mm/page_alloc.c | 52 ++++++++++++++++++++++++++++++++++-------- 2 files changed, 44 insertions(+), 11 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4132e7490b49..4f7420e35fbb 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -685,9 +685,10 @@ struct per_cpu_pages { int high; /* high watermark, emptying needed */ int batch; /* chunk size for buddy add/remove */ u8 flags; /* protected by pcp->lock */ + u8 alloc_factor; /* batch scaling factor during allocate */ u8 free_factor; /* batch scaling factor during free */ #ifdef CONFIG_NUMA - short expire; /* When 0, remote pagesets are drained */ + u8 expire; /* When 0, remote pagesets are drained */ #endif /* Lists of pages, one per migrate type stored on the pcp-lists */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 30554c674349..30bb05fa5353 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2376,6 +2376,12 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, int pindex; bool free_high = false; + /* + * On freeing, reduce the number of pages that are batch allocated. + * See nr_pcp_alloc() where alloc_factor is increased for subsequent + * allocations. + */ + pcp->alloc_factor >>= 1; __count_vm_events(PGFREE, 1 << order); pindex = order_to_pindex(migratetype, order); list_add(&page->pcp_list, &pcp->lists[pindex]); @@ -2682,6 +2688,41 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, return page; } +static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order) +{ + int high, batch, max_nr_alloc; + + high = READ_ONCE(pcp->high); + batch = READ_ONCE(pcp->batch); + + /* Check for PCP disabled or boot pageset */ + if (unlikely(high < batch)) + return 1; + + /* + * Double the number of pages allocated each time there is subsequent + * refiling of order-0 pages without drain. + */ + if (!order) { + max_nr_alloc = max(high - pcp->count - batch, batch); + batch <<= pcp->alloc_factor; + if (batch <= max_nr_alloc && pcp->alloc_factor < PCP_BATCH_SCALE_MAX) + pcp->alloc_factor++; + batch = min(batch, max_nr_alloc); + } + + /* + * Scale batch relative to order if batch implies free pages + * can be stored on the PCP. Batch can be 1 for small zones or + * for boot pagesets which should never store free pages as + * the pages may belong to arbitrary zones. + */ + if (batch > 1) + batch = max(batch >> order, 2); + + return batch; +} + /* Remove page from the per-cpu list, caller must protect the list */ static inline struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, @@ -2694,18 +2735,9 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, do { if (list_empty(list)) { - int batch = READ_ONCE(pcp->batch); + int batch = nr_pcp_alloc(pcp, order); int alloced; - /* - * Scale batch relative to order if batch implies - * free pages can be stored on the PCP. Batch can - * be 1 for small zones or for boot pagesets which - * should never store free pages as the pages may - * belong to arbitrary zones. - */ - if (batch > 1) - batch = max(batch >> order, 2); alloced = rmqueue_bulk(zone, order, batch, list, migratetype, alloc_flags); From patchwork Wed Sep 20 06:18:52 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13392113 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 208DACE79AD for ; Wed, 20 Sep 2023 06:20:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A94856B0115; Wed, 20 Sep 2023 02:19:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A41FD6B0116; Wed, 20 Sep 2023 02:19:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8BB746B0117; Wed, 20 Sep 2023 02:19:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 75AF06B0115 for ; Wed, 20 Sep 2023 02:19:59 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 4BA18B409E for ; Wed, 20 Sep 2023 06:19:59 +0000 (UTC) X-FDA: 81255975318.24.5DB2453 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126]) by imf29.hostedemail.com (Postfix) with ESMTP id 16741120015 for ; Wed, 20 Sep 2023 06:19:56 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=AYL8fzfH; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695190797; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SdcYhsok387cDjvXpV2YEVBEz71gokOEV+r5pxgghIk=; b=KFQDPi0REeFujg+wId8x95skTVxhQH/UTorBnyXD7u8GtoLuC6WfPinjybnrMPDC5AEjXs xciRJAkFMDliCRUSO8Ry5XZDXgEPrnYsmqCC0WWq7j2deVrfgdw/Ht+ppmq6fCRAqPs/aM jZAUArG6n4ZUDd1FXlYyFMbiRL2Qw7g= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=AYL8fzfH; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695190797; a=rsa-sha256; cv=none; b=eyZtn78wRETvSt14ZG1DiBEgLlJjeAklN8+9GSA6fHssl+RBaWdoJtFQqYnRXzMXRFOYvg 4gN+iJUvNYfMbY7BKgzYH3lZnLZXipE0is8jKhEKLZh5CqXKq+d62gWwgZc8wrZjpTVLuQ 3/Zlp4ooKsi65iaITHjNNJ4xi3tQ+o4= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695190797; x=1726726797; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=mpRZqHQlbPyOoPdSIFxq6ijVhcRIT0jLaqXdR2d5F6E=; b=AYL8fzfHjOESLE0NQxAgJ5zTca3qnmYfV3n1q/p24G9FwQkonPew90gz QF7IJKLZ7M8I6izXYhRV6qo5tBFvX4TFwBQmU90Jg8q7lEw+hNf00gp0K dQfjCtKadZwIfgZhKwkm/Pof6j56kZW5n4/dxrzP/MtCSvW+y52KGpnMO 3/92lJwVufPag6kMjnqAl/aukSzbMwXojUWUbGyMOBxU5QPxISKrl/hzV rOjdhYeHk/q5HO86Ch2MfJVgspcyJOxZ06rcjOwmt+lEbhPeQR04f2WmE PTFgkvSfpFj3slaIYBw00AQZnF+3KM6CRkx8h/Jp/ODNGYdITMWV6iwLM w==; X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187709" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="365187709" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:56 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060623" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="740060623" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:52 -0700 From: Huang Ying To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Andrew Morton , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH 06/10] mm: add framework for PCP high auto-tuning Date: Wed, 20 Sep 2023 14:18:52 +0800 Message-Id: <20230920061856.257597-7-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com> References: <20230920061856.257597-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 16741120015 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: pmxim9mcab973neqp86stn3i6665t4fj X-HE-Tag: 1695190796-72422 X-HE-Meta: U2FsdGVkX1+cxAREVDgX+48pefPRgn0hgtFunRuz/i8qL1xhb2AiXYAYIoKbUfKe/ys+vz1jWPOePYFDaoqwrn95cBj921m4A2edq5PGZug6fkhULJD7aBXFbODtE7zY0BXnxMaric0QetFCvV6z2ELDAziVgTbnhhE66MeKg+20PficMtA1raffiNg6h/SIckPb+0b2KO+7qSSyNkBM2kH4h7xDmeglgvMk1e5wUtC1nY5tusRdOfu9bOOmUTB/N/WzU2TQ41NP7A3NBWtQHzi/kvBogO0wcG44KcaI7h+MLtpdKys3+PXQyOBdMfSOzwNloKibKVLHsywi50N1ioues6O3ATZg31XwkJhzyt82uZ0sGWKmP4uAZlBMbwMtrAPbojhBkDGRAKtKjXSEs/k0wHOvLHGUofBefV300kScvTSZMRwR6iv3Mt58WZRJUXPex+6KiCyZPPsTcDJb/wAURvS2TidQaegNntJhB7y1NvzAl9+lQzcqnVL+pT67dfZHp/fqG/Dt52xvtEAmJFRmOMwGQNrx6v1uwtGhyQBCC0DZnOr0f69Jmx4FQ0xc2tVOLRW2NGz4kCJZCYBekDDQtQK2h4cu2zil9hr4nSFwVyBMPxO5jH7GrDRoC+VvXkC/mmW55SLZ0GgymTmKX+Ksa+UA7ht+/Sxu3ZG7UEaVdVD/UWHPgcfTelqAO/TfTjO0cASlPCTx3KPIGx+kT9VCjO1GbSCoWZRPMh9Z8oFJEjPB+jVsnxkxLAFnMHAFNymKUrKS0DPTlp6xovc3a5RtjTIQCDr8ztIZjbJ7B0OjR1beO65eluiVQBUOien5EXhsssnXYVqjp1lXE8h8mBj1U88VQSuyay+mm+m75GoYBYQkNTEJibhirOHjas4mkj239d9az0YJ/IYHo3sk+OZWa1WzyFn3h8kDAUp2bXFxNgO2UQQLDnda7rIkoE0kH1jmLl3JdVaxgiJQ/7S xRvQqzzK 3YqCczYaWv/Kr9y4jdEkPcno7B6PA1k3uDffpIeJ8qYtEhOSxnNlMYdUf9OJuKAE9Hr2y5h575LK7qXHRYU0djh+jDj2BVrc1dyxfAFbRiSjLagXUYLGpOxttXBW/LxzlmGMxUmMK0iE0/ET8j9zoWlzaYJp2ukwCDQ51Wt30MT2ze8mCD1KP8aKFLOl0x3DW5iuPlDC5XxgSKMtwlVGw7AgvMB/qYcGyLmjcvDCpZmHFK4peo6Kvp2Ssjw9cAEdIHc/Eq32T2hwbFwKummkYGu2GZtnKMMijv1ri5WAtjlt1LF54pOdspquWoK3kIHZsW2Hg3nMr6t0i/cB3WNGrlP6nCTlvneKVdDMw64lfOpYNGu48OXY6XIad+JGiBiQk7LfMxws/D8dsPN8xUHrvCYO7KfqW3lirLoNQZ0LtDUqxd4FJRpRJlKwF0exiH/DMxOqDMW3SCgeQGrHod077P9SPhbb1OEETW2wA9x1MYV2EKwg3x8wovOkKOqK4KKXhfUiAvyW3lyudID63TVb76hTh1+0NXPd9fzDs X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The page allocation performance requirements of different workloads are usually different. So, we need to tune PCP (per-CPU pageset) high to optimize the workload page allocation performance. Now, we have a system wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP high by hand. But, it's hard to find out the best value by hand. And one global configuration may not work best for the different workloads that run on the same system. One solution to these issues is to tune PCP high of each CPU automatically. This patch adds the framework for PCP high auto-tuning. With it, pcp->high of each CPU will be changed automatically by tuning algorithm at runtime. The minimal high (pcp->high_min) is the original PCP high value calculated based on the low watermark pages. While the maximal high (pcp->high_max) is the PCP high value when percpu_pagelist_high_fraction sysctl knob is set to MIN_PERCPU_PAGELIST_HIGH_FRACTION. That is, the maximal pcp->high that can be set via sysctl knob by hand. It's possible that PCP high auto-tuning doesn't work well for some workloads. So, when PCP high is tuned by hand via the sysctl knob, the auto-tuning will be disabled. The PCP high set by hand will be used instead. This patch only adds the framework, so pcp->high will be set to pcp->high_min (original default) always. We will add actual auto-tuning algorithm in the following patches in the series. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Mel Gorman Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter --- include/linux/mmzone.h | 5 ++- mm/page_alloc.c | 71 +++++++++++++++++++++++++++--------------- 2 files changed, 50 insertions(+), 26 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4f7420e35fbb..d6cfb5023f3e 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -683,6 +683,8 @@ struct per_cpu_pages { spinlock_t lock; /* Protects lists field */ int count; /* number of pages in the list */ int high; /* high watermark, emptying needed */ + int high_min; /* min high watermark */ + int high_max; /* max high watermark */ int batch; /* chunk size for buddy add/remove */ u8 flags; /* protected by pcp->lock */ u8 alloc_factor; /* batch scaling factor during allocate */ @@ -842,7 +844,8 @@ struct zone { * the high and batch values are copied to individual pagesets for * faster access */ - int pageset_high; + int pageset_high_min; + int pageset_high_max; int pageset_batch; #ifndef CONFIG_SPARSEMEM diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 30bb05fa5353..38bfab562b44 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2353,7 +2353,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high) static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, bool free_high) { - int high = READ_ONCE(pcp->high); + int high = READ_ONCE(pcp->high_min); if (unlikely(!high || free_high)) return 0; @@ -2692,7 +2692,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order) { int high, batch, max_nr_alloc; - high = READ_ONCE(pcp->high); + high = READ_ONCE(pcp->high_min); batch = READ_ONCE(pcp->batch); /* Check for PCP disabled or boot pageset */ @@ -5298,14 +5298,15 @@ static int zone_batchsize(struct zone *zone) } static int percpu_pagelist_high_fraction; -static int zone_highsize(struct zone *zone, int batch, int cpu_online) +static int zone_highsize(struct zone *zone, int batch, int cpu_online, + int high_fraction) { #ifdef CONFIG_MMU int high; int nr_split_cpus; unsigned long total_pages; - if (!percpu_pagelist_high_fraction) { + if (!high_fraction) { /* * By default, the high value of the pcp is based on the zone * low watermark so that if they are full then background @@ -5318,15 +5319,15 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online) * value is based on a fraction of the managed pages in the * zone. */ - total_pages = zone_managed_pages(zone) / percpu_pagelist_high_fraction; + total_pages = zone_managed_pages(zone) / high_fraction; } /* * Split the high value across all online CPUs local to the zone. Note * that early in boot that CPUs may not be online yet and that during * CPU hotplug that the cpumask is not yet updated when a CPU is being - * onlined. For memory nodes that have no CPUs, split pcp->high across - * all online CPUs to mitigate the risk that reclaim is triggered + * onlined. For memory nodes that have no CPUs, split the high value + * across all online CPUs to mitigate the risk that reclaim is triggered * prematurely due to pages stored on pcp lists. */ nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online; @@ -5354,19 +5355,21 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online) * However, guaranteeing these relations at all times would require e.g. write * barriers here but also careful usage of read barriers at the read side, and * thus be prone to error and bad for performance. Thus the update only prevents - * store tearing. Any new users of pcp->batch and pcp->high should ensure they - * can cope with those fields changing asynchronously, and fully trust only the - * pcp->count field on the local CPU with interrupts disabled. + * store tearing. Any new users of pcp->batch, pcp->high_min and pcp->high_max + * should ensure they can cope with those fields changing asynchronously, and + * fully trust only the pcp->count field on the local CPU with interrupts + * disabled. * * mutex_is_locked(&pcp_batch_high_lock) required when calling this function * outside of boot time (or some other assurance that no concurrent updaters * exist). */ -static void pageset_update(struct per_cpu_pages *pcp, unsigned long high, - unsigned long batch) +static void pageset_update(struct per_cpu_pages *pcp, unsigned long high_min, + unsigned long high_max, unsigned long batch) { WRITE_ONCE(pcp->batch, batch); - WRITE_ONCE(pcp->high, high); + WRITE_ONCE(pcp->high_min, high_min); + WRITE_ONCE(pcp->high_max, high_max); } static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonestat *pzstats) @@ -5386,20 +5389,21 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta * need to be as careful as pageset_update() as nobody can access the * pageset yet. */ - pcp->high = BOOT_PAGESET_HIGH; + pcp->high_min = BOOT_PAGESET_HIGH; + pcp->high_max = BOOT_PAGESET_HIGH; pcp->batch = BOOT_PAGESET_BATCH; pcp->free_factor = 0; } -static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high, - unsigned long batch) +static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high_min, + unsigned long high_max, unsigned long batch) { struct per_cpu_pages *pcp; int cpu; for_each_possible_cpu(cpu) { pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); - pageset_update(pcp, high, batch); + pageset_update(pcp, high_min, high_max, batch); } } @@ -5409,19 +5413,34 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h */ static void zone_set_pageset_high_and_batch(struct zone *zone, int cpu_online) { - int new_high, new_batch; + int new_high_min, new_high_max, new_batch; new_batch = max(1, zone_batchsize(zone)); - new_high = zone_highsize(zone, new_batch, cpu_online); + if (percpu_pagelist_high_fraction) { + new_high_min = zone_highsize(zone, new_batch, cpu_online, + percpu_pagelist_high_fraction); + /* + * PCP high is tuned manually, disable auto-tuning via + * setting high_min and high_max to the manual value. + */ + new_high_max = new_high_min; + } else { + new_high_min = zone_highsize(zone, new_batch, cpu_online, 0); + new_high_max = zone_highsize(zone, new_batch, cpu_online, + MIN_PERCPU_PAGELIST_HIGH_FRACTION); + } - if (zone->pageset_high == new_high && + if (zone->pageset_high_min == new_high_min && + zone->pageset_high_max == new_high_max && zone->pageset_batch == new_batch) return; - zone->pageset_high = new_high; + zone->pageset_high_min = new_high_min; + zone->pageset_high_max = new_high_max; zone->pageset_batch = new_batch; - __zone_set_pageset_high_and_batch(zone, new_high, new_batch); + __zone_set_pageset_high_and_batch(zone, new_high_min, new_high_max, + new_batch); } void __meminit setup_zone_pageset(struct zone *zone) @@ -5529,7 +5548,8 @@ __meminit void zone_pcp_init(struct zone *zone) */ zone->per_cpu_pageset = &boot_pageset; zone->per_cpu_zonestats = &boot_zonestats; - zone->pageset_high = BOOT_PAGESET_HIGH; + zone->pageset_high_min = BOOT_PAGESET_HIGH; + zone->pageset_high_max = BOOT_PAGESET_HIGH; zone->pageset_batch = BOOT_PAGESET_BATCH; if (populated_zone(zone)) @@ -6431,13 +6451,14 @@ EXPORT_SYMBOL(free_contig_range); void zone_pcp_disable(struct zone *zone) { mutex_lock(&pcp_batch_high_lock); - __zone_set_pageset_high_and_batch(zone, 0, 1); + __zone_set_pageset_high_and_batch(zone, 0, 0, 1); __drain_all_pages(zone, true); } void zone_pcp_enable(struct zone *zone) { - __zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch); + __zone_set_pageset_high_and_batch(zone, zone->pageset_high_min, + zone->pageset_high_max, zone->pageset_batch); mutex_unlock(&pcp_batch_high_lock); } From patchwork Wed Sep 20 06:18:53 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13392114 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A740BCE79AE for ; Wed, 20 Sep 2023 06:20:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4037E6B0117; Wed, 20 Sep 2023 02:20:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3B2A06B0118; Wed, 20 Sep 2023 02:20:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 22C1A6B0119; Wed, 20 Sep 2023 02:20:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 0F1EB6B0117 for ; Wed, 20 Sep 2023 02:20:04 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id B9C8B1408C1 for ; Wed, 20 Sep 2023 06:20:03 +0000 (UTC) X-FDA: 81255975486.21.3FD84BD Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126]) by imf29.hostedemail.com (Postfix) with ESMTP id 95BFD12001F for ; Wed, 20 Sep 2023 06:20:01 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=WfeXs7Pr; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695190802; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=M/fe6gvTJ5IvajuIXHKt80mPtHGXVD+KvoXX8vweENs=; b=OoUMNRSwBbSTN2Cu7g39VseunKqDzYIcmyJnfByuqelc2JsTyPoUqnkgTgsGsMYgckMR6e FNCLpT7uRiNS5OcGH52ORmxjEsC/6c3rgLCoFFiAdYO1hR6gc+r9gGJP6sPpAzECEpCfcb LqnCM73dhuKPRrvvtyA9WgTRwyrhSWs= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=WfeXs7Pr; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695190802; a=rsa-sha256; cv=none; b=vF4O7uH54zOzd8yGyxy5ae/4EHZD4av4ilSL8IGPlHvRPPK/7By8JRLw0otCkEX/LSENAD r17R1BgLyk9ePkTjnz4xpS/PIIV52yTqctRDjqwdiAeZHUBI+6Q97yuxhIEp+J15Uf+CM6 NUaFPpohZgDMXz6nCfQ98XZPTb+xn98= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695190801; x=1726726801; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=hq7IYYC8nVzeFGqZQrv5zeua36AUxmv2g1nXbA7LL/w=; b=WfeXs7PrdGSr3f+XV/Uc1r9LwyeMX1Cw4QnURWL8lbYlu9+v5goZsLbj k2oezLdBJhx/WyXGMsdlSxkrH7aiz8f0KFHMOs8oO6+fGLx6C7xjjZLMM lko0OYwm5S3Ykconu0RM2xs4V8BNMUZaAUl6F4Tv2DjGwOG6we/zFTWBh aC7UcZqc2D5l45YWEr7V3jBgjb1XtJ1di3Yd5omRFHOnqOAq85MI0eYmn SZEZ1DqnQFC/mghhRfFuxwILn+vH41pxAty2RugV6YgnsnXJq2Lohd1Un G2It5WZvQY0gD1beKoMDAzJZIS56oGxReI9eBkYKeBxkR3iApefk0ltZ1 g==; X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187734" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="365187734" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:20:01 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060638" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="740060638" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:56 -0700 From: Huang Ying To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Mel Gorman , Michal Hocko , Andrew Morton , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH 07/10] mm: tune PCP high automatically Date: Wed, 20 Sep 2023 14:18:53 +0800 Message-Id: <20230920061856.257597-8-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com> References: <20230920061856.257597-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 95BFD12001F X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: s31487skd51sq8399u8s8m671c7zqirx X-HE-Tag: 1695190801-699868 X-HE-Meta: U2FsdGVkX1/W/I+x0hxAYY2YkNE7OUuPpx4Yj2zJ+p5hj+olXMoVMB7DyyFF4DiEafuvbEnR0LACoan/6/H91NhSMv/jhcZOflAcrxLu/3ZN5Fpqo7A2F6PIYXppocKJ9JX2TfvGkHdKGhjFpOvs/2AeQoWDD4a0rO6IT2KBrTEXam5P6Qy42HTcYOGRLjHF4CaeleATP08fmFAo8LGc+n3j2wkEDzkpBndwh+heMkieMC5+stwZ8f/snGcxgHD8hmbSXxQFol7sigdJbpR/lk99ak5534cnTN8G2Oyx8IuJ6GRKcz+JLY+VXWjfWzP8pXkDAJOTeg2g1EypjZRNMzGj3fVrJ8tM8f6YoSMP6wAP63OsiJNxhOLEmimVv4AnfgOn6gTMUKlIMGZ6b9LPxm4am5UCcWV6A7yq+n5aIOHrph7o36Y2S4Glr4GhKnWa5JIgr8C32ObZjr0ZariUBQUDNO8UVjIKI0nnOdMXaOBwD9WVE07BXoUh5PsKREabemNmn8PDro0hfugO7WDPO10wFN+Ito/TSpkmgAG7AUkGLpvhhCHM9S7ZFnyxJBieEQNJSMgBpxA97t7WVLL7KxTWjo6drron9gYYlSWExrtZtb/quSJdOo1PbpuHjVjxfVF1rSDq/wQg/qWOfTN8eeg+ww9P32dUfry0dVjnSG8QLHMSw0vCk04QWsiSvbrQxXh32Nxv45uNvCK846NUe9QBqZhYCRrYCyKQotYWUWv5QlCudHjNtkOy1tTGtYh0B4fh4nK5kfptS8NiKpgbzk9QF7sW6ARGfQw7vJ1ucM5fHXVh5DwXq1yw+We8leUWQjhC0B4nKQ2fGkacw2kf15G5sjwaJSsWtAgm+o209z7LhsdVCz+V1Bw7yuj4KI2c72JoN9nR0I0IM0oSYe7QnF+D4QBmmBr5n/8FSDr6tdB+EPNVszO4ZLGsSjWKDYcsUgahakzujgZM2YXTVGn ShDz8NmG 7548XgvJYM3DLYjJDzCB06YBJ88OOKPGPLXZT1DLMb8M6K5o3dm5yvLq6j7XzoC/jLB8snk/OGIp2RZt8tOfVPxoOPGHLtR3dhThU6mhC+j2caRmQHPDFZUh1aRAoQJ6BI2/syTlcmJeaEABTEke1s3RpfFSVOyMnkCYcZQnRZr72ZKOUGgnyLRdzbLiSKPEpPXVFqzGhG/jU3aM6ziadYhyaG3onC7FRAAKXNRd0JuHkwCr0xPr2UAr96fpHkn1tDD0P9WpTDySmPWaDu8EUEr4tPe7OVSfkED6NoSmOo/OvsfBUpr6rbirsD1pIWReIwgo8/cNPC0mYyQZb4ymEOtshKRuyYggn+oAOWKpoIwTm18MLW19q3iQmQLAkmat0ZTkITgqw0bsKR7sfsjqun6oWP5/ofHG2fEr+rPVz8McJh3SWrlvTSYswfnRxdxZZCurrevnKFqdCGuEmp7sk/Z/pK/q5FQPJBOvALtsvhSh2Hy3Td0TkGlkV5f7oEOcTf2qr50boYQhGFC05qlhFsRgYl/5fuE+HMEIV X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The target to tune PCP high automatically is as follows, - Minimize allocation/freeing from/to shared zone - Minimize idle pages in PCP - Minimize pages in PCP if the system free pages is too few To reach these target, a tuning algorithm as follows is designed, - When we refill PCP via allocating from the zone, increase PCP high. Because if we had larger PCP, we could avoid to allocate from the zone. - In periodic vmstat updating kworker (via refresh_cpu_vm_stats()), decrease PCP high to try to free possible idle PCP pages. - When page reclaiming is active for the zone, stop increasing PCP high in allocating path, decrease PCP high and free some pages in freeing path. So, the PCP high can be tuned to the page allocating/freeing depth of workloads eventually. One issue of the algorithm is that if the number of pages allocated is much more than that of pages freed on a CPU, the PCP high may become the maximal value even if the allocating/freeing depth is small. But this isn't a severe issue, because there are no idle pages in this case. One alternative choice is to increase PCP high when we drain PCP via trying to free pages to the zone, but don't increase PCP high during PCP refilling. This can avoid the issue above. But if the number of pages allocated is much less than that of pages freed on a CPU, there will be many idle pages in PCP and it may be hard to free these idle pages. On a 2-socket Intel server with 224 logical CPU, we tested kbuild on one socket with `make -j 112`. With the patch, the build time decreases 10.1%. The cycles% of the spinlock contention (mostly for zone lock) decreases from 37.9% to 9.8% (with PCP size == 361). The number of PCP draining for high order pages freeing (free_high) decreases 53.4%. The number of pages allocated from zone (instead of from PCP) decreases 77.3%. Signed-off-by: "Huang, Ying" Suggested-by: Mel Gorman Suggested-by: Michal Hocko Cc: Andrew Morton Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter --- include/linux/gfp.h | 1 + mm/page_alloc.c | 118 ++++++++++++++++++++++++++++++++++---------- mm/vmstat.c | 8 +-- 3 files changed, 98 insertions(+), 29 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 665edc11fb9f..5b917e5b9350 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -320,6 +320,7 @@ extern void page_frag_free(void *addr); #define free_page(addr) free_pages((addr), 0) void page_alloc_init_cpuhp(void); +int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp); void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp); void drain_all_pages(struct zone *zone); void drain_local_pages(struct zone *zone); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 38bfab562b44..225abe56752c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2160,6 +2160,40 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, return i; } +/* + * Called from the vmstat counter updater to decay the PCP high. + * Return whether there are addition works to do. + */ +int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) +{ + int high_min, to_drain, batch; + int todo = 0; + + high_min = READ_ONCE(pcp->high_min); + batch = READ_ONCE(pcp->batch); + /* + * Decrease pcp->high periodically to try to free possible + * idle PCP pages. And, avoid to free too many pages to + * control latency. + */ + if (pcp->high > high_min) { + pcp->high = max3(pcp->count - (batch << PCP_BATCH_SCALE_MAX), + pcp->high * 4 / 5, high_min); + if (pcp->high > high_min) + todo++; + } + + to_drain = pcp->count - pcp->high; + if (to_drain > 0) { + spin_lock(&pcp->lock); + free_pcppages_bulk(zone, to_drain, pcp, 0); + spin_unlock(&pcp->lock); + todo++; + } + + return todo; +} + #ifdef CONFIG_NUMA /* * Called from the vmstat counter updater to drain pagesets of this @@ -2321,14 +2355,13 @@ static bool free_unref_page_prepare(struct page *page, unsigned long pfn, return true; } -static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high) +static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free_high) { int min_nr_free, max_nr_free; - int batch = READ_ONCE(pcp->batch); - /* Free everything if batch freeing high-order pages. */ + /* Free as much as possible if batch freeing high-order pages. */ if (unlikely(free_high)) - return pcp->count; + return min(pcp->count, batch << PCP_BATCH_SCALE_MAX); /* Check for PCP disabled or boot pageset */ if (unlikely(high < batch)) @@ -2343,7 +2376,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high) * freeing of pages without any allocation. */ batch <<= pcp->free_factor; - if (batch < max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX) + if (batch <= max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX) pcp->free_factor++; batch = clamp(batch, min_nr_free, max_nr_free); @@ -2351,28 +2384,47 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high) } static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, - bool free_high) + int batch, bool free_high) { - int high = READ_ONCE(pcp->high_min); + int high, high_min, high_max; - if (unlikely(!high || free_high)) + high_min = READ_ONCE(pcp->high_min); + high_max = READ_ONCE(pcp->high_max); + high = pcp->high = clamp(pcp->high, high_min, high_max); + + if (unlikely(!high)) return 0; - if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) - return high; + if (unlikely(free_high)) { + pcp->high = max(high - (batch << PCP_BATCH_SCALE_MAX), high_min); + return 0; + } /* * If reclaim is active, limit the number of pages that can be * stored on pcp lists */ - return min(READ_ONCE(pcp->batch) << 2, high); + if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) { + pcp->high = max(high - (batch << pcp->free_factor), high_min); + return min(batch << 2, pcp->high); + } + + if (pcp->count >= high && high_min != high_max) { + int need_high = (batch << pcp->free_factor) + batch; + + /* pcp->high should be large enough to hold batch freed pages */ + if (pcp->high < need_high) + pcp->high = clamp(need_high, high_min, high_max); + } + + return high; } static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, struct page *page, int migratetype, unsigned int order) { - int high; + int high, batch; int pindex; bool free_high = false; @@ -2387,6 +2439,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, list_add(&page->pcp_list, &pcp->lists[pindex]); pcp->count += 1 << order; + batch = READ_ONCE(pcp->batch); /* * As high-order pages other than THP's stored on PCP can contribute * to fragmentation, limit the number stored when PCP is heavily @@ -2397,14 +2450,15 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, free_high = (pcp->free_factor && (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) && (!(pcp->flags & PCPF_FREE_HIGH_BATCH) || - pcp->count >= READ_ONCE(pcp->batch))); + pcp->count >= READ_ONCE(batch))); pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER; } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER; } - high = nr_pcp_high(pcp, zone, free_high); + high = nr_pcp_high(pcp, zone, batch, free_high); if (pcp->count >= high) { - free_pcppages_bulk(zone, nr_pcp_free(pcp, high, free_high), pcp, pindex); + free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high), + pcp, pindex); } } @@ -2688,24 +2742,38 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, return page; } -static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order) +static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order) { - int high, batch, max_nr_alloc; + int high, base_batch, batch, max_nr_alloc; + int high_max, high_min; - high = READ_ONCE(pcp->high_min); - batch = READ_ONCE(pcp->batch); + base_batch = READ_ONCE(pcp->batch); + high_min = READ_ONCE(pcp->high_min); + high_max = READ_ONCE(pcp->high_max); + high = pcp->high = clamp(pcp->high, high_min, high_max); /* Check for PCP disabled or boot pageset */ - if (unlikely(high < batch)) + if (unlikely(high < base_batch)) return 1; + if (order) + batch = base_batch; + else + batch = (base_batch << pcp->alloc_factor); + /* - * Double the number of pages allocated each time there is subsequent - * refiling of order-0 pages without drain. + * If we had larger pcp->high, we could avoid to allocate from + * zone. */ + if (high_min != high_max && !test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) + high = pcp->high = min(high + batch, high_max); + if (!order) { - max_nr_alloc = max(high - pcp->count - batch, batch); - batch <<= pcp->alloc_factor; + max_nr_alloc = max(high - pcp->count - base_batch, base_batch); + /* + * Double the number of pages allocated each time there is + * subsequent refiling of order-0 pages without drain. + */ if (batch <= max_nr_alloc && pcp->alloc_factor < PCP_BATCH_SCALE_MAX) pcp->alloc_factor++; batch = min(batch, max_nr_alloc); @@ -2735,7 +2803,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, do { if (list_empty(list)) { - int batch = nr_pcp_alloc(pcp, order); + int batch = nr_pcp_alloc(pcp, zone, order); int alloced; alloced = rmqueue_bulk(zone, order, diff --git a/mm/vmstat.c b/mm/vmstat.c index 00e81e99c6ee..2f716ad14168 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -814,9 +814,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets) for_each_populated_zone(zone) { struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats; -#ifdef CONFIG_NUMA struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset; -#endif for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) { int v; @@ -832,10 +830,12 @@ static int refresh_cpu_vm_stats(bool do_pagesets) #endif } } -#ifdef CONFIG_NUMA if (do_pagesets) { cond_resched(); + + changes += decay_pcp_high(zone, this_cpu_ptr(pcp)); +#ifdef CONFIG_NUMA /* * Deal with draining the remote pageset of this * processor @@ -862,8 +862,8 @@ static int refresh_cpu_vm_stats(bool do_pagesets) drain_zone_pages(zone, this_cpu_ptr(pcp)); changes++; } - } #endif + } } for_each_online_pgdat(pgdat) { From patchwork Wed Sep 20 06:18:54 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13392115 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5EC16CE79AC for ; Wed, 20 Sep 2023 06:20:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E626B6B0119; Wed, 20 Sep 2023 02:20:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E13E46B011A; Wed, 20 Sep 2023 02:20:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CB4546B011B; Wed, 20 Sep 2023 02:20:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id B59206B0119 for ; Wed, 20 Sep 2023 02:20:07 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 8429781060 for ; Wed, 20 Sep 2023 06:20:07 +0000 (UTC) X-FDA: 81255975654.21.3659726 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126]) by imf29.hostedemail.com (Postfix) with ESMTP id 74AB212001A for ; Wed, 20 Sep 2023 06:20:05 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=EJur15Bd; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695190805; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UIyhaTtMVzkd1JNxNHUm6Tdf0oFtWg8xOLzRvYoLm7Q=; b=qtfqcobOs0RkmfdOukiHS4Q/ooUT6PQLBf7X07BwSDYRbDzJp8oq66ZQEHZVt324qrgksP tAyTZhpqucANuVy/6MFS6L0Q/wxLEylcelCRyy/NpYpsO07CEyZDGiObgKE/vkvhXPhax1 McB1NBsvgDLXd9TeDnIaECdBLN0Nx38= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=EJur15Bd; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695190805; a=rsa-sha256; cv=none; b=jYt6fiubjm1YoXzj4ufXJKoLV1eElOkfHUYt/B3H5+2TgsE4VYI1on5lIoXfF9N/ErlVMi lPEVXYmHntLFZt3h1JHW+NN19Advseoh20MpolmOJn/RN1Ufw0hYvoUqSQXOdgu505q74o FR6PaMn01KGeyAci6Q1uO1NSw0O/EAc= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695190805; x=1726726805; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=6W6lJyCPE2vK1mxwG+RBHDFX6xtr7yfsoTDfDpxDR10=; b=EJur15BdL1AofMJxC1dgVPzOu6jkqI1uiiSu9QDQNm5rjdI5ghObtCnU B7sqjN0uTlMAwW35J2+xf9TapE76cyd799uZ/I6PdNc2PEJoj6TAVyfnO Ev2Y/Rs7a4OZhQOfH/qL44GVwcHdPY+ZXJ1v3VKSw7kbP++jX+B67BQTR 3uKUdKzTzerKd44k2sBrekpfIlEqJsKSM7QNcl+lCMkUIBTOuRJj/pk0S V/VbAVmsiWm2BvCp4kjK6iNRslW0sdw+bq7jpfOECMJkW8GxBbWi9gunP hMhDqMRb9HQ2TDSV2gdwpk2KfweKTC+H+cdR87sSmEtW2VKULP2Y3EEZM g==; X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187766" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="365187766" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:20:05 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060665" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="740060665" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:20:01 -0700 From: Huang Ying To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Andrew Morton , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH 08/10] mm, pcp: decrease PCP high if free pages < high watermark Date: Wed, 20 Sep 2023 14:18:54 +0800 Message-Id: <20230920061856.257597-9-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com> References: <20230920061856.257597-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 74AB212001A X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: ze6ornsz6j46w563gq9mruwee1f4wxm8 X-HE-Tag: 1695190805-892348 X-HE-Meta: U2FsdGVkX1+SfIcJ4dbiOsgcpv4+X18zu8HyJw/X30QJWnUZTjAQTEiIi5d3np9e+TX+rQPTwMTBO/VlctW0KxKMfy3aPwLi/LXLU4/k5us0n8MAfLwBz/iNoIrlPh+IUX4TnUNSi4y+9g9O75wNlUoiFINWS7++Ai1rLyj2CKSeAMwmY2OPt5dMHljn7VHwCBgnY+JUvBSkJ9lfSklywLJLpe2dYrUOi3qkH3s5FeptDKfpesft5qLWUcFMAXxRqe2rbilz597u7mO3Ouf9G+0syTHGWTil+7GavYtrPFEp2Z9hdqvgCMCUkvOC4yeqakHHSbxwf2zji56/Hcty7JWxE3QODI7ZVt0a43nwgluhxbj0E7QkRvuDD6BH9MJe7lVYLprk6r1eyBM1lXHxPDk3g/kgMhjdPDqfJaSsmHjOBiWMlrUQG7vxetYPZYgIBcpqsqt0Dz6d7KTm8xQ1fL51X4YFe/mJOuMMjkD86auOgIS8aMIlhBmeloi/V4H5f2Ch3LmQ5PCUwJbOVxVDDRDwGLzRYcB+4AutVAneFdgHvOEWOqv12248D8wyjOMozdBeUJCt51vvxEJzOO6GN9zmRwn7NqWWBqCUXRdd4nJpco6oTK8F2CRSd3wkOAhvDbkf2TgR8nxOfeKZieiYtSOpbKWEEThrcxsU7z86rj6jQpUEtr1mNoZlmjXoD32R91dPSrKft2ZQliGU+sTDHYqC9nyfhZeHKfDNewab5TZNxuwUQes02FJxQpdwQaoAeNDQTAVKc1y1pfOckVRutmBy5orktIIRW/GCJgeISXMFYzpaBWJ0e1tJO7/hA6JtXAYYTh1TmboDx6J4cc+aOt6fVr6/QvxRQOPQHVEsn45v7SQ1uUxIcK87PAmZMrVKVHusjpyNO7TnWJesOo+obljBqGiNvchXFqXS990ZIBgcibCr1jnDsn+9+z0calVEX4hgAyLg497HQcy083Z dP+qB/UH HoFAplInxtToM2VsBvopLaCaHrxxYyCifQBcHJmSkZ75YM2TsVnt1niICtTNKQyLeJv+xvUitdJEG4LfzA17KmBxD3K8CPdN9JFL6u4DtSH1QgwR33l8cPF2aZOjbLGE8QyO+9UNFwroce15QoBMdlRVsbogZBlv0Q0e30qWTWJuKC5BSfp6olKS9/3f6bZE5f4sNphJg3oSlA0YF2Eg+Z8D7HOdA0aAuuih/NllusPxiCXVhWuBwnGAyHLTYFWSVm20d3QDhBd+84x0jdCsPuWP9DYOXmgm+OxtcTDOBEyXSGhwD4tZowcNu8fkTyoNisRSdtAGm1xxqNaUFHbMhbI59K9IKyJPqYfYQ6SKLIjzoIyVJJ/H1eT0hvhOX/c2hf/7ZUqnJUK6LVg+0voUPWP9ofnYR67ugX0veJ8FNAeGvKtzlaw8le55gy+rakziMHzhneAG4dwh+PgziYIVI09x3joaUvenQ5Ly0EE6KJ68/5FxERcoxW2BccAOgSsc1X7F1cNBnp8XcXQefk10FtS+5s3XHvIPL0FFb X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: One target of PCP is to minimize pages in PCP if the system free pages is too few. To reach that target, when page reclaiming is active for the zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in allocating path, decrease PCP high and free some pages in freeing path. But this may be too late because the background page reclaiming may introduce latency for some workloads. So, in this patch, during page allocation we will detect whether the number of free pages of the zone is below high watermark. If so, we will stop increasing PCP high in allocating path, decrease PCP high and free some pages in freeing path. With this, we can reduce the possibility of the premature background page reclaiming caused by too large PCP. The high watermark checking is done in allocating path to reduce the overhead in hotter freeing path. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Mel Gorman Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter --- include/linux/mmzone.h | 1 + mm/page_alloc.c | 22 ++++++++++++++++++++-- 2 files changed, 21 insertions(+), 2 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index d6cfb5023f3e..8a19e2af89df 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1006,6 +1006,7 @@ enum zone_flags { * Cleared when kswapd is woken. */ ZONE_RECLAIM_ACTIVE, /* kswapd may be scanning the zone. */ + ZONE_BELOW_HIGH, /* zone is below high watermark. */ }; static inline unsigned long zone_managed_pages(struct zone *zone) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 225abe56752c..3f8c7dfeed23 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2409,7 +2409,13 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, return min(batch << 2, pcp->high); } - if (pcp->count >= high && high_min != high_max) { + if (high_min == high_max) + return high; + + if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) { + pcp->high = max(high - (batch << pcp->free_factor), high_min); + high = max(pcp->count, high_min); + } else if (pcp->count >= high) { int need_high = (batch << pcp->free_factor) + batch; /* pcp->high should be large enough to hold batch freed pages */ @@ -2459,6 +2465,10 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, if (pcp->count >= high) { free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high), pcp, pindex); + if (test_bit(ZONE_BELOW_HIGH, &zone->flags) && + zone_watermark_ok(zone, 0, high_wmark_pages(zone), + ZONE_MOVABLE, 0)) + clear_bit(ZONE_BELOW_HIGH, &zone->flags); } } @@ -2765,7 +2775,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order) * If we had larger pcp->high, we could avoid to allocate from * zone. */ - if (high_min != high_max && !test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) + if (high_min != high_max && !test_bit(ZONE_BELOW_HIGH, &zone->flags)) high = pcp->high = min(high + batch, high_max); if (!order) { @@ -3226,6 +3236,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, } } + mark = high_wmark_pages(zone); + if (zone_watermark_fast(zone, order, mark, + ac->highest_zoneidx, alloc_flags, + gfp_mask)) + goto try_this_zone; + else if (!test_bit(ZONE_BELOW_HIGH, &zone->flags)) + set_bit(ZONE_BELOW_HIGH, &zone->flags); + mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); if (!zone_watermark_fast(zone, order, mark, ac->highest_zoneidx, alloc_flags, From patchwork Wed Sep 20 06:18:55 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13392116 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 08993CE79AC for ; Wed, 20 Sep 2023 06:20:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 90F336B011B; Wed, 20 Sep 2023 02:20:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8999D6B011C; Wed, 20 Sep 2023 02:20:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 738D26B011D; Wed, 20 Sep 2023 02:20:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 5FF4A6B011B for ; Wed, 20 Sep 2023 02:20:11 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 42E4F1207F1 for ; Wed, 20 Sep 2023 06:20:11 +0000 (UTC) X-FDA: 81255975822.28.E868AAC Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126]) by imf29.hostedemail.com (Postfix) with ESMTP id 2C94A120007 for ; Wed, 20 Sep 2023 06:20:08 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=e97HTkmW; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695190809; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kQYc5N1rnjtiHtbCX15sF/J52hBb9PsZp7Xb3r8dfqY=; b=3AAy7h6dQ3eLu/ZQA3VsUsCTDr8QhER4kA7GtQY2ksQmU0OeVohKAtTNPJxuHcp73Spjm8 w1yu5Nz1JPlUWIJy1GMwHViRjW+Gsc2jGAXvMwi6/2XFQWa5w0qSAf8QgzjjxlKPS+qSVk y7Ay8lwyS64AVYDkiZhW9hP7YYgp8zk= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=e97HTkmW; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695190809; a=rsa-sha256; cv=none; b=cYOvSTKyes0y0vfDRj3rx3D6y6v052kvbPsc2MpVwR/ePTxYLj/Uwd32SUp6WC/yUNfYyv uYrPSiuK1nWU3rZ9O6AsjbVBVinEAq6K5t2t/fKUDvFdxtnarkDJE+JXGei1K4BYNTMZ1J HzkyiqTFlvXG7RP/QkUYQnnV+uTHnPc= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695190809; x=1726726809; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=+FoiCfoy4TLCRTojH1ubjvlhPyG87O12FA/RyIO19/k=; b=e97HTkmWjwmBWqYJ2BvfTxhgtOMn93iCwx8Wx/VoQdW619wARiAfUR+h PBFq9LJAByZADaQTe9s5h5LVYgifqsNhBvBPLff0dGi6K+AtbyKJ7dOxx xPIEZFzUvgPijRIgA4VCGalczuY+dSqpQehUZ+SUyTyGz9+2vynsNX7fj yA19bEJukBC7qwrxZRL24HKQUqbz8wsTRJbusqboeIlFVMZOje/w3m4LN wKlrcoBHEoYXfIKNNaqbITjcLjHnbF4luZn5Rq/CXee7hLn4sEZtApt9q RT3a1Xt5xzOB+faZTQN6gcecHJ9ktYRoemR6vADnkC3eP0cKPEFMq5xgn g==; X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187788" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="365187788" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:20:08 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060679" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="740060679" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:20:05 -0700 From: Huang Ying To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Andrew Morton , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH 09/10] mm, pcp: avoid to reduce PCP high unnecessarily Date: Wed, 20 Sep 2023 14:18:55 +0800 Message-Id: <20230920061856.257597-10-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com> References: <20230920061856.257597-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 2C94A120007 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 9i4ogutrj8qtw4dcpfhxys8mhbpk5tni X-HE-Tag: 1695190808-682944 X-HE-Meta: U2FsdGVkX19EYg+kcN0SJvNFsO4YxQRGr7q68CiEGHn2buXN/L5RlCwRlzCconEXutxnIQ0ttVgNHAwayoFX03h03Mf1AQhOnEXAo+2PfWTo7Wa+sAABRAjivDVV7LUPEgdolugw/FB+PmzgqdqFZyjCOy98SjN6DYyM38XWXMQEIcopXcQxHx5KlkFT10gogsuM/EJCR5ZuvQJ6P8FjPHoZZWXnqWFRgpyAgps1Hq22KdAJ966rXAApJ7GzPvjVfssbQ0R2JIPOAJs32HLEUuX/DptcDk3H3bkzDKxhg1KOC2Hb0/GTxl9ArhEKTzT6hgrFP72JtiBFH96Ue/7VMs57m2k5rg0dXuXvg94G5b5IaiyHxzQDeAOzktoKCLSLrxYdbnlYyp9TV2ftmhI32aWAampjju2VNjV4wmntYZPB7hVJAWgQX4A86hT6sltkpqySYAE40q09aI4XawdzVNpV24iNFy2hoCVV0XcT2KgwIyV8PMX7hk3heb19rvqrar34foknClphXaBZbrTDf8+dD6DM5KNL7cwXCfRP9sHR7f+yU9R2A+0JR0jan6iqQtARj0Ihrigs7jtJ4EGyJmctcBse6Nye0r983ItoHd6x0mAvDLYX1qu9k2iKvaivzW8RHPBildrmlKhw3a2ypcl79B7M4iSOMMCZbdcUBY3VDBBgbdIXHD3rIT6rk4ifaoDbXYDwxRl3vykGpBF6n19e4nlBpx3PDlO5NbOtJiN0fLTWnE4v+0+ovyds8JANdIA4p48AOEmucst67xlE8oDtDs/Qk8hFElsi187ggC2JlhJqvefo0iK8SlJSz5OMw/dOvE3LqaNq0whqzt4aXl+OHnSAUQi9UEWzzG5n0FSxfzGe0vF77iNnOfvEh0Jd5gE4ySluLTyTrBTKb3KUMW8v0MEqCg6xHm4A91dU7bRfINS8xFc3YlJQ/n1YTu2A8HQ5QKaiMu1BE5ZkyGz w46uv3Jm ePYLuwYWSjGMgoQaWbKeriwbwD3RcaewGkwoESAtywGJoaulByV/fbRlr9QSJyHo++qb2zNUgB7aN2dCmRo7UBBG/7tjEQxKEYJZDrxmxSQDv861JKu3/Sr1NtryFOQRzYeUNVwn0/r569OxVt+RuGFNIDJIFpWYtxT5tvC49uyFQvyi+YDLDuT7bRk+XMbI6l6HE/xJacPFfhDT7AxFiQFAR840g9wIds44uKlo7CZKzJupsfgQu5OQBlZ9Hziz+F7UMAXtKjCEaoMHslHcEg/Lejwaq6yS7Vkju+23w2zUAfp5wba2SD94CS5qjsO4+pBQINB/iEzf/HcfjcIoOP0uJgLVbrePHNQsA2uXEzJ7YoiE9UlenJjfBenjvd8iqG5lwcfW1tVZK+nEeTbC5hJ1rRaeMI8XGjILdMzXporJ+ZfWg+F8G5UMEFQn4wv8ds5KSD5PIfzsgHatP1EH8qnKszsFWCd/o16vMqaisuHJ1X/89IEPvRqvgNHcrRSKUGQPymoMMnLpzrckmx/10++BBB+SwhccWl/rh X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In PCP high auto-tuning algorithm, to minimize idle pages in PCP, in periodic vmstat updating kworker (via refresh_cpu_vm_stats()), we will decrease PCP high to try to free possible idle PCP pages. One issue is that even if the page allocating/freeing depth is larger than maximal PCP high, we may reduce PCP high unnecessarily. To avoid the above issue, in this patch, we will track the minimal PCP page count. And, the periodic PCP high decrement will not more than the recent minimal PCP page count. So, only detected idle pages will be freed. On a 2-socket Intel server with 224 logical CPU, we tested kbuild on one socket with `make -j 112`. With the patch, The number of pages allocated from zone (instead of from PCP) decreases 25.8%. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Mel Gorman Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter --- include/linux/mmzone.h | 1 + mm/page_alloc.c | 15 ++++++++++----- 2 files changed, 11 insertions(+), 5 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 8a19e2af89df..35b78c7522a7 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -682,6 +682,7 @@ enum zone_watermarks { struct per_cpu_pages { spinlock_t lock; /* Protects lists field */ int count; /* number of pages in the list */ + int count_min; /* minimal number of pages in the list recently */ int high; /* high watermark, emptying needed */ int high_min; /* min high watermark */ int high_max; /* max high watermark */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3f8c7dfeed23..77e9b7b51688 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2166,19 +2166,20 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, */ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) { - int high_min, to_drain, batch; + int high_min, decrease, to_drain, batch; int todo = 0; high_min = READ_ONCE(pcp->high_min); batch = READ_ONCE(pcp->batch); /* - * Decrease pcp->high periodically to try to free possible - * idle PCP pages. And, avoid to free too many pages to - * control latency. + * Decrease pcp->high periodically to free idle PCP pages counted + * via pcp->count_min. And, avoid to free too many pages to + * control latency. This caps pcp->high decrement too. */ if (pcp->high > high_min) { + decrease = min(pcp->count_min, pcp->high / 5); pcp->high = max3(pcp->count - (batch << PCP_BATCH_SCALE_MAX), - pcp->high * 4 / 5, high_min); + pcp->high - decrease, high_min); if (pcp->high > high_min) todo++; } @@ -2191,6 +2192,8 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) todo++; } + pcp->count_min = pcp->count; + return todo; } @@ -2828,6 +2831,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, page = list_first_entry(list, struct page, pcp_list); list_del(&page->pcp_list); pcp->count -= 1 << order; + if (pcp->count < pcp->count_min) + pcp->count_min = pcp->count; } while (check_new_pages(page, order)); return page; From patchwork Wed Sep 20 06:18:56 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13392117 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BDA30CE79AC for ; Wed, 20 Sep 2023 06:20:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 58A926B011D; Wed, 20 Sep 2023 02:20:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 53B1C6B011E; Wed, 20 Sep 2023 02:20:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3DB7C6B011F; Wed, 20 Sep 2023 02:20:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 28C3F6B011D for ; Wed, 20 Sep 2023 02:20:16 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id F1F6C16046F for ; Wed, 20 Sep 2023 06:20:15 +0000 (UTC) X-FDA: 81255975990.26.F6F6F26 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126]) by imf10.hostedemail.com (Postfix) with ESMTP id EBA1BC000F for ; Wed, 20 Sep 2023 06:20:13 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=FzsQNyjA; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf10.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695190814; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UvpAserKTFQazQgvaQV8AWmN5BTD454Je6S/+Aq412A=; b=d3sRAgxMFE8gdQS2wOuQcncrn+h9+HfaRwZ5mdXiUEo1RH8C6VH84zKidyEVjPoTlRqA3v onP0oEjfARt9SWyxBpMT6ulEyfDFa4ZfgalWRkZYhytY6nL8q4NAJG3arusm5m1mjBWvDs 32+QqHPEHFmbM6G+I0NdaTyujSnWxp4= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=FzsQNyjA; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf10.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695190814; a=rsa-sha256; cv=none; b=Frf0G8bKVpA4+UsCwoBIZQumdpqf5rvTaB2drUbwrrN8zf7TEFnTA7e7Q4RTuIJini7raU LWI/tIzQCMiVVU58f8laTEGqMat5BtIssej6BwxrsYhs9sg9U6vK4wcJYiJDYbPPkbBOd2 UBoVsAp9kZHObOj8hNkXyfNj36xxD/U= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695190814; x=1726726814; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=S/60qEX1jhBiS6ZWzLs5VHrBM2EfBeFasIo8hPl4DFY=; b=FzsQNyjAOaI+jy5/HT011eA468r+Tbw0llCUqFrClo+7bL3wtqSG7JcX ef/JgQ2uYRqjiIeZXB64Iwww1pfbVcQBiT7VyUIiYSIprTimbUSbue7Ll LDYFDMYJCWE0hsgbcJkY9pCNd4iJNqUYWT5PwLXspGRBubSz1U6vjU633 /vtdg05Y8xjQ5eFEaSjJDYbee3MXh5w5IjqJCIK84B2qD4zhYtS5LISgq lnPtpBCfcqcX4QGJGUDopHwIHlDShhzKD/2VqsSZE+5TTrLzEfnaC55LQ ZYvZtZJjiI8lnvF2uwHkLXGmYcBscJ9XpVGc95NkEqwgnJVQq+OfkPImI w==; X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187810" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="365187810" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:20:12 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060689" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="740060689" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:20:08 -0700 From: Huang Ying To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Andrew Morton , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH 10/10] mm, pcp: reduce detecting time of consecutive high order page freeing Date: Wed, 20 Sep 2023 14:18:56 +0800 Message-Id: <20230920061856.257597-11-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com> References: <20230920061856.257597-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: EBA1BC000F X-Stat-Signature: 8s5zkkdwbc9gacarwfgghsxt6uopphqk X-Rspam-User: X-HE-Tag: 1695190813-404085 X-HE-Meta: U2FsdGVkX1/Hr3lM4YdsB9reBPje4sJS1TgYOmutX+R6Iat+OZyss9UVAuxyzucAsKIn8OLxmkMJkjYAGT2LGUJTgjIOMl7XO4VPJRw/+1ydOY5QfWDM5UhZ81UAJtaWmDJY0Z+zrzS08Zj3hH9RNpHIDdG+OtAqT1VjMTiuu9HmD3VbfmtvFSbpQoai3WbQ+e9XeClVOP1yglC5jnTUT3rmCo3xGtJbCJAAXW9DdsnXALaaWkDgewrcpHJhE9VMvqBjg5EZqkXLh1HVbuRRjqLPGnSGJvldE/16qP0CDI4+0+YbJUNegU11HpzOVCPXOaESoOD5Ei2iP6d5KE5rnMNc4Znh2dlSRdpzuMuSHjlNgY5EYeukdlHuRFqxlib/DYH7xP4LIX73FJ+DoqRf7nW12YmP69B1qN1879t2fOacWVi+dIGRYgdBXkzS4+Q02akxFR8838no6I6wXI0hNVXisNb8weRg6Inb6fO0l1ksOAgu1mn3mFnZuFB19swF3RLiLuHWk6oQMJoLnhISq6c6WrMfVroj/xUSkypsubLdtbNkWfWXkQ8RuMaJeorFkrywrjAKHjZMl2M4aAI4G4gB+wfw9mtK/07ety7PK64AUjUsx4HAIaBxY35OGaLg8xhQxnN3vsB4VqPPddgAOInPqNeXfCEVxPdv+7myQ7jvZOpDgJTf+XVdAWhtaLN10b69Iq9POiRPf+Rk0fHsw3p8NVhWgUgG/Lcq0bqpWhBQSjQqTsseRBJfSpde2lisgmJ8tZ0TxbnTSI9mu90/aszDnTfxMOSJ8ng/3B4550vx2zoJNTXkO9+y3qVJ9L2qliUVN9tXb363zEI5CiknSL88nuhhHFYSCjIsoOxq9R3iEiHRRTOLDHNlkhxdP2UzO34c/2S1ChgCegaY9wML+tFzNWPlbbcPwgPROzzLM22SqWgVjsXJ5cshjdQrppqtt/E9LFLrSg+kVz0h15K cw1Aw7aC b1fay9WgbRkLBBwimSO+UtYQawyepuLiVeY2ObB5jtpXFhWItHIyZTZJjcdEunNIiNQjoOAEznp83tn6Ct+PQxmeFsccEySfsgjKezeTlG7x4Yx/9Wg5y+Iv2VZwviHjC5Z3xVDffzY0dzhSkkjEeVtJegfsa8q82kUVS6kkYQbSm/+zEoyMXdl+tA+QxhvoYFhJzudobLAfCh1qkqqBFyCVRXrcuoL2ED8glL4UwaI3oiI15D5gWnwX8WDP/8Hyx0IHXout56f54y0pyv5UULXdIkveGik6jR60HhhKA5Og0KMYuUAFf3v8xh3Bx+IYA1iW0FUkPkE6qb/hwxxwK9Wrc7crqKdGTKzy9+VyV6gSbxVoAzTmK1C0FAMS/+9/Ut5A2UI27ZHNXf3XWYNrVlMICj0CUOcaEz4tCaMo4S9Sh09XndhhZm1De6YjzngtbffzWtTahCodEpV0taMmJQgqUTXGnPQfiXem20Q+ebR5EoLtoD+7uHDyy0AL3iuxc4cgWiOKcVi/jELmXUcclGy0oLyb5m2rcmYGl X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In current PCP auto-tuning design, if the number of pages allocated is much more than that of pages freed on a CPU, the PCP high may become the maximal value even if the allocating/freeing depth is small, for example, in the sender of network workloads. If a CPU was used as sender originally, then it is used as receiver after context switching, we need to fill the whole PCP with maximal high before triggering PCP draining for consecutive high order freeing. This will hurt the performance of some network workloads. To solve the issue, in this patch, we will track the consecutive page freeing with a counter in stead of relying on PCP draining. So, we can detect consecutive page freeing much earlier. On a 2-socket Intel server with 128 logical CPU, we tested SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes. With the patch, the network bandwidth improves 3.1%. This restores the performance drop caused by PCP auto-tuning. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Mel Gorman Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter --- include/linux/mmzone.h | 2 +- mm/page_alloc.c | 23 +++++++++++------------ 2 files changed, 12 insertions(+), 13 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 35b78c7522a7..44f6dc3cdeeb 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -689,10 +689,10 @@ struct per_cpu_pages { int batch; /* chunk size for buddy add/remove */ u8 flags; /* protected by pcp->lock */ u8 alloc_factor; /* batch scaling factor during allocate */ - u8 free_factor; /* batch scaling factor during free */ #ifdef CONFIG_NUMA u8 expire; /* When 0, remote pagesets are drained */ #endif + short free_count; /* consecutive free count */ /* Lists of pages, one per migrate type stored on the pcp-lists */ struct list_head lists[NR_PCP_LISTS]; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 77e9b7b51688..6ae2a5ebf7a4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2375,13 +2375,10 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free max_nr_free = high - batch; /* - * Double the number of pages freed each time there is subsequent - * freeing of pages without any allocation. + * Increase the batch number to the number of the consecutive + * freed pages to reduce zone lock contention. */ - batch <<= pcp->free_factor; - if (batch <= max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX) - pcp->free_factor++; - batch = clamp(batch, min_nr_free, max_nr_free); + batch = clamp_t(int, pcp->free_count, min_nr_free, max_nr_free); return batch; } @@ -2408,7 +2405,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, * stored on pcp lists */ if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) { - pcp->high = max(high - (batch << pcp->free_factor), high_min); + pcp->high = max(high - pcp->free_count, high_min); return min(batch << 2, pcp->high); } @@ -2416,10 +2413,10 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, return high; if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) { - pcp->high = max(high - (batch << pcp->free_factor), high_min); + pcp->high = max(high - pcp->free_count, high_min); high = max(pcp->count, high_min); } else if (pcp->count >= high) { - int need_high = (batch << pcp->free_factor) + batch; + int need_high = pcp->free_count + batch; /* pcp->high should be large enough to hold batch freed pages */ if (pcp->high < need_high) @@ -2456,7 +2453,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, * stops will be drained from vmstat refresh context. */ if (order && order <= PAGE_ALLOC_COSTLY_ORDER) { - free_high = (pcp->free_factor && + free_high = (pcp->free_count >= batch && (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) && (!(pcp->flags & PCPF_FREE_HIGH_BATCH) || pcp->count >= READ_ONCE(batch))); @@ -2464,6 +2461,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER; } + if (pcp->free_count < (batch << PCP_BATCH_SCALE_MAX)) + pcp->free_count += (1 << order); high = nr_pcp_high(pcp, zone, batch, free_high); if (pcp->count >= high) { free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high), @@ -2861,7 +2860,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, * See nr_pcp_free() where free_factor is increased for subsequent * frees. */ - pcp->free_factor >>= 1; + pcp->free_count >>= 1; list = &pcp->lists[order_to_pindex(migratetype, order)]; page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list); pcp_spin_unlock(pcp); @@ -5483,7 +5482,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta pcp->high_min = BOOT_PAGESET_HIGH; pcp->high_max = BOOT_PAGESET_HIGH; pcp->batch = BOOT_PAGESET_BATCH; - pcp->free_factor = 0; + pcp->free_count = 0; } static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high_min,