From patchwork Mon Jul 10 06:53:24 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13306322 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 26AF8EB64DC for ; Mon, 10 Jul 2023 06:54:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9756C6B0078; Mon, 10 Jul 2023 02:54:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 925F86B007B; Mon, 10 Jul 2023 02:54:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7C6D06B007D; Mon, 10 Jul 2023 02:54:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 6C6866B0078 for ; Mon, 10 Jul 2023 02:54:22 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 437AF1202B0 for ; Mon, 10 Jul 2023 06:54:22 +0000 (UTC) X-FDA: 80994788364.21.F62365C Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by imf15.hostedemail.com (Postfix) with ESMTP id 2CC50A000E for ; Mon, 10 Jul 2023 06:54:20 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=b+vHiyfP; spf=pass (imf15.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688972060; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RdoMlQO433ii3NTixTiSwlbUvDs4mQUNFFkio2P1gr4=; b=ppcBjJxKSLFRX0s/pw+arr5foPcwZwZ1FNdIzznlHscW9eDqjBhps4A2d9wvNQVzyg69Jw Tu/RVGmTcJlLicjh0XH6c4kDytIPE3Oa8R/xZ0RMdawDGpeWZ3UcDBlr5KhTyQwJd4mKi+ K2WoYq9HtAzQ/x5pD0lS4VdKGLxlRa4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688972060; a=rsa-sha256; cv=none; b=P98fcfHec9Pj7QRVUDm7BPwDrurFuJ0qbphwQzFyrM1aqYY6bNKHnWNw87wglLr+o+lkRM Lk5W3QMf02agYrnaFA72kYv5GMt8F9J1WvpPPhYA9sX17icGz6Srv5KEXk2eCS8xDSqWIV k7QRTttUODh6nmgFf/Z3eLHEqm1teR4= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=b+vHiyfP; spf=pass (imf15.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1688972059; x=1720508059; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=GlsxORxC3PtLa2ddHZKW9QHSn6KCfv0VDJ0uOh64sas=; b=b+vHiyfPCQ3oQdOa+RBc6iIShEhp+ZrKbmUlgRkQt+Mm9MH3LH0Bl8zs capYbf7B5pAV7PsaUEKM5s/zQmUXPoulcQVwNTGNZBeEnXe3FanP1WBv1 nW1m+aKT74OksA6+wgOX0z81Sq8SDdFGE/qAP/mFVXH28W8MzDeJWMuYX EUYOcVeOr7sJMvgNxehOV+z7OI12T4OS4zahVkZouyA318F9FqSzsPbIZ Xiv3wiSJCipkJHu4L+xgMzOrGFlApFc3i0UDc5EOO6LZydvYKLcOtVhgn UulMQ9rXOApb0eCqX6+gM1JogJkjYZ4Pr1SOPrbw3cqEVUBFcExLTLM5y g==; X-IronPort-AV: E=McAfee;i="6600,9927,10766"; a="427961503" X-IronPort-AV: E=Sophos;i="6.01,194,1684825200"; d="scan'208";a="427961503" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jul 2023 23:54:17 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10766"; a="750240550" X-IronPort-AV: E=Sophos;i="6.01,194,1684825200"; d="scan'208";a="750240550" Received: from zhanggua-mobl.ccr.corp.intel.com (HELO yhuang6-mobl2.ccr.corp.intel.com) ([10.255.29.223]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jul 2023 23:54:14 -0700 From: Huang Ying To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Andrew Morton , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox Subject: [RFC 1/2] mm: add framework for PCP high auto-tuning Date: Mon, 10 Jul 2023 14:53:24 +0800 Message-Id: <20230710065325.290366-2-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230710065325.290366-1-ying.huang@intel.com> References: <20230710065325.290366-1-ying.huang@intel.com> MIME-Version: 1.0 X-Stat-Signature: n5z3fw1g15wksaisudqqdcdk53hen1n8 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 2CC50A000E X-Rspam-User: X-HE-Tag: 1688972060-566463 X-HE-Meta: U2FsdGVkX19MPR2UglNSXfGf9lnp1VUbIG0sKkQAJSX3Y43FUx0axwWYLB6RFKZGZhvvdW2KY2gqWeHOOzd/MBOi7iqaKWyLHgL6LfYWSJ3atXuPueVwqNB0fFJ3DttUHi3a95Ckh3BxtQH8PkSmtIFzUGhOgytUpuNCDczj326vAd/XAo/y/5pHQOWhxE8WCVwyrz/Jux02Om/q2J1bWVbBGvVUq3k3Bd0K1xWkKkdawQFPaaQ3M45VCTS+jChMQp5e0PzDPGpxN54U+EATlZt/teAGUfO6kgyOjfVstQ+L4ApVLirS1h9zV0teUquDFln3QrAzRWLqUqSHFa02dpqnusQ95ho7EbDd73uRU/fHqTTWnbZzg0jgST7EoZnOhk2n72LfuhnReuNEaiz+rPHHn5DCvs1pke1ILFix5Qb7IGcQOss4QzT4z6cEiVINnexHhhhhL/v6bGhKqkEgq2UoIjHGm3fJNWs/mDMS2u2nTLc2pIw1wC5fjBgVV6H8Yl98qkuz7fGmc1VHQLhZ+kYqlA9uijGQkz5GekOC6jnNv+xtiIoZYj/QNHScckyXT61h0eeSwubjDYmTN917sWmMwex5aogB/6XBrB5wtgh5WX6dEUl77lFchKnxMbLugukAuhbTMB7VCWT08CpA+JAkcSNRtf8qmLTLvW29C0xSY6rLm8J6dvA0FPw5mg7/eM5ENXTFMlzusoiGuk2V0JlpsNJyA1sGmFXdeqVqk4rcm2Q5c5kNnocM1z55Kg8TsO3/o1MwbX8vaoiKx5UEMUh/FzKIzul5YFlRckrr6O9zxwqNBGNVFzSH8vb793/r+PMdm/iRUDHXlEsUcthNMJ0iFz9qSNxmB4iL70zDArPYbxZ57C3gHQVmrbXutnfdCjz3mgII/2y7dLm/IF3ECSoJptF+2h/QF/b79tPgxR6Ad448vl5GNTWrp9cxfM3+wjNKuZwCDdbcY47GEFV 1jXBTxg/ 1xY6LDKfba4X4go3yZVMqgL6/xMiDutaAuk361w4GkrKz9ieyha6+UV0FDDW3nW/pQpm9t+lgERTGME4hLJel0cXC5WKuDJkCz0hUxa25x6op491xTtUafumAnexzDya3BsFaLNIf81SHly+b2TwctK+D3Ipdo5gUBZtrEi1Hc5UWa4AhkJfpA1LDSa6796PyVr1yIqd0dB/gGcBz1w7TjQNfEMwN5jX4a3yxElDyn4jOSWmUd+P3Q4V7/0OC0OhC2D/S7DVNAT/2nHxxNu+iROMEkg9qOGgh2gZNHLmRvDvRSKnH0KvNnTt3soycQ69zLF6tMvDsJtqg2oBdaC32bwh2P3RRCJMZ1aXaBOPY2KJHj7zhy5lfk/0QyIP03Jlg1wmkdMB1pIhcEMKZ9GK9VdATSQo/E9CsdStTd2r+vtcAydbIfUSONXJdqseT40phJP4nNvLnvNPcwAZNNqRPAkUrhHV/QKwP3y6J X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The page allocation performance requirements of different workloads are usually different. So, we often need to tune PCP (per-CPU pageset) high to optimize the workload page allocation performance. Now, we have a system wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP high by hand. But, it's hard to find out the best value by hand. And one global configuration may not work best for the different workloads that run on the same system. One solution to these issues is to tune PCP high of each CPU automatically. This patch adds the framework for PCP high auto-tuning. With it, pcp->high will be changed automatically by tuning algorithm at runtime. Its default value (pcp->high_def) is the original PCP high value calculated based on low watermark pages or percpu_pagelist_high_fraction sysctl knob. To avoid putting too many pages in PCP, the original limit of percpu_pagelist_high_fraction sysctl knob, MIN_PERCPU_PAGELIST_HIGH_FRACTION, is used to calculate the max PCP high value (pcp->high_max). This patch only adds the framework, so pcp->high will be set to pcp->high_def always. We will add actual auto-tuning algorithm in the next patch in the series. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Mel Gorman Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox --- include/linux/mmzone.h | 5 ++- mm/page_alloc.c | 79 +++++++++++++++++++++++++++--------------- 2 files changed, 55 insertions(+), 29 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a4889c9d4055..7e2c1864a9ea 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -663,6 +663,8 @@ struct per_cpu_pages { spinlock_t lock; /* Protects lists field */ int count; /* number of pages in the list */ int high; /* high watermark, emptying needed */ + int high_def; /* default high watermark */ + int high_max; /* max high watermark */ int batch; /* chunk size for buddy add/remove */ short free_factor; /* batch scaling factor during free */ #ifdef CONFIG_NUMA @@ -820,7 +822,8 @@ struct zone { * the high and batch values are copied to individual pagesets for * faster access */ - int pageset_high; + int pageset_high_def; + int pageset_high_max; int pageset_batch; #ifndef CONFIG_SPARSEMEM diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 47421bedc12b..dd83c19f25c6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2601,7 +2601,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch, static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, bool free_high) { - int high = READ_ONCE(pcp->high); + int high = pcp->high; if (unlikely(!high || free_high)) return 0; @@ -2616,14 +2616,22 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, return min(READ_ONCE(pcp->batch) << 2, high); } +static void tune_pcp_high(struct per_cpu_pages *pcp, int high_def) +{ + pcp->high = high_def; +} + static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, struct page *page, int migratetype, unsigned int order) { - int high; + int high, high_def; int pindex; bool free_high; + high_def = READ_ONCE(pcp->high_def); + tune_pcp_high(pcp, high_def); + __count_vm_events(PGFREE, 1 << order); pindex = order_to_pindex(migratetype, order); list_add(&page->pcp_list, &pcp->lists[pindex]); @@ -5976,14 +5984,15 @@ static int zone_batchsize(struct zone *zone) #endif } -static int zone_highsize(struct zone *zone, int batch, int cpu_online) +static int zone_highsize(struct zone *zone, int batch, int cpu_online, + int high_fraction) { #ifdef CONFIG_MMU int high; int nr_split_cpus; unsigned long total_pages; - if (!percpu_pagelist_high_fraction) { + if (!high_fraction) { /* * By default, the high value of the pcp is based on the zone * low watermark so that if they are full then background @@ -5996,15 +6005,15 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online) * value is based on a fraction of the managed pages in the * zone. */ - total_pages = zone_managed_pages(zone) / percpu_pagelist_high_fraction; + total_pages = zone_managed_pages(zone) / high_fraction; } /* * Split the high value across all online CPUs local to the zone. Note * that early in boot that CPUs may not be online yet and that during * CPU hotplug that the cpumask is not yet updated when a CPU is being - * onlined. For memory nodes that have no CPUs, split pcp->high across - * all online CPUs to mitigate the risk that reclaim is triggered + * onlined. For memory nodes that have no CPUs, split the high value + * across all online CPUs to mitigate the risk that reclaim is triggered * prematurely due to pages stored on pcp lists. */ nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online; @@ -6032,19 +6041,21 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online) * However, guaranteeing these relations at all times would require e.g. write * barriers here but also careful usage of read barriers at the read side, and * thus be prone to error and bad for performance. Thus the update only prevents - * store tearing. Any new users of pcp->batch and pcp->high should ensure they - * can cope with those fields changing asynchronously, and fully trust only the - * pcp->count field on the local CPU with interrupts disabled. + * store tearing. Any new users of pcp->batch, pcp->high_def and pcp->high_max + * should ensure they can cope with those fields changing asynchronously, and + * fully trust only the pcp->count field on the local CPU with interrupts + * disabled. * * mutex_is_locked(&pcp_batch_high_lock) required when calling this function * outside of boot time (or some other assurance that no concurrent updaters * exist). */ -static void pageset_update(struct per_cpu_pages *pcp, unsigned long high, - unsigned long batch) +static void pageset_update(struct per_cpu_pages *pcp, unsigned long high_def, + unsigned long high_max, unsigned long batch) { WRITE_ONCE(pcp->batch, batch); - WRITE_ONCE(pcp->high, high); + WRITE_ONCE(pcp->high_def, high_def); + WRITE_ONCE(pcp->high_max, high_max); } static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonestat *pzstats) @@ -6064,20 +6075,21 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta * need to be as careful as pageset_update() as nobody can access the * pageset yet. */ - pcp->high = BOOT_PAGESET_HIGH; + pcp->high_def = BOOT_PAGESET_HIGH; + pcp->high_max = BOOT_PAGESET_HIGH; pcp->batch = BOOT_PAGESET_BATCH; pcp->free_factor = 0; } -static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high, - unsigned long batch) +static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high_def, + unsigned long high_max, unsigned long batch) { struct per_cpu_pages *pcp; int cpu; for_each_possible_cpu(cpu) { pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); - pageset_update(pcp, high, batch); + pageset_update(pcp, high_def, high_max, batch); } } @@ -6087,19 +6099,26 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h */ static void zone_set_pageset_high_and_batch(struct zone *zone, int cpu_online) { - int new_high, new_batch; + int new_high_def, new_high_max, new_batch; new_batch = max(1, zone_batchsize(zone)); - new_high = zone_highsize(zone, new_batch, cpu_online); + new_high_def = zone_highsize(zone, new_batch, cpu_online, + percpu_pagelist_high_fraction); + new_high_max = zone_highsize(zone, new_batch, cpu_online, + MIN_PERCPU_PAGELIST_HIGH_FRACTION); + new_high_def = min(new_high_def, new_high_max); - if (zone->pageset_high == new_high && + if (zone->pageset_high_def == new_high_def && + zone->pageset_high_max == new_high_max && zone->pageset_batch == new_batch) return; - zone->pageset_high = new_high; + zone->pageset_high_def = new_high_def; + zone->pageset_high_max = new_high_max; zone->pageset_batch = new_batch; - __zone_set_pageset_high_and_batch(zone, new_high, new_batch); + __zone_set_pageset_high_and_batch(zone, new_high_def, new_high_max, + new_batch); } void __meminit setup_zone_pageset(struct zone *zone) @@ -6175,7 +6194,8 @@ __meminit void zone_pcp_init(struct zone *zone) */ zone->per_cpu_pageset = &boot_pageset; zone->per_cpu_zonestats = &boot_zonestats; - zone->pageset_high = BOOT_PAGESET_HIGH; + zone->pageset_high_def = BOOT_PAGESET_HIGH; + zone->pageset_high_max = BOOT_PAGESET_HIGH; zone->pageset_batch = BOOT_PAGESET_BATCH; if (populated_zone(zone)) @@ -6619,9 +6639,11 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write, } /* - * percpu_pagelist_high_fraction - changes the pcp->high for each zone on each - * cpu. It is the fraction of total pages in each zone that a hot per cpu - * pagelist can have before it gets flushed back to buddy allocator. + * percpu_pagelist_high_fraction - changes the pcp->high_def for each zone on + * each cpu. It is the fraction of total pages in each zone that a hot per cpu + * pagelist can have before it gets flushed back to buddy allocator. This + * only set the default value, the actual value may be tuned automatically at + * runtime. */ int percpu_pagelist_high_fraction_sysctl_handler(struct ctl_table *table, int write, void *buffer, size_t *length, loff_t *ppos) @@ -7008,13 +7030,14 @@ EXPORT_SYMBOL(free_contig_range); void zone_pcp_disable(struct zone *zone) { mutex_lock(&pcp_batch_high_lock); - __zone_set_pageset_high_and_batch(zone, 0, 1); + __zone_set_pageset_high_and_batch(zone, 0, 0, 1); __drain_all_pages(zone, true); } void zone_pcp_enable(struct zone *zone) { - __zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch); + __zone_set_pageset_high_and_batch(zone, zone->pageset_high_def, + zone->pageset_high_max, zone->pageset_batch); mutex_unlock(&pcp_batch_high_lock); } From patchwork Mon Jul 10 06:53:25 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13306323 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D9D7EB64D9 for ; Mon, 10 Jul 2023 06:54:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BBF526B007B; Mon, 10 Jul 2023 02:54:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B71BB6B007D; Mon, 10 Jul 2023 02:54:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A126F6B007E; Mon, 10 Jul 2023 02:54:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 927886B007B for ; Mon, 10 Jul 2023 02:54:25 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 6B128AFACA for ; Mon, 10 Jul 2023 06:54:25 +0000 (UTC) X-FDA: 80994788490.04.B7245E2 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by imf16.hostedemail.com (Postfix) with ESMTP id 6129418000D for ; Mon, 10 Jul 2023 06:54:23 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=PaTwxQVB; spf=pass (imf16.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688972063; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=0nouO9OcAdux4rp+R5OLsd8oOd56jQjb+veENuZ8RQY=; b=xn9Ha2U4sob2SMLQzrAAn74cAz3D246Qpi725C7c6JnEm9VY1wPH78qBg0G5gr9whJG2Q3 0sv9X4YzyzrZMSBiyXnjkidWv1SeQHG6yCWAEky5vXEtQmWPrKUf45ah7QhNOznoLYf6dq XrLU79wT9dS7vyCzYf55hBfgluBDZnY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688972063; a=rsa-sha256; cv=none; b=Cl78nY7DAeMbKg2un1QAOYtEKbyBAgMnxL6MFXV5KcFTclDPnjDQ/AfMUqFcdIvXQfEPC1 18V2zovke2Q/QX+1ZC+3Bwc5g3c9rNXHOG/efRkllvr9yChY/ylvvfkaqrVcWrcBBwkJ0S OIEP4HA8NOMOugtL5EIoU6uJV60l49k= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=PaTwxQVB; spf=pass (imf16.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1688972063; x=1720508063; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=yqtL2s8WqXvRLMZvsfWXtVAKECayu9UX3tE/H+XNNpY=; b=PaTwxQVBiCPBOVjfYZ/m0keKGxo9r3EvvCFVSDlSeoxvEAmLiwyY6AgN QuZaHdVeZXAGHbpH5XWtw/gbf72g2ttY5rMgGHeiD7633XkAA0hvKklkp 2hZ20JLDk0jsslCJ9JxDUHPEnRlKVZRqqny7FTJngVR1D4Tb0kjRQOhx4 R3eZH6/aeN5F2HcP3g246hkPeTP3qpIYc6G0MG1sk63NHow6yHiCcKTv9 6/4asjPw5vBnVQkYxjtsmz8/Ou5u4bH+RpSv4L4IJutYjCj4dev36BWSu bpvHVO3qISaKbEVzHxOwSOIoSlzkdNjmIMI83b2y4GFJFjayqXTEYVjbg g==; X-IronPort-AV: E=McAfee;i="6600,9927,10766"; a="427961521" X-IronPort-AV: E=Sophos;i="6.01,194,1684825200"; d="scan'208";a="427961521" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jul 2023 23:54:21 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10766"; a="750240562" X-IronPort-AV: E=Sophos;i="6.01,194,1684825200"; d="scan'208";a="750240562" Received: from zhanggua-mobl.ccr.corp.intel.com (HELO yhuang6-mobl2.ccr.corp.intel.com) ([10.255.29.223]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jul 2023 23:54:18 -0700 From: Huang Ying To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Andrew Morton , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox Subject: [RFC 2/2] mm: alloc/free depth based PCP high auto-tuning Date: Mon, 10 Jul 2023 14:53:25 +0800 Message-Id: <20230710065325.290366-3-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230710065325.290366-1-ying.huang@intel.com> References: <20230710065325.290366-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 6129418000D X-Rspam-User: X-Stat-Signature: rm5jx3ybg7rqazxbc4y97ko49ge9kg9o X-Rspamd-Server: rspam03 X-HE-Tag: 1688972063-552751 X-HE-Meta: U2FsdGVkX1/s2Ka/1h5JiDcypgUMK7LJR39GfCvMUBAARktUy9Dad7rPYawamNE9fILUqlt7Mi0Ip4prPedf8hlUJiQFkM7O4I1DASxveiz6tcEedfjJgk4MCj2DxsjvApD9SsY+6NKiNTYEuxLTLqvMNqVL42blwuj6FJkqLxS9EwF/4lb9hpTTwdimC9TrDVHPW+Ze2CqcANY74UtcIjV3T6LVF9jTzfYZzRVg8GfegqHSswLO8kbE+Q7kWGyMIhO+bXgKGbKuwhrLH089lW3+HK6bZB+my3Fv04LOj9wZUtynSthiTVWkENtoIRACWJkQF0dhl8Fb1r9Mdq2FqZJLLTANHXhoGob5ZCGFSOJPmZa8KOZXv67FPvgjUb0vcirGNGwm33laahBrlm7N6QI9FZBHjGItmUk7ZcgPftGE6nWur0IkFjlLB98Pqll8CQ85854pStm9/SWuZXVyibwmTX03e5vNWEXFGzfH0MJZFsZkni7YHKShAOyYT9JXbBmevmoN9RInQL9z2AJBuVCkfua/pR9/b8/bgWdm+PmKIp2FBDHqu9sWYZJTz54tYjUz2ptPRFPPc8QihdzGIKFM3fGuDLRqTd5s9uBnHMOMir/aeXRka6CFJ0rl0A4sJLue50vO9kqc4f+qVWTExLv1KxD3pqenJBmbdRN5FQKigvtpHhWPCmIdifvYulsU+FXz1rrIxpvpjUW3O8ZGAQwb/P/6Ml01HYyIRljk3FS04Atw978KbIkhAFxaW9ILOsrowXgGZ0vyNeT2CJdxzr2jLb1JfXTcO1YIIaD5yvuxbuAfafYjX34vy0BuDORJUGK784sxTUSOEjpOW8V9ZNrncx7DIiSv26DHNd+rbCbxM7QamvkI3rEzpTROlvO/9Sm67IGLoPWbHjXjfge3UhvWGtq6dsTSgsABCLgp34Dwqs2+6uFCOex4aZ65yH/S2XiLkwtUAiJBLCO9Y4Y mloKaTTp Uy/xejP/9WoChB1wSzWZfgT/+9/HfXNoc/Nh08s4xYJUeHPWfYJ8cxCEKyxX86pGn8/03K0MEOJI+gp2OO0enT1NO3DStlaCA3SQJEW/8GY9TOH3kgynAMas8PRaK0e/ZhG1CuA6v0L/d16oddBNri57A/K+cdQ8OscIFIwhtjcNEqpaQjkwBnEx2YGK77rmXds8ioO/5guWY8HqMado0aqoT54Jt3yDRktP06bsAK6AZeXX9voli9rGWyoV0gx2oczyupIEJ8RDBlj/SNWOKsR2Cu/0DivkosHB/xFbzprAOuCJbclAV1HiJkSrq5LTviLHr4ns6J7EbZHLj3mGTQwBXTU8GSKVJ5H1Cn8V3iVnUDUDb/aQwlstRthS/uCygI+GCkczb/oxaT637rEwxODBX0XlwAjYb6uISfmCwQFApcrFZ4JP8wnNW6TBrZIiLa+KQVajiYRAD2vj5LSEF1XQSbZw5yqBq1JwwiMPKHN6Za1g= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: To auto-tune PCP high for each CPU automatically, an allocation/freeing depth based PCP high auto-tuning algorithm is implemented in this patch. The basic idea behind the algorithm is to detect the repetitive allocation and freeing pattern with short enough period (about 1 second). The period needs to be short to respond to allocation and freeing pattern changes quickly and control the memory wasted by unnecessary caching. To detect the repetitive allocation and freeing pattern, the alloc/free depth is calculated for each tuning period (1 second) on each CPU. To calculate the alloc/free depth, we track the alloc count. Which increases for page allocation from PCP and decreases for page freeing to PCP. The alloc depth is the maximum alloc count difference between the later large value and former small value. While, the free depth is the maximum alloc count difference between the former large value and the later small value. Then, the average alloc/free depth in multiple tuning periods is calculated, with the old alloc/free depth decay in the average gradually. Finally, the PCP high is set to be the smaller value of average alloc depth and average free depth, after clamped between the default and the max PCP high. In this way, pure allocation or freeing will not enlarge the PCP high because PCP doesn't help. We have tested the algorithm with several workloads on Intel's 2-socket server machines. Will-it-scale/page_fault1 ========================= On one socket of the system with 56 cores, 56 workload processes are run to stress the kernel memory allocator. Each workload process is put in a different memcg to eliminate the LRU lock contention. base optimized ---- -----​---- Thoughput (GB/s) 34.3 75.0 native_queued_spin_lock_slowpath% 60.9 0.2 This is a simple workload that each process allocates 128MB pages then frees them repetitively. So, it's quite easy to detect its allocation and freeing pattern and adjust PCP high accordingly. The optimized kernel almost eliminates the lock contention cycles% (from 60.9% to 0.2%). And its benchmark score increases 118.7%. Kbuild ====== "make -j 224" is used to build the kernel in parallel on the 2-socket server system with 224 logical CPUs. base optimized ---- -----​---- Build time (s) 162.67 162.67​ native_queued_spin_lock_slowpath% 17.00 12.28​ rmqueue% 11.53 8.33​ Free_unref_page_list% 3.85 0.54​ folio_lruvec_lock_irqsave% 1.21 1.96​ The optimized kernel reduces cycles of the page allocation/freeing functions from ~15.38% to ~8.87% via enlarging the PCP high when necessary. The system overhead lock contention cycles% decreases too. But the benchmark score hasn't visible change. There should be other bottlenecks. We also captured /proc/meminfo during test. After a short while (about 10s) from the beginning of the test, the Memused (MemTotal - MemFree) of the optimized kernel is higher than that of the base kernel for the increased PCP high. But in the second half of the test, the Memused of the optimized kernel decreases to the same level of that of the base kernel. That is, PCP high is decreased effectively for the decreased page allocation requirements. Netperf/SCTP_STREAM_MANY On another 2-socket server with 128 logical CPUs. 64 processes of netperf are run with the netserver runs on the same machine (that is, loopback network is used). base optimized ---- -----​---- Throughput (MB/s)​ 7136 8489 +19.0% vmstat.cpu.id%​ 73.05 63.73 -9.3 vmstat.procs.r​ 34.1 45.6 +33.5% meminfo.Memused​ 5479861 8492875 +55.0% perf-stat.ps.cpu-cycles​ 1.04e+11 1.38e+11 +32.3% perf-stat.ps.instructions​ 0.96e+11 1.14e+11 +17.8% perf-profile.free_unref_page% 2.46 1.65 -0.8 latency.99%.__alloc_pages​ 4.28 2.21 -48.4% latency.99%.__free_unref_page 4.11 0.87 -78.8%​ From the test results, the throughput of benchmark increases 19.0%. That comes from the increased CPU cycles and instructions per second (perf-stat.ps.cpu-cycles and perf-stat.ps.instructions), that is, reduced CPU idle. And, perf-profile shows that page allocator cycles% isn't high. So, the reduced CPU idle may comes from the page allocation/freeing latency reduction, which influence the network behavior. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Mel Gorman Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox --- include/linux/mmzone.h | 8 +++++++ mm/page_alloc.c | 50 ++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 56 insertions(+), 2 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 7e2c1864a9ea..cd9b497cd596 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -670,6 +670,14 @@ struct per_cpu_pages { #ifdef CONFIG_NUMA short expire; /* When 0, remote pagesets are drained */ #endif + int alloc_count; /* alloc/free count from tune period start */ + int alloc_high; /* max alloc count from tune period start */ + int alloc_low; /* min alloc count from tune period start */ + int alloc_depth; /* alloc depth from tune period start */ + int free_depth; /* free depth from tune period start */ + int avg_alloc_depth; /* average alloc depth */ + int avg_free_depth; /* average free depth */ + unsigned long tune_start; /* tune period start timestamp */ /* Lists of pages, one per migrate type stored on the pcp-lists */ struct list_head lists[NR_PCP_LISTS]; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dd83c19f25c6..4d627d96e41a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2616,9 +2616,38 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, return min(READ_ONCE(pcp->batch) << 2, high); } +#define PCP_AUTO_TUNE_PERIOD HZ + static void tune_pcp_high(struct per_cpu_pages *pcp, int high_def) { - pcp->high = high_def; + unsigned long now = jiffies; + int high_max, high; + + if (likely(now - pcp->tune_start <= PCP_AUTO_TUNE_PERIOD)) + return; + + /* No alloc/free in last 2 tune period, reset */ + if (now - pcp->tune_start > 2 * PCP_AUTO_TUNE_PERIOD) { + pcp->tune_start = now; + pcp->alloc_high = pcp->alloc_low = pcp->alloc_count = 0; + pcp->alloc_depth = pcp->free_depth = 0; + pcp->avg_alloc_depth = pcp->avg_free_depth = 0; + pcp->high = high_def; + return; + } + + /* End of tune period, try to tune PCP high automatically */ + pcp->tune_start = now; + /* The old alloc/free depth decay with time */ + pcp->avg_alloc_depth = (pcp->avg_alloc_depth + pcp->alloc_depth) / 2; + pcp->avg_free_depth = (pcp->avg_free_depth + pcp->free_depth) / 2; + /* Reset for next tune period */ + pcp->alloc_high = pcp->alloc_low = pcp->alloc_count = 0; + pcp->alloc_depth = pcp->free_depth = 0; + /* Pure alloc/free will not increase PCP high */ + high = min(pcp->avg_alloc_depth, pcp->avg_free_depth); + high_max = READ_ONCE(pcp->high_max); + pcp->high = clamp(high, high_def, high_max); } static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, @@ -2630,7 +2659,19 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, bool free_high; high_def = READ_ONCE(pcp->high_def); - tune_pcp_high(pcp, high_def); + /* PCP is disabled or boot pageset */ + if (unlikely(!high_def)) { + pcp->high = high_def; + pcp->tune_start = 0; + } else { + /* free count as negative allocation */ + pcp->alloc_count -= (1 << order); + pcp->alloc_low = min(pcp->alloc_low, pcp->alloc_count); + /* max free depth from the start of current tune period */ + pcp->free_depth = max(pcp->free_depth, + pcp->alloc_high - pcp->alloc_count); + tune_pcp_high(pcp, high_def); + } __count_vm_events(PGFREE, 1 << order); pindex = order_to_pindex(migratetype, order); @@ -2998,6 +3039,11 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, return NULL; } + pcp->alloc_count += (1 << order); + pcp->alloc_high = max(pcp->alloc_high, pcp->alloc_count); + /* max alloc depth from the start of current tune period */ + pcp->alloc_depth = max(pcp->alloc_depth, pcp->alloc_count - pcp->alloc_low); + /* * On allocation, reduce the number of pages that are batch freed. * See nr_pcp_free() where free_factor is increased for subsequent