From patchwork Thu May 11 06:56:07 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13237531 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0983DC7EE22 for ; Thu, 11 May 2023 06:56:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9D7FA280003; Thu, 11 May 2023 02:56:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 98802280002; Thu, 11 May 2023 02:56:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 84F5B280003; Thu, 11 May 2023 02:56:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 78BC1280002 for ; Thu, 11 May 2023 02:56:57 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id A53CF400B0 for ; Thu, 11 May 2023 06:56:56 +0000 (UTC) X-FDA: 80777066832.21.DA491DC Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by imf01.hostedemail.com (Postfix) with ESMTP id 72B914000D for ; Thu, 11 May 2023 06:56:54 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=l8mQAtFi; spf=pass (imf01.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1683788214; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nILl1crUo9LkIHrO5FsPHfMUN48zhfhOxzlqMwikxbI=; b=gpM0z6krEhJ+U4lQyiqrcNNh9ESFTLVoGgjCV5KeAFJ/Y83/Y0WCEaNxqTtJRlFvNgb2aU X8kKtl3V1Vn38PFw00NpVSQTj8Nx3UsN6VnAfH+Hn+F8U+xaC++Nze7ZBPCyjp4uHTQ0x6 OTRf/EksNY4RlQdozx7usC51cjCdVjA= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1683788214; a=rsa-sha256; cv=none; b=xghngdK/5T9vqdT0mNq60r0jSnisOM/9Ymch4tBMQN9KYnPsda2TULb5ThetYQMrZK8nyY ERcAkrXSMbg3seTp90119XdwWkJyi2FgvqT0BVuygCEv440fg2TNYzwqCKXr5Yck4mKZh+ 82C9R/n+qKJu7AMIiDRnV4dob3QEKBo= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=l8mQAtFi; spf=pass (imf01.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1683788214; x=1715324214; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=dB/KO3YPw8DRc4O11uwMQV7MwlURu0L5mJt/bhxjSuQ=; b=l8mQAtFiFlOk0LOz4sEeUpiHcgiD7uZW5wfJE89snJwF/0FgXR2fDiws Y/q42n3/M4MDdWDsYzYa4ts2N2haUZiAsIbmpsxi4cMNAGGqIKxrBM54f pKFNUr58cNIimmgUHv5mIgu6MPzl717I0p6+4XIftNXY9QF05RQqm24ik wYNIqp9/OWbjRrHGoFnnJpbgfzG2HqjNuGpASwaem8MePiTG1rvVFjFyf 47tCCrZZpnUyHo8jYtpsj9v+JXkwX/LN3Nxj1pO16r0Fykt1hIzAHSQPl CSaxAgzVptOezcfyPg3QL8KWMqXJCKWLgWz+HBQugeNrqS8Ksu4F/hkBn g==; X-IronPort-AV: E=McAfee;i="6600,9927,10706"; a="436744551" X-IronPort-AV: E=Sophos;i="5.99,266,1677571200"; d="scan'208";a="436744551" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 May 2023 23:56:53 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10706"; a="823855356" X-IronPort-AV: E=Sophos;i="5.99,266,1677571200"; d="scan'208";a="823855356" Received: from chaoyan1-mobl2.ccr.corp.intel.com (HELO yhuang6-mobl2.ccr.corp.intel.com) ([10.255.31.95]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 May 2023 23:56:50 -0700 From: Huang Ying To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven , Andrew Morton , Huang Ying , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox Subject: [RFC 6/6] mm: prefer different zone list on different logical CPU Date: Thu, 11 May 2023 14:56:07 +0800 Message-Id: <20230511065607.37407-7-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230511065607.37407-1-ying.huang@intel.com> References: <20230511065607.37407-1-ying.huang@intel.com> MIME-Version: 1.0 X-Stat-Signature: qzpuxyhm6yg4bdi7bycqk4mhw388xa4p X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: 72B914000D X-HE-Tag: 1683788214-721689 X-HE-Meta: U2FsdGVkX1/rH5HS0RTr+C39F+yW8bs/6exDgSc8vUNxBeF3ypo53P9jKBYGGewJWoGHG2pxAuMqHLh9BQSdh5aEo63Ntu68JSRy6+G8gZUpc56Jm7wGB4xMMGM2ZbH1IAkYLDgmGU2j3fq6PtMyz2TPa5SbUzxuFf89D7FPnNDhs03l/9GwCsNS3WF/PzAopEXr0tOFhacrmbCMhFpAzVKG/RH7/6LlY5NdbQ6vWp22EbKJViPxQxh4TG7X30jqUxMLXtWoO2miUNJm56q5VjWq9NbpGm5F1Mjb72FJwOd0cnFFgyyb5ypvHmYF/7cAAOnCHfiiS5/sY8guxXOV/3bJqcHCI+I3vbFGSqFt6lpRjU6ZTa1owp5/qYk2FAvj/G0ynVT2XU1vJoyyCZalNrgB67Uvc9ZVHgA6pOZsLz+t5MT/u3B3C3+7rB/hq4qnqJvVzI8p7QzJKlBzD07rI+qnQ5aKsh4EG6oTXNb39gAW7X5WQ5fEutFoxsb6oYhlQ2HdDC93QOxEpGxPS8wphPR3xdMmYJoZCjkSxz/mAjFI8+MWLIWc9Gno+FpdEL5Ys1a8P0NPyhPvM7ME17c0MV+drk68gIoMrhRH3GwtKIg5AcADvQ3GXNPlnCCI9S76FUIM7ijOnm9E7Jqyjii+GCtz1fuvlGnaPv9Kal1ie5Bb7HKXxKiKtJSjgI81GWkWujn/mLGeNlUDKtvTSFedCc5uNO6QjT7NiGrSqDGLOLRZYjk5yJhbKff2LtmhAAU9YszG6SwjrpkxXvA35XRS1rOVMZYnFGeM2FR/SrqWyyco5i6cuZXhkuP+79cZJP0GFDyEfL6c5S/YdNfzCfrCAQf6RKxRUf2Ivsa99hrBrIdQdrZk+8QwJieF2vXF876WKhe9LJduG682VLqAhmERxo1L1Rn9soVO3D9LVt2KkkLS94VgYLSRnA779upe5XBsuAkKzcOcQgsCdebWLEx oJshtoIy t8C7iFEKAKpL4IcVjsSG9ovy0Ja5CuBuBCKOnuHxwG1+ZdETCQBff2TnptbkShs4piYgPgNk73OoSHZlIUblWf/U5qBpBeRxS7RkFztgIje/nkTlV8x+txTra8fBcDQMbqkd8KdeTSQtJ8bPt/eihPcdItWQ+22KsOg9FkVOp3ojspjdzqWNWV0EjyXNMN/05Qkaaw9KegdnRdGPZGJoqhWfcrqTUanWlL/23J8VKsL3+Ngi+lTGNG9Z4xPp1JjoyzhMXhotiw4edN0OFUP7w4bdCZzSbmNEwtMPkSeTU4pdpbMD0JXLXwY2qc5Eeh1XPV9FEsNiOaqw8KkeR7xXPBP5rla0tsrm4sGN7O5jI2qq+r1POUYG4M9f/Y/U0NlmXW0Yjdi5mRNrn1QS7KsZaivv3oSyIgvKBfTJH X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Originally, there are only one fallback and one no-fallback zone list for each NUMA node (pglist_data->node_zonelists). That is, all logical CPUs of a NUMA node will use one zone list during the page allocation. This isn't a problem before, because there's at most one instance for each zone type. Now, we may create multiple zone instances for one zone type. This makes it possible for the different logical CPUs to prefer different zone instance of one zone type to improve the zone lock scalability. So, in this patch, multiple fallback and multiple no-fallback zone lists can be created for each NUMA node based on the max zone instances number for one zone type of the NUMA node. Then different logical CPUs will prefer different zone list based on logical CPU number. Combined with the previous patches in the series, this can improve the scalability of zone lock contention effectively in the kbuild test case. Details can be found in the description of the previous patch of the series. Signed-off-by: "Huang, Ying" Cc: Mel Gorman Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox --- include/linux/gfp.h | 6 ++- include/linux/mmzone.h | 5 ++- mm/mempolicy.c | 6 ++- mm/mm_init.c | 2 +- mm/page_alloc.c | 93 ++++++++++++++++++++++++++++-------------- 5 files changed, 77 insertions(+), 35 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 60b5f43792ec..12903098122f 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -164,7 +164,11 @@ static inline int gfp_zonelist(gfp_t flags) */ static inline struct zonelist *node_zonelist(int nid, gfp_t flags) { - return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); + pg_data_t *pgdat = NODE_DATA(nid); + int li; + + li = raw_smp_processor_id() % pgdat->max_nr_zones_per_type; + return pgdat->node_zonelists[li] + gfp_zonelist(flags); } #ifndef HAVE_ARCH_FREE_PAGE diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 1a9b47bfc71d..11481f921697 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1240,10 +1240,11 @@ typedef struct pglist_data { * Generally the first zones will be references to this node's * node_zones. */ - struct zonelist node_zonelists[MAX_ZONELISTS]; + struct zonelist node_zonelists[MAX_NR_ZONES_PER_TYPE][MAX_ZONELISTS]; int nr_zones; /* number of populated zones in this node */ int nr_zone_types; + int max_nr_zones_per_type; #ifdef CONFIG_FLATMEM /* means !SPARSEMEM */ struct page *node_mem_map; #ifdef CONFIG_PAGE_EXTENSION @@ -1699,7 +1700,7 @@ static inline bool movable_only_nodes(nodemask_t *nodes) * at least one zone that can satisfy kernel allocations. */ nid = first_node(*nodes); - zonelist = &NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK]; + zonelist = &NODE_DATA(nid)->node_zonelists[0][ZONELIST_FALLBACK]; z = first_zones_zonelist(zonelist, ZONE_NORMAL, nodes); return (!z->zone) ? true : false; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 0e0ce31a623c..35d3793e6a19 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1924,7 +1924,11 @@ unsigned int mempolicy_slab_node(void) */ struct zonelist *zonelist; enum zone_type highest_zone_type = gfp_zone(GFP_KERNEL); - zonelist = &NODE_DATA(node)->node_zonelists[ZONELIST_FALLBACK]; + pg_data_t *pgdat = NODE_DATA(node); + int li; + + li = raw_smp_processor_id() % pgdat->max_nr_zones_per_type; + zonelist = &pgdat->node_zonelists[li][ZONELIST_FALLBACK]; z = first_zones_zonelist(zonelist, highest_zone_type, &policy->nodes); return z->zone ? zone_to_nid(z->zone) : node; diff --git a/mm/mm_init.c b/mm/mm_init.c index c1883362e71d..b950bdfc43f3 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -40,7 +40,7 @@ void __init mminit_verify_zonelist(void) /* Identify the zone and nodelist */ zoneid = i % MAX_NR_ZONES; listid = i / MAX_NR_ZONES; - zonelist = &pgdat->node_zonelists[listid]; + zonelist = &pgdat->node_zonelists[0][listid]; zone = &pgdat->node_zones[zoneid]; if (!populated_zone(zone)) continue; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d60fedc6961b..b03ea2f23d93 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6351,20 +6351,25 @@ static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) * * Add all populated zones of a node to the zonelist. */ -static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs) +static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs, + int zidx_in_type) { + struct zone_type_struct *zts; struct zone *zone; - int zid = MAX_NR_ZONES; - int nr_zones = 0; + int zt, i, nr, nr_zones = 0; - do { - zid--; - zone = pgdat->node_zones + zid; - if (populated_zone(zone)) { + for (zt = MAX_NR_ZONE_TYPES - 1; zt >= 0; zt--) { + zts = pgdat->node_zone_types + zt; + if (!zts->present_pages) + continue; + nr = zts->last_zone_idx - zts->start_zone_idx + 1; + for (i = 0; i < nr; i++) { + zone = pgdat->node_zones + zts->start_zone_idx; + zone += (zidx_in_type + i) % nr; zoneref_set_zone(zone, &zonerefs[nr_zones++]); check_highest_zone(zone_type_num(zone)); } - } while (zid); + } return nr_zones; } @@ -6462,27 +6467,48 @@ int find_next_best_node(int node, nodemask_t *used_node_mask) } +static void __build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order, + unsigned nr_nodes, int zidx_in_type) +{ + struct zoneref *zonerefs; + int i; + + zonerefs = pgdat->node_zonelists[zidx_in_type][ZONELIST_FALLBACK]._zonerefs; + + for (i = 0; i < nr_nodes; i++) { + int nr_zones; + + pg_data_t *node = NODE_DATA(node_order[i]); + + nr_zones = build_zonerefs_node(node, zonerefs, zidx_in_type); + zonerefs += nr_zones; + } + zonerefs->zone = NULL; + zonerefs->zone_type = 0; +} + /* * Build zonelists ordered by node and zones within node. * This results in maximum locality--normal zone overflows into local * DMA zone, if any--but risks exhausting DMA zone. */ static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order, - unsigned nr_nodes) + unsigned nr_nodes) { - struct zoneref *zonerefs; int i; - zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs; + for (i = 0; i < pgdat->max_nr_zones_per_type; i++) + __build_zonelists_in_node_order(pgdat, node_order, nr_nodes, i); +} - for (i = 0; i < nr_nodes; i++) { - int nr_zones; +static void __build_thisnode_zonelists(pg_data_t *pgdat, int zidx_in_type) +{ + struct zoneref *zonerefs; + int nr_zones; - pg_data_t *node = NODE_DATA(node_order[i]); - - nr_zones = build_zonerefs_node(node, zonerefs); - zonerefs += nr_zones; - } + zonerefs = pgdat->node_zonelists[zidx_in_type][ZONELIST_NOFALLBACK]._zonerefs; + nr_zones = build_zonerefs_node(pgdat, zonerefs, zidx_in_type); + zonerefs += nr_zones; zonerefs->zone = NULL; zonerefs->zone_type = 0; } @@ -6492,14 +6518,10 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order, */ static void build_thisnode_zonelists(pg_data_t *pgdat) { - struct zoneref *zonerefs; - int nr_zones; + int i; - zonerefs = pgdat->node_zonelists[ZONELIST_NOFALLBACK]._zonerefs; - nr_zones = build_zonerefs_node(pgdat, zonerefs); - zonerefs += nr_zones; - zonerefs->zone = NULL; - zonerefs->zone_type = 0; + for (i = 0; i < pgdat->max_nr_zones_per_type; i++) + __build_thisnode_zonelists(pgdat, i); } /* @@ -6565,7 +6587,7 @@ static void setup_min_unmapped_ratio(void); static void setup_min_slab_ratio(void); #else /* CONFIG_NUMA */ -static void build_zonelists(pg_data_t *pgdat) +static void __build_zonelists(pg_data_t *pgdat, int zidx_in_type) { int node, local_node; struct zoneref *zonerefs; @@ -6573,8 +6595,8 @@ static void build_zonelists(pg_data_t *pgdat) local_node = pgdat->node_id; - zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs; - nr_zones = build_zonerefs_node(pgdat, zonerefs); + zonerefs = pgdat->node_zonelists[zidx_in_type][ZONELIST_FALLBACK]._zonerefs; + nr_zones = build_zonerefs_node(pgdat, zonerefs, zidx_in_type); zonerefs += nr_zones; /* @@ -6588,13 +6610,13 @@ static void build_zonelists(pg_data_t *pgdat) for (node = local_node + 1; node < MAX_NUMNODES; node++) { if (!node_online(node)) continue; - nr_zones = build_zonerefs_node(NODE_DATA(node), zonerefs); + nr_zones = build_zonerefs_node(NODE_DATA(node), zonerefs, zidx_in_type); zonerefs += nr_zones; } for (node = 0; node < local_node; node++) { if (!node_online(node)) continue; - nr_zones = build_zonerefs_node(NODE_DATA(node), zonerefs); + nr_zones = build_zonerefs_node(NODE_DATA(node), zonerefs, zidx_in_type); zonerefs += nr_zones; } @@ -6602,6 +6624,14 @@ static void build_zonelists(pg_data_t *pgdat) zonerefs->zone_type = 0; } +static void build_zonelists(pg_data_t *pgdat) +{ + int i; + + for (i = 0; i < pgdat->max_nr_zones_per_type; i++) + __build_zonelists(pgdat, i); +} + #endif /* CONFIG_NUMA */ /* @@ -7899,6 +7929,7 @@ static void __init zones_init(struct pglist_data *pgdat) int split_nr; BUILD_BUG_ON(MAX_NR_ZONES_PER_TYPE > __MAX_NR_SPLIT_ZONES + 1); + pgdat->max_nr_zones_per_type = 1; for (zt = 0; zt < MAX_NR_ZONE_TYPES; zt++) { zts = pgdat->node_zone_types + zt; @@ -7925,6 +7956,8 @@ static void __init zones_init(struct pglist_data *pgdat) start_pfn = end_pfn; zone++; } + if (i > pgdat->max_nr_zones_per_type) + pgdat->max_nr_zones_per_type = i; } else { zone->type = zt; zone->zone_start_pfn = zts->zts_start_pfn;