From patchwork Thu May 11 06:56:06 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13237530 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CB289C77B7C for ; Thu, 11 May 2023 06:56:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6A3506B0078; Thu, 11 May 2023 02:56:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 65353280002; Thu, 11 May 2023 02:56:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4F4406B007D; Thu, 11 May 2023 02:56:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 411486B0078 for ; Thu, 11 May 2023 02:56:53 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id E8329140557 for ; Thu, 11 May 2023 06:56:52 +0000 (UTC) X-FDA: 80777066664.23.DB46EAB Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by imf01.hostedemail.com (Postfix) with ESMTP id C7B374000D for ; Thu, 11 May 2023 06:56:50 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=X1pXbCGD; spf=pass (imf01.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1683788211; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SkGmXu/xJQbQpYffxc1f1C8rxw0tQ8MYKQmQTazBP+c=; b=nWe7DKsLO9cXpFp/hSM+l68PH19n0cK6XNfR0WrLuiqhB58q/Axcao3M4WwYw4hazKpu1V 6XN5eRFb8LZzwSGnWCqz1kEz96K9d66c8OpPm7M2IPvsscVPwE0zzg24us/pX6ki+n0Hl2 gLajc8mnqLHBDYa9V04lPbVzR9rFG+Y= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1683788211; a=rsa-sha256; cv=none; b=Hwx1Xe0d3DNRywregFgGMJh3QkJqXlDoVJ6UYtpS5ZFUN7jWng+ihDE6JjDWvovyuapoFH cL6iBK831CZjigEyJUSS3YfPWaSTlzfKfr6756aj/8V+BcTO3J8QBWngFLYgUiP8sDJun4 QeLs8xgcjico7m0V/C48dGcDIwC82EY= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=X1pXbCGD; spf=pass (imf01.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1683788210; x=1715324210; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=/tCFV/T4kHcLFOaALLGy5Fe3D7GBcreIfwrmEbLI1Ro=; b=X1pXbCGDuwgu0McmtGxahaH50VyTYwQstS0nefT4TFLXo344BOVtSZQU ufBRs/1gJEOzhOPU72pegVpReP9frJv8ihdL+88GKi6SxNfLIHzx7hQM7 eSdSLwsRZ+JGmstYffVEUzWo0xoQ0Kctp9VNG5RmvfovEDnrAIjna+EqI DAx+QMaF6iDYbBQ04zHcqFiqGAL4le+MuPxV1tuvmvKPhiFW+S8SdY96g PW7jsWkby1YOWRuyh5hdffXMDHNuf9DIq9bwVL/Lmopm2NbJ8OBcnrXn4 C1YOXL9LLz308bqGJbic+EiiwkSRa9B+psgiR8+DqSs6VA3L3f8quFZhp w==; X-IronPort-AV: E=McAfee;i="6600,9927,10706"; a="436744539" X-IronPort-AV: E=Sophos;i="5.99,266,1677571200"; d="scan'208";a="436744539" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 May 2023 23:56:50 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10706"; a="823855340" X-IronPort-AV: E=Sophos;i="5.99,266,1677571200"; d="scan'208";a="823855340" Received: from chaoyan1-mobl2.ccr.corp.intel.com (HELO yhuang6-mobl2.ccr.corp.intel.com) ([10.255.31.95]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 May 2023 23:56:46 -0700 From: Huang Ying To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven , Andrew Morton , Huang Ying , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox Subject: [RFC 5/6] mm: create multiple zone instances for one zone type based on memory size Date: Thu, 11 May 2023 14:56:06 +0800 Message-Id: <20230511065607.37407-6-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230511065607.37407-1-ying.huang@intel.com> References: <20230511065607.37407-1-ying.huang@intel.com> MIME-Version: 1.0 X-Stat-Signature: jo4q7djzye953a5edsmai1s8yatnm96z X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: C7B374000D X-HE-Tag: 1683788210-202773 X-HE-Meta: U2FsdGVkX19pleOZ9m2IVfG5y+QYfAe7wqmjRvXfP8+JxlAGPMkCUguvVUDfLx57bj69wC040iMLfmp3/o9m9ndJr+nTOeKR+/ChKfj68XA+GuJ2h+I9Ckbz32HgI/r3/TIZtX/OTENI7Rpwyp4RYzKhGEFmjEKjC8TmDEepQLXl4pAXk9cAgTk2GKhXbYeLAYZjoGKqskV2/pZDvD1FbNRcknAVj2WHhDtjZxHnuCsOLqsMyWa37YJw5M1ZysxC4zTbU7Mt6CHS7zfc32WQB+wkenBhPCXp370BQAVVRxOLctNDv0TQgKh54Cve6LOK92LKMM3uU2NTiZqJWc4ZhSVXeKYfQXfT09EXUCjp4LAfgbRajUBC2N5GSMLAy7KjGHkkepndFM1F70LfIe1y6p+PGjFUhzBFzMz2l8df1pTpQgvCWP1MVoiGmsirpROKmOwHrNBi445H290nduk5S7dbp/gDCyi+Igl7UDx/9XxT0mjxLAeuyQGJ3ldTg/hByIqrR5wyovc+z9w/oAtaFnrFLZchNqeXNAaON7Dzhp+6T16/+uapYdDyZoxCqIxC8m6z+VcvzOvH1H/O8bN0tvrpuDbQKBrLdONDcVgRUk4VLzR5u+M9DhVPo2E2xVBwRfhE7PhlnrtUqx5p8nxw2TeYQMneJTlGvg0mUIRDPenzE7ylW92cV7nC4uISH9Nc6XQpRfrpZ/o80OU3Y9MGi/hGZ/jBDgJVtLwKH+uXvsm4ZnU8E2u7vpSOQm0cs0w2AhJ0/XvZiSEe5233azG5//hXvmPAgNA6liz/dIcxGLg7inGETfRDO4pIHJE4U6uSO5TE8ZW1BYlNuQNpA8/dgd+mAFIpQZv7MaTSuWxffSfD7L0xxEMXUTGcV/iXTzTK5EzpX6pH6NyfABnjLvB0L5h5Sbs0nzY+0YjRk0YtPLPhQCq8loOGxPIMfdqkY3BtMfDvppbIVaI3di2kwjf slgpLjoW tNU/EWSJcpUrd5aVdOW3G+9x0oSPmNNmiShhwVlDl7kzqZICeI2Qf7JZzs4NYB6RcYOpOu8kN7S/D9PI431EwmDZX+hw4x3QYkrTjHdhvkWDoiDfpKyGD1EdPSOmSj/P+CWDLgB8wvSCr1mJgnQcmpl2XxqnKLopNMW0oe5VW/soQ2ELki5dHX8eGHtwFNJBaa/RqkiOME4ZBFZEOKxakkoaVfdS5IbNfGUqT7nXmTJcZqFC/08N2om6u2807xRPqrnlpCg4YaAFwHUCykkwKlS0kPN788N0ywbJ8KhizJgMxXWOm7qxupITiIyc0RVXz64VtmjKidGNDOmITogALkXOyULyRxQOkU8Ee2qlmS3WcJ/Tw1TKHNWnkvxXJkoC6LZ+EXbYGJOD5O/LjIS73kiaZF/N03nvb+HPf X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: More and more cores are put in one physical CPU (usually one NUMA node too). In 2023, one high-end server CPU has 56, 64, or more cores. Even more cores per physical CPU are planned for future CPUs. While all cores in one physical CPU will contend for the page allocation on one zone in most cases. This causes heavy zone lock contention in some workloads. And the situation will become worse and worse in the future. For example, on an 2-socket Intel server machine with 224 logical CPUs, if the kernel is built with `make -j224`, the zone lock contention cycles% can reach up to about 12.7%. To improve the scalability of the zone lock contention, in this patch, we will create one zone instance for each about 256 GB memory of a zone type generally. In next patch of the series, different logical CPUs will prefer different zone instances based on the logical CPU No. So the total number of logical CPUs contend on one zone will be reduced. Thus the scalability is improved. Combined with the next patch in the series ("mm: prefer different zone list on different logical CPU"), the zone lock contention cycles% reduces to less than 1.6% in the above kbuild test case when 4 zone instances are created for ZONE_NORMAL. Also tested with the will-it-scale/page_fault1 with 16 processes. With the optimization, the benchmark score increases up to 18.2% and the zone lock contention reduces from 13.01% to 0.56%. To create multiple zone instances for a zone type, another choice is to create zone instances based on the total number of logical CPUs. We choose to use memory size because it is easier to be implemented. In most cases, the more the cores, the larger the memory size is. And, on system with larger memory size, the performance requirement of the page allocator is usually higher. Signed-off-by: "Huang, Ying" Cc: Mel Gorman Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox --- include/linux/mmzone.h | 13 +++- include/linux/page-flags-layout.h | 2 + mm/page_alloc.c | 104 ++++++++++++++++++++++++++---- 3 files changed, 107 insertions(+), 12 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 18d64cf1263c..1a9b47bfc71d 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -800,7 +800,12 @@ enum zone_type { }; -#define __MAX_NR_ZONES __MAX_NR_ZONE_TYPES +#ifdef CONFIG_64BIT +#define __MAX_NR_SPLIT_ZONES 4 +#else +#define __MAX_NR_SPLIT_ZONES 0 +#endif +#define __MAX_NR_ZONES (__MAX_NR_ZONE_TYPES + __MAX_NR_SPLIT_ZONES) #ifndef __GENERATING_BOUNDS_H @@ -1106,6 +1111,12 @@ static inline bool zone_intersects(struct zone *zone, return true; } +#ifdef CONFIG_64BIT +#define MAX_NR_ZONES_PER_TYPE 4 +#else +#define MAX_NR_ZONES_PER_TYPE 1 +#endif + struct zone_type_struct { int start_zone_idx; int last_zone_idx; diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h index fcf194125768..a1f307720bea 100644 --- a/include/linux/page-flags-layout.h +++ b/include/linux/page-flags-layout.h @@ -32,6 +32,8 @@ #define ZONES_SHIFT 2 #elif MAX_NR_ZONES <= 8 #define ZONES_SHIFT 3 +#elif MAX_NR_ZONES <= 16 +#define ZONES_SHIFT 4 #else #error ZONES_SHIFT "Too many zones configured" #endif diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 11e7e182bf5c..d60fedc6961b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7836,24 +7836,106 @@ void __ref free_area_init_core_hotplug(struct pglist_data *pgdat) } #endif +static void __init setup_zone_size(struct pglist_data *pgdat, struct zone *zone, + enum zone_type zt, unsigned long start_pfn, + unsigned long end_pfn) +{ + unsigned long spanned, absent; + unsigned long zstart_pfn, zend_pfn; + + spanned = zone_spanned_pages_in_node(pgdat->node_id, zt, + start_pfn, + end_pfn, + &zstart_pfn, + &zend_pfn); + absent = zone_absent_pages_in_node(pgdat->node_id, zt, + start_pfn, + end_pfn); + zone->zone_start_pfn = zstart_pfn; + zone->spanned_pages = spanned; + zone->present_pages = spanned - absent; +#if defined(CONFIG_MEMORY_HOTPLUG) + zone->present_early_pages = zone->present_pages; +#endif +} + +#define SPLIT_ZONE_ALIGN_PAGES ((1UL * 1024 * 1024 * 1024) >> PAGE_SHIFT) + +#ifdef CONFIG_64BIT +/* 254GB instead of 256GB to deal with ZONE_DMA32 and small memory holes */ +#define SPLIT_ZONE_PAGES ((254UL * 1024 * 1024 * 1024) >> PAGE_SHIFT) + +static int split_zone_type_number(struct pglist_data *pgdat, + struct zone_type_struct *zts, + struct zone *zone) +{ + int nr, remaining; + + if (zts->present_pages < SPLIT_ZONE_PAGES * 2) + return 1; + + /* Remaining number of zones can be used for the zone type */ + remaining = 1 + (MAX_NR_ZONES - MAX_NR_ZONE_TYPES) - + ((zone - pgdat->node_zones) - (zts - pgdat->node_zone_types)); + nr = zts->present_pages / SPLIT_ZONE_PAGES; + nr = min3(nr, remaining, MAX_NR_ZONES_PER_TYPE); + + return nr; +} +#else +static int split_zone_type_number(struct pglist_data *pgdat, + struct zone_type_struct *zts, + struct zone *zone) +{ + return 1; +} +#endif + static void __init zones_init(struct pglist_data *pgdat) { - enum zone_type j; + enum zone_type zt; struct zone_type_struct *zts; - struct zone *zone; + struct zone *zone = pgdat->node_zones; + int split_nr; - for (j = 0; j < MAX_NR_ZONE_TYPES; j++) { - zts = pgdat->node_zone_types + j; - zone = pgdat->node_zones + j; + BUILD_BUG_ON(MAX_NR_ZONES_PER_TYPE > __MAX_NR_SPLIT_ZONES + 1); + for (zt = 0; zt < MAX_NR_ZONE_TYPES; zt++) { + zts = pgdat->node_zone_types + zt; - zts->start_zone_idx = zts->last_zone_idx = zone - pgdat->node_zones; - zone->type = j; - zone->zone_start_pfn = zts->zts_start_pfn; - zone->spanned_pages = zts->spanned_pages; - zone->present_pages = zts->present_pages; + zts->start_zone_idx = zone - pgdat->node_zones; + split_nr = split_zone_type_number(pgdat, zts, zone); + if (split_nr > 1) { + unsigned long split_span = zts->spanned_pages / split_nr; + unsigned long start_pfn = zts->zts_start_pfn; + unsigned long end_pfn; + unsigned long zts_end_pfn = zts->zts_start_pfn + zts->spanned_pages; + int i; + + for (i = 0; i < split_nr && start_pfn < zts_end_pfn; i++) { + if (i == split_nr - 1) { + end_pfn = zts_end_pfn; + } else { + end_pfn = ALIGN(start_pfn + split_span, + SPLIT_ZONE_ALIGN_PAGES); + if (end_pfn > zts_end_pfn) + end_pfn = zts_end_pfn; + } + setup_zone_size(pgdat, zone, zt, start_pfn, end_pfn); + zone->type = zt; + start_pfn = end_pfn; + zone++; + } + } else { + zone->type = zt; + zone->zone_start_pfn = zts->zts_start_pfn; + zone->spanned_pages = zts->spanned_pages; + zone->present_pages = zts->present_pages; #if defined(CONFIG_MEMORY_HOTPLUG) - zone->present_early_pages = zts->present_early_pages; + zone->present_early_pages = zts->present_early_pages; #endif + zone++; + } + zts->last_zone_idx = (zone - pgdat->node_zones) - 1; } }