From patchwork Fri Nov 5 20:40:21 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 12605581 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5E2F1C433EF for ; Fri, 5 Nov 2021 20:40:24 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 12223611C0 for ; Fri, 5 Nov 2021 20:40:24 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 12223611C0 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id A292B940069; Fri, 5 Nov 2021 16:40:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9B120940049; Fri, 5 Nov 2021 16:40:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8A21D940069; Fri, 5 Nov 2021 16:40:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0095.hostedemail.com [216.40.44.95]) by kanga.kvack.org (Postfix) with ESMTP id 78AB4940049 for ; Fri, 5 Nov 2021 16:40:23 -0400 (EDT) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 428A68249980 for ; Fri, 5 Nov 2021 20:40:23 +0000 (UTC) X-FDA: 78776044326.11.1C540A0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf31.hostedemail.com (Postfix) with ESMTP id 821E2104AAED for ; Fri, 5 Nov 2021 20:40:14 +0000 (UTC) Received: by mail.kernel.org (Postfix) with ESMTPSA id DE4A161252; Fri, 5 Nov 2021 20:40:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1636144822; bh=cqUpwh1vL29lqlqzPR/43fIRuS1KNjozAK1PsVth1+c=; h=Date:From:To:Subject:In-Reply-To:From; b=YOPwTKHK4yBcM6coe5G5uBmtB0K4mvpYRf9yH+OdEtvnIVrWq4IQTwiolBI4XIz/s zPFV/f1xZWERYQOm3OuiNzk4YdJc0uFgI2wH/l3Utg5YTvXoFmjM4pbI+eUgSLLczV 7+fXdw5gMlkIgLjFIpAXXG+QB1kltVzNV5DhXFhY= Date: Fri, 05 Nov 2021 13:40:21 -0700 From: Andrew Morton To: akpm@linux-foundation.org, anshuman.khandual@arm.com, bharata@amd.com, kamezawa.hiroyu@jp.fujitsu.com, krupa.ramakrishnan@amd.com, lee.schermerhorn@hp.com, linux-mm@kvack.org, mgorman@suse.de, mm-commits@vger.kernel.org, Sadagopan.Srinivasan@amd.com, torvalds@linux-foundation.org Subject: [patch 109/262] mm/page_alloc: use accumulated load when building node fallback list Message-ID: <20211105204021.LOWopzfhI%akpm@linux-foundation.org> In-Reply-To: <20211105133408.cccbb98b71a77d5e8430aba1@linux-foundation.org> User-Agent: s-nail v14.8.16 Authentication-Results: imf31.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=korg header.b=YOPwTKHK; dmarc=none; spf=pass (imf31.hostedemail.com: domain of akpm@linux-foundation.org designates 198.145.29.99 as permitted sender) smtp.mailfrom=akpm@linux-foundation.org X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 821E2104AAED X-Stat-Signature: nn5cpgzhgr1z73tnihg4fxam1m4e3rcn X-HE-Tag: 1636144814-175373 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Krupa Ramakrishnan Subject: mm/page_alloc: use accumulated load when building node fallback list In build_zonelists(), when the fallback list is built for the nodes, the node load gets reinitialized during each iteration. This results in nodes with same distances occupying the same slot in different node fallback lists rather than appearing in the intended round- robin manner. This results in one node getting picked for allocation more compared to other nodes with the same distance. As an example, consider a 4 node system with the following distance matrix. Node 0 1 2 3 ---------------- 0 10 12 32 32 1 12 10 32 32 2 32 32 10 12 3 32 32 12 10 For this case, the node fallback list gets built like this: Node Fallback list --------------------- 0 0 1 2 3 1 1 0 3 2 2 2 3 0 1 3 3 2 0 1 <-- Unexpected fallback order In the fallback list for nodes 2 and 3, the nodes 0 and 1 appear in the same order which results in more allocations getting satisfied from node 0 compared to node 1. The effect of this on remote memory bandwidth as seen by stream benchmark is shown below: Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1 (numactl -m 0,1 ./stream_lowOverhead ... --cores ) Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3 (numactl -m 2,3 ./stream_lowOverhead ... --cores ) ---------------------------------------- BANDWIDTH (MB/s) TEST Case 1 Case 2 ---------------------------------------- COPY 57479.6 110791.8 SCALE 55372.9 105685.9 ADD 50460.6 96734.2 TRIADD 50397.6 97119.1 ---------------------------------------- The bandwidth drop in Case 1 occurs because most of the allocations get satisfied by node 0 as it appears first in the fallback order for both nodes 2 and 3. This can be fixed by accumulating the node load in build_zonelists() rather than reinitializing it during each iteration. With this the nodes with the same distance rightly get assigned in the round robin manner. In fact this was how it was originally until the commit f0c0b2b808f2 ("change zonelist order: zonelist order selection logic") dropped the load accumulation and resorted to initializing the load during each iteration. While zonelist ordering was removed by commit c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE"), the change to the node load accumulation in build_zonelists() remained. So essentially this patch reverts back to the accumulated node load logic. After this fix, the fallback order gets built like this: Node Fallback list ------------------ 0 0 1 2 3 1 1 0 3 2 2 2 3 0 1 3 3 2 1 0 <-- Note the change here The bandwidth in Case 1 improves and matches Case 2 as shown below. ---------------------------------------- BANDWIDTH (MB/s) TEST Case 1 Case 2 ---------------------------------------- COPY 110438.9 110107.2 SCALE 105930.5 105817.5 ADD 97005.1 96159.8 TRIADD 97441.5 96757.1 ---------------------------------------- The correctness of the fallback list generation has been verified for the above node configuration where the node 3 starts as memory-less node and comes up online only during memory hotplug. [bharata@amd.com: Added changelog, review, test validation] Link: https://lkml.kernel.org/r/20210830121603.1081-3-bharata@amd.com Fixes: f0c0b2b808f2 ("change zonelist order: zonelist order selection logic") Signed-off-by: Krupa Ramakrishnan Co-developed-by: Sadagopan Srinivasan Signed-off-by: Sadagopan Srinivasan Signed-off-by: Bharata B Rao Acked-by: Mel Gorman Reviewed-by: Anshuman Khandual Cc: KAMEZAWA Hiroyuki Cc: Lee Schermerhorn Signed-off-by: Andrew Morton --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/mm/page_alloc.c~mm-page_alloc-use-accumulated-load-when-building-node-fallback-list +++ a/mm/page_alloc.c @@ -6253,7 +6253,7 @@ static void build_zonelists(pg_data_t *p */ if (node_distance(local_node, node) != node_distance(local_node, prev_node)) - node_load[node] = load; + node_load[node] += load; node_order[nr_nodes++] = node; prev_node = node;