From patchwork Sun Jul 23 19:09:06 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hyeonggon Yoo <42.hyeyoo@gmail.com> X-Patchwork-Id: 13323320 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 69213C001DC for ; Sun, 23 Jul 2023 19:09:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CBCEE6B0075; Sun, 23 Jul 2023 15:09:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C47336B0078; Sun, 23 Jul 2023 15:09:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A99BA6B007B; Sun, 23 Jul 2023 15:09:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 94E6B6B0075 for ; Sun, 23 Jul 2023 15:09:40 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 6082D1207BE for ; Sun, 23 Jul 2023 19:09:40 +0000 (UTC) X-FDA: 81043815720.26.7FCD914 Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) by imf07.hostedemail.com (Postfix) with ESMTP id 8915340018 for ; Sun, 23 Jul 2023 19:09:38 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=Ou4Wv9xA; spf=pass (imf07.hostedemail.com: domain of 42.hyeyoo@gmail.com designates 209.85.214.178 as permitted sender) smtp.mailfrom=42.hyeyoo@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690139378; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=r/RoRea2RA0cNUB7aJ08JIbxus10wCykI3szBf/b3NI=; b=E2gVKdz7Y7XwM7Bad2w4QpbtNjisC9hZ7h+G9WvoPZLSV11vU4gM9RRYCv8JZtCEFsOWHq sR8I9yieGfZgdWxsvA3eF35nU7D9dl/R5xzKQ+4666EY9FdcH2YPTl3gT1C4o7voFUj39F rb8IupKtwCaQbbkFnJsP9tZ29Bz6prk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690139378; a=rsa-sha256; cv=none; b=aOCNNOfgsnjFZWwszjjkIKVUhRdQsR3nfwyetFXcg219m9zsE3XCAyyZEY8sKpj6KjzHFh Kic4CJ9B56M34gSasE6M95jmQGUJM+6+QOyt8KJGrFt41u+uWQG/3EaUjZkKk0QJxhyDAS /50IO6cI3Os2c12sBDf0bi22WB4jdZM= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=Ou4Wv9xA; spf=pass (imf07.hostedemail.com: domain of 42.hyeyoo@gmail.com designates 209.85.214.178 as permitted sender) smtp.mailfrom=42.hyeyoo@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-1bb119be881so28055585ad.3 for ; Sun, 23 Jul 2023 12:09:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1690139377; x=1690744177; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=r/RoRea2RA0cNUB7aJ08JIbxus10wCykI3szBf/b3NI=; b=Ou4Wv9xAdrG23117hD6v3zkc3JOqEdfqt1Qu9ZrCNSbLtrIaXd+6wF32CEMSTd5na6 7A4Guaqf64bRqeW0tCtQ3dcA6oQku9WLgiqy4Me04QKAWd3fmPrJLmu8AjG0JZx3jRBL hstkawb10WlcxmKoFfy0da+F7E75mpmfcKyMuRUHGldiAFUkoLeykz80oJDSAuWDOXWk KXk5oTCiwgbucfFmgcXHqCnTDdL05FW9V+s4UvDhY8p68ItPDrDmUNJBHBrjC6auj4W7 /lrlhfTfTf9r+moYceZyaJC6HkUM5ekZ3jNJYt3cPYA4LvyFgu4LhZluoSlht0DWlKVo jKUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690139377; x=1690744177; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=r/RoRea2RA0cNUB7aJ08JIbxus10wCykI3szBf/b3NI=; b=kuGiwhjOqE1djubGgdYpKd+lpO1lp8NyiB28DaiA6NKDAl3em7rQyXoMzNpdIhK5+K 0lTlMV0H5FaJiBMKWtmJaI3UFhtNNMu+K60tpjeygWBifyerwhbcBXksJo+QJi+9xyxP xX38SoOeNTK4EqD+IvaWyDcunsDmbcwYpz7Cjxn5MbUbtGHZkufhQlFid8uLc/3wlTm/ z368tU4sWcFfXnMnwhifcmuPPMoPqXAKp1+Oierwzmwe/WCxfIs/i98sg6nlev2+oYiF KIAxKKIeb8PfK8b9QAVHxyPViEhEuvgkf4Kqn2ozCQANJeGIjQXhSCOnUqbyIcsyGq0f IhSA== X-Gm-Message-State: ABy/qLb8oawg7Vz0njdu8SxRl8RZnplgP14qhKdJkGROFZo7ttHgq3wS OFY60xw/9Cz7x5zd2Jzw4KM= X-Google-Smtp-Source: APBJJlGkPj1Di2xoKkNZYf+n3venm8BAIC1yQaqaQC5aMCdLWvD6Q5YstcIl0N6JZs1PReycc5iu/Q== X-Received: by 2002:a17:903:1208:b0:1b9:d307:c1df with SMTP id l8-20020a170903120800b001b9d307c1dfmr10327171plh.17.1690139377245; Sun, 23 Jul 2023 12:09:37 -0700 (PDT) Received: from fedora.. ([1.245.179.104]) by smtp.gmail.com with ESMTPSA id s10-20020a170902ea0a00b001b53d3d8f3dsm7168625plg.299.2023.07.23.12.09.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 23 Jul 2023 12:09:36 -0700 (PDT) From: Hyeonggon Yoo <42.hyeyoo@gmail.com> To: Vlastimil Babka , Christoph Lameter , Pekka Enberg , Joonsoo Kim , David Rientjes , Andrew Morton Cc: Roman Gushchin , Feng Tang , "Sang, Oliver" , Jay Patel , Binder Makin , aneesh.kumar@linux.ibm.com, tsahu@linux.ibm.com, piyushs@linux.ibm.com, fengwei.yin@intel.com, ying.huang@intel.com, lkp , "oe-lkp@lists.linux.dev" , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Hyeonggon Yoo <42.hyeyoo@gmail.com> Subject: [RFC 2/2] mm/slub: prefer NUMA locality over slight memory saving on NUMA machines Date: Mon, 24 Jul 2023 04:09:06 +0900 Message-ID: <20230723190906.4082646-3-42.hyeyoo@gmail.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230723190906.4082646-1-42.hyeyoo@gmail.com> References: <20230723190906.4082646-1-42.hyeyoo@gmail.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 8915340018 X-Rspam-User: X-Stat-Signature: poedcpy5r9h67gxscwswsuqi9ddwt9ef X-Rspamd-Server: rspam03 X-HE-Tag: 1690139378-555702 X-HE-Meta: U2FsdGVkX18IYbdDDG4VuSOZm5pLiOIZ/R3gjLeSF71YoOm9asbukXJlhd3So0MkxvbVL2zi2b+VzQKEGi9TccmMSkbcoUq/gAJqXWyghof+1qh6HdPspN4ORRNx7t2dR68vzVKyvGkDZv+E0HjCo6smDs+w+EvFF6ydpUEbDTXD9ei2ICLYCohliXkN4NpwZAhNWXU5MAJx2pjwEMAw9hMhmNXjtybuXS/xo5I1RkUV9/kr8CM4xaY/LNnVfPXjbv7XpeXoTtyLxUUkFSyigaONTm1dmI2itunLZ/FVbJTknk97Pk2uACPwa71aBGhFa/pvY84Nub7EwNnVTTIenAvExgI+Ig0HtV2lwDqAMVGClmBhjO1568BC2jkbcXViFk7dC9OajfbIeYMRaFzKYdhnCSdNP4zuQTyGaEiKAXiVirjGdz9EyLg58z1PtFNEHc0TB3UZaaNG5bHPDkmlpw/eWjgq1owZw4qEN88xrj8GDlADmQW4WyqkKBf3+cds1BM4l7ADFrfyy6S6ZYqArw+vFaMazSPPoHw5NZCUrBU1DXqE9PS2Ldipka8mEuRy7G4fasHjKE8fhiTyWA1WKL49Nm8aeCLbTparFA3hYkxz+Pm6nEKPemKORRxspUce4jjL035nv1WxcwpMbwA3aa3CwbDNc4LQgfe0V5MzWxYHk/d9zFSh1b9YqQ1vQbt9oV7MPE7viQxx3FTDG9/T/NjHHbAxWTe0AT4HNh5+XzYiLzBSA5PCnjLDM0p1CJ4LaAaZneEOB4o8294709xX3f62Zlf4WrOgjvVd7trBv//hyhPeaTltDks3+yEsQuY2LSn9Jv8BZXIhYZzHUbKtn0emyxbXpvA0hk+mcat2xfMwNfyaiJhIBAsEuL94be0ovEovNBqtAw2LXmRXM/jxggJcJ4lYIU5TlXofCTQyTS6ALWdkDIIFqMdgFD6p2cIlz1K2kSwmHHDsQ56crvs AK7ksuFo G6pMoHLZW/JwWHgPJFNPvvN/HKHC7Rp2CIZL7wr3jRmQUXzWQGrUFbBjAclLr8OxKMzHw9jXlnYgY3K3xL9UmnI+IsyZm4uYP/pcv3GMPdvmwr0eC2ruhnDWNcT6aam1wy2+vwXxbUHFIMaZ4iihMe389Phfa7Nz1aJwx/VGuUYdj6b3QYXpTxTR540vTvInwrw22AWzFcTh8Ym57QJtrm9O+2GmeULfIcFaKTvnZzOz3uli83RLNt1U5JC1ajdgx3/6FS6wBl5k0i7hCfhQ4BgME+fUqS9VYKtMeImNga/xmHESfVG9ncqpY4nUBMAs6ppMHO8fj8Y6/hkPkwKwDURgAMskfcOr85yb8cTwWv4UkVr4/Dlh5gbV0dwjwJDcmYwitF4AYKzBVXecEBnAIulwqlxrx3n+3PsTly0Z2ZamUi8G1TE9AxiRDp6mh5R6zvEjoiBTsP2xn95K6FiAW+CwHm8L+Nf6U3tcoUpiHqrklmRfvkKG0fGtX26VYXX2UjGT+S/bOKiloWBmnmXorDsDxRt5LvnEnMNa/Zmr12y3T2Wm7jYC0TpzpVLaC3BtyYFnT X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: By default, SLUB sets remote_node_defrag_ratio to 1000, which makes it (in most cases) take slabs from remote nodes first before trying allocating new folios on the local node from buddy. Documentation/ABI/testing/sysfs-kernel-slab says: > The file remote_node_defrag_ratio specifies the percentage of > times SLUB will attempt to refill the cpu slab with a partial > slab from a remote node as opposed to allocating a new slab on > the local node. This reduces the amount of wasted memory over > the entire system but can be expensive. Although this made sense when it was introduced, the portion of per node partial lists in the overall SLUB memory usage has been decreased since the introduction of per cpu partial lists. Therefore, it's worth reevaluating its overhead on performance and memory usage. [ XXX: Add performance data. I tried to measure its impact on hackbench with a 2 socket NUMA machine. but it seems hackbench is too synthetic to benefit from this, because the skbuff_head_cache's size fits into the last level cache. Probably more realistic workloads like netperf would benefit from this? ] Set remote_node_defrag_ratio to zero by default, and the new behavior is: 1) try refilling per CPU partial list from the local node 2) try allocating new slabs from the local node without reclamation 3) try refilling per CPU partial list from remote nodes 4) try allocating new slabs from the local node or remote nodes If user specified remote_node_defrag_ratio, it probabilistically tries 3) first and then try 2) and 4) in order, to avoid unexpected behavioral change from user's perspective. --- mm/slub.c | 45 +++++++++++++++++++++++++++++++++++++-------- 1 file changed, 37 insertions(+), 8 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 199d3d03d5b9..cfdea3e3e221 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2319,7 +2319,8 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n, /* * Get a slab from somewhere. Search in increasing NUMA distances. */ -static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc) +static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc, + bool force_defrag) { #ifdef CONFIG_NUMA struct zonelist *zonelist; @@ -2347,8 +2348,8 @@ static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc) * may be expensive if we do it every time we are trying to find a slab * with available objects. */ - if (!s->remote_node_defrag_ratio || - get_cycles() % 1024 > s->remote_node_defrag_ratio) + if (!force_defrag && (!s->remote_node_defrag_ratio || + get_cycles() % 1024 > s->remote_node_defrag_ratio)) return NULL; do { @@ -2382,7 +2383,8 @@ static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc) /* * Get a partial slab, lock it and return it. */ -static void *get_partial(struct kmem_cache *s, int node, struct partial_context *pc) +static void *get_partial(struct kmem_cache *s, int node, struct partial_context *pc, + bool force_defrag) { void *object; int searchnode = node; @@ -2394,7 +2396,7 @@ static void *get_partial(struct kmem_cache *s, int node, struct partial_context if (object || node != NUMA_NO_NODE) return object; - return get_any_partial(s, pc); + return get_any_partial(s, pc, force_defrag); } #ifndef CONFIG_SLUB_TINY @@ -3092,6 +3094,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, struct slab *slab; unsigned long flags; struct partial_context pc; + gfp_t local_flags; stat(s, ALLOC_SLOWPATH); @@ -3208,10 +3211,35 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, pc.flags = gfpflags; pc.slab = &slab; pc.orig_size = orig_size; - freelist = get_partial(s, node, &pc); + + freelist = get_partial(s, node, &pc, false); if (freelist) goto check_new_slab; + /* + * try allocating slab from the local node first before taking slabs + * from remote nodes. If user specified remote_node_defrag_ratio, + * try taking slabs from remote nodes first. + */ + slub_put_cpu_ptr(s->cpu_slab); + local_flags = (gfpflags | __GFP_NOWARN | __GFP_THISNODE); + local_flags &= ~(__GFP_NOFAIL | __GFP_RECLAIM); + slab = new_slab(s, local_flags, node); + c = slub_get_cpu_ptr(s->cpu_slab); + + if (slab) + goto alloc_slab; + + /* + * At this point no memory can be allocated lightly. + * Take slabs from remote nodes. + */ + if (node == NUMA_NO_NODE) { + freelist = get_any_partial(s, &pc, true); + if (freelist) + goto check_new_slab; + } + slub_put_cpu_ptr(s->cpu_slab); slab = new_slab(s, gfpflags, node); c = slub_get_cpu_ptr(s->cpu_slab); @@ -3221,6 +3249,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, return NULL; } +alloc_slab: stat(s, ALLOC_SLAB); if (kmem_cache_debug(s)) { @@ -3404,7 +3433,7 @@ static void *__slab_alloc_node(struct kmem_cache *s, pc.flags = gfpflags; pc.slab = &slab; pc.orig_size = orig_size; - object = get_partial(s, node, &pc); + object = get_partial(s, node, &pc, false); if (object) return object; @@ -4538,7 +4567,7 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags) set_cpu_partial(s); #ifdef CONFIG_NUMA - s->remote_node_defrag_ratio = 1000; + s->remote_node_defrag_ratio = 0; #endif /* Initialize the pre-computed randomized freelist if slab is up */