From patchwork Sat Mar 29 11:02:28 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nhat Pham X-Patchwork-Id: 14032645 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DCB32C36008 for ; Sat, 29 Mar 2025 11:02:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 72F6928017C; Sat, 29 Mar 2025 07:02:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6DE1E28017B; Sat, 29 Mar 2025 07:02:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5A56328017C; Sat, 29 Mar 2025 07:02:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 39ED828017B for ; Sat, 29 Mar 2025 07:02:33 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 1ABC2C1262 for ; Sat, 29 Mar 2025 11:02:34 +0000 (UTC) X-FDA: 83274300228.24.ACDD997 Received: from mail-yw1-f179.google.com (mail-yw1-f179.google.com [209.85.128.179]) by imf07.hostedemail.com (Postfix) with ESMTP id 54A0F40006 for ; Sat, 29 Mar 2025 11:02:32 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=WKeKkMYH; spf=pass (imf07.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.128.179 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743246152; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=sz8keCqa5qupL5Rb/+bTM+6bhMdIOvmpSra9wTxqHOM=; b=w9kcjrmr+HdOmMETwkRZAu24BQXzjG7oCxkULMno2qAy+grDZ/NGTeLVYiXu8inz6/km/t BssD3eAROyN+zof80QFvGmwJ8f5NCRiFOaUyfCNSwntDnf775f2Adf3GBR2uXOnHuUttrm V7YaMflBaquEmFYXd5gqW/mQa4P3O8w= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=WKeKkMYH; spf=pass (imf07.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.128.179 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743246152; a=rsa-sha256; cv=none; b=TgSg3Fx5xRX9WdxlRtxm+dquKhjaZ9RGDIs3o/xYfsAuhNaU03bAIg9xlbgTpjJnZqEWJJ aZ0b0XUORgXn8s/IyQ4MFYVu96oi2mrXoPvWOTNBtVP8vRiNTLG0mNrk3MpNpVbw17mI6r rB/iXA9se4u+ZBDCD1b3Mkc5WnQnVE0= Received: by mail-yw1-f179.google.com with SMTP id 00721157ae682-6ff37565154so26850207b3.3 for ; Sat, 29 Mar 2025 04:02:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1743246151; x=1743850951; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=sz8keCqa5qupL5Rb/+bTM+6bhMdIOvmpSra9wTxqHOM=; b=WKeKkMYHlUHWH91WCLqq8vmj4gCc02MLK4W23Au2ku0miTIERlOWc+/rjlTJRy4FE4 UKyhdnIJPv+OJSAnDHN1eEs5bXGYK4uFBpHVOc1EnNNGguUUVJELbrQCpS9ZqO+HQB07 1neVV5bjxD6GW0plhuJO19rFh0YykIiP/EV9zlZPa/6Pc6zHNHAHIFDGvRmfwkRFssyK ODJSkyn0UR3p9CNUFX0sCT1aKIAn+tmq+ka7fO3EWghvudJOeo4+B71ZaNSmVT6ilPhQ yoe5dfrdS0CIbmK8VcZrLiW2EhM+Z/qnhpoWfgqBraocf51w+HIWx8JZHbGhMl1L0ZCQ +n7A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743246151; x=1743850951; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=sz8keCqa5qupL5Rb/+bTM+6bhMdIOvmpSra9wTxqHOM=; b=hIEUT4Vd00Siq+oIyWuqOwg39+BTwrCNBvbZqfmQx/+oA3SWci8DXXvXi9OV8S/kUz C7n9oDlGmtD0qPOwhNGdeJzddjVClm75fJOXfGXAxMmjh45FuNrDZMiNJMHT2vOgguCO TS+zI4u2zeRMJI02rfougVO4ZfcHwI5lTCpSRiT0bQgvegvJjU1jDXj8fr1OorLxXfbA Bo8JJIXMaxBmPTmEa21PEwIyl6MspHWzgpT8HBG8fb6glVASHZ9670A3vOI+SQ27Z5Dy IpRHHyCc95IoBMFr328omjZUjbsaaOJF9aT0+H+5CBKSe6xYVIxLLop/2t7QPjdXMt9J +EsA== X-Gm-Message-State: AOJu0YxKv1BU0voDirDCf2IMwynuTLLGeOSZvxUqvnCQFChnbTNxWE9L bvwK0zWK5TGzKFf8x2Ns5P/Ctqe8xP0CjXR3iCr88YvXcRhYGh8i4CdH4Whx X-Gm-Gg: ASbGncsgVJBCCX1g9okxoHdnsOGHuyZg8yep3NhvnkvIjRiEQn4TTcBVo5+azoYTyNV fKz3p/qp22qQ1nCbcCjBSNgBbbJc/GTUx/JeWErooIUVRvdisiswKpGtxl+TxnJUjLxf++Q0u0Z Vfo+9SFkD6mxp/4RSMRiqGG59CJ4tv/QqFA3pkAkX/w+XRs2/L2ozoqXLonoU8yDWtDbjuPF76g 1QPDQOa92M/HFNMPodmPMdxSx3jfEFxnPm8lqUUWHQpYjw863/67rn8rwmxRuQJ6wjBFEm16PP0 oZTjbGVkbhSVB6e8ua6jgalxpvf0hMTchf9R2iXccg+NpA== X-Google-Smtp-Source: AGHT+IGTHtzkeKtUtYf4ABEeM2lZ6bKxI1FNIxhtRBiZP9SDv3siVSdXIdoMhyiM9MahzTDX2SnQXA== X-Received: by 2002:a05:690c:480a:b0:6fd:47a7:3fb2 with SMTP id 00721157ae682-702572d919bmr36632147b3.24.1743246151245; Sat, 29 Mar 2025 04:02:31 -0700 (PDT) Received: from localhost ([2a03:2880:25ff:1::]) by smtp.gmail.com with ESMTPSA id 00721157ae682-7023a3bb561sm11682677b3.40.2025.03.29.04.02.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 29 Mar 2025 04:02:30 -0700 (PDT) From: Nhat Pham To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, yosry.ahmed@linux.dev, chengming.zhou@linux.dev, sj@kernel.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, gourry@gourry.net, willy@infradead.org, ying.huang@linux.alibaba.com, jonathan.cameron@huawei.com, dan.j.williams@intel.com, linux-cxl@vger.kernel.org, minchan@kernel.org, senozhatsky@chromium.org Subject: [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems Date: Sat, 29 Mar 2025 04:02:28 -0700 Message-ID: <20250329110230.2459730-1-nphamcs@gmail.com> X-Mailer: git-send-email 2.47.1 MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 54A0F40006 X-Stat-Signature: s57o4gkzinrhuzibnpjma87mkz37mbm3 X-HE-Tag: 1743246152-107945 X-HE-Meta: U2FsdGVkX18ogYXmGA801Dr1QHdWaFTzh5mKJgXGGWK9ESCcFMwGIx7La/ecvkj3Z+fU0rB3eVzmtJoXeOs/lXWCrJcLgsJ9Ec4u6s5SEoCw0u+r9TIkcmIj0Q5drX1BC4iWr9761Pz6xLEj7WxQUZfJ4ADs5qE31yCrs6EjTlw4atdddNi8ZywD77L9Jtv37I3EzFpmqQY0aZEKLoE+76UbKku6t65ZEd0LljjLz+pRQH5DtGUrJdC1jyi0894jq/ML5VF62LHtyfDN/wUVAeel7923dh5kt48dzQaT8JgLQvYwCX9gLwGoGnHP+tW1nw9pA0VbyK+lV3mbPF6P8H0RJypQC32qvH7cNS4w8S2EYTtXnSBqxc5lwxYupjqebLlTguu9+doMBmiOqaw4PZD+r9xR5UUX19TdUpmk/Cxw4aJxGLPWm1KFtoBdZXqmKidPWNGdcOd35g042w0z0VNrGqPMdDU1+qctiWrQHPNzJwm64O9UfYg+xKlhg8zRtZuGpzYVbSr9Jfcd7NVG7ZgQt1eFwoRGoT7qXx2t1OrUYXQMtJzd3ghkaBmk3+zDqhqQwGPxvi6I+8LpbpKhq77zsQZZCSlrIYCQxt1ABg3V/BVUz9LwlWxmlP7DSM09PYq7Jk4rIlh2E4X4wqNOdIrAXGFsPwi1Z07HTn/wPRpbyKidNQb+hEtyThosFZQf4hb/Mpyr2YZOFkFzsG6+zAZRosURSiJxicRPqBhqpSwQbPkFzW+Gqb8gJDp1TX9XZbkiU0naJQrFRtstNJyLbEzYXK5s7JIFVM9dovZTLkuH7hWa8BxlImEEf90ohL3MwxYF6+9ZIFzD6BIg9/VFvAdxae/KUYX5Zc0tlfBTO7E+F/GyTeZovz2vO1pit030WJb8ydRt8SV/xsxidpLHlKUDIc80mpvDp+4ZUrmEZy+tOnXNbB/CVVEfONfMi8z6b5F/aFWMam/8JO8WaTS 5H2Y+SoM Q013pY4EYhlyV+NHBnh3zR4FT9T2NgWLcwMXd1pX4E9ScGgMkx7Gr9sV6VvcfFQAm+a7NIfFxdyKmcCkI+pnqg2/V69DXj6Dv0znSFbM67xn5zsXC4BbalDTE///y6l7RJ7UfR4q5yl0GATX0XpBmcsGtIqXEJq5CWDldeQplk7iwD3h8phts0s0mZDvHRVWTMQ5Rky9Kv5XFZijsoYYsZlIUsnm9/NmCKGcEVs467J6lMGXXYwav5rDt4FJvBgYmJRcPqgyxqZvClHHrpQBFqpuwhQKN+EghaDoSoWqQTzRHIShnURzwBGjTl9yMuLmkbaPrXl5GOyK75vVoMn29jOCxgPuXOt7v7ZoARe2u/Hah5kbpQ41eKmt9SY31DOnNQ6XvFIsTvQ50nC4FcYpN3HIaQWTFWPqaZnEHypyjzCFnEVpKnRbtC9Ljj/tDOjP8PHDx X-Bogosity: Ham, tests=bogofilter, spamicity=0.001276, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Currently, systems with CXL-based memory tiering can encounter the following inversion with zswap: the coldest pages demoted to the CXL tier can return to the high tier when they are zswapped out, creating memory pressure on the high tier. This happens because zsmalloc, zswap's backend memory allocator, does not enforce any memory policy. If the task reclaiming memory follows the local-first policy for example, the memory requested for zswap can be served by the upper tier, leading to the aformentioned inversion. This RFC fixes this inversion by adding a new memory allocation mode for zswap (exposed through a zswap sysfs knob), intended for hosts with CXL, where the memory for the compressed object is requested preferentially from the same node that the original page resides on. With the new zswap allocation mode enabled, we should observe the following dynamics: 1. When demotion is turned on, under reasonable conditions, zswap will prefer CXL memory by default, since top-tier memory being reclaimed will typically be demoted instead of swapped. 2. This should prevent reclaim on the lower tier from causing high-tier memory pressure due to new allocations. 3. This should avoid a quiet promotion of cold memory (memory being zswapped is cold, but is promoted when put into the zswap pool because the memory allocated for the compressed copy comes from the high tier). 4. However, this may actually cause pressure on the CXL tier, which may actually result in further demotion (to swap, etc). This needs to be tested. I'm still testing and collecting more data, but figure I should send this out as an RFC to spark the discussion: 1. Is this the right policy? Do we need a more complicated policy? Should we instead go for the "lowest" node (which would require new memory tiering API)? Or maybe trying each node from current node to the lowest node in the hierarchy? Also, I hack together this fix with CXL in mind, but if there are other cases that I should also address we can explore a more general memory allocation strategy or interface. 2. Similarly, is this the right zsmalloc API? For instance, we can build build a full-fledged mempolicy-based API for zsmalloc, but I haven't found a use case for it yet. 3. Assuming this is the right policy, what should be the semantics? Not very good at naming things, so same_node_mode might not be it :) Nhat Pham (2): zsmalloc: let callers select NUMA node to store the compressed objects zswap: add sysfs knob for same node mode Documentation/admin-guide/mm/zswap.rst | 9 +++++++++ include/linux/zpool.h | 4 ++-- mm/zpool.c | 8 +++++--- mm/zsmalloc.c | 28 +++++++++++++++++++------- mm/zswap.c | 10 +++++++-- 5 files changed, 45 insertions(+), 14 deletions(-) base-commit: 4135040c342ba080328891f1b7e523c8f2f04c58