From patchwork Mon Nov 27 19:36:57 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nhat Pham X-Patchwork-Id: 13470243 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C1F37C07D5A for ; Mon, 27 Nov 2023 19:37:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DA6F76B02A6; Mon, 27 Nov 2023 14:37:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D57776B02B4; Mon, 27 Nov 2023 14:37:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C1EBD6B02B6; Mon, 27 Nov 2023 14:37:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id AF64F6B02A6 for ; Mon, 27 Nov 2023 14:37:07 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 4787F120704 for ; Mon, 27 Nov 2023 19:37:07 +0000 (UTC) X-FDA: 81504742494.05.021FFDA Received: from mail-pg1-f173.google.com (mail-pg1-f173.google.com [209.85.215.173]) by imf22.hostedemail.com (Postfix) with ESMTP id 8A7C4C000E for ; Mon, 27 Nov 2023 19:37:05 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Ghb5GxN1; spf=pass (imf22.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.215.173 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701113825; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=Pr58hl6apUe5+alkB6uva5XtrEHy6iRJyUoDh1Ha2hs=; b=eOf/aORHQ3oaQ/52Mw4Rnaeg5j8LsbIrK8EdiF9/65eyD/DDJ8Qy3Rd1i7OqdGzB8Omwcm 9ZYygUyoWKc2VyfAg5ku/1WN0LBrzWm37N15YvE9+/3peiNQwyXq8R/mO6idKqxQ7SqHty 0JnlbzKhrTaJNOuEuYl9KBiokahYjyc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701113825; a=rsa-sha256; cv=none; b=Pe+0wXOrma2b+hj870uARd1G71n4m3jP1z29+h4kl0UMU3NvnReBkjx8f5TQGgwy9MHxYy eEPzlCCeTyq0TQZRXe7w+dlWDPWaY0mCQem6VVj3ILkURnsdFVCdgJUDoYsAIzm09hg178 pDGeuP7pE+hnSQPvw+nH/s7sZmmn1TE= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Ghb5GxN1; spf=pass (imf22.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.215.173 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pg1-f173.google.com with SMTP id 41be03b00d2f7-5c194b111d6so3457228a12.0 for ; Mon, 27 Nov 2023 11:37:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701113824; x=1701718624; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=Pr58hl6apUe5+alkB6uva5XtrEHy6iRJyUoDh1Ha2hs=; b=Ghb5GxN10lv3Vdg683dymRKTUUR8qgzQMWdOzi3jJFkX5bYjK+fkzLLWoiG+fZfFxn 7nmLeWCLUSlak5gJ+/GD8siythPfxGpi9BCH+0BcC7tznidHtBP0CdsNtOyPuo84H/hs BQluFHtGOtOa2VCRzpbQIeO4TwLy7WjIldjamtuZmhBawbXWqR4zR6axvLnCTfTvaERL kKARXXvyJojUmiRj4APILYCpGxUNnmH+emgpbtwJ0eGHCZ6tv5wY0AukMQRA3JdWiUkN Bs5QC4EAJ4Orj+cninxbHlNKP9uia9hoakqXaG9PBKwDtKkYYahkxY8GaWRQveoSqi7b vptg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701113824; x=1701718624; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Pr58hl6apUe5+alkB6uva5XtrEHy6iRJyUoDh1Ha2hs=; b=PYU46YtMNDxdqQVUA39SzBJ28zNEDltbcn3mgexDMYO3chgrXp+ggwO7hmfR0mjYAs wvj54o25q33v23UAPoN7CDu22gVrCHHknRCMLlmU7bdLI+Lt0N2altXAdj03gukf9895 dkPTMYG3UH5gKC0wW1qrMzIASsB8ItMYnu5oJ459SSOWD8T5pENXc/0G8cNqbqVTGT6j pyE/7a5ReTl+mWp9s814cmn/XnPeyDQVXVfpE3KSbENYl4/wdS8xtJlEHXnNVSRn4ZcR vUEwfh7YOwc2exdeALOAa045hbvN8wATmoe6HqxyIz5F8luQ/g/KU0LaATttdLXV10N9 WwkQ== X-Gm-Message-State: AOJu0YzyqniazeRKsYqgQY+JhgkNPo+3ieDHmWwu7fol2W+Gsg5zGqdc fgtSwPfCLDG3y88NMcYMgsI= X-Google-Smtp-Source: AGHT+IGwCclyYBEHXLoWjJQZEITf2jm2Btr55pPS0Qw3ekTMzxn2uSOt7WbdoKi3d0Axy4JycJ9oMw== X-Received: by 2002:a17:90b:212:b0:285:81aa:aeb7 with SMTP id fy18-20020a17090b021200b0028581aaaeb7mr13868104pjb.8.1701113824212; Mon, 27 Nov 2023 11:37:04 -0800 (PST) Received: from localhost (fwdproxy-prn-003.fbsv.net. [2a03:2880:ff:3::face:b00c]) by smtp.gmail.com with ESMTPSA id 102-20020a17090a09ef00b002800e0b4852sm8958205pjo.22.2023.11.27.11.37.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 27 Nov 2023 11:37:03 -0800 (PST) From: Nhat Pham To: akpm@linux-foundation.org Cc: hannes@cmpxchg.org, cerasuolodomenico@gmail.com, yosryahmed@google.com, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, chrisl@kernel.org, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, shuah@kernel.org Subject: [PATCH v6 0/6] workload-specific and memory pressure-driven zswap writeback Date: Mon, 27 Nov 2023 11:36:57 -0800 Message-Id: <20231127193703.1980089-1-nphamcs@gmail.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 X-Rspamd-Queue-Id: 8A7C4C000E X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: o9fp16ww1t5k9ify4g7ritbubjzdkdnq X-HE-Tag: 1701113825-973059 X-HE-Meta: U2FsdGVkX19vOtNfhjWA4n6KXEOaBKJyMADSqVv/txcs9gc7Jaeo78ElFLtPeRdI5FyE8GAlqwgl+0/ObftAUBlrA0woNmGLMLeRUHz67SpDFl67/RVACSB1IXka1Qw5wbeMQxg5z/YclMCgb/V68I32H3NNh3JPyyH7LDc1zXQDi/zsTVDLOGs4EO1typlwIpvzYUtJ+COSuVnK2EZwyNesB9dMh/l2AImJkG5HFg1Gr8W65Q7RBEE3tZspp18aVfIL4g3tIUFezFvFha2K7UTPBpmctVfGcM6tECdTgY55ybVwBGMOp0Jvm4r15J7QHpYHyB6tzNSKe2a2ZlTWutQaDSoUPN+cahDbVN4c3KYLd3AKWaDGi9G4hwUpncN0cFMGN6zXI7smaURaf6vZ4LS1FZ+2FG2e8VEonv7hY8WYWvzLFJRBCcF/Eu5f0wBhSuPifSjRwDGSINTbsXSNSEpDncnHOXd2L8BB77pqw7OkVxkUFGDVstUtT3n6CNHCIbPi+UCFTG48oXjrTAYE3ISgSrdLNWuqPRKEMoJHe0eTpAfSwQck0/hPF5LGbb5VKcbgV1DbOrAC0I8S0LOweSNPyuZddsUI1chupQpM3DMMUVQlGcLQsZw/0bv89dgCqJtbymCu4oUhwdef1bhbEYAE9zbDXDDjf9XrNLOkaCSI975c35q0YcsqJA2bV9o0ieGNFdrrpQ2cVP2bRBjldxp4QBpojemHSN/wiYB8B+xnxJlYlQe7QYygFpx3SiAD0GJ6YpA891k0nCuyCH6IomIGYUK9ZCtRdxsHSswxKD2SJC3bv/xBIr/LiysmWNra9C/peqiIczL49WRmam55qiFVqcV1pU1MuMYFVXA1Q9tUstYXhiuzkVXZtWi4FqTk4VTvo5e7Ud6227IctimAULa3tCrCRY0Xr8FuSSTb3yWTxS+7QiPYUC4Guys1fSZGxaPQVlyOpnlKB8ZzVbm 9qESlQe4 vQdD7G8dVOtLQP9FPoynohB/ALfFmXzfpmEWmT6PCXNIrBJyiwQAqGt+xHuiYPEzYKVl7e3VIzCP2euNoahW45gLhGHOc0fiu2Yz8Jh4mHlOWR4jaqpS7iHqTh3FgIrket9zvWJfq7jaXkiwSO+wV0JKBV6bYCVW7/rm0yc9G2xGsgJ5aBljv4fJq0VkUO8Ca7CP/BIlahi2QC40SQ8pd910LrIl5gPVdu5/9ZHdCZqb6jz9G460EtOhHRsp5OxMDeYAVT55gh7FENPwLiha1/kTYQXGIvu0nRZnX1XRfIQqSVymqV5OO3tCizYAXrpoyBqOFEq5F1hHpzyy1STxHzIQrf5PE3SUFRbriPqtdN3+qlvA7ZJouCVo34+hrGzbSggyiv8MWnNHqy8C1DXWSFixlylGvgC2NKwngzo+yln9M/ckSYx9woIXK+jU3+lkj6dVZGZR5HtrwqMd+t7VQxyEDjN2cgHb3H0c59l0v8Tn8nWpIJBV+rGCmpVRl4YuJ7ip64rXWiZUGyO9lPYBqs3sIfXaYemEO+4ni X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Changelog: v6: * Rebase on top of latest mm-unstable. * Fix/improve the in-code documentation of the new list_lru manipulation functions (patch 1) v5: * Replace reference getting with an rcu_read_lock() section for zswap lru modifications (suggested by Yosry) * Add a new prep patch that allows mem_cgroup_iter() to return online cgroup. * Add a callback that updates pool->next_shrink when the cgroup is offlined (suggested by Yosry Ahmed, Johannes Weiner) v4: * Rename list_lru_add to list_lru_add_obj and __list_lru_add to list_lru_add (patch 1) (suggested by Johannes Weiner and Yosry Ahmed) * Some cleanups on the memcg aware LRU patch (patch 2) (suggested by Yosry Ahmed) * Use event interface for the new per-cgroup writeback counters. (patch 3) (suggested by Yosry Ahmed) * Abstract zswap's lruvec states and handling into zswap_lruvec_state (patch 5) (suggested by Yosry Ahmed) v3: * Add a patch to export per-cgroup zswap writeback counters * Add a patch to update zswap's kselftest * Separate the new list_lru functions into its own prep patch * Do not start from the top of the hierarchy when encounter a memcg that is not online for the global limit zswap writeback (patch 2) (suggested by Yosry Ahmed) * Do not remove the swap entry from list_lru in __read_swapcache_async() (patch 2) (suggested by Yosry Ahmed) * Removed a redundant zswap pool getting (patch 2) (reported by Ryan Roberts) * Use atomic for the nr_zswap_protected (instead of lruvec's lock) (patch 5) (suggested by Yosry Ahmed) * Remove the per-cgroup zswap shrinker knob (patch 5) (suggested by Yosry Ahmed) v2: * Fix loongarch compiler errors * Use pool stats instead of memcg stats when !CONFIG_MEMCG_KEM There are currently several issues with zswap writeback: 1. There is only a single global LRU for zswap, making it impossible to perform worload-specific shrinking - an memcg under memory pressure cannot determine which pages in the pool it owns, and often ends up writing pages from other memcgs. This issue has been previously observed in practice and mitigated by simply disabling memcg-initiated shrinking: https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u But this solution leaves a lot to be desired, as we still do not have an avenue for an memcg to free up its own memory locked up in the zswap pool. 2. We only shrink the zswap pool when the user-defined limit is hit. This means that if we set the limit too high, cold data that are unlikely to be used again will reside in the pool, wasting precious memory. It is hard to predict how much zswap space will be needed ahead of time, as this depends on the workload (specifically, on factors such as memory access patterns and compressibility of the memory pages). This patch series solves these issues by separating the global zswap LRU into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The new shrinker does not have any parameter that must be tuned by the user, and can be opted in or out on a per-memcg basis. As a proof of concept, we ran the following synthetic benchmark: build the linux kernel in a memory-limited cgroup, and allocate some cold data in tmpfs to see if the shrinker could write them out and improved the overall performance. Depending on the amount of cold data generated, we observe from 14% to 35% reduction in kernel CPU time used in the kernel builds. Domenico Cerasuolo (3): zswap: make shrinking memcg-aware mm: memcg: add per-memcg zswap writeback stat selftests: cgroup: update per-memcg zswap writeback selftest Nhat Pham (3): list_lru: allows explicit memcg and NUMA node selection memcontrol: allows mem_cgroup_iter() to check for onlineness zswap: shrinks zswap pool based on memory pressure Documentation/admin-guide/mm/zswap.rst | 7 + drivers/android/binder_alloc.c | 5 +- fs/dcache.c | 8 +- fs/gfs2/quota.c | 6 +- fs/inode.c | 4 +- fs/nfs/nfs42xattr.c | 8 +- fs/nfsd/filecache.c | 4 +- fs/xfs/xfs_buf.c | 6 +- fs/xfs/xfs_dquot.c | 2 +- fs/xfs/xfs_qm.c | 2 +- include/linux/list_lru.h | 54 ++- include/linux/memcontrol.h | 9 +- include/linux/mmzone.h | 2 + include/linux/vm_event_item.h | 1 + include/linux/zswap.h | 27 +- mm/list_lru.c | 48 ++- mm/memcontrol.c | 20 +- mm/mmzone.c | 1 + mm/shrinker.c | 4 +- mm/swap.h | 3 +- mm/swap_state.c | 26 +- mm/vmscan.c | 26 +- mm/vmstat.c | 1 + mm/workingset.c | 4 +- mm/zswap.c | 426 +++++++++++++++++--- tools/testing/selftests/cgroup/test_zswap.c | 74 ++-- 26 files changed, 629 insertions(+), 149 deletions(-) base-commit: 40b487ae2620fc9187fee68b09d2cb275de0d60e