From patchwork Mon Jun 17 17:05:43 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jiaqi Yan X-Patchwork-Id: 13701089 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EB2B6C2BA18 for ; Mon, 17 Jun 2024 17:05:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 43AB16B024B; Mon, 17 Jun 2024 13:05:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3EA3D6B024C; Mon, 17 Jun 2024 13:05:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 28BCD6B024D; Mon, 17 Jun 2024 13:05:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 08B066B024B for ; Mon, 17 Jun 2024 13:05:54 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id A1F101C33B9 for ; Mon, 17 Jun 2024 17:05:53 +0000 (UTC) X-FDA: 82241007786.15.BA5BD12 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf26.hostedemail.com (Postfix) with ESMTP id C8804140010 for ; Mon, 17 Jun 2024 17:05:51 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="rH1ir/SE"; spf=pass (imf26.hostedemail.com: domain of 37mxwZggKCMMsrjzr7jwpxxpun.lxvurw36-vvt4jlt.x0p@flex--jiaqiyan.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=37mxwZggKCMMsrjzr7jwpxxpun.lxvurw36-vvt4jlt.x0p@flex--jiaqiyan.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718643946; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KM/EmQyPzaIKRRU3lBo3RyHuJehGBp5I0jJwxKmg0Wk=; b=hvVP1M9HRUPSnQWyIVItDDtGwIDbuhhPxheptlBOfTW5PcXI4TC9eNZ3196eWhD5khH7PU j/h40zvZk1PSNenMsLvN/2rYHUUPg+YbBc6cQJVXrjPbOw0kr2CgB7/mV3UPKv/DNPw3NT +wt+ZTtlfeLO84hdddTg2QwNSagdnUg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718643946; a=rsa-sha256; cv=none; b=sqY2H05239UdzrA8iJv8t+X6cX6mNtmAZmFtaSVKw/TmE7zXNCIWo13Ie53tmT1IWBCzST lU3+V4rhTBEYBP/dxRnziLyRA/yCDvBM3ZQYzJJ3aTsj6+wl9LEjEuOa+h4gf1Tvez256W MVri+Uuvq+LxopUSiD6rj1ljWz3UdS0= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="rH1ir/SE"; spf=pass (imf26.hostedemail.com: domain of 37mxwZggKCMMsrjzr7jwpxxpun.lxvurw36-vvt4jlt.x0p@flex--jiaqiyan.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=37mxwZggKCMMsrjzr7jwpxxpun.lxvurw36-vvt4jlt.x0p@flex--jiaqiyan.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-dfeec5da787so7960072276.1 for ; Mon, 17 Jun 2024 10:05:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1718643951; x=1719248751; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=KM/EmQyPzaIKRRU3lBo3RyHuJehGBp5I0jJwxKmg0Wk=; b=rH1ir/SE6wfzQWh0Kk6j2t505bhJIoBwgXmy2aGwjPA1VZJ+UKkDFdcF+kpCdS9C/Y oXS5YbF4OgOLeA1ZkuCozNF6wX8YZH0zPMcH1Z3Bmfq8phpE7QvA5H39k9pI9MBi8fax VvxLrMZdFKEUsBT0MLPcq0EnP9p3+K3+txybuZ3uJdufsFsHRjAZivrcdDj7HsVk7+d0 2VhjThYwuyt1aNSq33dx/bsylFFy16chDLZidcKGIphzkNiCJUoZwv4mywaaWKaQKeBY SkWTdB6kaKlT4g6dsbsZ6WOMKbsZvxPB5MO/9nSSGuxDqQ8o95jpIWRUfphqYHBcx7Im BgZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718643951; x=1719248751; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=KM/EmQyPzaIKRRU3lBo3RyHuJehGBp5I0jJwxKmg0Wk=; b=LRQqzHSVaupxMlXmrvX+N5HUBSGbIFtgnJMy7Ne6el4+LaMRuNDqmESy76VOMQP+2p /0wjEVQpQmLizM3WEATuCAwBP+g7bgTC6dvx7/koI30eAreYmJWEkTjcMcZa1WuGbQvf gdo67nDQz7AmfqX4RVA0uOTwzm0lBmTmcnAH6jPoEk7+WM9kyiOXlMKSbQ1XnE/fQEFV RNvkzhVM3fgVLG6X0ViGlLSMv76Sp/fdfShAiwI7/BtP5+d8o5sZ7Whk30Nfyk4gEmXj A8EV6VMlG3+p7W4DZwGYJN8kbrO5PHctFqckgtxh3HTbBJxuoWwEqLqlj1KAHZvIGjZj d4aw== X-Forwarded-Encrypted: i=1; AJvYcCUmK/P7z7H2Su1fdZL9Bkt7chAP7jls/4N/A2GI83nNQqvcdRbuHEzH0z31SLVk2PfDvXs8RGgu5BdcW9k8YULsaAw= X-Gm-Message-State: AOJu0Ywl1ekklt357dkZOGQTraR2kM25c9bWUTNfyy8ebuH/Br2ZI4Ed kQayJGfptiD3KkeFSh5Kbsn/KNnEIkFVXOjMYLkISEwqd3gLrF1HXfOegFLkcFo77f8UIYU8zWX ASo/JIh/6Cw== X-Google-Smtp-Source: AGHT+IE6i7AwPAxo0OUUZFk9HFUB3axExctM9lLZOhHsVMPD9MZ5+BUeIdWtDZUhg3QZQeglebomfhWpC7C7uw== X-Received: from yjq3.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:272f]) (user=jiaqiyan job=sendgmr) by 2002:a05:6902:1242:b0:dff:2d92:d952 with SMTP id 3f1490d57ef6-dff2d92e760mr569722276.9.1718643950758; Mon, 17 Jun 2024 10:05:50 -0700 (PDT) Date: Mon, 17 Jun 2024 17:05:43 +0000 In-Reply-To: <20240617170545.3820912-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20240617170545.3820912-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.45.2.627.g7a2c4fd464-goog Message-ID: <20240617170545.3820912-2-jiaqiyan@google.com> Subject: [PATCH v3 1/3] mm/memory-failure: userspace controls soft-offlining pages From: Jiaqi Yan To: nao.horiguchi@gmail.com, linmiaohe@huawei.com, jane.chu@oracle.com, ioworker0@gmail.com Cc: muchun.song@linux.dev, akpm@linux-foundation.org, shuah@kernel.org, corbet@lwn.net, osalvador@suse.de, rientjes@google.com, duenwen@google.com, fvdl@google.com, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org, Jiaqi Yan X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: C8804140010 X-Stat-Signature: ntoi4h88bngezf6m5u4ztxa9gyizk3te X-HE-Tag: 1718643951-622771 X-HE-Meta: U2FsdGVkX195YHeG3Nhf65ED/pZKfFj3AAjEGFjpE8Hid7vYhCOFFTa+ugg13fs72TjFs/YNHhZ8cgWuzzSzg1xPHdeEjlY6dB31okkm7YkvWd+h0IX0E72zd05QSx71uo38rp6pV6GtroIhgTWTpPqObVC3NxJKwoboYdcVYQ8zKGEU8Zc/GVmEwPGY22FgCRSj30XKVsKo577SCZqaYTwdzs4p1cTbiko/C+hyf30eYlC8WVL8I0CfAQkcpp/wtqjZvjCwa+XEmFdJNzxYtrINp+gh2hajhD49O1ZjJ9NGG6c0a3GN5fZHIh7D6IbevzrsV57gk2r3CgpLzz95q3rnk7/vXbQ3MfrkJIAiM6D8H+7dEs9iomiLBQVK3w3SjDGSXEWYrR7HMpqiMc+mlmGZ69M7qvtq4RKnNtCvvl8R310Sh/G+34uei+EKKwFzVx4iP59lp0Usqfbjyg5zb+BXyUzc2T3b2PCXsIQfbR3yQrf3MFOnMVt7OvPrG/pUhnR3a4vq03PAHoFhmXTPo4N9ovedD7AiNiW8y033FLk4mXGHU0ripywjwLFOYNtBXvlK+OBmqxdmrzWgUkQ0LcUCzqntns7c92+0nTQlKZpNdIC4DULjmFh+TCEaL1iQqOHAyJxaAO8EZ/XCt1IiOzYle/nu4RVjtWMJQwshEiW1YpFt2eI0qqV/8P514Fi7GTk4XuXtOPTd/Rlia/rE0Eyypf8PbqsotoFF8GNmKyEf5X2iPUFZbpRmk2MMznX/lFLN+4ahSFFmaTHFNDsIoJc4Tf/Tg5aM97nCfbSaMHwrLADD7gyWL9YRSsopgvX+Ox8eS/6GsuTNWVaTOZ+kz3JzzPCDIjxKnZfQXteCyvwmYTOZDNsEXSyoeQxRdyX0iCJcFP/U8+xczW/mGwnWHE4uLt3InLLkqEwdCp4oUr1/g3qmSWlKkvsIit3TD65k29nx48fxiuphDjFwj/m x8X8RRuT 6aw3KfklnxaPMNQ6U5kXtvwXzwXyVJsRNRyHBV66cSe2djfrInBvUEgRb2Z+1U7lges42KO9E4uLPAA03EIC+UvCDY144SeDOiLl5EpSK9ZULZlYooydSea2S/6Cvso+KlIdCO7qlfofbhUpMvJUx6/7UwMuppHb+sd8SvwwZC5mioEIgs0nLiY7dOqYTugM4Pfi4tJfuGaxcYirR4DBGace/vBYJeZSt1ou8g31ACBtiFaqkndnZ3a+40wz5AOlXwYVg4DYn9F7gvxHo1wQH0a4CqP+xsNuOBIGA+FeGpcU2W6hdDRC+DLZlze0R0s6gvM72eRPlw7j1/v1Q2UitNHQnqONMJoSJpZRmt+abtItFSVENd1pNQBUp05DadQRut4rikK6rCtnfZ7/H2zR7sOrg6H5cH0MMvSHB0DtyqMviBqqdVGXf8X19J3lLHGSNkra4mBvidCGXPfQr5OxtOURoEIXmD6qbDfOhDHg0NZiZZuQ4tXznTR07KsB8b4xV2MzikzqhyB/+6OmwcuR6KBL25LSRedTN50O+v1srCaPGFF18lVl5tHqZIMVeIyadS3a8jZpKBtY7H5vs93yR/qFNhs5hR2hhraHg X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Correctable memory errors are very common on servers with large amount of memory, and are corrected by ECC. Soft offline is kernel's additional recovery handling for memory pages having (excessive) corrected memory errors. Impacted page is migrated to a healthy page if it is in-use; the original page is discarded for any future use. The actual policy on whether (and when) to soft offline should be maintained by userspace, especially in case of an 1G HugeTLB page. Soft-offline dissolves the HugeTLB page, either in-use or free, into chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage. If userspace has not acknowledged such behavior, it may be surprised when later failed to mmap hugepages due to lack of hugepages. In case of a transparent hugepage, it will be split into 4K pages as well; userspace will stop enjoying the transparent performance. In addition, discarding the entire 1G HugeTLB page only because of corrected memory errors sounds very costly and kernel better not doing under the hood. But today there are at least 2 such cases doing so: 1. GHES driver sees both GHES_SEV_CORRECTED and CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER. 2. RAS Correctable Errors Collector counts correctable errors per PFN and when the counter for a PFN reaches threshold In both cases, userspace has no control of the soft offline performed by kernel's memory failure recovery. This commit gives userspace the control of softofflining any page: kernel only soft offlines raw page / transparent hugepage / HugeTLB hugepage if userspace has agreed to. The interface to userspace is a new sysctl at /proc/sys/vm/enable_soft_offline. By default its value is set to 1 to preserve existing behavior in kernel. When set to 0, soft-offline (e.g. MADV_SOFT_OFFLINE) will fail with EOPNOTSUPP. Signed-off-by: Jiaqi Yan --- mm/memory-failure.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index d3c830e817e3..9eb216ed0b86 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -68,6 +68,8 @@ static int sysctl_memory_failure_early_kill __read_mostly; static int sysctl_memory_failure_recovery __read_mostly = 1; +static int sysctl_enable_soft_offline __read_mostly = 1; + atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0); static bool hw_memory_failure __read_mostly = false; @@ -141,6 +143,15 @@ static struct ctl_table memory_failure_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, + { + .procname = "enable_soft_offline", + .data = &sysctl_enable_soft_offline, + .maxlen = sizeof(sysctl_enable_soft_offline), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + } }; /* @@ -2746,8 +2757,9 @@ static int soft_offline_in_use_page(struct page *page) * @pfn: pfn to soft-offline * @flags: flags. Same as memory_failure(). * - * Returns 0 on success - * -EOPNOTSUPP for hwpoison_filter() filtered the error event + * Returns 0 on success, + * -EOPNOTSUPP for hwpoison_filter() filtered the error event, + * -EOPNOTSUPP if disabled by /proc/sys/vm/enable_soft_offline, * < 0 otherwise negated errno. * * Soft offline a page, by migration or invalidation, @@ -2783,6 +2795,12 @@ int soft_offline_page(unsigned long pfn, int flags) return -EIO; } + if (!sysctl_enable_soft_offline) { + pr_info("%#lx: OS-wide disabled\n", pfn); + put_ref_page(pfn, flags); + return -EOPNOTSUPP; + } + mutex_lock(&mf_mutex); if (PageHWPoison(page)) {