From patchwork Tue Feb 11 11:13:26 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dev Jain X-Patchwork-Id: 13969532 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D4898C0219B for ; Tue, 11 Feb 2025 11:16:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5B7666B008C; Tue, 11 Feb 2025 06:16:56 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 540426B0098; Tue, 11 Feb 2025 06:16:56 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3E0F46B0099; Tue, 11 Feb 2025 06:16:56 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 1C59A6B008C for ; Tue, 11 Feb 2025 06:16:56 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id BD8421C8AAC for ; Tue, 11 Feb 2025 11:16:40 +0000 (UTC) X-FDA: 83107411002.15.066FA80 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf10.hostedemail.com (Postfix) with ESMTP id DE6F4C000A for ; Tue, 11 Feb 2025 11:16:38 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf10.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1739272599; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Cmd9EnVWTOAbuOj2/O8Hi1V4lwX9xsFBQIeDghF7L9w=; b=zz6aycxs+OZzBOoxUD6N7fIV1ds48Juzd8VqzreTADQteSG2SCjQvZniTYiQy6TeHdmmGV 7fVXD7JTE4/SWjeZ+8mijG1biOmypRgZRo4/K710oP41HarFIaQ0MA9Z+dQxBFlrsTiwWJ 4+1oePiD4amRaVkgeNsXASN6AHsAuso= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739272599; a=rsa-sha256; cv=none; b=zT19Zhy34OQm4j51t5W2DYHbGemLhdq/iklhQWVA3azFqry597UANFTVu2uf2tRHdo3xEq jiuEys50ZV9gJ8J5S1wX6jGBDWV9/f/xKq+ZbWw4lxbrkkU77R/v1vOrAQJq0rxcR88IGn oR+QOnJUiZmrpUwX6FjRrPJHvd/Ogyw= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf10.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A26381477; Tue, 11 Feb 2025 03:16:59 -0800 (PST) Received: from K4MQJ0H1H2.emea.arm.com (K4MQJ0H1H2.blr.arm.com [10.162.40.80]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 2A64C3F5A1; Tue, 11 Feb 2025 03:16:27 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: npache@redhat.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [PATCH v2 17/17] Documentation: transhuge: Define khugepaged mTHP collapse policy Date: Tue, 11 Feb 2025 16:43:26 +0530 Message-Id: <20250211111326.14295-18-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com> References: <20250211111326.14295-1-dev.jain@arm.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Queue-Id: DE6F4C000A X-Rspamd-Server: rspam07 X-Stat-Signature: bwcbgipj8kxckji4zdgkzda3fgeq87f7 X-HE-Tag: 1739272598-210635 X-HE-Meta: U2FsdGVkX1/6FYN1BJ/mG9FwVpitfcQJSh0UQeVatHM9MUoNUURHxKZNb4Vmr9jJI1XykobmgdiAlP1qo3j/uNnIdo5Ur33iXNyjpFK+8g5KEWKsCjkKdtkbG1GGqKsNdPHNSED8E2LC0rEXt0fJErYaKp9yVmpcsxQvTTfapIht2s+5ZIvTWKt+Huw0q/J9fQqO2AqZKqxWRw6rBO0dWNBNVDCvNhjlKLj0FJqWmZo8Rn/kyFuOx97tDZVSy2GetidEpPuQAzfOcXm3E1f6O9l1HQUhpZmZnsrlwPVCh64Ofj0fUb8zLz4Df+mCZ98LpM6JuSUZoUCNufbbFF944AJsn0ifSjiM/yl8JGir2VD5TZzFBdyUCcoPJKKXBQNkJykE9fUJeDUC50h99/sgDKieBk0Wr2l4mUitgO/B+DITBIC/eqHE/lIm0faOBZzSPXnOQZN3T2+3jaDZCMMi5tfK18KFg13hL2uhN4j8JX4G21HDSn33Q0MZFSVnTRQWIZvc//svoPVYc5ueuTN3hI0BoIhjCR+zEhJ7vM3voy+LfVMWakgGL7cNaJ86MQeepzkDni0TigLN+HiT22Rz8StZ17JpF43ZThNQpO/Jk9XB2vbCXP7curlvdcYyFdzoNm/iGOwMhcejdmUkPozvRlH6Rp7oOV90MNApoDx+bCzcM13kvTnzoIZoRGA9XVZDa30emwcC5G+VuEnEQMOmdT2Ejig+lYd11lTYv2AZlJxXWQ8zL2GWp2ZkjdB0yomryPJKQKte2AGTKOGsbNf1uJnitsvBiibnwmlRn1bhZmaFSH04BTun9zrcMLqv0NPHV9d68QLYo89NksjYCb26i1kTbztWASOTciOuHfMMloCh52vders0x9xReC7X0vsT04aRKcJc+aBXU9C5tGEI+Sw6NYJxqJOrQXFZEwtzrAIxhreuKaQn4n5PDEtAHl1Ge0C42D/M9UvlMikK+xt clnmCn9Q xDqSC/u5Rlsnn6rCIDRGtxSBNmZJpkdesfjQ71hPE/kXnbYnXWLe/JdaulWu9w7m62fvcLLeG4KBZ1IMbOaQgdWiiRUhpDB+ZI52bypTMTDq/l1NaDtLyCZQvWchUl2fDeOwaaI0lxX+XMJvsgskaUSnnU0TK7cjgJOGhbRfoP+9jiRzDkfZAnXap+d3AQUxYJZL89rmOKXJC+cRgEwhv1th2bT6MdJxjOZy4 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Update documentation to reflect the mTHP specific changes for khugepaged. Signed-off-by: Dev Jain --- Documentation/admin-guide/mm/transhuge.rst | 49 +++++++++++++++++----- 1 file changed, 38 insertions(+), 11 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index dff8d5985f0f..6a513fa81005 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -63,7 +63,7 @@ often. THP can be enabled system wide or restricted to certain tasks or even memory ranges inside task's address space. Unless THP is completely disabled, there is ``khugepaged`` daemon that scans memory and -collapses sequences of basic pages into PMD-sized huge pages. +collapses sequences of basic pages into huge pages. The THP behaviour is controlled via :ref:`sysfs ` interface and using madvise(2) and prctl(2) system calls. @@ -212,20 +212,16 @@ this behaviour by writing 0 to shrink_underused, and enable it by writing echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused -khugepaged will be automatically started when PMD-sized THP is enabled +khugepaged will be automatically started when THP is enabled (either of the per-size anon control or the top-level control are set to "always" or "madvise"), and it'll be automatically shutdown when -PMD-sized THP is disabled (when both the per-size anon control and the -top-level control are "never") +THP is disabled (when all of the per-size anon controls and the +top-level control are "never"). mTHP collapse is supported only for +private-anonymous memory. Khugepaged controls ------------------- -.. note:: - khugepaged currently only searches for opportunities to collapse to - PMD-sized THP and no attempt is made to collapse to other THP - sizes. - khugepaged runs usually at low frequency so while one may not want to invoke defrag algorithms synchronously during the page faults, it should be worth invoking defrag at least in khugepaged. However it's @@ -254,8 +250,9 @@ The khugepaged progress can be seen in the number of pages collapsed (note that this counter may not be an exact count of the number of pages collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping being replaced by a PMD mapping, or (2) All 4K physical pages replaced by -one 2M hugepage. Each may happen independently, or together, depending on -the type of memory and the failures that occur. As such, this value should +one 2M hugepage, or (3) A portion of the PTE mapping 4K pages replaced by +a mapping to an mTHP. Each may happen independently, or together, depending +on the type of memory and the failures that occur. As such, this value should be interpreted roughly as a sign of progress, and counters in /proc/vmstat consulted for more accurate accounting):: @@ -294,6 +291,36 @@ that THP is shared. Exceeding the number would block the collapse:: A higher value may increase memory footprint for some workloads. +Khugepaged specifics for anon-mTHP collapse +------------------------------------------ + +The objective of khugepaged is to collapse memory to the highest aligned order +possible. If it fails on PMD order, it will greedily try the lower orders. + +The tunables max_ptes_shared and max_ptes_swap are considered to be zero for +mTHP collapsing; i.e the memory range must not have any shared or swap PTE +for it to be eligible for mTHP collapse. + +The tunable max_ptes_none is scaled downwards, according to the order of +the collapse. For example, if max_ptes_none = 511, and khugepaged tries to +collapse to order 4, then the memory range under consideration will become +a candidate for collapse only when the number of none PTEs (out of the 16 PTEs) +does not exceed: 511 >> (9 - 4) = 15. + +mTHP collapse is supported only if max_ptes_none is either zero or 511 (one less +than the number of entries in the PTE table). Any other value, given the scaling +logic presented above, produces what we call the "creep" problem; let the bitmask +00110000 denote a memory range mapped by 8 consecutive pagetable entries, where 0 +denotes an empty pte and 1, a pte embedding a physical folio. Let max_ptes_none = 50% +(i.e max_ptes_none = 256, which implies 256 >> (9 - 4) = 8 for our case). If order-2 and +order-3 are enabled, khugepaged may do the following: it scans the range for order-3, but +since the percentage of none ptes = 5/8 * 100 = 62.5%, it drops down to order 2. +It successfully collapses to order-2 for the first 4 PTEs, and the memory range becomes: +11110000 +Now, from the order-3 PoV, the range has 4 out of 8 PTEs filled, and the range has now +suddenly become eligible for order-3 collapse. So, we can creep into large order +collapses in a very inefficient manner. + Boot parameters ===============