From patchwork Fri Mar 17 10:57:56 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13178887 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1FF83C76195 for ; Fri, 17 Mar 2023 10:59:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:Message-Id:Date:Subject:Cc :To:From:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To:References: List-Owner; bh=sxKBLFd08XiTU4frV5brBvhCYv5eD6sEzdLDbErP0iA=; b=k/LDsUwb6DZ1RM l6mZE3+oLPU8lDC9C/QQ53sdYNrYKBM5OmrRSUgZAjk/kyZk4dvW6HEMn88nxkRa+MQZebgt5YAqb 0EBEWj2ZTBSM25Wp+KY0aJpEy8BR0Ong3jthsqNxUcICBXWXZWQQhOWlJci+luhQ6tZHVoJybXE/+ uu4EBog2l7YD+dHCRDb07N4RFVf8dHo9tNLwhxxby9XZNrUv1fmb3mmjNIsZwfhrALBlmvW5gOtNR pcM8L7hqU7DkfRJ0BG/FI8NrpfDXZkcBg+ZnB5rfQFAWQM0Pho9ufb1AC+Q+F8hOXf6KsEdTGCn4/ b0twGKkKefpag9bCFYoA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1pd7n3-001y5F-00; Fri, 17 Mar 2023 10:58:29 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1pd7mx-001y2S-0N for linux-arm-kernel@lists.infradead.org; Fri, 17 Mar 2023 10:58:25 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A5E961480; Fri, 17 Mar 2023 03:59:04 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DF7903F885; Fri, 17 Mar 2023 03:58:19 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , "Matthew Wilcox (Oracle)" , "Yin, Fengwei" , Yu Zhao Cc: Ryan Roberts , linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org Subject: [RFC PATCH 0/6] variable-order, large folios for anonymous memory Date: Fri, 17 Mar 2023 10:57:56 +0000 Message-Id: <20230317105802.2634004-1-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230317_035823_245606_CC7010C2 X-CRM114-Status: GOOD ( 32.83 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org Hi All, This is an RFC for an initial, _very_ limited stab at implementing support for using variable-order, large folios for anonymous memory. It intends to be the minimal change, upon which additions can be made incrementally. That said, with just this change, I achive a 4% performance improvement when compiling the kernel (more on that later). My motivation for posting the RFC now is twofold: - Get feedback on the approach I'm taking before I go too far down the path; does this fit with the direction of the community? Are there any bear traps that I've not considered (due to my being fairly new to mm and not having a complete understanding of its entirety)? - Seek support for a bug I'm encountering when MADV_FREE is attempting to split_folio() one of these new variable-order anon folios. I've been pouring through the source and can't find the root cause. For now I have a work around, but hopefully someone can give me some pointers as to where the problem is likely to be. (see details below). The patches apply on top of v6.3-rc1 + patches 1-31 of [4] (which needed one minor conflict resolution). And I have a tree at [5]. See [1], [2], [3] for more background. Approach ======== For now, I'm only modifying the allocation path (do_anonymous_page()). I'm not touching the CoW path. First, I determine the order of the folio to allocate for the given fault. This is determined by: - Folio must be naturally aligned within VA space - Folio must not breach boundaries of vma - Folio must be fully contained inside one pmd entry - Folio must not overlap any non-none ptes - Order must not be higher than a provided starting order Where the "provided starting order" is currently hardcoded to 4, but the idea is that this would eventually be a per-vma value that gets dynamically tuned. We then try to allocate a large folio of the determined order, and keep trying to allocate with successively smaller orders until we succeed. Once the folio is allocated we can take the PTL and re-check that all the covered PTE entries are still none. If not, we decrement the order and start again. Next, the folio is added to the rmap using a new API, folio_add_new_anon_rmap_range(), which is similar to Yin, Fengwei's folio_add_file_rmap_range() at [4]. And finally set the ptes using Matthew Wilcox's new set_ptes() API, also at [4]. Folio/page refcounts and mapcounts are managed in the same way as Yin, Fengwei is doing in folio_add_file_rmap_range(); A reference is taken on the folio for each pte, _mapcount is incremented on each page by 1, and folio->_nr_pages_mapped is set to the number of pages in the folio (since every page is initially mapped). It is my assumption that the mm should be able to deal with these folios correctly for CoW and reclaim etc. although perhaps not as optimally as we would (eventually) like. Bug(s) ====== When I run this code without the last (workaround) patch, with DEBUG_VM et al, PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are relating to invalid kernel addresses (which usually look like either NULL + small offset or mostly zeros with a few mid-order bits set + a small offset) or lockdep complaining about a bad unlock balance. Call stacks are often in madvise_free_pte_range(), but I've seen them in filesystem code too. (I can email example oopses out separately if anyone wants to review them). My hunch is that struct pages adjacent to the folio are being corrupted, but don't have hard evidence. When adding the workaround patch, which prevents madvise_free_pte_range() from attempting to split a large folio, I never see any issues. Although I'm not putting the system under memory pressure so guess I might see the same types of problem crop up under swap, etc. I've reviewed most of the code within split_folio() and can't find any smoking gun, but I wonder if there are implicit assumptions about the large folio being PMD sized that I'm obviously breaking now? The code in madvise_free_pte_range(): if (folio_test_large(folio)) { if (folio_mapcount(folio) != 1) goto out; folio_get(folio); if (!folio_trylock(folio)) { folio_put(folio); goto out; } pte_unmap_unlock(orig_pte, ptl); if (split_folio(folio)) { folio_unlock(folio); folio_put(folio); orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); goto out; } ... } Will normally skip my large folios because they have a mapcount > 1, due to incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it will see a mapcount of 1 and proceed. So I guess this is racing against reclaim or CoW in this case? I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my large anon folio is not using the folio lock in the same way as a THP would and we are therefore not getting the expected serialization? I'd really appreciate any suggestions for how to pregress here! Performance =========== With the above bug worked around, I'm benchmarking kernel compilation, which is known to be heavy on anonymous page faults. Overall, I see a reduction in wall-time by 4%. This is inline with my predictions based on earlier experiments summarised at [1]. I beleive there is scope for future improvement on the CoW and reclaim paths. I'd also expect to see performance improvements due to reduced TLB pressure on CPUs that support HPA (I'm running on Ampere Altra where HPA is not enabled). Of the 4%, all of it is (obviously) in the kernel; overall kernel execution time has reduced by 34%, more than halving the time spent servicing data faults, and significantly speeding up sys_exit_group(). Thanks, Ryan [1] https://lore.kernel.org/linux-mm/4c991dcb-c5bb-86bb-5a29-05df24429607@arm.com/ [2] https://lore.kernel.org/linux-mm/a7cd938e-a86f-e3af-f56c-433c92ac69c2@arm.com/ [3] https://lore.kernel.org/linux-mm/Y%2FblF0GIunm+pRIC@casper.infradead.org/ [4] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ [5] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc Ryan Roberts (6): mm: Expose clear_huge_page() unconditionally mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() mm: Introduce try_vma_alloc_zeroed_movable_folio() mm: Implement folio_add_new_anon_rmap_range() mm: Allocate large folios for anonymous memory WORKAROUND: Don't split large folios on madvise arch/alpha/include/asm/page.h | 5 +- arch/arm64/include/asm/page.h | 3 +- arch/arm64/mm/fault.c | 7 +- arch/ia64/include/asm/page.h | 5 +- arch/m68k/include/asm/page_no.h | 7 +- arch/s390/include/asm/page.h | 5 +- arch/x86/include/asm/page.h | 5 +- include/linux/highmem.h | 23 +++-- include/linux/mm.h | 3 +- include/linux/rmap.h | 2 + mm/madvise.c | 8 ++ mm/memory.c | 167 ++++++++++++++++++++++++++++---- mm/rmap.c | 43 ++++++++ 13 files changed, 239 insertions(+), 44 deletions(-) --- 2.25.1