From patchwork Wed Sep 7 14:45:11 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zach O'Keefe X-Patchwork-Id: 12969069 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 89052C38145 for ; Wed, 7 Sep 2022 14:45:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 062926B0072; Wed, 7 Sep 2022 10:45:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 011F06B0073; Wed, 7 Sep 2022 10:45:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E1C298D0001; Wed, 7 Sep 2022 10:45:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id D3C2C6B0072 for ; Wed, 7 Sep 2022 10:45:36 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 76F5B161943 for ; Wed, 7 Sep 2022 14:45:36 +0000 (UTC) X-FDA: 79885563072.05.60916CB Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) by imf13.hostedemail.com (Postfix) with ESMTP id 0F03420079 for ; Wed, 7 Sep 2022 14:45:35 +0000 (UTC) Received: by mail-pj1-f74.google.com with SMTP id u5-20020a17090a1d4500b001fad7c5f685so6679030pju.9 for ; Wed, 07 Sep 2022 07:45:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date; bh=eofNcK+q/CbjT029Irog10r6Ae95r0p58L4FY8syOsE=; b=MP9epeJ31sSPPV+UklrXAKP7pJ86makbOoWvv7CCMG3zixGj+mJIuUstjy2fKpFJlt UgVKQAfx20UxSCSWyx6OBxQcqSCT5fsdF52+lUqkYsFOcEXsHoRjmWy1GLILPrltZzeA NQ6eADS/7GUmFYDuINMPBuxjcYHEsM31gzYdk4uSnBZZEdBEdClI1mFRARSeN57o4DxT WrNSzZOaquMdrQg8mxCZp58gBe503I3AQdbobrfRzFrqkou9s3p/PhmsebCdL+T663Ie QOslOr3uiaRwwU6v1/ZYwOlwnRlsYWhLIIszHTmv1UgvJXvasvylIAvy0F07YTwzjSvf MJfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date; bh=eofNcK+q/CbjT029Irog10r6Ae95r0p58L4FY8syOsE=; b=vnlIPfUYOr1ZVEEMIbS9WWbisavCEqyxA7XwxqDUQ11PvxEmHI9Nn0f/GeVSguIcPk jMLE+kMVkayE8WWj/w2+35ysawp2TSnuWUdJwwOISgGDucbfSLGccqwsKk8BEEhFGtXO uoOOSbzMon+r9ywKsVClhAn2wGfmMB/euujNABvi2f9U2C6dszPm9lJwuM+bpnqfiyFb VpLoDv6HF/5uYMQgifndPK8H8eE4EXfdrFCQ1jC6CRozU09J4ElJePx23WtXITzM7NSa GyLUcX+nTgphvIZvglnJ2P8vhB4y7SyDC+YuRdRDa7VaiX2RHirjFrbv93gk1L5Ek4Tm HxEQ== X-Gm-Message-State: ACgBeo1VRjcBFH7mMYcpWDNwKv6u2HMFxW1percmxtF6vMdCVz2/KLwW PpOoY1GSNjGmqSljbvWxFcD940UdxHLVCWObUL/fwu6ktQbzJ1g52eLraFUJXn+ybW0OCYQ7+8D 4S4O8iakEtNeNw6j6ew8tFcivT38pRWbqBWi5ZlNR8SoJmf1MPdQZdMD9aEA= X-Google-Smtp-Source: AA6agR41xC/YM0VEtZUMSaQv6s6ydFF+YZAmiUA5IIxlkEqXt8HOceritlOQ0tM0LylJMxxFgy9no+9bquGs X-Received: from zokeefe3.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:1b6]) (user=zokeefe job=sendgmr) by 2002:a17:90b:4c52:b0:1f5:5129:af1a with SMTP id np18-20020a17090b4c5200b001f55129af1amr30526464pjb.202.1662561934849; Wed, 07 Sep 2022 07:45:34 -0700 (PDT) Date: Wed, 7 Sep 2022 07:45:11 -0700 Mime-Version: 1.0 X-Mailer: git-send-email 2.37.2.789.g6183377224-goog Message-ID: <20220907144521.3115321-1-zokeefe@google.com> Subject: [PATCH mm-unstable v3 00/10] mm: add file/shmem support to MADV_COLLAPSE From: "Zach O'Keefe" To: linux-mm@kvack.org Cc: Andrew Morton , linux-api@vger.kernel.org, Axel Rasmussen , James Houghton , Hugh Dickins , Yang Shi , Miaohe Lin , David Hildenbrand , David Rientjes , Matthew Wilcox , Pasha Tatashin , Peter Xu , Rongwei Wang , SeongJae Park , Song Liu , Vlastimil Babka , Chris Kennelly , "Kirill A. Shutemov" , Minchan Kim , Patrick Xia , "Zach O'Keefe" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662561936; a=rsa-sha256; cv=none; b=kBVwuSbxkUuq3FkcpZGlkdeJCNQNDV9YtMq+L1JRXFSSOq2MuB6KAQNOijhu/cIJa3CEh1 E6i3X97+FYwgXHhKCyD53W5F2T5j/ZufM8+VE/oHyzgBu5Dx0S7UAOODrdwsAhvzpNo+44 3FFZQ0yoDVYGCCR2hklpoQAyYIW/tzc= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=MP9epeJ3; spf=pass (imf13.hostedemail.com: domain of 3jq4YYwcKCG0kZVPPQPRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--zokeefe.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3jq4YYwcKCG0kZVPPQPRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--zokeefe.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662561936; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=eofNcK+q/CbjT029Irog10r6Ae95r0p58L4FY8syOsE=; b=PeRfK53ytIIKYZJgHwMFYyKLEv4QlF/BJ2xrUlZCN9lJzrOVDIKgjYpFSY3FnYfZzpXvnM uodrlK04/AMn2Lezcra9SQ93NGqKQJG+8LZsizti6xRaHR5qLl+rN32CLJ8ppFUotxYZHR Kr11sFz2+9uIcJWhMBzsIioElqJ3lGs= X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 0F03420079 X-Rspam-User: Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=MP9epeJ3; spf=pass (imf13.hostedemail.com: domain of 3jq4YYwcKCG0kZVPPQPRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--zokeefe.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3jq4YYwcKCG0kZVPPQPRZZRWP.NZXWTYfi-XXVgLNV.ZcR@flex--zokeefe.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com X-Stat-Signature: y9cgxt83e5pnw548mcku6z9teahj5zki X-HE-Tag: 1662561935-862744 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: v3 Forward This version cleans up a few small issues in v2, expands selftest coverage, rebases on some recent khugepaged changes and adds more details to commit descriptions to help with review. The three main cleanups made are: (1) Patch 2: In hpage_collapse_scan_file() and collapse_file(), don't use then xa_state.xa_index to determine if the HPAGE_PMD_ORDER THP is properly aligned. Instead, check the compound_head(page)->index. Not only is it better to not rely on internal data in struct xa_state (as the comments above said struct definition ask), but it is slightly more accurate / future proof in case we encounter an unaligned compound page of order HPAGE_PMD_ORDER (AFAIK not possible today). Moreover, especially for hpage_collapse_scan_file() where the RCU lock might be dropped as we traverse the XArray, we want to be checking the compound_head(), since otherwise we might erroneously be looking at a tail page if a collapse happened from under us. (2) Patch 2: When hpage_collapse_scan_file() returns SCAN_PTE_MAPPED_HUGEPAGE in the khugepaged path, check the pmd maps a pte table before adding the mm/address to the deferred collapse array. The reason is: we will grab mmap_lock in write every time we attempt collapse_pte_mapped_thp(), so we should try to avoid this if possible. This also prevents khugepaged from repeatedly adding the same mm/address pair to the deferred collapse array after the page cache has already been updated with the new hugepage, but before the memory has been refaulted. (3) Patch 3: In find_pmd_thp_or_none(), check pmd_none() instead of !pmd_present() when detecting pmds that have been cleared. The reason this check exists is because MADV_COLLAPSE might be operating on memory which was already collapsed by khugepaged, but before the memory had been refaulted. In this case, khugepaged cleared the pmd, and so the correct pmd entry to look for is the "none" pmd. -------------------------------- v2 Forward Mostly a RESEND: rebase on latest mm-unstable + minor bug fixes from kernel test robot. -------------------------------- This series builds on top of the previous "mm: userspace hugepage collapse" series which introduced the MADV_COLLAPSE madvise mode and added support for private, anonymous mappings[1], by adding support for file and shmem backed memory to CONFIG_READ_ONLY_THP_FOR_FS=y kernels. File and shmem support have been added with effort to align with existing MADV_COLLAPSE semantics and policy decisions[2]. Collapse of shmem-backed memory ignores kernel-guiding directives and heuristics including all sysfs settings (transparent_hugepage/shmem_enabled), and tmpfs huge= mount options (shmem always supports large folios). Like anonymous mappings, on successful return of MADV_COLLAPSE on file/shmem memory, the contents of memory mapped by the addresses provided will be synchronously pmd-mapped THPs. This functionality unlocks two important uses: (1) Immediately back executable text by THPs. Current support provided by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which might impair services from serving at their full rated load after (re)starting. Tricks like mremap(2)'ing text onto anonymous memory to immediately realize iTLB performance prevents page sharing and demand paging, both of which increase steady state memory footprint. Now, we can have the best of both worlds: Peak upfront performance and lower RAM footprints. (2) userfaultfd-based live migration of virtual machines satisfy UFFD faults by fetching native-sized pages over the network (to avoid latency of transferring an entire hugepage). However, after guest memory has been fully copied to the new host, MADV_COLLAPSE can be used to immediately increase guest performance. khugepaged has received a small improvement by association and can now detect and collapse pte-mapped THPs. However, there is still work to be done along the file collapse path. Compound pages of arbitrary order still needs to be supported and THP collapse needs to be converted to using folios in general. Eventually, we'd like to move away from the read-only and executable-mapped constraints currently imposed on eligible files and support any inode claiming huge folio support. That said, I think the series as-is covers enough to claim that MADV_COLLAPSE supports file/shmem memory. Patches 1-3 Implement the guts of the series. Patch 4 Is a tracepoint for debugging. Patches 5-9 Refactor existing khugepaged selftests to work with new memory types + new collapse tests. Patch 10 Adds a userfaultfd selftest mode to mimic a functional test of UFFDIO_REGISTER_MODE_MINOR+MADV_COLLAPSE live migration. (v3 note: "userfaultfd shmem" selftest is failing as of Sep 5 mm-unstable) Applies against mm-unstable. [1] https://lore.kernel.org/linux-mm/20220706235936.2197195-1-zokeefe@google.com/ [2] https://lore.kernel.org/linux-mm/YtBmhaiPHUTkJml8@google.com/ Previous versions: v1: https://lore.kernel.org/linux-mm/20220812012843.3948330-1-zokeefe@google.com/ v2: https://lore.kernel.org/linux-mm/20220826220329.1495407-1-zokeefe@google.com/ v2 -> v3: - The 3 changes mentioned in the v3 Forward - Drop redundant PageTransCompound() check in collapse_pte_mapped_thp() in "mm/madvise: add file and shmem support to MADV_COLLAPSE" (it is covered by PageHead() and hugepage_vma_check() for !HugeTLB. - In "selftests/vm: add thp collapse file and tmpfs testing", don't assume path used for file collapse testing will be on /dev/sda - instead, use the major/minor device numbers returned from stat(2) to traverse sysfs and find the correct block device. Also only do stat() statfs() checks on user-supplied test directory once (instead of every time we create a test file). - Added "selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd" which tests a common case of MADV_COLLAPSE applied to file/shmem memory that has been "collapsed" (in the page cache) by khugepaged, but not yet refaulted by the process. v1 -> v2: - Add missing definition for khugepaged_add_pte_mapped_thp() in !CONFIG_SHEM builds, in "mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds" - Minor bugfixes in "mm/madvise: add file and shmem support to MADV_COLLAPSE" for !CONFIG_SHMEM, !CONFIG_TRANSPARENT_HUGEPAGE and some compiler settings. - Rebased on latest mm-unstable Zach O'Keefe (10): mm/shmem: add flag to enforce shmem THP in hugepage_vma_check() mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds mm/madvise: add file and shmem support to MADV_COLLAPSE mm/khugepaged: add tracepoint to hpage_collapse_scan_file() selftests/vm: dedup THP helpers selftests/vm: modularize thp collapse memory operations selftests/vm: add thp collapse file and tmpfs testing selftests/vm: add thp collapse shmem testing selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory include/linux/khugepaged.h | 13 +- include/linux/shmem_fs.h | 10 +- include/trace/events/huge_memory.h | 36 + kernel/events/uprobes.c | 2 +- mm/huge_memory.c | 2 +- mm/khugepaged.c | 304 ++++-- mm/shmem.c | 18 +- tools/testing/selftests/vm/Makefile | 2 + tools/testing/selftests/vm/khugepaged.c | 904 +++++++++++++----- tools/testing/selftests/vm/soft-dirty.c | 2 +- .../selftests/vm/split_huge_page_test.c | 12 +- tools/testing/selftests/vm/userfaultfd.c | 171 +++- tools/testing/selftests/vm/vm_util.c | 36 +- tools/testing/selftests/vm/vm_util.h | 5 +- 14 files changed, 1143 insertions(+), 374 deletions(-)