From patchwork Fri Oct 27 18:21:59 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sean Christopherson X-Patchwork-Id: 13438944 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14F99C25B47 for ; Fri, 27 Oct 2023 18:23:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C3E708000C; Fri, 27 Oct 2023 14:23:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id ABFE78001A; Fri, 27 Oct 2023 14:23:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 905EC80018; Fri, 27 Oct 2023 14:23:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 52DE180018 for ; Fri, 27 Oct 2023 14:23:02 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 2888712094B for ; Fri, 27 Oct 2023 18:23:02 +0000 (UTC) X-FDA: 81392063004.27.BB71127 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) by imf01.hostedemail.com (Postfix) with ESMTP id 4F57440006 for ; Fri, 27 Oct 2023 18:23:00 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Xdvmt+aA; spf=pass (imf01.hostedemail.com: domain of 3AwA8ZQYKCCQSEANJCGOOGLE.COMLINUX-MMKVACK.ORG@flex--seanjc.bounces.google.com designates 209.85.214.201 as permitted sender) smtp.mailfrom=3AwA8ZQYKCCQSEANJCGOOGLE.COMLINUX-MMKVACK.ORG@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698430980; a=rsa-sha256; cv=none; b=O2apP34ZaHqSMPC5LVNXjJ5xLRIueg/qR6dRxzrp1x2voMGaRag+pkGB1YkMYTgqkIwnV7 3TuStOJZHb+sOwqtE5D1uxOz1Jr7hh2hmAwG5lYoDJ3fmPKeTs/PKKCJgttL9+B21al7+J o0VpwYxoydmyZhxAWumwO99jdl97Xos= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Xdvmt+aA; spf=pass (imf01.hostedemail.com: domain of 3AwA8ZQYKCCQSEANJCGOOGLE.COMLINUX-MMKVACK.ORG@flex--seanjc.bounces.google.com designates 209.85.214.201 as permitted sender) smtp.mailfrom=3AwA8ZQYKCCQSEANJCGOOGLE.COMLINUX-MMKVACK.ORG@flex--seanjc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698430980; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MVLj5FN4ghSPcx7W6m0P6aKU6EoKDZB5qQvPMI5sLXI=; b=E6r6Vj12QH+ILAKdEEUubkhIT9ZUsixg39WkUJrt1YD0UPWmt6IpgrCki95xA+n88sb/Ew HdYAJSIqHA+TpODkPoQAhmixguyhU5XFzeyFZt1dTbuE+ebK806McgTBTADJQlhwnGlzC4 KIy9CGy7GVL6EZdKKnRmKNuOG/Mc5Tw= Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-1caaaa873efso23352625ad.3 for ; Fri, 27 Oct 2023 11:23:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1698430979; x=1699035779; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:from:to:cc:subject:date:message-id:reply-to; bh=MVLj5FN4ghSPcx7W6m0P6aKU6EoKDZB5qQvPMI5sLXI=; b=Xdvmt+aANwcgboM/XFSO9UEiEb/fm/2idy/YalbwLomYR/za/8L2ON7z/ba9Qe2ywl daw+GoHNOiRmpZHTopHGVrsf6QyANq8WHM+p8l4DOaeUiIZeLknaMwZqddQZ/P/kwMP5 Pu6gBCywHggJD4B+k63v0ab3kBQYGRBbftEkIj8SDMMSBxJpDYdZ19BhXhsmVoQ18vdg vxIuoQ8nwMYgUGGvG86A1xB8XJldj9yLHgKOtFEaU78Ah+JFMwIVk4an5sfOvxg58W4Z mZXmbnmgTZM8kwRkQ5uLHTiDnjG8fVPQpCycZ1xIBK32r5h35e5xqh6k+uRGaNd65CZp iNyQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698430979; x=1699035779; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=MVLj5FN4ghSPcx7W6m0P6aKU6EoKDZB5qQvPMI5sLXI=; b=e3NVqGc92VNkxttk305FevC1OwL6N7UNbD2xF6FbEMw/W87YLfKnsc1cgzkPzWat49 27D4DbejkhC9KdkrftNlrkeJGueuG1kFw8YMeJHi7xVfmDNxkPHdQt1VWKfKvcdtAP4v ceZstOV7foM5DR4WmhXVypUGLXnhpJkVB6tmPgW6zSSIP+uzBcYRXYgUtvMNZN1S+Ndv wKCC0idOL2ORuqSAQZJGhiC1FEEm9Xw/kBK2o6V2iKLln1VXpfxJ4cuVtxjwAXxFxFKI GkqALnniPoqVYVU2dt0u8YEgnf4VugCphVW8/w+uLJ+WPxl2wYppTOwvDmX1z4xnT8Tg KpUw== X-Gm-Message-State: AOJu0YwrLJE1czCswQOqS419/Ljaqy7pw0eFWyVDmaIq03HNcwebM0wj phFu7hrV8TbYdCmbUencrFQI3T51jEI= X-Google-Smtp-Source: AGHT+IEKkHZ5OGZS170UfQmbvrYiPXoVQsZfMjyX59FOqET9I4K7HFcSw9HcrHh9Pco2N8IBe4E5ffgBki8= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:902:7b8f:b0:1c9:f356:b7d5 with SMTP id w15-20020a1709027b8f00b001c9f356b7d5mr60396pll.7.1698430979282; Fri, 27 Oct 2023 11:22:59 -0700 (PDT) Reply-To: Sean Christopherson Date: Fri, 27 Oct 2023 11:21:59 -0700 In-Reply-To: <20231027182217.3615211-1-seanjc@google.com> Mime-Version: 1.0 References: <20231027182217.3615211-1-seanjc@google.com> X-Mailer: git-send-email 2.42.0.820.g83a721a137-goog Message-ID: <20231027182217.3615211-18-seanjc@google.com> Subject: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory From: Sean Christopherson To: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , Sean Christopherson , Alexander Viro , Christian Brauner , "Matthew Wilcox (Oracle)" , Andrew Morton Cc: kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Xiaoyao Li , Xu Yilun , Chao Peng , Fuad Tabba , Jarkko Sakkinen , Anish Moorthy , David Matlack , Yu Zhang , Isaku Yamahata , " =?utf-8?q?Micka=C3=ABl_Sala?= =?utf-8?q?=C3=BCn?= " , Vlastimil Babka , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 4F57440006 X-Stat-Signature: n3tkcjtd43ta4stzes1feuuooyhq56k7 X-Rspam-User: X-HE-Tag: 1698430980-866086 X-HE-Meta: U2FsdGVkX18XPHHRbFFR/SmPB1wnSms0RDgbcJRQ6LSmwpF3ZYW5+kbLvPxfoyhyB4aPPQbNY1t1LzO9YHT/C/cjX5c7oNA2lFIpflCb4ScA8B+sGjvmKaVjE17ji+WPvbLma4pqDiwIsidL41eRYdyiG7xlxKNmbClv9vjtOYzaCXK8z7W9uCnZiZrbVokSJcvu4yd8cD81TDxH2wVGwFsyqhl/aN9JHGevIaR8NU8PzVtAiB3Xr5LFigLc5xTDOs9QZFIMcsryrUNSduAKHPECr7kd2txXVCBBlYXgq4i9VfDRhpwaVriI8ZYixCi8y1HCV0H28M22oyuNgYf3KJoT0tcF6+1ECdVmPfOkMhAumnvQgY9aTmTCbw+X6kQllJAIYRE7g+srpDYYJdov3ztNIXS4HIm799vSQoHyYMC4C9ybeZF/lQaQQwqd6L/MjOVOVmVEwdR5TQue/K5O6l+82BiftLmWbT/PXfMER67s3LnlIx9P/DuribZTZ996sRymIhcysBohfW5DoTTiICtxvNiurpS1EyMmozT+9xuTYpg1iqld3zDsRyYoQAX0ClzcBkP2fxoWzTG7KS1uy5eltnSNMKWJ2uidGntDPSdgamJyu3TeEHS8DMTXu6MsldEZcS+wRl6kIXD86PfYGrl3Kom+Xx/eWeVEeesLsJRrdZbmCzY3I3zgVnOPGaNDewUrPf6bsWefeFug9N0RlgswFoPPJTEY9D7UyFxgT8KVswPGX4uamLRlFNtqAU1jQDicUB27scupYT+TxbG2ZNkm5CSLZhjrZcwnOKg0in0Mq310mibJBANcwTFQ7NdbVAOJjvXz82NigFWycFnwgbtp8oZ3k+sUfUTMRfmwBLH1kJD/vRKJ3P7eki5Ly3tEWdh7SxvX4XZRrtUtAxE8xe2tfZ43u05rVMSXRJK50dE3GcROIva8UPwGNSYT7jo7/mAhSVUyI21W1eerVAN eGP2X4Bw jfzbKf/lBcH96WvlSrp8VMMIeFr0G06YfR/0zycLAGTUBCTrpOs1hBS5qy2XAPLIxkus9VzfPg4fbP3jT49LldvkK1aeE7875y3xuP3PCk7TnJQcPF7S4+OYFlq7dFMf0zDfvZiZzzumS/KS5VWwhBL97cAq2d6XmOmVFWa1zhbLVGk8/zHCIrrXSSXJxpeCEnhKFeYvx5KREODMTgvN8ql78ulFJS47Y20zxMEm/wT1jorTUt3KWmp7L2g7Wqo08n5phJoj4KgkR8R4HPay9yMxzs6MGT8CDBc9vkVPc6sDvzCiw3fnoacMHs7cIZt26vR57yC4ujcpkYlU+fEwyD7ilJeTB16+K4gf3ruvib+0OLXit5r/CGRVyeU6rZUZLlzMxbpVWxd+RM2HrTnTsXNt5fmPiywh5mydEU70vJkAV+SQkdi/jOqEzGLQHyok062JGDN229twMf/LvXJ1UmC0OzcLsUVrFLDEerOhDBfJL2fRwSxVRtkHv4EihraMVqcXYlFOE5Mz36CuJlzfdgKjYGq1wIB5DpYDaXKnlTSl6QLrl0VDQ+lRZXaZOnvQGMtTv8fq8NJivJXisSiXcazvi9OzvAnSp0fTYx0vCIjWzvgnYRMbA34encHYjvbmzmhq4XL+XQalVb8uvZRwe+FOKjSHnn8FxYLS5f9wlng9Ld6HFkFHC9qFXMT7eqMEwu3OCQuLI85e/UtXitHgLJz6tYIPR4d2RZaVfPFFNXFo+z0Cd7gBXflemcqZPdgMiQSS4qCCoA6Bhj8E= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Extended guest_memfd to allow backing guest memory with transparent hugepages. Require userspace to opt-in via a flag even though there's no known/anticipated use case for forcing small pages as THP is optional, i.e. to avoid ending up in a situation where userspace is unaware that KVM can't provide hugepages. For simplicity, require the guest_memfd size to be a multiple of the hugepage size, e.g. so that KVM doesn't need to do bounds checking when deciding whether or not to allocate a huge folio. When reporting the max order when KVM gets a pfn from guest_memfd, force order-0 pages if the hugepage is not fully contained by the memslot binding, e.g. if userspace requested hugepages but punches a hole in the memslot bindings in order to emulate x86's VGA hole. Signed-off-by: Sean Christopherson Signed-off-by: Sean Christopherson --- Documentation/virt/kvm/api.rst | 7 ++++ include/uapi/linux/kvm.h | 2 + virt/kvm/guest_memfd.c | 73 ++++++++++++++++++++++++++++++---- 3 files changed, 75 insertions(+), 7 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index e82c69d5e755..7f00c310c24a 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6176,6 +6176,8 @@ and cannot be resized (guest_memfd files do however support PUNCH_HOLE). __u64 reserved[6]; }; + #define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0) + Conceptually, the inode backing a guest_memfd file represents physical memory, i.e. is coupled to the virtual machine as a thing, not to a "struct kvm". The file itself, which is bound to a "struct kvm", is that instance's view of the @@ -6192,6 +6194,11 @@ most one mapping per page, i.e. binding multiple memory regions to a single guest_memfd range is not allowed (any number of memory regions can be bound to a single guest_memfd file, but the bound ranges must not overlap). +If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set in flags, KVM will attempt to allocate +and map hugepages for the guest_memfd file. This is currently best effort. If +KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set, the size must be aligned to the maximum +transparent hugepage size supported by the kernel + See KVM_SET_USER_MEMORY_REGION2 for additional details. 5. The kvm_run structure diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 25caee8d1a80..33d542de0a61 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -2303,4 +2303,6 @@ struct kvm_create_guest_memfd { __u64 reserved[6]; }; +#define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0) + #endif /* __LINUX_KVM_H */ diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 98a12da80214..94bc478c26f3 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -13,14 +13,47 @@ struct kvm_gmem { struct list_head entry; }; +static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index) +{ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + unsigned long huge_index = round_down(index, HPAGE_PMD_NR); + unsigned long flags = (unsigned long)inode->i_private; + struct address_space *mapping = inode->i_mapping; + gfp_t gfp = mapping_gfp_mask(mapping); + struct folio *folio; + + if (!(flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE)) + return NULL; + + if (filemap_range_has_page(mapping, huge_index << PAGE_SHIFT, + (huge_index + HPAGE_PMD_NR - 1) << PAGE_SHIFT)) + return NULL; + + folio = filemap_alloc_folio(gfp, HPAGE_PMD_ORDER); + if (!folio) + return NULL; + + if (filemap_add_folio(mapping, folio, huge_index, gfp)) { + folio_put(folio); + return NULL; + } + + return folio; +#else + return NULL; +#endif +} + static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index) { struct folio *folio; - /* TODO: Support huge pages. */ - folio = filemap_grab_folio(inode->i_mapping, index); - if (IS_ERR_OR_NULL(folio)) - return NULL; + folio = kvm_gmem_get_huge_folio(inode, index); + if (!folio) { + folio = filemap_grab_folio(inode->i_mapping, index); + if (IS_ERR_OR_NULL(folio)) + return NULL; + } /* * Use the up-to-date flag to track whether or not the memory has been @@ -373,6 +406,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) inode->i_mode |= S_IFREG; inode->i_size = size; mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); + mapping_set_large_folios(inode->i_mapping); mapping_set_unmovable(inode->i_mapping); /* Unmovable mappings are supposed to be marked unevictable as well. */ WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); @@ -398,12 +432,21 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) u64 flags = args->flags; u64 valid_flags = 0; + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) + valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE; + if (flags & ~valid_flags) return -EINVAL; if (size < 0 || !PAGE_ALIGNED(size)) return -EINVAL; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) && + !IS_ALIGNED(size, HPAGE_PMD_SIZE)) + return -EINVAL; +#endif + return __kvm_gmem_create(kvm, size, flags); } @@ -501,7 +544,7 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot) int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn, kvm_pfn_t *pfn, int *max_order) { - pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff; + pgoff_t index, huge_index; struct kvm_gmem *gmem; struct folio *folio; struct page *page; @@ -514,6 +557,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, gmem = file->private_data; + index = gfn - slot->base_gfn + slot->gmem.pgoff; if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot)) { r = -EIO; goto out_fput; @@ -533,9 +577,24 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, page = folio_file_page(folio, index); *pfn = page_to_pfn(page); - if (max_order) + if (!max_order) + goto success; + + *max_order = compound_order(compound_head(page)); + if (!*max_order) + goto success; + + /* + * The folio can be mapped with a hugepage if and only if the folio is + * fully contained by the range the memslot is bound to. Note, the + * caller is responsible for handling gfn alignment, this only deals + * with the file binding. + */ + huge_index = ALIGN(index, 1ull << *max_order); + if (huge_index < ALIGN(slot->gmem.pgoff, 1ull << *max_order) || + huge_index + (1ull << *max_order) > slot->gmem.pgoff + slot->npages) *max_order = 0; - +success: r = 0; out_unlock: