From patchwork Sun Oct 27 14:21:00 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Leon Romanovsky X-Patchwork-Id: 13852552 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6CBA3D13562 for ; Sun, 27 Oct 2024 14:21:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 91D006B0083; Sun, 27 Oct 2024 10:21:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8CC906B0085; Sun, 27 Oct 2024 10:21:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 76D556B0088; Sun, 27 Oct 2024 10:21:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 589416B0083 for ; Sun, 27 Oct 2024 10:21:35 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 196BF14014A for ; Sun, 27 Oct 2024 14:21:12 +0000 (UTC) X-FDA: 82719594636.26.B4087CA Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91]) by imf29.hostedemail.com (Postfix) with ESMTP id 7E43812001F for ; Sun, 27 Oct 2024 14:21:03 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=ZJXLcikU; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf29.hostedemail.com: domain of leon@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=leon@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730038851; a=rsa-sha256; cv=none; b=jOVlufH5jF3oLMAf6w9EyZYrs25AaTlbPT2jt0JPXThO2qHrIT8e7IUfABxYaLMIzNSKam z7GFaEI7DOLrqjneYQYL9TgaTPCRK/ZYPlBFiKSc9WMnFui9vZXf83SQFkqSaX+0Vm1xti pGUVwVlUtX5ClZAUkf/GWJil2DVrR3Y= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=ZJXLcikU; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf29.hostedemail.com: domain of leon@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=leon@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730038851; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=8QMN8oIbXWRme5fliVoE/X24sm1xxZ0f0rjPuVHN3wY=; b=RZHE9h7Pi6xtRbSI5OXNzO6gf6U+rjW4kCkBIydvUlZnAARQSD5Lb0gyilLKKPpotXSAe7 PnM32Mqf3fVsffLYSQxp9eQOitvgvdpxhqmEljsZWlDZkl2XCKtDD5DO73lZX/hEAAOo4M hU6w1Zx2WepDbOcOhtiaKmaVn6snQQU= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id 0E9F6A40072; Sun, 27 Oct 2024 14:19:35 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1F722C4CEC3; Sun, 27 Oct 2024 14:21:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1730038890; bh=yiDPGTeRo6kkrCL424ml+10YMw3MQYm4nZaRw+o9Rl8=; h=From:To:Cc:Subject:Date:From; b=ZJXLcikUOitybN0hao1J9Vd9spDMODBJQHEviT465Ah4u+MoxB1mFmOHyANGxw5ia 30Bf6ThOTtlNMLBOOP2eKx/OtxsF4ACyy/E63V76kw9Srz48Q+zMmvzZyawo1/qkIF ddiJk3UvCTr8eaD8xPC0CphXPMhx/Whg8Z0IO0OafvVuzrewN0MLYdQN+RBLSAbr8z VMS/1jV+I/kno/tx99UhrFjJNGeo0dYa7Y+hdSQM6izl1RSW9fc5m7kTTZj0mvvo7r fzojmNtZXvaZxwQaqY0r08vfE+KjCx3P8dfFhVwisgJsul2XoWFiwIpfHubMYxtPPG kk7H9CbqENKHA== From: Leon Romanovsky To: Jens Axboe , Jason Gunthorpe , Robin Murphy , Joerg Roedel , Will Deacon , Christoph Hellwig , Sagi Grimberg Cc: Keith Busch , Bjorn Helgaas , Logan Gunthorpe , Yishai Hadas , Shameer Kolothum , Kevin Tian , Alex Williamson , Marek Szyprowski , =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= , Andrew Morton , Jonathan Corbet , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, linux-rdma@vger.kernel.org, iommu@lists.linux.dev, linux-nvme@lists.infradead.org, linux-pci@vger.kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH 00/18] Provide a new two step DMA mapping API Date: Sun, 27 Oct 2024 16:21:00 +0200 Message-ID: X-Mailer: git-send-email 2.46.2 MIME-Version: 1.0 X-Rspam-User: X-Stat-Signature: zz7c7nornt4afb9j49jrqs7wxydb7cg8 X-Rspamd-Queue-Id: 7E43812001F X-Rspamd-Server: rspam02 X-HE-Tag: 1730038863-747849 X-HE-Meta: U2FsdGVkX19nMXFYXkR9PFf/KnTmM0222a/njhMmXpZLIbVgEQHsKUvMfLaOIrkdSyGpiNFbWZOVmKVXoprEaxerTbHCldhRca/TXxRHx2cVgW1Q8m7i1m1miNha1xOEv2u7+MYB8S5dZ5eTBivCd6gwaMr66oE0oG/JuaE0lDO3VLAzgdrDBo1+fbATEVGE7UZCIbnyPrXqB3xNwBRm3f9kvqu7yGXH5q5/wI+mPLqF4dy6bGrMBiz45rF3B1eR7OBYMsH7OxGO3qq94zMe5+XqvmFDnHkWSDtIspLdUMKlo9/b/FmPKpCJG1DA86qbWROUAnaRQ6WN0lRJtKJqrQnaaxrtJGrrz9l5QylmyjUNOxqnfoz5EFZ6U43B2Mh5wB7ljIGIWbFobPpC6Z2h3TByiSoNYhBFyS3Gvfm2zu1Eg810O4JPnJivtSLmsNsNxv0iUkj8KAmpMHnGFLYftiNfGBWAfeAFgOemhs+a49/c5OAh7aKQuZw1MQpLym0i2Jx4K+WMBwqsFbtUultztxINKCd3XMA8TVK9nXbSUK8X3Z8KumpmLLzxDftWroL8Fj25WyF86mxZhywEHwfh2gqRx6p6tuMiLiDceTveWViPgURjBLmdc0vxWUJ16GnKXnMBjRZGKbE+WFtFQAEVqRhcJheOEZHVNaOIVTwbqkueWLyTFg+2ZUCv6gbulhTC5w1cqgM1HHG2eMe+n13AePJzaTdfGKfDUSEPbrSgljh87M7REPfUohpLwhXPiYIJ/tgrOVywdqt2maGfolHTvZO4QplWMsjkV0W+Ap+FN3jxiN/mFQE1KAfMI3IHgfVJW2xHAswWP0MWjHR1qYOko1m8pvxno7H93My3bKRNLCaA1J2Px9j6x8bOS65E/YQRp76ZK2VDJdlBYKM81YtJDtbPYbBt4Ok70+ZbXpuULhClm0XgMNCav26QpAAzNzaj3AI1hrV4W3QWWkmBeY4 HPgllPT4 ryr2Db8Bvw5p+UcooUOSFuSIDmGovqXIUCCBdZSjW0nviao35rRlkzMjoSgasVJU9G+G+cZiiZdmLmLRVhsXJvZ4hjxb+Do4/zfdHCkY7+MGSGQnUKYguG9EYD+hgEkQz9Saf3vX903t0teVeKkfrcg2GYd/QO+LksniPNJpeOOp7yDhSs6HEdfeMRNlem1n7jNKihed2ayS8DFd7s3x3YM4RTQ8/WKFRWZsQH/cac+MEg1c= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Currently the only efficient way to map a complex memory description through the DMA API is by using the scatterlist APIs. The SG APIs are unique in that they efficiently combine the two fundamental operations of sizing and allocating a large IOVA window from the IOMMU and processing all the per-address swiotlb/flushing/p2p/map details. This uniqueness has been a long standing pain point as the scatterlist API is mandatory, but expensive to use. It prevents any kind of optimization or feature improvement (such as avoiding struct page for P2P) due to the impossibility of improving the scatterlist. Several approaches have been explored to expand the DMA API with additional scatterlist-like structures (BIO[1], rlist[2]), instead split up the DMA API to allow callers to bring their own data structure. The API is split up into parts: - Allocate IOVA space: To do any pre-allocation required. This is done based on the caller supplying some details about how much IOMMU address space it would need in worst case. - Map and unmap relevant structures to pre-allocated IOVA space: Perform the actual mapping into the pre-allocated IOVA. This is very similar to dma_map_page(). In this and the next series [1], examples of three different users are converted to the new API to show the benefits and its versatility. Each user has a unique flow: 1. RDMA ODP is an example of "SVA mirroring" using HMM that needs to dynamically map/unmap large numbers of single pages. This becomes significantly faster in the IOMMU case as the map/unmap is now just a page table walk, the IOVA allocation is pre-computed once. Significant amounts of memory are saved as there is no longer a need to store the dma_addr_t of each page. 2. VFIO PCI live migration code is building a very large "page list" for the device. Instead of allocating a scatter list entry per allocated page it can just allocate an array of 'struct page *', saving a large amount of memory. 3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter list without having to allocate then populate an intermediate SG table. To make the use of the new API easier, HMM and block subsystems are extended to hide the optimization details from the caller. Among these optimizations: * Memory reduction as in most real use cases there is no need to store mapped DMA addresses and unmap them. * Reducing the function call overhead by removing the need to call function pointers and use direct calls instead. This step is first along a path to provide alternatives to scatterlist and solve some of the abuses and design mistakes, for instance in DMABUF's P2P support. Thanks [1] https://lore.kernel.org/all/cover.1730037261.git.leon@kernel.org Christoph Hellwig (6): PCI/P2PDMA: refactor the p2pdma mapping helpers dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h iommu: generalize the batched sync after map interface iommu/dma: Factor out a iommu_dma_map_swiotlb helper dma-mapping: add a dma_need_unmap helper docs: core-api: document the IOVA-based API Leon Romanovsky (12): dma-mapping: Add check if IOVA can be used dma: Provide an interface to allow allocate IOVA dma-mapping: Implement link/unlink ranges API mm/hmm: let users to tag specific PFN with DMA mapped bit mm/hmm: provide generic DMA managing logic RDMA/umem: Store ODP access mask information in PFN RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage RDMA/umem: Separate implicit ODP initialization from explicit ODP vfio/mlx5: Explicitly use number of pages instead of allocated length vfio/mlx5: Rewrite create mkey flow to allow better code reuse vfio/mlx5: Explicitly store page list vfio/mlx5: Convert vfio to use DMA link API Documentation/core-api/dma-api.rst | 70 +++++ drivers/infiniband/core/umem_odp.c | 250 +++++---------- drivers/infiniband/hw/mlx5/mlx5_ib.h | 12 +- drivers/infiniband/hw/mlx5/odp.c | 65 ++-- drivers/infiniband/hw/mlx5/umr.c | 12 +- drivers/iommu/dma-iommu.c | 455 +++++++++++++++++++++++---- drivers/iommu/iommu.c | 65 ++-- drivers/pci/p2pdma.c | 38 +-- drivers/vfio/pci/mlx5/cmd.c | 312 +++++++++--------- drivers/vfio/pci/mlx5/cmd.h | 24 +- drivers/vfio/pci/mlx5/main.c | 87 +++-- include/linux/dma-map-ops.h | 54 ---- include/linux/dma-mapping.h | 84 +++++ include/linux/hmm-dma.h | 32 ++ include/linux/hmm.h | 16 + include/linux/iommu.h | 4 + include/linux/pci-p2pdma.h | 84 +++++ include/rdma/ib_umem_odp.h | 25 +- kernel/dma/direct.c | 43 ++- kernel/dma/mapping.c | 20 ++ mm/hmm.c | 229 +++++++++++++- 21 files changed, 1345 insertions(+), 636 deletions(-) create mode 100644 include/linux/hmm-dma.h