mbox series

[v1,00/17] Provide a new two step DMA mapping API

Message ID cover.1730298502.git.leon@kernel.org (mailing list archive)
Headers show
Series Provide a new two step DMA mapping API | expand

Message

Leon Romanovsky Oct. 30, 2024, 3:12 p.m. UTC
Changelog:
v1: 
 * Squashed two VFIO patches into one
 * Added Acked-by/Reviewed-by tags
 * Fix docs spelling errors
 * Simplified dma_iova_sync() API
 * Added extra check in dma_iova_destroy() if mapped size to make code more clear
 * Fixed checkpatch warnings in p2p patch
 * Changed implementation of VFIO mlx5 mlx5vf_add_migration_pages() to
   be more general
 * Reduced the number of changes in VFIO patch
v0: https://lore.kernel.org/all/cover.1730037276.git.leon@kernel.org

----------------------------------------------------------------------------
The code can be downloaded from:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git tag:dma-split-oct-30

----------------------------------------------------------------------------
Currently the only efficient way to map a complex memory description through
the DMA API is by using the scatterlist APIs. The SG APIs are unique in that
they efficiently combine the two fundamental operations of sizing and allocating
a large IOVA window from the IOMMU and processing all the per-address
swiotlb/flushing/p2p/map details.

This uniqueness has been a long standing pain point as the scatterlist API
is mandatory, but expensive to use. It prevents any kind of optimization or
feature improvement (such as avoiding struct page for P2P) due to the impossibility
of improving the scatterlist.

Several approaches have been explored to expand the DMA API with additional
scatterlist-like structures (BIO, rlist), instead split up the DMA API
to allow callers to bring their own data structure.

The API is split up into parts:
 - Allocate IOVA space:
    To do any pre-allocation required. This is done based on the caller
    supplying some details about how much IOMMU address space it would need
    in worst case.
 - Map and unmap relevant structures to pre-allocated IOVA space:
    Perform the actual mapping into the pre-allocated IOVA. This is very
    similar to dma_map_page().

In this and the next series [1], examples of three different users are converted
to the new API to show the benefits and its versatility. Each user has a unique
flow:
 1. RDMA ODP is an example of "SVA mirroring" using HMM that needs to
    dynamically map/unmap large numbers of single pages. This becomes
    significantly faster in the IOMMU case as the map/unmap is now just
    a page table walk, the IOVA allocation is pre-computed once. Significant
    amounts of memory are saved as there is no longer a need to store the
    dma_addr_t of each page.
 2. VFIO PCI live migration code is building a very large "page list"
    for the device. Instead of allocating a scatter list entry per allocated
    page it can just allocate an array of 'struct page *', saving a large
    amount of memory.
 3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
    list without having to allocate then populate an intermediate SG table.

To make the use of the new API easier, HMM and block subsystems are extended
to hide the optimization details from the caller. Among these optimizations:
 * Memory reduction as in most real use cases there is no need to store mapped
   DMA addresses and unmap them.
 * Reducing the function call overhead by removing the need to call function
   pointers and use direct calls instead.

This step is first along a path to provide alternatives to scatterlist and
solve some of the abuses and design mistakes, for instance in DMABUF's P2P
support.

Thanks

[1] This still points to v0, as the change is just around handling dma_iova_sync():
https://lore.kernel.org/all/cover.1730037261.git.leon@kernel.org

Christoph Hellwig (6):
  PCI/P2PDMA: Refactor the p2pdma mapping helpers
  dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
  iommu: generalize the batched sync after map interface
  iommu/dma: Factor out a iommu_dma_map_swiotlb helper
  dma-mapping: add a dma_need_unmap helper
  docs: core-api: document the IOVA-based API

Leon Romanovsky (11):
  dma-mapping: Add check if IOVA can be used
  dma: Provide an interface to allow allocate IOVA
  dma-mapping: Implement link/unlink ranges API
  mm/hmm: let users to tag specific PFN with DMA mapped bit
  mm/hmm: provide generic DMA managing logic
  RDMA/umem: Store ODP access mask information in PFN
  RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page
    linkage
  RDMA/umem: Separate implicit ODP initialization from explicit ODP
  vfio/mlx5: Explicitly use number of pages instead of allocated length
  vfio/mlx5: Rewrite create mkey flow to allow better code reuse
  vfio/mlx5: Convert vfio to use DMA link API

 Documentation/core-api/dma-api.rst   |  70 ++++
 drivers/infiniband/core/umem_odp.c   | 250 +++++----------
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  12 +-
 drivers/infiniband/hw/mlx5/odp.c     |  65 ++--
 drivers/infiniband/hw/mlx5/umr.c     |  12 +-
 drivers/iommu/dma-iommu.c            | 459 +++++++++++++++++++++++----
 drivers/iommu/iommu.c                |  65 ++--
 drivers/pci/p2pdma.c                 |  38 +--
 drivers/vfio/pci/mlx5/cmd.c          | 373 +++++++++++-----------
 drivers/vfio/pci/mlx5/cmd.h          |  35 +-
 drivers/vfio/pci/mlx5/main.c         |  87 +++--
 include/linux/dma-map-ops.h          |  54 ----
 include/linux/dma-mapping.h          |  85 +++++
 include/linux/hmm-dma.h              |  32 ++
 include/linux/hmm.h                  |  16 +
 include/linux/iommu.h                |   4 +
 include/linux/pci-p2pdma.h           |  84 +++++
 include/rdma/ib_umem_odp.h           |  25 +-
 kernel/dma/direct.c                  |  44 +--
 kernel/dma/mapping.c                 |  20 ++
 mm/hmm.c                             | 231 +++++++++++++-
 21 files changed, 1377 insertions(+), 684 deletions(-)
 create mode 100644 include/linux/hmm-dma.h

Comments

Jens Axboe Oct. 31, 2024, 1:44 a.m. UTC | #1
On 10/30/24 9:12 AM, Leon Romanovsky wrote:
> Changelog:
> v1: 
>  * Squashed two VFIO patches into one
>  * Added Acked-by/Reviewed-by tags
>  * Fix docs spelling errors
>  * Simplified dma_iova_sync() API
>  * Added extra check in dma_iova_destroy() if mapped size to make code more clear
>  * Fixed checkpatch warnings in p2p patch
>  * Changed implementation of VFIO mlx5 mlx5vf_add_migration_pages() to
>    be more general
>  * Reduced the number of changes in VFIO patch
> v0: https://lore.kernel.org/all/cover.1730037276.git.leon@kernel.org
> 
> ----------------------------------------------------------------------------
> The code can be downloaded from:
> https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git tag:dma-split-oct-30

On Christoph's request, I tested this series last week and saw some
pretty significant performance regressions on my box. I don't know what
the status is in terms of that, just want to make sure something like
this doesn't get merged until that is both fully understood and sorted
out.
Christoph Hellwig Oct. 31, 2024, 8:34 a.m. UTC | #2
On Wed, Oct 30, 2024 at 07:44:13PM -0600, Jens Axboe wrote:
> On Christoph's request, I tested this series last week and saw some
> pretty significant performance regressions on my box. I don't know what
> the status is in terms of that, just want to make sure something like
> this doesn't get merged until that is both fully understood and sorted
> out.

Working on it, but I have way too many things going on at once.  Note
that the weird thing about your setup was that we apparently dropped into
the slow path, which still puzzles me.  But I should probably also look
into making that path a little less slow.
Leon Romanovsky Oct. 31, 2024, 9:05 a.m. UTC | #3
On Thu, Oct 31, 2024 at 09:34:50AM +0100, Christoph Hellwig wrote:
> On Wed, Oct 30, 2024 at 07:44:13PM -0600, Jens Axboe wrote:
> > On Christoph's request, I tested this series last week and saw some
> > pretty significant performance regressions on my box. I don't know what
> > the status is in terms of that, just want to make sure something like
> > this doesn't get merged until that is both fully understood and sorted
> > out.

This series is a subset of the series you tested and doesn't include the
block layer changes which most likely were the cause of the performance
regression.

This is why I separated the block layer changes from the rest of the series
and marked them as RFC.

The current patch set is viable for HMM and VFIO. Can you please retest
only this series and leave the block layer changes for later till Christoph
finds the answer for the performance regression?

Thanks

> 
> Working on it, but I have way too many things going on at once.  Note
> that the weird thing about your setup was that we apparently dropped into
> the slow path, which still puzzles me.  But I should probably also look
> into making that path a little less slow.
> 
>
Christoph Hellwig Oct. 31, 2024, 9:21 a.m. UTC | #4
On Thu, Oct 31, 2024 at 11:05:30AM +0200, Leon Romanovsky wrote:
> This series is a subset of the series you tested and doesn't include the
> block layer changes which most likely were the cause of the performance
> regression.
> 
> This is why I separated the block layer changes from the rest of the series
> and marked them as RFC.
> 
> The current patch set is viable for HMM and VFIO. Can you please retest
> only this series and leave the block layer changes for later till Christoph
> finds the answer for the performance regression?

As the subset doesn't touch block code or code called by block I don't
think we need Jens to benchmark it, unless he really wants to.
Leon Romanovsky Oct. 31, 2024, 9:37 a.m. UTC | #5
On Thu, Oct 31, 2024 at 10:21:13AM +0100, Christoph Hellwig wrote:
> On Thu, Oct 31, 2024 at 11:05:30AM +0200, Leon Romanovsky wrote:
> > This series is a subset of the series you tested and doesn't include the
> > block layer changes which most likely were the cause of the performance
> > regression.
> > 
> > This is why I separated the block layer changes from the rest of the series
> > and marked them as RFC.
> > 
> > The current patch set is viable for HMM and VFIO. Can you please retest
> > only this series and leave the block layer changes for later till Christoph
> > finds the answer for the performance regression?
> 
> As the subset doesn't touch block code or code called by block I don't
> think we need Jens to benchmark it, unless he really wants to.

He wrote this sentence in his email, while responding on subset which doesn't change
anything in block layer: "just want to make sure something like this doesn't get merged
until that is both fully understood and sorted out."

This series works like a charm for RDMA (HMM) and VFIO.

Thanks
Jens Axboe Oct. 31, 2024, 5:42 p.m. UTC | #6
On 10/31/24 2:34 AM, Christoph Hellwig wrote:
> On Wed, Oct 30, 2024 at 07:44:13PM -0600, Jens Axboe wrote:
>> On Christoph's request, I tested this series last week and saw some
>> pretty significant performance regressions on my box. I don't know what
>> the status is in terms of that, just want to make sure something like
>> this doesn't get merged until that is both fully understood and sorted
>> out.
> 
> Working on it, but I have way too many things going on at once.  Note
> that the weird thing about your setup was that we apparently dropped into
> the slow path, which still puzzles me.  But I should probably also look
> into making that path a little less slow.

That's fine, just wanted to ensure that no push was being done on this
before that was resolved.
Jens Axboe Oct. 31, 2024, 5:43 p.m. UTC | #7
On 10/31/24 3:37 AM, Leon Romanovsky wrote:
> On Thu, Oct 31, 2024 at 10:21:13AM +0100, Christoph Hellwig wrote:
>> On Thu, Oct 31, 2024 at 11:05:30AM +0200, Leon Romanovsky wrote:
>>> This series is a subset of the series you tested and doesn't include the
>>> block layer changes which most likely were the cause of the performance
>>> regression.
>>>
>>> This is why I separated the block layer changes from the rest of the series
>>> and marked them as RFC.
>>>
>>> The current patch set is viable for HMM and VFIO. Can you please retest
>>> only this series and leave the block layer changes for later till Christoph
>>> finds the answer for the performance regression?
>>
>> As the subset doesn't touch block code or code called by block I don't
>> think we need Jens to benchmark it, unless he really wants to.
> 
> He wrote this sentence in his email, while responding on subset which
> doesn't change anything in block layer: "just want to make sure
> something like this doesn't get merged until that is both fully
> understood and sorted out."
> 
> This series works like a charm for RDMA (HMM) and VFIO.

I don't care about rdma/vfio, nor do I test it, so you guys can do
whatever you want there, as long as it doesn't regress the iommu side.
The block series is separate, so we'll deal with that when we get there.

I don't know why you CC'ed linux-block on the series.
Leon Romanovsky Oct. 31, 2024, 8:43 p.m. UTC | #8
On Thu, Oct 31, 2024 at 11:43:50AM -0600, Jens Axboe wrote:
> On 10/31/24 3:37 AM, Leon Romanovsky wrote:
> > On Thu, Oct 31, 2024 at 10:21:13AM +0100, Christoph Hellwig wrote:
> >> On Thu, Oct 31, 2024 at 11:05:30AM +0200, Leon Romanovsky wrote:
> >>> This series is a subset of the series you tested and doesn't include the
> >>> block layer changes which most likely were the cause of the performance
> >>> regression.
> >>>
> >>> This is why I separated the block layer changes from the rest of the series
> >>> and marked them as RFC.
> >>>
> >>> The current patch set is viable for HMM and VFIO. Can you please retest
> >>> only this series and leave the block layer changes for later till Christoph
> >>> finds the answer for the performance regression?
> >>
> >> As the subset doesn't touch block code or code called by block I don't
> >> think we need Jens to benchmark it, unless he really wants to.
> > 
> > He wrote this sentence in his email, while responding on subset which
> > doesn't change anything in block layer: "just want to make sure
> > something like this doesn't get merged until that is both fully
> > understood and sorted out."
> > 
> > This series works like a charm for RDMA (HMM) and VFIO.
> 
> I don't care about rdma/vfio, nor do I test it, so you guys can do
> whatever you want there, as long as it doesn't regress the iommu side.
> The block series is separate, so we'll deal with that when we get there.
> 
> I don't know why you CC'ed linux-block on the series.

Because of the second part, which is marked as RFC and based on this
one. I think that it is better to present whole picture to everyone
interested in the discussion.

Thanks

> 
> -- 
> Jens Axboe
>
Robin Murphy Oct. 31, 2024, 9:17 p.m. UTC | #9
On 30/10/2024 3:12 pm, Leon Romanovsky wrote:
> Changelog:
> v1:
>   * Squashed two VFIO patches into one
>   * Added Acked-by/Reviewed-by tags
>   * Fix docs spelling errors
>   * Simplified dma_iova_sync() API
>   * Added extra check in dma_iova_destroy() if mapped size to make code more clear
>   * Fixed checkpatch warnings in p2p patch
>   * Changed implementation of VFIO mlx5 mlx5vf_add_migration_pages() to
>     be more general
>   * Reduced the number of changes in VFIO patch
> v0: https://lore.kernel.org/all/cover.1730037276.git.leon@kernel.org
> 
> ----------------------------------------------------------------------------
> The code can be downloaded from:
> https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git tag:dma-split-oct-30
> 
> ----------------------------------------------------------------------------
> Currently the only efficient way to map a complex memory description through
> the DMA API is by using the scatterlist APIs.

It's really not efficient... In most cases they're just wrappers for a 
bunch of dma_map_page() etc. calls for the convenience of callers who 
are using a scatterlist for their own reasons anyway. Even with 
iommu-dma, I expect that approach would likely perform better for most 
users as well, given that typical individual segment sizes are much more 
likely to be in scope of the IOVA caches.

The hilarious amount of work that iommu_dma_map_sg() does is pretty much 
entirely for the benefit of v4l2 and dma-buf importers who *depend* on 
being able to linearise a scatterlist in DMA address space. TBH I doubt 
there are many actual scatter-gather-capable devices with significant 
enough limitations to meaningfully benefit from DMA segment combining 
these days - I've often thought that by now it might be a good idea to 
turn that behaviour off by default and add an attribute for callers to 
explicitly request it.

> The SG APIs are unique in that
> they efficiently combine the two fundamental operations of sizing and allocating
> a large IOVA window from the IOMMU and processing all the per-address
> swiotlb/flushing/p2p/map details.

Except that's obviously not unique when the page APIs also combine the 
exact same operations? :/

> This uniqueness has been a long standing pain point as the scatterlist API
> is mandatory, but expensive to use.

Huh? When and where has anything ever called it mandatory? Nobody's 
getting sent to DMA jail for open-coding:

	for_each_sg(...)
		my_dma_addr = dma_map_page(..., sg_page());

if they do know the map_sg operation is unnecessarily expensive for 
their needs.

> It prevents any kind of optimization or
> feature improvement (such as avoiding struct page for P2P) due to the impossibility
> of improving the scatterlist.
> 
> Several approaches have been explored to expand the DMA API with additional
> scatterlist-like structures (BIO, rlist), instead split up the DMA API
> to allow callers to bring their own data structure.

And this line of reasoning is still "2 + 2 = Thursday" - what is to say 
those two notions in any way related? We literally already have one 
generic DMA operation which doesn't operate on struct page, yet needed 
nothing "split up" to be possible. Fair enough if callers want some 
alternative interfaces for mapping memory as well, but to be a common 
DMA API it has to be usable everywhere and cover all the DMA operations 
that the current page-based APIs provide, otherwise those callers 
obviously can't stop using struct pages. What precludes a 
straightforward dma_map_phys() etc. to parallel the existing API? What's 
the justification for an IOMMU-specific design when surely if anyone can 
benefit from more memory-efficient structures across drivers and 
subsystems it's the little embedded platforms, not the big servers 
already happy to spend tens to hundreds of megabytes on IOMMU pagetables?

> The API is split up into parts:
>   - Allocate IOVA space:
>      To do any pre-allocation required. This is done based on the caller
>      supplying some details about how much IOMMU address space it would need
>      in worst case.
>   - Map and unmap relevant structures to pre-allocated IOVA space:
>      Perform the actual mapping into the pre-allocated IOVA. This is very
>      similar to dma_map_page().
 >
> In this and the next series [1], examples of three different users are converted
> to the new API to show the benefits and its versatility. Each user has a unique
> flow:
>   1. RDMA ODP is an example of "SVA mirroring" using HMM that needs to
>      dynamically map/unmap large numbers of single pages. This becomes
>      significantly faster in the IOMMU case as the map/unmap is now just
>      a page table walk, the IOVA allocation is pre-computed once. Significant
>      amounts of memory are saved as there is no longer a need to store the
>      dma_addr_t of each page.

I particularly enjoy the comment in patch #11 calling out how this 
"unique flow" is fundamentally incompatible with the API it's supposed 
to show off and has to rely on a sketchy hack to abuse its 
"versatility". Great stuff.

>   2. VFIO PCI live migration code is building a very large "page list"
>      for the device. Instead of allocating a scatter list entry per allocated
>      page it can just allocate an array of 'struct page *', saving a large
>      amount of memory.

VFIO already assumes a coherent device with (realistically) an IOMMU 
which it explicitly manages - why is it even pretending to need a 
generic DMA API?

>   3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
>      list without having to allocate then populate an intermediate SG table.

As above, given that a bio_vec still deals in struct pages, that could 
seemingly already be done by just mapping the pages, so how is it 
proving any benefit of a fragile new interface?

Heck, not that I really want to encourage it, but we also already have 
network drivers who don't have the space to stash both a DMA address 
*and* a page address in their descriptors, and economise on shadow 
storage by instead grovelling into the default IOMMU domain with 
iova_to_phys(). I mean, I'd _kinda_ like to send them to DMA jail, but 
it's not an absolutely unreasonable trick to play... also DMA jail 
doesn't exist.

> To make the use of the new API easier, HMM and block subsystems are extended
> to hide the optimization details from the caller. Among these optimizations:
>   * Memory reduction as in most real use cases there is no need to store mapped
>     DMA addresses and unmap them.
>   * Reducing the function call overhead by removing the need to call function
>     pointers and use direct calls instead.
> 
> This step is first along a path to provide alternatives to scatterlist and
> solve some of the abuses and design mistakes, for instance in DMABUF's P2P
> support.

My big concern here is that a thin and vaguely-defined wrapper around 
the IOMMU API is itself a step which smells strongly of "abuse and 
design mistake", given that the basic notion of allocating DMA addresses 
in advance clearly cannot generalise. Thus it really demands some 
considered justification beyond "We must do something; This is 
something; Therefore we must do this." to be convincing.

Thanks,
Robin.

> 
> Thanks
> 
> [1] This still points to v0, as the change is just around handling dma_iova_sync():
> https://lore.kernel.org/all/cover.1730037261.git.leon@kernel.org
> 
> Christoph Hellwig (6):
>    PCI/P2PDMA: Refactor the p2pdma mapping helpers
>    dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
>    iommu: generalize the batched sync after map interface
>    iommu/dma: Factor out a iommu_dma_map_swiotlb helper
>    dma-mapping: add a dma_need_unmap helper
>    docs: core-api: document the IOVA-based API
> 
> Leon Romanovsky (11):
>    dma-mapping: Add check if IOVA can be used
>    dma: Provide an interface to allow allocate IOVA
>    dma-mapping: Implement link/unlink ranges API
>    mm/hmm: let users to tag specific PFN with DMA mapped bit
>    mm/hmm: provide generic DMA managing logic
>    RDMA/umem: Store ODP access mask information in PFN
>    RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page
>      linkage
>    RDMA/umem: Separate implicit ODP initialization from explicit ODP
>    vfio/mlx5: Explicitly use number of pages instead of allocated length
>    vfio/mlx5: Rewrite create mkey flow to allow better code reuse
>    vfio/mlx5: Convert vfio to use DMA link API
> 
>   Documentation/core-api/dma-api.rst   |  70 ++++
>   drivers/infiniband/core/umem_odp.c   | 250 +++++----------
>   drivers/infiniband/hw/mlx5/mlx5_ib.h |  12 +-
>   drivers/infiniband/hw/mlx5/odp.c     |  65 ++--
>   drivers/infiniband/hw/mlx5/umr.c     |  12 +-
>   drivers/iommu/dma-iommu.c            | 459 +++++++++++++++++++++++----
>   drivers/iommu/iommu.c                |  65 ++--
>   drivers/pci/p2pdma.c                 |  38 +--
>   drivers/vfio/pci/mlx5/cmd.c          | 373 +++++++++++-----------
>   drivers/vfio/pci/mlx5/cmd.h          |  35 +-
>   drivers/vfio/pci/mlx5/main.c         |  87 +++--
>   include/linux/dma-map-ops.h          |  54 ----
>   include/linux/dma-mapping.h          |  85 +++++
>   include/linux/hmm-dma.h              |  32 ++
>   include/linux/hmm.h                  |  16 +
>   include/linux/iommu.h                |   4 +
>   include/linux/pci-p2pdma.h           |  84 +++++
>   include/rdma/ib_umem_odp.h           |  25 +-
>   kernel/dma/direct.c                  |  44 +--
>   kernel/dma/mapping.c                 |  20 ++
>   mm/hmm.c                             | 231 +++++++++++++-
>   21 files changed, 1377 insertions(+), 684 deletions(-)
>   create mode 100644 include/linux/hmm-dma.h
>
Christoph Hellwig Nov. 4, 2024, 9:58 a.m. UTC | #10
On Thu, Oct 31, 2024 at 09:17:45PM +0000, Robin Murphy wrote:
> The hilarious amount of work that iommu_dma_map_sg() does is pretty much 
> entirely for the benefit of v4l2 and dma-buf importers who *depend* on 
> being able to linearise a scatterlist in DMA address space. TBH I doubt 
> there are many actual scatter-gather-capable devices with significant 
> enough limitations to meaningfully benefit from DMA segment combining these 
> days - I've often thought that by now it might be a good idea to turn that 
> behaviour off by default and add an attribute for callers to explicitly 
> request it.

Even when devices are not limited they often perform significantly better
when IOVA space is not completely fragmented.  While the dma_map_sg code
is a bit gross due to the fact that it has to deal with unaligned segments,
the coalescing itself often is a big win.

Note that dma_map_sg also has two other very useful features:  batching
of the iotlb flushing, and support for P2P, which to be efficient also
requires batching the lookups.

>> This uniqueness has been a long standing pain point as the scatterlist API
>> is mandatory, but expensive to use.
>
> Huh? When and where has anything ever called it mandatory? Nobody's getting 
> sent to DMA jail for open-coding:

You don't get sent to jail.  But you do not get batched iotlb sync, you
don't get properly working P2P, and you don't get IOVA coalescing.

>> Several approaches have been explored to expand the DMA API with additional
>> scatterlist-like structures (BIO, rlist), instead split up the DMA API
>> to allow callers to bring their own data structure.
>
> And this line of reasoning is still "2 + 2 = Thursday" - what is to say 
> those two notions in any way related? We literally already have one generic 
> DMA operation which doesn't operate on struct page, yet needed nothing 
> "split up" to be possible.

Yeah, I don't really get the struct page argument.  In fact if we look
at the nitty-gritty details of dma_map_page it doesn't really need a
page at all.  I've been looking at cleaning some of this up and providing
a dma_map_phys/paddr which would be quite handy in a few places.  Note
because we don't have a struct page for the memory, but because converting
to/from it all the time is not very efficient.

>>   2. VFIO PCI live migration code is building a very large "page list"
>>      for the device. Instead of allocating a scatter list entry per allocated
>>      page it can just allocate an array of 'struct page *', saving a large
>>      amount of memory.
>
> VFIO already assumes a coherent device with (realistically) an IOMMU which 
> it explicitly manages - why is it even pretending to need a generic DMA 
> API?

AFAIK that does isn't really vfio as we know it but the control device
for live migration.  But Leon or Jason might fill in more.

The point is that quite a few devices have these page list based APIs
(RDMA where mlx5 comes from, NVMe with PRPs, AHCI, GPUs).

>
>>   3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
>>      list without having to allocate then populate an intermediate SG table.
>
> As above, given that a bio_vec still deals in struct pages, that could 
> seemingly already be done by just mapping the pages, so how is it proving 
> any benefit of a fragile new interface?

Because we only need to preallocate the tiny constant sized dma_iova_state
as part of the request instead of an additional scatterlist that requires
sizeof(struct page *) + sizeof(dma_addr_t) + 3 * sizeof(unsigned int)
per segment, including a memory allocation per I/O for that.

> My big concern here is that a thin and vaguely-defined wrapper around the 
> IOMMU API is itself a step which smells strongly of "abuse and design 
> mistake", given that the basic notion of allocating DMA addresses in 
> advance clearly cannot generalise. Thus it really demands some considered 
> justification beyond "We must do something; This is something; Therefore we 
> must do this." to be convincing.

At least for the block code we have a nice little core wrapper that is
very easy to use, and provides a great reduction of memory use and
allocations.  The HMM use case I'll let others talk about.
Leon Romanovsky Nov. 4, 2024, 11:39 a.m. UTC | #11
On Mon, Nov 04, 2024 at 10:58:31AM +0100, Christoph Hellwig wrote:
> On Thu, Oct 31, 2024 at 09:17:45PM +0000, Robin Murphy wrote:

<...>

> >>   2. VFIO PCI live migration code is building a very large "page list"
> >>      for the device. Instead of allocating a scatter list entry per allocated
> >>      page it can just allocate an array of 'struct page *', saving a large
> >>      amount of memory.
> >
> > VFIO already assumes a coherent device with (realistically) an IOMMU which 
> > it explicitly manages - why is it even pretending to need a generic DMA 
> > API?
> 
> AFAIK that does isn't really vfio as we know it but the control device
> for live migration.  But Leon or Jason might fill in more.

Yes, you are right, as it is written above "VFIO PCI live migration ...".
That piece of code directly connected to the underlying real HW device
and uses DMA API to provide live migration functionality to/from that
device.

> 
> The point is that quite a few devices have these page list based APIs
> (RDMA where mlx5 comes from, NVMe with PRPs, AHCI, GPUs).
> 
> >
> >>   3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
> >>      list without having to allocate then populate an intermediate SG table.
> >
> > As above, given that a bio_vec still deals in struct pages, that could 
> > seemingly already be done by just mapping the pages, so how is it proving 
> > any benefit of a fragile new interface?
> 
> Because we only need to preallocate the tiny constant sized dma_iova_state
> as part of the request instead of an additional scatterlist that requires
> sizeof(struct page *) + sizeof(dma_addr_t) + 3 * sizeof(unsigned int)
> per segment, including a memory allocation per I/O for that.
> 
> > My big concern here is that a thin and vaguely-defined wrapper around the 
> > IOMMU API is itself a step which smells strongly of "abuse and design 
> > mistake", given that the basic notion of allocating DMA addresses in 
> > advance clearly cannot generalise. Thus it really demands some considered 
> > justification beyond "We must do something; This is something; Therefore we 
> > must do this." to be convincing.
> 
> At least for the block code we have a nice little core wrapper that is
> very easy to use, and provides a great reduction of memory use and
> allocations.  The HMM use case I'll let others talk about.

I'm not sure about which wrappers Robin talks, but if we are talking
about HMM wrappers, they gave us perfect combination of usability,
performance and maintenance. All HMM users use same pattern, same
structures and don't need to worry about internal DMA/IOMMU details.

Thanks
Jason Gunthorpe Nov. 5, 2024, 6:51 p.m. UTC | #12
On Wed, Oct 30, 2024 at 05:12:46PM +0200, Leon Romanovsky wrote:

>  Documentation/core-api/dma-api.rst   |  70 ++++
>  drivers/infiniband/core/umem_odp.c   | 250 +++++----------
>  drivers/infiniband/hw/mlx5/mlx5_ib.h |  12 +-
>  drivers/infiniband/hw/mlx5/odp.c     |  65 ++--
>  drivers/infiniband/hw/mlx5/umr.c     |  12 +-
>  drivers/iommu/dma-iommu.c            | 459 +++++++++++++++++++++++----
>  drivers/iommu/iommu.c                |  65 ++--
>  drivers/pci/p2pdma.c                 |  38 +--
>  drivers/vfio/pci/mlx5/cmd.c          | 373 +++++++++++-----------
>  drivers/vfio/pci/mlx5/cmd.h          |  35 +-
>  drivers/vfio/pci/mlx5/main.c         |  87 +++--
>  include/linux/dma-map-ops.h          |  54 ----
>  include/linux/dma-mapping.h          |  85 +++++
>  include/linux/hmm-dma.h              |  32 ++
>  include/linux/hmm.h                  |  16 +
>  include/linux/iommu.h                |   4 +
>  include/linux/pci-p2pdma.h           |  84 +++++
>  include/rdma/ib_umem_odp.h           |  25 +-
>  kernel/dma/direct.c                  |  44 +--
>  kernel/dma/mapping.c                 |  20 ++
>  mm/hmm.c                             | 231 +++++++++++++-

This is touching alot of subsystems, at least two are mine :)

Who is in the hot seat to merge this? Are we expecting it this merge
window?

I've read through past versions and am happy with the general
concept. Would like to read it again in details.

Thanks,
Jason
Jason Gunthorpe Nov. 5, 2024, 7:53 p.m. UTC | #13
On Mon, Nov 04, 2024 at 10:58:31AM +0100, Christoph Hellwig wrote:
> On Thu, Oct 31, 2024 at 09:17:45PM +0000, Robin Murphy wrote:
> > The hilarious amount of work that iommu_dma_map_sg() does is pretty much 
> > entirely for the benefit of v4l2 and dma-buf importers who *depend* on 
> > being able to linearise a scatterlist in DMA address space. TBH I doubt 
> > there are many actual scatter-gather-capable devices with significant 
> > enough limitations to meaningfully benefit from DMA segment combining these 
> > days - I've often thought that by now it might be a good idea to turn that 
> > behaviour off by default and add an attribute for callers to explicitly 
> > request it.
> 
> Even when devices are not limited they often perform significantly better
> when IOVA space is not completely fragmented.  While the dma_map_sg code
> is a bit gross due to the fact that it has to deal with unaligned segments,
> the coalescing itself often is a big win.

RDMA is like this too, Almost all the MR HW gets big wins if the
entire scatter list is IOVA contiguous. One of the future steps I'd
like to see on top of this is to fine tune the IOVA allocation backing
MRs to exactly match the HW needs. Having proper alignment and
contiguity can be huge reduction in device overhead, like a 100MB MR
may need to store 200K of mapping information on-device, but with a
properly aligned IOVA this can be reduced to only 16 bytes.

Avoiding a double translation tax when the iommu HW is enabled is
potentially significant. We have some RDMA workloads with VMs where
the NIC is holding ~1GB of memory just for translations, but the iommu
is active as the S2. ie we are paying a double tax on translation.

It could be a very interesting trade off to reduce the NIC side to
nothing and rely on the CPU IOMMU with nested translation instead.

> Note that dma_map_sg also has two other very useful features:  batching
> of the iotlb flushing, and support for P2P, which to be efficient also
> requires batching the lookups.

This is the main point, and I think, is the uniqueness Leon is talking
about. We don't get those properties through any other API and this
one series preserves them.

In fact I would say that is the entire point of this series: preserve
everything special about dma_map_sg() compared to dma_map_page() but
don't require a scatterlist.

> >> Several approaches have been explored to expand the DMA API with additional
> >> scatterlist-like structures (BIO, rlist), instead split up the DMA API
> >> to allow callers to bring their own data structure.
> >
> > And this line of reasoning is still "2 + 2 = Thursday" - what is to say 
> > those two notions in any way related? We literally already have one generic 
> > DMA operation which doesn't operate on struct page, yet needed nothing 
> > "split up" to be possible.
> 
> Yeah, I don't really get the struct page argument.  In fact if we look
> at the nitty-gritty details of dma_map_page it doesn't really need a
> page at all. 

Today, if you want to map a P2P address you must have a struct page,
because page->pgmap is the only source of information on the P2P
topology.

So the logic is, to get P2P without struct page we need a way to have
all the features of dma_map_sg() but without a mandatory scatterlist
because we cannot remove struct page from scatterlist.

This series gets to the first step - no scatterlist. There will need
to be another series to provide an alternative to page->pgmap to get
the P2P information. Then we really won't have struct page dependence
in the DMA API.

I actually once looked at how to enhance dma_map_resource() to support
P2P and it was not very nice, the unmap side became quite complex. I
think this is a more elgant solution than what I was sketching.

> >>      for the device. Instead of allocating a scatter list entry per allocated
> >>      page it can just allocate an array of 'struct page *', saving a large
> >>      amount of memory.
> >
> > VFIO already assumes a coherent device with (realistically) an IOMMU which 
> > it explicitly manages - why is it even pretending to need a generic DMA 
> > API?
> 
> AFAIK that does isn't really vfio as we know it but the control device
> for live migration.  But Leon or Jason might fill in more.

Yes, this is the control side of the VFIO live migration driver that
needs rather a lot of memory to store the migration blob. There is
definitely an iommu, and the VF function is definitely translating,
but it doesn't mean the PF function is using dma-iommu.c, it is often
in iommu passthrough/identity and using DMA direct.

It was done as an alternative example on how to use the API. Again
there are more improvements possible there, the driver does not take
advantage of contiguity or alignment when programming the HW.

> Because we only need to preallocate the tiny constant sized dma_iova_state
> as part of the request instead of an additional scatterlist that requires
> sizeof(struct page *) + sizeof(dma_addr_t) + 3 * sizeof(unsigned int)
> per segment, including a memory allocation per I/O for that.

Right, eliminating scatterlist entirely on fast paths is a big
point. I recall Chuck was keen on the same thing for NFSoRDMA as well.

> At least for the block code we have a nice little core wrapper that is
> very easy to use, and provides a great reduction of memory use and
> allocations.  The HMM use case I'll let others talk about.

I saw the Intel XE team make a complicated integration with the DMA
API that wasn't so good. They were looking at an earlier version of
this and I think the feedback was positive. It should make a big
difference, but we will need to see what they come up and possibly
tweak things.

Jason
Christoph Hellwig Nov. 7, 2024, 8:32 a.m. UTC | #14
On Tue, Nov 05, 2024 at 03:53:57PM -0400, Jason Gunthorpe wrote:
> > Yeah, I don't really get the struct page argument.  In fact if we look
> > at the nitty-gritty details of dma_map_page it doesn't really need a
> > page at all. 
> 
> Today, if you want to map a P2P address you must have a struct page,
> because page->pgmap is the only source of information on the P2P
> topology.
> 
> So the logic is, to get P2P without struct page we need a way to have
> all the features of dma_map_sg() but without a mandatory scatterlist
> because we cannot remove struct page from scatterlist.

Well, that is true but also not the point.  The hard part is to
find the P2P routing information without the page.  After that
any physical address based interface will work, including a trivial
dma_map_phys.

> > At least for the block code we have a nice little core wrapper that is
> > very easy to use, and provides a great reduction of memory use and
> > allocations.  The HMM use case I'll let others talk about.
> 
> I saw the Intel XE team make a complicated integration with the DMA
> API that wasn't so good. They were looking at an earlier version of
> this and I think the feedback was positive. It should make a big
> difference, but we will need to see what they come up and possibly
> tweak things.

Not even sure what XE is, but do you have a pointer to it?  It would
really be great if people having DMA problems talked to the dma-mapping
and iommu maintaines / list..
Jason Gunthorpe Nov. 7, 2024, 1:28 p.m. UTC | #15
On Thu, Nov 07, 2024 at 09:32:56AM +0100, Christoph Hellwig wrote:
> On Tue, Nov 05, 2024 at 03:53:57PM -0400, Jason Gunthorpe wrote:
> > > Yeah, I don't really get the struct page argument.  In fact if we look
> > > at the nitty-gritty details of dma_map_page it doesn't really need a
> > > page at all. 
> > 
> > Today, if you want to map a P2P address you must have a struct page,
> > because page->pgmap is the only source of information on the P2P
> > topology.
> > 
> > So the logic is, to get P2P without struct page we need a way to have
> > all the features of dma_map_sg() but without a mandatory scatterlist
> > because we cannot remove struct page from scatterlist.
> 
> Well, that is true but also not the point.  The hard part is to
> find the P2P routing information without the page.  After that
> any physical address based interface will work, including a trivial
> dma_map_phys.

Once we are freed from scatterlist we can explore a design that would
pass the P2P routing information directly. For instance imagine
something like:

   dma_map_p2p(dev, phys, p2p_provider);

Then dma_map_page(dev, page) could be something like

   if (is_pci_p2pdma_page(page))
      dev_map_p2p(dev, page_to_phys(page), page->pgmap->p2p_provider)

From there we could then go into DRM/VFIO/etc and give them
p2p_providers without pgmaps. p2p_provider is some light refactoring
of what is already in drivers/pci/p2pdma.c

For the dmabuf use cases it is not actually hard to find the P2P
routing information - the driver constructing the dmabuf has it. The
challenge is carrying that information from the originating driver,
through the dmabuf apis to the final place that does the dma mapping.

So I'm thinking of a datastructure for things like dmabuf/rdma MR
that is sort of like this:

   struct phys_list {
         enum type; // CPU, p2p, encrypted, whatever
         struct p2p_provider *p2p_provider;
         struct phys_list *next;
         struct phys_range frags[];
   }

Where each phys_list would be a single uniform dma operation and
easily carries the extra meta data. No struct page, no serious issue
transfering the P2P routing information.

> > I saw the Intel XE team make a complicated integration with the DMA
> > API that wasn't so good. They were looking at an earlier version of
> > this and I think the feedback was positive. It should make a big
> > difference, but we will need to see what they come up and possibly
> > tweak things.
> 
> Not even sure what XE is, but do you have a pointer to it?  It would
> really be great if people having DMA problems talked to the dma-mapping
> and iommu maintaines / list..

GPU driver

https://lore.kernel.org/dri-devel/20240117221223.18540-7-oak.zeng@intel.com/

Jason
Christoph Hellwig Nov. 7, 2024, 1:50 p.m. UTC | #16
On Thu, Nov 07, 2024 at 09:28:08AM -0400, Jason Gunthorpe wrote:
> Once we are freed from scatterlist we can explore a design that would
> pass the P2P routing information directly. For instance imagine
> something like:
> 
>    dma_map_p2p(dev, phys, p2p_provider);
> 
> Then dma_map_page(dev, page) could be something like
> 
>    if (is_pci_p2pdma_page(page))
>       dev_map_p2p(dev, page_to_phys(page), page->pgmap->p2p_provider)

One thing that this series does is to move the P2P mapping decisions out
of the low-level dma mapping helpers and into the caller (again) for
the non-sg callers and moves the special switch based bus mapping into
a routine that can be called directly.

Take a look at blk_rq_dma_map_iter_start, which now literally uses
dma_map_page for the no-iommu, no-switch P2P case.  It also is a good
use case for the proposed dma_map_phys.

> GPU driver
> 
> https://lore.kernel.org/dri-devel/20240117221223.18540-7-oak.zeng@intel.com/

Eww, that's horrible.  Converting this to Leon's new hmm helpers
would be really nice (and how that they are useful for more than
mlx5).
Jason Gunthorpe Nov. 8, 2024, 3:02 p.m. UTC | #17
On Thu, Nov 07, 2024 at 02:50:25PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 07, 2024 at 09:28:08AM -0400, Jason Gunthorpe wrote:
> > Once we are freed from scatterlist we can explore a design that would
> > pass the P2P routing information directly. For instance imagine
> > something like:
> > 
> >    dma_map_p2p(dev, phys, p2p_provider);
> > 
> > Then dma_map_page(dev, page) could be something like
> > 
> >    if (is_pci_p2pdma_page(page))
> >       dev_map_p2p(dev, page_to_phys(page), page->pgmap->p2p_provider)
> 
> One thing that this series does is to move the P2P mapping decisions out
> of the low-level dma mapping helpers and into the caller (again) for
> the non-sg callers and moves the special switch based bus mapping into
> a routine that can be called directly.
> 
> Take a look at blk_rq_dma_map_iter_start, which now literally uses
> dma_map_page for the no-iommu, no-switch P2P case.  It also is a good
> use case for the proposed dma_map_phys.

It is fully OK? Can't dma_map_page() trigger swiotlb? It must not do
that for P2P. How does it know the difference if it just gets a phys?

Jason
Christoph Hellwig Nov. 8, 2024, 3:05 p.m. UTC | #18
On Fri, Nov 08, 2024 at 11:02:26AM -0400, Jason Gunthorpe wrote:
> It is fully OK? Can't dma_map_page() trigger swiotlb? It must not do
> that for P2P. How does it know the difference if it just gets a phys?

dma_direct_map_page checks for p2p pages in the swiotlb bounce
path already in the current kernel, and dma_map_sg relies on exactly
that check to prevent bouncing for p2p.
Jason Gunthorpe Nov. 8, 2024, 3:25 p.m. UTC | #19
On Fri, Nov 08, 2024 at 04:05:00PM +0100, Christoph Hellwig wrote:
> On Fri, Nov 08, 2024 at 11:02:26AM -0400, Jason Gunthorpe wrote:
> > It is fully OK? Can't dma_map_page() trigger swiotlb? It must not do
> > that for P2P. How does it know the difference if it just gets a phys?
> 
> dma_direct_map_page checks for p2p pages in the swiotlb bounce
> path already in the current kernel, and dma_map_sg relies on exactly
> that check to prevent bouncing for p2p.

I'm asking how it will work if you change the struct page argument to
physical, because today dma_direct_map_page() has:

		if (is_pci_p2pdma_page(page))
			return DMA_MAPPING_ERROR;

Which is exactly the sorts of things I'm looking at when when I say to
get rid of struct page.

What I'm thinking about is replacing code like the above with something like:

		if (p2p_provider)
			return DMA_MAPPING_ERROR;

And the caller is the one that would have done is_pci_p2pdma_page()
and either passes p2p_provider=NULL or page->pgmap->p2p_provider.

Anyhow, I hope Leon will attempt this once this is settled and it will
make more sense in patches. I'm just brainstorming how I've been
thinking of it.

Another option would be some 'is_pci_p2pdma_page_phys(phys)', but I
think that is going to be worse performance than managing a
p2p_provider pointer in the mapping call chain explicitly.

Jason
Christoph Hellwig Nov. 8, 2024, 3:29 p.m. UTC | #20
On Fri, Nov 08, 2024 at 11:25:37AM -0400, Jason Gunthorpe wrote:
> I'm asking how it will work if you change the struct page argument to
> physical, because today dma_direct_map_page() has:
> 
> 		if (is_pci_p2pdma_page(page))
> 			return DMA_MAPPING_ERROR;
> 
> Which is exactly the sorts of things I'm looking at when when I say to
> get rid of struct page.

It will have to look up the page from the physical address obviously.
But at least only in the error path.

> What I'm thinking about is replacing code like the above with something like:
> 
> 		if (p2p_provider)
> 			return DMA_MAPPING_ERROR;
> 
> And the caller is the one that would have done is_pci_p2pdma_page()
> and either passes p2p_provider=NULL or page->pgmap->p2p_provider.

And where do you get that one from?
Jason Gunthorpe Nov. 8, 2024, 3:38 p.m. UTC | #21
On Fri, Nov 08, 2024 at 04:29:56PM +0100, Christoph Hellwig wrote:
> On Fri, Nov 08, 2024 at 11:25:37AM -0400, Jason Gunthorpe wrote:
> > I'm asking how it will work if you change the struct page argument to
> > physical, because today dma_direct_map_page() has:
> > 
> > 		if (is_pci_p2pdma_page(page))
> > 			return DMA_MAPPING_ERROR;
> > 
> > Which is exactly the sorts of things I'm looking at when when I say to
> > get rid of struct page.
> 
> It will have to look up the page from the physical address obviously.
> But at least only in the error path.

I'm thinking we can largely avoid searching on physical, or at least
we can optimize this so there is only one search on physical at the
start of the DMA mapping. (since we are now saying all pages are the
same type)

> > What I'm thinking about is replacing code like the above with something like:
> > 
> > 		if (p2p_provider)
> > 			return DMA_MAPPING_ERROR;
> > 
> > And the caller is the one that would have done is_pci_p2pdma_page()
> > and either passes p2p_provider=NULL or page->pgmap->p2p_provider.
> 
> And where do you get that one from?

Which one?

The caller must know the p2p properties of what it is doing because it
is driving all the P2P logic around what APIs to call.

Either because it is already working with struct page and gets it out
of the pgmap.

Or it is working with non-struct page memory and has a (MMIO address,
p2p_provider) tuple that it got from the original driver that gave it
the MMIO address.

Or it really does have a naked phys_addr_t and it did the search on
physical, but only once.

Jason
Christoph Hellwig Nov. 12, 2024, 6:01 a.m. UTC | #22
On Fri, Nov 08, 2024 at 11:38:46AM -0400, Jason Gunthorpe wrote:
> > > What I'm thinking about is replacing code like the above with something like:
> > > 
> > > 		if (p2p_provider)
> > > 			return DMA_MAPPING_ERROR;
> > > 
> > > And the caller is the one that would have done is_pci_p2pdma_page()
> > > and either passes p2p_provider=NULL or page->pgmap->p2p_provider.
> > 
> > And where do you get that one from?
> 
> Which one?

The p2p_provider thing (whatever that will actually be).
Jason Gunthorpe Nov. 13, 2024, 6:41 p.m. UTC | #23
On Tue, Nov 12, 2024 at 07:01:08AM +0100, Christoph Hellwig wrote:
> On Fri, Nov 08, 2024 at 11:38:46AM -0400, Jason Gunthorpe wrote:
> > > > What I'm thinking about is replacing code like the above with something like:
> > > > 
> > > > 		if (p2p_provider)
> > > > 			return DMA_MAPPING_ERROR;
> > > > 
> > > > And the caller is the one that would have done is_pci_p2pdma_page()
> > > > and either passes p2p_provider=NULL or page->pgmap->p2p_provider.
> > > 
> > > And where do you get that one from?
> > 
> > Which one?
> 
> The p2p_provider thing (whatever that will actually be).

p2p_provider would be splitting out the information in
pci_p2pdma_pagemap to it's own type:

struct pci_p2pdma_pagemap {
	struct pci_dev *provider;
	u64 bus_offset;

That is the essential information to compute PCI_P2PDMA_MAP_*.

For example when blk_rq_dma_map_iter_start() calls pci_p2pdma_state(),
it has this information from page->pgmap. It would still have the
information via the pgmap when we split it out of the
pci_p2pdma_pagemap.

Since everything doing a dma map has to do the pci_p2pdma_state() to
compute PCI_P2PDMA_MAP_* every dma mapping operation has already got
the provider. Since everything is uniform within a mapping operation
the provider is constant for the whole map.

For future non-struct page cases the provider comes along with the
address list from whatever created the address list in the first
place.

Looking at dmabuf for example, I expect dmabuf to provide a new data
structure which is a list of lists:

 [[provider GPU: [mmio_addr1,mmio_addr2,mmio_addr3],
  [provider NULL: [cpu_addr1, cpu_addr2, ...],
   ..
 ]

And each uniform group would be dma map'd on its own using the
embedded provider instead of page->pgmap.

Jason