[RFC,v2,0/3] RDMA: add dma-buf support

Message ID	1593451903-30959-1-git-send-email-jianxin.xiong@intel.com (mailing list archive)
Headers	show Return-Path: <SRS0=Zpop=AK=vger.kernel.org=linux-rdma-owner@kernel.org> IronPort-SDR: uxoGpSwsEuK9x8xqS4yFiPknMOJ+ZFu21XSnRyCRyKN8sgUkofkKpKw/A8TTDW8GQYqtxdhlNu B9AjBS+cHGVA== IronPort-SDR: nBU2k8tqByVUJxpr+2WnOiDA1oYAOrRHgEpz7gSAHHjGaoYWcVQB2aEcvvJwTui6+tAsdBisu+ j8i4CcKPWZsw== From: Jianxin Xiong <jianxin.xiong@intel.com> To: linux-rdma@vger.kernel.org Cc: Jianxin Xiong <jianxin.xiong@intel.com>, Doug Ledford <dledford@redhat.com>, Jason Gunthorpe <jgg@ziepe.ca>, Sumit Semwal <sumit.semwal@linaro.org>, Leon Romanovsky <leon@kernel.org>, Daniel Vetter <daniel.vetter@intel.com> Subject: [RFC PATCH v2 0/3] RDMA: add dma-buf support Date: Mon, 29 Jun 2020 10:31:40 -0700 Message-Id: <1593451903-30959-1-git-send-email-jianxin.xiong@intel.com> Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk
Series	RDMA: add dma-buf support \| expand [RFC,v2,0/3] RDMA: add dma-buf support [RFC,v2,1/3] RDMA/umem: Support importing dma-buf as user memory region [RFC,v2,2/3] RDMA/core: Expand the driver method 'reg_user_mr' to support dma-buf [RFC,v2,3/3] RDMA/uverbs: Add uverbs command for dma-buf based MR registration

Xiong, Jianxin June 29, 2020, 5:31 p.m. UTC

When enabled, an RDMA capable NIC can perform peer-to-peer transactions
over PCIe to access the local memory located on another device. This can
often lead to better performance than using a system memory buffer for
RDMA and copying data between the buffer and device memory.

Current kernel RDMA stack uses get_user_pages() to pin the physical
pages backing the user buffer and uses dma_map_sg_attrs() to get the
dma addresses for memory access. This usually doesn't work for peer
device memory due to the lack of associated page structures.

Several mechanisms exist today to facilitate device memory access.

ZONE_DEVICE is a new zone for device memory in the memory management
subsystem. It allows pages from device memory being described with
specialized page structures. As the result, calls like get_user_pages()
can succeed, but what can be done with these page structures may be
different from system memory. It is further specialized into multiple
memory types, such as one type for PCI p2pmem/p2pdma and one type for
HMM.

PCI p2pmem/p2pdma uses ZONE_DEVICE to represent device memory residing
in a PCI BAR and provides a set of calls to publish, discover, allocate,
and map such memory for peer-to-peer transactions. One feature of the
API is that the buffer is allocated by the side that does the DMA
transfer. This works well with the storage usage case, but is awkward
with GPU-NIC communication, where typically the buffer is allocated by
the GPU driver rather than the NIC driver.

Heterogeneous Memory Management (HMM) utilizes mmu_interval_notifier
and ZONE_DEVICE to support shared virtual address space and page
migration between system memory and device memory. HMM doesn't support
pinning device memory because pages located on device must be able to
migrate to system memory when accessed by CPU. Peer-to-peer access
is possible if the peer can handle page fault. For RDMA, that means
the NIC must support on-demand paging.

Dma-buf is a standard mechanism for sharing buffers among different
device drivers. The buffer to be shared is exported by the owning
driver and imported by the driver that wants to use it. The exporter
provides a set of ops that the importer can call to pin and map the
buffer. In addition, a file descriptor can be associated with a dma-
buf object as the handle that can be passed to user space.

This patch series adds dma-buf importer role to the RDMA driver in
attempt to support RDMA using device memory such as GPU VRAM. Dma-buf is
chosen for a few reasons: first, the API is relatively simple and allows
a lot of flexibility in implementing the buffer manipulation ops.
Second, it doesn't require page structure. Third, dma-buf is already
supported in many GPU drivers. However, we are aware that existing GPU
drivers don't allow pinning device memory via the dma-buf interface.
Pinning and mapping a dma-buf would cause the backing storage to migrate
to system RAM. This is due to the lack of knowledge about whether the
importer can perform peer-to-peer access and the lack of resource limit
control measure for GPU. For the first part, the latest dma-buf driver
has a peer-to-peer flag for the importer, but the flag is currently tied
to dynamic mapping support, which requires on-demand paging support from
the NIC to work. There are a few possible ways to address these issues,
such as decoupling peer-to-peer flag from dynamic mapping, allowing more
leeway for individual drivers to make the pinning decision and adding
GPU resource limit control via cgroup. We would like to get comments on
this patch series with the assumption that device memory pinning via
dma-buf is supported by some GPU drivers, and at the same time welcome
open discussions on how to address the aforementioned issues as well as
GPU-NIC peer-to-peer access solutions in general.

This is the second version of the patch series. Here are the changes
from the previous version:
* The Kconfig option is removed. There is no dependence issue since
dma-buf driver is always enabled.
* The declaration of new data structure and functions is reorganized to
minimize the visibility of the changes.
* The new uverbs command now goes through ioctl() instead of write().
* The rereg functionality is removed.
* Instead of adding new device method for dma-buf specific registration,
existing method is extended to accept an extra parameter. 
* The correct function is now used for address range checking. 

This series is organized as follows. The first patch adds the common
code for importing dma-buf from a file descriptor and pinning and
mapping the dma-buf pages. Patch 2 extends the reg_user_mr() method
of the ib_device structure to accept dma-buf file descriptor as an extra
parameter. Vendor drivers are updated with the change. Patch 3 adds a
new uverbs command for registering dma-buf based memory region.

Related user space RDMA library changes will be provided as a separate
patch series.

Jianxin Xiong (3):
  RDMA/umem: Support importing dma-buf as user memory region
  RDMA/core: Expand the driver method 'reg_user_mr' to support dma-buf
  RDMA/uverbs: Add uverbs command for dma-buf based MR registration

 drivers/infiniband/core/Makefile                |   2 +-
 drivers/infiniband/core/umem.c                  |   4 +
 drivers/infiniband/core/umem_dmabuf.c           | 105 ++++++++++++++++++++++
 drivers/infiniband/core/umem_dmabuf.h           |  11 +++
 drivers/infiniband/core/uverbs_cmd.c            |   2 +-
 drivers/infiniband/core/uverbs_std_types_mr.c   | 112 ++++++++++++++++++++++++
 drivers/infiniband/core/verbs.c                 |   2 +-
 drivers/infiniband/hw/bnxt_re/ib_verbs.c        |   7 +-
 drivers/infiniband/hw/bnxt_re/ib_verbs.h        |   2 +-
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h          |   3 +-
 drivers/infiniband/hw/cxgb4/mem.c               |   8 +-
 drivers/infiniband/hw/efa/efa.h                 |   2 +-
 drivers/infiniband/hw/efa/efa_verbs.c           |   7 +-
 drivers/infiniband/hw/hns/hns_roce_device.h     |   2 +-
 drivers/infiniband/hw/hns/hns_roce_mr.c         |   7 +-
 drivers/infiniband/hw/i40iw/i40iw_verbs.c       |   6 ++
 drivers/infiniband/hw/mlx4/mlx4_ib.h            |   2 +-
 drivers/infiniband/hw/mlx4/mr.c                 |   7 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h            |   2 +-
 drivers/infiniband/hw/mlx5/mr.c                 |  45 +++++++++-
 drivers/infiniband/hw/mthca/mthca_provider.c    |   8 +-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c     |   9 +-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h     |   3 +-
 drivers/infiniband/hw/qedr/verbs.c              |   8 +-
 drivers/infiniband/hw/qedr/verbs.h              |   3 +-
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c    |   8 +-
 drivers/infiniband/hw/usnic/usnic_ib_verbs.h    |   2 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c    |   6 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h |   2 +-
 drivers/infiniband/sw/rdmavt/mr.c               |   6 +-
 drivers/infiniband/sw/rdmavt/mr.h               |   2 +-
 drivers/infiniband/sw/rxe/rxe_verbs.c           |   6 ++
 drivers/infiniband/sw/siw/siw_verbs.c           |   8 +-
 drivers/infiniband/sw/siw/siw_verbs.h           |   3 +-
 include/rdma/ib_umem.h                          |  14 ++-
 include/rdma/ib_verbs.h                         |   4 +-
 include/uapi/rdma/ib_user_ioctl_cmds.h          |  14 +++
 37 files changed, 410 insertions(+), 34 deletions(-)
 create mode 100644 drivers/infiniband/core/umem_dmabuf.c
 create mode 100644 drivers/infiniband/core/umem_dmabuf.h

Xiong, Jianxin June 30, 2020, 5:21 p.m. UTC | #1

> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Monday, June 29, 2020 11:52 AM
> To: Xiong, Jianxin <jianxin.xiong@intel.com>
> Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
> <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>
> Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> 
> On Mon, Jun 29, 2020 at 10:31:40AM -0700, Jianxin Xiong wrote:
> 
> > ZONE_DEVICE is a new zone for device memory in the memory management
> > subsystem. It allows pages from device memory being described with
> > specialized page structures. As the result, calls like
> > get_user_pages() can succeed, but what can be done with these page
> > structures may be
> 
> get_user_pages() does not succeed with ZONE_DEVICE_PAGEs

I stand corrected.

> 
> > Heterogeneous Memory Management (HMM) utilizes mmu_interval_notifier
> > and ZONE_DEVICE to support shared virtual address space and page
> > migration between system memory and device memory. HMM doesn't support
> > pinning device memory because pages located on device must be able to
> > migrate to system memory when accessed by CPU. Peer-to-peer access is
> > possible if the peer can handle page fault. For RDMA, that means the
> > NIC must support on-demand paging.
> 
> peer-peer access is currently not possible with hmm_range_fault().

Currently hmm_range_fault() always sets the cpu access flag and device
private pages are migrated to the system RAM in the fault handler. However, 
it's possible to have a modified code flow to keep the device private page info
for use with peer to peer access. 

> 
> > This patch series adds dma-buf importer role to the RDMA driver in
> > attempt to support RDMA using device memory such as GPU VRAM. Dma-buf
> > is chosen for a few reasons: first, the API is relatively simple and
> > allows a lot of flexibility in implementing the buffer manipulation ops.
> > Second, it doesn't require page structure. Third, dma-buf is already
> > supported in many GPU drivers. However, we are aware that existing GPU
> > drivers don't allow pinning device memory via the dma-buf interface.
> 
> So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?

That's right, the patch alone is just half of the story. The functionality
depends on availability of dma-buf exporter that can pin the device
memory.

> 
> > Pinning and mapping a dma-buf would cause the backing storage to
> > migrate to system RAM. This is due to the lack of knowledge about
> > whether the importer can perform peer-to-peer access and the lack of
> > resource limit control measure for GPU. For the first part, the latest
> > dma-buf driver has a peer-to-peer flag for the importer, but the flag
> > is currently tied to dynamic mapping support, which requires on-demand
> > paging support from the NIC to work.
> 
> ODP for DMA buf?

Right.

> 
> > There are a few possible ways to address these issues, such as
> > decoupling peer-to-peer flag from dynamic mapping, allowing more
> > leeway for individual drivers to make the pinning decision and adding
> > GPU resource limit control via cgroup. We would like to get comments
> > on this patch series with the assumption that device memory pinning
> > via dma-buf is supported by some GPU drivers, and at the same time
> > welcome open discussions on how to address the aforementioned issues
> > as well as GPU-NIC peer-to-peer access solutions in general.
> 
> These seem like DMA buf problems, not RDMA problems, why are you asking these questions with a RDMA patch set? The usual DMA buf
> people are not even Cc'd here.

The intention is to have people from both RDMA and DMA buffer side to
comment. Sumit Semwal is the DMA buffer maintainer according to the
MAINTAINERS file. I agree more people could be invited to the discussion.
Just added Christian Koenig to the cc-list.

> 
> > This is the second version of the patch series. Here are the changes
> > from the previous version:
> > * Instead of adding new device method for dma-buf specific
> > registration, existing method is extended to accept an extra parameter.
> 
> I think the comment was the extra parameter should have been a umem or maybe a new umem_description struct, not blindly adding a fd
> as a parameter and a wack of EOPNOTSUPPS

Passing a 'umem' leads to some difficulties. For example, the mlx4 driver needs to
modify the access flags before getting the umem; the mlx5 driver needs to pass
driver specific ops to get the ODP umem.

If the umem_description you mentioned is for information used to create the
umem (e.g. a structure for all the parameters), then this would work better.

> 
> > This series is organized as follows. The first patch adds the common
> > code for importing dma-buf from a file descriptor and pinning and
> > mapping the dma-buf pages. Patch 2 extends the reg_user_mr() method of
> > the ib_device structure to accept dma-buf file descriptor as an extra
> > parameter. Vendor drivers are updated with the change. Patch 3 adds a
> > new uverbs command for registering dma-buf based memory region.
> 
> The ioctl stuff seems OK, but this doesn't seem to bring any new functionality?

Thanks.

> 
> Jason

Jason Gunthorpe June 30, 2020, 5:34 p.m. UTC | #2

On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> > > Heterogeneous Memory Management (HMM) utilizes mmu_interval_notifier
> > > and ZONE_DEVICE to support shared virtual address space and page
> > > migration between system memory and device memory. HMM doesn't support
> > > pinning device memory because pages located on device must be able to
> > > migrate to system memory when accessed by CPU. Peer-to-peer access is
> > > possible if the peer can handle page fault. For RDMA, that means the
> > > NIC must support on-demand paging.
> > 
> > peer-peer access is currently not possible with hmm_range_fault().
> 
> Currently hmm_range_fault() always sets the cpu access flag and device
> private pages are migrated to the system RAM in the fault handler. However, 
> it's possible to have a modified code flow to keep the device private page info
> for use with peer to peer access.

Sort of, but only within the same device, RDMA or anything else
generic can't reach inside a DEVICE_PRIVATE and extract anything
useful.

> > So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
> 
> That's right, the patch alone is just half of the story. The functionality
> depends on availability of dma-buf exporter that can pin the device
> memory.

Well, what do you want to happen here? The RDMA parts are reasonable,
but I don't want to add new functionality without a purpose - the
other parts need to be settled out first.

The need for the dynamic mapping support for even the current DMA Buf
hacky P2P users is really too bad. Can you get any GPU driver to
support non-dynamic mapping?

> > > migrate to system RAM. This is due to the lack of knowledge about
> > > whether the importer can perform peer-to-peer access and the lack of
> > > resource limit control measure for GPU. For the first part, the latest
> > > dma-buf driver has a peer-to-peer flag for the importer, but the flag
> > > is currently tied to dynamic mapping support, which requires on-demand
> > > paging support from the NIC to work.
> > 
> > ODP for DMA buf?
> 
> Right.

Hum. This is not actually so hard to do. The whole dma buf proposal
would make a lot more sense if the 'dma buf MR' had to be the dynamic
kind and the driver had to provide the faulting. It would not be so
hard to change mlx5 to be able to work like this, perhaps. (the
locking might be a bit tricky though)

> > > There are a few possible ways to address these issues, such as
> > > decoupling peer-to-peer flag from dynamic mapping, allowing more
> > > leeway for individual drivers to make the pinning decision and adding
> > > GPU resource limit control via cgroup. We would like to get comments
> > > on this patch series with the assumption that device memory pinning
> > > via dma-buf is supported by some GPU drivers, and at the same time
> > > welcome open discussions on how to address the aforementioned issues
> > > as well as GPU-NIC peer-to-peer access solutions in general.
> > 
> > These seem like DMA buf problems, not RDMA problems, why are you asking these questions with a RDMA patch set? The usual DMA buf
> > people are not even Cc'd here.
> 
> The intention is to have people from both RDMA and DMA buffer side to
> comment. Sumit Semwal is the DMA buffer maintainer according to the
> MAINTAINERS file. I agree more people could be invited to the discussion.
> Just added Christian Koenig to the cc-list.

Would be good to have added the drm lists too

> If the umem_description you mentioned is for information used to create the
> umem (e.g. a structure for all the parameters), then this would work better.

It would make some more sense, and avoid all these weird EOPNOTSUPPS.

Jason

Xiong, Jianxin June 30, 2020, 6:46 p.m. UTC | #3

> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Tuesday, June 30, 2020 10:35 AM
> To: Xiong, Jianxin <jianxin.xiong@intel.com>
> Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
> <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
> Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> 
> On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> > > > Heterogeneous Memory Management (HMM) utilizes
> > > > mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> > > > address space and page migration between system memory and device
> > > > memory. HMM doesn't support pinning device memory because pages
> > > > located on device must be able to migrate to system memory when
> > > > accessed by CPU. Peer-to-peer access is possible if the peer can
> > > > handle page fault. For RDMA, that means the NIC must support on-demand paging.
> > >
> > > peer-peer access is currently not possible with hmm_range_fault().
> >
> > Currently hmm_range_fault() always sets the cpu access flag and device
> > private pages are migrated to the system RAM in the fault handler.
> > However, it's possible to have a modified code flow to keep the device
> > private page info for use with peer to peer access.
> 
> Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.

But pfn is supposed to be all that is needed.

> 
> > > So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
> >
> > That's right, the patch alone is just half of the story. The
> > functionality depends on availability of dma-buf exporter that can pin
> > the device memory.
> 
> Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the
> other parts need to be settled out first.

At the RDMA side, we mainly want to check if the changes are acceptable. For example,
the part about adding 'fd' to the device ops and the ioctl interface. All the previous
comments are very helpful for us to refine the patch so that we can be ready when
GPU side support becomes available.

> 
> The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to
> support non-dynamic mapping?

We are working on direct direction.

> 
> > > > migrate to system RAM. This is due to the lack of knowledge about
> > > > whether the importer can perform peer-to-peer access and the lack
> > > > of resource limit control measure for GPU. For the first part, the
> > > > latest dma-buf driver has a peer-to-peer flag for the importer,
> > > > but the flag is currently tied to dynamic mapping support, which
> > > > requires on-demand paging support from the NIC to work.
> > >
> > > ODP for DMA buf?
> >
> > Right.
> 
> Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the
> dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the
> locking might be a bit tricky though)

The main issue is that not all NICs support ODP.

> 
> > > > There are a few possible ways to address these issues, such as
> > > > decoupling peer-to-peer flag from dynamic mapping, allowing more
> > > > leeway for individual drivers to make the pinning decision and
> > > > adding GPU resource limit control via cgroup. We would like to get
> > > > comments on this patch series with the assumption that device
> > > > memory pinning via dma-buf is supported by some GPU drivers, and
> > > > at the same time welcome open discussions on how to address the
> > > > aforementioned issues as well as GPU-NIC peer-to-peer access solutions in general.
> > >
> > > These seem like DMA buf problems, not RDMA problems, why are you
> > > asking these questions with a RDMA patch set? The usual DMA buf people are not even Cc'd here.
> >
> > The intention is to have people from both RDMA and DMA buffer side to
> > comment. Sumit Semwal is the DMA buffer maintainer according to the
> > MAINTAINERS file. I agree more people could be invited to the discussion.
> > Just added Christian Koenig to the cc-list.
> 
> Would be good to have added the drm lists too

Thanks, cc'd dri-devel here, and will also do the same for the previous part of the thread.

> 
> > If the umem_description you mentioned is for information used to
> > create the umem (e.g. a structure for all the parameters), then this would work better.
> 
> It would make some more sense, and avoid all these weird EOPNOTSUPPS.

Good, thanks for the suggestion.

> 
> Jason

Xiong, Jianxin June 30, 2020, 6:56 p.m. UTC | #4

Added to cc-list:
Christian Koenig <christian.koenig@amd.com>
dri-devel@lists.freedesktop.org

> -----Original Message-----
> From: Xiong, Jianxin <jianxin.xiong@intel.com>
> Sent: Monday, June 29, 2020 10:32 AM
> To: linux-rdma@vger.kernel.org
> Cc: Xiong, Jianxin <jianxin.xiong@intel.com>; Doug Ledford <dledford@redhat.com>; Jason Gunthorpe <jgg@ziepe.ca>; Sumit Semwal
> <sumit.semwal@linaro.org>; Leon Romanovsky <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>
> Subject: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> 
> When enabled, an RDMA capable NIC can perform peer-to-peer transactions
> over PCIe to access the local memory located on another device. This can
> often lead to better performance than using a system memory buffer for
> RDMA and copying data between the buffer and device memory.
> 
> Current kernel RDMA stack uses get_user_pages() to pin the physical
> pages backing the user buffer and uses dma_map_sg_attrs() to get the
> dma addresses for memory access. This usually doesn't work for peer
> device memory due to the lack of associated page structures.
> 
> Several mechanisms exist today to facilitate device memory access.
> 
> ZONE_DEVICE is a new zone for device memory in the memory management
> subsystem. It allows pages from device memory being described with
> specialized page structures. As the result, calls like get_user_pages()
> can succeed, but what can be done with these page structures may be
> different from system memory. It is further specialized into multiple
> memory types, such as one type for PCI p2pmem/p2pdma and one type for
> HMM.
> 
> PCI p2pmem/p2pdma uses ZONE_DEVICE to represent device memory residing
> in a PCI BAR and provides a set of calls to publish, discover, allocate,
> and map such memory for peer-to-peer transactions. One feature of the
> API is that the buffer is allocated by the side that does the DMA
> transfer. This works well with the storage usage case, but is awkward
> with GPU-NIC communication, where typically the buffer is allocated by
> the GPU driver rather than the NIC driver.
> 
> Heterogeneous Memory Management (HMM) utilizes mmu_interval_notifier
> and ZONE_DEVICE to support shared virtual address space and page
> migration between system memory and device memory. HMM doesn't support
> pinning device memory because pages located on device must be able to
> migrate to system memory when accessed by CPU. Peer-to-peer access
> is possible if the peer can handle page fault. For RDMA, that means
> the NIC must support on-demand paging.
> 
> Dma-buf is a standard mechanism for sharing buffers among different
> device drivers. The buffer to be shared is exported by the owning
> driver and imported by the driver that wants to use it. The exporter
> provides a set of ops that the importer can call to pin and map the
> buffer. In addition, a file descriptor can be associated with a dma-
> buf object as the handle that can be passed to user space.
> 
> This patch series adds dma-buf importer role to the RDMA driver in
> attempt to support RDMA using device memory such as GPU VRAM. Dma-buf is
> chosen for a few reasons: first, the API is relatively simple and allows
> a lot of flexibility in implementing the buffer manipulation ops.
> Second, it doesn't require page structure. Third, dma-buf is already
> supported in many GPU drivers. However, we are aware that existing GPU
> drivers don't allow pinning device memory via the dma-buf interface.
> Pinning and mapping a dma-buf would cause the backing storage to migrate
> to system RAM. This is due to the lack of knowledge about whether the
> importer can perform peer-to-peer access and the lack of resource limit
> control measure for GPU. For the first part, the latest dma-buf driver
> has a peer-to-peer flag for the importer, but the flag is currently tied
> to dynamic mapping support, which requires on-demand paging support from
> the NIC to work. There are a few possible ways to address these issues,
> such as decoupling peer-to-peer flag from dynamic mapping, allowing more
> leeway for individual drivers to make the pinning decision and adding
> GPU resource limit control via cgroup. We would like to get comments on
> this patch series with the assumption that device memory pinning via
> dma-buf is supported by some GPU drivers, and at the same time welcome
> open discussions on how to address the aforementioned issues as well as
> GPU-NIC peer-to-peer access solutions in general.
> 
> This is the second version of the patch series. Here are the changes
> from the previous version:
> * The Kconfig option is removed. There is no dependence issue since
> dma-buf driver is always enabled.
> * The declaration of new data structure and functions is reorganized to
> minimize the visibility of the changes.
> * The new uverbs command now goes through ioctl() instead of write().
> * The rereg functionality is removed.
> * Instead of adding new device method for dma-buf specific registration,
> existing method is extended to accept an extra parameter.
> * The correct function is now used for address range checking.
> 
> This series is organized as follows. The first patch adds the common
> code for importing dma-buf from a file descriptor and pinning and
> mapping the dma-buf pages. Patch 2 extends the reg_user_mr() method
> of the ib_device structure to accept dma-buf file descriptor as an extra
> parameter. Vendor drivers are updated with the change. Patch 3 adds a
> new uverbs command for registering dma-buf based memory region.
> 
> Related user space RDMA library changes will be provided as a separate
> patch series.
> 
> Jianxin Xiong (3):
>   RDMA/umem: Support importing dma-buf as user memory region
>   RDMA/core: Expand the driver method 'reg_user_mr' to support dma-buf
>   RDMA/uverbs: Add uverbs command for dma-buf based MR registration
> 
>  drivers/infiniband/core/Makefile                |   2 +-
>  drivers/infiniband/core/umem.c                  |   4 +
>  drivers/infiniband/core/umem_dmabuf.c           | 105 ++++++++++++++++++++++
>  drivers/infiniband/core/umem_dmabuf.h           |  11 +++
>  drivers/infiniband/core/uverbs_cmd.c            |   2 +-
>  drivers/infiniband/core/uverbs_std_types_mr.c   | 112 ++++++++++++++++++++++++
>  drivers/infiniband/core/verbs.c                 |   2 +-
>  drivers/infiniband/hw/bnxt_re/ib_verbs.c        |   7 +-
>  drivers/infiniband/hw/bnxt_re/ib_verbs.h        |   2 +-
>  drivers/infiniband/hw/cxgb4/iw_cxgb4.h          |   3 +-
>  drivers/infiniband/hw/cxgb4/mem.c               |   8 +-
>  drivers/infiniband/hw/efa/efa.h                 |   2 +-
>  drivers/infiniband/hw/efa/efa_verbs.c           |   7 +-
>  drivers/infiniband/hw/hns/hns_roce_device.h     |   2 +-
>  drivers/infiniband/hw/hns/hns_roce_mr.c         |   7 +-
>  drivers/infiniband/hw/i40iw/i40iw_verbs.c       |   6 ++
>  drivers/infiniband/hw/mlx4/mlx4_ib.h            |   2 +-
>  drivers/infiniband/hw/mlx4/mr.c                 |   7 +-
>  drivers/infiniband/hw/mlx5/mlx5_ib.h            |   2 +-
>  drivers/infiniband/hw/mlx5/mr.c                 |  45 +++++++++-
>  drivers/infiniband/hw/mthca/mthca_provider.c    |   8 +-
>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c     |   9 +-
>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.h     |   3 +-
>  drivers/infiniband/hw/qedr/verbs.c              |   8 +-
>  drivers/infiniband/hw/qedr/verbs.h              |   3 +-
>  drivers/infiniband/hw/usnic/usnic_ib_verbs.c    |   8 +-
>  drivers/infiniband/hw/usnic/usnic_ib_verbs.h    |   2 +-
>  drivers/infiniband/hw/vmw_pvrdma/pvrdma_mr.c    |   6 +-
>  drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.h |   2 +-
>  drivers/infiniband/sw/rdmavt/mr.c               |   6 +-
>  drivers/infiniband/sw/rdmavt/mr.h               |   2 +-
>  drivers/infiniband/sw/rxe/rxe_verbs.c           |   6 ++
>  drivers/infiniband/sw/siw/siw_verbs.c           |   8 +-
>  drivers/infiniband/sw/siw/siw_verbs.h           |   3 +-
>  include/rdma/ib_umem.h                          |  14 ++-
>  include/rdma/ib_verbs.h                         |   4 +-
>  include/uapi/rdma/ib_user_ioctl_cmds.h          |  14 +++
>  37 files changed, 410 insertions(+), 34 deletions(-)
>  create mode 100644 drivers/infiniband/core/umem_dmabuf.c
>  create mode 100644 drivers/infiniband/core/umem_dmabuf.h
> 
> --
> 1.8.3.1

Jason Gunthorpe June 30, 2020, 7:17 p.m. UTC | #5

On Tue, Jun 30, 2020 at 06:46:17PM +0000, Xiong, Jianxin wrote:
> > From: Jason Gunthorpe <jgg@ziepe.ca>
> > Sent: Tuesday, June 30, 2020 10:35 AM
> > To: Xiong, Jianxin <jianxin.xiong@intel.com>
> > Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
> > <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
> > Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> > 
> > On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> > > > > Heterogeneous Memory Management (HMM) utilizes
> > > > > mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> > > > > address space and page migration between system memory and device
> > > > > memory. HMM doesn't support pinning device memory because pages
> > > > > located on device must be able to migrate to system memory when
> > > > > accessed by CPU. Peer-to-peer access is possible if the peer can
> > > > > handle page fault. For RDMA, that means the NIC must support on-demand paging.
> > > >
> > > > peer-peer access is currently not possible with hmm_range_fault().
> > >
> > > Currently hmm_range_fault() always sets the cpu access flag and device
> > > private pages are migrated to the system RAM in the fault handler.
> > > However, it's possible to have a modified code flow to keep the device
> > > private page info for use with peer to peer access.
> > 
> > Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
> 
> But pfn is supposed to be all that is needed.

Needed for what? The PFN of the DEVICE_PRIVATE pages is useless for
anything.

> > Well, what do you want to happen here? The RDMA parts are
> > reasonable, but I don't want to add new functionality without a
> > purpose - the other parts need to be settled out first.
> 
> At the RDMA side, we mainly want to check if the changes are
> acceptable. For example, the part about adding 'fd' to the device
> ops and the ioctl interface. All the previous comments are very
> helpful for us to refine the patch so that we can be ready when GPU
> side support becomes available.

Well, I'm not totally happy with the way the umem and the fd is
handled so roughly and incompletely..

> > Hum. This is not actually so hard to do. The whole dma buf
> > proposal would make a lot more sense if the 'dma buf MR' had to be
> > the dynamic kind and the driver had to provide the faulting. It
> > would not be so hard to change mlx5 to be able to work like this,
> > perhaps. (the locking might be a bit tricky though)
> 
> The main issue is that not all NICs support ODP.

Sure, but there is lots of infrastructure work here to be done on dma
buf, having a correct consumer in the form of ODP might be helpful to
advance it.

Jason

Xiong, Jianxin June 30, 2020, 8:08 p.m. UTC | #6

> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Tuesday, June 30, 2020 12:17 PM
> To: Xiong, Jianxin <jianxin.xiong@intel.com>
> Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
> <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>; dri-
> devel@lists.freedesktop.org
> Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> 
> > >
> > > On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> > > > > > Heterogeneous Memory Management (HMM) utilizes
> > > > > > mmu_interval_notifier and ZONE_DEVICE to support shared
> > > > > > virtual address space and page migration between system memory
> > > > > > and device memory. HMM doesn't support pinning device memory
> > > > > > because pages located on device must be able to migrate to
> > > > > > system memory when accessed by CPU. Peer-to-peer access is
> > > > > > possible if the peer can handle page fault. For RDMA, that means the NIC must support on-demand paging.
> > > > >
> > > > > peer-peer access is currently not possible with hmm_range_fault().
> > > >
> > > > Currently hmm_range_fault() always sets the cpu access flag and
> > > > device private pages are migrated to the system RAM in the fault handler.
> > > > However, it's possible to have a modified code flow to keep the
> > > > device private page info for use with peer to peer access.
> > >
> > > Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything
> useful.
> >
> > But pfn is supposed to be all that is needed.
> 
> Needed for what? The PFN of the DEVICE_PRIVATE pages is useless for anything.

Hmm. I thought the pfn corresponds to the address in the BAR range. I could be
wrong here. 

> 
> > > Well, what do you want to happen here? The RDMA parts are
> > > reasonable, but I don't want to add new functionality without a
> > > purpose - the other parts need to be settled out first.
> >
> > At the RDMA side, we mainly want to check if the changes are
> > acceptable. For example, the part about adding 'fd' to the device ops
> > and the ioctl interface. All the previous comments are very helpful
> > for us to refine the patch so that we can be ready when GPU side
> > support becomes available.
> 
> Well, I'm not totally happy with the way the umem and the fd is handled so roughly and incompletely..

Yes, this feedback is very helpful. Will work on improving the code.

> 
> > > Hum. This is not actually so hard to do. The whole dma buf proposal
> > > would make a lot more sense if the 'dma buf MR' had to be the
> > > dynamic kind and the driver had to provide the faulting. It would
> > > not be so hard to change mlx5 to be able to work like this, perhaps.
> > > (the locking might be a bit tricky though)
> >
> > The main issue is that not all NICs support ODP.
> 
> Sure, but there is lots of infrastructure work here to be done on dma buf, having a correct consumer in the form of ODP might be helpful to
> advance it.

Good point. Thanks.

> 
> Jason

Christian König July 1, 2020, 9:03 a.m. UTC | #7

Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
>> -----Original Message-----
>> From: Jason Gunthorpe <jgg@ziepe.ca>
>> Sent: Tuesday, June 30, 2020 10:35 AM
>> To: Xiong, Jianxin <jianxin.xiong@intel.com>
>> Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
>> <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
>> Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
>>
>> On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
>>>>> Heterogeneous Memory Management (HMM) utilizes
>>>>> mmu_interval_notifier and ZONE_DEVICE to support shared virtual
>>>>> address space and page migration between system memory and device
>>>>> memory. HMM doesn't support pinning device memory because pages
>>>>> located on device must be able to migrate to system memory when
>>>>> accessed by CPU. Peer-to-peer access is possible if the peer can
>>>>> handle page fault. For RDMA, that means the NIC must support on-demand paging.
>>>> peer-peer access is currently not possible with hmm_range_fault().
>>> Currently hmm_range_fault() always sets the cpu access flag and device
>>> private pages are migrated to the system RAM in the fault handler.
>>> However, it's possible to have a modified code flow to keep the device
>>> private page info for use with peer to peer access.
>> Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
> But pfn is supposed to be all that is needed.
>
>>>> So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
>>> That's right, the patch alone is just half of the story. The
>>> functionality depends on availability of dma-buf exporter that can pin
>>> the device memory.
>> Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the
>> other parts need to be settled out first.
> At the RDMA side, we mainly want to check if the changes are acceptable. For example,
> the part about adding 'fd' to the device ops and the ioctl interface. All the previous
> comments are very helpful for us to refine the patch so that we can be ready when
> GPU side support becomes available.
>
>> The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to
>> support non-dynamic mapping?
> We are working on direct direction.
>
>>>>> migrate to system RAM. This is due to the lack of knowledge about
>>>>> whether the importer can perform peer-to-peer access and the lack
>>>>> of resource limit control measure for GPU. For the first part, the
>>>>> latest dma-buf driver has a peer-to-peer flag for the importer,
>>>>> but the flag is currently tied to dynamic mapping support, which
>>>>> requires on-demand paging support from the NIC to work.
>>>> ODP for DMA buf?
>>> Right.
>> Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the
>> dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the
>> locking might be a bit tricky though)
> The main issue is that not all NICs support ODP.

You don't need on-demand paging support from the NIC for dynamic mapping 
to work.

All you need is the ability to stop wait for ongoing accesses to end and 
make sure that new ones grab a new mapping.

Apart from that this is a rather interesting work.

Regards,
Christian.

>
>>>>> There are a few possible ways to address these issues, such as
>>>>> decoupling peer-to-peer flag from dynamic mapping, allowing more
>>>>> leeway for individual drivers to make the pinning decision and
>>>>> adding GPU resource limit control via cgroup. We would like to get
>>>>> comments on this patch series with the assumption that device
>>>>> memory pinning via dma-buf is supported by some GPU drivers, and
>>>>> at the same time welcome open discussions on how to address the
>>>>> aforementioned issues as well as GPU-NIC peer-to-peer access solutions in general.
>>>> These seem like DMA buf problems, not RDMA problems, why are you
>>>> asking these questions with a RDMA patch set? The usual DMA buf people are not even Cc'd here.
>>> The intention is to have people from both RDMA and DMA buffer side to
>>> comment. Sumit Semwal is the DMA buffer maintainer according to the
>>> MAINTAINERS file. I agree more people could be invited to the discussion.
>>> Just added Christian Koenig to the cc-list.
>> Would be good to have added the drm lists too
> Thanks, cc'd dri-devel here, and will also do the same for the previous part of the thread.
>
>>> If the umem_description you mentioned is for information used to
>>> create the umem (e.g. a structure for all the parameters), then this would work better.
>> It would make some more sense, and avoid all these weird EOPNOTSUPPS.
> Good, thanks for the suggestion.
>
>> Jason

Daniel Vetter July 1, 2020, 12:07 p.m. UTC | #8

Either my mailer ate half the thread or it's still stuck somewhere, so
jumping in the middle a bit.

On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
> Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
> > > -----Original Message-----
> > > From: Jason Gunthorpe <jgg@ziepe.ca>
> > > Sent: Tuesday, June 30, 2020 10:35 AM
> > > To: Xiong, Jianxin <jianxin.xiong@intel.com>
> > > Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
> > > <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
> > > Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> > > 
> > > On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> > > > > > Heterogeneous Memory Management (HMM) utilizes
> > > > > > mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> > > > > > address space and page migration between system memory and device
> > > > > > memory. HMM doesn't support pinning device memory because pages
> > > > > > located on device must be able to migrate to system memory when
> > > > > > accessed by CPU. Peer-to-peer access is possible if the peer can
> > > > > > handle page fault. For RDMA, that means the NIC must support on-demand paging.
> > > > > peer-peer access is currently not possible with hmm_range_fault().
> > > > Currently hmm_range_fault() always sets the cpu access flag and device
> > > > private pages are migrated to the system RAM in the fault handler.
> > > > However, it's possible to have a modified code flow to keep the device
> > > > private page info for use with peer to peer access.
> > > Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
> > But pfn is supposed to be all that is needed.
> > 
> > > > > So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
> > > > That's right, the patch alone is just half of the story. The
> > > > functionality depends on availability of dma-buf exporter that can pin
> > > > the device memory.
> > > Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the
> > > other parts need to be settled out first.
> > At the RDMA side, we mainly want to check if the changes are acceptable. For example,
> > the part about adding 'fd' to the device ops and the ioctl interface. All the previous
> > comments are very helpful for us to refine the patch so that we can be ready when
> > GPU side support becomes available.
> > 
> > > The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to
> > > support non-dynamic mapping?
> > We are working on direct direction.
> > 
> > > > > > migrate to system RAM. This is due to the lack of knowledge about
> > > > > > whether the importer can perform peer-to-peer access and the lack
> > > > > > of resource limit control measure for GPU. For the first part, the
> > > > > > latest dma-buf driver has a peer-to-peer flag for the importer,
> > > > > > but the flag is currently tied to dynamic mapping support, which
> > > > > > requires on-demand paging support from the NIC to work.
> > > > > ODP for DMA buf?
> > > > Right.
> > > Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the
> > > dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the
> > > locking might be a bit tricky though)
> > The main issue is that not all NICs support ODP.
> 
> You don't need on-demand paging support from the NIC for dynamic mapping to
> work.
> 
> All you need is the ability to stop wait for ongoing accesses to end and
> make sure that new ones grab a new mapping.

So having no clue about rdma myself much, this sounds rather interesting.
Sure it would result in immediately re-acquiring the pages, but that's
also really all we need to be able to move buffers around on the gpu side.
And with dma_resv_lock there's no livelock risk if the NIC immediately
starts a kthread/work_struct which reacquires all the dma-buf and
everything else it needs. Plus also with the full ww_mutex deadlock
backoff dance there's no locking issues with having to acquire an entire
pile of dma_resv_lock, that's natively supported (gpus very much need to
be able to lock arbitrary set of buffers).

And I think if that would allow us to avoid the entire "avoid random
drivers pinning dma-buf into vram" discussions, much better and quicker to
land something like that.

I guess the big question is going to be how to fit this into rdma, since
the ww_mutex deadlock backoff dance needs to be done at a fairly high
level. For gpu drivers it's always done at the top level ioctl entry
point.

> Apart from that this is a rather interesting work.
> 
> Regards,
> Christian.
> 
> > 
> > > > > > There are a few possible ways to address these issues, such as
> > > > > > decoupling peer-to-peer flag from dynamic mapping, allowing more
> > > > > > leeway for individual drivers to make the pinning decision and
> > > > > > adding GPU resource limit control via cgroup. We would like to get
> > > > > > comments on this patch series with the assumption that device
> > > > > > memory pinning via dma-buf is supported by some GPU drivers, and
> > > > > > at the same time welcome open discussions on how to address the
> > > > > > aforementioned issues as well as GPU-NIC peer-to-peer access solutions in general.
> > > > > These seem like DMA buf problems, not RDMA problems, why are you
> > > > > asking these questions with a RDMA patch set? The usual DMA buf people are not even Cc'd here.
> > > > The intention is to have people from both RDMA and DMA buffer side to
> > > > comment. Sumit Semwal is the DMA buffer maintainer according to the
> > > > MAINTAINERS file. I agree more people could be invited to the discussion.
> > > > Just added Christian Koenig to the cc-list.


MAINTAINERS also says to cc and entire pile of mailing lists, where the
usual suspects (including Christian and me) hang around. Is that the
reason I got only like half the thread here?

For next time around, really include everyone relevant here please.
-Daniel

> > > Would be good to have added the drm lists too
> > Thanks, cc'd dri-devel here, and will also do the same for the previous part of the thread.
> > 
> > > > If the umem_description you mentioned is for information used to
> > > > create the umem (e.g. a structure for all the parameters), then this would work better.
> > > It would make some more sense, and avoid all these weird EOPNOTSUPPS.
> > Good, thanks for the suggestion.
> > 
> > > Jason
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

Daniel Vetter July 1, 2020, 12:14 p.m. UTC | #9

On Wed, Jul 01, 2020 at 02:07:44PM +0200, Daniel Vetter wrote:
> Either my mailer ate half the thread or it's still stuck somewhere, so
> jumping in the middle a bit.
> 
> On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
> > Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
> > > > -----Original Message-----
> > > > From: Jason Gunthorpe <jgg@ziepe.ca>
> > > > Sent: Tuesday, June 30, 2020 10:35 AM
> > > > To: Xiong, Jianxin <jianxin.xiong@intel.com>
> > > > Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
> > > > <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
> > > > Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> > > > 
> > > > On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> > > > > > > Heterogeneous Memory Management (HMM) utilizes
> > > > > > > mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> > > > > > > address space and page migration between system memory and device
> > > > > > > memory. HMM doesn't support pinning device memory because pages
> > > > > > > located on device must be able to migrate to system memory when
> > > > > > > accessed by CPU. Peer-to-peer access is possible if the peer can
> > > > > > > handle page fault. For RDMA, that means the NIC must support on-demand paging.
> > > > > > peer-peer access is currently not possible with hmm_range_fault().
> > > > > Currently hmm_range_fault() always sets the cpu access flag and device
> > > > > private pages are migrated to the system RAM in the fault handler.
> > > > > However, it's possible to have a modified code flow to keep the device
> > > > > private page info for use with peer to peer access.
> > > > Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
> > > But pfn is supposed to be all that is needed.
> > > 
> > > > > > So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
> > > > > That's right, the patch alone is just half of the story. The
> > > > > functionality depends on availability of dma-buf exporter that can pin
> > > > > the device memory.
> > > > Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the
> > > > other parts need to be settled out first.
> > > At the RDMA side, we mainly want to check if the changes are acceptable. For example,
> > > the part about adding 'fd' to the device ops and the ioctl interface. All the previous
> > > comments are very helpful for us to refine the patch so that we can be ready when
> > > GPU side support becomes available.
> > > 
> > > > The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to
> > > > support non-dynamic mapping?
> > > We are working on direct direction.
> > > 
> > > > > > > migrate to system RAM. This is due to the lack of knowledge about
> > > > > > > whether the importer can perform peer-to-peer access and the lack
> > > > > > > of resource limit control measure for GPU. For the first part, the
> > > > > > > latest dma-buf driver has a peer-to-peer flag for the importer,
> > > > > > > but the flag is currently tied to dynamic mapping support, which
> > > > > > > requires on-demand paging support from the NIC to work.
> > > > > > ODP for DMA buf?
> > > > > Right.
> > > > Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the
> > > > dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the
> > > > locking might be a bit tricky though)
> > > The main issue is that not all NICs support ODP.
> > 
> > You don't need on-demand paging support from the NIC for dynamic mapping to
> > work.
> > 
> > All you need is the ability to stop wait for ongoing accesses to end and
> > make sure that new ones grab a new mapping.
> 
> So having no clue about rdma myself much, this sounds rather interesting.
> Sure it would result in immediately re-acquiring the pages, but that's
> also really all we need to be able to move buffers around on the gpu side.
> And with dma_resv_lock there's no livelock risk if the NIC immediately
> starts a kthread/work_struct which reacquires all the dma-buf and
> everything else it needs. Plus also with the full ww_mutex deadlock
> backoff dance there's no locking issues with having to acquire an entire
> pile of dma_resv_lock, that's natively supported (gpus very much need to
> be able to lock arbitrary set of buffers).
> 
> And I think if that would allow us to avoid the entire "avoid random
> drivers pinning dma-buf into vram" discussions, much better and quicker to
> land something like that.
> 
> I guess the big question is going to be how to fit this into rdma, since
> the ww_mutex deadlock backoff dance needs to be done at a fairly high
> level. For gpu drivers it's always done at the top level ioctl entry
> point.

Also, just to alleviate fears: I think all that dynamic dma-buf stuff for
rdma should be doable this way _without_ having to interact with
dma_fence. Avoiding that I think is the biggest request Jason has in this
area :-)

Furthermore, it is officially ok to allocate memory while holding a
dma_resv_lock. What is not ok (and might cause issues if you somehow mix
up things in strange ways) is taking a userspace fault, because gpu
drivers must be able to take the dma_resv_lock in their fault handlers.
That might pose a problem.

Also, all these rules are now enforced by lockdep, might_fault() and
similar checks.
-Daniel

> 
> > Apart from that this is a rather interesting work.
> > 
> > Regards,
> > Christian.
> > 
> > > 
> > > > > > > There are a few possible ways to address these issues, such as
> > > > > > > decoupling peer-to-peer flag from dynamic mapping, allowing more
> > > > > > > leeway for individual drivers to make the pinning decision and
> > > > > > > adding GPU resource limit control via cgroup. We would like to get
> > > > > > > comments on this patch series with the assumption that device
> > > > > > > memory pinning via dma-buf is supported by some GPU drivers, and
> > > > > > > at the same time welcome open discussions on how to address the
> > > > > > > aforementioned issues as well as GPU-NIC peer-to-peer access solutions in general.
> > > > > > These seem like DMA buf problems, not RDMA problems, why are you
> > > > > > asking these questions with a RDMA patch set? The usual DMA buf people are not even Cc'd here.
> > > > > The intention is to have people from both RDMA and DMA buffer side to
> > > > > comment. Sumit Semwal is the DMA buffer maintainer according to the
> > > > > MAINTAINERS file. I agree more people could be invited to the discussion.
> > > > > Just added Christian Koenig to the cc-list.
> 
> 
> MAINTAINERS also says to cc and entire pile of mailing lists, where the
> usual suspects (including Christian and me) hang around. Is that the
> reason I got only like half the thread here?
> 
> For next time around, really include everyone relevant here please.
> -Daniel
> 
> > > > Would be good to have added the drm lists too
> > > Thanks, cc'd dri-devel here, and will also do the same for the previous part of the thread.
> > > 
> > > > > If the umem_description you mentioned is for information used to
> > > > > create the umem (e.g. a structure for all the parameters), then this would work better.
> > > > It would make some more sense, and avoid all these weird EOPNOTSUPPS.
> > > Good, thanks for the suggestion.
> > > 
> > > > Jason
> > 
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Jason Gunthorpe July 1, 2020, 12:39 p.m. UTC | #10

On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
> Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
> > > From: Jason Gunthorpe <jgg@ziepe.ca>
> > > Sent: Tuesday, June 30, 2020 10:35 AM
> > > To: Xiong, Jianxin <jianxin.xiong@intel.com>
> > > Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
> > > <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
> > > Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> > > 
> > > On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> > > > > > Heterogeneous Memory Management (HMM) utilizes
> > > > > > mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> > > > > > address space and page migration between system memory and device
> > > > > > memory. HMM doesn't support pinning device memory because pages
> > > > > > located on device must be able to migrate to system memory when
> > > > > > accessed by CPU. Peer-to-peer access is possible if the peer can
> > > > > > handle page fault. For RDMA, that means the NIC must support on-demand paging.
> > > > > peer-peer access is currently not possible with hmm_range_fault().
> > > > Currently hmm_range_fault() always sets the cpu access flag and device
> > > > private pages are migrated to the system RAM in the fault handler.
> > > > However, it's possible to have a modified code flow to keep the device
> > > > private page info for use with peer to peer access.
> > > Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
> > But pfn is supposed to be all that is needed.
> > 
> > > > > So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
> > > > That's right, the patch alone is just half of the story. The
> > > > functionality depends on availability of dma-buf exporter that can pin
> > > > the device memory.
> > > Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the
> > > other parts need to be settled out first.
> > At the RDMA side, we mainly want to check if the changes are acceptable. For example,
> > the part about adding 'fd' to the device ops and the ioctl interface. All the previous
> > comments are very helpful for us to refine the patch so that we can be ready when
> > GPU side support becomes available.
> > 
> > > The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to
> > > support non-dynamic mapping?
> > We are working on direct direction.
> > 
> > > > > > migrate to system RAM. This is due to the lack of knowledge about
> > > > > > whether the importer can perform peer-to-peer access and the lack
> > > > > > of resource limit control measure for GPU. For the first part, the
> > > > > > latest dma-buf driver has a peer-to-peer flag for the importer,
> > > > > > but the flag is currently tied to dynamic mapping support, which
> > > > > > requires on-demand paging support from the NIC to work.
> > > > > ODP for DMA buf?
> > > > Right.
> > > Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the
> > > dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the
> > > locking might be a bit tricky though)
> > The main issue is that not all NICs support ODP.
> 
> You don't need on-demand paging support from the NIC for dynamic mapping to
> work.
> 
> All you need is the ability to stop wait for ongoing accesses to end and
> make sure that new ones grab a new mapping.

Swap and flush isn't a general HW ability either..

I'm unclear how this could be useful, it is guarenteed to corrupt
in-progress writes?

Did you mean pause, swap and resume? That's ODP.

Jason

Christian König July 1, 2020, 12:55 p.m. UTC | #11

Am 01.07.20 um 14:39 schrieb Jason Gunthorpe:
> On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
>> Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
>>>> From: Jason Gunthorpe <jgg@ziepe.ca>
>>>> Sent: Tuesday, June 30, 2020 10:35 AM
>>>> To: Xiong, Jianxin <jianxin.xiong@intel.com>
>>>> Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
>>>> <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
>>>> Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
>>>>
>>>> On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
>>>>>>> Heterogeneous Memory Management (HMM) utilizes
>>>>>>> mmu_interval_notifier and ZONE_DEVICE to support shared virtual
>>>>>>> address space and page migration between system memory and device
>>>>>>> memory. HMM doesn't support pinning device memory because pages
>>>>>>> located on device must be able to migrate to system memory when
>>>>>>> accessed by CPU. Peer-to-peer access is possible if the peer can
>>>>>>> handle page fault. For RDMA, that means the NIC must support on-demand paging.
>>>>>> peer-peer access is currently not possible with hmm_range_fault().
>>>>> Currently hmm_range_fault() always sets the cpu access flag and device
>>>>> private pages are migrated to the system RAM in the fault handler.
>>>>> However, it's possible to have a modified code flow to keep the device
>>>>> private page info for use with peer to peer access.
>>>> Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
>>> But pfn is supposed to be all that is needed.
>>>
>>>>>> So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
>>>>> That's right, the patch alone is just half of the story. The
>>>>> functionality depends on availability of dma-buf exporter that can pin
>>>>> the device memory.
>>>> Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the
>>>> other parts need to be settled out first.
>>> At the RDMA side, we mainly want to check if the changes are acceptable. For example,
>>> the part about adding 'fd' to the device ops and the ioctl interface. All the previous
>>> comments are very helpful for us to refine the patch so that we can be ready when
>>> GPU side support becomes available.
>>>
>>>> The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to
>>>> support non-dynamic mapping?
>>> We are working on direct direction.
>>>
>>>>>>> migrate to system RAM. This is due to the lack of knowledge about
>>>>>>> whether the importer can perform peer-to-peer access and the lack
>>>>>>> of resource limit control measure for GPU. For the first part, the
>>>>>>> latest dma-buf driver has a peer-to-peer flag for the importer,
>>>>>>> but the flag is currently tied to dynamic mapping support, which
>>>>>>> requires on-demand paging support from the NIC to work.
>>>>>> ODP for DMA buf?
>>>>> Right.
>>>> Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the
>>>> dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the
>>>> locking might be a bit tricky though)
>>> The main issue is that not all NICs support ODP.
>> You don't need on-demand paging support from the NIC for dynamic mapping to
>> work.
>>
>> All you need is the ability to stop wait for ongoing accesses to end and
>> make sure that new ones grab a new mapping.
> Swap and flush isn't a general HW ability either..
>
> I'm unclear how this could be useful, it is guarenteed to corrupt
> in-progress writes?
>
> Did you mean pause, swap and resume? That's ODP.

Yes, something like this. And good to know, never heard of ODP.


On the GPU side we can pipeline things, e.g. you can program the 
hardware that page tables are changed at a certain point in time.

So what we do is when we get a notification that a buffer will move 
around is to mark this buffer in our structures as invalid and return a 
fence to so that the exporter is able to wait for ongoing stuff to finish.

The actual move then happens only after the ongoing operations on the 
GPU are finished and on the next operation we grab the new location of 
the buffer and re-program the page tables to it.

This way all the CPU does is really just planning asynchronous page 
table changes which are executed on the GPU later on.

You can of course do it synchronized as well, but this would hurt our 
performance pretty badly.

Regards,
Christian.

>
> Jason

Daniel Vetter July 1, 2020, 3:42 p.m. UTC | #12

On Wed, Jul 1, 2020 at 2:56 PM Christian König <christian.koenig@amd.com> wrote:
>
> Am 01.07.20 um 14:39 schrieb Jason Gunthorpe:
> > On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
> >> Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
> >>>> From: Jason Gunthorpe <jgg@ziepe.ca>
> >>>> Sent: Tuesday, June 30, 2020 10:35 AM
> >>>> To: Xiong, Jianxin <jianxin.xiong@intel.com>
> >>>> Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
> >>>> <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>
> >>>> Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> >>>>
> >>>> On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> >>>>>>> Heterogeneous Memory Management (HMM) utilizes
> >>>>>>> mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> >>>>>>> address space and page migration between system memory and device
> >>>>>>> memory. HMM doesn't support pinning device memory because pages
> >>>>>>> located on device must be able to migrate to system memory when
> >>>>>>> accessed by CPU. Peer-to-peer access is possible if the peer can
> >>>>>>> handle page fault. For RDMA, that means the NIC must support on-demand paging.
> >>>>>> peer-peer access is currently not possible with hmm_range_fault().
> >>>>> Currently hmm_range_fault() always sets the cpu access flag and device
> >>>>> private pages are migrated to the system RAM in the fault handler.
> >>>>> However, it's possible to have a modified code flow to keep the device
> >>>>> private page info for use with peer to peer access.
> >>>> Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
> >>> But pfn is supposed to be all that is needed.
> >>>
> >>>>>> So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
> >>>>> That's right, the patch alone is just half of the story. The
> >>>>> functionality depends on availability of dma-buf exporter that can pin
> >>>>> the device memory.
> >>>> Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the
> >>>> other parts need to be settled out first.
> >>> At the RDMA side, we mainly want to check if the changes are acceptable. For example,
> >>> the part about adding 'fd' to the device ops and the ioctl interface. All the previous
> >>> comments are very helpful for us to refine the patch so that we can be ready when
> >>> GPU side support becomes available.
> >>>
> >>>> The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to
> >>>> support non-dynamic mapping?
> >>> We are working on direct direction.
> >>>
> >>>>>>> migrate to system RAM. This is due to the lack of knowledge about
> >>>>>>> whether the importer can perform peer-to-peer access and the lack
> >>>>>>> of resource limit control measure for GPU. For the first part, the
> >>>>>>> latest dma-buf driver has a peer-to-peer flag for the importer,
> >>>>>>> but the flag is currently tied to dynamic mapping support, which
> >>>>>>> requires on-demand paging support from the NIC to work.
> >>>>>> ODP for DMA buf?
> >>>>> Right.
> >>>> Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the
> >>>> dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the
> >>>> locking might be a bit tricky though)
> >>> The main issue is that not all NICs support ODP.
> >> You don't need on-demand paging support from the NIC for dynamic mapping to
> >> work.
> >>
> >> All you need is the ability to stop wait for ongoing accesses to end and
> >> make sure that new ones grab a new mapping.
> > Swap and flush isn't a general HW ability either..
> >
> > I'm unclear how this could be useful, it is guarenteed to corrupt
> > in-progress writes?
> >
> > Did you mean pause, swap and resume? That's ODP.
>
> Yes, something like this. And good to know, never heard of ODP.

Hm I thought ODP was full hw page faults at an individual page level,
and this stop&resume is for the entire nic. Under the hood both apply
back-pressure on the network if a transmission can't be received, but
I thought the ODP one is a lore more fine-grained. For this dma_buf
stop-the-world approach we'd also need to guarantee _all_ buffers are
present again (without hw page faults on the nic side of things),
which has some additional locking implications.

> On the GPU side we can pipeline things, e.g. you can program the
> hardware that page tables are changed at a certain point in time.
>
> So what we do is when we get a notification that a buffer will move
> around is to mark this buffer in our structures as invalid and return a
> fence to so that the exporter is able to wait for ongoing stuff to finish.
>
> The actual move then happens only after the ongoing operations on the
> GPU are finished and on the next operation we grab the new location of
> the buffer and re-program the page tables to it.
>
> This way all the CPU does is really just planning asynchronous page
> table changes which are executed on the GPU later on.
>
> You can of course do it synchronized as well, but this would hurt our
> performance pretty badly.

So since Jason really doesn't like dma_fence much I think for rdma
synchronous it is. And it shouldn't really matter, since waiting for a
small transaction to complete at rdma wire speed isn't really that
long an operation. Should be much faster than waiting for the gpu to
complete quite sizeable amounts of rendering first. Plus with real hw
page faults it's really just pte write plus tlb flush, that should
definitely be possible from the ->move_notify hook.
But even the global stop-the-world approach I expect is going to be
much faster than a preempt request on a gpu.
-Daniel

>
> Regards,
> Christian.
>
> >
> > Jason
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

Jason Gunthorpe July 1, 2020, 5:15 p.m. UTC | #13

On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
> > >> All you need is the ability to stop wait for ongoing accesses to end and
> > >> make sure that new ones grab a new mapping.
> > > Swap and flush isn't a general HW ability either..
> > >
> > > I'm unclear how this could be useful, it is guarenteed to corrupt
> > > in-progress writes?
> > >
> > > Did you mean pause, swap and resume? That's ODP.
> >
> > Yes, something like this. And good to know, never heard of ODP.
> 
> Hm I thought ODP was full hw page faults at an individual page
> level,

Yes

> and this stop&resume is for the entire nic. Under the hood both apply
> back-pressure on the network if a transmission can't be received,
> but

NIC's don't do stop and resume, blocking the Rx pipe is very
problematic and performance destroying.

The strategy for something like ODP is more complex, and so far no NIC
has deployed it at any granularity larger than per-page.

> So since Jason really doesn't like dma_fence much I think for rdma
> synchronous it is. And it shouldn't really matter, since waiting for a
> small transaction to complete at rdma wire speed isn't really that
> long an operation. 

Even if DMA fence were to somehow be involved, how would it look?

Jason

Jason Gunthorpe July 2, 2020, 12:27 p.m. UTC | #14

On Tue, Jun 30, 2020 at 08:08:46PM +0000, Xiong, Jianxin wrote:
> > From: Jason Gunthorpe <jgg@ziepe.ca>
> > Sent: Tuesday, June 30, 2020 12:17 PM
> > To: Xiong, Jianxin <jianxin.xiong@intel.com>
> > Cc: linux-rdma@vger.kernel.org; Doug Ledford <dledford@redhat.com>; Sumit Semwal <sumit.semwal@linaro.org>; Leon Romanovsky
> > <leon@kernel.org>; Vetter, Daniel <daniel.vetter@intel.com>; Christian Koenig <christian.koenig@amd.com>; dri-
> > devel@lists.freedesktop.org
> > Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> > 
> > > >
> > > > On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> > > > > > > Heterogeneous Memory Management (HMM) utilizes
> > > > > > > mmu_interval_notifier and ZONE_DEVICE to support shared
> > > > > > > virtual address space and page migration between system memory
> > > > > > > and device memory. HMM doesn't support pinning device memory
> > > > > > > because pages located on device must be able to migrate to
> > > > > > > system memory when accessed by CPU. Peer-to-peer access is
> > > > > > > possible if the peer can handle page fault. For RDMA, that means the NIC must support on-demand paging.
> > > > > >
> > > > > > peer-peer access is currently not possible with hmm_range_fault().
> > > > >
> > > > > Currently hmm_range_fault() always sets the cpu access flag and
> > > > > device private pages are migrated to the system RAM in the fault handler.
> > > > > However, it's possible to have a modified code flow to keep the
> > > > > device private page info for use with peer to peer access.
> > > >
> > > > Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything
> > useful.
> > >
> > > But pfn is supposed to be all that is needed.
> > 
> > Needed for what? The PFN of the DEVICE_PRIVATE pages is useless for anything.
> 
> Hmm. I thought the pfn corresponds to the address in the BAR range. I could be
> wrong here. 

No, DEVICE_PRIVATE is a dummy pfn to empty address space.

Jason

Daniel Vetter July 2, 2020, 1:10 p.m. UTC | #15

On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
> > > >> All you need is the ability to stop wait for ongoing accesses to end and
> > > >> make sure that new ones grab a new mapping.
> > > > Swap and flush isn't a general HW ability either..
> > > >
> > > > I'm unclear how this could be useful, it is guarenteed to corrupt
> > > > in-progress writes?
> > > >
> > > > Did you mean pause, swap and resume? That's ODP.
> > >
> > > Yes, something like this. And good to know, never heard of ODP.
> > 
> > Hm I thought ODP was full hw page faults at an individual page
> > level,
> 
> Yes
> 
> > and this stop&resume is for the entire nic. Under the hood both apply
> > back-pressure on the network if a transmission can't be received,
> > but
> 
> NIC's don't do stop and resume, blocking the Rx pipe is very
> problematic and performance destroying.
> 
> The strategy for something like ODP is more complex, and so far no NIC
> has deployed it at any granularity larger than per-page.
> 
> > So since Jason really doesn't like dma_fence much I think for rdma
> > synchronous it is. And it shouldn't really matter, since waiting for a
> > small transaction to complete at rdma wire speed isn't really that
> > long an operation. 
> 
> Even if DMA fence were to somehow be involved, how would it look?

Well above you're saying it would be performance destroying, but let's
pretend that's not a problem :-) Also, I have no clue about rdma, so this
is really just the flow we have on the gpu side.

0. rdma driver maintains a list of all dma-buf that it has mapped
somewhere and is currently using for transactions

1. rdma driver gets a dma_buf->notify_move callback on one of these
buffers. To handle that it:
	1. stops hw access somehow at the rx
	2. flushes caches and whatever else is needed
	3. moves the unmapped buffer on a special list or marks it in some
	different way as unavailable
	4. launch the kthread/work_struct to fix everything back up

2. dma-buf export (gpu driver) can now issue the commands to move the
buffer around

3. rdma driver worker gets busy to restart rx:
	1. lock all dma-buf that are currently in use (dma_resv_lock).
	thanks to ww_mutex deadlock avoidance this is possible
	2. for any buffers which have been marked as unavailable in 1.3
	grab a new mapping (which might now be in system memory, or again
	peer2peer but different address)
	3. restart hw and rx
	4. unlock all dma-buf locks (dma_resv_unlock)

There is a minor problem because step 2 only queues up the entire buffer
moves behind a pile of dma_fence, and atm we haven't made it absolutely
clear who's responsible for waiting for those to complete. For gpu drivers
it's the importer since gpu drivers don't have big qualms about
dma_fences, so 3.2. would perhaps also include a dma_fence_wait to make
sure the buffer move has actually completed.

Above flow is more or less exactly what happens for gpu workloads where we
can preempt running computations. Instead of stopping rx we preempt the
compute job and remove it from the scheduler queues, and instead of
restarting rx we just put the compute job back onto the scheduler queue as
eligible for gpu time. Otherwise it's exactly the same stuff.

Of course if you only have a single compute job and too many such
interruptions, then performance is going to tank. Don't do that, instead
make sure you have enough vram or system memory or whatever :-)

Cheers, Daniel

Jason Gunthorpe July 2, 2020, 1:29 p.m. UTC | #16

On Thu, Jul 02, 2020 at 03:10:00PM +0200, Daniel Vetter wrote:
> On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:
> > On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
> > > > >> All you need is the ability to stop wait for ongoing accesses to end and
> > > > >> make sure that new ones grab a new mapping.
> > > > > Swap and flush isn't a general HW ability either..
> > > > >
> > > > > I'm unclear how this could be useful, it is guarenteed to corrupt
> > > > > in-progress writes?
> > > > >
> > > > > Did you mean pause, swap and resume? That's ODP.
> > > >
> > > > Yes, something like this. And good to know, never heard of ODP.
> > > 
> > > Hm I thought ODP was full hw page faults at an individual page
> > > level,
> > 
> > Yes
> > 
> > > and this stop&resume is for the entire nic. Under the hood both apply
> > > back-pressure on the network if a transmission can't be received,
> > > but
> > 
> > NIC's don't do stop and resume, blocking the Rx pipe is very
> > problematic and performance destroying.
> > 
> > The strategy for something like ODP is more complex, and so far no NIC
> > has deployed it at any granularity larger than per-page.
> > 
> > > So since Jason really doesn't like dma_fence much I think for rdma
> > > synchronous it is. And it shouldn't really matter, since waiting for a
> > > small transaction to complete at rdma wire speed isn't really that
> > > long an operation. 
> > 
> > Even if DMA fence were to somehow be involved, how would it look?
> 
> Well above you're saying it would be performance destroying, but let's
> pretend that's not a problem :-) Also, I have no clue about rdma, so this
> is really just the flow we have on the gpu side.

I see, no, this is not workable, the command flow in RDMA is not at
all like GPU - what you are a proposing is a global 'stop the whole
chip' Tx and Rx flows for an undetermined time. Not feasible

What we can do is use ODP techniques and pause only the MR attached to
the DMA buf with the process you outline below. This is not so hard to
implement.

> 3. rdma driver worker gets busy to restart rx:
> 	1. lock all dma-buf that are currently in use (dma_resv_lock).
> 	thanks to ww_mutex deadlock avoidance this is possible

Why all? Why not just lock the one that was invalidated to restore the
mappings? That is some artifact of the GPU approach?

And why is this done with work queues and locking instead of a
callback saying the buffer is valid again?

Jason

Christian König July 2, 2020, 2:50 p.m. UTC | #17

Am 02.07.20 um 15:29 schrieb Jason Gunthorpe:
> On Thu, Jul 02, 2020 at 03:10:00PM +0200, Daniel Vetter wrote:
>> On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:
>>> On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
>>>>>>> All you need is the ability to stop wait for ongoing accesses to end and
>>>>>>> make sure that new ones grab a new mapping.
>>>>>> Swap and flush isn't a general HW ability either..
>>>>>>
>>>>>> I'm unclear how this could be useful, it is guarenteed to corrupt
>>>>>> in-progress writes?
>>>>>>
>>>>>> Did you mean pause, swap and resume? That's ODP.
>>>>> Yes, something like this. And good to know, never heard of ODP.
>>>> Hm I thought ODP was full hw page faults at an individual page
>>>> level,
>>> Yes
>>>
>>>> and this stop&resume is for the entire nic. Under the hood both apply
>>>> back-pressure on the network if a transmission can't be received,
>>>> but
>>> NIC's don't do stop and resume, blocking the Rx pipe is very
>>> problematic and performance destroying.
>>>
>>> The strategy for something like ODP is more complex, and so far no NIC
>>> has deployed it at any granularity larger than per-page.
>>>
>>>> So since Jason really doesn't like dma_fence much I think for rdma
>>>> synchronous it is. And it shouldn't really matter, since waiting for a
>>>> small transaction to complete at rdma wire speed isn't really that
>>>> long an operation.
>>> Even if DMA fence were to somehow be involved, how would it look?
>> Well above you're saying it would be performance destroying, but let's
>> pretend that's not a problem :-) Also, I have no clue about rdma, so this
>> is really just the flow we have on the gpu side.
> I see, no, this is not workable, the command flow in RDMA is not at
> all like GPU - what you are a proposing is a global 'stop the whole
> chip' Tx and Rx flows for an undetermined time. Not feasible
>
> What we can do is use ODP techniques and pause only the MR attached to
> the DMA buf with the process you outline below. This is not so hard to
> implement.

Well it boils down to only two requirements:

1. You can stop accessing the memory or addresses exported by the DMA-buf.

2. Before the next access you need to acquire a new mapping.

How you do this is perfectly up to you. E.g. you can stop everything, 
just prevent access to this DMA-buf, or just pause the users of this 
DMA-buf....

>
>> 3. rdma driver worker gets busy to restart rx:
>> 	1. lock all dma-buf that are currently in use (dma_resv_lock).
>> 	thanks to ww_mutex deadlock avoidance this is possible
> Why all? Why not just lock the one that was invalidated to restore the
> mappings? That is some artifact of the GPU approach?

No, but you must make sure that mapping one doesn't invalidate others 
you need.

Otherwise you can end up in a nice live lock :)

> And why is this done with work queues and locking instead of a
> callback saying the buffer is valid again?

You can do this as well, but a work queue is usually easier to handle 
than a notification in an interrupt context of a foreign driver.

Regards,
Christian.

>
> Jason

Daniel Vetter July 2, 2020, 6:15 p.m. UTC | #18

On Thu, Jul 02, 2020 at 04:50:32PM +0200, Christian König wrote:
> Am 02.07.20 um 15:29 schrieb Jason Gunthorpe:
> > On Thu, Jul 02, 2020 at 03:10:00PM +0200, Daniel Vetter wrote:
> > > On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
> > > > > > > > All you need is the ability to stop wait for ongoing accesses to end and
> > > > > > > > make sure that new ones grab a new mapping.
> > > > > > > Swap and flush isn't a general HW ability either..
> > > > > > > 
> > > > > > > I'm unclear how this could be useful, it is guarenteed to corrupt
> > > > > > > in-progress writes?
> > > > > > > 
> > > > > > > Did you mean pause, swap and resume? That's ODP.
> > > > > > Yes, something like this. And good to know, never heard of ODP.
> > > > > Hm I thought ODP was full hw page faults at an individual page
> > > > > level,
> > > > Yes
> > > > 
> > > > > and this stop&resume is for the entire nic. Under the hood both apply
> > > > > back-pressure on the network if a transmission can't be received,
> > > > > but
> > > > NIC's don't do stop and resume, blocking the Rx pipe is very
> > > > problematic and performance destroying.
> > > > 
> > > > The strategy for something like ODP is more complex, and so far no NIC
> > > > has deployed it at any granularity larger than per-page.
> > > > 
> > > > > So since Jason really doesn't like dma_fence much I think for rdma
> > > > > synchronous it is. And it shouldn't really matter, since waiting for a
> > > > > small transaction to complete at rdma wire speed isn't really that
> > > > > long an operation.
> > > > Even if DMA fence were to somehow be involved, how would it look?
> > > Well above you're saying it would be performance destroying, but let's
> > > pretend that's not a problem :-) Also, I have no clue about rdma, so this
> > > is really just the flow we have on the gpu side.
> > I see, no, this is not workable, the command flow in RDMA is not at
> > all like GPU - what you are a proposing is a global 'stop the whole
> > chip' Tx and Rx flows for an undetermined time. Not feasible

Yeah, I said I have no clue about rdma :-)

> > What we can do is use ODP techniques and pause only the MR attached to
> > the DMA buf with the process you outline below. This is not so hard to
> > implement.
> 
> Well it boils down to only two requirements:
> 
> 1. You can stop accessing the memory or addresses exported by the DMA-buf.
> 
> 2. Before the next access you need to acquire a new mapping.
> 
> How you do this is perfectly up to you. E.g. you can stop everything, just
> prevent access to this DMA-buf, or just pause the users of this DMA-buf....

Yeah in a gpu we also don't stop the entire world, only the context that
needs the buffer. If there's other stuff to run, we do keep running that.
Usually the reason for a buffer move is that we do actually have other
stuff that needs to be run, and which needs more vram for itself, so we
might have to throw out a few buffers from vram that can also be placed in
system memory.

Note that a modern gpu has multiple engines and most have or are gaining
hw scheduling of some sort, so there's a pile of concurrent gpu context
execution going on at any moment. The days where we just push gpu work
into a fifo are (mostly) long gone.

> > > 3. rdma driver worker gets busy to restart rx:
> > > 	1. lock all dma-buf that are currently in use (dma_resv_lock).
> > > 	thanks to ww_mutex deadlock avoidance this is possible
> > Why all? Why not just lock the one that was invalidated to restore the
> > mappings? That is some artifact of the GPU approach?
> 
> No, but you must make sure that mapping one doesn't invalidate others you
> need.
> 
> Otherwise you can end up in a nice live lock :)

Also if you don't have pagefaults, but have to track busy memory at a
context level, you do need to grab all locks of all buffers you need, or
you'd race. There's nothing stopping a concurrent ->notify_move on some
other buffer you'll need otherwise, and if you try to be clever and roll
you're own locking, you'll anger lockdep - you're own lock will have to be
on both sides of ww_mutex or it wont work, and that deadlocks.

So ww_mutex multi-lock dance or bust.

> > And why is this done with work queues and locking instead of a
> > callback saying the buffer is valid again?
> 
> You can do this as well, but a work queue is usually easier to handle than a
> notification in an interrupt context of a foreign driver.

Yeah you can just install a dma_fence callback but
- that's hardirq context
- if you don't have per-page hw faults you need the multi-lock ww_mutex
  dance anyway to avoid races.

On top of that the exporter has no idea whether the importer still really
needs the mapping, or whether it was just kept around opportunistically.
pte setup isn't free, so by default gpus keep everything mapped. At least
for the classic gl/vk execution model, compute like rocm/amdkfd is a bit
different here.

Cheers, Daniel

> 
> Regards,
> Christian.
> 
> > 
> > Jason
>

Jason Gunthorpe July 3, 2020, 12:03 p.m. UTC | #19

On Thu, Jul 02, 2020 at 08:15:40PM +0200, Daniel Vetter wrote:
> > > > 3. rdma driver worker gets busy to restart rx:
> > > > 	1. lock all dma-buf that are currently in use (dma_resv_lock).
> > > > 	thanks to ww_mutex deadlock avoidance this is possible
> > > Why all? Why not just lock the one that was invalidated to restore the
> > > mappings? That is some artifact of the GPU approach?
> > 
> > No, but you must make sure that mapping one doesn't invalidate others you
> > need.
> > 
> > Otherwise you can end up in a nice live lock :)
> 
> Also if you don't have pagefaults, but have to track busy memory at a
> context level, you do need to grab all locks of all buffers you need, or
> you'd race. There's nothing stopping a concurrent ->notify_move on some
> other buffer you'll need otherwise, and if you try to be clever and roll
> you're own locking, you'll anger lockdep - you're own lock will have to be
> on both sides of ww_mutex or it wont work, and that deadlocks.

So you are worried about atomically building some multi buffer
transaction? I don't think this applies to RDMA which isn't going to
be transcational here..

> > > And why is this done with work queues and locking instead of a
> > > callback saying the buffer is valid again?
> > 
> > You can do this as well, but a work queue is usually easier to handle than a
> > notification in an interrupt context of a foreign driver.
> 
> Yeah you can just install a dma_fence callback but
> - that's hardirq context
> - if you don't have per-page hw faults you need the multi-lock ww_mutex
>   dance anyway to avoid races.

It is still best to avoid the per-page faults and preload the new
mapping once it is ready.

Jason

Daniel Vetter July 3, 2020, 12:52 p.m. UTC | #20

On Fri, Jul 3, 2020 at 2:03 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Jul 02, 2020 at 08:15:40PM +0200, Daniel Vetter wrote:
> > > > > 3. rdma driver worker gets busy to restart rx:
> > > > >         1. lock all dma-buf that are currently in use (dma_resv_lock).
> > > > >         thanks to ww_mutex deadlock avoidance this is possible
> > > > Why all? Why not just lock the one that was invalidated to restore the
> > > > mappings? That is some artifact of the GPU approach?
> > >
> > > No, but you must make sure that mapping one doesn't invalidate others you
> > > need.
> > >
> > > Otherwise you can end up in a nice live lock :)
> >
> > Also if you don't have pagefaults, but have to track busy memory at a
> > context level, you do need to grab all locks of all buffers you need, or
> > you'd race. There's nothing stopping a concurrent ->notify_move on some
> > other buffer you'll need otherwise, and if you try to be clever and roll
> > you're own locking, you'll anger lockdep - you're own lock will have to be
> > on both sides of ww_mutex or it wont work, and that deadlocks.
>
> So you are worried about atomically building some multi buffer
> transaction? I don't think this applies to RDMA which isn't going to
> be transcational here..

So maybe I'm just totally confused about the rdma model. I thought:
- you bind a pile of memory for various transactions, that might
happen whenever. Kernel driver doesn't have much if any insight into
when memory isn't needed anymore. I think in the rdma world that's
called registering memory, but not sure.

- for hw with hw faults you can pull in the memory when it's needed,
and if concurrently another cpu is taking away pages and invalidating
rdma hw mappings, then that's no problem.

So far so good, 0 need for atomic transactions anything.

But the answer I gave here is for when you don't have per-page hw
faulting on the rdma nic, but something that works at a much larger
level. For a gpu it would be a compute context, no idea what the
equivalent for rdma is. This could go up to and including the entire
nic stalling all rx with link level back pressure, but not
necessarily.

Once you go above a single page, or well, at least a single dma-buf
object, you need some way to know when you have all buffers and memory
mappings present again, because you can't restart with only partial
memory.

Now if the rdma memory programming is more like traditional
networking, where you have a single rx or tx buffer, then yeah you
don't need fancy multi-buffer locking. Like I said I have no idea how
many of the buffers you need to restart the rdma stuff for hw which
doesn't have per-page hw faulting.

> > > > And why is this done with work queues and locking instead of a
> > > > callback saying the buffer is valid again?
> > >
> > > You can do this as well, but a work queue is usually easier to handle than a
> > > notification in an interrupt context of a foreign driver.
> >
> > Yeah you can just install a dma_fence callback but
> > - that's hardirq context
> > - if you don't have per-page hw faults you need the multi-lock ww_mutex
> >   dance anyway to avoid races.
>
> It is still best to avoid the per-page faults and preload the new
> mapping once it is ready.

Sure, but I think that's entirely orthogonal.

Also even if you don't need the multi-buffer dance (either because hw
faults, or because the rdma rx/tx can be stopped at least at a
per-buffer level through some other means) then you still need the
dma_resv_lock to serialize with the exporter. So needs a sleeping
context either way. Hence some worker is needed.
-Daniel

Jason Gunthorpe July 3, 2020, 1:14 p.m. UTC | #21

On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:

> So maybe I'm just totally confused about the rdma model. I thought:
> - you bind a pile of memory for various transactions, that might
> happen whenever. Kernel driver doesn't have much if any insight into
> when memory isn't needed anymore. I think in the rdma world that's
> called registering memory, but not sure.

Sure, but once registered the memory is able to be used at any moment with
no visibilty from the kernel.

Unlike GPU the transactions that trigger memory access do not go
through the kernel - so there is no ability to interrupt a command
flow and fiddle with mappings.

Jason

Christian König July 3, 2020, 1:21 p.m. UTC | #22

Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
> On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
>
>> So maybe I'm just totally confused about the rdma model. I thought:
>> - you bind a pile of memory for various transactions, that might
>> happen whenever. Kernel driver doesn't have much if any insight into
>> when memory isn't needed anymore. I think in the rdma world that's
>> called registering memory, but not sure.
> Sure, but once registered the memory is able to be used at any moment with
> no visibilty from the kernel.
>
> Unlike GPU the transactions that trigger memory access do not go
> through the kernel - so there is no ability to interrupt a command
> flow and fiddle with mappings.

This is the same for GPUs with user space queues as well.

But we can still say for a process if that this process is using a 
DMA-buf which is moved out and so can't run any more unless the DMA-buf 
is accessible again.

In other words you somehow need to make sure that the hardware is not 
accessing a piece of memory any more when you want to move it.

Christian.

>
> Jason

Xiong, Jianxin July 7, 2020, 9:58 p.m. UTC | #23

> -----Original Message-----
> From: Christian König <christian.koenig@amd.com>
> Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
> > On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
> >
> >> So maybe I'm just totally confused about the rdma model. I thought:
> >> - you bind a pile of memory for various transactions, that might
> >> happen whenever. Kernel driver doesn't have much if any insight into
> >> when memory isn't needed anymore. I think in the rdma world that's
> >> called registering memory, but not sure.
> > Sure, but once registered the memory is able to be used at any moment
> > with no visibilty from the kernel.
> >
> > Unlike GPU the transactions that trigger memory access do not go
> > through the kernel - so there is no ability to interrupt a command
> > flow and fiddle with mappings.
> 
> This is the same for GPUs with user space queues as well.
> 
> But we can still say for a process if that this process is using a DMA-buf which is moved out and so can't run any more unless the DMA-buf is
> accessible again.
> 
> In other words you somehow need to make sure that the hardware is not accessing a piece of memory any more when you want to move it.
> 

While a process can be easily suspended, there is no way to tell the RDMA NIC not to process posted work requests that use specific memory regions (or with any other conditions). 

So far it appears to me that DMA-buf dynamic mapping for RDMA is only viable with ODP support. For NICs without ODP, a way to allow pinning the device memory is still needed.

Jianxin

> Christian.
> 
> >
> > Jason

Christian König July 8, 2020, 9:38 a.m. UTC | #24

Am 07.07.20 um 23:58 schrieb Xiong, Jianxin:
>> -----Original Message-----
>> From: Christian König <christian.koenig@amd.com>
>> Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
>>> On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
>>>
>>>> So maybe I'm just totally confused about the rdma model. I thought:
>>>> - you bind a pile of memory for various transactions, that might
>>>> happen whenever. Kernel driver doesn't have much if any insight into
>>>> when memory isn't needed anymore. I think in the rdma world that's
>>>> called registering memory, but not sure.
>>> Sure, but once registered the memory is able to be used at any moment
>>> with no visibilty from the kernel.
>>>
>>> Unlike GPU the transactions that trigger memory access do not go
>>> through the kernel - so there is no ability to interrupt a command
>>> flow and fiddle with mappings.
>> This is the same for GPUs with user space queues as well.
>>
>> But we can still say for a process if that this process is using a DMA-buf which is moved out and so can't run any more unless the DMA-buf is
>> accessible again.
>>
>> In other words you somehow need to make sure that the hardware is not accessing a piece of memory any more when you want to move it.
>>
> While a process can be easily suspended, there is no way to tell the RDMA NIC not to process posted work requests that use specific memory regions (or with any other conditions).
>
> So far it appears to me that DMA-buf dynamic mapping for RDMA is only viable with ODP support. For NICs without ODP, a way to allow pinning the device memory is still needed.

And that's exactly the reason why I introduced explicit pin()/unpin() 
functions into the DMA-buf API: 
https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L811

It's just that at least our devices drivers currently prevent P2P with 
pinned DMA-buf's for two main reasons:

a) To prevent deny of service attacks because P2P BARs are a rather rare 
resource.

b) To prevent failures in configuration where P2P is not always possible 
between all devices which want to access a buffer.

Regards,
Christian.

>
> Jianxin
>
>> Christian.
>>
>>> Jason

Daniel Vetter July 8, 2020, 9:49 a.m. UTC | #25

On Wed, Jul 08, 2020 at 11:38:31AM +0200, Christian König wrote:
> Am 07.07.20 um 23:58 schrieb Xiong, Jianxin:
> > > -----Original Message-----
> > > From: Christian König <christian.koenig@amd.com>
> > > Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
> > > > On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
> > > > 
> > > > > So maybe I'm just totally confused about the rdma model. I thought:
> > > > > - you bind a pile of memory for various transactions, that might
> > > > > happen whenever. Kernel driver doesn't have much if any insight into
> > > > > when memory isn't needed anymore. I think in the rdma world that's
> > > > > called registering memory, but not sure.
> > > > Sure, but once registered the memory is able to be used at any moment
> > > > with no visibilty from the kernel.
> > > > 
> > > > Unlike GPU the transactions that trigger memory access do not go
> > > > through the kernel - so there is no ability to interrupt a command
> > > > flow and fiddle with mappings.
> > > This is the same for GPUs with user space queues as well.
> > > 
> > > But we can still say for a process if that this process is using a DMA-buf which is moved out and so can't run any more unless the DMA-buf is
> > > accessible again.
> > > 
> > > In other words you somehow need to make sure that the hardware is not accessing a piece of memory any more when you want to move it.
> > > 
> > While a process can be easily suspended, there is no way to tell the RDMA NIC not to process posted work requests that use specific memory regions (or with any other conditions).
> > 
> > So far it appears to me that DMA-buf dynamic mapping for RDMA is only viable with ODP support. For NICs without ODP, a way to allow pinning the device memory is still needed.
> 
> And that's exactly the reason why I introduced explicit pin()/unpin()
> functions into the DMA-buf API:
> https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L811
> 
> It's just that at least our devices drivers currently prevent P2P with
> pinned DMA-buf's for two main reasons:
> 
> a) To prevent deny of service attacks because P2P BARs are a rather rare
> resource.
> 
> b) To prevent failures in configuration where P2P is not always possible
> between all devices which want to access a buffer.

So the above is more or less the question in the cover letter (which
didn't make it to dri-devel). Can we somehow throw that limitation out, or
is that simply not a good idea?

Simply moving buffers to system memory when they're pinned does simplify a
lot of headaches. For a specific custom built system we can avoid that
maybe, but I think upstream is kinda a different thing.

Cheers, Daniel

> Regards,
> Christian.
> 
> > 
> > Jianxin
> > 
> > > Christian.
> > > 
> > > > Jason
>

Christian König July 8, 2020, 2:20 p.m. UTC | #26

Am 08.07.20 um 11:49 schrieb Daniel Vetter:
> On Wed, Jul 08, 2020 at 11:38:31AM +0200, Christian König wrote:
>> Am 07.07.20 um 23:58 schrieb Xiong, Jianxin:
>>>> -----Original Message-----
>>>> From: Christian König <christian.koenig@amd.com>
>>>> Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
>>>>> On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
>>>>>
>>>>>> So maybe I'm just totally confused about the rdma model. I thought:
>>>>>> - you bind a pile of memory for various transactions, that might
>>>>>> happen whenever. Kernel driver doesn't have much if any insight into
>>>>>> when memory isn't needed anymore. I think in the rdma world that's
>>>>>> called registering memory, but not sure.
>>>>> Sure, but once registered the memory is able to be used at any moment
>>>>> with no visibilty from the kernel.
>>>>>
>>>>> Unlike GPU the transactions that trigger memory access do not go
>>>>> through the kernel - so there is no ability to interrupt a command
>>>>> flow and fiddle with mappings.
>>>> This is the same for GPUs with user space queues as well.
>>>>
>>>> But we can still say for a process if that this process is using a DMA-buf which is moved out and so can't run any more unless the DMA-buf is
>>>> accessible again.
>>>>
>>>> In other words you somehow need to make sure that the hardware is not accessing a piece of memory any more when you want to move it.
>>>>
>>> While a process can be easily suspended, there is no way to tell the RDMA NIC not to process posted work requests that use specific memory regions (or with any other conditions).
>>>
>>> So far it appears to me that DMA-buf dynamic mapping for RDMA is only viable with ODP support. For NICs without ODP, a way to allow pinning the device memory is still needed.
>> And that's exactly the reason why I introduced explicit pin()/unpin()
>> functions into the DMA-buf API:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fdma-buf%2Fdma-buf.c%23L811&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C6d785861acc542a2f53608d823243a7c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297985792135311&amp;sdata=bBrkDynlACE9DAIlGntxXhE1unr%2FBxw5IRTm6AtV6WQ%3D&amp;reserved=0
>>
>> It's just that at least our devices drivers currently prevent P2P with
>> pinned DMA-buf's for two main reasons:
>>
>> a) To prevent deny of service attacks because P2P BARs are a rather rare
>> resource.
>>
>> b) To prevent failures in configuration where P2P is not always possible
>> between all devices which want to access a buffer.
> So the above is more or less the question in the cover letter (which
> didn't make it to dri-devel). Can we somehow throw that limitation out, or
> is that simply not a good idea?

At least for the AMD graphics drivers that's most certain not a good idea.

We do have an use case where buffers need to be in system memory because 
P2P doesn't work.

And by pinning them to VRAM you can create a really nice deny of service 
attack against the X system.

> Simply moving buffers to system memory when they're pinned does simplify a
> lot of headaches. For a specific custom built system we can avoid that
> maybe, but I think upstream is kinda a different thing.

Yes, agree completely on that. Customers which are willing to take the 
risk can easily do this themselves.

But that is not something we should probably do for upstream.

Regards,
Christian.

>
> Cheers, Daniel
>
>> Regards,
>> Christian.
>>
>>> Jianxin
>>>
>>>> Christian.
>>>>
>>>>> Jason

Alex Deucher July 8, 2020, 2:33 p.m. UTC | #27

On Wed, Jul 8, 2020 at 10:20 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 08.07.20 um 11:49 schrieb Daniel Vetter:
> > On Wed, Jul 08, 2020 at 11:38:31AM +0200, Christian König wrote:
> >> Am 07.07.20 um 23:58 schrieb Xiong, Jianxin:
> >>>> -----Original Message-----
> >>>> From: Christian König <christian.koenig@amd.com>
> >>>> Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
> >>>>> On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
> >>>>>
> >>>>>> So maybe I'm just totally confused about the rdma model. I thought:
> >>>>>> - you bind a pile of memory for various transactions, that might
> >>>>>> happen whenever. Kernel driver doesn't have much if any insight into
> >>>>>> when memory isn't needed anymore. I think in the rdma world that's
> >>>>>> called registering memory, but not sure.
> >>>>> Sure, but once registered the memory is able to be used at any moment
> >>>>> with no visibilty from the kernel.
> >>>>>
> >>>>> Unlike GPU the transactions that trigger memory access do not go
> >>>>> through the kernel - so there is no ability to interrupt a command
> >>>>> flow and fiddle with mappings.
> >>>> This is the same for GPUs with user space queues as well.
> >>>>
> >>>> But we can still say for a process if that this process is using a DMA-buf which is moved out and so can't run any more unless the DMA-buf is
> >>>> accessible again.
> >>>>
> >>>> In other words you somehow need to make sure that the hardware is not accessing a piece of memory any more when you want to move it.
> >>>>
> >>> While a process can be easily suspended, there is no way to tell the RDMA NIC not to process posted work requests that use specific memory regions (or with any other conditions).
> >>>
> >>> So far it appears to me that DMA-buf dynamic mapping for RDMA is only viable with ODP support. For NICs without ODP, a way to allow pinning the device memory is still needed.
> >> And that's exactly the reason why I introduced explicit pin()/unpin()
> >> functions into the DMA-buf API:
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fdma-buf%2Fdma-buf.c%23L811&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C6d785861acc542a2f53608d823243a7c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297985792135311&amp;sdata=bBrkDynlACE9DAIlGntxXhE1unr%2FBxw5IRTm6AtV6WQ%3D&amp;reserved=0
> >>
> >> It's just that at least our devices drivers currently prevent P2P with
> >> pinned DMA-buf's for two main reasons:
> >>
> >> a) To prevent deny of service attacks because P2P BARs are a rather rare
> >> resource.
> >>
> >> b) To prevent failures in configuration where P2P is not always possible
> >> between all devices which want to access a buffer.
> > So the above is more or less the question in the cover letter (which
> > didn't make it to dri-devel). Can we somehow throw that limitation out, or
> > is that simply not a good idea?
>
> At least for the AMD graphics drivers that's most certain not a good idea.
>
> We do have an use case where buffers need to be in system memory because
> P2P doesn't work.
>
> And by pinning them to VRAM you can create a really nice deny of service
> attack against the X system.
>

On the other hand, on modern platforms with large or resizable BARs,
you may end up with systems with more vram than system ram.

Alex


> > Simply moving buffers to system memory when they're pinned does simplify a
> > lot of headaches. For a specific custom built system we can avoid that
> > maybe, but I think upstream is kinda a different thing.
>
> Yes, agree completely on that. Customers which are willing to take the
> risk can easily do this themselves.
>
> But that is not something we should probably do for upstream.
>
> Regards,
> Christian.
>
> >
> > Cheers, Daniel
> >
> >> Regards,
> >> Christian.
> >>
> >>> Jianxin
> >>>
> >>>> Christian.
> >>>>
> >>>>> Jason
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

[RFC,v2,0/3] RDMA: add dma-buf support

Message

Comments