mbox series

[v1,0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages

Message ID 20241015152348.3055360-1-ymaman@nvidia.com (mailing list archive)
Headers show
Series GPU Direct RDMA (P2P DMA) for Device Private Pages | expand

Message

Yonatan Maman Oct. 15, 2024, 3:23 p.m. UTC
From: Yonatan Maman <Ymaman@Nvidia.com>

This patch series aims to enable Peer-to-Peer (P2P) DMA access in
GPU-centric applications that utilize RDMA and private device pages. This
enhancement is crucial for minimizing data transfer overhead by allowing
the GPU to directly expose device private page data to devices such as
NICs, eliminating the need to traverse system RAM, which is the native
method for exposing device private page data.

To fully support Peer-to-Peer for device private pages, the following
changes are proposed:

`Memory Management (MM)`
 * Leverage struct pagemap_ops to support P2P page operations: This
modification ensures that the GPU can directly map device private pages
for P2P DMA.
 * Utilize hmm_range_fault to support P2P connections for device private
pages (instead of Page fault)

`IB Drivers`
Add TRY_P2P_REQ flag for the hmm_range_fault call: This flag indicates the
need for P2P mapping, enabling IB drivers to efficiently handle P2P DMA
requests.

`Nouveau driver`
Add support for the Nouveau p2p_page callback function: This update
integrates P2P DMA support into the Nouveau driver, allowing it to handle
P2P page operations seamlessly.

`MLX5 Driver`
Optimize PCI Peer-to-Peer for private device pages, by enabling Address
Translation service(ATS) for ODP memory.

Yonatan Maman (4):
  mm/hmm: HMM API for P2P DMA to device zone pages
  nouveau/dmem: HMM P2P DMA for private dev pages
  IB/core: P2P DMA for device private pages
  RDMA/mlx5: Enabling ATS for ODP memory

 drivers/gpu/drm/nouveau/nouveau_dmem.c | 117 ++++++++++++++++++++++++-
 drivers/infiniband/core/umem_odp.c     |   2 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h   |   6 +-
 include/linux/hmm.h                    |   2 +
 include/linux/memremap.h               |   7 ++
 mm/hmm.c                               |  28 ++++++
 6 files changed, 156 insertions(+), 6 deletions(-)

Comments

Yonatan Maman Oct. 16, 2024, 3:16 p.m. UTC | #1
On 16/10/2024 7:23, Christoph Hellwig wrote:
> On Tue, Oct 15, 2024 at 06:23:44PM +0300, Yonatan Maman wrote:
>> From: Yonatan Maman <Ymaman@Nvidia.com>
>>
>> This patch series aims to enable Peer-to-Peer (P2P) DMA access in
>> GPU-centric applications that utilize RDMA and private device pages. This
>> enhancement is crucial for minimizing data transfer overhead by allowing
>> the GPU to directly expose device private page data to devices such as
>> NICs, eliminating the need to traverse system RAM, which is the native
>> method for exposing device private page data.
> 
> Please tone down your marketing language and explain your factual
> changes.  If you make performance claims back them by numbers.
> 

Got it, thanks! I'll fix that. Regarding performance, we’re achieving 
over 10x higher bandwidth and 10x lower latency using perftest-rdma, 
especially (with a high rate of GPU memory access).
Alistair Popple Oct. 16, 2024, 10:22 p.m. UTC | #2
Yonatan Maman <ymaman@nvidia.com> writes:

> On 16/10/2024 7:23, Christoph Hellwig wrote:
>> On Tue, Oct 15, 2024 at 06:23:44PM +0300, Yonatan Maman wrote:
>>> From: Yonatan Maman <Ymaman@Nvidia.com>
>>>
>>> This patch series aims to enable Peer-to-Peer (P2P) DMA access in
>>> GPU-centric applications that utilize RDMA and private device pages. This
>>> enhancement is crucial for minimizing data transfer overhead by allowing
>>> the GPU to directly expose device private page data to devices such as
>>> NICs, eliminating the need to traverse system RAM, which is the native
>>> method for exposing device private page data.
>> Please tone down your marketing language and explain your factual
>> changes.  If you make performance claims back them by numbers.
>> 
>
> Got it, thanks! I'll fix that. Regarding performance, we’re achieving
> over 10x higher bandwidth and 10x lower latency using perftest-rdma,
> especially (with a high rate of GPU memory access).

The performance claims still sound a bit vague. Please make sure you
include actual perftest-rdma performance numbers from before and after
applying this series when you repost.
Zhu Yanjun Oct. 18, 2024, 7:26 a.m. UTC | #3
在 2024/10/16 17:16, Yonatan Maman 写道:
> 
> 
> On 16/10/2024 7:23, Christoph Hellwig wrote:
>> On Tue, Oct 15, 2024 at 06:23:44PM +0300, Yonatan Maman wrote:
>>> From: Yonatan Maman <Ymaman@Nvidia.com>
>>>
>>> This patch series aims to enable Peer-to-Peer (P2P) DMA access in
>>> GPU-centric applications that utilize RDMA and private device pages. 
>>> This
>>> enhancement is crucial for minimizing data transfer overhead by allowing
>>> the GPU to directly expose device private page data to devices such as
>>> NICs, eliminating the need to traverse system RAM, which is the native
>>> method for exposing device private page data.
>>
>> Please tone down your marketing language and explain your factual
>> changes.  If you make performance claims back them by numbers.
>>
> 
> Got it, thanks! I'll fix that. Regarding performance, we’re achieving 
> over 10x higher bandwidth and 10x lower latency using perftest-rdma, 
> especially (with a high rate of GPU memory access).

If I got this patch series correctly, this is based on ODP (On Demand 
Paging). And a way also exists which is based on non-ODP. From the 
following links, this way is implemented on efa, irdma and mlx5.
1. iRDMA
https://lore.kernel.org/all/20230217011425.498847-1-yanjun.zhu@intel.com/

2. efa
https://lore.kernel.org/lkml/20211007114018.GD2688930@ziepe.ca/t/

3. mlx5
https://lore.kernel.org/all/1608067636-98073-5-git-send-email-jianxin.xiong@intel.com/

Because these 2 methods are both implemented on mlx5, have you compared 
the test results with the 2 methods on mlx5?

The most important results should be latency and bandwidth. Please let 
us know the test results.

Thanks a lot.
Zhu Yanjun
Yonatan Maman Oct. 20, 2024, 3:26 p.m. UTC | #4
On 18/10/2024 10:26, Zhu Yanjun wrote:
> External email: Use caution opening links or attachments
> 
> 
> 在 2024/10/16 17:16, Yonatan Maman 写道:
>>
>>
>> On 16/10/2024 7:23, Christoph Hellwig wrote:
>>> On Tue, Oct 15, 2024 at 06:23:44PM +0300, Yonatan Maman wrote:
>>>> From: Yonatan Maman <Ymaman@Nvidia.com>
>>>>
>>>> This patch series aims to enable Peer-to-Peer (P2P) DMA access in
>>>> GPU-centric applications that utilize RDMA and private device pages.
>>>> This
>>>> enhancement is crucial for minimizing data transfer overhead by 
>>>> allowing
>>>> the GPU to directly expose device private page data to devices such as
>>>> NICs, eliminating the need to traverse system RAM, which is the native
>>>> method for exposing device private page data.
>>>
>>> Please tone down your marketing language and explain your factual
>>> changes.  If you make performance claims back them by numbers.
>>>
>>
>> Got it, thanks! I'll fix that. Regarding performance, we’re achieving
>> over 10x higher bandwidth and 10x lower latency using perftest-rdma,
>> especially (with a high rate of GPU memory access).
> 
> If I got this patch series correctly, this is based on ODP (On Demand
> Paging). And a way also exists which is based on non-ODP. From the
> following links, this way is implemented on efa, irdma and mlx5.
> 1. iRDMA
> https://lore.kernel.org/all/20230217011425.498847-1-yanjun.zhu@intel.com/
> 
> 2. efa
> https://lore.kernel.org/lkml/20211007114018.GD2688930@ziepe.ca/t/
> 
> 3. mlx5
> https://lore.kernel.org/all/1608067636-98073-5-git-send-email- 
> jianxin.xiong@intel.com/
> 
> Because these 2 methods are both implemented on mlx5, have you compared
> the test results with the 2 methods on mlx5?
> 
> The most important results should be latency and bandwidth. Please let
> us know the test results.
> 
> Thanks a lot.
> Zhu Yanjun
> 

This patch-set aims to support GPU Direct RDMA for HMM ODP memory. 
Compared to the dma-buf method, we achieve the same performance (BW and 
latency), for GPU intensive test-cases (No CPU accesses during the test).