diff mbox series

[RFC] Make use of non-dynamic dmabuf in RDMA

Message ID 20210818074352.29950-1-galpress@amazon.com (mailing list archive)
State RFC
Headers show
Series [RFC] Make use of non-dynamic dmabuf in RDMA | expand

Commit Message

Gal Pressman Aug. 18, 2021, 7:43 a.m. UTC
Hey all,

Currently, the RDMA subsystem can only work with dynamic dmabuf
attachments, which requires the RDMA device to support on-demand-paging
(ODP) which is not common on most devices (only supported by mlx5).

While the dynamic requirement makes sense for certain GPUs, some devices
(such as habanalabs) have device memory that is always "pinned" and do
not need/use the move_notify operation.

The motivation of this RFC is to use habanalabs as the dmabuf exporter,
and EFA as the importer to allow for peer2peer access through libibverbs.

This draft patch changes the dmabuf driver to differentiate between
static/dynamic attachments by looking at the move_notify op instead of
the importer_ops struct, and allowing the peer2peer flag to be enabled
in case of a static exporter.

Thanks

Signed-off-by: Gal Pressman <galpress@amazon.com>
---
 drivers/dma-buf/dma-buf.c             | 5 +++--
 drivers/infiniband/core/umem_dmabuf.c | 2 +-
 include/linux/dma-buf.h               | 2 +-
 3 files changed, 5 insertions(+), 4 deletions(-)

Comments

Christian König Aug. 18, 2021, 8 a.m. UTC | #1
Am 18.08.21 um 09:43 schrieb Gal Pressman:
> Hey all,
>
> Currently, the RDMA subsystem can only work with dynamic dmabuf
> attachments, which requires the RDMA device to support on-demand-paging
> (ODP) which is not common on most devices (only supported by mlx5).
>
> While the dynamic requirement makes sense for certain GPUs, some devices
> (such as habanalabs) have device memory that is always "pinned" and do
> not need/use the move_notify operation.
>
> The motivation of this RFC is to use habanalabs as the dmabuf exporter,
> and EFA as the importer to allow for peer2peer access through libibverbs.
>
> This draft patch changes the dmabuf driver to differentiate between
> static/dynamic attachments by looking at the move_notify op instead of
> the importer_ops struct, and allowing the peer2peer flag to be enabled
> in case of a static exporter.

Well NAK to the general approach, this can be solved much easier.

If you can't support dynamic moves while using the buffer then just pin 
all buffers during import/export.

This avoids the move notification and the framework/exporter can still 
correctly account for pinned buffers.

But please note that at least amdgpu never uses P2P support for pinned 
buffers since we want to avoid that unmoveable buffers clutter video 
memory and create conflicts with V4L and scanout.

If you don't have such concerns in habanalabs then you can implement the 
pinning there while keeping P2P still enabled.

Regards,
Christian.

>
> Thanks
>
> Signed-off-by: Gal Pressman <galpress@amazon.com>
> ---
>   drivers/dma-buf/dma-buf.c             | 5 +++--
>   drivers/infiniband/core/umem_dmabuf.c | 2 +-
>   include/linux/dma-buf.h               | 2 +-
>   3 files changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 511fe0d217a0..e3faad8f492c 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -727,7 +727,8 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
>   	if (WARN_ON(!dmabuf || !dev))
>   		return ERR_PTR(-EINVAL);
>   
> -	if (WARN_ON(importer_ops && !importer_ops->move_notify))
> +	if (WARN_ON(importer_ops && !importer_ops->move_notify &&
> +		    dma_buf_is_dynamic(attach->dmabuf)))
>   		return ERR_PTR(-EINVAL);
>   
>   	attach = kzalloc(sizeof(*attach), GFP_KERNEL);
> @@ -1048,7 +1049,7 @@ void dma_buf_move_notify(struct dma_buf *dmabuf)
>   	dma_resv_assert_held(dmabuf->resv);
>   
>   	list_for_each_entry(attach, &dmabuf->attachments, node)
> -		if (attach->importer_ops)
> +		if (attach->importer_ops && attach->importer_ops->move_notify)
>   			attach->importer_ops->move_notify(attach);
>   }
>   EXPORT_SYMBOL_GPL(dma_buf_move_notify);
> diff --git a/drivers/infiniband/core/umem_dmabuf.c b/drivers/infiniband/core/umem_dmabuf.c
> index c6e875619fac..c502ae828bd3 100644
> --- a/drivers/infiniband/core/umem_dmabuf.c
> +++ b/drivers/infiniband/core/umem_dmabuf.c
> @@ -118,7 +118,7 @@ struct ib_umem_dmabuf *ib_umem_dmabuf_get(struct ib_device *device,
>   	if (check_add_overflow(offset, (unsigned long)size, &end))
>   		return ret;
>   
> -	if (unlikely(!ops || !ops->move_notify))
> +	if (unlikely(!ops))
>   		return ret;
>   
>   	dmabuf = dma_buf_get(fd);
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index efdc56b9d95f..4b2e99012cb1 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -473,7 +473,7 @@ static inline bool dma_buf_is_dynamic(struct dma_buf *dmabuf)
>   static inline bool
>   dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach)
>   {
> -	return !!attach->importer_ops;
> +	return !!attach->importer_ops->move_notify;
>   }
>   
>   struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
Gal Pressman Aug. 18, 2021, 8:37 a.m. UTC | #2
On 18/08/2021 11:00, Christian König wrote:
> Am 18.08.21 um 09:43 schrieb Gal Pressman:
>> Hey all,
>>
>> Currently, the RDMA subsystem can only work with dynamic dmabuf
>> attachments, which requires the RDMA device to support on-demand-paging
>> (ODP) which is not common on most devices (only supported by mlx5).
>>
>> While the dynamic requirement makes sense for certain GPUs, some devices
>> (such as habanalabs) have device memory that is always "pinned" and do
>> not need/use the move_notify operation.
>>
>> The motivation of this RFC is to use habanalabs as the dmabuf exporter,
>> and EFA as the importer to allow for peer2peer access through libibverbs.
>>
>> This draft patch changes the dmabuf driver to differentiate between
>> static/dynamic attachments by looking at the move_notify op instead of
>> the importer_ops struct, and allowing the peer2peer flag to be enabled
>> in case of a static exporter.
> 
> Well NAK to the general approach, this can be solved much easier.
> 
> If you can't support dynamic moves while using the buffer then just pin all
> buffers during import/export.
> 
> This avoids the move notification and the framework/exporter can still correctly
> account for pinned buffers.
> 
> But please note that at least amdgpu never uses P2P support for pinned buffers
> since we want to avoid that unmoveable buffers clutter video memory and create
> conflicts with V4L and scanout.
> 
> If you don't have such concerns in habanalabs then you can implement the pinning
> there while keeping P2P still enabled.
Thanks Christian!

Are you suggesting to pass an empty move_notify callback instead of passing NULL?
Also, doesn't the pin operation move the memory from the device to host memory?
Daniel Vetter Aug. 18, 2021, 9:34 a.m. UTC | #3
On Wed, Aug 18, 2021 at 9:45 AM Gal Pressman <galpress@amazon.com> wrote:
>
> Hey all,
>
> Currently, the RDMA subsystem can only work with dynamic dmabuf
> attachments, which requires the RDMA device to support on-demand-paging
> (ODP) which is not common on most devices (only supported by mlx5).
>
> While the dynamic requirement makes sense for certain GPUs, some devices
> (such as habanalabs) have device memory that is always "pinned" and do
> not need/use the move_notify operation.
>
> The motivation of this RFC is to use habanalabs as the dmabuf exporter,
> and EFA as the importer to allow for peer2peer access through libibverbs.
>
> This draft patch changes the dmabuf driver to differentiate between
> static/dynamic attachments by looking at the move_notify op instead of
> the importer_ops struct, and allowing the peer2peer flag to be enabled
> in case of a static exporter.
>
> Thanks
>
> Signed-off-by: Gal Pressman <galpress@amazon.com>

Given that habanalabs dma-buf support is very firmly in limbo (at
least it's not yet in linux-next or anywhere else) I think you want to
solve that problem first before we tackle the additional issue of
making p2p work without dynamic dma-buf. Without that it just doesn't
make a lot of sense really to talk about solutions here.
-Daniel

> ---
>  drivers/dma-buf/dma-buf.c             | 5 +++--
>  drivers/infiniband/core/umem_dmabuf.c | 2 +-
>  include/linux/dma-buf.h               | 2 +-
>  3 files changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 511fe0d217a0..e3faad8f492c 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -727,7 +727,8 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
>         if (WARN_ON(!dmabuf || !dev))
>                 return ERR_PTR(-EINVAL);
>
> -       if (WARN_ON(importer_ops && !importer_ops->move_notify))
> +       if (WARN_ON(importer_ops && !importer_ops->move_notify &&
> +                   dma_buf_is_dynamic(attach->dmabuf)))
>                 return ERR_PTR(-EINVAL);
>
>         attach = kzalloc(sizeof(*attach), GFP_KERNEL);
> @@ -1048,7 +1049,7 @@ void dma_buf_move_notify(struct dma_buf *dmabuf)
>         dma_resv_assert_held(dmabuf->resv);
>
>         list_for_each_entry(attach, &dmabuf->attachments, node)
> -               if (attach->importer_ops)
> +               if (attach->importer_ops && attach->importer_ops->move_notify)
>                         attach->importer_ops->move_notify(attach);
>  }
>  EXPORT_SYMBOL_GPL(dma_buf_move_notify);
> diff --git a/drivers/infiniband/core/umem_dmabuf.c b/drivers/infiniband/core/umem_dmabuf.c
> index c6e875619fac..c502ae828bd3 100644
> --- a/drivers/infiniband/core/umem_dmabuf.c
> +++ b/drivers/infiniband/core/umem_dmabuf.c
> @@ -118,7 +118,7 @@ struct ib_umem_dmabuf *ib_umem_dmabuf_get(struct ib_device *device,
>         if (check_add_overflow(offset, (unsigned long)size, &end))
>                 return ret;
>
> -       if (unlikely(!ops || !ops->move_notify))
> +       if (unlikely(!ops))
>                 return ret;
>
>         dmabuf = dma_buf_get(fd);
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index efdc56b9d95f..4b2e99012cb1 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -473,7 +473,7 @@ static inline bool dma_buf_is_dynamic(struct dma_buf *dmabuf)
>  static inline bool
>  dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach)
>  {
> -       return !!attach->importer_ops;
> +       return !!attach->importer_ops->move_notify;
>  }
>
>  struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
> --
> 2.32.0
>
Jason Gunthorpe Aug. 19, 2021, 11:06 p.m. UTC | #4
On Wed, Aug 18, 2021 at 11:34:51AM +0200, Daniel Vetter wrote:
> On Wed, Aug 18, 2021 at 9:45 AM Gal Pressman <galpress@amazon.com> wrote:
> >
> > Hey all,
> >
> > Currently, the RDMA subsystem can only work with dynamic dmabuf
> > attachments, which requires the RDMA device to support on-demand-paging
> > (ODP) which is not common on most devices (only supported by mlx5).
> >
> > While the dynamic requirement makes sense for certain GPUs, some devices
> > (such as habanalabs) have device memory that is always "pinned" and do
> > not need/use the move_notify operation.
> >
> > The motivation of this RFC is to use habanalabs as the dmabuf exporter,
> > and EFA as the importer to allow for peer2peer access through libibverbs.
> >
> > This draft patch changes the dmabuf driver to differentiate between
> > static/dynamic attachments by looking at the move_notify op instead of
> > the importer_ops struct, and allowing the peer2peer flag to be enabled
> > in case of a static exporter.
> >
> > Thanks
> >
> > Signed-off-by: Gal Pressman <galpress@amazon.com>
> 
> Given that habanalabs dma-buf support is very firmly in limbo (at
> least it's not yet in linux-next or anywhere else) I think you want to
> solve that problem first before we tackle the additional issue of
> making p2p work without dynamic dma-buf. Without that it just doesn't
> make a lot of sense really to talk about solutions here.

I have been thinking about adding a dmabuf exporter to VFIO, for
basically the same reason habana labs wants to do it.

In that situation we'd want to see an approach similar to this as well
to have a broad usability.

The GPU drivers also want this for certain sophisticated scenarios
with RDMA, the intree drivers just haven't quite got there yet.

So, I think it is worthwhile to start thinking about this regardless
of habana labs.

Jason
Daniel Vetter Aug. 20, 2021, 7:25 a.m. UTC | #5
On Fri, Aug 20, 2021 at 1:06 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Wed, Aug 18, 2021 at 11:34:51AM +0200, Daniel Vetter wrote:
> > On Wed, Aug 18, 2021 at 9:45 AM Gal Pressman <galpress@amazon.com> wrote:
> > >
> > > Hey all,
> > >
> > > Currently, the RDMA subsystem can only work with dynamic dmabuf
> > > attachments, which requires the RDMA device to support on-demand-paging
> > > (ODP) which is not common on most devices (only supported by mlx5).
> > >
> > > While the dynamic requirement makes sense for certain GPUs, some devices
> > > (such as habanalabs) have device memory that is always "pinned" and do
> > > not need/use the move_notify operation.
> > >
> > > The motivation of this RFC is to use habanalabs as the dmabuf exporter,
> > > and EFA as the importer to allow for peer2peer access through libibverbs.
> > >
> > > This draft patch changes the dmabuf driver to differentiate between
> > > static/dynamic attachments by looking at the move_notify op instead of
> > > the importer_ops struct, and allowing the peer2peer flag to be enabled
> > > in case of a static exporter.
> > >
> > > Thanks
> > >
> > > Signed-off-by: Gal Pressman <galpress@amazon.com>
> >
> > Given that habanalabs dma-buf support is very firmly in limbo (at
> > least it's not yet in linux-next or anywhere else) I think you want to
> > solve that problem first before we tackle the additional issue of
> > making p2p work without dynamic dma-buf. Without that it just doesn't
> > make a lot of sense really to talk about solutions here.
>
> I have been thinking about adding a dmabuf exporter to VFIO, for
> basically the same reason habana labs wants to do it.
>
> In that situation we'd want to see an approach similar to this as well
> to have a broad usability.
>
> The GPU drivers also want this for certain sophisticated scenarios
> with RDMA, the intree drivers just haven't quite got there yet.
>
> So, I think it is worthwhile to start thinking about this regardless
> of habana labs.

Oh sure, I've been having these for a while. I think there's two options:
- some kind of soft-pin, where the contract is that we only revoke
when absolutely necessary, and it's expected to be catastrophic on the
importer's side. The use-case would be single user that fully controls
all accelerator local memory, and so kernel driver evicting stuff. I
havent implemented it, but the idea is that essentially in eviction we
check whom we're evicting for (by checking owners of buffers maybe,
atm those are not tracked in generic code but not that hard to add),
and if it's the same userspace owner we don't ever pick these buffers
as victims for eviction, preferreing -ENOMEM/-ENOSPC. If a new user
comes around then we'd still throw these out to avoid abuse, and it
would be up to sysadmins to make sure this doesn't happen untimely,
maybe with the next thing.

- cgroups for (pinned) buffers. Mostly because cgroups for local
memory is somewhere on the plans anyway, but that one could also take
forever since there's questions about overlap with memcg and things
like that, plus thus far everyone who cares made and incompatible
proposal about how it should be done :-/

A variant of the first one would be device-level revoke, which is a
concept we already have in drm for the modesetting side and also for
like 20 year old gpu drivers. We could brush that off and close some
of the gaps (a student is fixing the locking right now, the thing left
to do is mmap revoke), and I think that model of exclusive device
ownership with the option to revoke fits pretty well for at least some
of the accelerators floating around. In that case importers would
never get a move_notify (maybe we should call this revoke_notify to
make it clear it's a bit different) callback, except when the entire
thing has been yanked. I think that would fit pretty well for VFIO,
and I think we should be able to make it work for rdma too as some
kind of auto-deregister. The locking might be fun with both of these
since I expect some inversions compared to the register path, we'll
have to figure these out.
-Daniel
Jason Gunthorpe Aug. 20, 2021, 12:33 p.m. UTC | #6
On Fri, Aug 20, 2021 at 09:25:30AM +0200, Daniel Vetter wrote:
> On Fri, Aug 20, 2021 at 1:06 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Wed, Aug 18, 2021 at 11:34:51AM +0200, Daniel Vetter wrote:
> > > On Wed, Aug 18, 2021 at 9:45 AM Gal Pressman <galpress@amazon.com> wrote:
> > > >
> > > > Hey all,
> > > >
> > > > Currently, the RDMA subsystem can only work with dynamic dmabuf
> > > > attachments, which requires the RDMA device to support on-demand-paging
> > > > (ODP) which is not common on most devices (only supported by mlx5).
> > > >
> > > > While the dynamic requirement makes sense for certain GPUs, some devices
> > > > (such as habanalabs) have device memory that is always "pinned" and do
> > > > not need/use the move_notify operation.
> > > >
> > > > The motivation of this RFC is to use habanalabs as the dmabuf exporter,
> > > > and EFA as the importer to allow for peer2peer access through libibverbs.
> > > >
> > > > This draft patch changes the dmabuf driver to differentiate between
> > > > static/dynamic attachments by looking at the move_notify op instead of
> > > > the importer_ops struct, and allowing the peer2peer flag to be enabled
> > > > in case of a static exporter.
> > > >
> > > > Thanks
> > > >
> > > > Signed-off-by: Gal Pressman <galpress@amazon.com>
> > >
> > > Given that habanalabs dma-buf support is very firmly in limbo (at
> > > least it's not yet in linux-next or anywhere else) I think you want to
> > > solve that problem first before we tackle the additional issue of
> > > making p2p work without dynamic dma-buf. Without that it just doesn't
> > > make a lot of sense really to talk about solutions here.
> >
> > I have been thinking about adding a dmabuf exporter to VFIO, for
> > basically the same reason habana labs wants to do it.
> >
> > In that situation we'd want to see an approach similar to this as well
> > to have a broad usability.
> >
> > The GPU drivers also want this for certain sophisticated scenarios
> > with RDMA, the intree drivers just haven't quite got there yet.
> >
> > So, I think it is worthwhile to start thinking about this regardless
> > of habana labs.
> 
> Oh sure, I've been having these for a while. I think there's two options:
> - some kind of soft-pin, where the contract is that we only revoke
> when absolutely necessary, and it's expected to be catastrophic on the
> importer's side. 

Honestly, I'm not very keen on this. We don't really have HW support
in several RDMA scenarios for even catastrophic unpin.

Gal, can EFA even do this for a MR? You basically have to resize the
rkey/lkey to zero length (or invalidate it like a FMR) under the
catstrophic revoke. The rkey/lkey cannot just be destroyed as that
opens a security problem with rkey/lkey re-use.

I think I saw EFA's current out of tree implementations had this bug.

> to do is mmap revoke), and I think that model of exclusive device
> ownership with the option to revoke fits pretty well for at least some
> of the accelerators floating around. In that case importers would
> never get a move_notify (maybe we should call this revoke_notify to
> make it clear it's a bit different) callback, except when the entire
> thing has been yanked. I think that would fit pretty well for VFIO,
> and I think we should be able to make it work for rdma too as some
> kind of auto-deregister. The locking might be fun with both of these
> since I expect some inversions compared to the register path, we'll
> have to figure these out.

It fits semantically nicely, VFIO also has a revoke semantic for BAR
mappings.

The challenge is the RDMA side which doesn't have a 'dma disabled
error state' for objects as part of the spec.

Some HW, like mlx5, can implement this for MR objects (see revoke_mr),
but I don't know if anything else can, and even mlx5 currently can't
do a revoke for any other object type.

I don't know how useful it would be, need to check on some of the use
cases.

The locking is tricky as we have to issue a device command, but that
device command cannot run concurrently with destruction or the tail
part of creation.

Jason
Gal Pressman Aug. 20, 2021, 12:58 p.m. UTC | #7
On 20/08/2021 15:33, Jason Gunthorpe wrote:
> On Fri, Aug 20, 2021 at 09:25:30AM +0200, Daniel Vetter wrote:
>> On Fri, Aug 20, 2021 at 1:06 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>> On Wed, Aug 18, 2021 at 11:34:51AM +0200, Daniel Vetter wrote:
>>>> On Wed, Aug 18, 2021 at 9:45 AM Gal Pressman <galpress@amazon.com> wrote:
>>>>>
>>>>> Hey all,
>>>>>
>>>>> Currently, the RDMA subsystem can only work with dynamic dmabuf
>>>>> attachments, which requires the RDMA device to support on-demand-paging
>>>>> (ODP) which is not common on most devices (only supported by mlx5).
>>>>>
>>>>> While the dynamic requirement makes sense for certain GPUs, some devices
>>>>> (such as habanalabs) have device memory that is always "pinned" and do
>>>>> not need/use the move_notify operation.
>>>>>
>>>>> The motivation of this RFC is to use habanalabs as the dmabuf exporter,
>>>>> and EFA as the importer to allow for peer2peer access through libibverbs.
>>>>>
>>>>> This draft patch changes the dmabuf driver to differentiate between
>>>>> static/dynamic attachments by looking at the move_notify op instead of
>>>>> the importer_ops struct, and allowing the peer2peer flag to be enabled
>>>>> in case of a static exporter.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Signed-off-by: Gal Pressman <galpress@amazon.com>
>>>>
>>>> Given that habanalabs dma-buf support is very firmly in limbo (at
>>>> least it's not yet in linux-next or anywhere else) I think you want to
>>>> solve that problem first before we tackle the additional issue of
>>>> making p2p work without dynamic dma-buf. Without that it just doesn't
>>>> make a lot of sense really to talk about solutions here.
>>>
>>> I have been thinking about adding a dmabuf exporter to VFIO, for
>>> basically the same reason habana labs wants to do it.
>>>
>>> In that situation we'd want to see an approach similar to this as well
>>> to have a broad usability.
>>>
>>> The GPU drivers also want this for certain sophisticated scenarios
>>> with RDMA, the intree drivers just haven't quite got there yet.
>>>
>>> So, I think it is worthwhile to start thinking about this regardless
>>> of habana labs.
>>
>> Oh sure, I've been having these for a while. I think there's two options:
>> - some kind of soft-pin, where the contract is that we only revoke
>> when absolutely necessary, and it's expected to be catastrophic on the
>> importer's side. 
> 
> Honestly, I'm not very keen on this. We don't really have HW support
> in several RDMA scenarios for even catastrophic unpin.
> 
> Gal, can EFA even do this for a MR? You basically have to resize the
> rkey/lkey to zero length (or invalidate it like a FMR) under the
> catstrophic revoke. The rkey/lkey cannot just be destroyed as that
> opens a security problem with rkey/lkey re-use.

I had some discussions with our hardware guys about such functionality in the
past, I think it should be doable (not necessarily the same way that FMR does it).

Though it would've been nicer if we could agree on a solution that could work
for more than 1-2 RDMA devices, using the existing tools the RDMA subsystem has.
That's why I tried to approach this by denying such attachments for non-ODP
importers instead of exposing a "limited" dynamic importer.
Jason Gunthorpe Aug. 20, 2021, 2:32 p.m. UTC | #8
On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:

> Though it would've been nicer if we could agree on a solution that could work
> for more than 1-2 RDMA devices, using the existing tools the RDMA subsystem has.

I don't think it can really be done, revoke is necessary, and isn't a
primitive we have today.

Revoke is sort of like rereg MR, but with a guaranteed no-change to
the lkey/rkey

Then there is the locking complexity of linking the mr creation and
destruction to the lifecycle of the pages, which is messy and maybe
not general. For instance mlx5 would call its revoke_mr, disconnect
the dmabuf then destroy the mkey - but this is only safe because mlx5
HW can handle concurrent revokes.

> That's why I tried to approach this by denying such attachments for non-ODP
> importers instead of exposing a "limited" dynamic importer.

That is fine if there is no revoke - once revoke exists we must have
driver and HW support.

Jason
Gal Pressman Aug. 21, 2021, 9:16 a.m. UTC | #9
On 20/08/2021 17:32, Jason Gunthorpe wrote:
> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
> 
>> Though it would've been nicer if we could agree on a solution that could work
>> for more than 1-2 RDMA devices, using the existing tools the RDMA subsystem has.
> 
> I don't think it can really be done, revoke is necessary, and isn't a
> primitive we have today.
> 
> Revoke is sort of like rereg MR, but with a guaranteed no-change to
> the lkey/rkey
> 
> Then there is the locking complexity of linking the mr creation and
> destruction to the lifecycle of the pages, which is messy and maybe
> not general. For instance mlx5 would call its revoke_mr, disconnect
> the dmabuf then destroy the mkey - but this is only safe because mlx5
> HW can handle concurrent revokes.

Thanks, that makes sense.

>> That's why I tried to approach this by denying such attachments for non-ODP
>> importers instead of exposing a "limited" dynamic importer.
> 
> That is fine if there is no revoke - once revoke exists we must have
> driver and HW support.

Agree.
IIUC, we're talking about three different exporter "types":
- Dynamic with move_notify (requires ODP)
- Dynamic with revoke_notify
- Static

Which changes do we need to make the third one work?
Christian König Aug. 23, 2021, 10:43 a.m. UTC | #10
Am 21.08.21 um 11:16 schrieb Gal Pressman:
> On 20/08/2021 17:32, Jason Gunthorpe wrote:
>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
>>
>>> Though it would've been nicer if we could agree on a solution that could work
>>> for more than 1-2 RDMA devices, using the existing tools the RDMA subsystem has.
>> I don't think it can really be done, revoke is necessary, and isn't a
>> primitive we have today.
>>
>> Revoke is sort of like rereg MR, but with a guaranteed no-change to
>> the lkey/rkey
>>
>> Then there is the locking complexity of linking the mr creation and
>> destruction to the lifecycle of the pages, which is messy and maybe
>> not general. For instance mlx5 would call its revoke_mr, disconnect
>> the dmabuf then destroy the mkey - but this is only safe because mlx5
>> HW can handle concurrent revokes.
> Thanks, that makes sense.
>
>>> That's why I tried to approach this by denying such attachments for non-ODP
>>> importers instead of exposing a "limited" dynamic importer.
>> That is fine if there is no revoke - once revoke exists we must have
>> driver and HW support.
> Agree.
> IIUC, we're talking about three different exporter "types":
> - Dynamic with move_notify (requires ODP)
> - Dynamic with revoke_notify
> - Static
>
> Which changes do we need to make the third one work?

Basically none at all in the framework.

You just need to properly use the dma_buf_pin() function when you start 
using a buffer (e.g. before you create an attachment) and the 
dma_buf_unpin() function after you are done with the DMA-buf.

Regards,
Christian.
Gal Pressman Aug. 24, 2021, 9:06 a.m. UTC | #11
On 23/08/2021 13:43, Christian König wrote:
> Am 21.08.21 um 11:16 schrieb Gal Pressman:
>> On 20/08/2021 17:32, Jason Gunthorpe wrote:
>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
>>>
>>>> Though it would've been nicer if we could agree on a solution that could work
>>>> for more than 1-2 RDMA devices, using the existing tools the RDMA subsystem
>>>> has.
>>> I don't think it can really be done, revoke is necessary, and isn't a
>>> primitive we have today.
>>>
>>> Revoke is sort of like rereg MR, but with a guaranteed no-change to
>>> the lkey/rkey
>>>
>>> Then there is the locking complexity of linking the mr creation and
>>> destruction to the lifecycle of the pages, which is messy and maybe
>>> not general. For instance mlx5 would call its revoke_mr, disconnect
>>> the dmabuf then destroy the mkey - but this is only safe because mlx5
>>> HW can handle concurrent revokes.
>> Thanks, that makes sense.
>>
>>>> That's why I tried to approach this by denying such attachments for non-ODP
>>>> importers instead of exposing a "limited" dynamic importer.
>>> That is fine if there is no revoke - once revoke exists we must have
>>> driver and HW support.
>> Agree.
>> IIUC, we're talking about three different exporter "types":
>> - Dynamic with move_notify (requires ODP)
>> - Dynamic with revoke_notify
>> - Static
>>
>> Which changes do we need to make the third one work?
> 
> Basically none at all in the framework.
> 
> You just need to properly use the dma_buf_pin() function when you start using a
> buffer (e.g. before you create an attachment) and the dma_buf_unpin() function
> after you are done with the DMA-buf.

I replied to your previous mail, but I'll ask again.
Doesn't the pin operation migrate the memory to host memory?
Christian König Aug. 24, 2021, 9:32 a.m. UTC | #12
Am 24.08.21 um 11:06 schrieb Gal Pressman:
> On 23/08/2021 13:43, Christian König wrote:
>> Am 21.08.21 um 11:16 schrieb Gal Pressman:
>>> On 20/08/2021 17:32, Jason Gunthorpe wrote:
>>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
>>>>
>>>>> Though it would've been nicer if we could agree on a solution that could work
>>>>> for more than 1-2 RDMA devices, using the existing tools the RDMA subsystem
>>>>> has.
>>>> I don't think it can really be done, revoke is necessary, and isn't a
>>>> primitive we have today.
>>>>
>>>> Revoke is sort of like rereg MR, but with a guaranteed no-change to
>>>> the lkey/rkey
>>>>
>>>> Then there is the locking complexity of linking the mr creation and
>>>> destruction to the lifecycle of the pages, which is messy and maybe
>>>> not general. For instance mlx5 would call its revoke_mr, disconnect
>>>> the dmabuf then destroy the mkey - but this is only safe because mlx5
>>>> HW can handle concurrent revokes.
>>> Thanks, that makes sense.
>>>
>>>>> That's why I tried to approach this by denying such attachments for non-ODP
>>>>> importers instead of exposing a "limited" dynamic importer.
>>>> That is fine if there is no revoke - once revoke exists we must have
>>>> driver and HW support.
>>> Agree.
>>> IIUC, we're talking about three different exporter "types":
>>> - Dynamic with move_notify (requires ODP)
>>> - Dynamic with revoke_notify
>>> - Static
>>>
>>> Which changes do we need to make the third one work?
>> Basically none at all in the framework.
>>
>> You just need to properly use the dma_buf_pin() function when you start using a
>> buffer (e.g. before you create an attachment) and the dma_buf_unpin() function
>> after you are done with the DMA-buf.
> I replied to your previous mail, but I'll ask again.
> Doesn't the pin operation migrate the memory to host memory?

Sorry missed your previous reply.

And yes at least for the amdgpu driver we migrate the memory to host 
memory as soon as it is pinned and I would expect that other GPU drivers 
do something similar.

This is intentional since we don't want any P2P to video memory with 
pinned objects and want to avoid to run into a situation where one 
device is doing P2P to video memory while another device needs the 
DMA-buf in host memory.

You can still do P2P with pinned object, it's just up to the exporting 
driver if it is allowed or not.

The other option is what Daniel suggested that we have some kind of 
revoke. This is essentially what our KFD is doing as well when doing 
interop with 3D GFX, but from Jasons responses I have a bit of doubt 
that this will actually work on the hardware level for RDMA.

Regards,
Christian.
John Hubbard Aug. 24, 2021, 5:27 p.m. UTC | #13
On 8/24/21 2:32 AM, Christian König wrote:
> Am 24.08.21 um 11:06 schrieb Gal Pressman:
>> On 23/08/2021 13:43, Christian König wrote:
>>> Am 21.08.21 um 11:16 schrieb Gal Pressman:
>>>> On 20/08/2021 17:32, Jason Gunthorpe wrote:
>>>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
...
>>>> IIUC, we're talking about three different exporter "types":
>>>> - Dynamic with move_notify (requires ODP)
>>>> - Dynamic with revoke_notify
>>>> - Static
>>>>
>>>> Which changes do we need to make the third one work?
>>> Basically none at all in the framework.
>>>
>>> You just need to properly use the dma_buf_pin() function when you start using a
>>> buffer (e.g. before you create an attachment) and the dma_buf_unpin() function
>>> after you are done with the DMA-buf.
>> I replied to your previous mail, but I'll ask again.
>> Doesn't the pin operation migrate the memory to host memory?
> 
> Sorry missed your previous reply.
> 
> And yes at least for the amdgpu driver we migrate the memory to host memory as soon as it is pinned 
> and I would expect that other GPU drivers do something similar.

Well...for many topologies, migrating to host memory will result in a
dramatically slower p2p setup. For that reason, some GPU drivers may
want to allow pinning of video memory in some situations.

Ideally, you've got modern ODP devices and you don't even need to pin.
But if not, and you still hope to do high performance p2p between a GPU
and a non-ODP Infiniband device, then you would need to leave the pinned
memory in vidmem.

So I think we don't want to rule out that behavior, right? Or is the
thinking more like, "you're lucky that this old non-ODP setup works at
all, and we'll make it work by routing through host/cpu memory, but it
will be slow"?


thanks,
Jason Gunthorpe Aug. 24, 2021, 5:32 p.m. UTC | #14
On Tue, Aug 24, 2021 at 10:27:23AM -0700, John Hubbard wrote:
> On 8/24/21 2:32 AM, Christian König wrote:
> > Am 24.08.21 um 11:06 schrieb Gal Pressman:
> > > On 23/08/2021 13:43, Christian König wrote:
> > > > Am 21.08.21 um 11:16 schrieb Gal Pressman:
> > > > > On 20/08/2021 17:32, Jason Gunthorpe wrote:
> > > > > > On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
> ...
> > > > > IIUC, we're talking about three different exporter "types":
> > > > > - Dynamic with move_notify (requires ODP)
> > > > > - Dynamic with revoke_notify
> > > > > - Static
> > > > > 
> > > > > Which changes do we need to make the third one work?
> > > > Basically none at all in the framework.
> > > > 
> > > > You just need to properly use the dma_buf_pin() function when you start using a
> > > > buffer (e.g. before you create an attachment) and the dma_buf_unpin() function
> > > > after you are done with the DMA-buf.
> > > I replied to your previous mail, but I'll ask again.
> > > Doesn't the pin operation migrate the memory to host memory?
> > 
> > Sorry missed your previous reply.
> > 
> > And yes at least for the amdgpu driver we migrate the memory to host
> > memory as soon as it is pinned and I would expect that other GPU drivers
> > do something similar.
> 
> Well...for many topologies, migrating to host memory will result in a
> dramatically slower p2p setup. For that reason, some GPU drivers may
> want to allow pinning of video memory in some situations.
> 
> Ideally, you've got modern ODP devices and you don't even need to pin.
> But if not, and you still hope to do high performance p2p between a GPU
> and a non-ODP Infiniband device, then you would need to leave the pinned
> memory in vidmem.
> 
> So I think we don't want to rule out that behavior, right? Or is the
> thinking more like, "you're lucky that this old non-ODP setup works at
> all, and we'll make it work by routing through host/cpu memory, but it
> will be slow"?

I think it depends on the user, if the user creates memory which is
permanently located on the GPU then it should be pinnable in this way
without force migration. But if the memory is inherently migratable
then it just cannot be pinned in the GPU at all as we can't
indefinately block migration from happening eg if the CPU touches it
later or something.

Jason
John Hubbard Aug. 24, 2021, 5:35 p.m. UTC | #15
On 8/24/21 10:32 AM, Jason Gunthorpe wrote:
...
>>> And yes at least for the amdgpu driver we migrate the memory to host
>>> memory as soon as it is pinned and I would expect that other GPU drivers
>>> do something similar.
>>
>> Well...for many topologies, migrating to host memory will result in a
>> dramatically slower p2p setup. For that reason, some GPU drivers may
>> want to allow pinning of video memory in some situations.
>>
>> Ideally, you've got modern ODP devices and you don't even need to pin.
>> But if not, and you still hope to do high performance p2p between a GPU
>> and a non-ODP Infiniband device, then you would need to leave the pinned
>> memory in vidmem.
>>
>> So I think we don't want to rule out that behavior, right? Or is the
>> thinking more like, "you're lucky that this old non-ODP setup works at
>> all, and we'll make it work by routing through host/cpu memory, but it
>> will be slow"?
> 
> I think it depends on the user, if the user creates memory which is
> permanently located on the GPU then it should be pinnable in this way
> without force migration. But if the memory is inherently migratable
> then it just cannot be pinned in the GPU at all as we can't
> indefinately block migration from happening eg if the CPU touches it
> later or something.
> 

OK. I just want to avoid creating any API-level assumptions that dma_buf_pin()
necessarily implies or requires migrating to host memory.

thanks,
Dave Airlie Aug. 24, 2021, 7:15 p.m. UTC | #16
On Wed, 25 Aug 2021 at 03:36, John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 8/24/21 10:32 AM, Jason Gunthorpe wrote:
> ...
> >>> And yes at least for the amdgpu driver we migrate the memory to host
> >>> memory as soon as it is pinned and I would expect that other GPU drivers
> >>> do something similar.
> >>
> >> Well...for many topologies, migrating to host memory will result in a
> >> dramatically slower p2p setup. For that reason, some GPU drivers may
> >> want to allow pinning of video memory in some situations.
> >>
> >> Ideally, you've got modern ODP devices and you don't even need to pin.
> >> But if not, and you still hope to do high performance p2p between a GPU
> >> and a non-ODP Infiniband device, then you would need to leave the pinned
> >> memory in vidmem.
> >>
> >> So I think we don't want to rule out that behavior, right? Or is the
> >> thinking more like, "you're lucky that this old non-ODP setup works at
> >> all, and we'll make it work by routing through host/cpu memory, but it
> >> will be slow"?
> >
> > I think it depends on the user, if the user creates memory which is
> > permanently located on the GPU then it should be pinnable in this way
> > without force migration. But if the memory is inherently migratable
> > then it just cannot be pinned in the GPU at all as we can't
> > indefinately block migration from happening eg if the CPU touches it
> > later or something.
> >
>
> OK. I just want to avoid creating any API-level assumptions that dma_buf_pin()
> necessarily implies or requires migrating to host memory.

I'm not sure we should be allowing dma_buf_pin at all on
non-migratable memory, what's to stop someone just pinning all the
VRAM and making the GPU unuseable?

I understand not considering more than a single user in these
situations is enterprise thinking, but I do worry about pinning is
always fine type of thinking when things are shared or multi-user.

My impression from this is we've designed hardware that didn't
consider the problem, and now to let us use that hardware in horrible
ways we should just allow it to pin all the things.

Dave.
Jason Gunthorpe Aug. 24, 2021, 7:30 p.m. UTC | #17
On Wed, Aug 25, 2021 at 05:15:52AM +1000, Dave Airlie wrote:
> On Wed, 25 Aug 2021 at 03:36, John Hubbard <jhubbard@nvidia.com> wrote:
> >
> > On 8/24/21 10:32 AM, Jason Gunthorpe wrote:
> > ...
> > >>> And yes at least for the amdgpu driver we migrate the memory to host
> > >>> memory as soon as it is pinned and I would expect that other GPU drivers
> > >>> do something similar.
> > >>
> > >> Well...for many topologies, migrating to host memory will result in a
> > >> dramatically slower p2p setup. For that reason, some GPU drivers may
> > >> want to allow pinning of video memory in some situations.
> > >>
> > >> Ideally, you've got modern ODP devices and you don't even need to pin.
> > >> But if not, and you still hope to do high performance p2p between a GPU
> > >> and a non-ODP Infiniband device, then you would need to leave the pinned
> > >> memory in vidmem.
> > >>
> > >> So I think we don't want to rule out that behavior, right? Or is the
> > >> thinking more like, "you're lucky that this old non-ODP setup works at
> > >> all, and we'll make it work by routing through host/cpu memory, but it
> > >> will be slow"?
> > >
> > > I think it depends on the user, if the user creates memory which is
> > > permanently located on the GPU then it should be pinnable in this way
> > > without force migration. But if the memory is inherently migratable
> > > then it just cannot be pinned in the GPU at all as we can't
> > > indefinately block migration from happening eg if the CPU touches it
> > > later or something.
> > >
> >
> > OK. I just want to avoid creating any API-level assumptions that dma_buf_pin()
> > necessarily implies or requires migrating to host memory.
> 
> I'm not sure we should be allowing dma_buf_pin at all on
> non-migratable memory, what's to stop someone just pinning all the
> VRAM and making the GPU unuseable?

IMHO the same thinking that prevents pining all of system ram and
making the system unusable? GPU isn't so special here. The main
restriction is the pinned memory ulimit. For most out-of-the-box cases
this is set to something like 64k

For the single-user HPC use cases it is made unlimited.

> My impression from this is we've designed hardware that didn't
> consider the problem, and now to let us use that hardware in horrible
> ways we should just allow it to pin all the things.

It is more complex than that, HW that can support dynamic memory under
*everything* is complicated (and in some cases slow!). As there is
only a weak rational to do this, we don't see it in often in the
market.

Jason
Alex Deucher Aug. 24, 2021, 7:43 p.m. UTC | #18
On Tue, Aug 24, 2021 at 3:16 PM Dave Airlie <airlied@gmail.com> wrote:
>
> On Wed, 25 Aug 2021 at 03:36, John Hubbard <jhubbard@nvidia.com> wrote:
> >
> > On 8/24/21 10:32 AM, Jason Gunthorpe wrote:
> > ...
> > >>> And yes at least for the amdgpu driver we migrate the memory to host
> > >>> memory as soon as it is pinned and I would expect that other GPU drivers
> > >>> do something similar.
> > >>
> > >> Well...for many topologies, migrating to host memory will result in a
> > >> dramatically slower p2p setup. For that reason, some GPU drivers may
> > >> want to allow pinning of video memory in some situations.
> > >>
> > >> Ideally, you've got modern ODP devices and you don't even need to pin.
> > >> But if not, and you still hope to do high performance p2p between a GPU
> > >> and a non-ODP Infiniband device, then you would need to leave the pinned
> > >> memory in vidmem.
> > >>
> > >> So I think we don't want to rule out that behavior, right? Or is the
> > >> thinking more like, "you're lucky that this old non-ODP setup works at
> > >> all, and we'll make it work by routing through host/cpu memory, but it
> > >> will be slow"?
> > >
> > > I think it depends on the user, if the user creates memory which is
> > > permanently located on the GPU then it should be pinnable in this way
> > > without force migration. But if the memory is inherently migratable
> > > then it just cannot be pinned in the GPU at all as we can't
> > > indefinately block migration from happening eg if the CPU touches it
> > > later or something.
> > >
> >
> > OK. I just want to avoid creating any API-level assumptions that dma_buf_pin()
> > necessarily implies or requires migrating to host memory.
>
> I'm not sure we should be allowing dma_buf_pin at all on
> non-migratable memory, what's to stop someone just pinning all the
> VRAM and making the GPU unuseable?

In a lot of cases we have GPUs with more VRAM than system memory, but
we allow pinning in system memory.

Alex

>
> I understand not considering more than a single user in these
> situations is enterprise thinking, but I do worry about pinning is
> always fine type of thinking when things are shared or multi-user.
>
> My impression from this is we've designed hardware that didn't
> consider the problem, and now to let us use that hardware in horrible
> ways we should just allow it to pin all the things.
>
> Dave.
Xiong, Jianxin Aug. 24, 2021, 8 p.m. UTC | #19
> -----Original Message-----
> From: Alex Deucher <alexdeucher@gmail.com>
> Sent: Tuesday, August 24, 2021 12:44 PM
> To: Dave Airlie <airlied@gmail.com>
> Cc: John Hubbard <jhubbard@nvidia.com>; Jason Gunthorpe <jgg@ziepe.ca>; Christian König <christian.koenig@amd.com>; Gal Pressman
> <galpress@amazon.com>; Daniel Vetter <daniel@ffwll.ch>; Sumit Semwal <sumit.semwal@linaro.org>; Doug Ledford
> <dledford@redhat.com>; open list:DMA BUFFER SHARING FRAMEWORK <linux-media@vger.kernel.org>; dri-devel <dri-
> devel@lists.freedesktop.org>; Linux Kernel Mailing List <linux-kernel@vger.kernel.org>; linux-rdma <linux-rdma@vger.kernel.org>; Gabbay,
> Oded (Habana) <ogabbay@habana.ai>; Tayar, Tomer (Habana) <ttayar@habana.ai>; Yossi Leybovich <sleybo@amazon.com>; Alexander
> Matushevsky <matua@amazon.com>; Leon Romanovsky <leonro@nvidia.com>; Xiong, Jianxin <jianxin.xiong@intel.com>
> Subject: Re: [RFC] Make use of non-dynamic dmabuf in RDMA
> 
> On Tue, Aug 24, 2021 at 3:16 PM Dave Airlie <airlied@gmail.com> wrote:
> >
> > On Wed, 25 Aug 2021 at 03:36, John Hubbard <jhubbard@nvidia.com> wrote:
> > >
> > > On 8/24/21 10:32 AM, Jason Gunthorpe wrote:
> > > ...
> > > >>> And yes at least for the amdgpu driver we migrate the memory to
> > > >>> host memory as soon as it is pinned and I would expect that
> > > >>> other GPU drivers do something similar.
> > > >>
> > > >> Well...for many topologies, migrating to host memory will result
> > > >> in a dramatically slower p2p setup. For that reason, some GPU
> > > >> drivers may want to allow pinning of video memory in some situations.
> > > >>
> > > >> Ideally, you've got modern ODP devices and you don't even need to pin.
> > > >> But if not, and you still hope to do high performance p2p between
> > > >> a GPU and a non-ODP Infiniband device, then you would need to
> > > >> leave the pinned memory in vidmem.
> > > >>
> > > >> So I think we don't want to rule out that behavior, right? Or is
> > > >> the thinking more like, "you're lucky that this old non-ODP setup
> > > >> works at all, and we'll make it work by routing through host/cpu
> > > >> memory, but it will be slow"?
> > > >
> > > > I think it depends on the user, if the user creates memory which
> > > > is permanently located on the GPU then it should be pinnable in
> > > > this way without force migration. But if the memory is inherently
> > > > migratable then it just cannot be pinned in the GPU at all as we
> > > > can't indefinately block migration from happening eg if the CPU
> > > > touches it later or something.
> > > >
> > >
> > > OK. I just want to avoid creating any API-level assumptions that
> > > dma_buf_pin() necessarily implies or requires migrating to host memory.
> >
> > I'm not sure we should be allowing dma_buf_pin at all on
> > non-migratable memory, what's to stop someone just pinning all the
> > VRAM and making the GPU unuseable?
> 
> In a lot of cases we have GPUs with more VRAM than system memory, but we allow pinning in system memory.
> 
> Alex
> 

In addition, the dma-buf exporter can be a non-GPU device.

Jianxin

> >
> > I understand not considering more than a single user in these
> > situations is enterprise thinking, but I do worry about pinning is
> > always fine type of thinking when things are shared or multi-user.
> >
> > My impression from this is we've designed hardware that didn't
> > consider the problem, and now to let us use that hardware in horrible
> > ways we should just allow it to pin all the things.
> >
> > Dave.
Christian König Aug. 25, 2021, 6:17 a.m. UTC | #20
Am 24.08.21 um 19:32 schrieb Jason Gunthorpe:
> On Tue, Aug 24, 2021 at 10:27:23AM -0700, John Hubbard wrote:
>> On 8/24/21 2:32 AM, Christian König wrote:
>>> Am 24.08.21 um 11:06 schrieb Gal Pressman:
>>>> On 23/08/2021 13:43, Christian König wrote:
>>>>> Am 21.08.21 um 11:16 schrieb Gal Pressman:
>>>>>> On 20/08/2021 17:32, Jason Gunthorpe wrote:
>>>>>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
>> ...
>>>>>> IIUC, we're talking about three different exporter "types":
>>>>>> - Dynamic with move_notify (requires ODP)
>>>>>> - Dynamic with revoke_notify
>>>>>> - Static
>>>>>>
>>>>>> Which changes do we need to make the third one work?
>>>>> Basically none at all in the framework.
>>>>>
>>>>> You just need to properly use the dma_buf_pin() function when you start using a
>>>>> buffer (e.g. before you create an attachment) and the dma_buf_unpin() function
>>>>> after you are done with the DMA-buf.
>>>> I replied to your previous mail, but I'll ask again.
>>>> Doesn't the pin operation migrate the memory to host memory?
>>> Sorry missed your previous reply.
>>>
>>> And yes at least for the amdgpu driver we migrate the memory to host
>>> memory as soon as it is pinned and I would expect that other GPU drivers
>>> do something similar.
>> Well...for many topologies, migrating to host memory will result in a
>> dramatically slower p2p setup. For that reason, some GPU drivers may
>> want to allow pinning of video memory in some situations.
>>
>> Ideally, you've got modern ODP devices and you don't even need to pin.
>> But if not, and you still hope to do high performance p2p between a GPU
>> and a non-ODP Infiniband device, then you would need to leave the pinned
>> memory in vidmem.
>>
>> So I think we don't want to rule out that behavior, right? Or is the
>> thinking more like, "you're lucky that this old non-ODP setup works at
>> all, and we'll make it work by routing through host/cpu memory, but it
>> will be slow"?
> I think it depends on the user, if the user creates memory which is
> permanently located on the GPU then it should be pinnable in this way
> without force migration. But if the memory is inherently migratable
> then it just cannot be pinned in the GPU at all as we can't
> indefinately block migration from happening eg if the CPU touches it
> later or something.

Yes, exactly that's the point. Especially GPUs have a great variety of 
setups.

For example we have APUs where the local memory is just stolen system 
memory and all buffers must be migrate-able because you might need all 
of this stolen memory for scanout or page tables. In this case P2P only 
makes sense to avoid the migration overhead in the first place.

Then you got dGPUs where only a fraction of the VRAM is accessible from 
the PCIe BUS. Here you also absolutely don't want to pin any buffers 
because that can easily crash when we need to migrate something into the 
visible window for CPU access.

The only real option where you could do P2P with buffer pinning are 
those compute boards where we know that everything is always accessible 
to everybody and we will never need to migrate anything. But even then 
you want some mechanism like cgroups to take care of limiting this. 
Otherwise any runaway process can bring down your whole system.

Key question at least for me as GPU maintainer is if we are going to see 
modern compute boards together with old non-ODP setups. Since those 
compute boards are usually used with new hardware (like PCIe v4 for 
example) the answer I think is most likely "no".

Christian.

>
> Jason
John Hubbard Aug. 25, 2021, 6:47 a.m. UTC | #21
On 8/24/21 11:17 PM, Christian König wrote:
...
>> I think it depends on the user, if the user creates memory which is
>> permanently located on the GPU then it should be pinnable in this way
>> without force migration. But if the memory is inherently migratable
>> then it just cannot be pinned in the GPU at all as we can't
>> indefinately block migration from happening eg if the CPU touches it
>> later or something.
> 
> Yes, exactly that's the point. Especially GPUs have a great variety of setups.
> 
> For example we have APUs where the local memory is just stolen system memory and all buffers must be 
> migrate-able because you might need all of this stolen memory for scanout or page tables. In this 
> case P2P only makes sense to avoid the migration overhead in the first place.
> 
> Then you got dGPUs where only a fraction of the VRAM is accessible from the PCIe BUS. Here you also 
> absolutely don't want to pin any buffers because that can easily crash when we need to migrate 
> something into the visible window for CPU access.
> 
> The only real option where you could do P2P with buffer pinning are those compute boards where we 
> know that everything is always accessible to everybody and we will never need to migrate anything. 
> But even then you want some mechanism like cgroups to take care of limiting this. Otherwise any 
> runaway process can bring down your whole system.
> 
> Key question at least for me as GPU maintainer is if we are going to see modern compute boards 
> together with old non-ODP setups. Since those compute boards are usually used with new hardware 
> (like PCIe v4 for example) the answer I think is most likely "no".
> 

That is a really good point. Times have changed and I guess ODP is on most (all?) of
the new Infiniband products now, and maybe we don't need to worry so much about
providing first-class support for non-ODP setups.

I've got to drag my brain into 2021+! :)

thanks,
Jason Gunthorpe Aug. 25, 2021, 12:18 p.m. UTC | #22
On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote:

> The only real option where you could do P2P with buffer pinning are those
> compute boards where we know that everything is always accessible to
> everybody and we will never need to migrate anything. But even then you want
> some mechanism like cgroups to take care of limiting this. Otherwise any
> runaway process can bring down your whole system.
 
Why? It is not the pin that is the problem, it was allocating GPU
dedicated memory in the first place. pinning it just changes the
sequence to free it. No different than CPU memory.

> Key question at least for me as GPU maintainer is if we are going to see
> modern compute boards together with old non-ODP setups. 

You should stop thinking about it as 'old non-ODP setups'.  ODP is
very much a special case that allows a narrow set of functionality to
work in a narrow situation. It has a high performance penalty and
narrow HW support.

So yes, compute boards are routinely used in scenarios where ODP is
not available, today and for the foreseeable future.

Jason
Christian König Aug. 25, 2021, 12:27 p.m. UTC | #23
Am 25.08.21 um 14:18 schrieb Jason Gunthorpe:
> On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote:
>
>> The only real option where you could do P2P with buffer pinning are those
>> compute boards where we know that everything is always accessible to
>> everybody and we will never need to migrate anything. But even then you want
>> some mechanism like cgroups to take care of limiting this. Otherwise any
>> runaway process can bring down your whole system.
>   
> Why? It is not the pin that is the problem, it was allocating GPU
> dedicated memory in the first place. pinning it just changes the
> sequence to free it. No different than CPU memory.

Pinning makes the memory un-evictable.

In other words as long as we don't pin anything we can support as many 
processes as we want until we run out of swap space. Swapping sucks 
badly because your applications become pretty much unuseable, but you 
can easily recover from it by killing some process.

With pinning on the other hand somebody sooner or later receives an 
-ENOMEM or -ENOSPC and there is no guarantee that this goes to the right 
process.

>> Key question at least for me as GPU maintainer is if we are going to see
>> modern compute boards together with old non-ODP setups.
> You should stop thinking about it as 'old non-ODP setups'.  ODP is
> very much a special case that allows a narrow set of functionality to
> work in a narrow situation. It has a high performance penalty and
> narrow HW support.
>
> So yes, compute boards are routinely used in scenarios where ODP is
> not available, today and for the foreseeable future.

Yeah, that's the stuff I'm not sure about.

Christian.

>
> Jason
Jason Gunthorpe Aug. 25, 2021, 12:38 p.m. UTC | #24
On Wed, Aug 25, 2021 at 02:27:08PM +0200, Christian König wrote:
> Am 25.08.21 um 14:18 schrieb Jason Gunthorpe:
> > On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote:
> > 
> > > The only real option where you could do P2P with buffer pinning are those
> > > compute boards where we know that everything is always accessible to
> > > everybody and we will never need to migrate anything. But even then you want
> > > some mechanism like cgroups to take care of limiting this. Otherwise any
> > > runaway process can bring down your whole system.
> > Why? It is not the pin that is the problem, it was allocating GPU
> > dedicated memory in the first place. pinning it just changes the
> > sequence to free it. No different than CPU memory.
> 
> Pinning makes the memory un-evictable.
> 
> In other words as long as we don't pin anything we can support as many
> processes as we want until we run out of swap space. Swapping sucks badly
> because your applications become pretty much unuseable, but you can easily
> recover from it by killing some process.
> 
> With pinning on the other hand somebody sooner or later receives an -ENOMEM
> or -ENOSPC and there is no guarantee that this goes to the right process.

It is not really different - you have the same failure mode once the
system runs out of swap.

This is really the kernel side trying to push a policy to the user
side that the user side doesn't want..

Dedicated systems are a significant use case here and should be
supported, even if the same solution wouldn't be applicable to someone
running a desktop.

Jason
Christian König Aug. 25, 2021, 1:51 p.m. UTC | #25
Am 25.08.21 um 14:38 schrieb Jason Gunthorpe:
> On Wed, Aug 25, 2021 at 02:27:08PM +0200, Christian König wrote:
>> Am 25.08.21 um 14:18 schrieb Jason Gunthorpe:
>>> On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote:
>>>
>>>> The only real option where you could do P2P with buffer pinning are those
>>>> compute boards where we know that everything is always accessible to
>>>> everybody and we will never need to migrate anything. But even then you want
>>>> some mechanism like cgroups to take care of limiting this. Otherwise any
>>>> runaway process can bring down your whole system.
>>> Why? It is not the pin that is the problem, it was allocating GPU
>>> dedicated memory in the first place. pinning it just changes the
>>> sequence to free it. No different than CPU memory.
>> Pinning makes the memory un-evictable.
>>
>> In other words as long as we don't pin anything we can support as many
>> processes as we want until we run out of swap space. Swapping sucks badly
>> because your applications become pretty much unuseable, but you can easily
>> recover from it by killing some process.
>>
>> With pinning on the other hand somebody sooner or later receives an -ENOMEM
>> or -ENOSPC and there is no guarantee that this goes to the right process.
> It is not really different - you have the same failure mode once the
> system runs out of swap.
>
> This is really the kernel side trying to push a policy to the user
> side that the user side doesn't want..

But which is still the right thing to do as far as I can see. See 
userspace also doesn't want proper process isolation since it takes 
extra time.

Kernel development is driven by exposing the hardware functionality in a 
save and manageable manner to userspace, and not by fulfilling userspace 
requirements.

This is very important cause you otherwise you create a specialized 
system and not a general purpose kernel.

> Dedicated systems are a significant use case here and should be
> supported, even if the same solution wouldn't be applicable to someone
> running a desktop.

And exactly that approach is not acceptable.

Christian.

>
> Jason
Jason Gunthorpe Aug. 25, 2021, 2:47 p.m. UTC | #26
On Wed, Aug 25, 2021 at 03:51:14PM +0200, Christian König wrote:
> Am 25.08.21 um 14:38 schrieb Jason Gunthorpe:
> > On Wed, Aug 25, 2021 at 02:27:08PM +0200, Christian König wrote:
> > > Am 25.08.21 um 14:18 schrieb Jason Gunthorpe:
> > > > On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote:
> > > > 
> > > > > The only real option where you could do P2P with buffer pinning are those
> > > > > compute boards where we know that everything is always accessible to
> > > > > everybody and we will never need to migrate anything. But even then you want
> > > > > some mechanism like cgroups to take care of limiting this. Otherwise any
> > > > > runaway process can bring down your whole system.
> > > > Why? It is not the pin that is the problem, it was allocating GPU
> > > > dedicated memory in the first place. pinning it just changes the
> > > > sequence to free it. No different than CPU memory.
> > > Pinning makes the memory un-evictable.
> > > 
> > > In other words as long as we don't pin anything we can support as many
> > > processes as we want until we run out of swap space. Swapping sucks badly
> > > because your applications become pretty much unuseable, but you can easily
> > > recover from it by killing some process.
> > > 
> > > With pinning on the other hand somebody sooner or later receives an -ENOMEM
> > > or -ENOSPC and there is no guarantee that this goes to the right process.
> > It is not really different - you have the same failure mode once the
> > system runs out of swap.
> > 
> > This is really the kernel side trying to push a policy to the user
> > side that the user side doesn't want..
> 
> But which is still the right thing to do as far as I can see. See userspace
> also doesn't want proper process isolation since it takes extra time.

Why? You are pushing a policy of resource allocation/usage which
more properly belongs in userspace.

> Kernel development is driven by exposing the hardware functionality in a
> save and manageable manner to userspace, and not by fulfilling userspace
> requirements.

I don't agree with this, that is a 1980's view of OS design. So much
these days in the kernel is driven entirely by boutique userspace
requirements and is very much not about the classical abstract role of
an OS.

> > Dedicated systems are a significant use case here and should be
> > supported, even if the same solution wouldn't be applicable to someone
> > running a desktop.
> 
> And exactly that approach is not acceptable.

We have knobs and settings all over the place to allow Linux to
support a broad set of use cases from Android to servers, to HPC. So
long as they can co-exist and the various optional modes do not
critically undermine the security of the kernel, it is well in line
with how things have been evolving in the last 15 years.

Here you are talking entirely about policy to control memory
allocation, which is already well trodden ground for CPU memory. 

There are now endless boutique ways to deal with this, it is a very
narrow view to say that GPU memory is so special and different that
only one way can be the correct/allowed way.

Jason
Christian König Aug. 25, 2021, 3:14 p.m. UTC | #27
Am 25.08.21 um 16:47 schrieb Jason Gunthorpe:
> On Wed, Aug 25, 2021 at 03:51:14PM +0200, Christian König wrote:
>> Am 25.08.21 um 14:38 schrieb Jason Gunthorpe:
>>> On Wed, Aug 25, 2021 at 02:27:08PM +0200, Christian König wrote:
>>>> Am 25.08.21 um 14:18 schrieb Jason Gunthorpe:
>>>>> On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote:
>>>>>
>>>>>> The only real option where you could do P2P with buffer pinning are those
>>>>>> compute boards where we know that everything is always accessible to
>>>>>> everybody and we will never need to migrate anything. But even then you want
>>>>>> some mechanism like cgroups to take care of limiting this. Otherwise any
>>>>>> runaway process can bring down your whole system.
>>>>> Why? It is not the pin that is the problem, it was allocating GPU
>>>>> dedicated memory in the first place. pinning it just changes the
>>>>> sequence to free it. No different than CPU memory.
>>>> Pinning makes the memory un-evictable.
>>>>
>>>> In other words as long as we don't pin anything we can support as many
>>>> processes as we want until we run out of swap space. Swapping sucks badly
>>>> because your applications become pretty much unuseable, but you can easily
>>>> recover from it by killing some process.
>>>>
>>>> With pinning on the other hand somebody sooner or later receives an -ENOMEM
>>>> or -ENOSPC and there is no guarantee that this goes to the right process.
>>> It is not really different - you have the same failure mode once the
>>> system runs out of swap.
>>>
>>> This is really the kernel side trying to push a policy to the user
>>> side that the user side doesn't want..
>> But which is still the right thing to do as far as I can see. See userspace
>> also doesn't want proper process isolation since it takes extra time.
> Why? You are pushing a policy of resource allocation/usage which
> more properly belongs in userspace.
>
>> Kernel development is driven by exposing the hardware functionality in a
>> save and manageable manner to userspace, and not by fulfilling userspace
>> requirements.
> I don't agree with this, that is a 1980's view of OS design. So much
> these days in the kernel is driven entirely by boutique userspace
> requirements and is very much not about the classical abstract role of
> an OS.

But it's still true never the less. Otherwise you would have libraries 
for filesystem accesses and no system security to speak of.

>>> Dedicated systems are a significant use case here and should be
>>> supported, even if the same solution wouldn't be applicable to someone
>>> running a desktop.
>> And exactly that approach is not acceptable.
> We have knobs and settings all over the place to allow Linux to
> support a broad set of use cases from Android to servers, to HPC. So
> long as they can co-exist and the various optional modes do not
> critically undermine the security of the kernel, it is well in line
> with how things have been evolving in the last 15 years.

Yeah, that's exactly what I'm talking about by adding cgroup or similar. 
You need a knob to control this.

> Here you are talking entirely about policy to control memory
> allocation, which is already well trodden ground for CPU memory.
>
> There are now endless boutique ways to deal with this, it is a very
> narrow view to say that GPU memory is so special and different that
> only one way can be the correct/allowed way.

Well I'm not talking about GPU memory in particular here. This is 
mandatory for any memory or saying more general any resource.

E.g. you are not allowed to pin large amount of system memory on a 
default installation for exactly those reasons as well.

That you can have a knob to disable this behavior for your HPC system is 
perfectly fine, but I thing what Dave notes here as well that this is 
most likely not the default behavior we want.

Christian.

>
> Jason
Jason Gunthorpe Aug. 25, 2021, 3:49 p.m. UTC | #28
On Wed, Aug 25, 2021 at 05:14:06PM +0200, Christian König wrote:

> Yeah, that's exactly what I'm talking about by adding cgroup or similar. You
> need a knob to control this.

We have the pinned memory ulimit today.

A pinned memory cgroup might be interesting, but even containrs are
covered under the ulimit (IIRC), so the driver to do this work might
not be so strong.

Jason
Oded Gabbay Aug. 25, 2021, 4:02 p.m. UTC | #29
On Wed, Aug 25, 2021 at 6:14 PM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 25.08.21 um 16:47 schrieb Jason Gunthorpe:
> > On Wed, Aug 25, 2021 at 03:51:14PM +0200, Christian König wrote:
> >> Am 25.08.21 um 14:38 schrieb Jason Gunthorpe:
> >>> On Wed, Aug 25, 2021 at 02:27:08PM +0200, Christian König wrote:
> >>>> Am 25.08.21 um 14:18 schrieb Jason Gunthorpe:
> >>>>> On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote:
> >>>>>
> >>>>>> The only real option where you could do P2P with buffer pinning are those
> >>>>>> compute boards where we know that everything is always accessible to
> >>>>>> everybody and we will never need to migrate anything. But even then you want
> >>>>>> some mechanism like cgroups to take care of limiting this. Otherwise any
> >>>>>> runaway process can bring down your whole system.
> >>>>> Why? It is not the pin that is the problem, it was allocating GPU
> >>>>> dedicated memory in the first place. pinning it just changes the
> >>>>> sequence to free it. No different than CPU memory.
> >>>> Pinning makes the memory un-evictable.
> >>>>
> >>>> In other words as long as we don't pin anything we can support as many
> >>>> processes as we want until we run out of swap space. Swapping sucks badly
> >>>> because your applications become pretty much unuseable, but you can easily
> >>>> recover from it by killing some process.
> >>>>
> >>>> With pinning on the other hand somebody sooner or later receives an -ENOMEM
> >>>> or -ENOSPC and there is no guarantee that this goes to the right process.
> >>> It is not really different - you have the same failure mode once the
> >>> system runs out of swap.
> >>>
> >>> This is really the kernel side trying to push a policy to the user
> >>> side that the user side doesn't want..
> >> But which is still the right thing to do as far as I can see. See userspace
> >> also doesn't want proper process isolation since it takes extra time.
> > Why? You are pushing a policy of resource allocation/usage which
> > more properly belongs in userspace.
> >
> >> Kernel development is driven by exposing the hardware functionality in a
> >> save and manageable manner to userspace, and not by fulfilling userspace
> >> requirements.
> > I don't agree with this, that is a 1980's view of OS design. So much
> > these days in the kernel is driven entirely by boutique userspace
> > requirements and is very much not about the classical abstract role of
> > an OS.
>
> But it's still true never the less. Otherwise you would have libraries
> for filesystem accesses and no system security to speak of.
>
> >>> Dedicated systems are a significant use case here and should be
> >>> supported, even if the same solution wouldn't be applicable to someone
> >>> running a desktop.
> >> And exactly that approach is not acceptable.
> > We have knobs and settings all over the place to allow Linux to
> > support a broad set of use cases from Android to servers, to HPC. So
> > long as they can co-exist and the various optional modes do not
> > critically undermine the security of the kernel, it is well in line
> > with how things have been evolving in the last 15 years.
>
> Yeah, that's exactly what I'm talking about by adding cgroup or similar.
> You need a knob to control this.
>
> > Here you are talking entirely about policy to control memory
> > allocation, which is already well trodden ground for CPU memory.
> >
> > There are now endless boutique ways to deal with this, it is a very
> > narrow view to say that GPU memory is so special and different that
> > only one way can be the correct/allowed way.
>
> Well I'm not talking about GPU memory in particular here. This is
> mandatory for any memory or saying more general any resource.
>
> E.g. you are not allowed to pin large amount of system memory on a
> default installation for exactly those reasons as well.

Unless you are working on a h/w architecture that is designed
specifically for a single user.
At least in our h/w architecture, we aim for 100% utilization of the
h/w when you are running DL training workloads, as opposed to what is
happening inside a GPU when you are running training.
Therefore, it is counter-productive to serve multiple users from a
performance perspective.
In fact, the training problem is so large, that a single ASIC is
hardly sufficient, and you need to run on anywhere between 8 ASICs to
64-128 ASICs to get a reasonable time for training.
So there is no point in serving multiple users in that case and that's
why our device memory (HBM) is always pinned.

This is totally different from the usual role of a GPU, where you
serve the desktop and multiple other applications that draw on the
screen.

I wouldn't want some knob in the kernel to control what is the limit
of memory I can pin or not. I just don't think it is useful and/or
user friendly.

btw, regarding ODP, we don't support it in h/w (yet) because 99% of
the use-cases in DL training will suffer from performance issues if we
don't pin the host memory.
There are extreme topologies, with 10TB of parameters that may require
this, but like I said, it's very rare and not worth the effort.

Thanks,
Oded

>
> That you can have a knob to disable this behavior for your HPC system is
> perfectly fine, but I thing what Dave notes here as well that this is
> most likely not the default behavior we want.
>
> Christian.
>
> >
> > Jason
>
Gal Pressman Sept. 1, 2021, 11:20 a.m. UTC | #30
On 24/08/2021 20:32, Jason Gunthorpe wrote:
> On Tue, Aug 24, 2021 at 10:27:23AM -0700, John Hubbard wrote:
>> On 8/24/21 2:32 AM, Christian König wrote:
>>> Am 24.08.21 um 11:06 schrieb Gal Pressman:
>>>> On 23/08/2021 13:43, Christian König wrote:
>>>>> Am 21.08.21 um 11:16 schrieb Gal Pressman:
>>>>>> On 20/08/2021 17:32, Jason Gunthorpe wrote:
>>>>>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
>> ...
>>>>>> IIUC, we're talking about three different exporter "types":
>>>>>> - Dynamic with move_notify (requires ODP)
>>>>>> - Dynamic with revoke_notify
>>>>>> - Static
>>>>>>
>>>>>> Which changes do we need to make the third one work?
>>>>> Basically none at all in the framework.
>>>>>
>>>>> You just need to properly use the dma_buf_pin() function when you start using a
>>>>> buffer (e.g. before you create an attachment) and the dma_buf_unpin() function
>>>>> after you are done with the DMA-buf.
>>>> I replied to your previous mail, but I'll ask again.
>>>> Doesn't the pin operation migrate the memory to host memory?
>>>
>>> Sorry missed your previous reply.
>>>
>>> And yes at least for the amdgpu driver we migrate the memory to host
>>> memory as soon as it is pinned and I would expect that other GPU drivers
>>> do something similar.
>>
>> Well...for many topologies, migrating to host memory will result in a
>> dramatically slower p2p setup. For that reason, some GPU drivers may
>> want to allow pinning of video memory in some situations.
>>
>> Ideally, you've got modern ODP devices and you don't even need to pin.
>> But if not, and you still hope to do high performance p2p between a GPU
>> and a non-ODP Infiniband device, then you would need to leave the pinned
>> memory in vidmem.
>>
>> So I think we don't want to rule out that behavior, right? Or is the
>> thinking more like, "you're lucky that this old non-ODP setup works at
>> all, and we'll make it work by routing through host/cpu memory, but it
>> will be slow"?
> 
> I think it depends on the user, if the user creates memory which is
> permanently located on the GPU then it should be pinnable in this way
> without force migration. But if the memory is inherently migratable
> then it just cannot be pinned in the GPU at all as we can't
> indefinately block migration from happening eg if the CPU touches it
> later or something.

So are we OK with exporters implementing dma_buf_pin() without migrating the memory?
If so, do we still want a move_notify callback for non-dynamic importers? A noop?
Christian König Sept. 1, 2021, 11:24 a.m. UTC | #31
Am 01.09.21 um 13:20 schrieb Gal Pressman:
> On 24/08/2021 20:32, Jason Gunthorpe wrote:
>> On Tue, Aug 24, 2021 at 10:27:23AM -0700, John Hubbard wrote:
>>> On 8/24/21 2:32 AM, Christian König wrote:
>>>> Am 24.08.21 um 11:06 schrieb Gal Pressman:
>>>>> On 23/08/2021 13:43, Christian König wrote:
>>>>>> Am 21.08.21 um 11:16 schrieb Gal Pressman:
>>>>>>> On 20/08/2021 17:32, Jason Gunthorpe wrote:
>>>>>>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
>>> ...
>>>>>>> IIUC, we're talking about three different exporter "types":
>>>>>>> - Dynamic with move_notify (requires ODP)
>>>>>>> - Dynamic with revoke_notify
>>>>>>> - Static
>>>>>>>
>>>>>>> Which changes do we need to make the third one work?
>>>>>> Basically none at all in the framework.
>>>>>>
>>>>>> You just need to properly use the dma_buf_pin() function when you start using a
>>>>>> buffer (e.g. before you create an attachment) and the dma_buf_unpin() function
>>>>>> after you are done with the DMA-buf.
>>>>> I replied to your previous mail, but I'll ask again.
>>>>> Doesn't the pin operation migrate the memory to host memory?
>>>> Sorry missed your previous reply.
>>>>
>>>> And yes at least for the amdgpu driver we migrate the memory to host
>>>> memory as soon as it is pinned and I would expect that other GPU drivers
>>>> do something similar.
>>> Well...for many topologies, migrating to host memory will result in a
>>> dramatically slower p2p setup. For that reason, some GPU drivers may
>>> want to allow pinning of video memory in some situations.
>>>
>>> Ideally, you've got modern ODP devices and you don't even need to pin.
>>> But if not, and you still hope to do high performance p2p between a GPU
>>> and a non-ODP Infiniband device, then you would need to leave the pinned
>>> memory in vidmem.
>>>
>>> So I think we don't want to rule out that behavior, right? Or is the
>>> thinking more like, "you're lucky that this old non-ODP setup works at
>>> all, and we'll make it work by routing through host/cpu memory, but it
>>> will be slow"?
>> I think it depends on the user, if the user creates memory which is
>> permanently located on the GPU then it should be pinnable in this way
>> without force migration. But if the memory is inherently migratable
>> then it just cannot be pinned in the GPU at all as we can't
>> indefinately block migration from happening eg if the CPU touches it
>> later or something.
> So are we OK with exporters implementing dma_buf_pin() without migrating the memory?

I think so, yes.

> If so, do we still want a move_notify callback for non-dynamic importers? A noop?

Well we could make the move_notify callback optional, e.g. so that you 
get the new locking approach but still pin the buffers manually with 
dma_buf_pin().

Regards,
Christian.
Gal Pressman Sept. 2, 2021, 6:56 a.m. UTC | #32
On 01/09/2021 14:24, Christian König wrote:
> 
> 
> Am 01.09.21 um 13:20 schrieb Gal Pressman:
>> On 24/08/2021 20:32, Jason Gunthorpe wrote:
>>> On Tue, Aug 24, 2021 at 10:27:23AM -0700, John Hubbard wrote:
>>>> On 8/24/21 2:32 AM, Christian König wrote:
>>>>> Am 24.08.21 um 11:06 schrieb Gal Pressman:
>>>>>> On 23/08/2021 13:43, Christian König wrote:
>>>>>>> Am 21.08.21 um 11:16 schrieb Gal Pressman:
>>>>>>>> On 20/08/2021 17:32, Jason Gunthorpe wrote:
>>>>>>>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
>>>> ...
>>>>>>>> IIUC, we're talking about three different exporter "types":
>>>>>>>> - Dynamic with move_notify (requires ODP)
>>>>>>>> - Dynamic with revoke_notify
>>>>>>>> - Static
>>>>>>>>
>>>>>>>> Which changes do we need to make the third one work?
>>>>>>> Basically none at all in the framework.
>>>>>>>
>>>>>>> You just need to properly use the dma_buf_pin() function when you start
>>>>>>> using a
>>>>>>> buffer (e.g. before you create an attachment) and the dma_buf_unpin()
>>>>>>> function
>>>>>>> after you are done with the DMA-buf.
>>>>>> I replied to your previous mail, but I'll ask again.
>>>>>> Doesn't the pin operation migrate the memory to host memory?
>>>>> Sorry missed your previous reply.
>>>>>
>>>>> And yes at least for the amdgpu driver we migrate the memory to host
>>>>> memory as soon as it is pinned and I would expect that other GPU drivers
>>>>> do something similar.
>>>> Well...for many topologies, migrating to host memory will result in a
>>>> dramatically slower p2p setup. For that reason, some GPU drivers may
>>>> want to allow pinning of video memory in some situations.
>>>>
>>>> Ideally, you've got modern ODP devices and you don't even need to pin.
>>>> But if not, and you still hope to do high performance p2p between a GPU
>>>> and a non-ODP Infiniband device, then you would need to leave the pinned
>>>> memory in vidmem.
>>>>
>>>> So I think we don't want to rule out that behavior, right? Or is the
>>>> thinking more like, "you're lucky that this old non-ODP setup works at
>>>> all, and we'll make it work by routing through host/cpu memory, but it
>>>> will be slow"?
>>> I think it depends on the user, if the user creates memory which is
>>> permanently located on the GPU then it should be pinnable in this way
>>> without force migration. But if the memory is inherently migratable
>>> then it just cannot be pinned in the GPU at all as we can't
>>> indefinately block migration from happening eg if the CPU touches it
>>> later or something.
>> So are we OK with exporters implementing dma_buf_pin() without migrating the
>> memory?
> 
> I think so, yes.
> 
>> If so, do we still want a move_notify callback for non-dynamic importers? A noop?
> 
> Well we could make the move_notify callback optional, e.g. so that you get the
> new locking approach but still pin the buffers manually with dma_buf_pin().

Thanks Christian!
So the end result will look similar to the original patch I posted, where
peer2peer can be enabled without providing move_notify, correct?
diff mbox series

Patch

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 511fe0d217a0..e3faad8f492c 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -727,7 +727,8 @@  dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
 	if (WARN_ON(!dmabuf || !dev))
 		return ERR_PTR(-EINVAL);
 
-	if (WARN_ON(importer_ops && !importer_ops->move_notify))
+	if (WARN_ON(importer_ops && !importer_ops->move_notify &&
+		    dma_buf_is_dynamic(attach->dmabuf)))
 		return ERR_PTR(-EINVAL);
 
 	attach = kzalloc(sizeof(*attach), GFP_KERNEL);
@@ -1048,7 +1049,7 @@  void dma_buf_move_notify(struct dma_buf *dmabuf)
 	dma_resv_assert_held(dmabuf->resv);
 
 	list_for_each_entry(attach, &dmabuf->attachments, node)
-		if (attach->importer_ops)
+		if (attach->importer_ops && attach->importer_ops->move_notify)
 			attach->importer_ops->move_notify(attach);
 }
 EXPORT_SYMBOL_GPL(dma_buf_move_notify);
diff --git a/drivers/infiniband/core/umem_dmabuf.c b/drivers/infiniband/core/umem_dmabuf.c
index c6e875619fac..c502ae828bd3 100644
--- a/drivers/infiniband/core/umem_dmabuf.c
+++ b/drivers/infiniband/core/umem_dmabuf.c
@@ -118,7 +118,7 @@  struct ib_umem_dmabuf *ib_umem_dmabuf_get(struct ib_device *device,
 	if (check_add_overflow(offset, (unsigned long)size, &end))
 		return ret;
 
-	if (unlikely(!ops || !ops->move_notify))
+	if (unlikely(!ops))
 		return ret;
 
 	dmabuf = dma_buf_get(fd);
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index efdc56b9d95f..4b2e99012cb1 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -473,7 +473,7 @@  static inline bool dma_buf_is_dynamic(struct dma_buf *dmabuf)
 static inline bool
 dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach)
 {
-	return !!attach->importer_ops;
+	return !!attach->importer_ops->move_notify;
 }
 
 struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,