mbox series

[rdma-next,00/13] Elastic Fabric Adapter (EFA) driver

Message ID 1543925069-8838-1-git-send-email-galpress@amazon.com (mailing list archive)
Headers show
Series Elastic Fabric Adapter (EFA) driver | expand

Message

Gal Pressman Dec. 4, 2018, 12:04 p.m. UTC
Hello all,
The following patchset introduces the Elastic Fabric Adapter (EFA) driver, that
was pre-announced by Amazon [1].

EFA is a networking adapter designed to support user space network
communication, initially offered in the Amazon EC2 environment. First release
of EFA supports datagram send/receive operations and does not support
connection-oriented or read/write operations.

EFA supports unreliable datagrams (UD) as well as a new unordered, scalable
reliable datagram protocol (SRD). SRD provides support for reliable datagrams
and more complete error handling than typical RD, but, unlike RD, it does not
support ordering nor segmentation. A new queue pair type, IB_QPT_SRD, is added
to expose this new queue pair type.
User verbs are supported via a dedicated userspace libfabric provider.
Kernel verbs and in-kernel services are initially not supported.

EFA enabled EC2 instances have two different devices allocated, one for ENA
(netdev) and one for EFA, the two are separate pci devices with no in-kernel
communication between them.

Thanks,
Gal

[1] https://aws.amazon.com/about-aws/whats-new/2018/11/introducing-elastic-fabric-adapter/

Gal Pressman (13):
  RDMA: Add EFA related definitions
  RDMA/efa: Add EFA device definitions
  RDMA/efa: Add the PCI device id definitions
  RDMA/efa: Add the efa.h header file
  RDMA/efa: Add the efa_com.h file
  RDMA/efa: Add the com service API definitions
  RDMA/efa: Add the ABI definitions
  RDMA/efa: Implement functions that submit and complete admin commands
  RDMA/efa: Add com command handlers
  RDMA/efa: Add bitmap allocation service
  RDMA/efa: Add EFA verbs implementation
  RDMA/efa: Add the efa module
  RDMA/efa: Add driver to Kconfig/Makefile

 MAINTAINERS                                     |    8 +
 drivers/infiniband/Kconfig                      |    2 +
 drivers/infiniband/core/verbs.c                 |    2 +
 drivers/infiniband/hw/Makefile                  |    1 +
 drivers/infiniband/hw/efa/Kconfig               |   14 +
 drivers/infiniband/hw/efa/Makefile              |    8 +
 drivers/infiniband/hw/efa/efa.h                 |  191 +++
 drivers/infiniband/hw/efa/efa_admin_cmds_defs.h |  783 ++++++++++
 drivers/infiniband/hw/efa/efa_admin_defs.h      |  135 ++
 drivers/infiniband/hw/efa/efa_bitmap.c          |   76 +
 drivers/infiniband/hw/efa/efa_com.c             | 1122 ++++++++++++++
 drivers/infiniband/hw/efa/efa_com.h             |  139 ++
 drivers/infiniband/hw/efa/efa_com_cmd.c         |  544 +++++++
 drivers/infiniband/hw/efa/efa_com_cmd.h         |  217 +++
 drivers/infiniband/hw/efa/efa_common_defs.h     |   17 +
 drivers/infiniband/hw/efa/efa_main.c            |  669 +++++++++
 drivers/infiniband/hw/efa/efa_pci_id_tbl.h      |   25 +
 drivers/infiniband/hw/efa/efa_regs_defs.h       |  117 ++
 drivers/infiniband/hw/efa/efa_verbs.c           | 1827 +++++++++++++++++++++++
 include/rdma/ib_verbs.h                         |    9 +-
 include/uapi/rdma/efa-abi.h                     |   89 ++
 21 files changed, 5993 insertions(+), 2 deletions(-)
 create mode 100644 drivers/infiniband/hw/efa/Kconfig
 create mode 100644 drivers/infiniband/hw/efa/Makefile
 create mode 100644 drivers/infiniband/hw/efa/efa.h
 create mode 100644 drivers/infiniband/hw/efa/efa_admin_cmds_defs.h
 create mode 100644 drivers/infiniband/hw/efa/efa_admin_defs.h
 create mode 100644 drivers/infiniband/hw/efa/efa_bitmap.c
 create mode 100644 drivers/infiniband/hw/efa/efa_com.c
 create mode 100644 drivers/infiniband/hw/efa/efa_com.h
 create mode 100644 drivers/infiniband/hw/efa/efa_com_cmd.c
 create mode 100644 drivers/infiniband/hw/efa/efa_com_cmd.h
 create mode 100644 drivers/infiniband/hw/efa/efa_common_defs.h
 create mode 100644 drivers/infiniband/hw/efa/efa_main.c
 create mode 100644 drivers/infiniband/hw/efa/efa_pci_id_tbl.h
 create mode 100644 drivers/infiniband/hw/efa/efa_regs_defs.h
 create mode 100644 drivers/infiniband/hw/efa/efa_verbs.c
 create mode 100644 include/uapi/rdma/efa-abi.h

Comments

Jason Gunthorpe Dec. 4, 2018, 3:33 p.m. UTC | #1
On Tue, Dec 04, 2018 at 02:04:16PM +0200, Gal Pressman wrote:
> Hello all,
> The following patchset introduces the Elastic Fabric Adapter (EFA) driver, that
> was pre-announced by Amazon [1].
> 
> EFA is a networking adapter designed to support user space network
> communication, initially offered in the Amazon EC2 environment. First release
> of EFA supports datagram send/receive operations and does not support
> connection-oriented or read/write operations.
> 
> EFA supports unreliable datagrams (UD) as well as a new unordered, scalable
> reliable datagram protocol (SRD). SRD provides support for reliable datagrams
> and more complete error handling than typical RD, but, unlike RD, it does not
> support ordering nor segmentation. A new queue pair type, IB_QPT_SRD, is added
> to expose this new queue pair type.
> User verbs are supported via a dedicated userspace libfabric provider.
> Kernel verbs and in-kernel services are initially not supported.

There are some general expectations for new drivers in rdma you should
be aware of..

1) Accepting usnic to RDMA is widely regarded as a mistake. This means
   that future drivers that do not implement 'enough' common verbs are
   not likely to be accepted, as they are not RDMA devices. I'm
   worried this driver doesn't cross the threshold.

2) Any change to common verbs must be supported by full public
   documentation good enough to allow another implementation. So
   most likely the IB_QPT_SRD has to go away (this is a NAK). Driver
   specific QP types can be implemented via driver udata.

3) We need to see the userspace for new drivers. A RDMA driver that
   doesn't provide a useful rdma-core provider is deeply suspect as
   not crossing the #1 threshold.

Jason
Gal Pressman Dec. 5, 2018, 8:55 a.m. UTC | #2
On 04-Dec-18 17:33, Jason Gunthorpe wrote:
> On Tue, Dec 04, 2018 at 02:04:16PM +0200, Gal Pressman wrote:
>> Hello all,
>> The following patchset introduces the Elastic Fabric Adapter (EFA) driver, that
>> was pre-announced by Amazon [1].
>>
>> EFA is a networking adapter designed to support user space network
>> communication, initially offered in the Amazon EC2 environment. First release
>> of EFA supports datagram send/receive operations and does not support
>> connection-oriented or read/write operations.
>>
>> EFA supports unreliable datagrams (UD) as well as a new unordered, scalable
>> reliable datagram protocol (SRD). SRD provides support for reliable datagrams
>> and more complete error handling than typical RD, but, unlike RD, it does not
>> support ordering nor segmentation. A new queue pair type, IB_QPT_SRD, is added
>> to expose this new queue pair type.
>> User verbs are supported via a dedicated userspace libfabric provider.
>> Kernel verbs and in-kernel services are initially not supported.
> 
> There are some general expectations for new drivers in rdma you should
> be aware of..
> 
> 1) Accepting usnic to RDMA is widely regarded as a mistake. This means
>    that future drivers that do not implement 'enough' common verbs are
>    not likely to be accepted, as they are not RDMA devices. I'm
>    worried this driver doesn't cross the threshold.

Hi Jason,
What do you mean by enough common verbs? we implement everything our device can
support for now - and we provide a functional RDMA device. Can you please be
more specific as to what is missing?
We can also implement a basic set of kernel verbs, but since we don't support RC
QP type there will not be any use for them - I wouldn't want to add dead code to
the driver just for the sake of having an implementation.

> 
> 2) Any change to common verbs must be supported by full public
>    documentation good enough to allow another implementation. So
>    most likely the IB_QPT_SRD has to go away (this is a NAK). Driver
>    specific QP types can be implemented via driver udata.

We can use driver specific QP type for now, perhaps in the future we'll be able
to expose more detailed documentation about SRD.

> 
> 3) We need to see the userspace for new drivers. A RDMA driver that
>    doesn't provide a useful rdma-core provider is deeply suspect as
>    not crossing the #1 threshold.

We'll be publishing our libfabric userspace provider soon, I'll add a link once
it's available.

> 
> Jason
>
Jason Gunthorpe Dec. 5, 2018, 3:15 p.m. UTC | #3
On Wed, Dec 05, 2018 at 10:55:59AM +0200, Gal Pressman wrote:
> On 04-Dec-18 17:33, Jason Gunthorpe wrote:
> > On Tue, Dec 04, 2018 at 02:04:16PM +0200, Gal Pressman wrote:
> >> Hello all,
> >> The following patchset introduces the Elastic Fabric Adapter (EFA) driver, that
> >> was pre-announced by Amazon [1].
> >>
> >> EFA is a networking adapter designed to support user space network
> >> communication, initially offered in the Amazon EC2 environment. First release
> >> of EFA supports datagram send/receive operations and does not support
> >> connection-oriented or read/write operations.
> >>
> >> EFA supports unreliable datagrams (UD) as well as a new unordered, scalable
> >> reliable datagram protocol (SRD). SRD provides support for reliable datagrams
> >> and more complete error handling than typical RD, but, unlike RD, it does not
> >> support ordering nor segmentation. A new queue pair type, IB_QPT_SRD, is added
> >> to expose this new queue pair type.
> >> User verbs are supported via a dedicated userspace libfabric provider.
> >> Kernel verbs and in-kernel services are initially not supported.
> > 
> > There are some general expectations for new drivers in rdma you should
> > be aware of..
> > 
> > 1) Accepting usnic to RDMA is widely regarded as a mistake. This means
> >    that future drivers that do not implement 'enough' common verbs are
> >    not likely to be accepted, as they are not RDMA devices. I'm
> >    worried this driver doesn't cross the threshold.
> 
> Hi Jason,
> What do you mean by enough common verbs? 

No one has really come up with a guideline, other than usnic was a
mistake.

> we implement everything our device can support for now - and we
> provide a functional RDMA device. 

Is it a 'functional RDMA device'? It doesn't work with any kernel ULP,
it doesn't have a libibverbs provider, it doesn't implement any
pre-existing protocol... It seems just like usnic.

.. and there is no documentation, what is the format for AH's on UD?
does it do the verbs memory layout with the 40 byte offset? etc. Just
like usnic :|

> Can you please be more specific as to what is missing?  We can also
> implement a basic set of kernel verbs, but since we don't support RC
> QP type there will not be any use for them - I wouldn't want to add
> dead code to the driver just for the sake of having an
> implementation.

I agree you shouldn't implement kernel verbs for UD. What is missing
is that your HW does not seem to be RDMA hardware. Implementing a
undocumented UD-only, no kverbs, a propritary protocol and
libfabric-only is exactly what usnic did, and is exactly why it has
been viewed as a mistake.

Using rdma has a hacky way to do shared process vfio is not
really appropriate.

> > 2) Any change to common verbs must be supported by full public
> >    documentation good enough to allow another implementation. So
> >    most likely the IB_QPT_SRD has to go away (this is a NAK). Driver
> >    specific QP types can be implemented via driver udata.
> 
> We can use driver specific QP type for now, perhaps in the future we'll be able
> to expose more detailed documentation about SRD.

Without kverbs or rdma-core it is irrelevant what you call it.
 
> > 3) We need to see the userspace for new drivers. A RDMA driver that
> >    doesn't provide a useful rdma-core provider is deeply suspect as
> >    not crossing the #1 threshold.
> 
> We'll be publishing our libfabric userspace provider soon, I'll add
> a link once it's available.

libfabric does not do the checks on the ABI construction that
rdma-core does, so that still has to be done somehow
 
Jason
Dennis Dalessandro Dec. 5, 2018, 3:54 p.m. UTC | #4
On 12/5/2018 10:15 AM, Jason Gunthorpe wrote:
> On Wed, Dec 05, 2018 at 10:55:59AM +0200, Gal Pressman wrote:
>> On 04-Dec-18 17:33, Jason Gunthorpe wrote:
>>> On Tue, Dec 04, 2018 at 02:04:16PM +0200, Gal Pressman wrote:
>>>> Hello all,
>>>> The following patchset introduces the Elastic Fabric Adapter (EFA) driver, that
>>>> was pre-announced by Amazon [1].
>>>>
>>>> EFA is a networking adapter designed to support user space network
>>>> communication, initially offered in the Amazon EC2 environment. First release
>>>> of EFA supports datagram send/receive operations and does not support
>>>> connection-oriented or read/write operations.
>>>>
>>>> EFA supports unreliable datagrams (UD) as well as a new unordered, scalable
>>>> reliable datagram protocol (SRD). SRD provides support for reliable datagrams
>>>> and more complete error handling than typical RD, but, unlike RD, it does not
>>>> support ordering nor segmentation. A new queue pair type, IB_QPT_SRD, is added
>>>> to expose this new queue pair type.
>>>> User verbs are supported via a dedicated userspace libfabric provider.
>>>> Kernel verbs and in-kernel services are initially not supported.
>>>
>>> There are some general expectations for new drivers in rdma you should
>>> be aware of..
>>>
>>> 1) Accepting usnic to RDMA is widely regarded as a mistake. This means
>>>     that future drivers that do not implement 'enough' common verbs are
>>>     not likely to be accepted, as they are not RDMA devices. I'm
>>>     worried this driver doesn't cross the threshold.
>>
>> Hi Jason,
>> What do you mean by enough common verbs?
> 
> No one has really come up with a guideline, other than usnic was a
> mistake.
> 
>> we implement everything our device can support for now - and we
>> provide a functional RDMA device.
> 
> Is it a 'functional RDMA device'? It doesn't work with any kernel ULP,
> it doesn't have a libibverbs provider, it doesn't implement any
> pre-existing protocol... It seems just like usnic.
> 
> .. and there is no documentation, what is the format for AH's on UD?
> does it do the verbs memory layout with the 40 byte offset? etc. Just
> like usnic :|
> 
>> Can you please be more specific as to what is missing?  We can also
>> implement a basic set of kernel verbs, but since we don't support RC
>> QP type there will not be any use for them - I wouldn't want to add
>> dead code to the driver just for the sake of having an
>> implementation.
> 
> I agree you shouldn't implement kernel verbs for UD. What is missing
> is that your HW does not seem to be RDMA hardware. Implementing a
> undocumented UD-only, no kverbs, a propritary protocol and
> libfabric-only is exactly what usnic did, and is exactly why it has
> been viewed as a mistake.
> 
> Using rdma has a hacky way to do shared process vfio is not
> really appropriate.
> 
>>> 2) Any change to common verbs must be supported by full public
>>>     documentation good enough to allow another implementation. So
>>>     most likely the IB_QPT_SRD has to go away (this is a NAK). Driver
>>>     specific QP types can be implemented via driver udata.
>>
>> We can use driver specific QP type for now, perhaps in the future we'll be able
>> to expose more detailed documentation about SRD.
> 
> Without kverbs or rdma-core it is irrelevant what you call it.
>   
>>> 3) We need to see the userspace for new drivers. A RDMA driver that
>>>     doesn't provide a useful rdma-core provider is deeply suspect as
>>>     not crossing the #1 threshold.
>>
>> We'll be publishing our libfabric userspace provider soon, I'll add
>> a link once it's available.
> 
> libfabric does not do the checks on the ABI construction that
> rdma-core does, so that still has to be done somehow
>   

Are you wanting to draw a line in the sand here and say rdma-core user 
space support is required for a driver to exist in the RDMA tree? I 
don't think that should be a requirement, but certainly there needs to 
be a consumer of this. Be it libfabric or kernel ULP, something.

-Denny
Jason Gunthorpe Dec. 5, 2018, 4:05 p.m. UTC | #5
On Wed, Dec 05, 2018 at 10:54:38AM -0500, Dennis Dalessandro wrote:

> > > > 3) We need to see the userspace for new drivers. A RDMA driver that
> > > >     doesn't provide a useful rdma-core provider is deeply suspect as
> > > >     not crossing the #1 threshold.
> > > 
> > > We'll be publishing our libfabric userspace provider soon, I'll add
> > > a link once it's available.
> > 
> > libfabric does not do the checks on the ABI construction that
> > rdma-core does, so that still has to be done somehow
> 
> Are you wanting to draw a line in the sand here and say rdma-core user space
> support is required for a driver to exist in the RDMA tree? I don't think
> that should be a requirement, but certainly there needs to be a consumer of
> this. Be it libfabric or kernel ULP, something.

I'm saying someone *at least* needs to do all the driver ABI structure
checks that we do in rdma-core.

.. and absolutely, if the driver implements kernel verbs/common verbs
there must be a rdma-core verbs provider for it as well. I don't think
that is unreasonable.

The question here is if we should even allow drivers under
drivers/infiniband that don't even implement common verbs, like usnic.

Jason
Dennis Dalessandro Dec. 5, 2018, 4:28 p.m. UTC | #6
On 12/5/2018 11:05 AM, Jason Gunthorpe wrote:
> On Wed, Dec 05, 2018 at 10:54:38AM -0500, Dennis Dalessandro wrote:
> 
>>>>> 3) We need to see the userspace for new drivers. A RDMA driver that
>>>>>      doesn't provide a useful rdma-core provider is deeply suspect as
>>>>>      not crossing the #1 threshold.
>>>>
>>>> We'll be publishing our libfabric userspace provider soon, I'll add
>>>> a link once it's available.
>>>
>>> libfabric does not do the checks on the ABI construction that
>>> rdma-core does, so that still has to be done somehow
>>
>> Are you wanting to draw a line in the sand here and say rdma-core user space
>> support is required for a driver to exist in the RDMA tree? I don't think
>> that should be a requirement, but certainly there needs to be a consumer of
>> this. Be it libfabric or kernel ULP, something.
> 
> I'm saying someone *at least* needs to do all the driver ABI structure
> checks that we do in rdma-core.
> 
> .. and absolutely, if the driver implements kernel verbs/common verbs
> there must be a rdma-core verbs provider for it as well. I don't think
> that is unreasonable.

Completely agree on that.

> The question here is if we should even allow drivers under
> drivers/infiniband that don't even implement common verbs, like usnic.

I would have been on board with that in the past but these days I'm not 
so sure. I would like to see us moving away from an "infiniband" 
subsystem to an "rdma" subsystem and embrace other technologies. We 
really do need to rename the subsystem at some point.

As far as usnic, it seems like its the driver that has been forgotten 
about. I see 3 commits in 2018. I'm not saying it's important or not, 
that's another debate for another thread.

But for the EFA driver I'm happy to see it. I say treat this as RFC for 
now until we see a libfabric or kernel ULP, or even some other open and 
available user space solution.

-Denny
Jason Gunthorpe Dec. 5, 2018, 5 p.m. UTC | #7
On Wed, Dec 05, 2018 at 11:28:39AM -0500, Dennis Dalessandro wrote:

> > The question here is if we should even allow drivers under
> > drivers/infiniband that don't even implement common verbs, like usnic.
> 
> I would have been on board with that in the past but these days I'm not so
> sure. I would like to see us moving away from an "infiniband" subsystem to
> an "rdma" subsystem and embrace other technologies. We really do need to
> rename the subsystem at some point.

The counterpoint is that a 'RDMA' subsystem is not just 'expose some
DMA queues to user space and let userspace sort it all out'.

That is properly called vfio, or perhaps matches this WarpDrive thing
Kenneth was trying to build.

> As far as usnic, it seems like its the driver that has been forgotten about.
> I see 3 commits in 2018. I'm not saying it's important or not, that's
> another debate for another thread.

There are alot more than three, and all of them are from non-Cisco
people trying to maintain this driver. Many are mine. 

This is not a plus, it shows why this approach is a burden on the rest
of the community.

> But for the EFA driver I'm happy to see it. I say treat this as RFC for now
> until we see a libfabric or kernel ULP, or even some other open and
> available user space solution.

A non-verbs kernel ULP to a RDMA driver??? That is an entirely
different and less happy discussion!

subsystems exist to put common things under. They are not a marketing
term. If a driver doesn't use the subsystem's support code, or
implement the subsystems APIs, why should it be in the subsystem?

Jason
Dennis Dalessandro Dec. 5, 2018, 6:06 p.m. UTC | #8
On 12/5/2018 12:00 PM, Jason Gunthorpe wrote:
> On Wed, Dec 05, 2018 at 11:28:39AM -0500, Dennis Dalessandro wrote:
> 
>>> The question here is if we should even allow drivers under
>>> drivers/infiniband that don't even implement common verbs, like usnic.
>>
>> I would have been on board with that in the past but these days I'm not so
>> sure. I would like to see us moving away from an "infiniband" subsystem to
>> an "rdma" subsystem and embrace other technologies. We really do need to
>> rename the subsystem at some point.
> 
> The counterpoint is that a 'RDMA' subsystem is not just 'expose some
> DMA queues to user space and let userspace sort it all out'.

It is a valid point. I'm not saying just expose some DMA queues, there 
has to be a valid, well thought out API.

> That is properly called vfio, or perhaps matches this WarpDrive thing
> Kenneth was trying to build.
> 
>> As far as usnic, it seems like its the driver that has been forgotten about.
>> I see 3 commits in 2018. I'm not saying it's important or not, that's
>> another debate for another thread.
> 
> There are alot more than three, and all of them are from non-Cisco
> people trying to maintain this driver. Many are mine.
> 
> This is not a plus, it shows why this approach is a burden on the rest
> of the community.

That's the point I was beating around the bush about. As a community we 
don't need any fire and forget drivers where it gets upstreamed and 
abandoned.

>> But for the EFA driver I'm happy to see it. I say treat this as RFC for now
>> until we see a libfabric or kernel ULP, or even some other open and
>> available user space solution.
> 
> A non-verbs kernel ULP to a RDMA driver??? That is an entirely
> different and less happy discussion!

Agree, but a valid discussion. However there isn't anything like that on 
the table here so not a rat hole we need to go down right now.

> subsystems exist to put common things under. They are not a marketing
> term. If a driver doesn't use the subsystem's support code, or
> implement the subsystems APIs, why should it be in the subsystem?

Why should we be against the subsystem evolving though? I'm not saying 
we allow just anything, just that we leave it open for discussion, *if* 
something comes up.

Now if we do want to say a driver must support verbs it can't be 
wishy-washy. We should specify exactly the set of verbs that must be 
supported. The IBTA has the notion of some mandatory verbs right, so 
should that be the low bar? Should we whittle that down further? Perhaps 
to the point where there is enough support for the core to manage the 
device from a high level, and bigger picture things like SELinux and 
containers, but leave the data path up to the driver/userspace to go 
figure out?

-Denny
Jason Gunthorpe Dec. 5, 2018, 6:39 p.m. UTC | #9
On Wed, Dec 05, 2018 at 01:06:50PM -0500, Dennis Dalessandro wrote:

> > subsystems exist to put common things under. They are not a marketing
> > term. If a driver doesn't use the subsystem's support code, or
> > implement the subsystems APIs, why should it be in the subsystem?
> 
> Why should we be against the subsystem evolving though? I'm not saying we
> allow just anything, just that we leave it open for discussion, *if*
> something comes up.

Where is the evolution here though? No new services were added to the
core to support this unique driver.

> Now if we do want to say a driver must support verbs it can't be
> wishy-washy. We should specify exactly the set of verbs that must be
> supported. The IBTA has the notion of some mandatory verbs right, so should
> that be the low bar?

I think the unambiguous low bar is enough implemented to run
NVMEoF. That is definitely 'RDMA common verbs' hardware, IMHO.

Some random proprietary MPI acceleration HW is probably something
else. We aren't trying to squeeze Aries/Gemini/PAMI/etc into the RDMA
verbs model after all.

If you want to evolve the RDMA subsystem into some kind of home for
'dynamic process protected direct IO' - then sure, go do that. Get
Kenneth, Gal, etc together and define a suitable user API for creating
totally opaque QPs and we can put a proper core subsystem API together
for this.

This could be interesting, and this would be evolution.

Defining yet another driver private QP type and ENOTSUPPing half the
subsystem is just a another usnic hack.

> point where there is enough support for the core to manage the
> device from a high level, and bigger picture things like SELinux and
> containers, but leave the data path up to the driver/userspace to go
> figure out?

So, we just ignored usnic for containers SELinux and other modern
stuff. This is another reason why it is a bad idea.

Jason
Dennis Dalessandro Dec. 5, 2018, 6:57 p.m. UTC | #10
On 12/5/2018 1:39 PM, Jason Gunthorpe wrote:
> On Wed, Dec 05, 2018 at 01:06:50PM -0500, Dennis Dalessandro wrote:
> 
>>> subsystems exist to put common things under. They are not a marketing
>>> term. If a driver doesn't use the subsystem's support code, or
>>> implement the subsystems APIs, why should it be in the subsystem?
>>
>> Why should we be against the subsystem evolving though? I'm not saying we
>> allow just anything, just that we leave it open for discussion, *if*
>> something comes up.
> 
> Where is the evolution here though? No new services were added to the
> core to support this unique driver.

Which is why I think we treat this as RFC. It's a start, but there is a 
ways to go.

>> Now if we do want to say a driver must support verbs it can't be
>> wishy-washy. We should specify exactly the set of verbs that must be
>> supported. The IBTA has the notion of some mandatory verbs right, so should
>> that be the low bar?
> 
> I think the unambiguous low bar is enough implemented to run
> NVMEoF. That is definitely 'RDMA common verbs' hardware, IMHO.

Sounds reasonable to me.

> Some random proprietary MPI acceleration HW is probably something
> else. We aren't trying to squeeze Aries/Gemini/PAMI/etc into the RDMA
> verbs model after all.
> 
> If you want to evolve the RDMA subsystem into some kind of home for
> 'dynamic process protected direct IO' - then sure, go do that. Get
> Kenneth, Gal, etc together and define a suitable user API for creating
> totally opaque QPs and we can put a proper core subsystem API together
> for this.
> 
> This could be interesting, and this would be evolution.
> 
> Defining yet another driver private QP type and ENOTSUPPing half the
> subsystem is just a another usnic hack.

Not arguing that. And with the guideline above of being enough to run 
NVMEoF Amazon has a bar to shoot for to be compatible with the subsystem.

>> point where there is enough support for the core to manage the
>> device from a high level, and bigger picture things like SELinux and
>> containers, but leave the data path up to the driver/userspace to go
>> figure out?
> 
> So, we just ignored usnic for containers SELinux and other modern
> stuff. This is another reason why it is a bad idea.

Regardless I think it's clear the low bar for the subsystem going 
forward is verbs support sufficient to run NVMEoF. I don't have any 
objections.

Now the question becomes, does rdma-core (user tool) need to be 
supported? Or is having a libfabrics provider going to be sufficient? We 
have danced around the topic, but let's make it clear.

-Denny
Hefty, Sean Dec. 5, 2018, 7:14 p.m. UTC | #11
> >> As far as usnic, it seems like its the driver that has been
> forgotten about.
> >> I see 3 commits in 2018. I'm not saying it's important or not,
> that's
> >> another debate for another thread.
> >
> > There are alot more than three, and all of them are from non-Cisco
> > people trying to maintain this driver. Many are mine.
> >
> > This is not a plus, it shows why this approach is a burden on the
> rest
> > of the community.
> 
> That's the point I was beating around the bush about. As a community
> we don't need any fire and forget drivers where it gets upstreamed and
> abandoned.

usNIC is a very simple device.  I'm not sure you can assume that it isn't used or was forgotten just because it's done.  I do know that it is actively maintained on the user space side, even if it doesn't need anything more from the kernel.

I agree that exposing usNIC through verbs was wrong.  However, as a device, I don't see that it's that functionally different than QIB, truescale, or hfi1.  None of those are verbs devices, and throwing a software verbs implementation over them doesn't really change that.

> Now if we do want to say a driver must support verbs it can't be
> wishy-washy. We should specify exactly the set of verbs that must be
> supported. The IBTA has the notion of some mandatory verbs right, so
> should that be the low bar? Should we whittle that down further?

If you go this path, rename the subsystem to linux-verbs or linux-ibverbs.

However, if the subsystem will support exposing non-verbs interfaces (usnic, psm, psm2), then I disagree with placing a requirement that the driver must also provide a software implementation of verbs over it.

- Sean
Jason Gunthorpe Dec. 5, 2018, 7:23 p.m. UTC | #12
On Wed, Dec 05, 2018 at 01:57:56PM -0500, Dennis Dalessandro wrote:

> > > point where there is enough support for the core to manage the
> > > device from a high level, and bigger picture things like SELinux and
> > > containers, but leave the data path up to the driver/userspace to go
> > > figure out?
> > 
> > So, we just ignored usnic for containers SELinux and other modern
> > stuff. This is another reason why it is a bad idea.
> 
> Regardless I think it's clear the low bar for the subsystem going forward is
> verbs support sufficient to run NVMEoF. I don't have any objections.
> 
> Now the question becomes, does rdma-core (user tool) need to be supported?
> Or is having a libfabrics provider going to be sufficient? We have danced
> around the topic, but let's make it clear.

As I said, I think kernel verbs must be accompanied by user verbs in
libibverbs. Otherwise it is impossible to test the implementation
properly.

If someone does this and really doesn't want user verbs then they
should refine the generic non-bypass verbs path we have (ie post send
syscall, etc) and write a generic rdma-core provider to test their
kernel verbs with.

But I can't really see a kernel verbs provider without a userspace one
as an acceptable driver..

Jason
Jason Gunthorpe Dec. 5, 2018, 8:02 p.m. UTC | #13
On Wed, Dec 05, 2018 at 07:14:28PM +0000, Hefty, Sean wrote:
> > >> As far as usnic, it seems like its the driver that has been
> > forgotten about.
> > >> I see 3 commits in 2018. I'm not saying it's important or not,
> > that's
> > >> another debate for another thread.
> > >
> > > There are alot more than three, and all of them are from non-Cisco
> > > people trying to maintain this driver. Many are mine.
> > >
> > > This is not a plus, it shows why this approach is a burden on the
> > rest
> > > of the community.
> > 
> > That's the point I was beating around the bush about. As a community
> > we don't need any fire and forget drivers where it gets upstreamed and
> > abandoned.
> 
> usNIC is a very simple device.  I'm not sure you can assume that it
> isn't used or was forgotten just because it's done.  I do know that
> it is actively maintained on the user space side, even if it doesn't
> need anything more from the kernel.

The driver still had security bugs I had to fix.

I'm not sure it follows our security model (I have this guess it
actually needs CAP_NET_RAW)

This is all bad.

> I agree that exposing usNIC through verbs was wrong.  However, as a
> device, I don't see that it's that functionally different than QIB,
> truescale, or hfi1.  None of those are verbs devices, and throwing a
> software verbs implementation over them doesn't really change that.

The Linux standard that has evolved is that a common verbs driver can
provide extensions beyond verbs through the driver private interface
and still be 'drivers/infiniband'

This makes sense in all cases so far as the extensions lean heavily on
the core code to work properly, ie for addressing/security/etc. 

QIB/HFI are sort of marginal exceptions since their extension was not
designed well as a proper extension, but that is more of a technical
failing than a philosophical one.

So at the userspace level psm/psm2/ucx may be distinct non-verbs
things, but in the kernel we largely treat them as verbs extensions.

> > Now if we do want to say a driver must support verbs it can't be
> > wishy-washy. We should specify exactly the set of verbs that must be
> > supported. The IBTA has the notion of some mandatory verbs right, so
> > should that be the low bar? Should we whittle that down further?
> 
> If you go this path, rename the subsystem to linux-verbs or
> linux-ibverbs.

It *is* the verbs subsystem - that is the only API it provides to the
rest of the kernel today.

> However, if the subsystem will support exposing non-verbs interfaces
> (usnic, psm, psm2), then I disagree with placing a requirement that
> the driver must also provide a software implementation of verbs over
> it.

This is what I mean when I talked about evolution earlier. If want to
do non-verbs then do it right, not as a special QP hack like usnic
did.

Jason
Parav Pandit Dec. 5, 2018, 8:45 p.m. UTC | #14
> -----Original Message-----
> From: linux-rdma-owner@vger.kernel.org <linux-rdma-
> owner@vger.kernel.org> On Behalf Of Jason Gunthorpe
> Sent: Wednesday, December 5, 2018 2:03 PM
> To: Hefty, Sean <sean.hefty@intel.com>
> Cc: Dalessandro, Dennis <dennis.dalessandro@intel.com>; Gal Pressman
> <galpress@amazon.com>; Doug Ledford <dledford@redhat.com>; Alexander
> Matushevsky <matua@amazon.com>; Yossi Leybovich
> <sleybo@amazon.com>; linux-rdma@vger.kernel.org; Tom Tucker
> <tom@opengridcomputing.com>
> Subject: Re: [PATCH rdma-next 00/13] Elastic Fabric Adapter (EFA) driver
> 
> On Wed, Dec 05, 2018 at 07:14:28PM +0000, Hefty, Sean wrote:
> > > >> As far as usnic, it seems like its the driver that has been
> > > forgotten about.
> > > >> I see 3 commits in 2018. I'm not saying it's important or not,
> > > that's
> > > >> another debate for another thread.
> > > >
> > > > There are alot more than three, and all of them are from non-Cisco
> > > > people trying to maintain this driver. Many are mine.
> > > >
> > > > This is not a plus, it shows why this approach is a burden on the
> > > rest
> > > > of the community.
> > >
> > > That's the point I was beating around the bush about. As a community
> > > we don't need any fire and forget drivers where it gets upstreamed
> > > and abandoned.
> >
> > usNIC is a very simple device.  I'm not sure you can assume that it
> > isn't used or was forgotten just because it's done.  I do know that it
> > is actively maintained on the user space side, even if it doesn't need
> > anything more from the kernel.
> 
> The driver still had security bugs I had to fix.
> 
> I'm not sure it follows our security model (I have this guess it actually needs
> CAP_NET_RAW)
> 
> This is all bad.

I was about to reply same comment for 9th patch for function rdma_link_layer efa_port_link_layer() which says its Ethernet.
But I see your reply already,  replying here.
Ethernet link layer has well defined L2 headers.
So there is intention to create such headers in user space without actually netdev->ifindex attached to UD or SRC QP, CAP_NET_RAW must be added.
Kernel cannot depend on external network's robustness, even though it may be.
Same for the rx path too.

> 
> > I agree that exposing usNIC through verbs was wrong.  However, as a
> > device, I don't see that it's that functionally different than QIB,
> > truescale, or hfi1.  None of those are verbs devices, and throwing a
> > software verbs implementation over them doesn't really change that.
> 
> The Linux standard that has evolved is that a common verbs driver can
> provide extensions beyond verbs through the driver private interface and
> still be 'drivers/infiniband'
> 
> This makes sense in all cases so far as the extensions lean heavily on the
> core code to work properly, ie for addressing/security/etc.
> 
> QIB/HFI are sort of marginal exceptions since their extension was not
> designed well as a proper extension, but that is more of a technical failing
> than a philosophical one.
> 
> So at the userspace level psm/psm2/ucx may be distinct non-verbs things,
> but in the kernel we largely treat them as verbs extensions.
> 
> > > Now if we do want to say a driver must support verbs it can't be
> > > wishy-washy. We should specify exactly the set of verbs that must be
> > > supported. The IBTA has the notion of some mandatory verbs right, so
> > > should that be the low bar? Should we whittle that down further?
> >
> > If you go this path, rename the subsystem to linux-verbs or
> > linux-ibverbs.
> 
> It *is* the verbs subsystem - that is the only API it provides to the rest of the
> kernel today.
> 
> > However, if the subsystem will support exposing non-verbs interfaces
> > (usnic, psm, psm2), then I disagree with placing a requirement that
> > the driver must also provide a software implementation of verbs over
> > it.
> 
> This is what I mean when I talked about evolution earlier. If want to do non-
> verbs then do it right, not as a special QP hack like usnic did.
> 
> Jason
Hefty, Sean Dec. 5, 2018, 10:37 p.m. UTC | #15
> Ethernet link layer has well defined L2 headers.
> So there is intention to create such headers in user space without
> actually netdev->ifindex attached to UD or SRC QP, CAP_NET_RAW must be
> added.

I haven't seen anything in the reviews that indicate that user space is able to create its own headers over Ethernet.  It looks like it uses a proprietary addressing scheme with one address per device, and my guess is it relies on the SRD/UD QP to generate the headers. 

> > This is what I mean when I talked about evolution earlier. If want
> to
> > do non- verbs then do it right, not as a special QP hack like usnic
> did.

These are the verbs that EFA appears to support based on what is actually being passed to the HW:

Create/destroy QP
Create/destroy CQ
Create/destroy MR
Create/destroy AH

It supports PD calls, but I'm unsure if that's really a thing.  On the surface it looks like EFA is a simplified verbs-like device that supports a proprietary transport, somewhat equivalent to an IB/RoCE device that only supports UD QPs.  Adding a new QP type looks sensible.

I think the mistake in the tree is either having different protocols map to the same QP type value, or not having a separate protocol field provided as part of create QP.

- Sean
Jason Gunthorpe Dec. 5, 2018, 11:12 p.m. UTC | #16
On Wed, Dec 05, 2018 at 10:37:08PM +0000, Hefty, Sean wrote:
> > Ethernet link layer has well defined L2 headers.
> > So there is intention to create such headers in user space without
> > actually netdev->ifindex attached to UD or SRC QP, CAP_NET_RAW must be
> > added.
> 
> I haven't seen anything in the reviews that indicate that user space
> is able to create its own headers over Ethernet.  It looks like it
> uses a proprietary addressing scheme with one address per device,
> and my guess is it relies on the SRD/UD QP to generate the headers.

It has a 16 byte opaque blob for the address (see struct
efa_admin_create_ah_cmd) - who knows what is in there.

If it includes a MAC address then this driver needs CAP_NET_RAW.

If it includes an IP address, and the MAC address is shared with a
netdev in the system, then it also needs CAP_NET_RAW.

Who knows what the format of the UD WR's are, maybe they wrongly
include headers to transmit.

And the same questions belatedly apply to usnic.

> It supports PD calls, but I'm unsure if that's really a thing.  On
> the surface it looks like EFA is a simplified verbs-like device that
> supports a proprietary transport, somewhat equivalent to an IB/RoCE
> device that only supports UD QPs.

The admin mailboxes are passing around a PD #, so to me it looks like
there is no HW object backing the PD but the HW has all the
information to do required per-PD security enforcement..

> Adding a new QP type looks sensible.

We are absolutely not adding new things to common verbs that are not
well specified to the level that someone else can implement them (let
alone use them). I've been adament about this point for a long, long
time now.

> I think the mistake in the tree is either having different protocols
> map to the same QP type value, or not having a separate protocol
> field provided as part of create QP.

The QP type value specifies the API contract to the verbs
consumer. UD, RC, etc all have well defined semantics that must be
adhered to. IIRC usnic broke this too by using a UD QP type that did
not meet all the requirements. 

Like usnic, I have no idea if EFA follows the contract for UD or
not. There is no documentation and no verbs provider to inspect.

Yes, the wire protocol behind the API contract cannot be specified
with existing verbs, that need has yet to come up, and could be met
someday with a create_qp extension.

At least in all present cases the various wire protocols are so
different as to need a different 'struct ib_device' anyhow.

Jason
Hefty, Sean Dec. 6, 2018, 1:37 a.m. UTC | #17
> > I think the mistake in the tree is either having different protocols
> > map to the same QP type value, or not having a separate protocol
> field
> > provided as part of create QP.
> 
> The QP type value specifies the API contract to the verbs consumer.

I disagree.  The primary API contract is determined by the device attributes and capabilities.  The QP type isn't known or used until well after that.

The QP type should only determine the behavior of QP related calls: create/modify/query qp, post_send, post_recv, and ib_wc data.  And even with these, we risk making assumptions.  For example, QP type is independent of memory registration, which affects the input to send/recv.

An issue we've had is linking attributes with orthogonal APIs/behaviors.  Some of these were fixed by adding the rdma_protocol_xxx() and rdma_cap_xxx() calls.  We shouldn't have QP type carrying broad API implications.  IMO, selecting the wire protocol seems ideal, so I still feel that introducing a new QP type is sensible in this case.  Add documentation for which QP/AH attributes are valid for each QP type if that helps.

QPs could be classified at a higher level to define a *general* API flow that's independent of the protocol.  E.g. provide helper functions to mark a QP as either reliable/unreliable and connected/unconnected.  That would provide a minimal list of what verbs should be supported, though details on which fields are used still end up device/protocol specific.

- Sean
Parav Pandit Dec. 6, 2018, 2:37 a.m. UTC | #18
> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Wednesday, December 5, 2018 5:13 PM
> To: Hefty, Sean <sean.hefty@intel.com>
> Cc: Parav Pandit <parav@mellanox.com>; Dalessandro, Dennis
> <dennis.dalessandro@intel.com>; Gal Pressman <galpress@amazon.com>;
> Doug Ledford <dledford@redhat.com>; Alexander Matushevsky
> <matua@amazon.com>; Yossi Leybovich <sleybo@amazon.com>; linux-
> rdma@vger.kernel.org; Tom Tucker <tom@opengridcomputing.com>
> Subject: Re: [PATCH rdma-next 00/13] Elastic Fabric Adapter (EFA) driver
> 
> On Wed, Dec 05, 2018 at 10:37:08PM +0000, Hefty, Sean wrote:
> > > Ethernet link layer has well defined L2 headers.
> > > So there is intention to create such headers in user space without
> > > actually netdev->ifindex attached to UD or SRC QP, CAP_NET_RAW must
> > > be added.
> >
> > I haven't seen anything in the reviews that indicate that user space
> > is able to create its own headers over Ethernet.  It looks like it
> > uses a proprietary addressing scheme with one address per device, and
> > my guess is it relies on the SRD/UD QP to generate the headers.
> 
> It has a 16 byte opaque blob for the address (see struct
> efa_admin_create_ah_cmd) - who knows what is in there.
> 
> If it includes a MAC address then this driver needs CAP_NET_RAW.
> 
> If it includes an IP address, and the MAC address is shared with a netdev in
> the system, then it also needs CAP_NET_RAW.
> 
> Who knows what the format of the UD WR's are, maybe they wrongly
> include headers to transmit.
>

efa_port_link_layer() need to return more appropriate link layer that reflects the addresses defined in ah.

 
> And the same questions belatedly apply to usnic.
> 
> > It supports PD calls, but I'm unsure if that's really a thing.  On the
> > surface it looks like EFA is a simplified verbs-like device that
> > supports a proprietary transport, somewhat equivalent to an IB/RoCE
> > device that only supports UD QPs.
> 
> The admin mailboxes are passing around a PD #, so to me it looks like there
> is no HW object backing the PD but the HW has all the information to do
> required per-PD security enforcement..
> 
> > Adding a new QP type looks sensible.
> 
> We are absolutely not adding new things to common verbs that are not well
> specified to the level that someone else can implement them (let alone use
> them). I've been adament about this point for a long, long time now.
> 
> > I think the mistake in the tree is either having different protocols
> > map to the same QP type value, or not having a separate protocol field
> > provided as part of create QP.
> 
> The QP type value specifies the API contract to the verbs consumer. UD, RC,
> etc all have well defined semantics that must be adhered to. IIRC usnic broke
> this too by using a UD QP type that did not meet all the requirements.
> 
> Like usnic, I have no idea if EFA follows the contract for UD or not. There is
> no documentation and no verbs provider to inspect.
> 
> Yes, the wire protocol behind the API contract cannot be specified with
> existing verbs, that need has yet to come up, and could be met someday
> with a create_qp extension.
> 
> At least in all present cases the various wire protocols are so different as to
> need a different 'struct ib_device' anyhow.
> 
> Jason
Gal Pressman Dec. 6, 2018, 10:40 a.m. UTC | #19
On 05-Dec-18 17:15, Jason Gunthorpe wrote:
> On Wed, Dec 05, 2018 at 10:55:59AM +0200, Gal Pressman wrote:
>> On 04-Dec-18 17:33, Jason Gunthorpe wrote:
>>> On Tue, Dec 04, 2018 at 02:04:16PM +0200, Gal Pressman wrote:
>>>> Hello all,
>>>> The following patchset introduces the Elastic Fabric Adapter (EFA) driver, that
>>>> was pre-announced by Amazon [1].
>>>>
>>>> EFA is a networking adapter designed to support user space network
>>>> communication, initially offered in the Amazon EC2 environment. First release
>>>> of EFA supports datagram send/receive operations and does not support
>>>> connection-oriented or read/write operations.
>>>>
>>>> EFA supports unreliable datagrams (UD) as well as a new unordered, scalable
>>>> reliable datagram protocol (SRD). SRD provides support for reliable datagrams
>>>> and more complete error handling than typical RD, but, unlike RD, it does not
>>>> support ordering nor segmentation. A new queue pair type, IB_QPT_SRD, is added
>>>> to expose this new queue pair type.
>>>> User verbs are supported via a dedicated userspace libfabric provider.
>>>> Kernel verbs and in-kernel services are initially not supported.
>>>
>>> There are some general expectations for new drivers in rdma you should
>>> be aware of..
>>>
>>> 1) Accepting usnic to RDMA is widely regarded as a mistake. This means
>>>    that future drivers that do not implement 'enough' common verbs are
>>>    not likely to be accepted, as they are not RDMA devices. I'm
>>>    worried this driver doesn't cross the threshold.
>>
>> Hi Jason,
>> What do you mean by enough common verbs? 
> 
> No one has really come up with a guideline, other than usnic was a
> mistake.
> 
>> we implement everything our device can support for now - and we
>> provide a functional RDMA device. 
> 
> Is it a 'functional RDMA device'? It doesn't work with any kernel ULP,
> it doesn't have a libibverbs provider, it doesn't implement any
> pre-existing protocol... It seems just like usnic.
> 
> .. and there is no documentation, what is the format for AH's on UD?
> does it do the verbs memory layout with the 40 byte offset? etc. Just
> like usnic :|

The usage of AHs is similar to the IB spec.
GID is used to create the AH, an opaque handle is returned from the device to be
used in TX work requests.

The opaque handle is reported in the RX work requests completions; in case the
address handle was not yet registered, the device reports the GID in an extended
completion entry. Therefore, we do not need the 40 byte offset memory layout.

> 
>> Can you please be more specific as to what is missing?  We can also
>> implement a basic set of kernel verbs, but since we don't support RC
>> QP type there will not be any use for them - I wouldn't want to add
>> dead code to the driver just for the sake of having an
>> implementation.
> 
> I agree you shouldn't implement kernel verbs for UD. What is missing
> is that your HW does not seem to be RDMA hardware. Implementing a
> undocumented UD-only, no kverbs, a propritary protocol and
> libfabric-only is exactly what usnic did, and is exactly why it has
> been viewed as a mistake.
> 
> Using rdma has a hacky way to do shared process vfio is not
> really appropriate.

EFA is much more similar to the IB spec than to vfio, it enforces PD protection,
uses remote QP numbers addressing and standard AH/MR registration, does not
allow userspace to send arbitrary packets and packet header construction is done
on the device.

> 
>>> 2) Any change to common verbs must be supported by full public
>>>    documentation good enough to allow another implementation. So
>>>    most likely the IB_QPT_SRD has to go away (this is a NAK). Driver
>>>    specific QP types can be implemented via driver udata.
>>
>> We can use driver specific QP type for now, perhaps in the future we'll be able
>> to expose more detailed documentation about SRD.
> 
> Without kverbs or rdma-core it is irrelevant what you call it.
>  
>>> 3) We need to see the userspace for new drivers. A RDMA driver that
>>>    doesn't provide a useful rdma-core provider is deeply suspect as
>>>    not crossing the #1 threshold.
>>
>> We'll be publishing our libfabric userspace provider soon, I'll add
>> a link once it's available.
> 
> libfabric does not do the checks on the ABI construction that
> rdma-core does, so that still has to be done somehow
>  
> Jason
>
Gal Pressman Dec. 6, 2018, 10:40 a.m. UTC | #20
On 05-Dec-18 22:45, Parav Pandit wrote:
> 
> 
>> -----Original Message-----
>> From: linux-rdma-owner@vger.kernel.org <linux-rdma-
>> owner@vger.kernel.org> On Behalf Of Jason Gunthorpe
>> Sent: Wednesday, December 5, 2018 2:03 PM
>> To: Hefty, Sean <sean.hefty@intel.com>
>> Cc: Dalessandro, Dennis <dennis.dalessandro@intel.com>; Gal Pressman
>> <galpress@amazon.com>; Doug Ledford <dledford@redhat.com>; Alexander
>> Matushevsky <matua@amazon.com>; Yossi Leybovich
>> <sleybo@amazon.com>; linux-rdma@vger.kernel.org; Tom Tucker
>> <tom@opengridcomputing.com>
>> Subject: Re: [PATCH rdma-next 00/13] Elastic Fabric Adapter (EFA) driver
>>
>> On Wed, Dec 05, 2018 at 07:14:28PM +0000, Hefty, Sean wrote:
>>>>>> As far as usnic, it seems like its the driver that has been
>>>> forgotten about.
>>>>>> I see 3 commits in 2018. I'm not saying it's important or not,
>>>> that's
>>>>>> another debate for another thread.
>>>>>
>>>>> There are alot more than three, and all of them are from non-Cisco
>>>>> people trying to maintain this driver. Many are mine.
>>>>>
>>>>> This is not a plus, it shows why this approach is a burden on the
>>>> rest
>>>>> of the community.
>>>>
>>>> That's the point I was beating around the bush about. As a community
>>>> we don't need any fire and forget drivers where it gets upstreamed
>>>> and abandoned.
>>>
>>> usNIC is a very simple device.  I'm not sure you can assume that it
>>> isn't used or was forgotten just because it's done.  I do know that it
>>> is actively maintained on the user space side, even if it doesn't need
>>> anything more from the kernel.
>>
>> The driver still had security bugs I had to fix.
>>
>> I'm not sure it follows our security model (I have this guess it actually needs
>> CAP_NET_RAW)
>>
>> This is all bad.
> 
> I was about to reply same comment for 9th patch for function rdma_link_layer efa_port_link_layer() which says its Ethernet.
> But I see your reply already,  replying here.
> Ethernet link layer has well defined L2 headers.
> So there is intention to create such headers in user space without actually netdev->ifindex attached to UD or SRC QP, CAP_NET_RAW must be added.

There is no intention to create headers in userspace, this is done by the device
hence no need to add CAP_NET_RAW.
The physical link layer is Ethernet, but for all practical purposes it shouldn't
matter to the driver/userspace provider.

> Kernel cannot depend on external network's robustness, even though it may be.
> Same for the rx path too.
> 
>>
>>> I agree that exposing usNIC through verbs was wrong.  However, as a
>>> device, I don't see that it's that functionally different than QIB,
>>> truescale, or hfi1.  None of those are verbs devices, and throwing a
>>> software verbs implementation over them doesn't really change that.
>>
>> The Linux standard that has evolved is that a common verbs driver can
>> provide extensions beyond verbs through the driver private interface and
>> still be 'drivers/infiniband'
>>
>> This makes sense in all cases so far as the extensions lean heavily on the
>> core code to work properly, ie for addressing/security/etc.
>>
>> QIB/HFI are sort of marginal exceptions since their extension was not
>> designed well as a proper extension, but that is more of a technical failing
>> than a philosophical one.
>>
>> So at the userspace level psm/psm2/ucx may be distinct non-verbs things,
>> but in the kernel we largely treat them as verbs extensions.
>>
>>>> Now if we do want to say a driver must support verbs it can't be
>>>> wishy-washy. We should specify exactly the set of verbs that must be
>>>> supported. The IBTA has the notion of some mandatory verbs right, so
>>>> should that be the low bar? Should we whittle that down further?
>>>
>>> If you go this path, rename the subsystem to linux-verbs or
>>> linux-ibverbs.
>>
>> It *is* the verbs subsystem - that is the only API it provides to the rest of the
>> kernel today.
>>
>>> However, if the subsystem will support exposing non-verbs interfaces
>>> (usnic, psm, psm2), then I disagree with placing a requirement that
>>> the driver must also provide a software implementation of verbs over
>>> it.
>>
>> This is what I mean when I talked about evolution earlier. If want to do non-
>> verbs then do it right, not as a special QP hack like usnic did.
>>
>> Jason
Gal Pressman Dec. 6, 2018, 10:41 a.m. UTC | #21
On 06-Dec-18 04:37, Parav Pandit wrote:
> 
> 
>> -----Original Message-----
>> From: Jason Gunthorpe <jgg@ziepe.ca>
>> Sent: Wednesday, December 5, 2018 5:13 PM
>> To: Hefty, Sean <sean.hefty@intel.com>
>> Cc: Parav Pandit <parav@mellanox.com>; Dalessandro, Dennis
>> <dennis.dalessandro@intel.com>; Gal Pressman <galpress@amazon.com>;
>> Doug Ledford <dledford@redhat.com>; Alexander Matushevsky
>> <matua@amazon.com>; Yossi Leybovich <sleybo@amazon.com>; linux-
>> rdma@vger.kernel.org; Tom Tucker <tom@opengridcomputing.com>
>> Subject: Re: [PATCH rdma-next 00/13] Elastic Fabric Adapter (EFA) driver
>>
>> On Wed, Dec 05, 2018 at 10:37:08PM +0000, Hefty, Sean wrote:
>>>> Ethernet link layer has well defined L2 headers.
>>>> So there is intention to create such headers in user space without
>>>> actually netdev->ifindex attached to UD or SRC QP, CAP_NET_RAW must
>>>> be added.
>>>
>>> I haven't seen anything in the reviews that indicate that user space
>>> is able to create its own headers over Ethernet.  It looks like it
>>> uses a proprietary addressing scheme with one address per device, and
>>> my guess is it relies on the SRD/UD QP to generate the headers.
>>
>> It has a 16 byte opaque blob for the address (see struct
>> efa_admin_create_ah_cmd) - who knows what is in there.
>>
>> If it includes a MAC address then this driver needs CAP_NET_RAW.
>>
>> If it includes an IP address, and the MAC address is shared with a netdev in
>> the system, then it also needs CAP_NET_RAW.
>>
>> Who knows what the format of the UD WR's are, maybe they wrongly
>> include headers to transmit.
>>
> 
> efa_port_link_layer() need to return more appropriate link layer that reflects the addresses defined in ah.
> 

We can add a new EFA link layer if it makes sense, the address format is
proprietary (does not match existing link types).

>  
>> And the same questions belatedly apply to usnic.
>>
>>> It supports PD calls, but I'm unsure if that's really a thing.  On the
>>> surface it looks like EFA is a simplified verbs-like device that
>>> supports a proprietary transport, somewhat equivalent to an IB/RoCE
>>> device that only supports UD QPs.
>>
>> The admin mailboxes are passing around a PD #, so to me it looks like there
>> is no HW object backing the PD but the HW has all the information to do
>> required per-PD security enforcement..
>>
>>> Adding a new QP type looks sensible.
>>
>> We are absolutely not adding new things to common verbs that are not well
>> specified to the level that someone else can implement them (let alone use
>> them). I've been adament about this point for a long, long time now.
>>
>>> I think the mistake in the tree is either having different protocols
>>> map to the same QP type value, or not having a separate protocol field
>>> provided as part of create QP.
>>
>> The QP type value specifies the API contract to the verbs consumer. UD, RC,
>> etc all have well defined semantics that must be adhered to. IIRC usnic broke
>> this too by using a UD QP type that did not meet all the requirements.
>>
>> Like usnic, I have no idea if EFA follows the contract for UD or not. There is
>> no documentation and no verbs provider to inspect.
>>
>> Yes, the wire protocol behind the API contract cannot be specified with
>> existing verbs, that need has yet to come up, and could be met someday
>> with a create_qp extension.
>>
>> At least in all present cases the various wire protocols are so different as to
>> need a different 'struct ib_device' anyhow.
>>
>> Jason
Gal Pressman Dec. 6, 2018, 10:42 a.m. UTC | #22
On 06-Dec-18 00:37, Hefty, Sean wrote:
>> Ethernet link layer has well defined L2 headers.
>> So there is intention to create such headers in user space without
>> actually netdev->ifindex attached to UD or SRC QP, CAP_NET_RAW must be
>> added.
> 
> I haven't seen anything in the reviews that indicate that user space is able to create its own headers over Ethernet.  It looks like it uses a proprietary addressing scheme with one address per device, and my guess is it relies on the SRD/UD QP to generate the headers. 

Correct.

> 
>>> This is what I mean when I talked about evolution earlier. If want
>> to
>>> do non- verbs then do it right, not as a special QP hack like usnic
>> did.
> 
> These are the verbs that EFA appears to support based on what is actually being passed to the HW:
> 
> Create/destroy QP
> Create/destroy CQ
> Create/destroy MR
> Create/destroy AH
> 
> It supports PD calls, but I'm unsure if that's really a thing.  On the surface it looks like EFA is a simplified verbs-like device that supports a proprietary transport, somewhat equivalent to an IB/RoCE device that only supports UD QPs.  Adding a new QP type looks sensible.

PD number is allocated by the driver and enforced by the device.
I think that adding a new QP type makes sense as well, but we can use the driver
specific QP type if that's the preferred way.

> 
> I think the mistake in the tree is either having different protocols map to the same QP type value, or not having a separate protocol field provided as part of create QP.
> 
> - Sean
>
Gal Pressman Dec. 6, 2018, 10:42 a.m. UTC | #23
On 06-Dec-18 01:12, Jason Gunthorpe wrote:
> On Wed, Dec 05, 2018 at 10:37:08PM +0000, Hefty, Sean wrote:
>>> Ethernet link layer has well defined L2 headers.
>>> So there is intention to create such headers in user space without
>>> actually netdev->ifindex attached to UD or SRC QP, CAP_NET_RAW must be
>>> added.
>>
>> I haven't seen anything in the reviews that indicate that user space
>> is able to create its own headers over Ethernet.  It looks like it
>> uses a proprietary addressing scheme with one address per device,
>> and my guess is it relies on the SRD/UD QP to generate the headers.
> 
> It has a 16 byte opaque blob for the address (see struct
> efa_admin_create_ah_cmd) - who knows what is in there.

It's a proprietary GID specific to EC2 networks.

> 
> If it includes a MAC address then this driver needs CAP_NET_RAW.

On creation of AH the GID is passed, on work requests usage the AH index is
used. The headers are constructed on the device hence there is no need to
CAP_NET_RAW.

> 
> If it includes an IP address, and the MAC address is shared with a
> netdev in the system, then it also needs CAP_NET_RAW.

It does not include an IP address.

> 
> Who knows what the format of the UD WR's are, maybe they wrongly
> include headers to transmit.

UD/SRD WRs do not include headers, this will be visible in our opensource
userspace libfabric provider.

> 
> And the same questions belatedly apply to usnic.
> 
>> It supports PD calls, but I'm unsure if that's really a thing.  On
>> the surface it looks like EFA is a simplified verbs-like device that
>> supports a proprietary transport, somewhat equivalent to an IB/RoCE
>> device that only supports UD QPs.
> 
> The admin mailboxes are passing around a PD #, so to me it looks like
> there is no HW object backing the PD but the HW has all the
> information to do required per-PD security enforcement..

Correct.

> 
>> Adding a new QP type looks sensible.
> 
> We are absolutely not adding new things to common verbs that are not
> well specified to the level that someone else can implement them (let
> alone use them). I've been adament about this point for a long, long
> time now.
> 
>> I think the mistake in the tree is either having different protocols
>> map to the same QP type value, or not having a separate protocol
>> field provided as part of create QP.
> 
> The QP type value specifies the API contract to the verbs
> consumer. UD, RC, etc all have well defined semantics that must be
> adhered to. IIRC usnic broke this too by using a UD QP type that did
> not meet all the requirements. 
> 
> Like usnic, I have no idea if EFA follows the contract for UD or
> not. There is no documentation and no verbs provider to inspect.
> 
> Yes, the wire protocol behind the API contract cannot be specified
> with existing verbs, that need has yet to come up, and could be met
> someday with a create_qp extension.
> 
> At least in all present cases the various wire protocols are so
> different as to need a different 'struct ib_device' anyhow.
> 
> Jason
>
Jason Gunthorpe Dec. 6, 2018, 4:09 p.m. UTC | #24
On Thu, Dec 06, 2018 at 12:40:26PM +0200, Gal Pressman wrote:

> The opaque handle is reported in the RX work requests completions;
> in case the address handle was not yet registered, the device
> reports the GID in an extended completion entry. Therefore, we do
> not need the 40 byte offset memory layout.

If the protocol doesn't have the 40 byte offset then it isn't
IBV_QPT_UD, and you'll need to also use a driver QP for this as well.

Jason
Hefty, Sean Dec. 6, 2018, 4:31 p.m. UTC | #25
> > efa_port_link_layer() need to return more appropriate link layer
> that reflects the addresses defined in ah.
> >
> 
> We can add a new EFA link layer if it makes sense, the address format
> is proprietary (does not match existing link types).

If the link layer is truly Ethernet, I don’t think it makes sense to return anything else.  The link layer should not define the network addressing used in the AH.  Introducing a new address format field for this purpose makes more sense.

- Sean
Jason Gunthorpe Dec. 6, 2018, 5:16 p.m. UTC | #26
On Thu, Dec 06, 2018 at 04:31:40PM +0000, Hefty, Sean wrote:
> > > efa_port_link_layer() need to return more appropriate link layer
> > that reflects the addresses defined in ah.
> > >
> > 
> > We can add a new EFA link layer if it makes sense, the address format
> > is proprietary (does not match existing link types).
> 
> If the link layer is truly Ethernet, I don’t think it makes sense to
> return anything else.  The link layer should not define the network
> addressing used in the AH.  Introducing a new address format field
> for this purpose makes more sense.

Yah, we have the crazy if (rdma_cap*) bleck stuff for figuring out
what the AH is in the kernel.

But none of this really matters for a non-verbs driver.

Jason
Jason Gunthorpe Dec. 6, 2018, 7:04 p.m. UTC | #27
On Thu, Dec 06, 2018 at 01:37:07AM +0000, Hefty, Sean wrote:
> > > I think the mistake in the tree is either having different protocols
> > > map to the same QP type value, or not having a separate protocol
> > field
> > > provided as part of create QP.
> > 
> > The QP type value specifies the API contract to the verbs consumer.
> 
> I disagree.  The primary API contract is determined by the device
> attributes and capabilities.  The QP type isn't known or used until
> well after that.
> 
> The QP type should only determine the behavior of QP related calls:
> create/modify/query qp, post_send, post_recv, and ib_wc data.  And
> even with these, we risk making assumptions.  For example, QP type
> is independent of memory registration, which affects the input to
> send/recv.

This is what I ment, the QP type defines the API contract for QP
related parts (ie QPT_UD has the 40 byte thing, etc)

We only have some fairly small variations due to iWarp, but for the
most part this should be enforced.

> An issue we've had is linking attributes with orthogonal
> APIs/behaviors.  Some of these were fixed by adding the
> rdma_protocol_xxx() and rdma_cap_xxx() calls.  We shouldn't have QP
> type carrying broad API implications.  IMO, selecting the wire
> protocol seems ideal, so I still feel that introducing a new QP type
> is sensible in this case.  Add documentation for which QP/AH
> attributes are valid for each QP type if that helps.

QP type doesn't select wire protocol, it fundamentally selects how the
send and recv queue works.

We have the various cap and related calls in-kernel primarily to deal
with the AH differences, with a few related to the MR problem on iWarp.

> QPs could be classified at a higher level to define a *general* API
> flow that's independent of the protocol.  E.g. provide helper
> functions to mark a QP as either reliable/unreliable and
> connected/unconnected.  That would provide a minimal list of what
> verbs should be supported, though details on which fields are used
> still end up device/protocol specific.

This is all implied by the QPT today. You can't have a QPT_RC that
isn't reliable, or in-order, or doesn't support RDMA. That would be
some other QPT.

You can't do UD without the 40 bytes and the usual other things. UD
can't start doing RDMA, etc.

Jason
Gal Pressman Dec. 9, 2018, 9:24 a.m. UTC | #28
On 06-Dec-18 19:16, Jason Gunthorpe wrote:
> On Thu, Dec 06, 2018 at 04:31:40PM +0000, Hefty, Sean wrote:
>>>> efa_port_link_layer() need to return more appropriate link layer
>>> that reflects the addresses defined in ah.
>>>>
>>>
>>> We can add a new EFA link layer if it makes sense, the address format
>>> is proprietary (does not match existing link types).
>>
>> If the link layer is truly Ethernet, I don’t think it makes sense to
>> return anything else.  The link layer should not define the network
>> addressing used in the AH.  Introducing a new address format field
>> for this purpose makes more sense.
> 
> Yah, we have the crazy if (rdma_cap*) bleck stuff for figuring out
> what the AH is in the kernel.
> 
> But none of this really matters for a non-verbs driver.
> 
> Jason
> 

We can add a cap for EFA AHs to clarify that the format is different.
Gal Pressman Dec. 9, 2018, 9:26 a.m. UTC | #29
On 06-Dec-18 18:09, Jason Gunthorpe wrote:
> On Thu, Dec 06, 2018 at 12:40:26PM +0200, Gal Pressman wrote:
> 
>> The opaque handle is reported in the RX work requests completions;
>> in case the address handle was not yet registered, the device
>> reports the GID in an extended completion entry. Therefore, we do
>> not need the 40 byte offset memory layout.
> 
> If the protocol doesn't have the 40 byte offset then it isn't
> IBV_QPT_UD, and you'll need to also use a driver QP for this as well.
> 
> Jason
> 

I need to figure out the exact details, but we can probably configure the device
to do the 40 bytes offset, although we don't really need it.

Is this the only thing we're missing for the IBV_QPT_UD?
Jason Gunthorpe Dec. 9, 2018, 4:48 p.m. UTC | #30
On Sun, Dec 09, 2018 at 11:26:02AM +0200, Gal Pressman wrote:
> On 06-Dec-18 18:09, Jason Gunthorpe wrote:
> > On Thu, Dec 06, 2018 at 12:40:26PM +0200, Gal Pressman wrote:
> > 
> >> The opaque handle is reported in the RX work requests completions;
> >> in case the address handle was not yet registered, the device
> >> reports the GID in an extended completion entry. Therefore, we do
> >> not need the 40 byte offset memory layout.
> > 
> > If the protocol doesn't have the 40 byte offset then it isn't
> > IBV_QPT_UD, and you'll need to also use a driver QP for this as well.
> > 
> > Jason
> > 
> 
> I need to figure out the exact details, but we can probably configure the device
> to do the 40 bytes offset, although we don't really need it.
> 
> Is this the only thing we're missing for the IBV_QPT_UD?

You're asking me? It is your responsibility to make sure the device
implements all mandatory IBTA verbs semantics for using IBV_QPT_UD

Jason
Gal Pressman Dec. 20, 2018, 9:28 a.m. UTC | #31
On 04-Dec-18 14:04, Gal Pressman wrote:
> Hello all,
> The following patchset introduces the Elastic Fabric Adapter (EFA) driver, that
> was pre-announced by Amazon [1].
> 
> EFA is a networking adapter designed to support user space network
> communication, initially offered in the Amazon EC2 environment. First release
> of EFA supports datagram send/receive operations and does not support
> connection-oriented or read/write operations.
> 
> EFA supports unreliable datagrams (UD) as well as a new unordered, scalable
> reliable datagram protocol (SRD). SRD provides support for reliable datagrams
> and more complete error handling than typical RD, but, unlike RD, it does not
> support ordering nor segmentation. A new queue pair type, IB_QPT_SRD, is added
> to expose this new queue pair type.
> User verbs are supported via a dedicated userspace libfabric provider.
> Kernel verbs and in-kernel services are initially not supported.
> 
> EFA enabled EC2 instances have two different devices allocated, one for ENA
> (netdev) and one for EFA, the two are separate pci devices with no in-kernel
> communication between them.
> 
> Thanks,
> Gal
> 
> [1] https://aws.amazon.com/about-aws/whats-new/2018/11/introducing-elastic-fabric-adapter/
> 
> Gal Pressman (13):
>   RDMA: Add EFA related definitions
>   RDMA/efa: Add EFA device definitions
>   RDMA/efa: Add the PCI device id definitions
>   RDMA/efa: Add the efa.h header file
>   RDMA/efa: Add the efa_com.h file
>   RDMA/efa: Add the com service API definitions
>   RDMA/efa: Add the ABI definitions
>   RDMA/efa: Implement functions that submit and complete admin commands
>   RDMA/efa: Add com command handlers
>   RDMA/efa: Add bitmap allocation service
>   RDMA/efa: Add EFA verbs implementation
>   RDMA/efa: Add the efa module
>   RDMA/efa: Add driver to Kconfig/Makefile
> 
>  MAINTAINERS                                     |    8 +
>  drivers/infiniband/Kconfig                      |    2 +
>  drivers/infiniband/core/verbs.c                 |    2 +
>  drivers/infiniband/hw/Makefile                  |    1 +
>  drivers/infiniband/hw/efa/Kconfig               |   14 +
>  drivers/infiniband/hw/efa/Makefile              |    8 +
>  drivers/infiniband/hw/efa/efa.h                 |  191 +++
>  drivers/infiniband/hw/efa/efa_admin_cmds_defs.h |  783 ++++++++++
>  drivers/infiniband/hw/efa/efa_admin_defs.h      |  135 ++
>  drivers/infiniband/hw/efa/efa_bitmap.c          |   76 +
>  drivers/infiniband/hw/efa/efa_com.c             | 1122 ++++++++++++++
>  drivers/infiniband/hw/efa/efa_com.h             |  139 ++
>  drivers/infiniband/hw/efa/efa_com_cmd.c         |  544 +++++++
>  drivers/infiniband/hw/efa/efa_com_cmd.h         |  217 +++
>  drivers/infiniband/hw/efa/efa_common_defs.h     |   17 +
>  drivers/infiniband/hw/efa/efa_main.c            |  669 +++++++++
>  drivers/infiniband/hw/efa/efa_pci_id_tbl.h      |   25 +
>  drivers/infiniband/hw/efa/efa_regs_defs.h       |  117 ++
>  drivers/infiniband/hw/efa/efa_verbs.c           | 1827 +++++++++++++++++++++++
>  include/rdma/ib_verbs.h                         |    9 +-
>  include/uapi/rdma/efa-abi.h                     |   89 ++
>  21 files changed, 5993 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/infiniband/hw/efa/Kconfig
>  create mode 100644 drivers/infiniband/hw/efa/Makefile
>  create mode 100644 drivers/infiniband/hw/efa/efa.h
>  create mode 100644 drivers/infiniband/hw/efa/efa_admin_cmds_defs.h
>  create mode 100644 drivers/infiniband/hw/efa/efa_admin_defs.h
>  create mode 100644 drivers/infiniband/hw/efa/efa_bitmap.c
>  create mode 100644 drivers/infiniband/hw/efa/efa_com.c
>  create mode 100644 drivers/infiniband/hw/efa/efa_com.h
>  create mode 100644 drivers/infiniband/hw/efa/efa_com_cmd.c
>  create mode 100644 drivers/infiniband/hw/efa/efa_com_cmd.h
>  create mode 100644 drivers/infiniband/hw/efa/efa_common_defs.h
>  create mode 100644 drivers/infiniband/hw/efa/efa_main.c
>  create mode 100644 drivers/infiniband/hw/efa/efa_pci_id_tbl.h
>  create mode 100644 drivers/infiniband/hw/efa/efa_regs_defs.h
>  create mode 100644 drivers/infiniband/hw/efa/efa_verbs.c
>  create mode 100644 include/uapi/rdma/efa-abi.h
> 

Hi Jason,

It looks like the discussion didn't come to a conclusion, I'm trying to come up
with a plan going forward and would like to get your opinion.

I followed the comments and your concerns and I'll try to address them all:
Let me start by making clear that EFA is not an infiniband device, nor it
aspires to being one, but I think it does fit the verbs model.

All technical comments will be fixed.

We can implement an rdma-core (libibverbs) userspace provider with support for
standard UD (including the 40 bytes offset) and SRD QPs through direct verbs.
I'll also add documentation for SRD QP type, even if we end up using it as a
driver QP type.

The create/destroy AH issue will be solved with the sleepable flag,
EFA can return -EOPNOTSUPP when called in an atomic context. When we'll add
kernel verbs we can solve that the same way bnxt driver did (polling for
completion).

The EFA wire protocol is tightly coupled to the wire protocol for EC2’s VPC
software defined network, which Amazon considers one of its proprietary
differentiating features.
We can’t share many of the details of the wire protocol as part of open sourcing
the kernel driver, but are happy to share details on any customer-visible
features, such as guarantees around our SRD protocol.
Since EFA is not designed to be used independently of EC2’s VPC data plane, we
don’t believe the lack of a well-documented wire protocol impacts customers in
any meaningful way.

Kernel verbs are not supported right now, but we do have future plans to support
that.
I know future plans is probably not something you care for, and I can't give it
a time frame right now - but it's not overlooked.
We are driven by our customers, and they have shown interest in this.

We are focused on our customer base, and due to our product offering we haven't
seen customer demand for nvmeof support, which you have set as the bar for the
RDMA subsystem.
I'd really like to avoid implementing things that do not interest our customers
and will not have actual use.

I genuinely believe that EFA belongs in the RDMA subsystem, a lot more than
vfio/anywhere else.
We enforce PDs and MRs in the device, use standard AH registration, remote QP
numbers addressing, packet headers are constructed on the device, etc..
We are not simply hacking our way to userspace through the subsystem.

We can implement the driver in a different subsystem, but I truly believe that
no one will benefit from that.

Thanks,
Gal
Hefty, Sean Dec. 20, 2018, 10:28 p.m. UTC | #32
> I followed the comments and your concerns and I'll try to address them all:
> Let me start by making clear that EFA is not an infiniband device, nor it
> aspires to being one, but I think it does fit the verbs model.

I agree.  It has the general abstract hardware semantics that I associated as 'verbs' - PD, MR, AH, QP, CQ.

> We can implement an rdma-core (libibverbs) userspace provider with support for
> standard UD (including the 40 bytes offset) and SRD QPs through direct verbs.
> I'll also add documentation for SRD QP type, even if we end up using it as a
> driver QP type.

IMO, IB hardware semantic quirks should not carry over to more generic QP type definitions.  EFA and IB have different addressing formats, so an app can't just switch between one UD QP and another.

For comparison, RC QPs refer to a transport neutral reliable-connected QP.  It went even further in that two RC QPs could use completely different CMs to setup the connection, or in the IB case, none at all!  The feature set (e.g. atomics) supported by different RC transports aren't even the same.

I still view SRD as a generic QP type for reliable-unconnected send and receive messages.  In the EFA case, the max message size is limited to a single MTU, but other implementations may not have this limit.  I don't see any benefit to hiding the real QP type behind a single driver defined value.

Along this line, given the constraints of the existing APIs, if a reliable-unconnected QP that supported RDMA read/write were ever introduced, I would also add that as a new QP type.

- Sean
Jason Gunthorpe Dec. 20, 2018, 10:48 p.m. UTC | #33
On Thu, Dec 20, 2018 at 10:28:41PM +0000, Hefty, Sean wrote:
> > I followed the comments and your concerns and I'll try to address them all:
> > Let me start by making clear that EFA is not an infiniband device, nor it
> > aspires to being one, but I think it does fit the verbs model.
> 
> I agree.  It has the general abstract hardware semantics that I
> associated as 'verbs' - PD, MR, AH, QP, CQ.

All most all modern high performance hardware has some variation on
QP's and CQ's.

I'm not sure exactly what a device that can't do RDMA is doing with
MRs. 

It doesn't have rkeys, right? So the PD&MR is just some variation of
lkey gather/scatter from user memory??

A MR without an rkey capability is not a MR, we've been calling that
object a 'umem' in verbs lately.

This is my objection to these drivers, the device does not implement
anything close to the expected verbs semantics for any of the objects
it seems to use. 

What value is a PD if there are no rkeys? The only real purpose of a
PD is to provide security via rkeys groupings - so why EFA does it
make a fake PD? To forcibly fit into verbs, it seems.

> > We can implement an rdma-core (libibverbs) userspace provider with
> > support for standard UD (including the 40 bytes offset) and SRD
> > QPs through direct verbs.  I'll also add documentation for SRD QP
> > type, even if we end up using it as a driver QP type.
> 
> IMO, IB hardware semantic quirks should not carry over to more
> generic QP type definitions.  EFA and IB have different addressing
> formats, so an app can't just switch between one UD QP and another.

The QP type must specify broadly how the WR's and WC's work anything
else is madness. Each protocol doesn't get to overload and redefine
the existing well standardized definitions. If it doesn't conform the
the WC and WR model from IBTA it does not get to use the
IBV_QPT_UD/RC/UC/RD's values. Thats verbs. Even iwarp puts in the 40
byte header and doesn't need it.

> I still view SRD as a generic QP type for reliable-unconnected send
> and receive messages.  

Something proprietary and undocumented cannot be part of common
verbs. I am absolutely adament on this. We cannot maintain the core
code as a patchwork of undocumented driver-specific crap.

Driver crap goes in drivers, where it is the driver's problem to
maintain and keep working.

> Along this line, given the constraints of the existing APIs, if a
> reliable-unconnected QP that supported RDMA read/write were ever
> introduced, I would also add that as a new QP type.

Mellanox already has a reliable delivery unconnected QP type that
supports RDMA. It is exposed out of the driver as IBV_QPT_DRIVER
because the protocol and behavior isn't documented.

I don't see any difference with EFA's proprietary protocol.

Jason
Jason Gunthorpe Dec. 20, 2018, 11:10 p.m. UTC | #34
On Thu, Dec 20, 2018 at 11:28:34AM +0200, Gal Pressman wrote:

> It looks like the discussion didn't come to a conclusion, I'm trying
> to come up with a plan going forward and would like to get your
> opinion.

I haven't heard any of the people who complained about usnic chime in
here..

> We can implement an rdma-core (libibverbs) userspace provider with
> support for standard UD (including the 40 bytes offset) and SRD QPs
> through direct verbs.  I'll also add documentation for SRD QP type,
> even if we end up using it as a driver QP type.

There is no reason to implement code you don't intend to maintain and
use. But if you are using the kernel QPT's you must follow the
protocols in the kernel driver. You should probably just give up on
the UD QPT if the HW doesn't follow verbs and use QPT_DRIVER2 for it.

You also need to pass the rdma-core ABI checks, but that does not
require implementing a provider.

> The create/destroy AH issue will be solved with the sleepable flag,
> EFA can return -EOPNOTSUPP when called in an atomic context. When
> we'll add kernel verbs we can solve that the same way bnxt driver
> did (polling for completion).

This seems fine
 
> The EFA wire protocol is tightly coupled to the wire protocol for
> EC2’s VPC software defined network, which Amazon considers one of
> its proprietary differentiating features.

This is a problem, we have never allowed proprietary wire protocols to
set a standard for multi-vendor verbs.

We've allowed a lot of stuff in extended verbs, but never that.

So that alone says the EFA proprietary stuff cannot be part of common
verbs at this point. But this is what QPT_DRIVER is for, so it isn't
insurmountable.

But what you are left with is a driver that implements QPT_DRIVER and
nothing else.

> Kernel verbs are not supported right now, but we do have future
> plans to support that.

This is a problem, as it was a big part of the objection to usnic.

I'm not sure what kernel support you imagine, but nearly all existing
kernel drivers require actual RC RDMA. 

Frankly, if your device and driver would implement actual 'verbs' RC
RDMA then the 'usnic' concern goes away.

> We are focused on our customer base, and due to our product offering
> we haven't seen customer demand for nvmeof support, which you have
> set as the bar for the RDMA subsystem.

nvmeof uses the same feature set as all the standard storage ULPs and
represents the base level of functionality to enable alot of in-kernel
functionality.

If you are imagining making new EFA specific kernel drivers using
proprietary protocols, then again, no, this is not what the RDMA
subsystem is for.

> I'd really like to avoid implementing things that do not interest
> our customers and will not have actual use.

You should not do this, we don't want to maintain such dead code.

> I genuinely believe that EFA belongs in the RDMA subsystem, a lot
> more than vfio/anywhere else.

And yet there are other teams saying the opposite about their devices
and are working to try and build a new subsystem that generically
provides WR/WC rings and userspace DMA..

> We enforce PDs and MRs in the device, use standard AH registration,
> remote QP numbers addressing, packet headers are constructed on the
> device, etc..  We are not simply hacking our way to userspace
> through the subsystem.

What part of the ib_core support code does EFA actually use besides
the UAPI conduit? umem? anything else?

Jason
Hefty, Sean Dec. 21, 2018, 1:28 a.m. UTC | #35
> > I agree.  It has the general abstract hardware semantics that I
> > associated as 'verbs' - PD, MR, AH, QP, CQ.
> 
> All most all modern high performance hardware has some variation on
> QP's and CQ's.

I draw a distinction between command queues and QPs and whether the addressing is closely coupled with it.

> I'm not sure exactly what a device that can't do RDMA is doing with
> MRs.

It is still required for pinning memory buffers for send/receive operations.

> It doesn't have rkeys, right? So the PD&MR is just some variation of
> lkey gather/scatter from user memory??

The PD also controls the AH.  You could argue that it's unnecessary to group QPs, MRs, and AHs by PDs, but that does match verb semantics.  The PD provides process isolation.

> A MR without an rkey capability is not a MR, we've been calling that
> object a 'umem' in verbs lately.

Well, the IB spec still calls this a MR.  :b

> This is my objection to these drivers, the device does not implement
> anything close to the expected verbs semantics for any of the objects
> it seems to use.

It looks similar to an IB device that only supports UD QPs.  It doesn't give the impression as being so far off from verbs as to be something completely unrelated.

You honestly don't think you could write an app against SRD knowing nothing more than you know now?

Yes, the documentation could be improved to clarify the semantics, but I do think a fair attempt could be made to program an app to use SRD.

> > IMO, IB hardware semantic quirks should not carry over to more
> > generic QP type definitions.  EFA and IB have different addressing
> > formats, so an app can't just switch between one UD QP and another.
> 
> The QP type must specify broadly how the WR's and WC's work

SRD appears to do this at a broad level.  What additional semantics are you wanting to know?
 
> > I still view SRD as a generic QP type for reliable-unconnected send
> > and receive messages.
> 
> Something proprietary and undocumented cannot be part of common
> verbs. I am absolutely adament on this. We cannot maintain the core
> code as a patchwork of undocumented driver-specific crap.

I'm not sure why something proprietary is considered crap.  That seems unnecessarily judgmental.  The behavior of SRD can be described in more detail, even keeping the protocol proprietary.
 
> Mellanox already has a reliable delivery unconnected QP type that
> supports RDMA. It is exposed out of the driver as IBV_QPT_DRIVER
> because the protocol and behavior isn't documented.

I would like to see this be defined generically as well, and really anything that supports verbs in an 'intuitive' way.

> I don't see any difference with EFA's proprietary protocol.

I still don't see the point of having every driver use value 255 to indicate that the real value is some other location.  We're not constrained on enum values, and this just makes it more likely that a request can get forwarded to the wrong driver.  For me, this is a separate issue than defining SRD in such a way that it can be used generically outside of EFA.

- Sean
Jason Gunthorpe Dec. 21, 2018, 3:15 a.m. UTC | #36
On Fri, Dec 21, 2018 at 01:28:14AM +0000, Hefty, Sean wrote:
> > > I agree.  It has the general abstract hardware semantics that I
> > > associated as 'verbs' - PD, MR, AH, QP, CQ.
> > 
> > All most all modern high performance hardware has some variation on
> > QP's and CQ's.
> 
> I draw a distinction between command queues and QPs and whether the
> addressing is closely coupled with it.

It doesn't sound like SRD has addressing associated with the QP, as it
is unconnected. Addressing is surely associated with the individual
command?

Whats the difference between a command queue and a verbs QP? Verbs QPs
have the verbs defined properties and states. (notice EFA
hardwired/ignored alot of these). It has the verbs state machines. QPs
execute *verbs* actions (SEND/RECV/READ/WRITE/etc).

Should 'rdma' be a subsystem that allows any arbitary process isolated
command queues to be exposed to user space?? That is what USNIC is
doing. But others have said no to this, they want to build a new
subsystem more like vfio..

> > It doesn't have rkeys, right? So the PD&MR is just some variation of
> > lkey gather/scatter from user memory??
> 
> The PD also controls the AH.

Hmm? Not really in verbs. Is EFA doing something different?

> You could argue that it's unnecessary to group QPs, MRs, and AHs by
> PDs, but that does match verb semantics.  The PD provides process
> isolation.

verbs semantics require the PD grouping because it is a security boundary
for the rkey. The verbs user must have control over the PD to ensure
that multiple remote connections in the same process have isolated
rkeys from each other.

If it wasn't for the rkey then all those objects would be simply
grouped to the verbs FD with no PD.

> > A MR without an rkey capability is not a MR, we've been calling that
> > object a 'umem' in verbs lately.
> 
> Well, the IB spec still calls this a MR.  :b

And AF_XDP calls it a umem and vfio calls it a 'dma map'.

> > This is my objection to these drivers, the device does not implement
> > anything close to the expected verbs semantics for any of the objects
> > it seems to use.
> 
> It looks similar to an IB device that only supports UD QPs.

Gal already said it doesn't support 'verbs' UD. It supports two kinds
of proprietary QPs, as far as I can tell. One is sort of like UD,
maybe.

> You honestly don't think you could write an app against SRD knowing
> nothing more than you know now?

More than *I* know? Certainly not! I assume you've been reviewing
their libfabric driver, so I think you know alot more than I do..

.. and that isn't the standard for defining a verb extension anyhow.

> Yes, the documentation could be improved to clarify the semantics,
> but I do think a fair attempt could be made to program an app to use
> SRD.

The semantics would need to be documented to a level that another
vendor could make a compatible implementation - at a *minimum*.

.. and that ignores the issue that the wire protocol is proprietary,
which we haven't ignored in the past.

.. and that they have stated no intention to actually provide a verbs
API for this functionality, only doing libfabric, it seems.

So I see this the same as usnic. A non-verbs proprietary protocol that
is used to build something more complicated in userspace.

This is in contrast to QIB/HFI that used their non-verbs HW protocol
to build verbs in SW, in the kernel.

> > > I still view SRD as a generic QP type for reliable-unconnected send
> > > and receive messages.
> > 
> > Something proprietary and undocumented cannot be part of common
> > verbs. I am absolutely adament on this. We cannot maintain the core
> > code as a patchwork of undocumented driver-specific crap.
> 
> I'm not sure why something proprietary is considered crap.  That
> seems unnecessarily judgmental.  

Have you looked at the usnic kernel driver? :(

> The behavior of SRD can be described in more detail, even keeping
> the protocol proprietary.

Sure, but this is a line we have not yet crossed in RDMA.

> > Mellanox already has a reliable delivery unconnected QP type that
> > supports RDMA. It is exposed out of the driver as IBV_QPT_DRIVER
> > because the protocol and behavior isn't documented.
> 
> I would like to see this be defined generically as well, and really
> anything that supports verbs in an 'intuitive' way.

Generally someone should use UCX to access this protocol, not
verbs.. Just like you should use (apparently) libfabric to access SRD,
not verbs.

I'm not sure who it benefits to document SRD just to justify being
verbs to get the uapi when nothing in the kernel or userspace will use
the 'SRD verbs'.

> > I don't see any difference with EFA's proprietary protocol.
> 
> I still don't see the point of having every driver use value 255 to
> indicate that the real value is some other location.  

It indicates it is not supported by the core stack, and will not be
supported by any other driver. Seeing QPT_DRIVER outside the driver/hw
directory is a huge red flag that patches will be NAK'd - ie ULPs
can't use QPT_DRIVER. It is the appropriate placement for something
expected to be propriety.

If there is an industry will to standardize then the definition can
change.

> We're not constrained on enum values, and this just makes it more
> likely that a request can get forwarded to the wrong driver.  

ioctl has protections against this already.

Look, this is a very strange situation. Since QIB/HFI PSM we have been
allowing drivers to expose their own command channels to userspace
that are proprietary. That isn't a problem, per say.

The objection is using the RDMA subsystem to access the uapi channel
when the device isn't an RDMA device and doesn't provide verbs ops
toward the kernel.

This has confused/complicated a lot of the kernel work because we see
bad/confusing/non-standard implementations of verbs objects that don't
act the way they should. In the past this kernel work has just said
'ignore usnic, it should even be here' - and honestly there is a bit
of a "shrug, it is cisco's problem if usnic somehow breaks" because it
is so incredibly far from all the other drivers.

I said it the last time this came up, if we want to go in this
direction we should consider that these devices are somehow entirely
different from the verbs oriented stack, and maybe they don't abuse
the kernel facing verbs APIs in their quest to get access to user
space.

But I haven't seen a proposal for something like this.. Even something
fairly simple like declaring them a NON_VERBS device and hiding them
from the in-kernel APIs might be enough. But them I'm not sure how we
justify using the common verbs uAPI..

Mellanox and HFI both defined their own non-verbs uAPIs, maybe that is
more appropriate here. It is hard to say when there is so little
public information <shrug> 

Jason
Gal Pressman Dec. 23, 2018, 2:48 p.m. UTC | #37
On 21-Dec-18 05:15, Jason Gunthorpe wrote:
> On Fri, Dec 21, 2018 at 01:28:14AM +0000, Hefty, Sean wrote:
>>>> I agree.  It has the general abstract hardware semantics that I
>>>> associated as 'verbs' - PD, MR, AH, QP, CQ.
>>>
>>> All most all modern high performance hardware has some variation on
>>> QP's and CQ's.
>>
>> I draw a distinction between command queues and QPs and whether the
>> addressing is closely coupled with it.
> 
> It doesn't sound like SRD has addressing associated with the QP, as it
> is unconnected. Addressing is surely associated with the individual
> command?
> 
> Whats the difference between a command queue and a verbs QP? Verbs QPs
> have the verbs defined properties and states. (notice EFA
> hardwired/ignored alot of these). It has the verbs state machines. QPs
> execute *verbs* actions (SEND/RECV/READ/WRITE/etc).

EFA queues are not command queues, these are verbs queues that support SEND/RECV
operations using AH and QP numbers for addressing and MR lkeys for local buffer
access.
We are missing the QP states at the moment, but support for these will be added
in the firmware.

> 
> Should 'rdma' be a subsystem that allows any arbitary process isolated
> command queues to be exposed to user space?? That is what USNIC is
> doing. But others have said no to this, they want to build a new
> subsystem more like vfio.>
>>> It doesn't have rkeys, right? So the PD&MR is just some variation of
>>> lkey gather/scatter from user memory??
>>
>> The PD also controls the AH.
> 
> Hmm? Not really in verbs. Is EFA doing something different?
> 
>> You could argue that it's unnecessary to group QPs, MRs, and AHs by
>> PDs, but that does match verb semantics.  The PD provides process
>> isolation.
> 
> verbs semantics require the PD grouping because it is a security boundary
> for the rkey. The verbs user must have control over the PD to ensure
> that multiple remote connections in the same process have isolated
> rkeys from each other.
> 
> If it wasn't for the rkey then all those objects would be simply
> grouped to the verbs FD with no PD.

We don't support RC/READ/WRITE so rkeys aren't existent in EFA, but we do use
PDs (and lkeys) to group the different objects according to the verbs semantics.

> 
>>> A MR without an rkey capability is not a MR, we've been calling that
>>> object a 'umem' in verbs lately.
>>
>> Well, the IB spec still calls this a MR.  :b
> 
> And AF_XDP calls it a umem and vfio calls it a 'dma map'.
> 
>>> This is my objection to these drivers, the device does not implement
>>> anything close to the expected verbs semantics for any of the objects
>>> it seems to use.
>>
>> It looks similar to an IB device that only supports UD QPs.
> 
> Gal already said it doesn't support 'verbs' UD. It supports two kinds
> of proprietary QPs, as far as I can tell. One is sort of like UD,
> maybe.

Sorry for not being clear about this, verbs UD will be supported.

> 
>> You honestly don't think you could write an app against SRD knowing
>> nothing more than you know now?
> 
> More than *I* know? Certainly not! I assume you've been reviewing
> their libfabric driver, so I think you know alot more than I do..
> 
> .. and that isn't the standard for defining a verb extension anyhow.

You're right, in addition to our userspace provider we will provide a functional
specification of SRD.

> 
>> Yes, the documentation could be improved to clarify the semantics,
>> but I do think a fair attempt could be made to program an app to use
>> SRD.
> 
> The semantics would need to be documented to a level that another
> vendor could make a compatible implementation - at a *minimum*.

I agree, will add.

> 
> .. and that ignores the issue that the wire protocol is proprietary,
> which we haven't ignored in the past.

I get your concerns, but how does the lack of documentation for the wire
protocol in EC2 network impact our users? What information that affects EFA's
usage is missing?
Every customer visible feature will be documented.

> 
> .. and that they have stated no intention to actually provide a verbs
> API for this functionality, only doing libfabric, it seems.

Sorry for not being clear about this as well, we do have intentions for a verbs
API :).
We have a libfabric provider which uses the verbs API at the moment, but we can
implement this through libibverbs as well.

> 
> So I see this the same as usnic. A non-verbs proprietary protocol that
> is used to build something more complicated in userspace.

I'm not familiar with what usnic are doing, but this is not what EFA is about.
This will be more clear in our userspace bits.

> 
> This is in contrast to QIB/HFI that used their non-verbs HW protocol
> to build verbs in SW, in the kernel.

I don't think our case is the same as qib/hfi, I do consider our device as
verbs, but we can add a software implementation for missing verbs functionality
(like RC) - it's a functionality our customers aren't interested in at the moment.

> 
>>>> I still view SRD as a generic QP type for reliable-unconnected send
>>>> and receive messages.
>>>
>>> Something proprietary and undocumented cannot be part of common
>>> verbs. I am absolutely adament on this. We cannot maintain the core
>>> code as a patchwork of undocumented driver-specific crap.
>>
>> I'm not sure why something proprietary is considered crap.  That
>> seems unnecessarily judgmental.  
> 
> Have you looked at the usnic kernel driver? :(
> 
>> The behavior of SRD can be described in more detail, even keeping
>> the protocol proprietary.
> 
> Sure, but this is a line we have not yet crossed in RDMA.

Does proprietary implementation mean proprietary protocol?

> 
>>> Mellanox already has a reliable delivery unconnected QP type that
>>> supports RDMA. It is exposed out of the driver as IBV_QPT_DRIVER
>>> because the protocol and behavior isn't documented.
>>
>> I would like to see this be defined generically as well, and really
>> anything that supports verbs in an 'intuitive' way.
> 
> Generally someone should use UCX to access this protocol, not
> verbs.. Just like you should use (apparently) libfabric to access SRD,
> not verbs.
> 
> I'm not sure who it benefits to document SRD just to justify being
> verbs to get the uapi when nothing in the kernel or userspace will use
> the 'SRD verbs'.

How is SRD different than DC? We'll expose a direct verb to introduce the new QP
type.
If the specification for SRD will suffice, we can implement it through standard
ibv_create_qp.

> 
>>> I don't see any difference with EFA's proprietary protocol.
>>
>> I still don't see the point of having every driver use value 255 to
>> indicate that the real value is some other location.  
> 
> It indicates it is not supported by the core stack, and will not be
> supported by any other driver. Seeing QPT_DRIVER outside the driver/hw
> directory is a huge red flag that patches will be NAK'd - ie ULPs
> can't use QPT_DRIVER. It is the appropriate placement for something
> expected to be propriety.
> 
> If there is an industry will to standardize then the definition can
> change.
> 
>> We're not constrained on enum values, and this just makes it more
>> likely that a request can get forwarded to the wrong driver.  
> 
> ioctl has protections against this already.
> 
> Look, this is a very strange situation. Since QIB/HFI PSM we have been
> allowing drivers to expose their own command channels to userspace
> that are proprietary. That isn't a problem, per say.
> 
> The objection is using the RDMA subsystem to access the uapi channel
> when the device isn't an RDMA device and doesn't provide verbs ops
> toward the kernel.
> 
> This has confused/complicated a lot of the kernel work because we see
> bad/confusing/non-standard implementations of verbs objects that don't
> act the way they should. In the past this kernel work has just said
> 'ignore usnic, it should even be here' - and honestly there is a bit
> of a "shrug, it is cisco's problem if usnic somehow breaks" because it
> is so incredibly far from all the other drivers.
> 
> I said it the last time this came up, if we want to go in this
> direction we should consider that these devices are somehow entirely
> different from the verbs oriented stack, and maybe they don't abuse
> the kernel facing verbs APIs in their quest to get access to user
> space.
> 
> But I haven't seen a proposal for something like this.. Even something
> fairly simple like declaring them a NON_VERBS device and hiding them
> from the in-kernel APIs might be enough. But them I'm not sure how we
> justify using the common verbs uAPI..

I can come up with a proposal for hiding EFA from in-kernel APIs, although I do
believe we will support some sort of kernel API in the future.
Regardless, I don't think that means we should declare EFA as a NON_VERBS device.

Also, we can implement a UD API for the kernel, it's not a smart thing to do
since there are no kernel users for that, but that would be providing verbs ops
towards the kernel and clear the non-standard implementation suspicion.

> 
> Mellanox and HFI both defined their own non-verbs uAPIs, maybe that is
> more appropriate here. It is hard to say when there is so little
> public information <shrug> 
> 
> Jason
>
Gal Pressman Dec. 23, 2018, 3:14 p.m. UTC | #38
On 21-Dec-18 01:10, Jason Gunthorpe wrote:
> On Thu, Dec 20, 2018 at 11:28:34AM +0200, Gal Pressman wrote:
> 
>> It looks like the discussion didn't come to a conclusion, I'm trying
>> to come up with a plan going forward and would like to get your
>> opinion.
> 
> I haven't heard any of the people who complained about usnic chime in
> here..
> 
>> We can implement an rdma-core (libibverbs) userspace provider with
>> support for standard UD (including the 40 bytes offset) and SRD QPs
>> through direct verbs.  I'll also add documentation for SRD QP type,
>> even if we end up using it as a driver QP type.
> 
> There is no reason to implement code you don't intend to maintain and
> use. But if you are using the kernel QPT's you must follow the
> protocols in the kernel driver. You should probably just give up on
> the UD QPT if the HW doesn't follow verbs and use QPT_DRIVER2 for it.

The ways in which we aren't UD QPT are very small, we're 90% there and the
remaining 10% are simple enough.
As I said, the 40 bytes thing will be changed.
We will implement the missing QP states handling for UD QPs in firmware, without
any driver/provider hacks.

> 
> You also need to pass the rdma-core ABI checks, but that does not
> require implementing a provider.

Ofcourse.

> 
>> The create/destroy AH issue will be solved with the sleepable flag,
>> EFA can return -EOPNOTSUPP when called in an atomic context. When
>> we'll add kernel verbs we can solve that the same way bnxt driver
>> did (polling for completion).
> 
> This seems fine

ACK.

>  
>> The EFA wire protocol is tightly coupled to the wire protocol for
>> EC2’s VPC software defined network, which Amazon considers one of
>> its proprietary differentiating features.
> 
> This is a problem, we have never allowed proprietary wire protocols to
> set a standard for multi-vendor verbs.
> 
> We've allowed a lot of stuff in extended verbs, but never that.
> 
> So that alone says the EFA proprietary stuff cannot be part of common
> verbs at this point. But this is what QPT_DRIVER is for, so it isn't
> insurmountable.

I agree, EFA proprietary stuff should not be part of common verbs.

> 
> But what you are left with is a driver that implements QPT_DRIVER and
> nothing else.

Sorry for not being clear, we will provide verbs QPT_UD in addition to
QPT_DRIVER (SRD).
In addition, we will provide a functional specification for SRD, its usage will
be clear.

> 
>> Kernel verbs are not supported right now, but we do have future
>> plans to support that.
> 
> This is a problem, as it was a big part of the objection to usnic.

Is kernel API more important than user API?

> 
> I'm not sure what kernel support you imagine, but nearly all existing
> kernel drivers require actual RC RDMA. 
> 
> Frankly, if your device and driver would implement actual 'verbs' RC
> RDMA then the 'usnic' concern goes away.
> 
>> We are focused on our customer base, and due to our product offering
>> we haven't seen customer demand for nvmeof support, which you have
>> set as the bar for the RDMA subsystem.
> 
> nvmeof uses the same feature set as all the standard storage ULPs and
> represents the base level of functionality to enable alot of in-kernel
> functionality.
> 
> If you are imagining making new EFA specific kernel drivers using
> proprietary protocols, then again, no, this is not what the RDMA
> subsystem is for.

Not sure how kernel verbs for EFA are going to look like, but I agree it should
be supported through a standard IB_QPT.

> 
>> I'd really like to avoid implementing things that do not interest
>> our customers and will not have actual use.
> 
> You should not do this, we don't want to maintain such dead code.

Exactly.

> 
>> I genuinely believe that EFA belongs in the RDMA subsystem, a lot
>> more than vfio/anywhere else.
> 
> And yet there are other teams saying the opposite about their devices
> and are working to try and build a new subsystem that generically
> provides WR/WC rings and userspace DMA..
> 
>> We enforce PDs and MRs in the device, use standard AH registration,
>> remote QP numbers addressing, packet headers are constructed on the
>> device, etc..  We are not simply hacking our way to userspace
>> through the subsystem.
> 
> What part of the ib_core support code does EFA actually use besides
> the UAPI conduit? umem? anything else?

Yes, core reference counts, resource tracking, ib_core commands
validation/security checks, device registration, sysfs api.
In the future, performance statistics (following Ariel's RFC), driver specific
resource tracking (dump), kernel verbs.
The features set will naturally increase over time.

> 
> Jason
> 

I'd really like to get to a point where we know how our next submission is going
to look like. I understand that the lack of user code accompanied to the driver
is an issue - that won't be the case in my next submission.

As I said, my proposal is to add libibverbs support (with standard QPT_UD and QP
states) and specification for SRD.
In addition, I can come up with an RFC for hiding EFA from in-kernel API as you
suggested, although we will probably revisit this in the future.
Will this be acceptable?

Thanks,
Gal
Sagi Grimberg Dec. 25, 2018, 9:17 a.m. UTC | #39
Hi, just happened to see this thread,

> I think the unambiguous low bar is enough implemented to run
> NVMEoF. That is definitely 'RDMA common verbs' hardware, IMHO.

Does this still hold Jason? Would be impossible to meet without
implementing RC though... Also, note that nvmeof and the others have
pretty much the same requirements from the rdma core.
Jason Gunthorpe Jan. 2, 2019, 5:56 p.m. UTC | #40
On Tue, Dec 25, 2018 at 01:17:28AM -0800, Sagi Grimberg wrote:
> Hi, just happened to see this thread,
> 
> > I think the unambiguous low bar is enough implemented to run
> > NVMEoF. That is definitely 'RDMA common verbs' hardware, IMHO.
> 
> Does this still hold Jason? Would be impossible to meet without
> implementing RC though... 

A driver needs RC + RDMA to be useful to our kernel ULPs.

So the question really is should a driver be merged that can't work
with any existing ULP? 

> Also, note that nvmeof and the others have pretty much the same
> requirements from the rdma core.

Yes, which is why I pointed at it..

Jason
Jason Gunthorpe Jan. 2, 2019, 6:04 p.m. UTC | #41
On Sun, Dec 23, 2018 at 04:48:36PM +0200, Gal Pressman wrote:
> On 21-Dec-18 05:15, Jason Gunthorpe wrote:
> > On Fri, Dec 21, 2018 at 01:28:14AM +0000, Hefty, Sean wrote:
> >>>> I agree.  It has the general abstract hardware semantics that I
> >>>> associated as 'verbs' - PD, MR, AH, QP, CQ.
> >>>
> >>> All most all modern high performance hardware has some variation on
> >>> QP's and CQ's.
> >>
> >> I draw a distinction between command queues and QPs and whether the
> >> addressing is closely coupled with it.
> > 
> > It doesn't sound like SRD has addressing associated with the QP, as it
> > is unconnected. Addressing is surely associated with the individual
> > command?
> > 
> > Whats the difference between a command queue and a verbs QP? Verbs QPs
> > have the verbs defined properties and states. (notice EFA
> > hardwired/ignored alot of these). It has the verbs state machines. QPs
> > execute *verbs* actions (SEND/RECV/READ/WRITE/etc).
> 
> EFA queues are not command queues, these are verbs queues that
> support SEND/RECV operations using AH and QP numbers for addressing
> and MR lkeys for local buffer access.  We are missing the QP states
> at the moment, but support for these will be added in the firmware.

verbs QP's are command queues.. The question is if the EFA command
queue does even verbs to be a verbs QP. If the device can implement
IBTA UD QP's then probably yes.

> > .. and that ignores the issue that the wire protocol is proprietary,
> > which we haven't ignored in the past.
> 
> I get your concerns, but how does the lack of documentation for the
> wire protocol in EC2 network impact our users? What information that
> affects EFA's usage is missing?  Every customer visible feature will
> be documented.

It is not something we've done before, so there is no canned answer or
precedent.

> > So I see this the same as usnic. A non-verbs proprietary protocol that
> > is used to build something more complicated in userspace.
> 
> I'm not familiar with what usnic are doing, but this is not what EFA
> is about.  This will be more clear in our userspace bits.

It is doing what this first EFA driver was doing :) Almost exactly,
right down to not having a libibverbs provider.  

Except it used a different approach to creating umem..

> >> The behavior of SRD can be described in more detail, even keeping
> >> the protocol proprietary.
> > 
> > Sure, but this is a line we have not yet crossed in RDMA.
> 
> Does proprietary implementation mean proprietary protocol?

Proprietary protocol means the wire protocol is not documented.

> > I'm not sure who it benefits to document SRD just to justify being
> > verbs to get the uapi when nothing in the kernel or userspace will use
> > the 'SRD verbs'.
> 
> How is SRD different than DC? We'll expose a direct verb to
> introduce the new QP type.

DC isn't documented at all :) Otherwise it looks the pretty much the
same to me.

> If the specification for SRD will suffice, we can implement it
> through standard ibv_create_qp.

We are trying not to accumulate endless one-off implementations in
verbs. If the implementation is expected to be a one-off then the
appropriate place is in the DV side of the verbs world.

> Also, we can implement a UD API for the kernel, it's not a smart
> thing to do since there are no kernel users for that, but that would
> be providing verbs ops towards the kernel and clear the non-standard
> implementation suspicion.

I wouldn't recommend adding dead code.

Jason
Jason Gunthorpe Jan. 2, 2019, 6:12 p.m. UTC | #42
On Sun, Dec 23, 2018 at 05:14:34PM +0200, Gal Pressman wrote:

> I'd really like to get to a point where we know how our next
> submission is going to look like. I understand that the lack of user
> code accompanied to the driver is an issue - that won't be the case
> in my next submission.

Well, your submission is up to you :) There seems to have been quite a
lot missing from the first go around, so I think you'll get more
comments.

You definitely need some kind of userspace though, we are not merging
any kernel uapi stuff without some kind of userspace.

> As I said, my proposal is to add libibverbs support (with standard
> QPT_UD and QP states) and specification for SRD.  In addition, I can
> come up with an RFC for hiding EFA from in-kernel API as you
> suggested, although we will probably revisit this in the future.
> Will this be acceptable?

I have a feeling EFA will hit the same problems as usnic - libibverbs
will be difficult to extend to expose the various special things (how
will AH's even work? What about MTU?).

Although we are in a better place now, so maybe those problems are now
solvable.

Unless the libfabric EFA provider will use libibverbs for the kernel
interface I'm not sure I see the point in making something for
libibverbs that no-one is expected to use??

Jason
Gal Pressman Jan. 3, 2019, 9:54 a.m. UTC | #43
On 02-Jan-19 20:12, Jason Gunthorpe wrote:
> On Sun, Dec 23, 2018 at 05:14:34PM +0200, Gal Pressman wrote:
> 
>> I'd really like to get to a point where we know how our next
>> submission is going to look like. I understand that the lack of user
>> code accompanied to the driver is an issue - that won't be the case
>> in my next submission.
> 
> Well, your submission is up to you :) There seems to have been quite a
> lot missing from the first go around, so I think you'll get more
> comments.
> 
> You definitely need some kind of userspace though, we are not merging
> any kernel uapi stuff without some kind of userspace.

Agreed.

> 
>> As I said, my proposal is to add libibverbs support (with standard
>> QPT_UD and QP states) and specification for SRD.  In addition, I can
>> come up with an RFC for hiding EFA from in-kernel API as you
>> suggested, although we will probably revisit this in the future.
>> Will this be acceptable?
> 
> I have a feeling EFA will hit the same problems as usnic - libibverbs
> will be difficult to extend to expose the various special things (how
> will AH's even work? What about MTU?).
> 
> Although we are in a better place now, so maybe those problems are now
> solvable.
> 
> Unless the libfabric EFA provider will use libibverbs for the kernel
> interface I'm not sure I see the point in making something for
> libibverbs that no-one is expected to use??
> 
> Jason
> 

The libfabric EFA provider doesn't use libibverbs, but it does use the same
kernel verbs interface and kABI.

If a libibverbs provider must be accompanied to each driver in the subsystem
we'll do our best to comply. You know this library better than I (obviously :)),
so there might be small issues we'll have to face. I guess we'll still have to
use an EFA libfabric provider that calls libibverbs instead of using the
existing libfabric verbs provider directly.

If a libfabric provider that uses the same verbs kernel ABI is acceptable,
that's what we currently have and that's what fits our customers needs the most
and does not force them to use two different libraries for EFA usage.

Whatever we do, I'll make sure to submit the userspace bits for review as well.
Leon Romanovsky Jan. 3, 2019, 10:27 a.m. UTC | #44
On Thu, Jan 03, 2019 at 11:54:01AM +0200, Gal Pressman wrote:
> On 02-Jan-19 20:12, Jason Gunthorpe wrote:
> > On Sun, Dec 23, 2018 at 05:14:34PM +0200, Gal Pressman wrote:
> >
> >> I'd really like to get to a point where we know how our next
> >> submission is going to look like. I understand that the lack of user
> >> code accompanied to the driver is an issue - that won't be the case
> >> in my next submission.
> >
> > Well, your submission is up to you :) There seems to have been quite a
> > lot missing from the first go around, so I think you'll get more
> > comments.
> >
> > You definitely need some kind of userspace though, we are not merging
> > any kernel uapi stuff without some kind of userspace.
>
> Agreed.
>
> >
> >> As I said, my proposal is to add libibverbs support (with standard
> >> QPT_UD and QP states) and specification for SRD.  In addition, I can
> >> come up with an RFC for hiding EFA from in-kernel API as you
> >> suggested, although we will probably revisit this in the future.
> >> Will this be acceptable?
> >
> > I have a feeling EFA will hit the same problems as usnic - libibverbs
> > will be difficult to extend to expose the various special things (how
> > will AH's even work? What about MTU?).
> >
> > Although we are in a better place now, so maybe those problems are now
> > solvable.
> >
> > Unless the libfabric EFA provider will use libibverbs for the kernel
> > interface I'm not sure I see the point in making something for
> > libibverbs that no-one is expected to use??
> >
> > Jason
> >
>
> The libfabric EFA provider doesn't use libibverbs, but it does use the same
> kernel verbs interface and kABI.
>
> If a libibverbs provider must be accompanied to each driver in the subsystem
> we'll do our best to comply. You know this library better than I (obviously :)),
> so there might be small issues we'll have to face. I guess we'll still have to
> use an EFA libfabric provider that calls libibverbs instead of using the
> existing libfabric verbs provider directly.
>
> If a libfabric provider that uses the same verbs kernel ABI is acceptable,
> that's what we currently have and that's what fits our customers needs the most
> and does not force them to use two different libraries for EFA usage.
>
> Whatever we do, I'll make sure to submit the userspace bits for review as well.

I'm slightly lost here, are you saying that EFA doesn't use libfabric
verbs interface and implemented ibv_* calls by itself?

Because libfabric verbs provider uses libibverbs behind the scenes.

Thanks
Gal Pressman Jan. 3, 2019, noon UTC | #45
On 03-Jan-19 12:27, Leon Romanovsky wrote:
> On Thu, Jan 03, 2019 at 11:54:01AM +0200, Gal Pressman wrote:
>> On 02-Jan-19 20:12, Jason Gunthorpe wrote:
>>> On Sun, Dec 23, 2018 at 05:14:34PM +0200, Gal Pressman wrote:
>>>
>>>> I'd really like to get to a point where we know how our next
>>>> submission is going to look like. I understand that the lack of user
>>>> code accompanied to the driver is an issue - that won't be the case
>>>> in my next submission.
>>>
>>> Well, your submission is up to you :) There seems to have been quite a
>>> lot missing from the first go around, so I think you'll get more
>>> comments.
>>>
>>> You definitely need some kind of userspace though, we are not merging
>>> any kernel uapi stuff without some kind of userspace.
>>
>> Agreed.
>>
>>>
>>>> As I said, my proposal is to add libibverbs support (with standard
>>>> QPT_UD and QP states) and specification for SRD.  In addition, I can
>>>> come up with an RFC for hiding EFA from in-kernel API as you
>>>> suggested, although we will probably revisit this in the future.
>>>> Will this be acceptable?
>>>
>>> I have a feeling EFA will hit the same problems as usnic - libibverbs
>>> will be difficult to extend to expose the various special things (how
>>> will AH's even work? What about MTU?).
>>>
>>> Although we are in a better place now, so maybe those problems are now
>>> solvable.
>>>
>>> Unless the libfabric EFA provider will use libibverbs for the kernel
>>> interface I'm not sure I see the point in making something for
>>> libibverbs that no-one is expected to use??
>>>
>>> Jason
>>>
>>
>> The libfabric EFA provider doesn't use libibverbs, but it does use the same
>> kernel verbs interface and kABI.
>>
>> If a libibverbs provider must be accompanied to each driver in the subsystem
>> we'll do our best to comply. You know this library better than I (obviously :)),
>> so there might be small issues we'll have to face. I guess we'll still have to
>> use an EFA libfabric provider that calls libibverbs instead of using the
>> existing libfabric verbs provider directly.
>>
>> If a libfabric provider that uses the same verbs kernel ABI is acceptable,
>> that's what we currently have and that's what fits our customers needs the most
>> and does not force them to use two different libraries for EFA usage.
>>
>> Whatever we do, I'll make sure to submit the userspace bits for review as well.
> 
> I'm slightly lost here, are you saying that EFA doesn't use libfabric
> verbs interface and implemented ibv_* calls by itself?
> 
> Because libfabric verbs provider uses libibverbs behind the scenes.
> 
> Thanks
> 

Correct, since there's no libibverbs EFA provider, libfabric EFA provider cannot
make use of ibv_* calls/verbs provider.
Jason Gunthorpe Jan. 3, 2019, 5:59 p.m. UTC | #46
On Thu, Jan 03, 2019 at 02:00:48PM +0200, Gal Pressman wrote:

> Correct, since there's no libibverbs EFA provider, libfabric EFA provider cannot
> make use of ibv_* calls/verbs provider.

Only usnic has ever bypassed libibverbs like this, so you are not in
good company.

I assume all the kabi implementation is outdated too?

Jason
Doug Ledford Jan. 3, 2019, 6:52 p.m. UTC | #47
On Thu, 2019-01-03 at 10:59 -0700, Jason Gunthorpe wrote:
> On Thu, Jan 03, 2019 at 02:00:48PM +0200, Gal Pressman wrote:
> 
> > Correct, since there's no libibverbs EFA provider, libfabric EFA provider cannot
> > make use of ibv_* calls/verbs provider.
> 
> Only usnic has ever bypassed libibverbs like this, so you are not in
> good company.

I don't think that's true at all.  Just a quick glance shows providers
for psm and psm2, and I'm pretty sure they don't use libibverbs at all
but go direct to the ipath char device files.

> I assume all the kabi implementation is outdated too?
> 
> Jason
Doug Ledford Jan. 3, 2019, 7:13 p.m. UTC | #48
On Wed, 2019-01-02 at 10:56 -0700, Jason Gunthorpe wrote:
> On Tue, Dec 25, 2018 at 01:17:28AM -0800, Sagi Grimberg wrote:
> > Hi, just happened to see this thread,
> > 
> > > I think the unambiguous low bar is enough implemented to run
> > > NVMEoF. That is definitely 'RDMA common verbs' hardware, IMHO.
> > 
> > Does this still hold Jason? Would be impossible to meet without
> > implementing RC though... 
> 
> A driver needs RC + RDMA to be useful to our kernel ULPs.
> 
> So the question really is should a driver be merged that can't work
> with any existing ULP? 

All of our ULPs present in drivers/infiniband (but not all of them in
the kernel) have been programmed to the RC model.  That's a matter of
implementation, not of requirement.  They could have been programmed to
UD if the people wanted.  I would argue that to say that today a driver
needs to implement something that was never planned as a functional
requirement but is merely a consequence of chance is arbitrary.

Secondly, RDMA was *always* about user space memcopy savings, and only
secondarily about kernel protocol support.  I gave a talk all the way
back in 2006 at Red Hat Summit about the benefits of RDMA, and the
benefits to the kernel were always minimal compared to the benefits to
user space.  Having user space command queues and completion queues and
pre-registered/posted memory buffers completely avoids the memcpy from
kernel to user space associated with all other network technologies. 
The kernel didn't get that advantage because the core networking code
can hand a regular tcp skbuff to the nfs code (for instance) without a
memcpy.  For that reason, kernel benefits are secondary to user space
benefits when it comes to RDMA.  And yet you are arguing that the
kernel's ability to benefit from the driver/hardware should be the
gating factor as to whether or not it should be merged.  I have to
disagree.

> > Also, note that nvmeof and the others have pretty much the same
> > requirements from the rdma core.
> 
> Yes, which is why I pointed at it..
> 
> Jason
Leon Romanovsky Jan. 3, 2019, 7:50 p.m. UTC | #49
On Thu, Jan 03, 2019 at 02:13:23PM -0500, Doug Ledford wrote:
> On Wed, 2019-01-02 at 10:56 -0700, Jason Gunthorpe wrote:
> > On Tue, Dec 25, 2018 at 01:17:28AM -0800, Sagi Grimberg wrote:
> > > Hi, just happened to see this thread,
> > >
> > > > I think the unambiguous low bar is enough implemented to run
> > > > NVMEoF. That is definitely 'RDMA common verbs' hardware, IMHO.
> > >
> > > Does this still hold Jason? Would be impossible to meet without
> > > implementing RC though...
> >
> > A driver needs RC + RDMA to be useful to our kernel ULPs.
> >
> > So the question really is should a driver be merged that can't work
> > with any existing ULP?
>
> All of our ULPs present in drivers/infiniband (but not all of them in
> the kernel) have been programmed to the RC model.  That's a matter of
> implementation, not of requirement.  They could have been programmed to
> UD if the people wanted.  I would argue that to say that today a driver
> needs to implement something that was never planned as a functional
> requirement but is merely a consequence of chance is arbitrary.
>
> Secondly, RDMA was *always* about user space memcopy savings, and only
> secondarily about kernel protocol support.  I gave a talk all the way
> back in 2006 at Red Hat Summit about the benefits of RDMA, and the
> benefits to the kernel were always minimal compared to the benefits to
> user space.  Having user space command queues and completion queues and
> pre-registered/posted memory buffers completely avoids the memcpy from
> kernel to user space associated with all other network technologies.
> The kernel didn't get that advantage because the core networking code
> can hand a regular tcp skbuff to the nfs code (for instance) without a
> memcpy.  For that reason, kernel benefits are secondary to user space
> benefits when it comes to RDMA.  And yet you are arguing that the
> kernel's ability to benefit from the driver/hardware should be the
> gating factor as to whether or not it should be merged.  I have to
> disagree.

I assume that NVMe over fabric users/developers will disagree with
you about such primary/secondary separation.

Thanks

>
> > > Also, note that nvmeof and the others have pretty much the same
> > > requirements from the rdma core.
> >
> > Yes, which is why I pointed at it..
> >
> > Jason
>
> --
> Doug Ledford <dledford@redhat.com>
>     GPG KeyID: B826A3330E572FDD
>     Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
Jason Gunthorpe Jan. 3, 2019, 7:50 p.m. UTC | #50
On Thu, Jan 03, 2019 at 01:52:50PM -0500, Doug Ledford wrote:
> On Thu, 2019-01-03 at 10:59 -0700, Jason Gunthorpe wrote:
> > On Thu, Jan 03, 2019 at 02:00:48PM +0200, Gal Pressman wrote:
> > 
> > > Correct, since there's no libibverbs EFA provider, libfabric EFA provider cannot
> > > make use of ibv_* calls/verbs provider.
> > 
> > Only usnic has ever bypassed libibverbs like this, so you are not in
> > good company.
> 
> I don't think that's true at all.  Just a quick glance shows providers
> for psm and psm2, and I'm pretty sure they don't use libibverbs at all
> but go direct to the ipath char device files.

The psm providers are not using uverbs, they use their ipath char
device. AFAIK only usnic invokes *uverbs* directly without calling
through libibverbs.

Jason
Hefty, Sean Jan. 3, 2019, 8:14 p.m. UTC | #51
> I assume that NVMe over fabric users/developers will disagree with
> you about such primary/secondary separation.

The point both Sagi and Doug are making are valid.  The primary users of hardware in the linux-rdma subsystem are user space applications.  That's why usNIC doesn't care about kernel users.  That's why EFA has no kernel support.  The qib and hfi1 drivers consider kernel support secondary and optimize for user space.  Neither of these optimize for libibverbs, and I doubt any apps actually use those devices through libibverbs.

Requiring support for a specific user space library is basically pointless and unenforceable.  Once the kABI is exposed, any software can write to it.

- Sean
Jason Gunthorpe Jan. 3, 2019, 9:22 p.m. UTC | #52
On Thu, Jan 03, 2019 at 08:14:54PM +0000, Hefty, Sean wrote:
> > I assume that NVMe over fabric users/developers will disagree with
> > you about such primary/secondary separation.
> 
> The point both Sagi and Doug are making are valid.  The primary
> users of hardware in the linux-rdma subsystem are user space
> applications.

That is only true in certain verticals, particularly HPC. More
generally there are lots of verticals that care about RDMA only to run
kernel-based things like NVMeoF or SMB-Direct (windows) and never run
a user applcation. Arguably there actually are more customers by
number in the latter (though possibly more CPU cores/adaptors in the
former)

The original point is still un-addressed. We, as a community, have
identified usnic as a bad idea that does not belong in
drivers/infiniband. This has been a pretty much universal position of
everyone who has worked on the core RDMA stack.

Do we want to reverse course on that and accept that usnic-like things
are actually OK? (ie no kverbs, no libibverbs and totally proprietary
everything)

I still haven't actually yet heard people saying yes to the above..

The discussion about kernel support started because I said supporting
the existing RC ULPs *clearly* means the devices is not a 'usnic'
class driver.

Nobody has presented another criteria that makes EFA and usnic any
different, so we are back to the start. Was usnic a bad idea or not?
How do we support 'usnic' style devices without wrecking the rest of
the stack?

verbs is supposed to be a multi-vendor standard, not an enumeration of
every kind of proprietary device-specific behavior.

I'm kind of half wondering if the idea to hide 'usnic' devices from
the kernel ULPs doesn't go far enough - maybe we should have a
/dev/urdma for them as well? That would also go a long way to address
some of my past annoyance that hfi/qib have a private cdev..

> That's why usNIC doesn't care about kernel users.  That's why EFA
> has no kernel support.  

I'd say they don't care because they aren't actually classic RDMA
devices, they are something new.

> Requiring support for a specific user space library is basically
> pointless and unenforceable.  Once the kABI is exposed, any software
> can write to it.

It is probably a bad idea to implement the uverbs abi without copying
the new machinery for doing so from verbs, it is complicted, and the
ioctl support makes it even more tricky.

Let alone the issue that libibverbs, assumes it has providers for all
the system devices and pukes if it doesn't - that will need fixing too
if we actually want to properly support this model.

Jason
Doug Ledford Jan. 3, 2019, 11:32 p.m. UTC | #53
On Thu, 2019-01-03 at 14:22 -0700, Jason Gunthorpe wrote:
> On Thu, Jan 03, 2019 at 08:14:54PM +0000, Hefty, Sean wrote:
> > > I assume that NVMe over fabric users/developers will disagree with
> > > you about such primary/secondary separation.
> > 
> > The point both Sagi and Doug are making are valid.  The primary
> > users of hardware in the linux-rdma subsystem are user space
> > applications.
> 
> That is only true in certain verticals, particularly HPC. More
> generally there are lots of verticals that care about RDMA only to run
> kernel-based things like NVMeoF or SMB-Direct (windows) and never run
> a user applcation. Arguably there actually are more customers by
> number in the latter (though possibly more CPU cores/adaptors in the
> former)

I doubt that.  After all, everything Jump Trading does is user space
just like HPC.  So is everything websphere low latency messaging based. 
I don't know an easy way to get an accurate analysis of this point, but
I don't really think it matters all that much.  Clearly, user space
applications that don't need kernel support do exist and are a valid
segment of our user base.  Whether they are the majority, they are still
valid.

> The original point is still un-addressed. We, as a community, have
> identified usnic as a bad idea that does not belong in
> drivers/infiniband. This has been a pretty much universal position of
> everyone who has worked on the core RDMA stack.

I'll generally agree with this.  It turns out usnic is not so much about
access to a shared device with PD, QP, CQ, MR and AH support, but an
SRIOV network device attached to a process with minimal (but sufficient
I think) security measures to at least make sure that process can't
listen in on other traffic.  From there, it's just UDP on a private
network adapter.

> Do we want to reverse course on that and accept that usnic-like things
> are actually OK? (ie no kverbs, no libibverbs and totally proprietary
> everything)

Even if we say the above is true, there are other factors to consider. 
Unlike usnic, efa follows much more closely to the existing rdma device
model.  It's not a separate device per process, it's a single shared
device.  It actually *does* have the concept of PD, CQ, QP, and AH.  It
doesn't do UDP, it actually does post_send and post_recv.  So even if it
appears usnic like in no kverbs and no libibverbs (at the moment), it
differs greatly from usnic in these other traits.

> I still haven't actually yet heard people saying yes to the above..

I think usnic falls too far outside of the RDMA model for my tastes. 
I'm not inclined to say we should reverse course on something that far
apart being a good thing.  But, the fact that efa follows the semantics
of the RDMA subsystem when it comes to all the other traits I listed
above makes me inclined to be more accepting of efa.  If you add in a
libibverbs provider and actual official, working UD support, then even
more so.

> The discussion about kernel support started because I said supporting
> the existing RC ULPs *clearly* means the devices is not a 'usnic'
> class driver.

Agreed.  That would be a clear indicator.  But just because this is a
clear differentiator does not mean lack of this is sufficient to
indicate the opposite.  See above for the various traits that make me
think the efa adapter is not the same as usnic.

> Nobody has presented another criteria that makes EFA and usnic any
> different, so we are back to the start. Was usnic a bad idea or not?
> How do we support 'usnic' style devices without wrecking the rest of
> the stack?
> 
> verbs is supposed to be a multi-vendor standard, not an enumeration of
> every kind of proprietary device-specific behavior.

What it's supposed to be...and what it is...well, that's evolving on a
day to day basis, and it's already got plenty of proprietary, device-
specific behavior.

To be honest, I don't even consider libibverbs a viable programming
interface anymore for the most part.  What I mean by this is that
libibverbs exists, and there are already existing consumers of
libibverbs.  I expect those things to continue.  But, for the most part,
people are encouraged now a days to use another interface at a higher
level that then uses libibverbs behind the scenes: libfabric, UCX, MPIs,
existing low latency middlewares, etc.  Nobody I know if is telling
people to write new code to libibverbs.

When thought of that way, I'm very happy that when we pulled rdma-core
together, we used the notion "anything that interacts with kernel device
files and talks to the core RDMA kernel ABI" was appropriate for rdma-
core.  That way libibverbs can be the common layer all of the other
things use to talk to the kernel while providing a more friendly API to
the user than just raw verbs.

But by that same token, if we are the common, device file opening,
kernel interacting layer, then we must also be the place where all of
the proprietary vendor specific stuff is implemented.  And we already
are.  I cite all of the recent Mellanox Direct Verbs stuff as exactly
that.  It does raise the question of whether, to be consistent, we
should have requested psm/psm2 also be part of rdma-core.  Mellanox uses
the driver direct model, Intel uses a separate cdev, and efa is
apparently intending to follow Mellanox's example.  I don't see a
problem with that. 

> I'm kind of half wondering if the idea to hide 'usnic' devices from
> the kernel ULPs doesn't go far enough - maybe we should have a
> /dev/urdma for them as well? That would also go a long way to address
> some of my past annoyance that hfi/qib have a private cdev..
> 
> > That's why usNIC doesn't care about kernel users.  That's why EFA
> > has no kernel support.  
> 
> I'd say they don't care because they aren't actually classic RDMA
> devices, they are something new.
> 
> > Requiring support for a specific user space library is basically
> > pointless and unenforceable.  Once the kABI is exposed, any software
> > can write to it.
> 
> It is probably a bad idea to implement the uverbs abi without copying
> the new machinery for doing so from verbs, it is complicted, and the
> ioctl support makes it even more tricky.

I second the idea that trying to do the uverbs kernel API outside
libibverbs is a bad idea.  My preference would be to have a new
libibverbs provider that supports what efa needs and then have libfabric
use that, simply because of the issue you mention here.

> Let alone the issue that libibverbs, assumes it has providers for all
> the system devices and pukes if it doesn't - that will need fixing too
> if we actually want to properly support this model.

We can fix libibverbs if we ever do take a driver without a verbs
driver, but I don't know that we are there yet, and I still think
libibverbs as the shim to interface with the kernel is a good idea.
Hefty, Sean Jan. 3, 2019, 11:34 p.m. UTC | #54
> Nobody has presented another criteria that makes EFA and usnic any
> different, so we are back to the start. Was usnic a bad idea or not?
> How do we support 'usnic' style devices without wrecking the rest of
> the stack?

Let's start by agreeing on what verbs is: an abstract definition of the functionality that a NIC supports.  From there, define the uverbs objects in terms of generic behavior, with addressing kept separate.

I thought the primary requirement for a driver was to have at least one user.  It doesn't matter if the user is another kernel driver or a Linux-ABI.

If we want to be more specific, does EFA support some subset of the commands and data structures defined ib_user_verbs.h in a manner that's similar to other devices?  Can the uverbs driver manage allocated resources (PDs, MRs, QPs, CQs, AHs) the same as other devices.  Based on patch 11 and responses to my questions, it looks that way to me.  There are a couple areas where the functionality is missing (e.g. QP state transitions), but the submission looks closer than not.

- Sean
Hefty, Sean Jan. 4, 2019, 12:53 a.m. UTC | #55
> > It is probably a bad idea to implement the uverbs abi without copying
> > the new machinery for doing so from verbs, it is complicted, and the
> > ioctl support makes it even more tricky.
> 
> I second the idea that trying to do the uverbs kernel API outside
> libibverbs is a bad idea.  My preference would be to have a new
> libibverbs provider that supports what efa needs and then have libfabric
> use that, simply because of the issue you mention here.

I would agree if we can figure out a way to avoid the libibverbs API overhead for fast path operations.  I don't believe this would be noticeable for EFA, specifically, but it shows up on other NICs.

- Sean
Jason Gunthorpe Jan. 4, 2019, 3:39 a.m. UTC | #56
On Fri, Jan 04, 2019 at 12:53:03AM +0000, Hefty, Sean wrote:
> > > It is probably a bad idea to implement the uverbs abi without copying
> > > the new machinery for doing so from verbs, it is complicted, and the
> > > ioctl support makes it even more tricky.
> > 
> > I second the idea that trying to do the uverbs kernel API outside
> > libibverbs is a bad idea.  My preference would be to have a new
> > libibverbs provider that supports what efa needs and then have libfabric
> > use that, simply because of the issue you mention here.
> 
> I would agree if we can figure out a way to avoid the libibverbs API
> overhead for fast path operations.  

This has been available for years now, the 'dv' interface does exactly
this sort of bypass. The Mellanox drivers in ucx, dpdk, vma, etc all
use the dv path for datapath operations.

I think that is a consensus that EFA should build its libfabric
provider on libibverbs & 'efadv'..

Jason
Jason Gunthorpe Jan. 4, 2019, 4:11 a.m. UTC | #57
On Thu, Jan 03, 2019 at 06:32:56PM -0500, Doug Ledford wrote:

> > The original point is still un-addressed. We, as a community, have
> > identified usnic as a bad idea that does not belong in
> > drivers/infiniband. This has been a pretty much universal position of
> > everyone who has worked on the core RDMA stack.
> 
> I'll generally agree with this.  It turns out usnic is not so much about
> access to a shared device with PD, QP, CQ, MR and AH support, but an
> SRIOV network device attached to a process with minimal (but sufficient
> I think) security measures to at least make sure that process can't
> listen in on other traffic.  From there, it's just UDP on a private
> network adapter.

EFA looks the same to me, except instead of UDP it is doing a bit
more, but it still just datagram processing. There are no verbs here
other than SEND and RECV. There is no RMA..

The functions usnic and EFA stub to EOPNOTSUPP/null look super similar
to me.

The fake PD construction looks similar, except EFA uses something
outside the driver to create it.

Neither support req_notify_cq, so they don't really support verbs CQ
semantics. I guess polling only?

> > Do we want to reverse course on that and accept that usnic-like things
> > are actually OK? (ie no kverbs, no libibverbs and totally proprietary
> > everything)
> 
> Even if we say the above is true, there are other factors to consider. 
> Unlike usnic, efa follows much more closely to the existing rdma device
> model.  It's not a separate device per process, it's a single shared
> device. 

I've never used usnic, is this true? It sure looks like it can
allocate ucontexts dynamically?? I think that is what the vnic stuff
is for?? Maybe the ucontext limit is pretty small, but there sure is
alot of code here if the limit is just 1.. We also don't know the EFA
ucontext limit ...

> It actually *does* have the concept of PD, CQ, QP, and AH.  It
> doesn't do UDP, it actually does post_send and post_recv.  

I think it is the same, usnic once had a post_send/recv implementation
in it's libibverbs provider, pretty much anything that has a command
queue for datagram could implement basic (but maybe non-conforming)
post_send/recv verbs for UD.

So if there is a EFA provider it would be in the same place as where
usnic started at before libfabric <shrug>

> above makes me inclined to be more accepting of efa.  If you add in a
> libibverbs provider and actual official, working UD support, then even
> more so.

running through libibverbs DV and having enough to be an to do the
standard ud_ping_pong would be better.

But, IIRC, usnic could do ud_ping_pong at one point too.

> > verbs is supposed to be a multi-vendor standard, not an enumeration of
> > every kind of proprietary device-specific behavior.
> 
> What it's supposed to be...and what it is...well, that's evolving on a
> day to day basis, and it's already got plenty of proprietary, device-
> specific behavior.

It certainly does..
 
> To be honest, I don't even consider libibverbs a viable programming
> interface anymore for the most part.  What I mean by this is that
> libibverbs exists, and there are already existing consumers of
> libibverbs.  I expect those things to continue.  But, for the most part,
> people are encouraged now a days to use another interface at a higher
> level that then uses libibverbs behind the scenes: libfabric, UCX, MPIs,
> existing low latency middlewares, etc.  Nobody I know if is telling
> people to write new code to libibverbs.

I'm still seeing/hearing that libibverbs is the goto choice for things
that are not HPC-style focused as all the alternatives are currently
unsuitable in some way.

It is also still the only choice if you want heterogenous
interoperability.

But libibverbs is a horrible mess of an API at this point..

> But by that same token, if we are the common, device file opening,
> kernel interacting layer, then we must also be the place where all of
> the proprietary vendor specific stuff is implemented.  And we already
> are.  I cite all of the recent Mellanox Direct Verbs stuff as exactly
> that.  

We definately have allowed a side-car of proprietary stuff. The qib
chardev started it, and the Mellanox DV/devx stuff is a more
integrated take on the exact same idea..

> It does raise the question of whether, to be consistent, we should
> have requested psm/psm2 also be part of rdma-core.

I did ask, was told 'not interested' :)

> Mellanox uses the driver direct model, Intel uses a separate cdev,
> and efa is apparently intending to follow Mellanox's example.  I
> don't see a problem with that.

Nor I..

> > Let alone the issue that libibverbs, assumes it has providers for all
> > the system devices and pukes if it doesn't - that will need fixing too
> > if we actually want to properly support this model.
> 
> We can fix libibverbs if we ever do take a driver without a verbs
> driver, but I don't know that we are there yet, and I still think
> libibverbs as the shim to interface with the kernel is a good idea.

Well we already have usnic without a verbs provider, everyone just
ignores the breakage it inflicts.

I think that is a really bad place for a new driver to start out at..

Jason
Jason Gunthorpe Jan. 4, 2019, 4:25 a.m. UTC | #58
On Thu, Jan 03, 2019 at 11:34:54PM +0000, Hefty, Sean wrote:
> > Nobody has presented another criteria that makes EFA and usnic any
> > different, so we are back to the start. Was usnic a bad idea or not?
> > How do we support 'usnic' style devices without wrecking the rest of
> > the stack?
> 
> Let's start by agreeing on what verbs is: an abstract definition of
> the functionality that a NIC supports.  

No, I don't think that at all. Linux common verbs is a concrete
realization of the the IBA and iWarp specifications - with well
defined behaviors from the application layer right down to exactly
which packets are formed on the wire.

The entire point of the verbs ecosystem was to enable heterogenous
interoperability - by being extremely well specified. We now have
many, many vendors implementing interoperable verbs to that spec.

So, to me, anything that is not in that multi-vendor interoperable
space is really NOT common verbs. Extended verbs, direct verbs, maybe,
but certainly not common verbs.

I think you are transfering libfabric thinking to Linux verbs..

> From there, define the uverbs objects in terms of generic behavior,
> with addressing kept separate.

This is what libfabric did :)

> I thought the primary requirement for a driver was to have at least
> one user.  It doesn't matter if the user is another kernel driver or
> a Linux-ABI.

It is a hard minimum requirement, not the only one.. Drivers still
have to fit into the subsystem. Ie you could use the GPU subsystem to
expose EFA, that would be totally nutz, but one could warp the APIs to
fit :(

> If we want to be more specific, does EFA support some subset of the
> commands and data structures defined ib_user_verbs.h in a manner
> that's similar to other devices?  Can the uverbs driver manage
> allocated resources (PDs, MRs, QPs, CQs, AHs) the same as other
> devices.  Based on patch 11 and responses to my questions, it looks
> that way to me.  There are a couple areas where the functionality is
> missing (e.g. QP state transitions), but the submission looks closer
> than not.

I think you are stretching into your 'abstract verbs' idea to claim
the objects are the same as other devices:

A CQ without notification is not really a CQ.

A QP without the verbs state machine is not really a QP.

A PD without rkeys is not really a PD.

Maybe this is OK, but it clearly isn't HW that is implementing
verbs. It is some pick and choose of a convenient subset of common
verbs..

Jason
Gal Pressman Jan. 4, 2019, 10:12 a.m. UTC | #59
On 04-Jan-19 05:39, Jason Gunthorpe wrote:
> On Fri, Jan 04, 2019 at 12:53:03AM +0000, Hefty, Sean wrote:
>>>> It is probably a bad idea to implement the uverbs abi without copying
>>>> the new machinery for doing so from verbs, it is complicted, and the
>>>> ioctl support makes it even more tricky.
>>>
>>> I second the idea that trying to do the uverbs kernel API outside
>>> libibverbs is a bad idea.  My preference would be to have a new
>>> libibverbs provider that supports what efa needs and then have libfabric
>>> use that, simply because of the issue you mention here.
>>
>> I would agree if we can figure out a way to avoid the libibverbs API
>> overhead for fast path operations.  
> 
> This has been available for years now, the 'dv' interface does exactly
> this sort of bypass. The Mellanox drivers in ucx, dpdk, vma, etc all
> use the dv path for datapath operations.
> 
> I think that is a consensus that EFA should build its libfabric
> provider on libibverbs & 'efadv'..

As long as this won't hurt EFA's performance, sounds good to me.
We will do that.
Gal Pressman Jan. 4, 2019, 10:31 a.m. UTC | #60
On 04-Jan-19 06:11, Jason Gunthorpe wrote:
> On Thu, Jan 03, 2019 at 06:32:56PM -0500, Doug Ledford wrote:
> 
>>> The original point is still un-addressed. We, as a community, have
>>> identified usnic as a bad idea that does not belong in
>>> drivers/infiniband. This has been a pretty much universal position of
>>> everyone who has worked on the core RDMA stack.
>>
>> I'll generally agree with this.  It turns out usnic is not so much about
>> access to a shared device with PD, QP, CQ, MR and AH support, but an
>> SRIOV network device attached to a process with minimal (but sufficient
>> I think) security measures to at least make sure that process can't
>> listen in on other traffic.  From there, it's just UDP on a private
>> network adapter.
> 
> EFA looks the same to me, except instead of UDP it is doing a bit
> more, but it still just datagram processing. There are no verbs here
> other than SEND and RECV. There is no RMA..
> 
> The functions usnic and EFA stub to EOPNOTSUPP/null look super similar
> to me.

The stubs are there because of IB_MANDATORY_FUNC.

> 
> The fake PD construction looks similar, except EFA uses something
> outside the driver to create it.

I wouldn't call it fake. Just because the PD isn't backed up by an object in the
device doesn't mean it's not used by the device.
We can pass alloc_pd through our command interface to the device, it's just not
needed :).

> 
> Neither support req_notify_cq, so they don't really support verbs CQ
> semantics. I guess polling only?

This one is lack of implementation, not lack of support.
We can and will implement CQ notifications.

> 
>>> Do we want to reverse course on that and accept that usnic-like things
>>> are actually OK? (ie no kverbs, no libibverbs and totally proprietary
>>> everything)
>>
>> Even if we say the above is true, there are other factors to consider. 
>> Unlike usnic, efa follows much more closely to the existing rdma device
>> model.  It's not a separate device per process, it's a single shared
>> device. 
> 
> I've never used usnic, is this true? It sure looks like it can
> allocate ucontexts dynamically?? I think that is what the vnic stuff
> is for?? Maybe the ucontext limit is pretty small, but there sure is
> alot of code here if the limit is just 1.. We also don't know the EFA
> ucontext limit ...

There is no ucontext limit.

> 
>> It actually *does* have the concept of PD, CQ, QP, and AH.  It
>> doesn't do UDP, it actually does post_send and post_recv.  
> 
> I think it is the same, usnic once had a post_send/recv implementation
> in it's libibverbs provider, pretty much anything that has a command
> queue for datagram could implement basic (but maybe non-conforming)
> post_send/recv verbs for UD.
> 
> So if there is a EFA provider it would be in the same place as where
> usnic started at before libfabric <shrug>
> 
>> above makes me inclined to be more accepting of efa.  If you add in a
>> libibverbs provider and actual official, working UD support, then even
>> more so.
> 
> running through libibverbs DV and having enough to be an to do the
> standard ud_ping_pong would be better.
> 
> But, IIRC, usnic could do ud_ping_pong at one point too.
> 
>>> verbs is supposed to be a multi-vendor standard, not an enumeration of
>>> every kind of proprietary device-specific behavior.
>>
>> What it's supposed to be...and what it is...well, that's evolving on a
>> day to day basis, and it's already got plenty of proprietary, device-
>> specific behavior.
> 
> It certainly does..
>  
>> To be honest, I don't even consider libibverbs a viable programming
>> interface anymore for the most part.  What I mean by this is that
>> libibverbs exists, and there are already existing consumers of
>> libibverbs.  I expect those things to continue.  But, for the most part,
>> people are encouraged now a days to use another interface at a higher
>> level that then uses libibverbs behind the scenes: libfabric, UCX, MPIs,
>> existing low latency middlewares, etc.  Nobody I know if is telling
>> people to write new code to libibverbs.
> 
> I'm still seeing/hearing that libibverbs is the goto choice for things
> that are not HPC-style focused as all the alternatives are currently
> unsuitable in some way.
> 
> It is also still the only choice if you want heterogenous
> interoperability.
> 
> But libibverbs is a horrible mess of an API at this point..
> 
>> But by that same token, if we are the common, device file opening,
>> kernel interacting layer, then we must also be the place where all of
>> the proprietary vendor specific stuff is implemented.  And we already
>> are.  I cite all of the recent Mellanox Direct Verbs stuff as exactly
>> that.  
> 
> We definately have allowed a side-car of proprietary stuff. The qib
> chardev started it, and the Mellanox DV/devx stuff is a more
> integrated take on the exact same idea..
> 
>> It does raise the question of whether, to be consistent, we should
>> have requested psm/psm2 also be part of rdma-core.
> 
> I did ask, was told 'not interested' :)
> 
>> Mellanox uses the driver direct model, Intel uses a separate cdev,
>> and efa is apparently intending to follow Mellanox's example.  I
>> don't see a problem with that.
> 
> Nor I..
> 
>>> Let alone the issue that libibverbs, assumes it has providers for all
>>> the system devices and pukes if it doesn't - that will need fixing too
>>> if we actually want to properly support this model.
>>
>> We can fix libibverbs if we ever do take a driver without a verbs
>> driver, but I don't know that we are there yet, and I still think
>> libibverbs as the shim to interface with the kernel is a good idea.
> 
> Well we already have usnic without a verbs provider, everyone just
> ignores the breakage it inflicts.
> 
> I think that is a really bad place for a new driver to start out at..
> 
> Jason
>
Jason Gunthorpe Jan. 4, 2019, 4:35 p.m. UTC | #61
On Fri, Jan 04, 2019 at 12:31:21PM +0200, Gal Pressman wrote:
> On 04-Jan-19 06:11, Jason Gunthorpe wrote:
> > On Thu, Jan 03, 2019 at 06:32:56PM -0500, Doug Ledford wrote:
> > 
> >>> The original point is still un-addressed. We, as a community, have
> >>> identified usnic as a bad idea that does not belong in
> >>> drivers/infiniband. This has been a pretty much universal position of
> >>> everyone who has worked on the core RDMA stack.
> >>
> >> I'll generally agree with this.  It turns out usnic is not so much about
> >> access to a shared device with PD, QP, CQ, MR and AH support, but an
> >> SRIOV network device attached to a process with minimal (but sufficient
> >> I think) security measures to at least make sure that process can't
> >> listen in on other traffic.  From there, it's just UDP on a private
> >> network adapter.
> > 
> > EFA looks the same to me, except instead of UDP it is doing a bit
> > more, but it still just datagram processing. There are no verbs here
> > other than SEND and RECV. There is no RMA..
> > 
> > The functions usnic and EFA stub to EOPNOTSUPP/null look super similar
> > to me.
> 
> The stubs are there because of IB_MANDATORY_FUNC.

Yes, because those functions are *MANDATORY* to be an actual verbs
device :) 

That is another thing that should be revised for these no-kverbs
things, I don't want to see stubs..

In fact seeing null in any mandatory function pointers is a good way
to trigger the 'clients cannot be attached' behavior in your earlier
RFC patch.

> > I've never used usnic, is this true? It sure looks like it can
> > allocate ucontexts dynamically?? I think that is what the vnic stuff
> > is for?? Maybe the ucontext limit is pretty small, but there sure is
> > alot of code here if the limit is just 1.. We also don't know the EFA
> > ucontext limit ...
> 
> There is no ucontext limit.

Oh? EFA is sharing BAR pages between user processes? You have a
security proof that is OK?

Jason
Leon Romanovsky Jan. 4, 2019, 5:15 p.m. UTC | #62
On Fri, Jan 04, 2019 at 09:35:47AM -0700, Jason Gunthorpe wrote:
> On Fri, Jan 04, 2019 at 12:31:21PM +0200, Gal Pressman wrote:
> > On 04-Jan-19 06:11, Jason Gunthorpe wrote:
> > > On Thu, Jan 03, 2019 at 06:32:56PM -0500, Doug Ledford wrote:
> > >
> > >>> The original point is still un-addressed. We, as a community, have
> > >>> identified usnic as a bad idea that does not belong in
> > >>> drivers/infiniband. This has been a pretty much universal position of
> > >>> everyone who has worked on the core RDMA stack.
> > >>
> > >> I'll generally agree with this.  It turns out usnic is not so much about
> > >> access to a shared device with PD, QP, CQ, MR and AH support, but an
> > >> SRIOV network device attached to a process with minimal (but sufficient
> > >> I think) security measures to at least make sure that process can't
> > >> listen in on other traffic.  From there, it's just UDP on a private
> > >> network adapter.
> > >
> > > EFA looks the same to me, except instead of UDP it is doing a bit
> > > more, but it still just datagram processing. There are no verbs here
> > > other than SEND and RECV. There is no RMA..
> > >
> > > The functions usnic and EFA stub to EOPNOTSUPP/null look super similar
> > > to me.
> >
> > The stubs are there because of IB_MANDATORY_FUNC.
>
> Yes, because those functions are *MANDATORY* to be an actual verbs
> device :)
>
> That is another thing that should be revised for these no-kverbs
> things, I don't want to see stubs..
>
> In fact seeing null in any mandatory function pointers is a good way
> to trigger the 'clients cannot be attached' behavior in your earlier
> RFC patch.
>
> > > I've never used usnic, is this true? It sure looks like it can
> > > allocate ucontexts dynamically?? I think that is what the vnic stuff
> > > is for?? Maybe the ucontext limit is pretty small, but there sure is
> > > alot of code here if the limit is just 1.. We also don't know the EFA
> > > ucontext limit ...
> >
> > There is no ucontext limit.
>
> Oh? EFA is sharing BAR pages between user processes? You have a
> security proof that is OK?

For proprietary device with proprietary protocol in proprietary network?

Thanks

>
> Jason
Hefty, Sean Jan. 4, 2019, 7:45 p.m. UTC | #63
> > Let's start by agreeing on what verbs is: an abstract definition of
> > the functionality that a NIC supports.
> 
> No, I don't think that at all. Linux common verbs is a concrete
> realization of the the IBA and iWarp specifications - with well
> defined behaviors from the application layer right down to exactly
> which packets are formed on the wire.

I was using definition provided by the IB spec, only replacing HCA with NIC.

> The entire point of the verbs ecosystem was to enable heterogenous
> interoperability - by being extremely well specified. We now have
> many, many vendors implementing interoperable verbs to that spec.

Nearly every single command carried across the uverbs ABI carries device specific data.  It's not generic at all.  Even the values carried in common structures depend not only on the device, but it's current state.  Is it IB? RoCEv1? RoCEv2? iWarp?  What is the current link? IB? Ethernet?

It seems like we're trying to apply a requirement that at odds with reality.  Uverbs is a common framework for carrying data and commands between user space and a *specific* kernel device driver.  It doesn't guarantee interoperability and user space has to know exactly what device it is communicating with.

> > If we want to be more specific, does EFA support some subset of the
> > commands and data structures defined ib_user_verbs.h in a manner
> > that's similar to other devices?  Can the uverbs driver manage
> > allocated resources (PDs, MRs, QPs, CQs, AHs) the same as other
> > devices.  Based on patch 11 and responses to my questions, it looks
> > that way to me.  There are a couple areas where the functionality is
> > missing (e.g. QP state transitions), but the submission looks closer
> > than not.
> 
> I think you are stretching into your 'abstract verbs' idea to claim
> the objects are the same as other devices:

No - I’m looking at the enum and structure definitions in ib_user_verbs.h and seeing if they are used in a similar way.  I.e. what changes, if any, were needed to the uverbs driver to support the calls?

> A CQ without notification is not really a CQ.

Hard polling on a CQ is a valid programming model.  Regardless, what does it matter if a device responds to a IB_USER_VERBS_CMD_REQ_NOTIFY_CQ with ENOSYS?  Why should that be considered unacceptable?
 
> A QP without the verbs state machine is not really a QP.

I see no reason to restrict what states a vendor must implement for their QP, or require that the state changes must be done through a kernel call.  I'm pretty sure IB and iWarp define different states anyway.

> A PD without rkeys is not really a PD.

Some sort of hardware protection mechanism to isolate processes is needed when OS bypass is involved.

I'm pretty sure we're never going to agree on any of this.

- Sean
Hefty, Sean Jan. 4, 2019, 7:58 p.m. UTC | #64
> > I would agree if we can figure out a way to avoid the libibverbs API
> > overhead for fast path operations.
> 
> This has been available for years now, the 'dv' interface does exactly
> this sort of bypass. The Mellanox drivers in ucx, dpdk, vma, etc all
> use the dv path for datapath operations.

Something more generic that would avoid per device interfaces would be ideal.

> I think that is a consensus that EFA should build its libfabric
> provider on libibverbs & 'efadv'..

The user space implementation should be an orthogonal discussion from the kernel.

- Sean
Jason Gunthorpe Jan. 4, 2019, 8:15 p.m. UTC | #65
On Fri, Jan 04, 2019 at 07:58:09PM +0000, Hefty, Sean wrote:
> > > I would agree if we can figure out a way to avoid the libibverbs API
> > > overhead for fast path operations.
> > 
> > This has been available for years now, the 'dv' interface does exactly
> > this sort of bypass. The Mellanox drivers in ucx, dpdk, vma, etc all
> > use the dv path for datapath operations.
> 
> Something more generic that would avoid per device interfaces would be ideal.

We have that, post_send/post_recv. If that is too slow then you have
to use device specific DV, which has no limitations on dataplane
performance.

> > I think that is a consensus that EFA should build its libfabric
> > provider on libibverbs & 'efadv'..
> 
> The user space implementation should be an orthogonal discussion
> from the kernel.

Why? We have to know what to review and kernel code is not going to be
accepted without a user space implementation.

Jason
Jason Gunthorpe Jan. 4, 2019, 8:44 p.m. UTC | #66
On Fri, Jan 04, 2019 at 07:45:59PM +0000, Hefty, Sean wrote:
> > > Let's start by agreeing on what verbs is: an abstract definition of
> > > the functionality that a NIC supports.
> > 
> > No, I don't think that at all. Linux common verbs is a concrete
> > realization of the the IBA and iWarp specifications - with well
> > defined behaviors from the application layer right down to exactly
> > which packets are formed on the wire.
> 
> I was using definition provided by the IB spec, only replacing HCA
> with NIC.

IB spec is not Linux common verbs, and the use of 'abstract' in the
IBA is not referring to some wide ranging mandate but an abstract
interface description to support the very specific behaviors required
in the spec. ie it is not defining an actual concrete C programming
interface.

> > The entire point of the verbs ecosystem was to enable heterogenous
> > interoperability - by being extremely well specified. We now have
> > many, many vendors implementing interoperable verbs to that spec.
> 
> Nearly every single command carried across the uverbs ABI carries
> device specific data.  It's not generic at all.  

I was not speaking of uverbs, I was talking about kverbs and
libibverbs. The ABI is not intended for any external consumption, nor
is it imagined to be device agnostic.

This is why uverbs has been so attractive to stuff like efa/usnic,
they can stuff all the non-standard things into the device specific
data and just ignore the common structures.

However, from a subsystem perspective we are only allowing 'rdma'
devices to use uverbs.

I have been asking for some kind of consensus on what qualifies
something to be a 'rdma' device.  We've agreed on a high water mark
(NVMEof) but what is the low watermark?

> > > If we want to be more specific, does EFA support some subset of
> > > the commands and data structures defined ib_user_verbs.h in a
> > > manner that's similar to other devices?  Can the uverbs driver
> > > manage allocated resources (PDs, MRs, QPs, CQs, AHs) the same as
> > > other devices.  Based on patch 11 and responses to my questions,
> > > it looks that way to me.  There are a couple areas where the
> > > functionality is missing (e.g. QP state transitions), but the
> > > submission looks closer than not.
> > 
> > I think you are stretching into your 'abstract verbs' idea to claim
> > the objects are the same as other devices:
> 
> No - I’m looking at the enum and structure definitions in
> ib_user_verbs.h and seeing if they are used in a similar way.
> I.e. what changes, if any, were needed to the uverbs driver to
> support the calls?

You are trying to say a bunch of apple seeds is the same as an apple -
and I really disagree. 

IMHO, EFA clearly does not reach the minimum threshold to be called a
RDMA common verbs device.

The question is if we should set the subsystem threshold for 'uverbs
access' to be below that (and how far below?).

> > A CQ without notification is not really a CQ.
> 
> Hard polling on a CQ is a valid programming model.  Regardless, what
> does it matter if a device responds to a
> IB_USER_VERBS_CMD_REQ_NOTIFY_CQ with ENOSYS?  Why should that be
> considered unacceptable?

In this case the only reason this ugly hack was put in the driver is
because the common code would *REFUSE* to register the driver unless
there is a req_notify_cq implementation (and others) because it is
declared as a mandatory part of common verbs for decades now.

The IBA has things marked MANDATORY for a reason - IMHO something that
doesn't implement 'enough' of the mandatory statements no longer
qualifies to be called verbs.

Jason
Sagi Grimberg Jan. 4, 2019, 11:37 p.m. UTC | #67
> In this case the only reason this ugly hack was put in the driver is
> because the common code would *REFUSE* to register the driver unless
> there is a req_notify_cq implementation (and others) because it is
> declared as a mandatory part of common verbs for decades now.
> 
> The IBA has things marked MANDATORY for a reason - IMHO something that
> doesn't implement 'enough' of the mandatory statements no longer
> qualifies to be called verbs.

Are there any mandatory callbacks that EFA refuses other than this one
(which Gal said that it will support)?
Jason Gunthorpe Jan. 4, 2019, 11:49 p.m. UTC | #68
On Fri, Jan 04, 2019 at 03:37:06PM -0800, Sagi Grimberg wrote:
> 
> > In this case the only reason this ugly hack was put in the driver is
> > because the common code would *REFUSE* to register the driver unless
> > there is a req_notify_cq implementation (and others) because it is
> > declared as a mandatory part of common verbs for decades now.
> > 
> > The IBA has things marked MANDATORY for a reason - IMHO something that
> > doesn't implement 'enough' of the mandatory statements no longer
> > qualifies to be called verbs.
> 
> Are there any mandatory callbacks that EFA refuses other than this one
> (which Gal said that it will support)?

All the EOPNOTSUPP ones. It *really* doesn't support kverbs at
all. Which is why my first thought was to just not present it via the
in-kernel client API at all.

Only req_notify really extends into userspace verbs though.

+/* In ib callbacks section -  Start of stub funcs */
+int efa_post_send(struct ib_qp *ibqp,
+		  const struct ib_send_wr *wr,
+		  const struct ib_send_wr **bad_wr)
+{
+	pr_warn("Function not supported\n");
+	return -EOPNOTSUPP;
+}
+
+int efa_post_recv(struct ib_qp *ibqp,
+		  const struct ib_recv_wr *wr,
+		  const struct ib_recv_wr **bad_wr)
+{
+	pr_warn("Function not supported\n");
+	return -EOPNOTSUPP;
+}
+
+int efa_poll_cq(struct ib_cq *ibcq, int num_entries,
+		struct ib_wc *wc)
+{
+	pr_warn("Function not supported\n");
+	return -EOPNOTSUPP;
+}
+
+int efa_req_notify_cq(struct ib_cq *ibcq,
+		      enum ib_cq_notify_flags flags)
+{
+	pr_warn("Function not supported\n");
+	return -EOPNOTSUPP;
+}
+
+struct ib_mr *efa_get_dma_mr(struct ib_pd *ibpd, int acc)
+{
+	pr_warn("Function not supported\n");
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+int efa_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+		  int attr_mask, struct ib_udata *udata)
+{
+	pr_warn("Function not supported\n");
+	return -EOPNOTSUPP;
+}
+
Sagi Grimberg Jan. 4, 2019, 11:57 p.m. UTC | #69
>> Are there any mandatory callbacks that EFA refuses other than this one
>> (which Gal said that it will support)?
> 
> All the EOPNOTSUPP ones. It *really* doesn't support kverbs at
> all.

Oh wow, didn't realize that :)

> Which is why my first thought was to just not present it via the
> in-kernel client API at all.

Definitely, it should not appear to the client API at all..
This patch is needed most definitely.

I'm sure that all of those (other than modify_qp) are trivial
to add but it doesn't really make sense to have them if userspace
is not using uverbs for them.

> Only req_notify really extends into userspace verbs though.

Is it? ibv_notify_cq doesn't go to the kernel afair..
Jason Gunthorpe Jan. 5, 2019, 12:22 a.m. UTC | #70
On Fri, Jan 04, 2019 at 03:57:38PM -0800, Sagi Grimberg wrote:
> 
> > > Are there any mandatory callbacks that EFA refuses other than this one
> > > (which Gal said that it will support)?
> > 
> > All the EOPNOTSUPP ones. It *really* doesn't support kverbs at
> > all.
> 
> Oh wow, didn't realize that :)

Make the dicussion take a different light, doesn't it :)

> > Which is why my first thought was to just not present it via the
> > in-kernel client API at all.
> 
> Definitely, it should not appear to the client API at all..
> This patch is needed most definitely.

Okay, Gal should keep going on that then, and do something about the
mandatory verbs. Probably failing mandatory verbs tests should hide
the driver from the client API (except toward uverbs).

> I'm sure that all of those (other than modify_qp) are trivial
> to add but it doesn't really make sense to have them if userspace
> is not using uverbs for them.

If there is no modify_qp then the driver doesn't support the verbs QP
FSM, and a lot of other stuff. Ie it is not really a verbs QP. :\

> > Only req_notify really extends into userspace verbs though.
> 
> Is it? ibv_notify_cq doesn't go to the kernel afair..

There is a IB_USER_VERBS_CMD_REQ_NOTIFY_CQ ..

Hmm.. Many drivers do this via a BAR page write, not via the syscall,
so I guess we don't know if EFA needs this or not.

Jason
Sagi Grimberg Jan. 5, 2019, 1:04 a.m. UTC | #71
>>>> Are there any mandatory callbacks that EFA refuses other than this one
>>>> (which Gal said that it will support)?
>>>
>>> All the EOPNOTSUPP ones. It *really* doesn't support kverbs at
>>> all.
>>
>> Oh wow, didn't realize that :)
> 
> Make the dicussion take a different light, doesn't it :)

It sorta makes sense for efa I guess..

>>> Which is why my first thought was to just not present it via the
>>> in-kernel client API at all.
>>
>> Definitely, it should not appear to the client API at all..
>> This patch is needed most definitely.
> 
> Okay, Gal should keep going on that then, and do something about the
> mandatory verbs. Probably failing mandatory verbs tests should hide
> the driver from the client API (except toward uverbs).
> 
>> I'm sure that all of those (other than modify_qp) are trivial
>> to add but it doesn't really make sense to have them if userspace
>> is not using uverbs for them.
> 
> If there is no modify_qp then the driver doesn't support the verbs QP
> FSM, and a lot of other stuff. Ie it is not really a verbs QP. :\

Can you explain what does the FSM matter to UD? I always thought
its really just a shim because its sort of irrelevant...

If that is the case, EFA can implement a trivial modify_qp for
UD QPs (i.e. do nothing, but at least fit the interface)?

>>> Only req_notify really extends into userspace verbs though.
>>
>> Is it? ibv_notify_cq doesn't go to the kernel afair..
> 
> There is a IB_USER_VERBS_CMD_REQ_NOTIFY_CQ ..
> 
> Hmm.. Many drivers do this via a BAR page write, not via the syscall,
> so I guess we don't know if EFA needs this or not.

I'm betting that it doesn't...
Jason Gunthorpe Jan. 5, 2019, 3:36 a.m. UTC | #72
On Fri, Jan 04, 2019 at 05:04:01PM -0800, Sagi Grimberg wrote:

> > > I'm sure that all of those (other than modify_qp) are trivial
> > > to add but it doesn't really make sense to have them if userspace
> > > is not using uverbs for them.
> > 
> > If there is no modify_qp then the driver doesn't support the verbs QP
> > FSM, and a lot of other stuff. Ie it is not really a verbs QP. :\
> 
> Can you explain what does the FSM matter to UD? I always thought
> its really just a shim because its sort of irrelevant...

No, it has INIT/RTR/RTS like everything else. The purpose of the FSM
is to synchronize with setup, you shouldn't get a recv until
completion until everything is ready (ie recvs are posted, etc) and
RTR is entered.

I think it also has some relevance if linking UD with CM, but that is
so obscure I'm not sure.

IIRC init has some relavance for error recovery. But that is pretty
obscure too.

Jason
Gal Pressman Jan. 5, 2019, 5:23 p.m. UTC | #73
On 04-Jan-19 18:35, Jason Gunthorpe wrote:
> On Fri, Jan 04, 2019 at 12:31:21PM +0200, Gal Pressman wrote:
>> On 04-Jan-19 06:11, Jason Gunthorpe wrote:
>>> On Thu, Jan 03, 2019 at 06:32:56PM -0500, Doug Ledford wrote:
>>>
>>>>> The original point is still un-addressed. We, as a community, have
>>>>> identified usnic as a bad idea that does not belong in
>>>>> drivers/infiniband. This has been a pretty much universal position of
>>>>> everyone who has worked on the core RDMA stack.
>>>>
>>>> I'll generally agree with this.  It turns out usnic is not so much about
>>>> access to a shared device with PD, QP, CQ, MR and AH support, but an
>>>> SRIOV network device attached to a process with minimal (but sufficient
>>>> I think) security measures to at least make sure that process can't
>>>> listen in on other traffic.  From there, it's just UDP on a private
>>>> network adapter.
>>>
>>> EFA looks the same to me, except instead of UDP it is doing a bit
>>> more, but it still just datagram processing. There are no verbs here
>>> other than SEND and RECV. There is no RMA..
>>>
>>> The functions usnic and EFA stub to EOPNOTSUPP/null look super similar
>>> to me.
>>
>> The stubs are there because of IB_MANDATORY_FUNC.
> 
> Yes, because those functions are *MANDATORY* to be an actual verbs
> device :) 
> 
> That is another thing that should be revised for these no-kverbs
> things, I don't want to see stubs..
> 
> In fact seeing null in any mandatory function pointers is a good way
> to trigger the 'clients cannot be attached' behavior in your earlier
> RFC patch.
> 
>>> I've never used usnic, is this true? It sure looks like it can
>>> allocate ucontexts dynamically?? I think that is what the vnic stuff
>>> is for?? Maybe the ucontext limit is pretty small, but there sure is
>>> alot of code here if the limit is just 1.. We also don't know the EFA
>>> ucontext limit ...
>>
>> There is no ucontext limit.
> 
> Oh? EFA is sharing BAR pages between user processes? You have a
> security proof that is OK?

I guess we're talking about PDs?
There's a PD limit (currently 128, depends on the device) which limits the
number of processes. There is no sharing of BAR pages between user processes.
Jason Gunthorpe Jan. 5, 2019, 5:41 p.m. UTC | #74
On Sat, Jan 05, 2019 at 07:23:32PM +0200, Gal Pressman wrote:

> > Oh? EFA is sharing BAR pages between user processes? You have a
> > security proof that is OK?
> 
> I guess we're talking about PDs?

In most devices available BAR address space is the limit to
ucontexts..

> There's a PD limit (currently 128, depends on the device) which
> limits the number of processes. There is no sharing of BAR pages
> between user processes.

EFA has some design problems here.. Generally mapping of a BAR page into
user space must be done under the ucontext, not for individual
objects. ie if I allocate BAR page X to ucontext Y then X must remain
allocated until ucontext Y is destroyed. 

Otherwise there can be use-after free style security bugs.

Since the efa_dealloc_ucontext does nothing, and BAR pages are being
mapped, it must be wrong.

It kind of looks like it is trying to tie BAR allocation lifetime to
individual objects?

.. and all of this is why one generally focuses on the ucontext as the
limit, as generally, allocating a ucontext implies allocating a BAR
page, and thus the number of ucontexts is strictly limited by the BAR
size.

A driver can do lazy allocation, but it is kind of pointless.

Jason
Gal Pressman Jan. 6, 2019, 7:21 a.m. UTC | #75
On 05-Jan-19 03:04, Sagi Grimberg wrote:
> 
>>>>> Are there any mandatory callbacks that EFA refuses other than this one
>>>>> (which Gal said that it will support)?
>>>>
>>>> All the EOPNOTSUPP ones. It *really* doesn't support kverbs at
>>>> all.
>>>
>>> Oh wow, didn't realize that :)
>>
>> Make the dicussion take a different light, doesn't it :)
> 
> It sorta makes sense for efa I guess..
> 
>>>> Which is why my first thought was to just not present it via the
>>>> in-kernel client API at all.
>>>
>>> Definitely, it should not appear to the client API at all..
>>> This patch is needed most definitely.
>>
>> Okay, Gal should keep going on that then, and do something about the
>> mandatory verbs. Probably failing mandatory verbs tests should hide
>> the driver from the client API (except toward uverbs).

Will do.

>>
>>> I'm sure that all of those (other than modify_qp) are trivial
>>> to add but it doesn't really make sense to have them if userspace
>>> is not using uverbs for them.
>>
>> If there is no modify_qp then the driver doesn't support the verbs QP
>> FSM, and a lot of other stuff. Ie it is not really a verbs QP. :\
> 
> Can you explain what does the FSM matter to UD? I always thought
> its really just a shim because its sort of irrelevant...

That's what I thought as well, but as I said, we'll implement the UD state
transitions.

> 
> If that is the case, EFA can implement a trivial modify_qp for
> UD QPs (i.e. do nothing, but at least fit the interface)?
> 
>>>> Only req_notify really extends into userspace verbs though.
>>>
>>> Is it? ibv_notify_cq doesn't go to the kernel afair..
>>
>> There is a IB_USER_VERBS_CMD_REQ_NOTIFY_CQ ..
>>
>> Hmm.. Many drivers do this via a BAR page write, not via the syscall,
>> so I guess we don't know if EFA needs this or not.
> 
> I'm betting that it doesn't...

Probably not, but I'll have to take a closer look.
Gal Pressman Jan. 6, 2019, 1:33 p.m. UTC | #76
On 05-Jan-19 19:41, Jason Gunthorpe wrote:
> On Sat, Jan 05, 2019 at 07:23:32PM +0200, Gal Pressman wrote:
> 
>>> Oh? EFA is sharing BAR pages between user processes? You have a
>>> security proof that is OK?
>>
>> I guess we're talking about PDs?
> 
> In most devices available BAR address space is the limit to
> ucontexts..

In our case the BAR space limits the number of PDs. Each PD is reserved with its
own BAR address space, it is guaranteed that all QPs doorbells of a specific PD
will reside on this address space. The pages are not being shared.

> 
>> There's a PD limit (currently 128, depends on the device) which
>> limits the number of processes. There is no sharing of BAR pages
>> between user processes.
> 
> EFA has some design problems here.. Generally mapping of a BAR page into
> user space must be done under the ucontext, not for individual
> objects. ie if I allocate BAR page X to ucontext Y then X must remain
> allocated until ucontext Y is destroyed. 

When you say alloc BAR page do you mean mmap BAR page?
We map the pages for individual objects, but the whole BAR is already
"allocated" for the associated PD.
The mapping stays valid until the userspace calls munmap.

> 
> Otherwise there can be use-after free style security bugs.
> 
> Since the efa_dealloc_ucontext does nothing, and BAR pages are being
> mapped, it must be wrong.

Can you please elaborate? what would you like to see in dealloc_ucontext?

> 
> It kind of looks like it is trying to tie BAR allocation lifetime to
> individual objects?
> 
> .. and all of this is why one generally focuses on the ucontext as the
> limit, as generally, allocating a ucontext implies allocating a BAR
> page, and thus the number of ucontexts is strictly limited by the BAR
> size.

s/ucontext/PD/g is the case for EFA, our device is not aware of ucontext but
PDs. The BAR "reservation" is there for the lifetime of the PD.

> 
> A driver can do lazy allocation, but it is kind of pointless.
> 
> Jason
>
Jason Gunthorpe Jan. 6, 2019, 9:30 p.m. UTC | #77
On Sun, Jan 06, 2019 at 03:33:15PM +0200, Gal Pressman wrote:

> > Otherwise there can be use-after free style security bugs.
> > 
> > Since the efa_dealloc_ucontext does nothing, and BAR pages are being
> > mapped, it must be wrong.
> 
> Can you please elaborate? what would you like to see in dealloc_ucontext?

Freeing bar page allocations.
 
> > It kind of looks like it is trying to tie BAR allocation lifetime to
> > individual objects?
> > 
> > .. and all of this is why one generally focuses on the ucontext as the
> > limit, as generally, allocating a ucontext implies allocating a BAR
> > page, and thus the number of ucontexts is strictly limited by the BAR
> > size.
> 
> s/ucontext/PD/g is the case for EFA, our device is not aware of
> ucontext but PDs. The BAR "reservation" is there for the lifetime of
> the PD.

Which is what I just said was wrong.

Jason
Gal Pressman Jan. 7, 2019, 2:43 p.m. UTC | #78
On 06-Jan-19 23:30, Jason Gunthorpe wrote:
> On Sun, Jan 06, 2019 at 03:33:15PM +0200, Gal Pressman wrote:
> 
>>> Otherwise there can be use-after free style security bugs.
>>>
>>> Since the efa_dealloc_ucontext does nothing, and BAR pages are being
>>> mapped, it must be wrong.
>>
>> Can you please elaborate? what would you like to see in dealloc_ucontext?
> 
> Freeing bar page allocations.
>  
>>> It kind of looks like it is trying to tie BAR allocation lifetime to
>>> individual objects?
>>>
>>> .. and all of this is why one generally focuses on the ucontext as the
>>> limit, as generally, allocating a ucontext implies allocating a BAR
>>> page, and thus the number of ucontexts is strictly limited by the BAR
>>> size.
>>
>> s/ucontext/PD/g is the case for EFA, our device is not aware of
>> ucontext but PDs. The BAR "reservation" is there for the lifetime of
>> the PD.
> 
> Which is what I just said was wrong.

It's different, I still can't see how it's wrong.
Our device is responsible for the BAR reservations, not the driver.
The device ties it to the lifetime of the object, as long as the object lives
the mapping is valid. When the object is created the device responds with it's
BAR offset that should be mmapped to the user. If the user tries to ring a
doorbell of a destroyed QP, that's not going to work.
Jason Gunthorpe Jan. 7, 2019, 5:06 p.m. UTC | #79
On Mon, Jan 07, 2019 at 04:43:20PM +0200, Gal Pressman wrote:
> On 06-Jan-19 23:30, Jason Gunthorpe wrote:
> > On Sun, Jan 06, 2019 at 03:33:15PM +0200, Gal Pressman wrote:
> > 
> >>> Otherwise there can be use-after free style security bugs.
> >>>
> >>> Since the efa_dealloc_ucontext does nothing, and BAR pages are being
> >>> mapped, it must be wrong.
> >>
> >> Can you please elaborate? what would you like to see in dealloc_ucontext?
> > 
> > Freeing bar page allocations.
> >  
> >>> It kind of looks like it is trying to tie BAR allocation lifetime to
> >>> individual objects?
> >>>
> >>> .. and all of this is why one generally focuses on the ucontext as the
> >>> limit, as generally, allocating a ucontext implies allocating a BAR
> >>> page, and thus the number of ucontexts is strictly limited by the BAR
> >>> size.
> >>
> >> s/ucontext/PD/g is the case for EFA, our device is not aware of
> >> ucontext but PDs. The BAR "reservation" is there for the lifetime of
> >> the PD.
> > 
> > Which is what I just said was wrong.
> 
> It's different, I still can't see how it's wrong.
> Our device is responsible for the BAR reservations, not the driver.
> The device ties it to the lifetime of the object, as long as the object lives
> the mapping is valid.

No, so long as the *mapping* is accessible to userspace the BAR pages
must be reserved and not re-used for any other purpose. Otherwise you
get in a situation where two processes have mmaps to the same BAR
pages when they really shouldn't, and you are back to proving that
your HW is secure to operate in that mode.

The lifetime of the mapping has nothing to do with the lifetime of the
object, you cannot allow object destruction to recyle the BAR page to
another object allocation.

The device *cannot* control the bar page assignement because it has no
information about the lifetime of the mapping of pages to userspace.

The general pattern is the BAR pages are linked to the uncontext,
because only destruction of the ucontext confirms that all userspace
visibility of the pages has been completely removed.

Jason
Gal Pressman Jan. 8, 2019, 9:19 a.m. UTC | #80
On 07-Jan-19 19:06, Jason Gunthorpe wrote:
> On Mon, Jan 07, 2019 at 04:43:20PM +0200, Gal Pressman wrote:
>> On 06-Jan-19 23:30, Jason Gunthorpe wrote:
>>> On Sun, Jan 06, 2019 at 03:33:15PM +0200, Gal Pressman wrote:
>>>
>>>>> Otherwise there can be use-after free style security bugs.
>>>>>
>>>>> Since the efa_dealloc_ucontext does nothing, and BAR pages are being
>>>>> mapped, it must be wrong.
>>>>
>>>> Can you please elaborate? what would you like to see in dealloc_ucontext?
>>>
>>> Freeing bar page allocations.
>>>  
>>>>> It kind of looks like it is trying to tie BAR allocation lifetime to
>>>>> individual objects?
>>>>>
>>>>> .. and all of this is why one generally focuses on the ucontext as the
>>>>> limit, as generally, allocating a ucontext implies allocating a BAR
>>>>> page, and thus the number of ucontexts is strictly limited by the BAR
>>>>> size.
>>>>
>>>> s/ucontext/PD/g is the case for EFA, our device is not aware of
>>>> ucontext but PDs. The BAR "reservation" is there for the lifetime of
>>>> the PD.
>>>
>>> Which is what I just said was wrong.
>>
>> It's different, I still can't see how it's wrong.
>> Our device is responsible for the BAR reservations, not the driver.
>> The device ties it to the lifetime of the object, as long as the object lives
>> the mapping is valid.
> 
> No, so long as the *mapping* is accessible to userspace the BAR pages
> must be reserved and not re-used for any other purpose. Otherwise you
> get in a situation where two processes have mmaps to the same BAR
> pages when they really shouldn't, and you are back to proving that
> your HW is secure to operate in that mode.
> 
> The lifetime of the mapping has nothing to do with the lifetime of the
> object, you cannot allow object destruction to recyle the BAR page to
> another object allocation.
> 
> The device *cannot* control the bar page assignement because it has no
> information about the lifetime of the mapping of pages to userspace.
> 
> The general pattern is the BAR pages are linked to the uncontext,
> because only destruction of the ucontext confirms that all userspace
> visibility of the pages has been completely removed.

Thanks for the explanation.
So we're talking about a scenario where the user destroyed the QP and
deallocated the PD but still holds the BAR mapping (bug?), right?
You're right, in that case (which shouldn't happen unless we have some kind of
userspace bug) the pages could be wrongly reused by another process that reuses
the previous PD.

This can be solved by freeing the PDs on dealloc_ucontext time which will
guarantee that no other process will get the same BAR pages.
Jason Gunthorpe Jan. 9, 2019, 12:05 a.m. UTC | #81
On Tue, Jan 08, 2019 at 11:19:19AM +0200, Gal Pressman wrote:

> You're right, in that case (which shouldn't happen unless we have
> some kind of userspace bug)

It is security, if I can force the BAR pages to be re-used then I
break the security model of the device from userspace.

> This can be solved by freeing the PDs on dealloc_ucontext time which
> will guarantee that no other process will get the same BAR pages.

Depends on the implementation. 

What is coming next is 'shared pd' which will allow the PD HW object
to be pointed at by multiple different ucontext's. (this is another
reason why you can't really remove the mmap cookie on first use, as to
make a correct driver for shared-PD the sharing flow will have to
re-mmap the BAR pages, presumably.)

The implementation of deferred destroy would need to very carefully
handle this.

Honestly, the HW implementation here sounds more like SRIOV where BAR
based regions are kept apart from each other.

Jason
Gal Pressman Jan. 9, 2019, 2:42 p.m. UTC | #82
On 09-Jan-19 02:05, Jason Gunthorpe wrote:
> 
>> This can be solved by freeing the PDs on dealloc_ucontext time which
>> will guarantee that no other process will get the same BAR pages.
> 
> Depends on the implementation. 
> 
> What is coming next is 'shared pd' which will allow the PD HW object
> to be pointed at by multiple different ucontext's. (this is another
> reason why you can't really remove the mmap cookie on first use, as to
> make a correct driver for shared-PD the sharing flow will have to
> re-mmap the BAR pages, presumably.)
> 
> The implementation of deferred destroy would need to very carefully
> handle this.

In case of a shared PD, are the different ucontexts expected to share the same
BAR pages?
Sounds like a PD reference count will be needed in order to free it on last
using ucontext dealloc.

We can also just fail mmaps for pages that are still used by a different
ucontext instead of deferring the PD deallocation.
Is there a way to notify the driver of munmaps? instead of waiting for
dealloc_ucontext?
Jason Gunthorpe Jan. 9, 2019, 3:36 p.m. UTC | #83
On Wed, Jan 09, 2019 at 04:42:37PM +0200, Gal Pressman wrote:
> On 09-Jan-19 02:05, Jason Gunthorpe wrote:
> > 
> >> This can be solved by freeing the PDs on dealloc_ucontext time which
> >> will guarantee that no other process will get the same BAR pages.
> > 
> > Depends on the implementation. 
> > 
> > What is coming next is 'shared pd' which will allow the PD HW object
> > to be pointed at by multiple different ucontext's. (this is another
> > reason why you can't really remove the mmap cookie on first use, as to
> > make a correct driver for shared-PD the sharing flow will have to
> > re-mmap the BAR pages, presumably.)
> > 
> > The implementation of deferred destroy would need to very carefully
> > handle this.
> 
> In case of a shared PD, are the different ucontexts expected to share the same
> BAR pages?

I'm guessing EFA would need that since the BAR page is linked to the
PD.

> Sounds like a PD reference count will be needed in order to free it
> on last using ucontext dealloc.

Something like that is in Shamir's patches

> We can also just fail mmaps for pages that are still used by a different
> ucontext instead of deferring the PD deallocation.

How do you know if they are still used?

> Is there a way to notify the driver of munmaps? instead of waiting for
> dealloc_ucontext?

Yes, but I prefer if drivers don't go back to implementing vma_ops, it
is complicated.

Jason
Gal Pressman Jan. 9, 2019, 4:03 p.m. UTC | #84
On 09-Jan-19 17:36, Jason Gunthorpe wrote:>> We can also just fail mmaps for
pages that are still used by a different
>> ucontext instead of deferring the PD deallocation.
> 
> How do you know if they are still used?

Store all mapped BAR pages per device, every mmap to a page that still exists in
our database will fail (unless it's from the same ucontext).
Dealloc ucontext will remove all pages associated to the ucontext so other users
could mmap it.
Jason Gunthorpe Jan. 9, 2019, 4:17 p.m. UTC | #85
On Wed, Jan 09, 2019 at 06:03:04PM +0200, Gal Pressman wrote:
> On 09-Jan-19 17:36, Jason Gunthorpe wrote:>> We can also just fail mmaps for
> pages that are still used by a different
> >> ucontext instead of deferring the PD deallocation.
> > 
> > How do you know if they are still used?
> 
> Store all mapped BAR pages per device, every mmap to a page that still exists in
> our database will fail (unless it's from the same ucontext).
> Dealloc ucontext will remove all pages associated to the ucontext so other users
> could mmap it.

Seems reasonable, but the tracking seems a bit complicated as you need
a refcount for every bar page.

Jason