[RFC,00/13] Ultra Ethernet driver introduction

Message ID	20250306230203.1550314-1-nikolay@enfabrica.net (mailing list archive)
Headers	show Received: from mail-qk1-f171.google.com (mail-qk1-f171.google.com [209.85.222.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9F5972E3373 for <linux-rdma@vger.kernel.org>; Thu, 6 Mar 2025 23:03:58 +0000 (UTC) From: Nikolay Aleksandrov <nikolay@enfabrica.net> To: netdev@vger.kernel.org Cc: shrijeet@enfabrica.net, alex.badea@keysight.com, eric.davis@broadcom.com, rip.sohan@amd.com, dsahern@kernel.org, bmt@zurich.ibm.com, roland@enfabrica.net, nikolay@enfabrica.net, winston.liu@keysight.com, dan.mihailescu@keysight.com, kheib@redhat.com, parth.v.parikh@keysight.com, davem@redhat.com, ian.ziemba@hpe.com, andrew.tauferner@cornelisnetworks.com, welch@hpe.com, rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com, linux-rdma@vger.kernel.org, kuba@kernel.org, pabeni@redhat.com Subject: [RFC PATCH 00/13] Ultra Ethernet driver introduction Date: Fri, 7 Mar 2025 01:01:50 +0200 Message-ID: <20250306230203.1550314-1-nikolay@enfabrica.net> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Ultra Ethernet driver introduction \| expand [RFC,00/13] Ultra Ethernet driver introduction [RFC,01/13] drivers: ultraeth: add initial skeleton and kconfig option [RFC,02/13] drivers: ultraeth: add context support [RFC,03/13] drivers: ultraeth: add new genl family [RFC,04/13] drivers: ultraeth: add job support [RFC,05/13] drivers: ultraeth: add tunnel udp device support [RFC,06/13] drivers: ultraeth: add initial PDS infrastructure [RFC,07/13] drivers: ultraeth: add request and ack receive support [RFC,08/13] drivers: ultraeth: add request transmit support [RFC,09/13] drivers: ultraeth: add support for coalescing ack [RFC,10/13] drivers: ultraeth: add sack support [RFC,11/13] drivers: ultraeth: add nack support [RFC,12/13] drivers: ultraeth: add initiator and target idle timeout support [RFC,13/13] HACK: drivers: ultraeth: add char device

Nikolay Aleksandrov March 6, 2025, 11:01 p.m. UTC

Hi all,
This patch-set introduces minimal Ultra Ethernet driver infrastructure and
the lowest Ultra Ethernet sublayer - the Packet Delivery Sublayer (PDS),
which underpins the entire communication model of the Ultra Ethernet
Transport[1] (UET). Ultra Ethernet is a new RDMA transport designed for
efficient AI and HPC communication. The specifications are still being
ironed out and first public versions should be available soon. As there
isn't any UET hardware available yet, we introduce a software device model
which implements the lowest sublayer of the spec - PDS. The code is still
in early stages and experimental, aiming to start a discussion on the
kernel implementation and to show how we plan to organize it.

The PDS is responsible for establishing dynamic connections between Fabric
Endpoints (FEPs) called Packet Delivery Contexts (PDCs), packet
reliability, ordering, duplicate elimination and congestion management.
The PDS packet ordering is defined by a mode which can be one of:
 - Reliable, Ordered Delivery (ROD)
 - Reliable, Unordered Delivery (RUD)
 - Reliable, Unordered Delivery for Idempotent Operations (RUDI)
 - Unreliable, Unordered Delivery (UUD)

This set implements RUD mode of communication with Packet Sequence
Number (PSN) tracking, retransmits, idle timeouts, coalescing and selective
ACKs. It adds support for generating and processing Request, ACK, NACK and
Control packet types. Communication is done over UDP, so all Ultra Ethernet
headers are on top of UDP packets. Packets are tracked by Packet Sequence
Numbers (PSNs) uniquely assigned within a PDC, the PSN window sizes are
currently static.

In this RFC all of the code is under a single kernel module in
drivers/ultraeth/ and guarded by a new kconfig option CONFIG_ULTRAETH. The
plan is to have that split into core Ultra Ethernet module (ultraeth.ko)
which is responsible for managing the UET contexts, jobs and all other
common/generic UET configuration, and the software UET device model
(uecon.ko) which implements the UET protocols for communication in software
(e.g. the PDS will be a part of uecon) and is represented by a UDP tunnel
network device. Note that there are critical missing pieces that will be
present when we send the first version such as:
 - Ultra Ethernet specs will be publicly available
 - missing UET sublayers critical for communication
 - more complete user API
 - kernel UET device API
 - memory management
 - IPv6

The last patch is a hack which adds a custom character device used to test
communication and basic PDS functionality, for the first version of this set
we would rather extend and re-use some of the Infiniband infrastructure.

This set will also be used to better illustrate the UET code and concepts
for the "Networking For AI BoF"[2] at the upcoming Netdev 0x19 conference
in Zagreb, Croatia.

Thank you,
 Nik

[1] https://ultraethernet.org/
[2] https://netdevconf.info/0x19/sessions/bof/networking-for-ai-bof.html


Alex Badea (1):
  HACK: drivers: ultraeth: add char device

Nikolay Aleksandrov (12):
  drivers: ultraeth: add initial skeleton and kconfig option
  drivers: ultraeth: add context support
  drivers: ultraeth: add new genl family
  drivers: ultraeth: add job support
  drivers: ultraeth: add tunnel udp device support
  drivers: ultraeth: add initial PDS infrastructure
  drivers: ultraeth: add request and ack receive support
  drivers: ultraeth: add request transmit support
  drivers: ultraeth: add support for coalescing ack
  drivers: ultraeth: add sack support
  drivers: ultraeth: add nack support
  drivers: ultraeth: add initiator and target idle timeout support

 Documentation/netlink/specs/rt_link.yaml  |   14 +
 Documentation/netlink/specs/ultraeth.yaml |  218 ++++
 drivers/Kconfig                           |    2 +
 drivers/Makefile                          |    1 +
 drivers/ultraeth/Kconfig                  |   11 +
 drivers/ultraeth/Makefile                 |    4 +
 drivers/ultraeth/uecon.c                  |  324 ++++++
 drivers/ultraeth/uet_chardev.c            |  264 +++++
 drivers/ultraeth/uet_context.c            |  274 +++++
 drivers/ultraeth/uet_job.c                |  456 +++++++++
 drivers/ultraeth/uet_main.c               |   41 +
 drivers/ultraeth/uet_netlink.c            |  113 +++
 drivers/ultraeth/uet_netlink.h            |   29 +
 drivers/ultraeth/uet_pdc.c                | 1122 +++++++++++++++++++++
 drivers/ultraeth/uet_pds.c                |  481 +++++++++
 include/net/ultraeth/uecon.h              |   28 +
 include/net/ultraeth/uet_chardev.h        |   11 +
 include/net/ultraeth/uet_context.h        |   47 +
 include/net/ultraeth/uet_job.h            |   80 ++
 include/net/ultraeth/uet_pdc.h            |  170 ++++
 include/net/ultraeth/uet_pds.h            |  110 ++
 include/uapi/linux/if_link.h              |    8 +
 include/uapi/linux/ultraeth.h             |  536 ++++++++++
 include/uapi/linux/ultraeth_nl.h          |  116 +++
 24 files changed, 4460 insertions(+)
 create mode 100644 Documentation/netlink/specs/ultraeth.yaml
 create mode 100644 drivers/ultraeth/Kconfig
 create mode 100644 drivers/ultraeth/Makefile
 create mode 100644 drivers/ultraeth/uecon.c
 create mode 100644 drivers/ultraeth/uet_chardev.c
 create mode 100644 drivers/ultraeth/uet_context.c
 create mode 100644 drivers/ultraeth/uet_job.c
 create mode 100644 drivers/ultraeth/uet_main.c
 create mode 100644 drivers/ultraeth/uet_netlink.c
 create mode 100644 drivers/ultraeth/uet_netlink.h
 create mode 100644 drivers/ultraeth/uet_pdc.c
 create mode 100644 drivers/ultraeth/uet_pds.c
 create mode 100644 include/net/ultraeth/uecon.h
 create mode 100644 include/net/ultraeth/uet_chardev.h
 create mode 100644 include/net/ultraeth/uet_context.h
 create mode 100644 include/net/ultraeth/uet_job.h
 create mode 100644 include/net/ultraeth/uet_pdc.h
 create mode 100644 include/net/ultraeth/uet_pds.h
 create mode 100644 include/uapi/linux/ultraeth.h
 create mode 100644 include/uapi/linux/ultraeth_nl.h

Leon Romanovsky March 8, 2025, 6:46 p.m. UTC | #1

On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> Hi all,

<...>

> Ultra Ethernet is a new RDMA transport.

Awesome, and now please explain why new subsystem is needed when
drivers/infiniband already supports at least 5 different RDMA
transports (OmniPath, iWARP, Infiniband, RoCE v1 and RoCE v2).

Maybe after this discussion it will be very clear that new subsystem
is needed, but at least it needs to be stated clearly.

An please CC RDMA maintainers to any Ultra Ethernet related discussions
as it is more RDMA than Ethernet.

Thanks

Parav Pandit March 9, 2025, 3:21 a.m. UTC | #2

> From: Leon Romanovsky <leon@kernel.org>
> Sent: Sunday, March 9, 2025 12:17 AM
> 
> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> > Hi all,
> 
> <...>
> 
> > Ultra Ethernet is a new RDMA transport.
> 
> Awesome, and now please explain why new subsystem is needed when
> drivers/infiniband already supports at least 5 different RDMA transports
> (OmniPath, iWARP, Infiniband, RoCE v1 and RoCE v2).
> 
6th transport is drivers/infiniband/hw/efa (srd).

> Maybe after this discussion it will be very clear that new subsystem is needed,
> but at least it needs to be stated clearly.
> 
> An please CC RDMA maintainers to any Ultra Ethernet related discussions as it
> is more RDMA than Ethernet.
> 
> Thanks

Bernard Metzler March 11, 2025, 2:20 p.m. UTC | #3

> -----Original Message-----
> From: Parav Pandit <parav@nvidia.com>
> Sent: Sunday, March 9, 2025 4:22 AM
> To: Leon Romanovsky <leon@kernel.org>; Nikolay Aleksandrov
> <nikolay@enfabrica.net>
> Cc: netdev@vger.kernel.org; shrijeet@enfabrica.net;
> alex.badea@keysight.com; eric.davis@broadcom.com; rip.sohan@amd.com;
> dsahern@kernel.org; Bernard Metzler <BMT@zurich.ibm.com>;
> roland@enfabrica.net; winston.liu@keysight.com;
> dan.mihailescu@keysight.com; Kamal Heib <kheib@redhat.com>;
> parth.v.parikh@keysight.com; Dave Miller <davem@redhat.com>;
> ian.ziemba@hpe.com; andrew.tauferner@cornelisnetworks.com;
> welch@hpe.com; rakhahari.bhunia@keysight.com;
> kingshuk.mandal@keysight.com; linux-rdma@vger.kernel.org;
> kuba@kernel.org; Paolo Abeni <pabeni@redhat.com>; Jason Gunthorpe
> <jgg@nvidia.com>
> Subject: [EXTERNAL] RE: [RFC PATCH 00/13] Ultra Ethernet driver
> introduction
> 
> 
> 
> > From: Leon Romanovsky <leon@kernel.org>
> > Sent: Sunday, March 9, 2025 12:17 AM
> >
> > On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> > > Hi all,
> >
> > <...>
> >
> > > Ultra Ethernet is a new RDMA transport.
> >
> > Awesome, and now please explain why new subsystem is needed when
> > drivers/infiniband already supports at least 5 different RDMA
> transports
> > (OmniPath, iWARP, Infiniband, RoCE v1 and RoCE v2).
> >
> 6th transport is drivers/infiniband/hw/efa (srd).
> 
> > Maybe after this discussion it will be very clear that new subsystem
> is needed,
> > but at least it needs to be stated clearly.

I am not sure if a new subsystem is what this RFC calls
for, but rather a discussion about the proper integration of
a new RDMA transport into the Linux kernel.

Ultra Ethernet Transport is probably not just another transport
up for easy integration into the current RDMA subsystem.
First of all, its design does not follow the well-known RDMA
verbs model inherited from InfiniBand, which has largely shaped
the current structure of the RDMA subsystem. While having send,
receive and completion queues (and completion counters) to steer
message exchange, there is no concept of a queue pair. Endpoints
can span multiple queues, can have multiple peer addresses.
Communication resources sharing is controlled in a different way
than within protection domains. Connections are ephemeral,
created and released by the provider as needed. There are more
differences. In a nutshell, the UET communication model is
trimmed for extreme scalability. Its API semantics follow
libfabrics, not RDMA verbs.

I think Nik gave us a first still incomplete look at the UET
protocol engine to help us understand some of the specifics.
It's just the lower part (packet delivery). The implementation
of the upper part (resource management, communication semantics,
job management) may largely depend on the environment we all
choose.

IMO, integrating UET with the current RDMA subsystem would ask
for its extension to allow exposing all of UETs intended
functionality, probably starting with a more generic RDMA
device model than current ib_device.

The different API semantics of UET may further call
for either extending verbs to cover it as well, or exposing a
new non-verbs API (libfabrics), or both.

Thanks,
Bernard.

> >
> > An please CC RDMA maintainers to any Ultra Ethernet related
> discussions as it
> > is more RDMA than Ethernet.
> >
> > Thanks

Leon Romanovsky March 11, 2025, 2:55 p.m. UTC | #4

On Tue, Mar 11, 2025 at 02:20:07PM +0000, Bernard Metzler wrote:
> 
> 
> > -----Original Message-----
> > From: Parav Pandit <parav@nvidia.com>
> > Sent: Sunday, March 9, 2025 4:22 AM
> > To: Leon Romanovsky <leon@kernel.org>; Nikolay Aleksandrov
> > <nikolay@enfabrica.net>
> > Cc: netdev@vger.kernel.org; shrijeet@enfabrica.net;
> > alex.badea@keysight.com; eric.davis@broadcom.com; rip.sohan@amd.com;
> > dsahern@kernel.org; Bernard Metzler <BMT@zurich.ibm.com>;
> > roland@enfabrica.net; winston.liu@keysight.com;
> > dan.mihailescu@keysight.com; Kamal Heib <kheib@redhat.com>;
> > parth.v.parikh@keysight.com; Dave Miller <davem@redhat.com>;
> > ian.ziemba@hpe.com; andrew.tauferner@cornelisnetworks.com;
> > welch@hpe.com; rakhahari.bhunia@keysight.com;
> > kingshuk.mandal@keysight.com; linux-rdma@vger.kernel.org;
> > kuba@kernel.org; Paolo Abeni <pabeni@redhat.com>; Jason Gunthorpe
> > <jgg@nvidia.com>
> > Subject: [EXTERNAL] RE: [RFC PATCH 00/13] Ultra Ethernet driver
> > introduction
> > 
> > 
> > 
> > > From: Leon Romanovsky <leon@kernel.org>
> > > Sent: Sunday, March 9, 2025 12:17 AM
> > >
> > > On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> > > > Hi all,
> > >
> > > <...>
> > >
> > > > Ultra Ethernet is a new RDMA transport.
> > >
> > > Awesome, and now please explain why new subsystem is needed when
> > > drivers/infiniband already supports at least 5 different RDMA
> > transports
> > > (OmniPath, iWARP, Infiniband, RoCE v1 and RoCE v2).
> > >
> > 6th transport is drivers/infiniband/hw/efa (srd).
> > 
> > > Maybe after this discussion it will be very clear that new subsystem
> > is needed,
> > > but at least it needs to be stated clearly.
> 
> I am not sure if a new subsystem is what this RFC calls
> for, but rather a discussion about the proper integration of
> a new RDMA transport into the Linux kernel.

<...>

> The different API semantics of UET may further call
> for either extending verbs to cover it as well, or exposing a
> new non-verbs API (libfabrics), or both.

So you should start from there (UAPI) by presenting the device model and
how the verbs API needs to be extended, so it will be possible to evaluate
how to fit that model into existing Linux kernel codebase.

RDNA subsystem provides multiple type of QPs and operational models, some of them
are indeed follow IB style, but not all of them (SRD, DC e.t.c).

Thanks

> 
> Thanks,
> Bernard.
> 
> 
> > >
> > > An please CC RDMA maintainers to any Ultra Ethernet related
> > discussions as it
> > > is more RDMA than Ethernet.
> > >
> > > Thanks
>

Sean Hefty March 11, 2025, 5:11 p.m. UTC | #5

> I am not sure if a new subsystem is what this RFC calls for, but rather a
> discussion about the proper integration of a new RDMA transport into the
> Linux kernel.
> 
> Ultra Ethernet Transport is probably not just another transport up for easy
> integration into the current RDMA subsystem.
> First of all, its design does not follow the well-known RDMA verbs model
> inherited from InfiniBand, which has largely shaped the current structure of
> the RDMA subsystem. While having send, receive and completion queues (and
> completion counters) to steer message exchange, there is no concept of a
> queue pair. Endpoints can span multiple queues, can have multiple peer
> addresses.
> Communication resources sharing is controlled in a different way than within
> protection domains. Connections are ephemeral, created and released by the
> provider as needed. There are more differences. In a nutshell, the UET
> communication model is trimmed for extreme scalability. Its API semantics
> follow libfabrics, not RDMA verbs.
> 
> I think Nik gave us a first still incomplete look at the UET protocol engine to
> help us understand some of the specifics.
> It's just the lower part (packet delivery). The implementation of the upper part
> (resource management, communication semantics, job management) may
> largely depend on the environment we all choose.
> 
> IMO, integrating UET with the current RDMA subsystem would ask for its
> extension to allow exposing all of UETs intended functionality, probably
> starting with a more generic RDMA device model than current ib_device.
> 
> The different API semantics of UET may further call for either extending verbs
> to cover it as well, or exposing a new non-verbs API (libfabrics), or both.

Reading through the submissions, what I found lacking is a description of some higher-level plan.  I don't easily see how to relate this series to NICs that may implement UET in HW.

Should the PDS be viewed as a partial implementation of a SW UET 'device', similar to soft RoCE or iWarp?  If so, having a description of a proposed device model seems like a necessary first step.

If, instead, the PDS should be viewed more along the lines of a partial RDS-like path, then that changes the uapi.

Or, am I not viewing this series as intended at all?

It is almost guaranteed that there will be NICs which will support both RoCE and UET, and it's not farfetched to think that an app may use both simultaneously.   IMO, a common device model is ideal, assuming exposing a device model is the intent.

I agree that different transport models should not be forced together unnaturally, but I think that's solvable.  In the end, the application developer is exposed to libfabric naming anyway.  Besides, even a repurposed RDMA name is still better than the naming used within OpenMPI.  :)

- Sean

Nikolay Aleksandrov March 12, 2025, 9:20 a.m. UTC | #6

On 3/11/25 7:11 PM, Sean Hefty wrote:
>> I am not sure if a new subsystem is what this RFC calls for, but rather a
>> discussion about the proper integration of a new RDMA transport into the
>> Linux kernel.
>>
>> Ultra Ethernet Transport is probably not just another transport up for easy
>> integration into the current RDMA subsystem.
>> First of all, its design does not follow the well-known RDMA verbs model
>> inherited from InfiniBand, which has largely shaped the current structure of
>> the RDMA subsystem. While having send, receive and completion queues (and
>> completion counters) to steer message exchange, there is no concept of a
>> queue pair. Endpoints can span multiple queues, can have multiple peer
>> addresses.
>> Communication resources sharing is controlled in a different way than within
>> protection domains. Connections are ephemeral, created and released by the
>> provider as needed. There are more differences. In a nutshell, the UET
>> communication model is trimmed for extreme scalability. Its API semantics
>> follow libfabrics, not RDMA verbs.
>>
>> I think Nik gave us a first still incomplete look at the UET protocol engine to
>> help us understand some of the specifics.
>> It's just the lower part (packet delivery). The implementation of the upper part
>> (resource management, communication semantics, job management) may
>> largely depend on the environment we all choose.
>>
>> IMO, integrating UET with the current RDMA subsystem would ask for its
>> extension to allow exposing all of UETs intended functionality, probably
>> starting with a more generic RDMA device model than current ib_device.
>>
>> The different API semantics of UET may further call for either extending verbs
>> to cover it as well, or exposing a new non-verbs API (libfabrics), or both.
> 
> Reading through the submissions, what I found lacking is a description of some higher-level plan.  I don't easily see how to relate this series to NICs that may implement UET in HW.
> 
> Should the PDS be viewed as a partial implementation of a SW UET 'device', similar to soft RoCE or iWarp?  If so, having a description of a proposed device model seems like a necessary first step.
> 

Hi Sean,
To quote the cover letter:
"...As there
isn't any UET hardware available yet, we introduce a software
device model which implements the lowest sublayer of the spec - PDS..."

and

"The plan is to have that split into core Ultra Ethernet module
(ultraeth.ko) which is responsible for managing the UET contexts, jobs
and all other common/generic UET configuration, and the software UET
device model (uecon.ko) which implements the UET protocols for
communication in software (e.g. the PDS will be a part of uecon) and is
represented by a UDP tunnel network device."

So as I said, it is in very early stage, but we plan to split this into
core UET code and uecon software device model that implements the UEC
specs.

> If, instead, the PDS should be viewed more along the lines of a partial RDS-like path, then that changes the uapi.
> 
> Or, am I not viewing this series as intended at all?
> 
> It is almost guaranteed that there will be NICs which will support both RoCE and UET, and it's not farfetched to think that an app may use both simultaneously.   IMO, a common device model is ideal, assuming exposing a device model is the intent.
> 

That is the goal and we're working on UET kernel device API as I've
noted in the cover letter.

> I agree that different transport models should not be forced together unnaturally, but I think that's solvable.  In the end, the application developer is exposed to libfabric naming anyway.  Besides, even a repurposed RDMA name is still better than the naming used within OpenMPI.  :)
> 
> - Sean

Cheers,
 Nik

Nikolay Aleksandrov March 12, 2025, 9:40 a.m. UTC | #7

On 3/8/25 8:46 PM, Leon Romanovsky wrote:
> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
>> Hi all,
> 
> <...>
> 
>> Ultra Ethernet is a new RDMA transport.
> > Awesome, and now please explain why new subsystem is needed when
> drivers/infiniband already supports at least 5 different RDMA
> transports (OmniPath, iWARP, Infiniband, RoCE v1 and RoCE v2).
> 

As Bernard commented, we're not trying to add a new subsystem, but
start a discussion on where UEC should live because it has multiple
objects and semantics that don't map well to the  current
infrastructure. For example from this set - managing contexts, jobs and
fabric endpoints. Also we have the ephemeral PDC connections
that come and go as needed. There more such objects coming with more
state, configuration and lifecycle management. That is why we added a
separate netlink family to cleanly manage them without trying to fit
a square peg in a round hole so to speak. In the next version I'll make
sure to expand much more on this topic. By the way I believe Sean is
working on the verbs mapping for parts of UEC, he can probably also
share more details.
We definitely want to re-use as much as possible from the current
infrastructure, noone is trying to reinvent the wheel.

> Maybe after this discussion it will be very clear that new subsystem
> is needed, but at least it needs to be stated clearly.
> 
> An please CC RDMA maintainers to any Ultra Ethernet related discussions
> as it is more RDMA than Ethernet.
> 

Of course it's RDMA, that's stated in the first few sentences, I made a
mistake with the "To", but I did add linux-rdma@ to the recipient list.
I'll make sure to also add the rdma maintainers personally for the next
version and change the "to".

> Thanks

Cheers,
 Nik

Leon Romanovsky March 12, 2025, 11:29 a.m. UTC | #8

On Wed, Mar 12, 2025 at 11:40:05AM +0200, Nikolay Aleksandrov wrote:
> On 3/8/25 8:46 PM, Leon Romanovsky wrote:
> > On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> >> Hi all,
> > 
> > <...>
> > 
> >> Ultra Ethernet is a new RDMA transport.
> > > Awesome, and now please explain why new subsystem is needed when
> > drivers/infiniband already supports at least 5 different RDMA
> > transports (OmniPath, iWARP, Infiniband, RoCE v1 and RoCE v2).
> > 
> 
> As Bernard commented, we're not trying to add a new subsystem, 

So why did you create new drivers/ultraeth/ folder?

> but start a discussion on where UEC should live because it has multiple
> objects and semantics that don't map well to the  current
> infrastructure. For example from this set - managing contexts, jobs and
> fabric endpoints. 

It is just different names which libfabric used to do not use
traditional verbs naming. There is nothing in the stack which prevents
from QP to have same properties as "fabric endpoints" have.

> Also we have the ephemeral PDC connections
> that come and go as needed. There more such objects coming with more
> state, configuration and lifecycle management. That is why we added a
> separate netlink family to cleanly manage them without trying to fit
> a square peg in a round hole so to speak.

Yeah, I saw that you are planning to use netlink to manage objects,
which is very questionable. It is slow, unreliable, requires sockets,
needs more parsing logic e.t.c

To avoid all this overhead, RDMA uses netlink-like ioctl calls, which
fits better for object configurations.

Thanks

Nikolay Aleksandrov March 12, 2025, 2:20 p.m. UTC | #9

On 3/12/25 1:29 PM, Leon Romanovsky wrote:
> On Wed, Mar 12, 2025 at 11:40:05AM +0200, Nikolay Aleksandrov wrote:
>> On 3/8/25 8:46 PM, Leon Romanovsky wrote:
>>> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
[snip]
>> Also we have the ephemeral PDC connections>> that come and go as
needed. There more such objects coming with more
>> state, configuration and lifecycle management. That is why we added a
>> separate netlink family to cleanly manage them without trying to fit
>> a square peg in a round hole so to speak.
> 
> Yeah, I saw that you are planning to use netlink to manage objects,
> which is very questionable. It is slow, unreliable, requires sockets,
> needs more parsing logic e.t.c
> 
> To avoid all this overhead, RDMA uses netlink-like ioctl calls, which
> fits better for object configurations.
> 
> Thanks

We'd definitely like to keep using netlink for control path object
management. Also please note we're talking about genetlink family. It is
fast and reliable enough for us, very easily extensible,
has a nice precise object definition with policies to enforce various
limitations, has extensive tooling (e.g. ynl), communication can be
monitored in realtime for debugging (e.g. nlmon), has a nice human
readable error reporting, gives the ability to easily dump large object
groups with filters applied, YAML family definitions and so on.
Having sockets or parsing are not issues.

Cheers,
 Nik

Leon Romanovsky March 12, 2025, 3:10 p.m. UTC | #10

On Wed, Mar 12, 2025 at 04:20:08PM +0200, Nikolay Aleksandrov wrote:
> On 3/12/25 1:29 PM, Leon Romanovsky wrote:
> > On Wed, Mar 12, 2025 at 11:40:05AM +0200, Nikolay Aleksandrov wrote:
> >> On 3/8/25 8:46 PM, Leon Romanovsky wrote:
> >>> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> [snip]
> >> Also we have the ephemeral PDC connections>> that come and go as
> needed. There more such objects coming with more
> >> state, configuration and lifecycle management. That is why we added a
> >> separate netlink family to cleanly manage them without trying to fit
> >> a square peg in a round hole so to speak.
> > 
> > Yeah, I saw that you are planning to use netlink to manage objects,
> > which is very questionable. It is slow, unreliable, requires sockets,
> > needs more parsing logic e.t.c
> > 
> > To avoid all this overhead, RDMA uses netlink-like ioctl calls, which
> > fits better for object configurations.
> > 
> > Thanks
> 
> We'd definitely like to keep using netlink for control path object
> management. Also please note we're talking about genetlink family. It is
> fast and reliable enough for us, very easily extensible,
> has a nice precise object definition with policies to enforce various
> limitations, has extensive tooling (e.g. ynl), communication can be
> monitored in realtime for debugging (e.g. nlmon), has a nice human
> readable error reporting, gives the ability to easily dump large object
> groups with filters applied, YAML family definitions and so on.
> Having sockets or parsing are not issues.

Of course it is issue as netlink relies on Netlink sockets, which means
that you constantly move your configuration data instead of doing
standard to whole linux kernel pattern of allocating configuration
structs in user-space and just providing pointer to that through ioctl
call.

However, this discussion is premature and as an intro it is worth to
read this cover letter for how object management is done in RDMA
subsystem.

https://lore.kernel.org/linux-rdma/1501765627-104860-1-git-send-email-matanb@mellanox.com/

Thanks

> 
> Cheers,
>  Nik
> 
>

Nikolay Aleksandrov March 12, 2025, 4 p.m. UTC | #11

On 3/12/25 5:10 PM, Leon Romanovsky wrote:
> On Wed, Mar 12, 2025 at 04:20:08PM +0200, Nikolay Aleksandrov wrote:
>> On 3/12/25 1:29 PM, Leon Romanovsky wrote:
>>> On Wed, Mar 12, 2025 at 11:40:05AM +0200, Nikolay Aleksandrov wrote:
>>>> On 3/8/25 8:46 PM, Leon Romanovsky wrote:
>>>>> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
>> [snip]
>>>> Also we have the ephemeral PDC connections>> that come and go as
>> needed. There more such objects coming with more
>>>> state, configuration and lifecycle management. That is why we added a
>>>> separate netlink family to cleanly manage them without trying to fit
>>>> a square peg in a round hole so to speak.
>>>
>>> Yeah, I saw that you are planning to use netlink to manage objects,
>>> which is very questionable. It is slow, unreliable, requires sockets,
>>> needs more parsing logic e.t.c
>>>
>>> To avoid all this overhead, RDMA uses netlink-like ioctl calls, which
>>> fits better for object configurations.
>>>
>>> Thanks
>>
>> We'd definitely like to keep using netlink for control path object
>> management. Also please note we're talking about genetlink family. It is
>> fast and reliable enough for us, very easily extensible,
>> has a nice precise object definition with policies to enforce various
>> limitations, has extensive tooling (e.g. ynl), communication can be
>> monitored in realtime for debugging (e.g. nlmon), has a nice human
>> readable error reporting, gives the ability to easily dump large object
>> groups with filters applied, YAML family definitions and so on.
>> Having sockets or parsing are not issues.
> 
> Of course it is issue as netlink relies on Netlink sockets, which means
> that you constantly move your configuration data instead of doing
> standard to whole linux kernel pattern of allocating configuration
> structs in user-space and just providing pointer to that through ioctl
> call.
> 

I should've been more specific - it is not an issue for UEC and the way
our driver's netlink API is designed. We fully understand the pros and
cons of our approach.

> However, this discussion is premature and as an intro it is worth to
> read this cover letter for how object management is done in RDMA
> subsystem.
>
> https://lore.kernel.org/linux-rdma/1501765627-104860-1-git-send-email-matanb@mellanox.com/
> 

Sure, I know how uverbs work, but thanks for the pointer!

> Thanks>

Cheers,
 Nik

>>
>> Cheers,
>>  Nik
>>
>>

[RFC,00/13] Ultra Ethernet driver introduction

Message

Comments