diff mbox

[v2,02/37] add opcodes to ib_pack.h

Message ID 20110724201228.204265835@systemfabricworks.com (mailing list archive)
State New, archived
Headers show

Commit Message

Robert Pearson July 24, 2011, 7:43 p.m. UTC
Bring up to date with the current version of the IBTA spec.
	- add new opcodes for RC and RD
	- add new groups of opcodes for CN and XRC

Signed-off-by: Bob Pearson <rpearson@systemfabricworks.com>

---
 include/rdma/ib_pack.h |   39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Bart Van Assche Aug. 15, 2011, 9:13 a.m. UTC | #1
On Sun, Jul 24, 2011 at 9:43 PM, <rpearson@systemfabricworks.com> wrote:
>
> Bring up to date with the current version of the IBTA spec.
>        - add new opcodes for RC and RD
>        - add new groups of opcodes for CN and XRC
>
> Signed-off-by: Bob Pearson <rpearson@systemfabricworks.com>
>
> ---
>  include/rdma/ib_pack.h |   39 ++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 38 insertions(+), 1 deletion(-)
>
> Index: infiniband/include/rdma/ib_pack.h
> ===================================================================
> --- infiniband.orig/include/rdma/ib_pack.h
> +++ infiniband/include/rdma/ib_pack.h
> @@ -75,6 +75,8 @@ enum {
>        IB_OPCODE_UC                                = 0x20,
>        IB_OPCODE_RD                                = 0x40,
>        IB_OPCODE_UD                                = 0x60,
> +       IB_OPCODE_CN                                = 0x80,
> +       IB_OPCODE_XRC                               = 0xA0,

A convention when submitting Linux kernel patches is to add only those
constants that are actually used in the code submitted in the same
patch series. So I'm not sure it's fine to add the XRC constants in
ib_pack.h through the rxe patch series.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert Pearson Aug. 15, 2011, 4:15 p.m. UTC | #2
Originally I had xrc support in rxe but there was nothing in Roland's tree
to support it so I stripped it out.
I was experimenting with congestion notification to see if there was a way
to slow down spewing by fast Ethernet devices but that wasn't working, also
all the MAD support was stripped out. In the end the only thing left was the
constants and entries in the opcode table in rxe_opcode.c which as you say
are not used elsewhere. I think it is reasonable to have a complete set of
opcodes somewhere.

I want to eventually complete the IB transport parts of the driver although
they are not used by RoCE. They would be useful for a soft IB implementation
with say an FPGA based MAC/PHY. The long term goal is to have a complete
reference implementation of the IB Architecture.

> -----Original Message-----
> From: bart.vanassche@gmail.com [mailto:bart.vanassche@gmail.com] On
> Behalf Of Bart Van Assche
> Sent: Monday, August 15, 2011 4:13 AM
> To: rpearson@systemfabricworks.com
> Cc: linux-rdma@vger.kernel.org
> Subject: Re: [patch v2 02/37] add opcodes to ib_pack.h
> 
> On Sun, Jul 24, 2011 at 9:43 PM, <rpearson@systemfabricworks.com> wrote:
> >
> > Bring up to date with the current version of the IBTA spec.
> >        - add new opcodes for RC and RD
> >        - add new groups of opcodes for CN and XRC
> >
> > Signed-off-by: Bob Pearson <rpearson@systemfabricworks.com>
> >
> > ---
> >  include/rdma/ib_pack.h |   39
> ++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 38 insertions(+), 1 deletion(-)
> >
> > Index: infiniband/include/rdma/ib_pack.h
> >
> ==========================================================
> =========
> > --- infiniband.orig/include/rdma/ib_pack.h
> > +++ infiniband/include/rdma/ib_pack.h
> > @@ -75,6 +75,8 @@ enum {
> >        IB_OPCODE_UC                                = 0x20,
> >        IB_OPCODE_RD                                = 0x40,
> >        IB_OPCODE_UD                                = 0x60,
> > +       IB_OPCODE_CN                                = 0x80,
> > +       IB_OPCODE_XRC                               = 0xA0,
> 
> A convention when submitting Linux kernel patches is to add only those
> constants that are actually used in the code submitted in the same
> patch series. So I'm not sure it's fine to add the XRC constants in
> ib_pack.h through the rxe patch series.
> 
> Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Aug. 22, 2011, 6:08 p.m. UTC | #3
On Sat, Aug 20, 2011 at 4:47 PM, Bart Van Assche <bvanassche@acm.org> wrote:
> On Mon, Aug 15, 2011 at 6:15 PM, Bob Pearson
> <rpearson@systemfabricworks.com> wrote:
> > I want to eventually complete the IB transport parts of the driver although
> > they are not used by RoCE. They would be useful for a soft IB implementation
> > with say an FPGA based MAC/PHY. The long term goal is to have a complete
> > reference implementation of the IB Architecture.
>
> Having a complete reference implementation of the IB architecture
> available would be great. IMHO at least the following should be added
> to ib_rxe (not necessarily in the first version) before it can called
> a reference implementation:
> - Support for secondary GIDs. These make it easy to implement
> fail-over in a H.A. cluster.
> - Support for multicast.

More in detail, my comments with regard to multicast support in ib_rxe are:
1. Using ipv6_eth_mc_map() for mapping multicast GIDs seems an
unfortunate choice to me. That choice will cause multicast GIDs to be
mapped to the 33-33-xx-xx-xx-xx Ethernet address range that has been
reserved by RFC 2464 for IPv6 multicast addresses. If a collision with
an IPv6 multicast address occurs and IPv6 MLD snooping has enabled on
the switches in the involved network then RoCE multicast won't work
properly. IMHO we need a separate Ethernet address range for RoCE
multicast purposes, next to the existing ranges for IPv4 and IPv6.
2. We need a way to limit multicast traffic to those switch ports on
which receivers are listening. The InfiniBand architecture has such a
protocol but it relies on the subnet manager, a component that does
not exist in RoCE networks. And since IGMP and MLD are IPv4/IPv6
specific we need a new (IBTA-approved) protocol.
3. RFC 4541, in which IGMP and MLD snooping has been defined, will
have to be extended with a paragraph about snooping the protocol
defined in (2).
4. Switch vendors will have to implement the protocol item (3) refers to.

Shouldn't a RoCE multicast implementation be postponed until the IEEE
has assigned an Ethernet address range for mapping RoCE multicast GIDs
?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert Pearson Sept. 1, 2011, 3:15 a.m. UTC | #4
> -----Original Message-----
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-
> owner@vger.kernel.org] On Behalf Of Bart Van Assche
> Sent: Monday, August 22, 2011 1:08 PM
> To: Bob Pearson
> Cc: linux-rdma@vger.kernel.org
> Subject: Re: [patch v2 02/37] add opcodes to ib_pack.h
> 
> On Sat, Aug 20, 2011 at 4:47 PM, Bart Van Assche <bvanassche@acm.org>
> wrote:
> > On Mon, Aug 15, 2011 at 6:15 PM, Bob Pearson
> > <rpearson@systemfabricworks.com> wrote:
> > > I want to eventually complete the IB transport parts of the driver
> although
> > > they are not used by RoCE. They would be useful for a soft IB
> implementation
> > > with say an FPGA based MAC/PHY. The long term goal is to have a
> complete
> > > reference implementation of the IB Architecture.
> >
> > Having a complete reference implementation of the IB architecture
> > available would be great. IMHO at least the following should be added
> > to ib_rxe (not necessarily in the first version) before it can called
> > a reference implementation:
> > - Support for secondary GIDs. These make it easy to implement
> > fail-over in a H.A. cluster.
> > - Support for multicast.
> 
> More in detail, my comments with regard to multicast support in ib_rxe
are:
> 1. Using ipv6_eth_mc_map() for mapping multicast GIDs seems an
> unfortunate choice to me. That choice will cause multicast GIDs to be
> mapped to the 33-33-xx-xx-xx-xx Ethernet address range that has been
> reserved by RFC 2464 for IPv6 multicast addresses. If a collision with
> an IPv6 multicast address occurs and IPv6 MLD snooping has enabled on
> the switches in the involved network then RoCE multicast won't work
> properly. IMHO we need a separate Ethernet address range for RoCE
> multicast purposes, next to the existing ranges for IPv4 and IPv6.

I thought this was intentional! I.e. in order to get multicast over RoCE to
work we were trying to piggyback on the behavior of intelligent switches.
Otherwise the only way to get packets forwarded without adding a new group
of multicast addresses would be to broadcast. The implementation of IGMP and
MLD at the end node is oblivious to the packet's ethertype. And switches (at
least the ones we tried) also happily multicast RoCE packets.

> 2. We need a way to limit multicast traffic to those switch ports on
> which receivers are listening. The InfiniBand architecture has such a
> protocol but it relies on the subnet manager, a component that does
> not exist in RoCE networks. And since IGMP and MLD are IPv4/IPv6
> specific we need a new (IBTA-approved) protocol.
> 3. RFC 4541, in which IGMP and MLD snooping has been defined, will
> have to be extended with a paragraph about snooping the protocol
> defined in (2).
> 4. Switch vendors will have to implement the protocol item (3) refers to.
> 
> Shouldn't a RoCE multicast implementation be postponed until the IEEE
> has assigned an Ethernet address range for mapping RoCE multicast GIDs
> ?
> 
> Bart.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert Pearson Sept. 1, 2011, 3:38 a.m. UTC | #5
OK. Since for now I don't need it I'll remove this patch but it might be
worth considering adding them as a standalone patch.

> -----Original Message-----
> From: bart.vanassche@gmail.com [mailto:bart.vanassche@gmail.com] On
> Behalf Of Bart Van Assche
> Sent: Monday, August 15, 2011 4:13 AM
> To: rpearson@systemfabricworks.com
> Cc: linux-rdma@vger.kernel.org
> Subject: Re: [patch v2 02/37] add opcodes to ib_pack.h
> 
> On Sun, Jul 24, 2011 at 9:43 PM, <rpearson@systemfabricworks.com> wrote:
> >
> > Bring up to date with the current version of the IBTA spec.
> >        - add new opcodes for RC and RD
> >        - add new groups of opcodes for CN and XRC
> >
> > Signed-off-by: Bob Pearson <rpearson@systemfabricworks.com>
> >
> > ---
> >  include/rdma/ib_pack.h |   39
> ++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 38 insertions(+), 1 deletion(-)
> >
> > Index: infiniband/include/rdma/ib_pack.h
> >
> ==========================================================
> =========
> > --- infiniband.orig/include/rdma/ib_pack.h
> > +++ infiniband/include/rdma/ib_pack.h
> > @@ -75,6 +75,8 @@ enum {
> >        IB_OPCODE_UC                                = 0x20,
> >        IB_OPCODE_RD                                = 0x40,
> >        IB_OPCODE_UD                                = 0x60,
> > +       IB_OPCODE_CN                                = 0x80,
> > +       IB_OPCODE_XRC                               = 0xA0,
> 
> A convention when submitting Linux kernel patches is to add only those
> constants that are actually used in the code submitted in the same
> patch series. So I'm not sure it's fine to add the XRC constants in
> ib_pack.h through the rxe patch series.
> 
> Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Sept. 1, 2011, 5:29 p.m. UTC | #6
On Wed, Aug 31, 2011 at 10:15:02PM -0500, Bob Pearson wrote:

> > 1. Using ipv6_eth_mc_map() for mapping multicast GIDs seems an
> > unfortunate choice to me. That choice will cause multicast GIDs to be
> > mapped to the 33-33-xx-xx-xx-xx Ethernet address range that has been
> > reserved by RFC 2464 for IPv6 multicast addresses. If a collision with
> > an IPv6 multicast address occurs and IPv6 MLD snooping has enabled on
> > the switches in the involved network then RoCE multicast won't work
> > properly. IMHO we need a separate Ethernet address range for RoCE
> > multicast purposes, next to the existing ranges for IPv4 and IPv6.
> 
> I thought this was intentional! I.e. in order to get multicast over RoCE to
> work we were trying to piggyback on the behavior of intelligent switches.
> Otherwise the only way to get packets forwarded without adding a new group
> of multicast addresses would be to broadcast. The implementation of IGMP and
> MLD at the end node is oblivious to the packet's ethertype. And switches (at
> least the ones we tried) also happily multicast RoCE packets.

Nothing about RoCE multicast is standardized or even proposed to be
standardized.

Hijacking the IPv6 MAC space - and also abusing the net stack to
generate MLD stuff is what Mellanox did in their implementation. I
don't know if that was included in the main line submission or not?

Frankly I'd be inclined to map to the broadcast MAC address to pressure
the relevant standards bodies to finish the work on RoCE.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Sept. 1, 2011, 5:51 p.m. UTC | #7
On Thu, Sep 1, 2011 at 5:15 AM, Bob Pearson
<rpearson@systemfabricworks.com> wrote:
> > On Sat, Aug 20, 2011 at 4:47 PM, Bart Van Assche <bvanassche@acm.org>
> > More in detail, my comments with regard to multicast support in ib_rxe
> are:
> > 1. Using ipv6_eth_mc_map() for mapping multicast GIDs seems an
> > unfortunate choice to me. That choice will cause multicast GIDs to be
> > mapped to the 33-33-xx-xx-xx-xx Ethernet address range that has been
> > reserved by RFC 2464 for IPv6 multicast addresses. If a collision with
> > an IPv6 multicast address occurs and IPv6 MLD snooping has enabled on
> > the switches in the involved network then RoCE multicast won't work
> > properly. IMHO we need a separate Ethernet address range for RoCE
> > multicast purposes, next to the existing ranges for IPv4 and IPv6.
>
> I thought this was intentional! I.e. in order to get multicast over RoCE to
> work we were trying to piggyback on the behavior of intelligent switches.
> Otherwise the only way to get packets forwarded without adding a new group
> of multicast addresses would be to broadcast. The implementation of IGMP and
> MLD at the end node is oblivious to the packet's ethertype. And switches (at
> least the ones we tried) also happily multicast RoCE packets.

Shouldn't both the RoCE specification and RFC 4541 be updated
according to the above ?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: infiniband/include/rdma/ib_pack.h
===================================================================
--- infiniband.orig/include/rdma/ib_pack.h
+++ infiniband/include/rdma/ib_pack.h
@@ -75,6 +75,8 @@  enum {
 	IB_OPCODE_UC                                = 0x20,
 	IB_OPCODE_RD                                = 0x40,
 	IB_OPCODE_UD                                = 0x60,
+	IB_OPCODE_CN                                = 0x80,
+	IB_OPCODE_XRC                               = 0xA0,
 
 	/* operations -- just used to define real constants */
 	IB_OPCODE_SEND_FIRST                        = 0x00,
@@ -98,6 +100,10 @@  enum {
 	IB_OPCODE_ATOMIC_ACKNOWLEDGE                = 0x12,
 	IB_OPCODE_COMPARE_SWAP                      = 0x13,
 	IB_OPCODE_FETCH_ADD                         = 0x14,
+	IB_OPCODE_CNP                               = 0x00,
+	IB_OPCODE_RESYNC                            = 0x15,
+	IB_OPCODE_SEND_LAST_INV                     = 0x16,
+	IB_OPCODE_SEND_ONLY_INV                     = 0x17,
 
 	/* real constants follow -- see comment about above IB_OPCODE()
 	   macro for more details */
@@ -124,6 +130,8 @@  enum {
 	IB_OPCODE(RC, ATOMIC_ACKNOWLEDGE),
 	IB_OPCODE(RC, COMPARE_SWAP),
 	IB_OPCODE(RC, FETCH_ADD),
+	IB_OPCODE(RC, SEND_LAST_INV),
+	IB_OPCODE(RC, SEND_ONLY_INV),
 
 	/* UC */
 	IB_OPCODE(UC, SEND_FIRST),
@@ -161,10 +169,39 @@  enum {
 	IB_OPCODE(RD, ATOMIC_ACKNOWLEDGE),
 	IB_OPCODE(RD, COMPARE_SWAP),
 	IB_OPCODE(RD, FETCH_ADD),
+	IB_OPCODE(RD, RESYNC),
 
 	/* UD */
 	IB_OPCODE(UD, SEND_ONLY),
-	IB_OPCODE(UD, SEND_ONLY_WITH_IMMEDIATE)
+	IB_OPCODE(UD, SEND_ONLY_WITH_IMMEDIATE),
+
+	/* CN */
+	IB_OPCODE(CN, CNP),
+
+	/* XRC */
+	IB_OPCODE(XRC, SEND_FIRST),
+	IB_OPCODE(XRC, SEND_MIDDLE),
+	IB_OPCODE(XRC, SEND_LAST),
+	IB_OPCODE(XRC, SEND_LAST_WITH_IMMEDIATE),
+	IB_OPCODE(XRC, SEND_ONLY),
+	IB_OPCODE(XRC, SEND_ONLY_WITH_IMMEDIATE),
+	IB_OPCODE(XRC, RDMA_WRITE_FIRST),
+	IB_OPCODE(XRC, RDMA_WRITE_MIDDLE),
+	IB_OPCODE(XRC, RDMA_WRITE_LAST),
+	IB_OPCODE(XRC, RDMA_WRITE_LAST_WITH_IMMEDIATE),
+	IB_OPCODE(XRC, RDMA_WRITE_ONLY),
+	IB_OPCODE(XRC, RDMA_WRITE_ONLY_WITH_IMMEDIATE),
+	IB_OPCODE(XRC, RDMA_READ_REQUEST),
+	IB_OPCODE(XRC, RDMA_READ_RESPONSE_FIRST),
+	IB_OPCODE(XRC, RDMA_READ_RESPONSE_MIDDLE),
+	IB_OPCODE(XRC, RDMA_READ_RESPONSE_LAST),
+	IB_OPCODE(XRC, RDMA_READ_RESPONSE_ONLY),
+	IB_OPCODE(XRC, ACKNOWLEDGE),
+	IB_OPCODE(XRC, ATOMIC_ACKNOWLEDGE),
+	IB_OPCODE(XRC, COMPARE_SWAP),
+	IB_OPCODE(XRC, FETCH_ADD),
+	IB_OPCODE(XRC, SEND_LAST_INV),
+	IB_OPCODE(XRC, SEND_ONLY_INV),
 };
 
 enum {