[EXPERIMENTAL,v1,0/4] RDMA loopback device

Message ID	1551248837-64041-1-git-send-email-parav@mellanox.com (mailing list archive)
Headers	show Return-Path: <linux-rdma-owner@kernel.org> From: Parav Pandit <parav@mellanox.com> To: bvanassche@acm.org, linux-rdma@vger.kernel.org Cc: parav@mellanox.com Subject: [EXPERIMENTAL v1 0/4] RDMA loopback device Date: Wed, 27 Feb 2019 00:27:13 -0600 Message-Id: <1551248837-64041-1-git-send-email-parav@mellanox.com> Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk
Series	RDMA loopback device \| expand [EXPERIMENTAL,v1,0/4] RDMA loopback device [EXPERIMENTAL,v1,1/4] RDMA/cma: Add support for loopback netdevice [EXPERIMENTAL,v1,2/4] RDMA/loopback: Add helper lib for resources and cqe fifo [EXPERIMENTAL,v1,3/4] RDMA/loopback: Loopback rdma (RoCE) driver [EXPERIMENTAL,v1,4/4] RDMA/loopback: Support Fast memory registration

Parav Pandit Feb. 27, 2019, 6:27 a.m. UTC

This patchset adds RDMA loopback driver.
Initially for RoCE which works on lo netdevice.

It is tested with with nvme fabrics over ext4, perftests, and rping.
It only supports RC and GSI QPs.
It supports only RoCEv2 GIDs which belongs to loopback lo netdevice.

It is only posted for discussion [1].
It is not yet ready for RFC posting or merge.

Its counter part rdma-core will be posted shortly.

[1] https://www.spinics.net/lists/linux-rdma/msg76285.html

---
Changelog:
v0->v1:
 - Added rdmacm patcch which I missed in first v0 patchset.

Parav Pandit (4):
  RDMA/cma: Add support for loopback netdevice
  RDMA/loopback: Add helper lib for resources and cqe fifo
  RDMA/loopback: Loopback rdma (RoCE) driver
  RDMA/loopback: Support Fast memory registration

 drivers/infiniband/Kconfig                       |    1 +
 drivers/infiniband/core/cma.c                    |  134 +-
 drivers/infiniband/sw/Makefile                   |    1 +
 drivers/infiniband/sw/loopback/Kconfig           |   14 +
 drivers/infiniband/sw/loopback/Makefile          |    4 +
 drivers/infiniband/sw/loopback/helper.c          |  139 ++
 drivers/infiniband/sw/loopback/loopback.c        | 1690 ++++++++++++++++++++++
 drivers/infiniband/sw/loopback/loopback_helper.h |   68 +
 include/uapi/rdma/rdma_user_ioctl_cmds.h         |    1 +
 9 files changed, 1930 insertions(+), 122 deletions(-)
 create mode 100644 drivers/infiniband/sw/loopback/Kconfig
 create mode 100644 drivers/infiniband/sw/loopback/Makefile
 create mode 100644 drivers/infiniband/sw/loopback/helper.c
 create mode 100644 drivers/infiniband/sw/loopback/loopback.c
 create mode 100644 drivers/infiniband/sw/loopback/loopback_helper.h

Leon Romanovsky Feb. 27, 2019, 7:56 a.m. UTC | #1

On Wed, Feb 27, 2019 at 12:27:13AM -0600, Parav Pandit wrote:
> This patchset adds RDMA loopback driver.
> Initially for RoCE which works on lo netdevice.
>
> It is tested with with nvme fabrics over ext4, perftests, and rping.
> It only supports RC and GSI QPs.
> It supports only RoCEv2 GIDs which belongs to loopback lo netdevice.
>
> It is only posted for discussion [1].
> It is not yet ready for RFC posting or merge.

Which type of discussion do you expect?
And can you give brief explanation why wasn't enough to extend rxe/siw?

Thanks

Lijun Ou Feb. 27, 2019, 8:08 a.m. UTC | #2

Hi,
  What is the difference between loopback  driver and loopback packet in the IB protocol?
 
for the IB protocol description:
10.2.2.3 LOOPBACK
Packet loopback is supported through self addressed packets as described
in 17.2.2 and C17-18:. Self-addressed packet is a packet whose
DLID and SLID address the same port of the same CA. Additionally, an
HCA can optionally support a Loopback Indicator. When Loopback Indicator
is supported, both Address Handles and QP/EE Address Vectors
can use Loopback Indicator instead of Destination LID. When reaping
completions for Datagrams based QPs, the Loopback Indicator is also reported
if the origin of the message was Loopback.

I always have a question for the above. Why not expsure the loopback indicator for user to support loopback packet?

thanks
在 2019/2/27 14:27, Parav Pandit 写道:
> This patchset adds RDMA loopback driver.
> Initially for RoCE which works on lo netdevice.
>
> It is tested with with nvme fabrics over ext4, perftests, and rping.
> It only supports RC and GSI QPs.
> It supports only RoCEv2 GIDs which belongs to loopback lo netdevice.
>
> It is only posted for discussion [1].
> It is not yet ready for RFC posting or merge.
>
> Its counter part rdma-core will be posted shortly.
>
> [1] https://www.spinics.net/lists/linux-rdma/msg76285.html
>
> ---
> Changelog:
> v0->v1:
>  - Added rdmacm patcch which I missed in first v0 patchset.
>
> Parav Pandit (4):
>   RDMA/cma: Add support for loopback netdevice
>   RDMA/loopback: Add helper lib for resources and cqe fifo
>   RDMA/loopback: Loopback rdma (RoCE) driver
>   RDMA/loopback: Support Fast memory registration
>
>  drivers/infiniband/Kconfig                       |    1 +
>  drivers/infiniband/core/cma.c                    |  134 +-
>  drivers/infiniband/sw/Makefile                   |    1 +
>  drivers/infiniband/sw/loopback/Kconfig           |   14 +
>  drivers/infiniband/sw/loopback/Makefile          |    4 +
>  drivers/infiniband/sw/loopback/helper.c          |  139 ++
>  drivers/infiniband/sw/loopback/loopback.c        | 1690 ++++++++++++++++++++++
>  drivers/infiniband/sw/loopback/loopback_helper.h |   68 +
>  include/uapi/rdma/rdma_user_ioctl_cmds.h         |    1 +
>  9 files changed, 1930 insertions(+), 122 deletions(-)
>  create mode 100644 drivers/infiniband/sw/loopback/Kconfig
>  create mode 100644 drivers/infiniband/sw/loopback/Makefile
>  create mode 100644 drivers/infiniband/sw/loopback/helper.c
>  create mode 100644 drivers/infiniband/sw/loopback/loopback.c
>  create mode 100644 drivers/infiniband/sw/loopback/loopback_helper.h
>

Parav Pandit Feb. 27, 2019, 7:49 p.m. UTC | #3

> -----Original Message-----
> From: Leon Romanovsky <leon@kernel.org>
> Sent: Wednesday, February 27, 2019 1:56 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: bvanassche@acm.org; linux-rdma@vger.kernel.org
> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> On Wed, Feb 27, 2019 at 12:27:13AM -0600, Parav Pandit wrote:
> > This patchset adds RDMA loopback driver.
> > Initially for RoCE which works on lo netdevice.
> >
> > It is tested with with nvme fabrics over ext4, perftests, and rping.
> > It only supports RC and GSI QPs.
> > It supports only RoCEv2 GIDs which belongs to loopback lo netdevice.
> >
> > It is only posted for discussion [1].
> > It is not yet ready for RFC posting or merge.
> 
> Which type of discussion do you expect?
Continuation of [1].
> And can you give brief explanation why wasn't enough to extend rxe/siw?
> 
Adding lo netdev to rxe is certainly an option along with cma patch in this series.

qp state machine is around spin locks..
pools doesn't use xarray that loopback uses and siw intends to use.

Incidentally, 5.0.0.rc5 rxe crashes on registering memory. Didn't have inspiration to supply a patch.

However rxe as it stands today after several fixes from many is still not there.
It leaks consumer index to user space and not sure its effect of it. Jason did talk some of the security concern I don't recall.
A while back when I reviewed the code, saw things that might crash kernel.

Users complain of memory leaks, rnr retries dropping connections..

Giving low priority to most of them, I think desire to have loopback rdma device are below.
1. rxe is not ready for adding IB link types and large code restructure to avoid skb processing in it. Pretty large rewrite to skip skbs.
2. stability and reasonable performance
3. maintainability

But if you think rxe is solid, siw should refactor the rxe code and start using most pieces from there, split into library for roce and iw.
Once that layering is done, may be loopback can fit that as different L4 so that rxe uses skb, siw uses sockets, loopback uses memcpy.

Loopback's helper.c is intended to share code with siw for table resources as xarray.
It also offers complete kernel level handling of data and control path commands and published perf numbers.

Parav Pandit Feb. 27, 2019, 9:32 p.m. UTC | #4

> -----Original Message-----
> From: oulijun <oulijun@huawei.com>
> Sent: Wednesday, February 27, 2019 2:09 AM
> To: Parav Pandit <parav@mellanox.com>; bvanassche@acm.org; linux-
> rdma@vger.kernel.org
> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> Hi,
>   What is the difference between loopback  driver and loopback packet in the
> IB protocol?
> 
Loopback packet in IB protocol is as you described.
However users almost always needs a physical hardware HCA (IB/RoCE, iwarp).
Iwarp patches are in progress.

There hasn't been reliable software stack yet for users who want to just run their applications in laptop/vm on single system.
This is often done for unit tests, sanity, stack regression tests. 

Loopback driver mimics the netdev's lo style loopback rdma device in very simple manner.
Its ABI less, where there almost any user space driver at all and still achieves 7Gbps to 80Gbps performance.

Again rxe and cma can be enhanced, to add lo netdev interface via new netlink command added.
But I have serious doubts that we can make rxe really work perfectly without ABI change, and without major rework.
I am not sure if that is really worth it. If siw reworks the rxe driver, than there is some light of hope.


> for the IB protocol description:
> 10.2.2.3 LOOPBACK
> Packet loopback is supported through self addressed packets as described in
> 17.2.2 and C17-18:. Self-addressed packet is a packet whose DLID and SLID
> address the same port of the same CA. Additionally, an HCA can optionally
> support a Loopback Indicator. When Loopback Indicator is supported, both
> Address Handles and QP/EE Address Vectors can use Loopback Indicator
> instead of Destination LID. When reaping completions for Datagrams based
> QPs, the Loopback Indicator is also reported if the origin of the message was
> Loopback.
> 
> I always have a question for the above. Why not expsure the loopback
> indicator for user to support loopback packet?
> 
It is not useful to expose this and modify all user applications for hw and sw devices.
Core code already does this detection and does the loopback on hw and sw interfaces.
Hence it is not desired to expose this optional part of the spec.

Lijun Ou Feb. 28, 2019, 3:27 a.m. UTC | #5

在 2019/2/28 5:32, Parav Pandit 写道:
>
>> -----Original Message-----
>> From: oulijun <oulijun@huawei.com>
>> Sent: Wednesday, February 27, 2019 2:09 AM
>> To: Parav Pandit <parav@mellanox.com>; bvanassche@acm.org; linux-
>> rdma@vger.kernel.org
>> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
>>
>> Hi,
>>   What is the difference between loopback  driver and loopback packet in the
>> IB protocol?
>>
> Loopback packet in IB protocol is as you described.
> However users almost always needs a physical hardware HCA (IB/RoCE, iwarp).
> Iwarp patches are in progress.
>
> There hasn't been reliable software stack yet for users who want to just run their applications in laptop/vm on single system.
> This is often done for unit tests, sanity, stack regression tests. 
>
> Loopback driver mimics the netdev's lo style loopback rdma device in very simple manner.
> Its ABI less, where there almost any user space driver at all and still achieves 7Gbps to 80Gbps performance.
>
> Again rxe and cma can be enhanced, to add lo netdev interface via new netlink command added.
> But I have serious doubts that we can make rxe really work perfectly without ABI change, and without major rework.
> I am not sure if that is really worth it. If siw reworks the rxe driver, than there is some light of hope.
>
Thank your detail reply。 I have seen the goal for loopback driver。
But I still want to use this series to ask about the loopback packet problem.

 The protocol specifies:
  o10-6.2-1.1: If the CI supports the Loopback Indicator, the CI shall report
it through the Query HCA Verb. The CI shall accept the Loopback Indicator
as an input modifier to the Create Address Handle, Modify Address
Handle, Modify QP and Modify EEC verbs

  I have checked the create address handle verbs' input modifier.
  struct rdma_ah_attr {
	struct ib_global_route	grh;
	u8			sl;
	u8			static_rate;
	u8			port_num;
	u8			ah_flags;
	enum rdma_ah_attr_type type;
	union {
		struct ib_ah_attr ib;
		struct roce_ah_attr roce;
		struct opa_ah_attr opa;
	};
};

and the verb for create ah in protocol specification.
input modifiers:
   ...
   ...
   • Loopback Indicator (if the HCA supports Loopback Indicator
on Address Handles). DLID and Loopback Indicator are mutually
exclusive.
   ...

   I don't see why defined the loopback indicator and expose for app?
   I understand that the loopback indicator should be gived by ofed, the
user can configue it. Any other understand?
   if the user not use lo nedev for loopback and use other nedev for rdma device?

thanks

>> for the IB protocol description:
>> 10.2.2.3 LOOPBACK
>> Packet loopback is supported through self addressed packets as described in
>> 17.2.2 and C17-18:. Self-addressed packet is a packet whose DLID and SLID
>> address the same port of the same CA. Additionally, an HCA can optionally
>> support a Loopback Indicator. When Loopback Indicator is supported, both
>> Address Handles and QP/EE Address Vectors can use Loopback Indicator
>> instead of Destination LID. When reaping completions for Datagrams based
>> QPs, the Loopback Indicator is also reported if the origin of the message was
>> Loopback.
>>
>> I always have a question for the above. Why not expsure the loopback
>> indicator for user to support loopback packet?
>>
> It is not useful to expose this and modify all user applications for hw and sw devices.
> Core code already does this detection and does the loopback on hw and sw interfaces.
> Hence it is not desired to expose this optional part of the spec.

Parav Pandit Feb. 28, 2019, 4:18 a.m. UTC | #6

> -----Original Message-----
> From: oulijun <oulijun@huawei.com>
> Sent: Wednesday, February 27, 2019 9:28 PM
> To: Parav Pandit <parav@mellanox.com>; bvanassche@acm.org; linux-
> rdma@vger.kernel.org
> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> 在 2019/2/28 5:32, Parav Pandit 写道:
> >
> >> -----Original Message-----
> >> From: oulijun <oulijun@huawei.com>
> >> Sent: Wednesday, February 27, 2019 2:09 AM
> >> To: Parav Pandit <parav@mellanox.com>; bvanassche@acm.org; linux-
> >> rdma@vger.kernel.org
> >> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> >>
> >> Hi,
> >>   What is the difference between loopback  driver and loopback packet
> >> in the IB protocol?
> >>
> > Loopback packet in IB protocol is as you described.
> > However users almost always needs a physical hardware HCA (IB/RoCE,
> iwarp).
> > Iwarp patches are in progress.
> >
> > There hasn't been reliable software stack yet for users who want to just
> run their applications in laptop/vm on single system.
> > This is often done for unit tests, sanity, stack regression tests.
> >
> > Loopback driver mimics the netdev's lo style loopback rdma device in very
> simple manner.
> > Its ABI less, where there almost any user space driver at all and still
> achieves 7Gbps to 80Gbps performance.
> >
> > Again rxe and cma can be enhanced, to add lo netdev interface via new
> netlink command added.
> > But I have serious doubts that we can make rxe really work perfectly
> without ABI change, and without major rework.
> > I am not sure if that is really worth it. If siw reworks the rxe driver, than
> there is some light of hope.
> >
> Thank your detail reply。 I have seen the goal for loopback driver。
> But I still want to use this series to ask about the loopback packet problem.
> 
>  The protocol specifies:
>   o10-6.2-1.1: If the CI supports the Loopback Indicator, the CI shall report it
> through the Query HCA Verb. The CI shall accept the Loopback Indicator as
> an input modifier to the Create Address Handle, Modify Address Handle,
> Modify QP and Modify EEC verbs
> 
>   I have checked the create address handle verbs' input modifier.
>   struct rdma_ah_attr {
> 	struct ib_global_route	grh;
> 	u8			sl;
> 	u8			static_rate;
> 	u8			port_num;
> 	u8			ah_flags;
> 	enum rdma_ah_attr_type type;
> 	union {
> 		struct ib_ah_attr ib;
> 		struct roce_ah_attr roce;
> 		struct opa_ah_attr opa;
> 	};
> };
> 
> and the verb for create ah in protocol specification.
> input modifiers:
>    ...
>    ...
>    • Loopback Indicator (if the HCA supports Loopback Indicator on Address
> Handles). DLID and Loopback Indicator are mutually exclusive.
>    ...
> 
>    I don't see why defined the loopback indicator and expose for app?
>    I understand that the loopback indicator should be gived by ofed, the user
> can configue it. Any other understand?
>    if the user not use lo nedev for loopback and use other nedev for rdma
> device?
> 
Loopack still works where HCA loops this packets for RoCE and IB link layers without this user configuration.
A user explicit flag would be required when user wants to force loopback when SLID != DLID.
I do know the use case or supported HCAs which does that.

> thanks
> 
> >> for the IB protocol description:
> >> 10.2.2.3 LOOPBACK
> >> Packet loopback is supported through self addressed packets as
> >> described in
> >> 17.2.2 and C17-18:. Self-addressed packet is a packet whose DLID and
> >> SLID address the same port of the same CA. Additionally, an HCA can
> >> optionally support a Loopback Indicator. When Loopback Indicator is
> >> supported, both Address Handles and QP/EE Address Vectors can use
> >> Loopback Indicator instead of Destination LID. When reaping
> >> completions for Datagrams based QPs, the Loopback Indicator is also
> >> reported if the origin of the message was Loopback.
> >>
> >> I always have a question for the above. Why not expsure the loopback
> >> indicator for user to support loopback packet?
> >>
> > It is not useful to expose this and modify all user applications for hw and
> sw devices.
> > Core code already does this detection and does the loopback on hw and
> sw interfaces.
> > Hence it is not desired to expose this optional part of the spec.
> 
>

Dennis Dalessandro Feb. 28, 2019, 12:39 p.m. UTC | #7

On 2/27/2019 2:49 PM, Parav Pandit wrote:
> 
> 
>> -----Original Message-----
>> From: Leon Romanovsky <leon@kernel.org>
>> Sent: Wednesday, February 27, 2019 1:56 AM
>> To: Parav Pandit <parav@mellanox.com>
>> Cc: bvanassche@acm.org; linux-rdma@vger.kernel.org
>> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
>>
>> On Wed, Feb 27, 2019 at 12:27:13AM -0600, Parav Pandit wrote:
>>> This patchset adds RDMA loopback driver.
>>> Initially for RoCE which works on lo netdevice.
>>>
>>> It is tested with with nvme fabrics over ext4, perftests, and rping.
>>> It only supports RC and GSI QPs.
>>> It supports only RoCEv2 GIDs which belongs to loopback lo netdevice.
>>>
>>> It is only posted for discussion [1].
>>> It is not yet ready for RFC posting or merge.
>>
>> Which type of discussion do you expect?
> Continuation of [1].
>> And can you give brief explanation why wasn't enough to extend rxe/siw?
>>
> Adding lo netdev to rxe is certainly an option along with cma patch in this series.
> 
> qp state machine is around spin locks..
> pools doesn't use xarray that loopback uses and siw intends to use.
> 
> Incidentally, 5.0.0.rc5 rxe crashes on registering memory. Didn't have inspiration to supply a patch.

If rxe crashes we may want to fix it rather than creating a whole new 
driver.

> However rxe as it stands today after several fixes from many is still not there.
> It leaks consumer index to user space and not sure its effect of it. Jason did talk some of the security concern I don't recall.
> A while back when I reviewed the code, saw things that might crash kernel.

> Users complain of memory leaks, rnr retries dropping connections..

If rxe is so broken, and there is no interest in fixing it, why do we 
still have it? Should we just excise it from the tree?

> Giving low priority to most of them, I think desire to have loopback rdma device are below.
> 1. rxe is not ready for adding IB link types and large code restructure to avoid skb processing in it. Pretty large rewrite to skip skbs.
> 2. stability and reasonable performance
> 3. maintainability

I don't see how this is more maintainable. We are adding a new driver, a 
new user space provider. So I don't see that as being a reason for 
adding this.

> But if you think rxe is solid, siw should refactor the rxe code and start using most pieces from there, split into library for roce and iw.
> Once that layering is done, may be loopback can fit that as different L4 so that rxe uses skb, siw uses sockets, loopback uses memcpy.

This is why rxe should have used rdmavt from the beginning and we would 
pretty much have such a library.

> Loopback's helper.c is intended to share code with siw for table resources as xarray.
> It also offers complete kernel level handling of data and control path commands and published perf numbers.

We can debate back and forth whether this needed to be included in siw 
and rxe, or if it and the others should have used rdmavt. However, I 
think this is different enough of an approach that it does stand on its 
own and could in fact be a new driver.

The fact that rxe is broken and no one seems to want to fix it shouldn't 
be our reason though.

-Denny

Leon Romanovsky Feb. 28, 2019, 1:22 p.m. UTC | #8

On Thu, Feb 28, 2019 at 07:39:25AM -0500, Dennis Dalessandro wrote:
> On 2/27/2019 2:49 PM, Parav Pandit wrote:
> >
> >
> > > -----Original Message-----
> > > From: Leon Romanovsky <leon@kernel.org>
> > > Sent: Wednesday, February 27, 2019 1:56 AM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: bvanassche@acm.org; linux-rdma@vger.kernel.org
> > > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > >
> > > On Wed, Feb 27, 2019 at 12:27:13AM -0600, Parav Pandit wrote:
> > > > This patchset adds RDMA loopback driver.
> > > > Initially for RoCE which works on lo netdevice.
> > > >
> > > > It is tested with with nvme fabrics over ext4, perftests, and rping.
> > > > It only supports RC and GSI QPs.
> > > > It supports only RoCEv2 GIDs which belongs to loopback lo netdevice.
> > > >
> > > > It is only posted for discussion [1].
> > > > It is not yet ready for RFC posting or merge.
> > >
> > > Which type of discussion do you expect?
> > Continuation of [1].
> > > And can you give brief explanation why wasn't enough to extend rxe/siw?
> > >
> > Adding lo netdev to rxe is certainly an option along with cma patch in this series.
> >
> > qp state machine is around spin locks..
> > pools doesn't use xarray that loopback uses and siw intends to use.
> >
> > Incidentally, 5.0.0.rc5 rxe crashes on registering memory. Didn't have inspiration to supply a patch.
>
> If rxe crashes we may want to fix it rather than creating a whole new
> driver.

Agree

>
> > However rxe as it stands today after several fixes from many is still not there.
> > It leaks consumer index to user space and not sure its effect of it. Jason did talk some of the security concern I don't recall.
> > A while back when I reviewed the code, saw things that might crash kernel.
>
> > Users complain of memory leaks, rnr retries dropping connections..
>
> If rxe is so broken, and there is no interest in fixing it, why do we still
> have it? Should we just excise it from the tree?

Because reality is not so bad as Parav sees it.

Parav is speaking from his experience where I forwarded to him results
of our regression runs over RXE, while those runs accumulated years
of experience and checks of corner cases. Most people who are using RXE
will never hit them.

>
> > Giving low priority to most of them, I think desire to have loopback rdma device are below.
> > 1. rxe is not ready for adding IB link types and large code restructure to avoid skb processing in it. Pretty large rewrite to skip skbs.
> > 2. stability and reasonable performance
> > 3. maintainability
>
> I don't see how this is more maintainable. We are adding a new driver, a new
> user space provider. So I don't see that as being a reason for adding this.

Agree too, it is so tempting to write something new instead of fixing.

>
> > But if you think rxe is solid, siw should refactor the rxe code and start using most pieces from there, split into library for roce and iw.
> > Once that layering is done, may be loopback can fit that as different L4 so that rxe uses skb, siw uses sockets, loopback uses memcpy.
>
> This is why rxe should have used rdmavt from the beginning and we would
> pretty much have such a library.
>
> > Loopback's helper.c is intended to share code with siw for table resources as xarray.
> > It also offers complete kernel level handling of data and control path commands and published perf numbers.
>
> We can debate back and forth whether this needed to be included in siw and
> rxe, or if it and the others should have used rdmavt. However, I think this
> is different enough of an approach that it does stand on its own and could
> in fact be a new driver.
>
> The fact that rxe is broken and no one seems to want to fix it shouldn't be
> our reason though.

The thing is that many people heard Jason complains about security
issues with RXE, but the problem that not many heard full explanation
about it. I didn't hear about it too.

Thanks

>
> -Denny

Parav Pandit Feb. 28, 2019, 2:06 p.m. UTC | #9

> -----Original Message-----
> From: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Sent: Thursday, February 28, 2019 6:39 AM
> To: Parav Pandit <parav@mellanox.com>; Leon Romanovsky
> <leon@kernel.org>
> Cc: bvanassche@acm.org; linux-rdma@vger.kernel.org
> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> On 2/27/2019 2:49 PM, Parav Pandit wrote:
> >
> >
> >> -----Original Message-----
> >> From: Leon Romanovsky <leon@kernel.org>
> >> Sent: Wednesday, February 27, 2019 1:56 AM
> >> To: Parav Pandit <parav@mellanox.com>
> >> Cc: bvanassche@acm.org; linux-rdma@vger.kernel.org
> >> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> >>
> >> On Wed, Feb 27, 2019 at 12:27:13AM -0600, Parav Pandit wrote:
> >>> This patchset adds RDMA loopback driver.
> >>> Initially for RoCE which works on lo netdevice.
> >>>
> >>> It is tested with with nvme fabrics over ext4, perftests, and rping.
> >>> It only supports RC and GSI QPs.
> >>> It supports only RoCEv2 GIDs which belongs to loopback lo netdevice.
> >>>
> >>> It is only posted for discussion [1].
> >>> It is not yet ready for RFC posting or merge.
> >>
> >> Which type of discussion do you expect?
> > Continuation of [1].
> >> And can you give brief explanation why wasn't enough to extend rxe/siw?
> >>
> > Adding lo netdev to rxe is certainly an option along with cma patch in this
> series.
> >
> > qp state machine is around spin locks..
> > pools doesn't use xarray that loopback uses and siw intends to use.
> >
> > Incidentally, 5.0.0.rc5 rxe crashes on registering memory. Didn't have
> inspiration to supply a patch.
> 
> If rxe crashes we may want to fix it rather than creating a whole new driver.
> 
> > However rxe as it stands today after several fixes from many is still not
> there.
> > It leaks consumer index to user space and not sure its effect of it. Jason did
> talk some of the security concern I don't recall.
> > A while back when I reviewed the code, saw things that might crash kernel.
> 
> > Users complain of memory leaks, rnr retries dropping connections..
> 
> If rxe is so broken, and there is no interest in fixing it, why do we still have
> it? Should we just excise it from the tree?
> 
> > Giving low priority to most of them, I think desire to have loopback rdma
> device are below.
> > 1. rxe is not ready for adding IB link types and large code restructure to
> avoid skb processing in it. Pretty large rewrite to skip skbs.
> > 2. stability and reasonable performance 3. maintainability
> 
> I don't see how this is more maintainable. We are adding a new driver, a
> new user space provider. So I don't see that as being a reason for adding
> this.
A new user space provider is less complex at cost of system calls.
However it reuses most kernel pieces present today. User space driver is just a wrapper to ibv_cmd().
I see this approach as start on right foot with this approach by not writing new code but use existing infra.
And all 3 drivers (rxe, siw, loopback) reuse common user space driver, reuse resource allocator, and plugin their transport callbacks.
Or siw should modify the rxe instead of creating those pieces now.

> 
> > But if you think rxe is solid, siw should refactor the rxe code and start
> using most pieces from there, split into library for roce and iw.
> > Once that layering is done, may be loopback can fit that as different L4 so
> that rxe uses skb, siw uses sockets, loopback uses memcpy.
> 
> This is why rxe should have used rdmavt from the beginning and we would
> pretty much have such a library.
> 
> > Loopback's helper.c is intended to share code with siw for table resources
> as xarray.
> > It also offers complete kernel level handling of data and control path
> commands and published perf numbers.
> 
> We can debate back and forth whether this needed to be included in siw
> and rxe, or if it and the others should have used rdmavt. However, I think
> this is different enough of an approach that it does stand on its own and
> could in fact be a new driver.
> 

> The fact that rxe is broken and no one seems to want to fix it shouldn't be
> our reason though.
Same reasoning applies to siw. It should refactor out the code such that new L4 piece can be fit in there.
But we are not taking that direction, same reasoning applies to similar other driver too.

The main reason to not refactor rxe is, its major rewrite to support IB link without skb layer. Same refactor is needed for siw to reuse rxe.
And due to that I forked a new driver, but whose user space can be useable across siw and loopback, and also resource table code.

>

Parav Pandit Feb. 28, 2019, 2:24 p.m. UTC | #10

> -----Original Message-----
> From: Leon Romanovsky <leon@kernel.org>
> Sent: Thursday, February 28, 2019 7:22 AM
> To: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Cc: Parav Pandit <parav@mellanox.com>; bvanassche@acm.org; linux-
> rdma@vger.kernel.org
> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> On Thu, Feb 28, 2019 at 07:39:25AM -0500, Dennis Dalessandro wrote:
> > On 2/27/2019 2:49 PM, Parav Pandit wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Leon Romanovsky <leon@kernel.org>
> > > > Sent: Wednesday, February 27, 2019 1:56 AM
> > > > To: Parav Pandit <parav@mellanox.com>
> > > > Cc: bvanassche@acm.org; linux-rdma@vger.kernel.org
> > > > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > > >
> > > > On Wed, Feb 27, 2019 at 12:27:13AM -0600, Parav Pandit wrote:
> > > > > This patchset adds RDMA loopback driver.
> > > > > Initially for RoCE which works on lo netdevice.
> > > > >
> > > > > It is tested with with nvme fabrics over ext4, perftests, and rping.
> > > > > It only supports RC and GSI QPs.
> > > > > It supports only RoCEv2 GIDs which belongs to loopback lo netdevice.
> > > > >
> > > > > It is only posted for discussion [1].
> > > > > It is not yet ready for RFC posting or merge.
> > > >
> > > > Which type of discussion do you expect?
> > > Continuation of [1].
> > > > And can you give brief explanation why wasn't enough to extend
> rxe/siw?
> > > >
> > > Adding lo netdev to rxe is certainly an option along with cma patch in
> this series.
> > >
> > > qp state machine is around spin locks..
> > > pools doesn't use xarray that loopback uses and siw intends to use.
> > >
> > > Incidentally, 5.0.0.rc5 rxe crashes on registering memory. Didn't have
> inspiration to supply a patch.
> >
> > If rxe crashes we may want to fix it rather than creating a whole new
> > driver.
> 
> Agree
> 
> >
> > > However rxe as it stands today after several fixes from many is still not
> there.
> > > It leaks consumer index to user space and not sure its effect of it. Jason
> did talk some of the security concern I don't recall.
> > > A while back when I reviewed the code, saw things that might crash
> kernel.
> >
> > > Users complain of memory leaks, rnr retries dropping connections..
> >
> > If rxe is so broken, and there is no interest in fixing it, why do we
> > still have it? Should we just excise it from the tree?
> 
> Because reality is not so bad as Parav sees it.
> 
> Parav is speaking from his experience where I forwarded to him results of
> our regression runs over RXE, while those runs accumulated years of
> experience and checks of corner cases. Most people who are using RXE will
> never hit them.
>
More than that, I have user complains for memory leaks and connection drops in single system.
 
> >
> > > Giving low priority to most of them, I think desire to have loopback rdma
> device are below.

> > > 1. rxe is not ready for adding IB link types and large code restructure to
> avoid skb processing in it. Pretty large rewrite to skip skbs.
> > > 2. stability and reasonable performance 3. maintainability
> >
> > I don't see how this is more maintainable. We are adding a new driver,
> > a new user space provider. So I don't see that as being a reason for adding
> this.
> 
> Agree too, it is so tempting to write something new instead of fixing.
>
So lets make siw reuse or refactor rxe to fit to siw needs.

I see loopback driver as similar to netdev lo portion, block devices null_blk driver to support IB and RoCE links.
Loopback driver would be creating this loopback lo devices in containers too in future when loaded without special user involvement, similar to netdev lo.
So that rdma also gets same level of default support as net stack.

> >
> > > But if you think rxe is solid, siw should refactor the rxe code and start
> using most pieces from there, split into library for roce and iw.
> > > Once that layering is done, may be loopback can fit that as different L4 so
> that rxe uses skb, siw uses sockets, loopback uses memcpy.
> >
> > This is why rxe should have used rdmavt from the beginning and we
> > would pretty much have such a library.
> >
> > > Loopback's helper.c is intended to share code with siw for table
> resources as xarray.
> > > It also offers complete kernel level handling of data and control path
> commands and published perf numbers.
> >
> > We can debate back and forth whether this needed to be included in siw
> > and rxe, or if it and the others should have used rdmavt. However, I
> > think this is different enough of an approach that it does stand on
> > its own and could in fact be a new driver.
> >

> > The fact that rxe is broken and no one seems to want to fix it
> > shouldn't be our reason though.
> 
> The thing is that many people heard Jason complains about security issues
> with RXE, but the problem that not many heard full explanation about it. I
> didn't hear about it too.
> 
> Thanks
> 
> >
> > -Denny

Leon Romanovsky Feb. 28, 2019, 7:38 p.m. UTC | #11

On Thu, Feb 28, 2019 at 02:06:53PM +0000, Parav Pandit wrote:
>
>
> > -----Original Message-----
> > From: Dennis Dalessandro <dennis.dalessandro@intel.com>
> > Sent: Thursday, February 28, 2019 6:39 AM
> > To: Parav Pandit <parav@mellanox.com>; Leon Romanovsky
> > <leon@kernel.org>
> > Cc: bvanassche@acm.org; linux-rdma@vger.kernel.org
> > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> >
> > On 2/27/2019 2:49 PM, Parav Pandit wrote:
> > >
> > >
> > >> -----Original Message-----
> > >> From: Leon Romanovsky <leon@kernel.org>
> > >> Sent: Wednesday, February 27, 2019 1:56 AM
> > >> To: Parav Pandit <parav@mellanox.com>
> > >> Cc: bvanassche@acm.org; linux-rdma@vger.kernel.org
> > >> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > >>
> > >> On Wed, Feb 27, 2019 at 12:27:13AM -0600, Parav Pandit wrote:
> > >>> This patchset adds RDMA loopback driver.
> > >>> Initially for RoCE which works on lo netdevice.
> > >>>
> > >>> It is tested with with nvme fabrics over ext4, perftests, and rping.
> > >>> It only supports RC and GSI QPs.
> > >>> It supports only RoCEv2 GIDs which belongs to loopback lo netdevice.
> > >>>
> > >>> It is only posted for discussion [1].
> > >>> It is not yet ready for RFC posting or merge.
> > >>
> > >> Which type of discussion do you expect?
> > > Continuation of [1].
> > >> And can you give brief explanation why wasn't enough to extend rxe/siw?
> > >>
> > > Adding lo netdev to rxe is certainly an option along with cma patch in this
> > series.
> > >
> > > qp state machine is around spin locks..
> > > pools doesn't use xarray that loopback uses and siw intends to use.
> > >
> > > Incidentally, 5.0.0.rc5 rxe crashes on registering memory. Didn't have
> > inspiration to supply a patch.
> >
> > If rxe crashes we may want to fix it rather than creating a whole new driver.
> >
> > > However rxe as it stands today after several fixes from many is still not
> > there.
> > > It leaks consumer index to user space and not sure its effect of it. Jason did
> > talk some of the security concern I don't recall.
> > > A while back when I reviewed the code, saw things that might crash kernel.
> >
> > > Users complain of memory leaks, rnr retries dropping connections..
> >
> > If rxe is so broken, and there is no interest in fixing it, why do we still have
> > it? Should we just excise it from the tree?
> >
> > > Giving low priority to most of them, I think desire to have loopback rdma
> > device are below.
> > > 1. rxe is not ready for adding IB link types and large code restructure to
> > avoid skb processing in it. Pretty large rewrite to skip skbs.
> > > 2. stability and reasonable performance 3. maintainability
> >
> > I don't see how this is more maintainable. We are adding a new driver, a
> > new user space provider. So I don't see that as being a reason for adding
> > this.
> A new user space provider is less complex at cost of system calls.
> However it reuses most kernel pieces present today. User space driver is just a wrapper to ibv_cmd().
> I see this approach as start on right foot with this approach by not writing new code but use existing infra.
> And all 3 drivers (rxe, siw, loopback) reuse common user space driver, reuse resource allocator, and plugin their transport callbacks.
> Or siw should modify the rxe instead of creating those pieces now.
>
> >
> > > But if you think rxe is solid, siw should refactor the rxe code and start
> > using most pieces from there, split into library for roce and iw.
> > > Once that layering is done, may be loopback can fit that as different L4 so
> > that rxe uses skb, siw uses sockets, loopback uses memcpy.
> >
> > This is why rxe should have used rdmavt from the beginning and we would
> > pretty much have such a library.
> >
> > > Loopback's helper.c is intended to share code with siw for table resources
> > as xarray.
> > > It also offers complete kernel level handling of data and control path
> > commands and published perf numbers.
> >
> > We can debate back and forth whether this needed to be included in siw
> > and rxe, or if it and the others should have used rdmavt. However, I think
> > this is different enough of an approach that it does stand on its own and
> > could in fact be a new driver.
> >
>
> > The fact that rxe is broken and no one seems to want to fix it shouldn't be
> > our reason though.
> Same reasoning applies to siw. It should refactor out the code such that new L4 piece can be fit in there.
> But we are not taking that direction, same reasoning applies to similar other driver too.

We didn't deeply review SIW yet, everything before was more coding style
bikeshedding. If you think that SIW and RXE need to be changed, feel free
to share your opinion more loudly.

>
> The main reason to not refactor rxe is, its major rewrite to support IB link without skb layer. Same refactor is needed for siw to reuse rxe.
> And due to that I forked a new driver, but whose user space can be useable across siw and loopback, and also resource table code.
>
> >

Ira Weiny Feb. 28, 2019, 10:16 p.m. UTC | #12

On Thu, Feb 28, 2019 at 09:38:53PM +0200, Leon Romanovsky wrote:
> On Thu, Feb 28, 2019 at 02:06:53PM +0000, Parav Pandit wrote:
> >
> >
> > > -----Original Message-----
> > > From: Dennis Dalessandro <dennis.dalessandro@intel.com>
> > > Sent: Thursday, February 28, 2019 6:39 AM
> > > To: Parav Pandit <parav@mellanox.com>; Leon Romanovsky
> > > <leon@kernel.org>
> > > Cc: bvanassche@acm.org; linux-rdma@vger.kernel.org
> > > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > >
> > > On 2/27/2019 2:49 PM, Parav Pandit wrote:
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: Leon Romanovsky <leon@kernel.org>
> > > >> Sent: Wednesday, February 27, 2019 1:56 AM
> > > >> To: Parav Pandit <parav@mellanox.com>
> > > >> Cc: bvanassche@acm.org; linux-rdma@vger.kernel.org
> > > >> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > > >>
> > > >> On Wed, Feb 27, 2019 at 12:27:13AM -0600, Parav Pandit wrote:
> > > >>> This patchset adds RDMA loopback driver.
> > > >>> Initially for RoCE which works on lo netdevice.
> > > >>>
> > > >>> It is tested with with nvme fabrics over ext4, perftests, and rping.
> > > >>> It only supports RC and GSI QPs.
> > > >>> It supports only RoCEv2 GIDs which belongs to loopback lo netdevice.
> > > >>>
> > > >>> It is only posted for discussion [1].
> > > >>> It is not yet ready for RFC posting or merge.
> > > >>
> > > >> Which type of discussion do you expect?
> > > > Continuation of [1].
> > > >> And can you give brief explanation why wasn't enough to extend rxe/siw?
> > > >>
> > > > Adding lo netdev to rxe is certainly an option along with cma patch in this
> > > series.
> > > >
> > > > qp state machine is around spin locks..
> > > > pools doesn't use xarray that loopback uses and siw intends to use.
> > > >
> > > > Incidentally, 5.0.0.rc5 rxe crashes on registering memory. Didn't have
> > > inspiration to supply a patch.
> > >
> > > If rxe crashes we may want to fix it rather than creating a whole new driver.
> > >
> > > > However rxe as it stands today after several fixes from many is still not
> > > there.
> > > > It leaks consumer index to user space and not sure its effect of it. Jason did
> > > talk some of the security concern I don't recall.
> > > > A while back when I reviewed the code, saw things that might crash kernel.
> > >
> > > > Users complain of memory leaks, rnr retries dropping connections..
> > >
> > > If rxe is so broken, and there is no interest in fixing it, why do we still have
> > > it? Should we just excise it from the tree?
> > >
> > > > Giving low priority to most of them, I think desire to have loopback rdma
> > > device are below.
> > > > 1. rxe is not ready for adding IB link types and large code restructure to
> > > avoid skb processing in it. Pretty large rewrite to skip skbs.
> > > > 2. stability and reasonable performance 3. maintainability
> > >
> > > I don't see how this is more maintainable. We are adding a new driver, a
> > > new user space provider. So I don't see that as being a reason for adding
> > > this.
> > A new user space provider is less complex at cost of system calls.
> > However it reuses most kernel pieces present today. User space driver is just a wrapper to ibv_cmd().
> > I see this approach as start on right foot with this approach by not writing new code but use existing infra.
> > And all 3 drivers (rxe, siw, loopback) reuse common user space driver, reuse resource allocator, and plugin their transport callbacks.
> > Or siw should modify the rxe instead of creating those pieces now.
> >
> > >
> > > > But if you think rxe is solid, siw should refactor the rxe code and start
> > > using most pieces from there, split into library for roce and iw.
> > > > Once that layering is done, may be loopback can fit that as different L4 so
> > > that rxe uses skb, siw uses sockets, loopback uses memcpy.
> > >
> > > This is why rxe should have used rdmavt from the beginning and we would
> > > pretty much have such a library.
> > >
> > > > Loopback's helper.c is intended to share code with siw for table resources
> > > as xarray.
> > > > It also offers complete kernel level handling of data and control path
> > > commands and published perf numbers.
> > >
> > > We can debate back and forth whether this needed to be included in siw
> > > and rxe, or if it and the others should have used rdmavt. However, I think
> > > this is different enough of an approach that it does stand on its own and
> > > could in fact be a new driver.
> > >
> >
> > > The fact that rxe is broken and no one seems to want to fix it shouldn't be
> > > our reason though.
> > Same reasoning applies to siw. It should refactor out the code such that new L4 piece can be fit in there.
> > But we are not taking that direction, same reasoning applies to similar other driver too.
> 
> We didn't deeply review SIW yet, everything before was more coding style
> bikeshedding. If you think that SIW and RXE need to be changed, feel free
> to share your opinion more loudly.

I have not really looked at SIW yet either but it seems like there would be a
lot of similarities to rxe which would be nice to consolidate especially at the
higher layers.  To be fair rdmavt had a lot of special things because of the
way the hfi1/qib hardware put packets on the wire _not_ using nor wanting
something like an skb for example.

My gut says that SIW and rxe are going to be similar, and different from
hfi1/qib (rdmavt) so I'm not sure trying to combine them will be worth the
effort.

As to this "loopback" device I'm skeptical.  SIW and rxe have use cases to
allow for interoperability/testing.

What is the real use case for this?

Ira

> 
> >
> > The main reason to not refactor rxe is, its major rewrite to support IB link without skb layer. Same refactor is needed for siw to reuse rxe.
> > And due to that I forked a new driver, but whose user space can be useable across siw and loopback, and also resource table code.
> >
> > >

Parav Pandit March 1, 2019, 6:27 a.m. UTC | #13

Hi Ira,

> -----Original Message-----
> From: linux-rdma-owner@vger.kernel.org <linux-rdma-
> owner@vger.kernel.org> On Behalf Of Ira Weiny
> Sent: Thursday, February 28, 2019 4:16 PM
> To: Leon Romanovsky <leon@kernel.org>
> Cc: Parav Pandit <parav@mellanox.com>; Dennis Dalessandro
> <dennis.dalessandro@intel.com>; bvanassche@acm.org; linux-
> rdma@vger.kernel.org
> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> On Thu, Feb 28, 2019 at 09:38:53PM +0200, Leon Romanovsky wrote:
> > On Thu, Feb 28, 2019 at 02:06:53PM +0000, Parav Pandit wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Dennis Dalessandro <dennis.dalessandro@intel.com>
> > > > Sent: Thursday, February 28, 2019 6:39 AM
> > > > To: Parav Pandit <parav@mellanox.com>; Leon Romanovsky
> > > > <leon@kernel.org>
> > > > Cc: bvanassche@acm.org; linux-rdma@vger.kernel.org
> > > > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > > >
> > > > On 2/27/2019 2:49 PM, Parav Pandit wrote:
> > > > >
> > > > >
> > > > >> -----Original Message-----
> > > > >> From: Leon Romanovsky <leon@kernel.org>
> > > > >> Sent: Wednesday, February 27, 2019 1:56 AM
> > > > >> To: Parav Pandit <parav@mellanox.com>
> > > > >> Cc: bvanassche@acm.org; linux-rdma@vger.kernel.org
> > > > >> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > > > >>
> > > > >> On Wed, Feb 27, 2019 at 12:27:13AM -0600, Parav Pandit wrote:
> > > > >>> This patchset adds RDMA loopback driver.
> > > > >>> Initially for RoCE which works on lo netdevice.
> > > > >>>
> > > > >>> It is tested with with nvme fabrics over ext4, perftests, and rping.
> > > > >>> It only supports RC and GSI QPs.
> > > > >>> It supports only RoCEv2 GIDs which belongs to loopback lo
> netdevice.
> > > > >>>
> > > > >>> It is only posted for discussion [1].
> > > > >>> It is not yet ready for RFC posting or merge.
> > > > >>
> > > > >> Which type of discussion do you expect?
> > > > > Continuation of [1].
> > > > >> And can you give brief explanation why wasn't enough to extend
> rxe/siw?
> > > > >>
> > > > > Adding lo netdev to rxe is certainly an option along with cma
> > > > > patch in this
> > > > series.
> > > > >
> > > > > qp state machine is around spin locks..
> > > > > pools doesn't use xarray that loopback uses and siw intends to use.
> > > > >
> > > > > Incidentally, 5.0.0.rc5 rxe crashes on registering memory.
> > > > > Didn't have
> > > > inspiration to supply a patch.
> > > >
> > > > If rxe crashes we may want to fix it rather than creating a whole new
> driver.
> > > >
> > > > > However rxe as it stands today after several fixes from many is
> > > > > still not
> > > > there.
> > > > > It leaks consumer index to user space and not sure its effect of
> > > > > it. Jason did
> > > > talk some of the security concern I don't recall.
> > > > > A while back when I reviewed the code, saw things that might crash
> kernel.
> > > >
> > > > > Users complain of memory leaks, rnr retries dropping connections..
> > > >
> > > > If rxe is so broken, and there is no interest in fixing it, why do
> > > > we still have it? Should we just excise it from the tree?
> > > >
> > > > > Giving low priority to most of them, I think desire to have
> > > > > loopback rdma
> > > > device are below.
> > > > > 1. rxe is not ready for adding IB link types and large code
> > > > > restructure to
> > > > avoid skb processing in it. Pretty large rewrite to skip skbs.
> > > > > 2. stability and reasonable performance 3. maintainability
> > > >
> > > > I don't see how this is more maintainable. We are adding a new
> > > > driver, a new user space provider. So I don't see that as being a
> > > > reason for adding this.
> > > A new user space provider is less complex at cost of system calls.
> > > However it reuses most kernel pieces present today. User space driver is
> just a wrapper to ibv_cmd().
> > > I see this approach as start on right foot with this approach by not
> writing new code but use existing infra.
> > > And all 3 drivers (rxe, siw, loopback) reuse common user space driver,
> reuse resource allocator, and plugin their transport callbacks.
> > > Or siw should modify the rxe instead of creating those pieces now.
> > >
> > > >
> > > > > But if you think rxe is solid, siw should refactor the rxe code
> > > > > and start
> > > > using most pieces from there, split into library for roce and iw.
> > > > > Once that layering is done, may be loopback can fit that as
> > > > > different L4 so
> > > > that rxe uses skb, siw uses sockets, loopback uses memcpy.
> > > >
> > > > This is why rxe should have used rdmavt from the beginning and we
> > > > would pretty much have such a library.
> > > >
> > > > > Loopback's helper.c is intended to share code with siw for table
> > > > > resources
> > > > as xarray.
> > > > > It also offers complete kernel level handling of data and
> > > > > control path
> > > > commands and published perf numbers.
> > > >
> > > > We can debate back and forth whether this needed to be included in
> > > > siw and rxe, or if it and the others should have used rdmavt.
> > > > However, I think this is different enough of an approach that it
> > > > does stand on its own and could in fact be a new driver.
> > > >
> > >
> > > > The fact that rxe is broken and no one seems to want to fix it
> > > > shouldn't be our reason though.
> > > Same reasoning applies to siw. It should refactor out the code such that
> new L4 piece can be fit in there.
> > > But we are not taking that direction, same reasoning applies to similar
> other driver too.
> >
> > We didn't deeply review SIW yet, everything before was more coding
> > style bikeshedding. If you think that SIW and RXE need to be changed,
> > feel free to share your opinion more loudly.
> 
> I have not really looked at SIW yet either but it seems like there would be a
> lot of similarities to rxe which would be nice to consolidate especially at the
> higher layers.
Yes. qp state machine, mr state, uapi with rdma-core, resource id management (qp, mr, srq) etc all is large piece of common code.
rxe's skb, and siw sockets handling needs to plugin as transport level callbacks in a common driver.

>  To be fair rdmavt had a lot of special things because of the
> way the hfi1/qib hardware put packets on the wire _not_ using nor wanting
> something like an skb for example.
> 
> My gut says that SIW and rxe are going to be similar, and different from
> hfi1/qib (rdmavt) so I'm not sure trying to combine them will be worth the
> effort.
> 
> As to this "loopback" device I'm skeptical.  SIW and rxe have use cases to
> allow for interoperability/testing.
>
I agree that it would be nice to see rxe inter-operate with connectx5 100Gbps and match the line rate without dropping a connection for an hour.
I have my doubts with my experience with rxe and users complain about dropping connections in single system (forget other side as connectx5).
Last week I tried to do this with 5.0.0-rc5 and memory registration on rxe driver crashed the system.

Does any one know if UNH has done any interoperability tests with it?

I will check with our internal QA team, if they do more than touch tests between cx5 and rxe. :-)
 
> What is the real use case for this?
Use case is fairly simple. Group of developers and users are running VMs on laptop and in cloud without hw HCAs for devel purposes and automated unit and Jenkins tests of their application.
Such tests run for few hundreds of QPs and intent to exchange million messages.
This is mainly RoCE users.
rxe is not fitting these needs with its current state.

Bart was looking to run on loopback (similar to other user request I received) here [1].

So I am looking at something insanely simple but extendible who can find more uses as we go forward.
null_blk driver started simple, but stability lead to more features, which grew for 3/4 module parameters to a configfs based options now.

I am considering if rdma subsystem can offer such stable 'lo netdev' or 'null block device' style rdma device (starting with RoCE, followed by IB link), would it be more useful.
Apart from user devel and appl test needs, loopback driver is useful to develop regression tests for rdma-subsystem.

Chuck or Bart's ULP experience on how they like it would be interesting to listen to.
I used rxe for developing nvme fabrics dissector and it was good enough for few thousands of packets a year back.
There has been several fixes by well-wishers.

Bart's problem can be partly solvable, using patch [2] + adding lo device using new netlink rxe command from Steve, didn't try it yet.

[1] https://marc.info/?l=linux-rdma&m=155122131404449&w=2
[2] https://patchwork.kernel.org/patch/10831261/

Dennis Dalessandro March 1, 2019, 12:28 p.m. UTC | #14

On 2/28/2019 5:16 PM, Ira Weiny wrote:
> I have not really looked at SIW yet either but it seems like there would be a
> lot of similarities to rxe which would be nice to consolidate especially at the
> higher layers.  To be fair rdmavt had a lot of special things because of the
> way the hfi1/qib hardware put packets on the wire _not_ using nor wanting
> something like an skb for example.
> 
> My gut says that SIW and rxe are going to be similar, and different from
> hfi1/qib (rdmavt) so I'm not sure trying to combine them will be worth the
> effort.

At this point, no. rdmavt is probably going to stay what it is. What 
could have or should have been can be debated, but at this point I see 
no reason to try and shoe horn them all together.

-Denny

Bart Van Assche March 1, 2019, 5:50 p.m. UTC | #15

On Thu, 2019-02-28 at 15:22 +0200, Leon Romanovsky wrote:
> On Thu, Feb 28, 2019 at 07:39:25AM -0500, Dennis Dalessandro wrote:
> > On 2/27/2019 2:49 PM, Parav Pandit wrote:
> > > Giving low priority to most of them, I think desire to have loopback rdma device are below.
> > > 1. rxe is not ready for adding IB link types and large code restructure to avoid skb processing in it. Pretty large rewrite to skip skbs.
> > > 2. stability and reasonable performance
> > > 3. maintainability
> > 
> > I don't see how this is more maintainable. We are adding a new driver, a new
> > user space provider. So I don't see that as being a reason for adding this.
> 
> Agree too, it is so tempting to write something new instead of fixing.

Hi Leon,

Early 2018 there was a discussion at LSF/MM about how to implement a high-
performance block driver in user space. A highly optimized RDMA loopback
driver in combination with e.g. the NVMeOF kernel initiator driver and a
NVMeOF target driver in user space could be used for this purpose. Do you
think it is possible to make loopback in the rdma_rxe driver as fast as
in Parav Pandit's driver? See also Matthew Wilcox, [LSF/MM TOPIC] A high-
performance userspace block driver, January 2018
(https://www.spinics.net/lists/linux-fsdevel/msg120674.html).

Thanks,

Bart.

Yuval Shaia March 4, 2019, 7:56 a.m. UTC | #16

On Fri, Mar 01, 2019 at 06:27:34AM +0000, Parav Pandit wrote:
>  
> > What is the real use case for this?
> Use case is fairly simple. Group of developers and users are running VMs on laptop and in cloud without hw HCAs for devel purposes and automated unit and Jenkins tests of their application.
> Such tests run for few hundreds of QPs and intent to exchange million messages.
> This is mainly RoCE users.
> rxe is not fitting these needs with its current state.

To run RDMA device in a VM even when no real HCA is installed on host i can
suggest QEMU's pvrdma device which is on its way to become a real product.

> 
> Bart was looking to run on loopback (similar to other user request I received) here [1].
> 
> So I am looking at something insanely simple but extendible who can find more uses as we go forward.

Suggestion: To enhance 'loopback' performances, can you consider using
shared memory or any other IPC instead of going thought the network stack?

> null_blk driver started simple, but stability lead to more features, which grew for 3/4 module parameters to a configfs based options now.
> 
> I am considering if rdma subsystem can offer such stable 'lo netdev' or 'null block device' style rdma device (starting with RoCE, followed by IB link), would it be more useful.
> Apart from user devel and appl test needs, loopback driver is useful to develop regression tests for rdma-subsystem.
> 
> Chuck or Bart's ULP experience on how they like it would be interesting to listen to.
> I used rxe for developing nvme fabrics dissector and it was good enough for few thousands of packets a year back.
> There has been several fixes by well-wishers.
> 
> Bart's problem can be partly solvable, using patch [2] + adding lo device using new netlink rxe command from Steve, didn't try it yet.
> 
> [1] https://marc.info/?l=linux-rdma&m=155122131404449&w=2
> [2] https://patchwork.kernel.org/patch/10831261/

Leon Romanovsky March 4, 2019, 11 a.m. UTC | #17

On Fri, Mar 01, 2019 at 09:50:46AM -0800, Bart Van Assche wrote:
> On Thu, 2019-02-28 at 15:22 +0200, Leon Romanovsky wrote:
> > On Thu, Feb 28, 2019 at 07:39:25AM -0500, Dennis Dalessandro wrote:
> > > On 2/27/2019 2:49 PM, Parav Pandit wrote:
> > > > Giving low priority to most of them, I think desire to have loopback rdma device are below.
> > > > 1. rxe is not ready for adding IB link types and large code restructure to avoid skb processing in it. Pretty large rewrite to skip skbs.
> > > > 2. stability and reasonable performance
> > > > 3. maintainability
> > >
> > > I don't see how this is more maintainable. We are adding a new driver, a new
> > > user space provider. So I don't see that as being a reason for adding this.
> >
> > Agree too, it is so tempting to write something new instead of fixing.
>
> Hi Leon,
>
> Early 2018 there was a discussion at LSF/MM about how to implement a high-
> performance block driver in user space. A highly optimized RDMA loopback
> driver in combination with e.g. the NVMeOF kernel initiator driver and a
> NVMeOF target driver in user space could be used for this purpose. Do you
> think it is possible to make loopback in the rdma_rxe driver as fast as
> in Parav Pandit's driver? See also Matthew Wilcox, [LSF/MM TOPIC] A high-
> performance userspace block driver, January 2018
> (https://www.spinics.net/lists/linux-fsdevel/msg120674.html).

I don't see any reason to say no, once RXE and SIW detect that they
were requested operate on loopback, they can "perform" extra
optimizations to do it extremely fast. In case of RXE, that driver will
skip SW ICRC calculations.

Thanks

>
> Thanks,
>
> Bart.
>

Parav Pandit March 4, 2019, 2:47 p.m. UTC | #18

> -----Original Message-----
> From: Yuval Shaia <yuval.shaia@oracle.com>
> Sent: Monday, March 4, 2019 1:56 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Ira Weiny <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>;
> Dennis Dalessandro <dennis.dalessandro@intel.com>; bvanassche@acm.org;
> linux-rdma@vger.kernel.org
> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> On Fri, Mar 01, 2019 at 06:27:34AM +0000, Parav Pandit wrote:
> >
> > > What is the real use case for this?
> > Use case is fairly simple. Group of developers and users are running VMs
> on laptop and in cloud without hw HCAs for devel purposes and automated
> unit and Jenkins tests of their application.
> > Such tests run for few hundreds of QPs and intent to exchange million
> messages.
> > This is mainly RoCE users.
> > rxe is not fitting these needs with its current state.
> 
> To run RDMA device in a VM even when no real HCA is installed on host i
> can suggest QEMU's pvrdma device which is on its way to become a real
> product.
>
Can you please share the github repo or other open source link which provides the host side of emulation code?
I am interested to know how does it work on Linux host without modifying kernel, if it is close to production line.
Linux host side code RoCE code doesn't support pvrdma's need. It requires ABI changes... different discussion.

And after doing all of it, such VM still requires enhanced host. This approach doesn't have any of those limitations.
 
> >
> > Bart was looking to run on loopback (similar to other user request I
> received) here [1].
> >
> > So I am looking at something insanely simple but extendible who can find
> more uses as we go forward.
> 
> Suggestion: To enhance 'loopback' performances, can you consider using
> shared memory or any other IPC instead of going thought the network stack?
> 
Loopback driver in this patchset doesn't use network stack.
It is just 2000 lines of wrapper to memcpy() to enables applications to use rdma.

> > null_blk driver started simple, but stability lead to more features, which
> grew for 3/4 module parameters to a configfs based options now.
> >
> > I am considering if rdma subsystem can offer such stable 'lo netdev' or 'null
> block device' style rdma device (starting with RoCE, followed by IB link),
> would it be more useful.
> > Apart from user devel and appl test needs, loopback driver is useful to
> develop regression tests for rdma-subsystem.
> >
> > Chuck or Bart's ULP experience on how they like it would be interesting to
> listen to.
> > I used rxe for developing nvme fabrics dissector and it was good enough
> for few thousands of packets a year back.
> > There has been several fixes by well-wishers.
> >
> > Bart's problem can be partly solvable, using patch [2] + adding lo device
> using new netlink rxe command from Steve, didn't try it yet.
> >
> > [1] https://marc.info/?l=linux-rdma&m=155122131404449&w=2
> > [2] https://patchwork.kernel.org/patch/10831261/

Bart Van Assche March 4, 2019, 4:10 p.m. UTC | #19

On Mon, 2019-03-04 at 09:56 +0200, Yuval Shaia wrote:
> Suggestion: To enhance 'loopback' performances, can you consider using
> shared memory or any other IPC instead of going thought the network stack?

I'd like to avoid having to implement yet another initiator block driver. Using
IPC implies writing a new block driver and also coming up with a new block-over-
IPC protocol. Using RDMA has the advantage that the existing NVMeOF initator
block driver and protocol can be used.

Bart.

Yuval Shaia March 4, 2019, 4:47 p.m. UTC | #20

On Mon, Mar 04, 2019 at 08:10:05AM -0800, Bart Van Assche wrote:
> On Mon, 2019-03-04 at 09:56 +0200, Yuval Shaia wrote:
> > Suggestion: To enhance 'loopback' performances, can you consider using
> > shared memory or any other IPC instead of going thought the network stack?
> 
> I'd like to avoid having to implement yet another initiator block driver. Using
> IPC implies writing a new block driver and also coming up with a new block-over-
> IPC protocol. Using RDMA has the advantage that the existing NVMeOF initator
> block driver and protocol can be used.
> 
> Bart.

No, no, i didn't mean to implement new driver, just that the xmit of the
packet would be by use of memcpy instead of going through TP stack. This
would make the data exchange extremely fast when the traffic is between two
entities on the same host.

Yuval

Parav Pandit March 4, 2019, 4:52 p.m. UTC | #21

> -----Original Message-----
> From: Yuval Shaia <yuval.shaia@oracle.com>
> Sent: Monday, March 4, 2019 10:48 AM
> To: Bart Van Assche <bvanassche@acm.org>
> Cc: Parav Pandit <parav@mellanox.com>; Ira Weiny <ira.weiny@intel.com>;
> Leon Romanovsky <leon@kernel.org>; Dennis Dalessandro
> <dennis.dalessandro@intel.com>; linux-rdma@vger.kernel.org
> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> On Mon, Mar 04, 2019 at 08:10:05AM -0800, Bart Van Assche wrote:
> > On Mon, 2019-03-04 at 09:56 +0200, Yuval Shaia wrote:
> > > Suggestion: To enhance 'loopback' performances, can you consider
> > > using shared memory or any other IPC instead of going thought the
> network stack?
> >
> > I'd like to avoid having to implement yet another initiator block
> > driver. Using IPC implies writing a new block driver and also coming
> > up with a new block-over- IPC protocol. Using RDMA has the advantage
> > that the existing NVMeOF initator block driver and protocol can be used.
> >
> > Bart.
> 
> No, no, i didn't mean to implement new driver, just that the xmit of the
> packet would be by use of memcpy instead of going through TP stack. This
> would make the data exchange extremely fast when the traffic is between
> two entities on the same host.
> 
Can you please review the other patches in this patchset and not just cover-letter?
It does what you are describing without the network stack.

Yuval Shaia March 4, 2019, 4:57 p.m. UTC | #22

On Mon, Mar 04, 2019 at 02:47:43PM +0000, Parav Pandit wrote:
> 
> 
> > -----Original Message-----
> > From: Yuval Shaia <yuval.shaia@oracle.com>
> > Sent: Monday, March 4, 2019 1:56 AM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: Ira Weiny <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>;
> > Dennis Dalessandro <dennis.dalessandro@intel.com>; bvanassche@acm.org;
> > linux-rdma@vger.kernel.org
> > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > 
> > On Fri, Mar 01, 2019 at 06:27:34AM +0000, Parav Pandit wrote:
> > >
> > > > What is the real use case for this?
> > > Use case is fairly simple. Group of developers and users are running VMs
> > on laptop and in cloud without hw HCAs for devel purposes and automated
> > unit and Jenkins tests of their application.
> > > Such tests run for few hundreds of QPs and intent to exchange million
> > messages.
> > > This is mainly RoCE users.
> > > rxe is not fitting these needs with its current state.
> > 
> > To run RDMA device in a VM even when no real HCA is installed on host i
> > can suggest QEMU's pvrdma device which is on its way to become a real
> > product.
> >
> Can you please share the github repo or other open source link which provides the host side of emulation code?

Official QEMU repo is here:
https://github.com/qemu/qemu

> I am interested to know how does it work on Linux host without modifying kernel, if it is close to production line.

On a host with no real HCA the backend device is RXE.

> Linux host side code RoCE code doesn't support pvrdma's need. It requires ABI changes... different discussion.

No, no ABI changes is needed, pvrdma is an ibverb client.

> 
> And after doing all of it, such VM still requires enhanced host. This approach doesn't have any of those limitations.

Can you elaborate on that? why enhanced host?

>  
> > >
> > > Bart was looking to run on loopback (similar to other user request I
> > received) here [1].
> > >
> > > So I am looking at something insanely simple but extendible who can find
> > more uses as we go forward.
> > 
> > Suggestion: To enhance 'loopback' performances, can you consider using
> > shared memory or any other IPC instead of going thought the network stack?
> > 
> Loopback driver in this patchset doesn't use network stack.
> It is just 2000 lines of wrapper to memcpy() to enables applications to use rdma.

gr8!!
I had plans to patch RXE to support it but didn't found the time.

So actually why not do it in RXE?

> 
> > > null_blk driver started simple, but stability lead to more features, which
> > grew for 3/4 module parameters to a configfs based options now.
> > >
> > > I am considering if rdma subsystem can offer such stable 'lo netdev' or 'null
> > block device' style rdma device (starting with RoCE, followed by IB link),
> > would it be more useful.
> > > Apart from user devel and appl test needs, loopback driver is useful to
> > develop regression tests for rdma-subsystem.
> > >
> > > Chuck or Bart's ULP experience on how they like it would be interesting to
> > listen to.
> > > I used rxe for developing nvme fabrics dissector and it was good enough
> > for few thousands of packets a year back.
> > > There has been several fixes by well-wishers.
> > >
> > > Bart's problem can be partly solvable, using patch [2] + adding lo device
> > using new netlink rxe command from Steve, didn't try it yet.
> > >
> > > [1] https://marc.info/?l=linux-rdma&m=155122131404449&w=2
> > > [2] https://patchwork.kernel.org/patch/10831261/

Parav Pandit March 4, 2019, 5:10 p.m. UTC | #23

> -----Original Message-----
> From: Yuval Shaia <yuval.shaia@oracle.com>
> Sent: Monday, March 4, 2019 10:57 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Ira Weiny <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>;
> Dennis Dalessandro <dennis.dalessandro@intel.com>; bvanassche@acm.org;
> linux-rdma@vger.kernel.org; Marcel Apfelbaum
> <marcel.apfelbaum@gmail.com>; Kamal Heib <kheib@redhat.com>
> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> On Mon, Mar 04, 2019 at 02:47:43PM +0000, Parav Pandit wrote:
> >
> >
> > > -----Original Message-----
> > > From: Yuval Shaia <yuval.shaia@oracle.com>
> > > Sent: Monday, March 4, 2019 1:56 AM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: Ira Weiny <ira.weiny@intel.com>; Leon Romanovsky
> > > <leon@kernel.org>; Dennis Dalessandro
> > > <dennis.dalessandro@intel.com>; bvanassche@acm.org;
> > > linux-rdma@vger.kernel.org
> > > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > >
> > > On Fri, Mar 01, 2019 at 06:27:34AM +0000, Parav Pandit wrote:
> > > >
> > > > > What is the real use case for this?
> > > > Use case is fairly simple. Group of developers and users are
> > > > running VMs
> > > on laptop and in cloud without hw HCAs for devel purposes and
> > > automated unit and Jenkins tests of their application.
> > > > Such tests run for few hundreds of QPs and intent to exchange
> > > > million
> > > messages.
> > > > This is mainly RoCE users.
> > > > rxe is not fitting these needs with its current state.
> > >
> > > To run RDMA device in a VM even when no real HCA is installed on
> > > host i can suggest QEMU's pvrdma device which is on its way to
> > > become a real product.
> > >
> > Can you please share the github repo or other open source link which
> provides the host side of emulation code?
> 
> Official QEMU repo is here:
> https://github.com/qemu/qemu
> 
> > I am interested to know how does it work on Linux host without modifying
> kernel, if it is close to production line.
> 
> On a host with no real HCA the backend device is RXE.
> 
Host and ibverb is not aware of the IP stack of the VM.

> > Linux host side code RoCE code doesn't support pvrdma's need. It requires
> ABI changes... different discussion.
> 
> No, no ABI changes is needed, pvrdma is an ibverb client.
> 
> >
> > And after doing all of it, such VM still requires enhanced host. This
> approach doesn't have any of those limitations.
> 
> Can you elaborate on that? why enhanced host?
> 
Host is not aware of the IP stack of the VM.
How do you resolve mac address for the destination IP used in guest VM in host, and how do you program right source mac and destination mac of the QP in host using ibverbs client?
I was asking Aviad in Mellanox to use devx interface do have passthrough programming.
So want to understand how are you doing this QEMU as its close to production now?

> >
> > > >
> > > > Bart was looking to run on loopback (similar to other user request
> > > > I
> > > received) here [1].
> > > >
> > > > So I am looking at something insanely simple but extendible who
> > > > can find
> > > more uses as we go forward.
> > >
> > > Suggestion: To enhance 'loopback' performances, can you consider
> > > using shared memory or any other IPC instead of going thought the
> network stack?
> > >
> > Loopback driver in this patchset doesn't use network stack.
> > It is just 2000 lines of wrapper to memcpy() to enables applications to use
> rdma.
> 
> gr8!!
> I had plans to patch RXE to support it but didn't found the time.
> 
> So actually why not do it in RXE?
> 
Can you please publish fio results with nfs-rdma, nvme-fabrics, rds and perftest using --infinite option by running it for one hour or so with rxe?
It's been a while I did that. Last time when I tried with 5.0.0-rc5, perftest crashed the kernel on MR registration.

Yuval Shaia March 4, 2019, 5:17 p.m. UTC | #24

On Mon, Mar 04, 2019 at 04:52:06PM +0000, Parav Pandit wrote:
> 
> 
> > -----Original Message-----
> > From: Yuval Shaia <yuval.shaia@oracle.com>
> > Sent: Monday, March 4, 2019 10:48 AM
> > To: Bart Van Assche <bvanassche@acm.org>
> > Cc: Parav Pandit <parav@mellanox.com>; Ira Weiny <ira.weiny@intel.com>;
> > Leon Romanovsky <leon@kernel.org>; Dennis Dalessandro
> > <dennis.dalessandro@intel.com>; linux-rdma@vger.kernel.org
> > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > 
> > On Mon, Mar 04, 2019 at 08:10:05AM -0800, Bart Van Assche wrote:
> > > On Mon, 2019-03-04 at 09:56 +0200, Yuval Shaia wrote:
> > > > Suggestion: To enhance 'loopback' performances, can you consider
> > > > using shared memory or any other IPC instead of going thought the
> > network stack?
> > >
> > > I'd like to avoid having to implement yet another initiator block
> > > driver. Using IPC implies writing a new block driver and also coming
> > > up with a new block-over- IPC protocol. Using RDMA has the advantage
> > > that the existing NVMeOF initator block driver and protocol can be used.
> > >
> > > Bart.
> > 
> > No, no, i didn't mean to implement new driver, just that the xmit of the
> > packet would be by use of memcpy instead of going through TP stack. This
> > would make the data exchange extremely fast when the traffic is between
> > two entities on the same host.
> > 
> Can you please review the other patches in this patchset and not just cover-letter?
> It does what you are describing without the network stack.

You are right, i should have do it, just was waiting for an answer to
Leon's question on why not using rxe as a base.

Yuval Shaia March 5, 2019, 10:55 a.m. UTC | #25

On Mon, Mar 04, 2019 at 05:10:35PM +0000, Parav Pandit wrote:
> 
> 
> > -----Original Message-----
> > From: Yuval Shaia <yuval.shaia@oracle.com>
> > Sent: Monday, March 4, 2019 10:57 AM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: Ira Weiny <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>;
> > Dennis Dalessandro <dennis.dalessandro@intel.com>; bvanassche@acm.org;
> > linux-rdma@vger.kernel.org; Marcel Apfelbaum
> > <marcel.apfelbaum@gmail.com>; Kamal Heib <kheib@redhat.com>
> > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > 
> > On Mon, Mar 04, 2019 at 02:47:43PM +0000, Parav Pandit wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Yuval Shaia <yuval.shaia@oracle.com>
> > > > Sent: Monday, March 4, 2019 1:56 AM
> > > > To: Parav Pandit <parav@mellanox.com>
> > > > Cc: Ira Weiny <ira.weiny@intel.com>; Leon Romanovsky
> > > > <leon@kernel.org>; Dennis Dalessandro
> > > > <dennis.dalessandro@intel.com>; bvanassche@acm.org;
> > > > linux-rdma@vger.kernel.org
> > > > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > > >
> > > > On Fri, Mar 01, 2019 at 06:27:34AM +0000, Parav Pandit wrote:
> > > > >
> > > > > > What is the real use case for this?
> > > > > Use case is fairly simple. Group of developers and users are
> > > > > running VMs
> > > > on laptop and in cloud without hw HCAs for devel purposes and
> > > > automated unit and Jenkins tests of their application.
> > > > > Such tests run for few hundreds of QPs and intent to exchange
> > > > > million
> > > > messages.
> > > > > This is mainly RoCE users.
> > > > > rxe is not fitting these needs with its current state.
> > > >
> > > > To run RDMA device in a VM even when no real HCA is installed on
> > > > host i can suggest QEMU's pvrdma device which is on its way to
> > > > become a real product.
> > > >
> > > Can you please share the github repo or other open source link which
> > provides the host side of emulation code?
> > 
> > Official QEMU repo is here:
> > https://github.com/qemu/qemu
> > 
> > > I am interested to know how does it work on Linux host without modifying
> > kernel, if it is close to production line.
> > 
> > On a host with no real HCA the backend device is RXE.
> > 
> Host and ibverb is not aware of the IP stack of the VM.
> 
> > > Linux host side code RoCE code doesn't support pvrdma's need. It requires
> > ABI changes... different discussion.
> > 
> > No, no ABI changes is needed, pvrdma is an ibverb client.
> > 
> > >
> > > And after doing all of it, such VM still requires enhanced host. This
> > approach doesn't have any of those limitations.
> > 
> > Can you elaborate on that? why enhanced host?
> > 
> Host is not aware of the IP stack of the VM.
> How do you resolve mac address for the destination IP used in guest VM in host, and how do you program right source mac and destination mac of the QP in host using ibverbs client?
> I was asking Aviad in Mellanox to use devx interface do have passthrough programming.
> So want to understand how are you doing this QEMU as its close to production now?

Not sure i fully understand your question, ex why host needs to know the
guest IP.
Anyway, the flow is like this:
- In guest, ib_core calls the driver's add_gid hook when gid entry is
  created
- Driver in guest passes binding info to qemu device (sgid and gid)
- qemu device adds this gid to host gid table (via netlink or QMP), note
  that sgid in host probably is different. Please also note that at this
  stage we have this gid defined twice in the fabric but since one is in
  guest, which is hidden, we are ok. (1098)
- When guest creates QP it passes the guest sgid to qemu device which
  replace it with the host sgid. (#436)

Since gid is defined in the host the all the routing is done as usual.

Full details of the above is available here
https://github.com/qemu/qemu/commit/2b05705dc8ad80c09a3aa9cc70c14fb8323b0fd3

Hope this answers your question.

> 
> > >
> > > > >
> > > > > Bart was looking to run on loopback (similar to other user request
> > > > > I
> > > > received) here [1].
> > > > >
> > > > > So I am looking at something insanely simple but extendible who
> > > > > can find
> > > > more uses as we go forward.
> > > >
> > > > Suggestion: To enhance 'loopback' performances, can you consider
> > > > using shared memory or any other IPC instead of going thought the
> > network stack?
> > > >
> > > Loopback driver in this patchset doesn't use network stack.
> > > It is just 2000 lines of wrapper to memcpy() to enables applications to use
> > rdma.
> > 
> > gr8!!
> > I had plans to patch RXE to support it but didn't found the time.
> > 
> > So actually why not do it in RXE?
> > 
> Can you please publish fio results with nfs-rdma, nvme-fabrics, rds and perftest using --infinite option by running it for one hour or so with rxe?
> It's been a while I did that. Last time when I tried with 5.0.0-rc5, perftest crashed the kernel on MR registration.

For now the device is limited to IBV_WR_SEND and IBV_WR_RECV opcodes so
anything with IBV_WR_RDMA_* is not yet supported.

But i ran ibv_rc_pingpong (which is also doing reg_mr) for lot more that an
hour with rxe as backend and had no issues. Can you share more details?

Yuval Shaia March 5, 2019, 11:09 a.m. UTC | #26

> > 
> > Suggestion: To enhance 'loopback' performances, can you consider using
> > shared memory or any other IPC instead of going thought the network stack?
> > 
> Loopback driver in this patchset doesn't use network stack.
> It is just 2000 lines of wrapper to memcpy() to enables applications to use rdma.

To have a dedicated driver just for the loopback will force the user to do
a smart select, i.e. to use lo device for local traffic and rxe for
non-local.
This is hard decision to make as user will not always know where the peer
is running (just consider migration).

I vote for enhancing rxe with this memcpy when it detects that peer is on
the same host. If it is on the same device (two different gids) then we do
not even need shared memory.

I once played with the idea of modifying the virt-to-phy table so instead
of memcpy just replace the mapping of the dest address with phy-addr of
source.
But this might be too far :)

> 
> > > null_blk driver started simple, but stability lead to more features, which
> > grew for 3/4 module parameters to a configfs based options now.
> > >

Parav Pandit March 5, 2019, 9:53 p.m. UTC | #27

> -----Original Message-----
> From: Yuval Shaia <yuval.shaia@oracle.com>
> Sent: Tuesday, March 5, 2019 5:09 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Ira Weiny <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>;
> Dennis Dalessandro <dennis.dalessandro@intel.com>; bvanassche@acm.org;
> linux-rdma@vger.kernel.org
> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> > >
> > > Suggestion: To enhance 'loopback' performances, can you consider
> > > using shared memory or any other IPC instead of going thought the
> network stack?
> > >
> > Loopback driver in this patchset doesn't use network stack.
> > It is just 2000 lines of wrapper to memcpy() to enables applications to use
> rdma.
> 
> To have a dedicated driver just for the loopback will force the user to do a
> smart select, i.e. to use lo device for local traffic and rxe for non-local.
No. when application is written using rdmacm, everything works based on the ip address.
It will pick the right rdma device that matches this ip.
It would be 'lo' when connections are on 127.0.0.1.
When application such as MPI, will have to anyway specify the which rdma device they want to use in system.

Parav Pandit March 5, 2019, 10:10 p.m. UTC | #28

> -----Original Message-----
> From: Yuval Shaia <yuval.shaia@oracle.com>
> Sent: Tuesday, March 5, 2019 4:55 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Ira Weiny <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>;
> Dennis Dalessandro <dennis.dalessandro@intel.com>; bvanassche@acm.org;
> linux-rdma@vger.kernel.org; Marcel Apfelbaum
> <marcel.apfelbaum@gmail.com>; Kamal Heib <kheib@redhat.com>
> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> On Mon, Mar 04, 2019 at 05:10:35PM +0000, Parav Pandit wrote:
> >
> >
> > > -----Original Message-----
> > > From: Yuval Shaia <yuval.shaia@oracle.com>
> > > Sent: Monday, March 4, 2019 10:57 AM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: Ira Weiny <ira.weiny@intel.com>; Leon Romanovsky
> > > <leon@kernel.org>; Dennis Dalessandro
> > > <dennis.dalessandro@intel.com>; bvanassche@acm.org;
> > > linux-rdma@vger.kernel.org; Marcel Apfelbaum
> > > <marcel.apfelbaum@gmail.com>; Kamal Heib <kheib@redhat.com>
> > > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > >
> > > On Mon, Mar 04, 2019 at 02:47:43PM +0000, Parav Pandit wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Yuval Shaia <yuval.shaia@oracle.com>
> > > > > Sent: Monday, March 4, 2019 1:56 AM
> > > > > To: Parav Pandit <parav@mellanox.com>
> > > > > Cc: Ira Weiny <ira.weiny@intel.com>; Leon Romanovsky
> > > > > <leon@kernel.org>; Dennis Dalessandro
> > > > > <dennis.dalessandro@intel.com>; bvanassche@acm.org;
> > > > > linux-rdma@vger.kernel.org
> > > > > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > > > >
> > > > > On Fri, Mar 01, 2019 at 06:27:34AM +0000, Parav Pandit wrote:
> > > > > >
> > > > > > > What is the real use case for this?
> > > > > > Use case is fairly simple. Group of developers and users are
> > > > > > running VMs
> > > > > on laptop and in cloud without hw HCAs for devel purposes and
> > > > > automated unit and Jenkins tests of their application.
> > > > > > Such tests run for few hundreds of QPs and intent to exchange
> > > > > > million
> > > > > messages.
> > > > > > This is mainly RoCE users.
> > > > > > rxe is not fitting these needs with its current state.
> > > > >
> > > > > To run RDMA device in a VM even when no real HCA is installed on
> > > > > host i can suggest QEMU's pvrdma device which is on its way to
> > > > > become a real product.
> > > > >
> > > > Can you please share the github repo or other open source link
> > > > which
> > > provides the host side of emulation code?
> > >
> > > Official QEMU repo is here:
> > > https://github.com/qemu/qemu
> > >
> > > > I am interested to know how does it work on Linux host without
> > > > modifying
> > > kernel, if it is close to production line.
> > >
> > > On a host with no real HCA the backend device is RXE.
> > >
> > Host and ibverb is not aware of the IP stack of the VM.
> >
> > > > Linux host side code RoCE code doesn't support pvrdma's need. It
> > > > requires
> > > ABI changes... different discussion.
> > >
> > > No, no ABI changes is needed, pvrdma is an ibverb client.
> > >
> > > >
> > > > And after doing all of it, such VM still requires enhanced host.
> > > > This
> > > approach doesn't have any of those limitations.
> > >
> > > Can you elaborate on that? why enhanced host?
> > >
> > Host is not aware of the IP stack of the VM.
> > How do you resolve mac address for the destination IP used in guest VM in
> host, and how do you program right source mac and destination mac of the
> QP in host using ibverbs client?
> > I was asking Aviad in Mellanox to use devx interface do have passthrough
> programming.
> > So want to understand how are you doing this QEMU as its close to
> production now?
> 
> Not sure i fully understand your question, ex why host needs to know the
> guest IP.
> Anyway, the flow is like this:
> - In guest, ib_core calls the driver's add_gid hook when gid entry is
>   created
> - Driver in guest passes binding info to qemu device (sgid and gid)
> - qemu device adds this gid to host gid table (via netlink or QMP), note
>   that sgid in host probably is different. Please also note that at this
>   stage we have this gid defined twice in the fabric but since one is in
>   guest, which is hidden, we are ok. (1098)
> - When guest creates QP it passes the guest sgid to qemu device which
>   replace it with the host sgid. (#436)
> 
> Since gid is defined in the host the all the routing is done as usual.
> 
> Full details of the above is available here
> https://github.com/qemu/qemu/commit/2b05705dc8ad80c09a3aa9cc70c14f
> b8323b0fd3
> 
> Hope this answers your question.
> 
I took cursory look. I am almost sure that if guest VM's IP address is not added to host, modify_qp() in kernel is going to fail. Unless you use a different GID in host. And build some smart way to figure out which one to use.

> >
> > > >
> > > > > >
> > > > > > Bart was looking to run on loopback (similar to other user
> > > > > > request I
> > > > > received) here [1].
> > > > > >
> > > > > > So I am looking at something insanely simple but extendible
> > > > > > who can find
> > > > > more uses as we go forward.
> > > > >
> > > > > Suggestion: To enhance 'loopback' performances, can you consider
> > > > > using shared memory or any other IPC instead of going thought
> > > > > the
> > > network stack?
> > > > >
> > > > Loopback driver in this patchset doesn't use network stack.
> > > > It is just 2000 lines of wrapper to memcpy() to enables
> > > > applications to use
> > > rdma.
> > >
> > > gr8!!
> > > I had plans to patch RXE to support it but didn't found the time.
> > >
> > > So actually why not do it in RXE?
> > >
> > Can you please publish fio results with nfs-rdma, nvme-fabrics, rds and
> perftest using --infinite option by running it for one hour or so with rxe?
> > It's been a while I did that. Last time when I tried with 5.0.0-rc5, perftest
> crashed the kernel on MR registration.
> 
> For now the device is limited to IBV_WR_SEND and IBV_WR_RECV opcodes
> so anything with IBV_WR_RDMA_* is not yet supported.
> 
User is running on Oracle virtual box on Windows laptop, where pvrdma backend is not available.
Same goes to running VM in cloud where backend pvrdma is not available.
Rdma is already hard to do and now we need to ask users to run a VM inside a VM and both have to have up to date kernel...

Additionally it doesn't even reach basic criteria of running nvme fabrics perftests, qp1 of the users...

So pvrdma + qemu is not a good starting point for this particular use case...

And loopback is perfect driver for vm suspend/resume or migration cases with no dependency on host.
But I don't think anyone would care for this anyway.

Yuval Shaia March 6, 2019, 7:31 a.m. UTC | #29

On Tue, Mar 05, 2019 at 09:53:01PM +0000, Parav Pandit wrote:
> 
> 
> > -----Original Message-----
> > From: Yuval Shaia <yuval.shaia@oracle.com>
> > Sent: Tuesday, March 5, 2019 5:09 AM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: Ira Weiny <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>;
> > Dennis Dalessandro <dennis.dalessandro@intel.com>; bvanassche@acm.org;
> > linux-rdma@vger.kernel.org
> > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > 
> > > >
> > > > Suggestion: To enhance 'loopback' performances, can you consider
> > > > using shared memory or any other IPC instead of going thought the
> > network stack?
> > > >
> > > Loopback driver in this patchset doesn't use network stack.
> > > It is just 2000 lines of wrapper to memcpy() to enables applications to use
> > rdma.
> > 
> > To have a dedicated driver just for the loopback will force the user to do a
> > smart select, i.e. to use lo device for local traffic and rxe for non-local.
> No. when application is written using rdmacm, everything works based on the ip address.
> It will pick the right rdma device that matches this ip.
> It would be 'lo' when connections are on 127.0.0.1.
> When application such as MPI, will have to anyway specify the which rdma device they want to use in system.

But what if one wants to stay at the verb level and not use rdmacm API?

See, QEMU's pvrdma device can't use rdmacm, i can explain why but it is for
sure out of the scope of this thread. You can see it as a process such as
ibv_rc_pingpong that one must specify the IB device he wants to use.

Yuval

Yuval Shaia March 6, 2019, 7:42 a.m. UTC | #30

> > > >
> > > Host is not aware of the IP stack of the VM.
> > > How do you resolve mac address for the destination IP used in guest VM in
> > host, and how do you program right source mac and destination mac of the
> > QP in host using ibverbs client?
> > > I was asking Aviad in Mellanox to use devx interface do have passthrough
> > programming.
> > > So want to understand how are you doing this QEMU as its close to
> > production now?
> > 
> > Not sure i fully understand your question, ex why host needs to know the
> > guest IP.
> > Anyway, the flow is like this:
> > - In guest, ib_core calls the driver's add_gid hook when gid entry is
> >   created
> > - Driver in guest passes binding info to qemu device (sgid and gid)
> > - qemu device adds this gid to host gid table (via netlink or QMP), note
> >   that sgid in host probably is different. Please also note that at this
> >   stage we have this gid defined twice in the fabric but since one is in
> >   guest, which is hidden, we are ok. (1098)
> > - When guest creates QP it passes the guest sgid to qemu device which
> >   replace it with the host sgid. (#436)
> > 
> > Since gid is defined in the host the all the routing is done as usual.
> > 
> > Full details of the above is available here
> > https://github.com/qemu/qemu/commit/2b05705dc8ad80c09a3aa9cc70c14f
> > b8323b0fd3
> > 
> > Hope this answers your question.
> > 
> I took cursory look. I am almost sure that if guest VM's IP address is not added to host, modify_qp() in kernel is going to fail. Unless you use a different GID in host. And build some smart way to figure out which one to use.

The option to use different GID is too complex and even not applicable as
you have to make the fabric know to route traffic to it.
Check the above step #3, the guest GID (and so also IP) is pops up to the
host level.

> 
> > >

Yuval Shaia March 6, 2019, 8:04 a.m. UTC | #31

> > > >
> > > Can you please publish fio results with nfs-rdma, nvme-fabrics, rds and
> > perftest using --infinite option by running it for one hour or so with rxe?
> > > It's been a while I did that. Last time when I tried with 5.0.0-rc5, perftest
> > crashed the kernel on MR registration.
> > 
> > For now the device is limited to IBV_WR_SEND and IBV_WR_RECV opcodes
> > so anything with IBV_WR_RDMA_* is not yet supported.
> > 
> User is running on Oracle virtual box on Windows laptop, where pvrdma backend is not available.
> Same goes to running VM in cloud where backend pvrdma is not available.

I'm not trying to convince you to use qemu and not Oracle VM, see my @ -:)
just wondering why in first place to choose hypervisor that does not have
pvrdma supported, VMWare and QEMU have it.

> Rdma is already hard to do and now we need to ask users to run a VM inside a VM and both have to have up to date kernel...

I agree that nested VM is a complex solution although i'm happily using it
as my dev and testing environment.

> 
> Additionally it doesn't even reach basic criteria of running nvme fabrics perftests, qp1 of the users...

QP1 is (partly) supported.
(partly means rdmacm MADs only)

> 
> So pvrdma + qemu is not a good starting point for this particular use case...
> 
> And loopback is perfect driver for vm suspend/resume or migration cases with no dependency on host.
> But I don't think anyone would care for this anyway.

So loopback is used inside the VM?

Parav Pandit March 6, 2019, 4:38 p.m. UTC | #32

> -----Original Message-----
> From: Yuval Shaia <yuval.shaia@oracle.com>
> Sent: Wednesday, March 6, 2019 1:31 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Ira Weiny <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>;
> Dennis Dalessandro <dennis.dalessandro@intel.com>; bvanassche@acm.org;
> linux-rdma@vger.kernel.org; Marcel Apfelbaum
> <marcel.apfelbaum@gmail.com>; Kamal Heib <kheib@redhat.com>
> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> On Tue, Mar 05, 2019 at 09:53:01PM +0000, Parav Pandit wrote:
> >
> >
> > > -----Original Message-----
> > > From: Yuval Shaia <yuval.shaia@oracle.com>
> > > Sent: Tuesday, March 5, 2019 5:09 AM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: Ira Weiny <ira.weiny@intel.com>; Leon Romanovsky
> > > <leon@kernel.org>; Dennis Dalessandro
> > > <dennis.dalessandro@intel.com>; bvanassche@acm.org;
> > > linux-rdma@vger.kernel.org
> > > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > >
> > > > >
> > > > > Suggestion: To enhance 'loopback' performances, can you consider
> > > > > using shared memory or any other IPC instead of going thought
> > > > > the
> > > network stack?
> > > > >
> > > > Loopback driver in this patchset doesn't use network stack.
> > > > It is just 2000 lines of wrapper to memcpy() to enables
> > > > applications to use
> > > rdma.
> > >
> > > To have a dedicated driver just for the loopback will force the user
> > > to do a smart select, i.e. to use lo device for local traffic and rxe for non-
> local.
> > No. when application is written using rdmacm, everything works based on
> the ip address.
> > It will pick the right rdma device that matches this ip.
> > It would be 'lo' when connections are on 127.0.0.1.
> > When application such as MPI, will have to anyway specify the which rdma
> device they want to use in system.
> 
> But what if one wants to stay at the verb level and not use rdmacm API?
> 
Sure. He can stay at verb level where he anyway have to explicitly give the device name.

> See, QEMU's pvrdma device can't use rdmacm, i can explain why but it is for
> sure out of the scope of this thread. You can see it as a process such as
> ibv_rc_pingpong that one must specify the IB device he wants to use.
> 
Yes, it is out of the scope here.

Parav Pandit March 6, 2019, 4:40 p.m. UTC | #33

> -----Original Message-----
> From: Yuval Shaia <yuval.shaia@oracle.com>
> Sent: Wednesday, March 6, 2019 1:43 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Ira Weiny <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>;
> Dennis Dalessandro <dennis.dalessandro@intel.com>; bvanassche@acm.org;
> linux-rdma@vger.kernel.org; Marcel Apfelbaum
> <marcel.apfelbaum@gmail.com>; Kamal Heib <kheib@redhat.com>
> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> > > > >
> > > > Host is not aware of the IP stack of the VM.
> > > > How do you resolve mac address for the destination IP used in
> > > > guest VM in
> > > host, and how do you program right source mac and destination mac of
> > > the QP in host using ibverbs client?
> > > > I was asking Aviad in Mellanox to use devx interface do have
> > > > passthrough
> > > programming.
> > > > So want to understand how are you doing this QEMU as its close to
> > > production now?
> > >
> > > Not sure i fully understand your question, ex why host needs to know
> > > the guest IP.
> > > Anyway, the flow is like this:
> > > - In guest, ib_core calls the driver's add_gid hook when gid entry is
> > >   created
> > > - Driver in guest passes binding info to qemu device (sgid and gid)
> > > - qemu device adds this gid to host gid table (via netlink or QMP), note
> > >   that sgid in host probably is different. Please also note that at this
> > >   stage we have this gid defined twice in the fabric but since one is in
> > >   guest, which is hidden, we are ok. (1098)
> > > - When guest creates QP it passes the guest sgid to qemu device which
> > >   replace it with the host sgid. (#436)
> > >
> > > Since gid is defined in the host the all the routing is done as usual.
> > >
> > > Full details of the above is available here
> > >
> https://github.com/qemu/qemu/commit/2b05705dc8ad80c09a3aa9cc70c14f
> > > b8323b0fd3
> > >
> > > Hope this answers your question.
> > >
> > I took cursory look. I am almost sure that if guest VM's IP address is not
> added to host, modify_qp() in kernel is going to fail. Unless you use a
> different GID in host. And build some smart way to figure out which one to
> use.
> 
> The option to use different GID is too complex and even not applicable as
> you have to make the fabric know to route traffic to it.
> Check the above step #3, the guest GID (and so also IP) is pops up to the
> host level.
> 
Interesting. How do you know what subnet mask to use when you add IP address in host?

Parav Pandit March 6, 2019, 4:45 p.m. UTC | #34

> -----Original Message-----
> From: Yuval Shaia <yuval.shaia@oracle.com>
> Sent: Wednesday, March 6, 2019 2:04 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Ira Weiny <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>;
> Dennis Dalessandro <dennis.dalessandro@intel.com>; bvanassche@acm.org;
> linux-rdma@vger.kernel.org; Marcel Apfelbaum
> <marcel.apfelbaum@gmail.com>; Kamal Heib <kheib@redhat.com>
> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> > > > >
> > > > Can you please publish fio results with nfs-rdma, nvme-fabrics,
> > > > rds and
> > > perftest using --infinite option by running it for one hour or so with rxe?
> > > > It's been a while I did that. Last time when I tried with
> > > > 5.0.0-rc5, perftest
> > > crashed the kernel on MR registration.
> > >
> > > For now the device is limited to IBV_WR_SEND and IBV_WR_RECV
> opcodes
> > > so anything with IBV_WR_RDMA_* is not yet supported.
> > >
> > User is running on Oracle virtual box on Windows laptop, where pvrdma
> backend is not available.
> > Same goes to running VM in cloud where backend pvrdma is not available.
> 
> I'm not trying to convince you to use qemu and not Oracle VM, see my @ -:)
> just wondering why in first place to choose hypervisor that does not have
> pvrdma supported, VMWare and QEMU have it.
> 
I just described what user is using in their setup.

Its rarely the case where developers in linux-rdma mailing list define IT policy/infrastructure, hypervisor tools of the users.
World would have been very different otherwise. :-)
Anyway, its not our scope to define that.
Users define their requirements, setup, clould vm or local etc.

> > Rdma is already hard to do and now we need to ask users to run a VM
> inside a VM and both have to have up to date kernel...
> 
> I agree that nested VM is a complex solution although i'm happily using it as
> my dev and testing environment.
> 
> >
> > Additionally it doesn't even reach basic criteria of running nvme fabrics
> perftests, qp1 of the users...
> 
> QP1 is (partly) supported.
> (partly means rdmacm MADs only)
> 
> >
> > So pvrdma + qemu is not a good starting point for this particular use case...
> >
> > And loopback is perfect driver for vm suspend/resume or migration cases
> with no dependency on host.
> > But I don't think anyone would care for this anyway.
> 
> So loopback is used inside the VM?
Yes.

Yuval Shaia March 6, 2019, 8:14 p.m. UTC | #35

> > > >
> > > > > >
> > > > > > Suggestion: To enhance 'loopback' performances, can you consider
> > > > > > using shared memory or any other IPC instead of going thought
> > > > > > the
> > > > network stack?
> > > > > >
> > > > > Loopback driver in this patchset doesn't use network stack.
> > > > > It is just 2000 lines of wrapper to memcpy() to enables
> > > > > applications to use
> > > > rdma.
> > > >
> > > > To have a dedicated driver just for the loopback will force the user
> > > > to do a smart select, i.e. to use lo device for local traffic and rxe for non-
> > local.
> > > No. when application is written using rdmacm, everything works based on
> > the ip address.
> > > It will pick the right rdma device that matches this ip.
> > > It would be 'lo' when connections are on 127.0.0.1.
> > > When application such as MPI, will have to anyway specify the which rdma
> > device they want to use in system.
> > 
> > But what if one wants to stay at the verb level and not use rdmacm API?
> > 
> Sure. He can stay at verb level where he anyway have to explicitly give the device name.

And that's is exactly the problem!

With qemu, the ibdev is given at the command-line of the virtual machine so
if two guests starts on the same host it is ok to give them the lo device
as backend but what will happen when one of the VMs will migrate to another
host? The traffic will break since the lo device cannot go outside.

>

Yuval Shaia March 6, 2019, 8:17 p.m. UTC | #36

On Wed, Mar 06, 2019 at 04:40:34PM +0000, Parav Pandit wrote:
> 
> 
> > -----Original Message-----
> > From: Yuval Shaia <yuval.shaia@oracle.com>
> > Sent: Wednesday, March 6, 2019 1:43 AM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: Ira Weiny <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>;
> > Dennis Dalessandro <dennis.dalessandro@intel.com>; bvanassche@acm.org;
> > linux-rdma@vger.kernel.org; Marcel Apfelbaum
> > <marcel.apfelbaum@gmail.com>; Kamal Heib <kheib@redhat.com>
> > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > 
> > > > > >
> > > > > Host is not aware of the IP stack of the VM.
> > > > > How do you resolve mac address for the destination IP used in
> > > > > guest VM in
> > > > host, and how do you program right source mac and destination mac of
> > > > the QP in host using ibverbs client?
> > > > > I was asking Aviad in Mellanox to use devx interface do have
> > > > > passthrough
> > > > programming.
> > > > > So want to understand how are you doing this QEMU as its close to
> > > > production now?
> > > >
> > > > Not sure i fully understand your question, ex why host needs to know
> > > > the guest IP.
> > > > Anyway, the flow is like this:
> > > > - In guest, ib_core calls the driver's add_gid hook when gid entry is
> > > >   created
> > > > - Driver in guest passes binding info to qemu device (sgid and gid)
> > > > - qemu device adds this gid to host gid table (via netlink or QMP), note
> > > >   that sgid in host probably is different. Please also note that at this
> > > >   stage we have this gid defined twice in the fabric but since one is in
> > > >   guest, which is hidden, we are ok. (1098)
> > > > - When guest creates QP it passes the guest sgid to qemu device which
> > > >   replace it with the host sgid. (#436)
> > > >
> > > > Since gid is defined in the host the all the routing is done as usual.
> > > >
> > > > Full details of the above is available here
> > > >
> > https://github.com/qemu/qemu/commit/2b05705dc8ad80c09a3aa9cc70c14f
> > > > b8323b0fd3
> > > >
> > > > Hope this answers your question.
> > > >
> > > I took cursory look. I am almost sure that if guest VM's IP address is not
> > added to host, modify_qp() in kernel is going to fail. Unless you use a
> > different GID in host. And build some smart way to figure out which one to
> > use.
> > 
> > The option to use different GID is too complex and even not applicable as
> > you have to make the fabric know to route traffic to it.
> > Check the above step #3, the guest GID (and so also IP) is pops up to the
> > host level.
> > 
> Interesting. How do you know what subnet mask to use when you add IP address in host?

Sorry, my bad, forgot to mentioned it - the subnet_prefix is also provided
by the guest driver.

Bart Van Assche March 6, 2019, 8:39 p.m. UTC | #37

On Wed, 2019-03-06 at 22:14 +0200, Yuval Shaia wrote:
> > > > > 
> > > > > > > 
> > > > > > > Suggestion: To enhance 'loopback' performances, can you consider
> > > > > > > using shared memory or any other IPC instead of going thought
> > > > > > > the
> > > > > 
> > > > > network stack?
> > > > > > > 
> > > > > > 
> > > > > > Loopback driver in this patchset doesn't use network stack.
> > > > > > It is just 2000 lines of wrapper to memcpy() to enables
> > > > > > applications to use
> > > > > 
> > > > > rdma.
> > > > > 
> > > > > To have a dedicated driver just for the loopback will force the user
> > > > > to do a smart select, i.e. to use lo device for local traffic and rxe for non-
> > > 
> > > local.
> > > > No. when application is written using rdmacm, everything works based on
> > > 
> > > the ip address.
> > > > It will pick the right rdma device that matches this ip.
> > > > It would be 'lo' when connections are on 127.0.0.1.
> > > > When application such as MPI, will have to anyway specify the which rdma
> > > 
> > > device they want to use in system.
> > > 
> > > But what if one wants to stay at the verb level and not use rdmacm API?
> > > 
> > 
> > Sure. He can stay at verb level where he anyway have to explicitly give the device name.
> 
> And that's is exactly the problem!
> 
> With qemu, the ibdev is given at the command-line of the virtual machine so
> if two guests starts on the same host it is ok to give them the lo device
> as backend but what will happen when one of the VMs will migrate to another
> host? The traffic will break since the lo device cannot go outside.

Hi Yuval,

I think what you are describing falls outside the use cases Parav has in mind. I
think that optimizing RDMA over loopback, even if that loopback only works inside
a single VM, is useful.

Bart.

Parav Pandit March 7, 2019, 1:41 a.m. UTC | #38

Hi Yuval,

> -----Original Message-----
> From: Bart Van Assche
> Sent: Wednesday, March 6, 2019 2:39 PM
> To: Yuval Shaia ; Parav Pandit
> Cc: Ira Weiny ; Leon Romanovsky ; Dennis Dalessandro ; linux-
> rdma@vger.kernel.org; Marcel Apfelbaum ; Kamal Heib
> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> On Wed, 2019-03-06 at 22:14 +0200, Yuval Shaia wrote:
> > > > > >
> > > > > > > >
> > > > > > > > Suggestion: To enhance 'loopback' performances, can you
> > > > > > > > consider using shared memory or any other IPC instead of
> > > > > > > > going thought the
> > > > > >
> > > > > > network stack?
> > > > > > > >
> > > > > > >
> > > > > > > Loopback driver in this patchset doesn't use network stack.
> > > > > > > It is just 2000 lines of wrapper to memcpy() to enables
> > > > > > > applications to use
> > > > > >
> > > > > > rdma.
> > > > > >
> > > > > > To have a dedicated driver just for the loopback will force
> > > > > > the user to do a smart select, i.e. to use lo device for local
> > > > > > traffic and rxe for non-
> > > >
> > > > local.
> > > > > No. when application is written using rdmacm, everything works
> > > > > based on
> > > >
> > > > the ip address.
> > > > > It will pick the right rdma device that matches this ip.
> > > > > It would be 'lo' when connections are on 127.0.0.1.
> > > > > When application such as MPI, will have to anyway specify the
> > > > > which rdma
> > > >
> > > > device they want to use in system.
> > > >
> > > > But what if one wants to stay at the verb level and not use rdmacm
> API?
> > > >
> > >
> > > Sure. He can stay at verb level where he anyway have to explicitly give
> the device name.
> >
> > And that's is exactly the problem!
> >
> > With qemu, the ibdev is given at the command-line of the virtual
> > machine so if two guests starts on the same host it is ok to give them
> > the lo device as backend but what will happen when one of the VMs will
> > migrate to another host? The traffic will break since the lo device cannot
> go outside.
> 
> Hi Yuval,
> 
> I think what you are describing falls outside the use cases Parav has in mind.
> I think that optimizing RDMA over loopback, even if that loopback only
> works inside a single VM, is useful.
> 

lo rdma device takes birth inside the VM, migrates as pure memory to other host, just like any lo netdev, and dies in VM.
There is no need to give lo device from outside to the guest VM.

Yuval Shaia March 10, 2019, 8:52 a.m. UTC | #39

On Wed, Mar 06, 2019 at 12:39:17PM -0800, Bart Van Assche wrote:
> On Wed, 2019-03-06 at 22:14 +0200, Yuval Shaia wrote:
> > > > > > 
> > > > > > > > 
> > > > > > > > Suggestion: To enhance 'loopback' performances, can you consider
> > > > > > > > using shared memory or any other IPC instead of going thought
> > > > > > > > the
> > > > > > 
> > > > > > network stack?
> > > > > > > > 
> > > > > > > 
> > > > > > > Loopback driver in this patchset doesn't use network stack.
> > > > > > > It is just 2000 lines of wrapper to memcpy() to enables
> > > > > > > applications to use
> > > > > > 
> > > > > > rdma.
> > > > > > 
> > > > > > To have a dedicated driver just for the loopback will force the user
> > > > > > to do a smart select, i.e. to use lo device for local traffic and rxe for non-
> > > > 
> > > > local.
> > > > > No. when application is written using rdmacm, everything works based on
> > > > 
> > > > the ip address.
> > > > > It will pick the right rdma device that matches this ip.
> > > > > It would be 'lo' when connections are on 127.0.0.1.
> > > > > When application such as MPI, will have to anyway specify the which rdma
> > > > 
> > > > device they want to use in system.
> > > > 
> > > > But what if one wants to stay at the verb level and not use rdmacm API?
> > > > 
> > > 
> > > Sure. He can stay at verb level where he anyway have to explicitly give the device name.
> > 
> > And that's is exactly the problem!
> > 
> > With qemu, the ibdev is given at the command-line of the virtual machine so
> > if two guests starts on the same host it is ok to give them the lo device
> > as backend but what will happen when one of the VMs will migrate to another
> > host? The traffic will break since the lo device cannot go outside.
> 
> Hi Yuval,
> 
> I think what you are describing falls outside the use cases Parav has in mind. I
> think that optimizing RDMA over loopback, even if that loopback only works inside
> a single VM, is useful.
> 
> Bart.

Sure, no argue here, just do not yet understood why not to optimize rxe and
enjoy both worlds? i.e. why to have new sw device just for loopback.

We had this discussion in the past, recall Leon? What is the difference now?

Yuval

Yuval Shaia March 10, 2019, 8:58 a.m. UTC | #40

On Thu, Mar 07, 2019 at 01:41:10AM +0000, Parav Pandit wrote:
> Hi Yuval,
> 
> > -----Original Message-----
> > From: Bart Van Assche
> > Sent: Wednesday, March 6, 2019 2:39 PM
> > To: Yuval Shaia ; Parav Pandit
> > Cc: Ira Weiny ; Leon Romanovsky ; Dennis Dalessandro ; linux-
> > rdma@vger.kernel.org; Marcel Apfelbaum ; Kamal Heib
> > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > 
> > On Wed, 2019-03-06 at 22:14 +0200, Yuval Shaia wrote:
> > > > > > >
> > > > > > > > >
> > > > > > > > > Suggestion: To enhance 'loopback' performances, can you
> > > > > > > > > consider using shared memory or any other IPC instead of
> > > > > > > > > going thought the
> > > > > > >
> > > > > > > network stack?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Loopback driver in this patchset doesn't use network stack.
> > > > > > > > It is just 2000 lines of wrapper to memcpy() to enables
> > > > > > > > applications to use
> > > > > > >
> > > > > > > rdma.
> > > > > > >
> > > > > > > To have a dedicated driver just for the loopback will force
> > > > > > > the user to do a smart select, i.e. to use lo device for local
> > > > > > > traffic and rxe for non-
> > > > >
> > > > > local.
> > > > > > No. when application is written using rdmacm, everything works
> > > > > > based on
> > > > >
> > > > > the ip address.
> > > > > > It will pick the right rdma device that matches this ip.
> > > > > > It would be 'lo' when connections are on 127.0.0.1.
> > > > > > When application such as MPI, will have to anyway specify the
> > > > > > which rdma
> > > > >
> > > > > device they want to use in system.
> > > > >
> > > > > But what if one wants to stay at the verb level and not use rdmacm
> > API?
> > > > >
> > > >
> > > > Sure. He can stay at verb level where he anyway have to explicitly give
> > the device name.
> > >
> > > And that's is exactly the problem!
> > >
> > > With qemu, the ibdev is given at the command-line of the virtual
> > > machine so if two guests starts on the same host it is ok to give them
> > > the lo device as backend but what will happen when one of the VMs will
> > > migrate to another host? The traffic will break since the lo device cannot
> > go outside.
> > 
> > Hi Yuval,
> > 
> > I think what you are describing falls outside the use cases Parav has in mind.
> > I think that optimizing RDMA over loopback, even if that loopback only
> > works inside a single VM, is useful.
> > 
> 
> lo rdma device takes birth inside the VM, migrates as pure memory to other host, just like any lo netdev, and dies in VM.
> There is no need to give lo device from outside to the guest VM.

Already answered to Bart's email so do not want to repeat my reply here.

I see your point just do not want to turn one use case to a generic use,
i.e. the same 'pure memcpy enhancement' requirements applies to a broader
scope than your use case, enhancing rxe will hit them both while having yet
another sw device will cover only your use case.

Yuval

Parav Pandit March 10, 2019, 4:15 p.m. UTC | #41

> -----Original Message-----
> From: Yuval Shaia <yuval.shaia@oracle.com>
> Sent: Sunday, March 10, 2019 3:58 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Bart Van Assche <bvanassche@acm.org>; Ira Weiny
> <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>; Dennis
> Dalessandro <dennis.dalessandro@intel.com>; linux-rdma@vger.kernel.org;
> Marcel Apfelbaum <marcel.apfelbaum@gmail.com>; Kamal Heib
> <kheib@redhat.com>
> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> On Thu, Mar 07, 2019 at 01:41:10AM +0000, Parav Pandit wrote:
> > Hi Yuval,
> >
> > > -----Original Message-----
> > > From: Bart Van Assche
> > > Sent: Wednesday, March 6, 2019 2:39 PM
> > > To: Yuval Shaia ; Parav Pandit
> > > Cc: Ira Weiny ; Leon Romanovsky ; Dennis Dalessandro ; linux-
> > > rdma@vger.kernel.org; Marcel Apfelbaum ; Kamal Heib
> > > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > >
> > > On Wed, 2019-03-06 at 22:14 +0200, Yuval Shaia wrote:
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Suggestion: To enhance 'loopback' performances, can
> > > > > > > > > > you consider using shared memory or any other IPC
> > > > > > > > > > instead of going thought the
> > > > > > > >
> > > > > > > > network stack?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Loopback driver in this patchset doesn't use network stack.
> > > > > > > > > It is just 2000 lines of wrapper to memcpy() to enables
> > > > > > > > > applications to use
> > > > > > > >
> > > > > > > > rdma.
> > > > > > > >
> > > > > > > > To have a dedicated driver just for the loopback will
> > > > > > > > force the user to do a smart select, i.e. to use lo device
> > > > > > > > for local traffic and rxe for non-
> > > > > >
> > > > > > local.
> > > > > > > No. when application is written using rdmacm, everything
> > > > > > > works based on
> > > > > >
> > > > > > the ip address.
> > > > > > > It will pick the right rdma device that matches this ip.
> > > > > > > It would be 'lo' when connections are on 127.0.0.1.
> > > > > > > When application such as MPI, will have to anyway specify
> > > > > > > the which rdma
> > > > > >
> > > > > > device they want to use in system.
> > > > > >
> > > > > > But what if one wants to stay at the verb level and not use
> > > > > > rdmacm
> > > API?
> > > > > >
> > > > >
> > > > > Sure. He can stay at verb level where he anyway have to
> > > > > explicitly give
> > > the device name.
> > > >
> > > > And that's is exactly the problem!
> > > >
> > > > With qemu, the ibdev is given at the command-line of the virtual
> > > > machine so if two guests starts on the same host it is ok to give
> > > > them the lo device as backend but what will happen when one of the
> > > > VMs will migrate to another host? The traffic will break since the
> > > > lo device cannot
> > > go outside.
> > >
> > > Hi Yuval,
> > >
> > > I think what you are describing falls outside the use cases Parav has in
> mind.
> > > I think that optimizing RDMA over loopback, even if that loopback
> > > only works inside a single VM, is useful.
> > >
> >
> > lo rdma device takes birth inside the VM, migrates as pure memory to
> other host, just like any lo netdev, and dies in VM.
> > There is no need to give lo device from outside to the guest VM.
> 
> Already answered to Bart's email so do not want to repeat my reply here.
> 
> I see your point just do not want to turn one use case to a generic use, i.e.
> the same 'pure memcpy enhancement' requirements applies to a broader
> scope than your use case, enhancing rxe will hit them both while having yet
> another sw device will cover only your use case.
> 
I do not know how to enhance rxe to same level of correctness, efficiency as the proposed loopback driver.
This driver uses core infrastructure for all the ibv commands, making very light user space and kernel handling.
I did some perf analysis and it has overheads in kmalloc/kfree(). But want to improve them without changing user interface or do it by incremental enhancements.

I haven't figured out QP0 details for IB, but intent to support IB link layer too.
It also requires little different locking scheme as it takes both qps reference at same time.

One good way to support rxe enhancement you have in mind is to publish RFC patches and showing how it is doable, efficient, good performance and stable.
That would make your idea stronger that - yes rxe is right choice.
(hint, as a starting point please provide a fix to avoid crash in memory registration in rxe:-) ).

And once you succeed in that, please also propose to Bernard to use rxe for siw driver to reuse resource allocation, resource (cq, qp, mr, srq) state machine, user ABI for sq, rq, cq handling, netlink etc, instead of creating new driver.
This would support rxe is right choice for any software driver.

Leon Romanovsky March 10, 2019, 4:27 p.m. UTC | #42

On Sun, Mar 10, 2019 at 10:52:04AM +0200, Yuval Shaia wrote:
> On Wed, Mar 06, 2019 at 12:39:17PM -0800, Bart Van Assche wrote:
> > On Wed, 2019-03-06 at 22:14 +0200, Yuval Shaia wrote:
> > > > > > >
> > > > > > > > >
> > > > > > > > > Suggestion: To enhance 'loopback' performances, can you consider
> > > > > > > > > using shared memory or any other IPC instead of going thought
> > > > > > > > > the
> > > > > > >
> > > > > > > network stack?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Loopback driver in this patchset doesn't use network stack.
> > > > > > > > It is just 2000 lines of wrapper to memcpy() to enables
> > > > > > > > applications to use
> > > > > > >
> > > > > > > rdma.
> > > > > > >
> > > > > > > To have a dedicated driver just for the loopback will force the user
> > > > > > > to do a smart select, i.e. to use lo device for local traffic and rxe for non-
> > > > >
> > > > > local.
> > > > > > No. when application is written using rdmacm, everything works based on
> > > > >
> > > > > the ip address.
> > > > > > It will pick the right rdma device that matches this ip.
> > > > > > It would be 'lo' when connections are on 127.0.0.1.
> > > > > > When application such as MPI, will have to anyway specify the which rdma
> > > > >
> > > > > device they want to use in system.
> > > > >
> > > > > But what if one wants to stay at the verb level and not use rdmacm API?
> > > > >
> > > >
> > > > Sure. He can stay at verb level where he anyway have to explicitly give the device name.
> > >
> > > And that's is exactly the problem!
> > >
> > > With qemu, the ibdev is given at the command-line of the virtual machine so
> > > if two guests starts on the same host it is ok to give them the lo device
> > > as backend but what will happen when one of the VMs will migrate to another
> > > host? The traffic will break since the lo device cannot go outside.
> >
> > Hi Yuval,
> >
> > I think what you are describing falls outside the use cases Parav has in mind. I
> > think that optimizing RDMA over loopback, even if that loopback only works inside
> > a single VM, is useful.
> >
> > Bart.
>
> Sure, no argue here, just do not yet understood why not to optimize rxe and
> enjoy both worlds? i.e. why to have new sw device just for loopback.
>
> We had this discussion in the past, recall Leon? What is the difference now?

There is no difference now. RXE should and can be fixed/enhanced.

Thanks

>
> Yuval

Leon Romanovsky March 10, 2019, 4:38 p.m. UTC | #43

On Mon, Mar 04, 2019 at 07:17:27PM +0200, Yuval Shaia wrote:
> On Mon, Mar 04, 2019 at 04:52:06PM +0000, Parav Pandit wrote:
> >
> >
> > > -----Original Message-----
> > > From: Yuval Shaia <yuval.shaia@oracle.com>
> > > Sent: Monday, March 4, 2019 10:48 AM
> > > To: Bart Van Assche <bvanassche@acm.org>
> > > Cc: Parav Pandit <parav@mellanox.com>; Ira Weiny <ira.weiny@intel.com>;
> > > Leon Romanovsky <leon@kernel.org>; Dennis Dalessandro
> > > <dennis.dalessandro@intel.com>; linux-rdma@vger.kernel.org
> > > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > >
> > > On Mon, Mar 04, 2019 at 08:10:05AM -0800, Bart Van Assche wrote:
> > > > On Mon, 2019-03-04 at 09:56 +0200, Yuval Shaia wrote:
> > > > > Suggestion: To enhance 'loopback' performances, can you consider
> > > > > using shared memory or any other IPC instead of going thought the
> > > network stack?
> > > >
> > > > I'd like to avoid having to implement yet another initiator block
> > > > driver. Using IPC implies writing a new block driver and also coming
> > > > up with a new block-over- IPC protocol. Using RDMA has the advantage
> > > > that the existing NVMeOF initator block driver and protocol can be used.
> > > >
> > > > Bart.
> > >
> > > No, no, i didn't mean to implement new driver, just that the xmit of the
> > > packet would be by use of memcpy instead of going through TP stack. This
> > > would make the data exchange extremely fast when the traffic is between
> > > two entities on the same host.
> > >
> > Can you please review the other patches in this patchset and not just cover-letter?
> > It does what you are describing without the network stack.
>
> You are right, i should have do it, just was waiting for an answer to
> Leon's question on why not using rxe as a base.

Yuval,

You won't get an answer on my question, because it is much more
easier and exciting to write something new instead of fixing
already existing piece of code. Luckily enough, kernel community
doesn't allow new code without proving that old code is not
possible to fix.

Thanks

Parav Pandit March 10, 2019, 4:40 p.m. UTC | #44

> -----Original Message-----
> From: Leon Romanovsky <leon@kernel.org>
> Sent: Sunday, March 10, 2019 11:38 AM
> To: Yuval Shaia <yuval.shaia@oracle.com>
> Cc: Parav Pandit <parav@mellanox.com>; Bart Van Assche
> <bvanassche@acm.org>; Ira Weiny <ira.weiny@intel.com>; Dennis
> Dalessandro <dennis.dalessandro@intel.com>; linux-rdma@vger.kernel.org
> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> On Mon, Mar 04, 2019 at 07:17:27PM +0200, Yuval Shaia wrote:
> > On Mon, Mar 04, 2019 at 04:52:06PM +0000, Parav Pandit wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Yuval Shaia <yuval.shaia@oracle.com>
> > > > Sent: Monday, March 4, 2019 10:48 AM
> > > > To: Bart Van Assche <bvanassche@acm.org>
> > > > Cc: Parav Pandit <parav@mellanox.com>; Ira Weiny
> > > > <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>; Dennis
> > > > Dalessandro <dennis.dalessandro@intel.com>;
> > > > linux-rdma@vger.kernel.org
> > > > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > > >
> > > > On Mon, Mar 04, 2019 at 08:10:05AM -0800, Bart Van Assche wrote:
> > > > > On Mon, 2019-03-04 at 09:56 +0200, Yuval Shaia wrote:
> > > > > > Suggestion: To enhance 'loopback' performances, can you
> > > > > > consider using shared memory or any other IPC instead of going
> > > > > > thought the
> > > > network stack?
> > > > >
> > > > > I'd like to avoid having to implement yet another initiator
> > > > > block driver. Using IPC implies writing a new block driver and
> > > > > also coming up with a new block-over- IPC protocol. Using RDMA
> > > > > has the advantage that the existing NVMeOF initator block driver and
> protocol can be used.
> > > > >
> > > > > Bart.
> > > >
> > > > No, no, i didn't mean to implement new driver, just that the xmit
> > > > of the packet would be by use of memcpy instead of going through
> > > > TP stack. This would make the data exchange extremely fast when
> > > > the traffic is between two entities on the same host.
> > > >
> > > Can you please review the other patches in this patchset and not just
> cover-letter?
> > > It does what you are describing without the network stack.
> >
> > You are right, i should have do it, just was waiting for an answer to
> > Leon's question on why not using rxe as a base.
> 
> Yuval,
> 
> You won't get an answer on my question, because it is much more easier
> and exciting to write something new instead of fixing already existing piece
> of code. Luckily enough, kernel community doesn't allow new code without
> proving that old code is not possible to fix.
>
Please convey this to Bernard too for siw driver.

Leon Romanovsky March 10, 2019, 4:44 p.m. UTC | #45

On Sun, Mar 10, 2019 at 04:40:50PM +0000, Parav Pandit wrote:
>
>
> > -----Original Message-----
> > From: Leon Romanovsky <leon@kernel.org>
> > Sent: Sunday, March 10, 2019 11:38 AM
> > To: Yuval Shaia <yuval.shaia@oracle.com>
> > Cc: Parav Pandit <parav@mellanox.com>; Bart Van Assche
> > <bvanassche@acm.org>; Ira Weiny <ira.weiny@intel.com>; Dennis
> > Dalessandro <dennis.dalessandro@intel.com>; linux-rdma@vger.kernel.org
> > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> >
> > On Mon, Mar 04, 2019 at 07:17:27PM +0200, Yuval Shaia wrote:
> > > On Mon, Mar 04, 2019 at 04:52:06PM +0000, Parav Pandit wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Yuval Shaia <yuval.shaia@oracle.com>
> > > > > Sent: Monday, March 4, 2019 10:48 AM
> > > > > To: Bart Van Assche <bvanassche@acm.org>
> > > > > Cc: Parav Pandit <parav@mellanox.com>; Ira Weiny
> > > > > <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>; Dennis
> > > > > Dalessandro <dennis.dalessandro@intel.com>;
> > > > > linux-rdma@vger.kernel.org
> > > > > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > > > >
> > > > > On Mon, Mar 04, 2019 at 08:10:05AM -0800, Bart Van Assche wrote:
> > > > > > On Mon, 2019-03-04 at 09:56 +0200, Yuval Shaia wrote:
> > > > > > > Suggestion: To enhance 'loopback' performances, can you
> > > > > > > consider using shared memory or any other IPC instead of going
> > > > > > > thought the
> > > > > network stack?
> > > > > >
> > > > > > I'd like to avoid having to implement yet another initiator
> > > > > > block driver. Using IPC implies writing a new block driver and
> > > > > > also coming up with a new block-over- IPC protocol. Using RDMA
> > > > > > has the advantage that the existing NVMeOF initator block driver and
> > protocol can be used.
> > > > > >
> > > > > > Bart.
> > > > >
> > > > > No, no, i didn't mean to implement new driver, just that the xmit
> > > > > of the packet would be by use of memcpy instead of going through
> > > > > TP stack. This would make the data exchange extremely fast when
> > > > > the traffic is between two entities on the same host.
> > > > >
> > > > Can you please review the other patches in this patchset and not just
> > cover-letter?
> > > > It does what you are describing without the network stack.
> > >
> > > You are right, i should have do it, just was waiting for an answer to
> > > Leon's question on why not using rxe as a base.
> >
> > Yuval,
> >
> > You won't get an answer on my question, because it is much more easier
> > and exciting to write something new instead of fixing already existing piece
> > of code. Luckily enough, kernel community doesn't allow new code without
> > proving that old code is not possible to fix.
> >
> Please convey this to Bernard too for siw driver.

Don't be so shy, feel free to say it to him.

Thanks

Yuval Shaia March 10, 2019, 7:23 p.m. UTC | #46

> (hint, as a starting point please provide a fix to avoid crash in memory registration in rxe:-) ).

I'm not aware of a crash in memory registration, can you describe the steps
to reproduce?

>

Parav Pandit March 10, 2019, 7:24 p.m. UTC | #47

> -----Original Message-----
> From: Yuval Shaia <yuval.shaia@oracle.com>
> Sent: Sunday, March 10, 2019 2:23 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Bart Van Assche <bvanassche@acm.org>; Ira Weiny
> <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>; Dennis
> Dalessandro <dennis.dalessandro@intel.com>; linux-rdma@vger.kernel.org;
> Marcel Apfelbaum <marcel.apfelbaum@gmail.com>; Kamal Heib
> <kheib@redhat.com>
> Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> > (hint, as a starting point please provide a fix to avoid crash in memory
> registration in rxe:-) ).
> 
> I'm not aware of a crash in memory registration, can you describe the steps
> to reproduce?
> 
ib_send_bw -x 1 -d rxe0 -a
ib_send_bw -x 1 -d rxe0 -a <ip_address>

Parav Pandit March 10, 2019, 7:37 p.m. UTC | #48

> -----Original Message-----
> From: Parav Pandit
> Sent: Sunday, March 10, 2019 2:24 PM
> To: 'Yuval Shaia' <yuval.shaia@oracle.com>
> Cc: Bart Van Assche <bvanassche@acm.org>; Ira Weiny
> <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>; Dennis
> Dalessandro <dennis.dalessandro@intel.com>; linux-rdma@vger.kernel.org;
> Marcel Apfelbaum <marcel.apfelbaum@gmail.com>; Kamal Heib
> <kheib@redhat.com>
> Subject: RE: [EXPERIMENTAL v1 0/4] RDMA loopback device
> 
> 
> 
> > -----Original Message-----
> > From: Yuval Shaia <yuval.shaia@oracle.com>
> > Sent: Sunday, March 10, 2019 2:23 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: Bart Van Assche <bvanassche@acm.org>; Ira Weiny
> > <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>; Dennis
> > Dalessandro <dennis.dalessandro@intel.com>;
> > linux-rdma@vger.kernel.org; Marcel Apfelbaum
> > <marcel.apfelbaum@gmail.com>; Kamal Heib <kheib@redhat.com>
> > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> >
> > > (hint, as a starting point please provide a fix to avoid crash in
> > > memory
> > registration in rxe:-) ).
> >
> > I'm not aware of a crash in memory registration, can you describe the
> > steps to reproduce?
> >
> ib_send_bw -x 1 -d rxe0 -a
> ib_send_bw -x 1 -d rxe0 -a <ip_address>
I did a quick run now on 5.0.0.-rc7, it is not crashing, which used to crash for me on 5.0.0.-rc5.
Seems better now.

Its running at 1.6Gbps compare to loopback at 50Gbps, but hey we can ignore the 50x performance. :-)

With write bw I hit a hit soft lockup,
kernel:watchdog: BUG: soft lockup - CPU#63 stuck for 22s! [ksoftirqd/63:328]

kernel: irq event stamp: 354570533
kernel: hardirqs last  enabled at (354570532): [<ffffffff92c23f12>] _raw_read_unlock_irqrestore+0x32/0x60
kernel: hardirqs last disabled at (354570533): [<ffffffff92403717>] trace_hardirqs_off_thunk+0x1a/0x1c
kernel: softirqs last  enabled at (20353810): [<ffffffff93000325>] __do_softirq+0x325/0x3cf
kernel: softirqs last disabled at (20353815): [<ffffffff9249eea5>] run_ksoftirqd+0x35/0x50
kernel: CPU: 32 PID: 173 Comm: ksoftirqd/32 Kdump: loaded Tainted: G             L    5.0.0-rc7-vdevbus+ #2
kernel: Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
kernel: rxe_responder+0x941/0x1ff0 [rdma_rxe]
kernel: ? __lock_acquire+0x240/0xf60
kernel: ? find_held_lock+0x31/0xa0
kernel: ? find_held_lock+0x31/0xa0
kernel: ? rxe_do_task+0x7e/0xf0 [rdma_rxe]
kernel: ? _raw_spin_unlock_irqrestore+0x32/0x51
kernel: rxe_do_task+0x85/0xf0 [rdma_rxe]
kernel: rxe_rcv+0x346/0x840 [rdma_rxe]
kernel: ? copy_data+0x113/0x240 [rdma_rxe]
kernel: rxe_requester+0x7c8/0x1060 [rdma_rxe]
kernel: rxe_do_task+0x85/0xf0 [rdma_rxe]
kernel: tasklet_action_common.isra.19+0x187/0x1a0
kernel: __do_softirq+0xd0/0x3cf
kernel: run_ksoftirqd+0x35/0x50
kernel: smpboot_thread_fn+0xfe/0x150
kernel: kthread+0xf5/0x130
kernel: ? sort_range+0x20/0x20
kernel: ? kthread_bind+0x10/0x10
kernel: ret_from_fork+0x24/0x30
kernel: rcu: INFO: rcu_sched self-detected stall on CPU
kernel: rcu: #01132-....: (64452 ticks this GP) idle=586/1/0x4000000000000002 softirq=184257/184259 fqs=16251
kernel: rcu: #011 (t=65008 jiffies g=8870789 q=3260)
kernel: NMI backtrace for cpu 32
kernel: CPU: 32 PID: 173 Comm: ksoftirqd/32 Kdump: loaded Tainted: G             L    5.0.0-rc7-vdevbus+ #2

Bart Van Assche March 10, 2019, 10:48 p.m. UTC | #49

On 3/10/19 12:37 PM, Parav Pandit wrote:
> With write bw I hit a hit soft lockup,
> kernel:watchdog: BUG: soft lockup - CPU#63 stuck for 22s! [ksoftirqd/63:328]
> 
> kernel: irq event stamp: 354570533
> kernel: hardirqs last  enabled at (354570532): [<ffffffff92c23f12>] _raw_read_unlock_irqrestore+0x32/0x60
> kernel: hardirqs last disabled at (354570533): [<ffffffff92403717>] trace_hardirqs_off_thunk+0x1a/0x1c
> kernel: softirqs last  enabled at (20353810): [<ffffffff93000325>] __do_softirq+0x325/0x3cf
> kernel: softirqs last disabled at (20353815): [<ffffffff9249eea5>] run_ksoftirqd+0x35/0x50
> kernel: CPU: 32 PID: 173 Comm: ksoftirqd/32 Kdump: loaded Tainted: G             L    5.0.0-rc7-vdevbus+ #2
> kernel: Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> kernel: rxe_responder+0x941/0x1ff0 [rdma_rxe]
> kernel: ? __lock_acquire+0x240/0xf60
> kernel: ? find_held_lock+0x31/0xa0
> kernel: ? find_held_lock+0x31/0xa0
> kernel: ? rxe_do_task+0x7e/0xf0 [rdma_rxe]
> kernel: ? _raw_spin_unlock_irqrestore+0x32/0x51
> kernel: rxe_do_task+0x85/0xf0 [rdma_rxe]

I've also seen this lockup with v5.0 while running blktests. I have not 
yet had the time to analyze this lockup further.

Bart.

Yuval Shaia March 11, 2019, 1:45 p.m. UTC | #50

On Sun, Mar 10, 2019 at 07:37:56PM +0000, Parav Pandit wrote:
> 
> 
> > -----Original Message-----
> > From: Parav Pandit
> > Sent: Sunday, March 10, 2019 2:24 PM
> > To: 'Yuval Shaia' <yuval.shaia@oracle.com>
> > Cc: Bart Van Assche <bvanassche@acm.org>; Ira Weiny
> > <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>; Dennis
> > Dalessandro <dennis.dalessandro@intel.com>; linux-rdma@vger.kernel.org;
> > Marcel Apfelbaum <marcel.apfelbaum@gmail.com>; Kamal Heib
> > <kheib@redhat.com>
> > Subject: RE: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > 
> > 
> > 
> > > -----Original Message-----
> > > From: Yuval Shaia <yuval.shaia@oracle.com>
> > > Sent: Sunday, March 10, 2019 2:23 PM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: Bart Van Assche <bvanassche@acm.org>; Ira Weiny
> > > <ira.weiny@intel.com>; Leon Romanovsky <leon@kernel.org>; Dennis
> > > Dalessandro <dennis.dalessandro@intel.com>;
> > > linux-rdma@vger.kernel.org; Marcel Apfelbaum
> > > <marcel.apfelbaum@gmail.com>; Kamal Heib <kheib@redhat.com>
> > > Subject: Re: [EXPERIMENTAL v1 0/4] RDMA loopback device
> > >
> > > > (hint, as a starting point please provide a fix to avoid crash in
> > > > memory
> > > registration in rxe:-) ).
> > >
> > > I'm not aware of a crash in memory registration, can you describe the
> > > steps to reproduce?
> > >
> > ib_send_bw -x 1 -d rxe0 -a
> > ib_send_bw -x 1 -d rxe0 -a <ip_address>
> I did a quick run now on 5.0.0.-rc7, it is not crashing, which used to crash for me on 5.0.0.-rc5.
> Seems better now.
> 
> Its running at 1.6Gbps compare to loopback at 50Gbps, but hey we can ignore the 50x performance. :-)

No, we can't ignore it - this is a huge motivation to enhance RXE with memcpy!!

> 
> With write bw I hit a hit soft lockup,
> kernel:watchdog: BUG: soft lockup - CPU#63 stuck for 22s! [ksoftirqd/63:328]
> 
> kernel: irq event stamp: 354570533
> kernel: hardirqs last  enabled at (354570532): [<ffffffff92c23f12>] _raw_read_unlock_irqrestore+0x32/0x60
> kernel: hardirqs last disabled at (354570533): [<ffffffff92403717>] trace_hardirqs_off_thunk+0x1a/0x1c
> kernel: softirqs last  enabled at (20353810): [<ffffffff93000325>] __do_softirq+0x325/0x3cf
> kernel: softirqs last disabled at (20353815): [<ffffffff9249eea5>] run_ksoftirqd+0x35/0x50
> kernel: CPU: 32 PID: 173 Comm: ksoftirqd/32 Kdump: loaded Tainted: G             L    5.0.0-rc7-vdevbus+ #2
> kernel: Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> kernel: rxe_responder+0x941/0x1ff0 [rdma_rxe]
> kernel: ? __lock_acquire+0x240/0xf60
> kernel: ? find_held_lock+0x31/0xa0
> kernel: ? find_held_lock+0x31/0xa0
> kernel: ? rxe_do_task+0x7e/0xf0 [rdma_rxe]
> kernel: ? _raw_spin_unlock_irqrestore+0x32/0x51
> kernel: rxe_do_task+0x85/0xf0 [rdma_rxe]
> kernel: rxe_rcv+0x346/0x840 [rdma_rxe]
> kernel: ? copy_data+0x113/0x240 [rdma_rxe]
> kernel: rxe_requester+0x7c8/0x1060 [rdma_rxe]
> kernel: rxe_do_task+0x85/0xf0 [rdma_rxe]
> kernel: tasklet_action_common.isra.19+0x187/0x1a0
> kernel: __do_softirq+0xd0/0x3cf
> kernel: run_ksoftirqd+0x35/0x50
> kernel: smpboot_thread_fn+0xfe/0x150
> kernel: kthread+0xf5/0x130
> kernel: ? sort_range+0x20/0x20
> kernel: ? kthread_bind+0x10/0x10
> kernel: ret_from_fork+0x24/0x30
> kernel: rcu: INFO: rcu_sched self-detected stall on CPU
> kernel: rcu: #01132-....: (64452 ticks this GP) idle=586/1/0x4000000000000002 softirq=184257/184259 fqs=16251
> kernel: rcu: #011 (t=65008 jiffies g=8870789 q=3260)
> kernel: NMI backtrace for cpu 32
> kernel: CPU: 32 PID: 173 Comm: ksoftirqd/32 Kdump: loaded Tainted: G             L    5.0.0-rc7-vdevbus+ #2

Is this the dump from rc5 or it is still happening with rc7?

>

[EXPERIMENTAL,v1,0/4] RDMA loopback device

Message

Comments