diff mbox series

[v5,24/24] docs: Update pvrdma device documentation

Message ID 20181122121402.13764-25-yuval.shaia@oracle.com (mailing list archive)
State New, archived
Headers show
Series Add support for RDMA MAD | expand

Commit Message

Yuval Shaia Nov. 22, 2018, 12:14 p.m. UTC
Interface with the device is changed with the addition of support for
MAD packets.
Adjust documentation accordingly.

While there fix a minor mistake which may lead to think that there is a
relation between using RXE on host and the compatibility with bare-metal
peers.

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 docs/pvrdma.txt | 103 +++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 84 insertions(+), 19 deletions(-)

Comments

Marcel Apfelbaum Nov. 26, 2018, 10:34 a.m. UTC | #1
Re-sending the comments, some of the recipients didn't get it,

Thanks,
Marcel

On 11/25/18 9:51 AM, Marcel Apfelbaum wrote:
>
>
> On 11/22/18 2:14 PM, Yuval Shaia wrote:
>> Interface with the device is changed with the addition of support for
>> MAD packets.
>> Adjust documentation accordingly.
>>
>> While there fix a minor mistake which may lead to think that there is a
>> relation between using RXE on host and the compatibility with bare-metal
>> peers.
>>
>> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
>> ---
>>   docs/pvrdma.txt | 103 +++++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 84 insertions(+), 19 deletions(-)
>>
>> diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt
>> index 5599318159..f82b2a69d2 100644
>> --- a/docs/pvrdma.txt
>> +++ b/docs/pvrdma.txt
>> @@ -9,8 +9,9 @@ It works with its Linux Kernel driver AS IS, no need 
>> for any special guest
>>   modifications.
>>     While it complies with the VMware device, it can also communicate 
>> with bare
>> -metal RDMA-enabled machines and does not require an RDMA HCA in the 
>> host, it
>> -can work with Soft-RoCE (rxe).
>> +metal RDMA-enabled machines as peers.
>> +
>> +It does not require an RDMA HCA in the host, it can work with 
>> Soft-RoCE (rxe).
>>     It does not require the whole guest RAM to be pinned allowing memory
>>   over-commit and, even if not implemented yet, migration support 
>> will be
>> @@ -78,29 +79,93 @@ the required RDMA libraries.
>>     3. Usage
>>   ========
>> +
>> +
>> +3.1 VM Memory settings
>> +======+++=============
>>   Currently the device is working only with memory backed RAM
>>   and it must be mark as "shared":
>>      -m 1G \
>>      -object memory-backend-ram,id=mb1,size=1G,share \
>>      -numa node,memdev=mb1 \
>>   -The pvrdma device is composed of two functions:
>> - - Function 0 is a vmxnet Ethernet Device which is redundant in Guest
>> -   but is required to pass the ibdevice GID using its MAC.
>> -   Examples:
>> -     For an rxe backend using eth0 interface it will use its mac:
>> -       -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC>
>> -     For an SRIOV VF, we take the Ethernet Interface exposed by it:
>> -       -device vmxnet3,multifunction=on,mac=<RoCE eth MAC>
>> - - Function 1 is the actual device:
>> -       -device 
>> pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port>
>> -   where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4)
>> - Note: Pay special attention that the GID at backend-gid-idx matches 
>> vmxnet's MAC.
>> - The rules of conversion are part of the RoCE spec, but since manual 
>> conversion
>> - is not required, spotting problems is not hard:
>> -    Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a
>> -             MAC: 7c:fe:90:cb:74:3a
>> -    Note the difference between the first byte of the MAC and the GID.
>> +
>> +3.2 MAD Multiplexer
>> +===================
>> +MAD Multiplexer is a service that exposes MAD-like interface for VMs in
>> +order to overcome the limitation where only single entity can 
>> register with
>> +MAD layer to send and receive RDMA-CM MAD packets.
>> +
>> +To build rdmacm-mux run
>> +# make rdmacm-mux
>> +
>> +The application accepts 3 command line arguments and exposes a UNIX 
>> socket
>> +to pass control and data to it.
>> +-s unix-socket-path   Path to unix socket to listen on
>> +                      (default /var/run/rdmacm-mux)
>> +-d rdma-device-name   Name of RDMA device to register with
>> +                      (default rxe0)
>
> I would not default it to rxe0, but request to specify a RDMA interface.
> One can think the multiplexer may select the best available device
> and finish with an rxe instance instead of a bare-metal one...
>
>> +-p rdma-device-port   Port number of RDMA device to register with
>> +                      (default 1)
>> +The final UNIX socket file name is a concatenation of the 3 
>> arguments so
>> +for example for device mlx5_0 on port 2 this 
>> /var/run/rdmacm-mux-mlx5_0-2
>> +will be created.
>> +
>> +Please refer to contrib/rdmacm-mux for more details.
>> +
>> +
>> +3.3 PCI devices settings
>> +========================
>> +RoCE device exposes two functions - an Ethernet and RDMA.
>> +To support it, pvrdma device is composed of two PCI functions, an 
>> Ethernet
>> +device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 
>> 1. The
>> +Ethernet function can be used for other Ethernet purposes such as IP.
>
> Nice !
>
>> +
>> +
>> +3.4 Device parameters
>> +=====================
>> +- netdev: Specifies the Ethernet device on host. For Soft-RoCE (rxe) 
>> this
>> +  would be the Ethernet device used to create it. For any other 
>> physical
>> +  RoCE device this would be the netdev name of the device.
>
> I don't fully understand the above explanation. Can you elaborate
> or give an exmaple?
>
>> +- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
>> +- mad-chardev: The name of the MAD multiplexer char device.
>> +- ibport: In case of multi-port device (such as Mellanox's HCA) this
>> +  specify the port to use. If not set 1 will be used.
>> +- dev-caps-max-mr-size: The maximum size of MR.
>> +- dev-caps-max-qp: Maximum number of QPs.
>> +- dev-caps-max-sge: Maximum number of SGE elements in WR.
>> +- dev-caps-max-cq: Maximum number of CQs.
>> +- dev-caps-max-mr: Maximum number of MRs.
>> +- dev-caps-max-pd: Maximum number of PDs.
>> +- dev-caps-max-ah: Maximum number of AHs.
>> +
>> +Notes:
>> +- The first 3 parameters are mandatory settings, the rest have their
>> +  defaults.
>> +- The last 8 parameters (the ones that prefixed by dev-caps) defines 
>> the top
>> +  limits but the final values is adjusted by the backend device 
>> limitations.
>> +
>> +3.5 Example
>> +===========
>> +Define bridge device with vmxnet3 network backend:
>> +<interface type='bridge'>
>> +  <mac address='56:b4:44:e9:62:dc'/>
>> +  <source bridge='bridge1'/>
>> +  <model type='vmxnet3'/>
>> +  <address type='pci' domain='0x0000' bus='0x00' slot='0x10' 
>> function='0x0' multifunction='on'/>
>> +</interface>
>> +
>> +Define pvrdma device:
>> +<qemu:commandline>
>> +  <qemu:arg value='-object'/>
>> +  <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
>> +  <qemu:arg value='-numa'/>
>> +  <qemu:arg value='node,memdev=mb1'/>
>> +  <qemu:arg value='-chardev'/>
>> +  <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
>> +  <qemu:arg value='-device'/>
>> +  <qemu:arg 
>> value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
>> +</qemu:commandline>
>
> Please be sure to emphasize that the pvrdma works only
> if the QEMU is operated by libvirt. The same about the multiplexer.
>
> Thanks,
> Marcel
>
>
Yuval Shaia Nov. 26, 2018, 1:05 p.m. UTC | #2
On Mon, Nov 26, 2018 at 12:34:41PM +0200, Marcel Apfelbaum wrote:
> Re-sending the comments, some of the recipients didn't get it,
> 
> Thanks,
> Marcel
> 
> On 11/25/18 9:51 AM, Marcel Apfelbaum wrote:
> > 
> > 
> > On 11/22/18 2:14 PM, Yuval Shaia wrote:
> > > Interface with the device is changed with the addition of support for
> > > MAD packets.
> > > Adjust documentation accordingly.
> > > 
> > > While there fix a minor mistake which may lead to think that there is a
> > > relation between using RXE on host and the compatibility with bare-metal
> > > peers.
> > > 
> > > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > > ---
> > >   docs/pvrdma.txt | 103 +++++++++++++++++++++++++++++++++++++++---------
> > >   1 file changed, 84 insertions(+), 19 deletions(-)
> > > 
> > > diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt
> > > index 5599318159..f82b2a69d2 100644
> > > --- a/docs/pvrdma.txt
> > > +++ b/docs/pvrdma.txt
> > > @@ -9,8 +9,9 @@ It works with its Linux Kernel driver AS IS, no need
> > > for any special guest
> > >   modifications.
> > >     While it complies with the VMware device, it can also
> > > communicate with bare
> > > -metal RDMA-enabled machines and does not require an RDMA HCA in the
> > > host, it
> > > -can work with Soft-RoCE (rxe).
> > > +metal RDMA-enabled machines as peers.
> > > +
> > > +It does not require an RDMA HCA in the host, it can work with
> > > Soft-RoCE (rxe).
> > >     It does not require the whole guest RAM to be pinned allowing memory
> > >   over-commit and, even if not implemented yet, migration support
> > > will be
> > > @@ -78,29 +79,93 @@ the required RDMA libraries.
> > >     3. Usage
> > >   ========
> > > +
> > > +
> > > +3.1 VM Memory settings
> > > +======+++=============
> > >   Currently the device is working only with memory backed RAM
> > >   and it must be mark as "shared":
> > >      -m 1G \
> > >      -object memory-backend-ram,id=mb1,size=1G,share \
> > >      -numa node,memdev=mb1 \
> > >   -The pvrdma device is composed of two functions:
> > > - - Function 0 is a vmxnet Ethernet Device which is redundant in Guest
> > > -   but is required to pass the ibdevice GID using its MAC.
> > > -   Examples:
> > > -     For an rxe backend using eth0 interface it will use its mac:
> > > -       -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC>
> > > -     For an SRIOV VF, we take the Ethernet Interface exposed by it:
> > > -       -device vmxnet3,multifunction=on,mac=<RoCE eth MAC>
> > > - - Function 1 is the actual device:
> > > -       -device pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port>
> > > -   where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4)
> > > - Note: Pay special attention that the GID at backend-gid-idx
> > > matches vmxnet's MAC.
> > > - The rules of conversion are part of the RoCE spec, but since
> > > manual conversion
> > > - is not required, spotting problems is not hard:
> > > -    Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a
> > > -             MAC: 7c:fe:90:cb:74:3a
> > > -    Note the difference between the first byte of the MAC and the GID.
> > > +
> > > +3.2 MAD Multiplexer
> > > +===================
> > > +MAD Multiplexer is a service that exposes MAD-like interface for VMs in
> > > +order to overcome the limitation where only single entity can
> > > register with
> > > +MAD layer to send and receive RDMA-CM MAD packets.
> > > +
> > > +To build rdmacm-mux run
> > > +# make rdmacm-mux
> > > +
> > > +The application accepts 3 command line arguments and exposes a UNIX
> > > socket
> > > +to pass control and data to it.
> > > +-s unix-socket-path   Path to unix socket to listen on
> > > +                      (default /var/run/rdmacm-mux)
> > > +-d rdma-device-name   Name of RDMA device to register with
> > > +                      (default rxe0)
> > 
> > I would not default it to rxe0, but request to specify a RDMA interface.
> > One can think the multiplexer may select the best available device
> > and finish with an rxe instance instead of a bare-metal one...

Done.

> > 
> > > +-p rdma-device-port   Port number of RDMA device to register with
> > > +                      (default 1)
> > > +The final UNIX socket file name is a concatenation of the 3
> > > arguments so
> > > +for example for device mlx5_0 on port 2 this
> > > /var/run/rdmacm-mux-mlx5_0-2
> > > +will be created.
> > > +
> > > +Please refer to contrib/rdmacm-mux for more details.
> > > +
> > > +
> > > +3.3 PCI devices settings
> > > +========================
> > > +RoCE device exposes two functions - an Ethernet and RDMA.
> > > +To support it, pvrdma device is composed of two PCI functions, an
> > > Ethernet
> > > +device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI
> > > slot 1. The
> > > +Ethernet function can be used for other Ethernet purposes such as IP.
> > 
> > Nice !
> > 
> > > +
> > > +
> > > +3.4 Device parameters
> > > +=====================
> > > +- netdev: Specifies the Ethernet device on host. For Soft-RoCE
> > > (rxe) this
> > > +  would be the Ethernet device used to create it. For any other
> > > physical
> > > +  RoCE device this would be the netdev name of the device.
> > 
> > I don't fully understand the above explanation. Can you elaborate
> > or give an exmaple?

How about this:
- netdev: Specifies the Ethernet device function name on the host for
  example enp175s0f0. For Soft-RoCE device (rxe) this would be the Ethernet
  device used to create it.

> > 
> > > +- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
> > > +- mad-chardev: The name of the MAD multiplexer char device.
> > > +- ibport: In case of multi-port device (such as Mellanox's HCA) this
> > > +  specify the port to use. If not set 1 will be used.
> > > +- dev-caps-max-mr-size: The maximum size of MR.
> > > +- dev-caps-max-qp: Maximum number of QPs.
> > > +- dev-caps-max-sge: Maximum number of SGE elements in WR.
> > > +- dev-caps-max-cq: Maximum number of CQs.
> > > +- dev-caps-max-mr: Maximum number of MRs.
> > > +- dev-caps-max-pd: Maximum number of PDs.
> > > +- dev-caps-max-ah: Maximum number of AHs.
> > > +
> > > +Notes:
> > > +- The first 3 parameters are mandatory settings, the rest have their
> > > +  defaults.
> > > +- The last 8 parameters (the ones that prefixed by dev-caps)
> > > defines the top
> > > +  limits but the final values is adjusted by the backend device
> > > limitations.
> > > +
> > > +3.5 Example
> > > +===========
> > > +Define bridge device with vmxnet3 network backend:
> > > +<interface type='bridge'>
> > > +  <mac address='56:b4:44:e9:62:dc'/>
> > > +  <source bridge='bridge1'/>
> > > +  <model type='vmxnet3'/>
> > > +  <address type='pci' domain='0x0000' bus='0x00' slot='0x10'
> > > function='0x0' multifunction='on'/>
> > > +</interface>
> > > +
> > > +Define pvrdma device:
> > > +<qemu:commandline>
> > > +  <qemu:arg value='-object'/>
> > > +  <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
> > > +  <qemu:arg value='-numa'/>
> > > +  <qemu:arg value='node,memdev=mb1'/>
> > > +  <qemu:arg value='-chardev'/>
> > > +  <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
> > > +  <qemu:arg value='-device'/>
> > > +  <qemu:arg
> > > value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
> > > +</qemu:commandline>
> > 
> > Please be sure to emphasize that the pvrdma works only
> > if the QEMU is operated by libvirt. The same about the multiplexer.

Added this:

3.3 Service exposed by libvirt daemon
=====================================
The control over the RDMA device's GID table is done by updating the
device's Ethernet function addresses.
Usually the first GID entry is determined by the MAC address, the second by
the first IPv6 address and the third by the IPv4 address. Other entries can
be added by adding more IP addresses. The opposite is the same, i.e.
whenever an address is removed, the corresponding GID entry is removed.
The process is done by the network and RDMA stacks. Whenever an address is
added the ib_core driver is notified and calls the device driver add_gid
function which in turn update the device.
To support this in pvrdma device the device hooks into the create_bind and
destroy_bind HW commands triggered by pvrdma driver in guest.

Whenever changed is made to the pvrdma port's GID table a special QMP
messages is sent to be processed by libvirt to update the address of the
backend Ethernet device.

pvrdma requires that libvirt service will be up.

> > 
> > Thanks,
> > Marcel
> > 
> > 
>
diff mbox series

Patch

diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt
index 5599318159..f82b2a69d2 100644
--- a/docs/pvrdma.txt
+++ b/docs/pvrdma.txt
@@ -9,8 +9,9 @@  It works with its Linux Kernel driver AS IS, no need for any special guest
 modifications.
 
 While it complies with the VMware device, it can also communicate with bare
-metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
-can work with Soft-RoCE (rxe).
+metal RDMA-enabled machines as peers.
+
+It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe).
 
 It does not require the whole guest RAM to be pinned allowing memory
 over-commit and, even if not implemented yet, migration support will be
@@ -78,29 +79,93 @@  the required RDMA libraries.
 
 3. Usage
 ========
+
+
+3.1 VM Memory settings
+======+++=============
 Currently the device is working only with memory backed RAM
 and it must be mark as "shared":
    -m 1G \
    -object memory-backend-ram,id=mb1,size=1G,share \
    -numa node,memdev=mb1 \
 
-The pvrdma device is composed of two functions:
- - Function 0 is a vmxnet Ethernet Device which is redundant in Guest
-   but is required to pass the ibdevice GID using its MAC.
-   Examples:
-     For an rxe backend using eth0 interface it will use its mac:
-       -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC>
-     For an SRIOV VF, we take the Ethernet Interface exposed by it:
-       -device vmxnet3,multifunction=on,mac=<RoCE eth MAC>
- - Function 1 is the actual device:
-       -device pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port>
-   where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4)
- Note: Pay special attention that the GID at backend-gid-idx matches vmxnet's MAC.
- The rules of conversion are part of the RoCE spec, but since manual conversion
- is not required, spotting problems is not hard:
-    Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a
-             MAC: 7c:fe:90:cb:74:3a
-    Note the difference between the first byte of the MAC and the GID.
+
+3.2 MAD Multiplexer
+===================
+MAD Multiplexer is a service that exposes MAD-like interface for VMs in
+order to overcome the limitation where only single entity can register with
+MAD layer to send and receive RDMA-CM MAD packets.
+
+To build rdmacm-mux run
+# make rdmacm-mux
+
+The application accepts 3 command line arguments and exposes a UNIX socket
+to pass control and data to it.
+-s unix-socket-path   Path to unix socket to listen on
+                      (default /var/run/rdmacm-mux)
+-d rdma-device-name   Name of RDMA device to register with
+                      (default rxe0)
+-p rdma-device-port   Port number of RDMA device to register with
+                      (default 1)
+The final UNIX socket file name is a concatenation of the 3 arguments so
+for example for device mlx5_0 on port 2 this /var/run/rdmacm-mux-mlx5_0-2
+will be created.
+
+Please refer to contrib/rdmacm-mux for more details.
+
+
+3.3 PCI devices settings
+========================
+RoCE device exposes two functions - an Ethernet and RDMA.
+To support it, pvrdma device is composed of two PCI functions, an Ethernet
+device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 1. The
+Ethernet function can be used for other Ethernet purposes such as IP.
+
+
+3.4 Device parameters
+=====================
+- netdev: Specifies the Ethernet device on host. For Soft-RoCE (rxe) this
+  would be the Ethernet device used to create it. For any other physical
+  RoCE device this would be the netdev name of the device.
+- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
+- mad-chardev: The name of the MAD multiplexer char device.
+- ibport: In case of multi-port device (such as Mellanox's HCA) this
+  specify the port to use. If not set 1 will be used.
+- dev-caps-max-mr-size: The maximum size of MR.
+- dev-caps-max-qp: Maximum number of QPs.
+- dev-caps-max-sge: Maximum number of SGE elements in WR.
+- dev-caps-max-cq: Maximum number of CQs.
+- dev-caps-max-mr: Maximum number of MRs.
+- dev-caps-max-pd: Maximum number of PDs.
+- dev-caps-max-ah: Maximum number of AHs.
+
+Notes:
+- The first 3 parameters are mandatory settings, the rest have their
+  defaults.
+- The last 8 parameters (the ones that prefixed by dev-caps) defines the top
+  limits but the final values is adjusted by the backend device limitations.
+
+3.5 Example
+===========
+Define bridge device with vmxnet3 network backend:
+<interface type='bridge'>
+  <mac address='56:b4:44:e9:62:dc'/>
+  <source bridge='bridge1'/>
+  <model type='vmxnet3'/>
+  <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/>
+</interface>
+
+Define pvrdma device:
+<qemu:commandline>
+  <qemu:arg value='-object'/>
+  <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
+  <qemu:arg value='-numa'/>
+  <qemu:arg value='node,memdev=mb1'/>
+  <qemu:arg value='-chardev'/>
+  <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
+  <qemu:arg value='-device'/>
+  <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
+</qemu:commandline>