Message ID | 20220511095900.343-1-xieyongji@bytedance.com (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Series | [RFC,v2] virtio-net: Add RoCE (RDMA over Converged Ethernet) support | expand |
On Wed, May 11, 2022 at 5:59 PM Xie Yongji <xieyongji@bytedance.com> wrote: > > Hi all, > Not very familiar with ROCE, try to give some comments from general virtio level. > This RFC aims to introduce our recent work on enabling RoCE support > for virtio-net device. We need to clarify the version of ROCE, is it ROCEv2 or not? > > To support RoCE, three types of virtqueues including RDMA send virtqueue, > RDMA receive virtqueue and RDMA completion virtqueue are introduced. > And control virtqueue is reused to support the RDMA control messages. > > Now we support some basic RDMA semantics such as send/receive > and read/write operation. It would be better to explain the advantages of this over the existing pvrdma approach. I guess one advantage is that using virtio makes it easier to connect to a userspace dataplane through vDPA/vhost-user? > > To test with our demo: > > 1. Build Guest kernel [1] with config INFINIBAND_VIRTIO_RDMA > > 2. Build QEMU [2] with config VHOST_USER_RDMA > > 3. Build rdma-core [3] > > 4. Build and install DPDK (NOTE that we only tested on DPDK 20.11.3) > > 5. Build vhost-user-rdma [4] > > 6. Run vhost-user-rdma with command: > $ ./vhost-user-rdma --vdev 'net_tap0' --lcore '1-3' -- -s '/tmp/vhost-rdma0' > > 7. Run qemu with command: > $ qemu-system-x86_64 -chardev socket,path=/tmp/vhost-rdma0,id=vrdma \ > -device vhost-user-rdma-pci,page-per-vq,chardev=vrdma ... It would be better to give some performance numbers (or even compare it with pvrdma). > > [1] https://github.com/bytedance/linux/tree/virtio-net-roce > [2] https://github.com/bytedance/qemu/tree/vhost-user-rdma > [3] https://github.com/YongjiXie/rdma-core/tree/virtio-rdma > [4] https://github.com/YongjiXie/vhost-user-rdma > > We have already tested it with ibv_rc_pingpong, ibv_ud_pingpong and some > others in rdma-core. > > TODO: > And we'd better consider the live migration support. Having a quick glance, it looks to me trapping the cvq is sufficient? > 1. Add support for Base Memory Management Extensions > > 2. Add support for atomic operation > > 3. Add support for SRQ > > 4. Add support for virtqueue resize Note that this is already supported by the spec via virtqueue reset. > > 5. Add support for enabling/disabling virtqueue at runtime I guess virtqueue reset could help in this case. > > Please review, thanks! > > V1 to V2: > - Rework the implementation via extending virtio-net instead of > introducing a new device type [Jason] > - Add address handle support > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com> > Co-developed-by: Wei Junji <weijunji@bytedance.com> > Signed-off-by: Wei Junji <weijunji@bytedance.com> > --- > content.tex | 858 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 854 insertions(+), 4 deletions(-) I wonder if there's some open-source ROCE transport device API that we can re-use then we can just behave like a transport layer instead of inventing new commands. > > diff --git a/content.tex b/content.tex > index 7508dd1..646d82a 100644 > --- a/content.tex > +++ b/content.tex > @@ -3008,7 +3008,10 @@ \section{Network Device}\label{sec:Device Types / Network Device} > placed in one virtqueue for receiving packets, and outgoing > packets are enqueued into another for transmission in that order. > A third command queue is used to control advanced filtering > -features. > +features. And if RoCE (RDMA over Converged Ethernet) capability > +is enabled, the virtio network device can also support transmitting > +and receiving RDMA message through RDMA send virtqueue, RDMA receive > +virtqueue and RDMA completion virtqueue. > > \subsection{Device ID}\label{sec:Device Types / Network Device / Device ID} > > @@ -3023,13 +3026,24 @@ \subsection{Virtqueues}\label{sec:Device Types / Network Device / Virtqueues} > \item[2(N-1)] receiveqN > \item[2(N-1)+1] transmitqN > \item[2N] controlq > +\item[2N+1] rdma_completeq1 > +\item[\ldots] > +\item[2N+M] rdma_completeqM > +\item[2N+M+1] rdma_transmitq1 > +\item[2N+M+2] rdma_receiveq1 > +\item[\ldots] > +\item[2N+M+2L-1] rdma_transmitqL > +\item[2N+M+2L] rdma_receiveqL > \end{description} > > N=1 if neither VIRTIO_NET_F_MQ nor VIRTIO_NET_F_RSS are negotiated, otherwise N is set by > - \field{max_virtqueue_pairs}. > + \field{max_virtqueue_pairs}. M is set by \field{max_rdma_cqs} and L is set by > + \field{max_rdma_qps}. > > controlq only exists if VIRTIO_NET_F_CTRL_VQ set. > > + rdma_completeq, rdma_transmitq and rdma_receiveq only exist if VIRTIO_NET_F_ROCE set > + > \subsection{Feature bits}\label{sec:Device Types / Network Device / Feature bits} > > \begin{description} > @@ -3084,6 +3098,9 @@ \subsection{Feature bits}\label{sec:Device Types / Network Device / Feature bits > \item[VIRTIO_NET_F_CTRL_MAC_ADDR(23)] Set MAC address through control > channel. > > +\item[VIRTIO_NET_F_ROCE(55)] Device supports RoCE (RDMA over Converged Ethernet) > + capability. > + > \item[VIRTIO_NET_F_HOST_USO (56)] Device can receive USO packets. Unlike UFO > (fragmenting the packet) the USO splits large UDP packet > to several segments when each of these smaller packets has UDP header. > @@ -3129,6 +3146,7 @@ \subsubsection{Feature bit requirements}\label{sec:Device Types / Network Device > \item[VIRTIO_NET_F_GUEST_ANNOUNCE] Requires VIRTIO_NET_F_CTRL_VQ. > \item[VIRTIO_NET_F_MQ] Requires VIRTIO_NET_F_CTRL_VQ. > \item[VIRTIO_NET_F_CTRL_MAC_ADDR] Requires VIRTIO_NET_F_CTRL_VQ. > +\item[VIRTIO_NET_F_ROCE] Requires VIRTIO_NET_F_CTRL_VQ. > \item[VIRTIO_NET_F_RSC_EXT] Requires VIRTIO_NET_F_HOST_TSO4 or VIRTIO_NET_F_HOST_TSO6. > \item[VIRTIO_NET_F_RSS] Requires VIRTIO_NET_F_CTRL_VQ. > \end{description} > @@ -3190,6 +3208,8 @@ \subsection{Device configuration layout}\label{sec:Device Types / Network Device > u8 rss_max_key_size; > le16 rss_max_indirection_table_length; > le32 supported_hash_types; > + le32 max_rdma_qps; > + le32 max_rdma_cps; > }; > \end{lstlisting} > The following field, \field{rss_max_key_size} only exists if VIRTIO_NET_F_RSS or VIRTIO_NET_F_HASH_REPORT is set. > @@ -3204,11 +3224,23 @@ \subsection{Device configuration layout}\label{sec:Device Types / Network Device > Field \field{supported_hash_types} contains the bitmask of supported hash types. > See \ref{sec:Device Types / Network Device / Device Operation / Processing of Incoming Packets / Hash calculation for incoming packets / Supported/enabled hash types} for details of supported hash types. > > +Field \field{max_rdma_qps} only exists if VIRTIO_NET_F_ROCE is set. > +It specifies the maximum number of queue pairs (send virtqueue and receive virtqueue) for RoCE usage. > + > +Field \field{max_rdma_cqs} only exists if VIRTIO_NET_F_ROCE is set. > +It specifies the maximum number of completion virtqueue for RoCE usage. > + > \devicenormative{\subsubsection}{Device configuration layout}{Device Types / Network Device / Device configuration layout} > > The device MUST set \field{max_virtqueue_pairs} to between 1 and 0x8000 inclusive, > if it offers VIRTIO_NET_F_MQ. > > +The device MUST set \field{max_rdma_qps} to between 1 an 16384 inclusive, > +if it offers VIRTIO_NET_F_ROCE. I wonder why 16384 is chosen here? > + > +The device MUST set \field{max_rdma_cqs} to between 1 an 16384 inclusive, > +if it offers VIRTIO_NET_F_ROCE. > + > The device MUST set \field{mtu} to between 68 and 65535 inclusive, > if it offers VIRTIO_NET_F_MTU. > > @@ -3306,6 +3338,12 @@ \subsection{Device Initialization}\label{sec:Device Types / Network Device / Dev > \item If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated, > identify the control virtqueue. > > +\item If the VIRTIO_NET_F_ROCE feature bit is negotiated, > + identify the the RDMA completion virtqueues, up to max_rdma_cqs. > + > +\item If the VIRTIO_NET_F_ROCE feature bit is negotiated, > + identify the the RDMA send and receive virtqueues, up to max_rdma_qps. > + > \item Fill the receive queues with buffers: see \ref{sec:Device Types / Network Device / Device Operation / Setting Up Receive Buffers}. > > \item Even with VIRTIO_NET_F_MQ, only receiveq1, transmitq1 and > @@ -4007,6 +4045,7 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi > u8 command; > u8 command-specific-data[]; > u8 ack; > + u8 ack-specific-data[]; > }; > > /* ack values */ > @@ -4015,8 +4054,8 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi > \end{lstlisting} > > The \field{class}, \field{command} and command-specific-data are set by the > -driver, and the device sets the \field{ack} byte. There is little it can > -do except issue a diagnostic if \field{ack} is not > +driver, and the device sets the \field{ack} byte and ack-specific-data. There > +is little it can do except issue a diagnostic if \field{ack} is not > VIRTIO_NET_OK. > > \paragraph{Packet Receive Filtering}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Packet Receive Filtering} > @@ -4463,6 +4502,534 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi > according to the native endian of the guest rather than > (necessarily when not using the legacy interface) little-endian. > > +\paragraph{RoCE Configuration}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / RoCE Configuration} > + > +If the driver negotiates the VIRTIO_NET_F_ROCE feature bit (depends on VIRTIO_NET_F_CTRL_VQ), > +it can send control commands for RoCE usage. The following commands are defined now: > + > +\begin{lstlisting} > +#define VIRTIO_NET_CTRL_ROCE 6 > + #define VIRTIO_NET_CTRL_ROCE_QUERY_DEVICE 0 > + #define VIRTIO_NET_CTRL_ROCE_QUERY_PORT 1 > + #define VIRTIO_NET_CTRL_ROCE_CREATE_CQ 2 > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_CQ 3 > + #define VIRTIO_NET_CTRL_ROCE_CREATE_PD 4 > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_PD 5 > + #define VIRTIO_NET_CTRL_ROCE_GET_DMA_MR 6 > + #define VIRTIO_NET_CTRL_ROCE_REG_USER_MR 7 > + #define VIRTIO_NET_CTRL_ROCE_DEREG_MR 8 > + #define VIRTIO_NET_CTRL_ROCE_CREATE_QP 9 > + #define VIRTIO_NET_CTRL_ROCE_MODIFY_QP 10 > + #define VIRTIO_NET_CTRL_ROCE_QUERY_QP 11 > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_QP 12 > + #define VIRTIO_NET_CTRL_ROCE_CREATE_AH 13 > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_AH 14 > + #define VIRTIO_NET_CTRL_ROCE_ADD_GID 15 > + #define VIRTIO_NET_CTRL_ROCE_DEL_GID 16 > + #define VIRTIO_NET_CTRL_ROCE_REQ_NOTIFY_CQ 17 > +\end{lstlisting} > + > +\begin{description} > +\item[VIRTIO_NET_CTRL_ROCE_QUERY_DEVICE] Query the attributes of device. > + No command-specific-data; > + the ack-specific-data is \field{struct virtio_rdma_ack_query_device}. > + > +\begin{lstlisting} > +struct virtio_rdma_ack_query_device { > +#define VIRTIO_IB_DEVICE_RC_RNR_NAK_GEN (1 << 0) What's the meaning of this capability? > + /* Capabilities mask */ > + le64 device_cap_flags; Will this introduce a migration compatibility issue? E.g src and dst have the same features but different capabilities. > + /* Largest contiguous block that can be registered */ > + le64 max_mr_size; > + /* Supported memory shift sizes */ > + le64 page_size_cap; > + /* Hardware version */ > + le32 hw_ver; What did "hardware version" mean? Is this something that is defined in the IB spec? > + /* Maximum number of outstanding Work Requests (WR) on Send Queue (SQ) and Receive Queue (RQ) */ > + le32 max_qp_wr; Is this implied in the virtqueue size? If not, why? > + /* Maximum number of scatter/gather (s/g) elements per WR for SQ for non RDMA Read operations */ > + le32 max_send_sge; > + /* Maximum number of s/g elements per WR for RQ for non RDMA Read operations */ > + le32 max_recv_sge; > + /* Maximum number of s/g per WR for RDMA Read operations */ > + le32 max_sge_rd; > + /* Maximum size of Completion Queue (CQ) */ > + le32 max_cqe; Need to specify the reason why we can't use the virtqueue size for the completion queue. > + /* Maximum number of Memory Regions (MR) */ > + le32 max_mr; > + /* Maximum number of Protection Domains (PD) */ > + le32 max_pd; > + /* Maximum number of RDMA Read perations that can be outstanding per Queue Pair (QP) */ I guess you mean "operations" here. > + le32 max_qp_rd_atom; > + /* Maximum depth per QP for initiation of RDMA Read operations */ The member has an "atom" suffix, does it mean "atomic read" or other? > + le32 max_qp_init_rd_atom; > + /* Maximum number of Address Handles (AH) */ > + le32 max_ah; > + /* Local CA ack delay */ > + u8 local_ca_ack_delay; > + /* Padding */ > + u8 padding[3]; > + /* Reserved for future */ > + le32 reserved[14]; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_QUERY_PORT] Query the attributes of port. > + No command-specific-data; > + the ack-specific-data is \field{struct virtio_rdma_ack_query_port}. > + > +\begin{lstlisting} > +struct virtio_rdma_ack_query_port { > + /* Length of source Global Identifier (GID) table */ > + le32 gid_tbl_len; > + /* Maximum message size */ > + le32 max_msg_sz; I guess this is for both read/write/send/receive? And is 4GB sufficient for the future? > + /* Reserved for future */ > + le32 reserved[6]; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_CQ] Create a Completion Queue (CQ). > + The command-specific-data is \field{struct virtio_rdma_cmd_create_cq}; > + the ack-specific-data is \field{struct virtio_rdma_ack_create_cq}. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_create_cq { > + /* Size of CQ */ > + le32 cqe; > +}; > + > +struct virtio_rdma_ack_create_cq { > + /* The index of CQ */ > + le32 cqn; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_CQ] Destroy a Completion Queue. > + The command-specific-data is \field{struct virtio_rdma_cmd_destroy_cq}; > + no ack-specific-data. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_destroy_cq { > + /* The index of CQ */ > + le32 cqn; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_PD] Create a Protection Domain (PD). > + No command-specific-data; > + the ack-specific-data is \field{struct virtio_rdma_ack_create_pd}. > + > +\begin{lstlisting} > +struct virtio_rdma_ack_create_pd { > + /* The handle of PD */ > + le32 pdn; > +}; > +\end{lstlisting} Can this command always succeed? I meant is there a limit of the total number of PDs that a single ROCE device can support? > + > +\item[VIRTIO_NET_CTRL_ROCE_DESTORY_PD] Destroy a Protection Domain. > + The command-specific-data is \field{virtio_rdma_cmd_destroy_pd}; > + no ack-specific-data. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_destroy_pd { > + /* The handle of PD */ > + le32 pdn; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_GET_DMA_MR] Get the DMA Memory Region (MR). > + associated with one protection domain. I wonder what's the difference between VIRTIO_NET_CTRL_ROCE_GET_DMA_MR and USR_MR. Can we unify them? > + The command-specific-data is \field{virtio_rdma_cmd_get_dma_mr}; > + the ack-specific-data is \field{virtio_rdma_ack_get_dma_mr}. > + > +\begin{lstlisting} > +enum virtio_ib_access_flags { > + VIRTIO_IB_ACCESS_LOCAL_WRITE = (1 << 0), Is LOCAL_READ implied to work always? > + VIRTIO_IB_ACCESS_REMOTE_WRITE = (1 << 1), > + VIRTIO_IB_ACCESS_REMOTE_READ = (1 << 2), > +}; > + > +struct virtio_rdma_cmd_get_dma_mr { > + /* The handle of PD which the MR associated with */ > + le32 pdn; > + /* MR's protection attributes, enum virtio_ib_access_flags */ > + le32 access_flags; > +}; > + > +struct virtio_rdma_ack_get_dma_mr { > + /* The handle of MR */ > + le32 mrn; > + /* MR's local access key */ > + le32 lkey; > + /* MR's remote access key */ > + le32 rkey; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_REG_USER_MR] Register a user Memory Region > + associated with one Protection Domain. > + The command-specific-data is \field{virtio_rdma_cmd_reg_user_mr}; > + the ack-specific-data is \field{virtio_rdma_ack_reg_user_mr}. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_reg_user_mr { > + /* The handle of PD which the MR associated with */ > + le32 pdn; > + /* MR's protection attributes, enum virtio_ib_access_flags */ > + le32 access_flags; > + /* Starting virtual address of MR */ > + le64 virt_addr; I guess this is actually the I/O virtual address and the device is in charge of translate it to the page arrays below? > + /* Length of MR */ > + le64 length; > + /* Size of the below page array */ > + le32 npages; > + /* Padding */ > + le32 padding; > + /* Array to store physical address of each page in MR */ > + le64 pages[]; How do device know the size of a page? > +}; I believe this command can fail, we need to describe the error conditions. > + > +struct virtio_rdma_ack_reg_user_mr { > + /* The handle of MR */ > + le32 mrn; > + /* MR's local access key */ > + le32 lkey; > + /* MR's remote access key */ > + le32 rkey; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_DEREG_MR] De-register a Memory Region. > + The command-specific-data is \field{virtio_rdma_cmd_dereg_mr}; > + no ack-specific-data. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_dereg_mr { > + /* The handle of MR */ > + le32 mrn; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_QP] Create a Queue Pair (Send Queue and Receive Queue). > + The command-specific-data is \field{virtio_rdma_cmd_create_qp}; > + the ack-specific-data is \field{virtio_rdma_ack_create_qp}. > + > +\begin{lstlisting} > +struct virtio_rdma_qp_cap { > + /* Maximum number of outstanding WRs in SQ */ > + le32 max_send_wr; > + /* Maximum number of outstanding WRs in RQ */ > + le32 max_recv_wr; > + /* Maximum number of s/g elements per WR in SQ */ > + le32 max_send_sge; > + /* Maximum number of s/g elements per WR in RQ */ > + le32 max_recv_sge; > + /* Maximum number of data (bytes) that can be posted inline to SQ */ > + le32 max_inline_data; > + /* Padding */ > + le32 padding; > +}; > + > +struct virtio_rdma_cmd_create_qp { > + /* The handle of PD which the QP associated with */ > + le32 pdn; > +#define VIRTIO_IB_QPT_SMI 0 > +#define VIRTIO_IB_QPT_GSI 1 > +#define VIRTIO_IB_QPT_RC 2 > +#define VIRTIO_IB_QPT_UC 3 > +#define VIRTIO_IB_QPT_UD 4 > + /* QP's type */ > + u8 qp_type; > + /* If set, each WR submitted to the SQ generates a completion entry */ > + u8 sq_sig_all; > + /* Padding */ > + u8 padding[2]; > + /* The index of CQ which the SQ associated with */ > + le32 send_cqn; > + /* The index of CQ which the RQ associated with */ > + le32 recv_cqn; > + /* QP's capabilities */ > + struct virtio_rdma_qp_cap cap; > + /* Reserved for future */ > + le32 reserved[4]; > +}; > + > +struct virtio_rdma_ack_create_qp { > + /* The index of QP */ > + le32 qpn; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_MODIFY_QP] Modify the attributes of a Queue Pair. > + The command-specific-data is \field{virtio_rdma_cmd_modify_qp}; > + no ack-specific-data. > + > +\begin{lstlisting} > +struct virtio_rdma_global_route { > + /* Destination GID or MGID */ > + u8 dgid[16]; > + /* Flow label */ > + le32 flow_label; > + /* Source GID index */ > + u8 sgid_index; > + /* Hop limit */ > + u8 hop_limit; > + /* Traffic class */ > + u8 traffic_class; > + /* Padding */ > + u8 padding; > +}; > + > +struct virtio_rdma_ah_attr { > + /* Global Routing Header (GRH) attributes */ > + virtio_rdma_global_route grh; > + /* Destination MAC address */ > + u8 dmac[6]; > + /* Reserved for future */ > + u8 reserved[10]; > +}; > + > +enum virtio_ib_qp_attr_mask { > + VIRTIO_IB_QP_STATE = (1 << 0), > + VIRTIO_IB_QP_CUR_STATE = (1 << 1), > + VIRTIO_IB_QP_ACCESS_FLAGS = (1 << 2), > + VIRTIO_IB_QP_QKEY = (1 << 3), > + VIRTIO_IB_QP_AV = (1 << 4), > + VIRTIO_IB_QP_PATH_MTU = (1 << 5), > + VIRTIO_IB_QP_TIMEOUT = (1 << 6), > + VIRTIO_IB_QP_RETRY_CNT = (1 << 7), > + VIRTIO_IB_QP_RNR_RETRY = (1 << 8), > + VIRTIO_IB_QP_RQ_PSN = (1 << 9), > + VIRTIO_IB_QP_MAX_QP_RD_ATOMIC = (1 << 10), > + VIRTIO_IB_QP_MIN_RNR_TIMER = (1 << 11), > + VIRTIO_IB_QP_SQ_PSN = (1 << 12), > + VIRTIO_IB_QP_MAX_DEST_RD_ATOMIC = (1 << 13), > + VIRTIO_IB_QP_CAP = (1 << 14), > + VIRTIO_IB_QP_DEST_QPN = (1 << 15), > + VIRTIO_IB_QP_RATE_LIMIT = (1 << 16), > +}; Do we need to explain the above error codes? Or it's simply a map from IB spec? > + > +enum virtio_ib_qp_state { > + VIRTIO_IB_QPS_RESET, > + VIRTIO_IB_QPS_INIT, > + VIRTIO_IB_QPS_RTR, > + VIRTIO_IB_QPS_RTS, > + VIRTIO_IB_QPS_SQD, > + VIRTIO_IB_QPS_SQE, > + VIRTIO_IB_QPS_ERR > +}; > + > +enum virtio_ib_mtu { > + VIRTIO_IB_MTU_256 = 1, > + VIRTIO_IB_MTU_512 = 2, > + VIRTIO_IB_MTU_1024 = 3, > + VIRTIO_IB_MTU_2048 = 4, > + VIRTIO_IB_MTU_4096 = 5 > +}; > + > +struct virtio_rdma_cmd_modify_qp { > + /* The index of QP */ > + le32 qpn; > + /* The mask of attributes needs to be modified, enum virtio_ib_qp_attr_mask */ > + le32 attr_mask; > + /* Move the QP to this state, enum virtio_ib_qp_state */ > + u8 qp_state; > + /* Current QP state, enum virtio_ib_qp_state */ > + u8 cur_qp_state; > + /* Path MTU (valid only for RC/UC QPs), enum virtio_ib_mtu */ > + u8 path_mtu; > + /* Number of outstanding RDMA Read operations on destination QP (valid only for RC QPs) */ > + u8 max_rd_atomic; > + /* Number of responder resources for handling incoming RDMA Read operations (valid only for RC QPs) */ > + u8 max_dest_rd_atomic; > + /* Minimum RNR (Receiver Not Ready) NAK timer (valid only for RC QPs) */ > + u8 min_rnr_timer; > + /* Local ack timeout (valid only for RC QPs) */ > + u8 timeout; > + /* Retry count (valid only for RC QPs) */ > + u8 retry_cnt; > + /* RNR retry (valid only for RC QPs) */ > + u8 rnr_retry; > + /* Padding */ > + u8 padding[7]; > + /* Q_Key for the QP (valid only for UD QPs) */ > + le32 qkey; > + /* PSN for RQ (valid only for RC/UC QPs) */ > + le32 rq_psn; > + /* PSN for SQ */ > + le32 sq_psn; > + /* Destination QP number (valid only for RC/UC QPs) */ > + le32 dest_qp_num; > + /* Mask of enabled remote access operations (valid only for RC/UC QPs), enum virtio_ib_access_flags */ > + le32 qp_access_flags; > + /* Rate limit in kbps for packet pacing */ > + le32 rate_limit; > + /* QP capabilities */ > + struct virtio_rdma_qp_cap cap; > + /* Address Vector (valid only for RC/UC QPs) */ > + struct virtio_rdma_ah_attr ah_attr; > + /* Reserved for future */ > + le32 reserved[4]; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_QUERY_QP] Query the attributes of a Queue Pair. > + The command-specific-data is \field{virtio_rdma_cmd_query_qp}; > + the ack-specific-data is \field{virtio_rdma_ack_query_qp}. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_query_qp { > + /* The index of QP */ > + le32 qpn; > + /* The mask of attributes need to be queried, enum virtio_ib_qp_attr_mask */ > + le32 attr_mask; > +}; > + > +struct virtio_rdma_ack_query_qp { Any chance to unify this with virtio_rdma_cmd_modify_qp? > + /* Move the QP to this state, enum virtio_ib_qp_state */ > + u8 qp_state; > + /* Path MTU (valid only for RC/UC QPs), enum virtio_ib_mtu */ > + u8 path_mtu; > + /* Is the SQ draining */ > + u8 sq_draining; > + /* Number of outstanding RDMA read operations on destination QP (valid only for RC QPs) */ > + u8 max_rd_atomic; > + /* Number of responder resources for handling incoming RDMA read operations (valid only for RC QPs) */ > + u8 max_dest_rd_atomic; > + /* Minimum RNR NAK timer (valid only for RC QPs) */ > + u8 min_rnr_timer; > + /* Local ack timeout (valid only for RC QPs) */ > + u8 timeout; > + /* Retry count (valid only for RC QPs) */ > + u8 retry_cnt; > + /* RNR retry (valid only for RC QPs) */ > + u8 rnr_retry; > + /* Padding */ > + u8 padding[7]; > + /* Q_Key for the QP (valid only for UD QPs) */ > + le32 qkey; > + /* PSN for RQ (valid only for RC/UC QPs) */ > + le32 rq_psn; > + /* PSN for SQ */ > + le32 sq_psn; > + /* Destination QP number (valid only for RC/UC QPs) */ > + le32 dest_qp_num; > + /* Mask of enabled remote access operations (valid only for RC/UC QPs), enum virtio_ib_access_flags */ > + le32 qp_access_flags; > + /* Rate limit in kbps for packet pacing */ > + le32 rate_limit; > + /* QP capabilities */ > + struct virtio_rdma_qp_cap cap; > + /* Address Vector (valid only for RC/UC QPs) */ > + struct virtio_rdma_ah_attr ah_attr; > + /* Reserved for future */ > + le32 reserved[4]; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_QP] Destroy a Queue Pair. > + The command-specific-data is \field{virtio_rdma_cmd_destroy_qp}; > + no ack-specific-data. What happen to the pending requests? Will the device wait for the completion or not? > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_destroy_qp { > + /* The index of QP */ > + le32 qpn; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_AH] Create a Address Handle (AH). > + The command-specific-data is \field{virtio_rdma_cmd_create_ah}; > + the ack-specific-data is \field{virtio_rdma_ack_create_ah}. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_create_ah { > + /* The handle of PD which the AH associated with */ > + le32 pdn; > + /* Padding */ > + le32 padding; > + /* Address Vector */ > + struct virtio_rdma_ah_attr ah_attr; > +}; > + > +struct virtio_rdma_ack_create_ah { > + /* The address handle */ > + le32 ah; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_AH] Destroy a Address Handle. > + The command-specific-data is \field{virtio_rdma_cmd_destroy_ah}; > + no ack-specific-data. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_destroy_ah { > + /* The handle of PD which the AH associated with */ > + le32 pdn; > + /* The address handle */ > + le32 ah; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_ADD_GID] Add a Global Identifier (GID). > + The command-specific-data is \field{virtio_rdma_cmd_add_gid}; > + no ack-specific-data. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_add_gid { > + /* Index of GID */ > + le16 index; > + /* Padding */ > + le16 padding[3]; > + /* GID to be added */ > + u8 gid[16]; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_DEL_GID] Delete a Global Identifier. > + The command-specific-data is \field{virtio_rdma_cmd_del_gid}; > + no ack-specific-data. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_del_gid { > + /* Index of GID */ > + le16 index; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_REQ_NOTIFY_CQ] Request a completion notification > + on a Completion Queue. > + The command-specific-data is \field{virtio_rdma_cmd_req_notify}; > + no ack-specific-data. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_req_notify { > + /* The index of CQ */ > + le32 cqn; > +#define VIRTIO_IB_NOTIFY_SOLICITED (1 << 0) > +#define VIRTIO_IB_NOTIFY_NEXT_COMPLETION (1 << 1) Need to describe the differences on those two flags. > + /* Notify flags */ > + le32 flags; > +}; > +\end{lstlisting} > + > +\end{description} > + > +\drivernormative{\subparagraph}{RoCE Configuration}{Device Types / Network Device / Device Operation / Control Virtqueue / RoCE Configuration} > + > +A driver MUST initialize the completion virtqueue and fill it with > +enough entries after command VIRTIO_NET_CTRL_ROCE_CREATE_CQ is > +successfully executed. > + > +A driver MUST reset the completion virtqueue after How to do the reset? Do you mean driver need to reset the indices? > +command VIRTIO_NET_CTRL_ROCE_DESTROY_CQ is successfully executed. > + > +A driver MUST initialize the send virtqueue and receive virtqueue after > +command VIRTIO_NET_CTRL_ROCE_CREATE_QP is successfully executed. > + > +A driver MUST reset the send virtqueue and receive virtqueue after > +command VIRTIO_NET_CTRL_ROCE_DESTROY_QP is successfully executed. > > \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device > Types / Network Device / Legacy Interface: Framing Requirements} > @@ -4496,6 +5063,289 @@ \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device > See \ref{sec:Basic > Facilities of a Virtio Device / Virtqueues / Message Framing}. > > +\subsubsection{RoCE Support}\label{sec:Device Types / Network Device / Device Operation / RoCE Support} > + > +RDMA over Converged Ethernet (RoCE) is a network protocol that allows > +remote direct memory access (RDMA) over an Ethernet network. To support > +RoCE (if VIRTIO_NET_F_ROCE is negotiated), in addtion to the control > +virtqueue support mentioned in \ref{sec:Device Types / Network Device / > +Device Operation / Control Virtqueue / RoCE Configuration}, multiple > +types of virtqueues including send virtqueue, receive virtqueue and > +completion virtqueue are introduced. > + > +The send virtqueue contains elements that describe the data to be > +transmitted. > + > +Requests (device-readable) have the following format: > + > +\begin{lstlisting} > +enum virtio_ib_wr_opcode { > + VIRTIO_IB_WR_RDMA_WRITE, > + VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM, > + VIRTIO_IB_WR_SEND, > + VIRTIO_IB_WR_SEND_WITH_IMM, > + VIRTIO_IB_WR_RDMA_READ, > +}; > + > +struct virtio_rdma_sge { > + le64 addr; > + le32 length; > + le32 lkey; > +}; > + > +struct virtio_rdma_sq_req { > + /* User defined WR ID */ > + le64 wr_id; > + /* WR opcode, enum virtio_ib_wr_opcode */ > + u8 opcode; > +#define VIRTIO_IB_SEND_FENCE (1 << 0) > +#define VIRTIO_IB_SEND_SIGNALED (1 << 1) > +#define VIRTIO_IB_SEND_SOLICITED (1 << 2) > +#define VIRTIO_IB_SEND_INLINE (1 << 3) > + /* Flags of the WR properties */ > + u8 send_flags; > + /* Padding */ > + le16 padding; > + /* Immediate data (in network byte order) to send */ > + le32 imm_data; > + union { > + struct { > + /* Start address of remote memory buffer */ > + le64 remote_addr; > + /* Key of the remote MR */ > + le32 rkey; > + } rdma; > + struct { > + /* Index of the destination QP */ > + le32 remote_qpn; > + /* Q_Key of the destination QP */ > + le32 remote_qkey; > + /* Address Handle */ > + le32 ah; > + } ud; > + /* Reserved for future */ > + le64 reserved[4]; > + }; > + /* Inline data */ > + u8 inline_data[512]; > + union { > + /* Length of sg_list */ > + le32 num_sge; > + /* Length of inline data */ > + le16 inline_len; > + }; > + /* Reserved for future */ > + le32 reserved2[3]; > + /* Scatter/gather list */ > + struct virtio_rdma_sge sg_list[]; > +}; > +\end{lstlisting} > + > +The receive virtqueue contains elements that describe where to place incoming data. > + > +Requests (device-readable) have the following format: > + > +\begin{lstlisting} > +struct virtio_rdma_rq_req { > + /* User defined WR ID */ > + le64 wr_id; > + /* Length of sg_list */ > + le32 num_sge; > + /* Reserved for future */ > + le32 reserved[3]; > + /* Scatter/gather list */ > + struct virtio_rdma_sge sg_list[]; > +}; > +\end{lstlisting} > + > +The completion virtqueue is used to notify the completion of requests in > +send virtqueue or receive virtqueue. > + > +Requests (device-writable) have the following format: > + > +\begin{lstlisting} > +enum virtio_ib_wc_opcode { > + VIRTIO_IB_WC_SEND, > + VIRTIO_IB_WC_RDMA_WRITE, > + VIRTIO_IB_WC_RDMA_READ, > + VIRTIO_IB_WC_RECV, > + VIRTIO_IB_WC_RECV_RDMA_WITH_IMM, > +}; > + > +enum virtio_ib_wc_status { > + /* Operation completed successfully */ > + VIRTIO_IB_WC_SUCCESS, > + /* Local Length Error */ > + VIRTIO_IB_WC_LOC_LEN_ERR, > + /* Local QP Operation Error */ > + VIRTIO_IB_WC_LOC_QP_OP_ERR, > + /* Local Protection Error */ > + VIRTIO_IB_WC_LOC_PROT_ERR, > + /* Work Request Flushed Error */ > + VIRTIO_IB_WC_WR_FLUSH_ERR, > + /* Bad Response Error */ > + VIRTIO_IB_WC_BAD_RESP_ERR, > + /* Local Access Error */ > + VIRTIO_IB_WC_LOC_ACCESS_ERR, > + /* Remote Invalid Request Error */ > + VIRTIO_IB_WC_REM_INV_REQ_ERR, > + /* Remote Access Error */ > + VIRTIO_IB_WC_REM_ACCESS_ERR, > + /* Remote Operation Error */ > + VIRTIO_IB_WC_REM_OP_ERR, > + /* Transport Retry Counter Exceeded */ > + VIRTIO_IB_WC_RETRY_EXC_ERR, > + /* RNR Retry Counter Exceeded */ > + VIRTIO_IB_WC_RNR_RETRY_EXC_ERR, > + /* Remote Aborted Error */ > + VIRTIO_IB_WC_REM_ABORT_ERR, > + /* Fatal Error */ > + VIRTIO_IB_WC_FATAL_ERR, > + /* Response Timeout Error */ > + VIRTIO_IB_WC_RESP_TIMEOUT_ERR, > + /* General Error */ > + VIRTIO_IB_WC_GENERAL_ERR > +}; > + > +struct virtio_rdma_cq_req { > + /* User defined WR ID */ > + le64 wr_id; > + /* Work completion status, enum virtio_ib_wc_status */ > + u8 status; > + /* WR opcode, enum virtio_ib_wc_opcode */ > + u8 opcode; > + /* Padding */ > + le16 padding; > + /* Vendor error */ > + le32 vendor_err; > + /* Number of bytes transferred */ > + le32 byte_len; > + /* Immediate data (in network byte order) to send */ > + le32 imm_data; > + /* Local QP number of completed WR */ > + le32 qp_num; > + /* Source QP number (remote QP number) of completed WR (valid only for UD QPs) */ > + le32 src_qp; > +#define VIRTIO_IB_WC_GRH (1 << 0) > +#define VIRTIO_IB_WC_WITH_IMM (1 << 1) > + /* Work completion flag */ > + le32 wc_flags; > + /* Reserved for future */ > + le32 reserved[3]; > +}; > +\end{lstlisting} > + > +\paragraph{Send Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Send Operation} > + > +The send operation allows us to send data to a remote QP’s Receive Queue. > +The receiver MUST have previously posted a receive buffer to receive the data. "MUST" keyword must belong to the normative section. > + > +To do a send operation, a request with \field{opcode} set to > +VIRTIO_IB_WR_SEND or VIRTIO_IB_WR_SEND_WITH_IMM MUST be posted to the Send > +Queue as one output descriptor and the device is notified of the new entry. > + > +\drivernormative{\subparagraph}{Send Operation}{Device Types / Network Device / Device Operation / RoCE Support / Send Operation} > + > +If VIRTIO_IB_SEND_INLINE is set in \field{send_flags}, the driver MUST fill > +send buffer into \field{inline_data} field and set \field{inline_len} to the > +length of the buffer. Otherwise, the driver MUST fill \field{sg_list} to > +describe the buffer. > + > +\devicenormative{\subparagraph}{Send Operation}{Device Types / Network Device / Device Operation / RoCE Support / Send Operation} > + > +If \field{opcode} is not set to VIRTIO_IB_WR_SEND_WITH_IMM, the device MUST > +ignore \field{imm_data}. > + > +If the QP type is UD, the device MUST validate \field{ud.ah}. > + > +If VIRTIO_IB_SEND_INLINE is not set in \field{send_flags}, the device MUST > +validate the \field{addr}, \field{length} and \field{lkey} in \field{sg_list}. > + > +\paragraph{Receive Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Receive Operation} > + > +The receive operation allows us to receive data from remote QP. > +It's the corresponding operation to a send operation. > + > +To do a receive operation, a request MUST be posted to the Receive > +Queue as one output descriptor and the device is notified of the new entry. > + I think we probably need to be more verbose as what has been done for virtio-net. That is, describe what need to be filled in virtio_rdma_rq_req in details. (And do this for other operation as well) > +\drivernormative{\subparagraph}{Receive Operation}{Device Types / Network Device / Device Operation / RoCE Support / Receive Operation} > + > +The driver MUST fill \field{sg_list} to describe the receive buffer. > + > +\devicenormative{\subparagraph}{Receive Operation}{Device Types / Network Device / Device Operation / RoCE Support / Receive Operation} > + > +The device MUST validate the \field{addr}, \field{length} and \field{lkey} > +in \field{sg_list}. > + > +\paragraph{Write Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Write Operation} > + > +The write operation allows us to write data to the local memory buffer > +in remote side with no notification. The remote side wouldn't be aware > +that this operation being done. > + > +To do a write operation, a request with \field{opcode} set to > +VIRTIO_IB_WR_RDMA_WRITE or VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM MUST be > +posted to the Send Queue as one output descriptor and the device is > +notified of the new entry. > + > +\drivernormative{\subparagraph}{Write Operation}{Device Types / Network Device / Device Operation / RoCE Support / Write Operation} > + > +The driver MUST fill \field{sg_list} to describe the write buffer. So sg is a must even if the driver want to use imm? > + > +The driver MUST fill \field{rdma.remote_addr} and \field{rdma.rkey} to > +identify the remote buffer. > + > +\devicenormative{\subparagraph}{Write Operation}{Device Types / Network Device / Device Operation / RoCE Support / Write Operation} > + > +If \field{opcode} is not set to VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM, the device > +MUST ignore \field{imm_data}. > + > +The device MUST validate the \field{addr}, \field{length} and \field{lkey} > +in \field{sg_list}. > + > +\paragraph{Read Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Read Operation} > + > +The read operation allows us to read data from the local memory buffer > +in remote side with no notification. The remote side wouldn't be aware > +that this operation being done. > + > +To do a read operation, a request with \field{opcode} set to > +VIRTIO_IB_WR_RDMA_READ MUST be posted to the Send Queue as one output > +descriptor and the device is notified of the new entry. > + > +\drivernormative{\subparagraph}{Read Operation}{Device Types / Network Device / Device Operation / RoCE Support / Read Operation} > + > +The driver MUST fill \field{sg_list} to describe the read buffer. > + > +The driver MUST fill \field{rdma.remote_addr} and \field{rdma.rkey} to > +identify the remote buffer. > + > +\devicenormative{\subparagraph}{Read Operation}{Device Types / Network Device / Device Operation / RoCE Support / Read Operation} > + > +The device MUST validate the \field{addr}, \field{length} and \field{lkey} > +in \field{sg_list}. > + > +\paragraph{Completion Notification}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Completion Notification} > + > +After above operation is completed, a completion notification MUST > +be triggered by the device. For "completion notification", do you mean the virtqueue notification of cq or the making the buffer than contains cqe used? > To achieve that, the device MUST consume > +an entry of the Completion Queue associated with the Send Queue/Receive > +Queue which the operation belongs to. > + > +\drivernormative{\subparagraph}{Completion Notification}{Device Types / Network Device / Device Operation / RoCE Support / Completion Notification} > + > +The driver MUST fill the Completion Queue with enough entries previously. What do you mean by "previously"? What happens if there's no sufficient cqe? Thanks > + > +\devicenormative{\subparagraph}{Completion Notification}{Device Types / Network Device / Device Operation / RoCE Support / Completion Notification} > + > +If \field{imm_data} is valid, the device MUST set VIRTIO_IB_WC_WITH_IMM to > +\field{wc_flags}. > + > +The device MUST set \field{wr_id} to the value of \field{wr_id} of > +corresponding \field{struct virtio_rdma_sq_req} or > +\field{struct virtio_rdma_rq_req}. > + > \section{Block Device}\label{sec:Device Types / Block Device} > > The virtio block device is a simple virtual block device (ie. > -- > 2.11.0 >
On Thu, Aug 4, 2022 at 4:30 PM Jason Wang <jasowang@redhat.com> wrote: > > On Wed, May 11, 2022 at 5:59 PM Xie Yongji <xieyongji@bytedance.com> wrote: > > > > Hi all, > > > > Not very familiar with ROCE, try to give some comments from general > virtio level. > Thank you! > > This RFC aims to introduce our recent work on enabling RoCE support > > for virtio-net device. > > We need to clarify the version of ROCE, is it ROCEv2 or not? > Yes, it's RoCE v2. > > > > To support RoCE, three types of virtqueues including RDMA send virtqueue, > > RDMA receive virtqueue and RDMA completion virtqueue are introduced. > > And control virtqueue is reused to support the RDMA control messages. > > > > Now we support some basic RDMA semantics such as send/receive > > and read/write operation. > > It would be better to explain the advantages of this over the existing > pvrdma approach. I guess one advantage is that using virtio makes it > easier to connect to a userspace dataplane through vDPA/vhost-user? > Yes, this is one advantage. Another one is that we don't need a physical RDMA-capable NIC. > > > > To test with our demo: > > > > 1. Build Guest kernel [1] with config INFINIBAND_VIRTIO_RDMA > > > > 2. Build QEMU [2] with config VHOST_USER_RDMA > > > > 3. Build rdma-core [3] > > > > 4. Build and install DPDK (NOTE that we only tested on DPDK 20.11.3) > > > > 5. Build vhost-user-rdma [4] > > > > 6. Run vhost-user-rdma with command: > > $ ./vhost-user-rdma --vdev 'net_tap0' --lcore '1-3' -- -s '/tmp/vhost-rdma0' > > > > 7. Run qemu with command: > > $ qemu-system-x86_64 -chardev socket,path=/tmp/vhost-rdma0,id=vrdma \ > > -device vhost-user-rdma-pci,page-per-vq,chardev=vrdma ... > > It would be better to give some performance numbers (or even compare > it with pvrdma). > OK, will do it in v3. > > > > [1] https://github.com/bytedance/linux/tree/virtio-net-roce > > [2] https://github.com/bytedance/qemu/tree/vhost-user-rdma > > [3] https://github.com/YongjiXie/rdma-core/tree/virtio-rdma > > [4] https://github.com/YongjiXie/vhost-user-rdma > > > > We have already tested it with ibv_rc_pingpong, ibv_ud_pingpong and some > > others in rdma-core. > > > > TODO: > > > > And we'd better consider the live migration support. Having a quick > glance, it looks to me trapping the cvq is sufficient? > I'm not sure. Each QP has its own state machine, which may also require save & restore. > > 1. Add support for Base Memory Management Extensions > > > > 2. Add support for atomic operation > > > > 3. Add support for SRQ > > > > 4. Add support for virtqueue resize > > Note that this is already supported by the spec via virtqueue reset. > OK. > > > > 5. Add support for enabling/disabling virtqueue at runtime > > I guess virtqueue reset could help in this case. > We might need to do some extension since we want to free the resources when disabling the queue. > > > > Please review, thanks! > > > > V1 to V2: > > - Rework the implementation via extending virtio-net instead of > > introducing a new device type [Jason] > > - Add address handle support > > > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com> > > Co-developed-by: Wei Junji <weijunji@bytedance.com> > > Signed-off-by: Wei Junji <weijunji@bytedance.com> > > --- > > content.tex | 858 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- > > 1 file changed, 854 insertions(+), 4 deletions(-) > > I wonder if there's some open-source ROCE transport device API that we > can re-use then we can just behave like a transport layer instead of > inventing new commands. > That's would be better. But I didn't find one. > > > > diff --git a/content.tex b/content.tex > > index 7508dd1..646d82a 100644 > > --- a/content.tex > > +++ b/content.tex > > @@ -3008,7 +3008,10 @@ \section{Network Device}\label{sec:Device Types / Network Device} > > placed in one virtqueue for receiving packets, and outgoing > > packets are enqueued into another for transmission in that order. > > A third command queue is used to control advanced filtering > > -features. > > +features. And if RoCE (RDMA over Converged Ethernet) capability > > +is enabled, the virtio network device can also support transmitting > > +and receiving RDMA message through RDMA send virtqueue, RDMA receive > > +virtqueue and RDMA completion virtqueue. > > > > \subsection{Device ID}\label{sec:Device Types / Network Device / Device ID} > > > > @@ -3023,13 +3026,24 @@ \subsection{Virtqueues}\label{sec:Device Types / Network Device / Virtqueues} > > \item[2(N-1)] receiveqN > > \item[2(N-1)+1] transmitqN > > \item[2N] controlq > > +\item[2N+1] rdma_completeq1 > > +\item[\ldots] > > +\item[2N+M] rdma_completeqM > > +\item[2N+M+1] rdma_transmitq1 > > +\item[2N+M+2] rdma_receiveq1 > > +\item[\ldots] > > +\item[2N+M+2L-1] rdma_transmitqL > > +\item[2N+M+2L] rdma_receiveqL > > \end{description} > > > > N=1 if neither VIRTIO_NET_F_MQ nor VIRTIO_NET_F_RSS are negotiated, otherwise N is set by > > - \field{max_virtqueue_pairs}. > > + \field{max_virtqueue_pairs}. M is set by \field{max_rdma_cqs} and L is set by > > + \field{max_rdma_qps}. > > > > controlq only exists if VIRTIO_NET_F_CTRL_VQ set. > > > > + rdma_completeq, rdma_transmitq and rdma_receiveq only exist if VIRTIO_NET_F_ROCE set > > + > > \subsection{Feature bits}\label{sec:Device Types / Network Device / Feature bits} > > > > \begin{description} > > @@ -3084,6 +3098,9 @@ \subsection{Feature bits}\label{sec:Device Types / Network Device / Feature bits > > \item[VIRTIO_NET_F_CTRL_MAC_ADDR(23)] Set MAC address through control > > channel. > > > > +\item[VIRTIO_NET_F_ROCE(55)] Device supports RoCE (RDMA over Converged Ethernet) > > + capability. > > + > > \item[VIRTIO_NET_F_HOST_USO (56)] Device can receive USO packets. Unlike UFO > > (fragmenting the packet) the USO splits large UDP packet > > to several segments when each of these smaller packets has UDP header. > > @@ -3129,6 +3146,7 @@ \subsubsection{Feature bit requirements}\label{sec:Device Types / Network Device > > \item[VIRTIO_NET_F_GUEST_ANNOUNCE] Requires VIRTIO_NET_F_CTRL_VQ. > > \item[VIRTIO_NET_F_MQ] Requires VIRTIO_NET_F_CTRL_VQ. > > \item[VIRTIO_NET_F_CTRL_MAC_ADDR] Requires VIRTIO_NET_F_CTRL_VQ. > > +\item[VIRTIO_NET_F_ROCE] Requires VIRTIO_NET_F_CTRL_VQ. > > \item[VIRTIO_NET_F_RSC_EXT] Requires VIRTIO_NET_F_HOST_TSO4 or VIRTIO_NET_F_HOST_TSO6. > > \item[VIRTIO_NET_F_RSS] Requires VIRTIO_NET_F_CTRL_VQ. > > \end{description} > > @@ -3190,6 +3208,8 @@ \subsection{Device configuration layout}\label{sec:Device Types / Network Device > > u8 rss_max_key_size; > > le16 rss_max_indirection_table_length; > > le32 supported_hash_types; > > + le32 max_rdma_qps; > > + le32 max_rdma_cps; > > }; > > \end{lstlisting} > > The following field, \field{rss_max_key_size} only exists if VIRTIO_NET_F_RSS or VIRTIO_NET_F_HASH_REPORT is set. > > @@ -3204,11 +3224,23 @@ \subsection{Device configuration layout}\label{sec:Device Types / Network Device > > Field \field{supported_hash_types} contains the bitmask of supported hash types. > > See \ref{sec:Device Types / Network Device / Device Operation / Processing of Incoming Packets / Hash calculation for incoming packets / Supported/enabled hash types} for details of supported hash types. > > > > +Field \field{max_rdma_qps} only exists if VIRTIO_NET_F_ROCE is set. > > +It specifies the maximum number of queue pairs (send virtqueue and receive virtqueue) for RoCE usage. > > + > > +Field \field{max_rdma_cqs} only exists if VIRTIO_NET_F_ROCE is set. > > +It specifies the maximum number of completion virtqueue for RoCE usage. > > + > > \devicenormative{\subsubsection}{Device configuration layout}{Device Types / Network Device / Device configuration layout} > > > > The device MUST set \field{max_virtqueue_pairs} to between 1 and 0x8000 inclusive, > > if it offers VIRTIO_NET_F_MQ. > > > > +The device MUST set \field{max_rdma_qps} to between 1 an 16384 inclusive, > > +if it offers VIRTIO_NET_F_ROCE. > > I wonder why 16384 is chosen here? > Since the max queue number is limited to 65536 and we have three types of queue, the queue number should be less than 65536 / 3. We choose 65536 / 4 here. > > + > > +The device MUST set \field{max_rdma_cqs} to between 1 an 16384 inclusive, > > +if it offers VIRTIO_NET_F_ROCE. > > + > > The device MUST set \field{mtu} to between 68 and 65535 inclusive, > > if it offers VIRTIO_NET_F_MTU. > > > > @@ -3306,6 +3338,12 @@ \subsection{Device Initialization}\label{sec:Device Types / Network Device / Dev > > \item If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated, > > identify the control virtqueue. > > > > +\item If the VIRTIO_NET_F_ROCE feature bit is negotiated, > > + identify the the RDMA completion virtqueues, up to max_rdma_cqs. > > + > > +\item If the VIRTIO_NET_F_ROCE feature bit is negotiated, > > + identify the the RDMA send and receive virtqueues, up to max_rdma_qps. > > + > > \item Fill the receive queues with buffers: see \ref{sec:Device Types / Network Device / Device Operation / Setting Up Receive Buffers}. > > > > \item Even with VIRTIO_NET_F_MQ, only receiveq1, transmitq1 and > > @@ -4007,6 +4045,7 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi > > u8 command; > > u8 command-specific-data[]; > > u8 ack; > > + u8 ack-specific-data[]; > > }; > > > > /* ack values */ > > @@ -4015,8 +4054,8 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi > > \end{lstlisting} > > > > The \field{class}, \field{command} and command-specific-data are set by the > > -driver, and the device sets the \field{ack} byte. There is little it can > > -do except issue a diagnostic if \field{ack} is not > > +driver, and the device sets the \field{ack} byte and ack-specific-data. There > > +is little it can do except issue a diagnostic if \field{ack} is not > > VIRTIO_NET_OK. > > > > \paragraph{Packet Receive Filtering}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Packet Receive Filtering} > > @@ -4463,6 +4502,534 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi > > according to the native endian of the guest rather than > > (necessarily when not using the legacy interface) little-endian. > > > > +\paragraph{RoCE Configuration}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / RoCE Configuration} > > + > > +If the driver negotiates the VIRTIO_NET_F_ROCE feature bit (depends on VIRTIO_NET_F_CTRL_VQ), > > +it can send control commands for RoCE usage. The following commands are defined now: > > + > > +\begin{lstlisting} > > +#define VIRTIO_NET_CTRL_ROCE 6 > > + #define VIRTIO_NET_CTRL_ROCE_QUERY_DEVICE 0 > > + #define VIRTIO_NET_CTRL_ROCE_QUERY_PORT 1 > > + #define VIRTIO_NET_CTRL_ROCE_CREATE_CQ 2 > > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_CQ 3 > > + #define VIRTIO_NET_CTRL_ROCE_CREATE_PD 4 > > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_PD 5 > > + #define VIRTIO_NET_CTRL_ROCE_GET_DMA_MR 6 > > + #define VIRTIO_NET_CTRL_ROCE_REG_USER_MR 7 > > + #define VIRTIO_NET_CTRL_ROCE_DEREG_MR 8 > > + #define VIRTIO_NET_CTRL_ROCE_CREATE_QP 9 > > + #define VIRTIO_NET_CTRL_ROCE_MODIFY_QP 10 > > + #define VIRTIO_NET_CTRL_ROCE_QUERY_QP 11 > > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_QP 12 > > + #define VIRTIO_NET_CTRL_ROCE_CREATE_AH 13 > > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_AH 14 > > + #define VIRTIO_NET_CTRL_ROCE_ADD_GID 15 > > + #define VIRTIO_NET_CTRL_ROCE_DEL_GID 16 > > + #define VIRTIO_NET_CTRL_ROCE_REQ_NOTIFY_CQ 17 > > +\end{lstlisting} > > + > > +\begin{description} > > +\item[VIRTIO_NET_CTRL_ROCE_QUERY_DEVICE] Query the attributes of device. > > + No command-specific-data; > > + the ack-specific-data is \field{struct virtio_rdma_ack_query_device}. > > + > > +\begin{lstlisting} > > +struct virtio_rdma_ack_query_device { > > +#define VIRTIO_IB_DEVICE_RC_RNR_NAK_GEN (1 << 0) > > What's the meaning of this capability? > It indicates whether the device supports RNR-NAK generation for RC QPs. I will add some comments. > > + /* Capabilities mask */ > > + le64 device_cap_flags; > > Will this introduce a migration compatibility issue? E.g src and dst > have the same features but different capabilities. > Should this be controlled by hypervisor since all capabilities is emulated by software. > > + /* Largest contiguous block that can be registered */ > > + le64 max_mr_size; > > + /* Supported memory shift sizes */ > > + le64 page_size_cap; > > + /* Hardware version */ > > + le32 hw_ver; > > What did "hardware version" mean? Is this something that is defined in > the IB spec? > Yes, it's defined in IB spec. > > + /* Maximum number of outstanding Work Requests (WR) on Send Queue (SQ) and Receive Queue (RQ) */ > > + le32 max_qp_wr; > > Is this implied in the virtqueue size? If not, why? > Yes. Will remove it. > > + /* Maximum number of scatter/gather (s/g) elements per WR for SQ for non RDMA Read operations */ > > + le32 max_send_sge; > > + /* Maximum number of s/g elements per WR for RQ for non RDMA Read operations */ > > + le32 max_recv_sge; > > + /* Maximum number of s/g per WR for RDMA Read operations */ > > + le32 max_sge_rd; > > + /* Maximum size of Completion Queue (CQ) */ > > + le32 max_cqe; > > Need to specify the reason why we can't use the virtqueue size for the > completion queue. > I think we can. Will remove it > > + /* Maximum number of Memory Regions (MR) */ > > + le32 max_mr; > > + /* Maximum number of Protection Domains (PD) */ > > + le32 max_pd; > > + /* Maximum number of RDMA Read perations that can be outstanding per Queue Pair (QP) */ > > I guess you mean "operations" here. > Yes. > > + le32 max_qp_rd_atom; > > + /* Maximum depth per QP for initiation of RDMA Read operations */ > > The member has an "atom" suffix, does it mean "atomic read" or other? > It means the atomic operation which is unsupported now. I think we need to remove it. > > + le32 max_qp_init_rd_atom; > > + /* Maximum number of Address Handles (AH) */ > > + le32 max_ah; > > + /* Local CA ack delay */ > > + u8 local_ca_ack_delay; > > + /* Padding */ > > + u8 padding[3]; > > + /* Reserved for future */ > > + le32 reserved[14]; > > +}; > > +\end{lstlisting} > > + > > +\item[VIRTIO_NET_CTRL_ROCE_QUERY_PORT] Query the attributes of port. > > + No command-specific-data; > > + the ack-specific-data is \field{struct virtio_rdma_ack_query_port}. > > + > > +\begin{lstlisting} > > +struct virtio_rdma_ack_query_port { > > + /* Length of source Global Identifier (GID) table */ > > + le32 gid_tbl_len; > > + /* Maximum message size */ > > + le32 max_msg_sz; > > I guess this is for both read/write/send/receive? And is 4GB > sufficient for the future? > Now this follows the definition in linux kernel and IB Spec. If we need to extend it in future, we can add a new field max_msg_sz64? > > + /* Reserved for future */ > > + le32 reserved[6]; > > +}; > > +\end{lstlisting} > > + > > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_CQ] Create a Completion Queue (CQ). > > + The command-specific-data is \field{struct virtio_rdma_cmd_create_cq}; > > + the ack-specific-data is \field{struct virtio_rdma_ack_create_cq}. > > + > > +\begin{lstlisting} > > +struct virtio_rdma_cmd_create_cq { > > + /* Size of CQ */ > > + le32 cqe; > > +}; > > + > > +struct virtio_rdma_ack_create_cq { > > + /* The index of CQ */ > > + le32 cqn; > > +}; > > +\end{lstlisting} > > + > > +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_CQ] Destroy a Completion Queue. > > + The command-specific-data is \field{struct virtio_rdma_cmd_destroy_cq}; > > + no ack-specific-data. > > + > > +\begin{lstlisting} > > +struct virtio_rdma_cmd_destroy_cq { > > + /* The index of CQ */ > > + le32 cqn; > > +}; > > +\end{lstlisting} > > + > > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_PD] Create a Protection Domain (PD). > > + No command-specific-data; > > + the ack-specific-data is \field{struct virtio_rdma_ack_create_pd}. > > + > > +\begin{lstlisting} > > +struct virtio_rdma_ack_create_pd { > > + /* The handle of PD */ > > + le32 pdn; > > +}; > > +\end{lstlisting} > > Can this command always succeed? I meant is there a limit of the total > number of PDs that a single ROCE device can support? > Yes, we have max_pd field in structure virtio_rdma_ack_query_device. > > + > > +\item[VIRTIO_NET_CTRL_ROCE_DESTORY_PD] Destroy a Protection Domain. > > + The command-specific-data is \field{virtio_rdma_cmd_destroy_pd}; > > + no ack-specific-data. > > + > > +\begin{lstlisting} > > +struct virtio_rdma_cmd_destroy_pd { > > + /* The handle of PD */ > > + le32 pdn; > > +}; > > +\end{lstlisting} > > + > > +\item[VIRTIO_NET_CTRL_ROCE_GET_DMA_MR] Get the DMA Memory Region (MR). > > + associated with one protection domain. > > I wonder what's the difference between VIRTIO_NET_CTRL_ROCE_GET_DMA_MR > and USR_MR. Can we unify them? > We should pass some address for USER_MR. I think we can unify them if we want. > > + The command-specific-data is \field{virtio_rdma_cmd_get_dma_mr}; > > + the ack-specific-data is \field{virtio_rdma_ack_get_dma_mr}. > > + > > +\begin{lstlisting} > > +enum virtio_ib_access_flags { > > + VIRTIO_IB_ACCESS_LOCAL_WRITE = (1 << 0), > > Is LOCAL_READ implied to work always? > Yes, the LOCAL_READ is always supported. > > + VIRTIO_IB_ACCESS_REMOTE_WRITE = (1 << 1), > > + VIRTIO_IB_ACCESS_REMOTE_READ = (1 << 2), > > +}; > > + > > +struct virtio_rdma_cmd_get_dma_mr { > > + /* The handle of PD which the MR associated with */ > > + le32 pdn; > > + /* MR's protection attributes, enum virtio_ib_access_flags */ > > + le32 access_flags; > > +}; > > + > > +struct virtio_rdma_ack_get_dma_mr { > > + /* The handle of MR */ > > + le32 mrn; > > + /* MR's local access key */ > > + le32 lkey; > > + /* MR's remote access key */ > > + le32 rkey; > > +}; > > +\end{lstlisting} > > + > > +\item[VIRTIO_NET_CTRL_ROCE_REG_USER_MR] Register a user Memory Region > > + associated with one Protection Domain. > > + The command-specific-data is \field{virtio_rdma_cmd_reg_user_mr}; > > + the ack-specific-data is \field{virtio_rdma_ack_reg_user_mr}. > > + > > +\begin{lstlisting} > > +struct virtio_rdma_cmd_reg_user_mr { > > + /* The handle of PD which the MR associated with */ > > + le32 pdn; > > + /* MR's protection attributes, enum virtio_ib_access_flags */ > > + le32 access_flags; > > + /* Starting virtual address of MR */ > > + le64 virt_addr; > > I guess this is actually the I/O virtual address and the device is in > charge of translate it to the page arrays below? > Yes, this address is specified by userspace, which can be a virtual address or not. > > + /* Length of MR */ > > + le64 length; > > + /* Size of the below page array */ > > + le32 npages; > > + /* Padding */ > > + le32 padding; > > + /* Array to store physical address of each page in MR */ > > + le64 pages[]; > > How do device know the size of a page? > We have npages field in this struture. > > +}; > > I believe this command can fail, we need to describe the error conditions. > OK. > > + > > +struct virtio_rdma_ack_reg_user_mr { > > + /* The handle of MR */ > > + le32 mrn; > > + /* MR's local access key */ > > + le32 lkey; > > + /* MR's remote access key */ > > + le32 rkey; > > +}; > > +\end{lstlisting} > > + > > +\item[VIRTIO_NET_CTRL_ROCE_DEREG_MR] De-register a Memory Region. > > + The command-specific-data is \field{virtio_rdma_cmd_dereg_mr}; > > + no ack-specific-data. > > + > > +\begin{lstlisting} > > +struct virtio_rdma_cmd_dereg_mr { > > + /* The handle of MR */ > > + le32 mrn; > > +}; > > +\end{lstlisting} > > + > > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_QP] Create a Queue Pair (Send Queue and Receive Queue). > > + The command-specific-data is \field{virtio_rdma_cmd_create_qp}; > > + the ack-specific-data is \field{virtio_rdma_ack_create_qp}. > > + > > +\begin{lstlisting} > > +struct virtio_rdma_qp_cap { > > + /* Maximum number of outstanding WRs in SQ */ > > + le32 max_send_wr; > > + /* Maximum number of outstanding WRs in RQ */ > > + le32 max_recv_wr; > > + /* Maximum number of s/g elements per WR in SQ */ > > + le32 max_send_sge; > > + /* Maximum number of s/g elements per WR in RQ */ > > + le32 max_recv_sge; > > + /* Maximum number of data (bytes) that can be posted inline to SQ */ > > + le32 max_inline_data; > > + /* Padding */ > > + le32 padding; > > +}; > > + > > +struct virtio_rdma_cmd_create_qp { > > + /* The handle of PD which the QP associated with */ > > + le32 pdn; > > +#define VIRTIO_IB_QPT_SMI 0 > > +#define VIRTIO_IB_QPT_GSI 1 > > +#define VIRTIO_IB_QPT_RC 2 > > +#define VIRTIO_IB_QPT_UC 3 > > +#define VIRTIO_IB_QPT_UD 4 > > + /* QP's type */ > > + u8 qp_type; > > + /* If set, each WR submitted to the SQ generates a completion entry */ > > + u8 sq_sig_all; > > + /* Padding */ > > + u8 padding[2]; > > + /* The index of CQ which the SQ associated with */ > > + le32 send_cqn; > > + /* The index of CQ which the RQ associated with */ > > + le32 recv_cqn; > > + /* QP's capabilities */ > > + struct virtio_rdma_qp_cap cap; > > + /* Reserved for future */ > > + le32 reserved[4]; > > +}; > > + > > +struct virtio_rdma_ack_create_qp { > > + /* The index of QP */ > > + le32 qpn; > > +}; > > +\end{lstlisting} > > + > > +\item[VIRTIO_NET_CTRL_ROCE_MODIFY_QP] Modify the attributes of a Queue Pair. > > + The command-specific-data is \field{virtio_rdma_cmd_modify_qp}; > > + no ack-specific-data. > > + > > +\begin{lstlisting} > > +struct virtio_rdma_global_route { > > + /* Destination GID or MGID */ > > + u8 dgid[16]; > > + /* Flow label */ > > + le32 flow_label; > > + /* Source GID index */ > > + u8 sgid_index; > > + /* Hop limit */ > > + u8 hop_limit; > > + /* Traffic class */ > > + u8 traffic_class; > > + /* Padding */ > > + u8 padding; > > +}; > > + > > +struct virtio_rdma_ah_attr { > > + /* Global Routing Header (GRH) attributes */ > > + virtio_rdma_global_route grh; > > + /* Destination MAC address */ > > + u8 dmac[6]; > > + /* Reserved for future */ > > + u8 reserved[10]; > > +}; > > + > > +enum virtio_ib_qp_attr_mask { > > + VIRTIO_IB_QP_STATE = (1 << 0), > > + VIRTIO_IB_QP_CUR_STATE = (1 << 1), > > + VIRTIO_IB_QP_ACCESS_FLAGS = (1 << 2), > > + VIRTIO_IB_QP_QKEY = (1 << 3), > > + VIRTIO_IB_QP_AV = (1 << 4), > > + VIRTIO_IB_QP_PATH_MTU = (1 << 5), > > + VIRTIO_IB_QP_TIMEOUT = (1 << 6), > > + VIRTIO_IB_QP_RETRY_CNT = (1 << 7), > > + VIRTIO_IB_QP_RNR_RETRY = (1 << 8), > > + VIRTIO_IB_QP_RQ_PSN = (1 << 9), > > + VIRTIO_IB_QP_MAX_QP_RD_ATOMIC = (1 << 10), > > + VIRTIO_IB_QP_MIN_RNR_TIMER = (1 << 11), > > + VIRTIO_IB_QP_SQ_PSN = (1 << 12), > > + VIRTIO_IB_QP_MAX_DEST_RD_ATOMIC = (1 << 13), > > + VIRTIO_IB_QP_CAP = (1 << 14), > > + VIRTIO_IB_QP_DEST_QPN = (1 << 15), > > + VIRTIO_IB_QP_RATE_LIMIT = (1 << 16), > > +}; > > Do we need to explain the above error codes? Or it's simply a map from IB spec? > Yes, it's defined in IB spec. But we can add some comments for them too. > > + > > +enum virtio_ib_qp_state { > > + VIRTIO_IB_QPS_RESET, > > + VIRTIO_IB_QPS_INIT, > > + VIRTIO_IB_QPS_RTR, > > + VIRTIO_IB_QPS_RTS, > > + VIRTIO_IB_QPS_SQD, > > + VIRTIO_IB_QPS_SQE, > > + VIRTIO_IB_QPS_ERR > > +}; > > + > > +enum virtio_ib_mtu { > > + VIRTIO_IB_MTU_256 = 1, > > + VIRTIO_IB_MTU_512 = 2, > > + VIRTIO_IB_MTU_1024 = 3, > > + VIRTIO_IB_MTU_2048 = 4, > > + VIRTIO_IB_MTU_4096 = 5 > > +}; > > + > > +struct virtio_rdma_cmd_modify_qp { > > + /* The index of QP */ > > + le32 qpn; > > + /* The mask of attributes needs to be modified, enum virtio_ib_qp_attr_mask */ > > + le32 attr_mask; > > + /* Move the QP to this state, enum virtio_ib_qp_state */ > > + u8 qp_state; > > + /* Current QP state, enum virtio_ib_qp_state */ > > + u8 cur_qp_state; > > + /* Path MTU (valid only for RC/UC QPs), enum virtio_ib_mtu */ > > + u8 path_mtu; > > + /* Number of outstanding RDMA Read operations on destination QP (valid only for RC QPs) */ > > + u8 max_rd_atomic; > > + /* Number of responder resources for handling incoming RDMA Read operations (valid only for RC QPs) */ > > + u8 max_dest_rd_atomic; > > + /* Minimum RNR (Receiver Not Ready) NAK timer (valid only for RC QPs) */ > > + u8 min_rnr_timer; > > + /* Local ack timeout (valid only for RC QPs) */ > > + u8 timeout; > > + /* Retry count (valid only for RC QPs) */ > > + u8 retry_cnt; > > + /* RNR retry (valid only for RC QPs) */ > > + u8 rnr_retry; > > + /* Padding */ > > + u8 padding[7]; > > + /* Q_Key for the QP (valid only for UD QPs) */ > > + le32 qkey; > > + /* PSN for RQ (valid only for RC/UC QPs) */ > > + le32 rq_psn; > > + /* PSN for SQ */ > > + le32 sq_psn; > > + /* Destination QP number (valid only for RC/UC QPs) */ > > + le32 dest_qp_num; > > + /* Mask of enabled remote access operations (valid only for RC/UC QPs), enum virtio_ib_access_flags */ > > + le32 qp_access_flags; > > + /* Rate limit in kbps for packet pacing */ > > + le32 rate_limit; > > + /* QP capabilities */ > > + struct virtio_rdma_qp_cap cap; > > + /* Address Vector (valid only for RC/UC QPs) */ > > + struct virtio_rdma_ah_attr ah_attr; > > + /* Reserved for future */ > > + le32 reserved[4]; > > +}; > > +\end{lstlisting} > > + > > +\item[VIRTIO_NET_CTRL_ROCE_QUERY_QP] Query the attributes of a Queue Pair. > > + The command-specific-data is \field{virtio_rdma_cmd_query_qp}; > > + the ack-specific-data is \field{virtio_rdma_ack_query_qp}. > > + > > +\begin{lstlisting} > > +struct virtio_rdma_cmd_query_qp { > > + /* The index of QP */ > > + le32 qpn; > > + /* The mask of attributes need to be queried, enum virtio_ib_qp_attr_mask */ > > + le32 attr_mask; > > +}; > > + > > +struct virtio_rdma_ack_query_qp { > > Any chance to unify this with virtio_rdma_cmd_modify_qp? > It would be a little confusing since some states is only used by modify_qp. > > + /* Move the QP to this state, enum virtio_ib_qp_state */ > > + u8 qp_state; > > + /* Path MTU (valid only for RC/UC QPs), enum virtio_ib_mtu */ > > + u8 path_mtu; > > + /* Is the SQ draining */ > > + u8 sq_draining; > > + /* Number of outstanding RDMA read operations on destination QP (valid only for RC QPs) */ > > + u8 max_rd_atomic; > > + /* Number of responder resources for handling incoming RDMA read operations (valid only for RC QPs) */ > > + u8 max_dest_rd_atomic; > > + /* Minimum RNR NAK timer (valid only for RC QPs) */ > > + u8 min_rnr_timer; > > + /* Local ack timeout (valid only for RC QPs) */ > > + u8 timeout; > > + /* Retry count (valid only for RC QPs) */ > > + u8 retry_cnt; > > + /* RNR retry (valid only for RC QPs) */ > > + u8 rnr_retry; > > + /* Padding */ > > + u8 padding[7]; > > + /* Q_Key for the QP (valid only for UD QPs) */ > > + le32 qkey; > > + /* PSN for RQ (valid only for RC/UC QPs) */ > > + le32 rq_psn; > > + /* PSN for SQ */ > > + le32 sq_psn; > > + /* Destination QP number (valid only for RC/UC QPs) */ > > + le32 dest_qp_num; > > + /* Mask of enabled remote access operations (valid only for RC/UC QPs), enum virtio_ib_access_flags */ > > + le32 qp_access_flags; > > + /* Rate limit in kbps for packet pacing */ > > + le32 rate_limit; > > + /* QP capabilities */ > > + struct virtio_rdma_qp_cap cap; > > + /* Address Vector (valid only for RC/UC QPs) */ > > + struct virtio_rdma_ah_attr ah_attr; > > + /* Reserved for future */ > > + le32 reserved[4]; > > +}; > > +\end{lstlisting} > > + > > +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_QP] Destroy a Queue Pair. > > + The command-specific-data is \field{virtio_rdma_cmd_destroy_qp}; > > + no ack-specific-data. > > What happen to the pending requests? Will the device wait for the > completion or not? > It should be discarded according to IB spec. > > + > > +\begin{lstlisting} > > +struct virtio_rdma_cmd_destroy_qp { > > + /* The index of QP */ > > + le32 qpn; > > +}; > > +\end{lstlisting} > > + > > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_AH] Create a Address Handle (AH). > > + The command-specific-data is \field{virtio_rdma_cmd_create_ah}; > > + the ack-specific-data is \field{virtio_rdma_ack_create_ah}. > > + > > +\begin{lstlisting} > > +struct virtio_rdma_cmd_create_ah { > > + /* The handle of PD which the AH associated with */ > > + le32 pdn; > > + /* Padding */ > > + le32 padding; > > + /* Address Vector */ > > + struct virtio_rdma_ah_attr ah_attr; > > +}; > > + > > +struct virtio_rdma_ack_create_ah { > > + /* The address handle */ > > + le32 ah; > > +}; > > +\end{lstlisting} > > + > > +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_AH] Destroy a Address Handle. > > + The command-specific-data is \field{virtio_rdma_cmd_destroy_ah}; > > + no ack-specific-data. > > + > > +\begin{lstlisting} > > +struct virtio_rdma_cmd_destroy_ah { > > + /* The handle of PD which the AH associated with */ > > + le32 pdn; > > + /* The address handle */ > > + le32 ah; > > +}; > > +\end{lstlisting} > > + > > +\item[VIRTIO_NET_CTRL_ROCE_ADD_GID] Add a Global Identifier (GID). > > + The command-specific-data is \field{virtio_rdma_cmd_add_gid}; > > + no ack-specific-data. > > + > > +\begin{lstlisting} > > +struct virtio_rdma_cmd_add_gid { > > + /* Index of GID */ > > + le16 index; > > + /* Padding */ > > + le16 padding[3]; > > + /* GID to be added */ > > + u8 gid[16]; > > +}; > > +\end{lstlisting} > > + > > +\item[VIRTIO_NET_CTRL_ROCE_DEL_GID] Delete a Global Identifier. > > + The command-specific-data is \field{virtio_rdma_cmd_del_gid}; > > + no ack-specific-data. > > + > > +\begin{lstlisting} > > +struct virtio_rdma_cmd_del_gid { > > + /* Index of GID */ > > + le16 index; > > +}; > > +\end{lstlisting} > > + > > +\item[VIRTIO_NET_CTRL_ROCE_REQ_NOTIFY_CQ] Request a completion notification > > + on a Completion Queue. > > + The command-specific-data is \field{virtio_rdma_cmd_req_notify}; > > + no ack-specific-data. > > + > > +\begin{lstlisting} > > +struct virtio_rdma_cmd_req_notify { > > + /* The index of CQ */ > > + le32 cqn; > > +#define VIRTIO_IB_NOTIFY_SOLICITED (1 << 0) > > +#define VIRTIO_IB_NOTIFY_NEXT_COMPLETION (1 << 1) > > Need to describe the differences on those two flags. > OK. > > + /* Notify flags */ > > + le32 flags; > > +}; > > +\end{lstlisting} > > + > > +\end{description} > > + > > +\drivernormative{\subparagraph}{RoCE Configuration}{Device Types / Network Device / Device Operation / Control Virtqueue / RoCE Configuration} > > + > > +A driver MUST initialize the completion virtqueue and fill it with > > +enough entries after command VIRTIO_NET_CTRL_ROCE_CREATE_CQ is > > +successfully executed. > > + > > +A driver MUST reset the completion virtqueue after > > How to do the reset? Do you mean driver need to reset the indices? > Yes, something like avail_idx, used_idx. > > +command VIRTIO_NET_CTRL_ROCE_DESTROY_CQ is successfully executed. > > + > > +A driver MUST initialize the send virtqueue and receive virtqueue after > > +command VIRTIO_NET_CTRL_ROCE_CREATE_QP is successfully executed. > > + > > +A driver MUST reset the send virtqueue and receive virtqueue after > > +command VIRTIO_NET_CTRL_ROCE_DESTROY_QP is successfully executed. > > > > \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device > > Types / Network Device / Legacy Interface: Framing Requirements} > > @@ -4496,6 +5063,289 @@ \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device > > See \ref{sec:Basic > > Facilities of a Virtio Device / Virtqueues / Message Framing}. > > > > +\subsubsection{RoCE Support}\label{sec:Device Types / Network Device / Device Operation / RoCE Support} > > + > > +RDMA over Converged Ethernet (RoCE) is a network protocol that allows > > +remote direct memory access (RDMA) over an Ethernet network. To support > > +RoCE (if VIRTIO_NET_F_ROCE is negotiated), in addtion to the control > > +virtqueue support mentioned in \ref{sec:Device Types / Network Device / > > +Device Operation / Control Virtqueue / RoCE Configuration}, multiple > > +types of virtqueues including send virtqueue, receive virtqueue and > > +completion virtqueue are introduced. > > + > > +The send virtqueue contains elements that describe the data to be > > +transmitted. > > + > > +Requests (device-readable) have the following format: > > + > > +\begin{lstlisting} > > +enum virtio_ib_wr_opcode { > > + VIRTIO_IB_WR_RDMA_WRITE, > > + VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM, > > + VIRTIO_IB_WR_SEND, > > + VIRTIO_IB_WR_SEND_WITH_IMM, > > + VIRTIO_IB_WR_RDMA_READ, > > +}; > > + > > +struct virtio_rdma_sge { > > + le64 addr; > > + le32 length; > > + le32 lkey; > > +}; > > + > > +struct virtio_rdma_sq_req { > > + /* User defined WR ID */ > > + le64 wr_id; > > + /* WR opcode, enum virtio_ib_wr_opcode */ > > + u8 opcode; > > +#define VIRTIO_IB_SEND_FENCE (1 << 0) > > +#define VIRTIO_IB_SEND_SIGNALED (1 << 1) > > +#define VIRTIO_IB_SEND_SOLICITED (1 << 2) > > +#define VIRTIO_IB_SEND_INLINE (1 << 3) > > + /* Flags of the WR properties */ > > + u8 send_flags; > > + /* Padding */ > > + le16 padding; > > + /* Immediate data (in network byte order) to send */ > > + le32 imm_data; > > + union { > > + struct { > > + /* Start address of remote memory buffer */ > > + le64 remote_addr; > > + /* Key of the remote MR */ > > + le32 rkey; > > + } rdma; > > + struct { > > + /* Index of the destination QP */ > > + le32 remote_qpn; > > + /* Q_Key of the destination QP */ > > + le32 remote_qkey; > > + /* Address Handle */ > > + le32 ah; > > + } ud; > > + /* Reserved for future */ > > + le64 reserved[4]; > > + }; > > + /* Inline data */ > > + u8 inline_data[512]; > > + union { > > + /* Length of sg_list */ > > + le32 num_sge; > > + /* Length of inline data */ > > + le16 inline_len; > > + }; > > + /* Reserved for future */ > > + le32 reserved2[3]; > > + /* Scatter/gather list */ > > + struct virtio_rdma_sge sg_list[]; > > +}; > > +\end{lstlisting} > > + > > +The receive virtqueue contains elements that describe where to place incoming data. > > + > > +Requests (device-readable) have the following format: > > + > > +\begin{lstlisting} > > +struct virtio_rdma_rq_req { > > + /* User defined WR ID */ > > + le64 wr_id; > > + /* Length of sg_list */ > > + le32 num_sge; > > + /* Reserved for future */ > > + le32 reserved[3]; > > + /* Scatter/gather list */ > > + struct virtio_rdma_sge sg_list[]; > > +}; > > +\end{lstlisting} > > + > > +The completion virtqueue is used to notify the completion of requests in > > +send virtqueue or receive virtqueue. > > + > > +Requests (device-writable) have the following format: > > + > > +\begin{lstlisting} > > +enum virtio_ib_wc_opcode { > > + VIRTIO_IB_WC_SEND, > > + VIRTIO_IB_WC_RDMA_WRITE, > > + VIRTIO_IB_WC_RDMA_READ, > > + VIRTIO_IB_WC_RECV, > > + VIRTIO_IB_WC_RECV_RDMA_WITH_IMM, > > +}; > > + > > +enum virtio_ib_wc_status { > > + /* Operation completed successfully */ > > + VIRTIO_IB_WC_SUCCESS, > > + /* Local Length Error */ > > + VIRTIO_IB_WC_LOC_LEN_ERR, > > + /* Local QP Operation Error */ > > + VIRTIO_IB_WC_LOC_QP_OP_ERR, > > + /* Local Protection Error */ > > + VIRTIO_IB_WC_LOC_PROT_ERR, > > + /* Work Request Flushed Error */ > > + VIRTIO_IB_WC_WR_FLUSH_ERR, > > + /* Bad Response Error */ > > + VIRTIO_IB_WC_BAD_RESP_ERR, > > + /* Local Access Error */ > > + VIRTIO_IB_WC_LOC_ACCESS_ERR, > > + /* Remote Invalid Request Error */ > > + VIRTIO_IB_WC_REM_INV_REQ_ERR, > > + /* Remote Access Error */ > > + VIRTIO_IB_WC_REM_ACCESS_ERR, > > + /* Remote Operation Error */ > > + VIRTIO_IB_WC_REM_OP_ERR, > > + /* Transport Retry Counter Exceeded */ > > + VIRTIO_IB_WC_RETRY_EXC_ERR, > > + /* RNR Retry Counter Exceeded */ > > + VIRTIO_IB_WC_RNR_RETRY_EXC_ERR, > > + /* Remote Aborted Error */ > > + VIRTIO_IB_WC_REM_ABORT_ERR, > > + /* Fatal Error */ > > + VIRTIO_IB_WC_FATAL_ERR, > > + /* Response Timeout Error */ > > + VIRTIO_IB_WC_RESP_TIMEOUT_ERR, > > + /* General Error */ > > + VIRTIO_IB_WC_GENERAL_ERR > > +}; > > + > > +struct virtio_rdma_cq_req { > > + /* User defined WR ID */ > > + le64 wr_id; > > + /* Work completion status, enum virtio_ib_wc_status */ > > + u8 status; > > + /* WR opcode, enum virtio_ib_wc_opcode */ > > + u8 opcode; > > + /* Padding */ > > + le16 padding; > > + /* Vendor error */ > > + le32 vendor_err; > > + /* Number of bytes transferred */ > > + le32 byte_len; > > + /* Immediate data (in network byte order) to send */ > > + le32 imm_data; > > + /* Local QP number of completed WR */ > > + le32 qp_num; > > + /* Source QP number (remote QP number) of completed WR (valid only for UD QPs) */ > > + le32 src_qp; > > +#define VIRTIO_IB_WC_GRH (1 << 0) > > +#define VIRTIO_IB_WC_WITH_IMM (1 << 1) > > + /* Work completion flag */ > > + le32 wc_flags; > > + /* Reserved for future */ > > + le32 reserved[3]; > > +}; > > +\end{lstlisting} > > + > > +\paragraph{Send Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Send Operation} > > + > > +The send operation allows us to send data to a remote QP’s Receive Queue. > > +The receiver MUST have previously posted a receive buffer to receive the data. > > "MUST" keyword must belong to the normative section. > OK. > > + > > +To do a send operation, a request with \field{opcode} set to > > +VIRTIO_IB_WR_SEND or VIRTIO_IB_WR_SEND_WITH_IMM MUST be posted to the Send > > +Queue as one output descriptor and the device is notified of the new entry. > > + > > +\drivernormative{\subparagraph}{Send Operation}{Device Types / Network Device / Device Operation / RoCE Support / Send Operation} > > + > > +If VIRTIO_IB_SEND_INLINE is set in \field{send_flags}, the driver MUST fill > > +send buffer into \field{inline_data} field and set \field{inline_len} to the > > +length of the buffer. Otherwise, the driver MUST fill \field{sg_list} to > > +describe the buffer. > > + > > +\devicenormative{\subparagraph}{Send Operation}{Device Types / Network Device / Device Operation / RoCE Support / Send Operation} > > + > > +If \field{opcode} is not set to VIRTIO_IB_WR_SEND_WITH_IMM, the device MUST > > +ignore \field{imm_data}. > > + > > +If the QP type is UD, the device MUST validate \field{ud.ah}. > > + > > +If VIRTIO_IB_SEND_INLINE is not set in \field{send_flags}, the device MUST > > +validate the \field{addr}, \field{length} and \field{lkey} in \field{sg_list}. > > + > > +\paragraph{Receive Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Receive Operation} > > + > > +The receive operation allows us to receive data from remote QP. > > +It's the corresponding operation to a send operation. > > + > > +To do a receive operation, a request MUST be posted to the Receive > > +Queue as one output descriptor and the device is notified of the new entry. > > + > > I think we probably need to be more verbose as what has been done for > virtio-net. > > That is, describe what need to be filled in virtio_rdma_rq_req in > details. (And do this for other operation as well) > OK. > > > +\drivernormative{\subparagraph}{Receive Operation}{Device Types / Network Device / Device Operation / RoCE Support / Receive Operation} > > + > > +The driver MUST fill \field{sg_list} to describe the receive buffer. > > + > > +\devicenormative{\subparagraph}{Receive Operation}{Device Types / Network Device / Device Operation / RoCE Support / Receive Operation} > > + > > +The device MUST validate the \field{addr}, \field{length} and \field{lkey} > > +in \field{sg_list}. > > + > > +\paragraph{Write Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Write Operation} > > + > > +The write operation allows us to write data to the local memory buffer > > +in remote side with no notification. The remote side wouldn't be aware > > +that this operation being done. > > + > > +To do a write operation, a request with \field{opcode} set to > > +VIRTIO_IB_WR_RDMA_WRITE or VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM MUST be > > +posted to the Send Queue as one output descriptor and the device is > > +notified of the new entry. > > + > > +\drivernormative{\subparagraph}{Write Operation}{Device Types / Network Device / Device Operation / RoCE Support / Write Operation} > > + > > +The driver MUST fill \field{sg_list} to describe the write buffer. > > So sg is a must even if the driver want to use imm? > Looks like not. I will fix it. > > + > > +The driver MUST fill \field{rdma.remote_addr} and \field{rdma.rkey} to > > +identify the remote buffer. > > + > > +\devicenormative{\subparagraph}{Write Operation}{Device Types / Network Device / Device Operation / RoCE Support / Write Operation} > > + > > +If \field{opcode} is not set to VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM, the device > > +MUST ignore \field{imm_data}. > > + > > +The device MUST validate the \field{addr}, \field{length} and \field{lkey} > > +in \field{sg_list}. > > + > > +\paragraph{Read Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Read Operation} > > + > > +The read operation allows us to read data from the local memory buffer > > +in remote side with no notification. The remote side wouldn't be aware > > +that this operation being done. > > + > > +To do a read operation, a request with \field{opcode} set to > > +VIRTIO_IB_WR_RDMA_READ MUST be posted to the Send Queue as one output > > +descriptor and the device is notified of the new entry. > > + > > +\drivernormative{\subparagraph}{Read Operation}{Device Types / Network Device / Device Operation / RoCE Support / Read Operation} > > + > > +The driver MUST fill \field{sg_list} to describe the read buffer. > > + > > +The driver MUST fill \field{rdma.remote_addr} and \field{rdma.rkey} to > > +identify the remote buffer. > > + > > +\devicenormative{\subparagraph}{Read Operation}{Device Types / Network Device / Device Operation / RoCE Support / Read Operation} > > + > > +The device MUST validate the \field{addr}, \field{length} and \field{lkey} > > +in \field{sg_list}. > > + > > +\paragraph{Completion Notification}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Completion Notification} > > + > > +After above operation is completed, a completion notification MUST > > +be triggered by the device. > > For "completion notification", do you mean the virtqueue notification > of cq or the making the buffer than contains cqe used? > Both? Making the buffer that contains cqe used and notify the virtqueue. > > To achieve that, the device MUST consume > > +an entry of the Completion Queue associated with the Send Queue/Receive > > +Queue which the operation belongs to. > > + > > +\drivernormative{\subparagraph}{Completion Notification}{Device Types / Network Device / Device Operation / RoCE Support / Completion Notification} > > + > > +The driver MUST fill the Completion Queue with enough entries previously. > > What do you mean by "previously"? What happens if there's no sufficient cqe? > We need to fill the Completion Queue in advance. Otherwise, the driver would not get completion notification after some operation is completed. Thanks, Yongji
diff --git a/content.tex b/content.tex index 7508dd1..646d82a 100644 --- a/content.tex +++ b/content.tex @@ -3008,7 +3008,10 @@ \section{Network Device}\label{sec:Device Types / Network Device} placed in one virtqueue for receiving packets, and outgoing packets are enqueued into another for transmission in that order. A third command queue is used to control advanced filtering -features. +features. And if RoCE (RDMA over Converged Ethernet) capability +is enabled, the virtio network device can also support transmitting +and receiving RDMA message through RDMA send virtqueue, RDMA receive +virtqueue and RDMA completion virtqueue. \subsection{Device ID}\label{sec:Device Types / Network Device / Device ID} @@ -3023,13 +3026,24 @@ \subsection{Virtqueues}\label{sec:Device Types / Network Device / Virtqueues} \item[2(N-1)] receiveqN \item[2(N-1)+1] transmitqN \item[2N] controlq +\item[2N+1] rdma_completeq1 +\item[\ldots] +\item[2N+M] rdma_completeqM +\item[2N+M+1] rdma_transmitq1 +\item[2N+M+2] rdma_receiveq1 +\item[\ldots] +\item[2N+M+2L-1] rdma_transmitqL +\item[2N+M+2L] rdma_receiveqL \end{description} N=1 if neither VIRTIO_NET_F_MQ nor VIRTIO_NET_F_RSS are negotiated, otherwise N is set by - \field{max_virtqueue_pairs}. + \field{max_virtqueue_pairs}. M is set by \field{max_rdma_cqs} and L is set by + \field{max_rdma_qps}. controlq only exists if VIRTIO_NET_F_CTRL_VQ set. + rdma_completeq, rdma_transmitq and rdma_receiveq only exist if VIRTIO_NET_F_ROCE set + \subsection{Feature bits}\label{sec:Device Types / Network Device / Feature bits} \begin{description} @@ -3084,6 +3098,9 @@ \subsection{Feature bits}\label{sec:Device Types / Network Device / Feature bits \item[VIRTIO_NET_F_CTRL_MAC_ADDR(23)] Set MAC address through control channel. +\item[VIRTIO_NET_F_ROCE(55)] Device supports RoCE (RDMA over Converged Ethernet) + capability. + \item[VIRTIO_NET_F_HOST_USO (56)] Device can receive USO packets. Unlike UFO (fragmenting the packet) the USO splits large UDP packet to several segments when each of these smaller packets has UDP header. @@ -3129,6 +3146,7 @@ \subsubsection{Feature bit requirements}\label{sec:Device Types / Network Device \item[VIRTIO_NET_F_GUEST_ANNOUNCE] Requires VIRTIO_NET_F_CTRL_VQ. \item[VIRTIO_NET_F_MQ] Requires VIRTIO_NET_F_CTRL_VQ. \item[VIRTIO_NET_F_CTRL_MAC_ADDR] Requires VIRTIO_NET_F_CTRL_VQ. +\item[VIRTIO_NET_F_ROCE] Requires VIRTIO_NET_F_CTRL_VQ. \item[VIRTIO_NET_F_RSC_EXT] Requires VIRTIO_NET_F_HOST_TSO4 or VIRTIO_NET_F_HOST_TSO6. \item[VIRTIO_NET_F_RSS] Requires VIRTIO_NET_F_CTRL_VQ. \end{description} @@ -3190,6 +3208,8 @@ \subsection{Device configuration layout}\label{sec:Device Types / Network Device u8 rss_max_key_size; le16 rss_max_indirection_table_length; le32 supported_hash_types; + le32 max_rdma_qps; + le32 max_rdma_cps; }; \end{lstlisting} The following field, \field{rss_max_key_size} only exists if VIRTIO_NET_F_RSS or VIRTIO_NET_F_HASH_REPORT is set. @@ -3204,11 +3224,23 @@ \subsection{Device configuration layout}\label{sec:Device Types / Network Device Field \field{supported_hash_types} contains the bitmask of supported hash types. See \ref{sec:Device Types / Network Device / Device Operation / Processing of Incoming Packets / Hash calculation for incoming packets / Supported/enabled hash types} for details of supported hash types. +Field \field{max_rdma_qps} only exists if VIRTIO_NET_F_ROCE is set. +It specifies the maximum number of queue pairs (send virtqueue and receive virtqueue) for RoCE usage. + +Field \field{max_rdma_cqs} only exists if VIRTIO_NET_F_ROCE is set. +It specifies the maximum number of completion virtqueue for RoCE usage. + \devicenormative{\subsubsection}{Device configuration layout}{Device Types / Network Device / Device configuration layout} The device MUST set \field{max_virtqueue_pairs} to between 1 and 0x8000 inclusive, if it offers VIRTIO_NET_F_MQ. +The device MUST set \field{max_rdma_qps} to between 1 an 16384 inclusive, +if it offers VIRTIO_NET_F_ROCE. + +The device MUST set \field{max_rdma_cqs} to between 1 an 16384 inclusive, +if it offers VIRTIO_NET_F_ROCE. + The device MUST set \field{mtu} to between 68 and 65535 inclusive, if it offers VIRTIO_NET_F_MTU. @@ -3306,6 +3338,12 @@ \subsection{Device Initialization}\label{sec:Device Types / Network Device / Dev \item If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated, identify the control virtqueue. +\item If the VIRTIO_NET_F_ROCE feature bit is negotiated, + identify the the RDMA completion virtqueues, up to max_rdma_cqs. + +\item If the VIRTIO_NET_F_ROCE feature bit is negotiated, + identify the the RDMA send and receive virtqueues, up to max_rdma_qps. + \item Fill the receive queues with buffers: see \ref{sec:Device Types / Network Device / Device Operation / Setting Up Receive Buffers}. \item Even with VIRTIO_NET_F_MQ, only receiveq1, transmitq1 and @@ -4007,6 +4045,7 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi u8 command; u8 command-specific-data[]; u8 ack; + u8 ack-specific-data[]; }; /* ack values */ @@ -4015,8 +4054,8 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi \end{lstlisting} The \field{class}, \field{command} and command-specific-data are set by the -driver, and the device sets the \field{ack} byte. There is little it can -do except issue a diagnostic if \field{ack} is not +driver, and the device sets the \field{ack} byte and ack-specific-data. There +is little it can do except issue a diagnostic if \field{ack} is not VIRTIO_NET_OK. \paragraph{Packet Receive Filtering}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Packet Receive Filtering} @@ -4463,6 +4502,534 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi according to the native endian of the guest rather than (necessarily when not using the legacy interface) little-endian. +\paragraph{RoCE Configuration}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / RoCE Configuration} + +If the driver negotiates the VIRTIO_NET_F_ROCE feature bit (depends on VIRTIO_NET_F_CTRL_VQ), +it can send control commands for RoCE usage. The following commands are defined now: + +\begin{lstlisting} +#define VIRTIO_NET_CTRL_ROCE 6 + #define VIRTIO_NET_CTRL_ROCE_QUERY_DEVICE 0 + #define VIRTIO_NET_CTRL_ROCE_QUERY_PORT 1 + #define VIRTIO_NET_CTRL_ROCE_CREATE_CQ 2 + #define VIRTIO_NET_CTRL_ROCE_DESTROY_CQ 3 + #define VIRTIO_NET_CTRL_ROCE_CREATE_PD 4 + #define VIRTIO_NET_CTRL_ROCE_DESTROY_PD 5 + #define VIRTIO_NET_CTRL_ROCE_GET_DMA_MR 6 + #define VIRTIO_NET_CTRL_ROCE_REG_USER_MR 7 + #define VIRTIO_NET_CTRL_ROCE_DEREG_MR 8 + #define VIRTIO_NET_CTRL_ROCE_CREATE_QP 9 + #define VIRTIO_NET_CTRL_ROCE_MODIFY_QP 10 + #define VIRTIO_NET_CTRL_ROCE_QUERY_QP 11 + #define VIRTIO_NET_CTRL_ROCE_DESTROY_QP 12 + #define VIRTIO_NET_CTRL_ROCE_CREATE_AH 13 + #define VIRTIO_NET_CTRL_ROCE_DESTROY_AH 14 + #define VIRTIO_NET_CTRL_ROCE_ADD_GID 15 + #define VIRTIO_NET_CTRL_ROCE_DEL_GID 16 + #define VIRTIO_NET_CTRL_ROCE_REQ_NOTIFY_CQ 17 +\end{lstlisting} + +\begin{description} +\item[VIRTIO_NET_CTRL_ROCE_QUERY_DEVICE] Query the attributes of device. + No command-specific-data; + the ack-specific-data is \field{struct virtio_rdma_ack_query_device}. + +\begin{lstlisting} +struct virtio_rdma_ack_query_device { +#define VIRTIO_IB_DEVICE_RC_RNR_NAK_GEN (1 << 0) + /* Capabilities mask */ + le64 device_cap_flags; + /* Largest contiguous block that can be registered */ + le64 max_mr_size; + /* Supported memory shift sizes */ + le64 page_size_cap; + /* Hardware version */ + le32 hw_ver; + /* Maximum number of outstanding Work Requests (WR) on Send Queue (SQ) and Receive Queue (RQ) */ + le32 max_qp_wr; + /* Maximum number of scatter/gather (s/g) elements per WR for SQ for non RDMA Read operations */ + le32 max_send_sge; + /* Maximum number of s/g elements per WR for RQ for non RDMA Read operations */ + le32 max_recv_sge; + /* Maximum number of s/g per WR for RDMA Read operations */ + le32 max_sge_rd; + /* Maximum size of Completion Queue (CQ) */ + le32 max_cqe; + /* Maximum number of Memory Regions (MR) */ + le32 max_mr; + /* Maximum number of Protection Domains (PD) */ + le32 max_pd; + /* Maximum number of RDMA Read perations that can be outstanding per Queue Pair (QP) */ + le32 max_qp_rd_atom; + /* Maximum depth per QP for initiation of RDMA Read operations */ + le32 max_qp_init_rd_atom; + /* Maximum number of Address Handles (AH) */ + le32 max_ah; + /* Local CA ack delay */ + u8 local_ca_ack_delay; + /* Padding */ + u8 padding[3]; + /* Reserved for future */ + le32 reserved[14]; +}; +\end{lstlisting} + +\item[VIRTIO_NET_CTRL_ROCE_QUERY_PORT] Query the attributes of port. + No command-specific-data; + the ack-specific-data is \field{struct virtio_rdma_ack_query_port}. + +\begin{lstlisting} +struct virtio_rdma_ack_query_port { + /* Length of source Global Identifier (GID) table */ + le32 gid_tbl_len; + /* Maximum message size */ + le32 max_msg_sz; + /* Reserved for future */ + le32 reserved[6]; +}; +\end{lstlisting} + +\item[VIRTIO_NET_CTRL_ROCE_CREATE_CQ] Create a Completion Queue (CQ). + The command-specific-data is \field{struct virtio_rdma_cmd_create_cq}; + the ack-specific-data is \field{struct virtio_rdma_ack_create_cq}. + +\begin{lstlisting} +struct virtio_rdma_cmd_create_cq { + /* Size of CQ */ + le32 cqe; +}; + +struct virtio_rdma_ack_create_cq { + /* The index of CQ */ + le32 cqn; +}; +\end{lstlisting} + +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_CQ] Destroy a Completion Queue. + The command-specific-data is \field{struct virtio_rdma_cmd_destroy_cq}; + no ack-specific-data. + +\begin{lstlisting} +struct virtio_rdma_cmd_destroy_cq { + /* The index of CQ */ + le32 cqn; +}; +\end{lstlisting} + +\item[VIRTIO_NET_CTRL_ROCE_CREATE_PD] Create a Protection Domain (PD). + No command-specific-data; + the ack-specific-data is \field{struct virtio_rdma_ack_create_pd}. + +\begin{lstlisting} +struct virtio_rdma_ack_create_pd { + /* The handle of PD */ + le32 pdn; +}; +\end{lstlisting} + +\item[VIRTIO_NET_CTRL_ROCE_DESTORY_PD] Destroy a Protection Domain. + The command-specific-data is \field{virtio_rdma_cmd_destroy_pd}; + no ack-specific-data. + +\begin{lstlisting} +struct virtio_rdma_cmd_destroy_pd { + /* The handle of PD */ + le32 pdn; +}; +\end{lstlisting} + +\item[VIRTIO_NET_CTRL_ROCE_GET_DMA_MR] Get the DMA Memory Region (MR). + associated with one protection domain. + The command-specific-data is \field{virtio_rdma_cmd_get_dma_mr}; + the ack-specific-data is \field{virtio_rdma_ack_get_dma_mr}. + +\begin{lstlisting} +enum virtio_ib_access_flags { + VIRTIO_IB_ACCESS_LOCAL_WRITE = (1 << 0), + VIRTIO_IB_ACCESS_REMOTE_WRITE = (1 << 1), + VIRTIO_IB_ACCESS_REMOTE_READ = (1 << 2), +}; + +struct virtio_rdma_cmd_get_dma_mr { + /* The handle of PD which the MR associated with */ + le32 pdn; + /* MR's protection attributes, enum virtio_ib_access_flags */ + le32 access_flags; +}; + +struct virtio_rdma_ack_get_dma_mr { + /* The handle of MR */ + le32 mrn; + /* MR's local access key */ + le32 lkey; + /* MR's remote access key */ + le32 rkey; +}; +\end{lstlisting} + +\item[VIRTIO_NET_CTRL_ROCE_REG_USER_MR] Register a user Memory Region + associated with one Protection Domain. + The command-specific-data is \field{virtio_rdma_cmd_reg_user_mr}; + the ack-specific-data is \field{virtio_rdma_ack_reg_user_mr}. + +\begin{lstlisting} +struct virtio_rdma_cmd_reg_user_mr { + /* The handle of PD which the MR associated with */ + le32 pdn; + /* MR's protection attributes, enum virtio_ib_access_flags */ + le32 access_flags; + /* Starting virtual address of MR */ + le64 virt_addr; + /* Length of MR */ + le64 length; + /* Size of the below page array */ + le32 npages; + /* Padding */ + le32 padding; + /* Array to store physical address of each page in MR */ + le64 pages[]; +}; + +struct virtio_rdma_ack_reg_user_mr { + /* The handle of MR */ + le32 mrn; + /* MR's local access key */ + le32 lkey; + /* MR's remote access key */ + le32 rkey; +}; +\end{lstlisting} + +\item[VIRTIO_NET_CTRL_ROCE_DEREG_MR] De-register a Memory Region. + The command-specific-data is \field{virtio_rdma_cmd_dereg_mr}; + no ack-specific-data. + +\begin{lstlisting} +struct virtio_rdma_cmd_dereg_mr { + /* The handle of MR */ + le32 mrn; +}; +\end{lstlisting} + +\item[VIRTIO_NET_CTRL_ROCE_CREATE_QP] Create a Queue Pair (Send Queue and Receive Queue). + The command-specific-data is \field{virtio_rdma_cmd_create_qp}; + the ack-specific-data is \field{virtio_rdma_ack_create_qp}. + +\begin{lstlisting} +struct virtio_rdma_qp_cap { + /* Maximum number of outstanding WRs in SQ */ + le32 max_send_wr; + /* Maximum number of outstanding WRs in RQ */ + le32 max_recv_wr; + /* Maximum number of s/g elements per WR in SQ */ + le32 max_send_sge; + /* Maximum number of s/g elements per WR in RQ */ + le32 max_recv_sge; + /* Maximum number of data (bytes) that can be posted inline to SQ */ + le32 max_inline_data; + /* Padding */ + le32 padding; +}; + +struct virtio_rdma_cmd_create_qp { + /* The handle of PD which the QP associated with */ + le32 pdn; +#define VIRTIO_IB_QPT_SMI 0 +#define VIRTIO_IB_QPT_GSI 1 +#define VIRTIO_IB_QPT_RC 2 +#define VIRTIO_IB_QPT_UC 3 +#define VIRTIO_IB_QPT_UD 4 + /* QP's type */ + u8 qp_type; + /* If set, each WR submitted to the SQ generates a completion entry */ + u8 sq_sig_all; + /* Padding */ + u8 padding[2]; + /* The index of CQ which the SQ associated with */ + le32 send_cqn; + /* The index of CQ which the RQ associated with */ + le32 recv_cqn; + /* QP's capabilities */ + struct virtio_rdma_qp_cap cap; + /* Reserved for future */ + le32 reserved[4]; +}; + +struct virtio_rdma_ack_create_qp { + /* The index of QP */ + le32 qpn; +}; +\end{lstlisting} + +\item[VIRTIO_NET_CTRL_ROCE_MODIFY_QP] Modify the attributes of a Queue Pair. + The command-specific-data is \field{virtio_rdma_cmd_modify_qp}; + no ack-specific-data. + +\begin{lstlisting} +struct virtio_rdma_global_route { + /* Destination GID or MGID */ + u8 dgid[16]; + /* Flow label */ + le32 flow_label; + /* Source GID index */ + u8 sgid_index; + /* Hop limit */ + u8 hop_limit; + /* Traffic class */ + u8 traffic_class; + /* Padding */ + u8 padding; +}; + +struct virtio_rdma_ah_attr { + /* Global Routing Header (GRH) attributes */ + virtio_rdma_global_route grh; + /* Destination MAC address */ + u8 dmac[6]; + /* Reserved for future */ + u8 reserved[10]; +}; + +enum virtio_ib_qp_attr_mask { + VIRTIO_IB_QP_STATE = (1 << 0), + VIRTIO_IB_QP_CUR_STATE = (1 << 1), + VIRTIO_IB_QP_ACCESS_FLAGS = (1 << 2), + VIRTIO_IB_QP_QKEY = (1 << 3), + VIRTIO_IB_QP_AV = (1 << 4), + VIRTIO_IB_QP_PATH_MTU = (1 << 5), + VIRTIO_IB_QP_TIMEOUT = (1 << 6), + VIRTIO_IB_QP_RETRY_CNT = (1 << 7), + VIRTIO_IB_QP_RNR_RETRY = (1 << 8), + VIRTIO_IB_QP_RQ_PSN = (1 << 9), + VIRTIO_IB_QP_MAX_QP_RD_ATOMIC = (1 << 10), + VIRTIO_IB_QP_MIN_RNR_TIMER = (1 << 11), + VIRTIO_IB_QP_SQ_PSN = (1 << 12), + VIRTIO_IB_QP_MAX_DEST_RD_ATOMIC = (1 << 13), + VIRTIO_IB_QP_CAP = (1 << 14), + VIRTIO_IB_QP_DEST_QPN = (1 << 15), + VIRTIO_IB_QP_RATE_LIMIT = (1 << 16), +}; + +enum virtio_ib_qp_state { + VIRTIO_IB_QPS_RESET, + VIRTIO_IB_QPS_INIT, + VIRTIO_IB_QPS_RTR, + VIRTIO_IB_QPS_RTS, + VIRTIO_IB_QPS_SQD, + VIRTIO_IB_QPS_SQE, + VIRTIO_IB_QPS_ERR +}; + +enum virtio_ib_mtu { + VIRTIO_IB_MTU_256 = 1, + VIRTIO_IB_MTU_512 = 2, + VIRTIO_IB_MTU_1024 = 3, + VIRTIO_IB_MTU_2048 = 4, + VIRTIO_IB_MTU_4096 = 5 +}; + +struct virtio_rdma_cmd_modify_qp { + /* The index of QP */ + le32 qpn; + /* The mask of attributes needs to be modified, enum virtio_ib_qp_attr_mask */ + le32 attr_mask; + /* Move the QP to this state, enum virtio_ib_qp_state */ + u8 qp_state; + /* Current QP state, enum virtio_ib_qp_state */ + u8 cur_qp_state; + /* Path MTU (valid only for RC/UC QPs), enum virtio_ib_mtu */ + u8 path_mtu; + /* Number of outstanding RDMA Read operations on destination QP (valid only for RC QPs) */ + u8 max_rd_atomic; + /* Number of responder resources for handling incoming RDMA Read operations (valid only for RC QPs) */ + u8 max_dest_rd_atomic; + /* Minimum RNR (Receiver Not Ready) NAK timer (valid only for RC QPs) */ + u8 min_rnr_timer; + /* Local ack timeout (valid only for RC QPs) */ + u8 timeout; + /* Retry count (valid only for RC QPs) */ + u8 retry_cnt; + /* RNR retry (valid only for RC QPs) */ + u8 rnr_retry; + /* Padding */ + u8 padding[7]; + /* Q_Key for the QP (valid only for UD QPs) */ + le32 qkey; + /* PSN for RQ (valid only for RC/UC QPs) */ + le32 rq_psn; + /* PSN for SQ */ + le32 sq_psn; + /* Destination QP number (valid only for RC/UC QPs) */ + le32 dest_qp_num; + /* Mask of enabled remote access operations (valid only for RC/UC QPs), enum virtio_ib_access_flags */ + le32 qp_access_flags; + /* Rate limit in kbps for packet pacing */ + le32 rate_limit; + /* QP capabilities */ + struct virtio_rdma_qp_cap cap; + /* Address Vector (valid only for RC/UC QPs) */ + struct virtio_rdma_ah_attr ah_attr; + /* Reserved for future */ + le32 reserved[4]; +}; +\end{lstlisting} + +\item[VIRTIO_NET_CTRL_ROCE_QUERY_QP] Query the attributes of a Queue Pair. + The command-specific-data is \field{virtio_rdma_cmd_query_qp}; + the ack-specific-data is \field{virtio_rdma_ack_query_qp}. + +\begin{lstlisting} +struct virtio_rdma_cmd_query_qp { + /* The index of QP */ + le32 qpn; + /* The mask of attributes need to be queried, enum virtio_ib_qp_attr_mask */ + le32 attr_mask; +}; + +struct virtio_rdma_ack_query_qp { + /* Move the QP to this state, enum virtio_ib_qp_state */ + u8 qp_state; + /* Path MTU (valid only for RC/UC QPs), enum virtio_ib_mtu */ + u8 path_mtu; + /* Is the SQ draining */ + u8 sq_draining; + /* Number of outstanding RDMA read operations on destination QP (valid only for RC QPs) */ + u8 max_rd_atomic; + /* Number of responder resources for handling incoming RDMA read operations (valid only for RC QPs) */ + u8 max_dest_rd_atomic; + /* Minimum RNR NAK timer (valid only for RC QPs) */ + u8 min_rnr_timer; + /* Local ack timeout (valid only for RC QPs) */ + u8 timeout; + /* Retry count (valid only for RC QPs) */ + u8 retry_cnt; + /* RNR retry (valid only for RC QPs) */ + u8 rnr_retry; + /* Padding */ + u8 padding[7]; + /* Q_Key for the QP (valid only for UD QPs) */ + le32 qkey; + /* PSN for RQ (valid only for RC/UC QPs) */ + le32 rq_psn; + /* PSN for SQ */ + le32 sq_psn; + /* Destination QP number (valid only for RC/UC QPs) */ + le32 dest_qp_num; + /* Mask of enabled remote access operations (valid only for RC/UC QPs), enum virtio_ib_access_flags */ + le32 qp_access_flags; + /* Rate limit in kbps for packet pacing */ + le32 rate_limit; + /* QP capabilities */ + struct virtio_rdma_qp_cap cap; + /* Address Vector (valid only for RC/UC QPs) */ + struct virtio_rdma_ah_attr ah_attr; + /* Reserved for future */ + le32 reserved[4]; +}; +\end{lstlisting} + +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_QP] Destroy a Queue Pair. + The command-specific-data is \field{virtio_rdma_cmd_destroy_qp}; + no ack-specific-data. + +\begin{lstlisting} +struct virtio_rdma_cmd_destroy_qp { + /* The index of QP */ + le32 qpn; +}; +\end{lstlisting} + +\item[VIRTIO_NET_CTRL_ROCE_CREATE_AH] Create a Address Handle (AH). + The command-specific-data is \field{virtio_rdma_cmd_create_ah}; + the ack-specific-data is \field{virtio_rdma_ack_create_ah}. + +\begin{lstlisting} +struct virtio_rdma_cmd_create_ah { + /* The handle of PD which the AH associated with */ + le32 pdn; + /* Padding */ + le32 padding; + /* Address Vector */ + struct virtio_rdma_ah_attr ah_attr; +}; + +struct virtio_rdma_ack_create_ah { + /* The address handle */ + le32 ah; +}; +\end{lstlisting} + +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_AH] Destroy a Address Handle. + The command-specific-data is \field{virtio_rdma_cmd_destroy_ah}; + no ack-specific-data. + +\begin{lstlisting} +struct virtio_rdma_cmd_destroy_ah { + /* The handle of PD which the AH associated with */ + le32 pdn; + /* The address handle */ + le32 ah; +}; +\end{lstlisting} + +\item[VIRTIO_NET_CTRL_ROCE_ADD_GID] Add a Global Identifier (GID). + The command-specific-data is \field{virtio_rdma_cmd_add_gid}; + no ack-specific-data. + +\begin{lstlisting} +struct virtio_rdma_cmd_add_gid { + /* Index of GID */ + le16 index; + /* Padding */ + le16 padding[3]; + /* GID to be added */ + u8 gid[16]; +}; +\end{lstlisting} + +\item[VIRTIO_NET_CTRL_ROCE_DEL_GID] Delete a Global Identifier. + The command-specific-data is \field{virtio_rdma_cmd_del_gid}; + no ack-specific-data. + +\begin{lstlisting} +struct virtio_rdma_cmd_del_gid { + /* Index of GID */ + le16 index; +}; +\end{lstlisting} + +\item[VIRTIO_NET_CTRL_ROCE_REQ_NOTIFY_CQ] Request a completion notification + on a Completion Queue. + The command-specific-data is \field{virtio_rdma_cmd_req_notify}; + no ack-specific-data. + +\begin{lstlisting} +struct virtio_rdma_cmd_req_notify { + /* The index of CQ */ + le32 cqn; +#define VIRTIO_IB_NOTIFY_SOLICITED (1 << 0) +#define VIRTIO_IB_NOTIFY_NEXT_COMPLETION (1 << 1) + /* Notify flags */ + le32 flags; +}; +\end{lstlisting} + +\end{description} + +\drivernormative{\subparagraph}{RoCE Configuration}{Device Types / Network Device / Device Operation / Control Virtqueue / RoCE Configuration} + +A driver MUST initialize the completion virtqueue and fill it with +enough entries after command VIRTIO_NET_CTRL_ROCE_CREATE_CQ is +successfully executed. + +A driver MUST reset the completion virtqueue after +command VIRTIO_NET_CTRL_ROCE_DESTROY_CQ is successfully executed. + +A driver MUST initialize the send virtqueue and receive virtqueue after +command VIRTIO_NET_CTRL_ROCE_CREATE_QP is successfully executed. + +A driver MUST reset the send virtqueue and receive virtqueue after +command VIRTIO_NET_CTRL_ROCE_DESTROY_QP is successfully executed. \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device Types / Network Device / Legacy Interface: Framing Requirements} @@ -4496,6 +5063,289 @@ \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device See \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Message Framing}. +\subsubsection{RoCE Support}\label{sec:Device Types / Network Device / Device Operation / RoCE Support} + +RDMA over Converged Ethernet (RoCE) is a network protocol that allows +remote direct memory access (RDMA) over an Ethernet network. To support +RoCE (if VIRTIO_NET_F_ROCE is negotiated), in addtion to the control +virtqueue support mentioned in \ref{sec:Device Types / Network Device / +Device Operation / Control Virtqueue / RoCE Configuration}, multiple +types of virtqueues including send virtqueue, receive virtqueue and +completion virtqueue are introduced. + +The send virtqueue contains elements that describe the data to be +transmitted. + +Requests (device-readable) have the following format: + +\begin{lstlisting} +enum virtio_ib_wr_opcode { + VIRTIO_IB_WR_RDMA_WRITE, + VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM, + VIRTIO_IB_WR_SEND, + VIRTIO_IB_WR_SEND_WITH_IMM, + VIRTIO_IB_WR_RDMA_READ, +}; + +struct virtio_rdma_sge { + le64 addr; + le32 length; + le32 lkey; +}; + +struct virtio_rdma_sq_req { + /* User defined WR ID */ + le64 wr_id; + /* WR opcode, enum virtio_ib_wr_opcode */ + u8 opcode; +#define VIRTIO_IB_SEND_FENCE (1 << 0) +#define VIRTIO_IB_SEND_SIGNALED (1 << 1) +#define VIRTIO_IB_SEND_SOLICITED (1 << 2) +#define VIRTIO_IB_SEND_INLINE (1 << 3) + /* Flags of the WR properties */ + u8 send_flags; + /* Padding */ + le16 padding; + /* Immediate data (in network byte order) to send */ + le32 imm_data; + union { + struct { + /* Start address of remote memory buffer */ + le64 remote_addr; + /* Key of the remote MR */ + le32 rkey; + } rdma; + struct { + /* Index of the destination QP */ + le32 remote_qpn; + /* Q_Key of the destination QP */ + le32 remote_qkey; + /* Address Handle */ + le32 ah; + } ud; + /* Reserved for future */ + le64 reserved[4]; + }; + /* Inline data */ + u8 inline_data[512]; + union { + /* Length of sg_list */ + le32 num_sge; + /* Length of inline data */ + le16 inline_len; + }; + /* Reserved for future */ + le32 reserved2[3]; + /* Scatter/gather list */ + struct virtio_rdma_sge sg_list[]; +}; +\end{lstlisting} + +The receive virtqueue contains elements that describe where to place incoming data. + +Requests (device-readable) have the following format: + +\begin{lstlisting} +struct virtio_rdma_rq_req { + /* User defined WR ID */ + le64 wr_id; + /* Length of sg_list */ + le32 num_sge; + /* Reserved for future */ + le32 reserved[3]; + /* Scatter/gather list */ + struct virtio_rdma_sge sg_list[]; +}; +\end{lstlisting} + +The completion virtqueue is used to notify the completion of requests in +send virtqueue or receive virtqueue. + +Requests (device-writable) have the following format: + +\begin{lstlisting} +enum virtio_ib_wc_opcode { + VIRTIO_IB_WC_SEND, + VIRTIO_IB_WC_RDMA_WRITE, + VIRTIO_IB_WC_RDMA_READ, + VIRTIO_IB_WC_RECV, + VIRTIO_IB_WC_RECV_RDMA_WITH_IMM, +}; + +enum virtio_ib_wc_status { + /* Operation completed successfully */ + VIRTIO_IB_WC_SUCCESS, + /* Local Length Error */ + VIRTIO_IB_WC_LOC_LEN_ERR, + /* Local QP Operation Error */ + VIRTIO_IB_WC_LOC_QP_OP_ERR, + /* Local Protection Error */ + VIRTIO_IB_WC_LOC_PROT_ERR, + /* Work Request Flushed Error */ + VIRTIO_IB_WC_WR_FLUSH_ERR, + /* Bad Response Error */ + VIRTIO_IB_WC_BAD_RESP_ERR, + /* Local Access Error */ + VIRTIO_IB_WC_LOC_ACCESS_ERR, + /* Remote Invalid Request Error */ + VIRTIO_IB_WC_REM_INV_REQ_ERR, + /* Remote Access Error */ + VIRTIO_IB_WC_REM_ACCESS_ERR, + /* Remote Operation Error */ + VIRTIO_IB_WC_REM_OP_ERR, + /* Transport Retry Counter Exceeded */ + VIRTIO_IB_WC_RETRY_EXC_ERR, + /* RNR Retry Counter Exceeded */ + VIRTIO_IB_WC_RNR_RETRY_EXC_ERR, + /* Remote Aborted Error */ + VIRTIO_IB_WC_REM_ABORT_ERR, + /* Fatal Error */ + VIRTIO_IB_WC_FATAL_ERR, + /* Response Timeout Error */ + VIRTIO_IB_WC_RESP_TIMEOUT_ERR, + /* General Error */ + VIRTIO_IB_WC_GENERAL_ERR +}; + +struct virtio_rdma_cq_req { + /* User defined WR ID */ + le64 wr_id; + /* Work completion status, enum virtio_ib_wc_status */ + u8 status; + /* WR opcode, enum virtio_ib_wc_opcode */ + u8 opcode; + /* Padding */ + le16 padding; + /* Vendor error */ + le32 vendor_err; + /* Number of bytes transferred */ + le32 byte_len; + /* Immediate data (in network byte order) to send */ + le32 imm_data; + /* Local QP number of completed WR */ + le32 qp_num; + /* Source QP number (remote QP number) of completed WR (valid only for UD QPs) */ + le32 src_qp; +#define VIRTIO_IB_WC_GRH (1 << 0) +#define VIRTIO_IB_WC_WITH_IMM (1 << 1) + /* Work completion flag */ + le32 wc_flags; + /* Reserved for future */ + le32 reserved[3]; +}; +\end{lstlisting} + +\paragraph{Send Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Send Operation} + +The send operation allows us to send data to a remote QP’s Receive Queue. +The receiver MUST have previously posted a receive buffer to receive the data. + +To do a send operation, a request with \field{opcode} set to +VIRTIO_IB_WR_SEND or VIRTIO_IB_WR_SEND_WITH_IMM MUST be posted to the Send +Queue as one output descriptor and the device is notified of the new entry. + +\drivernormative{\subparagraph}{Send Operation}{Device Types / Network Device / Device Operation / RoCE Support / Send Operation} + +If VIRTIO_IB_SEND_INLINE is set in \field{send_flags}, the driver MUST fill +send buffer into \field{inline_data} field and set \field{inline_len} to the +length of the buffer. Otherwise, the driver MUST fill \field{sg_list} to +describe the buffer. + +\devicenormative{\subparagraph}{Send Operation}{Device Types / Network Device / Device Operation / RoCE Support / Send Operation} + +If \field{opcode} is not set to VIRTIO_IB_WR_SEND_WITH_IMM, the device MUST +ignore \field{imm_data}. + +If the QP type is UD, the device MUST validate \field{ud.ah}. + +If VIRTIO_IB_SEND_INLINE is not set in \field{send_flags}, the device MUST +validate the \field{addr}, \field{length} and \field{lkey} in \field{sg_list}. + +\paragraph{Receive Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Receive Operation} + +The receive operation allows us to receive data from remote QP. +It's the corresponding operation to a send operation. + +To do a receive operation, a request MUST be posted to the Receive +Queue as one output descriptor and the device is notified of the new entry. + +\drivernormative{\subparagraph}{Receive Operation}{Device Types / Network Device / Device Operation / RoCE Support / Receive Operation} + +The driver MUST fill \field{sg_list} to describe the receive buffer. + +\devicenormative{\subparagraph}{Receive Operation}{Device Types / Network Device / Device Operation / RoCE Support / Receive Operation} + +The device MUST validate the \field{addr}, \field{length} and \field{lkey} +in \field{sg_list}. + +\paragraph{Write Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Write Operation} + +The write operation allows us to write data to the local memory buffer +in remote side with no notification. The remote side wouldn't be aware +that this operation being done. + +To do a write operation, a request with \field{opcode} set to +VIRTIO_IB_WR_RDMA_WRITE or VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM MUST be +posted to the Send Queue as one output descriptor and the device is +notified of the new entry. + +\drivernormative{\subparagraph}{Write Operation}{Device Types / Network Device / Device Operation / RoCE Support / Write Operation} + +The driver MUST fill \field{sg_list} to describe the write buffer. + +The driver MUST fill \field{rdma.remote_addr} and \field{rdma.rkey} to +identify the remote buffer. + +\devicenormative{\subparagraph}{Write Operation}{Device Types / Network Device / Device Operation / RoCE Support / Write Operation} + +If \field{opcode} is not set to VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM, the device +MUST ignore \field{imm_data}. + +The device MUST validate the \field{addr}, \field{length} and \field{lkey} +in \field{sg_list}. + +\paragraph{Read Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Read Operation} + +The read operation allows us to read data from the local memory buffer +in remote side with no notification. The remote side wouldn't be aware +that this operation being done. + +To do a read operation, a request with \field{opcode} set to +VIRTIO_IB_WR_RDMA_READ MUST be posted to the Send Queue as one output +descriptor and the device is notified of the new entry. + +\drivernormative{\subparagraph}{Read Operation}{Device Types / Network Device / Device Operation / RoCE Support / Read Operation} + +The driver MUST fill \field{sg_list} to describe the read buffer. + +The driver MUST fill \field{rdma.remote_addr} and \field{rdma.rkey} to +identify the remote buffer. + +\devicenormative{\subparagraph}{Read Operation}{Device Types / Network Device / Device Operation / RoCE Support / Read Operation} + +The device MUST validate the \field{addr}, \field{length} and \field{lkey} +in \field{sg_list}. + +\paragraph{Completion Notification}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Completion Notification} + +After above operation is completed, a completion notification MUST +be triggered by the device. To achieve that, the device MUST consume +an entry of the Completion Queue associated with the Send Queue/Receive +Queue which the operation belongs to. + +\drivernormative{\subparagraph}{Completion Notification}{Device Types / Network Device / Device Operation / RoCE Support / Completion Notification} + +The driver MUST fill the Completion Queue with enough entries previously. + +\devicenormative{\subparagraph}{Completion Notification}{Device Types / Network Device / Device Operation / RoCE Support / Completion Notification} + +If \field{imm_data} is valid, the device MUST set VIRTIO_IB_WC_WITH_IMM to +\field{wc_flags}. + +The device MUST set \field{wr_id} to the value of \field{wr_id} of +corresponding \field{struct virtio_rdma_sq_req} or +\field{struct virtio_rdma_rq_req}. + \section{Block Device}\label{sec:Device Types / Block Device} The virtio block device is a simple virtual block device (ie.