From patchwork Mon Oct 24 07:09:26 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Wang, Wei W" X-Patchwork-Id: 9391529 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id B72FE60231 for ; Mon, 24 Oct 2016 07:10:06 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A4F5528D6C for ; Mon, 24 Oct 2016 07:10:06 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 96B2F28D71; Mon, 24 Oct 2016 07:10:06 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 7877928D6F for ; Mon, 24 Oct 2016 07:10:05 +0000 (UTC) Received: from localhost ([::1]:44981 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1byZOW-0003E0-Lw for patchwork-qemu-devel@patchwork.kernel.org; Mon, 24 Oct 2016 03:10:04 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:37465) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1byZO2-0003C4-Hu for qemu-devel@nongnu.org; Mon, 24 Oct 2016 03:09:37 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1byZNz-0006et-8K for qemu-devel@nongnu.org; Mon, 24 Oct 2016 03:09:34 -0400 Received: from mga03.intel.com ([134.134.136.65]:59776) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1byZNy-0006eD-Qy for qemu-devel@nongnu.org; Mon, 24 Oct 2016 03:09:31 -0400 Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga103.jf.intel.com with ESMTP; 24 Oct 2016 00:09:29 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.31,541,1473145200"; d="scan'208";a="23237345" Received: from ww-bdw.sh.intel.com ([10.239.48.149]) by orsmga004.jf.intel.com with ESMTP; 24 Oct 2016 00:09:22 -0700 From: Wei Wang To: virtio-comment@lists.oasis-open.org, qemu-devel@nongnu.org, mst@redhat.com, marcandre.lureau@gmail.com, stefanha@redhat.com, pbonzini@redhat.com Date: Mon, 24 Oct 2016 15:09:26 +0800 Message-Id: <1477292966-144175-1-git-send-email-wei.w.wang@intel.com> X-Mailer: git-send-email 1.9.1 X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 134.134.136.65 Subject: [Qemu-devel] [PATCH v1] vhost-pci-net: design vhost-pci-net for the transmission of network packets between VMs X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Wei Wang Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Virus-Scanned: ClamAV using ClamSMTP Here lists the issues that we discussed before, and solutions are provided below. 1. Interrupt support 1) vhost-pci-net According to the virtio spec, "The device MUST present at least one notification capability", which implies that a virtio device can have multiple notification capabilities. Therefore, a second notifiaction capability (VIRTIO_PCI_CAP_PEER_NOTIFY_CFG) will be implemented for the vhost-pci device to notify its peer device (virtio-net) virtq. The fd used for the ioeventfd of a notification address shares the fd used for the irqfd of the corresponging peer virtq. So, when the vhost-pci driver writes to that notification address (i.e. kick the peer virtq), the related virtq interrupt will be injected to the virtio-net driver. 2) virtio-net The virtio-net device also needs to notify the peer on the other end (vhost-pci-net). The ioeventfd corresponding to the TX virtq uses the same fd as that of the peer RX virtq's irqfd. This is all done with only one notification capability. 2. How to handle guest crash? 1) vhost-pci-net guset crash Typically, the guest will be killed and re-created by the admin. The admin then uses the qemu monitor or qmp to operate on the virtio-net guest: "device_del" the old virtio-net device, and "device_add" a new virtio-net which connects to the new booted vhost-pci-net guest. 2) virtio-net guset crash When the virtio-net guest is killed by the admin, the server will be notified due to "close(fd)", which gives it a chance to destroy the related vhost-pci-net device. When the guest is re-created and boots, a new vhost-pci device will be created and initialized for it. Signed-off-by: Wei Wang --- content.tex | 240 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 238 insertions(+), 2 deletions(-) diff --git a/content.tex b/content.tex index 222b78e..4f67e94 100644 --- a/content.tex +++ b/content.tex @@ -3115,6 +3115,10 @@ features. \item[VIRTIO_NET_F_CTRL_MAC_ADDR(23)] Set MAC address through control channel. + +\item[VIRTIO_NET_F_PEER_CONNECTION(24)] Device supports connection to + a peer device. + \end{description} \subsubsection{Feature bit requirements}\label{sec:Device Types / Network Device / Feature bits / Feature bit requirements} @@ -3138,6 +3142,7 @@ Some networking feature bits require other networking feature bits \item[VIRTIO_NET_F_GUEST_ANNOUNCE] Requires VIRTIO_NET_F_CTRL_VQ. \item[VIRTIO_NET_F_MQ] Requires VIRTIO_NET_F_CTRL_VQ. \item[VIRTIO_NET_F_CTRL_MAC_ADDR] Requires VIRTIO_NET_F_CTRL_VQ. +\item[VIRTIO_NET_F_PEER_CONNECTION] Requires VIRTIO_NET_F_CTRL_VQ and VIRTIO_NET_F_STATUS. \end{description} \subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types / Network Device / Feature bits / Legacy Interface: Feature bits} @@ -3154,13 +3159,14 @@ were needed. Three driver-read-only configuration fields are currently defined. The \field{mac} address field always exists (though is only valid if VIRTIO_NET_F_MAC is set), and -\field{status} only exists if VIRTIO_NET_F_STATUS is set. Two +\field{status} only exists if VIRTIO_NET_F_STATUS is set. Three read-only bits (for the driver) are currently defined for the status field: -VIRTIO_NET_S_LINK_UP and VIRTIO_NET_S_ANNOUNCE. +VIRTIO_NET_S_LINK_UP, VIRTIO_NET_S_ANNOUNCE and VIRTIO_NET_S_PEER_CONNECTION. \begin{lstlisting} #define VIRTIO_NET_S_LINK_UP 1 #define VIRTIO_NET_S_ANNOUNCE 2 +#define VIRTIO_NET_S_PEER_CONNECTION 4 \end{lstlisting} The following driver-read-only field, \field{max_virtqueue_pairs} only exists if @@ -3204,6 +3210,10 @@ level ethernet header length) size with \field{gso_type} NONE or ECN, and do so without fragmentation, after VIRTIO_NET_F_MTU has been successfully negotiated. +Before the device turns on or off the VIRTIO_NET_S_PEER_CONNECTION +bit in \field{status}, it MUST sync up with the connected peer +device and get an acknowledgement. + \drivernormative{\subsubsection}{Device configuration layout}{Device Types / Network Device / Device configuration layout} A driver SHOULD negotiate VIRTIO_NET_F_MAC if the device offers it. @@ -3226,6 +3236,11 @@ If the driver negotiates VIRTIO_NET_F_MTU, it MUST NOT transmit packets of size exceeding the value of \field{mtu} (plus low level ethernet header length) with \field{gso_type} NONE or ECN. +If the driver negotiates VIRTIO_NET_F_PEER_CONNECTION, it SHOULD NOT +send packets to any transmitq when the VIRTIO_NET_S_PEER_CONNECTION +bit is off in \field{status}. The driver MUST NOT unload until the +VIRTIO_NET_S_PEER_CONNECTION bit is turned off. + \subsubsection{Legacy Interface: Device configuration layout}\label{sec:Device Types / Network Device / Device configuration layout / Legacy Interface: Device configuration layout} \label{sec:Device Types / Block Device / Feature bits / Device configuration layout / Legacy Interface: Device configuration layout} When using the legacy interface, transitional devices and drivers @@ -4046,6 +4061,23 @@ MUST format \field{offloads} according to the native endian of the guest rather than (necessarily when not using the legacy interface) little-endian. +\paragraph{Peer Connection}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Peer Connection} +If the VIRTIO_NET_F_PEER_CONNECTION feature bit is negotiated, the +driver can send a control command to the device to turn on or off +the VIRTIO_NET_S_PEER_CONNECTION bit in \field{status}. + +\begin{lstlisting} + #define VIRTIO_NET_PEER_CONNECTION 6 + #define VIRTIO_NET_PEER_CONNECTION_OFF 0 + #define VIRTIO_NET_PEER_CONNECTION_ON 1 +\end{lstlisting} + +The VIRTIO_NET_PEER_CONNECTION_OFF command is used by the driver to +request the device to turn off the VIRTIO_NET_S_PEER_CONNECTION bit +in \field{status}. +The VIRTIO_NET_PEER_CONNECTION_ON command is used by the driver to +request the device to turn on the VIRTIO_NET_S_PEER_CONNECTION bit +in \field{status}. \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device Types / Network Device / Legacy Interface: Framing Requirements} @@ -5811,6 +5843,210 @@ descriptor for the \field{sense_len}, \field{residual}, \field{status_qualifier}, \field{status}, \field{response} and \field{sense} fields. +\section{Vhost-pci Net Device}\label{sec:Device Types / Vhost-pci Net Device} + +The vhost-pci net device enables point-to-point transmission of +network packets between two isolated address spaces (e.g. virtual +machines). An instance of the vhost-pci net device transmits and +grabs packets from its peer device, which is usually a virtio net +device from another address space. + +\subsection{Device ID}\label{sec:Device Types / Vhost-pci Net Device / Device ID} + TBD + +\subsection{Virtqueues}\label{sec:Device Types / Vhost-pci Net Device / Virtqueues} + +\begin{description} +\item[0] control receiveq +\item[1] control transmitq +\item[2] receiveq1 +\item[\ldots] +\item[N+2] receiveqN +\end{description} + +N equals to the number of peer device transmitqs. + +\subsection{Feature bits}\label{sec:Device Types / Vhost-pci Net Device / Feature bits} + +The device is created with the feature bits that have been negotiated +with the peer device. If the driver only accepts a subset of the +feature bits, the device needs to re-negotiate the subset of feature +bits with the peer device, which may trigger a reset of the peer +device. + +\begin{description} +\item[VIRTIO_NET_F_GUEST_TSO4 (7)] Virtio-net can receive TSOv4. + +\item[VIRTIO_NET_F_GUEST_TSO6 (8)] Virtio-net can receive TSOv6. + +\item[VIRTIO_NET_F_GUEST_ECN (9)] Virtio-net can receive TSO with ECN. + +\item[VIRTIO_NET_F_GUEST_UFO (10)] Virtio-net can receive UFO. + +\item[VIRTIO_NET_F_HOST_TSO4 (11)] Vhost-pci-net supports TSOv4. + +\item[VIRTIO_NET_F_HOST_TSO6 (12)] Vhost-pci-net supports TSOv6. + +\item[VIRTIO_NET_F_HOST_ECN (13)] Vhost-pci-net supports TSO with ECN. + +\item[VIRTIO_NET_F_HOST_UFO (14)] Vhost-pci-net supports UFO. + +\item[VIRTIO_NET_F_MRG_RXBUF (15)] Virtio-net can merge receive buffers. + +\item[VIRTIO_NET_F_PEER_CONNECTION (24)] Virtio-net supports connection + to a peer device. + +\item[VHOST_F_LOG_ALL (26)] Vhost-pci-net supports dirty page logging. +\end{description} + +\subsection{Device configuration layout}\label{sec:Device Types / Vhost-pci Device / Device configuration layout} + None currently defined. + +\subsection{Device Initialization}\label{sec:Device Types / Vhost-pci Device / Device Initialization} + +The driver would perform a typical initialization routine like so: + +\begin{enumerate} +\item Identify and intialize the control receiveq, control transmitq. + +\item Fill the control receiveq with buffers. + +\item Generate a random local MAC address. +\end{enumerate} + +\subsection{Device Operation}\label{sec:Device Types / Vhost-pci Net Device / Device Operation} + +\subsubsection{Control Virtqueue}\label{sec:Device Types / Vhost-pci Net Device / Device Operation / Control Virtqueue} + +The pair of control virtqueues are used to exchange configuration +messages between the device and driver. All the configuration +messages are constructed using the following structure form: + +\begin{lstlisting} +struct vhost_pci_ctrl { + u8 class; + u8 command; + u8 command-specific-data[]; +}; +\end{lstlisting} + +\paragraph{Peer Connection}\label{sec:Device Types / Vhost-pci Net Device / Device Operation / Control Virtqueue / Peer Connection} +\begin{lstlisting} +#define VHOST_PCI_CTRL_PEER_CONNECTION 0 +#define VHOST_PCI_CTRL_PEER_CONNECTION_OFF 0 +#define VHOST_PCI_CTRL_PEER_CONNECTION_ON 1 +\end{lstlisting} + +The device maintains the status of connection to the peer +device. The VHOST_PCI_CTRL_PEER_CONNECTION_OFF and +VHOST_PCI_CTRL_PEER_CONNECTION_ON commands are used by the +device to send the status update to the driver through the +control receiveq. + +The VHOST_PCI_CTRL_PEER_CONNECTION_OFF command is also used by the +driver to request the device through the control transmitq to +disconnect to the peer device. + +\devicenormative{\subparagraph}{Peer Connection}{Device Types / Vhost-pci Net Device / Device Operation / Control Virtqueue / Peer Connection} + +When receiving a VHOST_PCI_CTRL_PEER_CONNECTION_OFF command from +the driver, the device SHOULD sync up with the peer device for +disconnection. Upon receiving an acknowledge of disconnection +from the peer device, it SHOULD update the driver with a +VHOST_PCI_CTRL_PEER_CONNECTION_OFF command. + +\drivernormative{\subparagraph}{Peer Connection}{Device Types / Vhost-pci Net Device / Device Operation / Control Virtqueue / Peer Connection} + +The driver SHOULD NOT access any peer device memory before it is +upadted with a VHOST_PCI_CTRL_PEER_CONNECTION_ON command. + +The driver SHOULD NOT access any peer device memory after it is +upadted with a VHOST_PCI_CTRL_PEER_CONNECTION_OFF command. + +The driver MUST NOT unload until it is updated by the device with +a VHOST_PCI_CTRL_PEER_CONNECTION_OFF command. + +\paragraph{Peer Memory Info}\label{sec:Device Types / Vhost-pci Net Device / Device Operation / Control Virtqueue / Peer Memory Info} + +\begin{lstlisting} +struct vhost_pci_ctrl_peer_mem_info { + u64 peer_mem; + u8 other-mem-info[]; +} +#define VHOST_PCI_CTRL_PEER_MEM_INFO 1 +#define VHOST_PCI_CTRL_PEER_MEM_INFO_BAR 0 +#define VHOST_PCI_CTRL_PEER_MEM_INFO_GVA 1 +\end{lstlisting} + +The device obtains the memory info from the peer device, and sends it +to the driver. + +For command VHOST_PCI_CTRL_PEER_MEM_INFO_BAR, \field{peer_mem} stores +the id of the BAR that holds the peer memory. + +For command VHOST_PCI_CTRL_PEER_MEM_INFO_GVA, \field{peer_mem} stores +the virtual address that maps to the start of the peer memory. The +driver can directly use this address to access the peer memory. + +The \field{other-mem-info} stores other peer memory info for the +driver to reference, and it is defined according to the +implementation's need. + +\drivernormative{\subparagraph}{Peer Memory Info}{Device Types / Vhost-pci Net Device / Device Operation / Control Virtqueue / Peer Memory Info} +If the driver receives a VHOST_PCI_CTRL_PEER_MEM_INFO_BAR command, +it SHOULD map the peer memory through the bar specified in +\field{peer_mem}. The address that maps to the start of the peer +memory SHOULD be sent to the device using a +VHOST_PCI_CTRL_PEER_MEM_INFO_GVA command through the +control transmitq. + +\devicenormative{\subparagraph}{Peer Memory Info}{Device Types / Vhost-pci Net Device / Device Operation / Control Virtqueue / Peer Memory Info} +The device SHOULD send a VHOST_PCI_CTRL_PEER_MEM_INFO_GVA command to +the driver if it internally records a virtual address of the peer +memory. Otherwise, it should send a VHOST_PCI_CTRL_PEER_MEM_INFO_BAR +command to the driver. + +\paragraph{Peer Virtq Info}\label{sec:Device Types / Vhost-pci Net Device / Device Operation / Control Virtqueue / Peer Virtq Info} + +\begin{lstlisting} +struct vhost_pci_ctrl_peer_virtq_info { + u32 virtq_num; + struct virtq peer_vq[]; +} +#define VHOST_PCI_CTRL_PEER_VIRTQ_INFO 2 +\end{lstlisting} + +The device obtains the virtq info from the peer device, and sends +it to the driver. The \field{virtq_num} stores the total number +of virtqs. + +\drivernormative{\subparagraph}{Peer Virtq Info}{Device Types / Vhost-pci Net Device / Device Operation / Control Virtqueue / Peer Virtq Info} +The driver SHOULD allocate (\field{virtq_num} / 2) of receiveqs. +The driver SHOULD share the \field{virtq_num} of \field{peer_vq[]}, and use the peer transmitq as the shared receiveq, the peer receiveq as the shared transmitq. + +\paragraph{Dirty Page Logging}\label{sec:Device Types / Vhost-pci Net Device / Device Operation / Control Virtqueue / Dirty Page Logging} + +\begin{lstlisting} +struct vhost_pci_ctrl_dirty_page_log_base { + u64 gpa; + u64 size; +} +#define VHOST_PCI_CTRL_DIRTY_PAGE_LOGGING 3 +#define VHOST_PCI_CTRL_DIRTY_PAGE_LOGGING_OFF 0 +#define VHOST_PCI_CTRL_DIRTY_PAGE_LOGGING_ON 1 +#define VHOST_PCI_CTRL_DIRTY_PAGE_LOGGING_SET_LOG_BASE 2 +\end{lstlisting} + +The device sends the VHOST_PCI_CTRL_DIRTY_PAGE_LOGGING_OFF or +VHOST_PCI_CTRL_DIRTY_PAGE_LOGGING_ON command to the driver to turn +off or on the dirty logging mode. + +The device sends the VHOST_PCI_CTRL_DIRTY_PAGE_LOGGING_SET_LOG_BASE +command to the driver to set the dirty logging bitmap. +The command-specific-data for +VHOST_PCI_CTRL_DIRTY_PAGE_LOGGING_SET_LOG_BASE includes a 64-bit +guest physical address and a 64-bit size of the bitmap. + \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits} Currently there are three device-independent feature bits defined: