diff mbox

[PATCHv5,3/3] vhost_net: a kernel-level virtio server

Message ID 20090827160750.GD23722@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Michael S. Tsirkin Aug. 27, 2009, 4:07 p.m. UTC
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.

There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for migration)
- support memory table and not just an offset (needed for kvm)

common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear.  I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.

What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.

How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a
tun-like device. In this version I only support raw socket as a backend,
which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
also configured by userspace, including vlan/mac etc.

Status:
This works for me, and I haven't see any crashes.
I have done some light benchmarking (with v4), compared to userspace, I
see improved latency (as I save up to 4 system calls per packet) but not
bandwidth/CPU (as TSO and interrupt mitigation are not supported).  For
ping benchmark (where there's no TSO) troughput is also improved.

Features that I plan to look at in the future:
- tap support
- TSO
- interrupt mitigation
- zero copy

Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---
 MAINTAINERS                |   10 +
 arch/x86/kvm/Kconfig       |    1 +
 drivers/Makefile           |    1 +
 drivers/vhost/Kconfig      |   11 +
 drivers/vhost/Makefile     |    2 +
 drivers/vhost/net.c        |  475 ++++++++++++++++++++++++++++++
 drivers/vhost/vhost.c      |  688 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/vhost.h      |  122 ++++++++
 include/linux/Kbuild       |    1 +
 include/linux/miscdevice.h |    1 +
 include/linux/vhost.h      |  101 +++++++
 11 files changed, 1413 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/Kconfig
 create mode 100644 drivers/vhost/Makefile
 create mode 100644 drivers/vhost/net.c
 create mode 100644 drivers/vhost/vhost.c
 create mode 100644 drivers/vhost/vhost.h
 create mode 100644 include/linux/vhost.h

Comments

Ira Snyder Sept. 3, 2009, 6:39 p.m. UTC | #1
On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote:
> What it is: vhost net is a character device that can be used to reduce
> the number of system calls involved in virtio networking.
> Existing virtio net code is used in the guest without modification.
> 
> There's similarity with vringfd, with some differences and reduced scope
> - uses eventfd for signalling
> - structures can be moved around in memory at any time (good for migration)
> - support memory table and not just an offset (needed for kvm)
> 
> common virtio related code has been put in a separate file vhost.c and
> can be made into a separate module if/when more backends appear.  I used
> Rusty's lguest.c as the source for developing this part : this supplied
> me with witty comments I wouldn't be able to write myself.
> 
> What it is not: vhost net is not a bus, and not a generic new system
> call. No assumptions are made on how guest performs hypercalls.
> Userspace hypervisors are supported as well as kvm.
> 
> How it works: Basically, we connect virtio frontend (configured by
> userspace) to a backend. The backend could be a network device, or a
> tun-like device. In this version I only support raw socket as a backend,
> which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
> also configured by userspace, including vlan/mac etc.
> 
> Status:
> This works for me, and I haven't see any crashes.
> I have done some light benchmarking (with v4), compared to userspace, I
> see improved latency (as I save up to 4 system calls per packet) but not
> bandwidth/CPU (as TSO and interrupt mitigation are not supported).  For
> ping benchmark (where there's no TSO) troughput is also improved.
> 
> Features that I plan to look at in the future:
> - tap support
> - TSO
> - interrupt mitigation
> - zero copy
> 

Hello Michael,

I've started looking at vhost with the intention of using it over PCI to
connect physical machines together.

The part that I am struggling with the most is figuring out which parts
of the rings are in the host's memory, and which parts are in the
guest's memory.

If I understand everything correctly, the rings are all userspace
addresses, which means that they can be moved around in physical memory,
and get pushed out to swap. AFAIK, this is impossible to handle when
connecting two physical systems, you'd need the rings available in IO
memory (PCI memory), so you can ioreadXX() them instead. To the best of
my knowledge, I shouldn't be using copy_to_user() on an __iomem address.
Also, having them migrate around in memory would be a bad thing.

Also, I'm having trouble figuring out how the packet contents are
actually copied from one system to the other. Could you point this out
for me?

Is there somewhere I can find the userspace code (kvm, qemu, lguest,
etc.) code needed for interacting with the vhost misc device so I can
get a better idea of how userspace is supposed to work? (Features
negotiation, etc.)

Thanks,
Ira

> Acked-by: Arnd Bergmann <arnd@arndb.de>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ---
>  MAINTAINERS                |   10 +
>  arch/x86/kvm/Kconfig       |    1 +
>  drivers/Makefile           |    1 +
>  drivers/vhost/Kconfig      |   11 +
>  drivers/vhost/Makefile     |    2 +
>  drivers/vhost/net.c        |  475 ++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.c      |  688 ++++++++++++++++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.h      |  122 ++++++++
>  include/linux/Kbuild       |    1 +
>  include/linux/miscdevice.h |    1 +
>  include/linux/vhost.h      |  101 +++++++
>  11 files changed, 1413 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/vhost/Kconfig
>  create mode 100644 drivers/vhost/Makefile
>  create mode 100644 drivers/vhost/net.c
>  create mode 100644 drivers/vhost/vhost.c
>  create mode 100644 drivers/vhost/vhost.h
>  create mode 100644 include/linux/vhost.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index b1114cf..de4587f 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5431,6 +5431,16 @@ S:	Maintained
>  F:	Documentation/filesystems/vfat.txt
>  F:	fs/fat/
>  
> +VIRTIO HOST (VHOST)
> +P:	Michael S. Tsirkin
> +M:	mst@redhat.com
> +L:	kvm@vger.kernel.org
> +L:	virtualization@lists.osdl.org
> +L:	netdev@vger.kernel.org
> +S:	Maintained
> +F:	drivers/vhost/
> +F:	include/linux/vhost.h
> +
>  VIA RHINE NETWORK DRIVER
>  M:	Roger Luethi <rl@hellgate.ch>
>  S:	Maintained
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index b84e571..94f44d9 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -64,6 +64,7 @@ config KVM_AMD
>  
>  # OK, it's a little counter-intuitive to do this, but it puts it neatly under
>  # the virtualization menu.
> +source drivers/vhost/Kconfig
>  source drivers/lguest/Kconfig
>  source drivers/virtio/Kconfig
>  
> diff --git a/drivers/Makefile b/drivers/Makefile
> index bc4205d..1551ae1 100644
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -105,6 +105,7 @@ obj-$(CONFIG_HID)		+= hid/
>  obj-$(CONFIG_PPC_PS3)		+= ps3/
>  obj-$(CONFIG_OF)		+= of/
>  obj-$(CONFIG_SSB)		+= ssb/
> +obj-$(CONFIG_VHOST_NET)		+= vhost/
>  obj-$(CONFIG_VIRTIO)		+= virtio/
>  obj-$(CONFIG_VLYNQ)		+= vlynq/
>  obj-$(CONFIG_STAGING)		+= staging/
> diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
> new file mode 100644
> index 0000000..d955406
> --- /dev/null
> +++ b/drivers/vhost/Kconfig
> @@ -0,0 +1,11 @@
> +config VHOST_NET
> +	tristate "Host kernel accelerator for virtio net"
> +	depends on NET && EVENTFD
> +	---help---
> +	  This kernel module can be loaded in host kernel to accelerate
> +	  guest networking with virtio_net. Not to be confused with virtio_net
> +	  module itself which needs to be loaded in guest kernel.
> +
> +	  To compile this driver as a module, choose M here: the module will
> +	  be called vhost_net.
> +
> diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
> new file mode 100644
> index 0000000..72dd020
> --- /dev/null
> +++ b/drivers/vhost/Makefile
> @@ -0,0 +1,2 @@
> +obj-$(CONFIG_VHOST_NET) += vhost_net.o
> +vhost_net-y := vhost.o net.o
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> new file mode 100644
> index 0000000..2210eaa
> --- /dev/null
> +++ b/drivers/vhost/net.c
> @@ -0,0 +1,475 @@
> +/* Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + *
> + * virtio-net server in host kernel.
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/eventfd.h>
> +#include <linux/vhost.h>
> +#include <linux/virtio_net.h>
> +#include <linux/mmu_context.h>
> +#include <linux/miscdevice.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/workqueue.h>
> +#include <linux/rcupdate.h>
> +#include <linux/file.h>
> +
> +#include <linux/net.h>
> +#include <linux/if_packet.h>
> +#include <linux/if_arp.h>
> +
> +#include <net/sock.h>
> +
> +#include "vhost.h"
> +
> +enum {
> +	VHOST_NET_VQ_RX = 0,
> +	VHOST_NET_VQ_TX = 1,
> +	VHOST_NET_VQ_MAX = 2,
> +};
> +
> +struct vhost_net {
> +	struct vhost_dev dev;
> +	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
> +	/* We use a kind of RCU to access sock pointer.
> +	 * All readers access it from workqueue, which makes it possible to
> +	 * flush the workqueue instead of synchronize_rcu. Therefore readers do
> +	 * not need to call rcu_read_lock/rcu_read_unlock: the beginning of
> +	 * work item execution acts instead of rcu_read_lock() and the end of
> +	 * work item execution acts instead of rcu_read_lock().
> +	 * Writers use device mutex. */
> +	struct socket *sock;
> +	struct vhost_poll poll[VHOST_NET_VQ_MAX];
> +};
> +
> +/* Pop first len bytes from iovec. Return number of segments used. */
> +static int move_iovec_hdr(struct iovec *from, struct iovec *to,
> +			  size_t len, int iov_count)
> +{
> +       int seg = 0;
> +       size_t size;
> +       while (len && seg < iov_count) {
> +               size = min(from->iov_len, len);
> +               to->iov_base = from->iov_base;
> +               to->iov_len = size;
> +               from->iov_len -= size;
> +               from->iov_base += size;
> +               len -= size;
> +               ++from;
> +               ++to;
> +               ++seg;
> +       }
> +       return seg;
> +}
> +
> +/* Expects to be always run from workqueue - which acts as
> + * read-size critical section for our kind of RCU. */
> +static void handle_tx(struct vhost_net *net)
> +{
> +	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> +	unsigned head, out, in, s;
> +	struct msghdr msg = {
> +		.msg_name = NULL,
> +		.msg_namelen = 0,
> +		.msg_control = NULL,
> +		.msg_controllen = 0,
> +		.msg_iov = vq->iov,
> +		.msg_flags = MSG_DONTWAIT,
> +	};
> +	size_t len;
> +	int err;
> +	struct socket *sock = rcu_dereference(net->sock);
> +	if (!sock || !sock_writeable(sock->sk))
> +		return;
> +
> +	use_mm(net->dev.mm);
> +	mutex_lock(&vq->mutex);
> +	for (;;) {
> +		head = vhost_get_vq_desc(&net->dev, vq, vq->iov, &out, &in);
> +		/* Nothing new?  Wait for eventfd to tell us they refilled. */
> +		if (head == vq->num)
> +			break;
> +		if (in) {
> +			vq_err(vq, "Unexpected descriptor format for TX: "
> +			       "out %d, int %d\n", out, in);
> +			break;
> +		}
> +		/* Skip header. TODO: support TSO. */
> +		s = move_iovec_hdr(vq->iov, vq->hdr,
> +				   sizeof(struct virtio_net_hdr), out);
> +		msg.msg_iovlen = out;
> +		len = iov_length(vq->iov, out);
> +		/* Sanity check */
> +		if (!len) {
> +			vq_err(vq, "Unexpected header len for TX: "
> +			       "%ld expected %zd\n",
> +			       iov_length(vq->hdr, s),
> +			       sizeof(struct virtio_net_hdr));
> +			break;
> +		}
> +		/* TODO: Check specific error and bomb out unless ENOBUFS? */
> +		err = sock->ops->sendmsg(NULL, sock, &msg, len);
> +		if (err < 0) {
> +			vhost_discard_vq_desc(vq);
> +			break;
> +		}
> +		if (err != len)
> +			pr_err("Truncated TX packet: "
> +			       " len %d != %zd\n", err, len);
> +		vhost_add_used_and_trigger(&net->dev, vq, head, 0);
> +	}
> +
> +	mutex_unlock(&vq->mutex);
> +	unuse_mm(net->dev.mm);
> +}
> +
> +/* Expects to be always run from workqueue - which acts as
> + * read-size critical section for our kind of RCU. */
> +static void handle_rx(struct vhost_net *net)
> +{
> +	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> +	unsigned head, out, in, s;
> +	struct msghdr msg = {
> +		.msg_name = NULL,
> +		.msg_namelen = 0,
> +		.msg_control = NULL, /* FIXME: get and handle RX aux data. */
> +		.msg_controllen = 0,
> +		.msg_iov = vq->iov,
> +		.msg_flags = MSG_DONTWAIT,
> +	};
> +
> +	struct virtio_net_hdr hdr = {
> +		.flags = 0,
> +		.gso_type = VIRTIO_NET_HDR_GSO_NONE
> +	};
> +
> +	size_t len;
> +	int err;
> +	struct socket *sock = rcu_dereference(net->sock);
> +	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> +		return;
> +
> +	use_mm(net->dev.mm);
> +	mutex_lock(&vq->mutex);
> +	vhost_no_notify(vq);
> +
> +	for (;;) {
> +		head = vhost_get_vq_desc(&net->dev, vq, vq->iov, &out, &in);
> +		/* OK, now we need to know about added descriptors. */
> +		if (head == vq->num && vhost_notify(vq))
> +			/* They could have slipped one in as we were doing that:
> +			 * check again. */
> +			continue;
> +		/* Nothing new?  Wait for eventfd to tell us they refilled. */
> +		if (head == vq->num)
> +			break;
> +		/* We don't need to be notified again. */
> +		vhost_no_notify(vq);
> +		if (out) {
> +			vq_err(vq, "Unexpected descriptor format for RX: "
> +			       "out %d, int %d\n",
> +			       out, in);
> +			break;
> +		}
> +		/* Skip header. TODO: support TSO/mergeable rx buffers. */
> +		s = move_iovec_hdr(vq->iov, vq->hdr, sizeof hdr, in);
> +		msg.msg_iovlen = in;
> +		len = iov_length(vq->iov, in);
> +		/* Sanity check */
> +		if (!len) {
> +			vq_err(vq, "Unexpected header len for RX: "
> +			       "%zd expected %zd\n",
> +			       iov_length(vq->hdr, s), sizeof hdr);
> +			break;
> +		}
> +		err = sock->ops->recvmsg(NULL, sock, &msg,
> +					 len, MSG_DONTWAIT | MSG_TRUNC);
> +		/* TODO: Check specific error and bomb out unless EAGAIN? */
> +		if (err < 0) {
> +			vhost_discard_vq_desc(vq);
> +			break;
> +		}
> +		/* TODO: Should check and handle checksum. */
> +		if (err > len) {
> +			pr_err("Discarded truncated rx packet: "
> +			       " len %d > %zd\n", err, len);
> +			vhost_discard_vq_desc(vq);
> +			continue;
> +		}
> +		len = err;
> +		err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr, sizeof hdr);
> +		if (err) {
> +			vq_err(vq, "Unable to write vnet_hdr at addr %p: %d\n",
> +			       vq->iov->iov_base, err);
> +			break;
> +		}
> +		vhost_add_used_and_trigger(&net->dev, vq, head,
> +					   len + sizeof hdr);
> +	}
> +
> +	mutex_unlock(&vq->mutex);
> +	unuse_mm(net->dev.mm);
> +}
> +
> +static void handle_tx_kick(struct work_struct *work)
> +{
> +	struct vhost_virtqueue *vq;
> +	struct vhost_net *net;
> +	vq = container_of(work, struct vhost_virtqueue, poll.work);
> +	net = container_of(vq->dev, struct vhost_net, dev);
> +	handle_tx(net);
> +}
> +
> +static void handle_rx_kick(struct work_struct *work)
> +{
> +	struct vhost_virtqueue *vq;
> +	struct vhost_net *net;
> +	vq = container_of(work, struct vhost_virtqueue, poll.work);
> +	net = container_of(vq->dev, struct vhost_net, dev);
> +	handle_rx(net);
> +}
> +
> +static void handle_tx_net(struct work_struct *work)
> +{
> +	struct vhost_net *net;
> +	net = container_of(work, struct vhost_net, poll[VHOST_NET_VQ_TX].work);
> +	handle_tx(net);
> +}
> +
> +static void handle_rx_net(struct work_struct *work)
> +{
> +	struct vhost_net *net;
> +	net = container_of(work, struct vhost_net, poll[VHOST_NET_VQ_RX].work);
> +	handle_rx(net);
> +}
> +
> +static int vhost_net_open(struct inode *inode, struct file *f)
> +{
> +	struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
> +	int r;
> +	if (!n)
> +		return -ENOMEM;
> +	f->private_data = n;
> +	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
> +	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
> +	r = vhost_dev_init(&n->dev, n->vqs, VHOST_NET_VQ_MAX);
> +	if (r < 0) {
> +		kfree(n);
> +		return r;
> +	}
> +
> +	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
> +	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
> +	return 0;
> +}
> +
> +static struct socket *vhost_net_stop(struct vhost_net *n)
> +{
> +	struct socket *sock = n->sock;
> +	rcu_assign_pointer(n->sock, NULL);
> +	if (sock) {
> +		vhost_poll_flush(n->poll + VHOST_NET_VQ_TX);
> +		vhost_poll_flush(n->poll + VHOST_NET_VQ_RX);
> +	}
> +	return sock;
> +}
> +
> +static int vhost_net_release(struct inode *inode, struct file *f)
> +{
> +	struct vhost_net *n = f->private_data;
> +	struct socket *sock;
> +
> +	sock = vhost_net_stop(n);
> +	vhost_dev_cleanup(&n->dev);
> +	if (sock)
> +		fput(sock->file);
> +	kfree(n);
> +	return 0;
> +}
> +
> +static void vhost_net_flush(struct vhost_net *n)
> +{
> +	vhost_poll_flush(n->poll + VHOST_NET_VQ_TX);
> +	vhost_poll_flush(n->poll + VHOST_NET_VQ_RX);
> +	vhost_poll_flush(&n->dev.vqs[VHOST_NET_VQ_TX].poll);
> +	vhost_poll_flush(&n->dev.vqs[VHOST_NET_VQ_RX].poll);
> +}
> +
> +static long vhost_net_set_socket(struct vhost_net *n, int fd)
> +{
> +	struct {
> +		struct sockaddr_ll sa;
> +		char  buf[MAX_ADDR_LEN];
> +	} uaddr;
> +	struct socket *sock, *oldsock = NULL;
> +	int uaddr_len = sizeof uaddr, r;
> +
> +	mutex_lock(&n->dev.mutex);
> +	r = vhost_dev_check_owner(&n->dev);
> +	if (r)
> +		goto done;
> +
> +	if (fd == -1) {
> +		/* Disconnect from socket and device. */
> +		oldsock = vhost_net_stop(n);
> +		goto done;
> +	}
> +
> +	sock = sockfd_lookup(fd, &r);
> +	if (!sock) {
> +		r = -ENOTSOCK;
> +		goto done;
> +	}
> +
> +	/* Parameter checking */
> +	if (sock->sk->sk_type != SOCK_RAW) {
> +		r = -ESOCKTNOSUPPORT;
> +		goto done;
> +	}
> +
> +	r = sock->ops->getname(sock, (struct sockaddr *)&uaddr.sa,
> +			       &uaddr_len, 0);
> +	if (r)
> +		goto done;
> +
> +	if (uaddr.sa.sll_family != AF_PACKET) {
> +		r = -EPFNOSUPPORT;
> +		goto done;
> +	}
> +
> +	/* start polling new socket */
> +	if (sock == oldsock)
> +		goto done;
> +
> +	if (oldsock) {
> +		vhost_poll_stop(n->poll + VHOST_NET_VQ_TX);
> +		vhost_poll_stop(n->poll + VHOST_NET_VQ_RX);
> +	}
> +	oldsock = n->sock;
> +	rcu_assign_pointer(n->sock, sock);
> +	vhost_poll_start(n->poll + VHOST_NET_VQ_TX, sock->file);
> +	vhost_poll_start(n->poll + VHOST_NET_VQ_RX, sock->file);
> +done:
> +	mutex_unlock(&n->dev.mutex);
> +	if (oldsock) {
> +		vhost_net_flush(n);
> +		fput(oldsock->file);
> +	}
> +	return r;
> +}
> +
> +static long vhost_net_reset_owner(struct vhost_net *n)
> +{
> +	struct socket *sock = NULL;
> +	long r;
> +	mutex_lock(&n->dev.mutex);
> +	r = vhost_dev_check_owner(&n->dev);
> +	if (r)
> +		goto done;
> +	sock = vhost_net_stop(n);
> +	r = vhost_dev_reset_owner(&n->dev);
> +done:
> +	mutex_unlock(&n->dev.mutex);
> +	if (sock)
> +		fput(sock->file);
> +	return r;
> +}
> +
> +static void vhost_net_set_features(struct vhost_net *n, u64 features)
> +{
> +	mutex_unlock(&n->dev.mutex);
> +	n->dev.acked_features = features;
> +	mutex_unlock(&n->dev.mutex);
> +	vhost_net_flush(n);
> +}
> +
> +static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
> +			    unsigned long arg)
> +{
> +	struct vhost_net *n = f->private_data;
> +	void __user *argp = (void __user *)arg;
> +	u32 __user *featurep = argp;
> +	int __user *fdp = argp;
> +	u64 features;
> +	int fd, r;
> +	switch (ioctl) {
> +	case VHOST_NET_SET_SOCKET:
> +		r = get_user(fd, fdp);
> +		if (r < 0)
> +			return r;
> +		return vhost_net_set_socket(n, fd);
> +	case VHOST_GET_FEATURES:
> +		features = VHOST_FEATURES;
> +		return put_user(features, featurep);
> +	case VHOST_ACK_FEATURES:
> +		r = get_user(features, featurep);
> +		/* No features for now */
> +		if (r < 0)
> +			return r;
> +		if (features & ~VHOST_FEATURES)
> +			return -EOPNOTSUPP;
> +		vhost_net_set_features(n, features);
> +		return 0;
> +	case VHOST_RESET_OWNER:
> +		return vhost_net_reset_owner(n);
> +	default:
> +		return vhost_dev_ioctl(&n->dev, ioctl, arg);
> +	}
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long vhost_net_compat_ioctl(struct file *f, unsigned int ioctl,
> +				   unsigned long arg)
> +{
> +	return vhost_net_ioctl(f, ioctl, (unsigned long)compat_ptr(arg));
> +}
> +#endif
> +
> +const static struct file_operations vhost_net_fops = {
> +	.owner          = THIS_MODULE,
> +	.release        = vhost_net_release,
> +	.unlocked_ioctl = vhost_net_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl   = vhost_net_compat_ioctl,
> +#endif
> +	.open           = vhost_net_open,
> +};
> +
> +static struct miscdevice vhost_net_misc = {
> +	VHOST_NET_MINOR,
> +	"vhost-net",
> +	&vhost_net_fops,
> +};
> +
> +int vhost_net_init(void)
> +{
> +	int r = vhost_init();
> +	if (r)
> +		goto err_init;
> +	r = misc_register(&vhost_net_misc);
> +	if (r)
> +		goto err_reg;
> +	return 0;
> +err_reg:
> +	vhost_cleanup();
> +err_init:
> +	return r;
> +
> +}
> +module_init(vhost_net_init);
> +
> +void vhost_net_exit(void)
> +{
> +	misc_deregister(&vhost_net_misc);
> +	vhost_cleanup();
> +}
> +module_exit(vhost_net_exit);
> +
> +MODULE_VERSION("0.0.1");
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Michael S. Tsirkin");
> +MODULE_DESCRIPTION("Host kernel accelerator for virtio net");
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> new file mode 100644
> index 0000000..6925cc1
> --- /dev/null
> +++ b/drivers/vhost/vhost.c
> @@ -0,0 +1,688 @@
> +/* Copyright (C) 2009 Red Hat, Inc.
> + * Copyright (C) 2006 Rusty Russell IBM Corporation
> + *
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + *
> + * Inspiration, some code, and most witty comments come from
> + * Documentation/lguest/lguest.c, by Rusty Russell
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + *
> + * Generic code for virtio server in host kernel.
> + */
> +
> +#include <linux/eventfd.h>
> +#include <linux/vhost.h>
> +#include <linux/virtio_net.h>
> +#include <linux/mm.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mutex.h>
> +#include <linux/workqueue.h>
> +#include <linux/rcupdate.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +
> +#include <linux/net.h>
> +#include <linux/if_packet.h>
> +#include <linux/if_arp.h>
> +
> +#include <net/sock.h>
> +
> +#include "vhost.h"
> +
> +enum {
> +	VHOST_MEMORY_MAX_NREGIONS = 64,
> +};
> +
> +static struct workqueue_struct *vhost_workqueue;
> +
> +static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
> +			    poll_table *pt)
> +{
> +	struct vhost_poll *poll;
> +	poll = container_of(pt, struct vhost_poll, table);
> +
> +	poll->wqh = wqh;
> +	add_wait_queue(wqh, &poll->wait);
> +}
> +
> +static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
> +			     void *key)
> +{
> +	struct vhost_poll *poll;
> +	poll = container_of(wait, struct vhost_poll, wait);
> +	if (!((unsigned long)key & poll->mask))
> +		return 0;
> +
> +	queue_work(vhost_workqueue, &poll->work);
> +	return 0;
> +}
> +
> +/* Init poll structure */
> +void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
> +		     unsigned long mask)
> +{
> +	INIT_WORK(&poll->work, func);
> +	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
> +	init_poll_funcptr(&poll->table, vhost_poll_func);
> +	poll->mask = mask;
> +}
> +
> +/* Start polling a file. We add ourselves to file's wait queue. The caller must
> + * keep a reference to a file until after vhost_poll_stop is called. */
> +void vhost_poll_start(struct vhost_poll *poll, struct file *file)
> +{
> +	unsigned long mask;
> +	mask = file->f_op->poll(file, &poll->table);
> +	if (mask)
> +		vhost_poll_wakeup(&poll->wait, 0, 0, (void *)mask);
> +}
> +
> +/* Stop polling a file. After this function returns, it becomes safe to drop the
> + * file reference. You must also flush afterwards. */
> +void vhost_poll_stop(struct vhost_poll *poll)
> +{
> +	remove_wait_queue(poll->wqh, &poll->wait);
> +}
> +
> +/* Flush any work that has been scheduled. When calling this, don't hold any
> + * locks that are also used by the callback. */
> +void vhost_poll_flush(struct vhost_poll *poll)
> +{
> +	flush_work(&poll->work);
> +}
> +
> +long vhost_dev_init(struct vhost_dev *dev,
> +		    struct vhost_virtqueue *vqs, int nvqs)
> +{
> +	int i;
> +	dev->vqs = vqs;
> +	dev->nvqs = nvqs;
> +	mutex_init(&dev->mutex);
> +
> +	for (i = 0; i < dev->nvqs; ++i) {
> +		dev->vqs[i].dev = dev;
> +		mutex_init(&dev->vqs[i].mutex);
> +		if (dev->vqs[i].handle_kick)
> +			vhost_poll_init(&dev->vqs[i].poll,
> +					dev->vqs[i].handle_kick,
> +					POLLIN);
> +	}
> +	return 0;
> +}
> +
> +/* Caller should have device mutex */
> +long vhost_dev_check_owner(struct vhost_dev *dev)
> +{
> +	/* Are you the owner? If not, I don't think you mean to do that */
> +	return dev->mm == current->mm ? 0 : -EPERM;
> +}
> +
> +/* Caller should have device mutex */
> +static long vhost_dev_set_owner(struct vhost_dev *dev)
> +{
> +	/* Is there an owner already? */
> +	if (dev->mm)
> +		return -EBUSY;
> +	/* No owner, become one */
> +	dev->mm = get_task_mm(current);
> +	return 0;
> +}
> +
> +/* Caller should have device mutex */
> +long vhost_dev_reset_owner(struct vhost_dev *dev)
> +{
> +	struct vhost_memory *memory;
> +
> +	/* Restore memory to default 1:1 mapping. */
> +	memory = kmalloc(offsetof(struct vhost_memory, regions) +
> +			 2 * sizeof *memory->regions, GFP_KERNEL);
> +	if (!memory)
> +		return -ENOMEM;
> +
> +	vhost_dev_cleanup(dev);
> +
> +	memory->nregions = 2;
> +	memory->regions[0].guest_phys_addr = 1;
> +	memory->regions[0].userspace_addr = 1;
> +	memory->regions[0].memory_size = ~0ULL;
> +	memory->regions[1].guest_phys_addr = 0;
> +	memory->regions[1].userspace_addr = 0;
> +	memory->regions[1].memory_size = 1;
> +	dev->memory = memory;
> +	return 0;
> +}
> +
> +/* Caller should have device mutex */
> +void vhost_dev_cleanup(struct vhost_dev *dev)
> +{
> +	int i;
> +	for (i = 0; i < dev->nvqs; ++i) {
> +		if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
> +			vhost_poll_stop(&dev->vqs[i].poll);
> +			vhost_poll_flush(&dev->vqs[i].poll);
> +		}
> +		if (dev->vqs[i].error_ctx)
> +			eventfd_ctx_put(dev->vqs[i].error_ctx);
> +		if (dev->vqs[i].error)
> +			fput(dev->vqs[i].error);
> +		if (dev->vqs[i].kick)
> +			fput(dev->vqs[i].kick);
> +		if (dev->vqs[i].call_ctx)
> +			eventfd_ctx_put(dev->vqs[i].call_ctx);
> +		if (dev->vqs[i].call)
> +			fput(dev->vqs[i].call);
> +		dev->vqs[i].error_ctx = NULL;
> +		dev->vqs[i].error = NULL;
> +		dev->vqs[i].kick = NULL;
> +		dev->vqs[i].call_ctx = NULL;
> +		dev->vqs[i].call = NULL;
> +	}
> +	/* No one will access memory at this point */
> +	kfree(dev->memory);
> +	dev->memory = NULL;
> +	if (dev->mm)
> +		mmput(dev->mm);
> +	dev->mm = NULL;
> +}
> +
> +static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
> +{
> +	struct vhost_memory mem, *newmem, *oldmem;
> +	unsigned long size = offsetof(struct vhost_memory, regions);
> +	long r;
> +	r = copy_from_user(&mem, m, size);
> +	if (r)
> +		return r;
> +	if (mem.padding)
> +		return -EOPNOTSUPP;
> +	if (mem.nregions > VHOST_MEMORY_MAX_NREGIONS)
> +		return -E2BIG;
> +	newmem = kmalloc(size + mem.nregions * sizeof *m->regions, GFP_KERNEL);
> +	if (!newmem)
> +		return -ENOMEM;
> +
> +	memcpy(newmem, &mem, size);
> +	r = copy_from_user(newmem->regions, m->regions,
> +			   mem.nregions * sizeof *m->regions);
> +	if (r) {
> +		kfree(newmem);
> +		return r;
> +	}
> +	oldmem = d->memory;
> +	rcu_assign_pointer(d->memory, newmem);
> +	synchronize_rcu();
> +	kfree(oldmem);
> +	return 0;
> +}
> +
> +static int init_used(struct vhost_virtqueue *vq)
> +{
> +	int r = put_user(vq->used_flags, &vq->used->flags);
> +	if (r)
> +		return r;
> +	return get_user(vq->last_used_idx, &vq->used->idx);
> +}
> +
> +static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
> +{
> +	struct file *eventfp, *filep = NULL,
> +		    *pollstart = NULL, *pollstop = NULL;
> +	struct eventfd_ctx *ctx = NULL;
> +	u32 __user *idxp = argp;
> +	struct vhost_virtqueue *vq;
> +	struct vhost_vring_state s;
> +	struct vhost_vring_file f;
> +	struct vhost_vring_addr a;
> +	u32 idx;
> +	long r;
> +
> +	r = get_user(idx, idxp);
> +	if (r < 0)
> +		return r;
> +	if (idx > d->nvqs)
> +		return -ENOBUFS;
> +
> +	vq = d->vqs + idx;
> +
> +	mutex_lock(&vq->mutex);
> +
> +	switch (ioctl) {
> +	case VHOST_SET_VRING_NUM:
> +		r = copy_from_user(&s, argp, sizeof s);
> +		if (r < 0)
> +			break;
> +		if (s.num > 0xffff) {
> +			r = -EINVAL;
> +			break;
> +		}
> +		vq->num = s.num;
> +		break;
> +	case VHOST_SET_VRING_BASE:
> +		r = copy_from_user(&s, argp, sizeof s);
> +		if (r < 0)
> +			break;
> +		if (s.num > 0xffff) {
> +			r = -EINVAL;
> +			break;
> +		}
> +		vq->avail_idx = vq->last_avail_idx = s.num;
> +		break;
> +	case VHOST_GET_VRING_BASE:
> +		s.index = idx;
> +		s.num = vq->last_avail_idx;
> +		r = copy_to_user(argp, &s, sizeof s);
> +		break;
> +	case VHOST_SET_VRING_DESC:
> +		r = copy_from_user(&a, argp, sizeof a);
> +		if (r < 0)
> +			break;
> +		if (a.padding) {
> +			r = -EOPNOTSUPP;
> +			break;
> +		}
> +		if ((u64)(long)a.user_addr != a.user_addr) {
> +			r = -EFAULT;
> +			break;
> +		}
> +		vq->desc = (void __user *)(long)a.user_addr;
> +		break;
> +	case VHOST_SET_VRING_AVAIL:
> +		r = copy_from_user(&a, argp, sizeof a);
> +		if (r < 0)
> +			break;
> +		if (a.padding) {
> +			r = -EOPNOTSUPP;
> +			break;
> +		}
> +		if ((u64)(long)a.user_addr != a.user_addr) {
> +			r = -EFAULT;
> +			break;
> +		}
> +		vq->avail = (void __user *)(long)a.user_addr;
> +		/* Forget the cached index value. */
> +		vq->avail_idx = vq->last_avail_idx;
> +		break;
> +	case VHOST_SET_VRING_USED:
> +		r = copy_from_user(&a, argp, sizeof a);
> +		if (r < 0)
> +			break;
> +		if (a.padding) {
> +			r = -EOPNOTSUPP;
> +			break;
> +		}
> +		if ((u64)(long)a.user_addr != a.user_addr) {
> +			r = -EFAULT;
> +			break;
> +		}
> +		vq->used = (void __user *)(long)a.user_addr;
> +		r = init_used(vq);
> +		if (r)
> +			break;
> +		break;
> +	case VHOST_SET_VRING_KICK:
> +		r = copy_from_user(&f, argp, sizeof f);
> +		if (r < 0)
> +			break;
> +		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
> +		if (IS_ERR(eventfp))
> +			return PTR_ERR(eventfp);
> +		if (eventfp != vq->kick) {
> +			pollstop = filep = vq->kick;
> +			pollstart = vq->kick = eventfp;
> +		} else
> +			filep = eventfp;
> +		break;
> +	case VHOST_SET_VRING_CALL:
> +		r = copy_from_user(&f, argp, sizeof f);
> +		if (r < 0)
> +			break;
> +		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
> +		if (IS_ERR(eventfp))
> +			return PTR_ERR(eventfp);
> +		if (eventfp != vq->call) {
> +			filep = vq->call;
> +			ctx = vq->call_ctx;
> +			vq->call = eventfp;
> +			vq->call_ctx = eventfp ?
> +				eventfd_ctx_fileget(eventfp) : NULL;
> +		} else
> +			filep = eventfp;
> +		break;
> +	case VHOST_SET_VRING_ERR:
> +		r = copy_from_user(&f, argp, sizeof f);
> +		if (r < 0)
> +			break;
> +		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
> +		if (IS_ERR(eventfp))
> +			return PTR_ERR(eventfp);
> +		if (eventfp != vq->error) {
> +			filep = vq->error;
> +			vq->error = eventfp;
> +			ctx = vq->error_ctx;
> +			vq->error_ctx = eventfp ?
> +				eventfd_ctx_fileget(eventfp) : NULL;
> +		} else
> +			filep = eventfp;
> +		break;
> +	default:
> +		r = -ENOIOCTLCMD;
> +	}
> +
> +	if (pollstop && vq->handle_kick)
> +		vhost_poll_stop(&vq->poll);
> +
> +	if (ctx)
> +		eventfd_ctx_put(ctx);
> +	if (filep)
> +		fput(filep);
> +
> +	if (pollstart && vq->handle_kick)
> +		vhost_poll_start(&vq->poll, vq->kick);
> +
> +	mutex_unlock(&vq->mutex);
> +
> +	if (pollstop && vq->handle_kick)
> +		vhost_poll_flush(&vq->poll);
> +	return 0;
> +}
> +
> +long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned long arg)
> +{
> +	void __user *argp = (void __user *)arg;
> +	long r;
> +
> +	mutex_lock(&d->mutex);
> +	/* If you are not the owner, you can become one */
> +	if (ioctl == VHOST_SET_OWNER) {
> +		r = vhost_dev_set_owner(d);
> +		goto done;
> +	}
> +
> +	/* You must be the owner to do anything else */
> +	r = vhost_dev_check_owner(d);
> +	if (r)
> +		goto done;
> +
> +	switch (ioctl) {
> +	case VHOST_SET_MEM_TABLE:
> +		r = vhost_set_memory(d, argp);
> +		break;
> +	default:
> +		r = vhost_set_vring(d, ioctl, argp);
> +		break;
> +	}
> +done:
> +	mutex_unlock(&d->mutex);
> +	return r;
> +}
> +
> +static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
> +						     __u64 addr, __u32 len)
> +{
> +	struct vhost_memory_region *reg;
> +	int i;
> +	/* linear search is not brilliant, but we really have on the order of 6
> +	 * regions in practice */
> +	for (i = 0; i < mem->nregions; ++i) {
> +		reg = mem->regions + i;
> +		if (reg->guest_phys_addr <= addr &&
> +		    reg->guest_phys_addr + reg->memory_size - 1 >= addr)
> +			return reg;
> +	}
> +	return NULL;
> +}
> +
> +int translate_desc(struct vhost_dev *dev, u64 addr, u32 len,
> +		   struct iovec iov[], int iov_size)
> +{
> +	const struct vhost_memory_region *reg;
> +	struct vhost_memory *mem;
> +	struct iovec *_iov;
> +	u64 s = 0;
> +	int ret = 0;
> +
> +	rcu_read_lock();
> +
> +	mem = rcu_dereference(dev->memory);
> +	while ((u64)len > s) {
> +		u64 size;
> +		if (ret >= iov_size) {
> +			ret = -ENOBUFS;
> +			break;
> +		}
> +		reg = find_region(mem, addr, len);
> +		if (!reg) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +		_iov = iov + ret;
> +		size = reg->memory_size - addr + reg->guest_phys_addr;
> +		_iov->iov_len = min((u64)len, size);
> +		_iov->iov_base = (void *)
> +			(reg->userspace_addr + addr - reg->guest_phys_addr);
> +		s += size;
> +		addr += size;
> +		++ret;
> +	}
> +
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
> +/* Each buffer in the virtqueues is actually a chain of descriptors.  This
> + * function returns the next descriptor in the chain, or vq->vring.num if we're
> + * at the end. */
> +static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc)
> +{
> +	unsigned int next;
> +
> +	/* If this descriptor says it doesn't chain, we're done. */
> +	if (!(desc->flags & VRING_DESC_F_NEXT))
> +		return vq->num;
> +
> +	/* Check they're not leading us off end of descriptors. */
> +	next = desc->next;
> +	/* Make sure compiler knows to grab that: we don't want it changing! */
> +	/* We will use the result as an index in an array, so most
> +	 * architectures only need a compiler barrier here. */
> +	read_barrier_depends();
> +
> +	if (next >= vq->num) {
> +		vq_err(vq, "Desc next is %u > %u", next, vq->num);
> +		return vq->num;
> +	}
> +
> +	return next;
> +}
> +
> +/* This looks in the virtqueue and for the first available buffer, and converts
> + * it to an iovec for convenient access.  Since descriptors consist of some
> + * number of output then some number of input descriptors, it's actually two
> + * iovecs, but we pack them into one and note how many of each there were.
> + *
> + * This function returns the descriptor number found, or vq->num (which
> + * is never a valid descriptor number) if none was found. */
> +unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
> +			   struct iovec iov[],
> +			   unsigned int *out_num, unsigned int *in_num)
> +{
> +	struct vring_desc desc;
> +	unsigned int i, head;
> +	u16 last_avail_idx;
> +	int ret;
> +
> +	/* Check it isn't doing very strange things with descriptor numbers. */
> +	last_avail_idx = vq->last_avail_idx;
> +	if (get_user(vq->avail_idx, &vq->avail->idx)) {
> +		vq_err(vq, "Failed to access avail idx at %p\n",
> +		       &vq->avail->idx);
> +		return vq->num;
> +	}
> +
> +	if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
> +		vq_err(vq, "Guest moved used index from %u to %u",
> +		       last_avail_idx, vq->avail_idx);
> +		return vq->num;
> +	}
> +
> +	/* If there's nothing new since last we looked, return invalid. */
> +	if (vq->avail_idx == last_avail_idx)
> +		return vq->num;
> +
> +	/* Grab the next descriptor number they're advertising, and increment
> +	 * the index we've seen. */
> +	if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
> +		vq_err(vq, "Failed to read head: idx %d address %p\n",
> +		       last_avail_idx,
> +		       &vq->avail->ring[last_avail_idx % vq->num]);
> +		return vq->num;
> +	}
> +
> +	/* If their number is silly, that's an error. */
> +	if (head >= vq->num) {
> +		vq_err(vq, "Guest says index %u > %u is available",
> +		       head, vq->num);
> +		return vq->num;
> +	}
> +
> +	vq->last_avail_idx++;
> +
> +	/* When we start there are none of either input nor output. */
> +	*out_num = *in_num = 0;
> +
> +	i = head;
> +	do {
> +		unsigned iov_count = *in_num + *out_num;
> +		if (copy_from_user(&desc, vq->desc + i, sizeof desc)) {
> +			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
> +			       i, vq->desc + i);
> +			return vq->num;
> +		}
> +		ret = translate_desc(dev, desc.addr, desc.len, iov + iov_count,
> +				     VHOST_NET_MAX_SG - iov_count);
> +		if (ret < 0) {
> +			vq_err(vq, "Translation failure %d descriptor idx %d\n",
> +			       ret, i);
> +			return vq->num;
> +		}
> +		/* If this is an input descriptor, increment that count. */
> +		if (desc.flags & VRING_DESC_F_WRITE)
> +			*in_num += ret;
> +		else {
> +			/* If it's an output descriptor, they're all supposed
> +			 * to come before any input descriptors. */
> +			if (*in_num) {
> +				vq_err(vq, "Descriptor has out after in: "
> +				       "idx %d\n", i);
> +				return vq->num;
> +			}
> +			*out_num += ret;
> +		}
> +	} while ((i = next_desc(vq, &desc)) != vq->num);
> +	return head;
> +}
> +
> +/* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
> +void vhost_discard_vq_desc(struct vhost_virtqueue *vq)
> +{
> +	vq->last_avail_idx--;
> +}
> +
> +/* After we've used one of their buffers, we tell them about it.  We'll then
> + * want to send them an interrupt, using vq->call. */
> +int vhost_add_used(struct vhost_virtqueue *vq,
> +			  unsigned int head, int len)
> +{
> +	struct vring_used_elem *used;
> +
> +	/* The virtqueue contains a ring of used buffers.  Get a pointer to the
> +	 * next entry in that used ring. */
> +	used = &vq->used->ring[vq->last_used_idx % vq->num];
> +	if (put_user(head, &used->id)) {
> +		vq_err(vq, "Failed to write used id");
> +		return -EFAULT;
> +	}
> +	if (put_user(len, &used->len)) {
> +		vq_err(vq, "Failed to write used len");
> +		return -EFAULT;
> +	}
> +	/* Make sure buffer is written before we update index. */
> +	wmb();
> +	if (put_user(vq->last_used_idx + 1, &vq->used->idx)) {
> +		vq_err(vq, "Failed to increment used idx");
> +		return -EFAULT;
> +	}
> +	vq->last_used_idx++;
> +	return 0;
> +}
> +
> +/* This actually sends the interrupt for this virtqueue */
> +void vhost_trigger_irq(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> +{
> +	__u16 flags = 0;
> +	if (get_user(flags, &vq->avail->flags)) {
> +		vq_err(vq, "Failed to get flags");
> +		return;
> +	}
> +
> +	/* If they don't want an interrupt, don't send one, unless empty. */
> +	if ((flags & VRING_AVAIL_F_NO_INTERRUPT) &&
> +	    (!vhost_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY) ||
> +	     vq->avail_idx != vq->last_avail_idx))
> +		return;
> +
> +	/* Send the Guest an interrupt tell them we used something up. */
> +	if (vq->call_ctx)
> +		eventfd_signal(vq->call_ctx, 1);
> +}
> +
> +/* And here's the combo meal deal.  Supersize me! */
> +void vhost_add_used_and_trigger(struct vhost_dev *dev,
> +				struct vhost_virtqueue *vq,
> +				unsigned int head, int len)
> +{
> +	vhost_add_used(vq, head, len);
> +	vhost_trigger_irq(dev, vq);
> +}
> +
> +/* OK, now we need to know about added descriptors. */
> +bool vhost_notify(struct vhost_virtqueue *vq)
> +{
> +	int r;
> +	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
> +		return false;
> +	vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
> +	r = put_user(vq->used_flags, &vq->used->flags);
> +	if (r)
> +		vq_err(vq, "Failed to disable notification: %d\n", r);
> +	/* They could have slipped one in as we were doing that: make
> +	 * sure it's written, tell caller it needs to check again. */
> +	mb();
> +	return true;
> +}
> +
> +/* We don't need to be notified again. */
> +void vhost_no_notify(struct vhost_virtqueue *vq)
> +{
> +	int r;
> +	if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
> +		return;
> +	vq->used_flags |= VRING_USED_F_NO_NOTIFY;
> +	r = put_user(vq->used_flags, &vq->used->flags);
> +	if (r)
> +		vq_err(vq, "Failed to enable notification: %d\n", r);
> +}
> +
> +int vhost_init(void)
> +{
> +	vhost_workqueue = create_workqueue("vhost");
> +	if (!vhost_workqueue)
> +		return -ENOMEM;
> +	return 0;
> +}
> +
> +void vhost_cleanup(void)
> +{
> +	destroy_workqueue(vhost_workqueue);
> +}
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> new file mode 100644
> index 0000000..8e13d06
> --- /dev/null
> +++ b/drivers/vhost/vhost.h
> @@ -0,0 +1,122 @@
> +#ifndef _VHOST_H
> +#define _VHOST_H
> +
> +#include <linux/eventfd.h>
> +#include <linux/vhost.h>
> +#include <linux/mm.h>
> +#include <linux/mutex.h>
> +#include <linux/workqueue.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +#include <linux/skbuff.h>
> +#include <linux/uio.h>
> +#include <linux/virtio_config.h>
> +
> +struct vhost_device;
> +
> +enum {
> +	VHOST_NET_MAX_SG = MAX_SKB_FRAGS + 2,
> +};
> +
> +/* Poll a file (eventfd or socket) */
> +/* Note: there's nothing vhost specific about this structure. */
> +struct vhost_poll {
> +	poll_table                table;
> +	wait_queue_head_t        *wqh;
> +	wait_queue_t              wait;
> +	/* struct which will handle all actual work. */
> +	struct work_struct        work;
> +	unsigned long		  mask;
> +};
> +
> +void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
> +		     unsigned long mask);
> +void vhost_poll_start(struct vhost_poll *poll, struct file *file);
> +void vhost_poll_stop(struct vhost_poll *poll);
> +void vhost_poll_flush(struct vhost_poll *poll);
> +
> +/* The virtqueue structure describes a queue attached to a device. */
> +struct vhost_virtqueue {
> +	struct vhost_dev *dev;
> +
> +	/* The actual ring of buffers. */
> +	struct mutex mutex;
> +	unsigned int num;
> +	struct vring_desc __user *desc;
> +	struct vring_avail __user *avail;
> +	struct vring_used __user *used;
> +	struct file *kick;
> +	struct file *call;
> +	struct file *error;
> +	struct eventfd_ctx *call_ctx;
> +	struct eventfd_ctx *error_ctx;
> +
> +	struct vhost_poll poll;
> +
> +	/* The routine to call when the Guest pings us, or timeout. */
> +	work_func_t handle_kick;
> +
> +	/* Last available index we saw. */
> +	u16 last_avail_idx;
> +
> +	/* Caches available index value from user. */
> +	u16 avail_idx;
> +
> +	/* Last index we used. */
> +	u16 last_used_idx;
> +
> +	/* Used flags */
> +	u16 used_flags;
> +
> +	struct iovec iov[VHOST_NET_MAX_SG];
> +	struct iovec hdr[VHOST_NET_MAX_SG];
> +};
> +
> +struct vhost_dev {
> +	/* Readers use RCU to access memory table pointer.
> +	 * Writers use mutex below.*/
> +	struct vhost_memory *memory;
> +	struct mm_struct *mm;
> +	struct vhost_virtqueue *vqs;
> +	int nvqs;
> +	struct mutex mutex;
> +	unsigned acked_features;
> +};
> +
> +long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
> +long vhost_dev_check_owner(struct vhost_dev *);
> +long vhost_dev_reset_owner(struct vhost_dev *);
> +void vhost_dev_cleanup(struct vhost_dev *);
> +long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, unsigned long arg);
> +
> +unsigned vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
> +			   struct iovec iov[],
> +			   unsigned int *out_num, unsigned int *in_num);
> +void vhost_discard_vq_desc(struct vhost_virtqueue *);
> +
> +int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
> +void vhost_trigger_irq(struct vhost_dev *, struct vhost_virtqueue *);
> +void vhost_add_used_and_trigger(struct vhost_dev *, struct vhost_virtqueue *,
> +				unsigned int head, int len);
> +void vhost_no_notify(struct vhost_virtqueue *);
> +bool vhost_notify(struct vhost_virtqueue *);
> +
> +int vhost_init(void);
> +void vhost_cleanup(void);
> +
> +#define vq_err(vq, fmt, ...) do {                                  \
> +		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> +		if ((vq)->error_ctx)                               \
> +				eventfd_signal((vq)->error_ctx, 1);\
> +	} while (0)
> +
> +enum {
> +	VHOST_FEATURES = 1 << VIRTIO_F_NOTIFY_ON_EMPTY,
> +};
> +
> +static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
> +{
> +	return dev->acked_features & (1 << bit);
> +}
> +
> +#endif
> diff --git a/include/linux/Kbuild b/include/linux/Kbuild
> index dec2f18..975df9a 100644
> --- a/include/linux/Kbuild
> +++ b/include/linux/Kbuild
> @@ -360,6 +360,7 @@ unifdef-y += uio.h
>  unifdef-y += unistd.h
>  unifdef-y += usbdevice_fs.h
>  unifdef-y += utsname.h
> +unifdef-y += vhost.h
>  unifdef-y += videodev2.h
>  unifdef-y += videodev.h
>  unifdef-y += virtio_config.h
> diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
> index 0521177..781a8bb 100644
> --- a/include/linux/miscdevice.h
> +++ b/include/linux/miscdevice.h
> @@ -30,6 +30,7 @@
>  #define HPET_MINOR		228
>  #define FUSE_MINOR		229
>  #define KVM_MINOR		232
> +#define VHOST_NET_MINOR		233
>  #define MISC_DYNAMIC_MINOR	255
>  
>  struct device;
> diff --git a/include/linux/vhost.h b/include/linux/vhost.h
> new file mode 100644
> index 0000000..3f441a9
> --- /dev/null
> +++ b/include/linux/vhost.h
> @@ -0,0 +1,101 @@
> +#ifndef _LINUX_VHOST_H
> +#define _LINUX_VHOST_H
> +/* Userspace interface for in-kernel virtio accelerators. */
> +
> +/* vhost is used to reduce the number of system calls involved in virtio.
> + *
> + * Existing virtio net code is used in the guest without modification.
> + *
> + * This header includes interface used by userspace hypervisor for
> + * device configuration.
> + */
> +
> +#include <linux/types.h>
> +#include <linux/compiler.h>
> +#include <linux/ioctl.h>
> +#include <linux/virtio_config.h>
> +#include <linux/virtio_ring.h>
> +
> +struct vhost_vring_state {
> +	unsigned int index;
> +	unsigned int num;
> +};
> +
> +struct vhost_vring_file {
> +	unsigned int index;
> +	int fd;
> +};
> +
> +struct vhost_vring_addr {
> +	unsigned int index;
> +	unsigned int padding;
> +	__u64 user_addr;
> +};
> +
> +struct vhost_memory_region {
> +	__u64 guest_phys_addr;
> +	__u64 memory_size; /* bytes */
> +	__u64 userspace_addr;
> +	__u64 padding; /* read/write protection? */
> +};
> +
> +struct vhost_memory {
> +	__u32 nregions;
> +	__u32 padding;
> +	struct vhost_memory_region regions[0];
> +};
> +
> +/* ioctls */
> +
> +#define VHOST_VIRTIO 0xAF
> +
> +/* Features bitmask for forward compatibility.  Transport bits are used for
> + * vhost specific features. */
> +#define VHOST_GET_FEATURES	_IOR(VHOST_VIRTIO, 0x00, __u64)
> +#define VHOST_ACK_FEATURES	_IOW(VHOST_VIRTIO, 0x00, __u64)
> +
> +/* Set current process as the (exclusive) owner of this file descriptor.  This
> + * must be called before any other vhost command.  Further calls to
> + * VHOST_OWNER_SET fail until VHOST_OWNER_RESET is called. */
> +#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01)
> +/* Give up ownership, and reset the device to default values.
> + * Allows subsequent call to VHOST_OWNER_SET to succeed. */
> +#define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
> +
> +/* Set up/modify memory layout */
> +#define VHOST_SET_MEM_TABLE	_IOW(VHOST_VIRTIO, 0x03, struct vhost_memory)
> +
> +/* Ring setup. These parameters can not be modified while ring is running
> + * (bound to a device). */
> +/* Set number of descriptors in ring */
> +#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)
> +/* Start of array of descriptors (virtually contiguous) */
> +#define VHOST_SET_VRING_DESC _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)
> +/* Used structure address */
> +#define VHOST_SET_VRING_USED _IOW(VHOST_VIRTIO, 0x12, struct vhost_vring_addr)
> +/* Available structure address */
> +#define VHOST_SET_VRING_AVAIL _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_addr)
> +/* Base value where queue looks for available descriptors */
> +#define VHOST_SET_VRING_BASE _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
> +/* Get accessor: reads index, writes value in num */
> +#define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
> +
> +/* The following ioctls use eventfd file descriptors to signal and poll
> + * for events. */
> +
> +/* Set eventfd to poll for added buffers */
> +#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)
> +/* Set eventfd to signal when buffers have beed used */
> +#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)
> +/* Set eventfd to signal an error */
> +#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
> +
> +/* VHOST_NET specific defines */
> +
> +/* Attach virtio net device to a raw socket. The socket must be already
> + * bound to an ethernet device, this device will be used for transmit.
> + * Pass -1 to unbind from the socket and the transmit device.
> + * This can be used to stop the device (e.g. for migration). */
> +#define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, int)
> +
> +#endif
> -- 
> 1.6.2.5
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Sept. 7, 2009, 10:15 a.m. UTC | #2
On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote:
> On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote:
> > What it is: vhost net is a character device that can be used to reduce
> > the number of system calls involved in virtio networking.
> > Existing virtio net code is used in the guest without modification.
> > 
> > There's similarity with vringfd, with some differences and reduced scope
> > - uses eventfd for signalling
> > - structures can be moved around in memory at any time (good for migration)
> > - support memory table and not just an offset (needed for kvm)
> > 
> > common virtio related code has been put in a separate file vhost.c and
> > can be made into a separate module if/when more backends appear.  I used
> > Rusty's lguest.c as the source for developing this part : this supplied
> > me with witty comments I wouldn't be able to write myself.
> > 
> > What it is not: vhost net is not a bus, and not a generic new system
> > call. No assumptions are made on how guest performs hypercalls.
> > Userspace hypervisors are supported as well as kvm.
> > 
> > How it works: Basically, we connect virtio frontend (configured by
> > userspace) to a backend. The backend could be a network device, or a
> > tun-like device. In this version I only support raw socket as a backend,
> > which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
> > also configured by userspace, including vlan/mac etc.
> > 
> > Status:
> > This works for me, and I haven't see any crashes.
> > I have done some light benchmarking (with v4), compared to userspace, I
> > see improved latency (as I save up to 4 system calls per packet) but not
> > bandwidth/CPU (as TSO and interrupt mitigation are not supported).  For
> > ping benchmark (where there's no TSO) troughput is also improved.
> > 
> > Features that I plan to look at in the future:
> > - tap support
> > - TSO
> > - interrupt mitigation
> > - zero copy
> > 
> 
> Hello Michael,
> 
> I've started looking at vhost with the intention of using it over PCI to
> connect physical machines together.
> 
> The part that I am struggling with the most is figuring out which parts
> of the rings are in the host's memory, and which parts are in the
> guest's memory.

All rings are in guest's memory, to match existing virtio code.  vhost
assumes that the memory space of the hypervisor userspace process covers
the whole of guest memory. And there's a translation table.
Ring addresses are userspace addresses, they do not undergo translation.

> If I understand everything correctly, the rings are all userspace
> addresses, which means that they can be moved around in physical memory,
> and get pushed out to swap.

Unless they are locked, yes.

> AFAIK, this is impossible to handle when
> connecting two physical systems, you'd need the rings available in IO
> memory (PCI memory), so you can ioreadXX() them instead. To the best of
> my knowledge, I shouldn't be using copy_to_user() on an __iomem address.
> Also, having them migrate around in memory would be a bad thing.
> 
> Also, I'm having trouble figuring out how the packet contents are
> actually copied from one system to the other. Could you point this out
> for me?

The code in net/packet/af_packet.c does it when vhost calls sendmsg.

> Is there somewhere I can find the userspace code (kvm, qemu, lguest,
> etc.) code needed for interacting with the vhost misc device so I can
> get a better idea of how userspace is supposed to work?

Look in archives for kvm@vger.kernel.org. the subject is qemu-kvm: vhost net.

> (Features
> negotiation, etc.)
> 
> Thanks,
> Ira

That's not yet implemented as there are no features yet.  I'm working on
tap support, which will add a feature bit.  Overall, qemu does an ioctl
to query supported features, and then acks them with another ioctl.  I'm
also trying to avoid duplicating functionality available elsewhere.  So
that to check e.g. TSO support, you'd just look at the underlying
hardware device you are binding to.
Ira Snyder Sept. 8, 2009, 5:20 p.m. UTC | #3
On Mon, Sep 07, 2009 at 01:15:37PM +0300, Michael S. Tsirkin wrote:
> On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote:
> > On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote:
> > > What it is: vhost net is a character device that can be used to reduce
> > > the number of system calls involved in virtio networking.
> > > Existing virtio net code is used in the guest without modification.
> > > 
> > > There's similarity with vringfd, with some differences and reduced scope
> > > - uses eventfd for signalling
> > > - structures can be moved around in memory at any time (good for migration)
> > > - support memory table and not just an offset (needed for kvm)
> > > 
> > > common virtio related code has been put in a separate file vhost.c and
> > > can be made into a separate module if/when more backends appear.  I used
> > > Rusty's lguest.c as the source for developing this part : this supplied
> > > me with witty comments I wouldn't be able to write myself.
> > > 
> > > What it is not: vhost net is not a bus, and not a generic new system
> > > call. No assumptions are made on how guest performs hypercalls.
> > > Userspace hypervisors are supported as well as kvm.
> > > 
> > > How it works: Basically, we connect virtio frontend (configured by
> > > userspace) to a backend. The backend could be a network device, or a
> > > tun-like device. In this version I only support raw socket as a backend,
> > > which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
> > > also configured by userspace, including vlan/mac etc.
> > > 
> > > Status:
> > > This works for me, and I haven't see any crashes.
> > > I have done some light benchmarking (with v4), compared to userspace, I
> > > see improved latency (as I save up to 4 system calls per packet) but not
> > > bandwidth/CPU (as TSO and interrupt mitigation are not supported).  For
> > > ping benchmark (where there's no TSO) troughput is also improved.
> > > 
> > > Features that I plan to look at in the future:
> > > - tap support
> > > - TSO
> > > - interrupt mitigation
> > > - zero copy
> > > 
> > 
> > Hello Michael,
> > 
> > I've started looking at vhost with the intention of using it over PCI to
> > connect physical machines together.
> > 
> > The part that I am struggling with the most is figuring out which parts
> > of the rings are in the host's memory, and which parts are in the
> > guest's memory.
> 
> All rings are in guest's memory, to match existing virtio code.

Ok, this makes sense.

> vhost
> assumes that the memory space of the hypervisor userspace process covers
> the whole of guest memory.

Is this necessary? Why? The assumption seems very wrong when you're
doing data transport between two physical systems via PCI.

I know vhost has not been designed for this specific situation, but it
is good to be looking toward other possible uses.

> And there's a translation table.
> Ring addresses are userspace addresses, they do not undergo translation.
> 
> > If I understand everything correctly, the rings are all userspace
> > addresses, which means that they can be moved around in physical memory,
> > and get pushed out to swap.
> 
> Unless they are locked, yes.
> 
> > AFAIK, this is impossible to handle when
> > connecting two physical systems, you'd need the rings available in IO
> > memory (PCI memory), so you can ioreadXX() them instead. To the best of
> > my knowledge, I shouldn't be using copy_to_user() on an __iomem address.
> > Also, having them migrate around in memory would be a bad thing.
> > 
> > Also, I'm having trouble figuring out how the packet contents are
> > actually copied from one system to the other. Could you point this out
> > for me?
> 
> The code in net/packet/af_packet.c does it when vhost calls sendmsg.
> 

Ok. The sendmsg() implementation uses memcpy_fromiovec(). Is it possible
to make this use a DMA engine instead? I know this was suggested in an
earlier thread.

> > Is there somewhere I can find the userspace code (kvm, qemu, lguest,
> > etc.) code needed for interacting with the vhost misc device so I can
> > get a better idea of how userspace is supposed to work?
> 
> Look in archives for kvm@vger.kernel.org. the subject is qemu-kvm: vhost net.
> 
> > (Features
> > negotiation, etc.)
> > 
> 
> That's not yet implemented as there are no features yet.  I'm working on
> tap support, which will add a feature bit.  Overall, qemu does an ioctl
> to query supported features, and then acks them with another ioctl.  I'm
> also trying to avoid duplicating functionality available elsewhere.  So
> that to check e.g. TSO support, you'd just look at the underlying
> hardware device you are binding to.
> 

Ok. Do you have plans to support the VIRTIO_NET_F_MRG_RXBUF feature in
the future? I found that this made an enormous improvement in throughput
on my virtio-net <-> virtio-net system. Perhaps it isn't needed with
vhost-net.

Thanks for replying,
Ira
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Sept. 8, 2009, 8:14 p.m. UTC | #4
On Tue, Sep 08, 2009 at 10:20:35AM -0700, Ira W. Snyder wrote:
> On Mon, Sep 07, 2009 at 01:15:37PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote:
> > > On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote:
> > > > What it is: vhost net is a character device that can be used to reduce
> > > > the number of system calls involved in virtio networking.
> > > > Existing virtio net code is used in the guest without modification.
> > > > 
> > > > There's similarity with vringfd, with some differences and reduced scope
> > > > - uses eventfd for signalling
> > > > - structures can be moved around in memory at any time (good for migration)
> > > > - support memory table and not just an offset (needed for kvm)
> > > > 
> > > > common virtio related code has been put in a separate file vhost.c and
> > > > can be made into a separate module if/when more backends appear.  I used
> > > > Rusty's lguest.c as the source for developing this part : this supplied
> > > > me with witty comments I wouldn't be able to write myself.
> > > > 
> > > > What it is not: vhost net is not a bus, and not a generic new system
> > > > call. No assumptions are made on how guest performs hypercalls.
> > > > Userspace hypervisors are supported as well as kvm.
> > > > 
> > > > How it works: Basically, we connect virtio frontend (configured by
> > > > userspace) to a backend. The backend could be a network device, or a
> > > > tun-like device. In this version I only support raw socket as a backend,
> > > > which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
> > > > also configured by userspace, including vlan/mac etc.
> > > > 
> > > > Status:
> > > > This works for me, and I haven't see any crashes.
> > > > I have done some light benchmarking (with v4), compared to userspace, I
> > > > see improved latency (as I save up to 4 system calls per packet) but not
> > > > bandwidth/CPU (as TSO and interrupt mitigation are not supported).  For
> > > > ping benchmark (where there's no TSO) troughput is also improved.
> > > > 
> > > > Features that I plan to look at in the future:
> > > > - tap support
> > > > - TSO
> > > > - interrupt mitigation
> > > > - zero copy
> > > > 
> > > 
> > > Hello Michael,
> > > 
> > > I've started looking at vhost with the intention of using it over PCI to
> > > connect physical machines together.
> > > 
> > > The part that I am struggling with the most is figuring out which parts
> > > of the rings are in the host's memory, and which parts are in the
> > > guest's memory.
> > 
> > All rings are in guest's memory, to match existing virtio code.
> 
> Ok, this makes sense.
> 
> > vhost
> > assumes that the memory space of the hypervisor userspace process covers
> > the whole of guest memory.
> 
> Is this necessary? Why?

Because with virtio ring can give us arbitrary guest addresses.  If
guest was limited to using a subset of addresses, hypervisor would only
have to map these.

> The assumption seems very wrong when you're
> doing data transport between two physical systems via PCI.
> I know vhost has not been designed for this specific situation, but it
> is good to be looking toward other possible uses.
> 
> > And there's a translation table.
> > Ring addresses are userspace addresses, they do not undergo translation.
> > 
> > > If I understand everything correctly, the rings are all userspace
> > > addresses, which means that they can be moved around in physical memory,
> > > and get pushed out to swap.
> > 
> > Unless they are locked, yes.
> > 
> > > AFAIK, this is impossible to handle when
> > > connecting two physical systems, you'd need the rings available in IO
> > > memory (PCI memory), so you can ioreadXX() them instead. To the best of
> > > my knowledge, I shouldn't be using copy_to_user() on an __iomem address.
> > > Also, having them migrate around in memory would be a bad thing.
> > > 
> > > Also, I'm having trouble figuring out how the packet contents are
> > > actually copied from one system to the other. Could you point this out
> > > for me?
> > 
> > The code in net/packet/af_packet.c does it when vhost calls sendmsg.
> > 
> 
> Ok. The sendmsg() implementation uses memcpy_fromiovec(). Is it possible
> to make this use a DMA engine instead?

Maybe.

> I know this was suggested in an earlier thread.

Yes, it might even give some performance benefit with e.g. I/O AT.

> > > Is there somewhere I can find the userspace code (kvm, qemu, lguest,
> > > etc.) code needed for interacting with the vhost misc device so I can
> > > get a better idea of how userspace is supposed to work?
> > 
> > Look in archives for kvm@vger.kernel.org. the subject is qemu-kvm: vhost net.
> > 
> > > (Features
> > > negotiation, etc.)
> > > 
> > 
> > That's not yet implemented as there are no features yet.  I'm working on
> > tap support, which will add a feature bit.  Overall, qemu does an ioctl
> > to query supported features, and then acks them with another ioctl.  I'm
> > also trying to avoid duplicating functionality available elsewhere.  So
> > that to check e.g. TSO support, you'd just look at the underlying
> > hardware device you are binding to.
> > 
> 
> Ok. Do you have plans to support the VIRTIO_NET_F_MRG_RXBUF feature in
> the future? I found that this made an enormous improvement in throughput
> on my virtio-net <-> virtio-net system. Perhaps it isn't needed with
> vhost-net.

Yes, I'm working on it.

> Thanks for replying,
> Ira
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Xin, Xiaohui Sept. 11, 2009, 3:17 p.m. UTC | #5
Michael,
We are very interested in your patch and want to have a try with it.
I have collected your 3 patches in kernel side and 4 patches in queue side.
The patches are listed here:

PATCHv5-1-3-mm-export-use_mm-unuse_mm-to-modules.patch
PATCHv5-2-3-mm-reduce-atomic-use-on-use_mm-fast-path.patch
PATCHv5-3-3-vhost_net-a-kernel-level-virtio-server.patch

PATCHv3-1-4-qemu-kvm-move-virtio-pci[1].o-to-near-pci.o.patch
PATCHv3-2-4-virtio-move-features-to-an-inline-function.patch
PATCHv3-3-4-qemu-kvm-vhost-net-implementation.patch
PATCHv3-4-4-qemu-kvm-add-compat-eventfd.patch

I applied the kernel patches on v2.6.31-rc4 and the qemu patches on latest kvm qemu.
But seems there are some patches are needed at least irqfd and ioeventfd patches on
current qemu. I cannot create a kvm guest with "-net nic,model=virtio,vhost=vethX".

May you kindly advice us the patch lists all exactly to make it work?
Thanks a lot. :-)

Thanks
Xiaohui
-----Original Message-----
From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On Behalf Of Michael S. Tsirkin
Sent: Wednesday, September 09, 2009 4:14 AM
To: Ira W. Snyder
Cc: netdev@vger.kernel.org; virtualization@lists.linux-foundation.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; mingo@elte.hu; linux-mm@kvack.org; akpm@linux-foundation.org; hpa@zytor.com; gregory.haskins@gmail.com; Rusty Russell; s.hetze@linux-ag.com
Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

On Tue, Sep 08, 2009 at 10:20:35AM -0700, Ira W. Snyder wrote:
> On Mon, Sep 07, 2009 at 01:15:37PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote:
> > > On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote:
> > > > What it is: vhost net is a character device that can be used to reduce
> > > > the number of system calls involved in virtio networking.
> > > > Existing virtio net code is used in the guest without modification.
> > > > 
> > > > There's similarity with vringfd, with some differences and reduced scope
> > > > - uses eventfd for signalling
> > > > - structures can be moved around in memory at any time (good for migration)
> > > > - support memory table and not just an offset (needed for kvm)
> > > > 
> > > > common virtio related code has been put in a separate file vhost.c and
> > > > can be made into a separate module if/when more backends appear.  I used
> > > > Rusty's lguest.c as the source for developing this part : this supplied
> > > > me with witty comments I wouldn't be able to write myself.
> > > > 
> > > > What it is not: vhost net is not a bus, and not a generic new system
> > > > call. No assumptions are made on how guest performs hypercalls.
> > > > Userspace hypervisors are supported as well as kvm.
> > > > 
> > > > How it works: Basically, we connect virtio frontend (configured by
> > > > userspace) to a backend. The backend could be a network device, or a
> > > > tun-like device. In this version I only support raw socket as a backend,
> > > > which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
> > > > also configured by userspace, including vlan/mac etc.
> > > > 
> > > > Status:
> > > > This works for me, and I haven't see any crashes.
> > > > I have done some light benchmarking (with v4), compared to userspace, I
> > > > see improved latency (as I save up to 4 system calls per packet) but not
> > > > bandwidth/CPU (as TSO and interrupt mitigation are not supported).  For
> > > > ping benchmark (where there's no TSO) troughput is also improved.
> > > > 
> > > > Features that I plan to look at in the future:
> > > > - tap support
> > > > - TSO
> > > > - interrupt mitigation
> > > > - zero copy
> > > > 
> > > 
> > > Hello Michael,
> > > 
> > > I've started looking at vhost with the intention of using it over PCI to
> > > connect physical machines together.
> > > 
> > > The part that I am struggling with the most is figuring out which parts
> > > of the rings are in the host's memory, and which parts are in the
> > > guest's memory.
> > 
> > All rings are in guest's memory, to match existing virtio code.
> 
> Ok, this makes sense.
> 
> > vhost
> > assumes that the memory space of the hypervisor userspace process covers
> > the whole of guest memory.
> 
> Is this necessary? Why?

Because with virtio ring can give us arbitrary guest addresses.  If
guest was limited to using a subset of addresses, hypervisor would only
have to map these.

> The assumption seems very wrong when you're
> doing data transport between two physical systems via PCI.
> I know vhost has not been designed for this specific situation, but it
> is good to be looking toward other possible uses.
> 
> > And there's a translation table.
> > Ring addresses are userspace addresses, they do not undergo translation.
> > 
> > > If I understand everything correctly, the rings are all userspace
> > > addresses, which means that they can be moved around in physical memory,
> > > and get pushed out to swap.
> > 
> > Unless they are locked, yes.
> > 
> > > AFAIK, this is impossible to handle when
> > > connecting two physical systems, you'd need the rings available in IO
> > > memory (PCI memory), so you can ioreadXX() them instead. To the best of
> > > my knowledge, I shouldn't be using copy_to_user() on an __iomem address.
> > > Also, having them migrate around in memory would be a bad thing.
> > > 
> > > Also, I'm having trouble figuring out how the packet contents are
> > > actually copied from one system to the other. Could you point this out
> > > for me?
> > 
> > The code in net/packet/af_packet.c does it when vhost calls sendmsg.
> > 
> 
> Ok. The sendmsg() implementation uses memcpy_fromiovec(). Is it possible
> to make this use a DMA engine instead?

Maybe.

> I know this was suggested in an earlier thread.

Yes, it might even give some performance benefit with e.g. I/O AT.

> > > Is there somewhere I can find the userspace code (kvm, qemu, lguest,
> > > etc.) code needed for interacting with the vhost misc device so I can
> > > get a better idea of how userspace is supposed to work?
> > 
> > Look in archives for kvm@vger.kernel.org. the subject is qemu-kvm: vhost net.
> > 
> > > (Features
> > > negotiation, etc.)
> > > 
> > 
> > That's not yet implemented as there are no features yet.  I'm working on
> > tap support, which will add a feature bit.  Overall, qemu does an ioctl
> > to query supported features, and then acks them with another ioctl.  I'm
> > also trying to avoid duplicating functionality available elsewhere.  So
> > that to check e.g. TSO support, you'd just look at the underlying
> > hardware device you are binding to.
> > 
> 
> Ok. Do you have plans to support the VIRTIO_NET_F_MRG_RXBUF feature in
> the future? I found that this made an enormous improvement in throughput
> on my virtio-net <-> virtio-net system. Perhaps it isn't needed with
> vhost-net.

Yes, I'm working on it.

> Thanks for replying,
> Ira
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins Sept. 11, 2009, 4 p.m. UTC | #6
Ira W. Snyder wrote:
> On Mon, Sep 07, 2009 at 01:15:37PM +0300, Michael S. Tsirkin wrote:
>> On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote:
>>> On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote:
>>>> What it is: vhost net is a character device that can be used to reduce
>>>> the number of system calls involved in virtio networking.
>>>> Existing virtio net code is used in the guest without modification.
>>>>
>>>> There's similarity with vringfd, with some differences and reduced scope
>>>> - uses eventfd for signalling
>>>> - structures can be moved around in memory at any time (good for migration)
>>>> - support memory table and not just an offset (needed for kvm)
>>>>
>>>> common virtio related code has been put in a separate file vhost.c and
>>>> can be made into a separate module if/when more backends appear.  I used
>>>> Rusty's lguest.c as the source for developing this part : this supplied
>>>> me with witty comments I wouldn't be able to write myself.
>>>>
>>>> What it is not: vhost net is not a bus, and not a generic new system
>>>> call. No assumptions are made on how guest performs hypercalls.
>>>> Userspace hypervisors are supported as well as kvm.
>>>>
>>>> How it works: Basically, we connect virtio frontend (configured by
>>>> userspace) to a backend. The backend could be a network device, or a
>>>> tun-like device. In this version I only support raw socket as a backend,
>>>> which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
>>>> also configured by userspace, including vlan/mac etc.
>>>>
>>>> Status:
>>>> This works for me, and I haven't see any crashes.
>>>> I have done some light benchmarking (with v4), compared to userspace, I
>>>> see improved latency (as I save up to 4 system calls per packet) but not
>>>> bandwidth/CPU (as TSO and interrupt mitigation are not supported).  For
>>>> ping benchmark (where there's no TSO) troughput is also improved.
>>>>
>>>> Features that I plan to look at in the future:
>>>> - tap support
>>>> - TSO
>>>> - interrupt mitigation
>>>> - zero copy
>>>>
>>> Hello Michael,
>>>
>>> I've started looking at vhost with the intention of using it over PCI to
>>> connect physical machines together.
>>>
>>> The part that I am struggling with the most is figuring out which parts
>>> of the rings are in the host's memory, and which parts are in the
>>> guest's memory.
>> All rings are in guest's memory, to match existing virtio code.
> 
> Ok, this makes sense.
> 
>> vhost
>> assumes that the memory space of the hypervisor userspace process covers
>> the whole of guest memory.
> 
> Is this necessary? Why? The assumption seems very wrong when you're
> doing data transport between two physical systems via PCI.

FWIW: VBUS handles this situation via the "memctx" abstraction.  IOW,
the memory is not assumed to be a userspace address.  Rather, it is a
memctx-specific address, which can be userspace, or any other type
(including hardware, dma-engine, etc).  As long as the memctx knows how
to translate it, it will work.

Kind Regards,
-Greg
Gregory Haskins Sept. 11, 2009, 4:14 p.m. UTC | #7
Gregory Haskins wrote:

[snip]

> 
> FWIW: VBUS handles this situation via the "memctx" abstraction.  IOW,
> the memory is not assumed to be a userspace address.  Rather, it is a
> memctx-specific address, which can be userspace, or any other type
> (including hardware, dma-engine, etc).  As long as the memctx knows how
> to translate it, it will work.
> 

citations:

Here is a packet import (from the perspective of the host side "venet"
device model, similar to Michaels "vhost")

http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=kernel/vbus/devices/venet-tap.c;h=ee091c47f06e9bb8487a45e72d493273fe08329f;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l535

Here is the KVM specific memctx:

http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=kernel/vbus/kvm.c;h=56e2c5682a7ca8432c159377b0f7389cf34cbc1b;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l188

and

http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=virt/kvm/xinterface.c;h=0cccb6095ca2a51bad01f7ba2137fdd9111b63d3;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l289

You could alternatively define a memctx for your environment which knows
how to deal with your PPC boards PCI based memory, and the devices would
all "just work".

Kind Regards,
-Greg
Michael S. Tsirkin Sept. 13, 2009, 5:46 a.m. UTC | #8
On Fri, Sep 11, 2009 at 11:17:33PM +0800, Xin, Xiaohui wrote:
> Michael,
> We are very interested in your patch and want to have a try with it.
> I have collected your 3 patches in kernel side and 4 patches in queue side.
> The patches are listed here:
> 
> PATCHv5-1-3-mm-export-use_mm-unuse_mm-to-modules.patch
> PATCHv5-2-3-mm-reduce-atomic-use-on-use_mm-fast-path.patch
> PATCHv5-3-3-vhost_net-a-kernel-level-virtio-server.patch
> 
> PATCHv3-1-4-qemu-kvm-move-virtio-pci[1].o-to-near-pci.o.patch
> PATCHv3-2-4-virtio-move-features-to-an-inline-function.patch
> PATCHv3-3-4-qemu-kvm-vhost-net-implementation.patch
> PATCHv3-4-4-qemu-kvm-add-compat-eventfd.patch
> 
> I applied the kernel patches on v2.6.31-rc4 and the qemu patches on latest kvm qemu.
> But seems there are some patches are needed at least irqfd and ioeventfd patches on
> current qemu. I cannot create a kvm guest with "-net nic,model=virtio,vhost=vethX".
> 
> May you kindly advice us the patch lists all exactly to make it work?
> Thanks a lot. :-)
> 
> Thanks
> Xiaohui


The irqfd/ioeventfd patches are part of Avi's kvm.git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git

I expect them to be merged by 2.6.32-rc1 - right, Avi?
Michael S. Tsirkin Sept. 13, 2009, 12:01 p.m. UTC | #9
On Fri, Sep 11, 2009 at 12:00:21PM -0400, Gregory Haskins wrote:
> FWIW: VBUS handles this situation via the "memctx" abstraction.  IOW,
> the memory is not assumed to be a userspace address.  Rather, it is a
> memctx-specific address, which can be userspace, or any other type
> (including hardware, dma-engine, etc).  As long as the memctx knows how
> to translate it, it will work.

How would permissions be handled? it's easy to allow an app to pass in
virtual addresses in its own address space.  But we can't let the guest
specify physical addresses.
Xin, Xiaohui Sept. 14, 2009, 5:57 a.m. UTC | #10
>The irqfd/ioeventfd patches are part of Avi's kvm.git tree:
>git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git
>
>I expect them to be merged by 2.6.32-rc1 - right, Avi?

Michael,

I think I have the kernel patch for kvm_irqfd and kvm_ioeventfd, but missed the qemu side patch for irqfd and ioeventfd.

I met the compile error when I compiled virtio-pci.c file in qemu-kvm like this:

/root/work/vmdq/vhost/qemu-kvm/hw/virtio-pci.c:384: error: `KVM_IRQFD` undeclared (first use in this function)
/root/work/vmdq/vhost/qemu-kvm/hw/virtio-pci.c:400: error: `KVM_IOEVENTFD` undeclared (first use in this function)

Which qemu tree or patch do you use for kvm_irqfd and kvm_ioeventfd?

Thanks
Xiaohui

-----Original Message-----
From: Michael S. Tsirkin [mailto:mst@redhat.com] 
Sent: Sunday, September 13, 2009 1:46 PM
To: Xin, Xiaohui
Cc: Ira W. Snyder; netdev@vger.kernel.org; virtualization@lists.linux-foundation.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; mingo@elte.hu; linux-mm@kvack.org; akpm@linux-foundation.org; hpa@zytor.com; gregory.haskins@gmail.com; Rusty Russell; s.hetze@linux-ag.com; avi@redhat.com
Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

On Fri, Sep 11, 2009 at 11:17:33PM +0800, Xin, Xiaohui wrote:
> Michael,
> We are very interested in your patch and want to have a try with it.
> I have collected your 3 patches in kernel side and 4 patches in queue side.
> The patches are listed here:
> 
> PATCHv5-1-3-mm-export-use_mm-unuse_mm-to-modules.patch
> PATCHv5-2-3-mm-reduce-atomic-use-on-use_mm-fast-path.patch
> PATCHv5-3-3-vhost_net-a-kernel-level-virtio-server.patch
> 
> PATCHv3-1-4-qemu-kvm-move-virtio-pci[1].o-to-near-pci.o.patch
> PATCHv3-2-4-virtio-move-features-to-an-inline-function.patch
> PATCHv3-3-4-qemu-kvm-vhost-net-implementation.patch
> PATCHv3-4-4-qemu-kvm-add-compat-eventfd.patch
> 
> I applied the kernel patches on v2.6.31-rc4 and the qemu patches on latest kvm qemu.
> But seems there are some patches are needed at least irqfd and ioeventfd patches on
> current qemu. I cannot create a kvm guest with "-net nic,model=virtio,vhost=vethX".
> 
> May you kindly advice us the patch lists all exactly to make it work?
> Thanks a lot. :-)
> 
> Thanks
> Xiaohui


The irqfd/ioeventfd patches are part of Avi's kvm.git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git

I expect them to be merged by 2.6.32-rc1 - right, Avi?
Michael S. Tsirkin Sept. 14, 2009, 7:05 a.m. UTC | #11
On Mon, Sep 14, 2009 at 01:57:06PM +0800, Xin, Xiaohui wrote:
> >The irqfd/ioeventfd patches are part of Avi's kvm.git tree:
> >git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git
> >
> >I expect them to be merged by 2.6.32-rc1 - right, Avi?
> 
> Michael,
> 
> I think I have the kernel patch for kvm_irqfd and kvm_ioeventfd, but missed the qemu side patch for irqfd and ioeventfd.
> 
> I met the compile error when I compiled virtio-pci.c file in qemu-kvm like this:
> 
> /root/work/vmdq/vhost/qemu-kvm/hw/virtio-pci.c:384: error: `KVM_IRQFD` undeclared (first use in this function)
> /root/work/vmdq/vhost/qemu-kvm/hw/virtio-pci.c:400: error: `KVM_IOEVENTFD` undeclared (first use in this function)
> 
> Which qemu tree or patch do you use for kvm_irqfd and kvm_ioeventfd?

I'm using the headers from upstream kernel.
I'll send a patch for that.

> Thanks
> Xiaohui
> 
> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst@redhat.com] 
> Sent: Sunday, September 13, 2009 1:46 PM
> To: Xin, Xiaohui
> Cc: Ira W. Snyder; netdev@vger.kernel.org; virtualization@lists.linux-foundation.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; mingo@elte.hu; linux-mm@kvack.org; akpm@linux-foundation.org; hpa@zytor.com; gregory.haskins@gmail.com; Rusty Russell; s.hetze@linux-ag.com; avi@redhat.com
> Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
> 
> On Fri, Sep 11, 2009 at 11:17:33PM +0800, Xin, Xiaohui wrote:
> > Michael,
> > We are very interested in your patch and want to have a try with it.
> > I have collected your 3 patches in kernel side and 4 patches in queue side.
> > The patches are listed here:
> > 
> > PATCHv5-1-3-mm-export-use_mm-unuse_mm-to-modules.patch
> > PATCHv5-2-3-mm-reduce-atomic-use-on-use_mm-fast-path.patch
> > PATCHv5-3-3-vhost_net-a-kernel-level-virtio-server.patch
> > 
> > PATCHv3-1-4-qemu-kvm-move-virtio-pci[1].o-to-near-pci.o.patch
> > PATCHv3-2-4-virtio-move-features-to-an-inline-function.patch
> > PATCHv3-3-4-qemu-kvm-vhost-net-implementation.patch
> > PATCHv3-4-4-qemu-kvm-add-compat-eventfd.patch
> > 
> > I applied the kernel patches on v2.6.31-rc4 and the qemu patches on latest kvm qemu.
> > But seems there are some patches are needed at least irqfd and ioeventfd patches on
> > current qemu. I cannot create a kvm guest with "-net nic,model=virtio,vhost=vethX".
> > 
> > May you kindly advice us the patch lists all exactly to make it work?
> > Thanks a lot. :-)
> > 
> > Thanks
> > Xiaohui
> 
> 
> The irqfd/ioeventfd patches are part of Avi's kvm.git tree:
> git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git
> 
> I expect them to be merged by 2.6.32-rc1 - right, Avi?
> 
> -- 
> MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins Sept. 14, 2009, 4:08 p.m. UTC | #12
Michael S. Tsirkin wrote:
> On Fri, Sep 11, 2009 at 12:00:21PM -0400, Gregory Haskins wrote:
>> FWIW: VBUS handles this situation via the "memctx" abstraction.  IOW,
>> the memory is not assumed to be a userspace address.  Rather, it is a
>> memctx-specific address, which can be userspace, or any other type
>> (including hardware, dma-engine, etc).  As long as the memctx knows how
>> to translate it, it will work.
> 
> How would permissions be handled?

Same as anything else, really.  Read on for details.

> it's easy to allow an app to pass in virtual addresses in its own address space.

Agreed, and this is what I do.

The guest always passes its own physical addresses (using things like
__pa() in linux).  This address passed is memctx specific, but generally
would fall into the category of "virtual-addresses" from the hosts
perspective.

For a KVM/AlacrityVM guest example, the addresses are GPAs, accessed
internally to the context via a gfn_to_hva conversion (you can see this
occuring in the citation links I sent)

For Ira's example, the addresses would represent a physical address on
the PCI boards, and would follow any kind of relevant rules for
converting a "GPA" to a host accessible address (even if indirectly, via
a dma controller).


>  But we can't let the guest specify physical addresses.

Agreed.  Neither your proposal nor mine operate this way afaict.

HTH

Kind Regards,
-Greg
Michael S. Tsirkin Sept. 14, 2009, 4:47 p.m. UTC | #13
On Mon, Sep 14, 2009 at 12:08:55PM -0400, Gregory Haskins wrote:
> For Ira's example, the addresses would represent a physical address on
> the PCI boards, and would follow any kind of relevant rules for
> converting a "GPA" to a host accessible address (even if indirectly, via
> a dma controller).

I don't think limiting addresses to PCI physical addresses will work
well.  From what I rememeber, Ira's x86 can not initiate burst
transactions on PCI, and it's the ppc that initiates all DMA.

> 
> >  But we can't let the guest specify physical addresses.
> 
> Agreed.  Neither your proposal nor mine operate this way afaict.

But this seems to be what Ira needs.

> HTH
> 
> Kind Regards,
> -Greg
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Sept. 14, 2009, 4:53 p.m. UTC | #14
On Mon, Sep 14, 2009 at 12:08:55PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Fri, Sep 11, 2009 at 12:00:21PM -0400, Gregory Haskins wrote:
> >> FWIW: VBUS handles this situation via the "memctx" abstraction.  IOW,
> >> the memory is not assumed to be a userspace address.  Rather, it is a
> >> memctx-specific address, which can be userspace, or any other type
> >> (including hardware, dma-engine, etc).  As long as the memctx knows how
> >> to translate it, it will work.
> > 
> > How would permissions be handled?
> 
> Same as anything else, really.  Read on for details.
> 
> > it's easy to allow an app to pass in virtual addresses in its own address space.
> 
> Agreed, and this is what I do.
> 
> The guest always passes its own physical addresses (using things like
> __pa() in linux).  This address passed is memctx specific, but generally
> would fall into the category of "virtual-addresses" from the hosts
> perspective.
> 
> For a KVM/AlacrityVM guest example, the addresses are GPAs, accessed
> internally to the context via a gfn_to_hva conversion (you can see this
> occuring in the citation links I sent)
> 
> For Ira's example, the addresses would represent a physical address on
> the PCI boards, and would follow any kind of relevant rules for
> converting a "GPA" to a host accessible address (even if indirectly, via
> a dma controller).

So vbus can let an application access either its own virtual memory or a
physical memory on a PCI device.  My question is, is any application
that's allowed to do the former also granted rights to do the later?

> >  But we can't let the guest specify physical addresses.
> 
> Agreed.  Neither your proposal nor mine operate this way afaict.
> 
> HTH
> 
> Kind Regards,
> -Greg
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins Sept. 14, 2009, 7:14 p.m. UTC | #15
Michael S. Tsirkin wrote:
> On Mon, Sep 14, 2009 at 12:08:55PM -0400, Gregory Haskins wrote:
>> For Ira's example, the addresses would represent a physical address on
>> the PCI boards, and would follow any kind of relevant rules for
>> converting a "GPA" to a host accessible address (even if indirectly, via
>> a dma controller).
> 
> I don't think limiting addresses to PCI physical addresses will work
> well.

The only "limit" is imposed by the memctx.  If a given context needs to
meet certain requirements beyond PCI physical addresses, it would
presumably be designed that way.


>  From what I rememeber, Ira's x86 can not initiate burst
> transactions on PCI, and it's the ppc that initiates all DMA.

The only requirement is that the "guest" "owns" the memory.  IOW: As
with virtio/vhost, the guest can access the pointers in the ring
directly but the host must pass through a translation function.

Your translation is direct: you use a slots/hva scheme.  My translation
is abstracted, which means it can support slots/hva (such as in
alacrityvm) or some other scheme as long as the general model of "guest
owned" holds true.

> 
>>>  But we can't let the guest specify physical addresses.
>> Agreed.  Neither your proposal nor mine operate this way afaict.
> 
> But this seems to be what Ira needs.

So what he could do then is implement the memctx to integrate with the
ppc side dma controller.  E.g. "translation" in his box means a protocol
from the x86 to the ppc to initiate the dma cycle.  This could be
exposed as a dma facility in the register file of the ppc boards, for
instance.

To reiterate, as long as the model is such that the ppc boards are
considered the "owner" (direct access, no translation needed) I believe
it will work.  If the pointers are expected to be owned by the host,
then my model doesn't work well either.

Kind Regards,
-Greg
Gregory Haskins Sept. 14, 2009, 7:28 p.m. UTC | #16
Michael S. Tsirkin wrote:
> On Mon, Sep 14, 2009 at 12:08:55PM -0400, Gregory Haskins wrote:
>> Michael S. Tsirkin wrote:
>>> On Fri, Sep 11, 2009 at 12:00:21PM -0400, Gregory Haskins wrote:
>>>> FWIW: VBUS handles this situation via the "memctx" abstraction.  IOW,
>>>> the memory is not assumed to be a userspace address.  Rather, it is a
>>>> memctx-specific address, which can be userspace, or any other type
>>>> (including hardware, dma-engine, etc).  As long as the memctx knows how
>>>> to translate it, it will work.
>>> How would permissions be handled?
>> Same as anything else, really.  Read on for details.
>>
>>> it's easy to allow an app to pass in virtual addresses in its own address space.
>> Agreed, and this is what I do.
>>
>> The guest always passes its own physical addresses (using things like
>> __pa() in linux).  This address passed is memctx specific, but generally
>> would fall into the category of "virtual-addresses" from the hosts
>> perspective.
>>
>> For a KVM/AlacrityVM guest example, the addresses are GPAs, accessed
>> internally to the context via a gfn_to_hva conversion (you can see this
>> occuring in the citation links I sent)
>>
>> For Ira's example, the addresses would represent a physical address on
>> the PCI boards, and would follow any kind of relevant rules for
>> converting a "GPA" to a host accessible address (even if indirectly, via
>> a dma controller).
> 
> So vbus can let an application

"application" means KVM guest, or ppc board, right?

> access either its own virtual memory or a physical memory on a PCI device.

To reiterate from the last reply: the model is the "guest" owns the
memory.  The host is granted access to that memory by means of a memctx
object, which must be admitted to the host kernel and accessed according
 to standard access-policy mechanisms.  Generally the "application" or
guest would never be accessing anything other than its own memory.

> My question is, is any application
> that's allowed to do the former also granted rights to do the later?

If I understand your question, no.  Can you elaborate?

Kind Regards,
-Greg
Avi Kivity Sept. 15, 2009, 12:32 p.m. UTC | #17
On 09/14/2009 07:47 PM, Michael S. Tsirkin wrote:
> On Mon, Sep 14, 2009 at 12:08:55PM -0400, Gregory Haskins wrote:
>    
>> For Ira's example, the addresses would represent a physical address on
>> the PCI boards, and would follow any kind of relevant rules for
>> converting a "GPA" to a host accessible address (even if indirectly, via
>> a dma controller).
>>      
> I don't think limiting addresses to PCI physical addresses will work
> well.  From what I rememeber, Ira's x86 can not initiate burst
> transactions on PCI, and it's the ppc that initiates all DMA.
>    

vhost-net would run on the PPC then.

>>>   But we can't let the guest specify physical addresses.
>>>        
>> Agreed.  Neither your proposal nor mine operate this way afaict.
>>      
> But this seems to be what Ira needs.
>    

In Ira's scenario, the "guest" (x86 host) specifies x86 physical 
addresses, and the ppc dmas to them.  It's the virtio model without any 
change.  A normal guest also specifis physical addresses.
Avi Kivity Sept. 15, 2009, 12:35 p.m. UTC | #18
On 09/14/2009 10:14 PM, Gregory Haskins wrote:
> To reiterate, as long as the model is such that the ppc boards are
> considered the "owner" (direct access, no translation needed) I believe
> it will work.  If the pointers are expected to be owned by the host,
> then my model doesn't work well either.
>    

In this case the x86 is the owner and the ppc boards use translated 
access.  Just switch drivers and device and it falls into place.
Gregory Haskins Sept. 15, 2009, 1:03 p.m. UTC | #19
Avi Kivity wrote:
> On 09/14/2009 10:14 PM, Gregory Haskins wrote:
>> To reiterate, as long as the model is such that the ppc boards are
>> considered the "owner" (direct access, no translation needed) I believe
>> it will work.  If the pointers are expected to be owned by the host,
>> then my model doesn't work well either.
>>    
> 
> In this case the x86 is the owner and the ppc boards use translated
> access.  Just switch drivers and device and it falls into place.
> 

You could switch vbus roles as well, I suppose.  Another potential
option is that he can stop mapping host memory on the guest so that it
follows the more traditional model.  As a bus-master device, the ppc
boards should have access to any host memory at least in the GFP_DMA
range, which would include all relevant pointers here.

I digress:  I was primarily addressing the concern that Ira would need
to manage the "host" side of the link using hvas mapped from userspace
(even if host side is the ppc boards).  vbus abstracts that access so as
to allow something other than userspace/hva mappings.  OTOH, having each
ppc board run a userspace app to do the mapping on its behalf and feed
it to vhost is probably not a huge deal either.  Where vhost might
really fall apart is when any assumptions about pageable memory occur,
if any.

As an aside: a bigger issue is that, iiuc, Ira wants more than a single
ethernet channel in his design (multiple ethernets, consoles, etc).  A
vhost solution in this environment is incomplete.

Note that Ira's architecture highlights that vbus's explicit management
interface is more valuable here than it is in KVM, since KVM already has
its own management interface via QEMU.

Kind Regards,
-Greg
Avi Kivity Sept. 15, 2009, 1:25 p.m. UTC | #20
On 09/15/2009 04:03 PM, Gregory Haskins wrote:
>
>> In this case the x86 is the owner and the ppc boards use translated
>> access.  Just switch drivers and device and it falls into place.
>>
>>      
> You could switch vbus roles as well, I suppose.

Right, there's not real difference in this regard.

> Another potential
> option is that he can stop mapping host memory on the guest so that it
> follows the more traditional model.  As a bus-master device, the ppc
> boards should have access to any host memory at least in the GFP_DMA
> range, which would include all relevant pointers here.
>
> I digress:  I was primarily addressing the concern that Ira would need
> to manage the "host" side of the link using hvas mapped from userspace
> (even if host side is the ppc boards).  vbus abstracts that access so as
> to allow something other than userspace/hva mappings.  OTOH, having each
> ppc board run a userspace app to do the mapping on its behalf and feed
> it to vhost is probably not a huge deal either.  Where vhost might
> really fall apart is when any assumptions about pageable memory occur,
> if any.
>    

Why?  vhost will call get_user_pages() or copy_*_user() which ought to 
do the right thing.

> As an aside: a bigger issue is that, iiuc, Ira wants more than a single
> ethernet channel in his design (multiple ethernets, consoles, etc).  A
> vhost solution in this environment is incomplete.
>    

Why?  Instantiate as many vhost-nets as needed.

> Note that Ira's architecture highlights that vbus's explicit management
> interface is more valuable here than it is in KVM, since KVM already has
> its own management interface via QEMU.
>    

vhost-net and vbus both need management, vhost-net via ioctls and vbus 
via configfs.  The only difference is the implementation.  vhost-net 
leaves much more to userspace, that's the main difference.
Gregory Haskins Sept. 15, 2009, 1:50 p.m. UTC | #21
Avi Kivity wrote:
> On 09/15/2009 04:03 PM, Gregory Haskins wrote:
>>
>>> In this case the x86 is the owner and the ppc boards use translated
>>> access.  Just switch drivers and device and it falls into place.
>>>
>>>      
>> You could switch vbus roles as well, I suppose.
> 
> Right, there's not real difference in this regard.
> 
>> Another potential
>> option is that he can stop mapping host memory on the guest so that it
>> follows the more traditional model.  As a bus-master device, the ppc
>> boards should have access to any host memory at least in the GFP_DMA
>> range, which would include all relevant pointers here.
>>
>> I digress:  I was primarily addressing the concern that Ira would need
>> to manage the "host" side of the link using hvas mapped from userspace
>> (even if host side is the ppc boards).  vbus abstracts that access so as
>> to allow something other than userspace/hva mappings.  OTOH, having each
>> ppc board run a userspace app to do the mapping on its behalf and feed
>> it to vhost is probably not a huge deal either.  Where vhost might
>> really fall apart is when any assumptions about pageable memory occur,
>> if any.
>>    
> 
> Why?  vhost will call get_user_pages() or copy_*_user() which ought to
> do the right thing.

I was speaking generally, not specifically to Ira's architecture.  What
I mean is that vbus was designed to work without assuming that the
memory is pageable.  There are environments in which the host is not
capable of mapping hvas/*page, but the memctx->copy_to/copy_from
paradigm could still work (think rdma, for instance).

> 
>> As an aside: a bigger issue is that, iiuc, Ira wants more than a single
>> ethernet channel in his design (multiple ethernets, consoles, etc).  A
>> vhost solution in this environment is incomplete.
>>    
> 
> Why?  Instantiate as many vhost-nets as needed.

a) what about non-ethernets?
b) what do you suppose this protocol to aggregate the connections would
look like? (hint: this is what a vbus-connector does).
c) how do you manage the configuration, especially on a per-board basis?

> 
>> Note that Ira's architecture highlights that vbus's explicit management
>> interface is more valuable here than it is in KVM, since KVM already has
>> its own management interface via QEMU.
>>    
> 
> vhost-net and vbus both need management, vhost-net via ioctls and vbus
> via configfs.

Actually I have patches queued to allow vbus to be managed via ioctls as
well, per your feedback (and it solves the permissions/lifetime
critisims in alacrityvm-v0.1).

>  The only difference is the implementation.  vhost-net
> leaves much more to userspace, that's the main difference.

Also,

*) vhost is virtio-net specific, whereas vbus is a more generic device
model where thing like virtio-net or venet ride on top.

*) vhost is only designed to work with environments that look very
similar to a KVM guest (slot/hva translatable).  vbus can bridge various
environments by abstracting the key components (such as memory access).

*) vhost requires an active userspace management daemon, whereas vbus
can be driven by transient components, like scripts (ala udev)

Kind Regards,
-Greg
Michael S. Tsirkin Sept. 15, 2009, 2:28 p.m. UTC | #22
On Tue, Sep 15, 2009 at 09:50:39AM -0400, Gregory Haskins wrote:
> Avi Kivity wrote:
> > On 09/15/2009 04:03 PM, Gregory Haskins wrote:
> >>
> >>> In this case the x86 is the owner and the ppc boards use translated
> >>> access.  Just switch drivers and device and it falls into place.
> >>>
> >>>      
> >> You could switch vbus roles as well, I suppose.
> > 
> > Right, there's not real difference in this regard.
> > 
> >> Another potential
> >> option is that he can stop mapping host memory on the guest so that it
> >> follows the more traditional model.  As a bus-master device, the ppc
> >> boards should have access to any host memory at least in the GFP_DMA
> >> range, which would include all relevant pointers here.
> >>
> >> I digress:  I was primarily addressing the concern that Ira would need
> >> to manage the "host" side of the link using hvas mapped from userspace
> >> (even if host side is the ppc boards).  vbus abstracts that access so as
> >> to allow something other than userspace/hva mappings.  OTOH, having each
> >> ppc board run a userspace app to do the mapping on its behalf and feed
> >> it to vhost is probably not a huge deal either.  Where vhost might
> >> really fall apart is when any assumptions about pageable memory occur,
> >> if any.
> >>    
> > 
> > Why?  vhost will call get_user_pages() or copy_*_user() which ought to
> > do the right thing.
> 
> I was speaking generally, not specifically to Ira's architecture.  What
> I mean is that vbus was designed to work without assuming that the
> memory is pageable.  There are environments in which the host is not
> capable of mapping hvas/*page, but the memctx->copy_to/copy_from
> paradigm could still work (think rdma, for instance).

rdma interfaces are typically asynchronous, so blocking
copy_from/copy_to can be made to work, but likely won't work
that well. DMA might work better if it is asynchronous as well.

Assuming a synchronous copy is what we need - maybe the issue is that
there aren't good APIs for x86/ppc communication? If so, sticking them in
vhost might not be the best place.  Maybe the specific platform can
redefine copy_to/from_user to do the right thing? Or, maybe add another
API for that ...

> > 
> >> As an aside: a bigger issue is that, iiuc, Ira wants more than a single
> >> ethernet channel in his design (multiple ethernets, consoles, etc).  A
> >> vhost solution in this environment is incomplete.
> >>    
> > 
> > Why?  Instantiate as many vhost-nets as needed.
> 
> a) what about non-ethernets?

vhost-net actually does not care.
the packet is passed on to a socket, we are done.

> b) what do you suppose this protocol to aggregate the connections would
> look like? (hint: this is what a vbus-connector does).

You are talking about management protocol between ppc and x86, right?
One wonders why does it have to be in kernel at all.

> c) how do you manage the configuration, especially on a per-board basis?

not sure what a board is, but configuration is done in userspace.
Avi Kivity Sept. 15, 2009, 3:03 p.m. UTC | #23
On 09/15/2009 04:50 PM, Gregory Haskins wrote:
>> Why?  vhost will call get_user_pages() or copy_*_user() which ought to
>> do the right thing.
>>      
> I was speaking generally, not specifically to Ira's architecture.  What
> I mean is that vbus was designed to work without assuming that the
> memory is pageable.  There are environments in which the host is not
> capable of mapping hvas/*page, but the memctx->copy_to/copy_from
> paradigm could still work (think rdma, for instance).
>    

Sure, vbus is more flexible here.

>>> As an aside: a bigger issue is that, iiuc, Ira wants more than a single
>>> ethernet channel in his design (multiple ethernets, consoles, etc).  A
>>> vhost solution in this environment is incomplete.
>>>
>>>        
>> Why?  Instantiate as many vhost-nets as needed.
>>      
> a) what about non-ethernets?
>    

There's virtio-console, virtio-blk etc.  None of these have kernel-mode 
servers, but these could be implemented if/when needed.

> b) what do you suppose this protocol to aggregate the connections would
> look like? (hint: this is what a vbus-connector does).
>    

You mean multilink?  You expose the device as a multiqueue.

> c) how do you manage the configuration, especially on a per-board basis?
>    

pci (for kvm/x86).

> Actually I have patches queued to allow vbus to be managed via ioctls as
> well, per your feedback (and it solves the permissions/lifetime
> critisims in alacrityvm-v0.1).
>    

That will make qemu integration easier.

>>   The only difference is the implementation.  vhost-net
>> leaves much more to userspace, that's the main difference.
>>      
> Also,
>
> *) vhost is virtio-net specific, whereas vbus is a more generic device
> model where thing like virtio-net or venet ride on top.
>    

I think vhost-net is separated into vhost and vhost-net.

> *) vhost is only designed to work with environments that look very
> similar to a KVM guest (slot/hva translatable).  vbus can bridge various
> environments by abstracting the key components (such as memory access).
>    

Yes.  virtio is really virtualization oriented.

> *) vhost requires an active userspace management daemon, whereas vbus
> can be driven by transient components, like scripts (ala udev)
>    

vhost by design leaves configuration and handshaking to userspace.  I 
see it as an advantage.
Gregory Haskins Sept. 15, 2009, 8:08 p.m. UTC | #24
Avi Kivity wrote:
> On 09/15/2009 04:50 PM, Gregory Haskins wrote:
>>> Why?  vhost will call get_user_pages() or copy_*_user() which ought to
>>> do the right thing.
>>>      
>> I was speaking generally, not specifically to Ira's architecture.  What
>> I mean is that vbus was designed to work without assuming that the
>> memory is pageable.  There are environments in which the host is not
>> capable of mapping hvas/*page, but the memctx->copy_to/copy_from
>> paradigm could still work (think rdma, for instance).
>>    
> 
> Sure, vbus is more flexible here.
> 
>>>> As an aside: a bigger issue is that, iiuc, Ira wants more than a single
>>>> ethernet channel in his design (multiple ethernets, consoles, etc).  A
>>>> vhost solution in this environment is incomplete.
>>>>
>>>>        
>>> Why?  Instantiate as many vhost-nets as needed.
>>>      
>> a) what about non-ethernets?
>>    
> 
> There's virtio-console, virtio-blk etc.  None of these have kernel-mode
> servers, but these could be implemented if/when needed.

IIUC, Ira already needs at least ethernet and console capability.

> 
>> b) what do you suppose this protocol to aggregate the connections would
>> look like? (hint: this is what a vbus-connector does).
>>    
> 
> You mean multilink?  You expose the device as a multiqueue.

No, what I mean is how do you surface multiple ethernet and consoles to
the guests?  For Ira's case, I think he needs at minimum at least one of
each, and he mentioned possibly having two unique ethernets at one point.

His slave boards surface themselves as PCI devices to the x86
host.  So how do you use that to make multiple vhost-based devices (say
two virtio-nets, and a virtio-console) communicate across the transport?

There are multiple ways to do this, but what I am saying is that
whatever is conceived will start to look eerily like a vbus-connector,
since this is one of its primary purposes ;)


> 
>> c) how do you manage the configuration, especially on a per-board basis?
>>    
> 
> pci (for kvm/x86).

Ok, for kvm understood (and I would also add "qemu" to that mix).  But
we are talking about vhost's application in a non-kvm environment here,
right?.

So if the vhost-X devices are in the "guest", and the x86 board is just
a slave...How do you tell each ppc board how many devices and what
config (e.g. MACs, etc) to instantiate?  Do you assume that they should
all be symmetric and based on positional (e.g. slot) data?  What if you
want asymmetric configurations (if not here, perhaps in a different
environment)?

> 
>> Actually I have patches queued to allow vbus to be managed via ioctls as
>> well, per your feedback (and it solves the permissions/lifetime
>> critisims in alacrityvm-v0.1).
>>    
> 
> That will make qemu integration easier.
> 
>>>   The only difference is the implementation.  vhost-net
>>> leaves much more to userspace, that's the main difference.
>>>      
>> Also,
>>
>> *) vhost is virtio-net specific, whereas vbus is a more generic device
>> model where thing like virtio-net or venet ride on top.
>>    
> 
> I think vhost-net is separated into vhost and vhost-net.

Thats good.

> 
>> *) vhost is only designed to work with environments that look very
>> similar to a KVM guest (slot/hva translatable).  vbus can bridge various
>> environments by abstracting the key components (such as memory access).
>>    
> 
> Yes.  virtio is really virtualization oriented.

I would say that its vhost in particular that is virtualization
oriented.  virtio, as a concept, generally should work in physical
systems, if perhaps with some minor modifications.  The biggest "limit"
is having "virt" in its name ;)

> 
>> *) vhost requires an active userspace management daemon, whereas vbus
>> can be driven by transient components, like scripts (ala udev)
>>    
> 
> vhost by design leaves configuration and handshaking to userspace.  I
> see it as an advantage.

The misconception here is that vbus by design _doesn't define_ where
configuration/handshaking happens.  It is primarily implemented by a
modular component called a "vbus-connector", and _I_ see this
flexibility as an advantage.  vhost on the other hand depends on a
active userspace component and a slots/hva memory design, which is more
limiting in where it can be used and forces you to split the logic.
However, I think we both more or less agree on this point already.

For the record, vbus itself is simply a resource container for
virtual-devices, which provides abstractions for the various points of
interest to generalizing PV (memory, signals, etc) and the proper
isolation and protection guarantees.  What you do with it is defined by
the modular virtual-devices (e.g. virtion-net, venet, sched, hrt, scsi,
rdma, etc) and vbus-connectors (vbus-kvm, etc) you plug into it.

As an example, you could emulate the vhost design in vbus by writing a
"vbus-vhost" connector.  This connector would be very thin and terminate
locally in QEMU.  It would provide a ioctl-based verb namespace similar
to the existing vhost verbs we have today.  QEMU would then similarly
reflect the vbus-based virtio device as a PCI device to the guest, so
that virtio-pci works unmodified.

You would then have most of the advantages of the work I have done for
commoditizing/abstracting the key points for in-kernel PV, like the
memctx.  In addition, much of the work could be reused in multiple
environments since any vbus-compliant device model that is plugged into
the framework would work with any connector that is plugged in (e.g.
vbus-kvm (alacrityvm), vbus-vhost (KVM), and "vbus-ira").

The only tradeoff is in features offered by the connector (e.g.
vbus-vhost has the advantage that existing PV guests can continue to
work unmodified, vbus-kvm has the advantage that it supports new
features like generic shared memory, non-virtio based devices,
priortizable interrupts, no dependencies on PCI for non PCI guests, etc).

Kind Regards,
-Greg
Michael S. Tsirkin Sept. 15, 2009, 8:40 p.m. UTC | #25
On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote:
> No, what I mean is how do you surface multiple ethernet and consoles to
> the guests?  For Ira's case, I think he needs at minimum at least one of
> each, and he mentioned possibly having two unique ethernets at one point.
> 
> His slave boards surface themselves as PCI devices to the x86
> host.  So how do you use that to make multiple vhost-based devices (say
> two virtio-nets, and a virtio-console) communicate across the transport?
> 
> There are multiple ways to do this, but what I am saying is that
> whatever is conceived will start to look eerily like a vbus-connector,
> since this is one of its primary purposes ;)

Can't all this be in userspace?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins Sept. 15, 2009, 8:43 p.m. UTC | #26
Michael S. Tsirkin wrote:
> On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote:
>> No, what I mean is how do you surface multiple ethernet and consoles to
>> the guests?  For Ira's case, I think he needs at minimum at least one of
>> each, and he mentioned possibly having two unique ethernets at one point.
>>
>> His slave boards surface themselves as PCI devices to the x86
>> host.  So how do you use that to make multiple vhost-based devices (say
>> two virtio-nets, and a virtio-console) communicate across the transport?
>>
>> There are multiple ways to do this, but what I am saying is that
>> whatever is conceived will start to look eerily like a vbus-connector,
>> since this is one of its primary purposes ;)
> 
> Can't all this be in userspace?

Can you outline your proposal?

-Greg
Michael S. Tsirkin Sept. 15, 2009, 9:25 p.m. UTC | #27
On Tue, Sep 15, 2009 at 04:43:58PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote:
> >> No, what I mean is how do you surface multiple ethernet and consoles to
> >> the guests?  For Ira's case, I think he needs at minimum at least one of
> >> each, and he mentioned possibly having two unique ethernets at one point.
> >>
> >> His slave boards surface themselves as PCI devices to the x86
> >> host.  So how do you use that to make multiple vhost-based devices (say
> >> two virtio-nets, and a virtio-console) communicate across the transport?
> >>
> >> There are multiple ways to do this, but what I am saying is that
> >> whatever is conceived will start to look eerily like a vbus-connector,
> >> since this is one of its primary purposes ;)
> > 
> > Can't all this be in userspace?
> 
> Can you outline your proposal?
> 
> -Greg
> 

Userspace in x86 maps a PCI region, uses it for communication with ppc?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Sept. 15, 2009, 9:38 p.m. UTC | #28
On Tue, Sep 15, 2009 at 05:39:27PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Tue, Sep 15, 2009 at 04:43:58PM -0400, Gregory Haskins wrote:
> >> Michael S. Tsirkin wrote:
> >>> On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote:
> >>>> No, what I mean is how do you surface multiple ethernet and consoles to
> >>>> the guests?  For Ira's case, I think he needs at minimum at least one of
> >>>> each, and he mentioned possibly having two unique ethernets at one point.
> >>>>
> >>>> His slave boards surface themselves as PCI devices to the x86
> >>>> host.  So how do you use that to make multiple vhost-based devices (say
> >>>> two virtio-nets, and a virtio-console) communicate across the transport?
> >>>>
> >>>> There are multiple ways to do this, but what I am saying is that
> >>>> whatever is conceived will start to look eerily like a vbus-connector,
> >>>> since this is one of its primary purposes ;)
> >>> Can't all this be in userspace?
> >> Can you outline your proposal?
> >>
> >> -Greg
> >>
> > 
> > Userspace in x86 maps a PCI region, uses it for communication with ppc?
> > 
> 
> And what do you propose this communication to look like?

Who cares? Implement vbus protocol there if you like.

> -Greg
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins Sept. 15, 2009, 9:39 p.m. UTC | #29
Michael S. Tsirkin wrote:
> On Tue, Sep 15, 2009 at 04:43:58PM -0400, Gregory Haskins wrote:
>> Michael S. Tsirkin wrote:
>>> On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote:
>>>> No, what I mean is how do you surface multiple ethernet and consoles to
>>>> the guests?  For Ira's case, I think he needs at minimum at least one of
>>>> each, and he mentioned possibly having two unique ethernets at one point.
>>>>
>>>> His slave boards surface themselves as PCI devices to the x86
>>>> host.  So how do you use that to make multiple vhost-based devices (say
>>>> two virtio-nets, and a virtio-console) communicate across the transport?
>>>>
>>>> There are multiple ways to do this, but what I am saying is that
>>>> whatever is conceived will start to look eerily like a vbus-connector,
>>>> since this is one of its primary purposes ;)
>>> Can't all this be in userspace?
>> Can you outline your proposal?
>>
>> -Greg
>>
> 
> Userspace in x86 maps a PCI region, uses it for communication with ppc?
> 

And what do you propose this communication to look like?

-Greg
Gregory Haskins Sept. 15, 2009, 9:55 p.m. UTC | #30
Michael S. Tsirkin wrote:
> On Tue, Sep 15, 2009 at 05:39:27PM -0400, Gregory Haskins wrote:
>> Michael S. Tsirkin wrote:
>>> On Tue, Sep 15, 2009 at 04:43:58PM -0400, Gregory Haskins wrote:
>>>> Michael S. Tsirkin wrote:
>>>>> On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote:
>>>>>> No, what I mean is how do you surface multiple ethernet and consoles to
>>>>>> the guests?  For Ira's case, I think he needs at minimum at least one of
>>>>>> each, and he mentioned possibly having two unique ethernets at one point.
>>>>>>
>>>>>> His slave boards surface themselves as PCI devices to the x86
>>>>>> host.  So how do you use that to make multiple vhost-based devices (say
>>>>>> two virtio-nets, and a virtio-console) communicate across the transport?
>>>>>>
>>>>>> There are multiple ways to do this, but what I am saying is that
>>>>>> whatever is conceived will start to look eerily like a vbus-connector,
>>>>>> since this is one of its primary purposes ;)
>>>>> Can't all this be in userspace?
>>>> Can you outline your proposal?
>>>>
>>>> -Greg
>>>>
>>> Userspace in x86 maps a PCI region, uses it for communication with ppc?
>>>
>> And what do you propose this communication to look like?
> 
> Who cares? Implement vbus protocol there if you like.
> 

Exactly.  My point is that you need something like a vbus protocol there. ;)

Here is the protocol I run over PCI in AlacrityVM:

http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=include/linux/vbus_pci.h;h=fe337590e644017392e4c9d9236150adb2333729;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025

And I guess to your point, yes the protocol can technically be in
userspace (outside of whatever you need for the in-kernel portion of the
communication transport, if any.

The vbus-connector design does not specify where the protocol needs to
take place, per se.  Note, however, for performance reasons some parts
of the protocol may want to be in the kernel (such as DEVCALL and
SHMSIGNAL).  It is for this reason that I just run all of it there,
because IMO its simpler than splitting it up.  The slow path stuff just
rides on infrastructure that I need for fast-path anyway, so it doesn't
really cost me anything additional.

Kind Regards,
-Greg
Avi Kivity Sept. 16, 2009, 8:23 a.m. UTC | #31
On 09/15/2009 11:08 PM, Gregory Haskins wrote:
>
>> There's virtio-console, virtio-blk etc.  None of these have kernel-mode
>> servers, but these could be implemented if/when needed.
>>      
> IIUC, Ira already needs at least ethernet and console capability.
>
>    

He's welcome to pick up the necessary code from qemu.

>>> b) what do you suppose this protocol to aggregate the connections would
>>> look like? (hint: this is what a vbus-connector does).
>>>
>>>        
>> You mean multilink?  You expose the device as a multiqueue.
>>      
> No, what I mean is how do you surface multiple ethernet and consoles to
> the guests?  For Ira's case, I think he needs at minimum at least one of
> each, and he mentioned possibly having two unique ethernets at one point.
>    

You instantiate multiple vhost-nets.  Multiple ethernet NICs is a 
supported configuration for kvm.

> His slave boards surface themselves as PCI devices to the x86
> host.  So how do you use that to make multiple vhost-based devices (say
> two virtio-nets, and a virtio-console) communicate across the transport?
>    

I don't really see the difference between 1 and N here.

> There are multiple ways to do this, but what I am saying is that
> whatever is conceived will start to look eerily like a vbus-connector,
> since this is one of its primary purposes ;)
>    

I'm not sure if you're talking about the configuration interface or data 
path here.

>>> c) how do you manage the configuration, especially on a per-board basis?
>>>
>>>        
>> pci (for kvm/x86).
>>      
> Ok, for kvm understood (and I would also add "qemu" to that mix).  But
> we are talking about vhost's application in a non-kvm environment here,
> right?.
>
> So if the vhost-X devices are in the "guest",

They aren't in the "guest".  The best way to look at it is

- a device side, with a dma engine: vhost-net
- a driver side, only accessing its own memory: virtio-net

Given that Ira's config has the dma engine in the ppc boards, that's 
where vhost-net would live (the ppc boards acting as NICs to the x86 
board, essentially).

> and the x86 board is just
> a slave...How do you tell each ppc board how many devices and what
> config (e.g. MACs, etc) to instantiate?  Do you assume that they should
> all be symmetric and based on positional (e.g. slot) data?  What if you
> want asymmetric configurations (if not here, perhaps in a different
> environment)?
>    

I have no idea, that's for Ira to solve.  If he could fake the PCI 
config space as seen by the x86 board, he would just show the normal pci 
config and use virtio-pci (multiple channels would show up as a 
multifunction device).  Given he can't, he needs to tunnel the virtio 
config space some other way.

>> Yes.  virtio is really virtualization oriented.
>>      
> I would say that its vhost in particular that is virtualization
> oriented.  virtio, as a concept, generally should work in physical
> systems, if perhaps with some minor modifications.  The biggest "limit"
> is having "virt" in its name ;)
>    

Let me rephrase.  The virtio developers are virtualization oriented.  If 
it works for non-virt applications, that's good, but not a design goal.
Gregory Haskins Sept. 16, 2009, 11:44 a.m. UTC | #32
Avi Kivity wrote:
> On 09/15/2009 11:08 PM, Gregory Haskins wrote:
>>
>>> There's virtio-console, virtio-blk etc.  None of these have kernel-mode
>>> servers, but these could be implemented if/when needed.
>>>      
>> IIUC, Ira already needs at least ethernet and console capability.
>>
>>    
> 
> He's welcome to pick up the necessary code from qemu.

The problem isn't where to find the models...the problem is how to
aggregate multiple models to the guest.

> 
>>>> b) what do you suppose this protocol to aggregate the connections would
>>>> look like? (hint: this is what a vbus-connector does).
>>>>
>>>>        
>>> You mean multilink?  You expose the device as a multiqueue.
>>>      
>> No, what I mean is how do you surface multiple ethernet and consoles to
>> the guests?  For Ira's case, I think he needs at minimum at least one of
>> each, and he mentioned possibly having two unique ethernets at one point.
>>    
> 
> You instantiate multiple vhost-nets.  Multiple ethernet NICs is a
> supported configuration for kvm.

But this is not KVM.

> 
>> His slave boards surface themselves as PCI devices to the x86
>> host.  So how do you use that to make multiple vhost-based devices (say
>> two virtio-nets, and a virtio-console) communicate across the transport?
>>    
> 
> I don't really see the difference between 1 and N here.

A KVM surfaces N virtio-devices as N pci-devices to the guest.  What do
we do in Ira's case where the entire guest represents itself as a PCI
device to the host, and nothing the other way around?


> 
>> There are multiple ways to do this, but what I am saying is that
>> whatever is conceived will start to look eerily like a vbus-connector,
>> since this is one of its primary purposes ;)
>>    
> 
> I'm not sure if you're talking about the configuration interface or data
> path here.

I am talking about how we would tunnel the config space for N devices
across his transport.

As an aside, the vbus-kvm connector makes them one and the same, but
they do not have to be.  Its all in the connector design.

> 
>>>> c) how do you manage the configuration, especially on a per-board
>>>> basis?
>>>>
>>>>        
>>> pci (for kvm/x86).
>>>      
>> Ok, for kvm understood (and I would also add "qemu" to that mix).  But
>> we are talking about vhost's application in a non-kvm environment here,
>> right?.
>>
>> So if the vhost-X devices are in the "guest",
> 
> They aren't in the "guest".  The best way to look at it is
> 
> - a device side, with a dma engine: vhost-net
> - a driver side, only accessing its own memory: virtio-net
> 
> Given that Ira's config has the dma engine in the ppc boards, that's
> where vhost-net would live (the ppc boards acting as NICs to the x86
> board, essentially).

That sounds convenient given his hardware, but it has its own set of
problems.  For one, the configuration/inventory of these boards is now
driven by the wrong side and has to be addressed.  Second, the role
reversal will likely not work for many models other than ethernet (e.g.
virtio-console or virtio-blk drivers running on the x86 board would be
naturally consuming services from the slave boards...virtio-net is an
exception because 802.x is generally symmetrical).

IIUC, vbus would support having the device models live properly on the
x86 side, solving both of these problems.  It would be impossible to
reverse vhost given its current design.

> 
>> and the x86 board is just
>> a slave...How do you tell each ppc board how many devices and what
>> config (e.g. MACs, etc) to instantiate?  Do you assume that they should
>> all be symmetric and based on positional (e.g. slot) data?  What if you
>> want asymmetric configurations (if not here, perhaps in a different
>> environment)?
>>    
> 
> I have no idea, that's for Ira to solve.

Bingo.  Thus my statement that the vhost proposal is incomplete.  You
have the virtio-net and vhost-net pieces covering the fast-path
end-points, but nothing in the middle (transport, aggregation,
config-space), and nothing on the management-side.  vbus provides most
of the other pieces, and can even support the same virtio-net protocol
on top.  The remaining part would be something like a udev script to
populate the vbus with devices on board-insert events.

> If he could fake the PCI
> config space as seen by the x86 board, he would just show the normal pci
> config and use virtio-pci (multiple channels would show up as a
> multifunction device).  Given he can't, he needs to tunnel the virtio
> config space some other way.

Right, and note that vbus was designed to solve this.  This tunneling
can, of course, be done without vbus using some other design.  However,
whatever solution is created will look incredibly close to what I've
already done, so my point is "why reinvent it"?

> 
>>> Yes.  virtio is really virtualization oriented.
>>>      
>> I would say that its vhost in particular that is virtualization
>> oriented.  virtio, as a concept, generally should work in physical
>> systems, if perhaps with some minor modifications.  The biggest "limit"
>> is having "virt" in its name ;)
>>    
> 
> Let me rephrase.  The virtio developers are virtualization oriented.  If
> it works for non-virt applications, that's good, but not a design goal.
> 

Fair enough.  Vbus was designed to support both HW and virt (as well as
other models, like containers), including tunneling virtio within those
environments.  That is probably why IMO vbus is a better fit than vhost
here.  (FWIW: I would love to see vhost use the vbus framework, then we
all win.  You can do this and still retain virtio-pci compatiblity (at
least theoretically).  I am still open to working with the team on this).

Kind Regards,
-Greg
Avi Kivity Sept. 16, 2009, 1:05 p.m. UTC | #33
On 09/16/2009 02:44 PM, Gregory Haskins wrote:
> The problem isn't where to find the models...the problem is how to
> aggregate multiple models to the guest.
>    

You mean configuration?

>> You instantiate multiple vhost-nets.  Multiple ethernet NICs is a
>> supported configuration for kvm.
>>      
> But this is not KVM.
>
>    

If kvm can do it, others can.

>>> His slave boards surface themselves as PCI devices to the x86
>>> host.  So how do you use that to make multiple vhost-based devices (say
>>> two virtio-nets, and a virtio-console) communicate across the transport?
>>>
>>>        
>> I don't really see the difference between 1 and N here.
>>      
> A KVM surfaces N virtio-devices as N pci-devices to the guest.  What do
> we do in Ira's case where the entire guest represents itself as a PCI
> device to the host, and nothing the other way around?
>    

There is no guest and host in this scenario.  There's a device side 
(ppc) and a driver side (x86).  The driver side can access configuration 
information on the device side.  How to multiplex multiple devices is an 
interesting exercise for whoever writes the virtio binding for that setup.

>>> There are multiple ways to do this, but what I am saying is that
>>> whatever is conceived will start to look eerily like a vbus-connector,
>>> since this is one of its primary purposes ;)
>>>
>>>        
>> I'm not sure if you're talking about the configuration interface or data
>> path here.
>>      
> I am talking about how we would tunnel the config space for N devices
> across his transport.
>    

Sounds trivial.  Write an address containing the device number and 
register number to on location, read or write data from another.  Just 
like the PCI cf8/cfc interface.

>> They aren't in the "guest".  The best way to look at it is
>>
>> - a device side, with a dma engine: vhost-net
>> - a driver side, only accessing its own memory: virtio-net
>>
>> Given that Ira's config has the dma engine in the ppc boards, that's
>> where vhost-net would live (the ppc boards acting as NICs to the x86
>> board, essentially).
>>      
> That sounds convenient given his hardware, but it has its own set of
> problems.  For one, the configuration/inventory of these boards is now
> driven by the wrong side and has to be addressed.

Why is it the wrong side?

> Second, the role
> reversal will likely not work for many models other than ethernet (e.g.
> virtio-console or virtio-blk drivers running on the x86 board would be
> naturally consuming services from the slave boards...virtio-net is an
> exception because 802.x is generally symmetrical).
>    

There is no role reversal.  The side doing dma is the device, the side 
accessing its own memory is the driver.  Just like that other 1e12 
driver/device pairs out there.

>> I have no idea, that's for Ira to solve.
>>      
> Bingo.  Thus my statement that the vhost proposal is incomplete.  You
> have the virtio-net and vhost-net pieces covering the fast-path
> end-points, but nothing in the middle (transport, aggregation,
> config-space), and nothing on the management-side.  vbus provides most
> of the other pieces, and can even support the same virtio-net protocol
> on top.  The remaining part would be something like a udev script to
> populate the vbus with devices on board-insert events.
>    

Of course vhost is incomplete, in the same sense that Linux is 
incomplete.  Both require userspace.

>> If he could fake the PCI
>> config space as seen by the x86 board, he would just show the normal pci
>> config and use virtio-pci (multiple channels would show up as a
>> multifunction device).  Given he can't, he needs to tunnel the virtio
>> config space some other way.
>>      
> Right, and note that vbus was designed to solve this.  This tunneling
> can, of course, be done without vbus using some other design.  However,
> whatever solution is created will look incredibly close to what I've
> already done, so my point is "why reinvent it"?
>    

virtio requires binding for this tunnelling, so does vbus.  Its the same 
problem with the same solution.
Gregory Haskins Sept. 16, 2009, 2:10 p.m. UTC | #34
Avi Kivity wrote:
> On 09/16/2009 02:44 PM, Gregory Haskins wrote:
>> The problem isn't where to find the models...the problem is how to
>> aggregate multiple models to the guest.
>>    
> 
> You mean configuration?
> 
>>> You instantiate multiple vhost-nets.  Multiple ethernet NICs is a
>>> supported configuration for kvm.
>>>      
>> But this is not KVM.
>>
>>    
> 
> If kvm can do it, others can.

The problem is that you seem to either hand-wave over details like this,
or you give details that are pretty much exactly what vbus does already.
 My point is that I've already sat down and thought about these issues
and solved them in a freely available GPL'ed software package.

So the question is: is your position that vbus is all wrong and you wish
to create a new bus-like thing to solve the problem?  If so, how is it
different from what Ive already done?  More importantly, what specific
objections do you have to what Ive done, as perhaps they can be fixed
instead of starting over?

> 
>>>> His slave boards surface themselves as PCI devices to the x86
>>>> host.  So how do you use that to make multiple vhost-based devices (say
>>>> two virtio-nets, and a virtio-console) communicate across the
>>>> transport?
>>>>
>>>>        
>>> I don't really see the difference between 1 and N here.
>>>      
>> A KVM surfaces N virtio-devices as N pci-devices to the guest.  What do
>> we do in Ira's case where the entire guest represents itself as a PCI
>> device to the host, and nothing the other way around?
>>    
> 
> There is no guest and host in this scenario.  There's a device side
> (ppc) and a driver side (x86).  The driver side can access configuration
> information on the device side.  How to multiplex multiple devices is an
> interesting exercise for whoever writes the virtio binding for that setup.

Bingo.  So now its a question of do you want to write this layer from
scratch, or re-use my framework.

> 
>>>> There are multiple ways to do this, but what I am saying is that
>>>> whatever is conceived will start to look eerily like a vbus-connector,
>>>> since this is one of its primary purposes ;)
>>>>
>>>>        
>>> I'm not sure if you're talking about the configuration interface or data
>>> path here.
>>>      
>> I am talking about how we would tunnel the config space for N devices
>> across his transport.
>>    
> 
> Sounds trivial.

No one said it was rocket science.  But it does need to be designed and
implemented end-to-end, much of which Ive already done in what I hope is
an extensible way.

>  Write an address containing the device number and
> register number to on location, read or write data from another.

You mean like the "u64 devh", and "u32 func" fields I have here for the
vbus-kvm connector?

http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=include/linux/vbus_pci.h;h=fe337590e644017392e4c9d9236150adb2333729;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l64

> Just
> like the PCI cf8/cfc interface.
> 
>>> They aren't in the "guest".  The best way to look at it is
>>>
>>> - a device side, with a dma engine: vhost-net
>>> - a driver side, only accessing its own memory: virtio-net
>>>
>>> Given that Ira's config has the dma engine in the ppc boards, that's
>>> where vhost-net would live (the ppc boards acting as NICs to the x86
>>> board, essentially).
>>>      
>> That sounds convenient given his hardware, but it has its own set of
>> problems.  For one, the configuration/inventory of these boards is now
>> driven by the wrong side and has to be addressed.
> 
> Why is it the wrong side?

"Wrong" is probably too harsh a word when looking at ethernet.  Its
certainly "odd", and possibly inconvenient.  It would be like having
vhost in a KVM guest, and virtio-net running on the host.  You could do
it, but its weird and awkward.  Where it really falls apart and enters
the "wrong" category is for non-symmetric devices, like disk-io.

> 
>> Second, the role
>> reversal will likely not work for many models other than ethernet (e.g.
>> virtio-console or virtio-blk drivers running on the x86 board would be
>> naturally consuming services from the slave boards...virtio-net is an
>> exception because 802.x is generally symmetrical).
>>    
> 
> There is no role reversal.

So if I have virtio-blk driver running on the x86 and vhost-blk device
running on the ppc board, I can use the ppc board as a block-device.
What if I really wanted to go the other way?

> The side doing dma is the device, the side
> accessing its own memory is the driver.  Just like that other 1e12
> driver/device pairs out there.

IIUC, his ppc boards really can be seen as "guests" (they are linux
instances that are utilizing services from the x86, not the other way
around).  vhost forces the model to have the ppc boards act as IO-hosts,
whereas vbus would likely work in either direction due to its more
refined abstraction layer.

> 
>>> I have no idea, that's for Ira to solve.
>>>      
>> Bingo.  Thus my statement that the vhost proposal is incomplete.  You
>> have the virtio-net and vhost-net pieces covering the fast-path
>> end-points, but nothing in the middle (transport, aggregation,
>> config-space), and nothing on the management-side.  vbus provides most
>> of the other pieces, and can even support the same virtio-net protocol
>> on top.  The remaining part would be something like a udev script to
>> populate the vbus with devices on board-insert events.
>>    
> 
> Of course vhost is incomplete, in the same sense that Linux is
> incomplete.  Both require userspace.

A vhost based solution to Iras design is missing more than userspace.
Many of those gaps are addressed by a vbus based solution.

> 
>>> If he could fake the PCI
>>> config space as seen by the x86 board, he would just show the normal pci
>>> config and use virtio-pci (multiple channels would show up as a
>>> multifunction device).  Given he can't, he needs to tunnel the virtio
>>> config space some other way.
>>>      
>> Right, and note that vbus was designed to solve this.  This tunneling
>> can, of course, be done without vbus using some other design.  However,
>> whatever solution is created will look incredibly close to what I've
>> already done, so my point is "why reinvent it"?
>>    
> 
> virtio requires binding for this tunnelling, so does vbus.

We aren't talking about virtio.  Virtio would work with either vbus or
vhost.  This is purely a question of what the layers below virtio and
the device backend looks like.

>  Its the same problem with the same solution.

I disagree.

Kind Regards,
-Greg
Arnd Bergmann Sept. 16, 2009, 2:57 p.m. UTC | #35
On Tuesday 15 September 2009, Michael S. Tsirkin wrote:
> Userspace in x86 maps a PCI region, uses it for communication with ppc?

This might have portability issues. On x86 it should work, but if the
host is powerpc or similar, you cannot reliably access PCI I/O memory
through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel
calls, which don't work on user pointers.

Specifically on powerpc, copy_from_user cannot access unaligned buffers
if they are on an I/O mapping.

	Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Sept. 16, 2009, 3:13 p.m. UTC | #36
On Wed, Sep 16, 2009 at 04:57:42PM +0200, Arnd Bergmann wrote:
> On Tuesday 15 September 2009, Michael S. Tsirkin wrote:
> > Userspace in x86 maps a PCI region, uses it for communication with ppc?
> 
> This might have portability issues. On x86 it should work, but if the
> host is powerpc or similar, you cannot reliably access PCI I/O memory
> through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel
> calls, which don't work on user pointers.
> 
> Specifically on powerpc, copy_from_user cannot access unaligned buffers
> if they are on an I/O mapping.
> 
> 	Arnd <><

We are talking about doing this in userspace, not in kernel.
Arnd Bergmann Sept. 16, 2009, 3:22 p.m. UTC | #37
On Wednesday 16 September 2009, Michael S. Tsirkin wrote:
> On Wed, Sep 16, 2009 at 04:57:42PM +0200, Arnd Bergmann wrote:
> > On Tuesday 15 September 2009, Michael S. Tsirkin wrote:
> > > Userspace in x86 maps a PCI region, uses it for communication with ppc?
> > 
> > This might have portability issues. On x86 it should work, but if the
> > host is powerpc or similar, you cannot reliably access PCI I/O memory
> > through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel
> > calls, which don't work on user pointers.
> > 
> > Specifically on powerpc, copy_from_user cannot access unaligned buffers
> > if they are on an I/O mapping.
> > 
> We are talking about doing this in userspace, not in kernel.

Ok, that's fine then. I thought the idea was to use the vhost_net driver
to access the user memory, which would be a really cute hack otherwise,
as you'd only need to provide the eventfds from a hardware specific
driver and could use the regular virtio_net on the other side.

	Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 16, 2009, 3:59 p.m. UTC | #38
On 09/16/2009 05:10 PM, Gregory Haskins wrote:
>
>> If kvm can do it, others can.
>>      
> The problem is that you seem to either hand-wave over details like this,
> or you give details that are pretty much exactly what vbus does already.
>   My point is that I've already sat down and thought about these issues
> and solved them in a freely available GPL'ed software package.
>    

In the kernel.  IMO that's the wrong place for it.  Further, if we adopt 
vbus, if drop compatibility with existing guests or have to support both 
vbus and virtio-pci.

> So the question is: is your position that vbus is all wrong and you wish
> to create a new bus-like thing to solve the problem?

I don't intend to create anything new, I am satisfied with virtio.  If 
it works for Ira, excellent.  If not, too bad.  I believe it will work 
without too much trouble.

> If so, how is it
> different from what Ive already done?  More importantly, what specific
> objections do you have to what Ive done, as perhaps they can be fixed
> instead of starting over?
>    

The two biggest objections are:
- the host side is in the kernel
- the guest side is a new bus instead of reusing pci (on x86/kvm), 
making Windows support more difficult

I guess these two are exactly what you think are vbus' greatest 
advantages, so we'll probably have to extend our agree-to-disagree on 
this one.

I also had issues with using just one interrupt vector to service all 
events, but that's easily fixed.

>> There is no guest and host in this scenario.  There's a device side
>> (ppc) and a driver side (x86).  The driver side can access configuration
>> information on the device side.  How to multiplex multiple devices is an
>> interesting exercise for whoever writes the virtio binding for that setup.
>>      
> Bingo.  So now its a question of do you want to write this layer from
> scratch, or re-use my framework.
>    

You will have to implement a connector or whatever for vbus as well.  
vbus has more layers so it's probably smaller for vbus.

>>>>
>>>>          
>>> I am talking about how we would tunnel the config space for N devices
>>> across his transport.
>>>
>>>        
>> Sounds trivial.
>>      
> No one said it was rocket science.  But it does need to be designed and
> implemented end-to-end, much of which Ive already done in what I hope is
> an extensible way.
>    

It was already implemented three times for virtio, so apparently that's 
extensible too.

>>   Write an address containing the device number and
>> register number to on location, read or write data from another.
>>      
> You mean like the "u64 devh", and "u32 func" fields I have here for the
> vbus-kvm connector?
>
> http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=include/linux/vbus_pci.h;h=fe337590e644017392e4c9d9236150adb2333729;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l64
>
>    

Probably.



>>> That sounds convenient given his hardware, but it has its own set of
>>> problems.  For one, the configuration/inventory of these boards is now
>>> driven by the wrong side and has to be addressed.
>>>        
>> Why is it the wrong side?
>>      
> "Wrong" is probably too harsh a word when looking at ethernet.  Its
> certainly "odd", and possibly inconvenient.  It would be like having
> vhost in a KVM guest, and virtio-net running on the host.  You could do
> it, but its weird and awkward.  Where it really falls apart and enters
> the "wrong" category is for non-symmetric devices, like disk-io.
>
>    


It's not odd or wrong or wierd or awkward.  An ethernet NIC is not 
symmetric, one side does DMA and issues interrupts, the other uses its 
own memory.  That's exactly the case with Ira's setup.

If the ppc boards were to emulate a disk controller, you'd run 
virtio-blk on x86 and vhost-blk on the ppc boards.

>>> Second, the role
>>> reversal will likely not work for many models other than ethernet (e.g.
>>> virtio-console or virtio-blk drivers running on the x86 board would be
>>> naturally consuming services from the slave boards...virtio-net is an
>>> exception because 802.x is generally symmetrical).
>>>
>>>        
>> There is no role reversal.
>>      
> So if I have virtio-blk driver running on the x86 and vhost-blk device
> running on the ppc board, I can use the ppc board as a block-device.
> What if I really wanted to go the other way?
>    

You mean, if the x86 board was able to access the disks and dma into the 
ppb boards memory?  You'd run vhost-blk on x86 and virtio-net on ppc.

As long as you don't use the words "guest" and "host" but keep to 
"driver" and "device", it all works out.

>> The side doing dma is the device, the side
>> accessing its own memory is the driver.  Just like that other 1e12
>> driver/device pairs out there.
>>      
> IIUC, his ppc boards really can be seen as "guests" (they are linux
> instances that are utilizing services from the x86, not the other way
> around).

They aren't guests.  Guests don't dma into their host's memory.

> vhost forces the model to have the ppc boards act as IO-hosts,
> whereas vbus would likely work in either direction due to its more
> refined abstraction layer.
>    

vhost=device=dma, virtio=driver=own-memory.

>> Of course vhost is incomplete, in the same sense that Linux is
>> incomplete.  Both require userspace.
>>      
> A vhost based solution to Iras design is missing more than userspace.
> Many of those gaps are addressed by a vbus based solution.
>    

Maybe.  Ira can fill the gaps or use vbus.
Michael S. Tsirkin Sept. 16, 2009, 4:08 p.m. UTC | #39
On Wed, Sep 16, 2009 at 05:22:37PM +0200, Arnd Bergmann wrote:
> On Wednesday 16 September 2009, Michael S. Tsirkin wrote:
> > On Wed, Sep 16, 2009 at 04:57:42PM +0200, Arnd Bergmann wrote:
> > > On Tuesday 15 September 2009, Michael S. Tsirkin wrote:
> > > > Userspace in x86 maps a PCI region, uses it for communication with ppc?
> > > 
> > > This might have portability issues. On x86 it should work, but if the
> > > host is powerpc or similar, you cannot reliably access PCI I/O memory
> > > through copy_tofrom_user but have to use memcpy_toio/fromio or readl/writel
> > > calls, which don't work on user pointers.
> > > 
> > > Specifically on powerpc, copy_from_user cannot access unaligned buffers
> > > if they are on an I/O mapping.
> > > 
> > We are talking about doing this in userspace, not in kernel.
> 
> Ok, that's fine then. I thought the idea was to use the vhost_net driver

It's a separate issue. We were talking generally about configuration
and setup. Gregory implemented it in kernel, Avi wants it
moved to userspace, with only fastpath in kernel.

> to access the user memory, which would be a really cute hack otherwise,
> as you'd only need to provide the eventfds from a hardware specific
> driver and could use the regular virtio_net on the other side.
> 
> 	Arnd <><

To do that, maybe copy to user on ppc can be fixed, or wrapped
around in a arch specific macro, so that everyone else
does not have to go through abstraction layers.
Gregory Haskins Sept. 16, 2009, 7:22 p.m. UTC | #40
Avi Kivity wrote:
> On 09/16/2009 05:10 PM, Gregory Haskins wrote:
>>
>>> If kvm can do it, others can.
>>>      
>> The problem is that you seem to either hand-wave over details like this,
>> or you give details that are pretty much exactly what vbus does already.
>>   My point is that I've already sat down and thought about these issues
>> and solved them in a freely available GPL'ed software package.
>>    
> 
> In the kernel.  IMO that's the wrong place for it.

In conversations with Ira, he indicated he needs kernel-to-kernel
ethernet for performance, and needs at least an ethernet and console
connectivity.  You could conceivably build a solution for this system 3
basic ways:

1) "completely" in userspace: use things like tuntap on the ppc boards,
and tunnel packets across a custom point-to-point connection formed over
the pci link to a userspace app on the x86 board.  This app then
reinjects the packets into the x86 kernel as a raw socket or tuntap,
etc.  Pretty much vanilla tuntap/vpn kind of stuff.  Advantage: very
little kernel code.  Problem: performance (citation: hopefully obvious).

2) "partially" in userspace: have an in-kernel virtio-net driver talk to
a userspace based virtio-net backend.  This is the (current, non-vhost
oriented) KVM/qemu model.  Advantage, re-uses existing kernel-code.
Problem: performance (citation: see alacrityvm numbers).

3) "in-kernel": You can do something like virtio-net to vhost to
potentially meet some of the requirements, but not all.

In order to fully meet (3), you would need to do some of that stuff you
mentioned in the last reply with muxing device-nr/reg-nr.  In addition,
we need to have a facility for mapping eventfds and establishing a
signaling mechanism (like PIO+qid), etc. KVM does this with
IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be
invented.

To meet performance, this stuff has to be in kernel and there has to be
a way to manage it.  Since vbus was designed to do exactly that, this is
what I would advocate.  You could also reinvent these concepts and put
your own mux and mapping code in place, in addition to all the other
stuff that vbus does.  But I am not clear why anyone would want to.

So no, the kernel is not the wrong place for it.  Its the _only_ place
for it.  Otherwise, just use (1) and be done with it.

>  Further, if we adopt
> vbus, if drop compatibility with existing guests or have to support both
> vbus and virtio-pci.

We already need to support both (at least to support Ira).  virtio-pci
doesn't work here.  Something else (vbus, or vbus-like) is needed.

> 
>> So the question is: is your position that vbus is all wrong and you wish
>> to create a new bus-like thing to solve the problem?
> 
> I don't intend to create anything new, I am satisfied with virtio.  If
> it works for Ira, excellent.  If not, too bad.

I think that about sums it up, then.


>  I believe it will work without too much trouble.

Afaict it wont for the reasons I mentioned.

> 
>> If so, how is it
>> different from what Ive already done?  More importantly, what specific
>> objections do you have to what Ive done, as perhaps they can be fixed
>> instead of starting over?
>>    
> 
> The two biggest objections are:
> - the host side is in the kernel

As it needs to be.

> - the guest side is a new bus instead of reusing pci (on x86/kvm),
> making Windows support more difficult

Thats a function of the vbus-connector, which is different from
vbus-core.  If you don't like it (and I know you don't), we can write
one that interfaces to qemu's pci system.  I just don't like the
limitations that imposes, nor do I think we need that complexity of
dealing with a split PCI model, so I chose to not implement vbus-kvm
this way.

With all due respect, based on all of your comments in aggregate I
really do not think you are truly grasping what I am actually building here.

> 
> I guess these two are exactly what you think are vbus' greatest
> advantages, so we'll probably have to extend our agree-to-disagree on
> this one.
> 
> I also had issues with using just one interrupt vector to service all
> events, but that's easily fixed.

Again, function of the connector.

> 
>>> There is no guest and host in this scenario.  There's a device side
>>> (ppc) and a driver side (x86).  The driver side can access configuration
>>> information on the device side.  How to multiplex multiple devices is an
>>> interesting exercise for whoever writes the virtio binding for that
>>> setup.
>>>      
>> Bingo.  So now its a question of do you want to write this layer from
>> scratch, or re-use my framework.
>>    
> 
> You will have to implement a connector or whatever for vbus as well. 
> vbus has more layers so it's probably smaller for vbus.

Bingo! That is precisely the point.

All the stuff for how to map eventfds, handle signal mitigation, demux
device/function pointers, isolation, etc, are built in.  All the
connector has to do is transport the 4-6 verbs and provide a memory
mapping/copy function, and the rest is reusable.  The device models
would then work in all environments unmodified, and likewise the
connectors could use all device-models unmodified.

> 
>>>>>
>>>>>          
>>>> I am talking about how we would tunnel the config space for N devices
>>>> across his transport.
>>>>
>>>>        
>>> Sounds trivial.
>>>      
>> No one said it was rocket science.  But it does need to be designed and
>> implemented end-to-end, much of which Ive already done in what I hope is
>> an extensible way.
>>    
> 
> It was already implemented three times for virtio, so apparently that's
> extensible too.

And to my point, I'm trying to commoditize as much of that process as
possible on both the front and backends (at least for cases where
performance matters) so that you don't need to reinvent the wheel for
each one.

> 
>>>   Write an address containing the device number and
>>> register number to on location, read or write data from another.
>>>      
>> You mean like the "u64 devh", and "u32 func" fields I have here for the
>> vbus-kvm connector?
>>
>> http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=include/linux/vbus_pci.h;h=fe337590e644017392e4c9d9236150adb2333729;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l64
>>
>>
>>    
> 
> Probably.
> 
> 
> 
>>>> That sounds convenient given his hardware, but it has its own set of
>>>> problems.  For one, the configuration/inventory of these boards is now
>>>> driven by the wrong side and has to be addressed.
>>>>        
>>> Why is it the wrong side?
>>>      
>> "Wrong" is probably too harsh a word when looking at ethernet.  Its
>> certainly "odd", and possibly inconvenient.  It would be like having
>> vhost in a KVM guest, and virtio-net running on the host.  You could do
>> it, but its weird and awkward.  Where it really falls apart and enters
>> the "wrong" category is for non-symmetric devices, like disk-io.
>>
>>    
> 
> 
> It's not odd or wrong or wierd or awkward.

Its weird IMO because IIUC the ppc boards are not really "NICs".  Yes,
their arrangement as bus-master PCI devices makes them look and smell
like "devices", but that is an implementation detail of its transport
(like hypercalls/PIO in KVM) and not relevant to its broader role in the
system.

They are more or less like "guests" from the KVM world.  The x86 is
providing connectivity resources to these guests, not the other way
around.  It is not a goal to make the x86 look like it has a multihomed
array of ppc based NIC adapters.  The only reason we would treat these
ppc boards like NICs is because (iiuc) that is the only way vhost can be
hacked to work with the system, not because its the optimal design.

FWIW: There are a ton of chassis-based systems that look similar to
Ira's out there (PCI inter-connected nodes), and I would like to support
them, too.  So its not like this is a one-off.


> An ethernet NIC is not
> symmetric, one side does DMA and issues interrupts, the other uses its
> own memory.

I never said a NIC was.  I meant the ethernet _protocol_ is symmetric.
I meant it in the sense that you can ingress/egress packets in either
direction and as long as "TX" on one side is "RX" on the other and vice
versa, it all kind of works.  You can even loop it back and it still works.

Contrast this to something like a disk-block protocol where a "read"
message is expected to actually do a read, etc.  In this case, you
cannot arbitrarily assign the location of the "driver" and "device" like
you can with ethernet.  The device should presumably be where the
storage is, and the driver should be where the consumer is.

>  That's exactly the case with Ira's setup.

See "implementation detail" comment above.


> 
> If the ppc boards were to emulate a disk controller, you'd run
> virtio-blk on x86 and vhost-blk on the ppc boards.

Agreed.

> 
>>>> Second, the role
>>>> reversal will likely not work for many models other than ethernet (e.g.
>>>> virtio-console or virtio-blk drivers running on the x86 board would be
>>>> naturally consuming services from the slave boards...virtio-net is an
>>>> exception because 802.x is generally symmetrical).
>>>>
>>>>        
>>> There is no role reversal.
>>>      
>> So if I have virtio-blk driver running on the x86 and vhost-blk device
>> running on the ppc board, I can use the ppc board as a block-device.
>> What if I really wanted to go the other way?
>>    
> 
> You mean, if the x86 board was able to access the disks and dma into the
> ppb boards memory?  You'd run vhost-blk on x86 and virtio-net on ppc.

But as we discussed, vhost doesn't work well if you try to run it on the
x86 side due to its assumptions about pagable "guest" memory, right?  So
is that even an option?  And even still, you would still need to solve
the aggregation problem so that multiple devices can coexist.

> 
> As long as you don't use the words "guest" and "host" but keep to
> "driver" and "device", it all works out.
> 
>>> The side doing dma is the device, the side
>>> accessing its own memory is the driver.  Just like that other 1e12
>>> driver/device pairs out there.
>>>      
>> IIUC, his ppc boards really can be seen as "guests" (they are linux
>> instances that are utilizing services from the x86, not the other way
>> around).
> 
> They aren't guests.  Guests don't dma into their host's memory.

Thats not relevant.  They are not guests in the sense of isolated
virtualized guests like KVM.  They are guests in the sense that they are
subordinate linux instances which utilize IO resources on the x86 (host).

The way this would work is that the x86 would be driving the dma
controller on the ppc board, not the other way around.  The fact that
the controller lives on the ppc board is an implementation detail.

The way I envision this to work would be that the ppc board exports two
functions in its device:

1) a vbus-bridge like device
2) a dma-controller that accepts "gpas" as one parameter

so function (1) does the 4-6 verbs I mentioned for device addressing,
etc.  function (2) is utilized by the x86 memctx whenever a
->copy_from() or ->copy_to() operation is invoked.  The ppc board's
would be doing their normal virtio kind of things, like
->add_buf(_pa(skb->data))).

> 
>> vhost forces the model to have the ppc boards act as IO-hosts,
>> whereas vbus would likely work in either direction due to its more
>> refined abstraction layer.
>>    
> 
> vhost=device=dma, virtio=driver=own-memory.

I agree that virtio=driver=own-memory.  The problem is vhost != dma.
vhost = hva*, and it just so happens that Ira's ppc boards support host
mapping/dma so it kind of works.

What I have been trying to say is that the extra abstraction to the
memctx gets the "vhost" side away from hva*, such that it can support
hva if that makes sense, or something else (like a custom dma engine if
it doesn't)

> 
>>> Of course vhost is incomplete, in the same sense that Linux is
>>> incomplete.  Both require userspace.
>>>      
>> A vhost based solution to Iras design is missing more than userspace.
>> Many of those gaps are addressed by a vbus based solution.
>>    
> 
> Maybe.  Ira can fill the gaps or use vbus.
> 
> 

Agreed.

Kind Regards,
-Greg
Avi Kivity Sept. 16, 2009, 9 p.m. UTC | #41
On 09/16/2009 10:22 PM, Gregory Haskins wrote:
> Avi Kivity wrote:
>    
>> On 09/16/2009 05:10 PM, Gregory Haskins wrote:
>>      
>>>> If kvm can do it, others can.
>>>>
>>>>          
>>> The problem is that you seem to either hand-wave over details like this,
>>> or you give details that are pretty much exactly what vbus does already.
>>>    My point is that I've already sat down and thought about these issues
>>> and solved them in a freely available GPL'ed software package.
>>>
>>>        
>> In the kernel.  IMO that's the wrong place for it.
>>      
> 3) "in-kernel": You can do something like virtio-net to vhost to
> potentially meet some of the requirements, but not all.
>
> In order to fully meet (3), you would need to do some of that stuff you
> mentioned in the last reply with muxing device-nr/reg-nr.  In addition,
> we need to have a facility for mapping eventfds and establishing a
> signaling mechanism (like PIO+qid), etc. KVM does this with
> IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be
> invented.
>    

irqfd/eventfd is the abstraction layer, it doesn't need to be reabstracted.

> To meet performance, this stuff has to be in kernel and there has to be
> a way to manage it.

and management belongs in userspace.

> Since vbus was designed to do exactly that, this is
> what I would advocate.  You could also reinvent these concepts and put
> your own mux and mapping code in place, in addition to all the other
> stuff that vbus does.  But I am not clear why anyone would want to.
>    

Maybe they like their backward compatibility and Windows support.

> So no, the kernel is not the wrong place for it.  Its the _only_ place
> for it.  Otherwise, just use (1) and be done with it.
>
>    

I'm talking about the config stuff, not the data path.

>>   Further, if we adopt
>> vbus, if drop compatibility with existing guests or have to support both
>> vbus and virtio-pci.
>>      
> We already need to support both (at least to support Ira).  virtio-pci
> doesn't work here.  Something else (vbus, or vbus-like) is needed.
>    

virtio-ira.

>>> So the question is: is your position that vbus is all wrong and you wish
>>> to create a new bus-like thing to solve the problem?
>>>        
>> I don't intend to create anything new, I am satisfied with virtio.  If
>> it works for Ira, excellent.  If not, too bad.
>>      
> I think that about sums it up, then.
>    

Yes.  I'm all for reusing virtio, but I'm not going switch to vbus or 
support both for this esoteric use case.

>>> If so, how is it
>>> different from what Ive already done?  More importantly, what specific
>>> objections do you have to what Ive done, as perhaps they can be fixed
>>> instead of starting over?
>>>
>>>        
>> The two biggest objections are:
>> - the host side is in the kernel
>>      
> As it needs to be.
>    

vhost-net somehow manages to work without the config stuff in the kernel.

> With all due respect, based on all of your comments in aggregate I
> really do not think you are truly grasping what I am actually building here.
>    

Thanks.



>>> Bingo.  So now its a question of do you want to write this layer from
>>> scratch, or re-use my framework.
>>>
>>>        
>> You will have to implement a connector or whatever for vbus as well.
>> vbus has more layers so it's probably smaller for vbus.
>>      
> Bingo!

(addictive, isn't it)

> That is precisely the point.
>
> All the stuff for how to map eventfds, handle signal mitigation, demux
> device/function pointers, isolation, etc, are built in.  All the
> connector has to do is transport the 4-6 verbs and provide a memory
> mapping/copy function, and the rest is reusable.  The device models
> would then work in all environments unmodified, and likewise the
> connectors could use all device-models unmodified.
>    

Well, virtio has a similar abstraction on the guest side.  The host side 
abstraction is limited to signalling since all configuration is in 
userspace.  vhost-net ought to work for lguest and s390 without change.

>> It was already implemented three times for virtio, so apparently that's
>> extensible too.
>>      
> And to my point, I'm trying to commoditize as much of that process as
> possible on both the front and backends (at least for cases where
> performance matters) so that you don't need to reinvent the wheel for
> each one.
>    

Since you're interested in any-to-any connectors it makes sense to you.  
I'm only interested in kvm-host-to-kvm-guest, so reducing the already 
minor effort to implement a new virtio binding has little appeal to me.

>> You mean, if the x86 board was able to access the disks and dma into the
>> ppb boards memory?  You'd run vhost-blk on x86 and virtio-net on ppc.
>>      
> But as we discussed, vhost doesn't work well if you try to run it on the
> x86 side due to its assumptions about pagable "guest" memory, right?  So
> is that even an option?  And even still, you would still need to solve
> the aggregation problem so that multiple devices can coexist.
>    

I don't know.  Maybe it can be made to work and maybe it cannot.  It 
probably can with some determined hacking.
Gregory Haskins Sept. 17, 2009, 3:11 a.m. UTC | #42
Avi Kivity wrote:
> On 09/16/2009 10:22 PM, Gregory Haskins wrote:
>> Avi Kivity wrote:
>>   
>>> On 09/16/2009 05:10 PM, Gregory Haskins wrote:
>>>     
>>>>> If kvm can do it, others can.
>>>>>
>>>>>          
>>>> The problem is that you seem to either hand-wave over details like
>>>> this,
>>>> or you give details that are pretty much exactly what vbus does
>>>> already.
>>>>    My point is that I've already sat down and thought about these
>>>> issues
>>>> and solved them in a freely available GPL'ed software package.
>>>>
>>>>        
>>> In the kernel.  IMO that's the wrong place for it.
>>>      
>> 3) "in-kernel": You can do something like virtio-net to vhost to
>> potentially meet some of the requirements, but not all.
>>
>> In order to fully meet (3), you would need to do some of that stuff you
>> mentioned in the last reply with muxing device-nr/reg-nr.  In addition,
>> we need to have a facility for mapping eventfds and establishing a
>> signaling mechanism (like PIO+qid), etc. KVM does this with
>> IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be
>> invented.
>>    
> 
> irqfd/eventfd is the abstraction layer, it doesn't need to be reabstracted.

Not per se, but it needs to be interfaced.  How do I register that
eventfd with the fastpath in Ira's rig? How do I signal the eventfd
(x86->ppc, and ppc->x86)?

To take it to the next level, how do I organize that mechanism so that
it works for more than one IO-stream (e.g. address the various queues
within ethernet or a different device like the console)?  KVM has
IOEVENTFD and IRQFD managed with MSI and PIO.  This new rig does not
have the luxury of an established IO paradigm.

Is vbus the only way to implement a solution?  No.  But it is _a_ way,
and its one that was specifically designed to solve this very problem
(as well as others).

(As an aside, note that you generally will want an abstraction on top of
irqfd/eventfd like shm-signal or virtqueues to do shared-memory based
event mitigation, but I digress.  That is a separate topic).

> 
>> To meet performance, this stuff has to be in kernel and there has to be
>> a way to manage it.
> 
> and management belongs in userspace.

vbus does not dictate where the management must be.  Its an extensible
framework, governed by what you plug into it (ala connectors and devices).

For instance, the vbus-kvm connector in alacrityvm chooses to put DEVADD
and DEVDROP hotswap events into the interrupt stream, because they are
simple and we already needed the interrupt stream anyway for fast-path.

As another example: venet chose to put ->call(MACQUERY) "config-space"
into its call namespace because its simple, and we already need
->calls() for fastpath.  It therefore exports an attribute to sysfs that
allows the management app to set it.

I could likewise have designed the connector or device-model differently
as to keep the mac-address and hotswap-events somewhere else (QEMU/PCI
userspace) but this seems silly to me when they are so trivial, so I didn't.

> 
>> Since vbus was designed to do exactly that, this is
>> what I would advocate.  You could also reinvent these concepts and put
>> your own mux and mapping code in place, in addition to all the other
>> stuff that vbus does.  But I am not clear why anyone would want to.
>>    
> 
> Maybe they like their backward compatibility and Windows support.

This is really not relevant to this thread, since we are talking about
Ira's hardware.  But if you must bring this up, then I will reiterate
that you just design the connector to interface with QEMU+PCI and you
have that too if that was important to you.

But on that topic: Since you could consider KVM a "motherboard
manufacturer" of sorts (it just happens to be virtual hardware), I don't
know why KVM seems to consider itself the only motherboard manufacturer
in the world that has to make everything look legacy.  If a company like
ASUS wants to add some cutting edge IO controller/bus, they simply do
it.  Pretty much every product release may contain a different array of
devices, many of which are not backwards compatible with any prior
silicon.  The guy/gal installing Windows on that system may see a "?" in
device-manager until they load a driver that supports the new chip, and
subsequently it works.  It is certainly not a requirement to make said
chip somehow work with existing drivers/facilities on bare metal, per
se.  Why should virtual systems be different?

So, yeah, the current design of the vbus-kvm connector means I have to
provide a driver.  This is understood, and I have no problem with that.

The only thing that I would agree has to be backwards compatible is the
BIOS/boot function.  If you can't support running an image like the
Windows installer, you are hosed.  If you can't use your ethernet until
you get a chance to install a driver after the install completes, its
just like most other systems in existence.  IOW: It's not a big deal.

For cases where the IO system is needed as part of the boot/install, you
provide BIOS and/or an install-disk support for it.

> 
>> So no, the kernel is not the wrong place for it.  Its the _only_ place
>> for it.  Otherwise, just use (1) and be done with it.
>>
>>    
> 
> I'm talking about the config stuff, not the data path.

As stated above, where config stuff lives is a function of what you
interface to vbus.  Data-path stuff must be in the kernel for
performance reasons, and this is what I was referring to.  I think we
are generally both in agreement, here.

What I was getting at is that you can't just hand-wave the datapath
stuff.  We do fast path in KVM with IRQFD/IOEVENTFD+PIO, and we do
device discovery/addressing with PCI.  Neither of those are available
here in Ira's case yet the general concepts are needed.  Therefore, we
have to come up with something else.

> 
>>>   Further, if we adopt
>>> vbus, if drop compatibility with existing guests or have to support both
>>> vbus and virtio-pci.
>>>      
>> We already need to support both (at least to support Ira).  virtio-pci
>> doesn't work here.  Something else (vbus, or vbus-like) is needed.
>>    
> 
> virtio-ira.

Sure, virtio-ira and he is on his own to make a bus-model under that, or
virtio-vbus + vbus-ira-connector to use the vbus framework.  Either
model can work, I agree.

> 
>>>> So the question is: is your position that vbus is all wrong and you
>>>> wish
>>>> to create a new bus-like thing to solve the problem?
>>>>        
>>> I don't intend to create anything new, I am satisfied with virtio.  If
>>> it works for Ira, excellent.  If not, too bad.
>>>      
>> I think that about sums it up, then.
>>    
> 
> Yes.  I'm all for reusing virtio, but I'm not going switch to vbus or
> support both for this esoteric use case.

With all due respect, no one asked you to.  This sub-thread was
originally about using vhost in Ira's rig.  When problems surfaced in
that proposed model, I highlighted that I had already addressed that
problem in vbus, and here we are.

> 
>>>> If so, how is it
>>>> different from what Ive already done?  More importantly, what specific
>>>> objections do you have to what Ive done, as perhaps they can be fixed
>>>> instead of starting over?
>>>>
>>>>        
>>> The two biggest objections are:
>>> - the host side is in the kernel
>>>      
>> As it needs to be.
>>    
> 
> vhost-net somehow manages to work without the config stuff in the kernel.

I was referring to data-path stuff, like signal and memory
configuration/routing.

As an aside, it should be noted that vhost under KVM has
IRQFD/IOEVENTFD, PCI-emulation, QEMU, etc to complement it and fill in
some of the pieces one needs for a complete solution.  Not all
environments have all of those pieces (nor should they), and those
pieces need to come from somewhere.

It should also be noted that what remains (config/management) after the
data-path stuff is laid out is actually quite simple.  It consists of
pretty much an enumerated list of device-ids within a container,
DEVADD(id), DEVDROP(id) events, and some sysfs attributes as defined on
a per-device basis (many of which are often needed regardless of whether
the "config-space" operation is handled in-kernel or not)

Therefore, the configuration aspect of the system does not necessitate a
complicated (e.g. full PCI emulation) or external (e.g. userspace)
component per se.  The parts of vbus that could be construed as
"management" are (afaict) built using accepted/best-practices for
managing arbitrary kernel subsystems (sysfs, configfs, ioctls, etc) so
there is nothing new or reasonably controversial there.  It is for this
reason that I think the objection to "in-kernel config" is unfounded.

Disagreements on this point may be settled by the connector design,
while still utilizing vbus, and thus retaining most of the other
benefits of using the vbus framework.  The connector ultimately dictates
how and what is exposed to the "guest".

> 
>> With all due respect, based on all of your comments in aggregate I
>> really do not think you are truly grasping what I am actually building
>> here.
>>    
> 
> Thanks.
> 
> 
> 
>>>> Bingo.  So now its a question of do you want to write this layer from
>>>> scratch, or re-use my framework.
>>>>
>>>>        
>>> You will have to implement a connector or whatever for vbus as well.
>>> vbus has more layers so it's probably smaller for vbus.
>>>      
>> Bingo!
> 
> (addictive, isn't it)

Apparently.

> 
>> That is precisely the point.
>>
>> All the stuff for how to map eventfds, handle signal mitigation, demux
>> device/function pointers, isolation, etc, are built in.  All the
>> connector has to do is transport the 4-6 verbs and provide a memory
>> mapping/copy function, and the rest is reusable.  The device models
>> would then work in all environments unmodified, and likewise the
>> connectors could use all device-models unmodified.
>>    
> 
> Well, virtio has a similar abstraction on the guest side.  The host side
> abstraction is limited to signalling since all configuration is in
> userspace.  vhost-net ought to work for lguest and s390 without change.

But IIUC that is primarily because the revectoring work is already in
QEMU for virtio-u and it rides on that, right?  Not knocking that, thats
nice and a distinct advantage.  It should just be noted that its based
on sunk-cost, and not truly free.  Its just already paid for, which is
different.  It also means it only works in environments based on QEMU,
which not all are (as evident by this sub-thread).

> 
>>> It was already implemented three times for virtio, so apparently that's
>>> extensible too.
>>>      
>> And to my point, I'm trying to commoditize as much of that process as
>> possible on both the front and backends (at least for cases where
>> performance matters) so that you don't need to reinvent the wheel for
>> each one.
>>    
> 
> Since you're interested in any-to-any connectors it makes sense to you. 
> I'm only interested in kvm-host-to-kvm-guest, so reducing the already
> minor effort to implement a new virtio binding has little appeal to me.
> 

Fair enough.

>>> You mean, if the x86 board was able to access the disks and dma into the
>>> ppb boards memory?  You'd run vhost-blk on x86 and virtio-net on ppc.
>>>      
>> But as we discussed, vhost doesn't work well if you try to run it on the
>> x86 side due to its assumptions about pagable "guest" memory, right?  So
>> is that even an option?  And even still, you would still need to solve
>> the aggregation problem so that multiple devices can coexist.
>>    
> 
> I don't know.  Maybe it can be made to work and maybe it cannot.  It
> probably can with some determined hacking.
> 

I guess you can say the same for any of the solutions.

Kind Regards,
-Greg
Michael S. Tsirkin Sept. 17, 2009, 3:57 a.m. UTC | #43
On Wed, Sep 16, 2009 at 10:10:55AM -0400, Gregory Haskins wrote:
> > There is no role reversal.
> 
> So if I have virtio-blk driver running on the x86 and vhost-blk device
> running on the ppc board, I can use the ppc board as a block-device.
> What if I really wanted to go the other way?

It seems ppc is the only one that can initiate DMA to an arbitrary
address, so you can't do this really, or you can by tunneling each
request back to ppc, or doing an extra data copy, but it's unlikely to
work well.

The limitation comes from hardware, not from the API we use.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins Sept. 17, 2009, 4:13 a.m. UTC | #44
Michael S. Tsirkin wrote:
> On Wed, Sep 16, 2009 at 10:10:55AM -0400, Gregory Haskins wrote:
>>> There is no role reversal.
>> So if I have virtio-blk driver running on the x86 and vhost-blk device
>> running on the ppc board, I can use the ppc board as a block-device.
>> What if I really wanted to go the other way?
> 
> It seems ppc is the only one that can initiate DMA to an arbitrary
> address, so you can't do this really, or you can by tunneling each
> request back to ppc, or doing an extra data copy, but it's unlikely to
> work well.
> 
> The limitation comes from hardware, not from the API we use.

Understood, but presumably it can be exposed as a sub-function of the
ppc's board's register file as a DMA-controller service to the x86.
This would fall into the "tunnel requests back" category you mention
above, though I think "tunnel" implies a heavier protocol than it would
actually require.  This would look more like a PIO cycle to a DMA
controller than some higher layer protocol.

You would then utilize that DMA service inside the memctx, and it the
rest of vbus would work transparently with the existing devices/drivers.

I do agree it would require some benchmarking to determine its
feasibility, which is why I was careful to say things like "may work"
;).  I also do not even know if its possible to expose the service this
way on his system.  If this design is not possible or performs poorly, I
admit vbus is just as hosed as vhost in regard to the "role correction"
benefit.

Kind Regards,
-Greg
Avi Kivity Sept. 17, 2009, 7:49 a.m. UTC | #45
On 09/17/2009 06:11 AM, Gregory Haskins wrote:
>
>> irqfd/eventfd is the abstraction layer, it doesn't need to be reabstracted.
>>      
> Not per se, but it needs to be interfaced.  How do I register that
> eventfd with the fastpath in Ira's rig? How do I signal the eventfd
> (x86->ppc, and ppc->x86)?
>    

You write a userspace or kernel module to do it.  It's a few dozen lines 
of code.

> To take it to the next level, how do I organize that mechanism so that
> it works for more than one IO-stream (e.g. address the various queues
> within ethernet or a different device like the console)?  KVM has
> IOEVENTFD and IRQFD managed with MSI and PIO.  This new rig does not
> have the luxury of an established IO paradigm.
>
> Is vbus the only way to implement a solution?  No.  But it is _a_ way,
> and its one that was specifically designed to solve this very problem
> (as well as others).
>    

virtio assumes that the number of transports will be limited and 
interesting growth is in the number of device classes and drivers.  So 
we have support for just three transports, but 6 device classes (9p, 
rng, balloon, console, blk, net) and 8 drivers (the preceding 6 for 
linux, plus blk/net for Windows).  It would have nice to be able to 
write a new binding in Visual Basic but it's hardly a killer feature.


>>> Since vbus was designed to do exactly that, this is
>>> what I would advocate.  You could also reinvent these concepts and put
>>> your own mux and mapping code in place, in addition to all the other
>>> stuff that vbus does.  But I am not clear why anyone would want to.
>>>
>>>        
>> Maybe they like their backward compatibility and Windows support.
>>      
> This is really not relevant to this thread, since we are talking about
> Ira's hardware.  But if you must bring this up, then I will reiterate
> that you just design the connector to interface with QEMU+PCI and you
> have that too if that was important to you.
>    

Well, for Ira the major issue is probably inclusion in the upstream kernel.

> But on that topic: Since you could consider KVM a "motherboard
> manufacturer" of sorts (it just happens to be virtual hardware), I don't
> know why KVM seems to consider itself the only motherboard manufacturer
> in the world that has to make everything look legacy.  If a company like
> ASUS wants to add some cutting edge IO controller/bus, they simply do
> it.

No, they don't.  New buses are added through industry consortiums these 
days.  No one adds a bus that is only available with their machine, not 
even Apple.

> Pretty much every product release may contain a different array of
> devices, many of which are not backwards compatible with any prior
> silicon.  The guy/gal installing Windows on that system may see a "?" in
> device-manager until they load a driver that supports the new chip, and
> subsequently it works.  It is certainly not a requirement to make said
> chip somehow work with existing drivers/facilities on bare metal, per
> se.  Why should virtual systems be different?
>    

Devices/drivers are a different matter, and if you have a virtio-net 
device you'll get the same "?" until you load the driver.  That's how 
people and the OS vendors expect things to work.

> What I was getting at is that you can't just hand-wave the datapath
> stuff.  We do fast path in KVM with IRQFD/IOEVENTFD+PIO, and we do
> device discovery/addressing with PCI.

That's not datapath stuff.

> Neither of those are available
> here in Ira's case yet the general concepts are needed.  Therefore, we
> have to come up with something else.
>    

Ira has to implement virtio's ->kick() function and come up with 
something for discovery.  It's a lot less lines of code than there are 
messages in this thread.

>> Yes.  I'm all for reusing virtio, but I'm not going switch to vbus or
>> support both for this esoteric use case.
>>      
> With all due respect, no one asked you to.  This sub-thread was
> originally about using vhost in Ira's rig.  When problems surfaced in
> that proposed model, I highlighted that I had already addressed that
> problem in vbus, and here we are.
>    

Ah, okay.  I have no interest in Ira choosing either virtio or vbus.



>> vhost-net somehow manages to work without the config stuff in the kernel.
>>      
> I was referring to data-path stuff, like signal and memory
> configuration/routing.
>    

signal and memory configuration/routing are not data-path stuff.

>> Well, virtio has a similar abstraction on the guest side.  The host side
>> abstraction is limited to signalling since all configuration is in
>> userspace.  vhost-net ought to work for lguest and s390 without change.
>>      
> But IIUC that is primarily because the revectoring work is already in
> QEMU for virtio-u and it rides on that, right?  Not knocking that, thats
> nice and a distinct advantage.  It should just be noted that its based
> on sunk-cost, and not truly free.  Its just already paid for, which is
> different.  It also means it only works in environments based on QEMU,
> which not all are (as evident by this sub-thread).
>    

No.  We expose a mix of emulated-in-userspace and emulated-in-the-kernel 
devices on one bus.  Devices emulated in userspace only lose by having 
the bus emulated in the kernel.  Devices in the kernel gain nothing from 
having the bus emulated in the kernel.  It's a complete slow path so it 
belongs in userspace where state is easy to get at, development is 
faster, and bugs are cheaper to fix.
Javier Guerra Sept. 17, 2009, 2:16 p.m. UTC | #46
On Wed, Sep 16, 2009 at 10:11 PM, Gregory Haskins
<gregory.haskins@gmail.com> wrote:
> It is certainly not a requirement to make said
> chip somehow work with existing drivers/facilities on bare metal, per
> se.  Why should virtual systems be different?

i'd guess it's an issue of support resources.  a hardware developer
creates a chip and immediately sells it, getting small but assured
revenue, with it they write (or pays to write) drivers for a couple of
releases, and stop to manufacture it as soon as it's not profitable.

software has a much longer lifetime, especially at the platform-level
(and KVM is a platform for a lot of us). also, being GPL, it's cheaper
to produce but has (much!) more limited resources.  creating a new
support issue is a scary thought.
Ira Snyder Sept. 21, 2009, 9:43 p.m. UTC | #47
On Wed, Sep 16, 2009 at 11:11:57PM -0400, Gregory Haskins wrote:
> Avi Kivity wrote:
> > On 09/16/2009 10:22 PM, Gregory Haskins wrote:
> >> Avi Kivity wrote:
> >>   
> >>> On 09/16/2009 05:10 PM, Gregory Haskins wrote:
> >>>     
> >>>>> If kvm can do it, others can.
> >>>>>
> >>>>>          
> >>>> The problem is that you seem to either hand-wave over details like
> >>>> this,
> >>>> or you give details that are pretty much exactly what vbus does
> >>>> already.
> >>>>    My point is that I've already sat down and thought about these
> >>>> issues
> >>>> and solved them in a freely available GPL'ed software package.
> >>>>
> >>>>        
> >>> In the kernel.  IMO that's the wrong place for it.
> >>>      
> >> 3) "in-kernel": You can do something like virtio-net to vhost to
> >> potentially meet some of the requirements, but not all.
> >>
> >> In order to fully meet (3), you would need to do some of that stuff you
> >> mentioned in the last reply with muxing device-nr/reg-nr.  In addition,
> >> we need to have a facility for mapping eventfds and establishing a
> >> signaling mechanism (like PIO+qid), etc. KVM does this with
> >> IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be
> >> invented.
> >>    
> > 
> > irqfd/eventfd is the abstraction layer, it doesn't need to be reabstracted.
> 
> Not per se, but it needs to be interfaced.  How do I register that
> eventfd with the fastpath in Ira's rig? How do I signal the eventfd
> (x86->ppc, and ppc->x86)?
> 

Sorry to reply so late to this thread, I've been on vacation for the
past week. If you'd like to continue in another thread, please start it
and CC me.

On the PPC, I've got a hardware "doorbell" register which generates 30
distiguishable interrupts over the PCI bus. I have outbound and inbound
registers, which can be used to signal the "other side".

I assume it isn't too much code to signal an eventfd in an interrupt
handler. I haven't gotten to this point in the code yet.

> To take it to the next level, how do I organize that mechanism so that
> it works for more than one IO-stream (e.g. address the various queues
> within ethernet or a different device like the console)?  KVM has
> IOEVENTFD and IRQFD managed with MSI and PIO.  This new rig does not
> have the luxury of an established IO paradigm.
> 
> Is vbus the only way to implement a solution?  No.  But it is _a_ way,
> and its one that was specifically designed to solve this very problem
> (as well as others).
> 
> (As an aside, note that you generally will want an abstraction on top of
> irqfd/eventfd like shm-signal or virtqueues to do shared-memory based
> event mitigation, but I digress.  That is a separate topic).
> 
> > 
> >> To meet performance, this stuff has to be in kernel and there has to be
> >> a way to manage it.
> > 
> > and management belongs in userspace.
> 
> vbus does not dictate where the management must be.  Its an extensible
> framework, governed by what you plug into it (ala connectors and devices).
> 
> For instance, the vbus-kvm connector in alacrityvm chooses to put DEVADD
> and DEVDROP hotswap events into the interrupt stream, because they are
> simple and we already needed the interrupt stream anyway for fast-path.
> 
> As another example: venet chose to put ->call(MACQUERY) "config-space"
> into its call namespace because its simple, and we already need
> ->calls() for fastpath.  It therefore exports an attribute to sysfs that
> allows the management app to set it.
> 
> I could likewise have designed the connector or device-model differently
> as to keep the mac-address and hotswap-events somewhere else (QEMU/PCI
> userspace) but this seems silly to me when they are so trivial, so I didn't.
> 
> > 
> >> Since vbus was designed to do exactly that, this is
> >> what I would advocate.  You could also reinvent these concepts and put
> >> your own mux and mapping code in place, in addition to all the other
> >> stuff that vbus does.  But I am not clear why anyone would want to.
> >>    
> > 
> > Maybe they like their backward compatibility and Windows support.
> 
> This is really not relevant to this thread, since we are talking about
> Ira's hardware.  But if you must bring this up, then I will reiterate
> that you just design the connector to interface with QEMU+PCI and you
> have that too if that was important to you.
> 
> But on that topic: Since you could consider KVM a "motherboard
> manufacturer" of sorts (it just happens to be virtual hardware), I don't
> know why KVM seems to consider itself the only motherboard manufacturer
> in the world that has to make everything look legacy.  If a company like
> ASUS wants to add some cutting edge IO controller/bus, they simply do
> it.  Pretty much every product release may contain a different array of
> devices, many of which are not backwards compatible with any prior
> silicon.  The guy/gal installing Windows on that system may see a "?" in
> device-manager until they load a driver that supports the new chip, and
> subsequently it works.  It is certainly not a requirement to make said
> chip somehow work with existing drivers/facilities on bare metal, per
> se.  Why should virtual systems be different?
> 
> So, yeah, the current design of the vbus-kvm connector means I have to
> provide a driver.  This is understood, and I have no problem with that.
> 
> The only thing that I would agree has to be backwards compatible is the
> BIOS/boot function.  If you can't support running an image like the
> Windows installer, you are hosed.  If you can't use your ethernet until
> you get a chance to install a driver after the install completes, its
> just like most other systems in existence.  IOW: It's not a big deal.
> 
> For cases where the IO system is needed as part of the boot/install, you
> provide BIOS and/or an install-disk support for it.
> 
> > 
> >> So no, the kernel is not the wrong place for it.  Its the _only_ place
> >> for it.  Otherwise, just use (1) and be done with it.
> >>
> >>    
> > 
> > I'm talking about the config stuff, not the data path.
> 
> As stated above, where config stuff lives is a function of what you
> interface to vbus.  Data-path stuff must be in the kernel for
> performance reasons, and this is what I was referring to.  I think we
> are generally both in agreement, here.
> 
> What I was getting at is that you can't just hand-wave the datapath
> stuff.  We do fast path in KVM with IRQFD/IOEVENTFD+PIO, and we do
> device discovery/addressing with PCI.  Neither of those are available
> here in Ira's case yet the general concepts are needed.  Therefore, we
> have to come up with something else.
> 
> > 
> >>>   Further, if we adopt
> >>> vbus, if drop compatibility with existing guests or have to support both
> >>> vbus and virtio-pci.
> >>>      
> >> We already need to support both (at least to support Ira).  virtio-pci
> >> doesn't work here.  Something else (vbus, or vbus-like) is needed.
> >>    
> > 
> > virtio-ira.
> 
> Sure, virtio-ira and he is on his own to make a bus-model under that, or
> virtio-vbus + vbus-ira-connector to use the vbus framework.  Either
> model can work, I agree.
> 

Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and
virtio-s390. It isn't especially easy. I can steal lots of code from the
lguest bus model, but sometimes it is good to generalize, especially
after the fourth implemention or so. I think this is what GHaskins tried
to do.


Here is what I've implemented so far:

* a generic virtio-phys-guest layer (my bus model, like lguest)
	- this runs on the crate server (x86) in my system
* a generic virtio-phys-host layer (my /dev/lguest implementation)
	- this runs on the ppc boards in my system
	- this assumes that the kernel will allocate some memory and
	  expose it over PCI in a device-specific way, so the guest can
	  see it as a PCI BAR
* a virtio-phys-mpc83xx driver
	- this runs on the crate server (x86) in my system
	- this interfaces virtio-phys-guest to my mpc83xx board
	- it is a Linux PCI driver, which detects mpc83xx boards, runs
	  ioremap_pci_bar() on the correct PCI BAR, and then gives that
	  to the virtio-phys-guest layer

I think that the idea of device/driver (instead of host/guest) is a good
one. It makes my problem easier to think about.

I've given it some thought, and I think that running vhost-net (or
similar) on the ppc boards, with virtio-net on the x86 crate server will
work. The virtio-ring abstraction is almost good enough to work for this
situation, but I had to re-invent it to work with my boards.

I've exposed a 16K region of memory as PCI BAR1 from my ppc board.
Remember that this is the "host" system. I used each 4K block as a
"device descriptor" which contains:

1) the type of device, config space, etc. for virtio
2) the "desc" table (virtio memory descriptors, see virtio-ring)
3) the "avail" table (available entries in the desc table)

Parts 2 and 3 are repeated three times, to allow for a maximum of three
virtqueues per device. This is good enough for all current drivers.

The guest side (x86 in my system) allocates some device-accessible
memory, and writes the PCI address to the device descriptor. This memory
contains:

1) the "used" table (consumed entries in the desc/avail tables)

This exists three times as well, once for each virtqueue.

The rest is basically a copy of virtio-ring, with a few changes to allow
for cacheing, etc. It may not even be worth doing this from a
performance standpoint, I haven't benchmarked it yet.

For now, I'd be happy with a non-DMA memcpy only solution. I can add DMA
once things are working.

I've got the current code (subject to change at any time) available at
the address listed below. If you think another format would be better
for you, please ask, and I'll provide it.
http://www.mmarray.org/~iws/virtio-phys/

I've gotten plenty of email about this from lots of interested
developers. There are people who would like this kind of system to just
work, while having to write just some glue for their device, just like a
network driver. I hunch most people have created some proprietary mess
that basically works, and left it at that.

So, here is a desperate cry for help. I'd like to make this work, and
I'd really like to see it in mainline. I'm trying to give back to the
community from which I've taken plenty.

Ira
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 22, 2009, 9:43 a.m. UTC | #48
On 09/22/2009 12:43 AM, Ira W. Snyder wrote:
>
>> Sure, virtio-ira and he is on his own to make a bus-model under that, or
>> virtio-vbus + vbus-ira-connector to use the vbus framework.  Either
>> model can work, I agree.
>>
>>      
> Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and
> virtio-s390. It isn't especially easy. I can steal lots of code from the
> lguest bus model, but sometimes it is good to generalize, especially
> after the fourth implemention or so. I think this is what GHaskins tried
> to do.
>    

Yes.  vbus is more finely layered so there is less code duplication.

The virtio layering was more or less dictated by Xen which doesn't have 
shared memory (it uses grant references instead).  As a matter of fact 
lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that 
part is duplicated.  It's probably possible to add a virtio-shmem.ko 
library that people who do have shared memory can reuse.

> I've given it some thought, and I think that running vhost-net (or
> similar) on the ppc boards, with virtio-net on the x86 crate server will
> work. The virtio-ring abstraction is almost good enough to work for this
> situation, but I had to re-invent it to work with my boards.
>
> I've exposed a 16K region of memory as PCI BAR1 from my ppc board.
> Remember that this is the "host" system. I used each 4K block as a
> "device descriptor" which contains:
>
> 1) the type of device, config space, etc. for virtio
> 2) the "desc" table (virtio memory descriptors, see virtio-ring)
> 3) the "avail" table (available entries in the desc table)
>    

Won't access from x86 be slow to this memory (on the other hand, if you 
change it to main memory access from ppc will be slow... really depends 
on how your system is tuned.

> Parts 2 and 3 are repeated three times, to allow for a maximum of three
> virtqueues per device. This is good enough for all current drivers.
>    

The plan is to switch to multiqueue soon.  Will not affect you if your 
boards are uniprocessor or small smp.

> I've gotten plenty of email about this from lots of interested
> developers. There are people who would like this kind of system to just
> work, while having to write just some glue for their device, just like a
> network driver. I hunch most people have created some proprietary mess
> that basically works, and left it at that.
>    

So long as you keep the system-dependent features hookable or 
configurable, it should work.

> So, here is a desperate cry for help. I'd like to make this work, and
> I'd really like to see it in mainline. I'm trying to give back to the
> community from which I've taken plenty.
>    

Not sure who you're crying for help to.  Once you get this working, post 
patches.  If the patches are reasonably clean and don't impact 
performance for the main use case, and if you can show the need, I 
expect they'll be merged.
Ira Snyder Sept. 22, 2009, 3:25 p.m. UTC | #49
On Tue, Sep 22, 2009 at 12:43:36PM +0300, Avi Kivity wrote:
> On 09/22/2009 12:43 AM, Ira W. Snyder wrote:
> >
> >> Sure, virtio-ira and he is on his own to make a bus-model under that, or
> >> virtio-vbus + vbus-ira-connector to use the vbus framework.  Either
> >> model can work, I agree.
> >>
> >>      
> > Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and
> > virtio-s390. It isn't especially easy. I can steal lots of code from the
> > lguest bus model, but sometimes it is good to generalize, especially
> > after the fourth implemention or so. I think this is what GHaskins tried
> > to do.
> >    
> 
> Yes.  vbus is more finely layered so there is less code duplication.
> 
> The virtio layering was more or less dictated by Xen which doesn't have 
> shared memory (it uses grant references instead).  As a matter of fact 
> lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that 
> part is duplicated.  It's probably possible to add a virtio-shmem.ko 
> library that people who do have shared memory can reuse.
> 

Seems like a nice benefit of vbus.

> > I've given it some thought, and I think that running vhost-net (or
> > similar) on the ppc boards, with virtio-net on the x86 crate server will
> > work. The virtio-ring abstraction is almost good enough to work for this
> > situation, but I had to re-invent it to work with my boards.
> >
> > I've exposed a 16K region of memory as PCI BAR1 from my ppc board.
> > Remember that this is the "host" system. I used each 4K block as a
> > "device descriptor" which contains:
> >
> > 1) the type of device, config space, etc. for virtio
> > 2) the "desc" table (virtio memory descriptors, see virtio-ring)
> > 3) the "avail" table (available entries in the desc table)
> >    
> 
> Won't access from x86 be slow to this memory (on the other hand, if you 
> change it to main memory access from ppc will be slow... really depends 
> on how your system is tuned.
> 

Writes across the bus are fast, reads across the bus are slow. These are
just the descriptor tables for memory buffers, not the physical memory
buffers themselves.

These only need to be written by the guest (x86), and read by the host
(ppc). The host never changes the tables, so we can cache a copy in the
guest, for a fast detach_buf() implementation (see virtio-ring, which
I'm copying the design from).

The only accesses are writes across the PCI bus. There is never a need
to do a read (except for slow-path configuration).

> > Parts 2 and 3 are repeated three times, to allow for a maximum of three
> > virtqueues per device. This is good enough for all current drivers.
> >    
> 
> The plan is to switch to multiqueue soon.  Will not affect you if your 
> boards are uniprocessor or small smp.
> 

Everything I have is UP. I don't need extreme performance, either.
40MB/sec is the minimum I need to reach, though I'd like to have some
headroom.

For reference, using the CPU to handle data transfers, I get ~2MB/sec
transfers. Using the DMA engine, I've hit about 60MB/sec with my
"crossed-wires" virtio-net.

> > I've gotten plenty of email about this from lots of interested
> > developers. There are people who would like this kind of system to just
> > work, while having to write just some glue for their device, just like a
> > network driver. I hunch most people have created some proprietary mess
> > that basically works, and left it at that.
> >    
> 
> So long as you keep the system-dependent features hookable or 
> configurable, it should work.
> 
> > So, here is a desperate cry for help. I'd like to make this work, and
> > I'd really like to see it in mainline. I'm trying to give back to the
> > community from which I've taken plenty.
> >    
> 
> Not sure who you're crying for help to.  Once you get this working, post 
> patches.  If the patches are reasonably clean and don't impact 
> performance for the main use case, and if you can show the need, I 
> expect they'll be merged.
> 

In the spirit of "post early and often", I'm making my code available,
that's all. I'm asking anyone interested for some review, before I have
to re-code this for about the fifth time now. I'm trying to avoid
Haskins' situation, where he's invented and debugged a lot of new code,
and then been told to do it completely differently.

Yes, the code I posted is only compile-tested, because quite a lot of
code (kernel and userspace) must be working before anything works at
all. I hate to design the whole thing, then be told that something
fundamental about it is wrong, and have to completely re-write it.

Thanks for the comments,
Ira

> -- 
> error compiling committee.c: too many arguments to function
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 22, 2009, 3:56 p.m. UTC | #50
On 09/22/2009 06:25 PM, Ira W. Snyder wrote:
>
>> Yes.  vbus is more finely layered so there is less code duplication.
>>
>> The virtio layering was more or less dictated by Xen which doesn't have
>> shared memory (it uses grant references instead).  As a matter of fact
>> lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that
>> part is duplicated.  It's probably possible to add a virtio-shmem.ko
>> library that people who do have shared memory can reuse.
>>
>>      
> Seems like a nice benefit of vbus.
>    

Yes, it is.  With some work virtio can gain that too (virtio-shmem.ko).

>>> I've given it some thought, and I think that running vhost-net (or
>>> similar) on the ppc boards, with virtio-net on the x86 crate server will
>>> work. The virtio-ring abstraction is almost good enough to work for this
>>> situation, but I had to re-invent it to work with my boards.
>>>
>>> I've exposed a 16K region of memory as PCI BAR1 from my ppc board.
>>> Remember that this is the "host" system. I used each 4K block as a
>>> "device descriptor" which contains:
>>>
>>> 1) the type of device, config space, etc. for virtio
>>> 2) the "desc" table (virtio memory descriptors, see virtio-ring)
>>> 3) the "avail" table (available entries in the desc table)
>>>
>>>        
>> Won't access from x86 be slow to this memory (on the other hand, if you
>> change it to main memory access from ppc will be slow... really depends
>> on how your system is tuned.
>>
>>      
> Writes across the bus are fast, reads across the bus are slow. These are
> just the descriptor tables for memory buffers, not the physical memory
> buffers themselves.
>
> These only need to be written by the guest (x86), and read by the host
> (ppc). The host never changes the tables, so we can cache a copy in the
> guest, for a fast detach_buf() implementation (see virtio-ring, which
> I'm copying the design from).
>
> The only accesses are writes across the PCI bus. There is never a need
> to do a read (except for slow-path configuration).
>    

Okay, sounds like what you're doing it optimal then.

> In the spirit of "post early and often", I'm making my code available,
> that's all. I'm asking anyone interested for some review, before I have
> to re-code this for about the fifth time now. I'm trying to avoid
> Haskins' situation, where he's invented and debugged a lot of new code,
> and then been told to do it completely differently.
>
> Yes, the code I posted is only compile-tested, because quite a lot of
> code (kernel and userspace) must be working before anything works at
> all. I hate to design the whole thing, then be told that something
> fundamental about it is wrong, and have to completely re-write it.
>    

Understood.  Best to get a review from Rusty then.
Gregory Haskins Sept. 23, 2009, 2:26 p.m. UTC | #51
Avi Kivity wrote:
> On 09/22/2009 12:43 AM, Ira W. Snyder wrote:
>>
>>> Sure, virtio-ira and he is on his own to make a bus-model under that, or
>>> virtio-vbus + vbus-ira-connector to use the vbus framework.  Either
>>> model can work, I agree.
>>>
>>>      
>> Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and
>> virtio-s390. It isn't especially easy. I can steal lots of code from the
>> lguest bus model, but sometimes it is good to generalize, especially
>> after the fourth implemention or so. I think this is what GHaskins tried
>> to do.
>>    
> 
> Yes.  vbus is more finely layered so there is less code duplication.

To clarify, Ira was correct in stating this generalizing some of these
components was one of the goals for the vbus project: IOW vbus finely
layers and defines what's below virtio, not replaces it.

You can think of a virtio-stack like this:

--------------------------
| virtio-net
--------------------------
| virtio-ring
--------------------------
| virtio-bus
--------------------------
| ? undefined ?
--------------------------

IOW: The way I see it, virtio is a device interface model only.  The
rest of it is filled in by the virtio-transport and some kind of back-end.

So today, we can complete the "? undefined ?" block like this for KVM:

--------------------------
| virtio-pci
--------------------------
             |
--------------------------
| kvm.ko
--------------------------
| qemu
--------------------------
| tuntap
--------------------------

In this case, kvm.ko and tuntap are providing plumbing, and qemu is
providing a backend device model (pci-based, etc).

You can, of course, plug a different stack in (such as virtio-lguest,
virtio-ira, etc) but you are more or less on your own to recreate many
of the various facilities contained in that stack (such as things
provided by QEMU, like discovery/hotswap/addressing), as Ira is discovering.

Vbus tries to commoditize more components in the stack (like the bus
model and backend-device model) so they don't need to be redesigned each
time we solve this "virtio-transport" problem.  IOW: stop the
proliferation of the need for pci-bus, lguest-bus, foo-bus underneath
virtio.  Instead, we can then focus on the value add on top, like the
models themselves or the simple glue between them.

So now you might have something like

--------------------------
| virtio-vbus
--------------------------
| vbus-proxy
--------------------------
| kvm-guest-connector
--------------------------
             |
--------------------------
| kvm.ko
--------------------------
| kvm-host-connector.ko
--------------------------
| vbus.ko
--------------------------
| virtio-net-backend.ko
--------------------------

so now we don't need to worry about the bus-model or the device-model
framework.  We only need to implement the connector, etc.  This is handy
when you find yourself in an environment that doesn't support PCI (such
as Ira's rig, or userspace containers), or when you want to add features
that PCI doesn't have (such as fluid event channels for things like IPC
services, or priortizable interrupts, etc).

> 
> The virtio layering was more or less dictated by Xen which doesn't have
> shared memory (it uses grant references instead).  As a matter of fact
> lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that
> part is duplicated.  It's probably possible to add a virtio-shmem.ko
> library that people who do have shared memory can reuse.

Note that I do not believe the Xen folk use virtio, so while I can
appreciate the foresight that went into that particular aspect of the
design of the virtio model, I am not sure if its a realistic constraint.

The reason why I decided to not worry about that particular model is
twofold:

1) Trying to support non shared-memory designs is prohibitively high for
my performance goals (for instance, requiring an exit on each
->add_buf() in addition to the ->kick()).

2) The Xen guys are unlikely to diverge from something like
xenbus/xennet anyway, so it would be for naught.

Therefore, I just went with a device model optimized for shared-memory
outright.

That said, I believe we can refactor what is called the
"vbus-proxy-device" into this virtio-shmem interface that you and
Anthony have described.  We could make the feature optional and only
support on architectures where this makes sense.

<snip>

Kind Regards,
-Greg
Avi Kivity Sept. 23, 2009, 2:37 p.m. UTC | #52
On 09/23/2009 05:26 PM, Gregory Haskins wrote:
>
>    
>>> Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and
>>> virtio-s390. It isn't especially easy. I can steal lots of code from the
>>> lguest bus model, but sometimes it is good to generalize, especially
>>> after the fourth implemention or so. I think this is what GHaskins tried
>>> to do.
>>>
>>>        
>> Yes.  vbus is more finely layered so there is less code duplication.
>>      
> To clarify, Ira was correct in stating this generalizing some of these
> components was one of the goals for the vbus project: IOW vbus finely
> layers and defines what's below virtio, not replaces it.
>
> You can think of a virtio-stack like this:
>
> --------------------------
> | virtio-net
> --------------------------
> | virtio-ring
> --------------------------
> | virtio-bus
> --------------------------
> | ? undefined ?
> --------------------------
>
> IOW: The way I see it, virtio is a device interface model only.  The
> rest of it is filled in by the virtio-transport and some kind of back-end.
>
> So today, we can complete the "? undefined ?" block like this for KVM:
>
> --------------------------
> | virtio-pci
> --------------------------
>               |
> --------------------------
> | kvm.ko
> --------------------------
> | qemu
> --------------------------
> | tuntap
> --------------------------
>
> In this case, kvm.ko and tuntap are providing plumbing, and qemu is
> providing a backend device model (pci-based, etc).
>
> You can, of course, plug a different stack in (such as virtio-lguest,
> virtio-ira, etc) but you are more or less on your own to recreate many
> of the various facilities contained in that stack (such as things
> provided by QEMU, like discovery/hotswap/addressing), as Ira is discovering.
>
> Vbus tries to commoditize more components in the stack (like the bus
> model and backend-device model) so they don't need to be redesigned each
> time we solve this "virtio-transport" problem.  IOW: stop the
> proliferation of the need for pci-bus, lguest-bus, foo-bus underneath
> virtio.  Instead, we can then focus on the value add on top, like the
> models themselves or the simple glue between them.
>
> So now you might have something like
>
> --------------------------
> | virtio-vbus
> --------------------------
> | vbus-proxy
> --------------------------
> | kvm-guest-connector
> --------------------------
>               |
> --------------------------
> | kvm.ko
> --------------------------
> | kvm-host-connector.ko
> --------------------------
> | vbus.ko
> --------------------------
> | virtio-net-backend.ko
> --------------------------
>
> so now we don't need to worry about the bus-model or the device-model
> framework.  We only need to implement the connector, etc.  This is handy
> when you find yourself in an environment that doesn't support PCI (such
> as Ira's rig, or userspace containers), or when you want to add features
> that PCI doesn't have (such as fluid event channels for things like IPC
> services, or priortizable interrupts, etc).
>    

Well, vbus does more, for example it tunnels interrupts instead of 
exposing them 1:1 on the native interface if it exists.  It also pulls 
parts of the device model into the host kernel.

>> The virtio layering was more or less dictated by Xen which doesn't have
>> shared memory (it uses grant references instead).  As a matter of fact
>> lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that
>> part is duplicated.  It's probably possible to add a virtio-shmem.ko
>> library that people who do have shared memory can reuse.
>>      
> Note that I do not believe the Xen folk use virtio, so while I can
> appreciate the foresight that went into that particular aspect of the
> design of the virtio model, I am not sure if its a realistic constraint.
>    

Since a virtio goal was to reduce virtual device driver proliferation, 
it was necessary to accommodate Xen.
Gregory Haskins Sept. 23, 2009, 3:10 p.m. UTC | #53
Avi Kivity wrote:
> On 09/23/2009 05:26 PM, Gregory Haskins wrote:
>>
>>   
>>>> Yes, I'm having to create my own bus model, a-la lguest, virtio-pci,
>>>> and
>>>> virtio-s390. It isn't especially easy. I can steal lots of code from
>>>> the
>>>> lguest bus model, but sometimes it is good to generalize, especially
>>>> after the fourth implemention or so. I think this is what GHaskins
>>>> tried
>>>> to do.
>>>>
>>>>        
>>> Yes.  vbus is more finely layered so there is less code duplication.
>>>      
>> To clarify, Ira was correct in stating this generalizing some of these
>> components was one of the goals for the vbus project: IOW vbus finely
>> layers and defines what's below virtio, not replaces it.
>>
>> You can think of a virtio-stack like this:
>>
>> --------------------------
>> | virtio-net
>> --------------------------
>> | virtio-ring
>> --------------------------
>> | virtio-bus
>> --------------------------
>> | ? undefined ?
>> --------------------------
>>
>> IOW: The way I see it, virtio is a device interface model only.  The
>> rest of it is filled in by the virtio-transport and some kind of
>> back-end.
>>
>> So today, we can complete the "? undefined ?" block like this for KVM:
>>
>> --------------------------
>> | virtio-pci
>> --------------------------
>>               |
>> --------------------------
>> | kvm.ko
>> --------------------------
>> | qemu
>> --------------------------
>> | tuntap
>> --------------------------
>>
>> In this case, kvm.ko and tuntap are providing plumbing, and qemu is
>> providing a backend device model (pci-based, etc).
>>
>> You can, of course, plug a different stack in (such as virtio-lguest,
>> virtio-ira, etc) but you are more or less on your own to recreate many
>> of the various facilities contained in that stack (such as things
>> provided by QEMU, like discovery/hotswap/addressing), as Ira is
>> discovering.
>>
>> Vbus tries to commoditize more components in the stack (like the bus
>> model and backend-device model) so they don't need to be redesigned each
>> time we solve this "virtio-transport" problem.  IOW: stop the
>> proliferation of the need for pci-bus, lguest-bus, foo-bus underneath
>> virtio.  Instead, we can then focus on the value add on top, like the
>> models themselves or the simple glue between them.
>>
>> So now you might have something like
>>
>> --------------------------
>> | virtio-vbus
>> --------------------------
>> | vbus-proxy
>> --------------------------
>> | kvm-guest-connector
>> --------------------------
>>               |
>> --------------------------
>> | kvm.ko
>> --------------------------
>> | kvm-host-connector.ko
>> --------------------------
>> | vbus.ko
>> --------------------------
>> | virtio-net-backend.ko
>> --------------------------
>>
>> so now we don't need to worry about the bus-model or the device-model
>> framework.  We only need to implement the connector, etc.  This is handy
>> when you find yourself in an environment that doesn't support PCI (such
>> as Ira's rig, or userspace containers), or when you want to add features
>> that PCI doesn't have (such as fluid event channels for things like IPC
>> services, or priortizable interrupts, etc).
>>    
> 
> Well, vbus does more, for example it tunnels interrupts instead of
> exposing them 1:1 on the native interface if it exists.

As I've previously explained, that trait is a function of the
kvm-connector I've chosen to implement, not of the overall design of vbus.

The reason why my kvm-connector is designed that way is because my early
testing/benchmarking shows one of the issues in KVM performance is the
ratio of exits per IO operation are fairly high, especially as your
scale io-load.  Therefore, the connector achieves a substantial
reduction in that ratio by treating "interrupts" to the same kind of
benefits that NAPI brought to general networking: That is, we enqueue
"interrupt" messages into a lockless ring and only hit the IDT for the
first occurrence.  Subsequent interrupts are injected in a
parallel/lockless manner, without hitting the IDT nor incurring an extra
EOI.  This pays dividends as the IO rate increases, which is when the
guest needs the most help.

OTOH, it is entirely possible to design the connector such that we
maintain a 1:1 ratio of signals to traditional IDT interrupts.  It is
also possible to design a connector which surfaces as something else,
such as PCI devices (by terminating the connector in QEMU and utilizing
its PCI emulation facilities), which would naturally employ 1:1 mapping.

So if 1:1 mapping is a critical feature (I would argue to the contrary),
vbus can support it.

> It also pulls parts of the device model into the host kernel.

That is the point.  Most of it needs to be there for performance.  And
what doesn't need to be there for performance can either be:

a) skipped at the discretion of the connector/device-model designer

OR

b) included because its trivially small subset of the model (e.g. a
mac-addr attribute) and its nice to have a cohesive solution instead of
requiring a separate binary blob that can get out of sync, etc.

The example Ive provided to date (venet on kvm) utilizes (b), but it
certainly doesn't have to.  Therefore, I don't think vbus as a whole can
be judged on this one point.

> 
>>> The virtio layering was more or less dictated by Xen which doesn't have
>>> shared memory (it uses grant references instead).  As a matter of fact
>>> lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that
>>> part is duplicated.  It's probably possible to add a virtio-shmem.ko
>>> library that people who do have shared memory can reuse.
>>>      
>> Note that I do not believe the Xen folk use virtio, so while I can
>> appreciate the foresight that went into that particular aspect of the
>> design of the virtio model, I am not sure if its a realistic constraint.
>>    
> 
> Since a virtio goal was to reduce virtual device driver proliferation,
> it was necessary to accommodate Xen.

Fair enough, but I don't think the Xen community will ever use it.

To your point, a vbus goal was to reduce the bus-model and
backend-device-model proliferation for environments served by Linux as
the host.  This naturally complements virtio's driver non-proliferation
goal, but probably excludes Xen for reasons beyond the lack of shmem
(since it has its own non-linux hypervisor kernel).

In any case, I've already stated that we simply make the virtio-shmem
(vbus-proxy-device) facility optionally defined, and unavailable on
non-shmem based architectures to work around that issue.

The alternative is that we abstract the shmem concept further (ala
->add_buf() from the virtqueue world) but it is probably pointless to
try to accommodate shared-memory if you don't really have it, and no-one
will likely use it.

Kind Regards,
-Greg
Gregory Haskins Sept. 23, 2009, 5:58 p.m. UTC | #54
Gregory Haskins wrote:
> Avi Kivity wrote:
>> On 09/23/2009 05:26 PM, Gregory Haskins wrote:
>>>   
>>>>> Yes, I'm having to create my own bus model, a-la lguest, virtio-pci,
>>>>> and
>>>>> virtio-s390. It isn't especially easy. I can steal lots of code from
>>>>> the
>>>>> lguest bus model, but sometimes it is good to generalize, especially
>>>>> after the fourth implemention or so. I think this is what GHaskins
>>>>> tried
>>>>> to do.
>>>>>
>>>>>        
>>>> Yes.  vbus is more finely layered so there is less code duplication.
>>>>      
>>> To clarify, Ira was correct in stating this generalizing some of these
>>> components was one of the goals for the vbus project: IOW vbus finely
>>> layers and defines what's below virtio, not replaces it.
>>>
>>> You can think of a virtio-stack like this:
>>>
>>> --------------------------
>>> | virtio-net
>>> --------------------------
>>> | virtio-ring
>>> --------------------------
>>> | virtio-bus
>>> --------------------------
>>> | ? undefined ?
>>> --------------------------
>>>
>>> IOW: The way I see it, virtio is a device interface model only.  The
>>> rest of it is filled in by the virtio-transport and some kind of
>>> back-end.
>>>
>>> So today, we can complete the "? undefined ?" block like this for KVM:
>>>
>>> --------------------------
>>> | virtio-pci
>>> --------------------------
>>>               |
>>> --------------------------
>>> | kvm.ko
>>> --------------------------
>>> | qemu
>>> --------------------------
>>> | tuntap
>>> --------------------------
>>>
>>> In this case, kvm.ko and tuntap are providing plumbing, and qemu is
>>> providing a backend device model (pci-based, etc).
>>>
>>> You can, of course, plug a different stack in (such as virtio-lguest,
>>> virtio-ira, etc) but you are more or less on your own to recreate many
>>> of the various facilities contained in that stack (such as things
>>> provided by QEMU, like discovery/hotswap/addressing), as Ira is
>>> discovering.
>>>
>>> Vbus tries to commoditize more components in the stack (like the bus
>>> model and backend-device model) so they don't need to be redesigned each
>>> time we solve this "virtio-transport" problem.  IOW: stop the
>>> proliferation of the need for pci-bus, lguest-bus, foo-bus underneath
>>> virtio.  Instead, we can then focus on the value add on top, like the
>>> models themselves or the simple glue between them.
>>>
>>> So now you might have something like
>>>
>>> --------------------------
>>> | virtio-vbus
>>> --------------------------
>>> | vbus-proxy
>>> --------------------------
>>> | kvm-guest-connector
>>> --------------------------
>>>               |
>>> --------------------------
>>> | kvm.ko
>>> --------------------------
>>> | kvm-host-connector.ko
>>> --------------------------
>>> | vbus.ko
>>> --------------------------
>>> | virtio-net-backend.ko
>>> --------------------------
>>>
>>> so now we don't need to worry about the bus-model or the device-model
>>> framework.  We only need to implement the connector, etc.  This is handy
>>> when you find yourself in an environment that doesn't support PCI (such
>>> as Ira's rig, or userspace containers), or when you want to add features
>>> that PCI doesn't have (such as fluid event channels for things like IPC
>>> services, or priortizable interrupts, etc).
>>>    
>> Well, vbus does more, for example it tunnels interrupts instead of
>> exposing them 1:1 on the native interface if it exists.
> 
> As I've previously explained, that trait is a function of the
> kvm-connector I've chosen to implement, not of the overall design of vbus.
> 
> The reason why my kvm-connector is designed that way is because my early
> testing/benchmarking shows one of the issues in KVM performance is the
> ratio of exits per IO operation are fairly high, especially as your
> scale io-load.  Therefore, the connector achieves a substantial
> reduction in that ratio by treating "interrupts" to the same kind of
> benefits that NAPI brought to general networking: That is, we enqueue
> "interrupt" messages into a lockless ring and only hit the IDT for the
> first occurrence.  Subsequent interrupts are injected in a
> parallel/lockless manner, without hitting the IDT nor incurring an extra
> EOI.  This pays dividends as the IO rate increases, which is when the
> guest needs the most help.
> 
> OTOH, it is entirely possible to design the connector such that we
> maintain a 1:1 ratio of signals to traditional IDT interrupts.  It is
> also possible to design a connector which surfaces as something else,
> such as PCI devices (by terminating the connector in QEMU and utilizing
> its PCI emulation facilities), which would naturally employ 1:1 mapping.
> 
> So if 1:1 mapping is a critical feature (I would argue to the contrary),
> vbus can support it.
> 
>> It also pulls parts of the device model into the host kernel.
> 
> That is the point.  Most of it needs to be there for performance.

To clarify this point:

There are various aspects about designing high-performance virtual
devices such as providing the shortest paths possible between the
physical resources and the consumers.  Conversely, we also need to
ensure that we meet proper isolation/protection guarantees at the same
time.  What this means is there are various aspects to any
high-performance PV design that require to be placed in-kernel to
maximize the performance yet properly isolate the guest.

For instance, you are required to have your signal-path (interrupts and
hypercalls), your memory-path (gpa translation), and
addressing/isolation model in-kernel to maximize performance.

Vbus accomplishes its in-kernel isolation model by providing a
"container" concept, where objects are placed into this container by
userspace.  The host kernel enforces isolation/protection by using a
namespace to identify objects that is only relevant within a specific
container's context (namely, a "u32 dev-id").  The guest addresses the
objects by its dev-id, and the kernel ensures that the guest can't
access objects outside of its dev-id namespace.

All that is required is a way to transport a message with a "devid"
attribute as an address (such as DEVCALL(devid)) and the framework
provides the rest of the decode+execute function.

Contrast this to vhost+virtio-pci (called simply "vhost" from here).
It is not immune to requiring in-kernel addressing support either, but
rather it just does it differently (and its not as you might expect via
qemu).

Vhost relies on QEMU to render PCI objects to the guest, which the guest
assigns resources (such as BARs, interrupts, etc).  A PCI-BAR in this
example may represent a PIO address for triggering some operation in the
device-model's fast-path.  For it to have meaning in the fast-path, KVM
has to have in-kernel knowledge of what a PIO-exit is, and what to do
with it (this is where pio-bus and ioeventfd come in).  The programming
of the PIO-exit and the ioeventfd are likewise controlled by some
userspace management entity (i.e. qemu).   The PIO address and value
tuple form the address, and the ioeventfd framework within KVM provide
the decode+execute function.

This idea seemingly works fine, mind you, but it rides on top of a *lot*
of stuff including but not limited to: the guests pci stack, the qemu
pci emulation, kvm pio support, and ioeventfd.  When you get into
situations where you don't have PCI or even KVM underneath you (e.g. a
userspace container, Ira's rig, etc) trying to recreate all of that PCI
infrastructure for the sake of using PCI is, IMO, a lot of overhead for
little gain.

All you really need is a simple decode+execute mechanism, and a way to
program it from userspace control.  vbus tries to do just that:
commoditize it so all you need is the transport of the control messages
(like DEVCALL()), but the decode+execute itself is reuseable, even
across various environments (like KVM or Iras rig).

And we face similar situations with the signal-path and memory-path
components...but lets take a look at the slow-path side.


>  And what doesn't need to be there for performance can either be:
> 
> a) skipped at the discretion of the connector/device-model designer
> 
> OR
> 
> b) included because its trivially small subset of the model (e.g. a
> mac-addr attribute) and its nice to have a cohesive solution instead of
> requiring a separate binary blob that can get out of sync, etc.
> 
> The example Ive provided to date (venet on kvm) utilizes (b), but it
> certainly doesn't have to.  Therefore, I don't think vbus as a whole can
> be judged on this one point.


For a given model, we have a grouping of operations for fast path and
slow path.  Fast path would be things like we just talked about
(signal-path, memory-path, addressing model).  Slow path would be things
like device discovery (and hotswap), config-space, etc.

And your argument, I believe, is that vbus allows both to be implemented
in the kernel (though to reiterate, its optional) and is therefore a bad
design, so lets discuss that.

I believe the assertion is that things like config-space are best left
to userspace, and we should only relegate fast-path duties to the
kernel.  The problem is that, in my experience, a good deal of
config-space actually influences the fast-path and thus needs to
interact with the fast-path mechanism eventually anyway.  Whats left
over that doesn't fall into this category may cheaply ride on existing
plumbing, so its not like we created something new or unnatural just to
support this subclass of config-space.

For example: take an attribute like the mac-address assigned to a NIC.
This clearly doesn't need to be in-kernel and could go either way (such
as a PCI config-space register).

As another example: consider an option bit that enables a new feature
that affects the fast-path, like RXBUF merging.  If we use the split
model where config space is handled by userspace and fast-path is
in-kernel, the userspace component is only going to act as a proxy.
I.e. it will pass the option down to the kernel eventually.  Therefore,
there is little gain in trying to split this type of slow-path out to
userspace.  In fact, its more work.

vbus addresses this observation by providing a very simple (yet
hopefully powerful) model of providing two basic verbs to a device:

dev->call()
dev->shm()

It makes no distinction of slow or fast-path type operations, per se.
Just a mechanism for synchronous or asynchronous communication.  It is
expected that a given component will build "config-space" primarily from
the synchronous ->call() interface if it requires one.  However, it gets
this for free since we need ->call() for fast-path too (like the
rt-scheduler device, etc).

So I can then use ->call to perform a fast-path scheduler update (has to
go in-kernel for performance), an "enable rxbuf-merge" function (has to
end-up in-kernel eventually), or a "macquery" (doesn't need to be
in-kernel).

My choice was to support that third operation in-kernel as well, because
its way more complicated to do it another way that it is to simply
export a sysfs attribute to set it.  Userspace is still completely in
control..it sets the value.  It just doesnt have to write plumbing to
make it accessible.  The basic vbus model inherently provides this.

Thats enough for now.  We can talk about discovery/hotswap at a later time.

Kind Regards,
-Greg
Avi Kivity Sept. 23, 2009, 7:37 p.m. UTC | #55
On 09/23/2009 08:58 PM, Gregory Haskins wrote:
>>
>>> It also pulls parts of the device model into the host kernel.
>>>        
>> That is the point.  Most of it needs to be there for performance.
>>      
> To clarify this point:
>
> There are various aspects about designing high-performance virtual
> devices such as providing the shortest paths possible between the
> physical resources and the consumers.  Conversely, we also need to
> ensure that we meet proper isolation/protection guarantees at the same
> time.  What this means is there are various aspects to any
> high-performance PV design that require to be placed in-kernel to
> maximize the performance yet properly isolate the guest.
>
> For instance, you are required to have your signal-path (interrupts and
> hypercalls), your memory-path (gpa translation), and
> addressing/isolation model in-kernel to maximize performance.
>    

Exactly.  That's what vhost puts into the kernel and nothing more.

> Vbus accomplishes its in-kernel isolation model by providing a
> "container" concept, where objects are placed into this container by
> userspace.  The host kernel enforces isolation/protection by using a
> namespace to identify objects that is only relevant within a specific
> container's context (namely, a "u32 dev-id").  The guest addresses the
> objects by its dev-id, and the kernel ensures that the guest can't
> access objects outside of its dev-id namespace.
>    

vhost manages to accomplish this without any kernel support.  The guest 
simply has not access to any vhost resources other than the guest->host 
doorbell, which is handed to the guest outside vhost (so it's somebody 
else's problem, in userspace).

> All that is required is a way to transport a message with a "devid"
> attribute as an address (such as DEVCALL(devid)) and the framework
> provides the rest of the decode+execute function.
>    

vhost avoids that.

> Contrast this to vhost+virtio-pci (called simply "vhost" from here).
>    

It's the wrong name.  vhost implements only the data path.

> It is not immune to requiring in-kernel addressing support either, but
> rather it just does it differently (and its not as you might expect via
> qemu).
>
> Vhost relies on QEMU to render PCI objects to the guest, which the guest
> assigns resources (such as BARs, interrupts, etc).

vhost does not rely on qemu.  It relies on its user to handle 
configuration.  In one important case it's qemu+pci.  It could just as 
well be the lguest launcher.

>    A PCI-BAR in this
> example may represent a PIO address for triggering some operation in the
> device-model's fast-path.  For it to have meaning in the fast-path, KVM
> has to have in-kernel knowledge of what a PIO-exit is, and what to do
> with it (this is where pio-bus and ioeventfd come in).  The programming
> of the PIO-exit and the ioeventfd are likewise controlled by some
> userspace management entity (i.e. qemu).   The PIO address and value
> tuple form the address, and the ioeventfd framework within KVM provide
> the decode+execute function.
>    

Right.

> This idea seemingly works fine, mind you, but it rides on top of a *lot*
> of stuff including but not limited to: the guests pci stack, the qemu
> pci emulation, kvm pio support, and ioeventfd.  When you get into
> situations where you don't have PCI or even KVM underneath you (e.g. a
> userspace container, Ira's rig, etc) trying to recreate all of that PCI
> infrastructure for the sake of using PCI is, IMO, a lot of overhead for
> little gain.
>    

For the N+1th time, no.  vhost is perfectly usable without pci.  Can we 
stop raising and debunking this point?

> All you really need is a simple decode+execute mechanism, and a way to
> program it from userspace control.  vbus tries to do just that:
> commoditize it so all you need is the transport of the control messages
> (like DEVCALL()), but the decode+execute itself is reuseable, even
> across various environments (like KVM or Iras rig).
>    

If you think it should be "commodotized", write libvhostconfig.so.

> And your argument, I believe, is that vbus allows both to be implemented
> in the kernel (though to reiterate, its optional) and is therefore a bad
> design, so lets discuss that.
>
> I believe the assertion is that things like config-space are best left
> to userspace, and we should only relegate fast-path duties to the
> kernel.  The problem is that, in my experience, a good deal of
> config-space actually influences the fast-path and thus needs to
> interact with the fast-path mechanism eventually anyway.
> Whats left
> over that doesn't fall into this category may cheaply ride on existing
> plumbing, so its not like we created something new or unnatural just to
> support this subclass of config-space.
>    

Flexibility is reduced, because changing code in the kernel is more 
expensive than in userspace, and kernel/user interfaces aren't typically 
as wide as pure userspace interfaces.  Security is reduced, since a bug 
in the kernel affects the host, while a bug in userspace affects just on 
guest.

Example: feature negotiation.  If it happens in userspace, it's easy to 
limit what features we expose to the guest.  If it happens in the 
kernel, we need to add an interface to let the kernel know which 
features it should expose to the guest.  We also need to add an 
interface to let userspace know which features were negotiated, if we 
want to implement live migration.  Something fairly trivial bloats rapidly.

> For example: take an attribute like the mac-address assigned to a NIC.
> This clearly doesn't need to be in-kernel and could go either way (such
> as a PCI config-space register).
>
> As another example: consider an option bit that enables a new feature
> that affects the fast-path, like RXBUF merging.  If we use the split
> model where config space is handled by userspace and fast-path is
> in-kernel, the userspace component is only going to act as a proxy.
> I.e. it will pass the option down to the kernel eventually.  Therefore,
> there is little gain in trying to split this type of slow-path out to
> userspace.  In fact, its more work.
>    

As you can see above, userspace needs to be involved in this, and the 
number of interfaces required is smaller if it's in userspace: you only 
need to know which features the kernel supports (they can be enabled 
unconditionally, just not exposed).

Further, some devices are perfectly happy to be implemented in 
userspace, so we need userspace configuration support anyway.  Why 
reimplement it in the kernel?
Gregory Haskins Sept. 23, 2009, 9:15 p.m. UTC | #56
Avi Kivity wrote:
> On 09/23/2009 08:58 PM, Gregory Haskins wrote:
>>>
>>>> It also pulls parts of the device model into the host kernel.
>>>>        
>>> That is the point.  Most of it needs to be there for performance.
>>>      
>> To clarify this point:
>>
>> There are various aspects about designing high-performance virtual
>> devices such as providing the shortest paths possible between the
>> physical resources and the consumers.  Conversely, we also need to
>> ensure that we meet proper isolation/protection guarantees at the same
>> time.  What this means is there are various aspects to any
>> high-performance PV design that require to be placed in-kernel to
>> maximize the performance yet properly isolate the guest.
>>
>> For instance, you are required to have your signal-path (interrupts and
>> hypercalls), your memory-path (gpa translation), and
>> addressing/isolation model in-kernel to maximize performance.
>>    
> 
> Exactly.  That's what vhost puts into the kernel and nothing more.

Actually, no.  Generally, _KVM_ puts those things into the kernel, and
vhost consumes them.  Without KVM (or something equivalent), vhost is
incomplete.  One of my goals with vbus is to generalize the "something
equivalent" part here.

I know you may not care about non-kvm use cases, and thats fine.  No one
says you have to.  However, note that some of use do care about these
non-kvm cases, and thus its a distinction I am making here as a benefit
of the vbus framework.

> 
>> Vbus accomplishes its in-kernel isolation model by providing a
>> "container" concept, where objects are placed into this container by
>> userspace.  The host kernel enforces isolation/protection by using a
>> namespace to identify objects that is only relevant within a specific
>> container's context (namely, a "u32 dev-id").  The guest addresses the
>> objects by its dev-id, and the kernel ensures that the guest can't
>> access objects outside of its dev-id namespace.
>>    
> 
> vhost manages to accomplish this without any kernel support.

No, vhost manages to accomplish this because of KVMs kernel support
(ioeventfd, etc).   Without a KVM-like in-kernel support, vhost is a
merely a kind of "tuntap"-like clone signalled by eventfds.

vbus on the other hand, generalizes one more piece of the puzzle
(namely, the function of pio+ioeventfd and userspace's programming of
it) by presenting the devid namespace and container concept.

This goes directly to my rebuttal of your claim that vbus places too
much in the kernel.  I state that, one way or the other, address decode
and isolation _must_ be in the kernel for performance.  Vbus does this
with a devid/container scheme.  vhost+virtio-pci+kvm does it with
pci+pio+ioeventfd.


>  The guest
> simply has not access to any vhost resources other than the guest->host
> doorbell, which is handed to the guest outside vhost (so it's somebody
> else's problem, in userspace).

You mean _controlled_ by userspace, right?  Obviously, the other side of
the kernel still needs to be programmed (ioeventfd, etc).  Otherwise,
vhost would be pointless: e.g. just use vanilla tuntap if you don't need
fast in-kernel decoding.

> 
>> All that is required is a way to transport a message with a "devid"
>> attribute as an address (such as DEVCALL(devid)) and the framework
>> provides the rest of the decode+execute function.
>>    
> 
> vhost avoids that.

No, it doesn't avoid it.  It just doesn't specify how its done, and
relies on something else to do it on its behalf.

Conversely, vbus specifies how its done, but not how to transport the
verb "across the wire".  That is the role of the vbus-connector abstraction.

> 
>> Contrast this to vhost+virtio-pci (called simply "vhost" from here).
>>    
> 
> It's the wrong name.  vhost implements only the data path.

Understood, but vhost+virtio-pci is what I am contrasting, and I use
"vhost" for short from that point on because I am too lazy to type the
whole name over and over ;)

> 
>> It is not immune to requiring in-kernel addressing support either, but
>> rather it just does it differently (and its not as you might expect via
>> qemu).
>>
>> Vhost relies on QEMU to render PCI objects to the guest, which the guest
>> assigns resources (such as BARs, interrupts, etc).
> 
> vhost does not rely on qemu.  It relies on its user to handle
> configuration.  In one important case it's qemu+pci.  It could just as
> well be the lguest launcher.

I meant vhost=vhost+virtio-pci here.  Sorry for the confusion.

The point I am making specifically is that vhost in general relies on
other in-kernel components to function.  I.e. It cannot function without
having something like the PCI model to build an IO namespace.  That
namespace (in this case, pio addresses+data tuples) are used for the
in-kernel addressing function under KVM + virtio-pci.

The case of the lguest launcher is a good one to highlight.  Yes, you
can presumably also use lguest with vhost, if the requisite facilities
are exposed to lguest-bus, and some eventfd based thing like ioeventfd
is written for the host (if it doesnt exist already).

And when the next virt design "foo" comes out, it can make a "foo-bus"
model, and implement foo-eventfd on the backend, etc, etc.

Ira can make ira-bus, and ira-eventfd, etc, etc.

Each iteration will invariably introduce duplicated parts of the stack.
 Vbus tries to generalize some of those pieces so we can reuse them.

I chose the very non-specific name "virtual-bus" for the design
intentionally to decouple it from any one particular "hypervisor" (e.g.
xenbus, lguest-bus, etc) and promote it as a general purpose bus for
hopefully any hypervisor (or physical systems too, e.g. Iras).  I assume
"virtio" was chosen to reflect a similar positioning at the device-model
layer.

Had vbus come out before lguest, I would have proposed that lguest
should use it natively instead of creating lguest-bus.  While its
probably too late in that specific case, perhaps going forward this is
the direction we can take, just like perhaps virtio is the device model
direction we can take.

Likewise, the backend is generalized so one model can be written that
works in all environments that support vbus.  The connector takes care
of the "wire" details, and the other stuff functions to serve the bus
portion of the stack (signal-routing, memory-routing,
isolation/addressing, etc).

> 
>>    A PCI-BAR in this
>> example may represent a PIO address for triggering some operation in the
>> device-model's fast-path.  For it to have meaning in the fast-path, KVM
>> has to have in-kernel knowledge of what a PIO-exit is, and what to do
>> with it (this is where pio-bus and ioeventfd come in).  The programming
>> of the PIO-exit and the ioeventfd are likewise controlled by some
>> userspace management entity (i.e. qemu).   The PIO address and value
>> tuple form the address, and the ioeventfd framework within KVM provide
>> the decode+execute function.
>>    
> 
> Right.
> 
>> This idea seemingly works fine, mind you, but it rides on top of a *lot*
>> of stuff including but not limited to: the guests pci stack, the qemu
>> pci emulation, kvm pio support, and ioeventfd.  When you get into
>> situations where you don't have PCI or even KVM underneath you (e.g. a
>> userspace container, Ira's rig, etc) trying to recreate all of that PCI
>> infrastructure for the sake of using PCI is, IMO, a lot of overhead for
>> little gain.
>>    
> 
> For the N+1th time, no.  vhost is perfectly usable without pci.  Can we
> stop raising and debunking this point?

Again, I understand vhost is decoupled from PCI, and I don't mean to
imply anything different.  I use PCI as an example here because a) its
the only working example of vhost today (to my knowledge), and b) you
have stated in the past that PCI is the only "right" way here, to
paraphrase.  Perhaps you no longer feel that way, so I apologize if you
feel you already recanted your position on PCI and I missed it.

I digress.  My point here isn't PCI.  The point here is the missing
component for when PCI is not present.  The component that is partially
satisfied by vbus's devid addressing scheme.  If you are going to use
vhost, and you don't have PCI, you've gotta build something to replace it.


> 
>> All you really need is a simple decode+execute mechanism, and a way to
>> program it from userspace control.  vbus tries to do just that:
>> commoditize it so all you need is the transport of the control messages
>> (like DEVCALL()), but the decode+execute itself is reuseable, even
>> across various environments (like KVM or Iras rig).
>>    
> 
> If you think it should be "commodotized", write libvhostconfig.so.

I know you are probably being facetious here, but what do you propose
for the parts that must be in-kernel?

> 
>> And your argument, I believe, is that vbus allows both to be implemented
>> in the kernel (though to reiterate, its optional) and is therefore a bad
>> design, so lets discuss that.
>>
>> I believe the assertion is that things like config-space are best left
>> to userspace, and we should only relegate fast-path duties to the
>> kernel.  The problem is that, in my experience, a good deal of
>> config-space actually influences the fast-path and thus needs to
>> interact with the fast-path mechanism eventually anyway.
>> Whats left
>> over that doesn't fall into this category may cheaply ride on existing
>> plumbing, so its not like we created something new or unnatural just to
>> support this subclass of config-space.
>>    
> 
> Flexibility is reduced, because changing code in the kernel is more
> expensive than in userspace, and kernel/user interfaces aren't typically
> as wide as pure userspace interfaces.  Security is reduced, since a bug
> in the kernel affects the host, while a bug in userspace affects just on
> guest.

For a mac-address attribute?  Thats all we are really talking about
here.  These points you raise, while true of any kernel code I suppose,
are a bit of a stretch in this context.

> 
> Example: feature negotiation.  If it happens in userspace, it's easy to
> limit what features we expose to the guest.

Its not any harder in the kernel.  I do this today.

And when you are done negotiating said features, you will generally have
to turn around and program the feature into the backend anyway (e.g.
ioctl() to vhost module).  Now you have to maintain some knowledge of
that particular feature and how to program it in two places.

Conversely, I am eliminating the (unnecessary) middleman by letting the
feature negotiating take place directly between the two entities that
will consume it.


>  If it happens in the
> kernel, we need to add an interface to let the kernel know which
> features it should expose to the guest.

You need this already either way for both models anyway.  As an added
bonus, vbus has generalized that interface using sysfs attributes, so
all models are handled in a similar and community accepted way.

>  We also need to add an
> interface to let userspace know which features were negotiated, if we
> want to implement live migration.  Something fairly trivial bloats rapidly.

Can you elaborate on the requirements for live-migration?  Wouldnt an
opaque save/restore model work here? (e.g. why does userspace need to be
able to interpret the in-kernel state, just pass it along as a blob to
the new instance).

> 
>> For example: take an attribute like the mac-address assigned to a NIC.
>> This clearly doesn't need to be in-kernel and could go either way (such
>> as a PCI config-space register).
>>
>> As another example: consider an option bit that enables a new feature
>> that affects the fast-path, like RXBUF merging.  If we use the split
>> model where config space is handled by userspace and fast-path is
>> in-kernel, the userspace component is only going to act as a proxy.
>> I.e. it will pass the option down to the kernel eventually.  Therefore,
>> there is little gain in trying to split this type of slow-path out to
>> userspace.  In fact, its more work.
>>    
> 
> As you can see above, userspace needs to be involved in this, and the
> number of interfaces required is smaller if it's in userspace:

Actually, no.  My experience has been the opposite.  Anytime I sat down
and tried to satisfy your request to move things to the userspace,
things got ugly and duplicative really quick.  I suspect part of the
reason you may think its easier because you already have part of
virtio-net in userspace and its surrounding support, but that is not the
case moving forward for new device types.

> you only
> need to know which features the kernel supports (they can be enabled
> unconditionally, just not exposed).
> 
> Further, some devices are perfectly happy to be implemented in
> userspace, so we need userspace configuration support anyway.  Why
> reimplement it in the kernel?

Thats fine.  vbus is targetted for high-performance IO.  So if you have
a robust userspace (like KVM+QEMU) and low-performance constraints (say,
for a console or something), put it in userspace and vbus is not
involved.  I don't care.

However, if you are coming from somewhere else (like Ira's rig) where
you don't necessarily have a robust userspace module, vbus provides a
model that allows you to chose whether you want to do a vhost like
model, or a full resource container with the isolation guarantees, etc,
built in.

Kind Regards,
-Greg
Avi Kivity Sept. 24, 2009, 7:18 a.m. UTC | #57
On 09/24/2009 12:15 AM, Gregory Haskins wrote:
>
>>> There are various aspects about designing high-performance virtual
>>> devices such as providing the shortest paths possible between the
>>> physical resources and the consumers.  Conversely, we also need to
>>> ensure that we meet proper isolation/protection guarantees at the same
>>> time.  What this means is there are various aspects to any
>>> high-performance PV design that require to be placed in-kernel to
>>> maximize the performance yet properly isolate the guest.
>>>
>>> For instance, you are required to have your signal-path (interrupts and
>>> hypercalls), your memory-path (gpa translation), and
>>> addressing/isolation model in-kernel to maximize performance.
>>>
>>>        
>> Exactly.  That's what vhost puts into the kernel and nothing more.
>>      
> Actually, no.  Generally, _KVM_ puts those things into the kernel, and
> vhost consumes them.  Without KVM (or something equivalent), vhost is
> incomplete.  One of my goals with vbus is to generalize the "something
> equivalent" part here.
>    

I don't really see how vhost and vbus are different here.  vhost expects 
signalling to happen through a couple of eventfds and requires someone 
to supply them and implement kernel support (if needed).  vbus requires 
someone to write a connector to provide the signalling implementation.  
Neither will work out-of-the-box when implementing virtio-net over 
falling dominos, for example.

>>> Vbus accomplishes its in-kernel isolation model by providing a
>>> "container" concept, where objects are placed into this container by
>>> userspace.  The host kernel enforces isolation/protection by using a
>>> namespace to identify objects that is only relevant within a specific
>>> container's context (namely, a "u32 dev-id").  The guest addresses the
>>> objects by its dev-id, and the kernel ensures that the guest can't
>>> access objects outside of its dev-id namespace.
>>>
>>>        
>> vhost manages to accomplish this without any kernel support.
>>      
> No, vhost manages to accomplish this because of KVMs kernel support
> (ioeventfd, etc).   Without a KVM-like in-kernel support, vhost is a
> merely a kind of "tuntap"-like clone signalled by eventfds.
>    

Without a vbus-connector-falling-dominos, vbus-venet can't do anything 
either.  Both vhost and vbus need an interface, vhost's is just narrower 
since it doesn't do configuration or enumeration.

> This goes directly to my rebuttal of your claim that vbus places too
> much in the kernel.  I state that, one way or the other, address decode
> and isolation _must_ be in the kernel for performance.  Vbus does this
> with a devid/container scheme.  vhost+virtio-pci+kvm does it with
> pci+pio+ioeventfd.
>    

vbus doesn't do kvm guest address decoding for the fast path.  It's 
still done by ioeventfd.

>>   The guest
>> simply has not access to any vhost resources other than the guest->host
>> doorbell, which is handed to the guest outside vhost (so it's somebody
>> else's problem, in userspace).
>>      
> You mean _controlled_ by userspace, right?  Obviously, the other side of
> the kernel still needs to be programmed (ioeventfd, etc).  Otherwise,
> vhost would be pointless: e.g. just use vanilla tuntap if you don't need
> fast in-kernel decoding.
>    

Yes (though for something like level-triggered interrupts we're probably 
keeping it in userspace, enjoying the benefits of vhost data path while 
paying more for signalling).

>>> All that is required is a way to transport a message with a "devid"
>>> attribute as an address (such as DEVCALL(devid)) and the framework
>>> provides the rest of the decode+execute function.
>>>
>>>        
>> vhost avoids that.
>>      
> No, it doesn't avoid it.  It just doesn't specify how its done, and
> relies on something else to do it on its behalf.
>    

That someone else can be in userspace, apart from the actual fast path.

> Conversely, vbus specifies how its done, but not how to transport the
> verb "across the wire".  That is the role of the vbus-connector abstraction.
>    

So again, vbus does everything in the kernel (since it's so easy and 
cheap) but expects a vbus-connector.  vhost does configuration in 
userspace (since it's so clunky and fragile) but expects a couple of 
eventfds.

>>> Contrast this to vhost+virtio-pci (called simply "vhost" from here).
>>>
>>>        
>> It's the wrong name.  vhost implements only the data path.
>>      
> Understood, but vhost+virtio-pci is what I am contrasting, and I use
> "vhost" for short from that point on because I am too lazy to type the
> whole name over and over ;)
>    

If you #define A A+B+C don't expect intelligent conversation afterwards.

>>> It is not immune to requiring in-kernel addressing support either, but
>>> rather it just does it differently (and its not as you might expect via
>>> qemu).
>>>
>>> Vhost relies on QEMU to render PCI objects to the guest, which the guest
>>> assigns resources (such as BARs, interrupts, etc).
>>>        
>> vhost does not rely on qemu.  It relies on its user to handle
>> configuration.  In one important case it's qemu+pci.  It could just as
>> well be the lguest launcher.
>>      
> I meant vhost=vhost+virtio-pci here.  Sorry for the confusion.
>
> The point I am making specifically is that vhost in general relies on
> other in-kernel components to function.  I.e. It cannot function without
> having something like the PCI model to build an IO namespace.  That
> namespace (in this case, pio addresses+data tuples) are used for the
> in-kernel addressing function under KVM + virtio-pci.
>
> The case of the lguest launcher is a good one to highlight.  Yes, you
> can presumably also use lguest with vhost, if the requisite facilities
> are exposed to lguest-bus, and some eventfd based thing like ioeventfd
> is written for the host (if it doesnt exist already).
>
> And when the next virt design "foo" comes out, it can make a "foo-bus"
> model, and implement foo-eventfd on the backend, etc, etc.
>    

It's exactly the same with vbus needing additional connectors for 
additional transports.

> Ira can make ira-bus, and ira-eventfd, etc, etc.
>
> Each iteration will invariably introduce duplicated parts of the stack.
>    

Invariably?  Use libraries (virtio-shmem.ko, libvhost.so).


>> For the N+1th time, no.  vhost is perfectly usable without pci.  Can we
>> stop raising and debunking this point?
>>      
> Again, I understand vhost is decoupled from PCI, and I don't mean to
> imply anything different.  I use PCI as an example here because a) its
> the only working example of vhost today (to my knowledge), and b) you
> have stated in the past that PCI is the only "right" way here, to
> paraphrase.  Perhaps you no longer feel that way, so I apologize if you
> feel you already recanted your position on PCI and I missed it.
>    

For kvm/x86 pci definitely remains king.  I was talking about the two 
lguest users and Ira.

> I digress.  My point here isn't PCI.  The point here is the missing
> component for when PCI is not present.  The component that is partially
> satisfied by vbus's devid addressing scheme.  If you are going to use
> vhost, and you don't have PCI, you've gotta build something to replace it.
>    

Yes, that's why people have keyboards.  They'll write that glue code if 
they need it.  If it turns out to be a hit an people start having virtio 
transport module writing parties, they'll figure out a way to share code.

>>> All you really need is a simple decode+execute mechanism, and a way to
>>> program it from userspace control.  vbus tries to do just that:
>>> commoditize it so all you need is the transport of the control messages
>>> (like DEVCALL()), but the decode+execute itself is reuseable, even
>>> across various environments (like KVM or Iras rig).
>>>
>>>        
>> If you think it should be "commodotized", write libvhostconfig.so.
>>      
> I know you are probably being facetious here, but what do you propose
> for the parts that must be in-kernel?
>    

On the guest side, virtio-shmem.ko can unify the ring access.  It 
probably makes sense even today.  On the host side I eventfd is the 
kernel interface and libvhostconfig.so can provide the configuration 
when an existing ABI is not imposed.

>>> And your argument, I believe, is that vbus allows both to be implemented
>>> in the kernel (though to reiterate, its optional) and is therefore a bad
>>> design, so lets discuss that.
>>>
>>> I believe the assertion is that things like config-space are best left
>>> to userspace, and we should only relegate fast-path duties to the
>>> kernel.  The problem is that, in my experience, a good deal of
>>> config-space actually influences the fast-path and thus needs to
>>> interact with the fast-path mechanism eventually anyway.
>>> Whats left
>>> over that doesn't fall into this category may cheaply ride on existing
>>> plumbing, so its not like we created something new or unnatural just to
>>> support this subclass of config-space.
>>>
>>>        
>> Flexibility is reduced, because changing code in the kernel is more
>> expensive than in userspace, and kernel/user interfaces aren't typically
>> as wide as pure userspace interfaces.  Security is reduced, since a bug
>> in the kernel affects the host, while a bug in userspace affects just on
>> guest.
>>      
> For a mac-address attribute?  Thats all we are really talking about
> here.  These points you raise, while true of any kernel code I suppose,
> are a bit of a stretch in this context.
>    

Look at the virtio-net feature negotiation.  There's a lot more there 
than the MAC address, and it's going to grow.

>> Example: feature negotiation.  If it happens in userspace, it's easy to
>> limit what features we expose to the guest.
>>      
> Its not any harder in the kernel.  I do this today.
>
> And when you are done negotiating said features, you will generally have
> to turn around and program the feature into the backend anyway (e.g.
> ioctl() to vhost module).  Now you have to maintain some knowledge of
> that particular feature and how to program it in two places.
>    

No, you can leave it enabled unconditionally in vhost (the guest won't 
use what it doesn't know about).

> Conversely, I am eliminating the (unnecessary) middleman by letting the
> feature negotiating take place directly between the two entities that
> will consume it.
>    

The middleman is necessary, if you want to support live migration, or to 
restrict a guest to a subset of your features.

>>   If it happens in the
>> kernel, we need to add an interface to let the kernel know which
>> features it should expose to the guest.
>>      
> You need this already either way for both models anyway.  As an added
> bonus, vbus has generalized that interface using sysfs attributes, so
> all models are handled in a similar and community accepted way.
>    

vhost doesn't need it since userspace takes care of it.

>>   We also need to add an
>> interface to let userspace know which features were negotiated, if we
>> want to implement live migration.  Something fairly trivial bloats rapidly.
>>      
> Can you elaborate on the requirements for live-migration?  Wouldnt an
> opaque save/restore model work here? (e.g. why does userspace need to be
> able to interpret the in-kernel state, just pass it along as a blob to
> the new instance).
>    

A blob would work, if you commit to forward and backward compatibility 
in the kernel side (i.e. an older kernel must be able to accept a blob 
from a newer one).  I don't like blobs though, they tie you to the 
implemenetation.

>> As you can see above, userspace needs to be involved in this, and the
>> number of interfaces required is smaller if it's in userspace:
>>      
> Actually, no.  My experience has been the opposite.  Anytime I sat down
> and tried to satisfy your request to move things to the userspace,
> things got ugly and duplicative really quick.  I suspect part of the
> reason you may think its easier because you already have part of
> virtio-net in userspace and its surrounding support, but that is not the
> case moving forward for new device types.
>    

I can't comment on your experience, but we'll definitely build on 
existing code for new device types.

>> you only
>> need to know which features the kernel supports (they can be enabled
>> unconditionally, just not exposed).
>>
>> Further, some devices are perfectly happy to be implemented in
>> userspace, so we need userspace configuration support anyway.  Why
>> reimplement it in the kernel?
>>      
> Thats fine.  vbus is targetted for high-performance IO.  So if you have
> a robust userspace (like KVM+QEMU) and low-performance constraints (say,
> for a console or something), put it in userspace and vbus is not
> involved.  I don't care.
>    

So now the hypothetical non-pci hypervisor needs to support two busses.
Avi Kivity Sept. 24, 2009, 8:03 a.m. UTC | #58
On 09/23/2009 10:37 PM, Avi Kivity wrote:
>
> Example: feature negotiation.  If it happens in userspace, it's easy 
> to limit what features we expose to the guest.  If it happens in the 
> kernel, we need to add an interface to let the kernel know which 
> features it should expose to the guest.  We also need to add an 
> interface to let userspace know which features were negotiated, if we 
> want to implement live migration.  Something fairly trivial bloats 
> rapidly.

btw, we have this issue with kvm reporting cpuid bits to the guest.  
Instead of letting kvm talk directly to the hardware and the guest, kvm 
gets the cpuid bits from the hardware, strips away features it doesn't 
support, exposes that to userspace, and expects userspace to program the 
cpuid bits it wants to expose to the guest (which may be different than 
what kvm exposed to userspace, and different from guest to guest).
Gregory Haskins Sept. 24, 2009, 6:03 p.m. UTC | #59
Avi Kivity wrote:
> On 09/24/2009 12:15 AM, Gregory Haskins wrote:
>>
>>>> There are various aspects about designing high-performance virtual
>>>> devices such as providing the shortest paths possible between the
>>>> physical resources and the consumers.  Conversely, we also need to
>>>> ensure that we meet proper isolation/protection guarantees at the same
>>>> time.  What this means is there are various aspects to any
>>>> high-performance PV design that require to be placed in-kernel to
>>>> maximize the performance yet properly isolate the guest.
>>>>
>>>> For instance, you are required to have your signal-path (interrupts and
>>>> hypercalls), your memory-path (gpa translation), and
>>>> addressing/isolation model in-kernel to maximize performance.
>>>>
>>>>        
>>> Exactly.  That's what vhost puts into the kernel and nothing more.
>>>      
>> Actually, no.  Generally, _KVM_ puts those things into the kernel, and
>> vhost consumes them.  Without KVM (or something equivalent), vhost is
>> incomplete.  One of my goals with vbus is to generalize the "something
>> equivalent" part here.
>>    
> 
> I don't really see how vhost and vbus are different here.  vhost expects
> signalling to happen through a couple of eventfds and requires someone
> to supply them and implement kernel support (if needed).  vbus requires
> someone to write a connector to provide the signalling implementation. 
> Neither will work out-of-the-box when implementing virtio-net over
> falling dominos, for example.

I realize in retrospect that my choice of words above implies vbus _is_
complete, but this is not what I was saying.  What I was trying to
convey is that vbus is _more_ complete.  Yes, in either case some kind
of glue needs to be written.  The difference is that vbus implements
more of the glue generally, and leaves less required to be customized
for each iteration.

Going back to our stack diagrams, you could think of a vhost solution
like this:

--------------------------
| virtio-net
--------------------------
| virtio-ring
--------------------------
| virtio-bus
--------------------------
| ? undefined-1 ?
--------------------------
| vhost
--------------------------

and you could think of a vbus solution like this

--------------------------
| virtio-net
--------------------------
| virtio-ring
--------------------------
| virtio-bus
--------------------------
| bus-interface
--------------------------
| ? undefined-2 ?
--------------------------
| bus-model
--------------------------
| virtio-net-device (vhost ported to vbus model? :)
--------------------------


So the difference between vhost and vbus in this particular context is
that you need to have "undefined-1" do device discovery/hotswap,
config-space, address-decode/isolation, signal-path routing, memory-path
routing, etc.  Today this function is filled by things like virtio-pci,
pci-bus, KVM/ioeventfd, and QEMU for x86.  I am not as familiar with
lguest, but presumably it is filled there by components like
virtio-lguest, lguest-bus, lguest.ko, and lguest-launcher.  And to use
more contemporary examples, we might have virtio-domino, domino-bus,
domino.ko, and domino-launcher as well as virtio-ira, ira-bus, ira.ko,
and ira-launcher.

Contrast this to the vbus stack:  The bus-X components (when optionally
employed by the connector designer) do device-discovery, hotswap,
config-space, address-decode/isolation, signal-path and memory-path
routing, etc in a general (and pv-centric) way. The "undefined-2"
portion is the "connector", and just needs to convey messages like
"DEVCALL" and "SHMSIGNAL".  The rest is handled in other parts of the stack.

So to answer your question, the difference is that the part that has to
be customized in vbus should be a fraction of what needs to be
customized with vhost because it defines more of the stack.  And, as
eluded to in my diagram, both virtio-net and vhost (with some
modifications to fit into the vbus framework) are potentially
complementary, not competitors.

> 
>>>> Vbus accomplishes its in-kernel isolation model by providing a
>>>> "container" concept, where objects are placed into this container by
>>>> userspace.  The host kernel enforces isolation/protection by using a
>>>> namespace to identify objects that is only relevant within a specific
>>>> container's context (namely, a "u32 dev-id").  The guest addresses the
>>>> objects by its dev-id, and the kernel ensures that the guest can't
>>>> access objects outside of its dev-id namespace.
>>>>
>>>>        
>>> vhost manages to accomplish this without any kernel support.
>>>      
>> No, vhost manages to accomplish this because of KVMs kernel support
>> (ioeventfd, etc).   Without a KVM-like in-kernel support, vhost is a
>> merely a kind of "tuntap"-like clone signalled by eventfds.
>>    
> 
> Without a vbus-connector-falling-dominos, vbus-venet can't do anything
> either.

Mostly covered above...

However, I was addressing your assertion that vhost somehow magically
accomplishes this "container/addressing" function without any specific
kernel support.  This is incorrect.  I contend that this kernel support
is required and present.  The difference is that its defined elsewhere
(and typically in a transport/arch specific way).

IOW: You can basically think of the programmed PIO addresses as forming
its "container".  Only addresses explicitly added are visible, and
everything else is inaccessible.  This whole discussion is merely a
question of what's been generalized verses what needs to be
re-implemented each time.


> Both vhost and vbus need an interface,

Agreed

> vhost's is just narrower since it doesn't do configuration or enumeration.

I would say that makes vhost solution's interface wider, not narrower.
With the vbus kvm-connector, simple vbus device instantiation implicitly
registers it in the address/enumeration namespace, and transmits a
devadd event.  It does all that with no more interface complexity than
instantiating a vhost device.  However, vhost has to then also
separately configure its address/enumeration space with other subsystems
(e.g. pci, ioeventfd, msi, etc), and define its config-space twice.

This means something in userspace has to proxy and/or refactor requests.
 This also means that the userspace component has to have some knowledge
of _how_ to proxy/refactor said requests (i.e. splitting the design),
which is another example of where the current vhost model really falls
apart IMO.

> 
>> This goes directly to my rebuttal of your claim that vbus places too
>> much in the kernel.  I state that, one way or the other, address decode
>> and isolation _must_ be in the kernel for performance.  Vbus does this
>> with a devid/container scheme.  vhost+virtio-pci+kvm does it with
>> pci+pio+ioeventfd.
>>    
> 
> vbus doesn't do kvm guest address decoding for the fast path.  It's
> still done by ioeventfd.

That is not correct.  vbus does its own native address decoding in the
fast path, such as here:

http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=kernel/vbus/client.c;h=e85b2d92d629734866496b67455dd307486e394a;hb=e6cbd4d1decca8e829db3b2b9b6ec65330b379e9#l331

The connector delivers a SHMSIGNAL(id) message, and its decoded
generically by an rcu protected radix tree.

I think what you are thinking of is that my KVM-connector in AlacrityVM
uses PIO/ioeventfd (*) as part of its transport to deliver that
SHMSIGNAL message.  In this sense, I am doing two address-decodes (one
for the initial pio, one for the subsequent shmsignal), but this is an
implementation detail of the KVM connector.

(Also note that its an implementation detail that the KVM maintainer
forced me into ;)  The original vbus design utilized a global hypercall
in place of the PIO, and thus the shmsignal was the only real decode
occurring)

(*) actually I dropped ioeventfd in my latest tree, but this is a
separate topic.  I still use KVM's pio-bus, however.

> 
>>>   The guest
>>> simply has not access to any vhost resources other than the guest->host
>>> doorbell, which is handed to the guest outside vhost (so it's somebody
>>> else's problem, in userspace).
>>>      
>> You mean _controlled_ by userspace, right?  Obviously, the other side of
>> the kernel still needs to be programmed (ioeventfd, etc).  Otherwise,
>> vhost would be pointless: e.g. just use vanilla tuntap if you don't need
>> fast in-kernel decoding.
>>    
> 
> Yes (though for something like level-triggered interrupts we're probably
> keeping it in userspace, enjoying the benefits of vhost data path while
> paying more for signalling).

Thats fine.  I am primarily interested in the high-performance IO, so
low-perf/legacy components can fall back to something else like
userspace if that best serves them.

> 
>>>> All that is required is a way to transport a message with a "devid"
>>>> attribute as an address (such as DEVCALL(devid)) and the framework
>>>> provides the rest of the decode+execute function.
>>>>
>>>>        
>>> vhost avoids that.
>>>      
>> No, it doesn't avoid it.  It just doesn't specify how its done, and
>> relies on something else to do it on its behalf.
>>    
> 
> That someone else can be in userspace, apart from the actual fast path.

No, this "devcall" like decoding _is_ fast path and it can't be in
userspace if you care about performance.  And if you don't care about
performance, you can use existing facilities (like QEMU+tuntap) so vhost
and vbus alike would become unnecessary in that scenario.

> 
>> Conversely, vbus specifies how its done, but not how to transport the
>> verb "across the wire".  That is the role of the vbus-connector
>> abstraction.
>>    
> 
> So again, vbus does everything in the kernel (since it's so easy and
> cheap) but expects a vbus-connector.  vhost does configuration in
> userspace (since it's so clunky and fragile) but expects a couple of
> eventfds.

Well, we are talking about fast-path here, so I am not sure why
config-space is coming up in this context.  I digress. I realize you are
being sarcastic, but your easy+cheap/clunky+fragile assessment is more
accurate than you perhaps realize.

You keep extolling that vhost does most things in userspace and that is
an advantage.  But the simple fact is that they both functionally do
almost the same amount in-kernel, because they _have_ to.  This includes
the obvious stuff like signal and memory routing, but also the less
obvious stuff like most of config-space.  Ultimately, most of
config-space needs to terminate at the device-model (the one exception
is perhaps "read-only attributes", like "MACQUERY").  Therefore, even if
you use a vhost like model, most of your parameters will invariably be a
translation from one config space to another and passed on (e.g. pci
config-cycle to ioctl()).

The disparity of in-kernel vs userspace functionality that remains
between the two implementations are basically the enumeration/hotswap
and read-only attribute functions.  These functions are prohibitively
complex in the vhost+virtio-pci+kvm model (full ICH/pci chipset
emulation, etc), so I understand why we wouldnt want to move those
in-kernel.  However, vbus was designed from scratch specifically for PV
to be flexible and simple.  As a result, the remaining functions in the
kvm-connector take advantage of this simplicity and just ride on the
existing model that we needed for fast-path anyway.  What this means is
there are of no significant consequence to do these few minor details
in-kernel, other than this long discussion.

In fact, it's actually a simpler design to unify things this way because
you avoid splitting the device model up. Consider how painful the vhost
implementation would be if it didn't already have the userspace
virtio-net to fall-back on.  This is effectively what we face for new
devices going forward if that model is to persist.


> 
>>>> Contrast this to vhost+virtio-pci (called simply "vhost" from here).
>>>>
>>>>        
>>> It's the wrong name.  vhost implements only the data path.
>>>      
>> Understood, but vhost+virtio-pci is what I am contrasting, and I use
>> "vhost" for short from that point on because I am too lazy to type the
>> whole name over and over ;)
>>    
> 
> If you #define A A+B+C don't expect intelligent conversation afterwards.

Fair enough, but I did attempt to declare the definition before using
it.  Sorry again for the confusion.

> 
>>>> It is not immune to requiring in-kernel addressing support either, but
>>>> rather it just does it differently (and its not as you might expect via
>>>> qemu).
>>>>
>>>> Vhost relies on QEMU to render PCI objects to the guest, which the
>>>> guest
>>>> assigns resources (such as BARs, interrupts, etc).
>>>>        
>>> vhost does not rely on qemu.  It relies on its user to handle
>>> configuration.  In one important case it's qemu+pci.  It could just as
>>> well be the lguest launcher.
>>>      
>> I meant vhost=vhost+virtio-pci here.  Sorry for the confusion.
>>
>> The point I am making specifically is that vhost in general relies on
>> other in-kernel components to function.  I.e. It cannot function without
>> having something like the PCI model to build an IO namespace.  That
>> namespace (in this case, pio addresses+data tuples) are used for the
>> in-kernel addressing function under KVM + virtio-pci.
>>
>> The case of the lguest launcher is a good one to highlight.  Yes, you
>> can presumably also use lguest with vhost, if the requisite facilities
>> are exposed to lguest-bus, and some eventfd based thing like ioeventfd
>> is written for the host (if it doesnt exist already).
>>
>> And when the next virt design "foo" comes out, it can make a "foo-bus"
>> model, and implement foo-eventfd on the backend, etc, etc.
>>    
> 
> It's exactly the same with vbus needing additional connectors for
> additional transports.

No, see my reply above.

> 
>> Ira can make ira-bus, and ira-eventfd, etc, etc.
>>
>> Each iteration will invariably introduce duplicated parts of the stack.
>>    
> 
> Invariably?

As in "always"

>  Use libraries (virtio-shmem.ko, libvhost.so).

What do you suppose vbus is?  vbus-proxy.ko = virtio-shmem.ko, and you
dont need libvhost.so per se since you can just use standard kernel
interfaces (like configfs/sysfs).  I could create an .so going forward
for the new ioctl-based interface, I suppose.

> 
> 
>>> For the N+1th time, no.  vhost is perfectly usable without pci.  Can we
>>> stop raising and debunking this point?
>>>      
>> Again, I understand vhost is decoupled from PCI, and I don't mean to
>> imply anything different.  I use PCI as an example here because a) its
>> the only working example of vhost today (to my knowledge), and b) you
>> have stated in the past that PCI is the only "right" way here, to
>> paraphrase.  Perhaps you no longer feel that way, so I apologize if you
>> feel you already recanted your position on PCI and I missed it.
>>    
> 
> For kvm/x86 pci definitely remains king.

For full virtualization, sure.  I agree.  However, we are talking about
PV here.  For PV, PCI is not a requirement and is a technical dead-end IMO.

KVM seems to be the only virt solution that thinks otherwise (*), but I
believe that is primarily a condition of its maturity.  I aim to help
advance things here.

(*) citation: xen has xenbus, lguest has lguest-bus, vmware has some
vmi-esq thing (I forget what its called) to name a few.  Love 'em or
hate 'em, most other hypervisors do something along these lines.  I'd
like to try to create one for KVM, but to unify them all (at least for
the Linux-based host designs).

>  I was talking about the two
> lguest users and Ira.
> 
>> I digress.  My point here isn't PCI.  The point here is the missing
>> component for when PCI is not present.  The component that is partially
>> satisfied by vbus's devid addressing scheme.  If you are going to use
>> vhost, and you don't have PCI, you've gotta build something to replace
>> it.
>>    
> 
> Yes, that's why people have keyboards.  They'll write that glue code if
> they need it.  If it turns out to be a hit an people start having virtio
> transport module writing parties, they'll figure out a way to share code.

Sigh...  The party has already started.  I tried to invite you months ago...

> 
>>>> All you really need is a simple decode+execute mechanism, and a way to
>>>> program it from userspace control.  vbus tries to do just that:
>>>> commoditize it so all you need is the transport of the control messages
>>>> (like DEVCALL()), but the decode+execute itself is reuseable, even
>>>> across various environments (like KVM or Iras rig).
>>>>
>>>>        
>>> If you think it should be "commodotized", write libvhostconfig.so.
>>>      
>> I know you are probably being facetious here, but what do you propose
>> for the parts that must be in-kernel?
>>    
> 
> On the guest side, virtio-shmem.ko can unify the ring access.  It
> probably makes sense even today.  On the host side I eventfd is the
> kernel interface and libvhostconfig.so can provide the configuration
> when an existing ABI is not imposed.

That won't cut it.  For one, creating an eventfd is only part of the
equation.  I.e. you need to have originate/terminate somewhere
interesting (and in-kernel, otherwise use tuntap).

> 
>>>> And your argument, I believe, is that vbus allows both to be
>>>> implemented
>>>> in the kernel (though to reiterate, its optional) and is therefore a
>>>> bad
>>>> design, so lets discuss that.
>>>>
>>>> I believe the assertion is that things like config-space are best left
>>>> to userspace, and we should only relegate fast-path duties to the
>>>> kernel.  The problem is that, in my experience, a good deal of
>>>> config-space actually influences the fast-path and thus needs to
>>>> interact with the fast-path mechanism eventually anyway.
>>>> Whats left
>>>> over that doesn't fall into this category may cheaply ride on existing
>>>> plumbing, so its not like we created something new or unnatural just to
>>>> support this subclass of config-space.
>>>>
>>>>        
>>> Flexibility is reduced, because changing code in the kernel is more
>>> expensive than in userspace, and kernel/user interfaces aren't typically
>>> as wide as pure userspace interfaces.  Security is reduced, since a bug
>>> in the kernel affects the host, while a bug in userspace affects just on
>>> guest.
>>>      
>> For a mac-address attribute?  Thats all we are really talking about
>> here.  These points you raise, while true of any kernel code I suppose,
>> are a bit of a stretch in this context.
>>    
> 
> Look at the virtio-net feature negotiation.  There's a lot more there
> than the MAC address, and it's going to grow.

Agreed, but note that makes my point.  That feature negotiation almost
invariably influences the device-model, not some config-space shim.
IOW: terminating config-space at some userspace shim is pointless.  The
model ultimately needs the result of whatever transpires during that
negotiation anyway.

> 
>>> Example: feature negotiation.  If it happens in userspace, it's easy to
>>> limit what features we expose to the guest.
>>>      
>> Its not any harder in the kernel.  I do this today.
>>
>> And when you are done negotiating said features, you will generally have
>> to turn around and program the feature into the backend anyway (e.g.
>> ioctl() to vhost module).  Now you have to maintain some knowledge of
>> that particular feature and how to program it in two places.
>>    
> 
> No, you can leave it enabled unconditionally in vhost (the guest won't
> use what it doesn't know about).

Perhaps, but IMO sending a "feature-mask"-like object down is far easier
then proxying/refactoring config-space and sending that down.  I'd still
chalk the win here to the vbus model used in AlacrityVM.

FWIW: venet has the ability to enable/disable features on the host side,
so clearly userspace config-space is not required for the basic premise.

> 
>> Conversely, I am eliminating the (unnecessary) middleman by letting the
>> feature negotiating take place directly between the two entities that
>> will consume it.
>>    
> 
> The middleman is necessary, if you want to support live migration

Orchestrating live-migration has nothing to do with whether config-space
is serviced by a middle-man or not.  It shouldn't be required to have
device-specific knowledge at all beyond what was initially needed to
create/config the object at boot time.

IOW, the orchestrator merely needs to know that a device-model object is
present and a method to serialize and reconstitute its state (if
appropriate).

, or to
> restrict a guest to a subset of your features.

No, that is incorrect.  We are not talking about directly exposing
something like HW cpuid here.  These are all virtual models, and they
can optionally expose as much or as little as we want.  They do this
under administrative control by userspace, and independent of the
location of the config-space handler.

> 
>>>   If it happens in the
>>> kernel, we need to add an interface to let the kernel know which
>>> features it should expose to the guest.
>>>      
>> You need this already either way for both models anyway.  As an added
>> bonus, vbus has generalized that interface using sysfs attributes, so
>> all models are handled in a similar and community accepted way.
>>    
> 
> vhost doesn't need it since userspace takes care of it.

Ok, but see my related reply above.

> 
>>>   We also need to add an
>>> interface to let userspace know which features were negotiated, if we
>>> want to implement live migration.  Something fairly trivial bloats
>>> rapidly.
>>>      
>> Can you elaborate on the requirements for live-migration?  Wouldnt an
>> opaque save/restore model work here? (e.g. why does userspace need to be
>> able to interpret the in-kernel state, just pass it along as a blob to
>> the new instance).
>>    
> 
> A blob would work, if you commit to forward and backward compatibility
> in the kernel side (i.e. an older kernel must be able to accept a blob
> from a newer one).

Thats understood and acceptable.

> I don't like blobs though, they tie you to the implemenetation.

What would you suggest otherwise?

> 
>>> As you can see above, userspace needs to be involved in this, and the
>>> number of interfaces required is smaller if it's in userspace:
>>>      
>> Actually, no.  My experience has been the opposite.  Anytime I sat down
>> and tried to satisfy your request to move things to the userspace,
>> things got ugly and duplicative really quick.  I suspect part of the
>> reason you may think its easier because you already have part of
>> virtio-net in userspace and its surrounding support, but that is not the
>> case moving forward for new device types.
>>    
> 
> I can't comment on your experience, but we'll definitely build on
> existing code for new device types.

Fair enough.  I'll build on my experience, either reusing existing code
or implementing new designs where appropriate.  If you or anyone else
want to join me in my efforts, the more the merrier.

> 
>>> you only
>>> need to know which features the kernel supports (they can be enabled
>>> unconditionally, just not exposed).
>>>
>>> Further, some devices are perfectly happy to be implemented in
>>> userspace, so we need userspace configuration support anyway.  Why
>>> reimplement it in the kernel?
>>>      
>> Thats fine.  vbus is targetted for high-performance IO.  So if you have
>> a robust userspace (like KVM+QEMU) and low-performance constraints (say,
>> for a console or something), put it in userspace and vbus is not
>> involved.  I don't care.
>>    
> 
> So now the hypothetical non-pci hypervisor needs to support two busses.

No.  The hypothetical hypervisor only needs to decide where
low-performance devices should live.  If that is best served by
making/reusing a unique bus for them, I have no specific problem with
that.  Systems are typically composed of multiple buses anyway.

Conversely, there is nothing wrong with putting low-performance devices
on a bus designed for high-performance either, and vbus can accommodate
both types.  The latter is what I would advocate for simplicity's sake,
but its not a requirement.

Kind Regards,
-Greg
Gregory Haskins Sept. 24, 2009, 6:04 p.m. UTC | #60
Avi Kivity wrote:
> On 09/23/2009 10:37 PM, Avi Kivity wrote:
>>
>> Example: feature negotiation.  If it happens in userspace, it's easy
>> to limit what features we expose to the guest.  If it happens in the
>> kernel, we need to add an interface to let the kernel know which
>> features it should expose to the guest.  We also need to add an
>> interface to let userspace know which features were negotiated, if we
>> want to implement live migration.  Something fairly trivial bloats
>> rapidly.
> 
> btw, we have this issue with kvm reporting cpuid bits to the guest. 
> Instead of letting kvm talk directly to the hardware and the guest, kvm
> gets the cpuid bits from the hardware, strips away features it doesn't
> support, exposes that to userspace, and expects userspace to program the
> cpuid bits it wants to expose to the guest (which may be different than
> what kvm exposed to userspace, and different from guest to guest).
> 

This issue doesn't exist in the model I am referring to, as these are
all virtual-devices anyway.  See my last reply

-Greg
Ira Snyder Sept. 24, 2009, 7:27 p.m. UTC | #61
On Thu, Sep 24, 2009 at 10:18:28AM +0300, Avi Kivity wrote:
> On 09/24/2009 12:15 AM, Gregory Haskins wrote:
> >
> >>> There are various aspects about designing high-performance virtual
> >>> devices such as providing the shortest paths possible between the
> >>> physical resources and the consumers.  Conversely, we also need to
> >>> ensure that we meet proper isolation/protection guarantees at the same
> >>> time.  What this means is there are various aspects to any
> >>> high-performance PV design that require to be placed in-kernel to
> >>> maximize the performance yet properly isolate the guest.
> >>>
> >>> For instance, you are required to have your signal-path (interrupts and
> >>> hypercalls), your memory-path (gpa translation), and
> >>> addressing/isolation model in-kernel to maximize performance.
> >>>
> >>>        
> >> Exactly.  That's what vhost puts into the kernel and nothing more.
> >>      
> > Actually, no.  Generally, _KVM_ puts those things into the kernel, and
> > vhost consumes them.  Without KVM (or something equivalent), vhost is
> > incomplete.  One of my goals with vbus is to generalize the "something
> > equivalent" part here.
> >    
> 
> I don't really see how vhost and vbus are different here.  vhost expects 
> signalling to happen through a couple of eventfds and requires someone 
> to supply them and implement kernel support (if needed).  vbus requires 
> someone to write a connector to provide the signalling implementation.  
> Neither will work out-of-the-box when implementing virtio-net over 
> falling dominos, for example.
> 
> >>> Vbus accomplishes its in-kernel isolation model by providing a
> >>> "container" concept, where objects are placed into this container by
> >>> userspace.  The host kernel enforces isolation/protection by using a
> >>> namespace to identify objects that is only relevant within a specific
> >>> container's context (namely, a "u32 dev-id").  The guest addresses the
> >>> objects by its dev-id, and the kernel ensures that the guest can't
> >>> access objects outside of its dev-id namespace.
> >>>
> >>>        
> >> vhost manages to accomplish this without any kernel support.
> >>      
> > No, vhost manages to accomplish this because of KVMs kernel support
> > (ioeventfd, etc).   Without a KVM-like in-kernel support, vhost is a
> > merely a kind of "tuntap"-like clone signalled by eventfds.
> >    
> 
> Without a vbus-connector-falling-dominos, vbus-venet can't do anything 
> either.  Both vhost and vbus need an interface, vhost's is just narrower 
> since it doesn't do configuration or enumeration.
> 
> > This goes directly to my rebuttal of your claim that vbus places too
> > much in the kernel.  I state that, one way or the other, address decode
> > and isolation _must_ be in the kernel for performance.  Vbus does this
> > with a devid/container scheme.  vhost+virtio-pci+kvm does it with
> > pci+pio+ioeventfd.
> >    
> 
> vbus doesn't do kvm guest address decoding for the fast path.  It's 
> still done by ioeventfd.
> 
> >>   The guest
> >> simply has not access to any vhost resources other than the guest->host
> >> doorbell, which is handed to the guest outside vhost (so it's somebody
> >> else's problem, in userspace).
> >>      
> > You mean _controlled_ by userspace, right?  Obviously, the other side of
> > the kernel still needs to be programmed (ioeventfd, etc).  Otherwise,
> > vhost would be pointless: e.g. just use vanilla tuntap if you don't need
> > fast in-kernel decoding.
> >    
> 
> Yes (though for something like level-triggered interrupts we're probably 
> keeping it in userspace, enjoying the benefits of vhost data path while 
> paying more for signalling).
> 
> >>> All that is required is a way to transport a message with a "devid"
> >>> attribute as an address (such as DEVCALL(devid)) and the framework
> >>> provides the rest of the decode+execute function.
> >>>
> >>>        
> >> vhost avoids that.
> >>      
> > No, it doesn't avoid it.  It just doesn't specify how its done, and
> > relies on something else to do it on its behalf.
> >    
> 
> That someone else can be in userspace, apart from the actual fast path.
> 
> > Conversely, vbus specifies how its done, but not how to transport the
> > verb "across the wire".  That is the role of the vbus-connector abstraction.
> >    
> 
> So again, vbus does everything in the kernel (since it's so easy and 
> cheap) but expects a vbus-connector.  vhost does configuration in 
> userspace (since it's so clunky and fragile) but expects a couple of 
> eventfds.
> 
> >>> Contrast this to vhost+virtio-pci (called simply "vhost" from here).
> >>>
> >>>        
> >> It's the wrong name.  vhost implements only the data path.
> >>      
> > Understood, but vhost+virtio-pci is what I am contrasting, and I use
> > "vhost" for short from that point on because I am too lazy to type the
> > whole name over and over ;)
> >    
> 
> If you #define A A+B+C don't expect intelligent conversation afterwards.
> 
> >>> It is not immune to requiring in-kernel addressing support either, but
> >>> rather it just does it differently (and its not as you might expect via
> >>> qemu).
> >>>
> >>> Vhost relies on QEMU to render PCI objects to the guest, which the guest
> >>> assigns resources (such as BARs, interrupts, etc).
> >>>        
> >> vhost does not rely on qemu.  It relies on its user to handle
> >> configuration.  In one important case it's qemu+pci.  It could just as
> >> well be the lguest launcher.
> >>      
> > I meant vhost=vhost+virtio-pci here.  Sorry for the confusion.
> >
> > The point I am making specifically is that vhost in general relies on
> > other in-kernel components to function.  I.e. It cannot function without
> > having something like the PCI model to build an IO namespace.  That
> > namespace (in this case, pio addresses+data tuples) are used for the
> > in-kernel addressing function under KVM + virtio-pci.
> >
> > The case of the lguest launcher is a good one to highlight.  Yes, you
> > can presumably also use lguest with vhost, if the requisite facilities
> > are exposed to lguest-bus, and some eventfd based thing like ioeventfd
> > is written for the host (if it doesnt exist already).
> >
> > And when the next virt design "foo" comes out, it can make a "foo-bus"
> > model, and implement foo-eventfd on the backend, etc, etc.
> >    
> 
> It's exactly the same with vbus needing additional connectors for 
> additional transports.
> 
> > Ira can make ira-bus, and ira-eventfd, etc, etc.
> >
> > Each iteration will invariably introduce duplicated parts of the stack.
> >    
> 
> Invariably?  Use libraries (virtio-shmem.ko, libvhost.so).
> 

Referencing libraries that don't yet exist doesn't seem like a good
argument against vbus from my point of view. I'm not speficially
advocating for vbus; I'm just letting you know how it looks to another
developer in the trenches.

If you'd like to see the amount of duplication present, look at the code
I'm currently working on. It mostly works at this point, though I
haven't finished my userspace, nor figured out how to actually transfer
data.

The current question I have (just to let you know where I am in
development) is:

I have the physical address of the remote data, but how do I get it into
a userspace buffer, so I can pass it to tun?

http://www.mmarray.org/~iws/virtio-phys/

Ira
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 25, 2009, 7:43 a.m. UTC | #62
On 09/24/2009 10:27 PM, Ira W. Snyder wrote:
>>> Ira can make ira-bus, and ira-eventfd, etc, etc.
>>>
>>> Each iteration will invariably introduce duplicated parts of the stack.
>>>
>>>        
>> Invariably?  Use libraries (virtio-shmem.ko, libvhost.so).
>>
>>      
> Referencing libraries that don't yet exist doesn't seem like a good
> argument against vbus from my point of view. I'm not speficially
> advocating for vbus; I'm just letting you know how it looks to another
> developer in the trenches.
>    

My argument is that we shouldn't write a new framework instead of fixing 
or extending an existing one.

> If you'd like to see the amount of duplication present, look at the code
> I'm currently working on.

Yes, virtio-phys-guest looks pretty much duplicated.  Looks like it 
should be pretty easy to deduplicate.

>   It mostly works at this point, though I
> haven't finished my userspace, nor figured out how to actually transfer
> data.
>
> The current question I have (just to let you know where I am in
> development) is:
>
> I have the physical address of the remote data, but how do I get it into
> a userspace buffer, so I can pass it to tun?
>    

vhost does guest physical address to host userspace address (it your 
scenario, remote physical to local virtual) using a table of memory 
slots; there's an ioctl that allows userspace to initialize that table.
Avi Kivity Sept. 25, 2009, 8:22 a.m. UTC | #63
On 09/24/2009 09:03 PM, Gregory Haskins wrote:
>
>> I don't really see how vhost and vbus are different here.  vhost expects
>> signalling to happen through a couple of eventfds and requires someone
>> to supply them and implement kernel support (if needed).  vbus requires
>> someone to write a connector to provide the signalling implementation.
>> Neither will work out-of-the-box when implementing virtio-net over
>> falling dominos, for example.
>>      
> I realize in retrospect that my choice of words above implies vbus _is_
> complete, but this is not what I was saying.  What I was trying to
> convey is that vbus is _more_ complete.  Yes, in either case some kind
> of glue needs to be written.  The difference is that vbus implements
> more of the glue generally, and leaves less required to be customized
> for each iteration.
>    


No argument there.  Since you care about non-virt scenarios and virtio 
doesn't, naturally vbus is a better fit for them as the code stands.  
But that's not a strong argument for vbus; instead of adding vbus you 
could make virtio more friendly to non-virt (there's a limit how far you 
can take this, not imposed by the code, but by virtio's charter as a 
virtual device driver framework).

> Going back to our stack diagrams, you could think of a vhost solution
> like this:
>
> --------------------------
> | virtio-net
> --------------------------
> | virtio-ring
> --------------------------
> | virtio-bus
> --------------------------
> | ? undefined-1 ?
> --------------------------
> | vhost
> --------------------------
>
> and you could think of a vbus solution like this
>
> --------------------------
> | virtio-net
> --------------------------
> | virtio-ring
> --------------------------
> | virtio-bus
> --------------------------
> | bus-interface
> --------------------------
> | ? undefined-2 ?
> --------------------------
> | bus-model
> --------------------------
> | virtio-net-device (vhost ported to vbus model? :)
> --------------------------
>
>
> So the difference between vhost and vbus in this particular context is
> that you need to have "undefined-1" do device discovery/hotswap,
> config-space, address-decode/isolation, signal-path routing, memory-path
> routing, etc.  Today this function is filled by things like virtio-pci,
> pci-bus, KVM/ioeventfd, and QEMU for x86.  I am not as familiar with
> lguest, but presumably it is filled there by components like
> virtio-lguest, lguest-bus, lguest.ko, and lguest-launcher.  And to use
> more contemporary examples, we might have virtio-domino, domino-bus,
> domino.ko, and domino-launcher as well as virtio-ira, ira-bus, ira.ko,
> and ira-launcher.
>
> Contrast this to the vbus stack:  The bus-X components (when optionally
> employed by the connector designer) do device-discovery, hotswap,
> config-space, address-decode/isolation, signal-path and memory-path
> routing, etc in a general (and pv-centric) way. The "undefined-2"
> portion is the "connector", and just needs to convey messages like
> "DEVCALL" and "SHMSIGNAL".  The rest is handled in other parts of the stack.
>
>    

Right.  virtio assumes that it's in a virt scenario and that the guest 
architecture already has enumeration and hotplug mechanisms which it 
would prefer to use.  That happens to be the case for kvm/x86.

> So to answer your question, the difference is that the part that has to
> be customized in vbus should be a fraction of what needs to be
> customized with vhost because it defines more of the stack.

But if you want to use the native mechanisms, vbus doesn't have any 
added value.

> And, as
> eluded to in my diagram, both virtio-net and vhost (with some
> modifications to fit into the vbus framework) are potentially
> complementary, not competitors.
>    

Only theoretically.  The existing installed base would have to be thrown 
away, or we'd need to support both.

  


>> Without a vbus-connector-falling-dominos, vbus-venet can't do anything
>> either.
>>      
> Mostly covered above...
>
> However, I was addressing your assertion that vhost somehow magically
> accomplishes this "container/addressing" function without any specific
> kernel support.  This is incorrect.  I contend that this kernel support
> is required and present.  The difference is that its defined elsewhere
> (and typically in a transport/arch specific way).
>
> IOW: You can basically think of the programmed PIO addresses as forming
> its "container".  Only addresses explicitly added are visible, and
> everything else is inaccessible.  This whole discussion is merely a
> question of what's been generalized verses what needs to be
> re-implemented each time.
>    

Sorry, this is too abstract for me.



>> vbus doesn't do kvm guest address decoding for the fast path.  It's
>> still done by ioeventfd.
>>      
> That is not correct.  vbus does its own native address decoding in the
> fast path, such as here:
>
> http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=kernel/vbus/client.c;h=e85b2d92d629734866496b67455dd307486e394a;hb=e6cbd4d1decca8e829db3b2b9b6ec65330b379e9#l331
>
>    

All this is after kvm has decoded that vbus is addresses.  It can't work 
without someone outside vbus deciding that.

> In fact, it's actually a simpler design to unify things this way because
> you avoid splitting the device model up. Consider how painful the vhost
> implementation would be if it didn't already have the userspace
> virtio-net to fall-back on.  This is effectively what we face for new
> devices going forward if that model is to persist.
>    


It doesn't have just virtio-net, it has userspace-based hostplug and a 
bunch of other devices impemented in userspace.  Currently qemu has 
virtio bindings for pci and syborg (whatever that is), and device models 
for baloon, block, net, and console, so it seems implementing device 
support in userspace is not as disasterous as you make it to be.

>> Invariably?
>>      
> As in "always"
>    

Refactor instead of duplicating.

>    
>>   Use libraries (virtio-shmem.ko, libvhost.so).
>>      
> What do you suppose vbus is?  vbus-proxy.ko = virtio-shmem.ko, and you
> dont need libvhost.so per se since you can just use standard kernel
> interfaces (like configfs/sysfs).  I could create an .so going forward
> for the new ioctl-based interface, I suppose.
>    

Refactor instead of rewriting.



>> For kvm/x86 pci definitely remains king.
>>      
> For full virtualization, sure.  I agree.  However, we are talking about
> PV here.  For PV, PCI is not a requirement and is a technical dead-end IMO.
>
> KVM seems to be the only virt solution that thinks otherwise (*), but I
> believe that is primarily a condition of its maturity.  I aim to help
> advance things here.
>
> (*) citation: xen has xenbus, lguest has lguest-bus, vmware has some
> vmi-esq thing (I forget what its called) to name a few.  Love 'em or
> hate 'em, most other hypervisors do something along these lines.  I'd
> like to try to create one for KVM, but to unify them all (at least for
> the Linux-based host designs).
>    

VMware are throwing VMI away (won't be supported in their new product, 
and they've sent a patch to rip it off from Linux); Xen has to tunnel 
xenbus in pci for full virtualization (which is where Windows is, and 
where Linux will be too once people realize it's faster).  lguest is 
meant as an example hypervisor, not an attempt to take over the world.

"PCI is a dead end" could not be more wrong, it's what guests support.  
An right now you can have a guest using pci to access a mix of 
userspace-emulated devices, userspace-emulated-but-kernel-accelerated 
virtio devices, and real host devices.  All on one dead-end bus.  Try 
that with vbus.


>>> I digress.  My point here isn't PCI.  The point here is the missing
>>> component for when PCI is not present.  The component that is partially
>>> satisfied by vbus's devid addressing scheme.  If you are going to use
>>> vhost, and you don't have PCI, you've gotta build something to replace
>>> it.
>>>
>>>        
>> Yes, that's why people have keyboards.  They'll write that glue code if
>> they need it.  If it turns out to be a hit an people start having virtio
>> transport module writing parties, they'll figure out a way to share code.
>>      
> Sigh...  The party has already started.  I tried to invite you months ago...
>    

I've been voting virtio since 2007.

>> On the guest side, virtio-shmem.ko can unify the ring access.  It
>> probably makes sense even today.  On the host side I eventfd is the
>> kernel interface and libvhostconfig.so can provide the configuration
>> when an existing ABI is not imposed.
>>      
> That won't cut it.  For one, creating an eventfd is only part of the
> equation.  I.e. you need to have originate/terminate somewhere
> interesting (and in-kernel, otherwise use tuntap).
>    

vbus needs the same thing so it cancels out.

>> Look at the virtio-net feature negotiation.  There's a lot more there
>> than the MAC address, and it's going to grow.
>>      
> Agreed, but note that makes my point.  That feature negotiation almost
> invariably influences the device-model, not some config-space shim.
> IOW: terminating config-space at some userspace shim is pointless.  The
> model ultimately needs the result of whatever transpires during that
> negotiation anyway.
>    

Well, let's see.  Can vbus today:

- let userspace know which features are available (so it can decide if 
live migration is possible)
- let userspace limit which features are exposed to the guest (so it can 
make live migration possible among hosts of different capabilities)
- let userspace know which features were negotiated (so it can transfer 
them to the other host during live migration)
- let userspace tell the kernel which features were negotiated (when 
live migration completes, to avoid requiring the guest to re-negotiate)
- do all that from an unprivileged process
- securely wrt other unprivileged processes

?

What are your plans here?
Ira Snyder Sept. 25, 2009, 5:01 p.m. UTC | #64
On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote:
> What it is: vhost net is a character device that can be used to reduce
> the number of system calls involved in virtio networking.
> Existing virtio net code is used in the guest without modification.
> 
> There's similarity with vringfd, with some differences and reduced scope
> - uses eventfd for signalling
> - structures can be moved around in memory at any time (good for migration)
> - support memory table and not just an offset (needed for kvm)
> 
> common virtio related code has been put in a separate file vhost.c and
> can be made into a separate module if/when more backends appear.  I used
> Rusty's lguest.c as the source for developing this part : this supplied
> me with witty comments I wouldn't be able to write myself.
> 
> What it is not: vhost net is not a bus, and not a generic new system
> call. No assumptions are made on how guest performs hypercalls.
> Userspace hypervisors are supported as well as kvm.
> 
> How it works: Basically, we connect virtio frontend (configured by
> userspace) to a backend. The backend could be a network device, or a
> tun-like device. In this version I only support raw socket as a backend,
> which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
> also configured by userspace, including vlan/mac etc.
> 
> Status:
> This works for me, and I haven't see any crashes.
> I have done some light benchmarking (with v4), compared to userspace, I
> see improved latency (as I save up to 4 system calls per packet) but not
> bandwidth/CPU (as TSO and interrupt mitigation are not supported).  For
> ping benchmark (where there's no TSO) troughput is also improved.
> 
> Features that I plan to look at in the future:
> - tap support
> - TSO
> - interrupt mitigation
> - zero copy
> 
> Acked-by: Arnd Bergmann <arnd@arndb.de>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ---
>  MAINTAINERS                |   10 +
>  arch/x86/kvm/Kconfig       |    1 +
>  drivers/Makefile           |    1 +
>  drivers/vhost/Kconfig      |   11 +
>  drivers/vhost/Makefile     |    2 +
>  drivers/vhost/net.c        |  475 ++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.c      |  688 ++++++++++++++++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.h      |  122 ++++++++
>  include/linux/Kbuild       |    1 +
>  include/linux/miscdevice.h |    1 +
>  include/linux/vhost.h      |  101 +++++++
>  11 files changed, 1413 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/vhost/Kconfig
>  create mode 100644 drivers/vhost/Makefile
>  create mode 100644 drivers/vhost/net.c
>  create mode 100644 drivers/vhost/vhost.c
>  create mode 100644 drivers/vhost/vhost.h
>  create mode 100644 include/linux/vhost.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index b1114cf..de4587f 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5431,6 +5431,16 @@ S:	Maintained
>  F:	Documentation/filesystems/vfat.txt
>  F:	fs/fat/
>  
> +VIRTIO HOST (VHOST)
> +P:	Michael S. Tsirkin
> +M:	mst@redhat.com
> +L:	kvm@vger.kernel.org
> +L:	virtualization@lists.osdl.org
> +L:	netdev@vger.kernel.org
> +S:	Maintained
> +F:	drivers/vhost/
> +F:	include/linux/vhost.h
> +
>  VIA RHINE NETWORK DRIVER
>  M:	Roger Luethi <rl@hellgate.ch>
>  S:	Maintained
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index b84e571..94f44d9 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -64,6 +64,7 @@ config KVM_AMD
>  
>  # OK, it's a little counter-intuitive to do this, but it puts it neatly under
>  # the virtualization menu.
> +source drivers/vhost/Kconfig
>  source drivers/lguest/Kconfig
>  source drivers/virtio/Kconfig
>  
> diff --git a/drivers/Makefile b/drivers/Makefile
> index bc4205d..1551ae1 100644
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -105,6 +105,7 @@ obj-$(CONFIG_HID)		+= hid/
>  obj-$(CONFIG_PPC_PS3)		+= ps3/
>  obj-$(CONFIG_OF)		+= of/
>  obj-$(CONFIG_SSB)		+= ssb/
> +obj-$(CONFIG_VHOST_NET)		+= vhost/
>  obj-$(CONFIG_VIRTIO)		+= virtio/
>  obj-$(CONFIG_VLYNQ)		+= vlynq/
>  obj-$(CONFIG_STAGING)		+= staging/
> diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
> new file mode 100644
> index 0000000..d955406
> --- /dev/null
> +++ b/drivers/vhost/Kconfig
> @@ -0,0 +1,11 @@
> +config VHOST_NET
> +	tristate "Host kernel accelerator for virtio net"
> +	depends on NET && EVENTFD
> +	---help---
> +	  This kernel module can be loaded in host kernel to accelerate
> +	  guest networking with virtio_net. Not to be confused with virtio_net
> +	  module itself which needs to be loaded in guest kernel.
> +
> +	  To compile this driver as a module, choose M here: the module will
> +	  be called vhost_net.
> +
> diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
> new file mode 100644
> index 0000000..72dd020
> --- /dev/null
> +++ b/drivers/vhost/Makefile
> @@ -0,0 +1,2 @@
> +obj-$(CONFIG_VHOST_NET) += vhost_net.o
> +vhost_net-y := vhost.o net.o
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> new file mode 100644
> index 0000000..2210eaa
> --- /dev/null
> +++ b/drivers/vhost/net.c
> @@ -0,0 +1,475 @@
> +/* Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + *
> + * virtio-net server in host kernel.
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/eventfd.h>
> +#include <linux/vhost.h>
> +#include <linux/virtio_net.h>
> +#include <linux/mmu_context.h>
> +#include <linux/miscdevice.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/workqueue.h>
> +#include <linux/rcupdate.h>
> +#include <linux/file.h>
> +
> +#include <linux/net.h>
> +#include <linux/if_packet.h>
> +#include <linux/if_arp.h>
> +
> +#include <net/sock.h>
> +
> +#include "vhost.h"
> +
> +enum {
> +	VHOST_NET_VQ_RX = 0,
> +	VHOST_NET_VQ_TX = 1,
> +	VHOST_NET_VQ_MAX = 2,
> +};
> +
> +struct vhost_net {
> +	struct vhost_dev dev;
> +	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
> +	/* We use a kind of RCU to access sock pointer.
> +	 * All readers access it from workqueue, which makes it possible to
> +	 * flush the workqueue instead of synchronize_rcu. Therefore readers do
> +	 * not need to call rcu_read_lock/rcu_read_unlock: the beginning of
> +	 * work item execution acts instead of rcu_read_lock() and the end of
> +	 * work item execution acts instead of rcu_read_lock().
> +	 * Writers use device mutex. */
> +	struct socket *sock;
> +	struct vhost_poll poll[VHOST_NET_VQ_MAX];
> +};
> +
> +/* Pop first len bytes from iovec. Return number of segments used. */
> +static int move_iovec_hdr(struct iovec *from, struct iovec *to,
> +			  size_t len, int iov_count)
> +{
> +       int seg = 0;
> +       size_t size;
> +       while (len && seg < iov_count) {
> +               size = min(from->iov_len, len);
> +               to->iov_base = from->iov_base;
> +               to->iov_len = size;
> +               from->iov_len -= size;
> +               from->iov_base += size;
> +               len -= size;
> +               ++from;
> +               ++to;
> +               ++seg;
> +       }
> +       return seg;
> +}
> +
> +/* Expects to be always run from workqueue - which acts as
> + * read-size critical section for our kind of RCU. */
> +static void handle_tx(struct vhost_net *net)
> +{
> +	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> +	unsigned head, out, in, s;
> +	struct msghdr msg = {
> +		.msg_name = NULL,
> +		.msg_namelen = 0,
> +		.msg_control = NULL,
> +		.msg_controllen = 0,
> +		.msg_iov = vq->iov,
> +		.msg_flags = MSG_DONTWAIT,
> +	};
> +	size_t len;
> +	int err;
> +	struct socket *sock = rcu_dereference(net->sock);
> +	if (!sock || !sock_writeable(sock->sk))
> +		return;
> +
> +	use_mm(net->dev.mm);
> +	mutex_lock(&vq->mutex);
> +	for (;;) {
> +		head = vhost_get_vq_desc(&net->dev, vq, vq->iov, &out, &in);
> +		/* Nothing new?  Wait for eventfd to tell us they refilled. */
> +		if (head == vq->num)
> +			break;
> +		if (in) {
> +			vq_err(vq, "Unexpected descriptor format for TX: "
> +			       "out %d, int %d\n", out, in);
> +			break;
> +		}
> +		/* Skip header. TODO: support TSO. */
> +		s = move_iovec_hdr(vq->iov, vq->hdr,
> +				   sizeof(struct virtio_net_hdr), out);
> +		msg.msg_iovlen = out;
> +		len = iov_length(vq->iov, out);
> +		/* Sanity check */
> +		if (!len) {
> +			vq_err(vq, "Unexpected header len for TX: "
> +			       "%ld expected %zd\n",
> +			       iov_length(vq->hdr, s),
> +			       sizeof(struct virtio_net_hdr));
> +			break;
> +		}
> +		/* TODO: Check specific error and bomb out unless ENOBUFS? */
> +		err = sock->ops->sendmsg(NULL, sock, &msg, len);
> +		if (err < 0) {
> +			vhost_discard_vq_desc(vq);
> +			break;
> +		}
> +		if (err != len)
> +			pr_err("Truncated TX packet: "
> +			       " len %d != %zd\n", err, len);
> +		vhost_add_used_and_trigger(&net->dev, vq, head, 0);
> +	}
> +
> +	mutex_unlock(&vq->mutex);
> +	unuse_mm(net->dev.mm);
> +}
> +
> +/* Expects to be always run from workqueue - which acts as
> + * read-size critical section for our kind of RCU. */
> +static void handle_rx(struct vhost_net *net)
> +{
> +	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> +	unsigned head, out, in, s;
> +	struct msghdr msg = {
> +		.msg_name = NULL,
> +		.msg_namelen = 0,
> +		.msg_control = NULL, /* FIXME: get and handle RX aux data. */
> +		.msg_controllen = 0,
> +		.msg_iov = vq->iov,
> +		.msg_flags = MSG_DONTWAIT,
> +	};
> +
> +	struct virtio_net_hdr hdr = {
> +		.flags = 0,
> +		.gso_type = VIRTIO_NET_HDR_GSO_NONE
> +	};
> +
> +	size_t len;
> +	int err;
> +	struct socket *sock = rcu_dereference(net->sock);
> +	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> +		return;
> +
> +	use_mm(net->dev.mm);
> +	mutex_lock(&vq->mutex);
> +	vhost_no_notify(vq);
> +
> +	for (;;) {
> +		head = vhost_get_vq_desc(&net->dev, vq, vq->iov, &out, &in);
> +		/* OK, now we need to know about added descriptors. */
> +		if (head == vq->num && vhost_notify(vq))
> +			/* They could have slipped one in as we were doing that:
> +			 * check again. */
> +			continue;
> +		/* Nothing new?  Wait for eventfd to tell us they refilled. */
> +		if (head == vq->num)
> +			break;
> +		/* We don't need to be notified again. */
> +		vhost_no_notify(vq);
> +		if (out) {
> +			vq_err(vq, "Unexpected descriptor format for RX: "
> +			       "out %d, int %d\n",
> +			       out, in);
> +			break;
> +		}
> +		/* Skip header. TODO: support TSO/mergeable rx buffers. */
> +		s = move_iovec_hdr(vq->iov, vq->hdr, sizeof hdr, in);
> +		msg.msg_iovlen = in;
> +		len = iov_length(vq->iov, in);
> +		/* Sanity check */
> +		if (!len) {
> +			vq_err(vq, "Unexpected header len for RX: "
> +			       "%zd expected %zd\n",
> +			       iov_length(vq->hdr, s), sizeof hdr);
> +			break;
> +		}
> +		err = sock->ops->recvmsg(NULL, sock, &msg,
> +					 len, MSG_DONTWAIT | MSG_TRUNC);
> +		/* TODO: Check specific error and bomb out unless EAGAIN? */
> +		if (err < 0) {
> +			vhost_discard_vq_desc(vq);
> +			break;
> +		}
> +		/* TODO: Should check and handle checksum. */
> +		if (err > len) {
> +			pr_err("Discarded truncated rx packet: "
> +			       " len %d > %zd\n", err, len);
> +			vhost_discard_vq_desc(vq);
> +			continue;
> +		}
> +		len = err;
> +		err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr, sizeof hdr);
> +		if (err) {
> +			vq_err(vq, "Unable to write vnet_hdr at addr %p: %d\n",
> +			       vq->iov->iov_base, err);
> +			break;
> +		}
> +		vhost_add_used_and_trigger(&net->dev, vq, head,
> +					   len + sizeof hdr);
> +	}
> +
> +	mutex_unlock(&vq->mutex);
> +	unuse_mm(net->dev.mm);
> +}
> +
> +static void handle_tx_kick(struct work_struct *work)
> +{
> +	struct vhost_virtqueue *vq;
> +	struct vhost_net *net;
> +	vq = container_of(work, struct vhost_virtqueue, poll.work);
> +	net = container_of(vq->dev, struct vhost_net, dev);
> +	handle_tx(net);
> +}
> +
> +static void handle_rx_kick(struct work_struct *work)
> +{
> +	struct vhost_virtqueue *vq;
> +	struct vhost_net *net;
> +	vq = container_of(work, struct vhost_virtqueue, poll.work);
> +	net = container_of(vq->dev, struct vhost_net, dev);
> +	handle_rx(net);
> +}
> +
> +static void handle_tx_net(struct work_struct *work)
> +{
> +	struct vhost_net *net;
> +	net = container_of(work, struct vhost_net, poll[VHOST_NET_VQ_TX].work);
> +	handle_tx(net);
> +}
> +
> +static void handle_rx_net(struct work_struct *work)
> +{
> +	struct vhost_net *net;
> +	net = container_of(work, struct vhost_net, poll[VHOST_NET_VQ_RX].work);
> +	handle_rx(net);
> +}
> +
> +static int vhost_net_open(struct inode *inode, struct file *f)
> +{
> +	struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
> +	int r;
> +	if (!n)
> +		return -ENOMEM;
> +	f->private_data = n;
> +	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
> +	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
> +	r = vhost_dev_init(&n->dev, n->vqs, VHOST_NET_VQ_MAX);
> +	if (r < 0) {
> +		kfree(n);
> +		return r;
> +	}
> +
> +	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
> +	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
> +	return 0;
> +}
> +
> +static struct socket *vhost_net_stop(struct vhost_net *n)
> +{
> +	struct socket *sock = n->sock;
> +	rcu_assign_pointer(n->sock, NULL);
> +	if (sock) {
> +		vhost_poll_flush(n->poll + VHOST_NET_VQ_TX);
> +		vhost_poll_flush(n->poll + VHOST_NET_VQ_RX);
> +	}
> +	return sock;
> +}
> +
> +static int vhost_net_release(struct inode *inode, struct file *f)
> +{
> +	struct vhost_net *n = f->private_data;
> +	struct socket *sock;
> +
> +	sock = vhost_net_stop(n);
> +	vhost_dev_cleanup(&n->dev);
> +	if (sock)
> +		fput(sock->file);
> +	kfree(n);
> +	return 0;
> +}
> +
> +static void vhost_net_flush(struct vhost_net *n)
> +{
> +	vhost_poll_flush(n->poll + VHOST_NET_VQ_TX);
> +	vhost_poll_flush(n->poll + VHOST_NET_VQ_RX);
> +	vhost_poll_flush(&n->dev.vqs[VHOST_NET_VQ_TX].poll);
> +	vhost_poll_flush(&n->dev.vqs[VHOST_NET_VQ_RX].poll);
> +}
> +
> +static long vhost_net_set_socket(struct vhost_net *n, int fd)
> +{
> +	struct {
> +		struct sockaddr_ll sa;
> +		char  buf[MAX_ADDR_LEN];
> +	} uaddr;
> +	struct socket *sock, *oldsock = NULL;
> +	int uaddr_len = sizeof uaddr, r;
> +
> +	mutex_lock(&n->dev.mutex);
> +	r = vhost_dev_check_owner(&n->dev);
> +	if (r)
> +		goto done;
> +
> +	if (fd == -1) {
> +		/* Disconnect from socket and device. */
> +		oldsock = vhost_net_stop(n);
> +		goto done;
> +	}
> +
> +	sock = sockfd_lookup(fd, &r);
> +	if (!sock) {
> +		r = -ENOTSOCK;
> +		goto done;
> +	}
> +
> +	/* Parameter checking */
> +	if (sock->sk->sk_type != SOCK_RAW) {
> +		r = -ESOCKTNOSUPPORT;
> +		goto done;
> +	}
> +
> +	r = sock->ops->getname(sock, (struct sockaddr *)&uaddr.sa,
> +			       &uaddr_len, 0);
> +	if (r)
> +		goto done;
> +
> +	if (uaddr.sa.sll_family != AF_PACKET) {
> +		r = -EPFNOSUPPORT;
> +		goto done;
> +	}
> +
> +	/* start polling new socket */
> +	if (sock == oldsock)
> +		goto done;
> +
> +	if (oldsock) {
> +		vhost_poll_stop(n->poll + VHOST_NET_VQ_TX);
> +		vhost_poll_stop(n->poll + VHOST_NET_VQ_RX);
> +	}
> +	oldsock = n->sock;
> +	rcu_assign_pointer(n->sock, sock);
> +	vhost_poll_start(n->poll + VHOST_NET_VQ_TX, sock->file);
> +	vhost_poll_start(n->poll + VHOST_NET_VQ_RX, sock->file);
> +done:
> +	mutex_unlock(&n->dev.mutex);
> +	if (oldsock) {
> +		vhost_net_flush(n);
> +		fput(oldsock->file);
> +	}
> +	return r;
> +}
> +
> +static long vhost_net_reset_owner(struct vhost_net *n)
> +{
> +	struct socket *sock = NULL;
> +	long r;
> +	mutex_lock(&n->dev.mutex);
> +	r = vhost_dev_check_owner(&n->dev);
> +	if (r)
> +		goto done;
> +	sock = vhost_net_stop(n);
> +	r = vhost_dev_reset_owner(&n->dev);
> +done:
> +	mutex_unlock(&n->dev.mutex);
> +	if (sock)
> +		fput(sock->file);
> +	return r;
> +}
> +
> +static void vhost_net_set_features(struct vhost_net *n, u64 features)
> +{
> +	mutex_unlock(&n->dev.mutex);
> +	n->dev.acked_features = features;
> +	mutex_unlock(&n->dev.mutex);
> +	vhost_net_flush(n);
> +}
> +
> +static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
> +			    unsigned long arg)
> +{
> +	struct vhost_net *n = f->private_data;
> +	void __user *argp = (void __user *)arg;
> +	u32 __user *featurep = argp;
> +	int __user *fdp = argp;
> +	u64 features;
> +	int fd, r;
> +	switch (ioctl) {
> +	case VHOST_NET_SET_SOCKET:
> +		r = get_user(fd, fdp);
> +		if (r < 0)
> +			return r;
> +		return vhost_net_set_socket(n, fd);
> +	case VHOST_GET_FEATURES:
> +		features = VHOST_FEATURES;
> +		return put_user(features, featurep);
> +	case VHOST_ACK_FEATURES:
> +		r = get_user(features, featurep);
> +		/* No features for now */
> +		if (r < 0)
> +			return r;
> +		if (features & ~VHOST_FEATURES)
> +			return -EOPNOTSUPP;
> +		vhost_net_set_features(n, features);
> +		return 0;
> +	case VHOST_RESET_OWNER:
> +		return vhost_net_reset_owner(n);
> +	default:
> +		return vhost_dev_ioctl(&n->dev, ioctl, arg);
> +	}
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long vhost_net_compat_ioctl(struct file *f, unsigned int ioctl,
> +				   unsigned long arg)
> +{
> +	return vhost_net_ioctl(f, ioctl, (unsigned long)compat_ptr(arg));
> +}
> +#endif
> +
> +const static struct file_operations vhost_net_fops = {
> +	.owner          = THIS_MODULE,
> +	.release        = vhost_net_release,
> +	.unlocked_ioctl = vhost_net_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl   = vhost_net_compat_ioctl,
> +#endif
> +	.open           = vhost_net_open,
> +};
> +
> +static struct miscdevice vhost_net_misc = {
> +	VHOST_NET_MINOR,
> +	"vhost-net",
> +	&vhost_net_fops,
> +};
> +
> +int vhost_net_init(void)
> +{
> +	int r = vhost_init();
> +	if (r)
> +		goto err_init;
> +	r = misc_register(&vhost_net_misc);
> +	if (r)
> +		goto err_reg;
> +	return 0;
> +err_reg:
> +	vhost_cleanup();
> +err_init:
> +	return r;
> +
> +}
> +module_init(vhost_net_init);
> +
> +void vhost_net_exit(void)
> +{
> +	misc_deregister(&vhost_net_misc);
> +	vhost_cleanup();
> +}
> +module_exit(vhost_net_exit);
> +
> +MODULE_VERSION("0.0.1");
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Michael S. Tsirkin");
> +MODULE_DESCRIPTION("Host kernel accelerator for virtio net");
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> new file mode 100644
> index 0000000..6925cc1
> --- /dev/null
> +++ b/drivers/vhost/vhost.c
> @@ -0,0 +1,688 @@
> +/* Copyright (C) 2009 Red Hat, Inc.
> + * Copyright (C) 2006 Rusty Russell IBM Corporation
> + *
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + *
> + * Inspiration, some code, and most witty comments come from
> + * Documentation/lguest/lguest.c, by Rusty Russell
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + *
> + * Generic code for virtio server in host kernel.
> + */
> +
> +#include <linux/eventfd.h>
> +#include <linux/vhost.h>
> +#include <linux/virtio_net.h>
> +#include <linux/mm.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mutex.h>
> +#include <linux/workqueue.h>
> +#include <linux/rcupdate.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +
> +#include <linux/net.h>
> +#include <linux/if_packet.h>
> +#include <linux/if_arp.h>
> +
> +#include <net/sock.h>
> +
> +#include "vhost.h"
> +
> +enum {
> +	VHOST_MEMORY_MAX_NREGIONS = 64,
> +};
> +
> +static struct workqueue_struct *vhost_workqueue;
> +
> +static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
> +			    poll_table *pt)
> +{
> +	struct vhost_poll *poll;
> +	poll = container_of(pt, struct vhost_poll, table);
> +
> +	poll->wqh = wqh;
> +	add_wait_queue(wqh, &poll->wait);
> +}
> +
> +static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
> +			     void *key)
> +{
> +	struct vhost_poll *poll;
> +	poll = container_of(wait, struct vhost_poll, wait);
> +	if (!((unsigned long)key & poll->mask))
> +		return 0;
> +
> +	queue_work(vhost_workqueue, &poll->work);
> +	return 0;
> +}
> +
> +/* Init poll structure */
> +void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
> +		     unsigned long mask)
> +{
> +	INIT_WORK(&poll->work, func);
> +	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
> +	init_poll_funcptr(&poll->table, vhost_poll_func);
> +	poll->mask = mask;
> +}
> +
> +/* Start polling a file. We add ourselves to file's wait queue. The caller must
> + * keep a reference to a file until after vhost_poll_stop is called. */
> +void vhost_poll_start(struct vhost_poll *poll, struct file *file)
> +{
> +	unsigned long mask;
> +	mask = file->f_op->poll(file, &poll->table);
> +	if (mask)
> +		vhost_poll_wakeup(&poll->wait, 0, 0, (void *)mask);
> +}
> +
> +/* Stop polling a file. After this function returns, it becomes safe to drop the
> + * file reference. You must also flush afterwards. */
> +void vhost_poll_stop(struct vhost_poll *poll)
> +{
> +	remove_wait_queue(poll->wqh, &poll->wait);
> +}
> +
> +/* Flush any work that has been scheduled. When calling this, don't hold any
> + * locks that are also used by the callback. */
> +void vhost_poll_flush(struct vhost_poll *poll)
> +{
> +	flush_work(&poll->work);
> +}
> +
> +long vhost_dev_init(struct vhost_dev *dev,
> +		    struct vhost_virtqueue *vqs, int nvqs)
> +{
> +	int i;
> +	dev->vqs = vqs;
> +	dev->nvqs = nvqs;
> +	mutex_init(&dev->mutex);
> +
> +	for (i = 0; i < dev->nvqs; ++i) {
> +		dev->vqs[i].dev = dev;
> +		mutex_init(&dev->vqs[i].mutex);
> +		if (dev->vqs[i].handle_kick)
> +			vhost_poll_init(&dev->vqs[i].poll,
> +					dev->vqs[i].handle_kick,
> +					POLLIN);
> +	}
> +	return 0;
> +}
> +
> +/* Caller should have device mutex */
> +long vhost_dev_check_owner(struct vhost_dev *dev)
> +{
> +	/* Are you the owner? If not, I don't think you mean to do that */
> +	return dev->mm == current->mm ? 0 : -EPERM;
> +}
> +
> +/* Caller should have device mutex */
> +static long vhost_dev_set_owner(struct vhost_dev *dev)
> +{
> +	/* Is there an owner already? */
> +	if (dev->mm)
> +		return -EBUSY;
> +	/* No owner, become one */
> +	dev->mm = get_task_mm(current);
> +	return 0;
> +}
> +
> +/* Caller should have device mutex */
> +long vhost_dev_reset_owner(struct vhost_dev *dev)
> +{
> +	struct vhost_memory *memory;
> +
> +	/* Restore memory to default 1:1 mapping. */
> +	memory = kmalloc(offsetof(struct vhost_memory, regions) +
> +			 2 * sizeof *memory->regions, GFP_KERNEL);
> +	if (!memory)
> +		return -ENOMEM;
> +
> +	vhost_dev_cleanup(dev);
> +
> +	memory->nregions = 2;
> +	memory->regions[0].guest_phys_addr = 1;
> +	memory->regions[0].userspace_addr = 1;
> +	memory->regions[0].memory_size = ~0ULL;
> +	memory->regions[1].guest_phys_addr = 0;
> +	memory->regions[1].userspace_addr = 0;
> +	memory->regions[1].memory_size = 1;
> +	dev->memory = memory;
> +	return 0;
> +}
> +
> +/* Caller should have device mutex */
> +void vhost_dev_cleanup(struct vhost_dev *dev)
> +{
> +	int i;
> +	for (i = 0; i < dev->nvqs; ++i) {
> +		if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
> +			vhost_poll_stop(&dev->vqs[i].poll);
> +			vhost_poll_flush(&dev->vqs[i].poll);
> +		}
> +		if (dev->vqs[i].error_ctx)
> +			eventfd_ctx_put(dev->vqs[i].error_ctx);
> +		if (dev->vqs[i].error)
> +			fput(dev->vqs[i].error);
> +		if (dev->vqs[i].kick)
> +			fput(dev->vqs[i].kick);
> +		if (dev->vqs[i].call_ctx)
> +			eventfd_ctx_put(dev->vqs[i].call_ctx);
> +		if (dev->vqs[i].call)
> +			fput(dev->vqs[i].call);
> +		dev->vqs[i].error_ctx = NULL;
> +		dev->vqs[i].error = NULL;
> +		dev->vqs[i].kick = NULL;
> +		dev->vqs[i].call_ctx = NULL;
> +		dev->vqs[i].call = NULL;
> +	}
> +	/* No one will access memory at this point */
> +	kfree(dev->memory);
> +	dev->memory = NULL;
> +	if (dev->mm)
> +		mmput(dev->mm);
> +	dev->mm = NULL;
> +}
> +
> +static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
> +{
> +	struct vhost_memory mem, *newmem, *oldmem;
> +	unsigned long size = offsetof(struct vhost_memory, regions);
> +	long r;
> +	r = copy_from_user(&mem, m, size);
> +	if (r)
> +		return r;
> +	if (mem.padding)
> +		return -EOPNOTSUPP;
> +	if (mem.nregions > VHOST_MEMORY_MAX_NREGIONS)
> +		return -E2BIG;
> +	newmem = kmalloc(size + mem.nregions * sizeof *m->regions, GFP_KERNEL);
> +	if (!newmem)
> +		return -ENOMEM;
> +
> +	memcpy(newmem, &mem, size);
> +	r = copy_from_user(newmem->regions, m->regions,
> +			   mem.nregions * sizeof *m->regions);
> +	if (r) {
> +		kfree(newmem);
> +		return r;
> +	}
> +	oldmem = d->memory;
> +	rcu_assign_pointer(d->memory, newmem);
> +	synchronize_rcu();
> +	kfree(oldmem);
> +	return 0;
> +}
> +
> +static int init_used(struct vhost_virtqueue *vq)
> +{
> +	int r = put_user(vq->used_flags, &vq->used->flags);
> +	if (r)
> +		return r;
> +	return get_user(vq->last_used_idx, &vq->used->idx);
> +}
> +
> +static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
> +{
> +	struct file *eventfp, *filep = NULL,
> +		    *pollstart = NULL, *pollstop = NULL;
> +	struct eventfd_ctx *ctx = NULL;
> +	u32 __user *idxp = argp;
> +	struct vhost_virtqueue *vq;
> +	struct vhost_vring_state s;
> +	struct vhost_vring_file f;
> +	struct vhost_vring_addr a;
> +	u32 idx;
> +	long r;
> +
> +	r = get_user(idx, idxp);
> +	if (r < 0)
> +		return r;
> +	if (idx > d->nvqs)
> +		return -ENOBUFS;
> +
> +	vq = d->vqs + idx;
> +
> +	mutex_lock(&vq->mutex);
> +
> +	switch (ioctl) {
> +	case VHOST_SET_VRING_NUM:
> +		r = copy_from_user(&s, argp, sizeof s);
> +		if (r < 0)
> +			break;
> +		if (s.num > 0xffff) {
> +			r = -EINVAL;
> +			break;
> +		}
> +		vq->num = s.num;
> +		break;
> +	case VHOST_SET_VRING_BASE:
> +		r = copy_from_user(&s, argp, sizeof s);
> +		if (r < 0)
> +			break;
> +		if (s.num > 0xffff) {
> +			r = -EINVAL;
> +			break;
> +		}
> +		vq->avail_idx = vq->last_avail_idx = s.num;
> +		break;
> +	case VHOST_GET_VRING_BASE:
> +		s.index = idx;
> +		s.num = vq->last_avail_idx;
> +		r = copy_to_user(argp, &s, sizeof s);
> +		break;
> +	case VHOST_SET_VRING_DESC:
> +		r = copy_from_user(&a, argp, sizeof a);
> +		if (r < 0)
> +			break;
> +		if (a.padding) {
> +			r = -EOPNOTSUPP;
> +			break;
> +		}
> +		if ((u64)(long)a.user_addr != a.user_addr) {
> +			r = -EFAULT;
> +			break;
> +		}
> +		vq->desc = (void __user *)(long)a.user_addr;
> +		break;
> +	case VHOST_SET_VRING_AVAIL:
> +		r = copy_from_user(&a, argp, sizeof a);
> +		if (r < 0)
> +			break;
> +		if (a.padding) {
> +			r = -EOPNOTSUPP;
> +			break;
> +		}
> +		if ((u64)(long)a.user_addr != a.user_addr) {
> +			r = -EFAULT;
> +			break;
> +		}
> +		vq->avail = (void __user *)(long)a.user_addr;
> +		/* Forget the cached index value. */
> +		vq->avail_idx = vq->last_avail_idx;
> +		break;
> +	case VHOST_SET_VRING_USED:
> +		r = copy_from_user(&a, argp, sizeof a);
> +		if (r < 0)
> +			break;
> +		if (a.padding) {
> +			r = -EOPNOTSUPP;
> +			break;
> +		}
> +		if ((u64)(long)a.user_addr != a.user_addr) {
> +			r = -EFAULT;
> +			break;
> +		}
> +		vq->used = (void __user *)(long)a.user_addr;
> +		r = init_used(vq);
> +		if (r)
> +			break;
> +		break;
> +	case VHOST_SET_VRING_KICK:
> +		r = copy_from_user(&f, argp, sizeof f);
> +		if (r < 0)
> +			break;
> +		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
> +		if (IS_ERR(eventfp))
> +			return PTR_ERR(eventfp);
> +		if (eventfp != vq->kick) {
> +			pollstop = filep = vq->kick;
> +			pollstart = vq->kick = eventfp;
> +		} else
> +			filep = eventfp;
> +		break;
> +	case VHOST_SET_VRING_CALL:
> +		r = copy_from_user(&f, argp, sizeof f);
> +		if (r < 0)
> +			break;
> +		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
> +		if (IS_ERR(eventfp))
> +			return PTR_ERR(eventfp);
> +		if (eventfp != vq->call) {
> +			filep = vq->call;
> +			ctx = vq->call_ctx;
> +			vq->call = eventfp;
> +			vq->call_ctx = eventfp ?
> +				eventfd_ctx_fileget(eventfp) : NULL;
> +		} else
> +			filep = eventfp;
> +		break;
> +	case VHOST_SET_VRING_ERR:
> +		r = copy_from_user(&f, argp, sizeof f);
> +		if (r < 0)
> +			break;
> +		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
> +		if (IS_ERR(eventfp))
> +			return PTR_ERR(eventfp);
> +		if (eventfp != vq->error) {
> +			filep = vq->error;
> +			vq->error = eventfp;
> +			ctx = vq->error_ctx;
> +			vq->error_ctx = eventfp ?
> +				eventfd_ctx_fileget(eventfp) : NULL;
> +		} else
> +			filep = eventfp;
> +		break;

I'm not sure how these eventfd's save a trip to userspace.

AFAICT, eventfd's cannot be used to signal another part of the kernel,
they can only be used to wake up userspace.

In my system, when an IRQ for kick() comes in, I have an eventfd which
gets signalled to notify userspace. When I want to send a call(), I have
to use a special ioctl(), just like lguest does.

Doesn't this mean that for call(), vhost is just going to signal an
eventfd to wake up userspace, which is then going to call ioctl(), and
then we're back in kernelspace. Seems like a wasted userspace
round-trip.

Or am I mis-reading this code?

PS - you can see my current code at:
http://www.mmarray.org/~iws/virtio-phys/

Thanks,
Ira

> +	default:
> +		r = -ENOIOCTLCMD;
> +	}
> +
> +	if (pollstop && vq->handle_kick)
> +		vhost_poll_stop(&vq->poll);
> +
> +	if (ctx)
> +		eventfd_ctx_put(ctx);
> +	if (filep)
> +		fput(filep);
> +
> +	if (pollstart && vq->handle_kick)
> +		vhost_poll_start(&vq->poll, vq->kick);
> +
> +	mutex_unlock(&vq->mutex);
> +
> +	if (pollstop && vq->handle_kick)
> +		vhost_poll_flush(&vq->poll);
> +	return 0;
> +}
> +
> +long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned long arg)
> +{
> +	void __user *argp = (void __user *)arg;
> +	long r;
> +
> +	mutex_lock(&d->mutex);
> +	/* If you are not the owner, you can become one */
> +	if (ioctl == VHOST_SET_OWNER) {
> +		r = vhost_dev_set_owner(d);
> +		goto done;
> +	}
> +
> +	/* You must be the owner to do anything else */
> +	r = vhost_dev_check_owner(d);
> +	if (r)
> +		goto done;
> +
> +	switch (ioctl) {
> +	case VHOST_SET_MEM_TABLE:
> +		r = vhost_set_memory(d, argp);
> +		break;
> +	default:
> +		r = vhost_set_vring(d, ioctl, argp);
> +		break;
> +	}
> +done:
> +	mutex_unlock(&d->mutex);
> +	return r;
> +}
> +
> +static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
> +						     __u64 addr, __u32 len)
> +{
> +	struct vhost_memory_region *reg;
> +	int i;
> +	/* linear search is not brilliant, but we really have on the order of 6
> +	 * regions in practice */
> +	for (i = 0; i < mem->nregions; ++i) {
> +		reg = mem->regions + i;
> +		if (reg->guest_phys_addr <= addr &&
> +		    reg->guest_phys_addr + reg->memory_size - 1 >= addr)
> +			return reg;
> +	}
> +	return NULL;
> +}
> +
> +int translate_desc(struct vhost_dev *dev, u64 addr, u32 len,
> +		   struct iovec iov[], int iov_size)
> +{
> +	const struct vhost_memory_region *reg;
> +	struct vhost_memory *mem;
> +	struct iovec *_iov;
> +	u64 s = 0;
> +	int ret = 0;
> +
> +	rcu_read_lock();
> +
> +	mem = rcu_dereference(dev->memory);
> +	while ((u64)len > s) {
> +		u64 size;
> +		if (ret >= iov_size) {
> +			ret = -ENOBUFS;
> +			break;
> +		}
> +		reg = find_region(mem, addr, len);
> +		if (!reg) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +		_iov = iov + ret;
> +		size = reg->memory_size - addr + reg->guest_phys_addr;
> +		_iov->iov_len = min((u64)len, size);
> +		_iov->iov_base = (void *)
> +			(reg->userspace_addr + addr - reg->guest_phys_addr);
> +		s += size;
> +		addr += size;
> +		++ret;
> +	}
> +
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
> +/* Each buffer in the virtqueues is actually a chain of descriptors.  This
> + * function returns the next descriptor in the chain, or vq->vring.num if we're
> + * at the end. */
> +static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc)
> +{
> +	unsigned int next;
> +
> +	/* If this descriptor says it doesn't chain, we're done. */
> +	if (!(desc->flags & VRING_DESC_F_NEXT))
> +		return vq->num;
> +
> +	/* Check they're not leading us off end of descriptors. */
> +	next = desc->next;
> +	/* Make sure compiler knows to grab that: we don't want it changing! */
> +	/* We will use the result as an index in an array, so most
> +	 * architectures only need a compiler barrier here. */
> +	read_barrier_depends();
> +
> +	if (next >= vq->num) {
> +		vq_err(vq, "Desc next is %u > %u", next, vq->num);
> +		return vq->num;
> +	}
> +
> +	return next;
> +}
> +
> +/* This looks in the virtqueue and for the first available buffer, and converts
> + * it to an iovec for convenient access.  Since descriptors consist of some
> + * number of output then some number of input descriptors, it's actually two
> + * iovecs, but we pack them into one and note how many of each there were.
> + *
> + * This function returns the descriptor number found, or vq->num (which
> + * is never a valid descriptor number) if none was found. */
> +unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
> +			   struct iovec iov[],
> +			   unsigned int *out_num, unsigned int *in_num)
> +{
> +	struct vring_desc desc;
> +	unsigned int i, head;
> +	u16 last_avail_idx;
> +	int ret;
> +
> +	/* Check it isn't doing very strange things with descriptor numbers. */
> +	last_avail_idx = vq->last_avail_idx;
> +	if (get_user(vq->avail_idx, &vq->avail->idx)) {
> +		vq_err(vq, "Failed to access avail idx at %p\n",
> +		       &vq->avail->idx);
> +		return vq->num;
> +	}
> +
> +	if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
> +		vq_err(vq, "Guest moved used index from %u to %u",
> +		       last_avail_idx, vq->avail_idx);
> +		return vq->num;
> +	}
> +
> +	/* If there's nothing new since last we looked, return invalid. */
> +	if (vq->avail_idx == last_avail_idx)
> +		return vq->num;
> +
> +	/* Grab the next descriptor number they're advertising, and increment
> +	 * the index we've seen. */
> +	if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
> +		vq_err(vq, "Failed to read head: idx %d address %p\n",
> +		       last_avail_idx,
> +		       &vq->avail->ring[last_avail_idx % vq->num]);
> +		return vq->num;
> +	}
> +
> +	/* If their number is silly, that's an error. */
> +	if (head >= vq->num) {
> +		vq_err(vq, "Guest says index %u > %u is available",
> +		       head, vq->num);
> +		return vq->num;
> +	}
> +
> +	vq->last_avail_idx++;
> +
> +	/* When we start there are none of either input nor output. */
> +	*out_num = *in_num = 0;
> +
> +	i = head;
> +	do {
> +		unsigned iov_count = *in_num + *out_num;
> +		if (copy_from_user(&desc, vq->desc + i, sizeof desc)) {
> +			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
> +			       i, vq->desc + i);
> +			return vq->num;
> +		}
> +		ret = translate_desc(dev, desc.addr, desc.len, iov + iov_count,
> +				     VHOST_NET_MAX_SG - iov_count);
> +		if (ret < 0) {
> +			vq_err(vq, "Translation failure %d descriptor idx %d\n",
> +			       ret, i);
> +			return vq->num;
> +		}
> +		/* If this is an input descriptor, increment that count. */
> +		if (desc.flags & VRING_DESC_F_WRITE)
> +			*in_num += ret;
> +		else {
> +			/* If it's an output descriptor, they're all supposed
> +			 * to come before any input descriptors. */
> +			if (*in_num) {
> +				vq_err(vq, "Descriptor has out after in: "
> +				       "idx %d\n", i);
> +				return vq->num;
> +			}
> +			*out_num += ret;
> +		}
> +	} while ((i = next_desc(vq, &desc)) != vq->num);
> +	return head;
> +}
> +
> +/* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
> +void vhost_discard_vq_desc(struct vhost_virtqueue *vq)
> +{
> +	vq->last_avail_idx--;
> +}
> +
> +/* After we've used one of their buffers, we tell them about it.  We'll then
> + * want to send them an interrupt, using vq->call. */
> +int vhost_add_used(struct vhost_virtqueue *vq,
> +			  unsigned int head, int len)
> +{
> +	struct vring_used_elem *used;
> +
> +	/* The virtqueue contains a ring of used buffers.  Get a pointer to the
> +	 * next entry in that used ring. */
> +	used = &vq->used->ring[vq->last_used_idx % vq->num];
> +	if (put_user(head, &used->id)) {
> +		vq_err(vq, "Failed to write used id");
> +		return -EFAULT;
> +	}
> +	if (put_user(len, &used->len)) {
> +		vq_err(vq, "Failed to write used len");
> +		return -EFAULT;
> +	}
> +	/* Make sure buffer is written before we update index. */
> +	wmb();
> +	if (put_user(vq->last_used_idx + 1, &vq->used->idx)) {
> +		vq_err(vq, "Failed to increment used idx");
> +		return -EFAULT;
> +	}
> +	vq->last_used_idx++;
> +	return 0;
> +}
> +
> +/* This actually sends the interrupt for this virtqueue */
> +void vhost_trigger_irq(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> +{
> +	__u16 flags = 0;
> +	if (get_user(flags, &vq->avail->flags)) {
> +		vq_err(vq, "Failed to get flags");
> +		return;
> +	}
> +
> +	/* If they don't want an interrupt, don't send one, unless empty. */
> +	if ((flags & VRING_AVAIL_F_NO_INTERRUPT) &&
> +	    (!vhost_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY) ||
> +	     vq->avail_idx != vq->last_avail_idx))
> +		return;
> +
> +	/* Send the Guest an interrupt tell them we used something up. */
> +	if (vq->call_ctx)
> +		eventfd_signal(vq->call_ctx, 1);
> +}
> +
> +/* And here's the combo meal deal.  Supersize me! */
> +void vhost_add_used_and_trigger(struct vhost_dev *dev,
> +				struct vhost_virtqueue *vq,
> +				unsigned int head, int len)
> +{
> +	vhost_add_used(vq, head, len);
> +	vhost_trigger_irq(dev, vq);
> +}
> +
> +/* OK, now we need to know about added descriptors. */
> +bool vhost_notify(struct vhost_virtqueue *vq)
> +{
> +	int r;
> +	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
> +		return false;
> +	vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
> +	r = put_user(vq->used_flags, &vq->used->flags);
> +	if (r)
> +		vq_err(vq, "Failed to disable notification: %d\n", r);
> +	/* They could have slipped one in as we were doing that: make
> +	 * sure it's written, tell caller it needs to check again. */
> +	mb();
> +	return true;
> +}
> +
> +/* We don't need to be notified again. */
> +void vhost_no_notify(struct vhost_virtqueue *vq)
> +{
> +	int r;
> +	if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
> +		return;
> +	vq->used_flags |= VRING_USED_F_NO_NOTIFY;
> +	r = put_user(vq->used_flags, &vq->used->flags);
> +	if (r)
> +		vq_err(vq, "Failed to enable notification: %d\n", r);
> +}
> +
> +int vhost_init(void)
> +{
> +	vhost_workqueue = create_workqueue("vhost");
> +	if (!vhost_workqueue)
> +		return -ENOMEM;
> +	return 0;
> +}
> +
> +void vhost_cleanup(void)
> +{
> +	destroy_workqueue(vhost_workqueue);
> +}
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> new file mode 100644
> index 0000000..8e13d06
> --- /dev/null
> +++ b/drivers/vhost/vhost.h
> @@ -0,0 +1,122 @@
> +#ifndef _VHOST_H
> +#define _VHOST_H
> +
> +#include <linux/eventfd.h>
> +#include <linux/vhost.h>
> +#include <linux/mm.h>
> +#include <linux/mutex.h>
> +#include <linux/workqueue.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +#include <linux/skbuff.h>
> +#include <linux/uio.h>
> +#include <linux/virtio_config.h>
> +
> +struct vhost_device;
> +
> +enum {
> +	VHOST_NET_MAX_SG = MAX_SKB_FRAGS + 2,
> +};
> +
> +/* Poll a file (eventfd or socket) */
> +/* Note: there's nothing vhost specific about this structure. */
> +struct vhost_poll {
> +	poll_table                table;
> +	wait_queue_head_t        *wqh;
> +	wait_queue_t              wait;
> +	/* struct which will handle all actual work. */
> +	struct work_struct        work;
> +	unsigned long		  mask;
> +};
> +
> +void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
> +		     unsigned long mask);
> +void vhost_poll_start(struct vhost_poll *poll, struct file *file);
> +void vhost_poll_stop(struct vhost_poll *poll);
> +void vhost_poll_flush(struct vhost_poll *poll);
> +
> +/* The virtqueue structure describes a queue attached to a device. */
> +struct vhost_virtqueue {
> +	struct vhost_dev *dev;
> +
> +	/* The actual ring of buffers. */
> +	struct mutex mutex;
> +	unsigned int num;
> +	struct vring_desc __user *desc;
> +	struct vring_avail __user *avail;
> +	struct vring_used __user *used;
> +	struct file *kick;
> +	struct file *call;
> +	struct file *error;
> +	struct eventfd_ctx *call_ctx;
> +	struct eventfd_ctx *error_ctx;
> +
> +	struct vhost_poll poll;
> +
> +	/* The routine to call when the Guest pings us, or timeout. */
> +	work_func_t handle_kick;
> +
> +	/* Last available index we saw. */
> +	u16 last_avail_idx;
> +
> +	/* Caches available index value from user. */
> +	u16 avail_idx;
> +
> +	/* Last index we used. */
> +	u16 last_used_idx;
> +
> +	/* Used flags */
> +	u16 used_flags;
> +
> +	struct iovec iov[VHOST_NET_MAX_SG];
> +	struct iovec hdr[VHOST_NET_MAX_SG];
> +};
> +
> +struct vhost_dev {
> +	/* Readers use RCU to access memory table pointer.
> +	 * Writers use mutex below.*/
> +	struct vhost_memory *memory;
> +	struct mm_struct *mm;
> +	struct vhost_virtqueue *vqs;
> +	int nvqs;
> +	struct mutex mutex;
> +	unsigned acked_features;
> +};
> +
> +long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
> +long vhost_dev_check_owner(struct vhost_dev *);
> +long vhost_dev_reset_owner(struct vhost_dev *);
> +void vhost_dev_cleanup(struct vhost_dev *);
> +long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, unsigned long arg);
> +
> +unsigned vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
> +			   struct iovec iov[],
> +			   unsigned int *out_num, unsigned int *in_num);
> +void vhost_discard_vq_desc(struct vhost_virtqueue *);
> +
> +int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
> +void vhost_trigger_irq(struct vhost_dev *, struct vhost_virtqueue *);
> +void vhost_add_used_and_trigger(struct vhost_dev *, struct vhost_virtqueue *,
> +				unsigned int head, int len);
> +void vhost_no_notify(struct vhost_virtqueue *);
> +bool vhost_notify(struct vhost_virtqueue *);
> +
> +int vhost_init(void);
> +void vhost_cleanup(void);
> +
> +#define vq_err(vq, fmt, ...) do {                                  \
> +		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> +		if ((vq)->error_ctx)                               \
> +				eventfd_signal((vq)->error_ctx, 1);\
> +	} while (0)
> +
> +enum {
> +	VHOST_FEATURES = 1 << VIRTIO_F_NOTIFY_ON_EMPTY,
> +};
> +
> +static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
> +{
> +	return dev->acked_features & (1 << bit);
> +}
> +
> +#endif
> diff --git a/include/linux/Kbuild b/include/linux/Kbuild
> index dec2f18..975df9a 100644
> --- a/include/linux/Kbuild
> +++ b/include/linux/Kbuild
> @@ -360,6 +360,7 @@ unifdef-y += uio.h
>  unifdef-y += unistd.h
>  unifdef-y += usbdevice_fs.h
>  unifdef-y += utsname.h
> +unifdef-y += vhost.h
>  unifdef-y += videodev2.h
>  unifdef-y += videodev.h
>  unifdef-y += virtio_config.h
> diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
> index 0521177..781a8bb 100644
> --- a/include/linux/miscdevice.h
> +++ b/include/linux/miscdevice.h
> @@ -30,6 +30,7 @@
>  #define HPET_MINOR		228
>  #define FUSE_MINOR		229
>  #define KVM_MINOR		232
> +#define VHOST_NET_MINOR		233
>  #define MISC_DYNAMIC_MINOR	255
>  
>  struct device;
> diff --git a/include/linux/vhost.h b/include/linux/vhost.h
> new file mode 100644
> index 0000000..3f441a9
> --- /dev/null
> +++ b/include/linux/vhost.h
> @@ -0,0 +1,101 @@
> +#ifndef _LINUX_VHOST_H
> +#define _LINUX_VHOST_H
> +/* Userspace interface for in-kernel virtio accelerators. */
> +
> +/* vhost is used to reduce the number of system calls involved in virtio.
> + *
> + * Existing virtio net code is used in the guest without modification.
> + *
> + * This header includes interface used by userspace hypervisor for
> + * device configuration.
> + */
> +
> +#include <linux/types.h>
> +#include <linux/compiler.h>
> +#include <linux/ioctl.h>
> +#include <linux/virtio_config.h>
> +#include <linux/virtio_ring.h>
> +
> +struct vhost_vring_state {
> +	unsigned int index;
> +	unsigned int num;
> +};
> +
> +struct vhost_vring_file {
> +	unsigned int index;
> +	int fd;
> +};
> +
> +struct vhost_vring_addr {
> +	unsigned int index;
> +	unsigned int padding;
> +	__u64 user_addr;
> +};
> +
> +struct vhost_memory_region {
> +	__u64 guest_phys_addr;
> +	__u64 memory_size; /* bytes */
> +	__u64 userspace_addr;
> +	__u64 padding; /* read/write protection? */
> +};
> +
> +struct vhost_memory {
> +	__u32 nregions;
> +	__u32 padding;
> +	struct vhost_memory_region regions[0];
> +};
> +
> +/* ioctls */
> +
> +#define VHOST_VIRTIO 0xAF
> +
> +/* Features bitmask for forward compatibility.  Transport bits are used for
> + * vhost specific features. */
> +#define VHOST_GET_FEATURES	_IOR(VHOST_VIRTIO, 0x00, __u64)
> +#define VHOST_ACK_FEATURES	_IOW(VHOST_VIRTIO, 0x00, __u64)
> +
> +/* Set current process as the (exclusive) owner of this file descriptor.  This
> + * must be called before any other vhost command.  Further calls to
> + * VHOST_OWNER_SET fail until VHOST_OWNER_RESET is called. */
> +#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01)
> +/* Give up ownership, and reset the device to default values.
> + * Allows subsequent call to VHOST_OWNER_SET to succeed. */
> +#define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
> +
> +/* Set up/modify memory layout */
> +#define VHOST_SET_MEM_TABLE	_IOW(VHOST_VIRTIO, 0x03, struct vhost_memory)
> +
> +/* Ring setup. These parameters can not be modified while ring is running
> + * (bound to a device). */
> +/* Set number of descriptors in ring */
> +#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)
> +/* Start of array of descriptors (virtually contiguous) */
> +#define VHOST_SET_VRING_DESC _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)
> +/* Used structure address */
> +#define VHOST_SET_VRING_USED _IOW(VHOST_VIRTIO, 0x12, struct vhost_vring_addr)
> +/* Available structure address */
> +#define VHOST_SET_VRING_AVAIL _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_addr)
> +/* Base value where queue looks for available descriptors */
> +#define VHOST_SET_VRING_BASE _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
> +/* Get accessor: reads index, writes value in num */
> +#define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
> +
> +/* The following ioctls use eventfd file descriptors to signal and poll
> + * for events. */
> +
> +/* Set eventfd to poll for added buffers */
> +#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)
> +/* Set eventfd to signal when buffers have beed used */
> +#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)
> +/* Set eventfd to signal an error */
> +#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
> +
> +/* VHOST_NET specific defines */
> +
> +/* Attach virtio net device to a raw socket. The socket must be already
> + * bound to an ethernet device, this device will be used for transmit.
> + * Pass -1 to unbind from the socket and the transmit device.
> + * This can be used to stop the device (e.g. for migration). */
> +#define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, int)
> +
> +#endif
> -- 
> 1.6.2.5
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins Sept. 25, 2009, 9:32 p.m. UTC | #65
Avi Kivity wrote:
> On 09/24/2009 09:03 PM, Gregory Haskins wrote:
>>
>>> I don't really see how vhost and vbus are different here.  vhost expects
>>> signalling to happen through a couple of eventfds and requires someone
>>> to supply them and implement kernel support (if needed).  vbus requires
>>> someone to write a connector to provide the signalling implementation.
>>> Neither will work out-of-the-box when implementing virtio-net over
>>> falling dominos, for example.
>>>      
>> I realize in retrospect that my choice of words above implies vbus _is_
>> complete, but this is not what I was saying.  What I was trying to
>> convey is that vbus is _more_ complete.  Yes, in either case some kind
>> of glue needs to be written.  The difference is that vbus implements
>> more of the glue generally, and leaves less required to be customized
>> for each iteration.
>>    
> 
> 
> No argument there.  Since you care about non-virt scenarios and virtio
> doesn't, naturally vbus is a better fit for them as the code stands.

Thanks for finally starting to acknowledge there's a benefit, at least.

To be more precise, IMO virtio is designed to be a performance oriented
ring-based driver interface that supports all types of hypervisors (e.g.
shmem based kvm, and non-shmem based Xen).  vbus is designed to be a
high-performance generic shared-memory interconnect (for rings or
otherwise) framework for environments where linux is the underpinning
"host" (physical or virtual).  They are distinctly different, but
complementary (the former addresses the part of the front-end, and
latter addresses the back-end, and a different part of the front-end).

In addition, the kvm-connector used in AlacrityVM's design strives to
add value and improve performance via other mechanisms, such as dynamic
 allocation, interrupt coalescing (thus reducing exit-ratio, which is a
serious issue in KVM) and priortizable/nestable signals.

Today there is a large performance disparity between what a KVM guest
sees and what a native linux application sees on that same host.  Just
take a look at some of my graphs between "virtio", and "native", for
example:

http://developer.novell.com/wiki/images/b/b7/31-rc4_throughput.png

A dominant vbus design principle is to try to achieve the same IO
performance for all "linux applications" whether they be literally
userspace applications, or things like KVM vcpus or Ira's physical
boards.  It also aims to solve problems not previously expressible with
current technologies (even virtio), like nested real-time.

And even though you repeatedly insist otherwise, the neat thing here is
that the two technologies mesh (at least under certain circumstances,
like when virtio is deployed on a shared-memory friendly linux backend
like KVM).  I hope that my stack diagram below depicts that clearly.


> But that's not a strong argument for vbus; instead of adding vbus you
> could make virtio more friendly to non-virt

Actually, it _is_ a strong argument then because adding vbus is what
helps makes virtio friendly to non-virt, at least for when performance
matters.

> (there's a limit how far you
> can take this, not imposed by the code, but by virtio's charter as a
> virtual device driver framework).
> 
>> Going back to our stack diagrams, you could think of a vhost solution
>> like this:
>>
>> --------------------------
>> | virtio-net
>> --------------------------
>> | virtio-ring
>> --------------------------
>> | virtio-bus
>> --------------------------
>> | ? undefined-1 ?
>> --------------------------
>> | vhost
>> --------------------------
>>
>> and you could think of a vbus solution like this
>>
>> --------------------------
>> | virtio-net
>> --------------------------
>> | virtio-ring
>> --------------------------
>> | virtio-bus
>> --------------------------
>> | bus-interface
>> --------------------------
>> | ? undefined-2 ?
>> --------------------------
>> | bus-model
>> --------------------------
>> | virtio-net-device (vhost ported to vbus model? :)
>> --------------------------
>>
>>
>> So the difference between vhost and vbus in this particular context is
>> that you need to have "undefined-1" do device discovery/hotswap,
>> config-space, address-decode/isolation, signal-path routing, memory-path
>> routing, etc.  Today this function is filled by things like virtio-pci,
>> pci-bus, KVM/ioeventfd, and QEMU for x86.  I am not as familiar with
>> lguest, but presumably it is filled there by components like
>> virtio-lguest, lguest-bus, lguest.ko, and lguest-launcher.  And to use
>> more contemporary examples, we might have virtio-domino, domino-bus,
>> domino.ko, and domino-launcher as well as virtio-ira, ira-bus, ira.ko,
>> and ira-launcher.
>>
>> Contrast this to the vbus stack:  The bus-X components (when optionally
>> employed by the connector designer) do device-discovery, hotswap,
>> config-space, address-decode/isolation, signal-path and memory-path
>> routing, etc in a general (and pv-centric) way. The "undefined-2"
>> portion is the "connector", and just needs to convey messages like
>> "DEVCALL" and "SHMSIGNAL".  The rest is handled in other parts of the
>> stack.
>>
>>    
> 
> Right.  virtio assumes that it's in a virt scenario and that the guest
> architecture already has enumeration and hotplug mechanisms which it
> would prefer to use.  That happens to be the case for kvm/x86.

No, virtio doesn't assume that.  It's stack provides the "virtio-bus"
abstraction and what it does assume is that it will be wired up to
something underneath. Kvm/x86 conveniently has pci, so the virtio-pci
adapter was created to reuse much of that facility.  For other things
like lguest and s360, something new had to be created underneath to make
up for the lack of pci-like support.

vbus, in conjunction with the kvm-connector, tries to unify that process
a little more by creating a PV-optimized bus.  The idea is that it can
be reused in that situation instead of creating a new hypervisor
specific bus each time.  It's also designed for high-performance, so you
get that important trait for free simply by tying into it.

> 
>> So to answer your question, the difference is that the part that has to
>> be customized in vbus should be a fraction of what needs to be
>> customized with vhost because it defines more of the stack.
> 
> But if you want to use the native mechanisms, vbus doesn't have any
> added value.

First of all, thats incorrect.  If you want to use the "native"
mechanisms (via the way the vbus-connector is implemented, for instance)
you at least still have the benefit that the backend design is more
broadly re-useable in more environments (like non-virt, for instance),
because vbus does a proper job of defining the requisite
layers/abstractions compared to vhost.  So it adds value even in that
situation.

Second of all, with PV there is no such thing as "native".  It's
software so it can be whatever we want.  Sure, you could argue that the
guest may have built-in support for something like PCI protocol.
However, PCI protocol itself isn't suitable for high-performance PV out
of the can.  So you will therefore invariably require new software
layers on top anyway, even if part of the support is already included.

And lastly, why would you _need_ to use the so called "native"
mechanism?  The short answer is, "you don't".  Any given system (guest
or bare-metal) already have a wide-range of buses (try running "tree
/sys/bus" in Linux).  More importantly, the concept of adding new buses
is widely supported in both the Windows and Linux driver model (and
probably any other guest-type that matters).  Therefore, despite claims
to the contrary, its not hard or even unusual to add a new bus to the mix.

In summary, vbus is simply one more bus of many, purpose built to
support high-end IO in a virt-like model, giving controlled access to
the linux-host underneath it.  You can write a high-performance layer
below the OS bus-model (vbus), or above it (virtio-pci) but either way
you are modifying the stack to add these capabilities, so we might as
well try to get this right.

With all due respect, you are making a big deal out of a minor issue.

> 
>> And, as
>> eluded to in my diagram, both virtio-net and vhost (with some
>> modifications to fit into the vbus framework) are potentially
>> complementary, not competitors.
>>    
> 
> Only theoretically.  The existing installed base would have to be thrown
> away

"Thrown away" is pure hyperbole.  The installed base, worse case, needs
to load a new driver for a missing device.  This is pretty much how
every machine works today, anyway.  And if loading a driver was actually
some insurmountable hurdle, as its sometimes implied (but its not in
reality), you can alternatively make vbus look like a legacy bus if you
are willing to sacrifice some of features, like exit-ratio reduction and
priority.

FWIW: AlacrityVM isn't willing to sacrifice those features, so we will
provide a Linux and Windows driver for explicit bus support, as well as
open-specs and community development assistance to any other guest that
wants to add support in the future.

> or we'd need to support both.
> 
>

No matter what model we talk about, there's always going to be a "both"
since the userspace virtio models are probably not going to go away (nor
should they).

> 
> 
>>> Without a vbus-connector-falling-dominos, vbus-venet can't do anything
>>> either.
>>>      
>> Mostly covered above...
>>
>> However, I was addressing your assertion that vhost somehow magically
>> accomplishes this "container/addressing" function without any specific
>> kernel support.  This is incorrect.  I contend that this kernel support
>> is required and present.  The difference is that its defined elsewhere
>> (and typically in a transport/arch specific way).
>>
>> IOW: You can basically think of the programmed PIO addresses as forming
>> its "container".  Only addresses explicitly added are visible, and
>> everything else is inaccessible.  This whole discussion is merely a
>> question of what's been generalized verses what needs to be
>> re-implemented each time.
>>    
> 
> Sorry, this is too abstract for me.

With all due respect, understanding my point above is required to have
any kind of meaningful discussion here.

> 
> 
> 
>>> vbus doesn't do kvm guest address decoding for the fast path.  It's
>>> still done by ioeventfd.
>>>      
>> That is not correct.  vbus does its own native address decoding in the
>> fast path, such as here:
>>
>> http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=kernel/vbus/client.c;h=e85b2d92d629734866496b67455dd307486e394a;hb=e6cbd4d1decca8e829db3b2b9b6ec65330b379e9#l331
>>
>>
>>    
> 
> All this is after kvm has decoded that vbus is addresses.  It can't work
> without someone outside vbus deciding that.

How the connector message is delivered is really not relevant.  Some
architectures will simply deliver the message point-to-point (like the
original hypercall design for KVM, or something like Ira's rig), and
some will need additional demuxing (like pci-bridge/pio based KVM).
It's an implementation detail of the connector.

However, the real point here is that something needs to establish a
scoped namespace mechanism, add items to that namespace, and advertise
the presence of the items to the guest.  vbus has this facility built in
to its stack.  vhost doesn't, so it must come from elsewhere.


> 
>> In fact, it's actually a simpler design to unify things this way because
>> you avoid splitting the device model up. Consider how painful the vhost
>> implementation would be if it didn't already have the userspace
>> virtio-net to fall-back on.  This is effectively what we face for new
>> devices going forward if that model is to persist.
>>    
> 
> 
> It doesn't have just virtio-net, it has userspace-based hostplug

vbus has hotplug too: mkdir and rmdir

As an added bonus, its device-model is modular.  A developer can write a
new device model, compile it, insmod it to the host kernel, hotplug it
to the running guest with mkdir/ln, and the come back out again
(hotunplug with rmdir, rmmod, etc).  They may do this all without taking
the guest down, and while eating QEMU based IO solutions for breakfast
performance wise.

Afaict, qemu can't do either of those things.

> and a bunch of other devices impemented in userspace.

Thats fine.  I am primarily interested in the high-performance
components, so most of those other items can stay there in userspace if
that is their ideal location.

>  Currently qemu has
> virtio bindings for pci and syborg (whatever that is), and device models
> for baloon, block, net, and console, so it seems implementing device
> support in userspace is not as disasterous as you make it to be.

I intentionally qualified "device" with "new" in my statement.  And in
that context I was talking about ultimately developing/supporting
in-kernel models, not pure legacy userspace ones.  I have no doubt the
implementation of the original userpsace devices was not a difficult or
horrific endeavor.

Requiring new models to be implemented (at least) twice is a poor design
IMO, however.  Requiring them to split such a minor portion of their
functionality (like read-only attributes) is a poor design, too.  I have
already demonstrated there are other ways to achieve the same
high-performance goals without requiring two models developed/tested
each time and for each manager.  For the times I went and tried to
satisfy your request in this manner, developing the code and managing
the resources in two places, for lack of a better description, made me
want to wretch. So I gave up, resolved that my original design was
better, and hoped that I could convince you and the community of the same.

> 
>>> Invariably?
>>>      
>> As in "always"
>>    
> 
> Refactor instead of duplicating.

There is no duplicating.  vbus has no equivalent today as virtio doesn't
define these layers.

> 
>>   
>>>   Use libraries (virtio-shmem.ko, libvhost.so).
>>>      
>> What do you suppose vbus is?  vbus-proxy.ko = virtio-shmem.ko, and you
>> dont need libvhost.so per se since you can just use standard kernel
>> interfaces (like configfs/sysfs).  I could create an .so going forward
>> for the new ioctl-based interface, I suppose.
>>    
> 
> Refactor instead of rewriting.

There is no rewriting.  vbus has no equivalent today as virtio doesn't
define these layers.

By your own admission, you said if you wanted that capability, use a
library.  What I think you are not understanding is vbus _is_ that
library.  So what is the problem, exactly?

> 
> 
> 
>>> For kvm/x86 pci definitely remains king.
>>>      
>> For full virtualization, sure.  I agree.  However, we are talking about
>> PV here.  For PV, PCI is not a requirement and is a technical dead-end
>> IMO.
>>
>> KVM seems to be the only virt solution that thinks otherwise (*), but I
>> believe that is primarily a condition of its maturity.  I aim to help
>> advance things here.
>>
>> (*) citation: xen has xenbus, lguest has lguest-bus, vmware has some
>> vmi-esq thing (I forget what its called) to name a few.  Love 'em or
>> hate 'em, most other hypervisors do something along these lines.  I'd
>> like to try to create one for KVM, but to unify them all (at least for
>> the Linux-based host designs).
>>    
> 
> VMware are throwing VMI away (won't be supported in their new product,
> and they've sent a patch to rip it off from Linux);

vmware only cares about x86 iiuc, so probably not a good example.

> Xen has to tunnel
> xenbus in pci for full virtualization (which is where Windows is, and
> where Linux will be too once people realize it's faster).  lguest is
> meant as an example hypervisor, not an attempt to take over the world.

So pick any other hypervisor, and the situation is often similar.

> 
> "PCI is a dead end" could not be more wrong, it's what guests support.

It's what _some_ guests support.  Even for the guests that support it,
it's not well designed for PV.  Therefore, you have to do a bunch of
dancing and waste resources on top to squeeze every last drop of
performance out of your platform.  In addition, it has a bunch of
baggage that goes with it that is not necessary to do the job in a
software environment.  It is therefore burdensome to recreate if you
don't already have something to leverage, like QEMU, just for the sake
of creating the illusion that its there.

Sounds pretty dead to me, sorry.  We don't need it.

Alternatively, you can just try to set a stake in the ground for looking
forward and fixing those PV-specific problems hopefully once and for
all, like vbus and the kvm-connector tries to do.  Sure, there will be
some degree of pain first as we roll out the subsystem and deploy
support, but thats true for lots of things.  It's simply a platform
investment.


> An right now you can have a guest using pci to access a mix of
> userspace-emulated devices, userspace-emulated-but-kernel-accelerated
> virtio devices, and real host devices.  All on one dead-end bus.  Try
> that with vbus.

vbus is not interested in userspace devices.  The charter is to provide
facilities for utilizing the host linux kernel's IO capabilities in the
most efficient, yet safe, manner possible.  Those devices that fit
outside that charter can ride on legacy mechanisms if that suits them best.

> 
> 
>>>> I digress.  My point here isn't PCI.  The point here is the missing
>>>> component for when PCI is not present.  The component that is partially
>>>> satisfied by vbus's devid addressing scheme.  If you are going to use
>>>> vhost, and you don't have PCI, you've gotta build something to replace
>>>> it.
>>>>
>>>>        
>>> Yes, that's why people have keyboards.  They'll write that glue code if
>>> they need it.  If it turns out to be a hit an people start having virtio
>>> transport module writing parties, they'll figure out a way to share
>>> code.
>>>      
>> Sigh...  The party has already started.  I tried to invite you months
>> ago...
>>    
> 
> I've been voting virtio since 2007.

That doesn't have much to do with whats underneath it, since it doesn't
define these layers.  See my stack diagram's for details.

> 
>>> On the guest side, virtio-shmem.ko can unify the ring access.  It
>>> probably makes sense even today.  On the host side I eventfd is the
>>> kernel interface and libvhostconfig.so can provide the configuration
>>> when an existing ABI is not imposed.
>>>      
>> That won't cut it.  For one, creating an eventfd is only part of the
>> equation.  I.e. you need to have originate/terminate somewhere
>> interesting (and in-kernel, otherwise use tuntap).
>>    
> 
> vbus needs the same thing so it cancels out.

No, it does not.  vbus just needs a relatively simple single message
pipe between the guest and host (think "hypercall tunnel", if you will).
 Per queue/device addressing is handled by the same conceptual namespace
as the one that would trigger eventfds in the model you mention.  And
that namespace is built in to the vbus stack, and objects are registered
automatically as they are created.

Contrast that to vhost, which requires some other kernel interface to
exist, and to be managed manually for each object that is created.  Your
libvhostconfig would need to somehow know how to perform this
registration operation, and there would have to be something in the
kernel to receive it, presumably on a per platform basis.  Solving this
problem generally would probably end up looking eerily like vbus,
because thats what vbus does.

> 
>>> Look at the virtio-net feature negotiation.  There's a lot more there
>>> than the MAC address, and it's going to grow.
>>>      
>> Agreed, but note that makes my point.  That feature negotiation almost
>> invariably influences the device-model, not some config-space shim.
>> IOW: terminating config-space at some userspace shim is pointless.  The
>> model ultimately needs the result of whatever transpires during that
>> negotiation anyway.
>>    
> 
> Well, let's see.  Can vbus today:
> 
> - let userspace know which features are available (so it can decide if
> live migration is possible)

yes, its in sysfs.

> - let userspace limit which features are exposed to the guest (so it can
> make live migration possible among hosts of different capabilities)

yes, its in sysfs.

> - let userspace know which features were negotiated (so it can transfer
> them to the other host during live migration)

no, but we can easily add ->save()/->restore() to the model going
forward, and the negotiated features are just a subcomponent if its
serialized stream.

> - let userspace tell the kernel which features were negotiated (when
> live migration completes, to avoid requiring the guest to re-negotiate)

that would be the function of the ->restore() deserializer.

> - do all that from an unprivileged process

yes, in the upcoming alacrityvm v0.3 with the ioctl based control plane.

> - securely wrt other unprivileged processes

yes, same mechanism plus it has a fork-inheritance model.

Bottom line: vbus isn't done, especially w.r.t. live-migration..but that
is not an valid argument against the idea if you believe in
release-early/release-often. kvm wasn't (isn't) done either when it was
proposed/merged.

Kind Regards,
-Greg
Michael S. Tsirkin Sept. 27, 2009, 7:43 a.m. UTC | #66
On Fri, Sep 25, 2009 at 10:01:58AM -0700, Ira W. Snyder wrote:
> > +	case VHOST_SET_VRING_KICK:
> > +		r = copy_from_user(&f, argp, sizeof f);
> > +		if (r < 0)
> > +			break;
> > +		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
> > +		if (IS_ERR(eventfp))
> > +			return PTR_ERR(eventfp);
> > +		if (eventfp != vq->kick) {
> > +			pollstop = filep = vq->kick;
> > +			pollstart = vq->kick = eventfp;
> > +		} else
> > +			filep = eventfp;
> > +		break;
> > +	case VHOST_SET_VRING_CALL:
> > +		r = copy_from_user(&f, argp, sizeof f);
> > +		if (r < 0)
> > +			break;
> > +		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
> > +		if (IS_ERR(eventfp))
> > +			return PTR_ERR(eventfp);
> > +		if (eventfp != vq->call) {
> > +			filep = vq->call;
> > +			ctx = vq->call_ctx;
> > +			vq->call = eventfp;
> > +			vq->call_ctx = eventfp ?
> > +				eventfd_ctx_fileget(eventfp) : NULL;
> > +		} else
> > +			filep = eventfp;
> > +		break;
> > +	case VHOST_SET_VRING_ERR:
> > +		r = copy_from_user(&f, argp, sizeof f);
> > +		if (r < 0)
> > +			break;
> > +		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
> > +		if (IS_ERR(eventfp))
> > +			return PTR_ERR(eventfp);
> > +		if (eventfp != vq->error) {
> > +			filep = vq->error;
> > +			vq->error = eventfp;
> > +			ctx = vq->error_ctx;
> > +			vq->error_ctx = eventfp ?
> > +				eventfd_ctx_fileget(eventfp) : NULL;
> > +		} else
> > +			filep = eventfp;
> > +		break;
> 
> I'm not sure how these eventfd's save a trip to userspace.
> 
> AFAICT, eventfd's cannot be used to signal another part of the kernel,
> they can only be used to wake up userspace.

Yes, they can.  See irqfd code in virt/kvm/eventfd.c.

> In my system, when an IRQ for kick() comes in, I have an eventfd which
> gets signalled to notify userspace. When I want to send a call(), I have
> to use a special ioctl(), just like lguest does.
> 
> Doesn't this mean that for call(), vhost is just going to signal an
> eventfd to wake up userspace, which is then going to call ioctl(), and
> then we're back in kernelspace. Seems like a wasted userspace
> round-trip.
> 
> Or am I mis-reading this code?

Yes. Kernel can poll eventfd and deliver an interrupt directly
without involving userspace.

> PS - you can see my current code at:
> http://www.mmarray.org/~iws/virtio-phys/
> 
> Thanks,
> Ira
> 
> > +	default:
> > +		r = -ENOIOCTLCMD;
> > +	}
> > +
> > +	if (pollstop && vq->handle_kick)
> > +		vhost_poll_stop(&vq->poll);
> > +
> > +	if (ctx)
> > +		eventfd_ctx_put(ctx);
> > +	if (filep)
> > +		fput(filep);
> > +
> > +	if (pollstart && vq->handle_kick)
> > +		vhost_poll_start(&vq->poll, vq->kick);
> > +
> > +	mutex_unlock(&vq->mutex);
> > +
> > +	if (pollstop && vq->handle_kick)
> > +		vhost_poll_flush(&vq->poll);
> > +	return 0;
> > +}
> > +
> > +long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned long arg)
> > +{
> > +	void __user *argp = (void __user *)arg;
> > +	long r;
> > +
> > +	mutex_lock(&d->mutex);
> > +	/* If you are not the owner, you can become one */
> > +	if (ioctl == VHOST_SET_OWNER) {
> > +		r = vhost_dev_set_owner(d);
> > +		goto done;
> > +	}
> > +
> > +	/* You must be the owner to do anything else */
> > +	r = vhost_dev_check_owner(d);
> > +	if (r)
> > +		goto done;
> > +
> > +	switch (ioctl) {
> > +	case VHOST_SET_MEM_TABLE:
> > +		r = vhost_set_memory(d, argp);
> > +		break;
> > +	default:
> > +		r = vhost_set_vring(d, ioctl, argp);
> > +		break;
> > +	}
> > +done:
> > +	mutex_unlock(&d->mutex);
> > +	return r;
> > +}
> > +
> > +static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
> > +						     __u64 addr, __u32 len)
> > +{
> > +	struct vhost_memory_region *reg;
> > +	int i;
> > +	/* linear search is not brilliant, but we really have on the order of 6
> > +	 * regions in practice */
> > +	for (i = 0; i < mem->nregions; ++i) {
> > +		reg = mem->regions + i;
> > +		if (reg->guest_phys_addr <= addr &&
> > +		    reg->guest_phys_addr + reg->memory_size - 1 >= addr)
> > +			return reg;
> > +	}
> > +	return NULL;
> > +}
> > +
> > +int translate_desc(struct vhost_dev *dev, u64 addr, u32 len,
> > +		   struct iovec iov[], int iov_size)
> > +{
> > +	const struct vhost_memory_region *reg;
> > +	struct vhost_memory *mem;
> > +	struct iovec *_iov;
> > +	u64 s = 0;
> > +	int ret = 0;
> > +
> > +	rcu_read_lock();
> > +
> > +	mem = rcu_dereference(dev->memory);
> > +	while ((u64)len > s) {
> > +		u64 size;
> > +		if (ret >= iov_size) {
> > +			ret = -ENOBUFS;
> > +			break;
> > +		}
> > +		reg = find_region(mem, addr, len);
> > +		if (!reg) {
> > +			ret = -EFAULT;
> > +			break;
> > +		}
> > +		_iov = iov + ret;
> > +		size = reg->memory_size - addr + reg->guest_phys_addr;
> > +		_iov->iov_len = min((u64)len, size);
> > +		_iov->iov_base = (void *)
> > +			(reg->userspace_addr + addr - reg->guest_phys_addr);
> > +		s += size;
> > +		addr += size;
> > +		++ret;
> > +	}
> > +
> > +	rcu_read_unlock();
> > +	return ret;
> > +}
> > +
> > +/* Each buffer in the virtqueues is actually a chain of descriptors.  This
> > + * function returns the next descriptor in the chain, or vq->vring.num if we're
> > + * at the end. */
> > +static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc)
> > +{
> > +	unsigned int next;
> > +
> > +	/* If this descriptor says it doesn't chain, we're done. */
> > +	if (!(desc->flags & VRING_DESC_F_NEXT))
> > +		return vq->num;
> > +
> > +	/* Check they're not leading us off end of descriptors. */
> > +	next = desc->next;
> > +	/* Make sure compiler knows to grab that: we don't want it changing! */
> > +	/* We will use the result as an index in an array, so most
> > +	 * architectures only need a compiler barrier here. */
> > +	read_barrier_depends();
> > +
> > +	if (next >= vq->num) {
> > +		vq_err(vq, "Desc next is %u > %u", next, vq->num);
> > +		return vq->num;
> > +	}
> > +
> > +	return next;
> > +}
> > +
> > +/* This looks in the virtqueue and for the first available buffer, and converts
> > + * it to an iovec for convenient access.  Since descriptors consist of some
> > + * number of output then some number of input descriptors, it's actually two
> > + * iovecs, but we pack them into one and note how many of each there were.
> > + *
> > + * This function returns the descriptor number found, or vq->num (which
> > + * is never a valid descriptor number) if none was found. */
> > +unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
> > +			   struct iovec iov[],
> > +			   unsigned int *out_num, unsigned int *in_num)
> > +{
> > +	struct vring_desc desc;
> > +	unsigned int i, head;
> > +	u16 last_avail_idx;
> > +	int ret;
> > +
> > +	/* Check it isn't doing very strange things with descriptor numbers. */
> > +	last_avail_idx = vq->last_avail_idx;
> > +	if (get_user(vq->avail_idx, &vq->avail->idx)) {
> > +		vq_err(vq, "Failed to access avail idx at %p\n",
> > +		       &vq->avail->idx);
> > +		return vq->num;
> > +	}
> > +
> > +	if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
> > +		vq_err(vq, "Guest moved used index from %u to %u",
> > +		       last_avail_idx, vq->avail_idx);
> > +		return vq->num;
> > +	}
> > +
> > +	/* If there's nothing new since last we looked, return invalid. */
> > +	if (vq->avail_idx == last_avail_idx)
> > +		return vq->num;
> > +
> > +	/* Grab the next descriptor number they're advertising, and increment
> > +	 * the index we've seen. */
> > +	if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
> > +		vq_err(vq, "Failed to read head: idx %d address %p\n",
> > +		       last_avail_idx,
> > +		       &vq->avail->ring[last_avail_idx % vq->num]);
> > +		return vq->num;
> > +	}
> > +
> > +	/* If their number is silly, that's an error. */
> > +	if (head >= vq->num) {
> > +		vq_err(vq, "Guest says index %u > %u is available",
> > +		       head, vq->num);
> > +		return vq->num;
> > +	}
> > +
> > +	vq->last_avail_idx++;
> > +
> > +	/* When we start there are none of either input nor output. */
> > +	*out_num = *in_num = 0;
> > +
> > +	i = head;
> > +	do {
> > +		unsigned iov_count = *in_num + *out_num;
> > +		if (copy_from_user(&desc, vq->desc + i, sizeof desc)) {
> > +			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
> > +			       i, vq->desc + i);
> > +			return vq->num;
> > +		}
> > +		ret = translate_desc(dev, desc.addr, desc.len, iov + iov_count,
> > +				     VHOST_NET_MAX_SG - iov_count);
> > +		if (ret < 0) {
> > +			vq_err(vq, "Translation failure %d descriptor idx %d\n",
> > +			       ret, i);
> > +			return vq->num;
> > +		}
> > +		/* If this is an input descriptor, increment that count. */
> > +		if (desc.flags & VRING_DESC_F_WRITE)
> > +			*in_num += ret;
> > +		else {
> > +			/* If it's an output descriptor, they're all supposed
> > +			 * to come before any input descriptors. */
> > +			if (*in_num) {
> > +				vq_err(vq, "Descriptor has out after in: "
> > +				       "idx %d\n", i);
> > +				return vq->num;
> > +			}
> > +			*out_num += ret;
> > +		}
> > +	} while ((i = next_desc(vq, &desc)) != vq->num);
> > +	return head;
> > +}
> > +
> > +/* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
> > +void vhost_discard_vq_desc(struct vhost_virtqueue *vq)
> > +{
> > +	vq->last_avail_idx--;
> > +}
> > +
> > +/* After we've used one of their buffers, we tell them about it.  We'll then
> > + * want to send them an interrupt, using vq->call. */
> > +int vhost_add_used(struct vhost_virtqueue *vq,
> > +			  unsigned int head, int len)
> > +{
> > +	struct vring_used_elem *used;
> > +
> > +	/* The virtqueue contains a ring of used buffers.  Get a pointer to the
> > +	 * next entry in that used ring. */
> > +	used = &vq->used->ring[vq->last_used_idx % vq->num];
> > +	if (put_user(head, &used->id)) {
> > +		vq_err(vq, "Failed to write used id");
> > +		return -EFAULT;
> > +	}
> > +	if (put_user(len, &used->len)) {
> > +		vq_err(vq, "Failed to write used len");
> > +		return -EFAULT;
> > +	}
> > +	/* Make sure buffer is written before we update index. */
> > +	wmb();
> > +	if (put_user(vq->last_used_idx + 1, &vq->used->idx)) {
> > +		vq_err(vq, "Failed to increment used idx");
> > +		return -EFAULT;
> > +	}
> > +	vq->last_used_idx++;
> > +	return 0;
> > +}
> > +
> > +/* This actually sends the interrupt for this virtqueue */
> > +void vhost_trigger_irq(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> > +{
> > +	__u16 flags = 0;
> > +	if (get_user(flags, &vq->avail->flags)) {
> > +		vq_err(vq, "Failed to get flags");
> > +		return;
> > +	}
> > +
> > +	/* If they don't want an interrupt, don't send one, unless empty. */
> > +	if ((flags & VRING_AVAIL_F_NO_INTERRUPT) &&
> > +	    (!vhost_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY) ||
> > +	     vq->avail_idx != vq->last_avail_idx))
> > +		return;
> > +
> > +	/* Send the Guest an interrupt tell them we used something up. */
> > +	if (vq->call_ctx)
> > +		eventfd_signal(vq->call_ctx, 1);
> > +}
> > +
> > +/* And here's the combo meal deal.  Supersize me! */
> > +void vhost_add_used_and_trigger(struct vhost_dev *dev,
> > +				struct vhost_virtqueue *vq,
> > +				unsigned int head, int len)
> > +{
> > +	vhost_add_used(vq, head, len);
> > +	vhost_trigger_irq(dev, vq);
> > +}
> > +
> > +/* OK, now we need to know about added descriptors. */
> > +bool vhost_notify(struct vhost_virtqueue *vq)
> > +{
> > +	int r;
> > +	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
> > +		return false;
> > +	vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
> > +	r = put_user(vq->used_flags, &vq->used->flags);
> > +	if (r)
> > +		vq_err(vq, "Failed to disable notification: %d\n", r);
> > +	/* They could have slipped one in as we were doing that: make
> > +	 * sure it's written, tell caller it needs to check again. */
> > +	mb();
> > +	return true;
> > +}
> > +
> > +/* We don't need to be notified again. */
> > +void vhost_no_notify(struct vhost_virtqueue *vq)
> > +{
> > +	int r;
> > +	if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
> > +		return;
> > +	vq->used_flags |= VRING_USED_F_NO_NOTIFY;
> > +	r = put_user(vq->used_flags, &vq->used->flags);
> > +	if (r)
> > +		vq_err(vq, "Failed to enable notification: %d\n", r);
> > +}
> > +
> > +int vhost_init(void)
> > +{
> > +	vhost_workqueue = create_workqueue("vhost");
> > +	if (!vhost_workqueue)
> > +		return -ENOMEM;
> > +	return 0;
> > +}
> > +
> > +void vhost_cleanup(void)
> > +{
> > +	destroy_workqueue(vhost_workqueue);
> > +}
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > new file mode 100644
> > index 0000000..8e13d06
> > --- /dev/null
> > +++ b/drivers/vhost/vhost.h
> > @@ -0,0 +1,122 @@
> > +#ifndef _VHOST_H
> > +#define _VHOST_H
> > +
> > +#include <linux/eventfd.h>
> > +#include <linux/vhost.h>
> > +#include <linux/mm.h>
> > +#include <linux/mutex.h>
> > +#include <linux/workqueue.h>
> > +#include <linux/poll.h>
> > +#include <linux/file.h>
> > +#include <linux/skbuff.h>
> > +#include <linux/uio.h>
> > +#include <linux/virtio_config.h>
> > +
> > +struct vhost_device;
> > +
> > +enum {
> > +	VHOST_NET_MAX_SG = MAX_SKB_FRAGS + 2,
> > +};
> > +
> > +/* Poll a file (eventfd or socket) */
> > +/* Note: there's nothing vhost specific about this structure. */
> > +struct vhost_poll {
> > +	poll_table                table;
> > +	wait_queue_head_t        *wqh;
> > +	wait_queue_t              wait;
> > +	/* struct which will handle all actual work. */
> > +	struct work_struct        work;
> > +	unsigned long		  mask;
> > +};
> > +
> > +void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
> > +		     unsigned long mask);
> > +void vhost_poll_start(struct vhost_poll *poll, struct file *file);
> > +void vhost_poll_stop(struct vhost_poll *poll);
> > +void vhost_poll_flush(struct vhost_poll *poll);
> > +
> > +/* The virtqueue structure describes a queue attached to a device. */
> > +struct vhost_virtqueue {
> > +	struct vhost_dev *dev;
> > +
> > +	/* The actual ring of buffers. */
> > +	struct mutex mutex;
> > +	unsigned int num;
> > +	struct vring_desc __user *desc;
> > +	struct vring_avail __user *avail;
> > +	struct vring_used __user *used;
> > +	struct file *kick;
> > +	struct file *call;
> > +	struct file *error;
> > +	struct eventfd_ctx *call_ctx;
> > +	struct eventfd_ctx *error_ctx;
> > +
> > +	struct vhost_poll poll;
> > +
> > +	/* The routine to call when the Guest pings us, or timeout. */
> > +	work_func_t handle_kick;
> > +
> > +	/* Last available index we saw. */
> > +	u16 last_avail_idx;
> > +
> > +	/* Caches available index value from user. */
> > +	u16 avail_idx;
> > +
> > +	/* Last index we used. */
> > +	u16 last_used_idx;
> > +
> > +	/* Used flags */
> > +	u16 used_flags;
> > +
> > +	struct iovec iov[VHOST_NET_MAX_SG];
> > +	struct iovec hdr[VHOST_NET_MAX_SG];
> > +};
> > +
> > +struct vhost_dev {
> > +	/* Readers use RCU to access memory table pointer.
> > +	 * Writers use mutex below.*/
> > +	struct vhost_memory *memory;
> > +	struct mm_struct *mm;
> > +	struct vhost_virtqueue *vqs;
> > +	int nvqs;
> > +	struct mutex mutex;
> > +	unsigned acked_features;
> > +};
> > +
> > +long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
> > +long vhost_dev_check_owner(struct vhost_dev *);
> > +long vhost_dev_reset_owner(struct vhost_dev *);
> > +void vhost_dev_cleanup(struct vhost_dev *);
> > +long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, unsigned long arg);
> > +
> > +unsigned vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
> > +			   struct iovec iov[],
> > +			   unsigned int *out_num, unsigned int *in_num);
> > +void vhost_discard_vq_desc(struct vhost_virtqueue *);
> > +
> > +int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
> > +void vhost_trigger_irq(struct vhost_dev *, struct vhost_virtqueue *);
> > +void vhost_add_used_and_trigger(struct vhost_dev *, struct vhost_virtqueue *,
> > +				unsigned int head, int len);
> > +void vhost_no_notify(struct vhost_virtqueue *);
> > +bool vhost_notify(struct vhost_virtqueue *);
> > +
> > +int vhost_init(void);
> > +void vhost_cleanup(void);
> > +
> > +#define vq_err(vq, fmt, ...) do {                                  \
> > +		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> > +		if ((vq)->error_ctx)                               \
> > +				eventfd_signal((vq)->error_ctx, 1);\
> > +	} while (0)
> > +
> > +enum {
> > +	VHOST_FEATURES = 1 << VIRTIO_F_NOTIFY_ON_EMPTY,
> > +};
> > +
> > +static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
> > +{
> > +	return dev->acked_features & (1 << bit);
> > +}
> > +
> > +#endif
> > diff --git a/include/linux/Kbuild b/include/linux/Kbuild
> > index dec2f18..975df9a 100644
> > --- a/include/linux/Kbuild
> > +++ b/include/linux/Kbuild
> > @@ -360,6 +360,7 @@ unifdef-y += uio.h
> >  unifdef-y += unistd.h
> >  unifdef-y += usbdevice_fs.h
> >  unifdef-y += utsname.h
> > +unifdef-y += vhost.h
> >  unifdef-y += videodev2.h
> >  unifdef-y += videodev.h
> >  unifdef-y += virtio_config.h
> > diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
> > index 0521177..781a8bb 100644
> > --- a/include/linux/miscdevice.h
> > +++ b/include/linux/miscdevice.h
> > @@ -30,6 +30,7 @@
> >  #define HPET_MINOR		228
> >  #define FUSE_MINOR		229
> >  #define KVM_MINOR		232
> > +#define VHOST_NET_MINOR		233
> >  #define MISC_DYNAMIC_MINOR	255
> >  
> >  struct device;
> > diff --git a/include/linux/vhost.h b/include/linux/vhost.h
> > new file mode 100644
> > index 0000000..3f441a9
> > --- /dev/null
> > +++ b/include/linux/vhost.h
> > @@ -0,0 +1,101 @@
> > +#ifndef _LINUX_VHOST_H
> > +#define _LINUX_VHOST_H
> > +/* Userspace interface for in-kernel virtio accelerators. */
> > +
> > +/* vhost is used to reduce the number of system calls involved in virtio.
> > + *
> > + * Existing virtio net code is used in the guest without modification.
> > + *
> > + * This header includes interface used by userspace hypervisor for
> > + * device configuration.
> > + */
> > +
> > +#include <linux/types.h>
> > +#include <linux/compiler.h>
> > +#include <linux/ioctl.h>
> > +#include <linux/virtio_config.h>
> > +#include <linux/virtio_ring.h>
> > +
> > +struct vhost_vring_state {
> > +	unsigned int index;
> > +	unsigned int num;
> > +};
> > +
> > +struct vhost_vring_file {
> > +	unsigned int index;
> > +	int fd;
> > +};
> > +
> > +struct vhost_vring_addr {
> > +	unsigned int index;
> > +	unsigned int padding;
> > +	__u64 user_addr;
> > +};
> > +
> > +struct vhost_memory_region {
> > +	__u64 guest_phys_addr;
> > +	__u64 memory_size; /* bytes */
> > +	__u64 userspace_addr;
> > +	__u64 padding; /* read/write protection? */
> > +};
> > +
> > +struct vhost_memory {
> > +	__u32 nregions;
> > +	__u32 padding;
> > +	struct vhost_memory_region regions[0];
> > +};
> > +
> > +/* ioctls */
> > +
> > +#define VHOST_VIRTIO 0xAF
> > +
> > +/* Features bitmask for forward compatibility.  Transport bits are used for
> > + * vhost specific features. */
> > +#define VHOST_GET_FEATURES	_IOR(VHOST_VIRTIO, 0x00, __u64)
> > +#define VHOST_ACK_FEATURES	_IOW(VHOST_VIRTIO, 0x00, __u64)
> > +
> > +/* Set current process as the (exclusive) owner of this file descriptor.  This
> > + * must be called before any other vhost command.  Further calls to
> > + * VHOST_OWNER_SET fail until VHOST_OWNER_RESET is called. */
> > +#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01)
> > +/* Give up ownership, and reset the device to default values.
> > + * Allows subsequent call to VHOST_OWNER_SET to succeed. */
> > +#define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
> > +
> > +/* Set up/modify memory layout */
> > +#define VHOST_SET_MEM_TABLE	_IOW(VHOST_VIRTIO, 0x03, struct vhost_memory)
> > +
> > +/* Ring setup. These parameters can not be modified while ring is running
> > + * (bound to a device). */
> > +/* Set number of descriptors in ring */
> > +#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)
> > +/* Start of array of descriptors (virtually contiguous) */
> > +#define VHOST_SET_VRING_DESC _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)
> > +/* Used structure address */
> > +#define VHOST_SET_VRING_USED _IOW(VHOST_VIRTIO, 0x12, struct vhost_vring_addr)
> > +/* Available structure address */
> > +#define VHOST_SET_VRING_AVAIL _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_addr)
> > +/* Base value where queue looks for available descriptors */
> > +#define VHOST_SET_VRING_BASE _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
> > +/* Get accessor: reads index, writes value in num */
> > +#define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
> > +
> > +/* The following ioctls use eventfd file descriptors to signal and poll
> > + * for events. */
> > +
> > +/* Set eventfd to poll for added buffers */
> > +#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)
> > +/* Set eventfd to signal when buffers have beed used */
> > +#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)
> > +/* Set eventfd to signal an error */
> > +#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
> > +
> > +/* VHOST_NET specific defines */
> > +
> > +/* Attach virtio net device to a raw socket. The socket must be already
> > + * bound to an ethernet device, this device will be used for transmit.
> > + * Pass -1 to unbind from the socket and the transmit device.
> > + * This can be used to stop the device (e.g. for migration). */
> > +#define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, int)
> > +
> > +#endif
> > -- 
> > 1.6.2.5
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 27, 2009, 9:43 a.m. UTC | #67
On 09/26/2009 12:32 AM, Gregory Haskins wrote:
>>>
>>> I realize in retrospect that my choice of words above implies vbus _is_
>>> complete, but this is not what I was saying.  What I was trying to
>>> convey is that vbus is _more_ complete.  Yes, in either case some kind
>>> of glue needs to be written.  The difference is that vbus implements
>>> more of the glue generally, and leaves less required to be customized
>>> for each iteration.
>>>
>>>        
>>
>> No argument there.  Since you care about non-virt scenarios and virtio
>> doesn't, naturally vbus is a better fit for them as the code stands.
>>      
> Thanks for finally starting to acknowledge there's a benefit, at least.
>    

I think I've mentioned vbus' finer grained layers as helpful here, 
though I doubt the value of this.  Hypervisors are added rarely, while 
devices and drivers are added (and modified) much more often.  I don't 
buy the anything-to-anything promise.

> To be more precise, IMO virtio is designed to be a performance oriented
> ring-based driver interface that supports all types of hypervisors (e.g.
> shmem based kvm, and non-shmem based Xen).  vbus is designed to be a
> high-performance generic shared-memory interconnect (for rings or
> otherwise) framework for environments where linux is the underpinning
> "host" (physical or virtual).  They are distinctly different, but
> complementary (the former addresses the part of the front-end, and
> latter addresses the back-end, and a different part of the front-end).
>    

They're not truly complementary since they're incompatible.  A 2.6.27 
guest, or Windows guest with the existing virtio drivers, won't work 
over vbus.  Further, non-shmem virtio can't work over vbus.  Since 
virtio is guest-oriented and host-agnostic, it can't ignore 
non-shared-memory hosts (even though it's unlikely virtio will be 
adopted there).

> In addition, the kvm-connector used in AlacrityVM's design strives to
> add value and improve performance via other mechanisms, such as dynamic
>   allocation, interrupt coalescing (thus reducing exit-ratio, which is a
> serious issue in KVM)

Do you have measurements of inter-interrupt coalescing rates (excluding 
intra-interrupt coalescing).

> and priortizable/nestable signals.
>    

That doesn't belong in a bus.

> Today there is a large performance disparity between what a KVM guest
> sees and what a native linux application sees on that same host.  Just
> take a look at some of my graphs between "virtio", and "native", for
> example:
>
> http://developer.novell.com/wiki/images/b/b7/31-rc4_throughput.png
>    

That's a red herring.  The problem is not with virtio as an ABI, but 
with its implementation in userspace.  vhost-net should offer equivalent 
performance to vbus.

> A dominant vbus design principle is to try to achieve the same IO
> performance for all "linux applications" whether they be literally
> userspace applications, or things like KVM vcpus or Ira's physical
> boards.  It also aims to solve problems not previously expressible with
> current technologies (even virtio), like nested real-time.
>
> And even though you repeatedly insist otherwise, the neat thing here is
> that the two technologies mesh (at least under certain circumstances,
> like when virtio is deployed on a shared-memory friendly linux backend
> like KVM).  I hope that my stack diagram below depicts that clearly.
>    

Right, when you ignore the points where they don't fit, it's a perfect mesh.

>> But that's not a strong argument for vbus; instead of adding vbus you
>> could make virtio more friendly to non-virt
>>      
> Actually, it _is_ a strong argument then because adding vbus is what
> helps makes virtio friendly to non-virt, at least for when performance
> matters.
>    

As vhost-net shows, you can do that without vbus and without breaking 
compatibility.



>> Right.  virtio assumes that it's in a virt scenario and that the guest
>> architecture already has enumeration and hotplug mechanisms which it
>> would prefer to use.  That happens to be the case for kvm/x86.
>>      
> No, virtio doesn't assume that.  It's stack provides the "virtio-bus"
> abstraction and what it does assume is that it will be wired up to
> something underneath. Kvm/x86 conveniently has pci, so the virtio-pci
> adapter was created to reuse much of that facility.  For other things
> like lguest and s360, something new had to be created underneath to make
> up for the lack of pci-like support.
>    

Right, I was wrong there.  But it does allow you to have a 1:1 mapping 
between native devices and virtio devices.


>>> So to answer your question, the difference is that the part that has to
>>> be customized in vbus should be a fraction of what needs to be
>>> customized with vhost because it defines more of the stack.
>>>        
>> But if you want to use the native mechanisms, vbus doesn't have any
>> added value.
>>      
> First of all, thats incorrect.  If you want to use the "native"
> mechanisms (via the way the vbus-connector is implemented, for instance)
> you at least still have the benefit that the backend design is more
> broadly re-useable in more environments (like non-virt, for instance),
> because vbus does a proper job of defining the requisite
> layers/abstractions compared to vhost.  So it adds value even in that
> situation.
>    

Maybe.  If vhost-net isn't sufficient I'm sure there will be patches sent.

> Second of all, with PV there is no such thing as "native".  It's
> software so it can be whatever we want.  Sure, you could argue that the
> guest may have built-in support for something like PCI protocol.
> However, PCI protocol itself isn't suitable for high-performance PV out
> of the can.  So you will therefore invariably require new software
> layers on top anyway, even if part of the support is already included.
>    

Of course there is such a thing as native, a pci-ready guest has tons of 
support built into it that doesn't need to be retrofitted.  Since 
practically everyone (including Xen) does their paravirt drivers atop 
pci, the claim that pci isn't suitable for high performance is incorrect.


> And lastly, why would you _need_ to use the so called "native"
> mechanism?  The short answer is, "you don't".  Any given system (guest
> or bare-metal) already have a wide-range of buses (try running "tree
> /sys/bus" in Linux).  More importantly, the concept of adding new buses
> is widely supported in both the Windows and Linux driver model (and
> probably any other guest-type that matters).  Therefore, despite claims
> to the contrary, its not hard or even unusual to add a new bus to the mix.
>    

The short answer is "compatibility".


> In summary, vbus is simply one more bus of many, purpose built to
> support high-end IO in a virt-like model, giving controlled access to
> the linux-host underneath it.  You can write a high-performance layer
> below the OS bus-model (vbus), or above it (virtio-pci) but either way
> you are modifying the stack to add these capabilities, so we might as
> well try to get this right.
>
> With all due respect, you are making a big deal out of a minor issue.
>    

It's not minor to me.

>>> And, as
>>> eluded to in my diagram, both virtio-net and vhost (with some
>>> modifications to fit into the vbus framework) are potentially
>>> complementary, not competitors.
>>>
>>>        
>> Only theoretically.  The existing installed base would have to be thrown
>> away
>>      
> "Thrown away" is pure hyperbole.  The installed base, worse case, needs
> to load a new driver for a missing device.

Yes, we all know how fun this is.  Especially if the device changed is 
your boot disk.  You may not care about the pain caused to users, but I 
do, so I will continue to insist on compatibility.

>> or we'd need to support both.
>>
>>
>>      
> No matter what model we talk about, there's always going to be a "both"
> since the userspace virtio models are probably not going to go away (nor
> should they).
>    

virtio allows you to have userspace-only, kernel-only, or 
start-with-userspace-and-move-to-kernel-later, all transparent to the 
guest.  In many cases we'll stick with userspace-only.

>> All this is after kvm has decoded that vbus is addresses.  It can't work
>> without someone outside vbus deciding that.
>>      
> How the connector message is delivered is really not relevant.  Some
> architectures will simply deliver the message point-to-point (like the
> original hypercall design for KVM, or something like Ira's rig), and
> some will need additional demuxing (like pci-bridge/pio based KVM).
> It's an implementation detail of the connector.
>
> However, the real point here is that something needs to establish a
> scoped namespace mechanism, add items to that namespace, and advertise
> the presence of the items to the guest.  vbus has this facility built in
> to its stack.  vhost doesn't, so it must come from elsewhere.
>    

So we have: vbus needs a connector, vhost needs a connector.  vbus 
doesn't need userspace to program the addresses (but does need userspace 
to instantiate the devices and to program the bus address decode), vhost 
needs userspace to instantiate the devices and program the addresses.

>>> In fact, it's actually a simpler design to unify things this way because
>>> you avoid splitting the device model up. Consider how painful the vhost
>>> implementation would be if it didn't already have the userspace
>>> virtio-net to fall-back on.  This is effectively what we face for new
>>> devices going forward if that model is to persist.
>>>
>>>        
>>
>> It doesn't have just virtio-net, it has userspace-based hostplug
>>      
> vbus has hotplug too: mkdir and rmdir
>    

Does that work from nonprivileged processes?  Does it work on Windows?

> As an added bonus, its device-model is modular.  A developer can write a
> new device model, compile it, insmod it to the host kernel, hotplug it
> to the running guest with mkdir/ln, and the come back out again
> (hotunplug with rmdir, rmmod, etc).  They may do this all without taking
> the guest down, and while eating QEMU based IO solutions for breakfast
> performance wise.
>
> Afaict, qemu can't do either of those things.
>    

We've seen that herring before, and it's redder than ever.



>> Refactor instead of duplicating.
>>      
> There is no duplicating.  vbus has no equivalent today as virtio doesn't
> define these layers.
>    

So define them if they're missing.


>>>
>>>        
>>>>    Use libraries (virtio-shmem.ko, libvhost.so).
>>>>
>>>>          
>>> What do you suppose vbus is?  vbus-proxy.ko = virtio-shmem.ko, and you
>>> dont need libvhost.so per se since you can just use standard kernel
>>> interfaces (like configfs/sysfs).  I could create an .so going forward
>>> for the new ioctl-based interface, I suppose.
>>>
>>>        
>> Refactor instead of rewriting.
>>      
> There is no rewriting.  vbus has no equivalent today as virtio doesn't
> define these layers.
>
> By your own admission, you said if you wanted that capability, use a
> library.  What I think you are not understanding is vbus _is_ that
> library.  So what is the problem, exactly?
>    

It's not compatible.  If you were truly worried about code duplication 
in virtio, you'd refactor it to remove the duplication, without 
affecting existing guests.

>>>> For kvm/x86 pci definitely remains king.
>>>>
>>>>          
>>> For full virtualization, sure.  I agree.  However, we are talking about
>>> PV here.  For PV, PCI is not a requirement and is a technical dead-end
>>> IMO.
>>>
>>> KVM seems to be the only virt solution that thinks otherwise (*), but I
>>> believe that is primarily a condition of its maturity.  I aim to help
>>> advance things here.
>>>
>>> (*) citation: xen has xenbus, lguest has lguest-bus, vmware has some
>>> vmi-esq thing (I forget what its called) to name a few.  Love 'em or
>>> hate 'em, most other hypervisors do something along these lines.  I'd
>>> like to try to create one for KVM, but to unify them all (at least for
>>> the Linux-based host designs).
>>>
>>>        
>> VMware are throwing VMI away (won't be supported in their new product,
>> and they've sent a patch to rip it off from Linux);
>>      
> vmware only cares about x86 iiuc, so probably not a good example.
>    

Well, you brought it up.  Between you and me, I only care about x86 too.

>> Xen has to tunnel
>> xenbus in pci for full virtualization (which is where Windows is, and
>> where Linux will be too once people realize it's faster).  lguest is
>> meant as an example hypervisor, not an attempt to take over the world.
>>      
> So pick any other hypervisor, and the situation is often similar.
>    

The situation is often pci.

>
>> An right now you can have a guest using pci to access a mix of
>> userspace-emulated devices, userspace-emulated-but-kernel-accelerated
>> virtio devices, and real host devices.  All on one dead-end bus.  Try
>> that with vbus.
>>      
> vbus is not interested in userspace devices.  The charter is to provide
> facilities for utilizing the host linux kernel's IO capabilities in the
> most efficient, yet safe, manner possible.  Those devices that fit
> outside that charter can ride on legacy mechanisms if that suits them best.
>    

vbus isn't, but I am.  I would prefer not to have to expose 
implementation decisions (kernel vs userspace) to the guest (vbus vs pci).

>>> That won't cut it.  For one, creating an eventfd is only part of the
>>> equation.  I.e. you need to have originate/terminate somewhere
>>> interesting (and in-kernel, otherwise use tuntap).
>>>
>>>        
>> vbus needs the same thing so it cancels out.
>>      
> No, it does not.  vbus just needs a relatively simple single message
> pipe between the guest and host (think "hypercall tunnel", if you will).
>    

That's ioeventfd.  So far so similar.

>   Per queue/device addressing is handled by the same conceptual namespace
> as the one that would trigger eventfds in the model you mention.  And
> that namespace is built in to the vbus stack, and objects are registered
> automatically as they are created.
>
> Contrast that to vhost, which requires some other kernel interface to
> exist, and to be managed manually for each object that is created.  Your
> libvhostconfig would need to somehow know how to perform this
> registration operation, and there would have to be something in the
> kernel to receive it, presumably on a per platform basis.  Solving this
> problem generally would probably end up looking eerily like vbus,
> because thats what vbus does.
>    

vbus devices aren't magically instantiated.  Userspace needs to 
instantiate them too.  Sure, there's less work on the host side since 
you're using vbus instead of the native interface, but more work on the 
guest side since you're using vbus instead of the native interface.



>> Well, let's see.  Can vbus today:
>>
>> - let userspace know which features are available (so it can decide if
>> live migration is possible)
>>      
> yes, its in sysfs.
>
>    
>> - let userspace limit which features are exposed to the guest (so it can
>> make live migration possible among hosts of different capabilities)
>>      
> yes, its in sysfs.
>    

Per-device?  non-privileged-user capable?

>> - let userspace know which features were negotiated (so it can transfer
>> them to the other host during live migration)
>>      
> no, but we can easily add ->save()/->restore() to the model going
> forward, and the negotiated features are just a subcomponent if its
> serialized stream.
>
>    
>> - let userspace tell the kernel which features were negotiated (when
>> live migration completes, to avoid requiring the guest to re-negotiate)
>>      
> that would be the function of the ->restore() deserializer.
>
>    
>> - do all that from an unprivileged process
>>      
> yes, in the upcoming alacrityvm v0.3 with the ioctl based control plane.
>    

Ah, so you have two control planes.

> Bottom line: vbus isn't done, especially w.r.t. live-migration..but that
> is not an valid argument against the idea if you believe in
> release-early/release-often. kvm wasn't (isn't) done either when it was
> proposed/merged.
>
>    

kvm didn't have an existing counterpart in Linux when it was 
proposed/merged.
diff mbox

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index b1114cf..de4587f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5431,6 +5431,16 @@  S:	Maintained
 F:	Documentation/filesystems/vfat.txt
 F:	fs/fat/
 
+VIRTIO HOST (VHOST)
+P:	Michael S. Tsirkin
+M:	mst@redhat.com
+L:	kvm@vger.kernel.org
+L:	virtualization@lists.osdl.org
+L:	netdev@vger.kernel.org
+S:	Maintained
+F:	drivers/vhost/
+F:	include/linux/vhost.h
+
 VIA RHINE NETWORK DRIVER
 M:	Roger Luethi <rl@hellgate.ch>
 S:	Maintained
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index b84e571..94f44d9 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -64,6 +64,7 @@  config KVM_AMD
 
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
+source drivers/vhost/Kconfig
 source drivers/lguest/Kconfig
 source drivers/virtio/Kconfig
 
diff --git a/drivers/Makefile b/drivers/Makefile
index bc4205d..1551ae1 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -105,6 +105,7 @@  obj-$(CONFIG_HID)		+= hid/
 obj-$(CONFIG_PPC_PS3)		+= ps3/
 obj-$(CONFIG_OF)		+= of/
 obj-$(CONFIG_SSB)		+= ssb/
+obj-$(CONFIG_VHOST_NET)		+= vhost/
 obj-$(CONFIG_VIRTIO)		+= virtio/
 obj-$(CONFIG_VLYNQ)		+= vlynq/
 obj-$(CONFIG_STAGING)		+= staging/
diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
new file mode 100644
index 0000000..d955406
--- /dev/null
+++ b/drivers/vhost/Kconfig
@@ -0,0 +1,11 @@ 
+config VHOST_NET
+	tristate "Host kernel accelerator for virtio net"
+	depends on NET && EVENTFD
+	---help---
+	  This kernel module can be loaded in host kernel to accelerate
+	  guest networking with virtio_net. Not to be confused with virtio_net
+	  module itself which needs to be loaded in guest kernel.
+
+	  To compile this driver as a module, choose M here: the module will
+	  be called vhost_net.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
new file mode 100644
index 0000000..72dd020
--- /dev/null
+++ b/drivers/vhost/Makefile
@@ -0,0 +1,2 @@ 
+obj-$(CONFIG_VHOST_NET) += vhost_net.o
+vhost_net-y := vhost.o net.o
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
new file mode 100644
index 0000000..2210eaa
--- /dev/null
+++ b/drivers/vhost/net.c
@@ -0,0 +1,475 @@ 
+/* Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <mst@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ *
+ * virtio-net server in host kernel.
+ */
+
+#include <linux/compat.h>
+#include <linux/eventfd.h>
+#include <linux/vhost.h>
+#include <linux/virtio_net.h>
+#include <linux/mmu_context.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/workqueue.h>
+#include <linux/rcupdate.h>
+#include <linux/file.h>
+
+#include <linux/net.h>
+#include <linux/if_packet.h>
+#include <linux/if_arp.h>
+
+#include <net/sock.h>
+
+#include "vhost.h"
+
+enum {
+	VHOST_NET_VQ_RX = 0,
+	VHOST_NET_VQ_TX = 1,
+	VHOST_NET_VQ_MAX = 2,
+};
+
+struct vhost_net {
+	struct vhost_dev dev;
+	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
+	/* We use a kind of RCU to access sock pointer.
+	 * All readers access it from workqueue, which makes it possible to
+	 * flush the workqueue instead of synchronize_rcu. Therefore readers do
+	 * not need to call rcu_read_lock/rcu_read_unlock: the beginning of
+	 * work item execution acts instead of rcu_read_lock() and the end of
+	 * work item execution acts instead of rcu_read_lock().
+	 * Writers use device mutex. */
+	struct socket *sock;
+	struct vhost_poll poll[VHOST_NET_VQ_MAX];
+};
+
+/* Pop first len bytes from iovec. Return number of segments used. */
+static int move_iovec_hdr(struct iovec *from, struct iovec *to,
+			  size_t len, int iov_count)
+{
+       int seg = 0;
+       size_t size;
+       while (len && seg < iov_count) {
+               size = min(from->iov_len, len);
+               to->iov_base = from->iov_base;
+               to->iov_len = size;
+               from->iov_len -= size;
+               from->iov_base += size;
+               len -= size;
+               ++from;
+               ++to;
+               ++seg;
+       }
+       return seg;
+}
+
+/* Expects to be always run from workqueue - which acts as
+ * read-size critical section for our kind of RCU. */
+static void handle_tx(struct vhost_net *net)
+{
+	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
+	unsigned head, out, in, s;
+	struct msghdr msg = {
+		.msg_name = NULL,
+		.msg_namelen = 0,
+		.msg_control = NULL,
+		.msg_controllen = 0,
+		.msg_iov = vq->iov,
+		.msg_flags = MSG_DONTWAIT,
+	};
+	size_t len;
+	int err;
+	struct socket *sock = rcu_dereference(net->sock);
+	if (!sock || !sock_writeable(sock->sk))
+		return;
+
+	use_mm(net->dev.mm);
+	mutex_lock(&vq->mutex);
+	for (;;) {
+		head = vhost_get_vq_desc(&net->dev, vq, vq->iov, &out, &in);
+		/* Nothing new?  Wait for eventfd to tell us they refilled. */
+		if (head == vq->num)
+			break;
+		if (in) {
+			vq_err(vq, "Unexpected descriptor format for TX: "
+			       "out %d, int %d\n", out, in);
+			break;
+		}
+		/* Skip header. TODO: support TSO. */
+		s = move_iovec_hdr(vq->iov, vq->hdr,
+				   sizeof(struct virtio_net_hdr), out);
+		msg.msg_iovlen = out;
+		len = iov_length(vq->iov, out);
+		/* Sanity check */
+		if (!len) {
+			vq_err(vq, "Unexpected header len for TX: "
+			       "%ld expected %zd\n",
+			       iov_length(vq->hdr, s),
+			       sizeof(struct virtio_net_hdr));
+			break;
+		}
+		/* TODO: Check specific error and bomb out unless ENOBUFS? */
+		err = sock->ops->sendmsg(NULL, sock, &msg, len);
+		if (err < 0) {
+			vhost_discard_vq_desc(vq);
+			break;
+		}
+		if (err != len)
+			pr_err("Truncated TX packet: "
+			       " len %d != %zd\n", err, len);
+		vhost_add_used_and_trigger(&net->dev, vq, head, 0);
+	}
+
+	mutex_unlock(&vq->mutex);
+	unuse_mm(net->dev.mm);
+}
+
+/* Expects to be always run from workqueue - which acts as
+ * read-size critical section for our kind of RCU. */
+static void handle_rx(struct vhost_net *net)
+{
+	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
+	unsigned head, out, in, s;
+	struct msghdr msg = {
+		.msg_name = NULL,
+		.msg_namelen = 0,
+		.msg_control = NULL, /* FIXME: get and handle RX aux data. */
+		.msg_controllen = 0,
+		.msg_iov = vq->iov,
+		.msg_flags = MSG_DONTWAIT,
+	};
+
+	struct virtio_net_hdr hdr = {
+		.flags = 0,
+		.gso_type = VIRTIO_NET_HDR_GSO_NONE
+	};
+
+	size_t len;
+	int err;
+	struct socket *sock = rcu_dereference(net->sock);
+	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
+		return;
+
+	use_mm(net->dev.mm);
+	mutex_lock(&vq->mutex);
+	vhost_no_notify(vq);
+
+	for (;;) {
+		head = vhost_get_vq_desc(&net->dev, vq, vq->iov, &out, &in);
+		/* OK, now we need to know about added descriptors. */
+		if (head == vq->num && vhost_notify(vq))
+			/* They could have slipped one in as we were doing that:
+			 * check again. */
+			continue;
+		/* Nothing new?  Wait for eventfd to tell us they refilled. */
+		if (head == vq->num)
+			break;
+		/* We don't need to be notified again. */
+		vhost_no_notify(vq);
+		if (out) {
+			vq_err(vq, "Unexpected descriptor format for RX: "
+			       "out %d, int %d\n",
+			       out, in);
+			break;
+		}
+		/* Skip header. TODO: support TSO/mergeable rx buffers. */
+		s = move_iovec_hdr(vq->iov, vq->hdr, sizeof hdr, in);
+		msg.msg_iovlen = in;
+		len = iov_length(vq->iov, in);
+		/* Sanity check */
+		if (!len) {
+			vq_err(vq, "Unexpected header len for RX: "
+			       "%zd expected %zd\n",
+			       iov_length(vq->hdr, s), sizeof hdr);
+			break;
+		}
+		err = sock->ops->recvmsg(NULL, sock, &msg,
+					 len, MSG_DONTWAIT | MSG_TRUNC);
+		/* TODO: Check specific error and bomb out unless EAGAIN? */
+		if (err < 0) {
+			vhost_discard_vq_desc(vq);
+			break;
+		}
+		/* TODO: Should check and handle checksum. */
+		if (err > len) {
+			pr_err("Discarded truncated rx packet: "
+			       " len %d > %zd\n", err, len);
+			vhost_discard_vq_desc(vq);
+			continue;
+		}
+		len = err;
+		err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr, sizeof hdr);
+		if (err) {
+			vq_err(vq, "Unable to write vnet_hdr at addr %p: %d\n",
+			       vq->iov->iov_base, err);
+			break;
+		}
+		vhost_add_used_and_trigger(&net->dev, vq, head,
+					   len + sizeof hdr);
+	}
+
+	mutex_unlock(&vq->mutex);
+	unuse_mm(net->dev.mm);
+}
+
+static void handle_tx_kick(struct work_struct *work)
+{
+	struct vhost_virtqueue *vq;
+	struct vhost_net *net;
+	vq = container_of(work, struct vhost_virtqueue, poll.work);
+	net = container_of(vq->dev, struct vhost_net, dev);
+	handle_tx(net);
+}
+
+static void handle_rx_kick(struct work_struct *work)
+{
+	struct vhost_virtqueue *vq;
+	struct vhost_net *net;
+	vq = container_of(work, struct vhost_virtqueue, poll.work);
+	net = container_of(vq->dev, struct vhost_net, dev);
+	handle_rx(net);
+}
+
+static void handle_tx_net(struct work_struct *work)
+{
+	struct vhost_net *net;
+	net = container_of(work, struct vhost_net, poll[VHOST_NET_VQ_TX].work);
+	handle_tx(net);
+}
+
+static void handle_rx_net(struct work_struct *work)
+{
+	struct vhost_net *net;
+	net = container_of(work, struct vhost_net, poll[VHOST_NET_VQ_RX].work);
+	handle_rx(net);
+}
+
+static int vhost_net_open(struct inode *inode, struct file *f)
+{
+	struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
+	int r;
+	if (!n)
+		return -ENOMEM;
+	f->private_data = n;
+	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
+	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
+	r = vhost_dev_init(&n->dev, n->vqs, VHOST_NET_VQ_MAX);
+	if (r < 0) {
+		kfree(n);
+		return r;
+	}
+
+	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
+	return 0;
+}
+
+static struct socket *vhost_net_stop(struct vhost_net *n)
+{
+	struct socket *sock = n->sock;
+	rcu_assign_pointer(n->sock, NULL);
+	if (sock) {
+		vhost_poll_flush(n->poll + VHOST_NET_VQ_TX);
+		vhost_poll_flush(n->poll + VHOST_NET_VQ_RX);
+	}
+	return sock;
+}
+
+static int vhost_net_release(struct inode *inode, struct file *f)
+{
+	struct vhost_net *n = f->private_data;
+	struct socket *sock;
+
+	sock = vhost_net_stop(n);
+	vhost_dev_cleanup(&n->dev);
+	if (sock)
+		fput(sock->file);
+	kfree(n);
+	return 0;
+}
+
+static void vhost_net_flush(struct vhost_net *n)
+{
+	vhost_poll_flush(n->poll + VHOST_NET_VQ_TX);
+	vhost_poll_flush(n->poll + VHOST_NET_VQ_RX);
+	vhost_poll_flush(&n->dev.vqs[VHOST_NET_VQ_TX].poll);
+	vhost_poll_flush(&n->dev.vqs[VHOST_NET_VQ_RX].poll);
+}
+
+static long vhost_net_set_socket(struct vhost_net *n, int fd)
+{
+	struct {
+		struct sockaddr_ll sa;
+		char  buf[MAX_ADDR_LEN];
+	} uaddr;
+	struct socket *sock, *oldsock = NULL;
+	int uaddr_len = sizeof uaddr, r;
+
+	mutex_lock(&n->dev.mutex);
+	r = vhost_dev_check_owner(&n->dev);
+	if (r)
+		goto done;
+
+	if (fd == -1) {
+		/* Disconnect from socket and device. */
+		oldsock = vhost_net_stop(n);
+		goto done;
+	}
+
+	sock = sockfd_lookup(fd, &r);
+	if (!sock) {
+		r = -ENOTSOCK;
+		goto done;
+	}
+
+	/* Parameter checking */
+	if (sock->sk->sk_type != SOCK_RAW) {
+		r = -ESOCKTNOSUPPORT;
+		goto done;
+	}
+
+	r = sock->ops->getname(sock, (struct sockaddr *)&uaddr.sa,
+			       &uaddr_len, 0);
+	if (r)
+		goto done;
+
+	if (uaddr.sa.sll_family != AF_PACKET) {
+		r = -EPFNOSUPPORT;
+		goto done;
+	}
+
+	/* start polling new socket */
+	if (sock == oldsock)
+		goto done;
+
+	if (oldsock) {
+		vhost_poll_stop(n->poll + VHOST_NET_VQ_TX);
+		vhost_poll_stop(n->poll + VHOST_NET_VQ_RX);
+	}
+	oldsock = n->sock;
+	rcu_assign_pointer(n->sock, sock);
+	vhost_poll_start(n->poll + VHOST_NET_VQ_TX, sock->file);
+	vhost_poll_start(n->poll + VHOST_NET_VQ_RX, sock->file);
+done:
+	mutex_unlock(&n->dev.mutex);
+	if (oldsock) {
+		vhost_net_flush(n);
+		fput(oldsock->file);
+	}
+	return r;
+}
+
+static long vhost_net_reset_owner(struct vhost_net *n)
+{
+	struct socket *sock = NULL;
+	long r;
+	mutex_lock(&n->dev.mutex);
+	r = vhost_dev_check_owner(&n->dev);
+	if (r)
+		goto done;
+	sock = vhost_net_stop(n);
+	r = vhost_dev_reset_owner(&n->dev);
+done:
+	mutex_unlock(&n->dev.mutex);
+	if (sock)
+		fput(sock->file);
+	return r;
+}
+
+static void vhost_net_set_features(struct vhost_net *n, u64 features)
+{
+	mutex_unlock(&n->dev.mutex);
+	n->dev.acked_features = features;
+	mutex_unlock(&n->dev.mutex);
+	vhost_net_flush(n);
+}
+
+static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
+			    unsigned long arg)
+{
+	struct vhost_net *n = f->private_data;
+	void __user *argp = (void __user *)arg;
+	u32 __user *featurep = argp;
+	int __user *fdp = argp;
+	u64 features;
+	int fd, r;
+	switch (ioctl) {
+	case VHOST_NET_SET_SOCKET:
+		r = get_user(fd, fdp);
+		if (r < 0)
+			return r;
+		return vhost_net_set_socket(n, fd);
+	case VHOST_GET_FEATURES:
+		features = VHOST_FEATURES;
+		return put_user(features, featurep);
+	case VHOST_ACK_FEATURES:
+		r = get_user(features, featurep);
+		/* No features for now */
+		if (r < 0)
+			return r;
+		if (features & ~VHOST_FEATURES)
+			return -EOPNOTSUPP;
+		vhost_net_set_features(n, features);
+		return 0;
+	case VHOST_RESET_OWNER:
+		return vhost_net_reset_owner(n);
+	default:
+		return vhost_dev_ioctl(&n->dev, ioctl, arg);
+	}
+}
+
+#ifdef CONFIG_COMPAT
+static long vhost_net_compat_ioctl(struct file *f, unsigned int ioctl,
+				   unsigned long arg)
+{
+	return vhost_net_ioctl(f, ioctl, (unsigned long)compat_ptr(arg));
+}
+#endif
+
+const static struct file_operations vhost_net_fops = {
+	.owner          = THIS_MODULE,
+	.release        = vhost_net_release,
+	.unlocked_ioctl = vhost_net_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl   = vhost_net_compat_ioctl,
+#endif
+	.open           = vhost_net_open,
+};
+
+static struct miscdevice vhost_net_misc = {
+	VHOST_NET_MINOR,
+	"vhost-net",
+	&vhost_net_fops,
+};
+
+int vhost_net_init(void)
+{
+	int r = vhost_init();
+	if (r)
+		goto err_init;
+	r = misc_register(&vhost_net_misc);
+	if (r)
+		goto err_reg;
+	return 0;
+err_reg:
+	vhost_cleanup();
+err_init:
+	return r;
+
+}
+module_init(vhost_net_init);
+
+void vhost_net_exit(void)
+{
+	misc_deregister(&vhost_net_misc);
+	vhost_cleanup();
+}
+module_exit(vhost_net_exit);
+
+MODULE_VERSION("0.0.1");
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Michael S. Tsirkin");
+MODULE_DESCRIPTION("Host kernel accelerator for virtio net");
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
new file mode 100644
index 0000000..6925cc1
--- /dev/null
+++ b/drivers/vhost/vhost.c
@@ -0,0 +1,688 @@ 
+/* Copyright (C) 2009 Red Hat, Inc.
+ * Copyright (C) 2006 Rusty Russell IBM Corporation
+ *
+ * Author: Michael S. Tsirkin <mst@redhat.com>
+ *
+ * Inspiration, some code, and most witty comments come from
+ * Documentation/lguest/lguest.c, by Rusty Russell
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ *
+ * Generic code for virtio server in host kernel.
+ */
+
+#include <linux/eventfd.h>
+#include <linux/vhost.h>
+#include <linux/virtio_net.h>
+#include <linux/mm.h>
+#include <linux/miscdevice.h>
+#include <linux/mutex.h>
+#include <linux/workqueue.h>
+#include <linux/rcupdate.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+
+#include <linux/net.h>
+#include <linux/if_packet.h>
+#include <linux/if_arp.h>
+
+#include <net/sock.h>
+
+#include "vhost.h"
+
+enum {
+	VHOST_MEMORY_MAX_NREGIONS = 64,
+};
+
+static struct workqueue_struct *vhost_workqueue;
+
+static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
+			    poll_table *pt)
+{
+	struct vhost_poll *poll;
+	poll = container_of(pt, struct vhost_poll, table);
+
+	poll->wqh = wqh;
+	add_wait_queue(wqh, &poll->wait);
+}
+
+static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
+			     void *key)
+{
+	struct vhost_poll *poll;
+	poll = container_of(wait, struct vhost_poll, wait);
+	if (!((unsigned long)key & poll->mask))
+		return 0;
+
+	queue_work(vhost_workqueue, &poll->work);
+	return 0;
+}
+
+/* Init poll structure */
+void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
+		     unsigned long mask)
+{
+	INIT_WORK(&poll->work, func);
+	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
+	init_poll_funcptr(&poll->table, vhost_poll_func);
+	poll->mask = mask;
+}
+
+/* Start polling a file. We add ourselves to file's wait queue. The caller must
+ * keep a reference to a file until after vhost_poll_stop is called. */
+void vhost_poll_start(struct vhost_poll *poll, struct file *file)
+{
+	unsigned long mask;
+	mask = file->f_op->poll(file, &poll->table);
+	if (mask)
+		vhost_poll_wakeup(&poll->wait, 0, 0, (void *)mask);
+}
+
+/* Stop polling a file. After this function returns, it becomes safe to drop the
+ * file reference. You must also flush afterwards. */
+void vhost_poll_stop(struct vhost_poll *poll)
+{
+	remove_wait_queue(poll->wqh, &poll->wait);
+}
+
+/* Flush any work that has been scheduled. When calling this, don't hold any
+ * locks that are also used by the callback. */
+void vhost_poll_flush(struct vhost_poll *poll)
+{
+	flush_work(&poll->work);
+}
+
+long vhost_dev_init(struct vhost_dev *dev,
+		    struct vhost_virtqueue *vqs, int nvqs)
+{
+	int i;
+	dev->vqs = vqs;
+	dev->nvqs = nvqs;
+	mutex_init(&dev->mutex);
+
+	for (i = 0; i < dev->nvqs; ++i) {
+		dev->vqs[i].dev = dev;
+		mutex_init(&dev->vqs[i].mutex);
+		if (dev->vqs[i].handle_kick)
+			vhost_poll_init(&dev->vqs[i].poll,
+					dev->vqs[i].handle_kick,
+					POLLIN);
+	}
+	return 0;
+}
+
+/* Caller should have device mutex */
+long vhost_dev_check_owner(struct vhost_dev *dev)
+{
+	/* Are you the owner? If not, I don't think you mean to do that */
+	return dev->mm == current->mm ? 0 : -EPERM;
+}
+
+/* Caller should have device mutex */
+static long vhost_dev_set_owner(struct vhost_dev *dev)
+{
+	/* Is there an owner already? */
+	if (dev->mm)
+		return -EBUSY;
+	/* No owner, become one */
+	dev->mm = get_task_mm(current);
+	return 0;
+}
+
+/* Caller should have device mutex */
+long vhost_dev_reset_owner(struct vhost_dev *dev)
+{
+	struct vhost_memory *memory;
+
+	/* Restore memory to default 1:1 mapping. */
+	memory = kmalloc(offsetof(struct vhost_memory, regions) +
+			 2 * sizeof *memory->regions, GFP_KERNEL);
+	if (!memory)
+		return -ENOMEM;
+
+	vhost_dev_cleanup(dev);
+
+	memory->nregions = 2;
+	memory->regions[0].guest_phys_addr = 1;
+	memory->regions[0].userspace_addr = 1;
+	memory->regions[0].memory_size = ~0ULL;
+	memory->regions[1].guest_phys_addr = 0;
+	memory->regions[1].userspace_addr = 0;
+	memory->regions[1].memory_size = 1;
+	dev->memory = memory;
+	return 0;
+}
+
+/* Caller should have device mutex */
+void vhost_dev_cleanup(struct vhost_dev *dev)
+{
+	int i;
+	for (i = 0; i < dev->nvqs; ++i) {
+		if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
+			vhost_poll_stop(&dev->vqs[i].poll);
+			vhost_poll_flush(&dev->vqs[i].poll);
+		}
+		if (dev->vqs[i].error_ctx)
+			eventfd_ctx_put(dev->vqs[i].error_ctx);
+		if (dev->vqs[i].error)
+			fput(dev->vqs[i].error);
+		if (dev->vqs[i].kick)
+			fput(dev->vqs[i].kick);
+		if (dev->vqs[i].call_ctx)
+			eventfd_ctx_put(dev->vqs[i].call_ctx);
+		if (dev->vqs[i].call)
+			fput(dev->vqs[i].call);
+		dev->vqs[i].error_ctx = NULL;
+		dev->vqs[i].error = NULL;
+		dev->vqs[i].kick = NULL;
+		dev->vqs[i].call_ctx = NULL;
+		dev->vqs[i].call = NULL;
+	}
+	/* No one will access memory at this point */
+	kfree(dev->memory);
+	dev->memory = NULL;
+	if (dev->mm)
+		mmput(dev->mm);
+	dev->mm = NULL;
+}
+
+static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
+{
+	struct vhost_memory mem, *newmem, *oldmem;
+	unsigned long size = offsetof(struct vhost_memory, regions);
+	long r;
+	r = copy_from_user(&mem, m, size);
+	if (r)
+		return r;
+	if (mem.padding)
+		return -EOPNOTSUPP;
+	if (mem.nregions > VHOST_MEMORY_MAX_NREGIONS)
+		return -E2BIG;
+	newmem = kmalloc(size + mem.nregions * sizeof *m->regions, GFP_KERNEL);
+	if (!newmem)
+		return -ENOMEM;
+
+	memcpy(newmem, &mem, size);
+	r = copy_from_user(newmem->regions, m->regions,
+			   mem.nregions * sizeof *m->regions);
+	if (r) {
+		kfree(newmem);
+		return r;
+	}
+	oldmem = d->memory;
+	rcu_assign_pointer(d->memory, newmem);
+	synchronize_rcu();
+	kfree(oldmem);
+	return 0;
+}
+
+static int init_used(struct vhost_virtqueue *vq)
+{
+	int r = put_user(vq->used_flags, &vq->used->flags);
+	if (r)
+		return r;
+	return get_user(vq->last_used_idx, &vq->used->idx);
+}
+
+static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
+{
+	struct file *eventfp, *filep = NULL,
+		    *pollstart = NULL, *pollstop = NULL;
+	struct eventfd_ctx *ctx = NULL;
+	u32 __user *idxp = argp;
+	struct vhost_virtqueue *vq;
+	struct vhost_vring_state s;
+	struct vhost_vring_file f;
+	struct vhost_vring_addr a;
+	u32 idx;
+	long r;
+
+	r = get_user(idx, idxp);
+	if (r < 0)
+		return r;
+	if (idx > d->nvqs)
+		return -ENOBUFS;
+
+	vq = d->vqs + idx;
+
+	mutex_lock(&vq->mutex);
+
+	switch (ioctl) {
+	case VHOST_SET_VRING_NUM:
+		r = copy_from_user(&s, argp, sizeof s);
+		if (r < 0)
+			break;
+		if (s.num > 0xffff) {
+			r = -EINVAL;
+			break;
+		}
+		vq->num = s.num;
+		break;
+	case VHOST_SET_VRING_BASE:
+		r = copy_from_user(&s, argp, sizeof s);
+		if (r < 0)
+			break;
+		if (s.num > 0xffff) {
+			r = -EINVAL;
+			break;
+		}
+		vq->avail_idx = vq->last_avail_idx = s.num;
+		break;
+	case VHOST_GET_VRING_BASE:
+		s.index = idx;
+		s.num = vq->last_avail_idx;
+		r = copy_to_user(argp, &s, sizeof s);
+		break;
+	case VHOST_SET_VRING_DESC:
+		r = copy_from_user(&a, argp, sizeof a);
+		if (r < 0)
+			break;
+		if (a.padding) {
+			r = -EOPNOTSUPP;
+			break;
+		}
+		if ((u64)(long)a.user_addr != a.user_addr) {
+			r = -EFAULT;
+			break;
+		}
+		vq->desc = (void __user *)(long)a.user_addr;
+		break;
+	case VHOST_SET_VRING_AVAIL:
+		r = copy_from_user(&a, argp, sizeof a);
+		if (r < 0)
+			break;
+		if (a.padding) {
+			r = -EOPNOTSUPP;
+			break;
+		}
+		if ((u64)(long)a.user_addr != a.user_addr) {
+			r = -EFAULT;
+			break;
+		}
+		vq->avail = (void __user *)(long)a.user_addr;
+		/* Forget the cached index value. */
+		vq->avail_idx = vq->last_avail_idx;
+		break;
+	case VHOST_SET_VRING_USED:
+		r = copy_from_user(&a, argp, sizeof a);
+		if (r < 0)
+			break;
+		if (a.padding) {
+			r = -EOPNOTSUPP;
+			break;
+		}
+		if ((u64)(long)a.user_addr != a.user_addr) {
+			r = -EFAULT;
+			break;
+		}
+		vq->used = (void __user *)(long)a.user_addr;
+		r = init_used(vq);
+		if (r)
+			break;
+		break;
+	case VHOST_SET_VRING_KICK:
+		r = copy_from_user(&f, argp, sizeof f);
+		if (r < 0)
+			break;
+		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
+		if (IS_ERR(eventfp))
+			return PTR_ERR(eventfp);
+		if (eventfp != vq->kick) {
+			pollstop = filep = vq->kick;
+			pollstart = vq->kick = eventfp;
+		} else
+			filep = eventfp;
+		break;
+	case VHOST_SET_VRING_CALL:
+		r = copy_from_user(&f, argp, sizeof f);
+		if (r < 0)
+			break;
+		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
+		if (IS_ERR(eventfp))
+			return PTR_ERR(eventfp);
+		if (eventfp != vq->call) {
+			filep = vq->call;
+			ctx = vq->call_ctx;
+			vq->call = eventfp;
+			vq->call_ctx = eventfp ?
+				eventfd_ctx_fileget(eventfp) : NULL;
+		} else
+			filep = eventfp;
+		break;
+	case VHOST_SET_VRING_ERR:
+		r = copy_from_user(&f, argp, sizeof f);
+		if (r < 0)
+			break;
+		eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd);
+		if (IS_ERR(eventfp))
+			return PTR_ERR(eventfp);
+		if (eventfp != vq->error) {
+			filep = vq->error;
+			vq->error = eventfp;
+			ctx = vq->error_ctx;
+			vq->error_ctx = eventfp ?
+				eventfd_ctx_fileget(eventfp) : NULL;
+		} else
+			filep = eventfp;
+		break;
+	default:
+		r = -ENOIOCTLCMD;
+	}
+
+	if (pollstop && vq->handle_kick)
+		vhost_poll_stop(&vq->poll);
+
+	if (ctx)
+		eventfd_ctx_put(ctx);
+	if (filep)
+		fput(filep);
+
+	if (pollstart && vq->handle_kick)
+		vhost_poll_start(&vq->poll, vq->kick);
+
+	mutex_unlock(&vq->mutex);
+
+	if (pollstop && vq->handle_kick)
+		vhost_poll_flush(&vq->poll);
+	return 0;
+}
+
+long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned long arg)
+{
+	void __user *argp = (void __user *)arg;
+	long r;
+
+	mutex_lock(&d->mutex);
+	/* If you are not the owner, you can become one */
+	if (ioctl == VHOST_SET_OWNER) {
+		r = vhost_dev_set_owner(d);
+		goto done;
+	}
+
+	/* You must be the owner to do anything else */
+	r = vhost_dev_check_owner(d);
+	if (r)
+		goto done;
+
+	switch (ioctl) {
+	case VHOST_SET_MEM_TABLE:
+		r = vhost_set_memory(d, argp);
+		break;
+	default:
+		r = vhost_set_vring(d, ioctl, argp);
+		break;
+	}
+done:
+	mutex_unlock(&d->mutex);
+	return r;
+}
+
+static const struct vhost_memory_region *find_region(struct vhost_memory *mem,
+						     __u64 addr, __u32 len)
+{
+	struct vhost_memory_region *reg;
+	int i;
+	/* linear search is not brilliant, but we really have on the order of 6
+	 * regions in practice */
+	for (i = 0; i < mem->nregions; ++i) {
+		reg = mem->regions + i;
+		if (reg->guest_phys_addr <= addr &&
+		    reg->guest_phys_addr + reg->memory_size - 1 >= addr)
+			return reg;
+	}
+	return NULL;
+}
+
+int translate_desc(struct vhost_dev *dev, u64 addr, u32 len,
+		   struct iovec iov[], int iov_size)
+{
+	const struct vhost_memory_region *reg;
+	struct vhost_memory *mem;
+	struct iovec *_iov;
+	u64 s = 0;
+	int ret = 0;
+
+	rcu_read_lock();
+
+	mem = rcu_dereference(dev->memory);
+	while ((u64)len > s) {
+		u64 size;
+		if (ret >= iov_size) {
+			ret = -ENOBUFS;
+			break;
+		}
+		reg = find_region(mem, addr, len);
+		if (!reg) {
+			ret = -EFAULT;
+			break;
+		}
+		_iov = iov + ret;
+		size = reg->memory_size - addr + reg->guest_phys_addr;
+		_iov->iov_len = min((u64)len, size);
+		_iov->iov_base = (void *)
+			(reg->userspace_addr + addr - reg->guest_phys_addr);
+		s += size;
+		addr += size;
+		++ret;
+	}
+
+	rcu_read_unlock();
+	return ret;
+}
+
+/* Each buffer in the virtqueues is actually a chain of descriptors.  This
+ * function returns the next descriptor in the chain, or vq->vring.num if we're
+ * at the end. */
+static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc)
+{
+	unsigned int next;
+
+	/* If this descriptor says it doesn't chain, we're done. */
+	if (!(desc->flags & VRING_DESC_F_NEXT))
+		return vq->num;
+
+	/* Check they're not leading us off end of descriptors. */
+	next = desc->next;
+	/* Make sure compiler knows to grab that: we don't want it changing! */
+	/* We will use the result as an index in an array, so most
+	 * architectures only need a compiler barrier here. */
+	read_barrier_depends();
+
+	if (next >= vq->num) {
+		vq_err(vq, "Desc next is %u > %u", next, vq->num);
+		return vq->num;
+	}
+
+	return next;
+}
+
+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access.  Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which
+ * is never a valid descriptor number) if none was found. */
+unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+			   struct iovec iov[],
+			   unsigned int *out_num, unsigned int *in_num)
+{
+	struct vring_desc desc;
+	unsigned int i, head;
+	u16 last_avail_idx;
+	int ret;
+
+	/* Check it isn't doing very strange things with descriptor numbers. */
+	last_avail_idx = vq->last_avail_idx;
+	if (get_user(vq->avail_idx, &vq->avail->idx)) {
+		vq_err(vq, "Failed to access avail idx at %p\n",
+		       &vq->avail->idx);
+		return vq->num;
+	}
+
+	if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
+		vq_err(vq, "Guest moved used index from %u to %u",
+		       last_avail_idx, vq->avail_idx);
+		return vq->num;
+	}
+
+	/* If there's nothing new since last we looked, return invalid. */
+	if (vq->avail_idx == last_avail_idx)
+		return vq->num;
+
+	/* Grab the next descriptor number they're advertising, and increment
+	 * the index we've seen. */
+	if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
+		vq_err(vq, "Failed to read head: idx %d address %p\n",
+		       last_avail_idx,
+		       &vq->avail->ring[last_avail_idx % vq->num]);
+		return vq->num;
+	}
+
+	/* If their number is silly, that's an error. */
+	if (head >= vq->num) {
+		vq_err(vq, "Guest says index %u > %u is available",
+		       head, vq->num);
+		return vq->num;
+	}
+
+	vq->last_avail_idx++;
+
+	/* When we start there are none of either input nor output. */
+	*out_num = *in_num = 0;
+
+	i = head;
+	do {
+		unsigned iov_count = *in_num + *out_num;
+		if (copy_from_user(&desc, vq->desc + i, sizeof desc)) {
+			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
+			       i, vq->desc + i);
+			return vq->num;
+		}
+		ret = translate_desc(dev, desc.addr, desc.len, iov + iov_count,
+				     VHOST_NET_MAX_SG - iov_count);
+		if (ret < 0) {
+			vq_err(vq, "Translation failure %d descriptor idx %d\n",
+			       ret, i);
+			return vq->num;
+		}
+		/* If this is an input descriptor, increment that count. */
+		if (desc.flags & VRING_DESC_F_WRITE)
+			*in_num += ret;
+		else {
+			/* If it's an output descriptor, they're all supposed
+			 * to come before any input descriptors. */
+			if (*in_num) {
+				vq_err(vq, "Descriptor has out after in: "
+				       "idx %d\n", i);
+				return vq->num;
+			}
+			*out_num += ret;
+		}
+	} while ((i = next_desc(vq, &desc)) != vq->num);
+	return head;
+}
+
+/* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
+void vhost_discard_vq_desc(struct vhost_virtqueue *vq)
+{
+	vq->last_avail_idx--;
+}
+
+/* After we've used one of their buffers, we tell them about it.  We'll then
+ * want to send them an interrupt, using vq->call. */
+int vhost_add_used(struct vhost_virtqueue *vq,
+			  unsigned int head, int len)
+{
+	struct vring_used_elem *used;
+
+	/* The virtqueue contains a ring of used buffers.  Get a pointer to the
+	 * next entry in that used ring. */
+	used = &vq->used->ring[vq->last_used_idx % vq->num];
+	if (put_user(head, &used->id)) {
+		vq_err(vq, "Failed to write used id");
+		return -EFAULT;
+	}
+	if (put_user(len, &used->len)) {
+		vq_err(vq, "Failed to write used len");
+		return -EFAULT;
+	}
+	/* Make sure buffer is written before we update index. */
+	wmb();
+	if (put_user(vq->last_used_idx + 1, &vq->used->idx)) {
+		vq_err(vq, "Failed to increment used idx");
+		return -EFAULT;
+	}
+	vq->last_used_idx++;
+	return 0;
+}
+
+/* This actually sends the interrupt for this virtqueue */
+void vhost_trigger_irq(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
+	__u16 flags = 0;
+	if (get_user(flags, &vq->avail->flags)) {
+		vq_err(vq, "Failed to get flags");
+		return;
+	}
+
+	/* If they don't want an interrupt, don't send one, unless empty. */
+	if ((flags & VRING_AVAIL_F_NO_INTERRUPT) &&
+	    (!vhost_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY) ||
+	     vq->avail_idx != vq->last_avail_idx))
+		return;
+
+	/* Send the Guest an interrupt tell them we used something up. */
+	if (vq->call_ctx)
+		eventfd_signal(vq->call_ctx, 1);
+}
+
+/* And here's the combo meal deal.  Supersize me! */
+void vhost_add_used_and_trigger(struct vhost_dev *dev,
+				struct vhost_virtqueue *vq,
+				unsigned int head, int len)
+{
+	vhost_add_used(vq, head, len);
+	vhost_trigger_irq(dev, vq);
+}
+
+/* OK, now we need to know about added descriptors. */
+bool vhost_notify(struct vhost_virtqueue *vq)
+{
+	int r;
+	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
+		return false;
+	vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
+	r = put_user(vq->used_flags, &vq->used->flags);
+	if (r)
+		vq_err(vq, "Failed to disable notification: %d\n", r);
+	/* They could have slipped one in as we were doing that: make
+	 * sure it's written, tell caller it needs to check again. */
+	mb();
+	return true;
+}
+
+/* We don't need to be notified again. */
+void vhost_no_notify(struct vhost_virtqueue *vq)
+{
+	int r;
+	if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
+		return;
+	vq->used_flags |= VRING_USED_F_NO_NOTIFY;
+	r = put_user(vq->used_flags, &vq->used->flags);
+	if (r)
+		vq_err(vq, "Failed to enable notification: %d\n", r);
+}
+
+int vhost_init(void)
+{
+	vhost_workqueue = create_workqueue("vhost");
+	if (!vhost_workqueue)
+		return -ENOMEM;
+	return 0;
+}
+
+void vhost_cleanup(void)
+{
+	destroy_workqueue(vhost_workqueue);
+}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
new file mode 100644
index 0000000..8e13d06
--- /dev/null
+++ b/drivers/vhost/vhost.h
@@ -0,0 +1,122 @@ 
+#ifndef _VHOST_H
+#define _VHOST_H
+
+#include <linux/eventfd.h>
+#include <linux/vhost.h>
+#include <linux/mm.h>
+#include <linux/mutex.h>
+#include <linux/workqueue.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/skbuff.h>
+#include <linux/uio.h>
+#include <linux/virtio_config.h>
+
+struct vhost_device;
+
+enum {
+	VHOST_NET_MAX_SG = MAX_SKB_FRAGS + 2,
+};
+
+/* Poll a file (eventfd or socket) */
+/* Note: there's nothing vhost specific about this structure. */
+struct vhost_poll {
+	poll_table                table;
+	wait_queue_head_t        *wqh;
+	wait_queue_t              wait;
+	/* struct which will handle all actual work. */
+	struct work_struct        work;
+	unsigned long		  mask;
+};
+
+void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
+		     unsigned long mask);
+void vhost_poll_start(struct vhost_poll *poll, struct file *file);
+void vhost_poll_stop(struct vhost_poll *poll);
+void vhost_poll_flush(struct vhost_poll *poll);
+
+/* The virtqueue structure describes a queue attached to a device. */
+struct vhost_virtqueue {
+	struct vhost_dev *dev;
+
+	/* The actual ring of buffers. */
+	struct mutex mutex;
+	unsigned int num;
+	struct vring_desc __user *desc;
+	struct vring_avail __user *avail;
+	struct vring_used __user *used;
+	struct file *kick;
+	struct file *call;
+	struct file *error;
+	struct eventfd_ctx *call_ctx;
+	struct eventfd_ctx *error_ctx;
+
+	struct vhost_poll poll;
+
+	/* The routine to call when the Guest pings us, or timeout. */
+	work_func_t handle_kick;
+
+	/* Last available index we saw. */
+	u16 last_avail_idx;
+
+	/* Caches available index value from user. */
+	u16 avail_idx;
+
+	/* Last index we used. */
+	u16 last_used_idx;
+
+	/* Used flags */
+	u16 used_flags;
+
+	struct iovec iov[VHOST_NET_MAX_SG];
+	struct iovec hdr[VHOST_NET_MAX_SG];
+};
+
+struct vhost_dev {
+	/* Readers use RCU to access memory table pointer.
+	 * Writers use mutex below.*/
+	struct vhost_memory *memory;
+	struct mm_struct *mm;
+	struct vhost_virtqueue *vqs;
+	int nvqs;
+	struct mutex mutex;
+	unsigned acked_features;
+};
+
+long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
+long vhost_dev_check_owner(struct vhost_dev *);
+long vhost_dev_reset_owner(struct vhost_dev *);
+void vhost_dev_cleanup(struct vhost_dev *);
+long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, unsigned long arg);
+
+unsigned vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
+			   struct iovec iov[],
+			   unsigned int *out_num, unsigned int *in_num);
+void vhost_discard_vq_desc(struct vhost_virtqueue *);
+
+int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
+void vhost_trigger_irq(struct vhost_dev *, struct vhost_virtqueue *);
+void vhost_add_used_and_trigger(struct vhost_dev *, struct vhost_virtqueue *,
+				unsigned int head, int len);
+void vhost_no_notify(struct vhost_virtqueue *);
+bool vhost_notify(struct vhost_virtqueue *);
+
+int vhost_init(void);
+void vhost_cleanup(void);
+
+#define vq_err(vq, fmt, ...) do {                                  \
+		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
+		if ((vq)->error_ctx)                               \
+				eventfd_signal((vq)->error_ctx, 1);\
+	} while (0)
+
+enum {
+	VHOST_FEATURES = 1 << VIRTIO_F_NOTIFY_ON_EMPTY,
+};
+
+static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
+{
+	return dev->acked_features & (1 << bit);
+}
+
+#endif
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index dec2f18..975df9a 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -360,6 +360,7 @@  unifdef-y += uio.h
 unifdef-y += unistd.h
 unifdef-y += usbdevice_fs.h
 unifdef-y += utsname.h
+unifdef-y += vhost.h
 unifdef-y += videodev2.h
 unifdef-y += videodev.h
 unifdef-y += virtio_config.h
diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
index 0521177..781a8bb 100644
--- a/include/linux/miscdevice.h
+++ b/include/linux/miscdevice.h
@@ -30,6 +30,7 @@ 
 #define HPET_MINOR		228
 #define FUSE_MINOR		229
 #define KVM_MINOR		232
+#define VHOST_NET_MINOR		233
 #define MISC_DYNAMIC_MINOR	255
 
 struct device;
diff --git a/include/linux/vhost.h b/include/linux/vhost.h
new file mode 100644
index 0000000..3f441a9
--- /dev/null
+++ b/include/linux/vhost.h
@@ -0,0 +1,101 @@ 
+#ifndef _LINUX_VHOST_H
+#define _LINUX_VHOST_H
+/* Userspace interface for in-kernel virtio accelerators. */
+
+/* vhost is used to reduce the number of system calls involved in virtio.
+ *
+ * Existing virtio net code is used in the guest without modification.
+ *
+ * This header includes interface used by userspace hypervisor for
+ * device configuration.
+ */
+
+#include <linux/types.h>
+#include <linux/compiler.h>
+#include <linux/ioctl.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_ring.h>
+
+struct vhost_vring_state {
+	unsigned int index;
+	unsigned int num;
+};
+
+struct vhost_vring_file {
+	unsigned int index;
+	int fd;
+};
+
+struct vhost_vring_addr {
+	unsigned int index;
+	unsigned int padding;
+	__u64 user_addr;
+};
+
+struct vhost_memory_region {
+	__u64 guest_phys_addr;
+	__u64 memory_size; /* bytes */
+	__u64 userspace_addr;
+	__u64 padding; /* read/write protection? */
+};
+
+struct vhost_memory {
+	__u32 nregions;
+	__u32 padding;
+	struct vhost_memory_region regions[0];
+};
+
+/* ioctls */
+
+#define VHOST_VIRTIO 0xAF
+
+/* Features bitmask for forward compatibility.  Transport bits are used for
+ * vhost specific features. */
+#define VHOST_GET_FEATURES	_IOR(VHOST_VIRTIO, 0x00, __u64)
+#define VHOST_ACK_FEATURES	_IOW(VHOST_VIRTIO, 0x00, __u64)
+
+/* Set current process as the (exclusive) owner of this file descriptor.  This
+ * must be called before any other vhost command.  Further calls to
+ * VHOST_OWNER_SET fail until VHOST_OWNER_RESET is called. */
+#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01)
+/* Give up ownership, and reset the device to default values.
+ * Allows subsequent call to VHOST_OWNER_SET to succeed. */
+#define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
+
+/* Set up/modify memory layout */
+#define VHOST_SET_MEM_TABLE	_IOW(VHOST_VIRTIO, 0x03, struct vhost_memory)
+
+/* Ring setup. These parameters can not be modified while ring is running
+ * (bound to a device). */
+/* Set number of descriptors in ring */
+#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)
+/* Start of array of descriptors (virtually contiguous) */
+#define VHOST_SET_VRING_DESC _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)
+/* Used structure address */
+#define VHOST_SET_VRING_USED _IOW(VHOST_VIRTIO, 0x12, struct vhost_vring_addr)
+/* Available structure address */
+#define VHOST_SET_VRING_AVAIL _IOW(VHOST_VIRTIO, 0x13, struct vhost_vring_addr)
+/* Base value where queue looks for available descriptors */
+#define VHOST_SET_VRING_BASE _IOW(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
+/* Get accessor: reads index, writes value in num */
+#define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x14, struct vhost_vring_state)
+
+/* The following ioctls use eventfd file descriptors to signal and poll
+ * for events. */
+
+/* Set eventfd to poll for added buffers */
+#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)
+/* Set eventfd to signal when buffers have beed used */
+#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)
+/* Set eventfd to signal an error */
+#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
+
+/* VHOST_NET specific defines */
+
+/* Attach virtio net device to a raw socket. The socket must be already
+ * bound to an ethernet device, this device will be used for transmit.
+ * Pass -1 to unbind from the socket and the transmit device.
+ * This can be used to stop the device (e.g. for migration). */
+#define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, int)
+
+#endif