diff mbox series

[v6,10/10] Documentation: Add documentation for VDUSE

Message ID 20210331080519.172-11-xieyongji@bytedance.com (mailing list archive)
State Not Applicable
Headers show
Series Introduce VDUSE - vDPA Device in Userspace | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Yongji Xie March 31, 2021, 8:05 a.m. UTC
VDUSE (vDPA Device in Userspace) is a framework to support
implementing software-emulated vDPA devices in userspace. This
document is intended to clarify the VDUSE design and usage.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 Documentation/userspace-api/index.rst |   1 +
 Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
 2 files changed, 213 insertions(+)
 create mode 100644 Documentation/userspace-api/vduse.rst

Comments

Jason Wang April 8, 2021, 7:18 a.m. UTC | #1
在 2021/3/31 下午4:05, Xie Yongji 写道:
> VDUSE (vDPA Device in Userspace) is a framework to support
> implementing software-emulated vDPA devices in userspace. This
> document is intended to clarify the VDUSE design and usage.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   Documentation/userspace-api/index.rst |   1 +
>   Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
>   2 files changed, 213 insertions(+)
>   create mode 100644 Documentation/userspace-api/vduse.rst
>
> diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
> index acd2cc2a538d..f63119130898 100644
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -24,6 +24,7 @@ place where this information is gathered.
>      ioctl/index
>      iommu
>      media/index
> +   vduse
>   
>   .. only::  subproject and html
>   
> diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst
> new file mode 100644
> index 000000000000..8c4e2b2df8bb
> --- /dev/null
> +++ b/Documentation/userspace-api/vduse.rst
> @@ -0,0 +1,212 @@
> +==================================
> +VDUSE - "vDPA Device in Userspace"
> +==================================
> +
> +vDPA (virtio data path acceleration) device is a device that uses a
> +datapath which complies with the virtio specifications with vendor
> +specific control path. vDPA devices can be both physically located on
> +the hardware or emulated by software. VDUSE is a framework that makes it
> +possible to implement software-emulated vDPA devices in userspace.
> +
> +How VDUSE works
> +------------
> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> +the character device (/dev/vduse/control). Then a device file with the
> +specified name (/dev/vduse/$NAME) will appear, which can be used to
> +implement the userspace vDPA device's control path and data path.
> +
> +To implement control path, a message-based communication protocol and some
> +types of control messages are introduced in the VDUSE framework:
> +
> +- VDUSE_SET_VQ_ADDR: Set the vring address of virtqueue.
> +
> +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
> +
> +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
> +
> +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
> +
> +- VDUSE_SET_VQ_STATE: Set the state for virtqueue
> +
> +- VDUSE_GET_VQ_STATE: Get the state for virtqueue
> +
> +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
> +
> +- VDUSE_GET_FEATURES: Get virtio features supported by the device
> +
> +- VDUSE_SET_STATUS: Set the device status
> +
> +- VDUSE_GET_STATUS: Get the device status
> +
> +- VDUSE_SET_CONFIG: Write to device specific configuration space
> +
> +- VDUSE_GET_CONFIG: Read from device specific configuration space
> +
> +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
> +
> +Those control messages are mostly based on the vdpa_config_ops in
> +include/linux/vdpa.h which defines a unified interface to control
> +different types of vdpa device. Userspace needs to read()/write()
> +on the VDUSE device file to receive/reply those control messages
> +from/to VDUSE kernel module as follows:
> +
> +.. code-block:: c
> +
> +	static int vduse_message_handler(int dev_fd)
> +	{
> +		int len;
> +		struct vduse_dev_request req;
> +		struct vduse_dev_response resp;
> +
> +		len = read(dev_fd, &req, sizeof(req));
> +		if (len != sizeof(req))
> +			return -1;
> +
> +		resp.request_id = req.request_id;
> +
> +		switch (req.type) {
> +
> +		/* handle different types of message */
> +
> +		}
> +
> +		len = write(dev_fd, &resp, sizeof(resp));
> +		if (len != sizeof(resp))
> +			return -1;
> +
> +		return 0;
> +	}
> +
> +In the data path, vDPA device's iova regions will be mapped into userspace
> +with the help of VDUSE_IOTLB_GET_FD ioctl on the VDUSE device file:
> +
> +- VDUSE_IOTLB_GET_FD: get the file descriptor to the first overlapped iova region.
> +  Userspace can access this iova region by passing fd and corresponding size, offset,
> +  perm to mmap(). For example:
> +
> +.. code-block:: c
> +
> +	static int perm_to_prot(uint8_t perm)
> +	{
> +		int prot = 0;
> +
> +		switch (perm) {
> +		case VDUSE_ACCESS_WO:
> +			prot |= PROT_WRITE;
> +			break;
> +		case VDUSE_ACCESS_RO:
> +			prot |= PROT_READ;
> +			break;
> +		case VDUSE_ACCESS_RW:
> +			prot |= PROT_READ | PROT_WRITE;
> +			break;
> +		}
> +
> +		return prot;
> +	}
> +
> +	static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> +	{
> +		int fd;
> +		void *addr;
> +		size_t size;
> +		struct vduse_iotlb_entry entry;
> +
> +		entry.start = iova;
> +		entry.last = iova + 1;
> +		fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry);
> +		if (fd < 0)
> +			return NULL;
> +
> +		size = entry.last - entry.start + 1;
> +		*len = entry.last - iova + 1;
> +		addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
> +			    fd, entry.offset);
> +		close(fd);
> +		if (addr == MAP_FAILED)
> +			return NULL;
> +
> +		/* do something to cache this iova region */
> +
> +		return addr + iova - entry.start;
> +	}
> +
> +Besides, the following ioctls on the VDUSE device file are provided to support
> +interrupt injection and setting up eventfd for virtqueue kicks:
> +
> +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> +  by VDUSE kernel module to notify userspace to consume the vring.
> +
> +- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue
> +
> +- VDUSE_INJECT_CONFIG_IRQ: inject a config interrupt
> +
> +Register VDUSE device on vDPA bus
> +---------------------------------
> +In order to make the VDUSE device work, administrator needs to use the management
> +API (netlink) to register it on vDPA bus. Some sample codes are show below:
> +
> +.. code-block:: c
> +
> +	static int netlink_add_vduse(const char *name, int device_id)
> +	{
> +		struct nl_sock *nlsock;
> +		struct nl_msg *msg;
> +		int famid;
> +
> +		nlsock = nl_socket_alloc();
> +		if (!nlsock)
> +			return -ENOMEM;
> +
> +		if (genl_connect(nlsock))
> +			goto free_sock;
> +
> +		famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> +		if (famid < 0)
> +			goto close_sock;
> +
> +		msg = nlmsg_alloc();
> +		if (!msg)
> +			goto close_sock;
> +
> +		if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
> +		    VDPA_CMD_DEV_NEW, 0))
> +			goto nla_put_failure;
> +
> +		NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> +		NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> +		NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
> +
> +		if (nl_send_sync(nlsock, msg))
> +			goto close_sock;
> +
> +		nl_close(nlsock);
> +		nl_socket_free(nlsock);
> +
> +		return 0;
> +	nla_put_failure:
> +		nlmsg_free(msg);
> +	close_sock:
> +		nl_close(nlsock);
> +	free_sock:
> +		nl_socket_free(nlsock);
> +		return -1;
> +	}


Let's also explain this can be done via vdpa tool in iproute2 as well.

Otherwise

Acked-by: Jason Wang <jasowang@redhat.com>


> +
> +MMU-based IOMMU Driver
> +----------------------
> +VDUSE framework implements an MMU-based on-chip IOMMU driver to support
> +mapping the kernel DMA buffer into the userspace iova region dynamically.
> +This is mainly designed for virtio-vdpa case (kernel virtio drivers).
> +
> +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
> +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
> +so that the userspace process is able to use its virtual address to access
> +the DMA buffer in kernel.
> +
> +And to avoid security issue, a bounce-buffering mechanism is introduced to
> +prevent userspace accessing the original buffer directly which may contain other
> +kernel data. During the mapping, unmapping, the driver will copy the data from
> +the original buffer to the bounce buffer and back, depending on the direction of
> +the transfer. And the bounce-buffer addresses will be mapped into the user address
> +space instead of the original one.
Yongji Xie April 8, 2021, 8:09 a.m. UTC | #2
On Thu, Apr 8, 2021 at 3:18 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/3/31 下午4:05, Xie Yongji 写道:
> > VDUSE (vDPA Device in Userspace) is a framework to support
> > implementing software-emulated vDPA devices in userspace. This
> > document is intended to clarify the VDUSE design and usage.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >   Documentation/userspace-api/index.rst |   1 +
> >   Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
> >   2 files changed, 213 insertions(+)
> >   create mode 100644 Documentation/userspace-api/vduse.rst
> >
> > diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
> > index acd2cc2a538d..f63119130898 100644
> > --- a/Documentation/userspace-api/index.rst
> > +++ b/Documentation/userspace-api/index.rst
> > @@ -24,6 +24,7 @@ place where this information is gathered.
> >      ioctl/index
> >      iommu
> >      media/index
> > +   vduse
> >
> >   .. only::  subproject and html
> >
> > diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst
> > new file mode 100644
> > index 000000000000..8c4e2b2df8bb
> > --- /dev/null
> > +++ b/Documentation/userspace-api/vduse.rst
> > @@ -0,0 +1,212 @@
> > +==================================
> > +VDUSE - "vDPA Device in Userspace"
> > +==================================
> > +
> > +vDPA (virtio data path acceleration) device is a device that uses a
> > +datapath which complies with the virtio specifications with vendor
> > +specific control path. vDPA devices can be both physically located on
> > +the hardware or emulated by software. VDUSE is a framework that makes it
> > +possible to implement software-emulated vDPA devices in userspace.
> > +
> > +How VDUSE works
> > +------------
> > +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> > +the character device (/dev/vduse/control). Then a device file with the
> > +specified name (/dev/vduse/$NAME) will appear, which can be used to
> > +implement the userspace vDPA device's control path and data path.
> > +
> > +To implement control path, a message-based communication protocol and some
> > +types of control messages are introduced in the VDUSE framework:
> > +
> > +- VDUSE_SET_VQ_ADDR: Set the vring address of virtqueue.
> > +
> > +- VDUSE_SET_VQ_NUM: Set the size of virtqueue
> > +
> > +- VDUSE_SET_VQ_READY: Set ready status of virtqueue
> > +
> > +- VDUSE_GET_VQ_READY: Get ready status of virtqueue
> > +
> > +- VDUSE_SET_VQ_STATE: Set the state for virtqueue
> > +
> > +- VDUSE_GET_VQ_STATE: Get the state for virtqueue
> > +
> > +- VDUSE_SET_FEATURES: Set virtio features supported by the driver
> > +
> > +- VDUSE_GET_FEATURES: Get virtio features supported by the device
> > +
> > +- VDUSE_SET_STATUS: Set the device status
> > +
> > +- VDUSE_GET_STATUS: Get the device status
> > +
> > +- VDUSE_SET_CONFIG: Write to device specific configuration space
> > +
> > +- VDUSE_GET_CONFIG: Read from device specific configuration space
> > +
> > +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
> > +
> > +Those control messages are mostly based on the vdpa_config_ops in
> > +include/linux/vdpa.h which defines a unified interface to control
> > +different types of vdpa device. Userspace needs to read()/write()
> > +on the VDUSE device file to receive/reply those control messages
> > +from/to VDUSE kernel module as follows:
> > +
> > +.. code-block:: c
> > +
> > +     static int vduse_message_handler(int dev_fd)
> > +     {
> > +             int len;
> > +             struct vduse_dev_request req;
> > +             struct vduse_dev_response resp;
> > +
> > +             len = read(dev_fd, &req, sizeof(req));
> > +             if (len != sizeof(req))
> > +                     return -1;
> > +
> > +             resp.request_id = req.request_id;
> > +
> > +             switch (req.type) {
> > +
> > +             /* handle different types of message */
> > +
> > +             }
> > +
> > +             len = write(dev_fd, &resp, sizeof(resp));
> > +             if (len != sizeof(resp))
> > +                     return -1;
> > +
> > +             return 0;
> > +     }
> > +
> > +In the data path, vDPA device's iova regions will be mapped into userspace
> > +with the help of VDUSE_IOTLB_GET_FD ioctl on the VDUSE device file:
> > +
> > +- VDUSE_IOTLB_GET_FD: get the file descriptor to the first overlapped iova region.
> > +  Userspace can access this iova region by passing fd and corresponding size, offset,
> > +  perm to mmap(). For example:
> > +
> > +.. code-block:: c
> > +
> > +     static int perm_to_prot(uint8_t perm)
> > +     {
> > +             int prot = 0;
> > +
> > +             switch (perm) {
> > +             case VDUSE_ACCESS_WO:
> > +                     prot |= PROT_WRITE;
> > +                     break;
> > +             case VDUSE_ACCESS_RO:
> > +                     prot |= PROT_READ;
> > +                     break;
> > +             case VDUSE_ACCESS_RW:
> > +                     prot |= PROT_READ | PROT_WRITE;
> > +                     break;
> > +             }
> > +
> > +             return prot;
> > +     }
> > +
> > +     static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> > +     {
> > +             int fd;
> > +             void *addr;
> > +             size_t size;
> > +             struct vduse_iotlb_entry entry;
> > +
> > +             entry.start = iova;
> > +             entry.last = iova + 1;
> > +             fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry);
> > +             if (fd < 0)
> > +                     return NULL;
> > +
> > +             size = entry.last - entry.start + 1;
> > +             *len = entry.last - iova + 1;
> > +             addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
> > +                         fd, entry.offset);
> > +             close(fd);
> > +             if (addr == MAP_FAILED)
> > +                     return NULL;
> > +
> > +             /* do something to cache this iova region */
> > +
> > +             return addr + iova - entry.start;
> > +     }
> > +
> > +Besides, the following ioctls on the VDUSE device file are provided to support
> > +interrupt injection and setting up eventfd for virtqueue kicks:
> > +
> > +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> > +  by VDUSE kernel module to notify userspace to consume the vring.
> > +
> > +- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue
> > +
> > +- VDUSE_INJECT_CONFIG_IRQ: inject a config interrupt
> > +
> > +Register VDUSE device on vDPA bus
> > +---------------------------------
> > +In order to make the VDUSE device work, administrator needs to use the management
> > +API (netlink) to register it on vDPA bus. Some sample codes are show below:
> > +
> > +.. code-block:: c
> > +
> > +     static int netlink_add_vduse(const char *name, int device_id)
> > +     {
> > +             struct nl_sock *nlsock;
> > +             struct nl_msg *msg;
> > +             int famid;
> > +
> > +             nlsock = nl_socket_alloc();
> > +             if (!nlsock)
> > +                     return -ENOMEM;
> > +
> > +             if (genl_connect(nlsock))
> > +                     goto free_sock;
> > +
> > +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> > +             if (famid < 0)
> > +                     goto close_sock;
> > +
> > +             msg = nlmsg_alloc();
> > +             if (!msg)
> > +                     goto close_sock;
> > +
> > +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
> > +                 VDPA_CMD_DEV_NEW, 0))
> > +                     goto nla_put_failure;
> > +
> > +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> > +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> > +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
> > +
> > +             if (nl_send_sync(nlsock, msg))
> > +                     goto close_sock;
> > +
> > +             nl_close(nlsock);
> > +             nl_socket_free(nlsock);
> > +
> > +             return 0;
> > +     nla_put_failure:
> > +             nlmsg_free(msg);
> > +     close_sock:
> > +             nl_close(nlsock);
> > +     free_sock:
> > +             nl_socket_free(nlsock);
> > +             return -1;
> > +     }
>
>
> Let's also explain this can be done via vdpa tool in iproute2 as well.
>

Sure.

Thanks,
Yongji
Stefan Hajnoczi April 14, 2021, 2:14 p.m. UTC | #3
On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
> VDUSE (vDPA Device in Userspace) is a framework to support
> implementing software-emulated vDPA devices in userspace. This
> document is intended to clarify the VDUSE design and usage.
> 
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>  Documentation/userspace-api/index.rst |   1 +
>  Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
>  2 files changed, 213 insertions(+)
>  create mode 100644 Documentation/userspace-api/vduse.rst

Just looking over the documentation briefly (I haven't studied the code
yet)...

> +How VDUSE works
> +------------
> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> +the character device (/dev/vduse/control). Then a device file with the
> +specified name (/dev/vduse/$NAME) will appear, which can be used to
> +implement the userspace vDPA device's control path and data path.

These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
message? (Please consider reordering the documentation to make it clear
what the sequence of steps are.)

> +	static int netlink_add_vduse(const char *name, int device_id)
> +	{
> +		struct nl_sock *nlsock;
> +		struct nl_msg *msg;
> +		int famid;
> +
> +		nlsock = nl_socket_alloc();
> +		if (!nlsock)
> +			return -ENOMEM;
> +
> +		if (genl_connect(nlsock))
> +			goto free_sock;
> +
> +		famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> +		if (famid < 0)
> +			goto close_sock;
> +
> +		msg = nlmsg_alloc();
> +		if (!msg)
> +			goto close_sock;
> +
> +		if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
> +		    VDPA_CMD_DEV_NEW, 0))
> +			goto nla_put_failure;
> +
> +		NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> +		NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> +		NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);

What are the permission/capability requirements for VDUSE?

How does VDUSE interact with namespaces?

What is the meaning of VDPA_ATTR_DEV_ID? I don't see it in Linux
v5.12-rc6 drivers/vdpa/vdpa.c:vdpa_nl_cmd_dev_add_set_doit().

> +MMU-based IOMMU Driver
> +----------------------
> +VDUSE framework implements an MMU-based on-chip IOMMU driver to support
> +mapping the kernel DMA buffer into the userspace iova region dynamically.
> +This is mainly designed for virtio-vdpa case (kernel virtio drivers).
> +
> +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
> +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
> +so that the userspace process is able to use its virtual address to access
> +the DMA buffer in kernel.
> +
> +And to avoid security issue, a bounce-buffering mechanism is introduced to
> +prevent userspace accessing the original buffer directly which may contain other
> +kernel data. During the mapping, unmapping, the driver will copy the data from
> +the original buffer to the bounce buffer and back, depending on the direction of
> +the transfer. And the bounce-buffer addresses will be mapped into the user address
> +space instead of the original one.

Is mmap(2) the right interface if memory is not actually shared, why not
just use pread(2)/pwrite(2) to make the copy explicit? That way the copy
semantics are clear. For example, don't expect to be able to busy wait
on the memory because changes will not be visible to the other side.

(I guess I'm missing something here and that mmap(2) is the right
approach, but maybe this documentation section can be clarified.)
Yongji Xie April 15, 2021, 5:38 a.m. UTC | #4
On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
> > VDUSE (vDPA Device in Userspace) is a framework to support
> > implementing software-emulated vDPA devices in userspace. This
> > document is intended to clarify the VDUSE design and usage.
> >
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >  Documentation/userspace-api/index.rst |   1 +
> >  Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
> >  2 files changed, 213 insertions(+)
> >  create mode 100644 Documentation/userspace-api/vduse.rst
>
> Just looking over the documentation briefly (I haven't studied the code
> yet)...
>

Thank you!

> > +How VDUSE works
> > +------------
> > +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> > +the character device (/dev/vduse/control). Then a device file with the
> > +specified name (/dev/vduse/$NAME) will appear, which can be used to
> > +implement the userspace vDPA device's control path and data path.
>
> These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
> message? (Please consider reordering the documentation to make it clear
> what the sequence of steps are.)
>

No, VDUSE devices should be created before sending the
VDPA_CMD_DEV_NEW netlink messages which might produce I/Os to VDUSE.

> > +     static int netlink_add_vduse(const char *name, int device_id)
> > +     {
> > +             struct nl_sock *nlsock;
> > +             struct nl_msg *msg;
> > +             int famid;
> > +
> > +             nlsock = nl_socket_alloc();
> > +             if (!nlsock)
> > +                     return -ENOMEM;
> > +
> > +             if (genl_connect(nlsock))
> > +                     goto free_sock;
> > +
> > +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> > +             if (famid < 0)
> > +                     goto close_sock;
> > +
> > +             msg = nlmsg_alloc();
> > +             if (!msg)
> > +                     goto close_sock;
> > +
> > +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
> > +                 VDPA_CMD_DEV_NEW, 0))
> > +                     goto nla_put_failure;
> > +
> > +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> > +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> > +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
>
> What are the permission/capability requirements for VDUSE?
>

Now I think we need privileged permission (root user). Because
userspace daemon is able to access avail vring, used vring, descriptor
table in kernel driver directly.

> How does VDUSE interact with namespaces?
>

Not sure I get your point here. Do you mean how the emulated vDPA
device interact with namespaces? This should work like hardware vDPA
devices do. VDUSE daemon can reside outside the namespace of a
container which uses the vDPA device.

> What is the meaning of VDPA_ATTR_DEV_ID? I don't see it in Linux
> v5.12-rc6 drivers/vdpa/vdpa.c:vdpa_nl_cmd_dev_add_set_doit().
>

It means the device id (e.g. VIRTIO_ID_BLOCK) of the vDPA device and
can be found in include/uapi/linux/vdpa.h.

> > +MMU-based IOMMU Driver
> > +----------------------
> > +VDUSE framework implements an MMU-based on-chip IOMMU driver to support
> > +mapping the kernel DMA buffer into the userspace iova region dynamically.
> > +This is mainly designed for virtio-vdpa case (kernel virtio drivers).
> > +
> > +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
> > +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
> > +so that the userspace process is able to use its virtual address to access
> > +the DMA buffer in kernel.
> > +
> > +And to avoid security issue, a bounce-buffering mechanism is introduced to
> > +prevent userspace accessing the original buffer directly which may contain other
> > +kernel data. During the mapping, unmapping, the driver will copy the data from
> > +the original buffer to the bounce buffer and back, depending on the direction of
> > +the transfer. And the bounce-buffer addresses will be mapped into the user address
> > +space instead of the original one.
>
> Is mmap(2) the right interface if memory is not actually shared, why not
> just use pread(2)/pwrite(2) to make the copy explicit? That way the copy
> semantics are clear. For example, don't expect to be able to busy wait
> on the memory because changes will not be visible to the other side.
>
> (I guess I'm missing something here and that mmap(2) is the right
> approach, but maybe this documentation section can be clarified.)

It's for performance considerations on the one hand. We might need to
call pread(2)/pwrite(2) multiple times for each request. On the other
hand, we can handle the virtqueue in a unified way for both vhost-vdpa
case and virtio-vdpa case. Otherwise, userspace daemon needs to know
which iova ranges need to be accessed with pread(2)/pwrite(2). And in
the future, we might be able to avoid bouncing in some cases.

Thanks,
Yongji
Stefan Hajnoczi April 15, 2021, 7:19 a.m. UTC | #5
On Thu, Apr 15, 2021 at 01:38:37PM +0800, Yongji Xie wrote:
> On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
> > > VDUSE (vDPA Device in Userspace) is a framework to support
> > > implementing software-emulated vDPA devices in userspace. This
> > > document is intended to clarify the VDUSE design and usage.
> > >
> > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > ---
> > >  Documentation/userspace-api/index.rst |   1 +
> > >  Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
> > >  2 files changed, 213 insertions(+)
> > >  create mode 100644 Documentation/userspace-api/vduse.rst
> >
> > Just looking over the documentation briefly (I haven't studied the code
> > yet)...
> >
> 
> Thank you!
> 
> > > +How VDUSE works
> > > +------------
> > > +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> > > +the character device (/dev/vduse/control). Then a device file with the
> > > +specified name (/dev/vduse/$NAME) will appear, which can be used to
> > > +implement the userspace vDPA device's control path and data path.
> >
> > These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
> > message? (Please consider reordering the documentation to make it clear
> > what the sequence of steps are.)
> >
> 
> No, VDUSE devices should be created before sending the
> VDPA_CMD_DEV_NEW netlink messages which might produce I/Os to VDUSE.

I see. Please include an overview of the steps before going into detail.
Something like:

  VDUSE devices are started as follows:

  1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
     /dev/vduse/control.

  2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
     messages will arrive while attaching the VDUSE instance to vDPA.

  3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
     instance to vDPA.

  VDUSE devices are stopped as follows:

  ...

> > > +     static int netlink_add_vduse(const char *name, int device_id)
> > > +     {
> > > +             struct nl_sock *nlsock;
> > > +             struct nl_msg *msg;
> > > +             int famid;
> > > +
> > > +             nlsock = nl_socket_alloc();
> > > +             if (!nlsock)
> > > +                     return -ENOMEM;
> > > +
> > > +             if (genl_connect(nlsock))
> > > +                     goto free_sock;
> > > +
> > > +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> > > +             if (famid < 0)
> > > +                     goto close_sock;
> > > +
> > > +             msg = nlmsg_alloc();
> > > +             if (!msg)
> > > +                     goto close_sock;
> > > +
> > > +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
> > > +                 VDPA_CMD_DEV_NEW, 0))
> > > +                     goto nla_put_failure;
> > > +
> > > +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> > > +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> > > +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
> >
> > What are the permission/capability requirements for VDUSE?
> >
> 
> Now I think we need privileged permission (root user). Because
> userspace daemon is able to access avail vring, used vring, descriptor
> table in kernel driver directly.

Please state this explicitly at the start of the document. Existing
interfaces like FUSE are designed to avoid trusting userspace. Therefore
people might think the same is the case here. It's critical that people
are aware of this before deploying VDUSE with virtio-vdpa.

We should probably pause here and think about whether it's possible to
avoid trusting userspace. Even if it takes some effort and costs some
performance it would probably be worthwhile.

Is the security situation different with vhost-vdpa? In that case it
seems more likely that the host kernel doesn't need to trust the
userspace VDUSE device.

Regarding privileges in general: userspace VDUSE processes shouldn't
need to run as root. The VDUSE device lifecycle will require privileges
to attach vhost-vdpa and virtio-vdpa devices, but the actual userspace
process that emulates the device should be able to run unprivileged.
Emulated devices are an attack surface and even if you are comfortable
with running them as root in your specific use case, it will be an issue
as soon as other people want to use VDUSE and could give VDUSE a
reputation for poor security.

> > How does VDUSE interact with namespaces?
> >
> 
> Not sure I get your point here. Do you mean how the emulated vDPA
> device interact with namespaces? This should work like hardware vDPA
> devices do. VDUSE daemon can reside outside the namespace of a
> container which uses the vDPA device.

Can VDUSE devices run inside containers? Are /dev/vduse/$NAME and vDPA
device names global?

> > What is the meaning of VDPA_ATTR_DEV_ID? I don't see it in Linux
> > v5.12-rc6 drivers/vdpa/vdpa.c:vdpa_nl_cmd_dev_add_set_doit().
> >
> 
> It means the device id (e.g. VIRTIO_ID_BLOCK) of the vDPA device and
> can be found in include/uapi/linux/vdpa.h.

VDPA_ATTR_DEV_ID is only used by VDPA_CMD_DEV_GET in Linux v5.12-rc6,
not by VDPA_CMD_DEV_NEW.

The example in this document uses VDPA_ATTR_DEV_ID with
VDPA_CMD_DEV_NEW. Is the example outdated?

> 
> > > +MMU-based IOMMU Driver
> > > +----------------------
> > > +VDUSE framework implements an MMU-based on-chip IOMMU driver to support
> > > +mapping the kernel DMA buffer into the userspace iova region dynamically.
> > > +This is mainly designed for virtio-vdpa case (kernel virtio drivers).
> > > +
> > > +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
> > > +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
> > > +so that the userspace process is able to use its virtual address to access
> > > +the DMA buffer in kernel.
> > > +
> > > +And to avoid security issue, a bounce-buffering mechanism is introduced to
> > > +prevent userspace accessing the original buffer directly which may contain other
> > > +kernel data. During the mapping, unmapping, the driver will copy the data from
> > > +the original buffer to the bounce buffer and back, depending on the direction of
> > > +the transfer. And the bounce-buffer addresses will be mapped into the user address
> > > +space instead of the original one.
> >
> > Is mmap(2) the right interface if memory is not actually shared, why not
> > just use pread(2)/pwrite(2) to make the copy explicit? That way the copy
> > semantics are clear. For example, don't expect to be able to busy wait
> > on the memory because changes will not be visible to the other side.
> >
> > (I guess I'm missing something here and that mmap(2) is the right
> > approach, but maybe this documentation section can be clarified.)
> 
> It's for performance considerations on the one hand. We might need to
> call pread(2)/pwrite(2) multiple times for each request.

Userspace can keep page-sized pread() buffers around to avoid additional
syscalls during a request.

mmap() access does reduce the number of syscalls, but it also introduces
page faults (effectively doing the page-sized pread() I mentioned
above).

It's not obvious to me that there is a fundamental difference between
the two approaches in terms of performance.

> On the other
> hand, we can handle the virtqueue in a unified way for both vhost-vdpa
> case and virtio-vdpa case. Otherwise, userspace daemon needs to know
> which iova ranges need to be accessed with pread(2)/pwrite(2). And in
> the future, we might be able to avoid bouncing in some cases.

Ah, I see. So bounce buffers are not used for vhost-vdpa?

Stefan
Yongji Xie April 15, 2021, 8:33 a.m. UTC | #6
On Thu, Apr 15, 2021 at 3:19 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Apr 15, 2021 at 01:38:37PM +0800, Yongji Xie wrote:
> > On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
> > > > VDUSE (vDPA Device in Userspace) is a framework to support
> > > > implementing software-emulated vDPA devices in userspace. This
> > > > document is intended to clarify the VDUSE design and usage.
> > > >
> > > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > > ---
> > > >  Documentation/userspace-api/index.rst |   1 +
> > > >  Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
> > > >  2 files changed, 213 insertions(+)
> > > >  create mode 100644 Documentation/userspace-api/vduse.rst
> > >
> > > Just looking over the documentation briefly (I haven't studied the code
> > > yet)...
> > >
> >
> > Thank you!
> >
> > > > +How VDUSE works
> > > > +------------
> > > > +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> > > > +the character device (/dev/vduse/control). Then a device file with the
> > > > +specified name (/dev/vduse/$NAME) will appear, which can be used to
> > > > +implement the userspace vDPA device's control path and data path.
> > >
> > > These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
> > > message? (Please consider reordering the documentation to make it clear
> > > what the sequence of steps are.)
> > >
> >
> > No, VDUSE devices should be created before sending the
> > VDPA_CMD_DEV_NEW netlink messages which might produce I/Os to VDUSE.
>
> I see. Please include an overview of the steps before going into detail.
> Something like:
>
>   VDUSE devices are started as follows:
>
>   1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
>      /dev/vduse/control.
>
>   2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
>      messages will arrive while attaching the VDUSE instance to vDPA.
>
>   3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
>      instance to vDPA.
>
>   VDUSE devices are stopped as follows:
>
>   ...
>

Sure.

> > > > +     static int netlink_add_vduse(const char *name, int device_id)
> > > > +     {
> > > > +             struct nl_sock *nlsock;
> > > > +             struct nl_msg *msg;
> > > > +             int famid;
> > > > +
> > > > +             nlsock = nl_socket_alloc();
> > > > +             if (!nlsock)
> > > > +                     return -ENOMEM;
> > > > +
> > > > +             if (genl_connect(nlsock))
> > > > +                     goto free_sock;
> > > > +
> > > > +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> > > > +             if (famid < 0)
> > > > +                     goto close_sock;
> > > > +
> > > > +             msg = nlmsg_alloc();
> > > > +             if (!msg)
> > > > +                     goto close_sock;
> > > > +
> > > > +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
> > > > +                 VDPA_CMD_DEV_NEW, 0))
> > > > +                     goto nla_put_failure;
> > > > +
> > > > +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> > > > +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> > > > +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
> > >
> > > What are the permission/capability requirements for VDUSE?
> > >
> >
> > Now I think we need privileged permission (root user). Because
> > userspace daemon is able to access avail vring, used vring, descriptor
> > table in kernel driver directly.
>
> Please state this explicitly at the start of the document. Existing
> interfaces like FUSE are designed to avoid trusting userspace. Therefore
> people might think the same is the case here. It's critical that people
> are aware of this before deploying VDUSE with virtio-vdpa.
>
> We should probably pause here and think about whether it's possible to
> avoid trusting userspace. Even if it takes some effort and costs some
> performance it would probably be worthwhile.
>
> Is the security situation different with vhost-vdpa? In that case it
> seems more likely that the host kernel doesn't need to trust the
> userspace VDUSE device.
>

Yes.

> Regarding privileges in general: userspace VDUSE processes shouldn't
> need to run as root. The VDUSE device lifecycle will require privileges
> to attach vhost-vdpa and virtio-vdpa devices, but the actual userspace
> process that emulates the device should be able to run unprivileged.
> Emulated devices are an attack surface and even if you are comfortable
> with running them as root in your specific use case, it will be an issue
> as soon as other people want to use VDUSE and could give VDUSE a
> reputation for poor security.
>

Agreed. Rethink about the virtio-vdpa case. The security risks mainly
come from the untrusted user being able to rewrite the content of
avail vring, used vring, descriptor table. But it seems that the worst
result of doing this is getting a broken virtqueue. Not sure if it's
acceptable to kernel.

> > > How does VDUSE interact with namespaces?
> > >
> >
> > Not sure I get your point here. Do you mean how the emulated vDPA
> > device interact with namespaces? This should work like hardware vDPA
> > devices do. VDUSE daemon can reside outside the namespace of a
> > container which uses the vDPA device.
>
> Can VDUSE devices run inside containers? Are /dev/vduse/$NAME and vDPA
> device names global?
>

I think we can run it inside containers. But there might be some
limitations. As you mentioned, the device name is global. So we need
to make sure the VDUSE daemons in different containers don't use the
same name to create vDPA devices.

> > > What is the meaning of VDPA_ATTR_DEV_ID? I don't see it in Linux
> > > v5.12-rc6 drivers/vdpa/vdpa.c:vdpa_nl_cmd_dev_add_set_doit().
> > >
> >
> > It means the device id (e.g. VIRTIO_ID_BLOCK) of the vDPA device and
> > can be found in include/uapi/linux/vdpa.h.
>
> VDPA_ATTR_DEV_ID is only used by VDPA_CMD_DEV_GET in Linux v5.12-rc6,
> not by VDPA_CMD_DEV_NEW.
>
> The example in this document uses VDPA_ATTR_DEV_ID with
> VDPA_CMD_DEV_NEW. Is the example outdated?
>

Oh, you are right. Will update it.

> >
> > > > +MMU-based IOMMU Driver
> > > > +----------------------
> > > > +VDUSE framework implements an MMU-based on-chip IOMMU driver to support
> > > > +mapping the kernel DMA buffer into the userspace iova region dynamically.
> > > > +This is mainly designed for virtio-vdpa case (kernel virtio drivers).
> > > > +
> > > > +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
> > > > +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
> > > > +so that the userspace process is able to use its virtual address to access
> > > > +the DMA buffer in kernel.
> > > > +
> > > > +And to avoid security issue, a bounce-buffering mechanism is introduced to
> > > > +prevent userspace accessing the original buffer directly which may contain other
> > > > +kernel data. During the mapping, unmapping, the driver will copy the data from
> > > > +the original buffer to the bounce buffer and back, depending on the direction of
> > > > +the transfer. And the bounce-buffer addresses will be mapped into the user address
> > > > +space instead of the original one.
> > >
> > > Is mmap(2) the right interface if memory is not actually shared, why not
> > > just use pread(2)/pwrite(2) to make the copy explicit? That way the copy
> > > semantics are clear. For example, don't expect to be able to busy wait
> > > on the memory because changes will not be visible to the other side.
> > >
> > > (I guess I'm missing something here and that mmap(2) is the right
> > > approach, but maybe this documentation section can be clarified.)
> >
> > It's for performance considerations on the one hand. We might need to
> > call pread(2)/pwrite(2) multiple times for each request.
>
> Userspace can keep page-sized pread() buffers around to avoid additional
> syscalls during a request.
>

In the indirect descriptors case , it looks like we can't use one
pread() to get all buffers?

> mmap() access does reduce the number of syscalls, but it also introduces
> page faults (effectively doing the page-sized pread() I mentioned
> above).
>

Yes, but only on the first access.

> It's not obvious to me that there is a fundamental difference between
> the two approaches in terms of performance.
>
> > On the other
> > hand, we can handle the virtqueue in a unified way for both vhost-vdpa
> > case and virtio-vdpa case. Otherwise, userspace daemon needs to know
> > which iova ranges need to be accessed with pread(2)/pwrite(2). And in
> > the future, we might be able to avoid bouncing in some cases.
>
> Ah, I see. So bounce buffers are not used for vhost-vdpa?
>

Yes.

Thanks,
Yongji
Jason Wang April 15, 2021, 8:36 a.m. UTC | #7
在 2021/4/15 下午3:19, Stefan Hajnoczi 写道:
> On Thu, Apr 15, 2021 at 01:38:37PM +0800, Yongji Xie wrote:
>> On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
>>>> VDUSE (vDPA Device in Userspace) is a framework to support
>>>> implementing software-emulated vDPA devices in userspace. This
>>>> document is intended to clarify the VDUSE design and usage.
>>>>
>>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>>> ---
>>>>   Documentation/userspace-api/index.rst |   1 +
>>>>   Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
>>>>   2 files changed, 213 insertions(+)
>>>>   create mode 100644 Documentation/userspace-api/vduse.rst
>>> Just looking over the documentation briefly (I haven't studied the code
>>> yet)...
>>>
>> Thank you!
>>
>>>> +How VDUSE works
>>>> +------------
>>>> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
>>>> +the character device (/dev/vduse/control). Then a device file with the
>>>> +specified name (/dev/vduse/$NAME) will appear, which can be used to
>>>> +implement the userspace vDPA device's control path and data path.
>>> These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
>>> message? (Please consider reordering the documentation to make it clear
>>> what the sequence of steps are.)
>>>
>> No, VDUSE devices should be created before sending the
>> VDPA_CMD_DEV_NEW netlink messages which might produce I/Os to VDUSE.
> I see. Please include an overview of the steps before going into detail.
> Something like:
>
>    VDUSE devices are started as follows:
>
>    1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
>       /dev/vduse/control.
>
>    2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
>       messages will arrive while attaching the VDUSE instance to vDPA.
>
>    3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
>       instance to vDPA.
>
>    VDUSE devices are stopped as follows:
>
>    ...
>
>>>> +     static int netlink_add_vduse(const char *name, int device_id)
>>>> +     {
>>>> +             struct nl_sock *nlsock;
>>>> +             struct nl_msg *msg;
>>>> +             int famid;
>>>> +
>>>> +             nlsock = nl_socket_alloc();
>>>> +             if (!nlsock)
>>>> +                     return -ENOMEM;
>>>> +
>>>> +             if (genl_connect(nlsock))
>>>> +                     goto free_sock;
>>>> +
>>>> +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
>>>> +             if (famid < 0)
>>>> +                     goto close_sock;
>>>> +
>>>> +             msg = nlmsg_alloc();
>>>> +             if (!msg)
>>>> +                     goto close_sock;
>>>> +
>>>> +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
>>>> +                 VDPA_CMD_DEV_NEW, 0))
>>>> +                     goto nla_put_failure;
>>>> +
>>>> +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
>>>> +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
>>>> +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
>>> What are the permission/capability requirements for VDUSE?
>>>
>> Now I think we need privileged permission (root user). Because
>> userspace daemon is able to access avail vring, used vring, descriptor
>> table in kernel driver directly.
> Please state this explicitly at the start of the document. Existing
> interfaces like FUSE are designed to avoid trusting userspace.


There're some subtle difference here. VDUSE present a device to kernel 
which means IOMMU is probably the only thing to prevent a malicous device.


> Therefore
> people might think the same is the case here. It's critical that people
> are aware of this before deploying VDUSE with virtio-vdpa.
>
> We should probably pause here and think about whether it's possible to
> avoid trusting userspace. Even if it takes some effort and costs some
> performance it would probably be worthwhile.


Since the bounce buffer is used the only attack surface is the coherent 
area, if we want to enforce stronger isolation we need to use shadow 
virtqueue (which is proposed in earlier version by me) in this case. But 
I'm not sure it's worth to do that.


>
> Is the security situation different with vhost-vdpa? In that case it
> seems more likely that the host kernel doesn't need to trust the
> userspace VDUSE device.
>
> Regarding privileges in general: userspace VDUSE processes shouldn't
> need to run as root. The VDUSE device lifecycle will require privileges
> to attach vhost-vdpa and virtio-vdpa devices, but the actual userspace
> process that emulates the device should be able to run unprivileged.
> Emulated devices are an attack surface and even if you are comfortable
> with running them as root in your specific use case, it will be an issue
> as soon as other people want to use VDUSE and could give VDUSE a
> reputation for poor security.


In this case, I think it works as other char device:

- privilleged process to create and destroy the VDUSE
- fd is passed via SCM_RIGHTS to unprivilleged process that implements 
the device


>
>>> How does VDUSE interact with namespaces?
>>>
>> Not sure I get your point here. Do you mean how the emulated vDPA
>> device interact with namespaces? This should work like hardware vDPA
>> devices do. VDUSE daemon can reside outside the namespace of a
>> container which uses the vDPA device.
> Can VDUSE devices run inside containers? Are /dev/vduse/$NAME and vDPA
> device names global?


I think it's a global one, we can add namespace on top.


>
>>> What is the meaning of VDPA_ATTR_DEV_ID? I don't see it in Linux
>>> v5.12-rc6 drivers/vdpa/vdpa.c:vdpa_nl_cmd_dev_add_set_doit().
>>>
>> It means the device id (e.g. VIRTIO_ID_BLOCK) of the vDPA device and
>> can be found in include/uapi/linux/vdpa.h.
> VDPA_ATTR_DEV_ID is only used by VDPA_CMD_DEV_GET in Linux v5.12-rc6,
> not by VDPA_CMD_DEV_NEW.
>
> The example in this document uses VDPA_ATTR_DEV_ID with
> VDPA_CMD_DEV_NEW. Is the example outdated?
>
>>>> +MMU-based IOMMU Driver
>>>> +----------------------
>>>> +VDUSE framework implements an MMU-based on-chip IOMMU driver to support
>>>> +mapping the kernel DMA buffer into the userspace iova region dynamically.
>>>> +This is mainly designed for virtio-vdpa case (kernel virtio drivers).
>>>> +
>>>> +The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
>>>> +The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
>>>> +so that the userspace process is able to use its virtual address to access
>>>> +the DMA buffer in kernel.
>>>> +
>>>> +And to avoid security issue, a bounce-buffering mechanism is introduced to
>>>> +prevent userspace accessing the original buffer directly which may contain other
>>>> +kernel data. During the mapping, unmapping, the driver will copy the data from
>>>> +the original buffer to the bounce buffer and back, depending on the direction of
>>>> +the transfer. And the bounce-buffer addresses will be mapped into the user address
>>>> +space instead of the original one.
>>> Is mmap(2) the right interface if memory is not actually shared, why not
>>> just use pread(2)/pwrite(2) to make the copy explicit? That way the copy
>>> semantics are clear. For example, don't expect to be able to busy wait
>>> on the memory because changes will not be visible to the other side.
>>>
>>> (I guess I'm missing something here and that mmap(2) is the right
>>> approach, but maybe this documentation section can be clarified.)
>> It's for performance considerations on the one hand. We might need to
>> call pread(2)/pwrite(2) multiple times for each request.
> Userspace can keep page-sized pread() buffers around to avoid additional
> syscalls during a request.


I'm not sure I get here. But the length of the request is not 
necessarily PAGE_SIZE.


>
> mmap() access does reduce the number of syscalls, but it also introduces
> page faults (effectively doing the page-sized pread() I mentioned
> above).


You can access the data directly if there's already a page fault. So 
mmap() should be much faster in this case.


>
> It's not obvious to me that there is a fundamental difference between
> the two approaches in terms of performance.
>
>> On the other
>> hand, we can handle the virtqueue in a unified way for both vhost-vdpa
>> case and virtio-vdpa case. Otherwise, userspace daemon needs to know
>> which iova ranges need to be accessed with pread(2)/pwrite(2). And in
>> the future, we might be able to avoid bouncing in some cases.
> Ah, I see. So bounce buffers are not used for vhost-vdpa?


Yes, VDUSE can pass different fds to usersapce for mmap().

Thanks


>
> Stefan
Jason Wang April 15, 2021, 9:04 a.m. UTC | #8
在 2021/4/15 下午4:36, Jason Wang 写道:
>>>
>> Please state this explicitly at the start of the document. Existing
>> interfaces like FUSE are designed to avoid trusting userspace.
>
>
> There're some subtle difference here. VDUSE present a device to kernel 
> which means IOMMU is probably the only thing to prevent a malicous 
> device.
>
>
>> Therefore
>> people might think the same is the case here. It's critical that people
>> are aware of this before deploying VDUSE with virtio-vdpa.
>>
>> We should probably pause here and think about whether it's possible to
>> avoid trusting userspace. Even if it takes some effort and costs some
>> performance it would probably be worthwhile.
>
>
> Since the bounce buffer is used the only attack surface is the 
> coherent area, if we want to enforce stronger isolation we need to use 
> shadow virtqueue (which is proposed in earlier version by me) in this 
> case. But I'm not sure it's worth to do that.



So this reminds me the discussion in the end of last year. We need to 
make sure we don't suffer from the same issues for VDUSE at least

https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152b32@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b

Or we can solve it at virtio level, e.g remember the dma address instead 
of depending on the addr in the descriptor ring

Thanks


>
>
>>
>> Is the security situation different with vhost-vdpa? In that case it
>> seems more likely that the host kernel doesn't need to trust the
>> userspace VDUSE device.
Yongji Xie April 15, 2021, 11:17 a.m. UTC | #9
On Thu, Apr 15, 2021 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/4/15 下午4:36, Jason Wang 写道:
> >>>
> >> Please state this explicitly at the start of the document. Existing
> >> interfaces like FUSE are designed to avoid trusting userspace.
> >
> >
> > There're some subtle difference here. VDUSE present a device to kernel
> > which means IOMMU is probably the only thing to prevent a malicous
> > device.
> >
> >
> >> Therefore
> >> people might think the same is the case here. It's critical that people
> >> are aware of this before deploying VDUSE with virtio-vdpa.
> >>
> >> We should probably pause here and think about whether it's possible to
> >> avoid trusting userspace. Even if it takes some effort and costs some
> >> performance it would probably be worthwhile.
> >
> >
> > Since the bounce buffer is used the only attack surface is the
> > coherent area, if we want to enforce stronger isolation we need to use
> > shadow virtqueue (which is proposed in earlier version by me) in this
> > case. But I'm not sure it's worth to do that.
>
>
>
> So this reminds me the discussion in the end of last year. We need to
> make sure we don't suffer from the same issues for VDUSE at least
>
> https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152b32@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b
>
> Or we can solve it at virtio level, e.g remember the dma address instead
> of depending on the addr in the descriptor ring
>

I might miss something. But VDUSE has recorded the dma address during
dma mapping, so we would not do bouncing if the addr/length is invalid
during dma unmapping. Is it enough?

Thanks,
Yongji
Stefan Hajnoczi April 15, 2021, 2:17 p.m. UTC | #10
On Thu, Apr 15, 2021 at 04:33:27PM +0800, Yongji Xie wrote:
> On Thu, Apr 15, 2021 at 3:19 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Thu, Apr 15, 2021 at 01:38:37PM +0800, Yongji Xie wrote:
> > > On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
> > It's not obvious to me that there is a fundamental difference between
> > the two approaches in terms of performance.
> >
> > > On the other
> > > hand, we can handle the virtqueue in a unified way for both vhost-vdpa
> > > case and virtio-vdpa case. Otherwise, userspace daemon needs to know
> > > which iova ranges need to be accessed with pread(2)/pwrite(2). And in
> > > the future, we might be able to avoid bouncing in some cases.
> >
> > Ah, I see. So bounce buffers are not used for vhost-vdpa?
> >
> 
> Yes.

Okay, in that case I understand why mmap is used and it's nice to keep
virtio-vpda and vhost-vdpa unified. Thanks!

Stefan
Stefan Hajnoczi April 15, 2021, 2:38 p.m. UTC | #11
On Thu, Apr 15, 2021 at 04:36:35PM +0800, Jason Wang wrote:
> 
> 在 2021/4/15 下午3:19, Stefan Hajnoczi 写道:
> > On Thu, Apr 15, 2021 at 01:38:37PM +0800, Yongji Xie wrote:
> > > On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
> > > > > VDUSE (vDPA Device in Userspace) is a framework to support
> > > > > implementing software-emulated vDPA devices in userspace. This
> > > > > document is intended to clarify the VDUSE design and usage.
> > > > > 
> > > > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > > > ---
> > > > >   Documentation/userspace-api/index.rst |   1 +
> > > > >   Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
> > > > >   2 files changed, 213 insertions(+)
> > > > >   create mode 100644 Documentation/userspace-api/vduse.rst
> > > > Just looking over the documentation briefly (I haven't studied the code
> > > > yet)...
> > > > 
> > > Thank you!
> > > 
> > > > > +How VDUSE works
> > > > > +------------
> > > > > +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> > > > > +the character device (/dev/vduse/control). Then a device file with the
> > > > > +specified name (/dev/vduse/$NAME) will appear, which can be used to
> > > > > +implement the userspace vDPA device's control path and data path.
> > > > These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
> > > > message? (Please consider reordering the documentation to make it clear
> > > > what the sequence of steps are.)
> > > > 
> > > No, VDUSE devices should be created before sending the
> > > VDPA_CMD_DEV_NEW netlink messages which might produce I/Os to VDUSE.
> > I see. Please include an overview of the steps before going into detail.
> > Something like:
> > 
> >    VDUSE devices are started as follows:
> > 
> >    1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> >       /dev/vduse/control.
> > 
> >    2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
> >       messages will arrive while attaching the VDUSE instance to vDPA.
> > 
> >    3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
> >       instance to vDPA.
> > 
> >    VDUSE devices are stopped as follows:
> > 
> >    ...
> > 
> > > > > +     static int netlink_add_vduse(const char *name, int device_id)
> > > > > +     {
> > > > > +             struct nl_sock *nlsock;
> > > > > +             struct nl_msg *msg;
> > > > > +             int famid;
> > > > > +
> > > > > +             nlsock = nl_socket_alloc();
> > > > > +             if (!nlsock)
> > > > > +                     return -ENOMEM;
> > > > > +
> > > > > +             if (genl_connect(nlsock))
> > > > > +                     goto free_sock;
> > > > > +
> > > > > +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> > > > > +             if (famid < 0)
> > > > > +                     goto close_sock;
> > > > > +
> > > > > +             msg = nlmsg_alloc();
> > > > > +             if (!msg)
> > > > > +                     goto close_sock;
> > > > > +
> > > > > +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
> > > > > +                 VDPA_CMD_DEV_NEW, 0))
> > > > > +                     goto nla_put_failure;
> > > > > +
> > > > > +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> > > > > +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> > > > > +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
> > > > What are the permission/capability requirements for VDUSE?
> > > > 
> > > Now I think we need privileged permission (root user). Because
> > > userspace daemon is able to access avail vring, used vring, descriptor
> > > table in kernel driver directly.
> > Please state this explicitly at the start of the document. Existing
> > interfaces like FUSE are designed to avoid trusting userspace.
> 
> 
> There're some subtle difference here. VDUSE present a device to kernel which
> means IOMMU is probably the only thing to prevent a malicous device.
> 
> 
> > Therefore
> > people might think the same is the case here. It's critical that people
> > are aware of this before deploying VDUSE with virtio-vdpa.
> > 
> > We should probably pause here and think about whether it's possible to
> > avoid trusting userspace. Even if it takes some effort and costs some
> > performance it would probably be worthwhile.
> 
> 
> Since the bounce buffer is used the only attack surface is the coherent
> area, if we want to enforce stronger isolation we need to use shadow
> virtqueue (which is proposed in earlier version by me) in this case. But I'm
> not sure it's worth to do that.

The security situation needs to be clear before merging this feature.

I think the IOMMU and vring can be made secure. What is more concerning
is the kernel code that runs on top: VIRTIO device drivers, network
stack, file systems, etc. They trust devices to an extent.

Since virtio-vdpa is a big reason for doing VDUSE in the first place I
don't think it makes sense to disable virtio-vdpa with VDUSE. A solution
is needed.

I'm going to be offline for a week and don't want to be a bottleneck.
I'll catch up when I'm back.

Stefan
Jason Wang April 16, 2021, 2:20 a.m. UTC | #12
在 2021/4/15 下午7:17, Yongji Xie 写道:
> On Thu, Apr 15, 2021 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/4/15 下午4:36, Jason Wang 写道:
>>>> Please state this explicitly at the start of the document. Existing
>>>> interfaces like FUSE are designed to avoid trusting userspace.
>>>
>>> There're some subtle difference here. VDUSE present a device to kernel
>>> which means IOMMU is probably the only thing to prevent a malicous
>>> device.
>>>
>>>
>>>> Therefore
>>>> people might think the same is the case here. It's critical that people
>>>> are aware of this before deploying VDUSE with virtio-vdpa.
>>>>
>>>> We should probably pause here and think about whether it's possible to
>>>> avoid trusting userspace. Even if it takes some effort and costs some
>>>> performance it would probably be worthwhile.
>>>
>>> Since the bounce buffer is used the only attack surface is the
>>> coherent area, if we want to enforce stronger isolation we need to use
>>> shadow virtqueue (which is proposed in earlier version by me) in this
>>> case. But I'm not sure it's worth to do that.
>>
>>
>> So this reminds me the discussion in the end of last year. We need to
>> make sure we don't suffer from the same issues for VDUSE at least
>>
>> https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152b32@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b
>>
>> Or we can solve it at virtio level, e.g remember the dma address instead
>> of depending on the addr in the descriptor ring
>>
> I might miss something. But VDUSE has recorded the dma address during
> dma mapping, so we would not do bouncing if the addr/length is invalid
> during dma unmapping. Is it enough?


E.g malicous device write a buggy dma address in the descriptor ring, so 
we had:

vring_unmap_one_split(desc->addr, desc->len)
     dma_unmap_single()
         vduse_dev_unmap_page()
             vduse_domain_bounce()

And in vduse_domain_bounce() we had:

         while (size) {
                 map = &domain->bounce_maps[iova >> PAGE_SHIFT];
                 offset = offset_in_page(iova);
                 sz = min_t(size_t, PAGE_SIZE - offset, size);

This means we trust the iova which is dangerous and exacly the issue 
mentioned in the above link.

 From VDUSE level need to make sure iova is legal.

 From virtio level, we should not truse desc->addr.

Thanks


>
> Thanks,
> Yongji
>
Jason Wang April 16, 2021, 2:23 a.m. UTC | #13
在 2021/4/15 下午10:38, Stefan Hajnoczi 写道:
> On Thu, Apr 15, 2021 at 04:36:35PM +0800, Jason Wang wrote:
>> 在 2021/4/15 下午3:19, Stefan Hajnoczi 写道:
>>> On Thu, Apr 15, 2021 at 01:38:37PM +0800, Yongji Xie wrote:
>>>> On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>> On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
>>>>>> VDUSE (vDPA Device in Userspace) is a framework to support
>>>>>> implementing software-emulated vDPA devices in userspace. This
>>>>>> document is intended to clarify the VDUSE design and usage.
>>>>>>
>>>>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>>>>> ---
>>>>>>    Documentation/userspace-api/index.rst |   1 +
>>>>>>    Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
>>>>>>    2 files changed, 213 insertions(+)
>>>>>>    create mode 100644 Documentation/userspace-api/vduse.rst
>>>>> Just looking over the documentation briefly (I haven't studied the code
>>>>> yet)...
>>>>>
>>>> Thank you!
>>>>
>>>>>> +How VDUSE works
>>>>>> +------------
>>>>>> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
>>>>>> +the character device (/dev/vduse/control). Then a device file with the
>>>>>> +specified name (/dev/vduse/$NAME) will appear, which can be used to
>>>>>> +implement the userspace vDPA device's control path and data path.
>>>>> These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
>>>>> message? (Please consider reordering the documentation to make it clear
>>>>> what the sequence of steps are.)
>>>>>
>>>> No, VDUSE devices should be created before sending the
>>>> VDPA_CMD_DEV_NEW netlink messages which might produce I/Os to VDUSE.
>>> I see. Please include an overview of the steps before going into detail.
>>> Something like:
>>>
>>>     VDUSE devices are started as follows:
>>>
>>>     1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
>>>        /dev/vduse/control.
>>>
>>>     2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
>>>        messages will arrive while attaching the VDUSE instance to vDPA.
>>>
>>>     3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
>>>        instance to vDPA.
>>>
>>>     VDUSE devices are stopped as follows:
>>>
>>>     ...
>>>
>>>>>> +     static int netlink_add_vduse(const char *name, int device_id)
>>>>>> +     {
>>>>>> +             struct nl_sock *nlsock;
>>>>>> +             struct nl_msg *msg;
>>>>>> +             int famid;
>>>>>> +
>>>>>> +             nlsock = nl_socket_alloc();
>>>>>> +             if (!nlsock)
>>>>>> +                     return -ENOMEM;
>>>>>> +
>>>>>> +             if (genl_connect(nlsock))
>>>>>> +                     goto free_sock;
>>>>>> +
>>>>>> +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
>>>>>> +             if (famid < 0)
>>>>>> +                     goto close_sock;
>>>>>> +
>>>>>> +             msg = nlmsg_alloc();
>>>>>> +             if (!msg)
>>>>>> +                     goto close_sock;
>>>>>> +
>>>>>> +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
>>>>>> +                 VDPA_CMD_DEV_NEW, 0))
>>>>>> +                     goto nla_put_failure;
>>>>>> +
>>>>>> +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
>>>>>> +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
>>>>>> +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
>>>>> What are the permission/capability requirements for VDUSE?
>>>>>
>>>> Now I think we need privileged permission (root user). Because
>>>> userspace daemon is able to access avail vring, used vring, descriptor
>>>> table in kernel driver directly.
>>> Please state this explicitly at the start of the document. Existing
>>> interfaces like FUSE are designed to avoid trusting userspace.
>>
>> There're some subtle difference here. VDUSE present a device to kernel which
>> means IOMMU is probably the only thing to prevent a malicous device.
>>
>>
>>> Therefore
>>> people might think the same is the case here. It's critical that people
>>> are aware of this before deploying VDUSE with virtio-vdpa.
>>>
>>> We should probably pause here and think about whether it's possible to
>>> avoid trusting userspace. Even if it takes some effort and costs some
>>> performance it would probably be worthwhile.
>>
>> Since the bounce buffer is used the only attack surface is the coherent
>> area, if we want to enforce stronger isolation we need to use shadow
>> virtqueue (which is proposed in earlier version by me) in this case. But I'm
>> not sure it's worth to do that.
> The security situation needs to be clear before merging this feature.


+1


>
> I think the IOMMU and vring can be made secure. What is more concerning
> is the kernel code that runs on top: VIRTIO device drivers, network
> stack, file systems, etc. They trust devices to an extent.
>
> Since virtio-vdpa is a big reason for doing VDUSE in the first place I
> don't think it makes sense to disable virtio-vdpa with VDUSE. A solution
> is needed.


Yes, so the case of VDUSE is something similar to the case of e.g SEV.

Both cases won't trust device and use some kind of software IOTLB.

That means we need to protect at both IOTLB and virtio drivers.

Let me post patches for virtio first.


>
> I'm going to be offline for a week and don't want to be a bottleneck.
> I'll catch up when I'm back.


Thanks a lot for comments and I think we had sufficent time to make 
VDUSE safe before merging.


>
> Stefan
Yongji Xie April 16, 2021, 2:58 a.m. UTC | #14
On Fri, Apr 16, 2021 at 10:20 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/4/15 下午7:17, Yongji Xie 写道:
> > On Thu, Apr 15, 2021 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/4/15 下午4:36, Jason Wang 写道:
> >>>> Please state this explicitly at the start of the document. Existing
> >>>> interfaces like FUSE are designed to avoid trusting userspace.
> >>>
> >>> There're some subtle difference here. VDUSE present a device to kernel
> >>> which means IOMMU is probably the only thing to prevent a malicous
> >>> device.
> >>>
> >>>
> >>>> Therefore
> >>>> people might think the same is the case here. It's critical that people
> >>>> are aware of this before deploying VDUSE with virtio-vdpa.
> >>>>
> >>>> We should probably pause here and think about whether it's possible to
> >>>> avoid trusting userspace. Even if it takes some effort and costs some
> >>>> performance it would probably be worthwhile.
> >>>
> >>> Since the bounce buffer is used the only attack surface is the
> >>> coherent area, if we want to enforce stronger isolation we need to use
> >>> shadow virtqueue (which is proposed in earlier version by me) in this
> >>> case. But I'm not sure it's worth to do that.
> >>
> >>
> >> So this reminds me the discussion in the end of last year. We need to
> >> make sure we don't suffer from the same issues for VDUSE at least
> >>
> >> https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152b32@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b
> >>
> >> Or we can solve it at virtio level, e.g remember the dma address instead
> >> of depending on the addr in the descriptor ring
> >>
> > I might miss something. But VDUSE has recorded the dma address during
> > dma mapping, so we would not do bouncing if the addr/length is invalid
> > during dma unmapping. Is it enough?
>
>
> E.g malicous device write a buggy dma address in the descriptor ring, so
> we had:
>
> vring_unmap_one_split(desc->addr, desc->len)
>      dma_unmap_single()
>          vduse_dev_unmap_page()
>              vduse_domain_bounce()
>
> And in vduse_domain_bounce() we had:
>
>          while (size) {
>                  map = &domain->bounce_maps[iova >> PAGE_SHIFT];
>                  offset = offset_in_page(iova);
>                  sz = min_t(size_t, PAGE_SIZE - offset, size);
>
> This means we trust the iova which is dangerous and exacly the issue
> mentioned in the above link.
>
>  From VDUSE level need to make sure iova is legal.
>

I think we already do that in vduse_domain_bounce():

    while (size) {
        map = &domain->bounce_maps[iova >> PAGE_SHIFT];

        if (WARN_ON(!map->bounce_page ||
            map->orig_phys == INVALID_PHYS_ADDR))
            return;


>  From virtio level, we should not truse desc->addr.
>

We would not touch desc->addr after vring_unmap_one_split(). So I'm
not sure what we need to do at the virtio level.

Thanks,
Yongji
Jason Wang April 16, 2021, 3:02 a.m. UTC | #15
在 2021/4/16 上午10:58, Yongji Xie 写道:
> On Fri, Apr 16, 2021 at 10:20 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/4/15 下午7:17, Yongji Xie 写道:
>>> On Thu, Apr 15, 2021 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> 在 2021/4/15 下午4:36, Jason Wang 写道:
>>>>>> Please state this explicitly at the start of the document. Existing
>>>>>> interfaces like FUSE are designed to avoid trusting userspace.
>>>>> There're some subtle difference here. VDUSE present a device to kernel
>>>>> which means IOMMU is probably the only thing to prevent a malicous
>>>>> device.
>>>>>
>>>>>
>>>>>> Therefore
>>>>>> people might think the same is the case here. It's critical that people
>>>>>> are aware of this before deploying VDUSE with virtio-vdpa.
>>>>>>
>>>>>> We should probably pause here and think about whether it's possible to
>>>>>> avoid trusting userspace. Even if it takes some effort and costs some
>>>>>> performance it would probably be worthwhile.
>>>>> Since the bounce buffer is used the only attack surface is the
>>>>> coherent area, if we want to enforce stronger isolation we need to use
>>>>> shadow virtqueue (which is proposed in earlier version by me) in this
>>>>> case. But I'm not sure it's worth to do that.
>>>>
>>>> So this reminds me the discussion in the end of last year. We need to
>>>> make sure we don't suffer from the same issues for VDUSE at least
>>>>
>>>> https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152b32@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b
>>>>
>>>> Or we can solve it at virtio level, e.g remember the dma address instead
>>>> of depending on the addr in the descriptor ring
>>>>
>>> I might miss something. But VDUSE has recorded the dma address during
>>> dma mapping, so we would not do bouncing if the addr/length is invalid
>>> during dma unmapping. Is it enough?
>>
>> E.g malicous device write a buggy dma address in the descriptor ring, so
>> we had:
>>
>> vring_unmap_one_split(desc->addr, desc->len)
>>       dma_unmap_single()
>>           vduse_dev_unmap_page()
>>               vduse_domain_bounce()
>>
>> And in vduse_domain_bounce() we had:
>>
>>           while (size) {
>>                   map = &domain->bounce_maps[iova >> PAGE_SHIFT];
>>                   offset = offset_in_page(iova);
>>                   sz = min_t(size_t, PAGE_SIZE - offset, size);
>>
>> This means we trust the iova which is dangerous and exacly the issue
>> mentioned in the above link.
>>
>>   From VDUSE level need to make sure iova is legal.
>>
> I think we already do that in vduse_domain_bounce():
>
>      while (size) {
>          map = &domain->bounce_maps[iova >> PAGE_SHIFT];
>
>          if (WARN_ON(!map->bounce_page ||
>              map->orig_phys == INVALID_PHYS_ADDR))
>              return;


So you don't check whether iova is legal before using it, so it's at 
least a possible out of bound access of the bounce_maps[] isn't it? (e.g 
what happens if iova is ULLONG_MAX).


>
>
>>   From virtio level, we should not truse desc->addr.
>>
> We would not touch desc->addr after vring_unmap_one_split(). So I'm
> not sure what we need to do at the virtio level.


I think the point is to record the dma addres/len somewhere instead of 
reading them from descriptor ring.

Thanks


>
> Thanks,
> Yongji
>
Yongji Xie April 16, 2021, 3:13 a.m. UTC | #16
On Thu, Apr 15, 2021 at 10:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Apr 15, 2021 at 04:36:35PM +0800, Jason Wang wrote:
> >
> > 在 2021/4/15 下午3:19, Stefan Hajnoczi 写道:
> > > On Thu, Apr 15, 2021 at 01:38:37PM +0800, Yongji Xie wrote:
> > > > On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
> > > > > > VDUSE (vDPA Device in Userspace) is a framework to support
> > > > > > implementing software-emulated vDPA devices in userspace. This
> > > > > > document is intended to clarify the VDUSE design and usage.
> > > > > >
> > > > > > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > > > > > ---
> > > > > >   Documentation/userspace-api/index.rst |   1 +
> > > > > >   Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
> > > > > >   2 files changed, 213 insertions(+)
> > > > > >   create mode 100644 Documentation/userspace-api/vduse.rst
> > > > > Just looking over the documentation briefly (I haven't studied the code
> > > > > yet)...
> > > > >
> > > > Thank you!
> > > >
> > > > > > +How VDUSE works
> > > > > > +------------
> > > > > > +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> > > > > > +the character device (/dev/vduse/control). Then a device file with the
> > > > > > +specified name (/dev/vduse/$NAME) will appear, which can be used to
> > > > > > +implement the userspace vDPA device's control path and data path.
> > > > > These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
> > > > > message? (Please consider reordering the documentation to make it clear
> > > > > what the sequence of steps are.)
> > > > >
> > > > No, VDUSE devices should be created before sending the
> > > > VDPA_CMD_DEV_NEW netlink messages which might produce I/Os to VDUSE.
> > > I see. Please include an overview of the steps before going into detail.
> > > Something like:
> > >
> > >    VDUSE devices are started as follows:
> > >
> > >    1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> > >       /dev/vduse/control.
> > >
> > >    2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
> > >       messages will arrive while attaching the VDUSE instance to vDPA.
> > >
> > >    3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
> > >       instance to vDPA.
> > >
> > >    VDUSE devices are stopped as follows:
> > >
> > >    ...
> > >
> > > > > > +     static int netlink_add_vduse(const char *name, int device_id)
> > > > > > +     {
> > > > > > +             struct nl_sock *nlsock;
> > > > > > +             struct nl_msg *msg;
> > > > > > +             int famid;
> > > > > > +
> > > > > > +             nlsock = nl_socket_alloc();
> > > > > > +             if (!nlsock)
> > > > > > +                     return -ENOMEM;
> > > > > > +
> > > > > > +             if (genl_connect(nlsock))
> > > > > > +                     goto free_sock;
> > > > > > +
> > > > > > +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> > > > > > +             if (famid < 0)
> > > > > > +                     goto close_sock;
> > > > > > +
> > > > > > +             msg = nlmsg_alloc();
> > > > > > +             if (!msg)
> > > > > > +                     goto close_sock;
> > > > > > +
> > > > > > +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
> > > > > > +                 VDPA_CMD_DEV_NEW, 0))
> > > > > > +                     goto nla_put_failure;
> > > > > > +
> > > > > > +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> > > > > > +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> > > > > > +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
> > > > > What are the permission/capability requirements for VDUSE?
> > > > >
> > > > Now I think we need privileged permission (root user). Because
> > > > userspace daemon is able to access avail vring, used vring, descriptor
> > > > table in kernel driver directly.
> > > Please state this explicitly at the start of the document. Existing
> > > interfaces like FUSE are designed to avoid trusting userspace.
> >
> >
> > There're some subtle difference here. VDUSE present a device to kernel which
> > means IOMMU is probably the only thing to prevent a malicous device.
> >
> >
> > > Therefore
> > > people might think the same is the case here. It's critical that people
> > > are aware of this before deploying VDUSE with virtio-vdpa.
> > >
> > > We should probably pause here and think about whether it's possible to
> > > avoid trusting userspace. Even if it takes some effort and costs some
> > > performance it would probably be worthwhile.
> >
> >
> > Since the bounce buffer is used the only attack surface is the coherent
> > area, if we want to enforce stronger isolation we need to use shadow
> > virtqueue (which is proposed in earlier version by me) in this case. But I'm
> > not sure it's worth to do that.
>
> The security situation needs to be clear before merging this feature.
>
> I think the IOMMU and vring can be made secure. What is more concerning
> is the kernel code that runs on top: VIRTIO device drivers, network
> stack, file systems, etc. They trust devices to an extent.
>

I will dig into it to see if there is any security issue.

> Since virtio-vdpa is a big reason for doing VDUSE in the first place I
> don't think it makes sense to disable virtio-vdpa with VDUSE. A solution
> is needed.
>
> I'm going to be offline for a week and don't want to be a bottleneck.
> I'll catch up when I'm back.
>

Thanks for your comments!

Thanks,
Yongji
Yongji Xie April 16, 2021, 3:18 a.m. UTC | #17
On Fri, Apr 16, 2021 at 11:03 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/4/16 上午10:58, Yongji Xie 写道:
> > On Fri, Apr 16, 2021 at 10:20 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/4/15 下午7:17, Yongji Xie 写道:
> >>> On Thu, Apr 15, 2021 at 5:05 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>> 在 2021/4/15 下午4:36, Jason Wang 写道:
> >>>>>> Please state this explicitly at the start of the document. Existing
> >>>>>> interfaces like FUSE are designed to avoid trusting userspace.
> >>>>> There're some subtle difference here. VDUSE present a device to kernel
> >>>>> which means IOMMU is probably the only thing to prevent a malicous
> >>>>> device.
> >>>>>
> >>>>>
> >>>>>> Therefore
> >>>>>> people might think the same is the case here. It's critical that people
> >>>>>> are aware of this before deploying VDUSE with virtio-vdpa.
> >>>>>>
> >>>>>> We should probably pause here and think about whether it's possible to
> >>>>>> avoid trusting userspace. Even if it takes some effort and costs some
> >>>>>> performance it would probably be worthwhile.
> >>>>> Since the bounce buffer is used the only attack surface is the
> >>>>> coherent area, if we want to enforce stronger isolation we need to use
> >>>>> shadow virtqueue (which is proposed in earlier version by me) in this
> >>>>> case. But I'm not sure it's worth to do that.
> >>>>
> >>>> So this reminds me the discussion in the end of last year. We need to
> >>>> make sure we don't suffer from the same issues for VDUSE at least
> >>>>
> >>>> https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152b32@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b
> >>>>
> >>>> Or we can solve it at virtio level, e.g remember the dma address instead
> >>>> of depending on the addr in the descriptor ring
> >>>>
> >>> I might miss something. But VDUSE has recorded the dma address during
> >>> dma mapping, so we would not do bouncing if the addr/length is invalid
> >>> during dma unmapping. Is it enough?
> >>
> >> E.g malicous device write a buggy dma address in the descriptor ring, so
> >> we had:
> >>
> >> vring_unmap_one_split(desc->addr, desc->len)
> >>       dma_unmap_single()
> >>           vduse_dev_unmap_page()
> >>               vduse_domain_bounce()
> >>
> >> And in vduse_domain_bounce() we had:
> >>
> >>           while (size) {
> >>                   map = &domain->bounce_maps[iova >> PAGE_SHIFT];
> >>                   offset = offset_in_page(iova);
> >>                   sz = min_t(size_t, PAGE_SIZE - offset, size);
> >>
> >> This means we trust the iova which is dangerous and exacly the issue
> >> mentioned in the above link.
> >>
> >>   From VDUSE level need to make sure iova is legal.
> >>
> > I think we already do that in vduse_domain_bounce():
> >
> >      while (size) {
> >          map = &domain->bounce_maps[iova >> PAGE_SHIFT];
> >
> >          if (WARN_ON(!map->bounce_page ||
> >              map->orig_phys == INVALID_PHYS_ADDR))
> >              return;
>
>
> So you don't check whether iova is legal before using it, so it's at
> least a possible out of bound access of the bounce_maps[] isn't it? (e.g
> what happens if iova is ULLONG_MAX).
>

Oh, yes. Will do it!

>
> >
> >
> >>   From virtio level, we should not truse desc->addr.
> >>
> > We would not touch desc->addr after vring_unmap_one_split(). So I'm
> > not sure what we need to do at the virtio level.
>
>
> I think the point is to record the dma addres/len somewhere instead of
> reading them from descriptor ring.
>

OK, I see.

Thanks,
Yongji
Yongji Xie April 16, 2021, 3:19 a.m. UTC | #18
On Fri, Apr 16, 2021 at 10:24 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/4/15 下午10:38, Stefan Hajnoczi 写道:
> > On Thu, Apr 15, 2021 at 04:36:35PM +0800, Jason Wang wrote:
> >> 在 2021/4/15 下午3:19, Stefan Hajnoczi 写道:
> >>> On Thu, Apr 15, 2021 at 01:38:37PM +0800, Yongji Xie wrote:
> >>>> On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>> On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
> >>>>>> VDUSE (vDPA Device in Userspace) is a framework to support
> >>>>>> implementing software-emulated vDPA devices in userspace. This
> >>>>>> document is intended to clarify the VDUSE design and usage.
> >>>>>>
> >>>>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> >>>>>> ---
> >>>>>>    Documentation/userspace-api/index.rst |   1 +
> >>>>>>    Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
> >>>>>>    2 files changed, 213 insertions(+)
> >>>>>>    create mode 100644 Documentation/userspace-api/vduse.rst
> >>>>> Just looking over the documentation briefly (I haven't studied the code
> >>>>> yet)...
> >>>>>
> >>>> Thank you!
> >>>>
> >>>>>> +How VDUSE works
> >>>>>> +------------
> >>>>>> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
> >>>>>> +the character device (/dev/vduse/control). Then a device file with the
> >>>>>> +specified name (/dev/vduse/$NAME) will appear, which can be used to
> >>>>>> +implement the userspace vDPA device's control path and data path.
> >>>>> These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
> >>>>> message? (Please consider reordering the documentation to make it clear
> >>>>> what the sequence of steps are.)
> >>>>>
> >>>> No, VDUSE devices should be created before sending the
> >>>> VDPA_CMD_DEV_NEW netlink messages which might produce I/Os to VDUSE.
> >>> I see. Please include an overview of the steps before going into detail.
> >>> Something like:
> >>>
> >>>     VDUSE devices are started as follows:
> >>>
> >>>     1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> >>>        /dev/vduse/control.
> >>>
> >>>     2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
> >>>        messages will arrive while attaching the VDUSE instance to vDPA.
> >>>
> >>>     3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
> >>>        instance to vDPA.
> >>>
> >>>     VDUSE devices are stopped as follows:
> >>>
> >>>     ...
> >>>
> >>>>>> +     static int netlink_add_vduse(const char *name, int device_id)
> >>>>>> +     {
> >>>>>> +             struct nl_sock *nlsock;
> >>>>>> +             struct nl_msg *msg;
> >>>>>> +             int famid;
> >>>>>> +
> >>>>>> +             nlsock = nl_socket_alloc();
> >>>>>> +             if (!nlsock)
> >>>>>> +                     return -ENOMEM;
> >>>>>> +
> >>>>>> +             if (genl_connect(nlsock))
> >>>>>> +                     goto free_sock;
> >>>>>> +
> >>>>>> +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> >>>>>> +             if (famid < 0)
> >>>>>> +                     goto close_sock;
> >>>>>> +
> >>>>>> +             msg = nlmsg_alloc();
> >>>>>> +             if (!msg)
> >>>>>> +                     goto close_sock;
> >>>>>> +
> >>>>>> +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
> >>>>>> +                 VDPA_CMD_DEV_NEW, 0))
> >>>>>> +                     goto nla_put_failure;
> >>>>>> +
> >>>>>> +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> >>>>>> +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> >>>>>> +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
> >>>>> What are the permission/capability requirements for VDUSE?
> >>>>>
> >>>> Now I think we need privileged permission (root user). Because
> >>>> userspace daemon is able to access avail vring, used vring, descriptor
> >>>> table in kernel driver directly.
> >>> Please state this explicitly at the start of the document. Existing
> >>> interfaces like FUSE are designed to avoid trusting userspace.
> >>
> >> There're some subtle difference here. VDUSE present a device to kernel which
> >> means IOMMU is probably the only thing to prevent a malicous device.
> >>
> >>
> >>> Therefore
> >>> people might think the same is the case here. It's critical that people
> >>> are aware of this before deploying VDUSE with virtio-vdpa.
> >>>
> >>> We should probably pause here and think about whether it's possible to
> >>> avoid trusting userspace. Even if it takes some effort and costs some
> >>> performance it would probably be worthwhile.
> >>
> >> Since the bounce buffer is used the only attack surface is the coherent
> >> area, if we want to enforce stronger isolation we need to use shadow
> >> virtqueue (which is proposed in earlier version by me) in this case. But I'm
> >> not sure it's worth to do that.
> > The security situation needs to be clear before merging this feature.
>
>
> +1
>
>
> >
> > I think the IOMMU and vring can be made secure. What is more concerning
> > is the kernel code that runs on top: VIRTIO device drivers, network
> > stack, file systems, etc. They trust devices to an extent.
> >
> > Since virtio-vdpa is a big reason for doing VDUSE in the first place I
> > don't think it makes sense to disable virtio-vdpa with VDUSE. A solution
> > is needed.
>
>
> Yes, so the case of VDUSE is something similar to the case of e.g SEV.
>
> Both cases won't trust device and use some kind of software IOTLB.
>
> That means we need to protect at both IOTLB and virtio drivers.
>
> Let me post patches for virtio first.
>

Looking forward your patches.

Thanks.
Yongji
Jason Wang April 16, 2021, 5:39 a.m. UTC | #19
在 2021/4/16 上午11:19, Yongji Xie 写道:
> On Fri, Apr 16, 2021 at 10:24 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/4/15 下午10:38, Stefan Hajnoczi 写道:
>>> On Thu, Apr 15, 2021 at 04:36:35PM +0800, Jason Wang wrote:
>>>> 在 2021/4/15 下午3:19, Stefan Hajnoczi 写道:
>>>>> On Thu, Apr 15, 2021 at 01:38:37PM +0800, Yongji Xie wrote:
>>>>>> On Wed, Apr 14, 2021 at 10:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>> On Wed, Mar 31, 2021 at 04:05:19PM +0800, Xie Yongji wrote:
>>>>>>>> VDUSE (vDPA Device in Userspace) is a framework to support
>>>>>>>> implementing software-emulated vDPA devices in userspace. This
>>>>>>>> document is intended to clarify the VDUSE design and usage.
>>>>>>>>
>>>>>>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>>>>>>> ---
>>>>>>>>     Documentation/userspace-api/index.rst |   1 +
>>>>>>>>     Documentation/userspace-api/vduse.rst | 212 ++++++++++++++++++++++++++++++++++
>>>>>>>>     2 files changed, 213 insertions(+)
>>>>>>>>     create mode 100644 Documentation/userspace-api/vduse.rst
>>>>>>> Just looking over the documentation briefly (I haven't studied the code
>>>>>>> yet)...
>>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>>>> +How VDUSE works
>>>>>>>> +------------
>>>>>>>> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
>>>>>>>> +the character device (/dev/vduse/control). Then a device file with the
>>>>>>>> +specified name (/dev/vduse/$NAME) will appear, which can be used to
>>>>>>>> +implement the userspace vDPA device's control path and data path.
>>>>>>> These steps are taken after sending the VDPA_CMD_DEV_NEW netlink
>>>>>>> message? (Please consider reordering the documentation to make it clear
>>>>>>> what the sequence of steps are.)
>>>>>>>
>>>>>> No, VDUSE devices should be created before sending the
>>>>>> VDPA_CMD_DEV_NEW netlink messages which might produce I/Os to VDUSE.
>>>>> I see. Please include an overview of the steps before going into detail.
>>>>> Something like:
>>>>>
>>>>>      VDUSE devices are started as follows:
>>>>>
>>>>>      1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
>>>>>         /dev/vduse/control.
>>>>>
>>>>>      2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
>>>>>         messages will arrive while attaching the VDUSE instance to vDPA.
>>>>>
>>>>>      3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
>>>>>         instance to vDPA.
>>>>>
>>>>>      VDUSE devices are stopped as follows:
>>>>>
>>>>>      ...
>>>>>
>>>>>>>> +     static int netlink_add_vduse(const char *name, int device_id)
>>>>>>>> +     {
>>>>>>>> +             struct nl_sock *nlsock;
>>>>>>>> +             struct nl_msg *msg;
>>>>>>>> +             int famid;
>>>>>>>> +
>>>>>>>> +             nlsock = nl_socket_alloc();
>>>>>>>> +             if (!nlsock)
>>>>>>>> +                     return -ENOMEM;
>>>>>>>> +
>>>>>>>> +             if (genl_connect(nlsock))
>>>>>>>> +                     goto free_sock;
>>>>>>>> +
>>>>>>>> +             famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
>>>>>>>> +             if (famid < 0)
>>>>>>>> +                     goto close_sock;
>>>>>>>> +
>>>>>>>> +             msg = nlmsg_alloc();
>>>>>>>> +             if (!msg)
>>>>>>>> +                     goto close_sock;
>>>>>>>> +
>>>>>>>> +             if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
>>>>>>>> +                 VDPA_CMD_DEV_NEW, 0))
>>>>>>>> +                     goto nla_put_failure;
>>>>>>>> +
>>>>>>>> +             NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
>>>>>>>> +             NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
>>>>>>>> +             NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
>>>>>>> What are the permission/capability requirements for VDUSE?
>>>>>>>
>>>>>> Now I think we need privileged permission (root user). Because
>>>>>> userspace daemon is able to access avail vring, used vring, descriptor
>>>>>> table in kernel driver directly.
>>>>> Please state this explicitly at the start of the document. Existing
>>>>> interfaces like FUSE are designed to avoid trusting userspace.
>>>> There're some subtle difference here. VDUSE present a device to kernel which
>>>> means IOMMU is probably the only thing to prevent a malicous device.
>>>>
>>>>
>>>>> Therefore
>>>>> people might think the same is the case here. It's critical that people
>>>>> are aware of this before deploying VDUSE with virtio-vdpa.
>>>>>
>>>>> We should probably pause here and think about whether it's possible to
>>>>> avoid trusting userspace. Even if it takes some effort and costs some
>>>>> performance it would probably be worthwhile.
>>>> Since the bounce buffer is used the only attack surface is the coherent
>>>> area, if we want to enforce stronger isolation we need to use shadow
>>>> virtqueue (which is proposed in earlier version by me) in this case. But I'm
>>>> not sure it's worth to do that.
>>> The security situation needs to be clear before merging this feature.
>>
>> +1
>>
>>
>>> I think the IOMMU and vring can be made secure. What is more concerning
>>> is the kernel code that runs on top: VIRTIO device drivers, network
>>> stack, file systems, etc. They trust devices to an extent.
>>>
>>> Since virtio-vdpa is a big reason for doing VDUSE in the first place I
>>> don't think it makes sense to disable virtio-vdpa with VDUSE. A solution
>>> is needed.
>>
>> Yes, so the case of VDUSE is something similar to the case of e.g SEV.
>>
>> Both cases won't trust device and use some kind of software IOTLB.
>>
>> That means we need to protect at both IOTLB and virtio drivers.
>>
>> Let me post patches for virtio first.
>>
> Looking forward your patches.
>
> Thanks.
> Yongji
>

Fortuantely, packed ring has already did this since the descriptor talbe 
is expected to be re-wrote by the device. I just need to conver the 
split ring.

Thanks
diff mbox series

Patch

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index acd2cc2a538d..f63119130898 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -24,6 +24,7 @@  place where this information is gathered.
    ioctl/index
    iommu
    media/index
+   vduse
 
 .. only::  subproject and html
 
diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst
new file mode 100644
index 000000000000..8c4e2b2df8bb
--- /dev/null
+++ b/Documentation/userspace-api/vduse.rst
@@ -0,0 +1,212 @@ 
+==================================
+VDUSE - "vDPA Device in Userspace"
+==================================
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace.
+
+How VDUSE works
+------------
+Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
+the character device (/dev/vduse/control). Then a device file with the
+specified name (/dev/vduse/$NAME) will appear, which can be used to
+implement the userspace vDPA device's control path and data path.
+
+To implement control path, a message-based communication protocol and some
+types of control messages are introduced in the VDUSE framework:
+
+- VDUSE_SET_VQ_ADDR: Set the vring address of virtqueue.
+
+- VDUSE_SET_VQ_NUM: Set the size of virtqueue
+
+- VDUSE_SET_VQ_READY: Set ready status of virtqueue
+
+- VDUSE_GET_VQ_READY: Get ready status of virtqueue
+
+- VDUSE_SET_VQ_STATE: Set the state for virtqueue
+
+- VDUSE_GET_VQ_STATE: Get the state for virtqueue
+
+- VDUSE_SET_FEATURES: Set virtio features supported by the driver
+
+- VDUSE_GET_FEATURES: Get virtio features supported by the device
+
+- VDUSE_SET_STATUS: Set the device status
+
+- VDUSE_GET_STATUS: Get the device status
+
+- VDUSE_SET_CONFIG: Write to device specific configuration space
+
+- VDUSE_GET_CONFIG: Read from device specific configuration space
+
+- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
+
+Those control messages are mostly based on the vdpa_config_ops in
+include/linux/vdpa.h which defines a unified interface to control
+different types of vdpa device. Userspace needs to read()/write()
+on the VDUSE device file to receive/reply those control messages
+from/to VDUSE kernel module as follows:
+
+.. code-block:: c
+
+	static int vduse_message_handler(int dev_fd)
+	{
+		int len;
+		struct vduse_dev_request req;
+		struct vduse_dev_response resp;
+
+		len = read(dev_fd, &req, sizeof(req));
+		if (len != sizeof(req))
+			return -1;
+
+		resp.request_id = req.request_id;
+
+		switch (req.type) {
+
+		/* handle different types of message */
+
+		}
+
+		len = write(dev_fd, &resp, sizeof(resp));
+		if (len != sizeof(resp))
+			return -1;
+
+		return 0;
+	}
+
+In the data path, vDPA device's iova regions will be mapped into userspace
+with the help of VDUSE_IOTLB_GET_FD ioctl on the VDUSE device file:
+
+- VDUSE_IOTLB_GET_FD: get the file descriptor to the first overlapped iova region.
+  Userspace can access this iova region by passing fd and corresponding size, offset,
+  perm to mmap(). For example:
+
+.. code-block:: c
+
+	static int perm_to_prot(uint8_t perm)
+	{
+		int prot = 0;
+
+		switch (perm) {
+		case VDUSE_ACCESS_WO:
+			prot |= PROT_WRITE;
+			break;
+		case VDUSE_ACCESS_RO:
+			prot |= PROT_READ;
+			break;
+		case VDUSE_ACCESS_RW:
+			prot |= PROT_READ | PROT_WRITE;
+			break;
+		}
+
+		return prot;
+	}
+
+	static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
+	{
+		int fd;
+		void *addr;
+		size_t size;
+		struct vduse_iotlb_entry entry;
+
+		entry.start = iova;
+		entry.last = iova + 1;
+		fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry);
+		if (fd < 0)
+			return NULL;
+
+		size = entry.last - entry.start + 1;
+		*len = entry.last - iova + 1;
+		addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
+			    fd, entry.offset);
+		close(fd);
+		if (addr == MAP_FAILED)
+			return NULL;
+
+		/* do something to cache this iova region */
+
+		return addr + iova - entry.start;
+	}
+
+Besides, the following ioctls on the VDUSE device file are provided to support
+interrupt injection and setting up eventfd for virtqueue kicks:
+
+- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
+  by VDUSE kernel module to notify userspace to consume the vring.
+
+- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue
+
+- VDUSE_INJECT_CONFIG_IRQ: inject a config interrupt
+
+Register VDUSE device on vDPA bus
+---------------------------------
+In order to make the VDUSE device work, administrator needs to use the management
+API (netlink) to register it on vDPA bus. Some sample codes are show below:
+
+.. code-block:: c
+
+	static int netlink_add_vduse(const char *name, int device_id)
+	{
+		struct nl_sock *nlsock;
+		struct nl_msg *msg;
+		int famid;
+
+		nlsock = nl_socket_alloc();
+		if (!nlsock)
+			return -ENOMEM;
+
+		if (genl_connect(nlsock))
+			goto free_sock;
+
+		famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
+		if (famid < 0)
+			goto close_sock;
+
+		msg = nlmsg_alloc();
+		if (!msg)
+			goto close_sock;
+
+		if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0,
+		    VDPA_CMD_DEV_NEW, 0))
+			goto nla_put_failure;
+
+		NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
+		NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
+		NLA_PUT_U32(msg, VDPA_ATTR_DEV_ID, device_id);
+
+		if (nl_send_sync(nlsock, msg))
+			goto close_sock;
+
+		nl_close(nlsock);
+		nl_socket_free(nlsock);
+
+		return 0;
+	nla_put_failure:
+		nlmsg_free(msg);
+	close_sock:
+		nl_close(nlsock);
+	free_sock:
+		nl_socket_free(nlsock);
+		return -1;
+	}
+
+MMU-based IOMMU Driver
+----------------------
+VDUSE framework implements an MMU-based on-chip IOMMU driver to support
+mapping the kernel DMA buffer into the userspace iova region dynamically.
+This is mainly designed for virtio-vdpa case (kernel virtio drivers).
+
+The basic idea behind this driver is treating MMU (VA->PA) as IOMMU (IOVA->PA).
+The driver will set up MMU mapping instead of IOMMU mapping for the DMA transfer
+so that the userspace process is able to use its virtual address to access
+the DMA buffer in kernel.
+
+And to avoid security issue, a bounce-buffering mechanism is introduced to
+prevent userspace accessing the original buffer directly which may contain other
+kernel data. During the mapping, unmapping, the driver will copy the data from
+the original buffer to the bounce buffer and back, depending on the direction of
+the transfer. And the bounce-buffer addresses will be mapped into the user address
+space instead of the original one.