[06/11] vhost: support delayed vq creation

Message ID	1604528804-2878-7-git-send-email-michael.christie@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=x8jc=EK=vger.kernel.org=target-devel-owner@kernel.org> From: Mike Christie <michael.christie@oracle.com> To: martin.petersen@oracle.com, linux-scsi@vger.kernel.org, target-devel@vger.kernel.org, mst@redhat.com, jasowang@redhat.com, pbonzini@redhat.com, stefanha@redhat.com, virtualization@lists.linux-foundation.org Subject: [PATCH 06/11] vhost: support delayed vq creation Date: Wed, 4 Nov 2020 16:26:39 -0600 Message-Id: <1604528804-2878-7-git-send-email-michael.christie@oracle.com> In-Reply-To: <1604528804-2878-1-git-send-email-michael.christie@oracle.com> References: <1604528804-2878-1-git-send-email-michael.christie@oracle.com> Precedence: bulk
Series	vhost: vhost-scsi bug fixes \| expand [00/11,V4] vhost: vhost-scsi bug fixes [01/11] vhost scsi: add lun parser helper [02/11] vhost: remove work arg from vhost_work_flush [03/11] vhost net: use goto error handling in open [04/11] vhost: prep vhost_dev_init users to handle failures [05/11] vhost: move vq iovec allocation to dev init time [06/11] vhost: support delayed vq creation [07/11] vhost scsi: support delayed IO vq creation [08/11] vhost scsi: alloc cmds per vq instead of session [09/11] vhost scsi: fix cmd completion race [10/11] vhost scsi: Add support for LUN resets. [11/11] vhost scsi: remove extra flushes

Mike Christie Nov. 4, 2020, 10:26 p.m. UTC

This allows vq creation to be done when it's first accessed by
userspace. vhost-scsi doesn't know how many queues the user requested
until they are first setup, and we don't want to allocate resources
like the iovecs for 128 vqs when we are only using 1 or 2 most of the
time. In the next pathces, vhost-scsi will also switch to preallocating
cmds per vq instead of per lio session and we don't want to allocate
them for 127 extra vqs if they are not in use.

With this patch when a driver calls vhost_dev_init they pass in the
number of vqs that they know they need and the max they can support.
This patch has all the drivers pass in the same value for both the
initial number of vqs and the max. The next patch will convert scsi.
The other drivers like net/vsock have their vqs hard coded in the
kernel or setup/discovered via other methods like with vdpa.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/net.c   |  2 +-
 drivers/vhost/scsi.c  |  4 +--
 drivers/vhost/test.c  |  5 ++--
 drivers/vhost/vdpa.c  |  2 +-
 drivers/vhost/vhost.c | 71 ++++++++++++++++++++++++++++++++++-----------------
 drivers/vhost/vhost.h |  7 +++--
 drivers/vhost/vsock.c | 11 ++++----
 7 files changed, 66 insertions(+), 36 deletions(-)

Jason Wang Nov. 9, 2020, 4:01 a.m. UTC | #1

On 2020/11/5 上午6:26, Mike Christie wrote:
> This allows vq creation to be done when it's first accessed by
> userspace. vhost-scsi doesn't know how many queues the user requested
> until they are first setup, and we don't want to allocate resources
> like the iovecs for 128 vqs when we are only using 1 or 2 most of the
> time. In the next pathces, vhost-scsi will also switch to preallocating
> cmds per vq instead of per lio session and we don't want to allocate
> them for 127 extra vqs if they are not in use.
>
> With this patch when a driver calls vhost_dev_init they pass in the
> number of vqs that they know they need and the max they can support.
> This patch has all the drivers pass in the same value for both the
> initial number of vqs and the max. The next patch will convert scsi.
> The other drivers like net/vsock have their vqs hard coded in the
> kernel or setup/discovered via other methods like with vdpa.
>
> Signed-off-by: Mike Christie <michael.christie@oracle.com>


This looks like a delayed vq metadata creation instead of vq itself.

Several questions:

- can we simply introduce new ioctls to set the #max_vqs for SCSI? Or 
simply introduce VRING_ENABLE ioctl for SCSI and let qemu to call that 
ioctl for 1.0 device, qemu can simply forward the ENABLE command to 
vhost-scsi, for 0.9x device qemu can simply check whether ring.addr is 
zero or not?
- if not, it's better to convert all the devices/queues to behave as if 
SCSI to have a consistent way allocate metadata
- do we need to care about vq free, e.g free vqs during reset?

Thanks


> ---
>   drivers/vhost/net.c   |  2 +-
>   drivers/vhost/scsi.c  |  4 +--
>   drivers/vhost/test.c  |  5 ++--
>   drivers/vhost/vdpa.c  |  2 +-
>   drivers/vhost/vhost.c | 71 ++++++++++++++++++++++++++++++++++-----------------
>   drivers/vhost/vhost.h |  7 +++--
>   drivers/vhost/vsock.c | 11 ++++----
>   7 files changed, 66 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index fd30b53..fce46f0 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -1316,7 +1316,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>   		n->vqs[i].rx_ring = NULL;
>   		vhost_net_buf_init(&n->vqs[i].rxq);
>   	}
> -	if (vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
> +	if (vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX, VHOST_NET_VQ_MAX,
>   			   UIO_MAXIOV + VHOST_NET_BATCH,
>   			   VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
>   			   NULL))
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index ed8baf9..5b3720e 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1632,8 +1632,8 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
>   		vqs[i] = &vs->vqs[i].vq;
>   		vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
>   	}
> -	r = vhost_dev_init(&vs->dev, vqs, VHOST_SCSI_MAX_VQ, UIO_MAXIOV,
> -			   VHOST_SCSI_WEIGHT, 0, true, NULL);
> +	r = vhost_dev_init(&vs->dev, vqs, VHOST_SCSI_MAX_VQ, VHOST_SCSI_MAX_VQ,
> +			   UIO_MAXIOV, VHOST_SCSI_WEIGHT, 0, true, NULL);
>   	if (r)
>   		goto err_dev_init;
>   
> diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> index c255ae5..9d2bfa3 100644
> --- a/drivers/vhost/test.c
> +++ b/drivers/vhost/test.c
> @@ -119,8 +119,9 @@ static int vhost_test_open(struct inode *inode, struct file *f)
>   	dev = &n->dev;
>   	vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
>   	n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
> -	if (vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
> -			   VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL)
> +	if (vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, VHOST_TEST_VQ_MAX,
> +			   UIO_MAXIOV, VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT,
> +			   true, NULL)
>   		goto err_dev_init;
>   
>   	f->private_data = n;
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index c1615a7..6abd9d8 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -829,7 +829,7 @@ static int vhost_vdpa_open(struct inode *inode, struct file *filep)
>   		vqs[i] = &v->vqs[i];
>   		vqs[i]->handle_kick = handle_vq_kick;
>   	}
> -	r = vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false,
> +	r = vhost_dev_init(dev, vqs, nvqs, nvqs, 0, 0, 0, false,
>   			   vhost_vdpa_process_iotlb_msg);
>   	if (r)
>   		goto err_dev_init;
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index a4a4450..2ca2e71 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -294,7 +294,7 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
>   {
>   	int i;
>   
> -	for (i = 0; i < d->nvqs; ++i)
> +	for (i = 0; i < d->max_nvqs; ++i)
>   		__vhost_vq_meta_reset(d->vqs[i]);
>   }
>   
> @@ -331,6 +331,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>   	vq->busyloop_timeout = 0;
>   	vq->umem = NULL;
>   	vq->iotlb = NULL;
> +	vq->initialized = false;
>   	vhost_vring_call_reset(&vq->call_ctx);
>   	__vhost_vq_meta_reset(vq);
>   }
> @@ -411,7 +412,7 @@ static void vhost_dev_free_iovecs(struct vhost_dev *dev)
>   {
>   	int i;
>   
> -	for (i = 0; i < dev->nvqs; ++i)
> +	for (i = 0; i < dev->max_nvqs; ++i)
>   		vhost_vq_free_iovecs(dev->vqs[i]);
>   }
>   
> @@ -456,7 +457,7 @@ static size_t vhost_get_desc_size(struct vhost_virtqueue *vq,
>   	return sizeof(*vq->desc) * num;
>   }
>   
> -static int vhost_vq_init(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> +static void __vhost_vq_init(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>   {
>   	vq->log = NULL;
>   	vq->indirect = NULL;
> @@ -467,12 +468,29 @@ static int vhost_vq_init(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>   
>   	if (vq->handle_kick)
>   		vhost_poll_init(&vq->poll, vq->handle_kick, EPOLLIN, dev);
> +}
> +
> +static int vhost_vq_init(struct vhost_dev *dev, int vq_idx)
> +{
> +	struct vhost_virtqueue *vq;
> +	int ret;
> +
> +	if (vq_idx >= dev->max_nvqs)
> +		return -ENOBUFS;
> +
> +	vq = dev->vqs[vq_idx];
> +	__vhost_vq_init(dev, vq);
> +	ret = vhost_vq_alloc_iovecs(dev, vq);
> +	if (ret)
> +		return ret;
>   
> -	return vhost_vq_alloc_iovecs(dev, vq);
> +	vq->initialized = true;
> +	dev->nvqs++;
> +	return 0;
>   }
>   
>   int vhost_dev_init(struct vhost_dev *dev,
> -		   struct vhost_virtqueue **vqs, int nvqs,
> +		   struct vhost_virtqueue **vqs, int nvqs, int max_nvqs,
>   		   int iov_limit, int weight, int byte_weight,
>   		   bool use_worker,
>   		   int (*msg_handler)(struct vhost_dev *dev,
> @@ -481,7 +499,8 @@ int vhost_dev_init(struct vhost_dev *dev,
>   	int i;
>   
>   	dev->vqs = vqs;
> -	dev->nvqs = nvqs;
> +	dev->nvqs = 0;
> +	dev->max_nvqs = max_nvqs;
>   	mutex_init(&dev->mutex);
>   	dev->log_ctx = NULL;
>   	dev->umem = NULL;
> @@ -499,12 +518,15 @@ int vhost_dev_init(struct vhost_dev *dev,
>   	INIT_LIST_HEAD(&dev->pending_list);
>   	spin_lock_init(&dev->iotlb_lock);
>   
> -
> -	for (i = 0; i < dev->nvqs; ++i) {
> -		if (vhost_vq_init(dev, dev->vqs[i]))
> +	for (i = 0; i < nvqs; ++i) {
> +		if (vhost_vq_init(dev, i))
>   			goto err_vq_init;
>   	}
>   
> +	for (; i < dev->max_nvqs; ++i)
> +		/* Just prep/clear the fields and set initialized=false */
> +		__vhost_vq_init(dev, dev->vqs[i]);
> +
>   	return 0;
>   
>   err_vq_init:
> @@ -652,7 +674,7 @@ void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_iotlb *umem)
>   	/* We don't need VQ locks below since vhost_dev_cleanup makes sure
>   	 * VQs aren't running.
>   	 */
> -	for (i = 0; i < dev->nvqs; ++i)
> +	for (i = 0; i < dev->max_nvqs; ++i)
>   		dev->vqs[i]->umem = umem;
>   }
>   EXPORT_SYMBOL_GPL(vhost_dev_reset_owner);
> @@ -661,7 +683,7 @@ void vhost_dev_stop(struct vhost_dev *dev)
>   {
>   	int i;
>   
> -	for (i = 0; i < dev->nvqs; ++i) {
> +	for (i = 0; i < dev->max_nvqs; ++i) {
>   		if (dev->vqs[i]->kick && dev->vqs[i]->handle_kick) {
>   			vhost_poll_stop(&dev->vqs[i]->poll);
>   			vhost_poll_flush(&dev->vqs[i]->poll);
> @@ -693,7 +715,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
>   {
>   	int i;
>   
> -	for (i = 0; i < dev->nvqs; ++i) {
> +	for (i = 0; i < dev->max_nvqs; ++i) {
>   		if (dev->vqs[i]->error_ctx)
>   			eventfd_ctx_put(dev->vqs[i]->error_ctx);
>   		if (dev->vqs[i]->kick)
> @@ -787,7 +809,7 @@ static bool memory_access_ok(struct vhost_dev *d, struct vhost_iotlb *umem,
>   {
>   	int i;
>   
> -	for (i = 0; i < d->nvqs; ++i) {
> +	for (i = 0; i < d->max_nvqs; ++i) {
>   		bool ok;
>   		bool log;
>   
> @@ -999,14 +1021,14 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
>   static void vhost_dev_lock_vqs(struct vhost_dev *d)
>   {
>   	int i = 0;
> -	for (i = 0; i < d->nvqs; ++i)
> +	for (i = 0; i < d->max_nvqs; ++i)
>   		mutex_lock_nested(&d->vqs[i]->mutex, i);
>   }
>   
>   static void vhost_dev_unlock_vqs(struct vhost_dev *d)
>   {
>   	int i = 0;
> -	for (i = 0; i < d->nvqs; ++i)
> +	for (i = 0; i < d->max_nvqs; ++i)
>   		mutex_unlock(&d->vqs[i]->mutex);
>   }
>   
> @@ -1462,7 +1484,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
>   	d->umem = newumem;
>   
>   	/* All memory accesses are done under some VQ mutex. */
> -	for (i = 0; i < d->nvqs; ++i) {
> +	for (i = 0; i < d->max_nvqs; ++i) {
>   		mutex_lock(&d->vqs[i]->mutex);
>   		d->vqs[i]->umem = newumem;
>   		mutex_unlock(&d->vqs[i]->mutex);
> @@ -1590,11 +1612,14 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>   	r = get_user(idx, idxp);
>   	if (r < 0)
>   		return r;
> -	if (idx >= d->nvqs)
> -		return -ENOBUFS;
>   
> -	idx = array_index_nospec(idx, d->nvqs);
> +	idx = array_index_nospec(idx, d->max_nvqs);
>   	vq = d->vqs[idx];
> +	if (!vq->initialized) {
> +		r = vhost_vq_init(d, idx);
> +		if (r)
> +			return r;
> +	}
>   
>   	if (ioctl == VHOST_SET_VRING_NUM ||
>   	    ioctl == VHOST_SET_VRING_ADDR) {
> @@ -1724,7 +1749,7 @@ int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled)
>   	oiotlb = d->iotlb;
>   	d->iotlb = niotlb;
>   
> -	for (i = 0; i < d->nvqs; ++i) {
> +	for (i = 0; i < d->max_nvqs; ++i) {
>   		struct vhost_virtqueue *vq = d->vqs[i];
>   
>   		mutex_lock(&vq->mutex);
> @@ -1771,7 +1796,7 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
>   			r = -EFAULT;
>   			break;
>   		}
> -		for (i = 0; i < d->nvqs; ++i) {
> +		for (i = 0; i < d->max_nvqs; ++i) {
>   			struct vhost_virtqueue *vq;
>   			void __user *base = (void __user *)(unsigned long)p;
>   			vq = d->vqs[i];
> @@ -1794,7 +1819,7 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
>   			break;
>   		}
>   		swap(ctx, d->log_ctx);
> -		for (i = 0; i < d->nvqs; ++i) {
> +		for (i = 0; i < d->max_nvqs; ++i) {
>   			mutex_lock(&d->vqs[i]->mutex);
>   			d->vqs[i]->log_ctx = d->log_ctx;
>   			mutex_unlock(&d->vqs[i]->mutex);
> @@ -2609,7 +2634,7 @@ void vhost_set_backend_features(struct vhost_dev *dev, u64 features)
>   	int i;
>   
>   	mutex_lock(&dev->mutex);
> -	for (i = 0; i < dev->nvqs; ++i) {
> +	for (i = 0; i < dev->max_nvqs; ++i) {
>   		vq = dev->vqs[i];
>   		mutex_lock(&vq->mutex);
>   		vq->acked_backend_features = features;
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 9ad34b1..9677870 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -132,6 +132,8 @@ struct vhost_virtqueue {
>   	bool user_be;
>   #endif
>   	u32 busyloop_timeout;
> +
> +	bool initialized;
>   };
>   
>   struct vhost_msg_node {
> @@ -148,6 +150,7 @@ struct vhost_dev {
>   	struct mutex mutex;
>   	struct vhost_virtqueue **vqs;
>   	int nvqs;
> +	int max_nvqs;
>   	struct eventfd_ctx *log_ctx;
>   	struct llist_head work_list;
>   	struct task_struct *worker;
> @@ -168,8 +171,8 @@ struct vhost_dev {
>   
>   bool vhost_exceeds_weight(struct vhost_virtqueue *vq, int pkts, int total_len);
>   int vhost_dev_init(struct vhost_dev *dev, struct vhost_virtqueue **vqs,
> -		   int nvqs, int iov_limit, int weight, int byte_weight,
> -		   bool use_worker,
> +		   int nvqs, int max_nvqs, int iov_limit, int weight,
> +		   int byte_weight, bool use_worker,
>   		   int (*msg_handler)(struct vhost_dev *dev,
>   				      struct vhost_iotlb_msg *msg));
>   long vhost_dev_set_owner(struct vhost_dev *dev);
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index cae0083..bcdfd58 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -606,7 +606,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
>   {
>   	struct vhost_virtqueue **vqs;
>   	struct vhost_vsock *vsock;
> -	int ret;
> +	int ret, nvqs;
>   
>   	/* This struct is large and allocation could fail, fall back to vmalloc
>   	 * if there is no other way.
> @@ -615,7 +615,8 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
>   	if (!vsock)
>   		return -ENOMEM;
>   
> -	vqs = kmalloc_array(ARRAY_SIZE(vsock->vqs), sizeof(*vqs), GFP_KERNEL);
> +	nvqs = ARRAY_SIZE(vsock->vqs);
> +	vqs = kmalloc_array(nvqs, sizeof(*vqs), GFP_KERNEL);
>   	if (!vqs) {
>   		ret = -ENOMEM;
>   		goto out;
> @@ -630,9 +631,9 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
>   	vsock->vqs[VSOCK_VQ_TX].handle_kick = vhost_vsock_handle_tx_kick;
>   	vsock->vqs[VSOCK_VQ_RX].handle_kick = vhost_vsock_handle_rx_kick;
>   
> -	ret = vhost_dev_init(&vsock->dev, vqs, ARRAY_SIZE(vsock->vqs),
> -			     UIO_MAXIOV, VHOST_VSOCK_PKT_WEIGHT,
> -			     VHOST_VSOCK_WEIGHT, true, NULL);
> +	ret = vhost_dev_init(&vsock->dev, vqs, nvqs, nvqs, UIO_MAXIOV,
> +			     VHOST_VSOCK_PKT_WEIGHT, VHOST_VSOCK_WEIGHT, true,
> +			     NULL);
>   	if (ret)
>   		goto err_dev_init;
>

Mike Christie Nov. 9, 2020, 6:41 p.m. UTC | #2

On 11/8/20 10:01 PM, Jason Wang wrote:
> 
> On 2020/11/5 上午6:26, Mike Christie wrote:
>> This allows vq creation to be done when it's first accessed by
>> userspace. vhost-scsi doesn't know how many queues the user requested
>> until they are first setup, and we don't want to allocate resources
>> like the iovecs for 128 vqs when we are only using 1 or 2 most of the
>> time. In the next pathces, vhost-scsi will also switch to preallocating
>> cmds per vq instead of per lio session and we don't want to allocate
>> them for 127 extra vqs if they are not in use.
>>
>> With this patch when a driver calls vhost_dev_init they pass in the
>> number of vqs that they know they need and the max they can support.
>> This patch has all the drivers pass in the same value for both the
>> initial number of vqs and the max. The next patch will convert scsi.
>> The other drivers like net/vsock have their vqs hard coded in the
>> kernel or setup/discovered via other methods like with vdpa.
>>
>> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> 
> 
> This looks like a delayed vq metadata creation instead of vq itself. >
> Several questions:
> 
> - can we simply introduce new ioctls to set the #max_vqs for SCSI? Or 

We can do this code wise, but I was not sure about the spec and 
VHOST_SET_OWNER's requirements. For the latter, the first ioctl cmd 
quemu does is VHOST_SET_OWNER. The vhost_dev_ioctl then requires that 
the caller be the owner to be able to do any other ioctls. The 
individual drivers like scsi, net, etc allow userspace to do ioctls 
without being the owner, and I was not sure if that is by accident or 
design.

> simply introduce VRING_ENABLE ioctl for SCSI and let qemu to call that 
> ioctl for 1.0 device, qemu can simply forward the ENABLE command to 

If it's allowed to do ioctl cmds that set vhost driver settings before 
VHOST_SET_OWNER then we can do the VRING_ENABLE calls first. For that 
though, I had one question about the spec. In section "4.1.4.3.2 Driver 
Requirements: Common configuration structure layout" it had the line:

-----
The driver MUST configure the other virtqueue fields before enabling the 
virtqueue with queue_enable.
-----

Does that apply to our case, or only when it's being used in a 
virtio_pci_common_cfg struct based command? If it applies to our case, 
then we have to change userspace to set config settings like the 
queue_size and features earlier, right?

I was thinking if we had to deal with that then maybe it's better to add 
a new ioctl like VRING_CREATE. It would be done first, singles that the 
kernel should do the vq creation (allocation, init steps we do in 
vhost_dev_init, etc).


> vhost-scsi, for 0.9x device qemu can simply check whether ring.addr is 
> zero or not
I had a question about that and my thread per vq patches.

If you want to pair this patchset down to it's bare bug fixes that 
prevent users from getting IO errors, I would do ring.addr check only 
approach for this bug fix patchset.

For my other patchset to do a vhost worker thread per vq, I think it 
would be cleanest to have a new ioctl or a userspace change. The problem 
is that we create the worker thread in vhost_dev_set_owner and at that 
time we do not know how many vqs there are going to be. For that we could:
1. Add a new VHOST_SET_OWNER_V2 that has an arg which contains the 
number of queues or worker threads
2. do the VRING_ENABLE approach if allowed spec/VHOST_SET_OWNER wise.
3. add a new ioctl that is done before VHOST_SET_OWNER

If we split it up like this, current users get the IO error bug fixes. 
This patchset gets a lot smaller. Users would then only have to upgrade 
userspace to get the multple thread feature.

Compat setup support might be cleaner too. If we did the 
VHOST_SET_OWNER_V2 or added a new ioctl then vhost_dev_set_owner just 
takes the new arg and loops num_queues/workers times to create more threads.


> - if not, it's better to convert all the devices/queues to behave as if 
> SCSI to have a consistent way allocate metadata
> - do we need to care about vq free, e.g free vqs during reset?
> 
> Thanks
> 
> 
>> ---
>>   drivers/vhost/net.c   |  2 +-
>>   drivers/vhost/scsi.c  |  4 +--
>>   drivers/vhost/test.c  |  5 ++--
>>   drivers/vhost/vdpa.c  |  2 +-
>>   drivers/vhost/vhost.c | 71 
>> ++++++++++++++++++++++++++++++++++-----------------
>>   drivers/vhost/vhost.h |  7 +++--
>>   drivers/vhost/vsock.c | 11 ++++----
>>   7 files changed, 66 insertions(+), 36 deletions(-)
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index fd30b53..fce46f0 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -1316,7 +1316,7 @@ static int vhost_net_open(struct inode *inode, 
>> struct file *f)
>>           n->vqs[i].rx_ring = NULL;
>>           vhost_net_buf_init(&n->vqs[i].rxq);
>>       }
>> -    if (vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
>> +    if (vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX, VHOST_NET_VQ_MAX,
>>                  UIO_MAXIOV + VHOST_NET_BATCH,
>>                  VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
>>                  NULL))
>> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
>> index ed8baf9..5b3720e 100644
>> --- a/drivers/vhost/scsi.c
>> +++ b/drivers/vhost/scsi.c
>> @@ -1632,8 +1632,8 @@ static int vhost_scsi_open(struct inode *inode, 
>> struct file *f)
>>           vqs[i] = &vs->vqs[i].vq;
>>           vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
>>       }
>> -    r = vhost_dev_init(&vs->dev, vqs, VHOST_SCSI_MAX_VQ, UIO_MAXIOV,
>> -               VHOST_SCSI_WEIGHT, 0, true, NULL);
>> +    r = vhost_dev_init(&vs->dev, vqs, VHOST_SCSI_MAX_VQ, 
>> VHOST_SCSI_MAX_VQ,
>> +               UIO_MAXIOV, VHOST_SCSI_WEIGHT, 0, true, NULL);
>>       if (r)
>>           goto err_dev_init;
>> diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
>> index c255ae5..9d2bfa3 100644
>> --- a/drivers/vhost/test.c
>> +++ b/drivers/vhost/test.c
>> @@ -119,8 +119,9 @@ static int vhost_test_open(struct inode *inode, 
>> struct file *f)
>>       dev = &n->dev;
>>       vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
>>       n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
>> -    if (vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
>> -               VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL)
>> +    if (vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, VHOST_TEST_VQ_MAX,
>> +               UIO_MAXIOV, VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT,
>> +               true, NULL)
>>           goto err_dev_init;
>>       f->private_data = n;
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index c1615a7..6abd9d8 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>> @@ -829,7 +829,7 @@ static int vhost_vdpa_open(struct inode *inode, 
>> struct file *filep)
>>           vqs[i] = &v->vqs[i];
>>           vqs[i]->handle_kick = handle_vq_kick;
>>       }
>> -    r = vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false,
>> +    r = vhost_dev_init(dev, vqs, nvqs, nvqs, 0, 0, 0, false,
>>                  vhost_vdpa_process_iotlb_msg);
>>       if (r)
>>           goto err_dev_init;
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index a4a4450..2ca2e71 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -294,7 +294,7 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
>>   {
>>       int i;
>> -    for (i = 0; i < d->nvqs; ++i)
>> +    for (i = 0; i < d->max_nvqs; ++i)
>>           __vhost_vq_meta_reset(d->vqs[i]);
>>   }
>> @@ -331,6 +331,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>>       vq->busyloop_timeout = 0;
>>       vq->umem = NULL;
>>       vq->iotlb = NULL;
>> +    vq->initialized = false;
>>       vhost_vring_call_reset(&vq->call_ctx);
>>       __vhost_vq_meta_reset(vq);
>>   }
>> @@ -411,7 +412,7 @@ static void vhost_dev_free_iovecs(struct vhost_dev 
>> *dev)
>>   {
>>       int i;
>> -    for (i = 0; i < dev->nvqs; ++i)
>> +    for (i = 0; i < dev->max_nvqs; ++i)
>>           vhost_vq_free_iovecs(dev->vqs[i]);
>>   }
>> @@ -456,7 +457,7 @@ static size_t vhost_get_desc_size(struct 
>> vhost_virtqueue *vq,
>>       return sizeof(*vq->desc) * num;
>>   }
>> -static int vhost_vq_init(struct vhost_dev *dev, struct 
>> vhost_virtqueue *vq)
>> +static void __vhost_vq_init(struct vhost_dev *dev, struct 
>> vhost_virtqueue *vq)
>>   {
>>       vq->log = NULL;
>>       vq->indirect = NULL;
>> @@ -467,12 +468,29 @@ static int vhost_vq_init(struct vhost_dev *dev, 
>> struct vhost_virtqueue *vq)
>>       if (vq->handle_kick)
>>           vhost_poll_init(&vq->poll, vq->handle_kick, EPOLLIN, dev);
>> +}
>> +
>> +static int vhost_vq_init(struct vhost_dev *dev, int vq_idx)
>> +{
>> +    struct vhost_virtqueue *vq;
>> +    int ret;
>> +
>> +    if (vq_idx >= dev->max_nvqs)
>> +        return -ENOBUFS;
>> +
>> +    vq = dev->vqs[vq_idx];
>> +    __vhost_vq_init(dev, vq);
>> +    ret = vhost_vq_alloc_iovecs(dev, vq);
>> +    if (ret)
>> +        return ret;
>> -    return vhost_vq_alloc_iovecs(dev, vq);
>> +    vq->initialized = true;
>> +    dev->nvqs++;
>> +    return 0;
>>   }
>>   int vhost_dev_init(struct vhost_dev *dev,
>> -           struct vhost_virtqueue **vqs, int nvqs,
>> +           struct vhost_virtqueue **vqs, int nvqs, int max_nvqs,
>>              int iov_limit, int weight, int byte_weight,
>>              bool use_worker,
>>              int (*msg_handler)(struct vhost_dev *dev,
>> @@ -481,7 +499,8 @@ int vhost_dev_init(struct vhost_dev *dev,
>>       int i;
>>       dev->vqs = vqs;
>> -    dev->nvqs = nvqs;
>> +    dev->nvqs = 0;
>> +    dev->max_nvqs = max_nvqs;
>>       mutex_init(&dev->mutex);
>>       dev->log_ctx = NULL;
>>       dev->umem = NULL;
>> @@ -499,12 +518,15 @@ int vhost_dev_init(struct vhost_dev *dev,
>>       INIT_LIST_HEAD(&dev->pending_list);
>>       spin_lock_init(&dev->iotlb_lock);
>> -
>> -    for (i = 0; i < dev->nvqs; ++i) {
>> -        if (vhost_vq_init(dev, dev->vqs[i]))
>> +    for (i = 0; i < nvqs; ++i) {
>> +        if (vhost_vq_init(dev, i))
>>               goto err_vq_init;
>>       }
>> +    for (; i < dev->max_nvqs; ++i)
>> +        /* Just prep/clear the fields and set initialized=false */
>> +        __vhost_vq_init(dev, dev->vqs[i]);
>> +
>>       return 0;
>>   err_vq_init:
>> @@ -652,7 +674,7 @@ void vhost_dev_reset_owner(struct vhost_dev *dev, 
>> struct vhost_iotlb *umem)
>>       /* We don't need VQ locks below since vhost_dev_cleanup makes sure
>>        * VQs aren't running.
>>        */
>> -    for (i = 0; i < dev->nvqs; ++i)
>> +    for (i = 0; i < dev->max_nvqs; ++i)
>>           dev->vqs[i]->umem = umem;
>>   }
>>   EXPORT_SYMBOL_GPL(vhost_dev_reset_owner);
>> @@ -661,7 +683,7 @@ void vhost_dev_stop(struct vhost_dev *dev)
>>   {
>>       int i;
>> -    for (i = 0; i < dev->nvqs; ++i) {
>> +    for (i = 0; i < dev->max_nvqs; ++i) {
>>           if (dev->vqs[i]->kick && dev->vqs[i]->handle_kick) {
>>               vhost_poll_stop(&dev->vqs[i]->poll);
>>               vhost_poll_flush(&dev->vqs[i]->poll);
>> @@ -693,7 +715,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
>>   {
>>       int i;
>> -    for (i = 0; i < dev->nvqs; ++i) {
>> +    for (i = 0; i < dev->max_nvqs; ++i) {
>>           if (dev->vqs[i]->error_ctx)
>>               eventfd_ctx_put(dev->vqs[i]->error_ctx);
>>           if (dev->vqs[i]->kick)
>> @@ -787,7 +809,7 @@ static bool memory_access_ok(struct vhost_dev *d, 
>> struct vhost_iotlb *umem,
>>   {
>>       int i;
>> -    for (i = 0; i < d->nvqs; ++i) {
>> +    for (i = 0; i < d->max_nvqs; ++i) {
>>           bool ok;
>>           bool log;
>> @@ -999,14 +1021,14 @@ static inline int vhost_put_used_idx(struct 
>> vhost_virtqueue *vq)
>>   static void vhost_dev_lock_vqs(struct vhost_dev *d)
>>   {
>>       int i = 0;
>> -    for (i = 0; i < d->nvqs; ++i)
>> +    for (i = 0; i < d->max_nvqs; ++i)
>>           mutex_lock_nested(&d->vqs[i]->mutex, i);
>>   }
>>   static void vhost_dev_unlock_vqs(struct vhost_dev *d)
>>   {
>>       int i = 0;
>> -    for (i = 0; i < d->nvqs; ++i)
>> +    for (i = 0; i < d->max_nvqs; ++i)
>>           mutex_unlock(&d->vqs[i]->mutex);
>>   }
>> @@ -1462,7 +1484,7 @@ static long vhost_set_memory(struct vhost_dev 
>> *d, struct vhost_memory __user *m)
>>       d->umem = newumem;
>>       /* All memory accesses are done under some VQ mutex. */
>> -    for (i = 0; i < d->nvqs; ++i) {
>> +    for (i = 0; i < d->max_nvqs; ++i) {
>>           mutex_lock(&d->vqs[i]->mutex);
>>           d->vqs[i]->umem = newumem;
>>           mutex_unlock(&d->vqs[i]->mutex);
>> @@ -1590,11 +1612,14 @@ long vhost_vring_ioctl(struct vhost_dev *d, 
>> unsigned int ioctl, void __user *arg
>>       r = get_user(idx, idxp);
>>       if (r < 0)
>>           return r;
>> -    if (idx >= d->nvqs)
>> -        return -ENOBUFS;
>> -    idx = array_index_nospec(idx, d->nvqs);
>> +    idx = array_index_nospec(idx, d->max_nvqs);
>>       vq = d->vqs[idx];
>> +    if (!vq->initialized) {
>> +        r = vhost_vq_init(d, idx);
>> +        if (r)
>> +            return r;
>> +    }
>>       if (ioctl == VHOST_SET_VRING_NUM ||
>>           ioctl == VHOST_SET_VRING_ADDR) {
>> @@ -1724,7 +1749,7 @@ int vhost_init_device_iotlb(struct vhost_dev *d, 
>> bool enabled)
>>       oiotlb = d->iotlb;
>>       d->iotlb = niotlb;
>> -    for (i = 0; i < d->nvqs; ++i) {
>> +    for (i = 0; i < d->max_nvqs; ++i) {
>>           struct vhost_virtqueue *vq = d->vqs[i];
>>           mutex_lock(&vq->mutex);
>> @@ -1771,7 +1796,7 @@ long vhost_dev_ioctl(struct vhost_dev *d, 
>> unsigned int ioctl, void __user *argp)
>>               r = -EFAULT;
>>               break;
>>           }
>> -        for (i = 0; i < d->nvqs; ++i) {
>> +        for (i = 0; i < d->max_nvqs; ++i) {
>>               struct vhost_virtqueue *vq;
>>               void __user *base = (void __user *)(unsigned long)p;
>>               vq = d->vqs[i];
>> @@ -1794,7 +1819,7 @@ long vhost_dev_ioctl(struct vhost_dev *d, 
>> unsigned int ioctl, void __user *argp)
>>               break;
>>           }
>>           swap(ctx, d->log_ctx);
>> -        for (i = 0; i < d->nvqs; ++i) {
>> +        for (i = 0; i < d->max_nvqs; ++i) {
>>               mutex_lock(&d->vqs[i]->mutex);
>>               d->vqs[i]->log_ctx = d->log_ctx;
>>               mutex_unlock(&d->vqs[i]->mutex);
>> @@ -2609,7 +2634,7 @@ void vhost_set_backend_features(struct vhost_dev 
>> *dev, u64 features)
>>       int i;
>>       mutex_lock(&dev->mutex);
>> -    for (i = 0; i < dev->nvqs; ++i) {
>> +    for (i = 0; i < dev->max_nvqs; ++i) {
>>           vq = dev->vqs[i];
>>           mutex_lock(&vq->mutex);
>>           vq->acked_backend_features = features;
>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>> index 9ad34b1..9677870 100644
>> --- a/drivers/vhost/vhost.h
>> +++ b/drivers/vhost/vhost.h
>> @@ -132,6 +132,8 @@ struct vhost_virtqueue {
>>       bool user_be;
>>   #endif
>>       u32 busyloop_timeout;
>> +
>> +    bool initialized;
>>   };
>>   struct vhost_msg_node {
>> @@ -148,6 +150,7 @@ struct vhost_dev {
>>       struct mutex mutex;
>>       struct vhost_virtqueue **vqs;
>>       int nvqs;
>> +    int max_nvqs;
>>       struct eventfd_ctx *log_ctx;
>>       struct llist_head work_list;
>>       struct task_struct *worker;
>> @@ -168,8 +171,8 @@ struct vhost_dev {
>>   bool vhost_exceeds_weight(struct vhost_virtqueue *vq, int pkts, int 
>> total_len);
>>   int vhost_dev_init(struct vhost_dev *dev, struct vhost_virtqueue **vqs,
>> -           int nvqs, int iov_limit, int weight, int byte_weight,
>> -           bool use_worker,
>> +           int nvqs, int max_nvqs, int iov_limit, int weight,
>> +           int byte_weight, bool use_worker,
>>              int (*msg_handler)(struct vhost_dev *dev,
>>                         struct vhost_iotlb_msg *msg));
>>   long vhost_dev_set_owner(struct vhost_dev *dev);
>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> index cae0083..bcdfd58 100644
>> --- a/drivers/vhost/vsock.c
>> +++ b/drivers/vhost/vsock.c
>> @@ -606,7 +606,7 @@ static int vhost_vsock_dev_open(struct inode 
>> *inode, struct file *file)
>>   {
>>       struct vhost_virtqueue **vqs;
>>       struct vhost_vsock *vsock;
>> -    int ret;
>> +    int ret, nvqs;
>>       /* This struct is large and allocation could fail, fall back to 
>> vmalloc
>>        * if there is no other way.
>> @@ -615,7 +615,8 @@ static int vhost_vsock_dev_open(struct inode 
>> *inode, struct file *file)
>>       if (!vsock)
>>           return -ENOMEM;
>> -    vqs = kmalloc_array(ARRAY_SIZE(vsock->vqs), sizeof(*vqs), 
>> GFP_KERNEL);
>> +    nvqs = ARRAY_SIZE(vsock->vqs);
>> +    vqs = kmalloc_array(nvqs, sizeof(*vqs), GFP_KERNEL);
>>       if (!vqs) {
>>           ret = -ENOMEM;
>>           goto out;
>> @@ -630,9 +631,9 @@ static int vhost_vsock_dev_open(struct inode 
>> *inode, struct file *file)
>>       vsock->vqs[VSOCK_VQ_TX].handle_kick = vhost_vsock_handle_tx_kick;
>>       vsock->vqs[VSOCK_VQ_RX].handle_kick = vhost_vsock_handle_rx_kick;
>> -    ret = vhost_dev_init(&vsock->dev, vqs, ARRAY_SIZE(vsock->vqs),
>> -                 UIO_MAXIOV, VHOST_VSOCK_PKT_WEIGHT,
>> -                 VHOST_VSOCK_WEIGHT, true, NULL);
>> +    ret = vhost_dev_init(&vsock->dev, vqs, nvqs, nvqs, UIO_MAXIOV,
>> +                 VHOST_VSOCK_PKT_WEIGHT, VHOST_VSOCK_WEIGHT, true,
>> +                 NULL);
>>       if (ret)
>>           goto err_dev_init;
>

Mike Christie Nov. 9, 2020, 8:30 p.m. UTC | #3

On 11/9/20 12:41 PM, Mike Christie wrote:
> 
> If you want to pair this patchset down to it's bare bug fixes that 
> prevent users from getting IO errors, I would do ring.addr check only 
> approach for this bug fix patchset.

Oh yeah, just so we don't have to burn an extra day, above I'm proposing 
I repost the original patchset:

https://lore.kernel.org/linux-scsi/1600712588-9514-1-git-send-email-michael.christie@oracle.com/t/

for the bug fix only patches. It will have the compile error fixed and 
Bart's comment handled.

To even trim it down more I can also drop the last 2 patches:

0007-vhost-remove-work-arg-from-vhost_work_flush.patch
0008-vhost-scsi-remove-extra-flushes.patch

and send separately in a cleanups patchset since the extra flushes it 
kills don't really hurt.

Michael S. Tsirkin Nov. 9, 2020, 10:32 p.m. UTC | #4

On Mon, Nov 09, 2020 at 02:30:36PM -0600, Mike Christie wrote:
> On 11/9/20 12:41 PM, Mike Christie wrote:
> > 
> > If you want to pair this patchset down to it's bare bug fixes that
> > prevent users from getting IO errors, I would do ring.addr check only
> > approach for this bug fix patchset.
> 
> Oh yeah, just so we don't have to burn an extra day, above I'm proposing I
> repost the original patchset:
> 
> https://lore.kernel.org/linux-scsi/1600712588-9514-1-git-send-email-michael.christie@oracle.com/t/
> 
> for the bug fix only patches. It will have the compile error fixed and
> Bart's comment handled.
> 
> To even trim it down more I can also drop the last 2 patches:
> 
> 0007-vhost-remove-work-arg-from-vhost_work_flush.patch
> 0008-vhost-scsi-remove-extra-flushes.patch
> 
> and send separately in a cleanups patchset since the extra flushes it kills
> don't really hurt.

Makes sense to me.

Jason Wang Nov. 10, 2020, 2:44 a.m. UTC | #5

On 2020/11/10 上午2:41, Mike Christie wrote:
> On 11/8/20 10:01 PM, Jason Wang wrote:
>>
>> On 2020/11/5 上午6:26, Mike Christie wrote:
>>> This allows vq creation to be done when it's first accessed by
>>> userspace. vhost-scsi doesn't know how many queues the user requested
>>> until they are first setup, and we don't want to allocate resources
>>> like the iovecs for 128 vqs when we are only using 1 or 2 most of the
>>> time. In the next pathces, vhost-scsi will also switch to preallocating
>>> cmds per vq instead of per lio session and we don't want to allocate
>>> them for 127 extra vqs if they are not in use.
>>>
>>> With this patch when a driver calls vhost_dev_init they pass in the
>>> number of vqs that they know they need and the max they can support.
>>> This patch has all the drivers pass in the same value for both the
>>> initial number of vqs and the max. The next patch will convert scsi.
>>> The other drivers like net/vsock have their vqs hard coded in the
>>> kernel or setup/discovered via other methods like with vdpa.
>>>
>>> Signed-off-by: Mike Christie <michael.christie@oracle.com>
>>
>>
>> This looks like a delayed vq metadata creation instead of vq itself. >
>> Several questions:
>>
>> - can we simply introduce new ioctls to set the #max_vqs for SCSI? Or 
>
> We can do this code wise, but I was not sure about the spec and 
> VHOST_SET_OWNER's requirements. For the latter, the first ioctl cmd 
> quemu does is VHOST_SET_OWNER. The vhost_dev_ioctl then requires that 
> the caller be the owner to be able to do any other ioctls. The 
> individual drivers like scsi, net, etc allow userspace to do ioctls 
> without being the owner, and I was not sure if that is by accident or 
> design.


My understanding is that owner is required for memory mapping 
operations. For setting #max_vqs, net will talk with the backend 
(TUN/TAP) instead of vhost. For SCSI, it looks OK to do it without an 
owner (my understanding is that vhost-scsi doesn't need to tell #max_vqs 
to the under-layer).


>
>> simply introduce VRING_ENABLE ioctl for SCSI and let qemu to call 
>> that ioctl for 1.0 device, qemu can simply forward the ENABLE command to 
>
> If it's allowed to do ioctl cmds that set vhost driver settings before 
> VHOST_SET_OWNER then we can do the VRING_ENABLE calls first. For that 
> though, I had one question about the spec. In section "4.1.4.3.2 
> Driver Requirements: Common configuration structure layout" it had the 
> line:
>
> -----
> The driver MUST configure the other virtqueue fields before enabling 
> the virtqueue with queue_enable.
> -----
>
> Does that apply to our case, or only when it's being used in a 
> virtio_pci_common_cfg struct based command? 


The spec is for virtio(pci) only, for vhost, it could choose to follow 
the spec (which should be better) or not.


> If it applies to our case, then we have to change userspace to set 
> config settings like the queue_size and features earlier, right?


Yes, or do VRING_ENABLE after queue_size and features are set.


>
> I was thinking if we had to deal with that then maybe it's better to 
> add a new ioctl like VRING_CREATE. It would be done first, singles 
> that the kernel should do the vq creation (allocation, init steps we 
> do in vhost_dev_init, etc).
>

Technically, that's possible, but it looks like a new version of vhost 
ioctl protocol which requires new features to be advertised to userspace 
to choose between the two. I'm not sure it's worth to do that.


>
>> vhost-scsi, for 0.9x device qemu can simply check whether ring.addr 
>> is zero or not
> I had a question about that and my thread per vq patches.
>
> If you want to pair this patchset down to it's bare bug fixes that 
> prevent users from getting IO errors, I would do ring.addr check only 
> approach for this bug fix patchset.


Yes, I think it would be helpful if we can limit the changes to bug fixes.


>
> For my other patchset to do a vhost worker thread per vq, I think it 
> would be cleanest to have a new ioctl or a userspace change. The 
> problem is that we create the worker thread in vhost_dev_set_owner and 
> at that time we do not know how many vqs there are going to be. For 
> that we could:
> 1. Add a new VHOST_SET_OWNER_V2 that has an arg which contains the 
> number of queues or worker threads
> 2. do the VRING_ENABLE approach if allowed spec/VHOST_SET_OWNER wise.
> 3. add a new ioctl that is done before VHOST_SET_OWNER


Can we just create one worker during SET_OWNER and for the rest during 
VRING_ENABLE?


>
> If we split it up like this, current users get the IO error bug fixes. 
> This patchset gets a lot smaller. Users would then only have to 
> upgrade userspace to get the multple thread feature.


Yes.


>
> Compat setup support might be cleaner too. If we did the 
> VHOST_SET_OWNER_V2 or added a new ioctl then vhost_dev_set_owner just 
> takes the new arg and loops num_queues/workers times to create more 
> threads.
>
>

Thanks


>> - if not, it's better to convert all the devices/queues to behave as 
>> if SCSI to have a consistent way allocate metadata
>> - do we need to care about vq free, e.g free vqs during reset?
>>
>> Thanks
>>
>>
>>> ---
>>>   drivers/vhost/net.c   |  2 +-
>>>   drivers/vhost/scsi.c  |  4 +--
>>>   drivers/vhost/test.c  |  5 ++--
>>>   drivers/vhost/vdpa.c  |  2 +-
>>>   drivers/vhost/vhost.c | 71 
>>> ++++++++++++++++++++++++++++++++++-----------------
>>>   drivers/vhost/vhost.h |  7 +++--
>>>   drivers/vhost/vsock.c | 11 ++++----
>>>   7 files changed, 66 insertions(+), 36 deletions(-)
>>>
>>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>>> index fd30b53..fce46f0 100644
>>> --- a/drivers/vhost/net.c
>>> +++ b/drivers/vhost/net.c
>>> @@ -1316,7 +1316,7 @@ static int vhost_net_open(struct inode *inode, 
>>> struct file *f)
>>>           n->vqs[i].rx_ring = NULL;
>>>           vhost_net_buf_init(&n->vqs[i].rxq);
>>>       }
>>> -    if (vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
>>> +    if (vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX, VHOST_NET_VQ_MAX,
>>>                  UIO_MAXIOV + VHOST_NET_BATCH,
>>>                  VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
>>>                  NULL))
>>> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
>>> index ed8baf9..5b3720e 100644
>>> --- a/drivers/vhost/scsi.c
>>> +++ b/drivers/vhost/scsi.c
>>> @@ -1632,8 +1632,8 @@ static int vhost_scsi_open(struct inode 
>>> *inode, struct file *f)
>>>           vqs[i] = &vs->vqs[i].vq;
>>>           vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
>>>       }
>>> -    r = vhost_dev_init(&vs->dev, vqs, VHOST_SCSI_MAX_VQ, UIO_MAXIOV,
>>> -               VHOST_SCSI_WEIGHT, 0, true, NULL);
>>> +    r = vhost_dev_init(&vs->dev, vqs, VHOST_SCSI_MAX_VQ, 
>>> VHOST_SCSI_MAX_VQ,
>>> +               UIO_MAXIOV, VHOST_SCSI_WEIGHT, 0, true, NULL);
>>>       if (r)
>>>           goto err_dev_init;
>>> diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
>>> index c255ae5..9d2bfa3 100644
>>> --- a/drivers/vhost/test.c
>>> +++ b/drivers/vhost/test.c
>>> @@ -119,8 +119,9 @@ static int vhost_test_open(struct inode *inode, 
>>> struct file *f)
>>>       dev = &n->dev;
>>>       vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
>>>       n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
>>> -    if (vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
>>> -               VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL)
>>> +    if (vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, VHOST_TEST_VQ_MAX,
>>> +               UIO_MAXIOV, VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT,
>>> +               true, NULL)
>>>           goto err_dev_init;
>>>       f->private_data = n;
>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>> index c1615a7..6abd9d8 100644
>>> --- a/drivers/vhost/vdpa.c
>>> +++ b/drivers/vhost/vdpa.c
>>> @@ -829,7 +829,7 @@ static int vhost_vdpa_open(struct inode *inode, 
>>> struct file *filep)
>>>           vqs[i] = &v->vqs[i];
>>>           vqs[i]->handle_kick = handle_vq_kick;
>>>       }
>>> -    r = vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false,
>>> +    r = vhost_dev_init(dev, vqs, nvqs, nvqs, 0, 0, 0, false,
>>>                  vhost_vdpa_process_iotlb_msg);
>>>       if (r)
>>>           goto err_dev_init;
>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>> index a4a4450..2ca2e71 100644
>>> --- a/drivers/vhost/vhost.c
>>> +++ b/drivers/vhost/vhost.c
>>> @@ -294,7 +294,7 @@ static void vhost_vq_meta_reset(struct vhost_dev 
>>> *d)
>>>   {
>>>       int i;
>>> -    for (i = 0; i < d->nvqs; ++i)
>>> +    for (i = 0; i < d->max_nvqs; ++i)
>>>           __vhost_vq_meta_reset(d->vqs[i]);
>>>   }
>>> @@ -331,6 +331,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>>>       vq->busyloop_timeout = 0;
>>>       vq->umem = NULL;
>>>       vq->iotlb = NULL;
>>> +    vq->initialized = false;
>>>       vhost_vring_call_reset(&vq->call_ctx);
>>>       __vhost_vq_meta_reset(vq);
>>>   }
>>> @@ -411,7 +412,7 @@ static void vhost_dev_free_iovecs(struct 
>>> vhost_dev *dev)
>>>   {
>>>       int i;
>>> -    for (i = 0; i < dev->nvqs; ++i)
>>> +    for (i = 0; i < dev->max_nvqs; ++i)
>>>           vhost_vq_free_iovecs(dev->vqs[i]);
>>>   }
>>> @@ -456,7 +457,7 @@ static size_t vhost_get_desc_size(struct 
>>> vhost_virtqueue *vq,
>>>       return sizeof(*vq->desc) * num;
>>>   }
>>> -static int vhost_vq_init(struct vhost_dev *dev, struct 
>>> vhost_virtqueue *vq)
>>> +static void __vhost_vq_init(struct vhost_dev *dev, struct 
>>> vhost_virtqueue *vq)
>>>   {
>>>       vq->log = NULL;
>>>       vq->indirect = NULL;
>>> @@ -467,12 +468,29 @@ static int vhost_vq_init(struct vhost_dev 
>>> *dev, struct vhost_virtqueue *vq)
>>>       if (vq->handle_kick)
>>>           vhost_poll_init(&vq->poll, vq->handle_kick, EPOLLIN, dev);
>>> +}
>>> +
>>> +static int vhost_vq_init(struct vhost_dev *dev, int vq_idx)
>>> +{
>>> +    struct vhost_virtqueue *vq;
>>> +    int ret;
>>> +
>>> +    if (vq_idx >= dev->max_nvqs)
>>> +        return -ENOBUFS;
>>> +
>>> +    vq = dev->vqs[vq_idx];
>>> +    __vhost_vq_init(dev, vq);
>>> +    ret = vhost_vq_alloc_iovecs(dev, vq);
>>> +    if (ret)
>>> +        return ret;
>>> -    return vhost_vq_alloc_iovecs(dev, vq);
>>> +    vq->initialized = true;
>>> +    dev->nvqs++;
>>> +    return 0;
>>>   }
>>>   int vhost_dev_init(struct vhost_dev *dev,
>>> -           struct vhost_virtqueue **vqs, int nvqs,
>>> +           struct vhost_virtqueue **vqs, int nvqs, int max_nvqs,
>>>              int iov_limit, int weight, int byte_weight,
>>>              bool use_worker,
>>>              int (*msg_handler)(struct vhost_dev *dev,
>>> @@ -481,7 +499,8 @@ int vhost_dev_init(struct vhost_dev *dev,
>>>       int i;
>>>       dev->vqs = vqs;
>>> -    dev->nvqs = nvqs;
>>> +    dev->nvqs = 0;
>>> +    dev->max_nvqs = max_nvqs;
>>>       mutex_init(&dev->mutex);
>>>       dev->log_ctx = NULL;
>>>       dev->umem = NULL;
>>> @@ -499,12 +518,15 @@ int vhost_dev_init(struct vhost_dev *dev,
>>>       INIT_LIST_HEAD(&dev->pending_list);
>>>       spin_lock_init(&dev->iotlb_lock);
>>> -
>>> -    for (i = 0; i < dev->nvqs; ++i) {
>>> -        if (vhost_vq_init(dev, dev->vqs[i]))
>>> +    for (i = 0; i < nvqs; ++i) {
>>> +        if (vhost_vq_init(dev, i))
>>>               goto err_vq_init;
>>>       }
>>> +    for (; i < dev->max_nvqs; ++i)
>>> +        /* Just prep/clear the fields and set initialized=false */
>>> +        __vhost_vq_init(dev, dev->vqs[i]);
>>> +
>>>       return 0;
>>>   err_vq_init:
>>> @@ -652,7 +674,7 @@ void vhost_dev_reset_owner(struct vhost_dev 
>>> *dev, struct vhost_iotlb *umem)
>>>       /* We don't need VQ locks below since vhost_dev_cleanup makes 
>>> sure
>>>        * VQs aren't running.
>>>        */
>>> -    for (i = 0; i < dev->nvqs; ++i)
>>> +    for (i = 0; i < dev->max_nvqs; ++i)
>>>           dev->vqs[i]->umem = umem;
>>>   }
>>>   EXPORT_SYMBOL_GPL(vhost_dev_reset_owner);
>>> @@ -661,7 +683,7 @@ void vhost_dev_stop(struct vhost_dev *dev)
>>>   {
>>>       int i;
>>> -    for (i = 0; i < dev->nvqs; ++i) {
>>> +    for (i = 0; i < dev->max_nvqs; ++i) {
>>>           if (dev->vqs[i]->kick && dev->vqs[i]->handle_kick) {
>>>               vhost_poll_stop(&dev->vqs[i]->poll);
>>>               vhost_poll_flush(&dev->vqs[i]->poll);
>>> @@ -693,7 +715,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
>>>   {
>>>       int i;
>>> -    for (i = 0; i < dev->nvqs; ++i) {
>>> +    for (i = 0; i < dev->max_nvqs; ++i) {
>>>           if (dev->vqs[i]->error_ctx)
>>>               eventfd_ctx_put(dev->vqs[i]->error_ctx);
>>>           if (dev->vqs[i]->kick)
>>> @@ -787,7 +809,7 @@ static bool memory_access_ok(struct vhost_dev 
>>> *d, struct vhost_iotlb *umem,
>>>   {
>>>       int i;
>>> -    for (i = 0; i < d->nvqs; ++i) {
>>> +    for (i = 0; i < d->max_nvqs; ++i) {
>>>           bool ok;
>>>           bool log;
>>> @@ -999,14 +1021,14 @@ static inline int vhost_put_used_idx(struct 
>>> vhost_virtqueue *vq)
>>>   static void vhost_dev_lock_vqs(struct vhost_dev *d)
>>>   {
>>>       int i = 0;
>>> -    for (i = 0; i < d->nvqs; ++i)
>>> +    for (i = 0; i < d->max_nvqs; ++i)
>>>           mutex_lock_nested(&d->vqs[i]->mutex, i);
>>>   }
>>>   static void vhost_dev_unlock_vqs(struct vhost_dev *d)
>>>   {
>>>       int i = 0;
>>> -    for (i = 0; i < d->nvqs; ++i)
>>> +    for (i = 0; i < d->max_nvqs; ++i)
>>>           mutex_unlock(&d->vqs[i]->mutex);
>>>   }
>>> @@ -1462,7 +1484,7 @@ static long vhost_set_memory(struct vhost_dev 
>>> *d, struct vhost_memory __user *m)
>>>       d->umem = newumem;
>>>       /* All memory accesses are done under some VQ mutex. */
>>> -    for (i = 0; i < d->nvqs; ++i) {
>>> +    for (i = 0; i < d->max_nvqs; ++i) {
>>>           mutex_lock(&d->vqs[i]->mutex);
>>>           d->vqs[i]->umem = newumem;
>>>           mutex_unlock(&d->vqs[i]->mutex);
>>> @@ -1590,11 +1612,14 @@ long vhost_vring_ioctl(struct vhost_dev *d, 
>>> unsigned int ioctl, void __user *arg
>>>       r = get_user(idx, idxp);
>>>       if (r < 0)
>>>           return r;
>>> -    if (idx >= d->nvqs)
>>> -        return -ENOBUFS;
>>> -    idx = array_index_nospec(idx, d->nvqs);
>>> +    idx = array_index_nospec(idx, d->max_nvqs);
>>>       vq = d->vqs[idx];
>>> +    if (!vq->initialized) {
>>> +        r = vhost_vq_init(d, idx);
>>> +        if (r)
>>> +            return r;
>>> +    }
>>>       if (ioctl == VHOST_SET_VRING_NUM ||
>>>           ioctl == VHOST_SET_VRING_ADDR) {
>>> @@ -1724,7 +1749,7 @@ int vhost_init_device_iotlb(struct vhost_dev 
>>> *d, bool enabled)
>>>       oiotlb = d->iotlb;
>>>       d->iotlb = niotlb;
>>> -    for (i = 0; i < d->nvqs; ++i) {
>>> +    for (i = 0; i < d->max_nvqs; ++i) {
>>>           struct vhost_virtqueue *vq = d->vqs[i];
>>>           mutex_lock(&vq->mutex);
>>> @@ -1771,7 +1796,7 @@ long vhost_dev_ioctl(struct vhost_dev *d, 
>>> unsigned int ioctl, void __user *argp)
>>>               r = -EFAULT;
>>>               break;
>>>           }
>>> -        for (i = 0; i < d->nvqs; ++i) {
>>> +        for (i = 0; i < d->max_nvqs; ++i) {
>>>               struct vhost_virtqueue *vq;
>>>               void __user *base = (void __user *)(unsigned long)p;
>>>               vq = d->vqs[i];
>>> @@ -1794,7 +1819,7 @@ long vhost_dev_ioctl(struct vhost_dev *d, 
>>> unsigned int ioctl, void __user *argp)
>>>               break;
>>>           }
>>>           swap(ctx, d->log_ctx);
>>> -        for (i = 0; i < d->nvqs; ++i) {
>>> +        for (i = 0; i < d->max_nvqs; ++i) {
>>>               mutex_lock(&d->vqs[i]->mutex);
>>>               d->vqs[i]->log_ctx = d->log_ctx;
>>>               mutex_unlock(&d->vqs[i]->mutex);
>>> @@ -2609,7 +2634,7 @@ void vhost_set_backend_features(struct 
>>> vhost_dev *dev, u64 features)
>>>       int i;
>>>       mutex_lock(&dev->mutex);
>>> -    for (i = 0; i < dev->nvqs; ++i) {
>>> +    for (i = 0; i < dev->max_nvqs; ++i) {
>>>           vq = dev->vqs[i];
>>>           mutex_lock(&vq->mutex);
>>>           vq->acked_backend_features = features;
>>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>>> index 9ad34b1..9677870 100644
>>> --- a/drivers/vhost/vhost.h
>>> +++ b/drivers/vhost/vhost.h
>>> @@ -132,6 +132,8 @@ struct vhost_virtqueue {
>>>       bool user_be;
>>>   #endif
>>>       u32 busyloop_timeout;
>>> +
>>> +    bool initialized;
>>>   };
>>>   struct vhost_msg_node {
>>> @@ -148,6 +150,7 @@ struct vhost_dev {
>>>       struct mutex mutex;
>>>       struct vhost_virtqueue **vqs;
>>>       int nvqs;
>>> +    int max_nvqs;
>>>       struct eventfd_ctx *log_ctx;
>>>       struct llist_head work_list;
>>>       struct task_struct *worker;
>>> @@ -168,8 +171,8 @@ struct vhost_dev {
>>>   bool vhost_exceeds_weight(struct vhost_virtqueue *vq, int pkts, 
>>> int total_len);
>>>   int vhost_dev_init(struct vhost_dev *dev, struct vhost_virtqueue 
>>> **vqs,
>>> -           int nvqs, int iov_limit, int weight, int byte_weight,
>>> -           bool use_worker,
>>> +           int nvqs, int max_nvqs, int iov_limit, int weight,
>>> +           int byte_weight, bool use_worker,
>>>              int (*msg_handler)(struct vhost_dev *dev,
>>>                         struct vhost_iotlb_msg *msg));
>>>   long vhost_dev_set_owner(struct vhost_dev *dev);
>>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>>> index cae0083..bcdfd58 100644
>>> --- a/drivers/vhost/vsock.c
>>> +++ b/drivers/vhost/vsock.c
>>> @@ -606,7 +606,7 @@ static int vhost_vsock_dev_open(struct inode 
>>> *inode, struct file *file)
>>>   {
>>>       struct vhost_virtqueue **vqs;
>>>       struct vhost_vsock *vsock;
>>> -    int ret;
>>> +    int ret, nvqs;
>>>       /* This struct is large and allocation could fail, fall back 
>>> to vmalloc
>>>        * if there is no other way.
>>> @@ -615,7 +615,8 @@ static int vhost_vsock_dev_open(struct inode 
>>> *inode, struct file *file)
>>>       if (!vsock)
>>>           return -ENOMEM;
>>> -    vqs = kmalloc_array(ARRAY_SIZE(vsock->vqs), sizeof(*vqs), 
>>> GFP_KERNEL);
>>> +    nvqs = ARRAY_SIZE(vsock->vqs);
>>> +    vqs = kmalloc_array(nvqs, sizeof(*vqs), GFP_KERNEL);
>>>       if (!vqs) {
>>>           ret = -ENOMEM;
>>>           goto out;
>>> @@ -630,9 +631,9 @@ static int vhost_vsock_dev_open(struct inode 
>>> *inode, struct file *file)
>>>       vsock->vqs[VSOCK_VQ_TX].handle_kick = vhost_vsock_handle_tx_kick;
>>>       vsock->vqs[VSOCK_VQ_RX].handle_kick = vhost_vsock_handle_rx_kick;
>>> -    ret = vhost_dev_init(&vsock->dev, vqs, ARRAY_SIZE(vsock->vqs),
>>> -                 UIO_MAXIOV, VHOST_VSOCK_PKT_WEIGHT,
>>> -                 VHOST_VSOCK_WEIGHT, true, NULL);
>>> +    ret = vhost_dev_init(&vsock->dev, vqs, nvqs, nvqs, UIO_MAXIOV,
>>> +                 VHOST_VSOCK_PKT_WEIGHT, VHOST_VSOCK_WEIGHT, true,
>>> +                 NULL);
>>>       if (ret)
>>>           goto err_dev_init;
>>
>

Jason Wang Nov. 10, 2020, 2:50 a.m. UTC | #6

On 2020/11/10 上午4:30, Mike Christie wrote:
> On 11/9/20 12:41 PM, Mike Christie wrote:
>>
>> If you want to pair this patchset down to it's bare bug fixes that 
>> prevent users from getting IO errors, I would do ring.addr check only 
>> approach for this bug fix patchset.
>
> Oh yeah, just so we don't have to burn an extra day, above I'm 
> proposing I repost the original patchset:
>
> https://lore.kernel.org/linux-scsi/1600712588-9514-1-git-send-email-michael.christie@oracle.com/t/ 
>
>
> for the bug fix only patches. It will have the compile error fixed and 
> Bart's comment handled.
>
> To even trim it down more I can also drop the last 2 patches:
>
> 0007-vhost-remove-work-arg-from-vhost_work_flush.patch
> 0008-vhost-scsi-remove-extra-flushes.patch
>
> and send separately in a cleanups patchset since the extra flushes it 
> kills don't really hurt.


Fine with me.

Thanks

[06/11] vhost: support delayed vq creation

Commit Message

Comments

Patch