Message ID | 20230804052954.2918915-2-fengli@smartx.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix vhost reconnect issues | expand |
Why can’t we rather fix this by adding a “event_cb” param to vhost_user_async_close and then call qemu_chr_fe_set_handlers in vhost_user_async_close_bh()? Even if calling vhost_dev_cleanup() twice is safe today I worry future changes may easily stumble over the reconnect case and introduce crashes or double frees. > On Aug 4, 2023, at 1:29 AM, Li Feng <fengli@smartx.com> wrote: > > When the vhost-user is reconnecting to the backend, and if the vhost-user fails > at the get_features in vhost_dev_init(), then the reconnect will fail > and it will not be retriggered forever. > > The reason is: > When the vhost-user fail at get_features, the vhost_dev_cleanup will be called > immediately. > > vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'. > > The reconnect path is: > vhost_user_blk_event > vhost_user_async_close(.. vhost_user_blk_disconnect ..) > qemu_chr_fe_set_handlers <----- clear the notifier callback > schedule vhost_user_async_close_bh > > The vhost->vdev is null, so the vhost_user_blk_disconnect will not be > called, then the event fd callback will not be reinstalled. > > With this patch, the vhost_user_blk_disconnect will call the > vhost_dev_cleanup() again, it's safe. > > All vhost-user devices have this issue, including vhost-user-blk/scsi. > > Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling") > > Signed-off-by: Li Feng <fengli@smartx.com> > --- > hw/virtio/vhost-user.c | 10 +--------- > 1 file changed, 1 insertion(+), 9 deletions(-) > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c > index 8dcf049d42..697b403fe2 100644 > --- a/hw/virtio/vhost-user.c > +++ b/hw/virtio/vhost-user.c > @@ -2648,16 +2648,8 @@ typedef struct { > static void vhost_user_async_close_bh(void *opaque) > { > VhostAsyncCallback *data = opaque; > - struct vhost_dev *vhost = data->vhost; > > - /* > - * If the vhost_dev has been cleared in the meantime there is > - * nothing left to do as some other path has completed the > - * cleanup. > - */ > - if (vhost->vdev) { > - data->cb(data->dev); > - } > + data->cb(data->dev); > > g_free(data); > } > -- > 2.41.0 >
> 2023年8月14日 下午8:11,Raphael Norwitz <raphael.norwitz@nutanix.com> 写道: > > Why can’t we rather fix this by adding a “event_cb” param to vhost_user_async_close and then call qemu_chr_fe_set_handlers in vhost_user_async_close_bh()? > > Even if calling vhost_dev_cleanup() twice is safe today I worry future changes may easily stumble over the reconnect case and introduce crashes or double frees. > I think add a new event_cb is not good enough. ‘qemu_chr_fe_set_handlers’ has been called in vhost_user_async_close, and will be called in event->cb, so why need add a new event_cb? For avoiding to call the vhost_dev_cleanup() twice, add a ‘inited’ in struct vhost-dev to mark if it’s inited like this: diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index e2f6ffb446..edc80c0231 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -1502,6 +1502,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque, goto fail_busyloop; } + hdev->inited = true; return 0; fail_busyloop: @@ -1520,6 +1521,10 @@ void vhost_dev_cleanup(struct vhost_dev *hdev) { int i; + if (!hdev->inited) { + return; + } + hdev->inited = false; trace_vhost_dev_cleanup(hdev); for (i = 0; i < hdev->nvqs; ++i) { diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h index ca3131b1af..74b1aec960 100644 --- a/include/hw/virtio/vhost.h +++ b/include/hw/virtio/vhost.h @@ -123,6 +123,7 @@ struct vhost_dev { /* @started: is the vhost device started? */ bool started; bool log_enabled; + bool inited; uint64_t log_size; Error *migration_blocker; const VhostOps *vhost_ops; Thanks. > >> On Aug 4, 2023, at 1:29 AM, Li Feng <fengli@smartx.com> wrote: >> >> When the vhost-user is reconnecting to the backend, and if the vhost-user fails >> at the get_features in vhost_dev_init(), then the reconnect will fail >> and it will not be retriggered forever. >> >> The reason is: >> When the vhost-user fail at get_features, the vhost_dev_cleanup will be called >> immediately. >> >> vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'. >> >> The reconnect path is: >> vhost_user_blk_event >> vhost_user_async_close(.. vhost_user_blk_disconnect ..) >> qemu_chr_fe_set_handlers <----- clear the notifier callback >> schedule vhost_user_async_close_bh >> >> The vhost->vdev is null, so the vhost_user_blk_disconnect will not be >> called, then the event fd callback will not be reinstalled. >> >> With this patch, the vhost_user_blk_disconnect will call the >> vhost_dev_cleanup() again, it's safe. >> >> All vhost-user devices have this issue, including vhost-user-blk/scsi. >> >> Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling") >> >> Signed-off-by: Li Feng <fengli@smartx.com> >> --- >> hw/virtio/vhost-user.c | 10 +--------- >> 1 file changed, 1 insertion(+), 9 deletions(-) >> >> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c >> index 8dcf049d42..697b403fe2 100644 >> --- a/hw/virtio/vhost-user.c >> +++ b/hw/virtio/vhost-user.c >> @@ -2648,16 +2648,8 @@ typedef struct { >> static void vhost_user_async_close_bh(void *opaque) >> { >> VhostAsyncCallback *data = opaque; >> - struct vhost_dev *vhost = data->vhost; >> >> - /* >> - * If the vhost_dev has been cleared in the meantime there is >> - * nothing left to do as some other path has completed the >> - * cleanup. >> - */ >> - if (vhost->vdev) { >> - data->cb(data->dev); >> - } >> + data->cb(data->dev); >> >> g_free(data); >> } >> -- >> 2.41.0 >> >
> On Aug 17, 2023, at 2:40 AM, Li Feng <fengli@smartx.com> wrote: > > >> 2023年8月14日 下午8:11,Raphael Norwitz <raphael.norwitz@nutanix.com> 写道: >> >> Why can’t we rather fix this by adding a “event_cb” param to vhost_user_async_close and then call qemu_chr_fe_set_handlers in vhost_user_async_close_bh()? >> >> Even if calling vhost_dev_cleanup() twice is safe today I worry future changes may easily stumble over the reconnect case and introduce crashes or double frees. >> > I think add a new event_cb is not good enough. ‘qemu_chr_fe_set_handlers’ has been called in vhost_user_async_close, and will be called in event->cb, so why need add a > new event_cb? > I’m suggesting calling the data->event_cb instead of the data->cb if we hit the error case where vhost->vdev is NULL. Something like: diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index 8dcf049d42..edf1dccd44 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -2648,6 +2648,10 @@ typedef struct { static void vhost_user_async_close_bh(void *opaque) { VhostAsyncCallback *data = opaque; + + VirtIODevice *vdev = VIRTIO_DEVICE(data->dev); + VHostUserBlk *s = VHOST_USER_BLK(vdev); + struct vhost_dev *vhost = data->vhost; /* @@ -2657,6 +2661,9 @@ static void vhost_user_async_close_bh(void *opaque) */ if (vhost->vdev) { data->cb(data->dev); + } else if (data->event_cb) { + qemu_chr_fe_set_handlers(&s->chardev, NULL, NULL, data->event_cb, + NULL, data->dev, NULL, true); } g_free(data); data->event_cb would be vhost_user_blk_event(). I think that makes the error path a lot easier to reason about and more future proof. > For avoiding to call the vhost_dev_cleanup() twice, add a ‘inited’ in struct vhost-dev to mark if it’s inited like this: > This is better than the original, but let me know what you think of my alternative. > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c > index e2f6ffb446..edc80c0231 100644 > --- a/hw/virtio/vhost.c > +++ b/hw/virtio/vhost.c > @@ -1502,6 +1502,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque, > goto fail_busyloop; > } > > + hdev->inited = true; > return 0; > > fail_busyloop: > @@ -1520,6 +1521,10 @@ void vhost_dev_cleanup(struct vhost_dev *hdev) > { > int i; > > + if (!hdev->inited) { > + return; > + } > + hdev->inited = false; > trace_vhost_dev_cleanup(hdev); > > for (i = 0; i < hdev->nvqs; ++i) { > diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h > index ca3131b1af..74b1aec960 100644 > --- a/include/hw/virtio/vhost.h > +++ b/include/hw/virtio/vhost.h > @@ -123,6 +123,7 @@ struct vhost_dev { > /* @started: is the vhost device started? */ > bool started; > bool log_enabled; > + bool inited; > uint64_t log_size; > Error *migration_blocker; > const VhostOps *vhost_ops; > > Thanks. > >> >>> On Aug 4, 2023, at 1:29 AM, Li Feng <fengli@smartx.com> wrote: >>> >>> When the vhost-user is reconnecting to the backend, and if the vhost-user fails >>> at the get_features in vhost_dev_init(), then the reconnect will fail >>> and it will not be retriggered forever. >>> >>> The reason is: >>> When the vhost-user fail at get_features, the vhost_dev_cleanup will be called >>> immediately. >>> >>> vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'. >>> >>> The reconnect path is: >>> vhost_user_blk_event >>> vhost_user_async_close(.. vhost_user_blk_disconnect ..) >>> qemu_chr_fe_set_handlers <----- clear the notifier callback >>> schedule vhost_user_async_close_bh >>> >>> The vhost->vdev is null, so the vhost_user_blk_disconnect will not be >>> called, then the event fd callback will not be reinstalled. >>> >>> With this patch, the vhost_user_blk_disconnect will call the >>> vhost_dev_cleanup() again, it's safe. >>> >>> All vhost-user devices have this issue, including vhost-user-blk/scsi. >>> >>> Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling") >>> >>> Signed-off-by: Li Feng <fengli@smartx.com> >>> --- >>> hw/virtio/vhost-user.c | 10 +--------- >>> 1 file changed, 1 insertion(+), 9 deletions(-) >>> >>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c >>> index 8dcf049d42..697b403fe2 100644 >>> --- a/hw/virtio/vhost-user.c >>> +++ b/hw/virtio/vhost-user.c >>> @@ -2648,16 +2648,8 @@ typedef struct { >>> static void vhost_user_async_close_bh(void *opaque) >>> { >>> VhostAsyncCallback *data = opaque; >>> - struct vhost_dev *vhost = data->vhost; >>> >>> - /* >>> - * If the vhost_dev has been cleared in the meantime there is >>> - * nothing left to do as some other path has completed the >>> - * cleanup. >>> - */ >>> - if (vhost->vdev) { >>> - data->cb(data->dev); >>> - } >>> + data->cb(data->dev); >>> >>> g_free(data); >>> } >>> -- >>> 2.41.0 >>> >> >
On 22 Aug 2023, at 8:38 AM, Raphael Norwitz <raphael.norwitz@nutanix.com>
wrote:
On Aug 17, 2023, at 2:40 AM, Li Feng <fengli@smartx.com> wrote:
2023年8月14日 下午8:11,Raphael Norwitz <raphael.norwitz@nutanix.com> 写道:
Why can’t we rather fix this by adding a “event_cb” param to
vhost_user_async_close and then call qemu_chr_fe_set_handlers in
vhost_user_async_close_bh()?
Even if calling vhost_dev_cleanup() twice is safe today I worry future
changes may easily stumble over the reconnect case and introduce crashes or
double frees.
I think add a new event_cb is not good enough. ‘qemu_chr_fe_set_handlers’
has been called in vhost_user_async_close, and will be called in event->cb,
so why need add a
new event_cb?
I’m suggesting calling the data->event_cb instead of the data->cb if we hit
the error case where vhost->vdev is NULL. Something like:
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 8dcf049d42..edf1dccd44 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -2648,6 +2648,10 @@ typedef struct {
static void vhost_user_async_close_bh(void *opaque)
{
VhostAsyncCallback *data = opaque;
+
+ VirtIODevice *vdev = VIRTIO_DEVICE(data->dev);
+ VHostUserBlk *s = VHOST_USER_BLK(vdev);
+
struct vhost_dev *vhost = data->vhost;
/*
@@ -2657,6 +2661,9 @@ static void vhost_user_async_close_bh(void *opaque)
*/
if (vhost->vdev) {
data->cb(data->dev);
+ } else if (data->event_cb) {
+ qemu_chr_fe_set_handlers(&s->chardev, NULL, NULL, data->event_cb,
+ NULL, data->dev, NULL, true);
}
g_free(data);
data->event_cb would be vhost_user_blk_event().
I think that makes the error path a lot easier to reason about and more
future proof.
For avoiding to call the vhost_dev_cleanup() twice, add a ‘inited’ in
struct vhost-dev to mark if it’s inited like this:
This is better than the original, but let me know what you think of my
alternative.
The vhost_user_async_close_bh() is a common function in vhost-user.c, and
vhost_user_async_close() is used by vhost-user-scsi/blk/gpio,
However, in your patch it’s limited to VhostUserBlk, so I think my fix is
more reasonable.
Thanks,
LI
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index e2f6ffb446..edc80c0231 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1502,6 +1502,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void
*opaque,
goto fail_busyloop;
}
+ hdev->inited = true;
return 0;
fail_busyloop:
@@ -1520,6 +1521,10 @@ void vhost_dev_cleanup(struct vhost_dev *hdev)
{
int i;
+ if (!hdev->inited) {
+ return;
+ }
+ hdev->inited = false;
trace_vhost_dev_cleanup(hdev);
for (i = 0; i < hdev->nvqs; ++i) {
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index ca3131b1af..74b1aec960 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -123,6 +123,7 @@ struct vhost_dev {
/* @started: is the vhost device started? */
bool started;
bool log_enabled;
+ bool inited;
uint64_t log_size;
Error *migration_blocker;
const VhostOps *vhost_ops;
Thanks.
On Aug 4, 2023, at 1:29 AM, Li Feng <fengli@smartx.com> wrote:
When the vhost-user is reconnecting to the backend, and if the vhost-user
fails
at the get_features in vhost_dev_init(), then the reconnect will fail
and it will not be retriggered forever.
The reason is:
When the vhost-user fail at get_features, the vhost_dev_cleanup will be
called
immediately.
vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.
The reconnect path is:
vhost_user_blk_event
vhost_user_async_close(.. vhost_user_blk_disconnect ..)
qemu_chr_fe_set_handlers <----- clear the notifier callback
schedule vhost_user_async_close_bh
The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
called, then the event fd callback will not be reinstalled.
With this patch, the vhost_user_blk_disconnect will call the
vhost_dev_cleanup() again, it's safe.
All vhost-user devices have this issue, including vhost-user-blk/scsi.
Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")
Signed-off-by: Li Feng <fengli@smartx.com>
---
hw/virtio/vhost-user.c | 10 +---------
1 file changed, 1 insertion(+), 9 deletions(-)
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 8dcf049d42..697b403fe2 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -2648,16 +2648,8 @@ typedef struct {
static void vhost_user_async_close_bh(void *opaque)
{
VhostAsyncCallback *data = opaque;
- struct vhost_dev *vhost = data->vhost;
- /*
- * If the vhost_dev has been cleared in the meantime there is
- * nothing left to do as some other path has completed the
- * cleanup.
- */
- if (vhost->vdev) {
- data->cb(data->dev);
- }
+ data->cb(data->dev);
g_free(data);
}
> On Aug 22, 2023, at 12:49 AM, Li Feng <fengli@smartx.com> wrote: > > > >> On 22 Aug 2023, at 8:38 AM, Raphael Norwitz <raphael.norwitz@nutanix.com> wrote: >> >>> >>> On Aug 17, 2023, at 2:40 AM, Li Feng <fengli@smartx.com> wrote: >>> >>> >>>> 2023年8月14日 下午8:11,Raphael Norwitz <raphael.norwitz@nutanix.com> 写道: >>>> >>>> Why can’t we rather fix this by adding a “event_cb” param to vhost_user_async_close and then call qemu_chr_fe_set_handlers in vhost_user_async_close_bh()? >>>> >>>> Even if calling vhost_dev_cleanup() twice is safe today I worry future changes may easily stumble over the reconnect case and introduce crashes or double frees. >>>> >>> I think add a new event_cb is not good enough. ‘qemu_chr_fe_set_handlers’ has been called in vhost_user_async_close, and will be called in event->cb, so why need add a >>> new event_cb? >>> >> >> I’m suggesting calling the data->event_cb instead of the data->cb if we hit the error case where vhost->vdev is NULL. Something like: >> >> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c >> index 8dcf049d42..edf1dccd44 100644 >> --- a/hw/virtio/vhost-user.c >> +++ b/hw/virtio/vhost-user.c >> @@ -2648,6 +2648,10 @@ typedef struct { >> static void vhost_user_async_close_bh(void *opaque) >> { >> VhostAsyncCallback *data = opaque; >> + >> + VirtIODevice *vdev = VIRTIO_DEVICE(data->dev); >> + VHostUserBlk *s = VHOST_USER_BLK(vdev); >> + >> struct vhost_dev *vhost = data->vhost; >> >> /* >> @@ -2657,6 +2661,9 @@ static void vhost_user_async_close_bh(void *opaque) >> */ >> if (vhost->vdev) { >> data->cb(data->dev); >> + } else if (data->event_cb) { >> + qemu_chr_fe_set_handlers(&s->chardev, NULL, NULL, data->event_cb, >> + NULL, data->dev, NULL, true); >> } >> >> g_free(data); >> >> data->event_cb would be vhost_user_blk_event(). >> >> I think that makes the error path a lot easier to reason about and more future proof. >> >>> For avoiding to call the vhost_dev_cleanup() twice, add a ‘inited’ in struct vhost-dev to mark if it’s inited like this: >>> >> >> This is better than the original, but let me know what you think of my alternative. > > The vhost_user_async_close_bh() is a common function in vhost-user.c, and vhost_user_async_close() is used by vhost-user-scsi/blk/gpio, > However, in your patch it’s limited to VhostUserBlk, so I think my fix is more reasonable. I did not write out the full patch. vhost-user-scsi/blk/gpio would each provide their own ->event_cb, just like they each provide a different ->cb today. Looking at it again, data has the chardev, so no need to use VIRTO_DEVICE() or VHOST_USER_BLK(). The fix generalizes to all device types. > > Thanks, > LI >> >>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c >>> index e2f6ffb446..edc80c0231 100644 >>> --- a/hw/virtio/vhost.c >>> +++ b/hw/virtio/vhost.c >>> @@ -1502,6 +1502,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque, >>> goto fail_busyloop; >>> } >>> >>> + hdev->inited = true; >>> return 0; >>> >>> fail_busyloop: >>> @@ -1520,6 +1521,10 @@ void vhost_dev_cleanup(struct vhost_dev *hdev) >>> { >>> int i; >>> >>> + if (!hdev->inited) { >>> + return; >>> + } >>> + hdev->inited = false; >>> trace_vhost_dev_cleanup(hdev); >>> >>> for (i = 0; i < hdev->nvqs; ++i) { >>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h >>> index ca3131b1af..74b1aec960 100644 >>> --- a/include/hw/virtio/vhost.h >>> +++ b/include/hw/virtio/vhost.h >>> @@ -123,6 +123,7 @@ struct vhost_dev { >>> /* @started: is the vhost device started? */ >>> bool started; >>> bool log_enabled; >>> + bool inited; >>> uint64_t log_size; >>> Error *migration_blocker; >>> const VhostOps *vhost_ops; >>> >>> Thanks. >>> >>>> >>>>> On Aug 4, 2023, at 1:29 AM, Li Feng <fengli@smartx.com> wrote: >>>>> >>>>> When the vhost-user is reconnecting to the backend, and if the vhost-user fails >>>>> at the get_features in vhost_dev_init(), then the reconnect will fail >>>>> and it will not be retriggered forever. >>>>> >>>>> The reason is: >>>>> When the vhost-user fail at get_features, the vhost_dev_cleanup will be called >>>>> immediately. >>>>> >>>>> vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'. >>>>> >>>>> The reconnect path is: >>>>> vhost_user_blk_event >>>>> vhost_user_async_close(.. vhost_user_blk_disconnect ..) >>>>> qemu_chr_fe_set_handlers <----- clear the notifier callback >>>>> schedule vhost_user_async_close_bh >>>>> >>>>> The vhost->vdev is null, so the vhost_user_blk_disconnect will not be >>>>> called, then the event fd callback will not be reinstalled. >>>>> >>>>> With this patch, the vhost_user_blk_disconnect will call the >>>>> vhost_dev_cleanup() again, it's safe. >>>>> >>>>> All vhost-user devices have this issue, including vhost-user-blk/scsi. >>>>> >>>>> Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling") >>>>> >>>>> Signed-off-by: Li Feng <fengli@smartx.com> >>>>> --- >>>>> hw/virtio/vhost-user.c | 10 +--------- >>>>> 1 file changed, 1 insertion(+), 9 deletions(-) >>>>> >>>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c >>>>> index 8dcf049d42..697b403fe2 100644 >>>>> --- a/hw/virtio/vhost-user.c >>>>> +++ b/hw/virtio/vhost-user.c >>>>> @@ -2648,16 +2648,8 @@ typedef struct { >>>>> static void vhost_user_async_close_bh(void *opaque) >>>>> { >>>>> VhostAsyncCallback *data = opaque; >>>>> - struct vhost_dev *vhost = data->vhost; >>>>> >>>>> - /* >>>>> - * If the vhost_dev has been cleared in the meantime there is >>>>> - * nothing left to do as some other path has completed the >>>>> - * cleanup. >>>>> - */ >>>>> - if (vhost->vdev) { >>>>> - data->cb(data->dev); >>>>> - } >>>>> + data->cb(data->dev); >>>>> >>>>> g_free(data); >>>>> } >>>>> -- >>>>> 2.41.0
> On 22 Aug 2023, at 6:17 PM, Raphael Norwitz <raphael.norwitz@nutanix.com> wrote: > > > >> On Aug 22, 2023, at 12:49 AM, Li Feng <fengli@smartx.com> wrote: >> >> >> >>> On 22 Aug 2023, at 8:38 AM, Raphael Norwitz <raphael.norwitz@nutanix.com> wrote: >>> >>>> >>>> On Aug 17, 2023, at 2:40 AM, Li Feng <fengli@smartx.com> wrote: >>>> >>>> >>>>> 2023年8月14日 下午8:11,Raphael Norwitz <raphael.norwitz@nutanix.com> 写道: >>>>> >>>>> Why can’t we rather fix this by adding a “event_cb” param to vhost_user_async_close and then call qemu_chr_fe_set_handlers in vhost_user_async_close_bh()? >>>>> >>>>> Even if calling vhost_dev_cleanup() twice is safe today I worry future changes may easily stumble over the reconnect case and introduce crashes or double frees. >>>>> >>>> I think add a new event_cb is not good enough. ‘qemu_chr_fe_set_handlers’ has been called in vhost_user_async_close, and will be called in event->cb, so why need add a >>>> new event_cb? >>>> >>> >>> I’m suggesting calling the data->event_cb instead of the data->cb if we hit the error case where vhost->vdev is NULL. Something like: >>> >>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c >>> index 8dcf049d42..edf1dccd44 100644 >>> --- a/hw/virtio/vhost-user.c >>> +++ b/hw/virtio/vhost-user.c >>> @@ -2648,6 +2648,10 @@ typedef struct { >>> static void vhost_user_async_close_bh(void *opaque) >>> { >>> VhostAsyncCallback *data = opaque; >>> + >>> + VirtIODevice *vdev = VIRTIO_DEVICE(data->dev); >>> + VHostUserBlk *s = VHOST_USER_BLK(vdev); >>> + >>> struct vhost_dev *vhost = data->vhost; >>> >>> /* >>> @@ -2657,6 +2661,9 @@ static void vhost_user_async_close_bh(void *opaque) >>> */ >>> if (vhost->vdev) { >>> data->cb(data->dev); >>> + } else if (data->event_cb) { >>> + qemu_chr_fe_set_handlers(&s->chardev, NULL, NULL, data->event_cb, >>> + NULL, data->dev, NULL, true); >>> } >>> >>> g_free(data); >>> >>> data->event_cb would be vhost_user_blk_event(). >>> >>> I think that makes the error path a lot easier to reason about and more future proof. >>> >>>> For avoiding to call the vhost_dev_cleanup() twice, add a ‘inited’ in struct vhost-dev to mark if it’s inited like this: >>>> >>> >>> This is better than the original, but let me know what you think of my alternative. >> >> The vhost_user_async_close_bh() is a common function in vhost-user.c, and vhost_user_async_close() is used by vhost-user-scsi/blk/gpio, >> However, in your patch it’s limited to VhostUserBlk, so I think my fix is more reasonable. > > I did not write out the full patch. > > vhost-user-scsi/blk/gpio would each provide their own ->event_cb, just like they each provide a different ->cb today. Looking at it again, data has the chardev, so no need to use VIRTO_DEVICE() or VHOST_USER_BLK(). > > The fix generalizes to all device types. OK, I will change it in V2. > >> >> Thanks, >> LI >>> >>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c >>>> index e2f6ffb446..edc80c0231 100644 >>>> --- a/hw/virtio/vhost.c >>>> +++ b/hw/virtio/vhost.c >>>> @@ -1502,6 +1502,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque, >>>> goto fail_busyloop; >>>> } >>>> >>>> + hdev->inited = true; >>>> return 0; >>>> >>>> fail_busyloop: >>>> @@ -1520,6 +1521,10 @@ void vhost_dev_cleanup(struct vhost_dev *hdev) >>>> { >>>> int i; >>>> >>>> + if (!hdev->inited) { >>>> + return; >>>> + } >>>> + hdev->inited = false; >>>> trace_vhost_dev_cleanup(hdev); >>>> >>>> for (i = 0; i < hdev->nvqs; ++i) { >>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h >>>> index ca3131b1af..74b1aec960 100644 >>>> --- a/include/hw/virtio/vhost.h >>>> +++ b/include/hw/virtio/vhost.h >>>> @@ -123,6 +123,7 @@ struct vhost_dev { >>>> /* @started: is the vhost device started? */ >>>> bool started; >>>> bool log_enabled; >>>> + bool inited; >>>> uint64_t log_size; >>>> Error *migration_blocker; >>>> const VhostOps *vhost_ops; >>>> >>>> Thanks. >>>> >>>>> >>>>>> On Aug 4, 2023, at 1:29 AM, Li Feng <fengli@smartx.com> wrote: >>>>>> >>>>>> When the vhost-user is reconnecting to the backend, and if the vhost-user fails >>>>>> at the get_features in vhost_dev_init(), then the reconnect will fail >>>>>> and it will not be retriggered forever. >>>>>> >>>>>> The reason is: >>>>>> When the vhost-user fail at get_features, the vhost_dev_cleanup will be called >>>>>> immediately. >>>>>> >>>>>> vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'. >>>>>> >>>>>> The reconnect path is: >>>>>> vhost_user_blk_event >>>>>> vhost_user_async_close(.. vhost_user_blk_disconnect ..) >>>>>> qemu_chr_fe_set_handlers <----- clear the notifier callback >>>>>> schedule vhost_user_async_close_bh >>>>>> >>>>>> The vhost->vdev is null, so the vhost_user_blk_disconnect will not be >>>>>> called, then the event fd callback will not be reinstalled. >>>>>> >>>>>> With this patch, the vhost_user_blk_disconnect will call the >>>>>> vhost_dev_cleanup() again, it's safe. >>>>>> >>>>>> All vhost-user devices have this issue, including vhost-user-blk/scsi. >>>>>> >>>>>> Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling") >>>>>> >>>>>> Signed-off-by: Li Feng <fengli@smartx.com> >>>>>> --- >>>>>> hw/virtio/vhost-user.c | 10 +--------- >>>>>> 1 file changed, 1 insertion(+), 9 deletions(-) >>>>>> >>>>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c >>>>>> index 8dcf049d42..697b403fe2 100644 >>>>>> --- a/hw/virtio/vhost-user.c >>>>>> +++ b/hw/virtio/vhost-user.c >>>>>> @@ -2648,16 +2648,8 @@ typedef struct { >>>>>> static void vhost_user_async_close_bh(void *opaque) >>>>>> { >>>>>> VhostAsyncCallback *data = opaque; >>>>>> - struct vhost_dev *vhost = data->vhost; >>>>>> >>>>>> - /* >>>>>> - * If the vhost_dev has been cleared in the meantime there is >>>>>> - * nothing left to do as some other path has completed the >>>>>> - * cleanup. >>>>>> - */ >>>>>> - if (vhost->vdev) { >>>>>> - data->cb(data->dev); >>>>>> - } >>>>>> + data->cb(data->dev); >>>>>> >>>>>> g_free(data); >>>>>> } >>>>>> -- >>>>>> 2.41.0
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index 8dcf049d42..697b403fe2 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -2648,16 +2648,8 @@ typedef struct { static void vhost_user_async_close_bh(void *opaque) { VhostAsyncCallback *data = opaque; - struct vhost_dev *vhost = data->vhost; - /* - * If the vhost_dev has been cleared in the meantime there is - * nothing left to do as some other path has completed the - * cleanup. - */ - if (vhost->vdev) { - data->cb(data->dev); - } + data->cb(data->dev); g_free(data); }
When the vhost-user is reconnecting to the backend, and if the vhost-user fails at the get_features in vhost_dev_init(), then the reconnect will fail and it will not be retriggered forever. The reason is: When the vhost-user fail at get_features, the vhost_dev_cleanup will be called immediately. vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'. The reconnect path is: vhost_user_blk_event vhost_user_async_close(.. vhost_user_blk_disconnect ..) qemu_chr_fe_set_handlers <----- clear the notifier callback schedule vhost_user_async_close_bh The vhost->vdev is null, so the vhost_user_blk_disconnect will not be called, then the event fd callback will not be reinstalled. With this patch, the vhost_user_blk_disconnect will call the vhost_dev_cleanup() again, it's safe. All vhost-user devices have this issue, including vhost-user-blk/scsi. Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling") Signed-off-by: Li Feng <fengli@smartx.com> --- hw/virtio/vhost-user.c | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-)