Message ID | 20200415032826.16701-2-fengli@smartx.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fix crashes when inject errors to vhost-user-blk chardev | expand |
On Wed, Apr 15, 2020 at 11:28:23AM +0800, Li Feng wrote: > > switch (event) { > case CHR_EVENT_OPENED: > @@ -363,7 +376,16 @@ static void vhost_user_blk_event(void *opaque, QEMUChrEvent event) > } > break; > case CHR_EVENT_CLOSED: > - vhost_user_blk_disconnect(dev); > + /* > + * a close event may happen during a read/write, but vhost > + * code assumes the vhost_dev remains setup, so delay the > + * stop & clear to idle. > + */ > + ctx = qemu_get_current_aio_context(); > + > + qemu_chr_fe_set_handlers(&s->chardev, NULL, NULL, NULL, > + NULL, NULL, NULL, false); > + aio_bh_schedule_oneshot(ctx, vhost_user_blk_chr_closed_bh, opaque); This seems a bit racy. What’s to stop the async operation from executing before the next read? If the issue is just that the vhost_dev state is being destroyed too early, why don’t we rather move the vhost_dev_cleanup() call from vhost_user_blk_disconnect() to vhost_user_blk_connect()? We may need to add some state to tell if this is the first connect or a reconnect so we don’t call vhost_dev_cleanup() on initial connect, but as long as we call vhost_dev_cleanup() before vhost_dev_init() every time the device reconnects it shouldn’t matter that we keep that state around. Thoughts? > break; > case CHR_EVENT_BREAK: > case CHR_EVENT_MUX_IN:
Hi Norwitz , Thanks for your good suggestion. I got this fix from net/vhost-user.c, it has the same issue with this case. Your suggestion is a good option. I'm trying to do some work. but there is another crash issue ... I need some time to make your idea be workable. This is the net/vhost-user patch: commit e7c83a885f865128ae3cf1946f8cb538b63cbfba Author: Marc-André Lureau <marcandre.lureau@redhat.com> Date: Mon Feb 27 14:49:56 2017 +0400 vhost-user: delay vhost_user_stop Since commit b0a335e351103bf92f3f9d0bd5759311be8156ac, a socket write may trigger a disconnect events, calling vhost_user_stop() and clearing all the vhost_dev strutures holding data that vhost.c functions expect to remain valid. Delay the cleanup to keep the vhost_dev structure valid during the vhost.c functions. Feng Li Raphael Norwitz <raphael.norwitz@nutanix.com> 于2020年4月17日周五 上午11:41写道: > > On Wed, Apr 15, 2020 at 11:28:23AM +0800, Li Feng wrote: > > > > switch (event) { > > case CHR_EVENT_OPENED: > > @@ -363,7 +376,16 @@ static void vhost_user_blk_event(void *opaque, QEMUChrEvent event) > > } > > break; > > case CHR_EVENT_CLOSED: > > - vhost_user_blk_disconnect(dev); > > + /* > > + * a close event may happen during a read/write, but vhost > > + * code assumes the vhost_dev remains setup, so delay the > > + * stop & clear to idle. > > + */ > > + ctx = qemu_get_current_aio_context(); > > + > > + qemu_chr_fe_set_handlers(&s->chardev, NULL, NULL, NULL, > > + NULL, NULL, NULL, false); > > + aio_bh_schedule_oneshot(ctx, vhost_user_blk_chr_closed_bh, opaque); > > This seems a bit racy. What’s to stop the async operation from executing before > the next read? > > If the issue is just that the vhost_dev state is being destroyed too early, why > don’t we rather move the vhost_dev_cleanup() call from vhost_user_blk_disconnect() > to vhost_user_blk_connect()? We may need to add some state to tell if this is the > first connect or a reconnect so we don’t call vhost_dev_cleanup() on initial > connect, but as long as we call vhost_dev_cleanup() before vhost_dev_init() > every time the device reconnects it shouldn’t matter that we keep that state > around. > > Thoughts? > > > break; > > case CHR_EVENT_BREAK: > > case CHR_EVENT_MUX_IN:
diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c index 17df5338e7..776b9af3eb 100644 --- a/hw/block/vhost-user-blk.c +++ b/hw/block/vhost-user-blk.c @@ -349,11 +349,24 @@ static void vhost_user_blk_disconnect(DeviceState *dev) vhost_dev_cleanup(&s->dev); } +static void vhost_user_blk_event(void *opaque, QEMUChrEvent event); + +static void vhost_user_blk_chr_closed_bh(void *opaque) +{ + DeviceState *dev = opaque; + VirtIODevice *vdev = VIRTIO_DEVICE(dev); + VHostUserBlk *s = VHOST_USER_BLK(vdev); + vhost_user_blk_disconnect(dev); + qemu_chr_fe_set_handlers(&s->chardev, NULL, NULL, vhost_user_blk_event, + NULL, (void *)dev, NULL, true); +} + static void vhost_user_blk_event(void *opaque, QEMUChrEvent event) { DeviceState *dev = opaque; VirtIODevice *vdev = VIRTIO_DEVICE(dev); VHostUserBlk *s = VHOST_USER_BLK(vdev); + AioContext *ctx; switch (event) { case CHR_EVENT_OPENED: @@ -363,7 +376,16 @@ static void vhost_user_blk_event(void *opaque, QEMUChrEvent event) } break; case CHR_EVENT_CLOSED: - vhost_user_blk_disconnect(dev); + /* + * a close event may happen during a read/write, but vhost + * code assumes the vhost_dev remains setup, so delay the + * stop & clear to idle. + */ + ctx = qemu_get_current_aio_context(); + + qemu_chr_fe_set_handlers(&s->chardev, NULL, NULL, NULL, + NULL, NULL, NULL, false); + aio_bh_schedule_oneshot(ctx, vhost_user_blk_chr_closed_bh, opaque); break; case CHR_EVENT_BREAK: case CHR_EVENT_MUX_IN:
Since commit b0a335e351103bf92f3f9d0bd5759311be8156ac, a socket write may trigger a disconnect events, calling vhost_user_blk_disconnect() and clearing all the vhost_dev strutures. Then the next socket read will encounter an invalid pointer to vhost_dev. Signed-off-by: Li Feng <fengli@smartx.com> --- hw/block/vhost-user-blk.c | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-)