diff mbox series

[v4,1/8] vhost-user.rst: Deprecate [GS]ET_STATUS

Message ID 20231004125904.110781-2-hreitz@redhat.com (mailing list archive)
State New, archived
Headers show
Series vhost-user: Back-end state migration | expand

Commit Message

Hanna Czenczek Oct. 4, 2023, 12:58 p.m. UTC
There is no clearly defined purpose for the virtio status byte in
vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
protocol extension, it is possible for SET_FEATURES to return errors
(SET_PROTOCOL_FEATURES may be called before SET_FEATURES).

As for implementations, SET_STATUS is not widely implemented.  dpdk does
implement it, but only uses it to signal feature negotiation failure.
While it does log reset requests (SET_STATUS 0) as such, it effectively
ignores them, in contrast to RESET_OWNER (which is deprecated, and today
means the same thing as RESET_DEVICE).

While qemu superficially has support for [GS]ET_STATUS, it does not
forward the guest-set status byte, but instead just makes it up
internally, and actually completely ignores what the back-end returns,
only using it as the template for a subsequent SET_STATUS to add single
bits to it.  Notably, after setting FEATURES_OK, it never reads it back
to see whether the flag is still set, which is the only way in which
dpdk uses the status byte.

As-is, no front-end or back-end can rely on the other side handling this
field in a useful manner, and it also provides no practical use over
other mechanisms the vhost-user protocol has, which are more clearly
defined.  Deprecate it.

Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

Comments

Stefan Hajnoczi Oct. 5, 2023, 5:08 p.m. UTC | #1
On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
> There is no clearly defined purpose for the virtio status byte in
> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
> feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
> protocol extension, it is possible for SET_FEATURES to return errors
> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
> 
> As for implementations, SET_STATUS is not widely implemented.  dpdk does
> implement it, but only uses it to signal feature negotiation failure.
> While it does log reset requests (SET_STATUS 0) as such, it effectively
> ignores them, in contrast to RESET_OWNER (which is deprecated, and today
> means the same thing as RESET_DEVICE).
> 
> While qemu superficially has support for [GS]ET_STATUS, it does not
> forward the guest-set status byte, but instead just makes it up
> internally, and actually completely ignores what the back-end returns,
> only using it as the template for a subsequent SET_STATUS to add single
> bits to it.  Notably, after setting FEATURES_OK, it never reads it back
> to see whether the flag is still set, which is the only way in which
> dpdk uses the status byte.
> 
> As-is, no front-end or back-end can rely on the other side handling this
> field in a useful manner, and it also provides no practical use over
> other mechanisms the vhost-user protocol has, which are more clearly
> defined.  Deprecate it.
> 
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
>  1 file changed, 21 insertions(+), 7 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Hanna Czenczek Oct. 6, 2023, 7:48 a.m. UTC | #2
On 05.10.23 19:15, Michael S. Tsirkin wrote:
> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
>>> There is no clearly defined purpose for the virtio status byte in
>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
>>> feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
>>> protocol extension, it is possible for SET_FEATURES to return errors
>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
>>>
>>> As for implementations, SET_STATUS is not widely implemented.  dpdk does
>>> implement it, but only uses it to signal feature negotiation failure.
>>> While it does log reset requests (SET_STATUS 0) as such, it effectively
>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today
>>> means the same thing as RESET_DEVICE).
>>>
>>> While qemu superficially has support for [GS]ET_STATUS, it does not
>>> forward the guest-set status byte, but instead just makes it up
>>> internally, and actually completely ignores what the back-end returns,
>>> only using it as the template for a subsequent SET_STATUS to add single
>>> bits to it.  Notably, after setting FEATURES_OK, it never reads it back
>>> to see whether the flag is still set, which is the only way in which
>>> dpdk uses the status byte.
>>>
>>> As-is, no front-end or back-end can rely on the other side handling this
>>> field in a useful manner, and it also provides no practical use over
>>> other mechanisms the vhost-user protocol has, which are more clearly
>>> defined.  Deprecate it.
>>>
>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>> ---
>>>   docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
>>>   1 file changed, 21 insertions(+), 7 deletions(-)
>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>
> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
> The fact current backends never check errors does not mean they never
> will. So no, not applying this.

Can this not be done with REPLY_ACK?  I.e., with the following message 
order:

1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is 
present
2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
4. SET_FEATURES with need_reply

If not, the problem is that qemu has sent SET_STATUS 0 for a while when 
the vCPUs are stopped, which generally seems to request a device reset.  
If we don’t state at least that SET_STATUS 0 is to be ignored, back-ends 
that will implement SET_STATUS later may break with at least these qemu 
versions.  But documenting that a particular use of the status byte is 
to be ignored would be really strange.

Hanna
Michael S. Tsirkin Oct. 6, 2023, 8:45 a.m. UTC | #3
On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
> On 05.10.23 19:15, Michael S. Tsirkin wrote:
> > On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
> > > On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
> > > > There is no clearly defined purpose for the virtio status byte in
> > > > vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
> > > > feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
> > > > protocol extension, it is possible for SET_FEATURES to return errors
> > > > (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
> > > > 
> > > > As for implementations, SET_STATUS is not widely implemented.  dpdk does
> > > > implement it, but only uses it to signal feature negotiation failure.
> > > > While it does log reset requests (SET_STATUS 0) as such, it effectively
> > > > ignores them, in contrast to RESET_OWNER (which is deprecated, and today
> > > > means the same thing as RESET_DEVICE).
> > > > 
> > > > While qemu superficially has support for [GS]ET_STATUS, it does not
> > > > forward the guest-set status byte, but instead just makes it up
> > > > internally, and actually completely ignores what the back-end returns,
> > > > only using it as the template for a subsequent SET_STATUS to add single
> > > > bits to it.  Notably, after setting FEATURES_OK, it never reads it back
> > > > to see whether the flag is still set, which is the only way in which
> > > > dpdk uses the status byte.
> > > > 
> > > > As-is, no front-end or back-end can rely on the other side handling this
> > > > field in a useful manner, and it also provides no practical use over
> > > > other mechanisms the vhost-user protocol has, which are more clearly
> > > > defined.  Deprecate it.
> > > > 
> > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > ---
> > > >   docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
> > > >   1 file changed, 21 insertions(+), 7 deletions(-)
> > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > 
> > SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
> > The fact current backends never check errors does not mean they never
> > will. So no, not applying this.
> 
> Can this not be done with REPLY_ACK?  I.e., with the following message
> order:
> 
> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
> present
> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
> 4. SET_FEATURES with need_reply
> 
> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
> vCPUs are stopped, which generally seems to request a device reset.  If we
> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
> implement SET_STATUS later may break with at least these qemu versions.  But
> documenting that a particular use of the status byte is to be ignored would
> be really strange.
> 
> Hanna

Hmm I guess. Though just following virtio spec seems cleaner to me...
vhost-user reconfigures the state fully on start. I guess symmetry was the
point. So I don't see why SET_STATUS 0 has to be ignored.


SET_STATUS was introduced by:

commit 923b8921d210763359e96246a58658ac0db6c645
Author: Yajun Wu <yajunw@nvidia.com>
Date:   Mon Oct 17 14:44:52 2022 +0800

    vhost-user: Support vhost_dev_start

CC the author.
Hanna Czenczek Oct. 6, 2023, 9:15 a.m. UTC | #4
On 06.10.23 10:45, Michael S. Tsirkin wrote:
> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
>>>>> There is no clearly defined purpose for the virtio status byte in
>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
>>>>> feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
>>>>> protocol extension, it is possible for SET_FEATURES to return errors
>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
>>>>>
>>>>> As for implementations, SET_STATUS is not widely implemented.  dpdk does
>>>>> implement it, but only uses it to signal feature negotiation failure.
>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively
>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today
>>>>> means the same thing as RESET_DEVICE).
>>>>>
>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not
>>>>> forward the guest-set status byte, but instead just makes it up
>>>>> internally, and actually completely ignores what the back-end returns,
>>>>> only using it as the template for a subsequent SET_STATUS to add single
>>>>> bits to it.  Notably, after setting FEATURES_OK, it never reads it back
>>>>> to see whether the flag is still set, which is the only way in which
>>>>> dpdk uses the status byte.
>>>>>
>>>>> As-is, no front-end or back-end can rely on the other side handling this
>>>>> field in a useful manner, and it also provides no practical use over
>>>>> other mechanisms the vhost-user protocol has, which are more clearly
>>>>> defined.  Deprecate it.
>>>>>
>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>> ---
>>>>>    docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
>>>>>    1 file changed, 21 insertions(+), 7 deletions(-)
>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
>>> The fact current backends never check errors does not mean they never
>>> will. So no, not applying this.
>> Can this not be done with REPLY_ACK?  I.e., with the following message
>> order:
>>
>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
>> present
>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
>> 4. SET_FEATURES with need_reply
>>
>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
>> vCPUs are stopped, which generally seems to request a device reset.  If we
>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
>> implement SET_STATUS later may break with at least these qemu versions.  But
>> documenting that a particular use of the status byte is to be ignored would
>> be really strange.
>>
>> Hanna
> Hmm I guess. Though just following virtio spec seems cleaner to me...
> vhost-user reconfigures the state fully on start.

Not the internal device state, though.  virtiofsd has internal state, 
and other devices like vhost-gpu back-ends would probably, too.

Stefan has recently sent a series 
(https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) 
to put the reset (RESET_DEVICE) into virtio_reset() (when we really need 
a reset).

I really don’t like our current approach with the status byte. Following 
the virtio specification to me would mean that the guest directly 
controls this byte, which it does not.  qemu makes up values as it deems 
appropriate, and this includes sending a SET_STATUS 0 when the guest is 
just paused, i.e. when the guest really doesn’t want a device reset.

That means that qemu does not treat this as a virtio device field 
(because that would mean exposing it to the guest driver), but instead 
treats it as part of the vhost(-user) protocol.  It doesn’t feel right 
to me that we use a virtio-defined feature for communication on the 
vhost level, i.e. between front-end and back-end, and not between guest 
driver and device.  I think all vhost-level protocol features should be 
fully defined in the vhost-user specification, which REPLY_ACK is.

Now, we could hand full control of the status byte to the guest, and 
that would make me content.  But I feel like that doesn’t really work, 
because qemu needs to intercept the status byte anyway (it needs to know 
when there is a reset, probably wants to know when the device is 
configured, etc.), so I don’t think having the status byte in vhost-user 
really gains us much when qemu could translate status byte changes 
to/from other vhost-user commands.

Hanna

> I guess symmetry was the
> point. So I don't see why SET_STATUS 0 has to be ignored.
>
>
> SET_STATUS was introduced by:
>
> commit 923b8921d210763359e96246a58658ac0db6c645
> Author: Yajun Wu <yajunw@nvidia.com>
> Date:   Mon Oct 17 14:44:52 2022 +0800
>
>      vhost-user: Support vhost_dev_start
>
> CC the author.
>
Michael S. Tsirkin Oct. 6, 2023, 9:26 a.m. UTC | #5
On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
> On 06.10.23 10:45, Michael S. Tsirkin wrote:
> > On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
> > > On 05.10.23 19:15, Michael S. Tsirkin wrote:
> > > > On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
> > > > > On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
> > > > > > There is no clearly defined purpose for the virtio status byte in
> > > > > > vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
> > > > > > feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
> > > > > > protocol extension, it is possible for SET_FEATURES to return errors
> > > > > > (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
> > > > > > 
> > > > > > As for implementations, SET_STATUS is not widely implemented.  dpdk does
> > > > > > implement it, but only uses it to signal feature negotiation failure.
> > > > > > While it does log reset requests (SET_STATUS 0) as such, it effectively
> > > > > > ignores them, in contrast to RESET_OWNER (which is deprecated, and today
> > > > > > means the same thing as RESET_DEVICE).
> > > > > > 
> > > > > > While qemu superficially has support for [GS]ET_STATUS, it does not
> > > > > > forward the guest-set status byte, but instead just makes it up
> > > > > > internally, and actually completely ignores what the back-end returns,
> > > > > > only using it as the template for a subsequent SET_STATUS to add single
> > > > > > bits to it.  Notably, after setting FEATURES_OK, it never reads it back
> > > > > > to see whether the flag is still set, which is the only way in which
> > > > > > dpdk uses the status byte.
> > > > > > 
> > > > > > As-is, no front-end or back-end can rely on the other side handling this
> > > > > > field in a useful manner, and it also provides no practical use over
> > > > > > other mechanisms the vhost-user protocol has, which are more clearly
> > > > > > defined.  Deprecate it.
> > > > > > 
> > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > ---
> > > > > >    docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
> > > > > >    1 file changed, 21 insertions(+), 7 deletions(-)
> > > > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
> > > > The fact current backends never check errors does not mean they never
> > > > will. So no, not applying this.
> > > Can this not be done with REPLY_ACK?  I.e., with the following message
> > > order:
> > > 
> > > 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
> > > present
> > > 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
> > > 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
> > > 4. SET_FEATURES with need_reply
> > > 
> > > If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
> > > vCPUs are stopped, which generally seems to request a device reset.  If we
> > > don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
> > > implement SET_STATUS later may break with at least these qemu versions.  But
> > > documenting that a particular use of the status byte is to be ignored would
> > > be really strange.
> > > 
> > > Hanna
> > Hmm I guess. Though just following virtio spec seems cleaner to me...
> > vhost-user reconfigures the state fully on start.
> 
> Not the internal device state, though.  virtiofsd has internal state, and
> other devices like vhost-gpu back-ends would probably, too.
> 
> Stefan has recently sent a series
> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to
> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a
> reset).
> 
> I really don’t like our current approach with the status byte. Following the
> virtio specification to me would mean that the guest directly controls this
> byte, which it does not.  qemu makes up values as it deems appropriate, and
> this includes sending a SET_STATUS 0 when the guest is just paused, i.e.
> when the guest really doesn’t want a device reset.
> 
> That means that qemu does not treat this as a virtio device field (because
> that would mean exposing it to the guest driver), but instead treats it as
> part of the vhost(-user) protocol.  It doesn’t feel right to me that we use
> a virtio-defined feature for communication on the vhost level, i.e. between
> front-end and back-end, and not between guest driver and device.  I think
> all vhost-level protocol features should be fully defined in the vhost-user
> specification, which REPLY_ACK is.

Hmm that makes sense. Maybe we should have done what stefan's patch
is doing.

Do look at the original commit that introduced it to understand why
it was added.

> Now, we could hand full control of the status byte to the guest, and that
> would make me content.  But I feel like that doesn’t really work, because
> qemu needs to intercept the status byte anyway (it needs to know when there
> is a reset, probably wants to know when the device is configured, etc.), so
> I don’t think having the status byte in vhost-user really gains us much when
> qemu could translate status byte changes to/from other vhost-user commands.
> 
> Hanna

well it intercepts it but I think it could pass it on unchanged.


> > I guess symmetry was the
> > point. So I don't see why SET_STATUS 0 has to be ignored.
> > 
> > 
> > SET_STATUS was introduced by:
> > 
> > commit 923b8921d210763359e96246a58658ac0db6c645
> > Author: Yajun Wu <yajunw@nvidia.com>
> > Date:   Mon Oct 17 14:44:52 2022 +0800
> > 
> >      vhost-user: Support vhost_dev_start
> > 
> > CC the author.
> >
Hanna Czenczek Oct. 6, 2023, 9:47 a.m. UTC | #6
On 06.10.23 11:26, Michael S. Tsirkin wrote:
> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
>>>>>>> There is no clearly defined purpose for the virtio status byte in
>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
>>>>>>> feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
>>>>>>> protocol extension, it is possible for SET_FEATURES to return errors
>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
>>>>>>>
>>>>>>> As for implementations, SET_STATUS is not widely implemented.  dpdk does
>>>>>>> implement it, but only uses it to signal feature negotiation failure.
>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively
>>>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today
>>>>>>> means the same thing as RESET_DEVICE).
>>>>>>>
>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not
>>>>>>> forward the guest-set status byte, but instead just makes it up
>>>>>>> internally, and actually completely ignores what the back-end returns,
>>>>>>> only using it as the template for a subsequent SET_STATUS to add single
>>>>>>> bits to it.  Notably, after setting FEATURES_OK, it never reads it back
>>>>>>> to see whether the flag is still set, which is the only way in which
>>>>>>> dpdk uses the status byte.
>>>>>>>
>>>>>>> As-is, no front-end or back-end can rely on the other side handling this
>>>>>>> field in a useful manner, and it also provides no practical use over
>>>>>>> other mechanisms the vhost-user protocol has, which are more clearly
>>>>>>> defined.  Deprecate it.
>>>>>>>
>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>> ---
>>>>>>>     docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
>>>>>>>     1 file changed, 21 insertions(+), 7 deletions(-)
>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
>>>>> The fact current backends never check errors does not mean they never
>>>>> will. So no, not applying this.
>>>> Can this not be done with REPLY_ACK?  I.e., with the following message
>>>> order:
>>>>
>>>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
>>>> present
>>>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>> 4. SET_FEATURES with need_reply
>>>>
>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
>>>> vCPUs are stopped, which generally seems to request a device reset.  If we
>>>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
>>>> implement SET_STATUS later may break with at least these qemu versions.  But
>>>> documenting that a particular use of the status byte is to be ignored would
>>>> be really strange.
>>>>
>>>> Hanna
>>> Hmm I guess. Though just following virtio spec seems cleaner to me...
>>> vhost-user reconfigures the state fully on start.
>> Not the internal device state, though.  virtiofsd has internal state, and
>> other devices like vhost-gpu back-ends would probably, too.
>>
>> Stefan has recently sent a series
>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to
>> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a
>> reset).
>>
>> I really don’t like our current approach with the status byte. Following the
>> virtio specification to me would mean that the guest directly controls this
>> byte, which it does not.  qemu makes up values as it deems appropriate, and
>> this includes sending a SET_STATUS 0 when the guest is just paused, i.e.
>> when the guest really doesn’t want a device reset.
>>
>> That means that qemu does not treat this as a virtio device field (because
>> that would mean exposing it to the guest driver), but instead treats it as
>> part of the vhost(-user) protocol.  It doesn’t feel right to me that we use
>> a virtio-defined feature for communication on the vhost level, i.e. between
>> front-end and back-end, and not between guest driver and device.  I think
>> all vhost-level protocol features should be fully defined in the vhost-user
>> specification, which REPLY_ACK is.
> Hmm that makes sense. Maybe we should have done what stefan's patch
> is doing.
>
> Do look at the original commit that introduced it to understand why
> it was added.

I don’t understand why this was added to the stop/cont code, though.  If 
it is time consuming to make these changes, why are they done every time 
the VM is paused
and resumed?  It makes sense that this would be done for the initial 
configuration (where a reset also wouldn’t hurt), but here it seems wrong.

(To be clear, a reset in the stop/cont code is wrong, because it breaks 
stateful devices.)

Also, note the newer commits 6f8be29ec17 and c3716f260bf.  The reset as 
originally introduced was wrong even for non-stateful devices, because 
it occurred before we fetched the state (vring indices) so we could 
restore it later.  I don’t know how 923b8921d21 was tested, but if the 
back-end used for testing implemented SET_STATUS 0 as a reset, it could 
not have survived either migration or a stop/cont in general, because 
the vring indices would have been reset to 0.

What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke 
all devices that would implement them as per virtio spec, and even today 
it’s broken for stateful devices.  The mentioned performance issue is 
likely real, but we can’t address it by making up SET_STATUS calls that 
are wrong.

I concede that I didn’t think about DRIVER_OK.  Personally, I would do 
all final configuration that would happen upon a DRIVER_OK once the 
first vring is started (i.e. receives a kick).  That has the added 
benefit of being asynchronous because it doesn’t block any vhost-user 
messages (which are synchronous, and thus block downtime).

Hanna

>> Now, we could hand full control of the status byte to the guest, and that
>> would make me content.  But I feel like that doesn’t really work, because
>> qemu needs to intercept the status byte anyway (it needs to know when there
>> is a reset, probably wants to know when the device is configured, etc.), so
>> I don’t think having the status byte in vhost-user really gains us much when
>> qemu could translate status byte changes to/from other vhost-user commands.
>>
>> Hanna
> well it intercepts it but I think it could pass it on unchanged.
>
>
>>> I guess symmetry was the
>>> point. So I don't see why SET_STATUS 0 has to be ignored.
>>>
>>>
>>> SET_STATUS was introduced by:
>>>
>>> commit 923b8921d210763359e96246a58658ac0db6c645
>>> Author: Yajun Wu <yajunw@nvidia.com>
>>> Date:   Mon Oct 17 14:44:52 2022 +0800
>>>
>>>       vhost-user: Support vhost_dev_start
>>>
>>> CC the author.
>>>
Michael S. Tsirkin Oct. 6, 2023, 10:34 a.m. UTC | #7
On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
> On 06.10.23 11:26, Michael S. Tsirkin wrote:
> > On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
> > > On 06.10.23 10:45, Michael S. Tsirkin wrote:
> > > > On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
> > > > > On 05.10.23 19:15, Michael S. Tsirkin wrote:
> > > > > > On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
> > > > > > > On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
> > > > > > > > There is no clearly defined purpose for the virtio status byte in
> > > > > > > > vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
> > > > > > > > feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
> > > > > > > > protocol extension, it is possible for SET_FEATURES to return errors
> > > > > > > > (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
> > > > > > > > 
> > > > > > > > As for implementations, SET_STATUS is not widely implemented.  dpdk does
> > > > > > > > implement it, but only uses it to signal feature negotiation failure.
> > > > > > > > While it does log reset requests (SET_STATUS 0) as such, it effectively
> > > > > > > > ignores them, in contrast to RESET_OWNER (which is deprecated, and today
> > > > > > > > means the same thing as RESET_DEVICE).
> > > > > > > > 
> > > > > > > > While qemu superficially has support for [GS]ET_STATUS, it does not
> > > > > > > > forward the guest-set status byte, but instead just makes it up
> > > > > > > > internally, and actually completely ignores what the back-end returns,
> > > > > > > > only using it as the template for a subsequent SET_STATUS to add single
> > > > > > > > bits to it.  Notably, after setting FEATURES_OK, it never reads it back
> > > > > > > > to see whether the flag is still set, which is the only way in which
> > > > > > > > dpdk uses the status byte.
> > > > > > > > 
> > > > > > > > As-is, no front-end or back-end can rely on the other side handling this
> > > > > > > > field in a useful manner, and it also provides no practical use over
> > > > > > > > other mechanisms the vhost-user protocol has, which are more clearly
> > > > > > > > defined.  Deprecate it.
> > > > > > > > 
> > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > ---
> > > > > > > >     docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
> > > > > > > >     1 file changed, 21 insertions(+), 7 deletions(-)
> > > > > > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
> > > > > > The fact current backends never check errors does not mean they never
> > > > > > will. So no, not applying this.
> > > > > Can this not be done with REPLY_ACK?  I.e., with the following message
> > > > > order:
> > > > > 
> > > > > 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
> > > > > present
> > > > > 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
> > > > > 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
> > > > > 4. SET_FEATURES with need_reply
> > > > > 
> > > > > If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
> > > > > vCPUs are stopped, which generally seems to request a device reset.  If we
> > > > > don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
> > > > > implement SET_STATUS later may break with at least these qemu versions.  But
> > > > > documenting that a particular use of the status byte is to be ignored would
> > > > > be really strange.
> > > > > 
> > > > > Hanna
> > > > Hmm I guess. Though just following virtio spec seems cleaner to me...
> > > > vhost-user reconfigures the state fully on start.
> > > Not the internal device state, though.  virtiofsd has internal state, and
> > > other devices like vhost-gpu back-ends would probably, too.
> > > 
> > > Stefan has recently sent a series
> > > (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to
> > > put the reset (RESET_DEVICE) into virtio_reset() (when we really need a
> > > reset).
> > > 
> > > I really don’t like our current approach with the status byte. Following the
> > > virtio specification to me would mean that the guest directly controls this
> > > byte, which it does not.  qemu makes up values as it deems appropriate, and
> > > this includes sending a SET_STATUS 0 when the guest is just paused, i.e.
> > > when the guest really doesn’t want a device reset.
> > > 
> > > That means that qemu does not treat this as a virtio device field (because
> > > that would mean exposing it to the guest driver), but instead treats it as
> > > part of the vhost(-user) protocol.  It doesn’t feel right to me that we use
> > > a virtio-defined feature for communication on the vhost level, i.e. between
> > > front-end and back-end, and not between guest driver and device.  I think
> > > all vhost-level protocol features should be fully defined in the vhost-user
> > > specification, which REPLY_ACK is.
> > Hmm that makes sense. Maybe we should have done what stefan's patch
> > is doing.
> > 
> > Do look at the original commit that introduced it to understand why
> > it was added.
> 
> I don’t understand why this was added to the stop/cont code, though.  If it
> is time consuming to make these changes, why are they done every time the VM
> is paused
> and resumed?  It makes sense that this would be done for the initial
> configuration (where a reset also wouldn’t hurt), but here it seems wrong.
> 
> (To be clear, a reset in the stop/cont code is wrong, because it breaks
> stateful devices.)
> 
> Also, note the newer commits 6f8be29ec17 and c3716f260bf.  The reset as
> originally introduced was wrong even for non-stateful devices, because it
> occurred before we fetched the state (vring indices) so we could restore it
> later.  I don’t know how 923b8921d21 was tested, but if the back-end used
> for testing implemented SET_STATUS 0 as a reset, it could not have survived
> either migration or a stop/cont in general, because the vring indices would
> have been reset to 0.
> 
> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
> devices that would implement them as per virtio spec, and even today it’s
> broken for stateful devices.  The mentioned performance issue is likely
> real, but we can’t address it by making up SET_STATUS calls that are wrong.
> 
> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
> final configuration that would happen upon a DRIVER_OK once the first vring
> is started (i.e. receives a kick).  That has the added benefit of being
> asynchronous because it doesn’t block any vhost-user messages (which are
> synchronous, and thus block downtime).
> 
> Hanna


For better or worse kick is per ring. It's out of spec to start rings
that were not kicked but I guess you could do configuration ...
Seems somewhat asymmetrical though.

Let's wait until next week, hopefully Yajun Wu will answer.

> > > Now, we could hand full control of the status byte to the guest, and that
> > > would make me content.  But I feel like that doesn’t really work, because
> > > qemu needs to intercept the status byte anyway (it needs to know when there
> > > is a reset, probably wants to know when the device is configured, etc.), so
> > > I don’t think having the status byte in vhost-user really gains us much when
> > > qemu could translate status byte changes to/from other vhost-user commands.
> > > 
> > > Hanna
> > well it intercepts it but I think it could pass it on unchanged.
> > 
> > 
> > > > I guess symmetry was the
> > > > point. So I don't see why SET_STATUS 0 has to be ignored.
> > > > 
> > > > 
> > > > SET_STATUS was introduced by:
> > > > 
> > > > commit 923b8921d210763359e96246a58658ac0db6c645
> > > > Author: Yajun Wu <yajunw@nvidia.com>
> > > > Date:   Mon Oct 17 14:44:52 2022 +0800
> > > > 
> > > >       vhost-user: Support vhost_dev_start
> > > > 
> > > > CC the author.
> > > >
Hanna Czenczek Oct. 6, 2023, 11:42 a.m. UTC | #8
On 06.10.23 12:34, Michael S. Tsirkin wrote:
> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
>> On 06.10.23 11:26, Michael S. Tsirkin wrote:
>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
>>>>>>>>> There is no clearly defined purpose for the virtio status byte in
>>>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
>>>>>>>>> feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
>>>>>>>>> protocol extension, it is possible for SET_FEATURES to return errors
>>>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
>>>>>>>>>
>>>>>>>>> As for implementations, SET_STATUS is not widely implemented.  dpdk does
>>>>>>>>> implement it, but only uses it to signal feature negotiation failure.
>>>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively
>>>>>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today
>>>>>>>>> means the same thing as RESET_DEVICE).
>>>>>>>>>
>>>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not
>>>>>>>>> forward the guest-set status byte, but instead just makes it up
>>>>>>>>> internally, and actually completely ignores what the back-end returns,
>>>>>>>>> only using it as the template for a subsequent SET_STATUS to add single
>>>>>>>>> bits to it.  Notably, after setting FEATURES_OK, it never reads it back
>>>>>>>>> to see whether the flag is still set, which is the only way in which
>>>>>>>>> dpdk uses the status byte.
>>>>>>>>>
>>>>>>>>> As-is, no front-end or back-end can rely on the other side handling this
>>>>>>>>> field in a useful manner, and it also provides no practical use over
>>>>>>>>> other mechanisms the vhost-user protocol has, which are more clearly
>>>>>>>>> defined.  Deprecate it.
>>>>>>>>>
>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>>>> ---
>>>>>>>>>      docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
>>>>>>>>>      1 file changed, 21 insertions(+), 7 deletions(-)
>>>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
>>>>>>> The fact current backends never check errors does not mean they never
>>>>>>> will. So no, not applying this.
>>>>>> Can this not be done with REPLY_ACK?  I.e., with the following message
>>>>>> order:
>>>>>>
>>>>>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
>>>>>> present
>>>>>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>>>> 4. SET_FEATURES with need_reply
>>>>>>
>>>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
>>>>>> vCPUs are stopped, which generally seems to request a device reset.  If we
>>>>>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
>>>>>> implement SET_STATUS later may break with at least these qemu versions.  But
>>>>>> documenting that a particular use of the status byte is to be ignored would
>>>>>> be really strange.
>>>>>>
>>>>>> Hanna
>>>>> Hmm I guess. Though just following virtio spec seems cleaner to me...
>>>>> vhost-user reconfigures the state fully on start.
>>>> Not the internal device state, though.  virtiofsd has internal state, and
>>>> other devices like vhost-gpu back-ends would probably, too.
>>>>
>>>> Stefan has recently sent a series
>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to
>>>> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a
>>>> reset).
>>>>
>>>> I really don’t like our current approach with the status byte. Following the
>>>> virtio specification to me would mean that the guest directly controls this
>>>> byte, which it does not.  qemu makes up values as it deems appropriate, and
>>>> this includes sending a SET_STATUS 0 when the guest is just paused, i.e.
>>>> when the guest really doesn’t want a device reset.
>>>>
>>>> That means that qemu does not treat this as a virtio device field (because
>>>> that would mean exposing it to the guest driver), but instead treats it as
>>>> part of the vhost(-user) protocol.  It doesn’t feel right to me that we use
>>>> a virtio-defined feature for communication on the vhost level, i.e. between
>>>> front-end and back-end, and not between guest driver and device.  I think
>>>> all vhost-level protocol features should be fully defined in the vhost-user
>>>> specification, which REPLY_ACK is.
>>> Hmm that makes sense. Maybe we should have done what stefan's patch
>>> is doing.
>>>
>>> Do look at the original commit that introduced it to understand why
>>> it was added.
>> I don’t understand why this was added to the stop/cont code, though.  If it
>> is time consuming to make these changes, why are they done every time the VM
>> is paused
>> and resumed?  It makes sense that this would be done for the initial
>> configuration (where a reset also wouldn’t hurt), but here it seems wrong.
>>
>> (To be clear, a reset in the stop/cont code is wrong, because it breaks
>> stateful devices.)
>>
>> Also, note the newer commits 6f8be29ec17 and c3716f260bf.  The reset as
>> originally introduced was wrong even for non-stateful devices, because it
>> occurred before we fetched the state (vring indices) so we could restore it
>> later.  I don’t know how 923b8921d21 was tested, but if the back-end used
>> for testing implemented SET_STATUS 0 as a reset, it could not have survived
>> either migration or a stop/cont in general, because the vring indices would
>> have been reset to 0.
>>
>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
>> devices that would implement them as per virtio spec, and even today it’s
>> broken for stateful devices.  The mentioned performance issue is likely
>> real, but we can’t address it by making up SET_STATUS calls that are wrong.
>>
>> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
>> final configuration that would happen upon a DRIVER_OK once the first vring
>> is started (i.e. receives a kick).  That has the added benefit of being
>> asynchronous because it doesn’t block any vhost-user messages (which are
>> synchronous, and thus block downtime).
>>
>> Hanna
>
> For better or worse kick is per ring. It's out of spec to start rings
> that were not kicked but I guess you could do configuration ...
> Seems somewhat asymmetrical though.

I meant to take the first ring being started as the signal to do the 
global configuration, i.e. not do this once per vring, but once globally.

> Let's wait until next week, hopefully Yajun Wu will answer.

I mean, personally I don’t really care about the whole SET_STATUS 
thing.  It’s clear that it’s broken for stateful devices.  The fact that 
it took until 6f8be29ec17d to fix it for just any device that would 
implement it according to spec to me is a strong indication that nobody 
does implement it according to spec, and is currently only used to 
signal to some specific back-end that all rings have been set up and 
should be configured in a single block.

(By the way, our SET_STATUS call that adds ACKNOWLEDGE | DRIVER | 
DRIVER_OK is also completely against the spec, and any well-behaving 
device should reject it.  These flags must be set one after another, and 
specifically, features must be read and set after setting DRIVER, but 
before setting FEATURES_OK, and FEATURES_OK must be set before setting 
DRIVER_OK.  Any well-behaving device should error out when DRIVER_OK is 
set without FEATURES_OK set, or when FEATURES_OK is set without 
ACKNOWLEDGE | DRIVER set.)

I can just drop this patch from the migration series, because in my 
opinion it doesn’t affect it whatsoever (although I understood Stefan 
disagrees).  But honestly, I think any vhost-user back-end developer is 
well-advised to completely ignore the status byte.  Not ignoring it 
means relying on qemu’s implementation-defined behavior.

Hanna
Alex Bennée Oct. 6, 2023, 3:17 p.m. UTC | #9
Hanna Czenczek <hreitz@redhat.com> writes:

> On 06.10.23 12:34, Michael S. Tsirkin wrote:
>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
>>> On 06.10.23 11:26, Michael S. Tsirkin wrote:
>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
<snip>
>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
>>> devices that would implement them as per virtio spec, and even today it’s
>>> broken for stateful devices.  The mentioned performance issue is likely
>>> real, but we can’t address it by making up SET_STATUS calls that are wrong.
>>>
>>> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
>>> final configuration that would happen upon a DRIVER_OK once the first vring
>>> is started (i.e. receives a kick).  That has the added benefit of being
>>> asynchronous because it doesn’t block any vhost-user messages (which are
>>> synchronous, and thus block downtime).
>>>
>>> Hanna
>>
>> For better or worse kick is per ring. It's out of spec to start rings
>> that were not kicked but I guess you could do configuration ...
>> Seems somewhat asymmetrical though.
>
> I meant to take the first ring being started as the signal to do the
> global configuration, i.e. not do this once per vring, but once
> globally.
>
>> Let's wait until next week, hopefully Yajun Wu will answer.
>
> I mean, personally I don’t really care about the whole SET_STATUS
> thing.  It’s clear that it’s broken for stateful devices.  The fact
> that it took until 6f8be29ec17d to fix it for just any device that
> would implement it according to spec to me is a strong indication that
> nobody does implement it according to spec, and is currently only used
> to signal to some specific back-end that all rings have been set up
> and should be configured in a single block.

I'm certainly using [GS]ET_STATUS for the proposed F_TRANSPORT
extensions where everything is off-loaded to the vhost-user backend.
Hanna Czenczek Oct. 6, 2023, 3:47 p.m. UTC | #10
On 06.10.23 17:17, Alex Bennée wrote:
> Hanna Czenczek <hreitz@redhat.com> writes:
>
>> On 06.10.23 12:34, Michael S. Tsirkin wrote:
>>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
>>>> On 06.10.23 11:26, Michael S. Tsirkin wrote:
>>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
>>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
>>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
> <snip>
>>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
>>>> devices that would implement them as per virtio spec, and even today it’s
>>>> broken for stateful devices.  The mentioned performance issue is likely
>>>> real, but we can’t address it by making up SET_STATUS calls that are wrong.
>>>>
>>>> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
>>>> final configuration that would happen upon a DRIVER_OK once the first vring
>>>> is started (i.e. receives a kick).  That has the added benefit of being
>>>> asynchronous because it doesn’t block any vhost-user messages (which are
>>>> synchronous, and thus block downtime).
>>>>
>>>> Hanna
>>> For better or worse kick is per ring. It's out of spec to start rings
>>> that were not kicked but I guess you could do configuration ...
>>> Seems somewhat asymmetrical though.
>> I meant to take the first ring being started as the signal to do the
>> global configuration, i.e. not do this once per vring, but once
>> globally.
>>
>>> Let's wait until next week, hopefully Yajun Wu will answer.
>> I mean, personally I don’t really care about the whole SET_STATUS
>> thing.  It’s clear that it’s broken for stateful devices.  The fact
>> that it took until 6f8be29ec17d to fix it for just any device that
>> would implement it according to spec to me is a strong indication that
>> nobody does implement it according to spec, and is currently only used
>> to signal to some specific back-end that all rings have been set up
>> and should be configured in a single block.
> I'm certainly using [GS]ET_STATUS for the proposed F_TRANSPORT
> extensions where everything is off-loaded to the vhost-user backend.

How do these back-ends work with the fact that qemu uses SET_STATUS 
incorrectly when not offloading?  Do you plan on fixing that?

(I.e. that we send SET_STATUS 0 when the VM is paused, potentially 
resetting state that is not recoverable, and that we set DRIVER and 
DRIVER_OK simultaneously.)

Hanna
Alex Bennée Oct. 6, 2023, 8:49 p.m. UTC | #11
Hanna Czenczek <hreitz@redhat.com> writes:

> On 06.10.23 17:17, Alex Bennée wrote:
>> Hanna Czenczek <hreitz@redhat.com> writes:
>>
>>> On 06.10.23 12:34, Michael S. Tsirkin wrote:
>>>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
>>>>> On 06.10.23 11:26, Michael S. Tsirkin wrote:
>>>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
>>>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
>>>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>>>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
>> <snip>
>>>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
>>>>> devices that would implement them as per virtio spec, and even today it’s
>>>>> broken for stateful devices.  The mentioned performance issue is likely
>>>>> real, but we can’t address it by making up SET_STATUS calls that are wrong.
>>>>>
>>>>> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
>>>>> final configuration that would happen upon a DRIVER_OK once the first vring
>>>>> is started (i.e. receives a kick).  That has the added benefit of being
>>>>> asynchronous because it doesn’t block any vhost-user messages (which are
>>>>> synchronous, and thus block downtime).
>>>>>
>>>>> Hanna
>>>> For better or worse kick is per ring. It's out of spec to start rings
>>>> that were not kicked but I guess you could do configuration ...
>>>> Seems somewhat asymmetrical though.
>>> I meant to take the first ring being started as the signal to do the
>>> global configuration, i.e. not do this once per vring, but once
>>> globally.
>>>
>>>> Let's wait until next week, hopefully Yajun Wu will answer.
>>> I mean, personally I don’t really care about the whole SET_STATUS
>>> thing.  It’s clear that it’s broken for stateful devices.  The fact
>>> that it took until 6f8be29ec17d to fix it for just any device that
>>> would implement it according to spec to me is a strong indication that
>>> nobody does implement it according to spec, and is currently only used
>>> to signal to some specific back-end that all rings have been set up
>>> and should be configured in a single block.
>> I'm certainly using [GS]ET_STATUS for the proposed F_TRANSPORT
>> extensions where everything is off-loaded to the vhost-user backend.
>
> How do these back-ends work with the fact that qemu uses SET_STATUS
> incorrectly when not offloading?  Do you plan on fixing that?

Mainly having a common base implementation which does it right and
having very lightweight derivations for legacy stubs using it. The
aim is to eliminate the need for QEMU stubs entirely by fully specifying
the device from the vhost-user API. 

> (I.e. that we send SET_STATUS 0 when the VM is paused, potentially
> resetting state that is not recoverable, and that we set DRIVER and
> DRIVER_OK simultaneously.)

This is QEMU simulating a SET_STATUS rather than the guest triggering
it?

>
> Hanna
Yajun Wu Oct. 7, 2023, 2:22 a.m. UTC | #12
On 10/6/2023 6:34 PM, Michael S. Tsirkin wrote:
> External email: Use caution opening links or attachments
>
>
> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
>> On 06.10.23 11:26, Michael S. Tsirkin wrote:
>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
>>>>>>>>> There is no clearly defined purpose for the virtio status byte in
>>>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
>>>>>>>>> feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
>>>>>>>>> protocol extension, it is possible for SET_FEATURES to return errors
>>>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
>>>>>>>>>
>>>>>>>>> As for implementations, SET_STATUS is not widely implemented.  dpdk does
>>>>>>>>> implement it, but only uses it to signal feature negotiation failure.
>>>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively
>>>>>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today
>>>>>>>>> means the same thing as RESET_DEVICE).
>>>>>>>>>
>>>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not
>>>>>>>>> forward the guest-set status byte, but instead just makes it up
>>>>>>>>> internally, and actually completely ignores what the back-end returns,
>>>>>>>>> only using it as the template for a subsequent SET_STATUS to add single
>>>>>>>>> bits to it.  Notably, after setting FEATURES_OK, it never reads it back
>>>>>>>>> to see whether the flag is still set, which is the only way in which
>>>>>>>>> dpdk uses the status byte.
>>>>>>>>>
>>>>>>>>> As-is, no front-end or back-end can rely on the other side handling this
>>>>>>>>> field in a useful manner, and it also provides no practical use over
>>>>>>>>> other mechanisms the vhost-user protocol has, which are more clearly
>>>>>>>>> defined.  Deprecate it.
>>>>>>>>>
>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>>>> ---
>>>>>>>>>      docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
>>>>>>>>>      1 file changed, 21 insertions(+), 7 deletions(-)
>>>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
>>>>>>> The fact current backends never check errors does not mean they never
>>>>>>> will. So no, not applying this.
>>>>>> Can this not be done with REPLY_ACK?  I.e., with the following message
>>>>>> order:
>>>>>>
>>>>>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
>>>>>> present
>>>>>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>>>> 4. SET_FEATURES with need_reply
>>>>>>
>>>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
>>>>>> vCPUs are stopped, which generally seems to request a device reset.  If we
>>>>>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
>>>>>> implement SET_STATUS later may break with at least these qemu versions.  But
>>>>>> documenting that a particular use of the status byte is to be ignored would
>>>>>> be really strange.
>>>>>>
>>>>>> Hanna
>>>>> Hmm I guess. Though just following virtio spec seems cleaner to me...
>>>>> vhost-user reconfigures the state fully on start.
>>>> Not the internal device state, though.  virtiofsd has internal state, and
>>>> other devices like vhost-gpu back-ends would probably, too.
>>>>
>>>> Stefan has recently sent a series
>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to
>>>> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a
>>>> reset).
>>>>
>>>> I really don’t like our current approach with the status byte. Following the
>>>> virtio specification to me would mean that the guest directly controls this
>>>> byte, which it does not.  qemu makes up values as it deems appropriate, and
>>>> this includes sending a SET_STATUS 0 when the guest is just paused, i.e.
>>>> when the guest really doesn’t want a device reset.
>>>>
>>>> That means that qemu does not treat this as a virtio device field (because
>>>> that would mean exposing it to the guest driver), but instead treats it as
>>>> part of the vhost(-user) protocol.  It doesn’t feel right to me that we use
>>>> a virtio-defined feature for communication on the vhost level, i.e. between
>>>> front-end and back-end, and not between guest driver and device.  I think
>>>> all vhost-level protocol features should be fully defined in the vhost-user
>>>> specification, which REPLY_ACK is.
>>> Hmm that makes sense. Maybe we should have done what stefan's patch
>>> is doing.
>>>
>>> Do look at the original commit that introduced it to understand why
>>> it was added.
>> I don’t understand why this was added to the stop/cont code, though.  If it
>> is time consuming to make these changes, why are they done every time the VM
>> is paused
>> and resumed?  It makes sense that this would be done for the initial
>> configuration (where a reset also wouldn’t hurt), but here it seems wrong.
>>
>> (To be clear, a reset in the stop/cont code is wrong, because it breaks
>> stateful devices.)
>>
>> Also, note the newer commits 6f8be29ec17 and c3716f260bf.  The reset as
>> originally introduced was wrong even for non-stateful devices, because it
>> occurred before we fetched the state (vring indices) so we could restore it
>> later.  I don’t know how 923b8921d21 was tested, but if the back-end used
>> for testing implemented SET_STATUS 0 as a reset, it could not have survived
>> either migration or a stop/cont in general, because the vring indices would
>> have been reset to 0.
>>
>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
>> devices that would implement them as per virtio spec, and even today it’s
>> broken for stateful devices.  The mentioned performance issue is likely
>> real, but we can’t address it by making up SET_STATUS calls that are wrong.
>>
>> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
>> final configuration that would happen upon a DRIVER_OK once the first vring
>> is started (i.e. receives a kick).  That has the added benefit of being
>> asynchronous because it doesn’t block any vhost-user messages (which are
>> synchronous, and thus block downtime).
>>
>> Hanna
>
> For better or worse kick is per ring. It's out of spec to start rings
> that were not kicked but I guess you could do configuration ...
> Seems somewhat asymmetrical though.
>
> Let's wait until next week, hopefully Yajun Wu will answer.
The main motivation of adding VHOST_USER_SET_STATUS is to let backend 
DPDK know
when DRIVER_OK bit is valid. It's an indication of all VQ configuration 
has sent,
otherwise DPDK has to rely on first queue pair is ready, then 
receiving/applying
VQ configuration one by one.

During live migration, configuring VQ one by one is very time consuming. 
For VIRTIO
net vDPA, HW needs to know how many VQs are enabled to set 
RSS(Receive-Side Scaling).

If you don’t want SET_STATUS message, backend can remove protocol 
feature bit
VHOST_USER_PROTOCOL_F_STATUS.
DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device 
close/reset.

I'm not involved in discussion about adding SET_STATUS in Vhost 
protocol. This feature
is essential for vDPA(same as vhost-vdpa implements VHOST_VDPA_SET_STATUS).

Thanks,
Yajun
>
>>>> Now, we could hand full control of the status byte to the guest, and that
>>>> would make me content.  But I feel like that doesn’t really work, because
>>>> qemu needs to intercept the status byte anyway (it needs to know when there
>>>> is a reset, probably wants to know when the device is configured, etc.), so
>>>> I don’t think having the status byte in vhost-user really gains us much when
>>>> qemu could translate status byte changes to/from other vhost-user commands.
>>>>
>>>> Hanna
>>> well it intercepts it but I think it could pass it on unchanged.
>>>
>>>
>>>>> I guess symmetry was the
>>>>> point. So I don't see why SET_STATUS 0 has to be ignored.
>>>>>
>>>>>
>>>>> SET_STATUS was introduced by:
>>>>>
>>>>> commit 923b8921d210763359e96246a58658ac0db6c645
>>>>> Author: Yajun Wu <yajunw@nvidia.com>
>>>>> Date:   Mon Oct 17 14:44:52 2022 +0800
>>>>>
>>>>>        vhost-user: Support vhost_dev_start
>>>>>
>>>>> CC the author.
>>>>>
Hanna Czenczek Oct. 9, 2023, 8:07 a.m. UTC | #13
On 06.10.23 22:49, Alex Bennée wrote:
> Hanna Czenczek <hreitz@redhat.com> writes:
>
>> On 06.10.23 17:17, Alex Bennée wrote:
>>> Hanna Czenczek <hreitz@redhat.com> writes:
>>>
>>>> On 06.10.23 12:34, Michael S. Tsirkin wrote:
>>>>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
>>>>>> On 06.10.23 11:26, Michael S. Tsirkin wrote:
>>>>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
>>>>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
>>>>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>>>>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
>>> <snip>
>>>>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
>>>>>> devices that would implement them as per virtio spec, and even today it’s
>>>>>> broken for stateful devices.  The mentioned performance issue is likely
>>>>>> real, but we can’t address it by making up SET_STATUS calls that are wrong.
>>>>>>
>>>>>> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
>>>>>> final configuration that would happen upon a DRIVER_OK once the first vring
>>>>>> is started (i.e. receives a kick).  That has the added benefit of being
>>>>>> asynchronous because it doesn’t block any vhost-user messages (which are
>>>>>> synchronous, and thus block downtime).
>>>>>>
>>>>>> Hanna
>>>>> For better or worse kick is per ring. It's out of spec to start rings
>>>>> that were not kicked but I guess you could do configuration ...
>>>>> Seems somewhat asymmetrical though.
>>>> I meant to take the first ring being started as the signal to do the
>>>> global configuration, i.e. not do this once per vring, but once
>>>> globally.
>>>>
>>>>> Let's wait until next week, hopefully Yajun Wu will answer.
>>>> I mean, personally I don’t really care about the whole SET_STATUS
>>>> thing.  It’s clear that it’s broken for stateful devices.  The fact
>>>> that it took until 6f8be29ec17d to fix it for just any device that
>>>> would implement it according to spec to me is a strong indication that
>>>> nobody does implement it according to spec, and is currently only used
>>>> to signal to some specific back-end that all rings have been set up
>>>> and should be configured in a single block.
>>> I'm certainly using [GS]ET_STATUS for the proposed F_TRANSPORT
>>> extensions where everything is off-loaded to the vhost-user backend.
>> How do these back-ends work with the fact that qemu uses SET_STATUS
>> incorrectly when not offloading?  Do you plan on fixing that?
> Mainly having a common base implementation which does it right and
> having very lightweight derivations for legacy stubs using it. The
> aim is to eliminate the need for QEMU stubs entirely by fully specifying
> the device from the vhost-user API.

If the current SET_STATUS use is overhauled, too, that would be good.  I 
wonder why you need the status byte, though.

>> (I.e. that we send SET_STATUS 0 when the VM is paused, potentially
>> resetting state that is not recoverable, and that we set DRIVER and
>> DRIVER_OK simultaneously.)
> This is QEMU simulating a SET_STATUS rather than the guest triggering
> it?

Yes, and the fact that we simulate it when the guest will not have 
triggered it, i.e. we reset the device (SET_STATUS 0) when the VM is 
paused.  Effectively, qemu injects virtio commands that the guest has 
never requested, which generally feels like a bad idea, because qemu 
will need to get the device back to its previous state before the guest 
is resumed, which may or may not work.  Specifically, it won’t work for 
devices that have internal state.

Furthermore, we use SET_STATUS to set ACKNOWLEDGE | DRIVER | DRIVER_OK 
simultaneously, which is wrong.  ACKNOWLEDGE | DRIVER may perhaps be set 
simultaneously, but then comes feature negotiation (setting and checking 
FEATURES_OK), and then DRIVER_OK.

Finally, how the status byte is to be used is not noted in the 
vhost-user specification, which instead points to the virtio 
specification.  I think if we keep SET_STATUS, it must be documented how 
it interacts with other vhost-user commands.  For example, how the 
FEATURES_OK protocol described in the virtio specification interacts 
with GET_FEATURES/SET_FEATURES, or whether SET_STATUS 0 and RESET_DEVICE 
are equivalent.  Currently, the only implementation of SET_STATUS I know 
(DPDK) ignores SET_STATUS 0, i.e. doesn’t do a reset.  To me that 
indicates that the spec must be clear on what these status values mean 
with regards to the vhost-user protocol as a whole.

So every software implementation with STATUS support that I know 
implements SET_STATUS wrongly right now, and that’s a problem, because 
it prevents implementations like virtiofsd from doing so correctly.

Hanna
Hanna Czenczek Oct. 9, 2023, 8:21 a.m. UTC | #14
On 07.10.23 04:22, Yajun Wu wrote:
>
> On 10/6/2023 6:34 PM, Michael S. Tsirkin wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
>>> On 06.10.23 11:26, Michael S. Tsirkin wrote:
>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
>>>>>>>>>> There is no clearly defined purpose for the virtio status 
>>>>>>>>>> byte in
>>>>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and 
>>>>>>>>>> for virtio
>>>>>>>>>> feature negotiation, we have [GS]ET_FEATURES. With the REPLY_ACK
>>>>>>>>>> protocol extension, it is possible for SET_FEATURES to return 
>>>>>>>>>> errors
>>>>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
>>>>>>>>>>
>>>>>>>>>> As for implementations, SET_STATUS is not widely 
>>>>>>>>>> implemented.  dpdk does
>>>>>>>>>> implement it, but only uses it to signal feature negotiation 
>>>>>>>>>> failure.
>>>>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it 
>>>>>>>>>> effectively
>>>>>>>>>> ignores them, in contrast to RESET_OWNER (which is 
>>>>>>>>>> deprecated, and today
>>>>>>>>>> means the same thing as RESET_DEVICE).
>>>>>>>>>>
>>>>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it 
>>>>>>>>>> does not
>>>>>>>>>> forward the guest-set status byte, but instead just makes it up
>>>>>>>>>> internally, and actually completely ignores what the back-end 
>>>>>>>>>> returns,
>>>>>>>>>> only using it as the template for a subsequent SET_STATUS to 
>>>>>>>>>> add single
>>>>>>>>>> bits to it.  Notably, after setting FEATURES_OK, it never 
>>>>>>>>>> reads it back
>>>>>>>>>> to see whether the flag is still set, which is the only way 
>>>>>>>>>> in which
>>>>>>>>>> dpdk uses the status byte.
>>>>>>>>>>
>>>>>>>>>> As-is, no front-end or back-end can rely on the other side 
>>>>>>>>>> handling this
>>>>>>>>>> field in a useful manner, and it also provides no practical 
>>>>>>>>>> use over
>>>>>>>>>> other mechanisms the vhost-user protocol has, which are more 
>>>>>>>>>> clearly
>>>>>>>>>> defined.  Deprecate it.
>>>>>>>>>>
>>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>>>>> ---
>>>>>>>>>>      docs/interop/vhost-user.rst | 28 
>>>>>>>>>> +++++++++++++++++++++-------
>>>>>>>>>>      1 file changed, 21 insertions(+), 7 deletions(-)
>>>>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>> SET_STATUS is the only way to signal failure to acknowledge 
>>>>>>>> FEATURES_OK.
>>>>>>>> The fact current backends never check errors does not mean they 
>>>>>>>> never
>>>>>>>> will. So no, not applying this.
>>>>>>> Can this not be done with REPLY_ACK?  I.e., with the following 
>>>>>>> message
>>>>>>> order:
>>>>>>>
>>>>>>> 1. GET_FEATURES to find out whether 
>>>>>>> VHOST_USER_F_PROTOCOL_FEATURES is
>>>>>>> present
>>>>>>> 2. GET_PROTOCOL_FEATURES to hopefully get 
>>>>>>> VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>>>>> 4. SET_FEATURES with need_reply
>>>>>>>
>>>>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a 
>>>>>>> while when the
>>>>>>> vCPUs are stopped, which generally seems to request a device 
>>>>>>> reset.  If we
>>>>>>> don’t state at least that SET_STATUS 0 is to be ignored, 
>>>>>>> back-ends that will
>>>>>>> implement SET_STATUS later may break with at least these qemu 
>>>>>>> versions.  But
>>>>>>> documenting that a particular use of the status byte is to be 
>>>>>>> ignored would
>>>>>>> be really strange.
>>>>>>>
>>>>>>> Hanna
>>>>>> Hmm I guess. Though just following virtio spec seems cleaner to 
>>>>>> me...
>>>>>> vhost-user reconfigures the state fully on start.
>>>>> Not the internal device state, though.  virtiofsd has internal 
>>>>> state, and
>>>>> other devices like vhost-gpu back-ends would probably, too.
>>>>>
>>>>> Stefan has recently sent a series
>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) 
>>>>> to
>>>>> put the reset (RESET_DEVICE) into virtio_reset() (when we really 
>>>>> need a
>>>>> reset).
>>>>>
>>>>> I really don’t like our current approach with the status byte. 
>>>>> Following the
>>>>> virtio specification to me would mean that the guest directly 
>>>>> controls this
>>>>> byte, which it does not.  qemu makes up values as it deems 
>>>>> appropriate, and
>>>>> this includes sending a SET_STATUS 0 when the guest is just 
>>>>> paused, i.e.
>>>>> when the guest really doesn’t want a device reset.
>>>>>
>>>>> That means that qemu does not treat this as a virtio device field 
>>>>> (because
>>>>> that would mean exposing it to the guest driver), but instead 
>>>>> treats it as
>>>>> part of the vhost(-user) protocol.  It doesn’t feel right to me 
>>>>> that we use
>>>>> a virtio-defined feature for communication on the vhost level, 
>>>>> i.e. between
>>>>> front-end and back-end, and not between guest driver and device.  
>>>>> I think
>>>>> all vhost-level protocol features should be fully defined in the 
>>>>> vhost-user
>>>>> specification, which REPLY_ACK is.
>>>> Hmm that makes sense. Maybe we should have done what stefan's patch
>>>> is doing.
>>>>
>>>> Do look at the original commit that introduced it to understand why
>>>> it was added.
>>> I don’t understand why this was added to the stop/cont code, 
>>> though.  If it
>>> is time consuming to make these changes, why are they done every 
>>> time the VM
>>> is paused
>>> and resumed?  It makes sense that this would be done for the initial
>>> configuration (where a reset also wouldn’t hurt), but here it seems 
>>> wrong.
>>>
>>> (To be clear, a reset in the stop/cont code is wrong, because it breaks
>>> stateful devices.)
>>>
>>> Also, note the newer commits 6f8be29ec17 and c3716f260bf.  The reset as
>>> originally introduced was wrong even for non-stateful devices, 
>>> because it
>>> occurred before we fetched the state (vring indices) so we could 
>>> restore it
>>> later.  I don’t know how 923b8921d21 was tested, but if the back-end 
>>> used
>>> for testing implemented SET_STATUS 0 as a reset, it could not have 
>>> survived
>>> either migration or a stop/cont in general, because the vring 
>>> indices would
>>> have been reset to 0.
>>>
>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that 
>>> broke all
>>> devices that would implement them as per virtio spec, and even today 
>>> it’s
>>> broken for stateful devices.  The mentioned performance issue is likely
>>> real, but we can’t address it by making up SET_STATUS calls that are 
>>> wrong.
>>>
>>> I concede that I didn’t think about DRIVER_OK.  Personally, I would 
>>> do all
>>> final configuration that would happen upon a DRIVER_OK once the 
>>> first vring
>>> is started (i.e. receives a kick).  That has the added benefit of being
>>> asynchronous because it doesn’t block any vhost-user messages (which 
>>> are
>>> synchronous, and thus block downtime).
>>>
>>> Hanna
>>
>> For better or worse kick is per ring. It's out of spec to start rings
>> that were not kicked but I guess you could do configuration ...
>> Seems somewhat asymmetrical though.
>>
>> Let's wait until next week, hopefully Yajun Wu will answer.
> The main motivation of adding VHOST_USER_SET_STATUS is to let backend 
> DPDK know
> when DRIVER_OK bit is valid. It's an indication of all VQ 
> configuration has sent,
> otherwise DPDK has to rely on first queue pair is ready, then 
> receiving/applying
> VQ configuration one by one.
>
> During live migration, configuring VQ one by one is very time consuming.

One question I have here is why it wasn’t then introduced in the live 
migration code, but in the general VM stop/cont code instead. It does 
seem time-consuming to do this every time the VM is paused and resumed.

> For VIRTIO
> net vDPA, HW needs to know how many VQs are enabled to set 
> RSS(Receive-Side Scaling).
>
> If you don’t want SET_STATUS message, backend can remove protocol 
> feature bit
> VHOST_USER_PROTOCOL_F_STATUS.

The problem isn’t back-ends that don’t want the message, the problem is 
that qemu uses the message wrongly, which prevents well-behaving 
back-ends from implementing the message.

> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device 
> close/reset.

So the right thing to do for back-ends is to announce STATUS support and 
then not implement it correctly?

GET_VRING_BASE should not reset the close or reset the device, by the 
way.  It should stop that one vring, not more.  We have a RESET_DEVICE 
command for resetting.

> I'm not involved in discussion about adding SET_STATUS in Vhost 
> protocol. This feature
> is essential for vDPA(same as vhost-vdpa implements 
> VHOST_VDPA_SET_STATUS).

So from what I gather from your response is that there is only a single 
use for SET_STATUS, which is the DRIVER_OK bit.  If so, documenting that 
all other bits are to be ignored by both back-end and front-end would be 
fine by me.

I’m not fully serious about that suggestion, but I hear the strong 
implication that nothing but DRIVER_OK was of any concern, and this is 
really important to note when we talk about the status of the STATUS 
feature in vhost today.  It seems to me now that it was not intended to 
be the virtio-level status byte, but just a DRIVER_OK signalling path 
from front-end to back-end.  That makes it a vhost-level protocol 
feature to me.

Hanna

>
> Thanks,
> Yajun
>>
>>>>> Now, we could hand full control of the status byte to the guest, 
>>>>> and that
>>>>> would make me content.  But I feel like that doesn’t really work, 
>>>>> because
>>>>> qemu needs to intercept the status byte anyway (it needs to know 
>>>>> when there
>>>>> is a reset, probably wants to know when the device is configured, 
>>>>> etc.), so
>>>>> I don’t think having the status byte in vhost-user really gains us 
>>>>> much when
>>>>> qemu could translate status byte changes to/from other vhost-user 
>>>>> commands.
>>>>>
>>>>> Hanna
>>>> well it intercepts it but I think it could pass it on unchanged.
>>>>
>>>>
>>>>>> I guess symmetry was the
>>>>>> point. So I don't see why SET_STATUS 0 has to be ignored.
>>>>>>
>>>>>>
>>>>>> SET_STATUS was introduced by:
>>>>>>
>>>>>> commit 923b8921d210763359e96246a58658ac0db6c645
>>>>>> Author: Yajun Wu <yajunw@nvidia.com>
>>>>>> Date:   Mon Oct 17 14:44:52 2022 +0800
>>>>>>
>>>>>>        vhost-user: Support vhost_dev_start
>>>>>>
>>>>>> CC the author.
>>>>>>
>
Hanna Czenczek Oct. 9, 2023, 9:07 a.m. UTC | #15
On 09.10.23 10:21, Hanna Czenczek wrote:
> On 07.10.23 04:22, Yajun Wu wrote:

[...]

>> The main motivation of adding VHOST_USER_SET_STATUS is to let backend 
>> DPDK know
>> when DRIVER_OK bit is valid. It's an indication of all VQ 
>> configuration has sent,
>> otherwise DPDK has to rely on first queue pair is ready, then 
>> receiving/applying
>> VQ configuration one by one.
>>
>> During live migration, configuring VQ one by one is very time consuming.
>
> One question I have here is why it wasn’t then introduced in the live 
> migration code, but in the general VM stop/cont code instead. It does 
> seem time-consuming to do this every time the VM is paused and resumed.
>
>> For VIRTIO
>> net vDPA, HW needs to know how many VQs are enabled to set 
>> RSS(Receive-Side Scaling).
>>
>> If you don’t want SET_STATUS message, backend can remove protocol 
>> feature bit
>> VHOST_USER_PROTOCOL_F_STATUS.
>
> The problem isn’t back-ends that don’t want the message, the problem 
> is that qemu uses the message wrongly, which prevents well-behaving 
> back-ends from implementing the message.
>
>> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device 
>> close/reset.
>
> So the right thing to do for back-ends is to announce STATUS support 
> and then not implement it correctly?
>
> GET_VRING_BASE should not reset the close or reset the device, by the 
> way.  It should stop that one vring, not more.  We have a RESET_DEVICE 
> command for resetting.
>
>> I'm not involved in discussion about adding SET_STATUS in Vhost 
>> protocol. This feature
>> is essential for vDPA(same as vhost-vdpa implements 
>> VHOST_VDPA_SET_STATUS).
>
> So from what I gather from your response is that there is only a 
> single use for SET_STATUS, which is the DRIVER_OK bit.  If so, 
> documenting that all other bits are to be ignored by both back-end and 
> front-end would be fine by me.
>
> I’m not fully serious about that suggestion, but I hear the strong 
> implication that nothing but DRIVER_OK was of any concern, and this is 
> really important to note when we talk about the status of the STATUS 
> feature in vhost today.  It seems to me now that it was not intended 
> to be the virtio-level status byte, but just a DRIVER_OK signalling 
> path from front-end to back-end.  That makes it a vhost-level protocol 
> feature to me.

On second thought, it just is a pure vhost-level protocol feature, and 
has nothing to do with the virtio status byte as-is.  The only stated 
purpose is for the front-end to send DRIVER_OK after migration, but 
migration is transparent to the guest, so the guest would never change 
the status byte during migration.  Therefore, if this feature is 
essential, we will never be able to have a status byte that is 
transparently shared between guest and back-end device, i.e. the virtio 
status byte.

Cc-ing Alex on this mail, because to me, this seems like an important 
detail when he plans on using the byte in the future.  If we need a 
virtio status byte, I can’t see how we could use the existing F_STATUS 
for it.

Hanna
Hanna Czenczek Oct. 9, 2023, 9:13 a.m. UTC | #16
On 09.10.23 11:07, Hanna Czenczek wrote:
> On 09.10.23 10:21, Hanna Czenczek wrote:
>> On 07.10.23 04:22, Yajun Wu wrote:
>
> [...]
>
>>> The main motivation of adding VHOST_USER_SET_STATUS is to let 
>>> backend DPDK know
>>> when DRIVER_OK bit is valid. It's an indication of all VQ 
>>> configuration has sent,
>>> otherwise DPDK has to rely on first queue pair is ready, then 
>>> receiving/applying
>>> VQ configuration one by one.
>>>
>>> During live migration, configuring VQ one by one is very time 
>>> consuming.
>>
>> One question I have here is why it wasn’t then introduced in the live 
>> migration code, but in the general VM stop/cont code instead. It does 
>> seem time-consuming to do this every time the VM is paused and resumed.
>>
>>> For VIRTIO
>>> net vDPA, HW needs to know how many VQs are enabled to set 
>>> RSS(Receive-Side Scaling).
>>>
>>> If you don’t want SET_STATUS message, backend can remove protocol 
>>> feature bit
>>> VHOST_USER_PROTOCOL_F_STATUS.
>>
>> The problem isn’t back-ends that don’t want the message, the problem 
>> is that qemu uses the message wrongly, which prevents well-behaving 
>> back-ends from implementing the message.
>>
>>> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device 
>>> close/reset.
>>
>> So the right thing to do for back-ends is to announce STATUS support 
>> and then not implement it correctly?
>>
>> GET_VRING_BASE should not reset the close or reset the device, by the 
>> way.  It should stop that one vring, not more.  We have a 
>> RESET_DEVICE command for resetting.
>>
>>> I'm not involved in discussion about adding SET_STATUS in Vhost 
>>> protocol. This feature
>>> is essential for vDPA(same as vhost-vdpa implements 
>>> VHOST_VDPA_SET_STATUS).
>>
>> So from what I gather from your response is that there is only a 
>> single use for SET_STATUS, which is the DRIVER_OK bit.  If so, 
>> documenting that all other bits are to be ignored by both back-end 
>> and front-end would be fine by me.
>>
>> I’m not fully serious about that suggestion, but I hear the strong 
>> implication that nothing but DRIVER_OK was of any concern, and this 
>> is really important to note when we talk about the status of the 
>> STATUS feature in vhost today.  It seems to me now that it was not 
>> intended to be the virtio-level status byte, but just a DRIVER_OK 
>> signalling path from front-end to back-end.  That makes it a 
>> vhost-level protocol feature to me.
>
> On second thought, it just is a pure vhost-level protocol feature, and 
> has nothing to do with the virtio status byte as-is.  The only stated 
> purpose is for the front-end to send DRIVER_OK after migration, but 
> migration is transparent to the guest, so the guest would never change 
> the status byte during migration.  Therefore, if this feature is 
> essential, we will never be able to have a status byte that is 
> transparently shared between guest and back-end device, i.e. the 
> virtio status byte.

On third thought, scratch that.  The guest wouldn’t set it, but 
naturally, after migration, the front-end will need to restore the 
status byte from the source, so the front-end will always need to set 
it, even if it were otherwise used controlled only by the guest and the 
back-end device.  So technically, this doesn’t prevent such a use case.  
(In practice, it isn’t controlled by the guest right now, but that could 
be fixed.)

> Cc-ing Alex on this mail, because to me, this seems like an important 
> detail when he plans on using the byte in the future. If we need a 
> virtio status byte, I can’t see how we could use the existing F_STATUS 
> for it.
>
> Hanna
German Maglione Oct. 9, 2023, 10:28 a.m. UTC | #17
On Sat, Oct 7, 2023 at 4:23 AM Yajun Wu <yajunw@nvidia.com> wrote:
>
>
> On 10/6/2023 6:34 PM, Michael S. Tsirkin wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
> >> On 06.10.23 11:26, Michael S. Tsirkin wrote:
> >>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
> >>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
> >>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
> >>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
> >>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
> >>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
> >>>>>>>>> There is no clearly defined purpose for the virtio status byte in
> >>>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
> >>>>>>>>> feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
> >>>>>>>>> protocol extension, it is possible for SET_FEATURES to return errors
> >>>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
> >>>>>>>>>
> >>>>>>>>> As for implementations, SET_STATUS is not widely implemented.  dpdk does
> >>>>>>>>> implement it, but only uses it to signal feature negotiation failure.
> >>>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively
> >>>>>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today
> >>>>>>>>> means the same thing as RESET_DEVICE).
> >>>>>>>>>
> >>>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not
> >>>>>>>>> forward the guest-set status byte, but instead just makes it up
> >>>>>>>>> internally, and actually completely ignores what the back-end returns,
> >>>>>>>>> only using it as the template for a subsequent SET_STATUS to add single
> >>>>>>>>> bits to it.  Notably, after setting FEATURES_OK, it never reads it back
> >>>>>>>>> to see whether the flag is still set, which is the only way in which
> >>>>>>>>> dpdk uses the status byte.
> >>>>>>>>>
> >>>>>>>>> As-is, no front-end or back-end can rely on the other side handling this
> >>>>>>>>> field in a useful manner, and it also provides no practical use over
> >>>>>>>>> other mechanisms the vhost-user protocol has, which are more clearly
> >>>>>>>>> defined.  Deprecate it.
> >>>>>>>>>
> >>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >>>>>>>>> ---
> >>>>>>>>>      docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
> >>>>>>>>>      1 file changed, 21 insertions(+), 7 deletions(-)
> >>>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
> >>>>>>> The fact current backends never check errors does not mean they never
> >>>>>>> will. So no, not applying this.
> >>>>>> Can this not be done with REPLY_ACK?  I.e., with the following message
> >>>>>> order:
> >>>>>>
> >>>>>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
> >>>>>> present
> >>>>>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
> >>>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
> >>>>>> 4. SET_FEATURES with need_reply
> >>>>>>
> >>>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
> >>>>>> vCPUs are stopped, which generally seems to request a device reset.  If we
> >>>>>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
> >>>>>> implement SET_STATUS later may break with at least these qemu versions.  But
> >>>>>> documenting that a particular use of the status byte is to be ignored would
> >>>>>> be really strange.
> >>>>>>
> >>>>>> Hanna
> >>>>> Hmm I guess. Though just following virtio spec seems cleaner to me...
> >>>>> vhost-user reconfigures the state fully on start.
> >>>> Not the internal device state, though.  virtiofsd has internal state, and
> >>>> other devices like vhost-gpu back-ends would probably, too.
> >>>>
> >>>> Stefan has recently sent a series
> >>>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to
> >>>> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a
> >>>> reset).
> >>>>
> >>>> I really don’t like our current approach with the status byte. Following the
> >>>> virtio specification to me would mean that the guest directly controls this
> >>>> byte, which it does not.  qemu makes up values as it deems appropriate, and
> >>>> this includes sending a SET_STATUS 0 when the guest is just paused, i.e.
> >>>> when the guest really doesn’t want a device reset.
> >>>>
> >>>> That means that qemu does not treat this as a virtio device field (because
> >>>> that would mean exposing it to the guest driver), but instead treats it as
> >>>> part of the vhost(-user) protocol.  It doesn’t feel right to me that we use
> >>>> a virtio-defined feature for communication on the vhost level, i.e. between
> >>>> front-end and back-end, and not between guest driver and device.  I think
> >>>> all vhost-level protocol features should be fully defined in the vhost-user
> >>>> specification, which REPLY_ACK is.
> >>> Hmm that makes sense. Maybe we should have done what stefan's patch
> >>> is doing.
> >>>
> >>> Do look at the original commit that introduced it to understand why
> >>> it was added.
> >> I don’t understand why this was added to the stop/cont code, though.  If it
> >> is time consuming to make these changes, why are they done every time the VM
> >> is paused
> >> and resumed?  It makes sense that this would be done for the initial
> >> configuration (where a reset also wouldn’t hurt), but here it seems wrong.
> >>
> >> (To be clear, a reset in the stop/cont code is wrong, because it breaks
> >> stateful devices.)
> >>
> >> Also, note the newer commits 6f8be29ec17 and c3716f260bf.  The reset as
> >> originally introduced was wrong even for non-stateful devices, because it
> >> occurred before we fetched the state (vring indices) so we could restore it
> >> later.  I don’t know how 923b8921d21 was tested, but if the back-end used
> >> for testing implemented SET_STATUS 0 as a reset, it could not have survived
> >> either migration or a stop/cont in general, because the vring indices would
> >> have been reset to 0.
> >>
> >> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
> >> devices that would implement them as per virtio spec, and even today it’s
> >> broken for stateful devices.  The mentioned performance issue is likely
> >> real, but we can’t address it by making up SET_STATUS calls that are wrong.
> >>
> >> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
> >> final configuration that would happen upon a DRIVER_OK once the first vring
> >> is started (i.e. receives a kick).  That has the added benefit of being
> >> asynchronous because it doesn’t block any vhost-user messages (which are
> >> synchronous, and thus block downtime).
> >>
> >> Hanna
> >
> > For better or worse kick is per ring. It's out of spec to start rings
> > that were not kicked but I guess you could do configuration ...
> > Seems somewhat asymmetrical though.
> >
> > Let's wait until next week, hopefully Yajun Wu will answer.
> The main motivation of adding VHOST_USER_SET_STATUS is to let backend
> DPDK know
> when DRIVER_OK bit is valid. It's an indication of all VQ configuration
> has sent,
> otherwise DPDK has to rely on first queue pair is ready, then
> receiving/applying
> VQ configuration one by one.
>
> During live migration, configuring VQ one by one is very time consuming.
> For VIRTIO
> net vDPA, HW needs to know how many VQs are enabled to set
> RSS(Receive-Side Scaling).
>
> If you don’t want SET_STATUS message, backend can remove protocol
> feature bit
> VHOST_USER_PROTOCOL_F_STATUS.
> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device
> close/reset.

This is incorrect, resetting the device on GET_VRING_BASE breaks
the stop/cont. Since you don't want to reset the VQs on stop/cont.

>
> I'm not involved in discussion about adding SET_STATUS in Vhost
> protocol. This feature
> is essential for vDPA(same as vhost-vdpa implements VHOST_VDPA_SET_STATUS).
>
> Thanks,
> Yajun
> >
> >>>> Now, we could hand full control of the status byte to the guest, and that
> >>>> would make me content.  But I feel like that doesn’t really work, because
> >>>> qemu needs to intercept the status byte anyway (it needs to know when there
> >>>> is a reset, probably wants to know when the device is configured, etc.), so
> >>>> I don’t think having the status byte in vhost-user really gains us much when
> >>>> qemu could translate status byte changes to/from other vhost-user commands.
> >>>>
> >>>> Hanna
> >>> well it intercepts it but I think it could pass it on unchanged.
> >>>
> >>>
> >>>>> I guess symmetry was the
> >>>>> point. So I don't see why SET_STATUS 0 has to be ignored.
> >>>>>
> >>>>>
> >>>>> SET_STATUS was introduced by:
> >>>>>
> >>>>> commit 923b8921d210763359e96246a58658ac0db6c645
> >>>>> Author: Yajun Wu <yajunw@nvidia.com>
> >>>>> Date:   Mon Oct 17 14:44:52 2022 +0800
> >>>>>
> >>>>>        vhost-user: Support vhost_dev_start
> >>>>>
> >>>>> CC the author.
> >>>>>
>
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://listman.redhat.com/mailman/listinfo/virtio-fs
Yajun Wu Oct. 10, 2023, 2:56 a.m. UTC | #18
On 10/9/2023 6:28 PM, German Maglione wrote:
> External email: Use caution opening links or attachments
>
>
> On Sat, Oct 7, 2023 at 4:23 AM Yajun Wu <yajunw@nvidia.com> wrote:
>>
>> On 10/6/2023 6:34 PM, Michael S. Tsirkin wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
>>>> On 06.10.23 11:26, Michael S. Tsirkin wrote:
>>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
>>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
>>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
>>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
>>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
>>>>>>>>>>> There is no clearly defined purpose for the virtio status byte in
>>>>>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
>>>>>>>>>>> feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
>>>>>>>>>>> protocol extension, it is possible for SET_FEATURES to return errors
>>>>>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
>>>>>>>>>>>
>>>>>>>>>>> As for implementations, SET_STATUS is not widely implemented.  dpdk does
>>>>>>>>>>> implement it, but only uses it to signal feature negotiation failure.
>>>>>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively
>>>>>>>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today
>>>>>>>>>>> means the same thing as RESET_DEVICE).
>>>>>>>>>>>
>>>>>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not
>>>>>>>>>>> forward the guest-set status byte, but instead just makes it up
>>>>>>>>>>> internally, and actually completely ignores what the back-end returns,
>>>>>>>>>>> only using it as the template for a subsequent SET_STATUS to add single
>>>>>>>>>>> bits to it.  Notably, after setting FEATURES_OK, it never reads it back
>>>>>>>>>>> to see whether the flag is still set, which is the only way in which
>>>>>>>>>>> dpdk uses the status byte.
>>>>>>>>>>>
>>>>>>>>>>> As-is, no front-end or back-end can rely on the other side handling this
>>>>>>>>>>> field in a useful manner, and it also provides no practical use over
>>>>>>>>>>> other mechanisms the vhost-user protocol has, which are more clearly
>>>>>>>>>>> defined.  Deprecate it.
>>>>>>>>>>>
>>>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>>>>>> ---
>>>>>>>>>>>       docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
>>>>>>>>>>>       1 file changed, 21 insertions(+), 7 deletions(-)
>>>>>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
>>>>>>>>> The fact current backends never check errors does not mean they never
>>>>>>>>> will. So no, not applying this.
>>>>>>>> Can this not be done with REPLY_ACK?  I.e., with the following message
>>>>>>>> order:
>>>>>>>>
>>>>>>>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
>>>>>>>> present
>>>>>>>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>>>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
>>>>>>>> 4. SET_FEATURES with need_reply
>>>>>>>>
>>>>>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
>>>>>>>> vCPUs are stopped, which generally seems to request a device reset.  If we
>>>>>>>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
>>>>>>>> implement SET_STATUS later may break with at least these qemu versions.  But
>>>>>>>> documenting that a particular use of the status byte is to be ignored would
>>>>>>>> be really strange.
>>>>>>>>
>>>>>>>> Hanna
>>>>>>> Hmm I guess. Though just following virtio spec seems cleaner to me...
>>>>>>> vhost-user reconfigures the state fully on start.
>>>>>> Not the internal device state, though.  virtiofsd has internal state, and
>>>>>> other devices like vhost-gpu back-ends would probably, too.
>>>>>>
>>>>>> Stefan has recently sent a series
>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to
>>>>>> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a
>>>>>> reset).
>>>>>>
>>>>>> I really don’t like our current approach with the status byte. Following the
>>>>>> virtio specification to me would mean that the guest directly controls this
>>>>>> byte, which it does not.  qemu makes up values as it deems appropriate, and
>>>>>> this includes sending a SET_STATUS 0 when the guest is just paused, i.e.
>>>>>> when the guest really doesn’t want a device reset.
>>>>>>
>>>>>> That means that qemu does not treat this as a virtio device field (because
>>>>>> that would mean exposing it to the guest driver), but instead treats it as
>>>>>> part of the vhost(-user) protocol.  It doesn’t feel right to me that we use
>>>>>> a virtio-defined feature for communication on the vhost level, i.e. between
>>>>>> front-end and back-end, and not between guest driver and device.  I think
>>>>>> all vhost-level protocol features should be fully defined in the vhost-user
>>>>>> specification, which REPLY_ACK is.
>>>>> Hmm that makes sense. Maybe we should have done what stefan's patch
>>>>> is doing.
>>>>>
>>>>> Do look at the original commit that introduced it to understand why
>>>>> it was added.
>>>> I don’t understand why this was added to the stop/cont code, though.  If it
>>>> is time consuming to make these changes, why are they done every time the VM
>>>> is paused
>>>> and resumed?  It makes sense that this would be done for the initial
>>>> configuration (where a reset also wouldn’t hurt), but here it seems wrong.
>>>>
>>>> (To be clear, a reset in the stop/cont code is wrong, because it breaks
>>>> stateful devices.)
>>>>
>>>> Also, note the newer commits 6f8be29ec17 and c3716f260bf.  The reset as
>>>> originally introduced was wrong even for non-stateful devices, because it
>>>> occurred before we fetched the state (vring indices) so we could restore it
>>>> later.  I don’t know how 923b8921d21 was tested, but if the back-end used
>>>> for testing implemented SET_STATUS 0 as a reset, it could not have survived
>>>> either migration or a stop/cont in general, because the vring indices would
>>>> have been reset to 0.
>>>>
>>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
>>>> devices that would implement them as per virtio spec, and even today it’s
>>>> broken for stateful devices.  The mentioned performance issue is likely
>>>> real, but we can’t address it by making up SET_STATUS calls that are wrong.
>>>>
>>>> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
>>>> final configuration that would happen upon a DRIVER_OK once the first vring
>>>> is started (i.e. receives a kick).  That has the added benefit of being
>>>> asynchronous because it doesn’t block any vhost-user messages (which are
>>>> synchronous, and thus block downtime).
>>>>
>>>> Hanna
>>> For better or worse kick is per ring. It's out of spec to start rings
>>> that were not kicked but I guess you could do configuration ...
>>> Seems somewhat asymmetrical though.
>>>
>>> Let's wait until next week, hopefully Yajun Wu will answer.
>> The main motivation of adding VHOST_USER_SET_STATUS is to let backend
>> DPDK know
>> when DRIVER_OK bit is valid. It's an indication of all VQ configuration
>> has sent,
>> otherwise DPDK has to rely on first queue pair is ready, then
>> receiving/applying
>> VQ configuration one by one.
>>
>> During live migration, configuring VQ one by one is very time consuming.
>> For VIRTIO
>> net vDPA, HW needs to know how many VQs are enabled to set
>> RSS(Receive-Side Scaling).
>>
>> If you don’t want SET_STATUS message, backend can remove protocol
>> feature bit
>> VHOST_USER_PROTOCOL_F_STATUS.
>> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device
>> close/reset.
> This is incorrect, resetting the device on GET_VRING_BASE breaks
> the stop/cont. Since you don't want to reset the VQs on stop/cont.
Sorry for the misunderstanding, dpdk vhost backend framework doesn't 
have RESET concept(only device level .dev_conf and .dev_close). On 
receiving DRIVER_OK does dev_conf, on receiving GET_VRING_BASE does 
dev_close. For every VM suspend/resume, dpdk issues dev_close then dev_conf.
>
>> I'm not involved in discussion about adding SET_STATUS in Vhost
>> protocol. This feature
>> is essential for vDPA(same as vhost-vdpa implements VHOST_VDPA_SET_STATUS).
>>
>> Thanks,
>> Yajun
>>>>>> Now, we could hand full control of the status byte to the guest, and that
>>>>>> would make me content.  But I feel like that doesn’t really work, because
>>>>>> qemu needs to intercept the status byte anyway (it needs to know when there
>>>>>> is a reset, probably wants to know when the device is configured, etc.), so
>>>>>> I don’t think having the status byte in vhost-user really gains us much when
>>>>>> qemu could translate status byte changes to/from other vhost-user commands.
>>>>>>
>>>>>> Hanna
>>>>> well it intercepts it but I think it could pass it on unchanged.
>>>>>
>>>>>
>>>>>>> I guess symmetry was the
>>>>>>> point. So I don't see why SET_STATUS 0 has to be ignored.
>>>>>>>
>>>>>>>
>>>>>>> SET_STATUS was introduced by:
>>>>>>>
>>>>>>> commit 923b8921d210763359e96246a58658ac0db6c645
>>>>>>> Author: Yajun Wu <yajunw@nvidia.com>
>>>>>>> Date:   Mon Oct 17 14:44:52 2022 +0800
>>>>>>>
>>>>>>>         vhost-user: Support vhost_dev_start
>>>>>>>
>>>>>>> CC the author.
>>>>>>>
>> _______________________________________________
>> Virtio-fs mailing list
>> Virtio-fs@redhat.com
>> https://listman.redhat.com/mailman/listinfo/virtio-fs
>
>
> --
> German
>
Yajun Wu Oct. 10, 2023, 4 a.m. UTC | #19
On 10/9/2023 5:13 PM, Hanna Czenczek wrote:
> External email: Use caution opening links or attachments
>
>
> On 09.10.23 11:07, Hanna Czenczek wrote:
>> On 09.10.23 10:21, Hanna Czenczek wrote:
>>> On 07.10.23 04:22, Yajun Wu wrote:
>> [...]
>>
>>>> The main motivation of adding VHOST_USER_SET_STATUS is to let
>>>> backend DPDK know
>>>> when DRIVER_OK bit is valid. It's an indication of all VQ
>>>> configuration has sent,
>>>> otherwise DPDK has to rely on first queue pair is ready, then
>>>> receiving/applying
>>>> VQ configuration one by one.
>>>>
>>>> During live migration, configuring VQ one by one is very time
>>>> consuming.
>>> One question I have here is why it wasn’t then introduced in the live
>>> migration code, but in the general VM stop/cont code instead. It does
>>> seem time-consuming to do this every time the VM is paused and resumed.

Yes, VM stop/cont will call vhost_net_stop/vhost_net_start. Maybe 
because there's no device level stop/cont vhost message?

>>>
>>>> For VIRTIO
>>>> net vDPA, HW needs to know how many VQs are enabled to set
>>>> RSS(Receive-Side Scaling).
>>>>
>>>> If you don’t want SET_STATUS message, backend can remove protocol
>>>> feature bit
>>>> VHOST_USER_PROTOCOL_F_STATUS.
>>> The problem isn’t back-ends that don’t want the message, the problem
>>> is that qemu uses the message wrongly, which prevents well-behaving
>>> back-ends from implementing the message.
>>>
>>>> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device
>>>> close/reset.
>>> So the right thing to do for back-ends is to announce STATUS support
>>> and then not implement it correctly?
>>>
>>> GET_VRING_BASE should not reset the close or reset the device, by the
>>> way.  It should stop that one vring, not more.  We have a
>>> RESET_DEVICE command for resetting.
I believe dpdk uses GET_VRING_BASE long before qemu has RESET_DEVICE? 
It's a compatible issue. For new backend implements, we can have better 
solution, right?
>>>> I'm not involved in discussion about adding SET_STATUS in Vhost
>>>> protocol. This feature
>>>> is essential for vDPA(same as vhost-vdpa implements
>>>> VHOST_VDPA_SET_STATUS).
>>> So from what I gather from your response is that there is only a
>>> single use for SET_STATUS, which is the DRIVER_OK bit.  If so,
>>> documenting that all other bits are to be ignored by both back-end
>>> and front-end would be fine by me.
>>>
>>> I’m not fully serious about that suggestion, but I hear the strong
>>> implication that nothing but DRIVER_OK was of any concern, and this
>>> is really important to note when we talk about the status of the
>>> STATUS feature in vhost today.  It seems to me now that it was not
>>> intended to be the virtio-level status byte, but just a DRIVER_OK
>>> signalling path from front-end to back-end.  That makes it a
>>> vhost-level protocol feature to me.
>> On second thought, it just is a pure vhost-level protocol feature, and
>> has nothing to do with the virtio status byte as-is.  The only stated
>> purpose is for the front-end to send DRIVER_OK after migration, but
>> migration is transparent to the guest, so the guest would never change
>> the status byte during migration.  Therefore, if this feature is
>> essential, we will never be able to have a status byte that is
>> transparently shared between guest and back-end device, i.e. the
>> virtio status byte.
> On third thought, scratch that.  The guest wouldn’t set it, but
> naturally, after migration, the front-end will need to restore the
> status byte from the source, so the front-end will always need to set
> it, even if it were otherwise used controlled only by the guest and the
> back-end device.  So technically, this doesn’t prevent such a use case.
> (In practice, it isn’t controlled by the guest right now, but that could
> be fixed.)
I only tested the feature with DPDK(the only backend use it today?). Max 
defined the protocol and added the corresponding code in DPDK before I 
added QEMU support. If other backend or different device type want to 
use this, we can have further discussion?
>> Cc-ing Alex on this mail, because to me, this seems like an important
>> detail when he plans on using the byte in the future. If we need a
>> virtio status byte, I can’t see how we could use the existing F_STATUS
>> for it.
>>
>> Hanna
Hanna Czenczek Oct. 10, 2023, 8:18 a.m. UTC | #20
On 10.10.23 06:00, Yajun Wu wrote:
>
> On 10/9/2023 5:13 PM, Hanna Czenczek wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 09.10.23 11:07, Hanna Czenczek wrote:
>>> On 09.10.23 10:21, Hanna Czenczek wrote:
>>>> On 07.10.23 04:22, Yajun Wu wrote:
>>> [...]
>>>
>>>>> The main motivation of adding VHOST_USER_SET_STATUS is to let
>>>>> backend DPDK know
>>>>> when DRIVER_OK bit is valid. It's an indication of all VQ
>>>>> configuration has sent,
>>>>> otherwise DPDK has to rely on first queue pair is ready, then
>>>>> receiving/applying
>>>>> VQ configuration one by one.
>>>>>
>>>>> During live migration, configuring VQ one by one is very time
>>>>> consuming.
>>>> One question I have here is why it wasn’t then introduced in the live
>>>> migration code, but in the general VM stop/cont code instead. It does
>>>> seem time-consuming to do this every time the VM is paused and 
>>>> resumed.
>
> Yes, VM stop/cont will call vhost_net_stop/vhost_net_start. Maybe 
> because there's no device level stop/cont vhost message?

No, it is because qemu will reset the status in stop/cont*, which it 
should not do.  Aside from guest-initiated resets, the only thing where 
a reset comes into play is when the back-end is changed, e.g. during 
migration.  In that case, the source back-end will see a disconnect on 
the vhost-user socket and can then do whatever uninitialization it needs 
to do, and the destination front-end will need to be reconfigured by 
qemu anyway, because it’s just a case of the destination qemu initiating 
a fresh connection to a new back-end (except that it will need to 
restore the state from the source).

*Yes, technically, dpdk will ignore that reset, but it still stops the 
device on a different message (when it should just pause processing 
vrings), so the outcome is the same.

>>>>
>>>>> For VIRTIO
>>>>> net vDPA, HW needs to know how many VQs are enabled to set
>>>>> RSS(Receive-Side Scaling).
>>>>>
>>>>> If you don’t want SET_STATUS message, backend can remove protocol
>>>>> feature bit
>>>>> VHOST_USER_PROTOCOL_F_STATUS.
>>>> The problem isn’t back-ends that don’t want the message, the problem
>>>> is that qemu uses the message wrongly, which prevents well-behaving
>>>> back-ends from implementing the message.
>>>>
>>>>> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device
>>>>> close/reset.
>>>> So the right thing to do for back-ends is to announce STATUS support
>>>> and then not implement it correctly?
>>>>
>>>> GET_VRING_BASE should not reset the close or reset the device, by the
>>>> way.  It should stop that one vring, not more.  We have a
>>>> RESET_DEVICE command for resetting.
> I believe dpdk uses GET_VRING_BASE long before qemu has RESET_DEVICE? 

I don’t think it matters who came first.  What matters is the 
specification, and that dpdk decided to rely on implementation-specific 
behavior without having all involved parties agree by matters of putting 
that in the specification.  And now dpdk clearly deviates from the 
specification as a result of that action, which can result in problems 
if the front-end doesn’t do what qemu always used to do.  (E.g. the 
front-end might just send GET_VRING_BASE for all vrings when suspending 
the guest, and then only send kicks on resume to re-start the vrings.  
dpdk would most likely be left in a state where the whole device is 
stopped, expecting DRIVER_OK.  Same thing in general for front-ends that 
don’t support F_STATUS.)

> It's a compatible issue. For new backend implements, we can have 
> better solution, right?

The fact that dpdk and qemu deviate from the specification is a problem 
as-is.

>>>>> I'm not involved in discussion about adding SET_STATUS in Vhost
>>>>> protocol. This feature
>>>>> is essential for vDPA(same as vhost-vdpa implements
>>>>> VHOST_VDPA_SET_STATUS).
>>>> So from what I gather from your response is that there is only a
>>>> single use for SET_STATUS, which is the DRIVER_OK bit.  If so,
>>>> documenting that all other bits are to be ignored by both back-end
>>>> and front-end would be fine by me.
>>>>
>>>> I’m not fully serious about that suggestion, but I hear the strong
>>>> implication that nothing but DRIVER_OK was of any concern, and this
>>>> is really important to note when we talk about the status of the
>>>> STATUS feature in vhost today.  It seems to me now that it was not
>>>> intended to be the virtio-level status byte, but just a DRIVER_OK
>>>> signalling path from front-end to back-end.  That makes it a
>>>> vhost-level protocol feature to me.
>>> On second thought, it just is a pure vhost-level protocol feature, and
>>> has nothing to do with the virtio status byte as-is.  The only stated
>>> purpose is for the front-end to send DRIVER_OK after migration, but
>>> migration is transparent to the guest, so the guest would never change
>>> the status byte during migration.  Therefore, if this feature is
>>> essential, we will never be able to have a status byte that is
>>> transparently shared between guest and back-end device, i.e. the
>>> virtio status byte.
>> On third thought, scratch that.  The guest wouldn’t set it, but
>> naturally, after migration, the front-end will need to restore the
>> status byte from the source, so the front-end will always need to set
>> it, even if it were otherwise used controlled only by the guest and the
>> back-end device.  So technically, this doesn’t prevent such a use case.
>> (In practice, it isn’t controlled by the guest right now, but that could
>> be fixed.)
> I only tested the feature with DPDK(the only backend use it today?). 
> Max defined the protocol and added the corresponding code in DPDK 
> before I added QEMU support. If other backend or different device type 
> want to use this, we can have further discussion?

So as far as I understand, the feature is supposed to rely on 
implementation-specific behavior between specifically qemu as a 
front-end and dpdk as a back-end, nothing else.  Honestly, that to me is 
a very good reason to deprecate it.  That would make it clear that any 
implementation that implements it does so because it relies on 
implementation-specific behavior from other implementations.

Option 2 is to fix it.  It is not right to use this broadly defined 
feature with its clear protocol as given in the virtio specification 
just to set and clear a single bit (DRIVER_OK).  The vhost-user 
specification points to that virtio protocol.  We must adhere to the 
protocol.  And note that we must not reset devices just because the VM 
is paused/resumed.  (That is why I wanted to deprecate SET_STATUS, so 
that Stefan’s series would introduce RESET_DEVICE where we need it, and 
we can (for now) ignore the SET_STATUS 0 in vhost_dev_stop().)

Option 3 would be to just be honest in the specification, and limit the 
scope of F_STATUS to say the only bit that matters is DRIVER_OK.  I 
would say this is not really different from deprecating, though it 
wouldn’t affect your case.  However, I understand Alex relies on a full 
status byte.  I’m still interested to know why that is.

Option 4 is of course not to do anything, and leave everything as-is, 
waiting for the next person to stir the hornet’s nest.

>>> Cc-ing Alex on this mail, because to me, this seems like an important
>>> detail when he plans on using the byte in the future. If we need a
>>> virtio status byte, I can’t see how we could use the existing F_STATUS
>>> for it.
>>>
>>> Hanna
>
German Maglione Oct. 10, 2023, 10:04 a.m. UTC | #21
On Tue, Oct 10, 2023 at 4:57 AM Yajun Wu <yajunw@nvidia.com> wrote:
>
>
> On 10/9/2023 6:28 PM, German Maglione wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Sat, Oct 7, 2023 at 4:23 AM Yajun Wu <yajunw@nvidia.com> wrote:
> >>
> >> On 10/6/2023 6:34 PM, Michael S. Tsirkin wrote:
> >>> External email: Use caution opening links or attachments
> >>>
> >>>
> >>> On Fri, Oct 06, 2023 at 11:47:55AM +0200, Hanna Czenczek wrote:
> >>>> On 06.10.23 11:26, Michael S. Tsirkin wrote:
> >>>>> On Fri, Oct 06, 2023 at 11:15:55AM +0200, Hanna Czenczek wrote:
> >>>>>> On 06.10.23 10:45, Michael S. Tsirkin wrote:
> >>>>>>> On Fri, Oct 06, 2023 at 09:48:14AM +0200, Hanna Czenczek wrote:
> >>>>>>>> On 05.10.23 19:15, Michael S. Tsirkin wrote:
> >>>>>>>>> On Thu, Oct 05, 2023 at 01:08:52PM -0400, Stefan Hajnoczi wrote:
> >>>>>>>>>> On Wed, Oct 04, 2023 at 02:58:57PM +0200, Hanna Czenczek wrote:
> >>>>>>>>>>> There is no clearly defined purpose for the virtio status byte in
> >>>>>>>>>>> vhost-user: For resetting, we already have RESET_DEVICE; and for virtio
> >>>>>>>>>>> feature negotiation, we have [GS]ET_FEATURES.  With the REPLY_ACK
> >>>>>>>>>>> protocol extension, it is possible for SET_FEATURES to return errors
> >>>>>>>>>>> (SET_PROTOCOL_FEATURES may be called before SET_FEATURES).
> >>>>>>>>>>>
> >>>>>>>>>>> As for implementations, SET_STATUS is not widely implemented.  dpdk does
> >>>>>>>>>>> implement it, but only uses it to signal feature negotiation failure.
> >>>>>>>>>>> While it does log reset requests (SET_STATUS 0) as such, it effectively
> >>>>>>>>>>> ignores them, in contrast to RESET_OWNER (which is deprecated, and today
> >>>>>>>>>>> means the same thing as RESET_DEVICE).
> >>>>>>>>>>>
> >>>>>>>>>>> While qemu superficially has support for [GS]ET_STATUS, it does not
> >>>>>>>>>>> forward the guest-set status byte, but instead just makes it up
> >>>>>>>>>>> internally, and actually completely ignores what the back-end returns,
> >>>>>>>>>>> only using it as the template for a subsequent SET_STATUS to add single
> >>>>>>>>>>> bits to it.  Notably, after setting FEATURES_OK, it never reads it back
> >>>>>>>>>>> to see whether the flag is still set, which is the only way in which
> >>>>>>>>>>> dpdk uses the status byte.
> >>>>>>>>>>>
> >>>>>>>>>>> As-is, no front-end or back-end can rely on the other side handling this
> >>>>>>>>>>> field in a useful manner, and it also provides no practical use over
> >>>>>>>>>>> other mechanisms the vhost-user protocol has, which are more clearly
> >>>>>>>>>>> defined.  Deprecate it.
> >>>>>>>>>>>
> >>>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >>>>>>>>>>> ---
> >>>>>>>>>>>       docs/interop/vhost-user.rst | 28 +++++++++++++++++++++-------
> >>>>>>>>>>>       1 file changed, 21 insertions(+), 7 deletions(-)
> >>>>>>>>>> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>>>>>> SET_STATUS is the only way to signal failure to acknowledge FEATURES_OK.
> >>>>>>>>> The fact current backends never check errors does not mean they never
> >>>>>>>>> will. So no, not applying this.
> >>>>>>>> Can this not be done with REPLY_ACK?  I.e., with the following message
> >>>>>>>> order:
> >>>>>>>>
> >>>>>>>> 1. GET_FEATURES to find out whether VHOST_USER_F_PROTOCOL_FEATURES is
> >>>>>>>> present
> >>>>>>>> 2. GET_PROTOCOL_FEATURES to hopefully get VHOST_USER_PROTOCOL_F_REPLY_ACK
> >>>>>>>> 3. SET_PROTOCOL_FEATURES to set VHOST_USER_PROTOCOL_F_REPLY_ACK
> >>>>>>>> 4. SET_FEATURES with need_reply
> >>>>>>>>
> >>>>>>>> If not, the problem is that qemu has sent SET_STATUS 0 for a while when the
> >>>>>>>> vCPUs are stopped, which generally seems to request a device reset.  If we
> >>>>>>>> don’t state at least that SET_STATUS 0 is to be ignored, back-ends that will
> >>>>>>>> implement SET_STATUS later may break with at least these qemu versions.  But
> >>>>>>>> documenting that a particular use of the status byte is to be ignored would
> >>>>>>>> be really strange.
> >>>>>>>>
> >>>>>>>> Hanna
> >>>>>>> Hmm I guess. Though just following virtio spec seems cleaner to me...
> >>>>>>> vhost-user reconfigures the state fully on start.
> >>>>>> Not the internal device state, though.  virtiofsd has internal state, and
> >>>>>> other devices like vhost-gpu back-ends would probably, too.
> >>>>>>
> >>>>>> Stefan has recently sent a series
> >>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg00709.html) to
> >>>>>> put the reset (RESET_DEVICE) into virtio_reset() (when we really need a
> >>>>>> reset).
> >>>>>>
> >>>>>> I really don’t like our current approach with the status byte. Following the
> >>>>>> virtio specification to me would mean that the guest directly controls this
> >>>>>> byte, which it does not.  qemu makes up values as it deems appropriate, and
> >>>>>> this includes sending a SET_STATUS 0 when the guest is just paused, i.e.
> >>>>>> when the guest really doesn’t want a device reset.
> >>>>>>
> >>>>>> That means that qemu does not treat this as a virtio device field (because
> >>>>>> that would mean exposing it to the guest driver), but instead treats it as
> >>>>>> part of the vhost(-user) protocol.  It doesn’t feel right to me that we use
> >>>>>> a virtio-defined feature for communication on the vhost level, i.e. between
> >>>>>> front-end and back-end, and not between guest driver and device.  I think
> >>>>>> all vhost-level protocol features should be fully defined in the vhost-user
> >>>>>> specification, which REPLY_ACK is.
> >>>>> Hmm that makes sense. Maybe we should have done what stefan's patch
> >>>>> is doing.
> >>>>>
> >>>>> Do look at the original commit that introduced it to understand why
> >>>>> it was added.
> >>>> I don’t understand why this was added to the stop/cont code, though.  If it
> >>>> is time consuming to make these changes, why are they done every time the VM
> >>>> is paused
> >>>> and resumed?  It makes sense that this would be done for the initial
> >>>> configuration (where a reset also wouldn’t hurt), but here it seems wrong.
> >>>>
> >>>> (To be clear, a reset in the stop/cont code is wrong, because it breaks
> >>>> stateful devices.)
> >>>>
> >>>> Also, note the newer commits 6f8be29ec17 and c3716f260bf.  The reset as
> >>>> originally introduced was wrong even for non-stateful devices, because it
> >>>> occurred before we fetched the state (vring indices) so we could restore it
> >>>> later.  I don’t know how 923b8921d21 was tested, but if the back-end used
> >>>> for testing implemented SET_STATUS 0 as a reset, it could not have survived
> >>>> either migration or a stop/cont in general, because the vring indices would
> >>>> have been reset to 0.
> >>>>
> >>>> What I’m saying is, 923b8921d21 introduced SET_STATUS calls that broke all
> >>>> devices that would implement them as per virtio spec, and even today it’s
> >>>> broken for stateful devices.  The mentioned performance issue is likely
> >>>> real, but we can’t address it by making up SET_STATUS calls that are wrong.
> >>>>
> >>>> I concede that I didn’t think about DRIVER_OK.  Personally, I would do all
> >>>> final configuration that would happen upon a DRIVER_OK once the first vring
> >>>> is started (i.e. receives a kick).  That has the added benefit of being
> >>>> asynchronous because it doesn’t block any vhost-user messages (which are
> >>>> synchronous, and thus block downtime).
> >>>>
> >>>> Hanna
> >>> For better or worse kick is per ring. It's out of spec to start rings
> >>> that were not kicked but I guess you could do configuration ...
> >>> Seems somewhat asymmetrical though.
> >>>
> >>> Let's wait until next week, hopefully Yajun Wu will answer.
> >> The main motivation of adding VHOST_USER_SET_STATUS is to let backend
> >> DPDK know
> >> when DRIVER_OK bit is valid. It's an indication of all VQ configuration
> >> has sent,
> >> otherwise DPDK has to rely on first queue pair is ready, then
> >> receiving/applying
> >> VQ configuration one by one.
> >>
> >> During live migration, configuring VQ one by one is very time consuming.
> >> For VIRTIO
> >> net vDPA, HW needs to know how many VQs are enabled to set
> >> RSS(Receive-Side Scaling).
> >>
> >> If you don’t want SET_STATUS message, backend can remove protocol
> >> feature bit
> >> VHOST_USER_PROTOCOL_F_STATUS.
> >> DPDK is ignoring SET_STATUS 0, but using GET_VRING_BASE to do device
> >> close/reset.
> > This is incorrect, resetting the device on GET_VRING_BASE breaks
> > the stop/cont. Since you don't want to reset the VQs on stop/cont.
> Sorry for the misunderstanding, dpdk vhost backend framework doesn't
> have RESET concept(only device level .dev_conf and .dev_close). On
> receiving DRIVER_OK does dev_conf, on receiving GET_VRING_BASE does
> dev_close. For every VM suspend/resume, dpdk issues dev_close then dev_conf.

(sorry I did not explain myself well)
I meant that resetting the VQs upon receiveng GET_VRING_BASE makes the
backend to fail if qemu continues after a "stop". I notice that in dpdk,
when it receives a GET_VRING_BASE[0], it calls 'vring_invalidate(dev, vq);'[1],
resetting the VQ[2], doing that is incorrect.

[0] https://github.com/DPDK/dpdk/blob/main/lib/vhost/vhost_user.c#L2135
[1] https://github.com/DPDK/dpdk/blob/main/lib/vhost/vhost_user.c#L2201
[2] https://github.com/DPDK/dpdk/blob/main/lib/vhost/vhost.c#L580

> >
> >> I'm not involved in discussion about adding SET_STATUS in Vhost
> >> protocol. This feature
> >> is essential for vDPA(same as vhost-vdpa implements VHOST_VDPA_SET_STATUS).
> >>
> >> Thanks,
> >> Yajun
> >>>>>> Now, we could hand full control of the status byte to the guest, and that
> >>>>>> would make me content.  But I feel like that doesn’t really work, because
> >>>>>> qemu needs to intercept the status byte anyway (it needs to know when there
> >>>>>> is a reset, probably wants to know when the device is configured, etc.), so
> >>>>>> I don’t think having the status byte in vhost-user really gains us much when
> >>>>>> qemu could translate status byte changes to/from other vhost-user commands.
> >>>>>>
> >>>>>> Hanna
> >>>>> well it intercepts it but I think it could pass it on unchanged.
> >>>>>
> >>>>>
> >>>>>>> I guess symmetry was the
> >>>>>>> point. So I don't see why SET_STATUS 0 has to be ignored.
> >>>>>>>
> >>>>>>>
> >>>>>>> SET_STATUS was introduced by:
> >>>>>>>
> >>>>>>> commit 923b8921d210763359e96246a58658ac0db6c645
> >>>>>>> Author: Yajun Wu <yajunw@nvidia.com>
> >>>>>>> Date:   Mon Oct 17 14:44:52 2022 +0800
> >>>>>>>
> >>>>>>>         vhost-user: Support vhost_dev_start
> >>>>>>>
> >>>>>>> CC the author.
> >>>>>>>
> >> _______________________________________________
> >> Virtio-fs mailing list
> >> Virtio-fs@redhat.com
> >> https://listman.redhat.com/mailman/listinfo/virtio-fs
> >
> >
> > --
> > German
> >
>
Alex Bennée Oct. 10, 2023, 10:36 a.m. UTC | #22
Hanna Czenczek <hreitz@redhat.com> writes:

> On 10.10.23 06:00, Yajun Wu wrote:
>>
>> On 10/9/2023 5:13 PM, Hanna Czenczek wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On 09.10.23 11:07, Hanna Czenczek wrote:
>>>> On 09.10.23 10:21, Hanna Czenczek wrote:
>>>>> On 07.10.23 04:22, Yajun Wu wrote:
>>>> [...]
>>>>
<snip>
> So as far as I understand, the feature is supposed to rely on
> implementation-specific behavior between specifically qemu as a
> front-end and dpdk as a back-end, nothing else.  Honestly, that to me
> is a very good reason to deprecate it.  That would make it clear that
> any implementation that implements it does so because it relies on
> implementation-specific behavior from other implementations.
>
> Option 2 is to fix it.  It is not right to use this broadly defined
> feature with its clear protocol as given in the virtio specification
> just to set and clear a single bit (DRIVER_OK).  The vhost-user
> specification points to that virtio protocol.  We must adhere to the
> protocol.  And note that we must not reset devices just because the VM
> is paused/resumed.  (That is why I wanted to deprecate SET_STATUS, so
> that Stefan’s series would introduce RESET_DEVICE where we need it,
> and we can (for now) ignore the SET_STATUS 0 in vhost_dev_stop().)
>
> Option 3 would be to just be honest in the specification, and limit
> the scope of F_STATUS to say the only bit that matters is DRIVER_OK. 
> I would say this is not really different from deprecating, though it
> wouldn’t affect your case.  However, I understand Alex relies on a
> full status byte.  I’m still interested to know why that is.

For an F_TRANSPORT backend (or whatever the final name ends up being) we
need the backend to have full control of the status byte because all the
handling of VirtIO is deferred to it. Therefor it has to handle all the
feature negotiation and indicate when the device needs resetting.

(side note: feature negotiation is another slippery area when QEMU gets
involved in gating which feature bits may or may not be exposed to the
backend. The only one it should ever mask is F_UNUSED which is used
(sic) to trigger the vhost protocol negotiation)

> Option 4 is of course not to do anything, and leave everything as-is,
> waiting for the next person to stir the hornet’s nest.
>
>>>> Cc-ing Alex on this mail, because to me, this seems like an important
>>>> detail when he plans on using the byte in the future. If we need a
>>>> virtio status byte, I can’t see how we could use the existing F_STATUS
>>>> for it.

What would we use instead of F_STATUS to query the Device Status field?

>>>>
>>>> Hanna
>>
Hanna Czenczek Oct. 10, 2023, 1:18 p.m. UTC | #23
On 10.10.23 12:36, Alex Bennée wrote:
> Hanna Czenczek <hreitz@redhat.com> writes:
>
>> On 10.10.23 06:00, Yajun Wu wrote:
>>> On 10/9/2023 5:13 PM, Hanna Czenczek wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> On 09.10.23 11:07, Hanna Czenczek wrote:
>>>>> On 09.10.23 10:21, Hanna Czenczek wrote:
>>>>>> On 07.10.23 04:22, Yajun Wu wrote:
>>>>> [...]
>>>>>
> <snip>
>> So as far as I understand, the feature is supposed to rely on
>> implementation-specific behavior between specifically qemu as a
>> front-end and dpdk as a back-end, nothing else.  Honestly, that to me
>> is a very good reason to deprecate it.  That would make it clear that
>> any implementation that implements it does so because it relies on
>> implementation-specific behavior from other implementations.
>>
>> Option 2 is to fix it.  It is not right to use this broadly defined
>> feature with its clear protocol as given in the virtio specification
>> just to set and clear a single bit (DRIVER_OK).  The vhost-user
>> specification points to that virtio protocol.  We must adhere to the
>> protocol.  And note that we must not reset devices just because the VM
>> is paused/resumed.  (That is why I wanted to deprecate SET_STATUS, so
>> that Stefan’s series would introduce RESET_DEVICE where we need it,
>> and we can (for now) ignore the SET_STATUS 0 in vhost_dev_stop().)
>>
>> Option 3 would be to just be honest in the specification, and limit
>> the scope of F_STATUS to say the only bit that matters is DRIVER_OK.
>> I would say this is not really different from deprecating, though it
>> wouldn’t affect your case.  However, I understand Alex relies on a
>> full status byte.  I’m still interested to know why that is.
> For an F_TRANSPORT backend (or whatever the final name ends up being) we
> need the backend to have full control of the status byte because all the
> handling of VirtIO is deferred to it. Therefor it has to handle all the
> feature negotiation and indicate when the device needs resetting.
>
> (side note: feature negotiation is another slippery area when QEMU gets
> involved in gating which feature bits may or may not be exposed to the
> backend. The only one it should ever mask is F_UNUSED which is used
> (sic) to trigger the vhost protocol negotiation)

That’s the thing, feature negotiation is done with GET_FEATURES and 
SET_FEATURES.  Configuring F_REPLY_ACK lets SET_FEATURES return errors.

Indicating that the device needs reset is a good point, there is no 
other feature to do that.  (And something qemu currently ignores, just 
like any value the device returns through GET_STATUS, but that’s besides 
the point.)

>> Option 4 is of course not to do anything, and leave everything as-is,
>> waiting for the next person to stir the hornet’s nest.
>>
>>>>> Cc-ing Alex on this mail, because to me, this seems like an important
>>>>> detail when he plans on using the byte in the future. If we need a
>>>>> virtio status byte, I can’t see how we could use the existing F_STATUS
>>>>> for it.
> What would we use instead of F_STATUS to query the Device Status field?

We would emulate it in the front-end, just like we need to do for 
back-ends without F_STATUS.  We can’t emulate the DEVICE_NEEDS_RESET 
bit, though, that’s correct.

Given that qemu currently ignores DEVICE_NEEDS_RESET, I’m not 100 % 
convinced that your use case has a hard dependency on F_STATUS. However, 
this still does make a fair point in general that it would be useful to 
keep it.

That still leaves us with the situation that currently, the only 
implementations with F_STATUS support are qemu and dpdk, which both 
handle it incorrectly.  Furthermore, the specification leaves much to be 
desired, specifically in how F_STATUS interacts with other vhost-user 
commands (which is something I cited as a reason for my original patch), 
i.e. whether RESET_DEVICE and SET_STATUS 0 are equivalent, and whether 
failures in feature negotiation must result in both SET_FEATURES 
returning an error (with F_REPLY_ACK), and FEATURES_OK being reset in 
the status byte, or whether either is sufficient.  What happens when 
DEVICE_NEEDS_RESET is set, i.e. do we just need RESET_DEVICE / 
SET_STATUS 0, or do we also need to reset some protocol state?  (This is 
also connected to the fact that what happens on RESET_DEVICE is largely 
undefined, which I said on Stefan’s series.)

In general, because we have our own transport, we should make a note how 
it interacts with the status negotiation phases, i.e. that GET_FEATURES 
must not be called before S_ACKNOWLEDGE | S_DRIVER are set, that 
FEATURES_OK must be set after the SET_FEATURES call, and that DRIVER_OK 
must not be set without FEATURES_OK set / SET_FEATURES having returned 
success.  Here we would also answer the question about the interaction 
of F_REPLY_ACK+SET_FEATURES with F_STATUS, specifically whether an 
implementation with F_REPLY_ACK even needs to read back the status byte 
after setting FEATURES_OK because it could have got the feature 
negotiation result already as a result of the SET_FEATURES call.

After migration, can you just set all flags immediately or do we need to 
follow this step-by-step protocol?  I think we do need to do it 
step-by-step, mostly for simplicity in the back-end, i.e. that it just 
sees a normal device start-up.

We should also clarify whether SET_STATUS can fail, i.e. whether setting 
an invalid status (is setting FEATURES_OK when the device doesn’t think 
so invalid?) has SET_STATUS fail (with F_REPLY_ACK) and/or immediately 
gets the device into DEVICE_NEEDS_RESET.

We should clarify whether SET_STATUS can block.  The current use of 
DRIVER_OK seems to indicate to me that dpdk does do time-consuming 
operations when it sees DRIVER_OK (code looks like it, too) and only 
returns when that’s done, but naïvely, I would expect SET_STATUS to be 
just setting some value and doing whatever needs to be done in the 
background, not actually launching and blocking on an operation.

I think it is dangerous to just push ahead with using F_STATUS without 
acknowledging that its implementation is broken right now, and that it 
is so *on purpose* because the DRIVER_OK bit is the only thing that it 
was supposed to be used for.  Using it for its purported original use 
(actually the virtio status byte) is contradictory to that.  It’s 
probably fixable, but I think it requires taking a step back and seeing 
what needs to be done to remedy the conflict.  If you rip out all the 
existing STATUS code and replace it such that qemu will let the guest 
have full control over the status byte (except for migration, where we 
restore it on the destination, which will result in DRIVER_OK being set 
at the end, fulfilling that requirement), that will fix the 
implementation in qemu.  I think.  But the specification should be 
amended to think about all these corner cases, not least because I think 
they will also affect your implementation.

(The answers to many of the questions I raise for documentation may be 
obvious to you, based on “in virtio, it’s just an MMIO byte that’s 
written and read, so the rest follows from there”.  But evidently the 
implementation we have kind of ignores that e.g. SET_STATUS 0 is a reset 
(6f8be29ec17d44496b9ed67599bceaaba72d1864 is a work-around, not much 
more) or that there is actually a protocol to setting the status flags 
and you can’t just set them all at once, so I don’t think the answers 
are immediately obvious, and should be documented.)

As for me and the original patch: I claimed nobody really needs 
F_STATUS, you say you do, so plainly, I assumed wrong and will naturally 
take my hands off of F_STATUS and just ensure not to implement it in any 
back-end until you’ve fixed it, as Yajun has advised.  I’d still prefer 
mentioning this advice in the documentation until it’s fixed, but, you 
know, I wouldn’t be the first one to say “I now know about the quirk, so 
I can work around it, no need to tell anyone else as long as my stuff 
works”.

Hanna
Alex Bennée Oct. 10, 2023, 2:35 p.m. UTC | #24
Hanna Czenczek <hreitz@redhat.com> writes:

(adding Viresh to CC for Xen Vhost questions)

> On 10.10.23 12:36, Alex Bennée wrote:
>> Hanna Czenczek <hreitz@redhat.com> writes:
>>
>>> On 10.10.23 06:00, Yajun Wu wrote:
>>>> On 10/9/2023 5:13 PM, Hanna Czenczek wrote:
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> On 09.10.23 11:07, Hanna Czenczek wrote:
>>>>>> On 09.10.23 10:21, Hanna Czenczek wrote:
>>>>>>> On 07.10.23 04:22, Yajun Wu wrote:
>>>>>> [...]
>>>>>>
>> <snip>
>>> So as far as I understand, the feature is supposed to rely on
>>> implementation-specific behavior between specifically qemu as a
>>> front-end and dpdk as a back-end, nothing else.  Honestly, that to me
>>> is a very good reason to deprecate it.  That would make it clear that
>>> any implementation that implements it does so because it relies on
>>> implementation-specific behavior from other implementations.
>>>
>>> Option 2 is to fix it.  It is not right to use this broadly defined
>>> feature with its clear protocol as given in the virtio specification
>>> just to set and clear a single bit (DRIVER_OK).  The vhost-user
>>> specification points to that virtio protocol.  We must adhere to the
>>> protocol.  And note that we must not reset devices just because the VM
>>> is paused/resumed.  (That is why I wanted to deprecate SET_STATUS, so
>>> that Stefan’s series would introduce RESET_DEVICE where we need it,
>>> and we can (for now) ignore the SET_STATUS 0 in vhost_dev_stop().)
>>>
>>> Option 3 would be to just be honest in the specification, and limit
>>> the scope of F_STATUS to say the only bit that matters is DRIVER_OK.
>>> I would say this is not really different from deprecating, though it
>>> wouldn’t affect your case.  However, I understand Alex relies on a
>>> full status byte.  I’m still interested to know why that is.
>> For an F_TRANSPORT backend (or whatever the final name ends up being) we
>> need the backend to have full control of the status byte because all the
>> handling of VirtIO is deferred to it. Therefor it has to handle all the
>> feature negotiation and indicate when the device needs resetting.
>>
>> (side note: feature negotiation is another slippery area when QEMU gets
>> involved in gating which feature bits may or may not be exposed to the
>> backend. The only one it should ever mask is F_UNUSED which is used
>> (sic) to trigger the vhost protocol negotiation)
>
> That’s the thing, feature negotiation is done with GET_FEATURES and
> SET_FEATURES.  Configuring F_REPLY_ACK lets SET_FEATURES return
> errors.

OK but then what - QEMU fakes up FEATURES_OK in the Device Status field
on the behalf of the backend?

I should point out QEMU doesn't exist in some of these use case. When
using the rust-vmm backends with Xen for example there is no VMM to talk
to so we have a Xen Vhost Frontend which is entirely concerned with
setup and then once connected up leaves the backend to do its thing. I'd
rather leave the frontend as dumb as possible rather than splitting
logic between the two.

> Indicating that the device needs reset is a good point, there is no
> other feature to do that.  (And something qemu currently ignores, just
> like any value the device returns through GET_STATUS, but that’s
> besides the point.)
>
>>> Option 4 is of course not to do anything, and leave everything as-is,
>>> waiting for the next person to stir the hornet’s nest.
>>>
>>>>>> Cc-ing Alex on this mail, because to me, this seems like an important
>>>>>> detail when he plans on using the byte in the future. If we need a
>>>>>> virtio status byte, I can’t see how we could use the existing F_STATUS
>>>>>> for it.
>> What would we use instead of F_STATUS to query the Device Status field?
>
> We would emulate it in the front-end, just like we need to do for
> back-ends without F_STATUS.  We can’t emulate the DEVICE_NEEDS_RESET
> bit, though, that’s correct.
>
> Given that qemu currently ignores DEVICE_NEEDS_RESET, I’m not 100 %
> convinced that your use case has a hard dependency on F_STATUS.
> However, this still does make a fair point in general that it would be
> useful to keep it.

OK/

> That still leaves us with the situation that currently, the only
> implementations with F_STATUS support are qemu and dpdk, which both
> handle it incorrectly. 

I was going to say there is also the rust-vmm vhost-user-master crates
which we've imported:

  https://github.com/vireshk/vhost

for the Xen Vhost Frontend:

  https://github.com/vireshk/xen-vhost-frontend

but I can't actually see any handling for GET/SET_STATUS at all which
makes me wonder how we actually work. Viresh?

> Furthermore, the specification leaves much to
> be desired, specifically in how F_STATUS interacts with other
> vhost-user commands (which is something I cited as a reason for my
> original patch), i.e. whether RESET_DEVICE and SET_STATUS 0 are
> equivalent, and whether failures in feature negotiation must result in
> both SET_FEATURES returning an error (with F_REPLY_ACK), and
> FEATURES_OK being reset in the status byte, or whether either is
> sufficient.  What happens when DEVICE_NEEDS_RESET is set, i.e. do we
> just need RESET_DEVICE / SET_STATUS 0, or do we also need to reset
> some protocol state?  (This is also connected to the fact that what
> happens on RESET_DEVICE is largely undefined, which I said on Stefan’s
> series.)

I'm all for strengthening the vhost-user protocol definitions. I'm just
wary of encoding QEMU<->backend implementation details.

> In general, because we have our own transport, we should make a note
> how it interacts with the status negotiation phases, i.e. that
> GET_FEATURES must not be called before S_ACKNOWLEDGE | S_DRIVER are
> set, that FEATURES_OK must be set after the SET_FEATURES call, and
> that DRIVER_OK must not be set without FEATURES_OK set / SET_FEATURES
> having returned success.  Here we would also answer the question about
> the interaction of F_REPLY_ACK+SET_FEATURES with F_STATUS,
> specifically whether an implementation with F_REPLY_ACK even needs to
> read back the status byte after setting FEATURES_OK because it could
> have got the feature negotiation result already as a result of the
> SET_FEATURES call.

Some sequence diagrams would remove a lot of the ambiguity from parsing
the words. I wonder if there is a pretty way to do that to render nicely
in our published docs?

> After migration, can you just set all flags immediately or do we need
> to follow this step-by-step protocol?  I think we do need to do it
> step-by-step, mostly for simplicity in the back-end, i.e. that it just
> sees a normal device start-up.

Makes sense.

> We should also clarify whether SET_STATUS can fail, i.e. whether
> setting an invalid status (is setting FEATURES_OK when the device
> doesn’t think so invalid?) has SET_STATUS fail (with F_REPLY_ACK)
> and/or immediately gets the device into DEVICE_NEEDS_RESET.
>
> We should clarify whether SET_STATUS can block.  The current use of
> DRIVER_OK seems to indicate to me that dpdk does do time-consuming
> operations when it sees DRIVER_OK (code looks like it, too) and only
> returns when that’s done, but naïvely, I would expect SET_STATUS to be
> just setting some value and doing whatever needs to be done in the
> background, not actually launching and blocking on an operation.

Shouldn't the guest driver be reading the status bit until it flips? So
potentially there could be multiple GET_STATUS calls.

> I think it is dangerous to just push ahead with using F_STATUS without
> acknowledging that its implementation is broken right now, and that it
> is so *on purpose* because the DRIVER_OK bit is the only thing that it
> was supposed to be used for.  Using it for its purported original use
> (actually the virtio status byte) is contradictory to that.  It’s
> probably fixable, but I think it requires taking a step back and
> seeing what needs to be done to remedy the conflict.  If you rip out
> all the existing STATUS code and replace it such that qemu will let
> the guest have full control over the status byte (except for
> migration, where we restore it on the destination, which will result
> in DRIVER_OK being set at the end, fulfilling that requirement), that
> will fix the implementation in qemu.  I think.  But the specification
> should be amended to think about all these corner cases, not least
> because I think they will also affect your implementation.
>
> (The answers to many of the questions I raise for documentation may be
> obvious to you, based on “in virtio, it’s just an MMIO byte that’s
> written and read, so the rest follows from there”.  But evidently the
> implementation we have kind of ignores that e.g. SET_STATUS 0 is a
> reset (6f8be29ec17d44496b9ed67599bceaaba72d1864 is a work-around, not
> much more) or that there is actually a protocol to setting the status
> flags and you can’t just set them all at once, so I don’t think the
> answers are immediately obvious, and should be documented.)
>
> As for me and the original patch: I claimed nobody really needs
> F_STATUS, you say you do, so plainly, I assumed wrong and will
> naturally take my hands off of F_STATUS and just ensure not to
> implement it in any back-end until you’ve fixed it, as Yajun has
> advised.  I’d still prefer mentioning this advice in the documentation
> until it’s fixed, but, you know, I wouldn’t be the first one to say “I
> now know about the quirk, so I can work around it, no need to tell
> anyone else as long as my stuff works”.
>
> Hanna
Hanna Czenczek Oct. 13, 2023, 6:02 p.m. UTC | #25
On 10.10.23 16:35, Alex Bennée wrote:
> Hanna Czenczek <hreitz@redhat.com> writes:
>
> (adding Viresh to CC for Xen Vhost questions)
>
>> On 10.10.23 12:36, Alex Bennée wrote:
>>> Hanna Czenczek <hreitz@redhat.com> writes:
>>>
>>>> On 10.10.23 06:00, Yajun Wu wrote:
>>>>> On 10/9/2023 5:13 PM, Hanna Czenczek wrote:
>>>>>> External email: Use caution opening links or attachments
>>>>>>
>>>>>>
>>>>>> On 09.10.23 11:07, Hanna Czenczek wrote:
>>>>>>> On 09.10.23 10:21, Hanna Czenczek wrote:
>>>>>>>> On 07.10.23 04:22, Yajun Wu wrote:
>>>>>>> [...]
>>>>>>>
>>> <snip>
>>>> So as far as I understand, the feature is supposed to rely on
>>>> implementation-specific behavior between specifically qemu as a
>>>> front-end and dpdk as a back-end, nothing else.  Honestly, that to me
>>>> is a very good reason to deprecate it.  That would make it clear that
>>>> any implementation that implements it does so because it relies on
>>>> implementation-specific behavior from other implementations.
>>>>
>>>> Option 2 is to fix it.  It is not right to use this broadly defined
>>>> feature with its clear protocol as given in the virtio specification
>>>> just to set and clear a single bit (DRIVER_OK).  The vhost-user
>>>> specification points to that virtio protocol.  We must adhere to the
>>>> protocol.  And note that we must not reset devices just because the VM
>>>> is paused/resumed.  (That is why I wanted to deprecate SET_STATUS, so
>>>> that Stefan’s series would introduce RESET_DEVICE where we need it,
>>>> and we can (for now) ignore the SET_STATUS 0 in vhost_dev_stop().)
>>>>
>>>> Option 3 would be to just be honest in the specification, and limit
>>>> the scope of F_STATUS to say the only bit that matters is DRIVER_OK.
>>>> I would say this is not really different from deprecating, though it
>>>> wouldn’t affect your case.  However, I understand Alex relies on a
>>>> full status byte.  I’m still interested to know why that is.
>>> For an F_TRANSPORT backend (or whatever the final name ends up being) we
>>> need the backend to have full control of the status byte because all the
>>> handling of VirtIO is deferred to it. Therefor it has to handle all the
>>> feature negotiation and indicate when the device needs resetting.
>>>
>>> (side note: feature negotiation is another slippery area when QEMU gets
>>> involved in gating which feature bits may or may not be exposed to the
>>> backend. The only one it should ever mask is F_UNUSED which is used
>>> (sic) to trigger the vhost protocol negotiation)
>> That’s the thing, feature negotiation is done with GET_FEATURES and
>> SET_FEATURES.  Configuring F_REPLY_ACK lets SET_FEATURES return
>> errors.
> OK but then what - QEMU fakes up FEATURES_OK in the Device Status field
> on the behalf of the backend?

It does that right now.  When using qemu, vhost-user status byte is not 
exposed to the guest at all.  qemu makes it up completely, and 
effectively ignores the response from GET_STATUS completely.

(The only use of GET_STATUS is (right now): There is a function to set a 
flag in the status byte, and it calls GET_STATUS, ORs the flag in, and 
calls SET_STATUS with the result.)

> I should point out QEMU doesn't exist in some of these use case. When
> using the rust-vmm backends with Xen for example there is no VMM to talk
> to so we have a Xen Vhost Frontend which is entirely concerned with
> setup and then once connected up leaves the backend to do its thing. I'd
> rather leave the frontend as dumb as possible rather than splitting
> logic between the two.
>
>> Indicating that the device needs reset is a good point, there is no
>> other feature to do that.  (And something qemu currently ignores, just
>> like any value the device returns through GET_STATUS, but that’s
>> besides the point.)
>>
>>>> Option 4 is of course not to do anything, and leave everything as-is,
>>>> waiting for the next person to stir the hornet’s nest.
>>>>
>>>>>>> Cc-ing Alex on this mail, because to me, this seems like an important
>>>>>>> detail when he plans on using the byte in the future. If we need a
>>>>>>> virtio status byte, I can’t see how we could use the existing F_STATUS
>>>>>>> for it.
>>> What would we use instead of F_STATUS to query the Device Status field?
>> We would emulate it in the front-end, just like we need to do for
>> back-ends without F_STATUS.  We can’t emulate the DEVICE_NEEDS_RESET
>> bit, though, that’s correct.
>>
>> Given that qemu currently ignores DEVICE_NEEDS_RESET, I’m not 100 %
>> convinced that your use case has a hard dependency on F_STATUS.
>> However, this still does make a fair point in general that it would be
>> useful to keep it.
> OK/
>
>> That still leaves us with the situation that currently, the only
>> implementations with F_STATUS support are qemu and dpdk, which both
>> handle it incorrectly.
> I was going to say there is also the rust-vmm vhost-user-master crates
> which we've imported:
>
>    https://github.com/vireshk/vhost
>
> for the Xen Vhost Frontend:
>
>    https://github.com/vireshk/xen-vhost-frontend
>
> but I can't actually see any handling for GET/SET_STATUS at all which
> makes me wonder how we actually work. Viresh?

As far as I know the only back-end implementation of F_STATUS is in 
DPDK.  As I said, if anyone else implemented it right now, that would be 
dangerous, because qemu doesn’t adhere to the virtio protocol when it 
comes to the status byte.

>> Furthermore, the specification leaves much to
>> be desired, specifically in how F_STATUS interacts with other
>> vhost-user commands (which is something I cited as a reason for my
>> original patch), i.e. whether RESET_DEVICE and SET_STATUS 0 are
>> equivalent, and whether failures in feature negotiation must result in
>> both SET_FEATURES returning an error (with F_REPLY_ACK), and
>> FEATURES_OK being reset in the status byte, or whether either is
>> sufficient.  What happens when DEVICE_NEEDS_RESET is set, i.e. do we
>> just need RESET_DEVICE / SET_STATUS 0, or do we also need to reset
>> some protocol state?  (This is also connected to the fact that what
>> happens on RESET_DEVICE is largely undefined, which I said on Stefan’s
>> series.)
> I'm all for strengthening the vhost-user protocol definitions. I'm just
> wary of encoding QEMU<->backend implementation details.
>
>> In general, because we have our own transport, we should make a note
>> how it interacts with the status negotiation phases, i.e. that
>> GET_FEATURES must not be called before S_ACKNOWLEDGE | S_DRIVER are
>> set, that FEATURES_OK must be set after the SET_FEATURES call, and
>> that DRIVER_OK must not be set without FEATURES_OK set / SET_FEATURES
>> having returned success.  Here we would also answer the question about
>> the interaction of F_REPLY_ACK+SET_FEATURES with F_STATUS,
>> specifically whether an implementation with F_REPLY_ACK even needs to
>> read back the status byte after setting FEATURES_OK because it could
>> have got the feature negotiation result already as a result of the
>> SET_FEATURES call.
> Some sequence diagrams would remove a lot of the ambiguity from parsing
> the words. I wonder if there is a pretty way to do that to render nicely
> in our published docs?

I’m sure some form of SVG will work.  Somehow.  If not, it should. :)

>> After migration, can you just set all flags immediately or do we need
>> to follow this step-by-step protocol?  I think we do need to do it
>> step-by-step, mostly for simplicity in the back-end, i.e. that it just
>> sees a normal device start-up.
> Makes sense.
>
>> We should also clarify whether SET_STATUS can fail, i.e. whether
>> setting an invalid status (is setting FEATURES_OK when the device
>> doesn’t think so invalid?) has SET_STATUS fail (with F_REPLY_ACK)
>> and/or immediately gets the device into DEVICE_NEEDS_RESET.
>>
>> We should clarify whether SET_STATUS can block.  The current use of
>> DRIVER_OK seems to indicate to me that dpdk does do time-consuming
>> operations when it sees DRIVER_OK (code looks like it, too) and only
>> returns when that’s done, but naïvely, I would expect SET_STATUS to be
>> just setting some value and doing whatever needs to be done in the
>> background, not actually launching and blocking on an operation.
> Shouldn't the guest driver be reading the status bit until it flips? So
> potentially there could be multiple GET_STATUS calls.

Ah, the device will only show DRIVER_OK set once the device is ready to 
serve the driver?

Hanna
Viresh Kumar Oct. 17, 2023, 7:49 a.m. UTC | #26
On 13-10-23, 20:02, Hanna Czenczek wrote:
> On 10.10.23 16:35, Alex Bennée wrote:
> > I was going to say there is also the rust-vmm vhost-user-master crates
> > which we've imported:
> > 
> >    https://github.com/vireshk/vhost
> > 
> > for the Xen Vhost Frontend:
> > 
> >    https://github.com/vireshk/xen-vhost-frontend
> > 
> > but I can't actually see any handling for GET/SET_STATUS at all which
> > makes me wonder how we actually work. Viresh?
> 
> As far as I know the only back-end implementation of F_STATUS is in DPDK. 
> As I said, if anyone else implemented it right now, that would be dangerous,
> because qemu doesn’t adhere to the virtio protocol when it comes to the
> status byte.

Yeah, none of the Rust based Virtio backends enable `STATUS` in
`VhostUserProtocolFeatures` and so these messages are never exchanged.

The generic Rust code for the backends, doesn't even implement them.
Not sure if they should or not.
Hanna Czenczek Oct. 17, 2023, 8:13 a.m. UTC | #27
On 17.10.23 09:49, Viresh Kumar wrote:
> On 13-10-23, 20:02, Hanna Czenczek wrote:
>> On 10.10.23 16:35, Alex Bennée wrote:
>>> I was going to say there is also the rust-vmm vhost-user-master crates
>>> which we've imported:
>>>
>>>     https://github.com/vireshk/vhost
>>>
>>> for the Xen Vhost Frontend:
>>>
>>>     https://github.com/vireshk/xen-vhost-frontend
>>>
>>> but I can't actually see any handling for GET/SET_STATUS at all which
>>> makes me wonder how we actually work. Viresh?
>> As far as I know the only back-end implementation of F_STATUS is in DPDK.
>> As I said, if anyone else implemented it right now, that would be dangerous,
>> because qemu doesn’t adhere to the virtio protocol when it comes to the
>> status byte.
> Yeah, none of the Rust based Virtio backends enable `STATUS` in
> `VhostUserProtocolFeatures` and so these messages are never exchanged.
>
> The generic Rust code for the backends, doesn't even implement them.
> Not sure if they should or not.

It absolutely should not, for evidence see this whole thread.  qemu 
sends a SET_STATUS 0, which amounts to a reset, when the VM is merely 
paused[1], and when it sets status bytes, it does not set them according 
to virtio specification.  Implementing it right now means relying on and 
working around qemu’s implementation-defined spec-breaking behavior.  
Also, note that qemu ignores feature negotiation response through 
FEATURES_OK, and DEVICE_NEEDS_RESET, so unless it’s worth working around 
the problems just to get some form of DRIVER_OK information (note this 
information does not come from the driver, but qemu makes it up), I 
absolutely would not implement it.

[1] Notably, it does restore the virtio state to the best of its 
abilities when the VM is resumed, but this is all still wrong (there is 
no point in doing so much on a pause/resume, it needlessly costs time) 
and any implementation that does a reset then will rely on the 
implementation-defined behavior that qemu is actually able to restore 
all the state that the back-end would lose during a reset. Notably, 
reset is not even well-defined in the vhost-user specification.  It was 
argued, in this thread, that DPDK works just fine with this, precisely 
because it ignores SET_STATUS 0.  Finally, if virtiofsd in particular, 
as a user of the Rust crates, is reset, it would lose its internal 
state, which qemu cannot restore short of using the upcoming migration 
facilities.
diff mbox series

Patch

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 5a070adbc1..2f68e67a1a 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -1424,21 +1424,35 @@  Front-end message types
   :request payload: ``u64``
   :reply payload: N/A
 
-  When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
-  successfully negotiated, this message is submitted by the front-end to
-  notify the back-end with updated device status as defined in the Virtio
+.. admonition:: Deprecated
+
+  This is no longer used. Used to be sent by the front-end to notify the
+  back-end with updated device status as defined in the Virtio
   specification.
 
+  However, its purpose in vhost-user was never well-defined; for
+  example, how or if it would replace VHOST_USER_RESET_DEVICE, or how it
+  integrates with the feature negotiation phase.  Therefore,
+  implementations in practice were less than strict in how the status
+  value was handled, which means there was actually no protocol between
+  front-end and back-end on the use of the status value.
+
+  For resetting, use VHOST_USER_RESET_DEVICE instead.  For feature
+  negotiation with acknowledgment from the device, use
+  VHOST_USER_SET_FEATURES with the :ref:`REPLY_ACK <reply_ack>` feature
+  instead.
+
 ``VHOST_USER_GET_STATUS``
   :id: 40
   :equivalent ioctl: VHOST_VDPA_GET_STATUS
   :request payload: N/A
   :reply payload: ``u64``
 
-  When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
-  successfully negotiated, this message is submitted by the front-end to
-  query the back-end for its device status as defined in the Virtio
-  specification.
+.. admonition:: Deprecated
+
+  This is no longer used. Used to be sent by the front-end to query the
+  back-end for its device status as defined in the Virtio specification.
+  Deprecated together with VHOST_USER_SET_STATUS.
 
 
 Back-end message types