diff mbox series

[2/2] vDPA: conditionally read fields in virtio-net dev

Message ID 20220815092638.504528-3-lingshan.zhu@intel.com (mailing list archive)
State New, archived
Headers show
Series allow userspace to query device features | expand

Commit Message

Zhu, Lingshan Aug. 15, 2022, 9:26 a.m. UTC
Some fields of virtio-net device config space are
conditional on the feature bits, the spec says:

"The mac address field always exists
(though is only valid if VIRTIO_NET_F_MAC is set)"

"max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ
or VIRTIO_NET_F_RSS is set"

"mtu only exists if VIRTIO_NET_F_MTU is set"

so we should read MTU, MAC and MQ in the device config
space only when these feature bits are offered.

For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are
not set, the virtio device should have
one queue pair as default value, so when userspace querying queue pair numbers,
it should return mq=1 than zero.

For MTU, if VIRTIO_NET_F_MTU is not set, we should not read
MTU from the device config sapce.
RFC894 <A Standard for the Transmission of IP Datagrams over Ethernet Networks>
says:"The minimum length of the data field of a packet sent over an
Ethernet is 1500 octets, thus the maximum length of an IP datagram
sent over an Ethernet is 1500 octets.  Implementations are encouraged
to support full-length packets"

virtio spec says:"The virtio network device is a virtual ethernet card",
so the default MTU value should be 1500 for virtio-net.

For MAC, the spec says:"If the VIRTIO_NET_F_MAC feature bit is set,
the configuration space mac entry indicates the “physical” address
of the network card, otherwise the driver would typically
generate a random local MAC address." So there is no
default MAC address if VIRTIO_NET_F_MAC not set.

This commits introduces functions vdpa_dev_net_mtu_config_fill()
and vdpa_dev_net_mac_config_fill() to fill MTU and MAC.
It also fixes vdpa_dev_net_mq_config_fill() to report correct
MQ when _F_MQ is not present.

These functions should check devices features than driver
features, and struct vdpa_device is not needed as a parameter

The test & userspace tool output:

Feature bit VIRTIO_NET_F_MTU, VIRTIO_NET_F_RSS, VIRTIO_NET_F_MQ
and VIRTIO_NET_F_MAC can be mask out by hardcode.

However, it is challenging to "disable" the related fields
in the HW device config space, so let's just assume the values
are meaningless if the feature bits are not set.

Before this change, when feature bits for RSS, MQ, MTU and MAC
are not set, iproute2 output:
$vdpa vdpa0: mac 00:e8:ca:11:be:05 link up link_announce false mtu 1500
  negotiated_features

without this commit, function vdpa_dev_net_config_fill()
reads all config space fields unconditionally, so let's
assume the MAC and MTU are meaningless, and it checks
MQ with driver_features, so we don't see max_vq_pairs.

After applying this commit, when feature bits for
MQ, RSS, MAC and MTU are not set,iproute2 output:
$vdpa dev config show vdpa0
vdpa0: link up link_announce false max_vq_pairs 1 mtu 1500
  negotiated_features

As explained above:
Here is no MAC, because VIRTIO_NET_F_MAC is not set,
and there is no default value for MAC. It shows
max_vq_paris = 1 because even without MQ feature,
a functional virtio-net must have one queue pair.
mtu = 1500 is the default value as ethernet
required.

This commit also add supplementary comments for
__virtio16_to_cpu(true, xxx) operations in
vdpa_dev_net_config_fill() and vdpa_fill_stats_rec()

Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
---
 drivers/vdpa/vdpa.c | 60 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 47 insertions(+), 13 deletions(-)

Comments

Michael S. Tsirkin Aug. 15, 2022, 3:52 p.m. UTC | #1
On Mon, Aug 15, 2022 at 05:26:38PM +0800, Zhu Lingshan wrote:
> Some fields of virtio-net device config space are
> conditional on the feature bits, the spec says:
> 
> "The mac address field always exists
> (though is only valid if VIRTIO_NET_F_MAC is set)"
> 
> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ
> or VIRTIO_NET_F_RSS is set"
> 
> "mtu only exists if VIRTIO_NET_F_MTU is set"
> 
> so we should read MTU, MAC and MQ in the device config
> space only when these feature bits are offered.
> 
> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are
> not set, the virtio device should have
> one queue pair as default value, so when userspace querying queue pair numbers,
> it should return mq=1 than zero.
> 
> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read
> MTU from the device config sapce.
> RFC894 <A Standard for the Transmission of IP Datagrams over Ethernet Networks>
> says:"The minimum length of the data field of a packet sent over an
> Ethernet is 1500 octets, thus the maximum length of an IP datagram
> sent over an Ethernet is 1500 octets.  Implementations are encouraged
> to support full-length packets"
> 
> virtio spec says:"The virtio network device is a virtual ethernet card",
> so the default MTU value should be 1500 for virtio-net.
> 
> For MAC, the spec says:"If the VIRTIO_NET_F_MAC feature bit is set,
> the configuration space mac entry indicates the “physical” address
> of the network card, otherwise the driver would typically
> generate a random local MAC address." So there is no
> default MAC address if VIRTIO_NET_F_MAC not set.
> 
> This commits introduces functions vdpa_dev_net_mtu_config_fill()
> and vdpa_dev_net_mac_config_fill() to fill MTU and MAC.
> It also fixes vdpa_dev_net_mq_config_fill() to report correct
> MQ when _F_MQ is not present.
> 
> These functions should check devices features than driver
> features, and struct vdpa_device is not needed as a parameter
> 
> The test & userspace tool output:
> 
> Feature bit VIRTIO_NET_F_MTU, VIRTIO_NET_F_RSS, VIRTIO_NET_F_MQ
> and VIRTIO_NET_F_MAC can be mask out by hardcode.
> 
> However, it is challenging to "disable" the related fields
> in the HW device config space, so let's just assume the values
> are meaningless if the feature bits are not set.
> 
> Before this change, when feature bits for RSS, MQ, MTU and MAC
> are not set, iproute2 output:
> $vdpa vdpa0: mac 00:e8:ca:11:be:05 link up link_announce false mtu 1500
>   negotiated_features

where does it get 1500? what if there's e.g. 0 in the mtu field?

> without this commit, function vdpa_dev_net_config_fill()
> reads all config space fields unconditionally, so let's
> assume the MAC and MTU are meaningless, and it checks
> MQ with driver_features, so we don't see max_vq_pairs.
> 
> After applying this commit, when feature bits for
> MQ, RSS, MAC and MTU are not set,iproute2 output:
> $vdpa dev config show vdpa0
> vdpa0: link up link_announce false max_vq_pairs 1 mtu 1500
>   negotiated_features
> 
> As explained above:
> Here is no MAC, because VIRTIO_NET_F_MAC is not set,
> and there is no default value for MAC. It shows
> max_vq_paris = 1 because even without MQ feature,
> a functional virtio-net must have one queue pair.
> mtu = 1500 is the default value as ethernet
> required.
> 
> This commit also add supplementary comments for
> __virtio16_to_cpu(true, xxx) operations in
> vdpa_dev_net_config_fill() and vdpa_fill_stats_rec()
> 
> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> ---
>  drivers/vdpa/vdpa.c | 60 +++++++++++++++++++++++++++++++++++----------
>  1 file changed, 47 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
> index efb55a06e961..a74660b98979 100644
> --- a/drivers/vdpa/vdpa.c
> +++ b/drivers/vdpa/vdpa.c
> @@ -801,19 +801,44 @@ static int vdpa_nl_cmd_dev_get_dumpit(struct sk_buff *msg, struct netlink_callba
>  	return msg->len;
>  }
>  
> -static int vdpa_dev_net_mq_config_fill(struct vdpa_device *vdev,
> -				       struct sk_buff *msg, u64 features,
> +static int vdpa_dev_net_mq_config_fill(struct sk_buff *msg, u64 features,
>  				       const struct virtio_net_config *config)
>  {
>  	u16 val_u16;
>  
> -	if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0)
> -		return 0;
> +	if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0 &&
> +	    (features & BIT_ULL(VIRTIO_NET_F_RSS)) == 0)
> +		val_u16 = 1;
> +	else
> +		val_u16 = __virtio16_to_cpu(true, config->max_virtqueue_pairs);
>  
> -	val_u16 = le16_to_cpu(config->max_virtqueue_pairs);
>  	return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, val_u16);
>  }
>  
> +static int vdpa_dev_net_mtu_config_fill(struct sk_buff *msg, u64 features,
> +					const struct virtio_net_config *config)
> +{
> +	u16 val_u16;
> +
> +	if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
> +		val_u16 = 1500;
> +	else
> +		val_u16 = __virtio16_to_cpu(true, config->mtu);
> +
> +	return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16);
> +}
> +
> +static int vdpa_dev_net_mac_config_fill(struct sk_buff *msg, u64 features,
> +					const struct virtio_net_config *config)
> +{
> +	if ((features & BIT_ULL(VIRTIO_NET_F_MAC)) == 0)
> +		return 0;
> +	else
> +		return  nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR,
> +				sizeof(config->mac), config->mac);
> +}
> +
> +
>  static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, struct sk_buff *msg)
>  {
>  	struct virtio_net_config config = {};
> @@ -822,18 +847,16 @@ static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, struct sk_buff *ms
>  
>  	vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>  
> -	if (nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR, sizeof(config.mac),
> -		    config.mac))
> -		return -EMSGSIZE;
> +	/*
> +	 * Assume little endian for now, userspace can tweak this for
> +	 * legacy guest support.
> +	 */
> +	val_u16 = __virtio16_to_cpu(true, config.status);
>  
>  	val_u16 = __virtio16_to_cpu(true, config.status);
>  	if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_STATUS, val_u16))
>  		return -EMSGSIZE;
>  
> -	val_u16 = __virtio16_to_cpu(true, config.mtu);
> -	if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16))
> -		return -EMSGSIZE;
> -
>  	features_driver = vdev->config->get_driver_features(vdev);
>  	if (nla_put_u64_64bit(msg, VDPA_ATTR_DEV_NEGOTIATED_FEATURES, features_driver,
>  			      VDPA_ATTR_PAD))
> @@ -846,7 +869,13 @@ static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, struct sk_buff *ms
>  			      VDPA_ATTR_PAD))
>  		return -EMSGSIZE;
>  
> -	return vdpa_dev_net_mq_config_fill(vdev, msg, features_driver, &config);
> +	if (vdpa_dev_net_mac_config_fill(msg, features_device, &config))
> +		return -EMSGSIZE;
> +
> +	if (vdpa_dev_net_mtu_config_fill(msg, features_device, &config))
> +		return -EMSGSIZE;
> +
> +	return vdpa_dev_net_mq_config_fill(msg, features_device, &config);
>  }
>  
>  static int
> @@ -914,6 +943,11 @@ static int vdpa_fill_stats_rec(struct vdpa_device *vdev, struct sk_buff *msg,
>  	}
>  	vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>  
> +	/*
> +	 * Assume little endian for now, userspace can tweak this for
> +	 * legacy guest support.
> +	 */
> +
>  	max_vqp = __virtio16_to_cpu(true, config.max_virtqueue_pairs);
>  	if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, max_vqp))
>  		return -EMSGSIZE;
> -- 
> 2.31.1
Si-Wei Liu Aug. 15, 2022, 11:32 p.m. UTC | #2
On 8/15/2022 2:26 AM, Zhu Lingshan wrote:
> Some fields of virtio-net device config space are
> conditional on the feature bits, the spec says:
>
> "The mac address field always exists
> (though is only valid if VIRTIO_NET_F_MAC is set)"
>
> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ
> or VIRTIO_NET_F_RSS is set"
>
> "mtu only exists if VIRTIO_NET_F_MTU is set"
>
> so we should read MTU, MAC and MQ in the device config
> space only when these feature bits are offered.
>
> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are
> not set, the virtio device should have
> one queue pair as default value, so when userspace querying queue pair numbers,
> it should return mq=1 than zero.
>
> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read
> MTU from the device config sapce.
> RFC894 <A Standard for the Transmission of IP Datagrams over Ethernet Networks>
> says:"The minimum length of the data field of a packet sent over an
> Ethernet is 1500 octets, thus the maximum length of an IP datagram
> sent over an Ethernet is 1500 octets.  Implementations are encouraged
> to support full-length packets"
Noted there's a typo in the above "The *maximum* length of the data 
field of a packet sent over an Ethernet is 1500 octets ..." and the RFC 
was written 1984.
Apparently that is no longer true with the introduction of Jumbo size 
frame later in the 2000s. I'm not sure what is the point of mention this 
ancient RFC. It doesn't say default MTU of any Ethernet NIC/switch 
should be 1500 in either  case.

>
> virtio spec says:"The virtio network device is a virtual ethernet card",
Right,
> so the default MTU value should be 1500 for virtio-net.
... but it doesn't say the default is 1500. At least, not in explicit 
way. Why it can't be 1492 or even lower? In practice, if the network 
backend has a MTU higher than 1500, there's nothing wrong for guest to 
configure default MTU more than 1500.

>
> For MAC, the spec says:"If the VIRTIO_NET_F_MAC feature bit is set,
> the configuration space mac entry indicates the “physical” address
> of the network card, otherwise the driver would typically
> generate a random local MAC address." So there is no
> default MAC address if VIRTIO_NET_F_MAC not set.
>
> This commits introduces functions vdpa_dev_net_mtu_config_fill()
> and vdpa_dev_net_mac_config_fill() to fill MTU and MAC.
> It also fixes vdpa_dev_net_mq_config_fill() to report correct
> MQ when _F_MQ is not present.
>
> These functions should check devices features than driver
> features, and struct vdpa_device is not needed as a parameter
>
> The test & userspace tool output:
>
> Feature bit VIRTIO_NET_F_MTU, VIRTIO_NET_F_RSS, VIRTIO_NET_F_MQ
> and VIRTIO_NET_F_MAC can be mask out by hardcode.
>
> However, it is challenging to "disable" the related fields
> in the HW device config space, so let's just assume the values
> are meaningless if the feature bits are not set.
>
> Before this change, when feature bits for RSS, MQ, MTU and MAC
> are not set, iproute2 output:
> $vdpa vdpa0: mac 00:e8:ca:11:be:05 link up link_announce false mtu 1500
>    negotiated_features
>
> without this commit, function vdpa_dev_net_config_fill()
> reads all config space fields unconditionally, so let's
> assume the MAC and MTU are meaningless, and it checks
> MQ with driver_features, so we don't see max_vq_pairs.
>
> After applying this commit, when feature bits for
> MQ, RSS, MAC and MTU are not set,iproute2 output:
> $vdpa dev config show vdpa0
> vdpa0: link up link_announce false max_vq_pairs 1 mtu 1500
>    negotiated_features
>
> As explained above:
> Here is no MAC, because VIRTIO_NET_F_MAC is not set,
> and there is no default value for MAC. It shows
> max_vq_paris = 1 because even without MQ feature,
> a functional virtio-net must have one queue pair.
> mtu = 1500 is the default value as ethernet
> required.
>
> This commit also add supplementary comments for
> __virtio16_to_cpu(true, xxx) operations in
> vdpa_dev_net_config_fill() and vdpa_fill_stats_rec()
>
> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> ---
>   drivers/vdpa/vdpa.c | 60 +++++++++++++++++++++++++++++++++++----------
>   1 file changed, 47 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
> index efb55a06e961..a74660b98979 100644
> --- a/drivers/vdpa/vdpa.c
> +++ b/drivers/vdpa/vdpa.c
> @@ -801,19 +801,44 @@ static int vdpa_nl_cmd_dev_get_dumpit(struct sk_buff *msg, struct netlink_callba
>   	return msg->len;
>   }
>   
> -static int vdpa_dev_net_mq_config_fill(struct vdpa_device *vdev,
> -				       struct sk_buff *msg, u64 features,
> +static int vdpa_dev_net_mq_config_fill(struct sk_buff *msg, u64 features,
>   				       const struct virtio_net_config *config)
>   {
>   	u16 val_u16;
>   
> -	if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0)
> -		return 0;
> +	if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0 &&
> +	    (features & BIT_ULL(VIRTIO_NET_F_RSS)) == 0)
> +		val_u16 = 1;
> +	else
> +		val_u16 = __virtio16_to_cpu(true, config->max_virtqueue_pairs);
>   
> -	val_u16 = le16_to_cpu(config->max_virtqueue_pairs);
>   	return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, val_u16);
>   }
>   
> +static int vdpa_dev_net_mtu_config_fill(struct sk_buff *msg, u64 features,
> +					const struct virtio_net_config *config)
> +{
> +	u16 val_u16;
> +
> +	if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
> +		val_u16 = 1500;
As said, there's no virtio spec defined value for MTU. Please leave this 
field out if feature VIRTIO_NET_F_MTU is not negotiated.
> +	else
> +		val_u16 = __virtio16_to_cpu(true, config->mtu);
> +
> +	return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16);
> +}
> +
> +static int vdpa_dev_net_mac_config_fill(struct sk_buff *msg, u64 features,
> +					const struct virtio_net_config *config)
> +{
> +	if ((features & BIT_ULL(VIRTIO_NET_F_MAC)) == 0)
> +		return 0;
> +	else
> +		return  nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR,
> +				sizeof(config->mac), config->mac);
> +}
> +
> +
>   static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, struct sk_buff *msg)
>   {
>   	struct virtio_net_config config = {};
> @@ -822,18 +847,16 @@ static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, struct sk_buff *ms
>   
>   	vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>   
> -	if (nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR, sizeof(config.mac),
> -		    config.mac))
> -		return -EMSGSIZE;
> +	/*
> +	 * Assume little endian for now, userspace can tweak this for
> +	 * legacy guest support.
You can leave it as a TODO for kernel (vdpa core limitation), but AFAIK 
there's nothing userspace needs to do to infer the endianness. IMHO it's 
the kernel's job to provide an abstraction rather than rely on userspace 
guessing it.

> +	 */
> +	val_u16 = __virtio16_to_cpu(true, config.status);
>   
>   	val_u16 = __virtio16_to_cpu(true, config.status);
>   	if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_STATUS, val_u16))
>   		return -EMSGSIZE;
>   
> -	val_u16 = __virtio16_to_cpu(true, config.mtu);
> -	if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16))
> -		return -EMSGSIZE;
> -
>   	features_driver = vdev->config->get_driver_features(vdev);
>   	if (nla_put_u64_64bit(msg, VDPA_ATTR_DEV_NEGOTIATED_FEATURES, features_driver,
>   			      VDPA_ATTR_PAD))
> @@ -846,7 +869,13 @@ static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, struct sk_buff *ms
>   			      VDPA_ATTR_PAD))
>   		return -EMSGSIZE;
>   
> -	return vdpa_dev_net_mq_config_fill(vdev, msg, features_driver, &config);
> +	if (vdpa_dev_net_mac_config_fill(msg, features_device, &config))
> +		return -EMSGSIZE;
> +
> +	if (vdpa_dev_net_mtu_config_fill(msg, features_device, &config))
> +		return -EMSGSIZE;
> +
> +	return vdpa_dev_net_mq_config_fill(msg, features_device, &config);
>   }
>   
>   static int
> @@ -914,6 +943,11 @@ static int vdpa_fill_stats_rec(struct vdpa_device *vdev, struct sk_buff *msg,
>   	}
>   	vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>   
> +	/*
> +	 * Assume little endian for now, userspace can tweak this for
> +	 * legacy guest support.
> +	 */
> +
Ditto.

Thanks,
-Siwei
>   	max_vqp = __virtio16_to_cpu(true, config.max_virtqueue_pairs);
>   	if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, max_vqp))
>   		return -EMSGSIZE;
Zhu, Lingshan Aug. 16, 2022, 1:58 a.m. UTC | #3
On 8/16/2022 7:32 AM, Si-Wei Liu wrote:
>
>
> On 8/15/2022 2:26 AM, Zhu Lingshan wrote:
>> Some fields of virtio-net device config space are
>> conditional on the feature bits, the spec says:
>>
>> "The mac address field always exists
>> (though is only valid if VIRTIO_NET_F_MAC is set)"
>>
>> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ
>> or VIRTIO_NET_F_RSS is set"
>>
>> "mtu only exists if VIRTIO_NET_F_MTU is set"
>>
>> so we should read MTU, MAC and MQ in the device config
>> space only when these feature bits are offered.
>>
>> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are
>> not set, the virtio device should have
>> one queue pair as default value, so when userspace querying queue 
>> pair numbers,
>> it should return mq=1 than zero.
>>
>> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read
>> MTU from the device config sapce.
>> RFC894 <A Standard for the Transmission of IP Datagrams over Ethernet 
>> Networks>
>> says:"The minimum length of the data field of a packet sent over an
>> Ethernet is 1500 octets, thus the maximum length of an IP datagram
>> sent over an Ethernet is 1500 octets.  Implementations are encouraged
>> to support full-length packets"
> Noted there's a typo in the above "The *maximum* length of the data 
> field of a packet sent over an Ethernet is 1500 octets ..." and the 
> RFC was written 1984.
the spec RFC894 says it is 1500, see 
https://www.rfc-editor.org/rfc/rfc894.txt
>
> Apparently that is no longer true with the introduction of Jumbo size 
> frame later in the 2000s. I'm not sure what is the point of mention 
> this ancient RFC. It doesn't say default MTU of any Ethernet 
> NIC/switch should be 1500 in either  case.
This could be a larger number for sure, we are trying to find out the 
min value for Ethernet here, to support 1500 octets, MTU should be 1500 
at least, so I assume 1500 should be the default value for MTU
>
>>
>> virtio spec says:"The virtio network device is a virtual ethernet card",
> Right,
>> so the default MTU value should be 1500 for virtio-net.
> ... but it doesn't say the default is 1500. At least, not in explicit 
> way. Why it can't be 1492 or even lower? In practice, if the network 
> backend has a MTU higher than 1500, there's nothing wrong for guest to 
> configure default MTU more than 1500.
same as above
>
>>
>> For MAC, the spec says:"If the VIRTIO_NET_F_MAC feature bit is set,
>> the configuration space mac entry indicates the “physical” address
>> of the network card, otherwise the driver would typically
>> generate a random local MAC address." So there is no
>> default MAC address if VIRTIO_NET_F_MAC not set.
>>
>> This commits introduces functions vdpa_dev_net_mtu_config_fill()
>> and vdpa_dev_net_mac_config_fill() to fill MTU and MAC.
>> It also fixes vdpa_dev_net_mq_config_fill() to report correct
>> MQ when _F_MQ is not present.
>>
>> These functions should check devices features than driver
>> features, and struct vdpa_device is not needed as a parameter
>>
>> The test & userspace tool output:
>>
>> Feature bit VIRTIO_NET_F_MTU, VIRTIO_NET_F_RSS, VIRTIO_NET_F_MQ
>> and VIRTIO_NET_F_MAC can be mask out by hardcode.
>>
>> However, it is challenging to "disable" the related fields
>> in the HW device config space, so let's just assume the values
>> are meaningless if the feature bits are not set.
>>
>> Before this change, when feature bits for RSS, MQ, MTU and MAC
>> are not set, iproute2 output:
>> $vdpa vdpa0: mac 00:e8:ca:11:be:05 link up link_announce false mtu 1500
>>    negotiated_features
>>
>> without this commit, function vdpa_dev_net_config_fill()
>> reads all config space fields unconditionally, so let's
>> assume the MAC and MTU are meaningless, and it checks
>> MQ with driver_features, so we don't see max_vq_pairs.
>>
>> After applying this commit, when feature bits for
>> MQ, RSS, MAC and MTU are not set,iproute2 output:
>> $vdpa dev config show vdpa0
>> vdpa0: link up link_announce false max_vq_pairs 1 mtu 1500
>>    negotiated_features
>>
>> As explained above:
>> Here is no MAC, because VIRTIO_NET_F_MAC is not set,
>> and there is no default value for MAC. It shows
>> max_vq_paris = 1 because even without MQ feature,
>> a functional virtio-net must have one queue pair.
>> mtu = 1500 is the default value as ethernet
>> required.
>>
>> This commit also add supplementary comments for
>> __virtio16_to_cpu(true, xxx) operations in
>> vdpa_dev_net_config_fill() and vdpa_fill_stats_rec()
>>
>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>> ---
>>   drivers/vdpa/vdpa.c | 60 +++++++++++++++++++++++++++++++++++----------
>>   1 file changed, 47 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
>> index efb55a06e961..a74660b98979 100644
>> --- a/drivers/vdpa/vdpa.c
>> +++ b/drivers/vdpa/vdpa.c
>> @@ -801,19 +801,44 @@ static int vdpa_nl_cmd_dev_get_dumpit(struct 
>> sk_buff *msg, struct netlink_callba
>>       return msg->len;
>>   }
>>   -static int vdpa_dev_net_mq_config_fill(struct vdpa_device *vdev,
>> -                       struct sk_buff *msg, u64 features,
>> +static int vdpa_dev_net_mq_config_fill(struct sk_buff *msg, u64 
>> features,
>>                          const struct virtio_net_config *config)
>>   {
>>       u16 val_u16;
>>   -    if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0)
>> -        return 0;
>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0 &&
>> +        (features & BIT_ULL(VIRTIO_NET_F_RSS)) == 0)
>> +        val_u16 = 1;
>> +    else
>> +        val_u16 = __virtio16_to_cpu(true, config->max_virtqueue_pairs);
>>   -    val_u16 = le16_to_cpu(config->max_virtqueue_pairs);
>>       return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, val_u16);
>>   }
>>   +static int vdpa_dev_net_mtu_config_fill(struct sk_buff *msg, u64 
>> features,
>> +                    const struct virtio_net_config *config)
>> +{
>> +    u16 val_u16;
>> +
>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
>> +        val_u16 = 1500;
> As said, there's no virtio spec defined value for MTU. Please leave 
> this field out if feature VIRTIO_NET_F_MTU is not negotiated.
same as above
>> +    else
>> +        val_u16 = __virtio16_to_cpu(true, config->mtu);
>> +
>> +    return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16);
>> +}
>> +
>> +static int vdpa_dev_net_mac_config_fill(struct sk_buff *msg, u64 
>> features,
>> +                    const struct virtio_net_config *config)
>> +{
>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MAC)) == 0)
>> +        return 0;
>> +    else
>> +        return  nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR,
>> +                sizeof(config->mac), config->mac);
>> +}
>> +
>> +
>>   static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, 
>> struct sk_buff *msg)
>>   {
>>       struct virtio_net_config config = {};
>> @@ -822,18 +847,16 @@ static int vdpa_dev_net_config_fill(struct 
>> vdpa_device *vdev, struct sk_buff *ms
>>         vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>>   -    if (nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR, 
>> sizeof(config.mac),
>> -            config.mac))
>> -        return -EMSGSIZE;
>> +    /*
>> +     * Assume little endian for now, userspace can tweak this for
>> +     * legacy guest support.
> You can leave it as a TODO for kernel (vdpa core limitation), but 
> AFAIK there's nothing userspace needs to do to infer the endianness. 
> IMHO it's the kernel's job to provide an abstraction rather than rely 
> on userspace guessing it.
we have discussed it in another thread, and this comment is suggested by 
MST.
>
>> +     */
>> +    val_u16 = __virtio16_to_cpu(true, config.status);
>>         val_u16 = __virtio16_to_cpu(true, config.status);
>>       if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_STATUS, val_u16))
>>           return -EMSGSIZE;
>>   -    val_u16 = __virtio16_to_cpu(true, config.mtu);
>> -    if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16))
>> -        return -EMSGSIZE;
>> -
>>       features_driver = vdev->config->get_driver_features(vdev);
>>       if (nla_put_u64_64bit(msg, VDPA_ATTR_DEV_NEGOTIATED_FEATURES, 
>> features_driver,
>>                     VDPA_ATTR_PAD))
>> @@ -846,7 +869,13 @@ static int vdpa_dev_net_config_fill(struct 
>> vdpa_device *vdev, struct sk_buff *ms
>>                     VDPA_ATTR_PAD))
>>           return -EMSGSIZE;
>>   -    return vdpa_dev_net_mq_config_fill(vdev, msg, features_driver, 
>> &config);
>> +    if (vdpa_dev_net_mac_config_fill(msg, features_device, &config))
>> +        return -EMSGSIZE;
>> +
>> +    if (vdpa_dev_net_mtu_config_fill(msg, features_device, &config))
>> +        return -EMSGSIZE;
>> +
>> +    return vdpa_dev_net_mq_config_fill(msg, features_device, &config);
>>   }
>>     static int
>> @@ -914,6 +943,11 @@ static int vdpa_fill_stats_rec(struct 
>> vdpa_device *vdev, struct sk_buff *msg,
>>       }
>>       vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>>   +    /*
>> +     * Assume little endian for now, userspace can tweak this for
>> +     * legacy guest support.
>> +     */
>> +
> Ditto.
same as above

Thanks
>
> Thanks,
> -Siwei
>>       max_vqp = __virtio16_to_cpu(true, config.max_virtqueue_pairs);
>>       if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, max_vqp))
>>           return -EMSGSIZE;
>
Parav Pandit Aug. 16, 2022, 2:32 a.m. UTC | #4
> From: Zhu Lingshan <lingshan.zhu@intel.com>
> Sent: Monday, August 15, 2022 5:27 AM
> 
> Some fields of virtio-net device config space are conditional on the feature
> bits, the spec says:
> 
> "The mac address field always exists
> (though is only valid if VIRTIO_NET_F_MAC is set)"
> 
> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ or
> VIRTIO_NET_F_RSS is set"
> 
> "mtu only exists if VIRTIO_NET_F_MTU is set"
> 
> so we should read MTU, MAC and MQ in the device config space only when
> these feature bits are offered.
Yes.

> 
> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are not set, the
> virtio device should have one queue pair as default value, so when userspace
> querying queue pair numbers, it should return mq=1 than zero.
No.
No need to treat mac and max_qps differently.
It is meaningless to differentiate when field exist/not-exists vs value valid/not valid.

> 
> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read MTU from
> the device config sapce.
> RFC894 <A Standard for the Transmission of IP Datagrams over Ethernet
> Networks> says:"The minimum length of the data field of a packet sent over
> an Ethernet is 1500 octets, thus the maximum length of an IP datagram sent
> over an Ethernet is 1500 octets.  Implementations are encouraged to support
> full-length packets"
This line in the RFC 894 of 1984 is wrong.
Errata already exists for it at [1].

[1] https://www.rfc-editor.org/errata_search.php?rfc=894&rec_status=0

> 
> virtio spec says:"The virtio network device is a virtual ethernet card", so the
> default MTU value should be 1500 for virtio-net.
> 
Practically I have seen 1500 and highe mtu.
And this derivation is not good of what should be the default mtu as above errata exists.

And I see the code below why you need to work so hard to define a default value so that _MQ and _MTU can report default values.

There is really no need for this complexity and such a long commit message.

Can we please expose feature bits as-is and report config space field which are valid?

User space will be querying both.

> For MAC, the spec says:"If the VIRTIO_NET_F_MAC feature bit is set, the
> configuration space mac entry indicates the “physical” address of the
> network card, otherwise the driver would typically generate a random local
> MAC address." So there is no default MAC address if VIRTIO_NET_F_MAC
> not set.
> 
> This commits introduces functions vdpa_dev_net_mtu_config_fill() and
> vdpa_dev_net_mac_config_fill() to fill MTU and MAC.
> It also fixes vdpa_dev_net_mq_config_fill() to report correct MQ when
> _F_MQ is not present.
> 
Multiple changes in single patch are not good idea.
Its ok to split to smaller patches.

> +	if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
> +		val_u16 = 1500;
> +	else
> +		val_u16 = __virtio16_to_cpu(true, config->mtu);
> +
Need to work hard to find default values and that too turned out had errata.
There are more fields that doesn’t have default values.

There is no point in kernel doing this guess work, that user space can figure out of what is valid/invalid.
Zhu, Lingshan Aug. 16, 2022, 4:18 a.m. UTC | #5
On 8/16/2022 10:32 AM, Parav Pandit wrote:
>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>> Sent: Monday, August 15, 2022 5:27 AM
>>
>> Some fields of virtio-net device config space are conditional on the feature
>> bits, the spec says:
>>
>> "The mac address field always exists
>> (though is only valid if VIRTIO_NET_F_MAC is set)"
>>
>> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ or
>> VIRTIO_NET_F_RSS is set"
>>
>> "mtu only exists if VIRTIO_NET_F_MTU is set"
>>
>> so we should read MTU, MAC and MQ in the device config space only when
>> these feature bits are offered.
> Yes.
>
>> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are not set, the
>> virtio device should have one queue pair as default value, so when userspace
>> querying queue pair numbers, it should return mq=1 than zero.
> No.
> No need to treat mac and max_qps differently.
> It is meaningless to differentiate when field exist/not-exists vs value valid/not valid.
as we discussed before, MQ has a default value 1, to be a functional 
virtio-net device,
while MAC has no default value, if no VIRTIO_NET_F_MAC set, the driver 
should generate
a random MAC.
>
>> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read MTU from
>> the device config sapce.
>> RFC894 <A Standard for the Transmission of IP Datagrams over Ethernet
>> Networks> says:"The minimum length of the data field of a packet sent over
>> an Ethernet is 1500 octets, thus the maximum length of an IP datagram sent
>> over an Ethernet is 1500 octets.  Implementations are encouraged to support
>> full-length packets"
> This line in the RFC 894 of 1984 is wrong.
> Errata already exists for it at [1].
>
> [1] https://www.rfc-editor.org/errata_search.php?rfc=894&rec_status=0
OK, so I think we should return nothing if _F_MTU not set, like handling 
the MAC
>
>> virtio spec says:"The virtio network device is a virtual ethernet card", so the
>> default MTU value should be 1500 for virtio-net.
>>
> Practically I have seen 1500 and highe mtu.
> And this derivation is not good of what should be the default mtu as above errata exists.
>
> And I see the code below why you need to work so hard to define a default value so that _MQ and _MTU can report default values.
>
> There is really no need for this complexity and such a long commit message.
>
> Can we please expose feature bits as-is and report config space field which are valid?
>
> User space will be querying both.
I think MAC and MTU don't have default values, so return nothing if the 
feature bits not set,
for MQ, it is still max_vq_paris == 1 by default.
>
>> For MAC, the spec says:"If the VIRTIO_NET_F_MAC feature bit is set, the
>> configuration space mac entry indicates the “physical” address of the
>> network card, otherwise the driver would typically generate a random local
>> MAC address." So there is no default MAC address if VIRTIO_NET_F_MAC
>> not set.
>>
>> This commits introduces functions vdpa_dev_net_mtu_config_fill() and
>> vdpa_dev_net_mac_config_fill() to fill MTU and MAC.
>> It also fixes vdpa_dev_net_mq_config_fill() to report correct MQ when
>> _F_MQ is not present.
>>
> Multiple changes in single patch are not good idea.
> Its ok to split to smaller patches.
OK, I can try to split it.
>
>> +	if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
>> +		val_u16 = 1500;
>> +	else
>> +		val_u16 = __virtio16_to_cpu(true, config->mtu);
>> +
> Need to work hard to find default values and that too turned out had errata.
> There are more fields that doesn’t have default values.
>
> There is no point in kernel doing this guess work, that user space can figure out of what is valid/invalid.
It's not guest work, when guest finds no feature bits set, it can decide 
what to do. These code only for the
user space, just MST ever suggest, if there is a default value, we can 
return it from the kernel, once for all.

Thanks
>
Zhu, Lingshan Aug. 16, 2022, 4:26 a.m. UTC | #6
On 8/16/2022 9:58 AM, Zhu, Lingshan wrote:
>
>
> On 8/16/2022 7:32 AM, Si-Wei Liu wrote:
>>
>>
>> On 8/15/2022 2:26 AM, Zhu Lingshan wrote:
>>> Some fields of virtio-net device config space are
>>> conditional on the feature bits, the spec says:
>>>
>>> "The mac address field always exists
>>> (though is only valid if VIRTIO_NET_F_MAC is set)"
>>>
>>> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ
>>> or VIRTIO_NET_F_RSS is set"
>>>
>>> "mtu only exists if VIRTIO_NET_F_MTU is set"
>>>
>>> so we should read MTU, MAC and MQ in the device config
>>> space only when these feature bits are offered.
>>>
>>> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are
>>> not set, the virtio device should have
>>> one queue pair as default value, so when userspace querying queue 
>>> pair numbers,
>>> it should return mq=1 than zero.
>>>
>>> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read
>>> MTU from the device config sapce.
>>> RFC894 <A Standard for the Transmission of IP Datagrams over 
>>> Ethernet Networks>
>>> says:"The minimum length of the data field of a packet sent over an
>>> Ethernet is 1500 octets, thus the maximum length of an IP datagram
>>> sent over an Ethernet is 1500 octets.  Implementations are encouraged
>>> to support full-length packets"
>> Noted there's a typo in the above "The *maximum* length of the data 
>> field of a packet sent over an Ethernet is 1500 octets ..." and the 
>> RFC was written 1984.
> the spec RFC894 says it is 1500, see 
> https://www.rfc-editor.org/rfc/rfc894.txt
I have seen 
https://www.rfc-editor.org/errata_search.php?rfc=894&rec_status=0,

so I think we should return nothing if _F_MTU not set.

Thanks
>>
>> Apparently that is no longer true with the introduction of Jumbo size 
>> frame later in the 2000s. I'm not sure what is the point of mention 
>> this ancient RFC. It doesn't say default MTU of any Ethernet 
>> NIC/switch should be 1500 in either  case.
> This could be a larger number for sure, we are trying to find out the 
> min value for Ethernet here, to support 1500 octets, MTU should be 
> 1500 at least, so I assume 1500 should be the default value for MTU
>>
>>>
>>> virtio spec says:"The virtio network device is a virtual ethernet 
>>> card",
>> Right,
>>> so the default MTU value should be 1500 for virtio-net.
>> ... but it doesn't say the default is 1500. At least, not in explicit 
>> way. Why it can't be 1492 or even lower? In practice, if the network 
>> backend has a MTU higher than 1500, there's nothing wrong for guest 
>> to configure default MTU more than 1500.
> same as above
>>
>>>
>>> For MAC, the spec says:"If the VIRTIO_NET_F_MAC feature bit is set,
>>> the configuration space mac entry indicates the “physical” address
>>> of the network card, otherwise the driver would typically
>>> generate a random local MAC address." So there is no
>>> default MAC address if VIRTIO_NET_F_MAC not set.
>>>
>>> This commits introduces functions vdpa_dev_net_mtu_config_fill()
>>> and vdpa_dev_net_mac_config_fill() to fill MTU and MAC.
>>> It also fixes vdpa_dev_net_mq_config_fill() to report correct
>>> MQ when _F_MQ is not present.
>>>
>>> These functions should check devices features than driver
>>> features, and struct vdpa_device is not needed as a parameter
>>>
>>> The test & userspace tool output:
>>>
>>> Feature bit VIRTIO_NET_F_MTU, VIRTIO_NET_F_RSS, VIRTIO_NET_F_MQ
>>> and VIRTIO_NET_F_MAC can be mask out by hardcode.
>>>
>>> However, it is challenging to "disable" the related fields
>>> in the HW device config space, so let's just assume the values
>>> are meaningless if the feature bits are not set.
>>>
>>> Before this change, when feature bits for RSS, MQ, MTU and MAC
>>> are not set, iproute2 output:
>>> $vdpa vdpa0: mac 00:e8:ca:11:be:05 link up link_announce false mtu 1500
>>>    negotiated_features
>>>
>>> without this commit, function vdpa_dev_net_config_fill()
>>> reads all config space fields unconditionally, so let's
>>> assume the MAC and MTU are meaningless, and it checks
>>> MQ with driver_features, so we don't see max_vq_pairs.
>>>
>>> After applying this commit, when feature bits for
>>> MQ, RSS, MAC and MTU are not set,iproute2 output:
>>> $vdpa dev config show vdpa0
>>> vdpa0: link up link_announce false max_vq_pairs 1 mtu 1500
>>>    negotiated_features
>>>
>>> As explained above:
>>> Here is no MAC, because VIRTIO_NET_F_MAC is not set,
>>> and there is no default value for MAC. It shows
>>> max_vq_paris = 1 because even without MQ feature,
>>> a functional virtio-net must have one queue pair.
>>> mtu = 1500 is the default value as ethernet
>>> required.
>>>
>>> This commit also add supplementary comments for
>>> __virtio16_to_cpu(true, xxx) operations in
>>> vdpa_dev_net_config_fill() and vdpa_fill_stats_rec()
>>>
>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>> ---
>>>   drivers/vdpa/vdpa.c | 60 
>>> +++++++++++++++++++++++++++++++++++----------
>>>   1 file changed, 47 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
>>> index efb55a06e961..a74660b98979 100644
>>> --- a/drivers/vdpa/vdpa.c
>>> +++ b/drivers/vdpa/vdpa.c
>>> @@ -801,19 +801,44 @@ static int vdpa_nl_cmd_dev_get_dumpit(struct 
>>> sk_buff *msg, struct netlink_callba
>>>       return msg->len;
>>>   }
>>>   -static int vdpa_dev_net_mq_config_fill(struct vdpa_device *vdev,
>>> -                       struct sk_buff *msg, u64 features,
>>> +static int vdpa_dev_net_mq_config_fill(struct sk_buff *msg, u64 
>>> features,
>>>                          const struct virtio_net_config *config)
>>>   {
>>>       u16 val_u16;
>>>   -    if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0)
>>> -        return 0;
>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0 &&
>>> +        (features & BIT_ULL(VIRTIO_NET_F_RSS)) == 0)
>>> +        val_u16 = 1;
>>> +    else
>>> +        val_u16 = __virtio16_to_cpu(true, 
>>> config->max_virtqueue_pairs);
>>>   -    val_u16 = le16_to_cpu(config->max_virtqueue_pairs);
>>>       return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, val_u16);
>>>   }
>>>   +static int vdpa_dev_net_mtu_config_fill(struct sk_buff *msg, u64 
>>> features,
>>> +                    const struct virtio_net_config *config)
>>> +{
>>> +    u16 val_u16;
>>> +
>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
>>> +        val_u16 = 1500;
>> As said, there's no virtio spec defined value for MTU. Please leave 
>> this field out if feature VIRTIO_NET_F_MTU is not negotiated.
> same as above
>>> +    else
>>> +        val_u16 = __virtio16_to_cpu(true, config->mtu);
>>> +
>>> +    return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16);
>>> +}
>>> +
>>> +static int vdpa_dev_net_mac_config_fill(struct sk_buff *msg, u64 
>>> features,
>>> +                    const struct virtio_net_config *config)
>>> +{
>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MAC)) == 0)
>>> +        return 0;
>>> +    else
>>> +        return  nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR,
>>> +                sizeof(config->mac), config->mac);
>>> +}
>>> +
>>> +
>>>   static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, 
>>> struct sk_buff *msg)
>>>   {
>>>       struct virtio_net_config config = {};
>>> @@ -822,18 +847,16 @@ static int vdpa_dev_net_config_fill(struct 
>>> vdpa_device *vdev, struct sk_buff *ms
>>>         vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>>>   -    if (nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR, 
>>> sizeof(config.mac),
>>> -            config.mac))
>>> -        return -EMSGSIZE;
>>> +    /*
>>> +     * Assume little endian for now, userspace can tweak this for
>>> +     * legacy guest support.
>> You can leave it as a TODO for kernel (vdpa core limitation), but 
>> AFAIK there's nothing userspace needs to do to infer the endianness. 
>> IMHO it's the kernel's job to provide an abstraction rather than rely 
>> on userspace guessing it.
> we have discussed it in another thread, and this comment is suggested 
> by MST.
>>
>>> +     */
>>> +    val_u16 = __virtio16_to_cpu(true, config.status);
>>>         val_u16 = __virtio16_to_cpu(true, config.status);
>>>       if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_STATUS, val_u16))
>>>           return -EMSGSIZE;
>>>   -    val_u16 = __virtio16_to_cpu(true, config.mtu);
>>> -    if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16))
>>> -        return -EMSGSIZE;
>>> -
>>>       features_driver = vdev->config->get_driver_features(vdev);
>>>       if (nla_put_u64_64bit(msg, VDPA_ATTR_DEV_NEGOTIATED_FEATURES, 
>>> features_driver,
>>>                     VDPA_ATTR_PAD))
>>> @@ -846,7 +869,13 @@ static int vdpa_dev_net_config_fill(struct 
>>> vdpa_device *vdev, struct sk_buff *ms
>>>                     VDPA_ATTR_PAD))
>>>           return -EMSGSIZE;
>>>   -    return vdpa_dev_net_mq_config_fill(vdev, msg, 
>>> features_driver, &config);
>>> +    if (vdpa_dev_net_mac_config_fill(msg, features_device, &config))
>>> +        return -EMSGSIZE;
>>> +
>>> +    if (vdpa_dev_net_mtu_config_fill(msg, features_device, &config))
>>> +        return -EMSGSIZE;
>>> +
>>> +    return vdpa_dev_net_mq_config_fill(msg, features_device, &config);
>>>   }
>>>     static int
>>> @@ -914,6 +943,11 @@ static int vdpa_fill_stats_rec(struct 
>>> vdpa_device *vdev, struct sk_buff *msg,
>>>       }
>>>       vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>>>   +    /*
>>> +     * Assume little endian for now, userspace can tweak this for
>>> +     * legacy guest support.
>>> +     */
>>> +
>> Ditto.
> same as above
>
> Thanks
>>
>> Thanks,
>> -Siwei
>>>       max_vqp = __virtio16_to_cpu(true, config.max_virtqueue_pairs);
>>>       if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, max_vqp))
>>>           return -EMSGSIZE;
>>
>
Si-Wei Liu Aug. 16, 2022, 7:58 a.m. UTC | #7
On 8/15/2022 6:58 PM, Zhu, Lingshan wrote:
>
>
> On 8/16/2022 7:32 AM, Si-Wei Liu wrote:
>>
>>
>> On 8/15/2022 2:26 AM, Zhu Lingshan wrote:
>>> Some fields of virtio-net device config space are
>>> conditional on the feature bits, the spec says:
>>>
>>> "The mac address field always exists
>>> (though is only valid if VIRTIO_NET_F_MAC is set)"
>>>
>>> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ
>>> or VIRTIO_NET_F_RSS is set"
>>>
>>> "mtu only exists if VIRTIO_NET_F_MTU is set"
>>>
>>> so we should read MTU, MAC and MQ in the device config
>>> space only when these feature bits are offered.
>>>
>>> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are
>>> not set, the virtio device should have
>>> one queue pair as default value, so when userspace querying queue 
>>> pair numbers,
>>> it should return mq=1 than zero.
>>>
>>> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read
>>> MTU from the device config sapce.
>>> RFC894 <A Standard for the Transmission of IP Datagrams over 
>>> Ethernet Networks>
>>> says:"The minimum length of the data field of a packet sent over an
>>> Ethernet is 1500 octets, thus the maximum length of an IP datagram
>>> sent over an Ethernet is 1500 octets.  Implementations are encouraged
>>> to support full-length packets"
>> Noted there's a typo in the above "The *maximum* length of the data 
>> field of a packet sent over an Ethernet is 1500 octets ..." and the 
>> RFC was written 1984.
> the spec RFC894 says it is 1500, see <a 
> href="https://urldefense.com/v3/__https://www.rfc-editor.org/rfc/rfc894.txt__;!!ACWV5N9M2RV99hQ!MdgxZjw5sp5Qz-GKfwT1IWcw_L4Jo1-UekuJPFz1UrG3YuqirKz7P9ksdJFh1vB6zHJ7z8Q04fpT0-9jWXCtlWM$">https://www.rfc-editor.org/rfc/rfc894.txt</a>
>>
>> Apparently that is no longer true with the introduction of Jumbo size 
>> frame later in the 2000s. I'm not sure what is the point of mention 
>> this ancient RFC. It doesn't say default MTU of any Ethernet 
>> NIC/switch should be 1500 in either  case.
> This could be a larger number for sure, we are trying to find out the 
> min value for Ethernet here, to support 1500 octets, MTU should be 
> 1500 at least, so I assume 1500 should be the default value for MTU
>>
>>>
>>> virtio spec says:"The virtio network device is a virtual ethernet 
>>> card",
>> Right,
>>> so the default MTU value should be 1500 for virtio-net.
>> ... but it doesn't say the default is 1500. At least, not in explicit 
>> way. Why it can't be 1492 or even lower? In practice, if the network 
>> backend has a MTU higher than 1500, there's nothing wrong for guest 
>> to configure default MTU more than 1500.
> same as above
>>
>>>
>>> For MAC, the spec says:"If the VIRTIO_NET_F_MAC feature bit is set,
>>> the configuration space mac entry indicates the “physical” address
>>> of the network card, otherwise the driver would typically
>>> generate a random local MAC address." So there is no
>>> default MAC address if VIRTIO_NET_F_MAC not set.
>>>
>>> This commits introduces functions vdpa_dev_net_mtu_config_fill()
>>> and vdpa_dev_net_mac_config_fill() to fill MTU and MAC.
>>> It also fixes vdpa_dev_net_mq_config_fill() to report correct
>>> MQ when _F_MQ is not present.
>>>
>>> These functions should check devices features than driver
>>> features, and struct vdpa_device is not needed as a parameter
>>>
>>> The test & userspace tool output:
>>>
>>> Feature bit VIRTIO_NET_F_MTU, VIRTIO_NET_F_RSS, VIRTIO_NET_F_MQ
>>> and VIRTIO_NET_F_MAC can be mask out by hardcode.
>>>
>>> However, it is challenging to "disable" the related fields
>>> in the HW device config space, so let's just assume the values
>>> are meaningless if the feature bits are not set.
>>>
>>> Before this change, when feature bits for RSS, MQ, MTU and MAC
>>> are not set, iproute2 output:
>>> $vdpa vdpa0: mac 00:e8:ca:11:be:05 link up link_announce false mtu 1500
>>>    negotiated_features
>>>
>>> without this commit, function vdpa_dev_net_config_fill()
>>> reads all config space fields unconditionally, so let's
>>> assume the MAC and MTU are meaningless, and it checks
>>> MQ with driver_features, so we don't see max_vq_pairs.
>>>
>>> After applying this commit, when feature bits for
>>> MQ, RSS, MAC and MTU are not set,iproute2 output:
>>> $vdpa dev config show vdpa0
>>> vdpa0: link up link_announce false max_vq_pairs 1 mtu 1500
>>>    negotiated_features
>>>
>>> As explained above:
>>> Here is no MAC, because VIRTIO_NET_F_MAC is not set,
>>> and there is no default value for MAC. It shows
>>> max_vq_paris = 1 because even without MQ feature,
>>> a functional virtio-net must have one queue pair.
>>> mtu = 1500 is the default value as ethernet
>>> required.
>>>
>>> This commit also add supplementary comments for
>>> __virtio16_to_cpu(true, xxx) operations in
>>> vdpa_dev_net_config_fill() and vdpa_fill_stats_rec()
>>>
>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>> ---
>>>   drivers/vdpa/vdpa.c | 60 
>>> +++++++++++++++++++++++++++++++++++----------
>>>   1 file changed, 47 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
>>> index efb55a06e961..a74660b98979 100644
>>> --- a/drivers/vdpa/vdpa.c
>>> +++ b/drivers/vdpa/vdpa.c
>>> @@ -801,19 +801,44 @@ static int vdpa_nl_cmd_dev_get_dumpit(struct 
>>> sk_buff *msg, struct netlink_callba
>>>       return msg->len;
>>>   }
>>>   -static int vdpa_dev_net_mq_config_fill(struct vdpa_device *vdev,
>>> -                       struct sk_buff *msg, u64 features,
>>> +static int vdpa_dev_net_mq_config_fill(struct sk_buff *msg, u64 
>>> features,
>>>                          const struct virtio_net_config *config)
>>>   {
>>>       u16 val_u16;
>>>   -    if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0)
>>> -        return 0;
>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0 &&
>>> +        (features & BIT_ULL(VIRTIO_NET_F_RSS)) == 0)
>>> +        val_u16 = 1;
>>> +    else
>>> +        val_u16 = __virtio16_to_cpu(true, 
>>> config->max_virtqueue_pairs);
>>>   -    val_u16 = le16_to_cpu(config->max_virtqueue_pairs);
>>>       return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, val_u16);
>>>   }
>>>   +static int vdpa_dev_net_mtu_config_fill(struct sk_buff *msg, u64 
>>> features,
>>> +                    const struct virtio_net_config *config)
>>> +{
>>> +    u16 val_u16;
>>> +
>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
>>> +        val_u16 = 1500;
>> As said, there's no virtio spec defined value for MTU. Please leave 
>> this field out if feature VIRTIO_NET_F_MTU is not negotiated.
> same as above
>>> +    else
>>> +        val_u16 = __virtio16_to_cpu(true, config->mtu);
>>> +
>>> +    return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16);
>>> +}
>>> +
>>> +static int vdpa_dev_net_mac_config_fill(struct sk_buff *msg, u64 
>>> features,
>>> +                    const struct virtio_net_config *config)
>>> +{
>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MAC)) == 0)
>>> +        return 0;
>>> +    else
>>> +        return  nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR,
>>> +                sizeof(config->mac), config->mac);
>>> +}
>>> +
>>> +
>>>   static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, 
>>> struct sk_buff *msg)
>>>   {
>>>       struct virtio_net_config config = {};
>>> @@ -822,18 +847,16 @@ static int vdpa_dev_net_config_fill(struct 
>>> vdpa_device *vdev, struct sk_buff *ms
>>>         vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>>>   -    if (nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR, 
>>> sizeof(config.mac),
>>> -            config.mac))
>>> -        return -EMSGSIZE;
>>> +    /*
>>> +     * Assume little endian for now, userspace can tweak this for
>>> +     * legacy guest support.
>> You can leave it as a TODO for kernel (vdpa core limitation), but 
>> AFAIK there's nothing userspace needs to do to infer the endianness. 
>> IMHO it's the kernel's job to provide an abstraction rather than rely 
>> on userspace guessing it.
> we have discussed it in another thread, and this comment is suggested 
> by MST.
Can you provide the context or link? It shouldn't work like this, 
otherwise it is breaking uABI. E.g. how will a legacy/BE supporting 
kernel/device be backward compatible with older vdpa tool (which has 
knowledge of this endianness implication/assumption from day one)?

-Siwei

>>
>>> +     */
>>> +    val_u16 = __virtio16_to_cpu(true, config.status);
>>>         val_u16 = __virtio16_to_cpu(true, config.status);
>>>       if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_STATUS, val_u16))
>>>           return -EMSGSIZE;
>>>   -    val_u16 = __virtio16_to_cpu(true, config.mtu);
>>> -    if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16))
>>> -        return -EMSGSIZE;
>>> -
>>>       features_driver = vdev->config->get_driver_features(vdev);
>>>       if (nla_put_u64_64bit(msg, VDPA_ATTR_DEV_NEGOTIATED_FEATURES, 
>>> features_driver,
>>>                     VDPA_ATTR_PAD))
>>> @@ -846,7 +869,13 @@ static int vdpa_dev_net_config_fill(struct 
>>> vdpa_device *vdev, struct sk_buff *ms
>>>                     VDPA_ATTR_PAD))
>>>           return -EMSGSIZE;
>>>   -    return vdpa_dev_net_mq_config_fill(vdev, msg, 
>>> features_driver, &config);
>>> +    if (vdpa_dev_net_mac_config_fill(msg, features_device, &config))
>>> +        return -EMSGSIZE;
>>> +
>>> +    if (vdpa_dev_net_mtu_config_fill(msg, features_device, &config))
>>> +        return -EMSGSIZE;
>>> +
>>> +    return vdpa_dev_net_mq_config_fill(msg, features_device, &config);
>>>   }
>>>     static int
>>> @@ -914,6 +943,11 @@ static int vdpa_fill_stats_rec(struct 
>>> vdpa_device *vdev, struct sk_buff *msg,
>>>       }
>>>       vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>>>   +    /*
>>> +     * Assume little endian for now, userspace can tweak this for
>>> +     * legacy guest support.
>>> +     */
>>> +
>> Ditto.
> same as above
>
> Thanks
>>
>> Thanks,
>> -Siwei
>>>       max_vqp = __virtio16_to_cpu(true, config.max_virtqueue_pairs);
>>>       if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, max_vqp))
>>>           return -EMSGSIZE;
>>
>
Zhu, Lingshan Aug. 16, 2022, 9:08 a.m. UTC | #8
On 8/16/2022 3:58 PM, Si-Wei Liu wrote:
>
>
> On 8/15/2022 6:58 PM, Zhu, Lingshan wrote:
>>
>>
>> On 8/16/2022 7:32 AM, Si-Wei Liu wrote:
>>>
>>>
>>> On 8/15/2022 2:26 AM, Zhu Lingshan wrote:
>>>> Some fields of virtio-net device config space are
>>>> conditional on the feature bits, the spec says:
>>>>
>>>> "The mac address field always exists
>>>> (though is only valid if VIRTIO_NET_F_MAC is set)"
>>>>
>>>> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ
>>>> or VIRTIO_NET_F_RSS is set"
>>>>
>>>> "mtu only exists if VIRTIO_NET_F_MTU is set"
>>>>
>>>> so we should read MTU, MAC and MQ in the device config
>>>> space only when these feature bits are offered.
>>>>
>>>> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are
>>>> not set, the virtio device should have
>>>> one queue pair as default value, so when userspace querying queue 
>>>> pair numbers,
>>>> it should return mq=1 than zero.
>>>>
>>>> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read
>>>> MTU from the device config sapce.
>>>> RFC894 <A Standard for the Transmission of IP Datagrams over 
>>>> Ethernet Networks>
>>>> says:"The minimum length of the data field of a packet sent over an
>>>> Ethernet is 1500 octets, thus the maximum length of an IP datagram
>>>> sent over an Ethernet is 1500 octets.  Implementations are encouraged
>>>> to support full-length packets"
>>> Noted there's a typo in the above "The *maximum* length of the data 
>>> field of a packet sent over an Ethernet is 1500 octets ..." and the 
>>> RFC was written 1984.
>> the spec RFC894 says it is 1500, see <a 
>> href="https://urldefense.com/v3/__https://www.rfc-editor.org/rfc/rfc894.txt__;!!ACWV5N9M2RV99hQ!MdgxZjw5sp5Qz-GKfwT1IWcw_L4Jo1-UekuJPFz1UrG3YuqirKz7P9ksdJFh1vB6zHJ7z8Q04fpT0-9jWXCtlWM$">https://www.rfc-editor.org/rfc/rfc894.txt</a>
>>>
>>> Apparently that is no longer true with the introduction of Jumbo 
>>> size frame later in the 2000s. I'm not sure what is the point of 
>>> mention this ancient RFC. It doesn't say default MTU of any Ethernet 
>>> NIC/switch should be 1500 in either  case.
>> This could be a larger number for sure, we are trying to find out the 
>> min value for Ethernet here, to support 1500 octets, MTU should be 
>> 1500 at least, so I assume 1500 should be the default value for MTU
>>>
>>>>
>>>> virtio spec says:"The virtio network device is a virtual ethernet 
>>>> card",
>>> Right,
>>>> so the default MTU value should be 1500 for virtio-net.
>>> ... but it doesn't say the default is 1500. At least, not in 
>>> explicit way. Why it can't be 1492 or even lower? In practice, if 
>>> the network backend has a MTU higher than 1500, there's nothing 
>>> wrong for guest to configure default MTU more than 1500.
>> same as above
>>>
>>>>
>>>> For MAC, the spec says:"If the VIRTIO_NET_F_MAC feature bit is set,
>>>> the configuration space mac entry indicates the “physical” address
>>>> of the network card, otherwise the driver would typically
>>>> generate a random local MAC address." So there is no
>>>> default MAC address if VIRTIO_NET_F_MAC not set.
>>>>
>>>> This commits introduces functions vdpa_dev_net_mtu_config_fill()
>>>> and vdpa_dev_net_mac_config_fill() to fill MTU and MAC.
>>>> It also fixes vdpa_dev_net_mq_config_fill() to report correct
>>>> MQ when _F_MQ is not present.
>>>>
>>>> These functions should check devices features than driver
>>>> features, and struct vdpa_device is not needed as a parameter
>>>>
>>>> The test & userspace tool output:
>>>>
>>>> Feature bit VIRTIO_NET_F_MTU, VIRTIO_NET_F_RSS, VIRTIO_NET_F_MQ
>>>> and VIRTIO_NET_F_MAC can be mask out by hardcode.
>>>>
>>>> However, it is challenging to "disable" the related fields
>>>> in the HW device config space, so let's just assume the values
>>>> are meaningless if the feature bits are not set.
>>>>
>>>> Before this change, when feature bits for RSS, MQ, MTU and MAC
>>>> are not set, iproute2 output:
>>>> $vdpa vdpa0: mac 00:e8:ca:11:be:05 link up link_announce false mtu 
>>>> 1500
>>>>    negotiated_features
>>>>
>>>> without this commit, function vdpa_dev_net_config_fill()
>>>> reads all config space fields unconditionally, so let's
>>>> assume the MAC and MTU are meaningless, and it checks
>>>> MQ with driver_features, so we don't see max_vq_pairs.
>>>>
>>>> After applying this commit, when feature bits for
>>>> MQ, RSS, MAC and MTU are not set,iproute2 output:
>>>> $vdpa dev config show vdpa0
>>>> vdpa0: link up link_announce false max_vq_pairs 1 mtu 1500
>>>>    negotiated_features
>>>>
>>>> As explained above:
>>>> Here is no MAC, because VIRTIO_NET_F_MAC is not set,
>>>> and there is no default value for MAC. It shows
>>>> max_vq_paris = 1 because even without MQ feature,
>>>> a functional virtio-net must have one queue pair.
>>>> mtu = 1500 is the default value as ethernet
>>>> required.
>>>>
>>>> This commit also add supplementary comments for
>>>> __virtio16_to_cpu(true, xxx) operations in
>>>> vdpa_dev_net_config_fill() and vdpa_fill_stats_rec()
>>>>
>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>> ---
>>>>   drivers/vdpa/vdpa.c | 60 
>>>> +++++++++++++++++++++++++++++++++++----------
>>>>   1 file changed, 47 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
>>>> index efb55a06e961..a74660b98979 100644
>>>> --- a/drivers/vdpa/vdpa.c
>>>> +++ b/drivers/vdpa/vdpa.c
>>>> @@ -801,19 +801,44 @@ static int vdpa_nl_cmd_dev_get_dumpit(struct 
>>>> sk_buff *msg, struct netlink_callba
>>>>       return msg->len;
>>>>   }
>>>>   -static int vdpa_dev_net_mq_config_fill(struct vdpa_device *vdev,
>>>> -                       struct sk_buff *msg, u64 features,
>>>> +static int vdpa_dev_net_mq_config_fill(struct sk_buff *msg, u64 
>>>> features,
>>>>                          const struct virtio_net_config *config)
>>>>   {
>>>>       u16 val_u16;
>>>>   -    if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0)
>>>> -        return 0;
>>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0 &&
>>>> +        (features & BIT_ULL(VIRTIO_NET_F_RSS)) == 0)
>>>> +        val_u16 = 1;
>>>> +    else
>>>> +        val_u16 = __virtio16_to_cpu(true, 
>>>> config->max_virtqueue_pairs);
>>>>   -    val_u16 = le16_to_cpu(config->max_virtqueue_pairs);
>>>>       return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, val_u16);
>>>>   }
>>>>   +static int vdpa_dev_net_mtu_config_fill(struct sk_buff *msg, u64 
>>>> features,
>>>> +                    const struct virtio_net_config *config)
>>>> +{
>>>> +    u16 val_u16;
>>>> +
>>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
>>>> +        val_u16 = 1500;
>>> As said, there's no virtio spec defined value for MTU. Please leave 
>>> this field out if feature VIRTIO_NET_F_MTU is not negotiated.
>> same as above
>>>> +    else
>>>> +        val_u16 = __virtio16_to_cpu(true, config->mtu);
>>>> +
>>>> +    return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16);
>>>> +}
>>>> +
>>>> +static int vdpa_dev_net_mac_config_fill(struct sk_buff *msg, u64 
>>>> features,
>>>> +                    const struct virtio_net_config *config)
>>>> +{
>>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MAC)) == 0)
>>>> +        return 0;
>>>> +    else
>>>> +        return  nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR,
>>>> +                sizeof(config->mac), config->mac);
>>>> +}
>>>> +
>>>> +
>>>>   static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, 
>>>> struct sk_buff *msg)
>>>>   {
>>>>       struct virtio_net_config config = {};
>>>> @@ -822,18 +847,16 @@ static int vdpa_dev_net_config_fill(struct 
>>>> vdpa_device *vdev, struct sk_buff *ms
>>>>         vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>>>>   -    if (nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR, 
>>>> sizeof(config.mac),
>>>> -            config.mac))
>>>> -        return -EMSGSIZE;
>>>> +    /*
>>>> +     * Assume little endian for now, userspace can tweak this for
>>>> +     * legacy guest support.
>>> You can leave it as a TODO for kernel (vdpa core limitation), but 
>>> AFAIK there's nothing userspace needs to do to infer the endianness. 
>>> IMHO it's the kernel's job to provide an abstraction rather than 
>>> rely on userspace guessing it.
>> we have discussed it in another thread, and this comment is suggested 
>> by MST.
> Can you provide the context or link? It shouldn't work like this, 
> otherwise it is breaking uABI. E.g. how will a legacy/BE supporting 
> kernel/device be backward compatible with older vdpa tool (which has 
> knowledge of this endianness implication/assumption from day one)?
https://www.spinics.net/lists/netdev/msg837114.html

The challenge is that the status filed is virtio16, not le16, so 
le16_to_cpu(xxx) is wrong anyway. However we can not tell whether it is 
a LE or BE device from struct vdpa_device, so for most cases, we assume 
it is LE, and leave this comment.

Thanks
>
> -Siwei
>
>>>
>>>> +     */
>>>> +    val_u16 = __virtio16_to_cpu(true, config.status);
>>>>         val_u16 = __virtio16_to_cpu(true, config.status);
>>>>       if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_STATUS, val_u16))
>>>>           return -EMSGSIZE;
>>>>   -    val_u16 = __virtio16_to_cpu(true, config.mtu);
>>>> -    if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16))
>>>> -        return -EMSGSIZE;
>>>> -
>>>>       features_driver = vdev->config->get_driver_features(vdev);
>>>>       if (nla_put_u64_64bit(msg, VDPA_ATTR_DEV_NEGOTIATED_FEATURES, 
>>>> features_driver,
>>>>                     VDPA_ATTR_PAD))
>>>> @@ -846,7 +869,13 @@ static int vdpa_dev_net_config_fill(struct 
>>>> vdpa_device *vdev, struct sk_buff *ms
>>>>                     VDPA_ATTR_PAD))
>>>>           return -EMSGSIZE;
>>>>   -    return vdpa_dev_net_mq_config_fill(vdev, msg, 
>>>> features_driver, &config);
>>>> +    if (vdpa_dev_net_mac_config_fill(msg, features_device, &config))
>>>> +        return -EMSGSIZE;
>>>> +
>>>> +    if (vdpa_dev_net_mtu_config_fill(msg, features_device, &config))
>>>> +        return -EMSGSIZE;
>>>> +
>>>> +    return vdpa_dev_net_mq_config_fill(msg, features_device, 
>>>> &config);
>>>>   }
>>>>     static int
>>>> @@ -914,6 +943,11 @@ static int vdpa_fill_stats_rec(struct 
>>>> vdpa_device *vdev, struct sk_buff *msg,
>>>>       }
>>>>       vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>>>>   +    /*
>>>> +     * Assume little endian for now, userspace can tweak this for
>>>> +     * legacy guest support.
>>>> +     */
>>>> +
>>> Ditto.
>> same as above
>>
>> Thanks
>>>
>>> Thanks,
>>> -Siwei
>>>>       max_vqp = __virtio16_to_cpu(true, config.max_virtqueue_pairs);
>>>>       if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, max_vqp))
>>>>           return -EMSGSIZE;
>>>
>>
>
Parav Pandit Aug. 16, 2022, 9:02 p.m. UTC | #9
> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Tuesday, August 16, 2022 12:19 AM
> 
> 
> On 8/16/2022 10:32 AM, Parav Pandit wrote:
> >> From: Zhu Lingshan <lingshan.zhu@intel.com>
> >> Sent: Monday, August 15, 2022 5:27 AM
> >>
> >> Some fields of virtio-net device config space are conditional on the
> >> feature bits, the spec says:
> >>
> >> "The mac address field always exists
> >> (though is only valid if VIRTIO_NET_F_MAC is set)"
> >>
> >> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ or
> >> VIRTIO_NET_F_RSS is set"
> >>
> >> "mtu only exists if VIRTIO_NET_F_MTU is set"
> >>
> >> so we should read MTU, MAC and MQ in the device config space only
> >> when these feature bits are offered.
> > Yes.
> >
> >> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are not set,
> the
> >> virtio device should have one queue pair as default value, so when
> >> userspace querying queue pair numbers, it should return mq=1 than zero.
> > No.
> > No need to treat mac and max_qps differently.
> > It is meaningless to differentiate when field exist/not-exists vs value
> valid/not valid.
> as we discussed before, MQ has a default value 1, to be a functional virtio-
> net device, while MAC has no default value, if no VIRTIO_NET_F_MAC set,
> the driver should generate a random MAC.
> >
> >> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read MTU from
> >> the device config sapce.
> >> RFC894 <A Standard for the Transmission of IP Datagrams over Ethernet
> >> Networks> says:"The minimum length of the data field of a packet sent
> >> Networks> over
> >> an Ethernet is 1500 octets, thus the maximum length of an IP datagram
> >> sent over an Ethernet is 1500 octets.  Implementations are encouraged
> >> to support full-length packets"
> > This line in the RFC 894 of 1984 is wrong.
> > Errata already exists for it at [1].
> >
> > [1] https://www.rfc-editor.org/errata_search.php?rfc=894&rec_status=0
> OK, so I think we should return nothing if _F_MTU not set, like handling the
> MAC
> >
> >> virtio spec says:"The virtio network device is a virtual ethernet
> >> card", so the default MTU value should be 1500 for virtio-net.
> >>
> > Practically I have seen 1500 and highe mtu.
> > And this derivation is not good of what should be the default mtu as above
> errata exists.
> >
> > And I see the code below why you need to work so hard to define a default
> value so that _MQ and _MTU can report default values.
> >
> > There is really no need for this complexity and such a long commit
> message.
> >
> > Can we please expose feature bits as-is and report config space field which
> are valid?
> >
> > User space will be querying both.
> I think MAC and MTU don't have default values, so return nothing if the
> feature bits not set, 

> for MQ, it is still max_vq_paris == 1 by default.

I have stressed enough to highlight the fact that we don’t want to start digging default/no default, valid/no-valid part of the spec.
I prefer kernel to reporting fields that _exists_ in the config space and are valid.
I will let MST to handle the maintenance nightmare that this kind of patch brings in without any visible gain to user space/orchestration apps.

A logic that can be easily build in user space, should be written in user space.
I conclude my thoughts here for this discussion.

I will let MST to decide how he prefers to proceed.

>
> >> +	if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
> >> +		val_u16 = 1500;
> >> +	else
> >> +		val_u16 = __virtio16_to_cpu(true, config->mtu);
> >> +
> > Need to work hard to find default values and that too turned out had
> errata.
> > There are more fields that doesn’t have default values.
> >
> > There is no point in kernel doing this guess work, that user space can figure
> out of what is valid/invalid.
> It's not guest work, when guest finds no feature bits set, it can decide what
> to do. 

Above code of doing 1500 was probably an honest attempt to find a legitimate default value, and we saw that it doesn’t work.
This is second example after _MQ that we both agree should not return default.

And there are more fields coming in this area.
Hence, I prefer to not avoid returning such defaults for MAC, MTU, MQ and rest all fields which doesn’t _exists_.

I will let MST to decide how he prefers to proceed for every field to come next.
Thanks.
Michael S. Tsirkin Aug. 16, 2022, 9:09 p.m. UTC | #10
On Tue, Aug 16, 2022 at 09:02:17PM +0000, Parav Pandit wrote:
> 
> > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > Sent: Tuesday, August 16, 2022 12:19 AM
> > 
> > 
> > On 8/16/2022 10:32 AM, Parav Pandit wrote:
> > >> From: Zhu Lingshan <lingshan.zhu@intel.com>
> > >> Sent: Monday, August 15, 2022 5:27 AM
> > >>
> > >> Some fields of virtio-net device config space are conditional on the
> > >> feature bits, the spec says:
> > >>
> > >> "The mac address field always exists
> > >> (though is only valid if VIRTIO_NET_F_MAC is set)"
> > >>
> > >> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ or
> > >> VIRTIO_NET_F_RSS is set"
> > >>
> > >> "mtu only exists if VIRTIO_NET_F_MTU is set"
> > >>
> > >> so we should read MTU, MAC and MQ in the device config space only
> > >> when these feature bits are offered.
> > > Yes.
> > >
> > >> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are not set,
> > the
> > >> virtio device should have one queue pair as default value, so when
> > >> userspace querying queue pair numbers, it should return mq=1 than zero.
> > > No.
> > > No need to treat mac and max_qps differently.
> > > It is meaningless to differentiate when field exist/not-exists vs value
> > valid/not valid.
> > as we discussed before, MQ has a default value 1, to be a functional virtio-
> > net device, while MAC has no default value, if no VIRTIO_NET_F_MAC set,
> > the driver should generate a random MAC.
> > >
> > >> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read MTU from
> > >> the device config sapce.
> > >> RFC894 <A Standard for the Transmission of IP Datagrams over Ethernet
> > >> Networks> says:"The minimum length of the data field of a packet sent
> > >> Networks> over
> > >> an Ethernet is 1500 octets, thus the maximum length of an IP datagram
> > >> sent over an Ethernet is 1500 octets.  Implementations are encouraged
> > >> to support full-length packets"
> > > This line in the RFC 894 of 1984 is wrong.
> > > Errata already exists for it at [1].
> > >
> > > [1] https://www.rfc-editor.org/errata_search.php?rfc=894&rec_status=0
> > OK, so I think we should return nothing if _F_MTU not set, like handling the
> > MAC
> > >
> > >> virtio spec says:"The virtio network device is a virtual ethernet
> > >> card", so the default MTU value should be 1500 for virtio-net.
> > >>
> > > Practically I have seen 1500 and highe mtu.
> > > And this derivation is not good of what should be the default mtu as above
> > errata exists.
> > >
> > > And I see the code below why you need to work so hard to define a default
> > value so that _MQ and _MTU can report default values.
> > >
> > > There is really no need for this complexity and such a long commit
> > message.
> > >
> > > Can we please expose feature bits as-is and report config space field which
> > are valid?
> > >
> > > User space will be querying both.
> > I think MAC and MTU don't have default values, so return nothing if the
> > feature bits not set, 
> 
> > for MQ, it is still max_vq_paris == 1 by default.
> 
> I have stressed enough to highlight the fact that we don’t want to start digging default/no default, valid/no-valid part of the spec.
> I prefer kernel to reporting fields that _exists_ in the config space and are valid.
> I will let MST to handle the maintenance nightmare that this kind of patch brings in without any visible gain to user space/orchestration apps.
> 
> A logic that can be easily build in user space, should be written in user space.
> I conclude my thoughts here for this discussion.
> 
> I will let MST to decide how he prefers to proceed.
> 
> >
> > >> +	if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
> > >> +		val_u16 = 1500;
> > >> +	else
> > >> +		val_u16 = __virtio16_to_cpu(true, config->mtu);
> > >> +
> > > Need to work hard to find default values and that too turned out had
> > errata.
> > > There are more fields that doesn’t have default values.
> > >
> > > There is no point in kernel doing this guess work, that user space can figure
> > out of what is valid/invalid.
> > It's not guest work, when guest finds no feature bits set, it can decide what
> > to do. 
> 
> Above code of doing 1500 was probably an honest attempt to find a legitimate default value, and we saw that it doesn’t work.
> This is second example after _MQ that we both agree should not return default.
> 
> And there are more fields coming in this area.
> Hence, I prefer to not avoid returning such defaults for MAC, MTU, MQ and rest all fields which doesn’t _exists_.
> 
> I will let MST to decide how he prefers to proceed for every field to come next.
> Thanks.
> 


If MTU does not return a value without _F_MTU, and MAC does not return
a value without _F_MAC then IMO yes, number of queues should not return
a value without _F_MQ.
Si-Wei Liu Aug. 16, 2022, 11:14 p.m. UTC | #11
On 8/16/2022 2:08 AM, Zhu, Lingshan wrote:
>
>
> On 8/16/2022 3:58 PM, Si-Wei Liu wrote:
>>
>>
>> On 8/15/2022 6:58 PM, Zhu, Lingshan wrote:
>>>
>>>
>>> On 8/16/2022 7:32 AM, Si-Wei Liu wrote:
>>>>
>>>>
>>>> On 8/15/2022 2:26 AM, Zhu Lingshan wrote:
>>>>> Some fields of virtio-net device config space are
>>>>> conditional on the feature bits, the spec says:
>>>>>
>>>>> "The mac address field always exists
>>>>> (though is only valid if VIRTIO_NET_F_MAC is set)"
>>>>>
>>>>> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ
>>>>> or VIRTIO_NET_F_RSS is set"
>>>>>
>>>>> "mtu only exists if VIRTIO_NET_F_MTU is set"
>>>>>
>>>>> so we should read MTU, MAC and MQ in the device config
>>>>> space only when these feature bits are offered.
>>>>>
>>>>> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are
>>>>> not set, the virtio device should have
>>>>> one queue pair as default value, so when userspace querying queue 
>>>>> pair numbers,
>>>>> it should return mq=1 than zero.
>>>>>
>>>>> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read
>>>>> MTU from the device config sapce.
>>>>> RFC894 <A Standard for the Transmission of IP Datagrams over 
>>>>> Ethernet Networks>
>>>>> says:"The minimum length of the data field of a packet sent over an
>>>>> Ethernet is 1500 octets, thus the maximum length of an IP datagram
>>>>> sent over an Ethernet is 1500 octets.  Implementations are encouraged
>>>>> to support full-length packets"
>>>> Noted there's a typo in the above "The *maximum* length of the data 
>>>> field of a packet sent over an Ethernet is 1500 octets ..." and the 
>>>> RFC was written 1984.
>>> the spec RFC894 says it is 1500, see <a 
>>> href="https://urldefense.com/v3/__https://www.rfc-editor.org/rfc/rfc894.txt__;!!ACWV5N9M2RV99hQ!MdgxZjw5sp5Qz-GKfwT1IWcw_L4Jo1-UekuJPFz1UrG3YuqirKz7P9ksdJFh1vB6zHJ7z8Q04fpT0-9jWXCtlWM$">https://urldefense.com/v3/__https://www.rfc-editor.org/rfc/rfc894.txt__;!!ACWV5N9M2RV99hQ!KVwfun0b1Q59Ajp6O7JrB-BuEBSLyQ9e95oGq1cVG_sQIPDL0whI5frx1EGoQFznmm67RsEeJTrUdfYrmZPRFaM$ 
>>> </a>
>>>>
>>>> Apparently that is no longer true with the introduction of Jumbo 
>>>> size frame later in the 2000s. I'm not sure what is the point of 
>>>> mention this ancient RFC. It doesn't say default MTU of any 
>>>> Ethernet NIC/switch should be 1500 in either  case.
>>> This could be a larger number for sure, we are trying to find out 
>>> the min value for Ethernet here, to support 1500 octets, MTU should 
>>> be 1500 at least, so I assume 1500 should be the default value for MTU
>>>>
>>>>>
>>>>> virtio spec says:"The virtio network device is a virtual ethernet 
>>>>> card",
>>>> Right,
>>>>> so the default MTU value should be 1500 for virtio-net.
>>>> ... but it doesn't say the default is 1500. At least, not in 
>>>> explicit way. Why it can't be 1492 or even lower? In practice, if 
>>>> the network backend has a MTU higher than 1500, there's nothing 
>>>> wrong for guest to configure default MTU more than 1500.
>>> same as above
>>>>
>>>>>
>>>>> For MAC, the spec says:"If the VIRTIO_NET_F_MAC feature bit is set,
>>>>> the configuration space mac entry indicates the “physical” address
>>>>> of the network card, otherwise the driver would typically
>>>>> generate a random local MAC address." So there is no
>>>>> default MAC address if VIRTIO_NET_F_MAC not set.
>>>>>
>>>>> This commits introduces functions vdpa_dev_net_mtu_config_fill()
>>>>> and vdpa_dev_net_mac_config_fill() to fill MTU and MAC.
>>>>> It also fixes vdpa_dev_net_mq_config_fill() to report correct
>>>>> MQ when _F_MQ is not present.
>>>>>
>>>>> These functions should check devices features than driver
>>>>> features, and struct vdpa_device is not needed as a parameter
>>>>>
>>>>> The test & userspace tool output:
>>>>>
>>>>> Feature bit VIRTIO_NET_F_MTU, VIRTIO_NET_F_RSS, VIRTIO_NET_F_MQ
>>>>> and VIRTIO_NET_F_MAC can be mask out by hardcode.
>>>>>
>>>>> However, it is challenging to "disable" the related fields
>>>>> in the HW device config space, so let's just assume the values
>>>>> are meaningless if the feature bits are not set.
>>>>>
>>>>> Before this change, when feature bits for RSS, MQ, MTU and MAC
>>>>> are not set, iproute2 output:
>>>>> $vdpa vdpa0: mac 00:e8:ca:11:be:05 link up link_announce false mtu 
>>>>> 1500
>>>>>    negotiated_features
>>>>>
>>>>> without this commit, function vdpa_dev_net_config_fill()
>>>>> reads all config space fields unconditionally, so let's
>>>>> assume the MAC and MTU are meaningless, and it checks
>>>>> MQ with driver_features, so we don't see max_vq_pairs.
>>>>>
>>>>> After applying this commit, when feature bits for
>>>>> MQ, RSS, MAC and MTU are not set,iproute2 output:
>>>>> $vdpa dev config show vdpa0
>>>>> vdpa0: link up link_announce false max_vq_pairs 1 mtu 1500
>>>>>    negotiated_features
>>>>>
>>>>> As explained above:
>>>>> Here is no MAC, because VIRTIO_NET_F_MAC is not set,
>>>>> and there is no default value for MAC. It shows
>>>>> max_vq_paris = 1 because even without MQ feature,
>>>>> a functional virtio-net must have one queue pair.
>>>>> mtu = 1500 is the default value as ethernet
>>>>> required.
>>>>>
>>>>> This commit also add supplementary comments for
>>>>> __virtio16_to_cpu(true, xxx) operations in
>>>>> vdpa_dev_net_config_fill() and vdpa_fill_stats_rec()
>>>>>
>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>> ---
>>>>>   drivers/vdpa/vdpa.c | 60 
>>>>> +++++++++++++++++++++++++++++++++++----------
>>>>>   1 file changed, 47 insertions(+), 13 deletions(-)
>>>>>
>>>>> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
>>>>> index efb55a06e961..a74660b98979 100644
>>>>> --- a/drivers/vdpa/vdpa.c
>>>>> +++ b/drivers/vdpa/vdpa.c
>>>>> @@ -801,19 +801,44 @@ static int vdpa_nl_cmd_dev_get_dumpit(struct 
>>>>> sk_buff *msg, struct netlink_callba
>>>>>       return msg->len;
>>>>>   }
>>>>>   -static int vdpa_dev_net_mq_config_fill(struct vdpa_device *vdev,
>>>>> -                       struct sk_buff *msg, u64 features,
>>>>> +static int vdpa_dev_net_mq_config_fill(struct sk_buff *msg, u64 
>>>>> features,
>>>>>                          const struct virtio_net_config *config)
>>>>>   {
>>>>>       u16 val_u16;
>>>>>   -    if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0)
>>>>> -        return 0;
>>>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0 &&
>>>>> +        (features & BIT_ULL(VIRTIO_NET_F_RSS)) == 0)
>>>>> +        val_u16 = 1;
>>>>> +    else
>>>>> +        val_u16 = __virtio16_to_cpu(true, 
>>>>> config->max_virtqueue_pairs);
>>>>>   -    val_u16 = le16_to_cpu(config->max_virtqueue_pairs);
>>>>>       return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, 
>>>>> val_u16);
>>>>>   }
>>>>>   +static int vdpa_dev_net_mtu_config_fill(struct sk_buff *msg, 
>>>>> u64 features,
>>>>> +                    const struct virtio_net_config *config)
>>>>> +{
>>>>> +    u16 val_u16;
>>>>> +
>>>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
>>>>> +        val_u16 = 1500;
>>>> As said, there's no virtio spec defined value for MTU. Please leave 
>>>> this field out if feature VIRTIO_NET_F_MTU is not negotiated.
>>> same as above
>>>>> +    else
>>>>> +        val_u16 = __virtio16_to_cpu(true, config->mtu);
>>>>> +
>>>>> +    return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16);
>>>>> +}
>>>>> +
>>>>> +static int vdpa_dev_net_mac_config_fill(struct sk_buff *msg, u64 
>>>>> features,
>>>>> +                    const struct virtio_net_config *config)
>>>>> +{
>>>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MAC)) == 0)
>>>>> +        return 0;
>>>>> +    else
>>>>> +        return  nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR,
>>>>> +                sizeof(config->mac), config->mac);
>>>>> +}
>>>>> +
>>>>> +
>>>>>   static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, 
>>>>> struct sk_buff *msg)
>>>>>   {
>>>>>       struct virtio_net_config config = {};
>>>>> @@ -822,18 +847,16 @@ static int vdpa_dev_net_config_fill(struct 
>>>>> vdpa_device *vdev, struct sk_buff *ms
>>>>>         vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>>>>>   -    if (nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR, 
>>>>> sizeof(config.mac),
>>>>> -            config.mac))
>>>>> -        return -EMSGSIZE;
>>>>> +    /*
>>>>> +     * Assume little endian for now, userspace can tweak this for
>>>>> +     * legacy guest support.
>>>> You can leave it as a TODO for kernel (vdpa core limitation), but 
>>>> AFAIK there's nothing userspace needs to do to infer the 
>>>> endianness. IMHO it's the kernel's job to provide an abstraction 
>>>> rather than rely on userspace guessing it.
>>> we have discussed it in another thread, and this comment is 
>>> suggested by MST.
>> Can you provide the context or link? It shouldn't work like this, 
>> otherwise it is breaking uABI. E.g. how will a legacy/BE supporting 
>> kernel/device be backward compatible with older vdpa tool (which has 
>> knowledge of this endianness implication/assumption from day one)?
> https://urldefense.com/v3/__https://www.spinics.net/lists/netdev/msg837114.html__;!!ACWV5N9M2RV99hQ!KVwfun0b1Q59Ajp6O7JrB-BuEBSLyQ9e95oGq1cVG_sQIPDL0whI5frx1EGoQFznmm67RsEeJTrUdfYrGq7Vwjk$ 
>
> The challenge is that the status filed is virtio16, not le16, so 
> le16_to_cpu(xxx) is wrong anyway. However we can not tell whether it 
> is a LE or BE device from struct vdpa_device, so for most cases, we 
> assume it is LE, and leave this comment.
While the fix is fine, the comment is misleading in giving readers false 
hope. This is in vdpa_dev_net_config_fill() the vdpa tool query path, 
instead of calls from the VMM dealing with vhost/virtio plumbing 
specifics. I think what's missing today in vdpa core is the detection of 
guest type (legacy, transitional, or modern) regarding endianness 
through F_VERSION_1 and legacy interface access, the latter of which 
would need some assistance from VMM for sure. However, the presence of 
information via the vdpa tool query is totally orthogonal. I don't get a 
good reason for why it has to couple with endianness. How vdpa tool 
users space is supposed to tweak it? I don't get it...

-Siwei


>
> Thanks
>>
>> -Siwei
>>
>>>>
>>>>> +     */
>>>>> +    val_u16 = __virtio16_to_cpu(true, config.status);
>>>>>         val_u16 = __virtio16_to_cpu(true, config.status);
>>>>>       if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_STATUS, val_u16))
>>>>>           return -EMSGSIZE;
>>>>>   -    val_u16 = __virtio16_to_cpu(true, config.mtu);
>>>>> -    if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16))
>>>>> -        return -EMSGSIZE;
>>>>> -
>>>>>       features_driver = vdev->config->get_driver_features(vdev);
>>>>>       if (nla_put_u64_64bit(msg, 
>>>>> VDPA_ATTR_DEV_NEGOTIATED_FEATURES, features_driver,
>>>>>                     VDPA_ATTR_PAD))
>>>>> @@ -846,7 +869,13 @@ static int vdpa_dev_net_config_fill(struct 
>>>>> vdpa_device *vdev, struct sk_buff *ms
>>>>>                     VDPA_ATTR_PAD))
>>>>>           return -EMSGSIZE;
>>>>>   -    return vdpa_dev_net_mq_config_fill(vdev, msg, 
>>>>> features_driver, &config);
>>>>> +    if (vdpa_dev_net_mac_config_fill(msg, features_device, &config))
>>>>> +        return -EMSGSIZE;
>>>>> +
>>>>> +    if (vdpa_dev_net_mtu_config_fill(msg, features_device, &config))
>>>>> +        return -EMSGSIZE;
>>>>> +
>>>>> +    return vdpa_dev_net_mq_config_fill(msg, features_device, 
>>>>> &config);
>>>>>   }
>>>>>     static int
>>>>> @@ -914,6 +943,11 @@ static int vdpa_fill_stats_rec(struct 
>>>>> vdpa_device *vdev, struct sk_buff *msg,
>>>>>       }
>>>>>       vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>>>>>   +    /*
>>>>> +     * Assume little endian for now, userspace can tweak this for
>>>>> +     * legacy guest support.
>>>>> +     */
>>>>> +
>>>> Ditto.
>>> same as above
>>>
>>> Thanks
>>>>
>>>> Thanks,
>>>> -Siwei
>>>>>       max_vqp = __virtio16_to_cpu(true, config.max_virtqueue_pairs);
>>>>>       if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, max_vqp))
>>>>>           return -EMSGSIZE;
>>>>
>>>
>>
>
Zhu, Lingshan Aug. 17, 2022, 2:03 a.m. UTC | #12
On 8/17/2022 5:09 AM, Michael S. Tsirkin wrote:
> On Tue, Aug 16, 2022 at 09:02:17PM +0000, Parav Pandit wrote:
>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>> Sent: Tuesday, August 16, 2022 12:19 AM
>>>
>>>
>>> On 8/16/2022 10:32 AM, Parav Pandit wrote:
>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>> Sent: Monday, August 15, 2022 5:27 AM
>>>>>
>>>>> Some fields of virtio-net device config space are conditional on the
>>>>> feature bits, the spec says:
>>>>>
>>>>> "The mac address field always exists
>>>>> (though is only valid if VIRTIO_NET_F_MAC is set)"
>>>>>
>>>>> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ or
>>>>> VIRTIO_NET_F_RSS is set"
>>>>>
>>>>> "mtu only exists if VIRTIO_NET_F_MTU is set"
>>>>>
>>>>> so we should read MTU, MAC and MQ in the device config space only
>>>>> when these feature bits are offered.
>>>> Yes.
>>>>
>>>>> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are not set,
>>> the
>>>>> virtio device should have one queue pair as default value, so when
>>>>> userspace querying queue pair numbers, it should return mq=1 than zero.
>>>> No.
>>>> No need to treat mac and max_qps differently.
>>>> It is meaningless to differentiate when field exist/not-exists vs value
>>> valid/not valid.
>>> as we discussed before, MQ has a default value 1, to be a functional virtio-
>>> net device, while MAC has no default value, if no VIRTIO_NET_F_MAC set,
>>> the driver should generate a random MAC.
>>>>> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read MTU from
>>>>> the device config sapce.
>>>>> RFC894 <A Standard for the Transmission of IP Datagrams over Ethernet
>>>>> Networks> says:"The minimum length of the data field of a packet sent
>>>>> Networks> over
>>>>> an Ethernet is 1500 octets, thus the maximum length of an IP datagram
>>>>> sent over an Ethernet is 1500 octets.  Implementations are encouraged
>>>>> to support full-length packets"
>>>> This line in the RFC 894 of 1984 is wrong.
>>>> Errata already exists for it at [1].
>>>>
>>>> [1] https://www.rfc-editor.org/errata_search.php?rfc=894&rec_status=0
>>> OK, so I think we should return nothing if _F_MTU not set, like handling the
>>> MAC
>>>>> virtio spec says:"The virtio network device is a virtual ethernet
>>>>> card", so the default MTU value should be 1500 for virtio-net.
>>>>>
>>>> Practically I have seen 1500 and highe mtu.
>>>> And this derivation is not good of what should be the default mtu as above
>>> errata exists.
>>>> And I see the code below why you need to work so hard to define a default
>>> value so that _MQ and _MTU can report default values.
>>>> There is really no need for this complexity and such a long commit
>>> message.
>>>> Can we please expose feature bits as-is and report config space field which
>>> are valid?
>>>> User space will be querying both.
>>> I think MAC and MTU don't have default values, so return nothing if the
>>> feature bits not set,
>>> for MQ, it is still max_vq_paris == 1 by default.
>> I have stressed enough to highlight the fact that we don’t want to start digging default/no default, valid/no-valid part of the spec.
>> I prefer kernel to reporting fields that _exists_ in the config space and are valid.
>> I will let MST to handle the maintenance nightmare that this kind of patch brings in without any visible gain to user space/orchestration apps.
>>
>> A logic that can be easily build in user space, should be written in user space.
>> I conclude my thoughts here for this discussion.
>>
>> I will let MST to decide how he prefers to proceed.
>>
>>>>> +	if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
>>>>> +		val_u16 = 1500;
>>>>> +	else
>>>>> +		val_u16 = __virtio16_to_cpu(true, config->mtu);
>>>>> +
>>>> Need to work hard to find default values and that too turned out had
>>> errata.
>>>> There are more fields that doesn’t have default values.
>>>>
>>>> There is no point in kernel doing this guess work, that user space can figure
>>> out of what is valid/invalid.
>>> It's not guest work, when guest finds no feature bits set, it can decide what
>>> to do.
>> Above code of doing 1500 was probably an honest attempt to find a legitimate default value, and we saw that it doesn’t work.
>> This is second example after _MQ that we both agree should not return default.
>>
>> And there are more fields coming in this area.
>> Hence, I prefer to not avoid returning such defaults for MAC, MTU, MQ and rest all fields which doesn’t _exists_.
>>
>> I will let MST to decide how he prefers to proceed for every field to come next.
>> Thanks.
>>
>
> If MTU does not return a value without _F_MTU, and MAC does not return
> a value without _F_MAC then IMO yes, number of queues should not return
> a value without _F_MQ.
sure I can do this, but may I ask whether it is a final decision, I 
remember you supported max_queue_paris = 1 without _F_MQ before

Thanks
>
>
Zhu, Lingshan Aug. 17, 2022, 2:14 a.m. UTC | #13
On 8/17/2022 7:14 AM, Si-Wei Liu wrote:
>
>
> On 8/16/2022 2:08 AM, Zhu, Lingshan wrote:
>>
>>
>> On 8/16/2022 3:58 PM, Si-Wei Liu wrote:
>>>
>>>
>>> On 8/15/2022 6:58 PM, Zhu, Lingshan wrote:
>>>>
>>>>
>>>> On 8/16/2022 7:32 AM, Si-Wei Liu wrote:
>>>>>
>>>>>
>>>>> On 8/15/2022 2:26 AM, Zhu Lingshan wrote:
>>>>>> Some fields of virtio-net device config space are
>>>>>> conditional on the feature bits, the spec says:
>>>>>>
>>>>>> "The mac address field always exists
>>>>>> (though is only valid if VIRTIO_NET_F_MAC is set)"
>>>>>>
>>>>>> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ
>>>>>> or VIRTIO_NET_F_RSS is set"
>>>>>>
>>>>>> "mtu only exists if VIRTIO_NET_F_MTU is set"
>>>>>>
>>>>>> so we should read MTU, MAC and MQ in the device config
>>>>>> space only when these feature bits are offered.
>>>>>>
>>>>>> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are
>>>>>> not set, the virtio device should have
>>>>>> one queue pair as default value, so when userspace querying queue 
>>>>>> pair numbers,
>>>>>> it should return mq=1 than zero.
>>>>>>
>>>>>> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read
>>>>>> MTU from the device config sapce.
>>>>>> RFC894 <A Standard for the Transmission of IP Datagrams over 
>>>>>> Ethernet Networks>
>>>>>> says:"The minimum length of the data field of a packet sent over an
>>>>>> Ethernet is 1500 octets, thus the maximum length of an IP datagram
>>>>>> sent over an Ethernet is 1500 octets.  Implementations are 
>>>>>> encouraged
>>>>>> to support full-length packets"
>>>>> Noted there's a typo in the above "The *maximum* length of the 
>>>>> data field of a packet sent over an Ethernet is 1500 octets ..." 
>>>>> and the RFC was written 1984.
>>>> the spec RFC894 says it is 1500, see <a 
>>>> href="https://urldefense.com/v3/__https://www.rfc-editor.org/rfc/rfc894.txt__;!!ACWV5N9M2RV99hQ!MdgxZjw5sp5Qz-GKfwT1IWcw_L4Jo1-UekuJPFz1UrG3YuqirKz7P9ksdJFh1vB6zHJ7z8Q04fpT0-9jWXCtlWM$">https://urldefense.com/v3/__https://www.rfc-editor.org/rfc/rfc894.txt__;!!ACWV5N9M2RV99hQ!KVwfun0b1Q59Ajp6O7JrB-BuEBSLyQ9e95oGq1cVG_sQIPDL0whI5frx1EGoQFznmm67RsEeJTrUdfYrmZPRFaM$ 
>>>> </a>
>>>>>
>>>>> Apparently that is no longer true with the introduction of Jumbo 
>>>>> size frame later in the 2000s. I'm not sure what is the point of 
>>>>> mention this ancient RFC. It doesn't say default MTU of any 
>>>>> Ethernet NIC/switch should be 1500 in either  case.
>>>> This could be a larger number for sure, we are trying to find out 
>>>> the min value for Ethernet here, to support 1500 octets, MTU should 
>>>> be 1500 at least, so I assume 1500 should be the default value for MTU
>>>>>
>>>>>>
>>>>>> virtio spec says:"The virtio network device is a virtual ethernet 
>>>>>> card",
>>>>> Right,
>>>>>> so the default MTU value should be 1500 for virtio-net.
>>>>> ... but it doesn't say the default is 1500. At least, not in 
>>>>> explicit way. Why it can't be 1492 or even lower? In practice, if 
>>>>> the network backend has a MTU higher than 1500, there's nothing 
>>>>> wrong for guest to configure default MTU more than 1500.
>>>> same as above
>>>>>
>>>>>>
>>>>>> For MAC, the spec says:"If the VIRTIO_NET_F_MAC feature bit is set,
>>>>>> the configuration space mac entry indicates the “physical” address
>>>>>> of the network card, otherwise the driver would typically
>>>>>> generate a random local MAC address." So there is no
>>>>>> default MAC address if VIRTIO_NET_F_MAC not set.
>>>>>>
>>>>>> This commits introduces functions vdpa_dev_net_mtu_config_fill()
>>>>>> and vdpa_dev_net_mac_config_fill() to fill MTU and MAC.
>>>>>> It also fixes vdpa_dev_net_mq_config_fill() to report correct
>>>>>> MQ when _F_MQ is not present.
>>>>>>
>>>>>> These functions should check devices features than driver
>>>>>> features, and struct vdpa_device is not needed as a parameter
>>>>>>
>>>>>> The test & userspace tool output:
>>>>>>
>>>>>> Feature bit VIRTIO_NET_F_MTU, VIRTIO_NET_F_RSS, VIRTIO_NET_F_MQ
>>>>>> and VIRTIO_NET_F_MAC can be mask out by hardcode.
>>>>>>
>>>>>> However, it is challenging to "disable" the related fields
>>>>>> in the HW device config space, so let's just assume the values
>>>>>> are meaningless if the feature bits are not set.
>>>>>>
>>>>>> Before this change, when feature bits for RSS, MQ, MTU and MAC
>>>>>> are not set, iproute2 output:
>>>>>> $vdpa vdpa0: mac 00:e8:ca:11:be:05 link up link_announce false 
>>>>>> mtu 1500
>>>>>>    negotiated_features
>>>>>>
>>>>>> without this commit, function vdpa_dev_net_config_fill()
>>>>>> reads all config space fields unconditionally, so let's
>>>>>> assume the MAC and MTU are meaningless, and it checks
>>>>>> MQ with driver_features, so we don't see max_vq_pairs.
>>>>>>
>>>>>> After applying this commit, when feature bits for
>>>>>> MQ, RSS, MAC and MTU are not set,iproute2 output:
>>>>>> $vdpa dev config show vdpa0
>>>>>> vdpa0: link up link_announce false max_vq_pairs 1 mtu 1500
>>>>>>    negotiated_features
>>>>>>
>>>>>> As explained above:
>>>>>> Here is no MAC, because VIRTIO_NET_F_MAC is not set,
>>>>>> and there is no default value for MAC. It shows
>>>>>> max_vq_paris = 1 because even without MQ feature,
>>>>>> a functional virtio-net must have one queue pair.
>>>>>> mtu = 1500 is the default value as ethernet
>>>>>> required.
>>>>>>
>>>>>> This commit also add supplementary comments for
>>>>>> __virtio16_to_cpu(true, xxx) operations in
>>>>>> vdpa_dev_net_config_fill() and vdpa_fill_stats_rec()
>>>>>>
>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>> ---
>>>>>>   drivers/vdpa/vdpa.c | 60 
>>>>>> +++++++++++++++++++++++++++++++++++----------
>>>>>>   1 file changed, 47 insertions(+), 13 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
>>>>>> index efb55a06e961..a74660b98979 100644
>>>>>> --- a/drivers/vdpa/vdpa.c
>>>>>> +++ b/drivers/vdpa/vdpa.c
>>>>>> @@ -801,19 +801,44 @@ static int 
>>>>>> vdpa_nl_cmd_dev_get_dumpit(struct sk_buff *msg, struct 
>>>>>> netlink_callba
>>>>>>       return msg->len;
>>>>>>   }
>>>>>>   -static int vdpa_dev_net_mq_config_fill(struct vdpa_device *vdev,
>>>>>> -                       struct sk_buff *msg, u64 features,
>>>>>> +static int vdpa_dev_net_mq_config_fill(struct sk_buff *msg, u64 
>>>>>> features,
>>>>>>                          const struct virtio_net_config *config)
>>>>>>   {
>>>>>>       u16 val_u16;
>>>>>>   -    if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0)
>>>>>> -        return 0;
>>>>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0 &&
>>>>>> +        (features & BIT_ULL(VIRTIO_NET_F_RSS)) == 0)
>>>>>> +        val_u16 = 1;
>>>>>> +    else
>>>>>> +        val_u16 = __virtio16_to_cpu(true, 
>>>>>> config->max_virtqueue_pairs);
>>>>>>   -    val_u16 = le16_to_cpu(config->max_virtqueue_pairs);
>>>>>>       return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, 
>>>>>> val_u16);
>>>>>>   }
>>>>>>   +static int vdpa_dev_net_mtu_config_fill(struct sk_buff *msg, 
>>>>>> u64 features,
>>>>>> +                    const struct virtio_net_config *config)
>>>>>> +{
>>>>>> +    u16 val_u16;
>>>>>> +
>>>>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
>>>>>> +        val_u16 = 1500;
>>>>> As said, there's no virtio spec defined value for MTU. Please 
>>>>> leave this field out if feature VIRTIO_NET_F_MTU is not negotiated.
>>>> same as above
>>>>>> +    else
>>>>>> +        val_u16 = __virtio16_to_cpu(true, config->mtu);
>>>>>> +
>>>>>> +    return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16);
>>>>>> +}
>>>>>> +
>>>>>> +static int vdpa_dev_net_mac_config_fill(struct sk_buff *msg, u64 
>>>>>> features,
>>>>>> +                    const struct virtio_net_config *config)
>>>>>> +{
>>>>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MAC)) == 0)
>>>>>> +        return 0;
>>>>>> +    else
>>>>>> +        return  nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR,
>>>>>> +                sizeof(config->mac), config->mac);
>>>>>> +}
>>>>>> +
>>>>>> +
>>>>>>   static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, 
>>>>>> struct sk_buff *msg)
>>>>>>   {
>>>>>>       struct virtio_net_config config = {};
>>>>>> @@ -822,18 +847,16 @@ static int vdpa_dev_net_config_fill(struct 
>>>>>> vdpa_device *vdev, struct sk_buff *ms
>>>>>>         vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>>>>>>   -    if (nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR, 
>>>>>> sizeof(config.mac),
>>>>>> -            config.mac))
>>>>>> -        return -EMSGSIZE;
>>>>>> +    /*
>>>>>> +     * Assume little endian for now, userspace can tweak this for
>>>>>> +     * legacy guest support.
>>>>> You can leave it as a TODO for kernel (vdpa core limitation), but 
>>>>> AFAIK there's nothing userspace needs to do to infer the 
>>>>> endianness. IMHO it's the kernel's job to provide an abstraction 
>>>>> rather than rely on userspace guessing it.
>>>> we have discussed it in another thread, and this comment is 
>>>> suggested by MST.
>>> Can you provide the context or link? It shouldn't work like this, 
>>> otherwise it is breaking uABI. E.g. how will a legacy/BE supporting 
>>> kernel/device be backward compatible with older vdpa tool (which has 
>>> knowledge of this endianness implication/assumption from day one)?
>> https://urldefense.com/v3/__https://www.spinics.net/lists/netdev/msg837114.html__;!!ACWV5N9M2RV99hQ!KVwfun0b1Q59Ajp6O7JrB-BuEBSLyQ9e95oGq1cVG_sQIPDL0whI5frx1EGoQFznmm67RsEeJTrUdfYrGq7Vwjk$ 
>>
>> The challenge is that the status filed is virtio16, not le16, so 
>> le16_to_cpu(xxx) is wrong anyway. However we can not tell whether it 
>> is a LE or BE device from struct vdpa_device, so for most cases, we 
>> assume it is LE, and leave this comment.
> While the fix is fine, the comment is misleading in giving readers 
> false hope. This is in vdpa_dev_net_config_fill() the vdpa tool query 
> path, instead of calls from the VMM dealing with vhost/virtio plumbing 
> specifics. I think what's missing today in vdpa core is the detection 
> of guest type (legacy, transitional, or modern) regarding endianness 
> through F_VERSION_1 and legacy interface access, the latter of which 
> would need some assistance from VMM for sure. However, the presence of 
> information via the vdpa tool query is totally orthogonal. I don't get 
> a good reason for why it has to couple with endianness. How vdpa tool 
> users space is supposed to tweak it? I don't get it...
Yes it is a little messy, and we can not check _F_VERSION_1 because of 
transitional devices, so maybe this is the best we can do for now
>
> -Siwei
>
>
>>
>> Thanks
>>>
>>> -Siwei
>>>
>>>>>
>>>>>> +     */
>>>>>> +    val_u16 = __virtio16_to_cpu(true, config.status);
>>>>>>         val_u16 = __virtio16_to_cpu(true, config.status);
>>>>>>       if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_STATUS, val_u16))
>>>>>>           return -EMSGSIZE;
>>>>>>   -    val_u16 = __virtio16_to_cpu(true, config.mtu);
>>>>>> -    if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16))
>>>>>> -        return -EMSGSIZE;
>>>>>> -
>>>>>>       features_driver = vdev->config->get_driver_features(vdev);
>>>>>>       if (nla_put_u64_64bit(msg, 
>>>>>> VDPA_ATTR_DEV_NEGOTIATED_FEATURES, features_driver,
>>>>>>                     VDPA_ATTR_PAD))
>>>>>> @@ -846,7 +869,13 @@ static int vdpa_dev_net_config_fill(struct 
>>>>>> vdpa_device *vdev, struct sk_buff *ms
>>>>>>                     VDPA_ATTR_PAD))
>>>>>>           return -EMSGSIZE;
>>>>>>   -    return vdpa_dev_net_mq_config_fill(vdev, msg, 
>>>>>> features_driver, &config);
>>>>>> +    if (vdpa_dev_net_mac_config_fill(msg, features_device, 
>>>>>> &config))
>>>>>> +        return -EMSGSIZE;
>>>>>> +
>>>>>> +    if (vdpa_dev_net_mtu_config_fill(msg, features_device, 
>>>>>> &config))
>>>>>> +        return -EMSGSIZE;
>>>>>> +
>>>>>> +    return vdpa_dev_net_mq_config_fill(msg, features_device, 
>>>>>> &config);
>>>>>>   }
>>>>>>     static int
>>>>>> @@ -914,6 +943,11 @@ static int vdpa_fill_stats_rec(struct 
>>>>>> vdpa_device *vdev, struct sk_buff *msg,
>>>>>>       }
>>>>>>       vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
>>>>>>   +    /*
>>>>>> +     * Assume little endian for now, userspace can tweak this for
>>>>>> +     * legacy guest support.
>>>>>> +     */
>>>>>> +
>>>>> Ditto.
>>>> same as above
>>>>
>>>> Thanks
>>>>>
>>>>> Thanks,
>>>>> -Siwei
>>>>>> max_vqp = __virtio16_to_cpu(true, config.max_virtqueue_pairs);
>>>>>>       if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, max_vqp))
>>>>>>           return -EMSGSIZE;
>>>>>
>>>>
>>>
>>
>
Michael S. Tsirkin Aug. 17, 2022, 8:55 a.m. UTC | #14
On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
> Yes it is a little messy, and we can not check _F_VERSION_1 because of
> transitional devices, so maybe this is the best we can do for now

I think vhost generally needs an API to declare config space endian-ness
to kernel. vdpa can reuse that too then.
Zhu, Lingshan Aug. 17, 2022, 9:13 a.m. UTC | #15
On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
> On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
>> Yes it is a little messy, and we can not check _F_VERSION_1 because of
>> transitional devices, so maybe this is the best we can do for now
> I think vhost generally needs an API to declare config space endian-ness
> to kernel. vdpa can reuse that too then.
Yes, I remember you have mentioned some IOCTL to set the endian-ness,
for vDPA, I think only the vendor driver knows the endian,
so we may need a new function vdpa_ops->get_endian().

In the last thread, we say maybe it's better to add a comment for now.
But if you think we should add a vdpa_ops->get_endian(), I can work
on it for sure!

Thanks
Zhu Lingshan
>
Michael S. Tsirkin Aug. 17, 2022, 9:39 a.m. UTC | #16
On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
> > On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
> > > Yes it is a little messy, and we can not check _F_VERSION_1 because of
> > > transitional devices, so maybe this is the best we can do for now
> > I think vhost generally needs an API to declare config space endian-ness
> > to kernel. vdpa can reuse that too then.
> Yes, I remember you have mentioned some IOCTL to set the endian-ness,
> for vDPA, I think only the vendor driver knows the endian,
> so we may need a new function vdpa_ops->get_endian().
> In the last thread, we say maybe it's better to add a comment for now.
> But if you think we should add a vdpa_ops->get_endian(), I can work
> on it for sure!
> 
> Thanks
> Zhu Lingshan

I think QEMU has to set endian-ness. No one else knows.

> >
Zhu, Lingshan Aug. 17, 2022, 9:43 a.m. UTC | #17
On 8/17/2022 5:39 PM, Michael S. Tsirkin wrote:
> On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote:
>>
>> On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
>>> On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
>>>> Yes it is a little messy, and we can not check _F_VERSION_1 because of
>>>> transitional devices, so maybe this is the best we can do for now
>>> I think vhost generally needs an API to declare config space endian-ness
>>> to kernel. vdpa can reuse that too then.
>> Yes, I remember you have mentioned some IOCTL to set the endian-ness,
>> for vDPA, I think only the vendor driver knows the endian,
>> so we may need a new function vdpa_ops->get_endian().
>> In the last thread, we say maybe it's better to add a comment for now.
>> But if you think we should add a vdpa_ops->get_endian(), I can work
>> on it for sure!
>>
>> Thanks
>> Zhu Lingshan
> I think QEMU has to set endian-ness. No one else knows.
Yes, for SW based vhost it is true. But for HW vDPA, only
the device & driver knows the endian, I think we can not
"set" a hardware's endian.

So if you think we should add a vdpa_ops->get_endian(),
I will drop these comments in the next version of
series, and work on a new patch for get_endian().

Thanks,
Zhu Lingshan
>
Michael S. Tsirkin Aug. 17, 2022, 10:37 a.m. UTC | #18
On Wed, Aug 17, 2022 at 05:43:22PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 8/17/2022 5:39 PM, Michael S. Tsirkin wrote:
> > On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote:
> > > 
> > > On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
> > > > On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
> > > > > Yes it is a little messy, and we can not check _F_VERSION_1 because of
> > > > > transitional devices, so maybe this is the best we can do for now
> > > > I think vhost generally needs an API to declare config space endian-ness
> > > > to kernel. vdpa can reuse that too then.
> > > Yes, I remember you have mentioned some IOCTL to set the endian-ness,
> > > for vDPA, I think only the vendor driver knows the endian,
> > > so we may need a new function vdpa_ops->get_endian().
> > > In the last thread, we say maybe it's better to add a comment for now.
> > > But if you think we should add a vdpa_ops->get_endian(), I can work
> > > on it for sure!
> > > 
> > > Thanks
> > > Zhu Lingshan
> > I think QEMU has to set endian-ness. No one else knows.
> Yes, for SW based vhost it is true. But for HW vDPA, only
> the device & driver knows the endian, I think we can not
> "set" a hardware's endian.

QEMU knows the guest endian-ness and it knows that
device is accessed through the legacy interface.
It can accordingly send endian-ness to the kernel and
kernel can propagate it to the driver.

> So if you think we should add a vdpa_ops->get_endian(),
> I will drop these comments in the next version of
> series, and work on a new patch for get_endian().
> 
> Thanks,
> Zhu Lingshan

Guests don't get endian-ness from devices so this seems pointless.
Jason Wang Aug. 18, 2022, 4:15 a.m. UTC | #19
在 2022/8/17 18:37, Michael S. Tsirkin 写道:
> On Wed, Aug 17, 2022 at 05:43:22PM +0800, Zhu, Lingshan wrote:
>>
>> On 8/17/2022 5:39 PM, Michael S. Tsirkin wrote:
>>> On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote:
>>>> On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
>>>>> On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
>>>>>> Yes it is a little messy, and we can not check _F_VERSION_1 because of
>>>>>> transitional devices, so maybe this is the best we can do for now
>>>>> I think vhost generally needs an API to declare config space endian-ness
>>>>> to kernel. vdpa can reuse that too then.
>>>> Yes, I remember you have mentioned some IOCTL to set the endian-ness,
>>>> for vDPA, I think only the vendor driver knows the endian,
>>>> so we may need a new function vdpa_ops->get_endian().
>>>> In the last thread, we say maybe it's better to add a comment for now.
>>>> But if you think we should add a vdpa_ops->get_endian(), I can work
>>>> on it for sure!
>>>>
>>>> Thanks
>>>> Zhu Lingshan
>>> I think QEMU has to set endian-ness. No one else knows.
>> Yes, for SW based vhost it is true. But for HW vDPA, only
>> the device & driver knows the endian, I think we can not
>> "set" a hardware's endian.
> QEMU knows the guest endian-ness and it knows that
> device is accessed through the legacy interface.
> It can accordingly send endian-ness to the kernel and
> kernel can propagate it to the driver.


I wonder if we can simply force LE and then Qemu can do the endian 
conversion?

Thanks


>
>> So if you think we should add a vdpa_ops->get_endian(),
>> I will drop these comments in the next version of
>> series, and work on a new patch for get_endian().
>>
>> Thanks,
>> Zhu Lingshan
> Guests don't get endian-ness from devices so this seems pointless.
>
Jason Wang Aug. 18, 2022, 4:18 a.m. UTC | #20
在 2022/8/17 10:03, Zhu, Lingshan 写道:
>
>
> On 8/17/2022 5:09 AM, Michael S. Tsirkin wrote:
>> On Tue, Aug 16, 2022 at 09:02:17PM +0000, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Tuesday, August 16, 2022 12:19 AM
>>>>
>>>>
>>>> On 8/16/2022 10:32 AM, Parav Pandit wrote:
>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Monday, August 15, 2022 5:27 AM
>>>>>>
>>>>>> Some fields of virtio-net device config space are conditional on the
>>>>>> feature bits, the spec says:
>>>>>>
>>>>>> "The mac address field always exists
>>>>>> (though is only valid if VIRTIO_NET_F_MAC is set)"
>>>>>>
>>>>>> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ or
>>>>>> VIRTIO_NET_F_RSS is set"
>>>>>>
>>>>>> "mtu only exists if VIRTIO_NET_F_MTU is set"
>>>>>>
>>>>>> so we should read MTU, MAC and MQ in the device config space only
>>>>>> when these feature bits are offered.
>>>>> Yes.
>>>>>
>>>>>> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are not set,
>>>> the
>>>>>> virtio device should have one queue pair as default value, so when
>>>>>> userspace querying queue pair numbers, it should return mq=1 than 
>>>>>> zero.
>>>>> No.
>>>>> No need to treat mac and max_qps differently.
>>>>> It is meaningless to differentiate when field exist/not-exists vs 
>>>>> value
>>>> valid/not valid.
>>>> as we discussed before, MQ has a default value 1, to be a 
>>>> functional virtio-
>>>> net device, while MAC has no default value, if no VIRTIO_NET_F_MAC 
>>>> set,
>>>> the driver should generate a random MAC.
>>>>>> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read MTU from
>>>>>> the device config sapce.
>>>>>> RFC894 <A Standard for the Transmission of IP Datagrams over 
>>>>>> Ethernet
>>>>>> Networks> says:"The minimum length of the data field of a packet 
>>>>>> sent
>>>>>> Networks> over
>>>>>> an Ethernet is 1500 octets, thus the maximum length of an IP 
>>>>>> datagram
>>>>>> sent over an Ethernet is 1500 octets.  Implementations are 
>>>>>> encouraged
>>>>>> to support full-length packets"
>>>>> This line in the RFC 894 of 1984 is wrong.
>>>>> Errata already exists for it at [1].
>>>>>
>>>>> [1] https://www.rfc-editor.org/errata_search.php?rfc=894&rec_status=0
>>>> OK, so I think we should return nothing if _F_MTU not set, like 
>>>> handling the
>>>> MAC
>>>>>> virtio spec says:"The virtio network device is a virtual ethernet
>>>>>> card", so the default MTU value should be 1500 for virtio-net.
>>>>>>
>>>>> Practically I have seen 1500 and highe mtu.
>>>>> And this derivation is not good of what should be the default mtu 
>>>>> as above
>>>> errata exists.
>>>>> And I see the code below why you need to work so hard to define a 
>>>>> default
>>>> value so that _MQ and _MTU can report default values.
>>>>> There is really no need for this complexity and such a long commit
>>>> message.
>>>>> Can we please expose feature bits as-is and report config space 
>>>>> field which
>>>> are valid?
>>>>> User space will be querying both.
>>>> I think MAC and MTU don't have default values, so return nothing if 
>>>> the
>>>> feature bits not set,
>>>> for MQ, it is still max_vq_paris == 1 by default.
>>> I have stressed enough to highlight the fact that we don’t want to 
>>> start digging default/no default, valid/no-valid part of the spec.
>>> I prefer kernel to reporting fields that _exists_ in the config 
>>> space and are valid.
>>> I will let MST to handle the maintenance nightmare that this kind of 
>>> patch brings in without any visible gain to user space/orchestration 
>>> apps.
>>>
>>> A logic that can be easily build in user space, should be written in 
>>> user space.
>>> I conclude my thoughts here for this discussion.
>>>
>>> I will let MST to decide how he prefers to proceed.
>>>
>>>>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
>>>>>> +        val_u16 = 1500;
>>>>>> +    else
>>>>>> +        val_u16 = __virtio16_to_cpu(true, config->mtu);
>>>>>> +
>>>>> Need to work hard to find default values and that too turned out had
>>>> errata.
>>>>> There are more fields that doesn’t have default values.
>>>>>
>>>>> There is no point in kernel doing this guess work, that user space 
>>>>> can figure
>>>> out of what is valid/invalid.
>>>> It's not guest work, when guest finds no feature bits set, it can 
>>>> decide what
>>>> to do.
>>> Above code of doing 1500 was probably an honest attempt to find a 
>>> legitimate default value, and we saw that it doesn’t work.
>>> This is second example after _MQ that we both agree should not 
>>> return default.
>>>
>>> And there are more fields coming in this area.
>>> Hence, I prefer to not avoid returning such defaults for MAC, MTU, 
>>> MQ and rest all fields which doesn’t _exists_.
>>>
>>> I will let MST to decide how he prefers to proceed for every field 
>>> to come next.
>>> Thanks.
>>>
>>
>> If MTU does not return a value without _F_MTU, and MAC does not return
>> a value without _F_MAC then IMO yes, number of queues should not return
>> a value without _F_MQ.
> sure I can do this, but may I ask whether it is a final decision, I 
> remember you supported max_queue_paris = 1 without _F_MQ before


I think we just need to be consistent:

Either

1) make field conditional to align with spec

or

2) always return a value even if the feature is not set

It seems to me 1) is easier.

Thanks


>
> Thanks
>>
>>
>
Zhu, Lingshan Aug. 18, 2022, 6:38 a.m. UTC | #21
On 8/18/2022 12:18 PM, Jason Wang wrote:
>
> 在 2022/8/17 10:03, Zhu, Lingshan 写道:
>>
>>
>> On 8/17/2022 5:09 AM, Michael S. Tsirkin wrote:
>>> On Tue, Aug 16, 2022 at 09:02:17PM +0000, Parav Pandit wrote:
>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>> Sent: Tuesday, August 16, 2022 12:19 AM
>>>>>
>>>>>
>>>>> On 8/16/2022 10:32 AM, Parav Pandit wrote:
>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>> Sent: Monday, August 15, 2022 5:27 AM
>>>>>>>
>>>>>>> Some fields of virtio-net device config space are conditional on 
>>>>>>> the
>>>>>>> feature bits, the spec says:
>>>>>>>
>>>>>>> "The mac address field always exists
>>>>>>> (though is only valid if VIRTIO_NET_F_MAC is set)"
>>>>>>>
>>>>>>> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ or
>>>>>>> VIRTIO_NET_F_RSS is set"
>>>>>>>
>>>>>>> "mtu only exists if VIRTIO_NET_F_MTU is set"
>>>>>>>
>>>>>>> so we should read MTU, MAC and MQ in the device config space only
>>>>>>> when these feature bits are offered.
>>>>>> Yes.
>>>>>>
>>>>>>> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are not set,
>>>>> the
>>>>>>> virtio device should have one queue pair as default value, so when
>>>>>>> userspace querying queue pair numbers, it should return mq=1 
>>>>>>> than zero.
>>>>>> No.
>>>>>> No need to treat mac and max_qps differently.
>>>>>> It is meaningless to differentiate when field exist/not-exists vs 
>>>>>> value
>>>>> valid/not valid.
>>>>> as we discussed before, MQ has a default value 1, to be a 
>>>>> functional virtio-
>>>>> net device, while MAC has no default value, if no VIRTIO_NET_F_MAC 
>>>>> set,
>>>>> the driver should generate a random MAC.
>>>>>>> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read MTU 
>>>>>>> from
>>>>>>> the device config sapce.
>>>>>>> RFC894 <A Standard for the Transmission of IP Datagrams over 
>>>>>>> Ethernet
>>>>>>> Networks> says:"The minimum length of the data field of a packet 
>>>>>>> sent
>>>>>>> Networks> over
>>>>>>> an Ethernet is 1500 octets, thus the maximum length of an IP 
>>>>>>> datagram
>>>>>>> sent over an Ethernet is 1500 octets. Implementations are 
>>>>>>> encouraged
>>>>>>> to support full-length packets"
>>>>>> This line in the RFC 894 of 1984 is wrong.
>>>>>> Errata already exists for it at [1].
>>>>>>
>>>>>> [1] 
>>>>>> https://www.rfc-editor.org/errata_search.php?rfc=894&rec_status=0
>>>>> OK, so I think we should return nothing if _F_MTU not set, like 
>>>>> handling the
>>>>> MAC
>>>>>>> virtio spec says:"The virtio network device is a virtual ethernet
>>>>>>> card", so the default MTU value should be 1500 for virtio-net.
>>>>>>>
>>>>>> Practically I have seen 1500 and highe mtu.
>>>>>> And this derivation is not good of what should be the default mtu 
>>>>>> as above
>>>>> errata exists.
>>>>>> And I see the code below why you need to work so hard to define a 
>>>>>> default
>>>>> value so that _MQ and _MTU can report default values.
>>>>>> There is really no need for this complexity and such a long commit
>>>>> message.
>>>>>> Can we please expose feature bits as-is and report config space 
>>>>>> field which
>>>>> are valid?
>>>>>> User space will be querying both.
>>>>> I think MAC and MTU don't have default values, so return nothing 
>>>>> if the
>>>>> feature bits not set,
>>>>> for MQ, it is still max_vq_paris == 1 by default.
>>>> I have stressed enough to highlight the fact that we don’t want to 
>>>> start digging default/no default, valid/no-valid part of the spec.
>>>> I prefer kernel to reporting fields that _exists_ in the config 
>>>> space and are valid.
>>>> I will let MST to handle the maintenance nightmare that this kind 
>>>> of patch brings in without any visible gain to user 
>>>> space/orchestration apps.
>>>>
>>>> A logic that can be easily build in user space, should be written 
>>>> in user space.
>>>> I conclude my thoughts here for this discussion.
>>>>
>>>> I will let MST to decide how he prefers to proceed.
>>>>
>>>>>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
>>>>>>> +        val_u16 = 1500;
>>>>>>> +    else
>>>>>>> +        val_u16 = __virtio16_to_cpu(true, config->mtu);
>>>>>>> +
>>>>>> Need to work hard to find default values and that too turned out had
>>>>> errata.
>>>>>> There are more fields that doesn’t have default values.
>>>>>>
>>>>>> There is no point in kernel doing this guess work, that user 
>>>>>> space can figure
>>>>> out of what is valid/invalid.
>>>>> It's not guest work, when guest finds no feature bits set, it can 
>>>>> decide what
>>>>> to do.
>>>> Above code of doing 1500 was probably an honest attempt to find a 
>>>> legitimate default value, and we saw that it doesn’t work.
>>>> This is second example after _MQ that we both agree should not 
>>>> return default.
>>>>
>>>> And there are more fields coming in this area.
>>>> Hence, I prefer to not avoid returning such defaults for MAC, MTU, 
>>>> MQ and rest all fields which doesn’t _exists_.
>>>>
>>>> I will let MST to decide how he prefers to proceed for every field 
>>>> to come next.
>>>> Thanks.
>>>>
>>>
>>> If MTU does not return a value without _F_MTU, and MAC does not return
>>> a value without _F_MAC then IMO yes, number of queues should not return
>>> a value without _F_MQ.
>> sure I can do this, but may I ask whether it is a final decision, I 
>> remember you supported max_queue_paris = 1 without _F_MQ before
>
>
> I think we just need to be consistent:
>
> Either
>
> 1) make field conditional to align with spec
>
> or
>
> 2) always return a value even if the feature is not set
>
> It seems to me 1) is easier.
>
> Thanks
I will pick 1

Thanks
>
>
>>
>> Thanks
>>>
>>>
>>
>
Zhu, Lingshan Aug. 18, 2022, 7:58 a.m. UTC | #22
On 8/18/2022 12:15 PM, Jason Wang wrote:
>
> 在 2022/8/17 18:37, Michael S. Tsirkin 写道:
>> On Wed, Aug 17, 2022 at 05:43:22PM +0800, Zhu, Lingshan wrote:
>>>
>>> On 8/17/2022 5:39 PM, Michael S. Tsirkin wrote:
>>>> On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote:
>>>>> On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
>>>>>> On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
>>>>>>> Yes it is a little messy, and we can not check _F_VERSION_1 
>>>>>>> because of
>>>>>>> transitional devices, so maybe this is the best we can do for now
>>>>>> I think vhost generally needs an API to declare config space 
>>>>>> endian-ness
>>>>>> to kernel. vdpa can reuse that too then.
>>>>> Yes, I remember you have mentioned some IOCTL to set the endian-ness,
>>>>> for vDPA, I think only the vendor driver knows the endian,
>>>>> so we may need a new function vdpa_ops->get_endian().
>>>>> In the last thread, we say maybe it's better to add a comment for 
>>>>> now.
>>>>> But if you think we should add a vdpa_ops->get_endian(), I can work
>>>>> on it for sure!
>>>>>
>>>>> Thanks
>>>>> Zhu Lingshan
>>>> I think QEMU has to set endian-ness. No one else knows.
>>> Yes, for SW based vhost it is true. But for HW vDPA, only
>>> the device & driver knows the endian, I think we can not
>>> "set" a hardware's endian.
>> QEMU knows the guest endian-ness and it knows that
>> device is accessed through the legacy interface.
>> It can accordingly send endian-ness to the kernel and
>> kernel can propagate it to the driver.
>
>
> I wonder if we can simply force LE and then Qemu can do the endian 
> conversion?
I think this is what we are doing now, force it to be LE, leave a comment.

QEMU will not set ENDIAN for vDPA devices, vhost_kernel_call() verifies
whether the backend is TYPE_KERNEL (we have TYPE_VDPA here),
so we can not rely on this code path.

Thanks
Zhu Lingshan
>
> Thanks
>
>
>>
>>> So if you think we should add a vdpa_ops->get_endian(),
>>> I will drop these comments in the next version of
>>> series, and work on a new patch for get_endian().
>>>
>>> Thanks,
>>> Zhu Lingshan
>> Guests don't get endian-ness from devices so this seems pointless.
>>
>
Parav Pandit Aug. 18, 2022, 5:20 p.m. UTC | #23
> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, August 18, 2022 12:19 AM

> >>>> On 8/16/2022 10:32 AM, Parav Pandit wrote:
> >>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Monday, August 15, 2022 5:27 AM
> >>>>>>
> >>>>>> Some fields of virtio-net device config space are conditional on
> >>>>>> the feature bits, the spec says:
> >>>>>>
> >>>>>> "The mac address field always exists (though is only valid if
> >>>>>> VIRTIO_NET_F_MAC is set)"
> >>>>>>
> >>>>>> "max_virtqueue_pairs only exists if VIRTIO_NET_F_MQ or
> >>>>>> VIRTIO_NET_F_RSS is set"
> >>>>>>
> >>>>>> "mtu only exists if VIRTIO_NET_F_MTU is set"
> >>>>>>
> >>>>>> so we should read MTU, MAC and MQ in the device config space
> only
> >>>>>> when these feature bits are offered.
> >>>>> Yes.
> >>>>>
> >>>>>> For MQ, if both VIRTIO_NET_F_MQ and VIRTIO_NET_F_RSS are not
> set,
> >>>> the
> >>>>>> virtio device should have one queue pair as default value, so
> >>>>>> when userspace querying queue pair numbers, it should return
> mq=1
> >>>>>> than zero.
> >>>>> No.
> >>>>> No need to treat mac and max_qps differently.
> >>>>> It is meaningless to differentiate when field exist/not-exists vs
> >>>>> value
> >>>> valid/not valid.
> >>>> as we discussed before, MQ has a default value 1, to be a
> >>>> functional virtio- net device, while MAC has no default value, if
> >>>> no VIRTIO_NET_F_MAC set, the driver should generate a random
> MAC.
> >>>>>> For MTU, if VIRTIO_NET_F_MTU is not set, we should not read MTU
> >>>>>> from the device config sapce.
> >>>>>> RFC894 <A Standard for the Transmission of IP Datagrams over
> >>>>>> Ethernet
> >>>>>> Networks> says:"The minimum length of the data field of a packet
> >>>>>> sent
> >>>>>> Networks> over
> >>>>>> an Ethernet is 1500 octets, thus the maximum length of an IP
> >>>>>> datagram sent over an Ethernet is 1500 octets.  Implementations
> >>>>>> are encouraged to support full-length packets"
> >>>>> This line in the RFC 894 of 1984 is wrong.
> >>>>> Errata already exists for it at [1].
> >>>>>
> >>>>> [1]
> >>>>> https://www.rfc-editor.org/errata_search.php?rfc=894&rec_status=0
> >>>> OK, so I think we should return nothing if _F_MTU not set, like
> >>>> handling the MAC
> >>>>>> virtio spec says:"The virtio network device is a virtual ethernet
> >>>>>> card", so the default MTU value should be 1500 for virtio-net.
> >>>>>>
> >>>>> Practically I have seen 1500 and highe mtu.
> >>>>> And this derivation is not good of what should be the default mtu
> >>>>> as above
> >>>> errata exists.
> >>>>> And I see the code below why you need to work so hard to define a
> >>>>> default
> >>>> value so that _MQ and _MTU can report default values.
> >>>>> There is really no need for this complexity and such a long commit
> >>>> message.
> >>>>> Can we please expose feature bits as-is and report config space
> >>>>> field which
> >>>> are valid?
> >>>>> User space will be querying both.
> >>>> I think MAC and MTU don't have default values, so return nothing if
> >>>> the feature bits not set, for MQ, it is still max_vq_paris == 1 by
> >>>> default.
> >>> I have stressed enough to highlight the fact that we don’t want to
> >>> start digging default/no default, valid/no-valid part of the spec.
> >>> I prefer kernel to reporting fields that _exists_ in the config
> >>> space and are valid.
> >>> I will let MST to handle the maintenance nightmare that this kind of
> >>> patch brings in without any visible gain to user space/orchestration
> >>> apps.
> >>>
> >>> A logic that can be easily build in user space, should be written in
> >>> user space.
> >>> I conclude my thoughts here for this discussion.
> >>>
> >>> I will let MST to decide how he prefers to proceed.
> >>>
> >>>>>> +    if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
> >>>>>> +        val_u16 = 1500;
> >>>>>> +    else
> >>>>>> +        val_u16 = __virtio16_to_cpu(true, config->mtu);
> >>>>>> +
> >>>>> Need to work hard to find default values and that too turned out
> >>>>> had
> >>>> errata.
> >>>>> There are more fields that doesn’t have default values.
> >>>>>
> >>>>> There is no point in kernel doing this guess work, that user space
> >>>>> can figure
> >>>> out of what is valid/invalid.
> >>>> It's not guest work, when guest finds no feature bits set, it can
> >>>> decide what to do.
> >>> Above code of doing 1500 was probably an honest attempt to find a
> >>> legitimate default value, and we saw that it doesn’t work.
> >>> This is second example after _MQ that we both agree should not
> >>> return default.
> >>>
> >>> And there are more fields coming in this area.
> >>> Hence, I prefer to not avoid returning such defaults for MAC, MTU,
> >>> MQ and rest all fields which doesn’t _exists_.
> >>>
> >>> I will let MST to decide how he prefers to proceed for every field
> >>> to come next.
> >>> Thanks.
> >>>
> >>
> >> If MTU does not return a value without _F_MTU, and MAC does not
> >> return a value without _F_MAC then IMO yes, number of queues should
> >> not return a value without _F_MQ.
> > sure I can do this, but may I ask whether it is a final decision, I
> > remember you supported max_queue_paris = 1 without _F_MQ before
> 
> 
> I think we just need to be consistent:
> 
> Either
> 
> 1) make field conditional to align with spec
> 
> or
> 
> 2) always return a value even if the feature is not set
> 
> It seems to me 1) is easier.
> 
+1
Si-Wei Liu Aug. 18, 2022, 11:20 p.m. UTC | #24
On 8/17/2022 9:15 PM, Jason Wang wrote:
>
> 在 2022/8/17 18:37, Michael S. Tsirkin 写道:
>> On Wed, Aug 17, 2022 at 05:43:22PM +0800, Zhu, Lingshan wrote:
>>>
>>> On 8/17/2022 5:39 PM, Michael S. Tsirkin wrote:
>>>> On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote:
>>>>> On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
>>>>>> On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
>>>>>>> Yes it is a little messy, and we can not check _F_VERSION_1 
>>>>>>> because of
>>>>>>> transitional devices, so maybe this is the best we can do for now
>>>>>> I think vhost generally needs an API to declare config space 
>>>>>> endian-ness
>>>>>> to kernel. vdpa can reuse that too then.
>>>>> Yes, I remember you have mentioned some IOCTL to set the endian-ness,
>>>>> for vDPA, I think only the vendor driver knows the endian,
>>>>> so we may need a new function vdpa_ops->get_endian().
>>>>> In the last thread, we say maybe it's better to add a comment for 
>>>>> now.
>>>>> But if you think we should add a vdpa_ops->get_endian(), I can work
>>>>> on it for sure!
>>>>>
>>>>> Thanks
>>>>> Zhu Lingshan
>>>> I think QEMU has to set endian-ness. No one else knows.
>>> Yes, for SW based vhost it is true. But for HW vDPA, only
>>> the device & driver knows the endian, I think we can not
>>> "set" a hardware's endian.
>> QEMU knows the guest endian-ness and it knows that
>> device is accessed through the legacy interface.
>> It can accordingly send endian-ness to the kernel and
>> kernel can propagate it to the driver.
>
>
> I wonder if we can simply force LE and then Qemu can do the endian 
> conversion?
convert from LE for config space fields only, or QEMU has to forcefully 
mediate and covert endianness for all device memory access including 
even the datapath (fields in descriptor and avail/used rings)? I hope 
it's not the latter, otherwise it loses the point to use vDPA for 
datapath acceleration.

Even if its the former, it's a little weird for vendor device to 
implement a LE config space with BE ring layout, although still possible...

-Siwei
>
> Thanks
>
>
>>
>>> So if you think we should add a vdpa_ops->get_endian(),
>>> I will drop these comments in the next version of
>>> series, and work on a new patch for get_endian().
>>>
>>> Thanks,
>>> Zhu Lingshan
>> Guests don't get endian-ness from devices so this seems pointless.
>>
>
Jason Wang Aug. 19, 2022, 12:42 a.m. UTC | #25
On Fri, Aug 19, 2022 at 7:20 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 8/17/2022 9:15 PM, Jason Wang wrote:
> >
> > 在 2022/8/17 18:37, Michael S. Tsirkin 写道:
> >> On Wed, Aug 17, 2022 at 05:43:22PM +0800, Zhu, Lingshan wrote:
> >>>
> >>> On 8/17/2022 5:39 PM, Michael S. Tsirkin wrote:
> >>>> On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote:
> >>>>> On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
> >>>>>> On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
> >>>>>>> Yes it is a little messy, and we can not check _F_VERSION_1
> >>>>>>> because of
> >>>>>>> transitional devices, so maybe this is the best we can do for now
> >>>>>> I think vhost generally needs an API to declare config space
> >>>>>> endian-ness
> >>>>>> to kernel. vdpa can reuse that too then.
> >>>>> Yes, I remember you have mentioned some IOCTL to set the endian-ness,
> >>>>> for vDPA, I think only the vendor driver knows the endian,
> >>>>> so we may need a new function vdpa_ops->get_endian().
> >>>>> In the last thread, we say maybe it's better to add a comment for
> >>>>> now.
> >>>>> But if you think we should add a vdpa_ops->get_endian(), I can work
> >>>>> on it for sure!
> >>>>>
> >>>>> Thanks
> >>>>> Zhu Lingshan
> >>>> I think QEMU has to set endian-ness. No one else knows.
> >>> Yes, for SW based vhost it is true. But for HW vDPA, only
> >>> the device & driver knows the endian, I think we can not
> >>> "set" a hardware's endian.
> >> QEMU knows the guest endian-ness and it knows that
> >> device is accessed through the legacy interface.
> >> It can accordingly send endian-ness to the kernel and
> >> kernel can propagate it to the driver.
> >
> >
> > I wonder if we can simply force LE and then Qemu can do the endian
> > conversion?
> convert from LE for config space fields only, or QEMU has to forcefully
> mediate and covert endianness for all device memory access including
> even the datapath (fields in descriptor and avail/used rings)?

Former. Actually, I want to force modern devices for vDPA when
developing the vDPA framework. But then we see requirements for
transitional or even legacy (e.g the Ali ENI parent). So it
complicates things a lot.

I think several ideas has been proposed:

1) Your proposal of having a vDPA specific way for
modern/transitional/legacy awareness. This seems very clean since each
transport should have the ability to do that but it still requires
some kind of mediation for the case e.g running BE legacy guest on LE
host.

2) Michael suggests using VHOST_SET_VRING_ENDIAN where it means we
need a new config ops for vDPA bus, but it doesn't solve the issue for
config space (at least from its name). We probably need a new ioctl
for both vring and config space.

or

3) revisit the idea of forcing modern only device which may simplify
things a lot

which way should we go?

> I hope
> it's not the latter, otherwise it loses the point to use vDPA for
> datapath acceleration.
>
> Even if its the former, it's a little weird for vendor device to
> implement a LE config space with BE ring layout, although still possible...

Right.

Thanks

>
> -Siwei
> >
> > Thanks
> >
> >
> >>
> >>> So if you think we should add a vdpa_ops->get_endian(),
> >>> I will drop these comments in the next version of
> >>> series, and work on a new patch for get_endian().
> >>>
> >>> Thanks,
> >>> Zhu Lingshan
> >> Guests don't get endian-ness from devices so this seems pointless.
> >>
> >
>
Michael S. Tsirkin Aug. 19, 2022, 3:52 a.m. UTC | #26
On Fri, Aug 19, 2022 at 08:42:32AM +0800, Jason Wang wrote:
> On Fri, Aug 19, 2022 at 7:20 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >
> >
> >
> > On 8/17/2022 9:15 PM, Jason Wang wrote:
> > >
> > > 在 2022/8/17 18:37, Michael S. Tsirkin 写道:
> > >> On Wed, Aug 17, 2022 at 05:43:22PM +0800, Zhu, Lingshan wrote:
> > >>>
> > >>> On 8/17/2022 5:39 PM, Michael S. Tsirkin wrote:
> > >>>> On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote:
> > >>>>> On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
> > >>>>>> On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
> > >>>>>>> Yes it is a little messy, and we can not check _F_VERSION_1
> > >>>>>>> because of
> > >>>>>>> transitional devices, so maybe this is the best we can do for now
> > >>>>>> I think vhost generally needs an API to declare config space
> > >>>>>> endian-ness
> > >>>>>> to kernel. vdpa can reuse that too then.
> > >>>>> Yes, I remember you have mentioned some IOCTL to set the endian-ness,
> > >>>>> for vDPA, I think only the vendor driver knows the endian,
> > >>>>> so we may need a new function vdpa_ops->get_endian().
> > >>>>> In the last thread, we say maybe it's better to add a comment for
> > >>>>> now.
> > >>>>> But if you think we should add a vdpa_ops->get_endian(), I can work
> > >>>>> on it for sure!
> > >>>>>
> > >>>>> Thanks
> > >>>>> Zhu Lingshan
> > >>>> I think QEMU has to set endian-ness. No one else knows.
> > >>> Yes, for SW based vhost it is true. But for HW vDPA, only
> > >>> the device & driver knows the endian, I think we can not
> > >>> "set" a hardware's endian.
> > >> QEMU knows the guest endian-ness and it knows that
> > >> device is accessed through the legacy interface.
> > >> It can accordingly send endian-ness to the kernel and
> > >> kernel can propagate it to the driver.
> > >
> > >
> > > I wonder if we can simply force LE and then Qemu can do the endian
> > > conversion?
> > convert from LE for config space fields only, or QEMU has to forcefully
> > mediate and covert endianness for all device memory access including
> > even the datapath (fields in descriptor and avail/used rings)?
> 
> Former. Actually, I want to force modern devices for vDPA when
> developing the vDPA framework. But then we see requirements for
> transitional or even legacy (e.g the Ali ENI parent). So it
> complicates things a lot.
> 
> I think several ideas has been proposed:
> 
> 1) Your proposal of having a vDPA specific way for
> modern/transitional/legacy awareness. This seems very clean since each
> transport should have the ability to do that but it still requires
> some kind of mediation for the case e.g running BE legacy guest on LE
> host.
> 
> 2) Michael suggests using VHOST_SET_VRING_ENDIAN where it means we
> need a new config ops for vDPA bus, but it doesn't solve the issue for
> config space (at least from its name). We probably need a new ioctl
> for both vring and config space.
> 
> or


Yea, like VHOST_SET_CONFIG_ENDIAN.



> 3) revisit the idea of forcing modern only device which may simplify
> things a lot

Problem is vhost needs VHOST_SET_CONFIG_ENDIAN too. it's not
a vdpa specific issue.

> which way should we go?
> 
> > I hope
> > it's not the latter, otherwise it loses the point to use vDPA for
> > datapath acceleration.
> >
> > Even if its the former, it's a little weird for vendor device to
> > implement a LE config space with BE ring layout, although still possible...
> 
> Right.
> 
> Thanks
> 
> >
> > -Siwei
> > >
> > > Thanks
> > >
> > >
> > >>
> > >>> So if you think we should add a vdpa_ops->get_endian(),
> > >>> I will drop these comments in the next version of
> > >>> series, and work on a new patch for get_endian().
> > >>>
> > >>> Thanks,
> > >>> Zhu Lingshan
> > >> Guests don't get endian-ness from devices so this seems pointless.
> > >>
> > >
> >
Si-Wei Liu Aug. 20, 2022, 8:55 a.m. UTC | #27
On 8/18/2022 5:42 PM, Jason Wang wrote:
> On Fri, Aug 19, 2022 at 7:20 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>>
>>
>> On 8/17/2022 9:15 PM, Jason Wang wrote:
>>> 在 2022/8/17 18:37, Michael S. Tsirkin 写道:
>>>> On Wed, Aug 17, 2022 at 05:43:22PM +0800, Zhu, Lingshan wrote:
>>>>> On 8/17/2022 5:39 PM, Michael S. Tsirkin wrote:
>>>>>> On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote:
>>>>>>> On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
>>>>>>>> On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
>>>>>>>>> Yes it is a little messy, and we can not check _F_VERSION_1
>>>>>>>>> because of
>>>>>>>>> transitional devices, so maybe this is the best we can do for now
>>>>>>>> I think vhost generally needs an API to declare config space
>>>>>>>> endian-ness
>>>>>>>> to kernel. vdpa can reuse that too then.
>>>>>>> Yes, I remember you have mentioned some IOCTL to set the endian-ness,
>>>>>>> for vDPA, I think only the vendor driver knows the endian,
>>>>>>> so we may need a new function vdpa_ops->get_endian().
>>>>>>> In the last thread, we say maybe it's better to add a comment for
>>>>>>> now.
>>>>>>> But if you think we should add a vdpa_ops->get_endian(), I can work
>>>>>>> on it for sure!
>>>>>>>
>>>>>>> Thanks
>>>>>>> Zhu Lingshan
>>>>>> I think QEMU has to set endian-ness. No one else knows.
>>>>> Yes, for SW based vhost it is true. But for HW vDPA, only
>>>>> the device & driver knows the endian, I think we can not
>>>>> "set" a hardware's endian.
>>>> QEMU knows the guest endian-ness and it knows that
>>>> device is accessed through the legacy interface.
>>>> It can accordingly send endian-ness to the kernel and
>>>> kernel can propagate it to the driver.
>>>
>>> I wonder if we can simply force LE and then Qemu can do the endian
>>> conversion?
>> convert from LE for config space fields only, or QEMU has to forcefully
>> mediate and covert endianness for all device memory access including
>> even the datapath (fields in descriptor and avail/used rings)?
> Former. Actually, I want to force modern devices for vDPA when
> developing the vDPA framework. But then we see requirements for
> transitional or even legacy (e.g the Ali ENI parent). So it
> complicates things a lot.
>
> I think several ideas has been proposed:
>
> 1) Your proposal of having a vDPA specific way for
> modern/transitional/legacy awareness. This seems very clean since each
> transport should have the ability to do that but it still requires
> some kind of mediation for the case e.g running BE legacy guest on LE
> host.
In theory it seems like so, though practically I wonder if we can just 
forbid BE legacy driver from running on modern LE host. For those who 
care about legacy BE guest, they mostly like could and should talk to 
vendor to get native BE support to achieve hardware acceleration, few of 
them would count on QEMU in mediating or emulating the datapath 
(otherwise I don't see the benefit of adopting vDPA?). I still feel that 
not every hardware vendor has to offer backward compatibility 
(transitional device) with legacy interface/behavior (BE being just 
one), this is unlike the situation on software virtio device, which has 
legacy support since day one. I think we ever discussed it before: for 
those vDPA vendors who don't offer legacy guest support, maybe we should 
mandate some feature for e.g. VERSION_1, as these devices really don't 
offer functionality of the opposite side (!VERSION_1) during negotiation.

Having it said, perhaps we should also allow vendor device to implement 
only partial support for legacy. We can define "reversed" backend 
feature to denote some part of the legacy interface/functionality not 
getting implemented by device. For instance, 
VHOST_BACKEND_F_NO_BE_VRING, VHOST_BACKEND_F_NO_BE_CONFIG, 
VHOST_BACKEND_F_NO_ALIGNED_VRING, VHOST_BACKEND_NET_F_NO_WRITEABLE_MAC, 
and et al. Not all of these missing features for legacy would be easy 
for QEMU to make up for, so QEMU can selectively emulate those at its 
best when necessary and applicable. In other word, this design shouldn't 
prevent QEMU from making up for vendor device's partial legacy support.

>
> 2) Michael suggests using VHOST_SET_VRING_ENDIAN where it means we
> need a new config ops for vDPA bus, but it doesn't solve the issue for
> config space (at least from its name). We probably need a new ioctl
> for both vring and config space.
Yep adding a new ioctl makes things better, but I think the key is not 
the new ioctl. It's whether or not we should enforce every vDPA vendor 
driver to implement all transitional interfaces to be spec compliant. If 
we allow them to reject the VHOST_SET_VRING_ENDIAN  or 
VHOST_SET_CONFIG_ENDIAN call, what could we do? We would still end up 
with same situation of either fail the guest, or trying to 
mediate/emulate, right?

Not to mention VHOST_SET_VRING_ENDIAN is rarely supported by vhost today 
- few distro kernel has CONFIG_VHOST_CROSS_ENDIAN_LEGACY enabled and 
QEMU just ignores the result. vhost doesn't necessarily depend on it to 
determine endianness it looks.
>
> or
>
> 3) revisit the idea of forcing modern only device which may simplify
> things a lot
I am not actually against forcing modern only config space, given that 
it's not hard for either QEMU or individual driver to mediate or 
emulate, and for the most part it's not conflict with the goal of 
offload or acceleration with vDPA. But forcing LE ring layout IMO would 
just kill off the potential of a very good use case. Currently for our 
use case the priority for supporting 0.9.5 guest with vDPA is slightly 
lower compared to live migration, but it is still in our TODO list.

Thanks,
-Siwei

>
> which way should we go?
>
>> I hope
>> it's not the latter, otherwise it loses the point to use vDPA for
>> datapath acceleration.
>>
>> Even if its the former, it's a little weird for vendor device to
>> implement a LE config space with BE ring layout, although still possible...
> Right.
>
> Thanks
>
>> -Siwei
>>> Thanks
>>>
>>>
>>>>> So if you think we should add a vdpa_ops->get_endian(),
>>>>> I will drop these comments in the next version of
>>>>> series, and work on a new patch for get_endian().
>>>>>
>>>>> Thanks,
>>>>> Zhu Lingshan
>>>> Guests don't get endian-ness from devices so this seems pointless.
>>>>
Zhu, Lingshan Aug. 22, 2022, 5:07 a.m. UTC | #28
On 8/20/2022 4:55 PM, Si-Wei Liu wrote:
>
>
> On 8/18/2022 5:42 PM, Jason Wang wrote:
>> On Fri, Aug 19, 2022 at 7:20 AM Si-Wei Liu <si-wei.liu@oracle.com> 
>> wrote:
>>>
>>>
>>> On 8/17/2022 9:15 PM, Jason Wang wrote:
>>>> 在 2022/8/17 18:37, Michael S. Tsirkin 写道:
>>>>> On Wed, Aug 17, 2022 at 05:43:22PM +0800, Zhu, Lingshan wrote:
>>>>>> On 8/17/2022 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>> On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote:
>>>>>>>> On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
>>>>>>>>> On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
>>>>>>>>>> Yes it is a little messy, and we can not check _F_VERSION_1
>>>>>>>>>> because of
>>>>>>>>>> transitional devices, so maybe this is the best we can do for 
>>>>>>>>>> now
>>>>>>>>> I think vhost generally needs an API to declare config space
>>>>>>>>> endian-ness
>>>>>>>>> to kernel. vdpa can reuse that too then.
>>>>>>>> Yes, I remember you have mentioned some IOCTL to set the 
>>>>>>>> endian-ness,
>>>>>>>> for vDPA, I think only the vendor driver knows the endian,
>>>>>>>> so we may need a new function vdpa_ops->get_endian().
>>>>>>>> In the last thread, we say maybe it's better to add a comment for
>>>>>>>> now.
>>>>>>>> But if you think we should add a vdpa_ops->get_endian(), I can 
>>>>>>>> work
>>>>>>>> on it for sure!
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Zhu Lingshan
>>>>>>> I think QEMU has to set endian-ness. No one else knows.
>>>>>> Yes, for SW based vhost it is true. But for HW vDPA, only
>>>>>> the device & driver knows the endian, I think we can not
>>>>>> "set" a hardware's endian.
>>>>> QEMU knows the guest endian-ness and it knows that
>>>>> device is accessed through the legacy interface.
>>>>> It can accordingly send endian-ness to the kernel and
>>>>> kernel can propagate it to the driver.
>>>>
>>>> I wonder if we can simply force LE and then Qemu can do the endian
>>>> conversion?
>>> convert from LE for config space fields only, or QEMU has to forcefully
>>> mediate and covert endianness for all device memory access including
>>> even the datapath (fields in descriptor and avail/used rings)?
>> Former. Actually, I want to force modern devices for vDPA when
>> developing the vDPA framework. But then we see requirements for
>> transitional or even legacy (e.g the Ali ENI parent). So it
>> complicates things a lot.
>>
>> I think several ideas has been proposed:
>>
>> 1) Your proposal of having a vDPA specific way for
>> modern/transitional/legacy awareness. This seems very clean since each
>> transport should have the ability to do that but it still requires
>> some kind of mediation for the case e.g running BE legacy guest on LE
>> host.
> In theory it seems like so, though practically I wonder if we can just 
> forbid BE legacy driver from running on modern LE host. For those who 
> care about legacy BE guest, they mostly like could and should talk to 
> vendor to get native BE support to achieve hardware acceleration, few 
> of them would count on QEMU in mediating or emulating the datapath 
> (otherwise I don't see the benefit of adopting vDPA?). I still feel 
> that not every hardware vendor has to offer backward compatibility 
> (transitional device) with legacy interface/behavior (BE being just 
> one), this is unlike the situation on software virtio device, which 
> has legacy support since day one. I think we ever discussed it before: 
> for those vDPA vendors who don't offer legacy guest support, maybe we 
> should mandate some feature for e.g. VERSION_1, as these devices 
> really don't offer functionality of the opposite side (!VERSION_1) 
> during negotiation.
>
> Having it said, perhaps we should also allow vendor device to 
> implement only partial support for legacy. We can define "reversed" 
> backend feature to denote some part of the legacy 
> interface/functionality not getting implemented by device. For 
> instance, VHOST_BACKEND_F_NO_BE_VRING, VHOST_BACKEND_F_NO_BE_CONFIG, 
> VHOST_BACKEND_F_NO_ALIGNED_VRING, 
> VHOST_BACKEND_NET_F_NO_WRITEABLE_MAC, and et al. Not all of these 
> missing features for legacy would be easy for QEMU to make up for, so 
> QEMU can selectively emulate those at its best when necessary and 
> applicable. In other word, this design shouldn't prevent QEMU from 
> making up for vendor device's partial legacy support.
>
>>
>> 2) Michael suggests using VHOST_SET_VRING_ENDIAN where it means we
>> need a new config ops for vDPA bus, but it doesn't solve the issue for
>> config space (at least from its name). We probably need a new ioctl
>> for both vring and config space.
> Yep adding a new ioctl makes things better, but I think the key is not 
> the new ioctl. It's whether or not we should enforce every vDPA vendor 
> driver to implement all transitional interfaces to be spec compliant. 
> If we allow them to reject the VHOST_SET_VRING_ENDIAN  or 
> VHOST_SET_CONFIG_ENDIAN call, what could we do? We would still end up 
> with same situation of either fail the guest, or trying to 
> mediate/emulate, right?
>
> Not to mention VHOST_SET_VRING_ENDIAN is rarely supported by vhost 
> today - few distro kernel has CONFIG_VHOST_CROSS_ENDIAN_LEGACY enabled 
> and QEMU just ignores the result. vhost doesn't necessarily depend on 
> it to determine endianness it looks.
I would like to suggest to add two new config ops get/set_vq_endian() 
and get/set_config_endian() for vDPA. This is used to:
a) support VHOST_GET/SET_VRING_ENDIAN as MST suggested, and add 
VHOST_SET/GET_CONFIG_ENDIAN for vhost_vdpa.
If the device has not implemented interface to set its endianess, then 
no matter success or failure of SET_ENDIAN, QEMU knows the endian-ness 
anyway. In this case, if the device endianess does not match the guest, 
there needs a mediation layer or fail.
b) ops->get_config_endian() can always tell the endian-ness of the 
device config space after the vendor driver probing the device. So we 
can use this ops->get_config_endian() for
MTU, MAC and other fields handling in vdpa_dev_net_config_fill() and we 
don't need to set_features in vdpa_get_config_unlocked(), so no race 
conditions.
Every time ops->get_config() returned, we can tell the endian by 
ops-config_>get_endian(), we don't need set_features(xxx, 0) if features 
negotiation not done.

The question is: Do we need two pairs of ioctls for both vq and config 
space? Can config space endian-ness differ from the vqs?
c) do we need a new netlink attr telling the endian-ness to user space?

Thanks,
Zhu Lingshan
>
>>
>> or
>>
>> 3) revisit the idea of forcing modern only device which may simplify
>> things a lot
> I am not actually against forcing modern only config space, given that 
> it's not hard for either QEMU or individual driver to mediate or 
> emulate, and for the most part it's not conflict with the goal of 
> offload or acceleration with vDPA. But forcing LE ring layout IMO 
> would just kill off the potential of a very good use case. Currently 
> for our use case the priority for supporting 0.9.5 guest with vDPA is 
> slightly lower compared to live migration, but it is still in our TODO 
> list.
>
> Thanks,
> -Siwei
>
>>
>> which way should we go?
>>
>>> I hope
>>> it's not the latter, otherwise it loses the point to use vDPA for
>>> datapath acceleration.
>>>
>>> Even if its the former, it's a little weird for vendor device to
>>> implement a LE config space with BE ring layout, although still 
>>> possible...
>> Right.
>>
>> Thanks
>>
>>> -Siwei
>>>> Thanks
>>>>
>>>>
>>>>>> So if you think we should add a vdpa_ops->get_endian(),
>>>>>> I will drop these comments in the next version of
>>>>>> series, and work on a new patch for get_endian().
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Lingshan
>>>>> Guests don't get endian-ness from devices so this seems pointless.
>>>>>
>
Jason Wang Aug. 23, 2022, 3:26 a.m. UTC | #29
On Mon, Aug 22, 2022 at 1:08 PM Zhu, Lingshan <lingshan.zhu@intel.com> wrote:
>
>
>
> On 8/20/2022 4:55 PM, Si-Wei Liu wrote:
> >
> >
> > On 8/18/2022 5:42 PM, Jason Wang wrote:
> >> On Fri, Aug 19, 2022 at 7:20 AM Si-Wei Liu <si-wei.liu@oracle.com>
> >> wrote:
> >>>
> >>>
> >>> On 8/17/2022 9:15 PM, Jason Wang wrote:
> >>>> 在 2022/8/17 18:37, Michael S. Tsirkin 写道:
> >>>>> On Wed, Aug 17, 2022 at 05:43:22PM +0800, Zhu, Lingshan wrote:
> >>>>>> On 8/17/2022 5:39 PM, Michael S. Tsirkin wrote:
> >>>>>>> On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote:
> >>>>>>>> On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
> >>>>>>>>> On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
> >>>>>>>>>> Yes it is a little messy, and we can not check _F_VERSION_1
> >>>>>>>>>> because of
> >>>>>>>>>> transitional devices, so maybe this is the best we can do for
> >>>>>>>>>> now
> >>>>>>>>> I think vhost generally needs an API to declare config space
> >>>>>>>>> endian-ness
> >>>>>>>>> to kernel. vdpa can reuse that too then.
> >>>>>>>> Yes, I remember you have mentioned some IOCTL to set the
> >>>>>>>> endian-ness,
> >>>>>>>> for vDPA, I think only the vendor driver knows the endian,
> >>>>>>>> so we may need a new function vdpa_ops->get_endian().
> >>>>>>>> In the last thread, we say maybe it's better to add a comment for
> >>>>>>>> now.
> >>>>>>>> But if you think we should add a vdpa_ops->get_endian(), I can
> >>>>>>>> work
> >>>>>>>> on it for sure!
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> Zhu Lingshan
> >>>>>>> I think QEMU has to set endian-ness. No one else knows.
> >>>>>> Yes, for SW based vhost it is true. But for HW vDPA, only
> >>>>>> the device & driver knows the endian, I think we can not
> >>>>>> "set" a hardware's endian.
> >>>>> QEMU knows the guest endian-ness and it knows that
> >>>>> device is accessed through the legacy interface.
> >>>>> It can accordingly send endian-ness to the kernel and
> >>>>> kernel can propagate it to the driver.
> >>>>
> >>>> I wonder if we can simply force LE and then Qemu can do the endian
> >>>> conversion?
> >>> convert from LE for config space fields only, or QEMU has to forcefully
> >>> mediate and covert endianness for all device memory access including
> >>> even the datapath (fields in descriptor and avail/used rings)?
> >> Former. Actually, I want to force modern devices for vDPA when
> >> developing the vDPA framework. But then we see requirements for
> >> transitional or even legacy (e.g the Ali ENI parent). So it
> >> complicates things a lot.
> >>
> >> I think several ideas has been proposed:
> >>
> >> 1) Your proposal of having a vDPA specific way for
> >> modern/transitional/legacy awareness. This seems very clean since each
> >> transport should have the ability to do that but it still requires
> >> some kind of mediation for the case e.g running BE legacy guest on LE
> >> host.
> > In theory it seems like so, though practically I wonder if we can just
> > forbid BE legacy driver from running on modern LE host. For those who
> > care about legacy BE guest, they mostly like could and should talk to
> > vendor to get native BE support to achieve hardware acceleration,

The problem is the hardware still needs a way to know the endian of the guest?

> > few
> > of them would count on QEMU in mediating or emulating the datapath
> > (otherwise I don't see the benefit of adopting vDPA?). I still feel
> > that not every hardware vendor has to offer backward compatibility
> > (transitional device) with legacy interface/behavior (BE being just
> > one),

Probably, I agree it is a corner case, and dealing with transitional
device for the following setups is very challenge for hardware:

- driver without IOMMU_PLATFORM support, (requiring device to send
translated request which have security implications)
- BE legacy guest on LE host, (requiring device to have a way to know
the endian)
- device specific requirement (e.g modern virtio-net mandate minimal
header length to contain mrg_rxbuf even if the device doesn't offer
it)

It is not obvious for the hardware vendor, so we may end up defecting
in the implementation. Dealing with compatibility for the transitional
devices is kind of a nightmare which there's no way for the spec to
rule the behavior of legacy devices.

> >  this is unlike the situation on software virtio device, which
> > has legacy support since day one. I think we ever discussed it before:
> > for those vDPA vendors who don't offer legacy guest support, maybe we
> > should mandate some feature for e.g. VERSION_1, as these devices
> > really don't offer functionality of the opposite side (!VERSION_1)
> > during negotiation.

I've tried something similar here (a global mandatory instead of per device).

https://lkml.org/lkml/2021/6/4/26

But for some reason, it is not applied by Michael. It would be a great
relief if we support modern devices only. Maybe it's time to revisit
this idea then we can introduce new backend features and then we can
mandate VERSION_1

> >
> > Having it said, perhaps we should also allow vendor device to
> > implement only partial support for legacy. We can define "reversed"
> > backend feature to denote some part of the legacy
> > interface/functionality not getting implemented by device. For
> > instance, VHOST_BACKEND_F_NO_BE_VRING, VHOST_BACKEND_F_NO_BE_CONFIG,
> > VHOST_BACKEND_F_NO_ALIGNED_VRING,
> > VHOST_BACKEND_NET_F_NO_WRITEABLE_MAC, and et al. Not all of these
> > missing features for legacy would be easy for QEMU to make up for, so
> > QEMU can selectively emulate those at its best when necessary and
> > applicable. In other word, this design shouldn't prevent QEMU from
> > making up for vendor device's partial legacy support.

This looks too heavyweight since it tries to provide compatibility for
legacy drivers. Considering we've introduced modern devices for 5+
years, I'd rather:

- Qemu to mediate the config space stuffs
- Shadow virtqueue to mediate the datapath (AF_XDP told us shadow ring
can perform very well if we do zero-copy).

> >
> >>
> >> 2) Michael suggests using VHOST_SET_VRING_ENDIAN where it means we
> >> need a new config ops for vDPA bus, but it doesn't solve the issue for
> >> config space (at least from its name). We probably need a new ioctl
> >> for both vring and config space.
> > Yep adding a new ioctl makes things better, but I think the key is not
> > the new ioctl. It's whether or not we should enforce every vDPA vendor
> > driver to implement all transitional interfaces to be spec compliant.

I think the answer is no since the spec allows transitional device.
And we know things will be greatly simplified if vDPA support non
transitional device only.

So we can change the question to:

1) do we need (or is it too late) to enforce non transitional device?
2) if yes, can transitional device be mediate in an efficient way?

For 1), it's probably too late but we can invent new vDPA features as
you suggest to be non transitional. Then we can:

1.1) extend the netlink API to provision non-transitonal device
1.2) work on the non-transtional device in the future
1.3) presenting transitional device via mediation

The previous transitional vDPA work as is, it's probably too late to
fix all the issue we suffer.

For 2), the key part is the datapath mediation, we can use shadow virtqueue.

> > If we allow them to reject the VHOST_SET_VRING_ENDIAN  or
> > VHOST_SET_CONFIG_ENDIAN call, what could we do? We would still end up
> > with same situation of either fail the guest, or trying to
> > mediate/emulate, right?
> >
> > Not to mention VHOST_SET_VRING_ENDIAN is rarely supported by vhost
> > today - few distro kernel has CONFIG_VHOST_CROSS_ENDIAN_LEGACY enabled
> > and QEMU just ignores the result. vhost doesn't necessarily depend on
> > it to determine endianness it looks.
> I would like to suggest to add two new config ops get/set_vq_endian()
> and get/set_config_endian() for vDPA. This is used to:
> a) support VHOST_GET/SET_VRING_ENDIAN as MST suggested, and add
> VHOST_SET/GET_CONFIG_ENDIAN for vhost_vdpa.
> If the device has not implemented interface to set its endianess, then
> no matter success or failure of SET_ENDIAN, QEMU knows the endian-ness
> anyway.

How can Qemu know the endian in this way? And if it can, there's no
need for the new API?

> In this case, if the device endianess does not match the guest,
> there needs a mediation layer or fail.
> b) ops->get_config_endian() can always tell the endian-ness of the
> device config space after the vendor driver probing the device. So we
> can use this ops->get_config_endian() for
> MTU, MAC and other fields handling in vdpa_dev_net_config_fill() and we
> don't need to set_features in vdpa_get_config_unlocked(), so no race
> conditions.
> Every time ops->get_config() returned, we can tell the endian by
> ops-config_>get_endian(), we don't need set_features(xxx, 0) if features
> negotiation not done.
>
> The question is: Do we need two pairs of ioctls for both vq and config
> space? Can config space endian-ness differ from the vqs?
> c) do we need a new netlink attr telling the endian-ness to user space?

Generally, I'm not sure this is a good design consider it provides neither:

Compatibility with the virtio spec

nor

Compatibility with the existing vhost API (VHOST_SET_VRING_ENDIAN)

Thanks

>
> Thanks,
> Zhu Lingshan
> >
> >>
> >> or
> >>
> >> 3) revisit the idea of forcing modern only device which may simplify
> >> things a lot
> > I am not actually against forcing modern only config space, given that
> > it's not hard for either QEMU or individual driver to mediate or
> > emulate, and for the most part it's not conflict with the goal of
> > offload or acceleration with vDPA. But forcing LE ring layout IMO
> > would just kill off the potential of a very good use case. Currently
> > for our use case the priority for supporting 0.9.5 guest with vDPA is
> > slightly lower compared to live migration, but it is still in our TODO
> > list.
> >
> > Thanks,
> > -Siwei
> >
> >>
> >> which way should we go?
> >>
> >>> I hope
> >>> it's not the latter, otherwise it loses the point to use vDPA for
> >>> datapath acceleration.
> >>>
> >>> Even if its the former, it's a little weird for vendor device to
> >>> implement a LE config space with BE ring layout, although still
> >>> possible...
> >> Right.
> >>
> >> Thanks
> >>
> >>> -Siwei
> >>>> Thanks
> >>>>
> >>>>
> >>>>>> So if you think we should add a vdpa_ops->get_endian(),
> >>>>>> I will drop these comments in the next version of
> >>>>>> series, and work on a new patch for get_endian().
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Zhu Lingshan
> >>>>> Guests don't get endian-ness from devices so this seems pointless.
> >>>>>
> >
>
Zhu, Lingshan Aug. 23, 2022, 6:52 a.m. UTC | #30
On 8/23/2022 11:26 AM, Jason Wang wrote:
> On Mon, Aug 22, 2022 at 1:08 PM Zhu, Lingshan <lingshan.zhu@intel.com> wrote:
>>
>>
>> On 8/20/2022 4:55 PM, Si-Wei Liu wrote:
>>>
>>> On 8/18/2022 5:42 PM, Jason Wang wrote:
>>>> On Fri, Aug 19, 2022 at 7:20 AM Si-Wei Liu <si-wei.liu@oracle.com>
>>>> wrote:
>>>>>
>>>>> On 8/17/2022 9:15 PM, Jason Wang wrote:
>>>>>> 在 2022/8/17 18:37, Michael S. Tsirkin 写道:
>>>>>>> On Wed, Aug 17, 2022 at 05:43:22PM +0800, Zhu, Lingshan wrote:
>>>>>>>> On 8/17/2022 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>>>> On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote:
>>>>>>>>>> On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
>>>>>>>>>>>> Yes it is a little messy, and we can not check _F_VERSION_1
>>>>>>>>>>>> because of
>>>>>>>>>>>> transitional devices, so maybe this is the best we can do for
>>>>>>>>>>>> now
>>>>>>>>>>> I think vhost generally needs an API to declare config space
>>>>>>>>>>> endian-ness
>>>>>>>>>>> to kernel. vdpa can reuse that too then.
>>>>>>>>>> Yes, I remember you have mentioned some IOCTL to set the
>>>>>>>>>> endian-ness,
>>>>>>>>>> for vDPA, I think only the vendor driver knows the endian,
>>>>>>>>>> so we may need a new function vdpa_ops->get_endian().
>>>>>>>>>> In the last thread, we say maybe it's better to add a comment for
>>>>>>>>>> now.
>>>>>>>>>> But if you think we should add a vdpa_ops->get_endian(), I can
>>>>>>>>>> work
>>>>>>>>>> on it for sure!
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Zhu Lingshan
>>>>>>>>> I think QEMU has to set endian-ness. No one else knows.
>>>>>>>> Yes, for SW based vhost it is true. But for HW vDPA, only
>>>>>>>> the device & driver knows the endian, I think we can not
>>>>>>>> "set" a hardware's endian.
>>>>>>> QEMU knows the guest endian-ness and it knows that
>>>>>>> device is accessed through the legacy interface.
>>>>>>> It can accordingly send endian-ness to the kernel and
>>>>>>> kernel can propagate it to the driver.
>>>>>> I wonder if we can simply force LE and then Qemu can do the endian
>>>>>> conversion?
>>>>> convert from LE for config space fields only, or QEMU has to forcefully
>>>>> mediate and covert endianness for all device memory access including
>>>>> even the datapath (fields in descriptor and avail/used rings)?
>>>> Former. Actually, I want to force modern devices for vDPA when
>>>> developing the vDPA framework. But then we see requirements for
>>>> transitional or even legacy (e.g the Ali ENI parent). So it
>>>> complicates things a lot.
>>>>
>>>> I think several ideas has been proposed:
>>>>
>>>> 1) Your proposal of having a vDPA specific way for
>>>> modern/transitional/legacy awareness. This seems very clean since each
>>>> transport should have the ability to do that but it still requires
>>>> some kind of mediation for the case e.g running BE legacy guest on LE
>>>> host.
>>> In theory it seems like so, though practically I wonder if we can just
>>> forbid BE legacy driver from running on modern LE host. For those who
>>> care about legacy BE guest, they mostly like could and should talk to
>>> vendor to get native BE support to achieve hardware acceleration,
> The problem is the hardware still needs a way to know the endian of the guest?
>
>>> few
>>> of them would count on QEMU in mediating or emulating the datapath
>>> (otherwise I don't see the benefit of adopting vDPA?). I still feel
>>> that not every hardware vendor has to offer backward compatibility
>>> (transitional device) with legacy interface/behavior (BE being just
>>> one),
> Probably, I agree it is a corner case, and dealing with transitional
> device for the following setups is very challenge for hardware:
>
> - driver without IOMMU_PLATFORM support, (requiring device to send
> translated request which have security implications)
> - BE legacy guest on LE host, (requiring device to have a way to know
> the endian)
> - device specific requirement (e.g modern virtio-net mandate minimal
> header length to contain mrg_rxbuf even if the device doesn't offer
> it)
>
> It is not obvious for the hardware vendor, so we may end up defecting
> in the implementation. Dealing with compatibility for the transitional
> devices is kind of a nightmare which there's no way for the spec to
> rule the behavior of legacy devices.
>
>>>   this is unlike the situation on software virtio device, which
>>> has legacy support since day one. I think we ever discussed it before:
>>> for those vDPA vendors who don't offer legacy guest support, maybe we
>>> should mandate some feature for e.g. VERSION_1, as these devices
>>> really don't offer functionality of the opposite side (!VERSION_1)
>>> during negotiation.
> I've tried something similar here (a global mandatory instead of per device).
>
> https://lkml.org/lkml/2021/6/4/26
I think this is the best option
>
> But for some reason, it is not applied by Michael. It would be a great
> relief if we support modern devices only. Maybe it's time to revisit
> this idea then we can introduce new backend features and then we can
> mandate VERSION_1
>
>>> Having it said, perhaps we should also allow vendor device to
>>> implement only partial support for legacy. We can define "reversed"
>>> backend feature to denote some part of the legacy
>>> interface/functionality not getting implemented by device. For
>>> instance, VHOST_BACKEND_F_NO_BE_VRING, VHOST_BACKEND_F_NO_BE_CONFIG,
>>> VHOST_BACKEND_F_NO_ALIGNED_VRING,
>>> VHOST_BACKEND_NET_F_NO_WRITEABLE_MAC, and et al. Not all of these
>>> missing features for legacy would be easy for QEMU to make up for, so
>>> QEMU can selectively emulate those at its best when necessary and
>>> applicable. In other word, this design shouldn't prevent QEMU from
>>> making up for vendor device's partial legacy support.
> This looks too heavyweight since it tries to provide compatibility for
> legacy drivers. Considering we've introduced modern devices for 5+
> years, I'd rather:
>
> - Qemu to mediate the config space stuffs
So that's what I have suggested, new ops to support VHOST_SET_VRING_ENDIAN,
then after QEMU issue this IOCTL, no matter success or fail, QEMU will
know the endian. Then QEMU can mediate the config space.

And these ops(get/get_endian) can help us mediate config space fields in
net_config_fill()
> - Shadow virtqueue to mediate the datapath (AF_XDP told us shadow ring
> can perform very well if we do zero-copy).
>
>>>> 2) Michael suggests using VHOST_SET_VRING_ENDIAN where it means we
>>>> need a new config ops for vDPA bus, but it doesn't solve the issue for
>>>> config space (at least from its name). We probably need a new ioctl
>>>> for both vring and config space.
>>> Yep adding a new ioctl makes things better, but I think the key is not
>>> the new ioctl. It's whether or not we should enforce every vDPA vendor
>>> driver to implement all transitional interfaces to be spec compliant.
> I think the answer is no since the spec allows transitional device.
> And we know things will be greatly simplified if vDPA support non
> transitional device only.
>
> So we can change the question to:
>
> 1) do we need (or is it too late) to enforce non transitional device?
> 2) if yes, can transitional device be mediate in an efficient way?
>
> For 1), it's probably too late but we can invent new vDPA features as
> you suggest to be non transitional. Then we can:
>
> 1.1) extend the netlink API to provision non-transitonal device
> 1.2) work on the non-transtional device in the future
> 1.3) presenting transitional device via mediation
>
> The previous transitional vDPA work as is, it's probably too late to
> fix all the issue we suffer.
>
> For 2), the key part is the datapath mediation, we can use shadow virtqueue.
>
>>> If we allow them to reject the VHOST_SET_VRING_ENDIAN  or
>>> VHOST_SET_CONFIG_ENDIAN call, what could we do? We would still end up
>>> with same situation of either fail the guest, or trying to
>>> mediate/emulate, right?
>>>
>>> Not to mention VHOST_SET_VRING_ENDIAN is rarely supported by vhost
>>> today - few distro kernel has CONFIG_VHOST_CROSS_ENDIAN_LEGACY enabled
>>> and QEMU just ignores the result. vhost doesn't necessarily depend on
>>> it to determine endianness it looks.
>> I would like to suggest to add two new config ops get/set_vq_endian()
>> and get/set_config_endian() for vDPA. This is used to:
>> a) support VHOST_GET/SET_VRING_ENDIAN as MST suggested, and add
>> VHOST_SET/GET_CONFIG_ENDIAN for vhost_vdpa.
>> If the device has not implemented interface to set its endianess, then
>> no matter success or failure of SET_ENDIAN, QEMU knows the endian-ness
>> anyway.
> How can Qemu know the endian in this way? And if it can, there's no
> need for the new API?
If we have VHOST_SET_VRING_ENDIAN support for vDPA, then when QEMU sets
the endian of a vDPA device through VHOST_SET_VRING_ENDIAN, no matter
this ioctl success or fail, QEMU knows the endian of the device.
E.g., if QEMU sets BE, but failed, then QEMU knows the device is LE only.
>
>> In this case, if the device endianess does not match the guest,
>> there needs a mediation layer or fail.
>> b) ops->get_config_endian() can always tell the endian-ness of the
>> device config space after the vendor driver probing the device. So we
>> can use this ops->get_config_endian() for
>> MTU, MAC and other fields handling in vdpa_dev_net_config_fill() and we
>> don't need to set_features in vdpa_get_config_unlocked(), so no race
>> conditions.
>> Every time ops->get_config() returned, we can tell the endian by
>> ops-config_>get_endian(), we don't need set_features(xxx, 0) if features
>> negotiation not done.
>>
>> The question is: Do we need two pairs of ioctls for both vq and config
>> space? Can config space endian-ness differ from the vqs?
>> c) do we need a new netlink attr telling the endian-ness to user space?
> Generally, I'm not sure this is a good design consider it provides neither:
>
> Compatibility with the virtio spec
I think this is not about the spec, we implement a new pair of ops to 
set/get_endian(),
then we can handle the device config space fields properly, at least to 
vdpa_dev_net_config_fill().

E.g, we can use __virtio16_to_cpu(vdpa->get_endian(), xxx);

And this can support VHOST_SET_VRING_ENDIAN.
>
> nor
>
> Compatibility with the existing vhost API (VHOST_SET_VRING_ENDIAN)
that's true, but if the endian does not match the guest endian, it can not
work without a mediation layer anyway.

Thanks
Zhu Lingshan
>
> Thanks
>
>> Thanks,
>> Zhu Lingshan
>>>> or
>>>>
>>>> 3) revisit the idea of forcing modern only device which may simplify
>>>> things a lot
>>> I am not actually against forcing modern only config space, given that
>>> it's not hard for either QEMU or individual driver to mediate or
>>> emulate, and for the most part it's not conflict with the goal of
>>> offload or acceleration with vDPA. But forcing LE ring layout IMO
>>> would just kill off the potential of a very good use case. Currently
>>> for our use case the priority for supporting 0.9.5 guest with vDPA is
>>> slightly lower compared to live migration, but it is still in our TODO
>>> list.
>>>
>>> Thanks,
>>> -Siwei
>>>
>>>> which way should we go?
>>>>
>>>>> I hope
>>>>> it's not the latter, otherwise it loses the point to use vDPA for
>>>>> datapath acceleration.
>>>>>
>>>>> Even if its the former, it's a little weird for vendor device to
>>>>> implement a LE config space with BE ring layout, although still
>>>>> possible...
>>>> Right.
>>>>
>>>> Thanks
>>>>
>>>>> -Siwei
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>>> So if you think we should add a vdpa_ops->get_endian(),
>>>>>>>> I will drop these comments in the next version of
>>>>>>>> series, and work on a new patch for get_endian().
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zhu Lingshan
>>>>>>> Guests don't get endian-ness from devices so this seems pointless.
>>>>>>>
Si-Wei Liu Aug. 26, 2022, 6:23 a.m. UTC | #31
On 8/22/2022 8:26 PM, Jason Wang wrote:
> On Mon, Aug 22, 2022 at 1:08 PM Zhu, Lingshan <lingshan.zhu@intel.com> wrote:
>>
>>
>> On 8/20/2022 4:55 PM, Si-Wei Liu wrote:
>>>
>>> On 8/18/2022 5:42 PM, Jason Wang wrote:
>>>> On Fri, Aug 19, 2022 at 7:20 AM Si-Wei Liu <si-wei.liu@oracle.com>
>>>> wrote:
>>>>>
>>>>> On 8/17/2022 9:15 PM, Jason Wang wrote:
>>>>>> 在 2022/8/17 18:37, Michael S. Tsirkin 写道:
>>>>>>> On Wed, Aug 17, 2022 at 05:43:22PM +0800, Zhu, Lingshan wrote:
>>>>>>>> On 8/17/2022 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>>>> On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote:
>>>>>>>>>> On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
>>>>>>>>>>>> Yes it is a little messy, and we can not check _F_VERSION_1
>>>>>>>>>>>> because of
>>>>>>>>>>>> transitional devices, so maybe this is the best we can do for
>>>>>>>>>>>> now
>>>>>>>>>>> I think vhost generally needs an API to declare config space
>>>>>>>>>>> endian-ness
>>>>>>>>>>> to kernel. vdpa can reuse that too then.
>>>>>>>>>> Yes, I remember you have mentioned some IOCTL to set the
>>>>>>>>>> endian-ness,
>>>>>>>>>> for vDPA, I think only the vendor driver knows the endian,
>>>>>>>>>> so we may need a new function vdpa_ops->get_endian().
>>>>>>>>>> In the last thread, we say maybe it's better to add a comment for
>>>>>>>>>> now.
>>>>>>>>>> But if you think we should add a vdpa_ops->get_endian(), I can
>>>>>>>>>> work
>>>>>>>>>> on it for sure!
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Zhu Lingshan
>>>>>>>>> I think QEMU has to set endian-ness. No one else knows.
>>>>>>>> Yes, for SW based vhost it is true. But for HW vDPA, only
>>>>>>>> the device & driver knows the endian, I think we can not
>>>>>>>> "set" a hardware's endian.
>>>>>>> QEMU knows the guest endian-ness and it knows that
>>>>>>> device is accessed through the legacy interface.
>>>>>>> It can accordingly send endian-ness to the kernel and
>>>>>>> kernel can propagate it to the driver.
>>>>>> I wonder if we can simply force LE and then Qemu can do the endian
>>>>>> conversion?
>>>>> convert from LE for config space fields only, or QEMU has to forcefully
>>>>> mediate and covert endianness for all device memory access including
>>>>> even the datapath (fields in descriptor and avail/used rings)?
>>>> Former. Actually, I want to force modern devices for vDPA when
>>>> developing the vDPA framework. But then we see requirements for
>>>> transitional or even legacy (e.g the Ali ENI parent). So it
>>>> complicates things a lot.
>>>>
>>>> I think several ideas has been proposed:
>>>>
>>>> 1) Your proposal of having a vDPA specific way for
>>>> modern/transitional/legacy awareness. This seems very clean since each
>>>> transport should have the ability to do that but it still requires
>>>> some kind of mediation for the case e.g running BE legacy guest on LE
>>>> host.
>>> In theory it seems like so, though practically I wonder if we can just
>>> forbid BE legacy driver from running on modern LE host. For those who
>>> care about legacy BE guest, they mostly like could and should talk to
>>> vendor to get native BE support to achieve hardware acceleration,
> The problem is the hardware still needs a way to know the endian of the guest?
Hardware doesn't need to know. VMM should know by judging from VERSION_1 
feature bit negotiation and legacy interface access (with new code), and 
the target architecture endianness (the latter is existing QEMU code).
>
>>> few
>>> of them would count on QEMU in mediating or emulating the datapath
>>> (otherwise I don't see the benefit of adopting vDPA?). I still feel
>>> that not every hardware vendor has to offer backward compatibility
>>> (transitional device) with legacy interface/behavior (BE being just
>>> one),
> Probably, I agree it is a corner case, and dealing with transitional
> device for the following setups is very challenge for hardware:
>
> - driver without IOMMU_PLATFORM support, (requiring device to send
> translated request which have security implications)
Don't get better suggestion for this one, but I presume this is 
something legacy guest users should be aware of ahead in term of 
security implications.

> - BE legacy guest on LE host, (requiring device to have a way to know
> the endian)
Yes. device can tell apart with the help from VMM (judging by VERSION_1 
acknowledgement and if legacy interface is used during negotiation).

> - device specific requirement (e.g modern virtio-net mandate minimal
> header length to contain mrg_rxbuf even if the device doesn't offer
> it)
This one seems to be spec mandated transitional interface requirement? 
Which vDPA hardware vendor should take care of rather (if they do offer 
a transitional device)?
>
> It is not obvious for the hardware vendor, so we may end up defecting
> in the implementation. Dealing with compatibility for the transitional
> devices is kind of a nightmare which there's no way for the spec to
> rule the behavior of legacy devices.
The compatibility detection part is tedious I agree. That's why I 
suggested starting from the very minimal and practically feasible (for 
e.g. on x86), but just don't prohibit the possibility to extend to big 
endian or come up with quirk fixes for various cases in QEMU.

>
>>>   this is unlike the situation on software virtio device, which
>>> has legacy support since day one. I think we ever discussed it before:
>>> for those vDPA vendors who don't offer legacy guest support, maybe we
>>> should mandate some feature for e.g. VERSION_1, as these devices
>>> really don't offer functionality of the opposite side (!VERSION_1)
>>> during negotiation.
> I've tried something similar here (a global mandatory instead of per device).
>
> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/4/26__;!!ACWV5N9M2RV99hQ!NRQPfj5o9o3MKE12ze1zfXMC-9SqwOWqF26g8RrIyUDbUmwDIwl5WQCaNiDe6aZ2yR83j-NEqRXQNXcNyOo$
>
> But for some reason, it is not applied by Michael. It would be a great
> relief if we support modern devices only. Maybe it's time to revisit
> this idea then we can introduce new backend features and then we can
> mandate VERSION_1
Probably, mandating per-device should be fine I guess.

>
>>> Having it said, perhaps we should also allow vendor device to
>>> implement only partial support for legacy. We can define "reversed"
>>> backend feature to denote some part of the legacy
>>> interface/functionality not getting implemented by device. For
>>> instance, VHOST_BACKEND_F_NO_BE_VRING, VHOST_BACKEND_F_NO_BE_CONFIG,
>>> VHOST_BACKEND_F_NO_ALIGNED_VRING,
>>> VHOST_BACKEND_NET_F_NO_WRITEABLE_MAC, and et al. Not all of these
>>> missing features for legacy would be easy for QEMU to make up for, so
>>> QEMU can selectively emulate those at its best when necessary and
>>> applicable. In other word, this design shouldn't prevent QEMU from
>>> making up for vendor device's partial legacy support.
> This looks too heavyweight since it tries to provide compatibility for
> legacy drivers.
That's just for the sake of extreme backward compatibility, but you can 
say that's not even needed if we mandate transitional device to offer 
all required interfaces for both legacy and modern guest.

>   Considering we've introduced modern devices for 5+
> years, I'd rather:
>
> - Qemu to mediate the config space stuffs
> - Shadow virtqueue to mediate the datapath (AF_XDP told us shadow ring
> can perform very well if we do zero-copy).
This is one way to achieve, though not sure we should stick the only 
hope to zero-copy, which IMHO may take a long way to realize and 
optimize to where a simple datapath passthrough can easily get to (with 
hardware acceleration of coz).

>
>>>> 2) Michael suggests using VHOST_SET_VRING_ENDIAN where it means we
>>>> need a new config ops for vDPA bus, but it doesn't solve the issue for
>>>> config space (at least from its name). We probably need a new ioctl
>>>> for both vring and config space.
>>> Yep adding a new ioctl makes things better, but I think the key is not
>>> the new ioctl. It's whether or not we should enforce every vDPA vendor
>>> driver to implement all transitional interfaces to be spec compliant.
> I think the answer is no since the spec allows transitional device.
> And we know things will be greatly simplified if vDPA support non
> transitional device only.
>
> So we can change the question to:
>
> 1) do we need (or is it too late) to enforce non transitional device?
We already have Alibaba ENI which is sort of a quasi-transitional 
device, right? In the sense it doesn't advertise VERSION_1. I know the 
other parts might not qualify it to be fully transitional, but code now 
doesn't prohibit it from supporting VERSION_1 modern interface depending 
on whatever future need.
> 2) if yes, can transitional device be mediate in an efficient way?
>
> For 1), it's probably too late but we can invent new vDPA features as
> you suggest to be non transitional. Then we can:
>
> 1.1) extend the netlink API to provision non-transitonal device
Define non-transitional: device could be either modern-only or legacy-only?
> 1.2) work on the non-transtional device in the future
> 1.3) presenting transitional device via mediation
presenting transitional on top of a modern device with VERSION_1, right? 
What if the hardware device can support legacy-compatible datapath 
natively that doesn't need mediation? Can it be done with direct 
datapath passthrough without svq involvement?

>
> The previous transitional vDPA work as is, it's probably too late to
> fix all the issue we suffer.
What do you mean work as-is, what's the nomenclature for that, 
quasi-transitional or broken-transitional? and what are the outstanding 
issues you anticipate remaining? If it is device specific or vendor 
driver specific, let it be. But I wonder if there's any generic one that 
has to be fixed in vdpa core to support a truly transitional model.
>
> For 2), the key part is the datapath mediation, we can use shadow virtqueue.
Sure. For our use case, we'd care more in providing transitional rather 
than being non-transitional. So, one device fits all.

Thanks for all the ideas. This discussion is really useful.

Best,
-Siwei
>
>>> If we allow them to reject the VHOST_SET_VRING_ENDIAN  or
>>> VHOST_SET_CONFIG_ENDIAN call, what could we do? We would still end up
>>> with same situation of either fail the guest, or trying to
>>> mediate/emulate, right?
>>>
>>> Not to mention VHOST_SET_VRING_ENDIAN is rarely supported by vhost
>>> today - few distro kernel has CONFIG_VHOST_CROSS_ENDIAN_LEGACY enabled
>>> and QEMU just ignores the result. vhost doesn't necessarily depend on
>>> it to determine endianness it looks.
>> I would like to suggest to add two new config ops get/set_vq_endian()
>> and get/set_config_endian() for vDPA. This is used to:
>> a) support VHOST_GET/SET_VRING_ENDIAN as MST suggested, and add
>> VHOST_SET/GET_CONFIG_ENDIAN for vhost_vdpa.
>> If the device has not implemented interface to set its endianess, then
>> no matter success or failure of SET_ENDIAN, QEMU knows the endian-ness
>> anyway.
> How can Qemu know the endian in this way? And if it can, there's no
> need for the new API?
>
>> In this case, if the device endianess does not match the guest,
>> there needs a mediation layer or fail.
>> b) ops->get_config_endian() can always tell the endian-ness of the
>> device config space after the vendor driver probing the device. So we
>> can use this ops->get_config_endian() for
>> MTU, MAC and other fields handling in vdpa_dev_net_config_fill() and we
>> don't need to set_features in vdpa_get_config_unlocked(), so no race
>> conditions.
>> Every time ops->get_config() returned, we can tell the endian by
>> ops-config_>get_endian(), we don't need set_features(xxx, 0) if features
>> negotiation not done.
>>
>> The question is: Do we need two pairs of ioctls for both vq and config
>> space? Can config space endian-ness differ from the vqs?
>> c) do we need a new netlink attr telling the endian-ness to user space?
> Generally, I'm not sure this is a good design consider it provides neither:
>
> Compatibility with the virtio spec
>
> nor
>
> Compatibility with the existing vhost API (VHOST_SET_VRING_ENDIAN)
>
> Thanks
>
>> Thanks,
>> Zhu Lingshan
>>>> or
>>>>
>>>> 3) revisit the idea of forcing modern only device which may simplify
>>>> things a lot
>>> I am not actually against forcing modern only config space, given that
>>> it's not hard for either QEMU or individual driver to mediate or
>>> emulate, and for the most part it's not conflict with the goal of
>>> offload or acceleration with vDPA. But forcing LE ring layout IMO
>>> would just kill off the potential of a very good use case. Currently
>>> for our use case the priority for supporting 0.9.5 guest with vDPA is
>>> slightly lower compared to live migration, but it is still in our TODO
>>> list.
>>>
>>> Thanks,
>>> -Siwei
>>>
>>>> which way should we go?
>>>>
>>>>> I hope
>>>>> it's not the latter, otherwise it loses the point to use vDPA for
>>>>> datapath acceleration.
>>>>>
>>>>> Even if its the former, it's a little weird for vendor device to
>>>>> implement a LE config space with BE ring layout, although still
>>>>> possible...
>>>> Right.
>>>>
>>>> Thanks
>>>>
>>>>> -Siwei
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>>> So if you think we should add a vdpa_ops->get_endian(),
>>>>>>>> I will drop these comments in the next version of
>>>>>>>> series, and work on a new patch for get_endian().
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zhu Lingshan
>>>>>>> Guests don't get endian-ness from devices so this seems pointless.
>>>>>>>
Zhu, Lingshan Aug. 30, 2022, 9:43 a.m. UTC | #32
On 8/23/2022 2:52 PM, Zhu, Lingshan wrote:
>
>
> On 8/23/2022 11:26 AM, Jason Wang wrote:
>> On Mon, Aug 22, 2022 at 1:08 PM Zhu, Lingshan 
>> <lingshan.zhu@intel.com> wrote:
>>>
>>>
>>> On 8/20/2022 4:55 PM, Si-Wei Liu wrote:
>>>>
>>>> On 8/18/2022 5:42 PM, Jason Wang wrote:
>>>>> On Fri, Aug 19, 2022 at 7:20 AM Si-Wei Liu <si-wei.liu@oracle.com>
>>>>> wrote:
>>>>>>
>>>>>> On 8/17/2022 9:15 PM, Jason Wang wrote:
>>>>>>> 在 2022/8/17 18:37, Michael S. Tsirkin 写道:
>>>>>>>> On Wed, Aug 17, 2022 at 05:43:22PM +0800, Zhu, Lingshan wrote:
>>>>>>>>> On 8/17/2022 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>>>>> On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote:
>>>>>>>>>>> On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>>> On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
>>>>>>>>>>>>> Yes it is a little messy, and we can not check _F_VERSION_1
>>>>>>>>>>>>> because of
>>>>>>>>>>>>> transitional devices, so maybe this is the best we can do for
>>>>>>>>>>>>> now
>>>>>>>>>>>> I think vhost generally needs an API to declare config space
>>>>>>>>>>>> endian-ness
>>>>>>>>>>>> to kernel. vdpa can reuse that too then.
>>>>>>>>>>> Yes, I remember you have mentioned some IOCTL to set the
>>>>>>>>>>> endian-ness,
>>>>>>>>>>> for vDPA, I think only the vendor driver knows the endian,
>>>>>>>>>>> so we may need a new function vdpa_ops->get_endian().
>>>>>>>>>>> In the last thread, we say maybe it's better to add a 
>>>>>>>>>>> comment for
>>>>>>>>>>> now.
>>>>>>>>>>> But if you think we should add a vdpa_ops->get_endian(), I can
>>>>>>>>>>> work
>>>>>>>>>>> on it for sure!
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Zhu Lingshan
>>>>>>>>>> I think QEMU has to set endian-ness. No one else knows.
>>>>>>>>> Yes, for SW based vhost it is true. But for HW vDPA, only
>>>>>>>>> the device & driver knows the endian, I think we can not
>>>>>>>>> "set" a hardware's endian.
>>>>>>>> QEMU knows the guest endian-ness and it knows that
>>>>>>>> device is accessed through the legacy interface.
>>>>>>>> It can accordingly send endian-ness to the kernel and
>>>>>>>> kernel can propagate it to the driver.
>>>>>>> I wonder if we can simply force LE and then Qemu can do the endian
>>>>>>> conversion?
>>>>>> convert from LE for config space fields only, or QEMU has to 
>>>>>> forcefully
>>>>>> mediate and covert endianness for all device memory access including
>>>>>> even the datapath (fields in descriptor and avail/used rings)?
>>>>> Former. Actually, I want to force modern devices for vDPA when
>>>>> developing the vDPA framework. But then we see requirements for
>>>>> transitional or even legacy (e.g the Ali ENI parent). So it
>>>>> complicates things a lot.
>>>>>
>>>>> I think several ideas has been proposed:
>>>>>
>>>>> 1) Your proposal of having a vDPA specific way for
>>>>> modern/transitional/legacy awareness. This seems very clean since 
>>>>> each
>>>>> transport should have the ability to do that but it still requires
>>>>> some kind of mediation for the case e.g running BE legacy guest on LE
>>>>> host.
>>>> In theory it seems like so, though practically I wonder if we can just
>>>> forbid BE legacy driver from running on modern LE host. For those who
>>>> care about legacy BE guest, they mostly like could and should talk to
>>>> vendor to get native BE support to achieve hardware acceleration,
>> The problem is the hardware still needs a way to know the endian of 
>> the guest?
>>
>>>> few
>>>> of them would count on QEMU in mediating or emulating the datapath
>>>> (otherwise I don't see the benefit of adopting vDPA?). I still feel
>>>> that not every hardware vendor has to offer backward compatibility
>>>> (transitional device) with legacy interface/behavior (BE being just
>>>> one),
>> Probably, I agree it is a corner case, and dealing with transitional
>> device for the following setups is very challenge for hardware:
>>
>> - driver without IOMMU_PLATFORM support, (requiring device to send
>> translated request which have security implications)
>> - BE legacy guest on LE host, (requiring device to have a way to know
>> the endian)
>> - device specific requirement (e.g modern virtio-net mandate minimal
>> header length to contain mrg_rxbuf even if the device doesn't offer
>> it)
>>
>> It is not obvious for the hardware vendor, so we may end up defecting
>> in the implementation. Dealing with compatibility for the transitional
>> devices is kind of a nightmare which there's no way for the spec to
>> rule the behavior of legacy devices.
>>
>>>>   this is unlike the situation on software virtio device, which
>>>> has legacy support since day one. I think we ever discussed it before:
>>>> for those vDPA vendors who don't offer legacy guest support, maybe we
>>>> should mandate some feature for e.g. VERSION_1, as these devices
>>>> really don't offer functionality of the opposite side (!VERSION_1)
>>>> during negotiation.
>> I've tried something similar here (a global mandatory instead of per 
>> device).
>>
>> https://lkml.org/lkml/2021/6/4/26
> I think this is the best option
>>
>> But for some reason, it is not applied by Michael. It would be a great
>> relief if we support modern devices only. Maybe it's time to revisit
>> this idea then we can introduce new backend features and then we can
>> mandate VERSION_1
>>
>>>> Having it said, perhaps we should also allow vendor device to
>>>> implement only partial support for legacy. We can define "reversed"
>>>> backend feature to denote some part of the legacy
>>>> interface/functionality not getting implemented by device. For
>>>> instance, VHOST_BACKEND_F_NO_BE_VRING, VHOST_BACKEND_F_NO_BE_CONFIG,
>>>> VHOST_BACKEND_F_NO_ALIGNED_VRING,
>>>> VHOST_BACKEND_NET_F_NO_WRITEABLE_MAC, and et al. Not all of these
>>>> missing features for legacy would be easy for QEMU to make up for, so
>>>> QEMU can selectively emulate those at its best when necessary and
>>>> applicable. In other word, this design shouldn't prevent QEMU from
>>>> making up for vendor device's partial legacy support.
>> This looks too heavyweight since it tries to provide compatibility for
>> legacy drivers. Considering we've introduced modern devices for 5+
>> years, I'd rather:
>>
>> - Qemu to mediate the config space stuffs
> So that's what I have suggested, new ops to support 
> VHOST_SET_VRING_ENDIAN,
> then after QEMU issue this IOCTL, no matter success or fail, QEMU will
> know the endian. Then QEMU can mediate the config space.
>
> And these ops(get/get_endian) can help us mediate config space fields in
> net_config_fill()
Update:

I am working on a RFC patch series implementing these, and will post it 
this week

Thanks,
Zhu Lingshan
>> - Shadow virtqueue to mediate the datapath (AF_XDP told us shadow ring
>> can perform very well if we do zero-copy).
>>
>>>>> 2) Michael suggests using VHOST_SET_VRING_ENDIAN where it means we
>>>>> need a new config ops for vDPA bus, but it doesn't solve the issue 
>>>>> for
>>>>> config space (at least from its name). We probably need a new ioctl
>>>>> for both vring and config space.
>>>> Yep adding a new ioctl makes things better, but I think the key is not
>>>> the new ioctl. It's whether or not we should enforce every vDPA vendor
>>>> driver to implement all transitional interfaces to be spec compliant.
>> I think the answer is no since the spec allows transitional device.
>> And we know things will be greatly simplified if vDPA support non
>> transitional device only.
>>
>> So we can change the question to:
>>
>> 1) do we need (or is it too late) to enforce non transitional device?
>> 2) if yes, can transitional device be mediate in an efficient way?
>>
>> For 1), it's probably too late but we can invent new vDPA features as
>> you suggest to be non transitional. Then we can:
>>
>> 1.1) extend the netlink API to provision non-transitonal device
>> 1.2) work on the non-transtional device in the future
>> 1.3) presenting transitional device via mediation
>>
>> The previous transitional vDPA work as is, it's probably too late to
>> fix all the issue we suffer.
>>
>> For 2), the key part is the datapath mediation, we can use shadow 
>> virtqueue.
>>
>>>> If we allow them to reject the VHOST_SET_VRING_ENDIAN  or
>>>> VHOST_SET_CONFIG_ENDIAN call, what could we do? We would still end up
>>>> with same situation of either fail the guest, or trying to
>>>> mediate/emulate, right?
>>>>
>>>> Not to mention VHOST_SET_VRING_ENDIAN is rarely supported by vhost
>>>> today - few distro kernel has CONFIG_VHOST_CROSS_ENDIAN_LEGACY enabled
>>>> and QEMU just ignores the result. vhost doesn't necessarily depend on
>>>> it to determine endianness it looks.
>>> I would like to suggest to add two new config ops get/set_vq_endian()
>>> and get/set_config_endian() for vDPA. This is used to:
>>> a) support VHOST_GET/SET_VRING_ENDIAN as MST suggested, and add
>>> VHOST_SET/GET_CONFIG_ENDIAN for vhost_vdpa.
>>> If the device has not implemented interface to set its endianess, then
>>> no matter success or failure of SET_ENDIAN, QEMU knows the endian-ness
>>> anyway.
>> How can Qemu know the endian in this way? And if it can, there's no
>> need for the new API?
> If we have VHOST_SET_VRING_ENDIAN support for vDPA, then when QEMU sets
> the endian of a vDPA device through VHOST_SET_VRING_ENDIAN, no matter
> this ioctl success or fail, QEMU knows the endian of the device.
> E.g., if QEMU sets BE, but failed, then QEMU knows the device is LE only.
>>
>>> In this case, if the device endianess does not match the guest,
>>> there needs a mediation layer or fail.
>>> b) ops->get_config_endian() can always tell the endian-ness of the
>>> device config space after the vendor driver probing the device. So we
>>> can use this ops->get_config_endian() for
>>> MTU, MAC and other fields handling in vdpa_dev_net_config_fill() and we
>>> don't need to set_features in vdpa_get_config_unlocked(), so no race
>>> conditions.
>>> Every time ops->get_config() returned, we can tell the endian by
>>> ops-config_>get_endian(), we don't need set_features(xxx, 0) if 
>>> features
>>> negotiation not done.
>>>
>>> The question is: Do we need two pairs of ioctls for both vq and config
>>> space? Can config space endian-ness differ from the vqs?
>>> c) do we need a new netlink attr telling the endian-ness to user space?
>> Generally, I'm not sure this is a good design consider it provides 
>> neither:
>>
>> Compatibility with the virtio spec
> I think this is not about the spec, we implement a new pair of ops to 
> set/get_endian(),
> then we can handle the device config space fields properly, at least 
> to vdpa_dev_net_config_fill().
>
> E.g, we can use __virtio16_to_cpu(vdpa->get_endian(), xxx);
>
> And this can support VHOST_SET_VRING_ENDIAN.
>>
>> nor
>>
>> Compatibility with the existing vhost API (VHOST_SET_VRING_ENDIAN)
> that's true, but if the endian does not match the guest endian, it can 
> not
> work without a mediation layer anyway.
>
> Thanks
> Zhu Lingshan
>>
>> Thanks
>>
>>> Thanks,
>>> Zhu Lingshan
>>>>> or
>>>>>
>>>>> 3) revisit the idea of forcing modern only device which may simplify
>>>>> things a lot
>>>> I am not actually against forcing modern only config space, given that
>>>> it's not hard for either QEMU or individual driver to mediate or
>>>> emulate, and for the most part it's not conflict with the goal of
>>>> offload or acceleration with vDPA. But forcing LE ring layout IMO
>>>> would just kill off the potential of a very good use case. Currently
>>>> for our use case the priority for supporting 0.9.5 guest with vDPA is
>>>> slightly lower compared to live migration, but it is still in our TODO
>>>> list.
>>>>
>>>> Thanks,
>>>> -Siwei
>>>>
>>>>> which way should we go?
>>>>>
>>>>>> I hope
>>>>>> it's not the latter, otherwise it loses the point to use vDPA for
>>>>>> datapath acceleration.
>>>>>>
>>>>>> Even if its the former, it's a little weird for vendor device to
>>>>>> implement a LE config space with BE ring layout, although still
>>>>>> possible...
>>>>> Right.
>>>>>
>>>>> Thanks
>>>>>
>>>>>> -Siwei
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>>> So if you think we should add a vdpa_ops->get_endian(),
>>>>>>>>> I will drop these comments in the next version of
>>>>>>>>> series, and work on a new patch for get_endian().
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Zhu Lingshan
>>>>>>>> Guests don't get endian-ness from devices so this seems pointless.
>>>>>>>>
>
Jason Wang Sept. 2, 2022, 6:03 a.m. UTC | #33
On Fri, Aug 26, 2022 at 2:24 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 8/22/2022 8:26 PM, Jason Wang wrote:
> > On Mon, Aug 22, 2022 at 1:08 PM Zhu, Lingshan <lingshan.zhu@intel.com> wrote:
> >>
> >>
> >> On 8/20/2022 4:55 PM, Si-Wei Liu wrote:
> >>>
> >>> On 8/18/2022 5:42 PM, Jason Wang wrote:
> >>>> On Fri, Aug 19, 2022 at 7:20 AM Si-Wei Liu <si-wei.liu@oracle.com>
> >>>> wrote:
> >>>>>
> >>>>> On 8/17/2022 9:15 PM, Jason Wang wrote:
> >>>>>> 在 2022/8/17 18:37, Michael S. Tsirkin 写道:
> >>>>>>> On Wed, Aug 17, 2022 at 05:43:22PM +0800, Zhu, Lingshan wrote:
> >>>>>>>> On 8/17/2022 5:39 PM, Michael S. Tsirkin wrote:
> >>>>>>>>> On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote:
> >>>>>>>>>> On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
> >>>>>>>>>>> On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
> >>>>>>>>>>>> Yes it is a little messy, and we can not check _F_VERSION_1
> >>>>>>>>>>>> because of
> >>>>>>>>>>>> transitional devices, so maybe this is the best we can do for
> >>>>>>>>>>>> now
> >>>>>>>>>>> I think vhost generally needs an API to declare config space
> >>>>>>>>>>> endian-ness
> >>>>>>>>>>> to kernel. vdpa can reuse that too then.
> >>>>>>>>>> Yes, I remember you have mentioned some IOCTL to set the
> >>>>>>>>>> endian-ness,
> >>>>>>>>>> for vDPA, I think only the vendor driver knows the endian,
> >>>>>>>>>> so we may need a new function vdpa_ops->get_endian().
> >>>>>>>>>> In the last thread, we say maybe it's better to add a comment for
> >>>>>>>>>> now.
> >>>>>>>>>> But if you think we should add a vdpa_ops->get_endian(), I can
> >>>>>>>>>> work
> >>>>>>>>>> on it for sure!
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>> Zhu Lingshan
> >>>>>>>>> I think QEMU has to set endian-ness. No one else knows.
> >>>>>>>> Yes, for SW based vhost it is true. But for HW vDPA, only
> >>>>>>>> the device & driver knows the endian, I think we can not
> >>>>>>>> "set" a hardware's endian.
> >>>>>>> QEMU knows the guest endian-ness and it knows that
> >>>>>>> device is accessed through the legacy interface.
> >>>>>>> It can accordingly send endian-ness to the kernel and
> >>>>>>> kernel can propagate it to the driver.
> >>>>>> I wonder if we can simply force LE and then Qemu can do the endian
> >>>>>> conversion?
> >>>>> convert from LE for config space fields only, or QEMU has to forcefully
> >>>>> mediate and covert endianness for all device memory access including
> >>>>> even the datapath (fields in descriptor and avail/used rings)?
> >>>> Former. Actually, I want to force modern devices for vDPA when
> >>>> developing the vDPA framework. But then we see requirements for
> >>>> transitional or even legacy (e.g the Ali ENI parent). So it
> >>>> complicates things a lot.
> >>>>
> >>>> I think several ideas has been proposed:
> >>>>
> >>>> 1) Your proposal of having a vDPA specific way for
> >>>> modern/transitional/legacy awareness. This seems very clean since each
> >>>> transport should have the ability to do that but it still requires
> >>>> some kind of mediation for the case e.g running BE legacy guest on LE
> >>>> host.
> >>> In theory it seems like so, though practically I wonder if we can just
> >>> forbid BE legacy driver from running on modern LE host. For those who
> >>> care about legacy BE guest, they mostly like could and should talk to
> >>> vendor to get native BE support to achieve hardware acceleration,
> > The problem is the hardware still needs a way to know the endian of the guest?
> Hardware doesn't need to know. VMM should know by judging from VERSION_1
> feature bit negotiation and legacy interface access (with new code), and
> the target architecture endianness (the latter is existing QEMU code).
> >
> >>> few
> >>> of them would count on QEMU in mediating or emulating the datapath
> >>> (otherwise I don't see the benefit of adopting vDPA?). I still feel
> >>> that not every hardware vendor has to offer backward compatibility
> >>> (transitional device) with legacy interface/behavior (BE being just
> >>> one),
> > Probably, I agree it is a corner case, and dealing with transitional
> > device for the following setups is very challenge for hardware:
> >
> > - driver without IOMMU_PLATFORM support, (requiring device to send
> > translated request which have security implications)
> Don't get better suggestion for this one, but I presume this is
> something legacy guest users should be aware of ahead in term of
> security implications.

Probably but I think this assumption will prevent the device from
being used in a production environment.

>
> > - BE legacy guest on LE host, (requiring device to have a way to know
> > the endian)
> Yes. device can tell apart with the help from VMM (judging by VERSION_1
> acknowledgement and if legacy interface is used during negotiation).
>
> > - device specific requirement (e.g modern virtio-net mandate minimal
> > header length to contain mrg_rxbuf even if the device doesn't offer
> > it)
> This one seems to be spec mandated transitional interface requirement?

Yes.

> Which vDPA hardware vendor should take care of rather (if they do offer
> a transitional device)?

Right but this is not the only one. Section 7.4 summaries the
transitional device conformance which is a very long list for vendor
to follow.

> >
> > It is not obvious for the hardware vendor, so we may end up defecting
> > in the implementation. Dealing with compatibility for the transitional
> > devices is kind of a nightmare which there's no way for the spec to
> > rule the behavior of legacy devices.
> The compatibility detection part is tedious I agree. That's why I
> suggested starting from the very minimal and practically feasible (for
> e.g. on x86), but just don't prohibit the possibility to extend to big
> endian or come up with quirk fixes for various cases in QEMU.

This is somehow we've already been done, e.g ali ENI is limited to x86.

>
> >
> >>>   this is unlike the situation on software virtio device, which
> >>> has legacy support since day one. I think we ever discussed it before:
> >>> for those vDPA vendors who don't offer legacy guest support, maybe we
> >>> should mandate some feature for e.g. VERSION_1, as these devices
> >>> really don't offer functionality of the opposite side (!VERSION_1)
> >>> during negotiation.
> > I've tried something similar here (a global mandatory instead of per device).
> >
> > https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/4/26__;!!ACWV5N9M2RV99hQ!NRQPfj5o9o3MKE12ze1zfXMC-9SqwOWqF26g8RrIyUDbUmwDIwl5WQCaNiDe6aZ2yR83j-NEqRXQNXcNyOo$
> >
> > But for some reason, it is not applied by Michael. It would be a great
> > relief if we support modern devices only. Maybe it's time to revisit
> > this idea then we can introduce new backend features and then we can
> > mandate VERSION_1
> Probably, mandating per-device should be fine I guess.
>
> >
> >>> Having it said, perhaps we should also allow vendor device to
> >>> implement only partial support for legacy. We can define "reversed"
> >>> backend feature to denote some part of the legacy
> >>> interface/functionality not getting implemented by device. For
> >>> instance, VHOST_BACKEND_F_NO_BE_VRING, VHOST_BACKEND_F_NO_BE_CONFIG,
> >>> VHOST_BACKEND_F_NO_ALIGNED_VRING,
> >>> VHOST_BACKEND_NET_F_NO_WRITEABLE_MAC, and et al. Not all of these
> >>> missing features for legacy would be easy for QEMU to make up for, so
> >>> QEMU can selectively emulate those at its best when necessary and
> >>> applicable. In other word, this design shouldn't prevent QEMU from
> >>> making up for vendor device's partial legacy support.
> > This looks too heavyweight since it tries to provide compatibility for
> > legacy drivers.
> That's just for the sake of extreme backward compatibility, but you can
> say that's not even needed if we mandate transitional device to offer
> all required interfaces for both legacy and modern guest.
>
> >   Considering we've introduced modern devices for 5+
> > years, I'd rather:
> >
> > - Qemu to mediate the config space stuffs
> > - Shadow virtqueue to mediate the datapath (AF_XDP told us shadow ring
> > can perform very well if we do zero-copy).
> This is one way to achieve, though not sure we should stick the only
> hope to zero-copy, which IMHO may take a long way to realize and
> optimize to where a simple datapath passthrough can easily get to (with
> hardware acceleration of coz).

Note that, current shadow virtqueue is zerocopy, Qemu just need to
forward the descriptors.

>
> >
> >>>> 2) Michael suggests using VHOST_SET_VRING_ENDIAN where it means we
> >>>> need a new config ops for vDPA bus, but it doesn't solve the issue for
> >>>> config space (at least from its name). We probably need a new ioctl
> >>>> for both vring and config space.
> >>> Yep adding a new ioctl makes things better, but I think the key is not
> >>> the new ioctl. It's whether or not we should enforce every vDPA vendor
> >>> driver to implement all transitional interfaces to be spec compliant.
> > I think the answer is no since the spec allows transitional device.
> > And we know things will be greatly simplified if vDPA support non
> > transitional device only.
> >
> > So we can change the question to:
> >
> > 1) do we need (or is it too late) to enforce non transitional device?
> We already have Alibaba ENI which is sort of a quasi-transitional
> device, right? In the sense it doesn't advertise VERSION_1. I know the
> other parts might not qualify it to be fully transitional, but code now
> doesn't prohibit it from supporting VERSION_1 modern interface depending
> on whatever future need.

We can ask ENI developer for their future plan, it looks to me a
legacy only device wouldn't be interested in the future.

Zongyong, do you have the plan to implement device with VERSION_1 support?

> > 2) if yes, can transitional device be mediate in an efficient way?
> >
> > For 1), it's probably too late but we can invent new vDPA features as
> > you suggest to be non transitional. Then we can:
> >
> > 1.1) extend the netlink API to provision non-transitonal device
> Define non-transitional: device could be either modern-only or legacy-only?

According to the spec, non-transitional should be modern only.

> > 1.2) work on the non-transtional device in the future
> > 1.3) presenting transitional device via mediation
> presenting transitional on top of a modern device with VERSION_1, right?

Yes, I mean presenting/mediating a transitional device on top of a
non-transitional device.

> What if the hardware device can support legacy-compatible datapath
> natively that doesn't need mediation? Can it be done with direct
> datapath passthrough without svq involvement?

I'd like to avoid supporting legacy-only device like ENI in the
future. The major problem is that it's out of the spec thus the
behaviour is defined by the code not the spec.

>
> >
> > The previous transitional vDPA work as is, it's probably too late to
> > fix all the issue we suffer.
> What do you mean work as-is,

See above, basically I mean the behaviour is defined by the vDPA code
not (or can't) by the spec.

For example, for virtio-pci, we have:

legacy interface: BAR
modern interface: capability

So a transitional device can simple provide both of those interfaces:
E.g for device configuration space, if it is accessed via legacy
interface device know it needs to provide the config with native
endian otherwise little endian when accessing via modern interface.

For virtio-mmio, it looks to me it doesn't provide a way for
transitional device.

For vDPA, we actually don't define whether config_ops is a modern or
legacy interface. This is very tricky for the transitional device
since it tries to reuse the same interface for both legacy and modern
which make it very hard to be correct. E.g:

1) VERSION_1 trick won't work, e.g the spec allows to read device
configuration space before FEATURE_OK. So legacy driver may assume a
native endian.
2) SET_VRING_ENDIAN doesn't fix all the issue, there's still a
question what endian it is before SET_VRING_ENDIAN (or we need to
support userspace without SET_VRING_ENDIAN)
...

Things will be simplified if we mandate the config_ops as the modern
interface and provide the necessary mediation in the hypervisor.

> what's the nomenclature for that,
> quasi-transitional or broken-transitional?

If we invent new API to clarify the modern/legacy and focus on the
modern in the future, we probably don't need a name?

> and what are the outstanding
> issues you anticipate remaining?

Basically we need to check the conformance listed in section 7.4 of the spec.

> If it is device specific or vendor
> driver specific, let it be. But I wonder if there's any generic one that
> has to be fixed in vdpa core to support a truly transitional model.

Two set of config_ops? E.g

legacy_config_ops
modern_config_ops

But I'm not sure whether or not it's worthwhile.

> >
> > For 2), the key part is the datapath mediation, we can use shadow virtqueue.
> Sure. For our use case, we'd care more in providing transitional rather
> than being non-transitional. So, one device fits all.
>
> Thanks for all the ideas. This discussion is really useful.

Appreciate for the discussion.

Thanks

>
> Best,
> -Siwei
> >
> >>> If we allow them to reject the VHOST_SET_VRING_ENDIAN  or
> >>> VHOST_SET_CONFIG_ENDIAN call, what could we do? We would still end up
> >>> with same situation of either fail the guest, or trying to
> >>> mediate/emulate, right?
> >>>
> >>> Not to mention VHOST_SET_VRING_ENDIAN is rarely supported by vhost
> >>> today - few distro kernel has CONFIG_VHOST_CROSS_ENDIAN_LEGACY enabled
> >>> and QEMU just ignores the result. vhost doesn't necessarily depend on
> >>> it to determine endianness it looks.
> >> I would like to suggest to add two new config ops get/set_vq_endian()
> >> and get/set_config_endian() for vDPA. This is used to:
> >> a) support VHOST_GET/SET_VRING_ENDIAN as MST suggested, and add
> >> VHOST_SET/GET_CONFIG_ENDIAN for vhost_vdpa.
> >> If the device has not implemented interface to set its endianess, then
> >> no matter success or failure of SET_ENDIAN, QEMU knows the endian-ness
> >> anyway.
> > How can Qemu know the endian in this way? And if it can, there's no
> > need for the new API?
> >
> >> In this case, if the device endianess does not match the guest,
> >> there needs a mediation layer or fail.
> >> b) ops->get_config_endian() can always tell the endian-ness of the
> >> device config space after the vendor driver probing the device. So we
> >> can use this ops->get_config_endian() for
> >> MTU, MAC and other fields handling in vdpa_dev_net_config_fill() and we
> >> don't need to set_features in vdpa_get_config_unlocked(), so no race
> >> conditions.
> >> Every time ops->get_config() returned, we can tell the endian by
> >> ops-config_>get_endian(), we don't need set_features(xxx, 0) if features
> >> negotiation not done.
> >>
> >> The question is: Do we need two pairs of ioctls for both vq and config
> >> space? Can config space endian-ness differ from the vqs?
> >> c) do we need a new netlink attr telling the endian-ness to user space?
> > Generally, I'm not sure this is a good design consider it provides neither:
> >
> > Compatibility with the virtio spec
> >
> > nor
> >
> > Compatibility with the existing vhost API (VHOST_SET_VRING_ENDIAN)
> >
> > Thanks
> >
> >> Thanks,
> >> Zhu Lingshan
> >>>> or
> >>>>
> >>>> 3) revisit the idea of forcing modern only device which may simplify
> >>>> things a lot
> >>> I am not actually against forcing modern only config space, given that
> >>> it's not hard for either QEMU or individual driver to mediate or
> >>> emulate, and for the most part it's not conflict with the goal of
> >>> offload or acceleration with vDPA. But forcing LE ring layout IMO
> >>> would just kill off the potential of a very good use case. Currently
> >>> for our use case the priority for supporting 0.9.5 guest with vDPA is
> >>> slightly lower compared to live migration, but it is still in our TODO
> >>> list.
> >>>
> >>> Thanks,
> >>> -Siwei
> >>>
> >>>> which way should we go?
> >>>>
> >>>>> I hope
> >>>>> it's not the latter, otherwise it loses the point to use vDPA for
> >>>>> datapath acceleration.
> >>>>>
> >>>>> Even if its the former, it's a little weird for vendor device to
> >>>>> implement a LE config space with BE ring layout, although still
> >>>>> possible...
> >>>> Right.
> >>>>
> >>>> Thanks
> >>>>
> >>>>> -Siwei
> >>>>>> Thanks
> >>>>>>
> >>>>>>
> >>>>>>>> So if you think we should add a vdpa_ops->get_endian(),
> >>>>>>>> I will drop these comments in the next version of
> >>>>>>>> series, and work on a new patch for get_endian().
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Zhu Lingshan
> >>>>>>> Guests don't get endian-ness from devices so this seems pointless.
> >>>>>>>
>
Michael S. Tsirkin Sept. 2, 2022, 6:14 a.m. UTC | #34
On Fri, Sep 02, 2022 at 02:03:22PM +0800, Jason Wang wrote:
> On Fri, Aug 26, 2022 at 2:24 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >
> >
> >
> > On 8/22/2022 8:26 PM, Jason Wang wrote:
> > > On Mon, Aug 22, 2022 at 1:08 PM Zhu, Lingshan <lingshan.zhu@intel.com> wrote:
> > >>
> > >>
> > >> On 8/20/2022 4:55 PM, Si-Wei Liu wrote:
> > >>>
> > >>> On 8/18/2022 5:42 PM, Jason Wang wrote:
> > >>>> On Fri, Aug 19, 2022 at 7:20 AM Si-Wei Liu <si-wei.liu@oracle.com>
> > >>>> wrote:
> > >>>>>
> > >>>>> On 8/17/2022 9:15 PM, Jason Wang wrote:
> > >>>>>> 在 2022/8/17 18:37, Michael S. Tsirkin 写道:
> > >>>>>>> On Wed, Aug 17, 2022 at 05:43:22PM +0800, Zhu, Lingshan wrote:
> > >>>>>>>> On 8/17/2022 5:39 PM, Michael S. Tsirkin wrote:
> > >>>>>>>>> On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote:
> > >>>>>>>>>> On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
> > >>>>>>>>>>> On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
> > >>>>>>>>>>>> Yes it is a little messy, and we can not check _F_VERSION_1
> > >>>>>>>>>>>> because of
> > >>>>>>>>>>>> transitional devices, so maybe this is the best we can do for
> > >>>>>>>>>>>> now
> > >>>>>>>>>>> I think vhost generally needs an API to declare config space
> > >>>>>>>>>>> endian-ness
> > >>>>>>>>>>> to kernel. vdpa can reuse that too then.
> > >>>>>>>>>> Yes, I remember you have mentioned some IOCTL to set the
> > >>>>>>>>>> endian-ness,
> > >>>>>>>>>> for vDPA, I think only the vendor driver knows the endian,
> > >>>>>>>>>> so we may need a new function vdpa_ops->get_endian().
> > >>>>>>>>>> In the last thread, we say maybe it's better to add a comment for
> > >>>>>>>>>> now.
> > >>>>>>>>>> But if you think we should add a vdpa_ops->get_endian(), I can
> > >>>>>>>>>> work
> > >>>>>>>>>> on it for sure!
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks
> > >>>>>>>>>> Zhu Lingshan
> > >>>>>>>>> I think QEMU has to set endian-ness. No one else knows.
> > >>>>>>>> Yes, for SW based vhost it is true. But for HW vDPA, only
> > >>>>>>>> the device & driver knows the endian, I think we can not
> > >>>>>>>> "set" a hardware's endian.
> > >>>>>>> QEMU knows the guest endian-ness and it knows that
> > >>>>>>> device is accessed through the legacy interface.
> > >>>>>>> It can accordingly send endian-ness to the kernel and
> > >>>>>>> kernel can propagate it to the driver.
> > >>>>>> I wonder if we can simply force LE and then Qemu can do the endian
> > >>>>>> conversion?
> > >>>>> convert from LE for config space fields only, or QEMU has to forcefully
> > >>>>> mediate and covert endianness for all device memory access including
> > >>>>> even the datapath (fields in descriptor and avail/used rings)?
> > >>>> Former. Actually, I want to force modern devices for vDPA when
> > >>>> developing the vDPA framework. But then we see requirements for
> > >>>> transitional or even legacy (e.g the Ali ENI parent). So it
> > >>>> complicates things a lot.
> > >>>>
> > >>>> I think several ideas has been proposed:
> > >>>>
> > >>>> 1) Your proposal of having a vDPA specific way for
> > >>>> modern/transitional/legacy awareness. This seems very clean since each
> > >>>> transport should have the ability to do that but it still requires
> > >>>> some kind of mediation for the case e.g running BE legacy guest on LE
> > >>>> host.
> > >>> In theory it seems like so, though practically I wonder if we can just
> > >>> forbid BE legacy driver from running on modern LE host. For those who
> > >>> care about legacy BE guest, they mostly like could and should talk to
> > >>> vendor to get native BE support to achieve hardware acceleration,
> > > The problem is the hardware still needs a way to know the endian of the guest?
> > Hardware doesn't need to know. VMM should know by judging from VERSION_1
> > feature bit negotiation and legacy interface access (with new code), and
> > the target architecture endianness (the latter is existing QEMU code).
> > >
> > >>> few
> > >>> of them would count on QEMU in mediating or emulating the datapath
> > >>> (otherwise I don't see the benefit of adopting vDPA?). I still feel
> > >>> that not every hardware vendor has to offer backward compatibility
> > >>> (transitional device) with legacy interface/behavior (BE being just
> > >>> one),
> > > Probably, I agree it is a corner case, and dealing with transitional
> > > device for the following setups is very challenge for hardware:
> > >
> > > - driver without IOMMU_PLATFORM support, (requiring device to send
> > > translated request which have security implications)
> > Don't get better suggestion for this one, but I presume this is
> > something legacy guest users should be aware of ahead in term of
> > security implications.
> 
> Probably but I think this assumption will prevent the device from
> being used in a production environment.
> 
> >
> > > - BE legacy guest on LE host, (requiring device to have a way to know
> > > the endian)
> > Yes. device can tell apart with the help from VMM (judging by VERSION_1
> > acknowledgement and if legacy interface is used during negotiation).
> >
> > > - device specific requirement (e.g modern virtio-net mandate minimal
> > > header length to contain mrg_rxbuf even if the device doesn't offer
> > > it)
> > This one seems to be spec mandated transitional interface requirement?
> 
> Yes.
> 
> > Which vDPA hardware vendor should take care of rather (if they do offer
> > a transitional device)?
> 
> Right but this is not the only one. Section 7.4 summaries the
> transitional device conformance which is a very long list for vendor
> to follow.
> 
> > >
> > > It is not obvious for the hardware vendor, so we may end up defecting
> > > in the implementation. Dealing with compatibility for the transitional
> > > devices is kind of a nightmare which there's no way for the spec to
> > > rule the behavior of legacy devices.
> > The compatibility detection part is tedious I agree. That's why I
> > suggested starting from the very minimal and practically feasible (for
> > e.g. on x86), but just don't prohibit the possibility to extend to big
> > endian or come up with quirk fixes for various cases in QEMU.
> 
> This is somehow we've already been done, e.g ali ENI is limited to x86.
> 
> >
> > >
> > >>>   this is unlike the situation on software virtio device, which
> > >>> has legacy support since day one. I think we ever discussed it before:
> > >>> for those vDPA vendors who don't offer legacy guest support, maybe we
> > >>> should mandate some feature for e.g. VERSION_1, as these devices
> > >>> really don't offer functionality of the opposite side (!VERSION_1)
> > >>> during negotiation.
> > > I've tried something similar here (a global mandatory instead of per device).
> > >
> > > https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/4/26__;!!ACWV5N9M2RV99hQ!NRQPfj5o9o3MKE12ze1zfXMC-9SqwOWqF26g8RrIyUDbUmwDIwl5WQCaNiDe6aZ2yR83j-NEqRXQNXcNyOo$
> > >
> > > But for some reason, it is not applied by Michael. It would be a great
> > > relief if we support modern devices only. Maybe it's time to revisit
> > > this idea then we can introduce new backend features and then we can
> > > mandate VERSION_1
> > Probably, mandating per-device should be fine I guess.
> >
> > >
> > >>> Having it said, perhaps we should also allow vendor device to
> > >>> implement only partial support for legacy. We can define "reversed"
> > >>> backend feature to denote some part of the legacy
> > >>> interface/functionality not getting implemented by device. For
> > >>> instance, VHOST_BACKEND_F_NO_BE_VRING, VHOST_BACKEND_F_NO_BE_CONFIG,
> > >>> VHOST_BACKEND_F_NO_ALIGNED_VRING,
> > >>> VHOST_BACKEND_NET_F_NO_WRITEABLE_MAC, and et al. Not all of these
> > >>> missing features for legacy would be easy for QEMU to make up for, so
> > >>> QEMU can selectively emulate those at its best when necessary and
> > >>> applicable. In other word, this design shouldn't prevent QEMU from
> > >>> making up for vendor device's partial legacy support.
> > > This looks too heavyweight since it tries to provide compatibility for
> > > legacy drivers.
> > That's just for the sake of extreme backward compatibility, but you can
> > say that's not even needed if we mandate transitional device to offer
> > all required interfaces for both legacy and modern guest.
> >
> > >   Considering we've introduced modern devices for 5+
> > > years, I'd rather:
> > >
> > > - Qemu to mediate the config space stuffs
> > > - Shadow virtqueue to mediate the datapath (AF_XDP told us shadow ring
> > > can perform very well if we do zero-copy).
> > This is one way to achieve, though not sure we should stick the only
> > hope to zero-copy, which IMHO may take a long way to realize and
> > optimize to where a simple datapath passthrough can easily get to (with
> > hardware acceleration of coz).
> 
> Note that, current shadow virtqueue is zerocopy, Qemu just need to
> forward the descriptors.
> 
> >
> > >
> > >>>> 2) Michael suggests using VHOST_SET_VRING_ENDIAN where it means we
> > >>>> need a new config ops for vDPA bus, but it doesn't solve the issue for
> > >>>> config space (at least from its name). We probably need a new ioctl
> > >>>> for both vring and config space.
> > >>> Yep adding a new ioctl makes things better, but I think the key is not
> > >>> the new ioctl. It's whether or not we should enforce every vDPA vendor
> > >>> driver to implement all transitional interfaces to be spec compliant.
> > > I think the answer is no since the spec allows transitional device.
> > > And we know things will be greatly simplified if vDPA support non
> > > transitional device only.
> > >
> > > So we can change the question to:
> > >
> > > 1) do we need (or is it too late) to enforce non transitional device?
> > We already have Alibaba ENI which is sort of a quasi-transitional
> > device, right? In the sense it doesn't advertise VERSION_1. I know the
> > other parts might not qualify it to be fully transitional, but code now
> > doesn't prohibit it from supporting VERSION_1 modern interface depending
> > on whatever future need.
> 
> We can ask ENI developer for their future plan, it looks to me a
> legacy only device wouldn't be interested in the future.
> 
> Zongyong, do you have the plan to implement device with VERSION_1 support?
> 
> > > 2) if yes, can transitional device be mediate in an efficient way?
> > >
> > > For 1), it's probably too late but we can invent new vDPA features as
> > > you suggest to be non transitional. Then we can:
> > >
> > > 1.1) extend the netlink API to provision non-transitonal device
> > Define non-transitional: device could be either modern-only or legacy-only?
> 
> According to the spec, non-transitional should be modern only.
> 
> > > 1.2) work on the non-transtional device in the future
> > > 1.3) presenting transitional device via mediation
> > presenting transitional on top of a modern device with VERSION_1, right?
> 
> Yes, I mean presenting/mediating a transitional device on top of a
> non-transitional device.
> 
> > What if the hardware device can support legacy-compatible datapath
> > natively that doesn't need mediation? Can it be done with direct
> > datapath passthrough without svq involvement?
> 
> I'd like to avoid supporting legacy-only device like ENI in the
> future. The major problem is that it's out of the spec thus the
> behaviour is defined by the code not the spec.
> 
> >
> > >
> > > The previous transitional vDPA work as is, it's probably too late to
> > > fix all the issue we suffer.
> > What do you mean work as-is,
> 
> See above, basically I mean the behaviour is defined by the vDPA code
> not (or can't) by the spec.
> 
> For example, for virtio-pci, we have:
> 
> legacy interface: BAR
> modern interface: capability
> 
> So a transitional device can simple provide both of those interfaces:
> E.g for device configuration space, if it is accessed via legacy
> interface device know it needs to provide the config with native
> endian otherwise little endian when accessing via modern interface.
> 
> For virtio-mmio, it looks to me it doesn't provide a way for
> transitional device.
> 
> For vDPA, we actually don't define whether config_ops is a modern or
> legacy interface. This is very tricky for the transitional device
> since it tries to reuse the same interface for both legacy and modern
> which make it very hard to be correct. E.g:
> 
> 1) VERSION_1 trick won't work, e.g the spec allows to read device
> configuration space before FEATURE_OK. So legacy driver may assume a
> native endian.

I am trying to address this in the spec. There was a fairly narrow
window during which guests accessed config space before
writing out features. Yes before FEATURES_OK but I think
asking that guests send the features to device
before poking at config space that depends on those
features is reasonable.

Similarly, we can also change QEMU to send
features to vdpa on config access that happens before
FEATURES_OK.



> 2) SET_VRING_ENDIAN doesn't fix all the issue, there's still a
> question what endian it is before SET_VRING_ENDIAN (or we need to
> support userspace without SET_VRING_ENDIAN)
> ...
> 
> Things will be simplified if we mandate the config_ops as the modern
> interface and provide the necessary mediation in the hypervisor.
> 
> > what's the nomenclature for that,
> > quasi-transitional or broken-transitional?
> 
> If we invent new API to clarify the modern/legacy and focus on the
> modern in the future, we probably don't need a name?

OK. What will that API be like? Maybe a bit in PROTOCOL_FEATURES?

> > and what are the outstanding
> > issues you anticipate remaining?
> 
> Basically we need to check the conformance listed in section 7.4 of the spec.
> 
> > If it is device specific or vendor
> > driver specific, let it be. But I wonder if there's any generic one that
> > has to be fixed in vdpa core to support a truly transitional model.
> 
> Two set of config_ops? E.g
> 
> legacy_config_ops
> modern_config_ops
> 
> But I'm not sure whether or not it's worthwhile.
> 
> > >
> > > For 2), the key part is the datapath mediation, we can use shadow virtqueue.
> > Sure. For our use case, we'd care more in providing transitional rather
> > than being non-transitional. So, one device fits all.
> >
> > Thanks for all the ideas. This discussion is really useful.
> 
> Appreciate for the discussion.
> 
> Thanks
> 
> >
> > Best,
> > -Siwei
> > >
> > >>> If we allow them to reject the VHOST_SET_VRING_ENDIAN  or
> > >>> VHOST_SET_CONFIG_ENDIAN call, what could we do? We would still end up
> > >>> with same situation of either fail the guest, or trying to
> > >>> mediate/emulate, right?
> > >>>
> > >>> Not to mention VHOST_SET_VRING_ENDIAN is rarely supported by vhost
> > >>> today - few distro kernel has CONFIG_VHOST_CROSS_ENDIAN_LEGACY enabled
> > >>> and QEMU just ignores the result. vhost doesn't necessarily depend on
> > >>> it to determine endianness it looks.
> > >> I would like to suggest to add two new config ops get/set_vq_endian()
> > >> and get/set_config_endian() for vDPA. This is used to:
> > >> a) support VHOST_GET/SET_VRING_ENDIAN as MST suggested, and add
> > >> VHOST_SET/GET_CONFIG_ENDIAN for vhost_vdpa.
> > >> If the device has not implemented interface to set its endianess, then
> > >> no matter success or failure of SET_ENDIAN, QEMU knows the endian-ness
> > >> anyway.
> > > How can Qemu know the endian in this way? And if it can, there's no
> > > need for the new API?
> > >
> > >> In this case, if the device endianess does not match the guest,
> > >> there needs a mediation layer or fail.
> > >> b) ops->get_config_endian() can always tell the endian-ness of the
> > >> device config space after the vendor driver probing the device. So we
> > >> can use this ops->get_config_endian() for
> > >> MTU, MAC and other fields handling in vdpa_dev_net_config_fill() and we
> > >> don't need to set_features in vdpa_get_config_unlocked(), so no race
> > >> conditions.
> > >> Every time ops->get_config() returned, we can tell the endian by
> > >> ops-config_>get_endian(), we don't need set_features(xxx, 0) if features
> > >> negotiation not done.
> > >>
> > >> The question is: Do we need two pairs of ioctls for both vq and config
> > >> space? Can config space endian-ness differ from the vqs?
> > >> c) do we need a new netlink attr telling the endian-ness to user space?
> > > Generally, I'm not sure this is a good design consider it provides neither:
> > >
> > > Compatibility with the virtio spec
> > >
> > > nor
> > >
> > > Compatibility with the existing vhost API (VHOST_SET_VRING_ENDIAN)
> > >
> > > Thanks
> > >
> > >> Thanks,
> > >> Zhu Lingshan
> > >>>> or
> > >>>>
> > >>>> 3) revisit the idea of forcing modern only device which may simplify
> > >>>> things a lot
> > >>> I am not actually against forcing modern only config space, given that
> > >>> it's not hard for either QEMU or individual driver to mediate or
> > >>> emulate, and for the most part it's not conflict with the goal of
> > >>> offload or acceleration with vDPA. But forcing LE ring layout IMO
> > >>> would just kill off the potential of a very good use case. Currently
> > >>> for our use case the priority for supporting 0.9.5 guest with vDPA is
> > >>> slightly lower compared to live migration, but it is still in our TODO
> > >>> list.
> > >>>
> > >>> Thanks,
> > >>> -Siwei
> > >>>
> > >>>> which way should we go?
> > >>>>
> > >>>>> I hope
> > >>>>> it's not the latter, otherwise it loses the point to use vDPA for
> > >>>>> datapath acceleration.
> > >>>>>
> > >>>>> Even if its the former, it's a little weird for vendor device to
> > >>>>> implement a LE config space with BE ring layout, although still
> > >>>>> possible...
> > >>>> Right.
> > >>>>
> > >>>> Thanks
> > >>>>
> > >>>>> -Siwei
> > >>>>>> Thanks
> > >>>>>>
> > >>>>>>
> > >>>>>>>> So if you think we should add a vdpa_ops->get_endian(),
> > >>>>>>>> I will drop these comments in the next version of
> > >>>>>>>> series, and work on a new patch for get_endian().
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Zhu Lingshan
> > >>>>>>> Guests don't get endian-ness from devices so this seems pointless.
> > >>>>>>>
> >
Jason Wang Sept. 5, 2022, 3:54 a.m. UTC | #35
On Fri, Sep 2, 2022 at 2:14 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Sep 02, 2022 at 02:03:22PM +0800, Jason Wang wrote:
> > On Fri, Aug 26, 2022 at 2:24 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> > >
> > >
> > >
> > > On 8/22/2022 8:26 PM, Jason Wang wrote:
> > > > On Mon, Aug 22, 2022 at 1:08 PM Zhu, Lingshan <lingshan.zhu@intel.com> wrote:
> > > >>
> > > >>
> > > >> On 8/20/2022 4:55 PM, Si-Wei Liu wrote:
> > > >>>
> > > >>> On 8/18/2022 5:42 PM, Jason Wang wrote:
> > > >>>> On Fri, Aug 19, 2022 at 7:20 AM Si-Wei Liu <si-wei.liu@oracle.com>
> > > >>>> wrote:
> > > >>>>>
> > > >>>>> On 8/17/2022 9:15 PM, Jason Wang wrote:
> > > >>>>>> 在 2022/8/17 18:37, Michael S. Tsirkin 写道:
> > > >>>>>>> On Wed, Aug 17, 2022 at 05:43:22PM +0800, Zhu, Lingshan wrote:
> > > >>>>>>>> On 8/17/2022 5:39 PM, Michael S. Tsirkin wrote:
> > > >>>>>>>>> On Wed, Aug 17, 2022 at 05:13:59PM +0800, Zhu, Lingshan wrote:
> > > >>>>>>>>>> On 8/17/2022 4:55 PM, Michael S. Tsirkin wrote:
> > > >>>>>>>>>>> On Wed, Aug 17, 2022 at 10:14:26AM +0800, Zhu, Lingshan wrote:
> > > >>>>>>>>>>>> Yes it is a little messy, and we can not check _F_VERSION_1
> > > >>>>>>>>>>>> because of
> > > >>>>>>>>>>>> transitional devices, so maybe this is the best we can do for
> > > >>>>>>>>>>>> now
> > > >>>>>>>>>>> I think vhost generally needs an API to declare config space
> > > >>>>>>>>>>> endian-ness
> > > >>>>>>>>>>> to kernel. vdpa can reuse that too then.
> > > >>>>>>>>>> Yes, I remember you have mentioned some IOCTL to set the
> > > >>>>>>>>>> endian-ness,
> > > >>>>>>>>>> for vDPA, I think only the vendor driver knows the endian,
> > > >>>>>>>>>> so we may need a new function vdpa_ops->get_endian().
> > > >>>>>>>>>> In the last thread, we say maybe it's better to add a comment for
> > > >>>>>>>>>> now.
> > > >>>>>>>>>> But if you think we should add a vdpa_ops->get_endian(), I can
> > > >>>>>>>>>> work
> > > >>>>>>>>>> on it for sure!
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thanks
> > > >>>>>>>>>> Zhu Lingshan
> > > >>>>>>>>> I think QEMU has to set endian-ness. No one else knows.
> > > >>>>>>>> Yes, for SW based vhost it is true. But for HW vDPA, only
> > > >>>>>>>> the device & driver knows the endian, I think we can not
> > > >>>>>>>> "set" a hardware's endian.
> > > >>>>>>> QEMU knows the guest endian-ness and it knows that
> > > >>>>>>> device is accessed through the legacy interface.
> > > >>>>>>> It can accordingly send endian-ness to the kernel and
> > > >>>>>>> kernel can propagate it to the driver.
> > > >>>>>> I wonder if we can simply force LE and then Qemu can do the endian
> > > >>>>>> conversion?
> > > >>>>> convert from LE for config space fields only, or QEMU has to forcefully
> > > >>>>> mediate and covert endianness for all device memory access including
> > > >>>>> even the datapath (fields in descriptor and avail/used rings)?
> > > >>>> Former. Actually, I want to force modern devices for vDPA when
> > > >>>> developing the vDPA framework. But then we see requirements for
> > > >>>> transitional or even legacy (e.g the Ali ENI parent). So it
> > > >>>> complicates things a lot.
> > > >>>>
> > > >>>> I think several ideas has been proposed:
> > > >>>>
> > > >>>> 1) Your proposal of having a vDPA specific way for
> > > >>>> modern/transitional/legacy awareness. This seems very clean since each
> > > >>>> transport should have the ability to do that but it still requires
> > > >>>> some kind of mediation for the case e.g running BE legacy guest on LE
> > > >>>> host.
> > > >>> In theory it seems like so, though practically I wonder if we can just
> > > >>> forbid BE legacy driver from running on modern LE host. For those who
> > > >>> care about legacy BE guest, they mostly like could and should talk to
> > > >>> vendor to get native BE support to achieve hardware acceleration,
> > > > The problem is the hardware still needs a way to know the endian of the guest?
> > > Hardware doesn't need to know. VMM should know by judging from VERSION_1
> > > feature bit negotiation and legacy interface access (with new code), and
> > > the target architecture endianness (the latter is existing QEMU code).
> > > >
> > > >>> few
> > > >>> of them would count on QEMU in mediating or emulating the datapath
> > > >>> (otherwise I don't see the benefit of adopting vDPA?). I still feel
> > > >>> that not every hardware vendor has to offer backward compatibility
> > > >>> (transitional device) with legacy interface/behavior (BE being just
> > > >>> one),
> > > > Probably, I agree it is a corner case, and dealing with transitional
> > > > device for the following setups is very challenge for hardware:
> > > >
> > > > - driver without IOMMU_PLATFORM support, (requiring device to send
> > > > translated request which have security implications)
> > > Don't get better suggestion for this one, but I presume this is
> > > something legacy guest users should be aware of ahead in term of
> > > security implications.
> >
> > Probably but I think this assumption will prevent the device from
> > being used in a production environment.
> >
> > >
> > > > - BE legacy guest on LE host, (requiring device to have a way to know
> > > > the endian)
> > > Yes. device can tell apart with the help from VMM (judging by VERSION_1
> > > acknowledgement and if legacy interface is used during negotiation).
> > >
> > > > - device specific requirement (e.g modern virtio-net mandate minimal
> > > > header length to contain mrg_rxbuf even if the device doesn't offer
> > > > it)
> > > This one seems to be spec mandated transitional interface requirement?
> >
> > Yes.
> >
> > > Which vDPA hardware vendor should take care of rather (if they do offer
> > > a transitional device)?
> >
> > Right but this is not the only one. Section 7.4 summaries the
> > transitional device conformance which is a very long list for vendor
> > to follow.
> >
> > > >
> > > > It is not obvious for the hardware vendor, so we may end up defecting
> > > > in the implementation. Dealing with compatibility for the transitional
> > > > devices is kind of a nightmare which there's no way for the spec to
> > > > rule the behavior of legacy devices.
> > > The compatibility detection part is tedious I agree. That's why I
> > > suggested starting from the very minimal and practically feasible (for
> > > e.g. on x86), but just don't prohibit the possibility to extend to big
> > > endian or come up with quirk fixes for various cases in QEMU.
> >
> > This is somehow we've already been done, e.g ali ENI is limited to x86.
> >
> > >
> > > >
> > > >>>   this is unlike the situation on software virtio device, which
> > > >>> has legacy support since day one. I think we ever discussed it before:
> > > >>> for those vDPA vendors who don't offer legacy guest support, maybe we
> > > >>> should mandate some feature for e.g. VERSION_1, as these devices
> > > >>> really don't offer functionality of the opposite side (!VERSION_1)
> > > >>> during negotiation.
> > > > I've tried something similar here (a global mandatory instead of per device).
> > > >
> > > > https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/4/26__;!!ACWV5N9M2RV99hQ!NRQPfj5o9o3MKE12ze1zfXMC-9SqwOWqF26g8RrIyUDbUmwDIwl5WQCaNiDe6aZ2yR83j-NEqRXQNXcNyOo$
> > > >
> > > > But for some reason, it is not applied by Michael. It would be a great
> > > > relief if we support modern devices only. Maybe it's time to revisit
> > > > this idea then we can introduce new backend features and then we can
> > > > mandate VERSION_1
> > > Probably, mandating per-device should be fine I guess.
> > >
> > > >
> > > >>> Having it said, perhaps we should also allow vendor device to
> > > >>> implement only partial support for legacy. We can define "reversed"
> > > >>> backend feature to denote some part of the legacy
> > > >>> interface/functionality not getting implemented by device. For
> > > >>> instance, VHOST_BACKEND_F_NO_BE_VRING, VHOST_BACKEND_F_NO_BE_CONFIG,
> > > >>> VHOST_BACKEND_F_NO_ALIGNED_VRING,
> > > >>> VHOST_BACKEND_NET_F_NO_WRITEABLE_MAC, and et al. Not all of these
> > > >>> missing features for legacy would be easy for QEMU to make up for, so
> > > >>> QEMU can selectively emulate those at its best when necessary and
> > > >>> applicable. In other word, this design shouldn't prevent QEMU from
> > > >>> making up for vendor device's partial legacy support.
> > > > This looks too heavyweight since it tries to provide compatibility for
> > > > legacy drivers.
> > > That's just for the sake of extreme backward compatibility, but you can
> > > say that's not even needed if we mandate transitional device to offer
> > > all required interfaces for both legacy and modern guest.
> > >
> > > >   Considering we've introduced modern devices for 5+
> > > > years, I'd rather:
> > > >
> > > > - Qemu to mediate the config space stuffs
> > > > - Shadow virtqueue to mediate the datapath (AF_XDP told us shadow ring
> > > > can perform very well if we do zero-copy).
> > > This is one way to achieve, though not sure we should stick the only
> > > hope to zero-copy, which IMHO may take a long way to realize and
> > > optimize to where a simple datapath passthrough can easily get to (with
> > > hardware acceleration of coz).
> >
> > Note that, current shadow virtqueue is zerocopy, Qemu just need to
> > forward the descriptors.
> >
> > >
> > > >
> > > >>>> 2) Michael suggests using VHOST_SET_VRING_ENDIAN where it means we
> > > >>>> need a new config ops for vDPA bus, but it doesn't solve the issue for
> > > >>>> config space (at least from its name). We probably need a new ioctl
> > > >>>> for both vring and config space.
> > > >>> Yep adding a new ioctl makes things better, but I think the key is not
> > > >>> the new ioctl. It's whether or not we should enforce every vDPA vendor
> > > >>> driver to implement all transitional interfaces to be spec compliant.
> > > > I think the answer is no since the spec allows transitional device.
> > > > And we know things will be greatly simplified if vDPA support non
> > > > transitional device only.
> > > >
> > > > So we can change the question to:
> > > >
> > > > 1) do we need (or is it too late) to enforce non transitional device?
> > > We already have Alibaba ENI which is sort of a quasi-transitional
> > > device, right? In the sense it doesn't advertise VERSION_1. I know the
> > > other parts might not qualify it to be fully transitional, but code now
> > > doesn't prohibit it from supporting VERSION_1 modern interface depending
> > > on whatever future need.
> >
> > We can ask ENI developer for their future plan, it looks to me a
> > legacy only device wouldn't be interested in the future.
> >
> > Zongyong, do you have the plan to implement device with VERSION_1 support?
> >
> > > > 2) if yes, can transitional device be mediate in an efficient way?
> > > >
> > > > For 1), it's probably too late but we can invent new vDPA features as
> > > > you suggest to be non transitional. Then we can:
> > > >
> > > > 1.1) extend the netlink API to provision non-transitonal device
> > > Define non-transitional: device could be either modern-only or legacy-only?
> >
> > According to the spec, non-transitional should be modern only.
> >
> > > > 1.2) work on the non-transtional device in the future
> > > > 1.3) presenting transitional device via mediation
> > > presenting transitional on top of a modern device with VERSION_1, right?
> >
> > Yes, I mean presenting/mediating a transitional device on top of a
> > non-transitional device.
> >
> > > What if the hardware device can support legacy-compatible datapath
> > > natively that doesn't need mediation? Can it be done with direct
> > > datapath passthrough without svq involvement?
> >
> > I'd like to avoid supporting legacy-only device like ENI in the
> > future. The major problem is that it's out of the spec thus the
> > behaviour is defined by the code not the spec.
> >
> > >
> > > >
> > > > The previous transitional vDPA work as is, it's probably too late to
> > > > fix all the issue we suffer.
> > > What do you mean work as-is,
> >
> > See above, basically I mean the behaviour is defined by the vDPA code
> > not (or can't) by the spec.
> >
> > For example, for virtio-pci, we have:
> >
> > legacy interface: BAR
> > modern interface: capability
> >
> > So a transitional device can simple provide both of those interfaces:
> > E.g for device configuration space, if it is accessed via legacy
> > interface device know it needs to provide the config with native
> > endian otherwise little endian when accessing via modern interface.
> >
> > For virtio-mmio, it looks to me it doesn't provide a way for
> > transitional device.
> >
> > For vDPA, we actually don't define whether config_ops is a modern or
> > legacy interface. This is very tricky for the transitional device
> > since it tries to reuse the same interface for both legacy and modern
> > which make it very hard to be correct. E.g:
> >
> > 1) VERSION_1 trick won't work, e.g the spec allows to read device
> > configuration space before FEATURE_OK. So legacy driver may assume a
> > native endian.
>
> I am trying to address this in the spec. There was a fairly narrow
> window during which guests accessed config space before
> writing out features. Yes before FEATURES_OK but I think
> asking that guests send the features to device
> before poking at config space that depends on those
> features is reasonable.

I'm not sure I get this. For transitional devices, we have legacy
interfaces so legacy drivers should work anyhow even without the fix.

>
> Similarly, we can also change QEMU to send
> features to vdpa on config access that happens before
> FEATURES_OK.

This only works when the guest driver sets a feature before config
space accessing. And then vDPA presents LE if VERSION_1 is negotiated?
But at the API level, vhost-vDPA still needs to handle the case when
VHOST_VDPA_GET_CONFIG is called before VHOST_SET_FEATURES.

Mandating a LE looks a better way then QEMU can mediate in the middle,
e.g converting to BE when necessary.

>
>
>
> > 2) SET_VRING_ENDIAN doesn't fix all the issue, there's still a
> > question what endian it is before SET_VRING_ENDIAN (or we need to
> > support userspace without SET_VRING_ENDIAN)
> > ...
> >
> > Things will be simplified if we mandate the config_ops as the modern
> > interface and provide the necessary mediation in the hypervisor.
> >
> > > what's the nomenclature for that,
> > > quasi-transitional or broken-transitional?
> >
> > If we invent new API to clarify the modern/legacy and focus on the
> > modern in the future, we probably don't need a name?
>
> OK. What will that API be like? Maybe a bit in PROTOCOL_FEATURES?

Something like this, a bit via VHOST_GET_BACKEND_FEATURES.

Thanks

>
> > > and what are the outstanding
> > > issues you anticipate remaining?
> >
> > Basically we need to check the conformance listed in section 7.4 of the spec.
> >
> > > If it is device specific or vendor
> > > driver specific, let it be. But I wonder if there's any generic one that
> > > has to be fixed in vdpa core to support a truly transitional model.
> >
> > Two set of config_ops? E.g
> >
> > legacy_config_ops
> > modern_config_ops
> >
> > But I'm not sure whether or not it's worthwhile.
> >
> > > >
> > > > For 2), the key part is the datapath mediation, we can use shadow virtqueue.
> > > Sure. For our use case, we'd care more in providing transitional rather
> > > than being non-transitional. So, one device fits all.
> > >
> > > Thanks for all the ideas. This discussion is really useful.
> >
> > Appreciate for the discussion.
> >
> > Thanks
> >
> > >
> > > Best,
> > > -Siwei
> > > >
> > > >>> If we allow them to reject the VHOST_SET_VRING_ENDIAN  or
> > > >>> VHOST_SET_CONFIG_ENDIAN call, what could we do? We would still end up
> > > >>> with same situation of either fail the guest, or trying to
> > > >>> mediate/emulate, right?
> > > >>>
> > > >>> Not to mention VHOST_SET_VRING_ENDIAN is rarely supported by vhost
> > > >>> today - few distro kernel has CONFIG_VHOST_CROSS_ENDIAN_LEGACY enabled
> > > >>> and QEMU just ignores the result. vhost doesn't necessarily depend on
> > > >>> it to determine endianness it looks.
> > > >> I would like to suggest to add two new config ops get/set_vq_endian()
> > > >> and get/set_config_endian() for vDPA. This is used to:
> > > >> a) support VHOST_GET/SET_VRING_ENDIAN as MST suggested, and add
> > > >> VHOST_SET/GET_CONFIG_ENDIAN for vhost_vdpa.
> > > >> If the device has not implemented interface to set its endianess, then
> > > >> no matter success or failure of SET_ENDIAN, QEMU knows the endian-ness
> > > >> anyway.
> > > > How can Qemu know the endian in this way? And if it can, there's no
> > > > need for the new API?
> > > >
> > > >> In this case, if the device endianess does not match the guest,
> > > >> there needs a mediation layer or fail.
> > > >> b) ops->get_config_endian() can always tell the endian-ness of the
> > > >> device config space after the vendor driver probing the device. So we
> > > >> can use this ops->get_config_endian() for
> > > >> MTU, MAC and other fields handling in vdpa_dev_net_config_fill() and we
> > > >> don't need to set_features in vdpa_get_config_unlocked(), so no race
> > > >> conditions.
> > > >> Every time ops->get_config() returned, we can tell the endian by
> > > >> ops-config_>get_endian(), we don't need set_features(xxx, 0) if features
> > > >> negotiation not done.
> > > >>
> > > >> The question is: Do we need two pairs of ioctls for both vq and config
> > > >> space? Can config space endian-ness differ from the vqs?
> > > >> c) do we need a new netlink attr telling the endian-ness to user space?
> > > > Generally, I'm not sure this is a good design consider it provides neither:
> > > >
> > > > Compatibility with the virtio spec
> > > >
> > > > nor
> > > >
> > > > Compatibility with the existing vhost API (VHOST_SET_VRING_ENDIAN)
> > > >
> > > > Thanks
> > > >
> > > >> Thanks,
> > > >> Zhu Lingshan
> > > >>>> or
> > > >>>>
> > > >>>> 3) revisit the idea of forcing modern only device which may simplify
> > > >>>> things a lot
> > > >>> I am not actually against forcing modern only config space, given that
> > > >>> it's not hard for either QEMU or individual driver to mediate or
> > > >>> emulate, and for the most part it's not conflict with the goal of
> > > >>> offload or acceleration with vDPA. But forcing LE ring layout IMO
> > > >>> would just kill off the potential of a very good use case. Currently
> > > >>> for our use case the priority for supporting 0.9.5 guest with vDPA is
> > > >>> slightly lower compared to live migration, but it is still in our TODO
> > > >>> list.
> > > >>>
> > > >>> Thanks,
> > > >>> -Siwei
> > > >>>
> > > >>>> which way should we go?
> > > >>>>
> > > >>>>> I hope
> > > >>>>> it's not the latter, otherwise it loses the point to use vDPA for
> > > >>>>> datapath acceleration.
> > > >>>>>
> > > >>>>> Even if its the former, it's a little weird for vendor device to
> > > >>>>> implement a LE config space with BE ring layout, although still
> > > >>>>> possible...
> > > >>>> Right.
> > > >>>>
> > > >>>> Thanks
> > > >>>>
> > > >>>>> -Siwei
> > > >>>>>> Thanks
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>>> So if you think we should add a vdpa_ops->get_endian(),
> > > >>>>>>>> I will drop these comments in the next version of
> > > >>>>>>>> series, and work on a new patch for get_endian().
> > > >>>>>>>>
> > > >>>>>>>> Thanks,
> > > >>>>>>>> Zhu Lingshan
> > > >>>>>>> Guests don't get endian-ness from devices so this seems pointless.
> > > >>>>>>>
> > >
>
diff mbox series

Patch

diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index efb55a06e961..a74660b98979 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -801,19 +801,44 @@  static int vdpa_nl_cmd_dev_get_dumpit(struct sk_buff *msg, struct netlink_callba
 	return msg->len;
 }
 
-static int vdpa_dev_net_mq_config_fill(struct vdpa_device *vdev,
-				       struct sk_buff *msg, u64 features,
+static int vdpa_dev_net_mq_config_fill(struct sk_buff *msg, u64 features,
 				       const struct virtio_net_config *config)
 {
 	u16 val_u16;
 
-	if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0)
-		return 0;
+	if ((features & BIT_ULL(VIRTIO_NET_F_MQ)) == 0 &&
+	    (features & BIT_ULL(VIRTIO_NET_F_RSS)) == 0)
+		val_u16 = 1;
+	else
+		val_u16 = __virtio16_to_cpu(true, config->max_virtqueue_pairs);
 
-	val_u16 = le16_to_cpu(config->max_virtqueue_pairs);
 	return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, val_u16);
 }
 
+static int vdpa_dev_net_mtu_config_fill(struct sk_buff *msg, u64 features,
+					const struct virtio_net_config *config)
+{
+	u16 val_u16;
+
+	if ((features & BIT_ULL(VIRTIO_NET_F_MTU)) == 0)
+		val_u16 = 1500;
+	else
+		val_u16 = __virtio16_to_cpu(true, config->mtu);
+
+	return nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16);
+}
+
+static int vdpa_dev_net_mac_config_fill(struct sk_buff *msg, u64 features,
+					const struct virtio_net_config *config)
+{
+	if ((features & BIT_ULL(VIRTIO_NET_F_MAC)) == 0)
+		return 0;
+	else
+		return  nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR,
+				sizeof(config->mac), config->mac);
+}
+
+
 static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, struct sk_buff *msg)
 {
 	struct virtio_net_config config = {};
@@ -822,18 +847,16 @@  static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, struct sk_buff *ms
 
 	vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
 
-	if (nla_put(msg, VDPA_ATTR_DEV_NET_CFG_MACADDR, sizeof(config.mac),
-		    config.mac))
-		return -EMSGSIZE;
+	/*
+	 * Assume little endian for now, userspace can tweak this for
+	 * legacy guest support.
+	 */
+	val_u16 = __virtio16_to_cpu(true, config.status);
 
 	val_u16 = __virtio16_to_cpu(true, config.status);
 	if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_STATUS, val_u16))
 		return -EMSGSIZE;
 
-	val_u16 = __virtio16_to_cpu(true, config.mtu);
-	if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MTU, val_u16))
-		return -EMSGSIZE;
-
 	features_driver = vdev->config->get_driver_features(vdev);
 	if (nla_put_u64_64bit(msg, VDPA_ATTR_DEV_NEGOTIATED_FEATURES, features_driver,
 			      VDPA_ATTR_PAD))
@@ -846,7 +869,13 @@  static int vdpa_dev_net_config_fill(struct vdpa_device *vdev, struct sk_buff *ms
 			      VDPA_ATTR_PAD))
 		return -EMSGSIZE;
 
-	return vdpa_dev_net_mq_config_fill(vdev, msg, features_driver, &config);
+	if (vdpa_dev_net_mac_config_fill(msg, features_device, &config))
+		return -EMSGSIZE;
+
+	if (vdpa_dev_net_mtu_config_fill(msg, features_device, &config))
+		return -EMSGSIZE;
+
+	return vdpa_dev_net_mq_config_fill(msg, features_device, &config);
 }
 
 static int
@@ -914,6 +943,11 @@  static int vdpa_fill_stats_rec(struct vdpa_device *vdev, struct sk_buff *msg,
 	}
 	vdpa_get_config_unlocked(vdev, 0, &config, sizeof(config));
 
+	/*
+	 * Assume little endian for now, userspace can tweak this for
+	 * legacy guest support.
+	 */
+
 	max_vqp = __virtio16_to_cpu(true, config.max_virtqueue_pairs);
 	if (nla_put_u16(msg, VDPA_ATTR_DEV_NET_CFG_MAX_VQP, max_vqp))
 		return -EMSGSIZE;