diff mbox series

[3/5] vfio/iommu_type1: Prefer to reuse domains vs match enforced cache coherency

Message ID 20220606061927.26049-4-nicolinc@nvidia.com (mailing list archive)
State New
Headers show
Series Simplify vfio_iommu_type1 attach/detach routine | expand

Commit Message

Nicolin Chen June 6, 2022, 6:19 a.m. UTC
From: Jason Gunthorpe <jgg@nvidia.com>

The KVM mechanism for controlling wbinvd is only triggered during
kvm_vfio_group_add(), meaning it is a one-shot test done once the devices
are setup.

So, there is no value in trying to push a device that could do enforced
cache coherency to a dedicated domain vs re-using an existing domain since
KVM won't be able to take advantage of it. This just wastes domain memory.

Simplify this code and eliminate the test. This removes the only logic
that needed to have a dummy domain attached prior to searching for a
matching domain and simplifies the next patches.

If someday we want to try and optimize this further the better approach is
to update the Intel driver so that enforce_cache_coherency() can work on a
domain that already has IOPTEs and then call the enforce_cache_coherency()
after detaching a device from a domain to upgrade the whole domain to
enforced cache coherency mode.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

Comments

Tian, Kevin June 8, 2022, 8:28 a.m. UTC | #1
> From: Nicolin Chen
> Sent: Monday, June 6, 2022 2:19 PM
> 
> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> The KVM mechanism for controlling wbinvd is only triggered during
> kvm_vfio_group_add(), meaning it is a one-shot test done once the devices
> are setup.

It's not one-shot. kvm_vfio_update_coherency() is called in both
group_add() and group_del(). Then the coherency property is
checked dynamically in wbinvd emulation:

kvm_emulate_wbinvd()
  kvm_emulate_wbinvd_noskip()
    need_emulate_wbinvd()
      kvm_arch_has_noncoherent_dma()

It's also checked when a vcpu is scheduled to a new cpu for
tracking dirty cpus which requires cache flush when emulating
wbinvd on that vcpu. See kvm_arch_vcpu_load().

	/* Address WBINVD may be executed by guest */
	if (need_emulate_wbinvd(vcpu)) {
		if (static_call(kvm_x86_has_wbinvd_exit)())
			cpumask_set_cpu(cpu, vcpu->arch.wbinvd_dirty_mask);

In addition, it's also checked when deciding the effective memory
type of EPT entry. See vmx_get_mt_mask().

	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

But I doubt above can work reliably when the property is changed
in the fly given above paths are triggered at different points. The
guest may end up in a mixed state where inconsistent coherency 
is assumed in different emulation paths.

and In reality I don't think such niche scenario is even tested 
given the only device imposing such trick is integrated Intel GPU
which iiuc no one would try to hotplug/hot-remove it to/from
a guest.

given that I'm fine with the change in this patch. Even more probably
we really want an explicit one-shot model so KVM can lock down
the property once it starts to consume it then further adding a new
group which would change the coherency is explicitly rejected and
removing an existing group leaves it intact.

> 
> So, there is no value in trying to push a device that could do enforced
> cache coherency to a dedicated domain vs re-using an existing domain since

"an existing domain (even if it doesn't enforce coherency)", otherwise if
it's already compatible there is no question here.

> KVM won't be able to take advantage of it. This just wastes domain memory.
> 
> Simplify this code and eliminate the test. This removes the only logic
> that needed to have a dummy domain attached prior to searching for a
> matching domain and simplifies the next patches.
> 
> If someday we want to try and optimize this further the better approach is
> to update the Intel driver so that enforce_cache_coherency() can work on a
> domain that already has IOPTEs and then call the enforce_cache_coherency()
> after detaching a device from a domain to upgrade the whole domain to
> enforced cache coherency mode.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index c13b9290e357..f4e3b423a453 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2285,9 +2285,7 @@ static int vfio_iommu_type1_attach_group(void
> *iommu_data,
>  	 * testing if they're on the same bus_type.
>  	 */
>  	list_for_each_entry(d, &iommu->domain_list, next) {
> -		if (d->domain->ops == domain->domain->ops &&
> -		    d->enforce_cache_coherency ==
> -			    domain->enforce_cache_coherency) {
> +		if (d->domain->ops == domain->domain->ops) {
>  			iommu_detach_group(domain->domain, group-
> >iommu_group);
>  			if (!iommu_attach_group(d->domain,
>  						group->iommu_group)) {
> --
> 2.17.1
> 
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
Jason Gunthorpe June 8, 2022, 11:17 a.m. UTC | #2
On Wed, Jun 08, 2022 at 08:28:03AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen
> > Sent: Monday, June 6, 2022 2:19 PM
> > 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > 
> > The KVM mechanism for controlling wbinvd is only triggered during
> > kvm_vfio_group_add(), meaning it is a one-shot test done once the devices
> > are setup.
> 
> It's not one-shot. kvm_vfio_update_coherency() is called in both
> group_add() and group_del(). Then the coherency property is
> checked dynamically in wbinvd emulation:

From the perspective of managing the domains that is still
one-shot. It doesn't get updated when individual devices are
added/removed to domains.

> given that I'm fine with the change in this patch. Even more probably
> we really want an explicit one-shot model so KVM can lock down
> the property once it starts to consume it then further adding a new
> group which would change the coherency is explicitly rejected and
> removing an existing group leaves it intact.

Why? Once wbinvd is enabled it is compatible with all domain
configurations, so just leave it on and ignore everything at that
point.

Jason
Tian, Kevin June 8, 2022, 11:48 p.m. UTC | #3
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 8, 2022 7:17 PM
> 
> On Wed, Jun 08, 2022 at 08:28:03AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen
> > > Sent: Monday, June 6, 2022 2:19 PM
> > >
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > >
> > > The KVM mechanism for controlling wbinvd is only triggered during
> > > kvm_vfio_group_add(), meaning it is a one-shot test done once the
> devices
> > > are setup.
> >
> > It's not one-shot. kvm_vfio_update_coherency() is called in both
> > group_add() and group_del(). Then the coherency property is
> > checked dynamically in wbinvd emulation:
> 
> From the perspective of managing the domains that is still
> one-shot. It doesn't get updated when individual devices are
> added/removed to domains.

It's unchanged per-domain but dynamic per-vm when multiple
domains are added/removed (i.e. kvm->arch.noncoherent_dma_count).
It's the latter being checked in the kvm.

> 
> > given that I'm fine with the change in this patch. Even more probably
> > we really want an explicit one-shot model so KVM can lock down
> > the property once it starts to consume it then further adding a new
> > group which would change the coherency is explicitly rejected and
> > removing an existing group leaves it intact.
> 
> Why? Once wbinvd is enabled it is compatible with all domain
> configurations, so just leave it on and ignore everything at that
> point.
> 

More than that. My point was to make it a static policy so even if
wbinvd is disabled in the start we want to leave it off and not affected
by adding a device which doesn't have coherency. 'wbinvd off' is not
a compatible configuration hence imo need a way to reject adding
incompatible device.

Thanks
Kevin
Nicolin Chen June 14, 2022, 8:45 p.m. UTC | #4
Hi Kevin,

On Wed, Jun 08, 2022 at 11:48:27PM +0000, Tian, Kevin wrote:
> > > > The KVM mechanism for controlling wbinvd is only triggered during
> > > > kvm_vfio_group_add(), meaning it is a one-shot test done once the
> > devices
> > > > are setup.
> > >
> > > It's not one-shot. kvm_vfio_update_coherency() is called in both
> > > group_add() and group_del(). Then the coherency property is
> > > checked dynamically in wbinvd emulation:
> >
> > From the perspective of managing the domains that is still
> > one-shot. It doesn't get updated when individual devices are
> > added/removed to domains.
> 
> It's unchanged per-domain but dynamic per-vm when multiple
> domains are added/removed (i.e. kvm->arch.noncoherent_dma_count).
> It's the latter being checked in the kvm.

I am going to send a v2, yet not quite getting the point here.
Meanwhile, Jason is on leave.

What, in your opinion, would be an accurate description here?

Thanks
Nic
Tian, Kevin June 15, 2022, 7:35 a.m. UTC | #5
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, June 15, 2022 4:45 AM
> 
> Hi Kevin,
> 
> On Wed, Jun 08, 2022 at 11:48:27PM +0000, Tian, Kevin wrote:
> > > > > The KVM mechanism for controlling wbinvd is only triggered during
> > > > > kvm_vfio_group_add(), meaning it is a one-shot test done once the
> > > devices
> > > > > are setup.
> > > >
> > > > It's not one-shot. kvm_vfio_update_coherency() is called in both
> > > > group_add() and group_del(). Then the coherency property is
> > > > checked dynamically in wbinvd emulation:
> > >
> > > From the perspective of managing the domains that is still
> > > one-shot. It doesn't get updated when individual devices are
> > > added/removed to domains.
> >
> > It's unchanged per-domain but dynamic per-vm when multiple
> > domains are added/removed (i.e. kvm->arch.noncoherent_dma_count).
> > It's the latter being checked in the kvm.
> 
> I am going to send a v2, yet not quite getting the point here.
> Meanwhile, Jason is on leave.
> 
> What, in your opinion, would be an accurate description here?
> 

Something like below:
--
The KVM mechanism for controlling wbinvd is based on OR of
the coherency property of all devices attached to a guest, no matter
those devices  are attached to a single domain or multiple domains.

So, there is no value in trying to push a device that could do enforced
cache coherency to a dedicated domain vs re-using an existing domain
which is non-coherent since KVM won't be able to take advantage of it. 
This just wastes domain memory.

Simplify this code and eliminate the test. This removes the only logic
that needed to have a dummy domain attached prior to searching for a
matching domain and simplifies the next patches.

It's unclear whether we want to further optimize the Intel driver to
update the domain coherency after a device is detached from it, at
least not before KVM can be verified to handle such dynamics in related
emulation paths (wbinvd, vcpu load, write_cr0, ept, etc.). In reality
we don't see an usage requiring such optimization as the only device
which imposes such non-coherency is Intel GPU which even doesn't
support hotplug/hot remove.
--

Thanks
Kevin
Nicolin Chen June 15, 2022, 11:12 p.m. UTC | #6
On Wed, Jun 15, 2022 at 07:35:00AM +0000, Tian, Kevin wrote:
> External email: Use caution opening links or attachments
> 
> 
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, June 15, 2022 4:45 AM
> >
> > Hi Kevin,
> >
> > On Wed, Jun 08, 2022 at 11:48:27PM +0000, Tian, Kevin wrote:
> > > > > > The KVM mechanism for controlling wbinvd is only triggered during
> > > > > > kvm_vfio_group_add(), meaning it is a one-shot test done once the
> > > > devices
> > > > > > are setup.
> > > > >
> > > > > It's not one-shot. kvm_vfio_update_coherency() is called in both
> > > > > group_add() and group_del(). Then the coherency property is
> > > > > checked dynamically in wbinvd emulation:
> > > >
> > > > From the perspective of managing the domains that is still
> > > > one-shot. It doesn't get updated when individual devices are
> > > > added/removed to domains.
> > >
> > > It's unchanged per-domain but dynamic per-vm when multiple
> > > domains are added/removed (i.e. kvm->arch.noncoherent_dma_count).
> > > It's the latter being checked in the kvm.
> >
> > I am going to send a v2, yet not quite getting the point here.
> > Meanwhile, Jason is on leave.
> >
> > What, in your opinion, would be an accurate description here?
> >
> 
> Something like below:
> --
> The KVM mechanism for controlling wbinvd is based on OR of
> the coherency property of all devices attached to a guest, no matter
> those devices  are attached to a single domain or multiple domains.
> 
> So, there is no value in trying to push a device that could do enforced
> cache coherency to a dedicated domain vs re-using an existing domain
> which is non-coherent since KVM won't be able to take advantage of it.
> This just wastes domain memory.
> 
> Simplify this code and eliminate the test. This removes the only logic
> that needed to have a dummy domain attached prior to searching for a
> matching domain and simplifies the next patches.
> 
> It's unclear whether we want to further optimize the Intel driver to
> update the domain coherency after a device is detached from it, at
> least not before KVM can be verified to handle such dynamics in related
> emulation paths (wbinvd, vcpu load, write_cr0, ept, etc.). In reality
> we don't see an usage requiring such optimization as the only device
> which imposes such non-coherency is Intel GPU which even doesn't
> support hotplug/hot remove.

Thanks! I just updated that and will send v2.
diff mbox series

Patch

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c13b9290e357..f4e3b423a453 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2285,9 +2285,7 @@  static int vfio_iommu_type1_attach_group(void *iommu_data,
 	 * testing if they're on the same bus_type.
 	 */
 	list_for_each_entry(d, &iommu->domain_list, next) {
-		if (d->domain->ops == domain->domain->ops &&
-		    d->enforce_cache_coherency ==
-			    domain->enforce_cache_coherency) {
+		if (d->domain->ops == domain->domain->ops) {
 			iommu_detach_group(domain->domain, group->iommu_group);
 			if (!iommu_attach_group(d->domain,
 						group->iommu_group)) {