diff mbox series

[rc,v2] iommu/arm-smmu-v3: Do not use GFP_KERNEL under as spinlock

Message ID 0-v2-7988c2682634+261-smmu_cd_atomic_jgg@nvidia.com (mailing list archive)
State New, archived
Headers show
Series [rc,v2] iommu/arm-smmu-v3: Do not use GFP_KERNEL under as spinlock | expand

Commit Message

Jason Gunthorpe Feb. 15, 2024, 2:56 p.m. UTC
If the SMMU is configured to use a two level CD table then
arm_smmu_write_ctx_desc() allocates a CD table leaf internally using
GFP_KERNEL. Due to recent changes this is being done under a spinlock to
iterate over the device list - thus it will trigger a sleeping while
atomic warning:

  arm_smmu_sva_set_dev_pasid()
    mutex_lock(&sva_lock);
    __arm_smmu_sva_bind()
     arm_smmu_mmu_notifier_get()
      spin_lock_irqsave()
      arm_smmu_write_ctx_desc()
	arm_smmu_get_cd_ptr()
         arm_smmu_alloc_cd_leaf_table()
	  dmam_alloc_coherent(GFP_KERNEL)

This is a 64K high order allocation and really should not be done
atomically.

At the moment the rework of the SVA to follow the new API is half
finished. Recently the CD table memory was moved from the domain to the
master, however we have the confusing situation where the SVA code is
wrongly using the RID domains device's list to track which CD tables the
SVA is installed in.

Remove the logic to replicate the CD across all the domain's masters
during attach. We know which master and which CD table the PASID should be
installed in.

At the moment SVA is only invoked when dma-iommu.c is in control of the
RID translation, which means we have a single iommu_domain shared across
the entire group and that iommu_domain is not shared outside the group.

For PCI cases the core code also insists on singleton groups so there is
only ever one entry in the smmu_domain->domains list that is equal to the
master being passed in to arm_smmu_sva_set_dev_pasid().

Only non-PCI cases may have multi-device groups. However, the core code it
self will replicate the calls to arm_smmu_sva_set_dev_pasid() across the
entire group so we will still correctly install the CD into each group
members master.

Thus the loop here is fully redundant and can be removed.

Removing the loop allows arm_smmu_write_ctx_desc() to be called outside
the spinlock and thus is safe to use GFP_KERNEL.

Fixes: 24503148c545 ("iommu/arm-smmu-v3: Refactor write_ctx_desc")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/4e25d161-0cf8-4050-9aa3-dfa21cd63e56@moroto.mountain/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   | 29 +++++--------------
 1 file changed, 8 insertions(+), 21 deletions(-)

v2 - rewritten


base-commit: c76c50cd425e574df5a071cd7e1805d0764aabff

Comments

Will Deacon Feb. 16, 2024, 12:05 p.m. UTC | #1
On Thu, Feb 15, 2024 at 10:56:57AM -0400, Jason Gunthorpe wrote:
> If the SMMU is configured to use a two level CD table then
> arm_smmu_write_ctx_desc() allocates a CD table leaf internally using
> GFP_KERNEL. Due to recent changes this is being done under a spinlock to
> iterate over the device list - thus it will trigger a sleeping while
> atomic warning:
> 
>   arm_smmu_sva_set_dev_pasid()
>     mutex_lock(&sva_lock);
>     __arm_smmu_sva_bind()
>      arm_smmu_mmu_notifier_get()
>       spin_lock_irqsave()
>       arm_smmu_write_ctx_desc()
> 	arm_smmu_get_cd_ptr()
>          arm_smmu_alloc_cd_leaf_table()
> 	  dmam_alloc_coherent(GFP_KERNEL)
> 
> This is a 64K high order allocation and really should not be done
> atomically.
> 
> At the moment the rework of the SVA to follow the new API is half
> finished. Recently the CD table memory was moved from the domain to the
> master, however we have the confusing situation where the SVA code is
> wrongly using the RID domains device's list to track which CD tables the
> SVA is installed in.
> 
> Remove the logic to replicate the CD across all the domain's masters
> during attach. We know which master and which CD table the PASID should be
> installed in.
> 
> At the moment SVA is only invoked when dma-iommu.c is in control of the
> RID translation, which means we have a single iommu_domain shared across
> the entire group and that iommu_domain is not shared outside the group.
> 
> For PCI cases the core code also insists on singleton groups so there is
> only ever one entry in the smmu_domain->domains list that is equal to the
> master being passed in to arm_smmu_sva_set_dev_pasid().
> 
> Only non-PCI cases may have multi-device groups. However, the core code it
> self will replicate the calls to arm_smmu_sva_set_dev_pasid() across the
> entire group so we will still correctly install the CD into each group
> members master.

Are you sure about this paragraph? arm_smmu_mmu_notifier_get() will return
early if it finds an existing notifier in the 'mmu_notifiers' list for the
domain, so I don't think we'll actually get as far as installing the CD,
will we?

Will
Jason Gunthorpe Feb. 16, 2024, 12:36 p.m. UTC | #2
On Fri, Feb 16, 2024 at 12:05:12PM +0000, Will Deacon wrote:
> On Thu, Feb 15, 2024 at 10:56:57AM -0400, Jason Gunthorpe wrote:
> > If the SMMU is configured to use a two level CD table then
> > arm_smmu_write_ctx_desc() allocates a CD table leaf internally using
> > GFP_KERNEL. Due to recent changes this is being done under a spinlock to
> > iterate over the device list - thus it will trigger a sleeping while
> > atomic warning:
> > 
> >   arm_smmu_sva_set_dev_pasid()
> >     mutex_lock(&sva_lock);
> >     __arm_smmu_sva_bind()
> >      arm_smmu_mmu_notifier_get()
> >       spin_lock_irqsave()
> >       arm_smmu_write_ctx_desc()
> > 	arm_smmu_get_cd_ptr()
> >          arm_smmu_alloc_cd_leaf_table()
> > 	  dmam_alloc_coherent(GFP_KERNEL)
> > 
> > This is a 64K high order allocation and really should not be done
> > atomically.
> > 
> > At the moment the rework of the SVA to follow the new API is half
> > finished. Recently the CD table memory was moved from the domain to the
> > master, however we have the confusing situation where the SVA code is
> > wrongly using the RID domains device's list to track which CD tables the
> > SVA is installed in.
> > 
> > Remove the logic to replicate the CD across all the domain's masters
> > during attach. We know which master and which CD table the PASID should be
> > installed in.
> > 
> > At the moment SVA is only invoked when dma-iommu.c is in control of the
> > RID translation, which means we have a single iommu_domain shared across
> > the entire group and that iommu_domain is not shared outside the group.
> > 
> > For PCI cases the core code also insists on singleton groups so there is
> > only ever one entry in the smmu_domain->domains list that is equal to the
> > master being passed in to arm_smmu_sva_set_dev_pasid().
> > 
> > Only non-PCI cases may have multi-device groups. However, the core code it
> > self will replicate the calls to arm_smmu_sva_set_dev_pasid() across the
> > entire group so we will still correctly install the CD into each group
> > members master.
> 
> Are you sure about this paragraph? arm_smmu_mmu_notifier_get() will return
> early if it finds an existing notifier in the 'mmu_notifiers' list for the
> domain, so I don't think we'll actually get as far as installing the CD,
> will we?

I think the paragraph is the right analysis, the code just isn't
listening very well..

Lifting up the arm_smmu_write_ctx_desc() into the caller will fix it.

Also Michael should look at it (I recall we talked about this once)
and Nicolin should test it.

BTW, I have no idea if non-PCI cases exists, everyone I know is doing
PCI SVA.

Jason
Michael Shavit Feb. 17, 2024, 12:25 p.m. UTC | #3
On Fri, Feb 16, 2024 at 8:36 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Fri, Feb 16, 2024 at 12:05:12PM +0000, Will Deacon wrote:
> > On Thu, Feb 15, 2024 at 10:56:57AM -0400, Jason Gunthorpe wrote:
> > > If the SMMU is configured to use a two level CD table then
> > > arm_smmu_write_ctx_desc() allocates a CD table leaf internally using
> > > GFP_KERNEL. Due to recent changes this is being done under a spinlock to
> > > iterate over the device list - thus it will trigger a sleeping while
> > > atomic warning:
> > >
> > >   arm_smmu_sva_set_dev_pasid()
> > >     mutex_lock(&sva_lock);
> > >     __arm_smmu_sva_bind()
> > >      arm_smmu_mmu_notifier_get()
> > >       spin_lock_irqsave()
> > >       arm_smmu_write_ctx_desc()
> > >     arm_smmu_get_cd_ptr()
> > >          arm_smmu_alloc_cd_leaf_table()
> > >       dmam_alloc_coherent(GFP_KERNEL)
> > >
> > > This is a 64K high order allocation and really should not be done
> > > atomically.
> > >
> > > At the moment the rework of the SVA to follow the new API is half
> > > finished. Recently the CD table memory was moved from the domain to the
> > > master, however we have the confusing situation where the SVA code is
> > > wrongly using the RID domains device's list to track which CD tables the
> > > SVA is installed in.
> > >
> > > Remove the logic to replicate the CD across all the domain's masters
> > > during attach. We know which master and which CD table the PASID should be
> > > installed in.
> > >
> > > At the moment SVA is only invoked when dma-iommu.c is in control of the
> > > RID translation, which means we have a single iommu_domain shared across
> > > the entire group and that iommu_domain is not shared outside the group.
> > >
> > > For PCI cases the core code also insists on singleton groups so there is
> > > only ever one entry in the smmu_domain->domains list that is equal to the
> > > master being passed in to arm_smmu_sva_set_dev_pasid().
> > >
> > > Only non-PCI cases may have multi-device groups. However, the core code it
> > > self will replicate the calls to arm_smmu_sva_set_dev_pasid() across the
> > > entire group so we will still correctly install the CD into each group
> > > members master.
> >
> > Are you sure about this paragraph? arm_smmu_mmu_notifier_get() will return
> > early if it finds an existing notifier in the 'mmu_notifiers' list for the
> > domain, so I don't think we'll actually get as far as installing the CD,
> > will we?
>
> I think the paragraph is the right analysis, the code just isn't
> listening very well..
>
> Lifting up the arm_smmu_write_ctx_desc() into the caller will fix it.

Calling arm_smmu_write_ctx_desc requires the CD which we get from the
mmu_notifiers list...which makes it a bit more complicated than that.
But it does sound doable with some work (perhaps keeping it in
arm_smmu_mmu_notifier_get for now and changing the early return
logic?).

>
> Also Michael should look at it (I recall we talked about this once)
> and Nicolin should test it.


Just to make sure I follow why we're pursuing this instead of the v1
rc patch: in the non-PCI multidevice group scenario, the first call to
set_dev_pasid would only have pre-allocated for the current master but
arm_smmu_mmu_notifier_get would then still write
arm_smmu_write_ctx_desc to other masters?

>
> BTW, I have no idea if non-PCI cases exists, everyone I know is doing
> PCI SVA.
>
> Jason
Jason Gunthorpe Feb. 17, 2024, 1:24 p.m. UTC | #4
On Sat, Feb 17, 2024 at 08:25:36PM +0800, Michael Shavit wrote:

> Calling arm_smmu_write_ctx_desc requires the CD which we get from the
> mmu_notifiers list...which makes it a bit more complicated than
> that.

@@ -404,9 +384,15 @@ static int __arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm)
                goto err_free_bond;
        }
 
+       ret = arm_smmu_write_ctx_desc(master, pasid, bond->smmu_mn->cd);
+       if (ret)
+               goto err_put_notifier;
+


> > Also Michael should look at it (I recall we talked about this once)
> > and Nicolin should test it.
> 
> Just to make sure I follow why we're pursuing this instead of the v1
> rc patch: in the non-PCI multidevice group scenario, the first call to
> set_dev_pasid would only have pre-allocated for the current master but
> arm_smmu_mmu_notifier_get would then still write
> arm_smmu_write_ctx_desc to other masters?

Yes, though noting I have no idea if non-PCI exist but it does make
the code clearer and removes even the possibility of a non-atomic
allocation

Jason
Michael Shavit Feb. 19, 2024, 8:32 a.m. UTC | #5
On Sat, Feb 17, 2024 at 9:24 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Sat, Feb 17, 2024 at 08:25:36PM +0800, Michael Shavit wrote:
>
> > Calling arm_smmu_write_ctx_desc requires the CD which we get from the
> > mmu_notifiers list...which makes it a bit more complicated than
> > that.
>
> @@ -404,9 +384,15 @@ static int __arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm)
>                 goto err_free_bond;
>         }
>
> +       ret = arm_smmu_write_ctx_desc(master, pasid, bond->smmu_mn->cd);
> +       if (ret)
> +               goto err_put_notifier;
> +
>
Oh hey that's not much more complicated :) . I'm guessing that'll be a
call to arm_smmu_mmu_notifier_put() rather than a goto on the error
path?
Speaking of... the arm_smmu_update_ctx_desc_devices call in
arm_smmu_mmu_notifier_put might be tricky as well if it's encountered
on the error path before all devices (in theoretical non-pci case) had
a chance to previously call arm_smmu_write_ctx_desc.
Jason Gunthorpe Feb. 20, 2024, 12:35 a.m. UTC | #6
On Mon, Feb 19, 2024 at 04:32:40PM +0800, Michael Shavit wrote:
> On Sat, Feb 17, 2024 at 9:24 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Sat, Feb 17, 2024 at 08:25:36PM +0800, Michael Shavit wrote:
> >
> > > Calling arm_smmu_write_ctx_desc requires the CD which we get from the
> > > mmu_notifiers list...which makes it a bit more complicated than
> > > that.
> >
> > @@ -404,9 +384,15 @@ static int __arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm)
> >                 goto err_free_bond;
> >         }
> >
> > +       ret = arm_smmu_write_ctx_desc(master, pasid, bond->smmu_mn->cd);
> > +       if (ret)
> > +               goto err_put_notifier;
> > +
> >
> Oh hey that's not much more complicated :) . I'm guessing that'll be a
> call to arm_smmu_mmu_notifier_put() rather than a goto on the error
> path?

Yes, error unwind is the goto which does put and kfree(bond)

> Speaking of... the arm_smmu_update_ctx_desc_devices call in
> arm_smmu_mmu_notifier_put might be tricky as well if it's encountered
> on the error path before all devices (in theoretical non-pci case) had
> a chance to previously call arm_smmu_write_ctx_desc.

It is hard to understand but I think it is close enough to OK..

The put will pair with the get and if this is the first member of the
group then it will do a symmetric tear down. The 

	if (!refcount_dec_and_test(&smmu_mn->refs))
		return;

Takes care of that

Otherwise the puts accumulate the refcount back down to zero which
should be hit once __iommu_remove_group_pasid() gets all the group
members removed.

IOW the CDs are not cleaned up on any devices until all the group
members have the PASID removed. Clearly it is not correct design, but
it looks like it works good enough even in error paths.

Then when it does eventually reach arm_smmu_update_ctx_desc_devices()
it wipes all the CDs of all the RID domain's masters which is the same
as the group membership.

Which will end up happening before iommu_attach_device_pasid() returns
on its error path.

(obviously this is all made to work logically and properly in part 2)

Jasson
Will Deacon Feb. 21, 2024, 1:08 p.m. UTC | #7
On Sat, Feb 17, 2024 at 09:24:46AM -0400, Jason Gunthorpe wrote:
> On Sat, Feb 17, 2024 at 08:25:36PM +0800, Michael Shavit wrote:
> 
> > Calling arm_smmu_write_ctx_desc requires the CD which we get from the
> > mmu_notifiers list...which makes it a bit more complicated than
> > that.
> 
> @@ -404,9 +384,15 @@ static int __arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm)
>                 goto err_free_bond;
>         }
>  
> +       ret = arm_smmu_write_ctx_desc(master, pasid, bond->smmu_mn->cd);
> +       if (ret)
> +               goto err_put_notifier;
> +

Please can you spin a v3 with this folded in?

Will
diff mbox series

Patch

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index 05722121f00e70..64ec3f0ed8ffe0 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -288,14 +288,14 @@  static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = {
 
 /* Allocate or get existing MMU notifier for this {domain, mm} pair */
 static struct arm_smmu_mmu_notifier *
-arm_smmu_mmu_notifier_get(struct arm_smmu_domain *smmu_domain,
+arm_smmu_mmu_notifier_get(struct arm_smmu_master *master, ioasid_t pasid,
 			  struct mm_struct *mm)
 {
 	int ret;
-	unsigned long flags;
 	struct arm_smmu_ctx_desc *cd;
 	struct arm_smmu_mmu_notifier *smmu_mn;
-	struct arm_smmu_master *master;
+	struct iommu_domain *domain = iommu_get_domain_for_dev(master->dev);
+	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
 
 	list_for_each_entry(smmu_mn, &smmu_domain->mmu_notifiers, list) {
 		if (smmu_mn->mn.mm == mm) {
@@ -325,19 +325,7 @@  arm_smmu_mmu_notifier_get(struct arm_smmu_domain *smmu_domain,
 		goto err_free_cd;
 	}
 
-	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
-	list_for_each_entry(master, &smmu_domain->devices, domain_head) {
-		ret = arm_smmu_write_ctx_desc(master, mm_get_enqcmd_pasid(mm),
-					      cd);
-		if (ret) {
-			list_for_each_entry_from_reverse(
-				master, &smmu_domain->devices, domain_head)
-				arm_smmu_write_ctx_desc(
-					master, mm_get_enqcmd_pasid(mm), NULL);
-			break;
-		}
-	}
-	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+	ret = arm_smmu_write_ctx_desc(master, pasid, cd);
 	if (ret)
 		goto err_put_notifier;
 
@@ -381,13 +369,12 @@  static void arm_smmu_mmu_notifier_put(struct arm_smmu_mmu_notifier *smmu_mn)
 	arm_smmu_free_shared_cd(cd);
 }
 
-static int __arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm)
+static int __arm_smmu_sva_bind(struct device *dev, ioasid_t pasid,
+			       struct mm_struct *mm)
 {
 	int ret;
 	struct arm_smmu_bond *bond;
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
-	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
 
 	if (!master || !master->sva_enabled)
 		return -ENODEV;
@@ -398,7 +385,7 @@  static int __arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm)
 
 	bond->mm = mm;
 
-	bond->smmu_mn = arm_smmu_mmu_notifier_get(smmu_domain, mm);
+	bond->smmu_mn = arm_smmu_mmu_notifier_get(master, pasid, mm);
 	if (IS_ERR(bond->smmu_mn)) {
 		ret = PTR_ERR(bond->smmu_mn);
 		goto err_free_bond;
@@ -590,7 +577,7 @@  static int arm_smmu_sva_set_dev_pasid(struct iommu_domain *domain,
 	struct mm_struct *mm = domain->mm;
 
 	mutex_lock(&sva_lock);
-	ret = __arm_smmu_sva_bind(dev, mm);
+	ret = __arm_smmu_sva_bind(dev, id, mm);
 	mutex_unlock(&sva_lock);
 
 	return ret;