Message ID | 0-v2-7988c2682634+261-smmu_cd_atomic_jgg@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [rc,v2] iommu/arm-smmu-v3: Do not use GFP_KERNEL under as spinlock | expand |
On Thu, Feb 15, 2024 at 10:56:57AM -0400, Jason Gunthorpe wrote: > If the SMMU is configured to use a two level CD table then > arm_smmu_write_ctx_desc() allocates a CD table leaf internally using > GFP_KERNEL. Due to recent changes this is being done under a spinlock to > iterate over the device list - thus it will trigger a sleeping while > atomic warning: > > arm_smmu_sva_set_dev_pasid() > mutex_lock(&sva_lock); > __arm_smmu_sva_bind() > arm_smmu_mmu_notifier_get() > spin_lock_irqsave() > arm_smmu_write_ctx_desc() > arm_smmu_get_cd_ptr() > arm_smmu_alloc_cd_leaf_table() > dmam_alloc_coherent(GFP_KERNEL) > > This is a 64K high order allocation and really should not be done > atomically. > > At the moment the rework of the SVA to follow the new API is half > finished. Recently the CD table memory was moved from the domain to the > master, however we have the confusing situation where the SVA code is > wrongly using the RID domains device's list to track which CD tables the > SVA is installed in. > > Remove the logic to replicate the CD across all the domain's masters > during attach. We know which master and which CD table the PASID should be > installed in. > > At the moment SVA is only invoked when dma-iommu.c is in control of the > RID translation, which means we have a single iommu_domain shared across > the entire group and that iommu_domain is not shared outside the group. > > For PCI cases the core code also insists on singleton groups so there is > only ever one entry in the smmu_domain->domains list that is equal to the > master being passed in to arm_smmu_sva_set_dev_pasid(). > > Only non-PCI cases may have multi-device groups. However, the core code it > self will replicate the calls to arm_smmu_sva_set_dev_pasid() across the > entire group so we will still correctly install the CD into each group > members master. Are you sure about this paragraph? arm_smmu_mmu_notifier_get() will return early if it finds an existing notifier in the 'mmu_notifiers' list for the domain, so I don't think we'll actually get as far as installing the CD, will we? Will
On Fri, Feb 16, 2024 at 12:05:12PM +0000, Will Deacon wrote: > On Thu, Feb 15, 2024 at 10:56:57AM -0400, Jason Gunthorpe wrote: > > If the SMMU is configured to use a two level CD table then > > arm_smmu_write_ctx_desc() allocates a CD table leaf internally using > > GFP_KERNEL. Due to recent changes this is being done under a spinlock to > > iterate over the device list - thus it will trigger a sleeping while > > atomic warning: > > > > arm_smmu_sva_set_dev_pasid() > > mutex_lock(&sva_lock); > > __arm_smmu_sva_bind() > > arm_smmu_mmu_notifier_get() > > spin_lock_irqsave() > > arm_smmu_write_ctx_desc() > > arm_smmu_get_cd_ptr() > > arm_smmu_alloc_cd_leaf_table() > > dmam_alloc_coherent(GFP_KERNEL) > > > > This is a 64K high order allocation and really should not be done > > atomically. > > > > At the moment the rework of the SVA to follow the new API is half > > finished. Recently the CD table memory was moved from the domain to the > > master, however we have the confusing situation where the SVA code is > > wrongly using the RID domains device's list to track which CD tables the > > SVA is installed in. > > > > Remove the logic to replicate the CD across all the domain's masters > > during attach. We know which master and which CD table the PASID should be > > installed in. > > > > At the moment SVA is only invoked when dma-iommu.c is in control of the > > RID translation, which means we have a single iommu_domain shared across > > the entire group and that iommu_domain is not shared outside the group. > > > > For PCI cases the core code also insists on singleton groups so there is > > only ever one entry in the smmu_domain->domains list that is equal to the > > master being passed in to arm_smmu_sva_set_dev_pasid(). > > > > Only non-PCI cases may have multi-device groups. However, the core code it > > self will replicate the calls to arm_smmu_sva_set_dev_pasid() across the > > entire group so we will still correctly install the CD into each group > > members master. > > Are you sure about this paragraph? arm_smmu_mmu_notifier_get() will return > early if it finds an existing notifier in the 'mmu_notifiers' list for the > domain, so I don't think we'll actually get as far as installing the CD, > will we? I think the paragraph is the right analysis, the code just isn't listening very well.. Lifting up the arm_smmu_write_ctx_desc() into the caller will fix it. Also Michael should look at it (I recall we talked about this once) and Nicolin should test it. BTW, I have no idea if non-PCI cases exists, everyone I know is doing PCI SVA. Jason
On Fri, Feb 16, 2024 at 8:36 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Fri, Feb 16, 2024 at 12:05:12PM +0000, Will Deacon wrote: > > On Thu, Feb 15, 2024 at 10:56:57AM -0400, Jason Gunthorpe wrote: > > > If the SMMU is configured to use a two level CD table then > > > arm_smmu_write_ctx_desc() allocates a CD table leaf internally using > > > GFP_KERNEL. Due to recent changes this is being done under a spinlock to > > > iterate over the device list - thus it will trigger a sleeping while > > > atomic warning: > > > > > > arm_smmu_sva_set_dev_pasid() > > > mutex_lock(&sva_lock); > > > __arm_smmu_sva_bind() > > > arm_smmu_mmu_notifier_get() > > > spin_lock_irqsave() > > > arm_smmu_write_ctx_desc() > > > arm_smmu_get_cd_ptr() > > > arm_smmu_alloc_cd_leaf_table() > > > dmam_alloc_coherent(GFP_KERNEL) > > > > > > This is a 64K high order allocation and really should not be done > > > atomically. > > > > > > At the moment the rework of the SVA to follow the new API is half > > > finished. Recently the CD table memory was moved from the domain to the > > > master, however we have the confusing situation where the SVA code is > > > wrongly using the RID domains device's list to track which CD tables the > > > SVA is installed in. > > > > > > Remove the logic to replicate the CD across all the domain's masters > > > during attach. We know which master and which CD table the PASID should be > > > installed in. > > > > > > At the moment SVA is only invoked when dma-iommu.c is in control of the > > > RID translation, which means we have a single iommu_domain shared across > > > the entire group and that iommu_domain is not shared outside the group. > > > > > > For PCI cases the core code also insists on singleton groups so there is > > > only ever one entry in the smmu_domain->domains list that is equal to the > > > master being passed in to arm_smmu_sva_set_dev_pasid(). > > > > > > Only non-PCI cases may have multi-device groups. However, the core code it > > > self will replicate the calls to arm_smmu_sva_set_dev_pasid() across the > > > entire group so we will still correctly install the CD into each group > > > members master. > > > > Are you sure about this paragraph? arm_smmu_mmu_notifier_get() will return > > early if it finds an existing notifier in the 'mmu_notifiers' list for the > > domain, so I don't think we'll actually get as far as installing the CD, > > will we? > > I think the paragraph is the right analysis, the code just isn't > listening very well.. > > Lifting up the arm_smmu_write_ctx_desc() into the caller will fix it. Calling arm_smmu_write_ctx_desc requires the CD which we get from the mmu_notifiers list...which makes it a bit more complicated than that. But it does sound doable with some work (perhaps keeping it in arm_smmu_mmu_notifier_get for now and changing the early return logic?). > > Also Michael should look at it (I recall we talked about this once) > and Nicolin should test it. Just to make sure I follow why we're pursuing this instead of the v1 rc patch: in the non-PCI multidevice group scenario, the first call to set_dev_pasid would only have pre-allocated for the current master but arm_smmu_mmu_notifier_get would then still write arm_smmu_write_ctx_desc to other masters? > > BTW, I have no idea if non-PCI cases exists, everyone I know is doing > PCI SVA. > > Jason
On Sat, Feb 17, 2024 at 08:25:36PM +0800, Michael Shavit wrote: > Calling arm_smmu_write_ctx_desc requires the CD which we get from the > mmu_notifiers list...which makes it a bit more complicated than > that. @@ -404,9 +384,15 @@ static int __arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm) goto err_free_bond; } + ret = arm_smmu_write_ctx_desc(master, pasid, bond->smmu_mn->cd); + if (ret) + goto err_put_notifier; + > > Also Michael should look at it (I recall we talked about this once) > > and Nicolin should test it. > > Just to make sure I follow why we're pursuing this instead of the v1 > rc patch: in the non-PCI multidevice group scenario, the first call to > set_dev_pasid would only have pre-allocated for the current master but > arm_smmu_mmu_notifier_get would then still write > arm_smmu_write_ctx_desc to other masters? Yes, though noting I have no idea if non-PCI exist but it does make the code clearer and removes even the possibility of a non-atomic allocation Jason
On Sat, Feb 17, 2024 at 9:24 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Sat, Feb 17, 2024 at 08:25:36PM +0800, Michael Shavit wrote: > > > Calling arm_smmu_write_ctx_desc requires the CD which we get from the > > mmu_notifiers list...which makes it a bit more complicated than > > that. > > @@ -404,9 +384,15 @@ static int __arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm) > goto err_free_bond; > } > > + ret = arm_smmu_write_ctx_desc(master, pasid, bond->smmu_mn->cd); > + if (ret) > + goto err_put_notifier; > + > Oh hey that's not much more complicated :) . I'm guessing that'll be a call to arm_smmu_mmu_notifier_put() rather than a goto on the error path? Speaking of... the arm_smmu_update_ctx_desc_devices call in arm_smmu_mmu_notifier_put might be tricky as well if it's encountered on the error path before all devices (in theoretical non-pci case) had a chance to previously call arm_smmu_write_ctx_desc.
On Mon, Feb 19, 2024 at 04:32:40PM +0800, Michael Shavit wrote: > On Sat, Feb 17, 2024 at 9:24 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > On Sat, Feb 17, 2024 at 08:25:36PM +0800, Michael Shavit wrote: > > > > > Calling arm_smmu_write_ctx_desc requires the CD which we get from the > > > mmu_notifiers list...which makes it a bit more complicated than > > > that. > > > > @@ -404,9 +384,15 @@ static int __arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm) > > goto err_free_bond; > > } > > > > + ret = arm_smmu_write_ctx_desc(master, pasid, bond->smmu_mn->cd); > > + if (ret) > > + goto err_put_notifier; > > + > > > Oh hey that's not much more complicated :) . I'm guessing that'll be a > call to arm_smmu_mmu_notifier_put() rather than a goto on the error > path? Yes, error unwind is the goto which does put and kfree(bond) > Speaking of... the arm_smmu_update_ctx_desc_devices call in > arm_smmu_mmu_notifier_put might be tricky as well if it's encountered > on the error path before all devices (in theoretical non-pci case) had > a chance to previously call arm_smmu_write_ctx_desc. It is hard to understand but I think it is close enough to OK.. The put will pair with the get and if this is the first member of the group then it will do a symmetric tear down. The if (!refcount_dec_and_test(&smmu_mn->refs)) return; Takes care of that Otherwise the puts accumulate the refcount back down to zero which should be hit once __iommu_remove_group_pasid() gets all the group members removed. IOW the CDs are not cleaned up on any devices until all the group members have the PASID removed. Clearly it is not correct design, but it looks like it works good enough even in error paths. Then when it does eventually reach arm_smmu_update_ctx_desc_devices() it wipes all the CDs of all the RID domain's masters which is the same as the group membership. Which will end up happening before iommu_attach_device_pasid() returns on its error path. (obviously this is all made to work logically and properly in part 2) Jasson
On Sat, Feb 17, 2024 at 09:24:46AM -0400, Jason Gunthorpe wrote: > On Sat, Feb 17, 2024 at 08:25:36PM +0800, Michael Shavit wrote: > > > Calling arm_smmu_write_ctx_desc requires the CD which we get from the > > mmu_notifiers list...which makes it a bit more complicated than > > that. > > @@ -404,9 +384,15 @@ static int __arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm) > goto err_free_bond; > } > > + ret = arm_smmu_write_ctx_desc(master, pasid, bond->smmu_mn->cd); > + if (ret) > + goto err_put_notifier; > + Please can you spin a v3 with this folded in? Will
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c index 05722121f00e70..64ec3f0ed8ffe0 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c @@ -288,14 +288,14 @@ static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = { /* Allocate or get existing MMU notifier for this {domain, mm} pair */ static struct arm_smmu_mmu_notifier * -arm_smmu_mmu_notifier_get(struct arm_smmu_domain *smmu_domain, +arm_smmu_mmu_notifier_get(struct arm_smmu_master *master, ioasid_t pasid, struct mm_struct *mm) { int ret; - unsigned long flags; struct arm_smmu_ctx_desc *cd; struct arm_smmu_mmu_notifier *smmu_mn; - struct arm_smmu_master *master; + struct iommu_domain *domain = iommu_get_domain_for_dev(master->dev); + struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); list_for_each_entry(smmu_mn, &smmu_domain->mmu_notifiers, list) { if (smmu_mn->mn.mm == mm) { @@ -325,19 +325,7 @@ arm_smmu_mmu_notifier_get(struct arm_smmu_domain *smmu_domain, goto err_free_cd; } - spin_lock_irqsave(&smmu_domain->devices_lock, flags); - list_for_each_entry(master, &smmu_domain->devices, domain_head) { - ret = arm_smmu_write_ctx_desc(master, mm_get_enqcmd_pasid(mm), - cd); - if (ret) { - list_for_each_entry_from_reverse( - master, &smmu_domain->devices, domain_head) - arm_smmu_write_ctx_desc( - master, mm_get_enqcmd_pasid(mm), NULL); - break; - } - } - spin_unlock_irqrestore(&smmu_domain->devices_lock, flags); + ret = arm_smmu_write_ctx_desc(master, pasid, cd); if (ret) goto err_put_notifier; @@ -381,13 +369,12 @@ static void arm_smmu_mmu_notifier_put(struct arm_smmu_mmu_notifier *smmu_mn) arm_smmu_free_shared_cd(cd); } -static int __arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm) +static int __arm_smmu_sva_bind(struct device *dev, ioasid_t pasid, + struct mm_struct *mm) { int ret; struct arm_smmu_bond *bond; struct arm_smmu_master *master = dev_iommu_priv_get(dev); - struct iommu_domain *domain = iommu_get_domain_for_dev(dev); - struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); if (!master || !master->sva_enabled) return -ENODEV; @@ -398,7 +385,7 @@ static int __arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm) bond->mm = mm; - bond->smmu_mn = arm_smmu_mmu_notifier_get(smmu_domain, mm); + bond->smmu_mn = arm_smmu_mmu_notifier_get(master, pasid, mm); if (IS_ERR(bond->smmu_mn)) { ret = PTR_ERR(bond->smmu_mn); goto err_free_bond; @@ -590,7 +577,7 @@ static int arm_smmu_sva_set_dev_pasid(struct iommu_domain *domain, struct mm_struct *mm = domain->mm; mutex_lock(&sva_lock); - ret = __arm_smmu_sva_bind(dev, mm); + ret = __arm_smmu_sva_bind(dev, id, mm); mutex_unlock(&sva_lock); return ret;
If the SMMU is configured to use a two level CD table then arm_smmu_write_ctx_desc() allocates a CD table leaf internally using GFP_KERNEL. Due to recent changes this is being done under a spinlock to iterate over the device list - thus it will trigger a sleeping while atomic warning: arm_smmu_sva_set_dev_pasid() mutex_lock(&sva_lock); __arm_smmu_sva_bind() arm_smmu_mmu_notifier_get() spin_lock_irqsave() arm_smmu_write_ctx_desc() arm_smmu_get_cd_ptr() arm_smmu_alloc_cd_leaf_table() dmam_alloc_coherent(GFP_KERNEL) This is a 64K high order allocation and really should not be done atomically. At the moment the rework of the SVA to follow the new API is half finished. Recently the CD table memory was moved from the domain to the master, however we have the confusing situation where the SVA code is wrongly using the RID domains device's list to track which CD tables the SVA is installed in. Remove the logic to replicate the CD across all the domain's masters during attach. We know which master and which CD table the PASID should be installed in. At the moment SVA is only invoked when dma-iommu.c is in control of the RID translation, which means we have a single iommu_domain shared across the entire group and that iommu_domain is not shared outside the group. For PCI cases the core code also insists on singleton groups so there is only ever one entry in the smmu_domain->domains list that is equal to the master being passed in to arm_smmu_sva_set_dev_pasid(). Only non-PCI cases may have multi-device groups. However, the core code it self will replicate the calls to arm_smmu_sva_set_dev_pasid() across the entire group so we will still correctly install the CD into each group members master. Thus the loop here is fully redundant and can be removed. Removing the loop allows arm_smmu_write_ctx_desc() to be called outside the spinlock and thus is safe to use GFP_KERNEL. Fixes: 24503148c545 ("iommu/arm-smmu-v3: Refactor write_ctx_desc") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/4e25d161-0cf8-4050-9aa3-dfa21cd63e56@moroto.mountain/ Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> --- .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 29 +++++-------------- 1 file changed, 8 insertions(+), 21 deletions(-) v2 - rewritten base-commit: c76c50cd425e574df5a071cd7e1805d0764aabff