Message ID | 1450859136-98482-4-git-send-email-quan.xu@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> From: Xu, Quan > Sent: Wednesday, December 23, 2015 4:26 PM > > Now if IOTLB/Context/IETC flush is timeout, panic hypervisor. > The coming patch set will fix it. again, please adjust comment to reflect what this patch is doing, i.e. only about improvement on Device-TLB flush. > > If Device-TLB flush is timeout, we'll hide the target ATS > device and crash the domain owning this ATS device. > > If impacted domain is hardware domain, just throw out a warning. > > The hided Device will be disallowed to be further assigned to > any domain. hided Device -> hidden device > > Signed-off-by: Quan Xu <quan.xu@intel.com> > --- > xen/drivers/passthrough/pci.c | 2 +- > xen/drivers/passthrough/vtd/extern.h | 2 + > xen/drivers/passthrough/vtd/qinval.c | 80 > ++++++++++++++++++++++++++++++++++- > xen/drivers/passthrough/vtd/x86/ats.c | 13 ++++++ > 4 files changed, 94 insertions(+), 3 deletions(-) > > diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c > index 27b3ca7..2d7dc59 100644 > --- a/xen/drivers/passthrough/pci.c > +++ b/xen/drivers/passthrough/pci.c > @@ -407,7 +407,7 @@ static void _pci_hide_device(struct pci_dev *pdev) > list_add(&pdev->domain_list, &dom_xen->arch.pdev_list); > } > > -int __init pci_hide_device(int bus, int devfn) > +int pci_hide_device(int bus, int devfn) > { > struct pci_dev *pdev; > int rc = -ENOMEM; > diff --git a/xen/drivers/passthrough/vtd/extern.h > b/xen/drivers/passthrough/vtd/extern.h > index ec9c513..a2123ce 100644 > --- a/xen/drivers/passthrough/vtd/extern.h > +++ b/xen/drivers/passthrough/vtd/extern.h > @@ -58,6 +58,8 @@ int ats_device(const struct pci_dev *, const struct acpi_drhd_unit *); > > int dev_invalidate_iotlb(struct iommu *iommu, u16 did, > u64 addr, unsigned int size_order, u64 type); > +int dev_invalidate_iotlb_sync(struct iommu *iommu, u16 did, > + u16 seg, u8 bus, u8 devfn); > > int qinval_device_iotlb(struct iommu *iommu, > u32 max_invs_pend, u16 sid, u16 size, u64 addr); > diff --git a/xen/drivers/passthrough/vtd/qinval.c b/xen/drivers/passthrough/vtd/qinval.c > index b227e4e..7330c5d 100644 > --- a/xen/drivers/passthrough/vtd/qinval.c > +++ b/xen/drivers/passthrough/vtd/qinval.c > @@ -190,9 +190,19 @@ static int queue_invalidate_wait(struct iommu *iommu, > static int invalidate_sync(struct iommu *iommu) > { > struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu); > + int rc; > > if ( qi_ctrl->qinval_maddr ) > - return queue_invalidate_wait(iommu, 0, 1, 1); > + { > + rc = queue_invalidate_wait(iommu, 0, 1, 1); > + if ( rc ) > + { > + if ( rc == -ETIMEDOUT ) > + panic("Queue invalidate wait descriptor was timeout.\n"); > + return rc; > + } > + } > + why do you still do panic here? w/ PATCH 1/3 you should return rc to caller directly, right? > return 0; > } > > @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu, > return 0; > } > > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did, > + u16 seg, u8 bus, u8 devfn) > +{ > + struct domain *d; > + struct pci_dev *pdev; > + > + d = rcu_lock_domain_by_id(iommu->domid_map[did]); > + ASSERT(d); > + for_each_pdev(d, pdev) > + { > + if ( (pdev->seg == seg) && > + (pdev->bus == bus) && > + (pdev->devfn == devfn) ) > + { > + if ( pdev->domain ) > + { > + list_del(&pdev->domain_list); > + pdev->domain = NULL; > + } > + > + if ( pci_hide_device(bus, devfn) ) > + { > + printk(XENLOG_ERR > + "IOMMU hide device %04x:%02x:%02x error.", > + seg, bus, devfn); > + break; > + } I'm not sure whether using pci_hide_device is enough or not break other code path here? For example, now you have errors propagated to callers. What's the final policy against flush error? If they will finally go to cleanup some more state relying on pdev->domain, then that code path might be broken. If it's fine to do cleanup at this point, then just hidden is not enough. If you look at pci_remove_device, there's quite some state to be further cleaned up... I didn't think about the full logic thoroughly now. But it would always be good to not hide device now until a point where all related states have been cleaned up in error handling path chained up. Thanks Kevin
> On December 25 2015, 11:06 AM, <Tian, Kevin> wrote: > > From: Xu, Quan > > Sent: Wednesday, December 23, 2015 4:26 PM > > > > Now if IOTLB/Context/IETC flush is timeout, panic hypervisor. > > The coming patch set will fix it. > > again, please adjust comment to reflect what this patch is doing, i.e. only about > improvement on Device-TLB flush. > Agreed. > > > > If Device-TLB flush is timeout, we'll hide the target ATS device and > > crash the domain owning this ATS device. > > > > If impacted domain is hardware domain, just throw out a warning. > > > > The hided Device will be disallowed to be further assigned to any > > domain. > > hided Device -> hidden device > Agreed. [...] > > diff --git a/xen/drivers/passthrough/vtd/qinval.c > > b/xen/drivers/passthrough/vtd/qinval.c > > index b227e4e..7330c5d 100644 > > --- a/xen/drivers/passthrough/vtd/qinval.c > > +++ b/xen/drivers/passthrough/vtd/qinval.c > > @@ -190,9 +190,19 @@ static int queue_invalidate_wait(struct iommu > > *iommu, static int invalidate_sync(struct iommu *iommu) { > > struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu); > > + int rc; > > > > if ( qi_ctrl->qinval_maddr ) > > - return queue_invalidate_wait(iommu, 0, 1, 1); > > + { > > + rc = queue_invalidate_wait(iommu, 0, 1, 1); > > + if ( rc ) > > + { > > + if ( rc == -ETIMEDOUT ) > > + panic("Queue invalidate wait descriptor was > timeout.\n"); > > + return rc; > > + } > > + } > > + > > why do you still do panic here? w/ PATCH 1/3 you should return rc to caller > directly, right? > This panic is original in queue_invalidate_wait(). Patch 1/3 is just for Device-TLB flush error ( Actually it covers iotlb flush error). If I fix VT-d iotlb/context/iec flush error to propagated callers, then I can remove this panic and return rc to propagated caller directly. > > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did, > > + u16 seg, u8 bus, u8 devfn) > { > > + struct domain *d; > > + struct pci_dev *pdev; > > + > > + d = rcu_lock_domain_by_id(iommu->domid_map[did]); > > + ASSERT(d); > > + for_each_pdev(d, pdev) > > + { > > + if ( (pdev->seg == seg) && > > + (pdev->bus == bus) && > > + (pdev->devfn == devfn) ) > > + { > > + if ( pdev->domain ) > > + { > > + list_del(&pdev->domain_list); > > + pdev->domain = NULL; > > + } > > + > > + if ( pci_hide_device(bus, devfn) ) > > + { > > + printk(XENLOG_ERR > > + "IOMMU hide device %04x:%02x:%02x error.", > > + seg, bus, devfn); > > + break; > > + } > > I'm not sure whether using pci_hide_device is enough or not break other code > path here? For example, now you have errors propagated to callers. More information, after call pci_hide_device(), .. pdev->domain = dom_xen; list_add(&pdev->domain_list, &dom_xen->arch.pdev_list); .. Hidden PCI devices are associated with 'dom_xen'. Now I found it may not clear rmrr mapping / context mapping. .i.e iommu_remove_device() -- [hd->platform_ops is NULL for dom_xen]. But I think it is acceptable, IMO dom_xen is a part hypervisor, and there mappings would not a security issue. Or I can remove these mappings before to hide this device. So I think it will not break other code break other code. > What's the > final policy against flush error? > If Device-TLB flush is timeout, we'll hide the target ATS device and crash the domain owning this ATS device. If impacted domain is hardware domain, just throw out a warning, instead of crash the hardware domain. The hided Device will be disallowed to be further assigned to any domain. >If they will finally go to cleanup some more state > relying on pdev->domain, then that code path might be broken. If it's fine to do > cleanup at this point, then just hidden is not enough. If you look at > pci_remove_device, there's quite some state to be further cleaned up... > As the pdev->domain is 'dom_xen'; So for this case, it is still working. Correct me if it is not correct. BTW, a quick question, which case to call pci_remove_device()? > I didn't think about the full logic thoroughly now. But it would always be good > to not hide device now until a point where all related states have been cleaned > up in error handling path chained up. > Quan
>>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote: > --- a/xen/drivers/passthrough/vtd/qinval.c > +++ b/xen/drivers/passthrough/vtd/qinval.c > @@ -190,9 +190,19 @@ static int queue_invalidate_wait(struct iommu *iommu, > static int invalidate_sync(struct iommu *iommu) > { > struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu); > + int rc; > > if ( qi_ctrl->qinval_maddr ) > - return queue_invalidate_wait(iommu, 0, 1, 1); > + { > + rc = queue_invalidate_wait(iommu, 0, 1, 1); > + if ( rc ) > + { > + if ( rc == -ETIMEDOUT ) > + panic("Queue invalidate wait descriptor was timeout.\n"); Unless I'm overlooking something this doesn't replace and existing panic(), and I think I had indicated quite clearly that this series shouldn't add any new ones, unless the alternative of crashing the owning domain is completely unfeasible. > + panic("Queue invalidate wait descriptor was timeout.\n"); > + return rc; > + } > + } > + > return 0; > } Please avoid introducing multiple return points when this can be trivially avoided. > @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu, > return 0; > } > > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did, > + u16 seg, u8 bus, u8 devfn) > +{ > + struct domain *d; > + struct pci_dev *pdev; > + > + d = rcu_lock_domain_by_id(iommu->domid_map[did]); > + ASSERT(d); Don't you need to acquire some lock in order to safely assert this? Also note that unused slots hold zero, i.e. there's at least a theoretical risk of problems here when you don't also look at iommu->domid_bitmap. > + for_each_pdev(d, pdev) > + { > + if ( (pdev->seg == seg) && > + (pdev->bus == bus) && > + (pdev->devfn == devfn) ) > + { > + if ( pdev->domain ) If the device is on the domain's list, can this reasonably be false? I.e. don't you rather mean ASSERT() here? > + { > + list_del(&pdev->domain_list); This should happen under pcidevs_lock - you need to either acquire it or ASSERT() it being held. > + pdev->domain = NULL; > + } > + > + if ( pci_hide_device(bus, devfn) ) > + { > + printk(XENLOG_ERR > + "IOMMU hide device %04x:%02x:%02x error.", > + seg, bus, devfn); > + break; > + } > + > + break; > + } Redundant breaks. > + } > + > + if ( !is_hardware_domain(d) ) > + domain_crash(d); I wonder whether the device hiding shouldn't also be avoided if the device is owned by hwdom. > @@ -349,9 +416,18 @@ static int flush_iotlb_qi( > queue_invalidate_iotlb(iommu, > type >> DMA_TLB_FLUSH_GRANU_OFFSET, dr, > dw, did, size_order, 0, addr); > + > + /* > + * Synchronize with hardware for invalidation request descriptors > + * submitted before Device-TLB invalidate descriptor. > + */ > + rc = invalidate_sync(iommu); > + if ( rc ) > + return rc; > + > if ( flush_dev_iotlb ) > ret = dev_invalidate_iotlb(iommu, did, addr, size_order, type); > - rc = invalidate_sync(iommu); > + > if ( !ret ) > ret = rc; > } This change is because of ...? > --- a/xen/drivers/passthrough/vtd/x86/ats.c > +++ b/xen/drivers/passthrough/vtd/x86/ats.c > @@ -162,6 +162,19 @@ int dev_invalidate_iotlb(struct iommu *iommu, u16 did, > return -EOPNOTSUPP; > } > > + /* > + * Synchronize with hardware for Device-TLB invalidate > + * descriptor. > + */ > + ret = dev_invalidate_iotlb_sync(iommu, did, pdev->seg, > + pdev->bus, pdev->devfn); > + if ( ret ) > + { > + dprintk(XENLOG_WARNING VTDPREFIX, > + "Device-TLB flush timeout.\n"); Are you aware that dprintk() compiles to nothing in non-debug builds? Plus logging the error code and device coordinates might turn out helpful in such cases. But first of all - if there was a timeout, we'd make it here only for Dom0. Perhaps the printing should move into an else to the domain_crash()? And if there was another error, the message would be outright wrong. Jan
Jan, thanks for your comments. > On January 15, 2016 at 9:10, <JBeulich@suse.com> wrote: > >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote: > > --- a/xen/drivers/passthrough/vtd/qinval.c > > +++ b/xen/drivers/passthrough/vtd/qinval.c > > @@ -190,9 +190,19 @@ static int queue_invalidate_wait(struct iommu > > *iommu, static int invalidate_sync(struct iommu *iommu) { > > struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu); > > + int rc; > > > > if ( qi_ctrl->qinval_maddr ) > > - return queue_invalidate_wait(iommu, 0, 1, 1); > > + { > > + rc = queue_invalidate_wait(iommu, 0, 1, 1); > > + if ( rc ) > > + { > > + if ( rc == -ETIMEDOUT ) > > + panic("Queue invalidate wait descriptor was > > + timeout.\n"); > > Unless I'm overlooking something this doesn't replace and existing panic(), and I > think I had indicated quite clearly that this series shouldn't add any new ones, > unless the alternative of crashing the owning domain is completely unfeasible. > I will remove it in next v5. [...] > > @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu, > > return 0; > > } > > > > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did, > > + u16 seg, u8 bus, u8 devfn) > { > > + struct domain *d; > > + struct pci_dev *pdev; > > + > > + d = rcu_lock_domain_by_id(iommu->domid_map[did]); > > + ASSERT(d); > > Don't you need to acquire some lock in order to safely assert this? Referred to the other use case of 'rcu_lock_domain_by_id()', Xen checks whether The domain is 'NULL'. Could I also replace the 'ASSERT(d)' with + If ( d == NULL ) + return; ? > Also note that unused slots hold zero, i.e. there's at least a theoretical risk of > problems here when you don't also look at > iommu->domid_bitmap. > I am not clear how to fix this point. Do you have good idea? Add a lock to 'iommu->domid_bitmap'? > > + for_each_pdev(d, pdev) > > + { > > + if ( (pdev->seg == seg) && > > + (pdev->bus == bus) && > > + (pdev->devfn == devfn) ) > > + { > > + if ( pdev->domain ) > > If the device is on the domain's list, can this reasonably be false? I can't find the false case. > I.e. don't you rather mean ASSERT() here? I'd better replace 'if' with ASSERT() for this case. > > > + { > > + list_del(&pdev->domain_list); > > This should happen under pcidevs_lock - you need to either acquire it or > ASSERT() it being held. > I should check whether it is under pcidevs_lock -- with spin_is_locked(&pcidevs_lock) If it is not under pcidevs_lock, I should acquire it. I think ASSERT() is not a good idea. Hypervisor acquires this lock and then remove the resource. > > + } > > + > > + if ( !is_hardware_domain(d) ) > > + domain_crash(d); > > I wonder whether the device hiding shouldn't also be avoided if the device is > owned by hwdom. > I think it should be avoided if the device is owned by hwdom. Otherwise, it is quite similar to panic.. > > @@ -349,9 +416,18 @@ static int flush_iotlb_qi( > > queue_invalidate_iotlb(iommu, > > type >> > DMA_TLB_FLUSH_GRANU_OFFSET, dr, > > dw, did, size_order, 0, addr); > > + > > + /* > > + * Synchronize with hardware for invalidation request descriptors > > + * submitted before Device-TLB invalidate descriptor. > > + */ > > + rc = invalidate_sync(iommu); > > + if ( rc ) > > + return rc; > > + > > if ( flush_dev_iotlb ) > > ret = dev_invalidate_iotlb(iommu, did, addr, size_order, type); > > - rc = invalidate_sync(iommu); > > + > > if ( !ret ) > > ret = rc; > > } > > This change is because of ...? > As Kevin's comments, I add 'dev_invalidate_iotlb_sync()' to dev_invalidate_iotlb(). Then i can know which device is flush timeout. > > --- a/xen/drivers/passthrough/vtd/x86/ats.c > > +++ b/xen/drivers/passthrough/vtd/x86/ats.c > > @@ -162,6 +162,19 @@ int dev_invalidate_iotlb(struct iommu *iommu, u16 > did, > > return -EOPNOTSUPP; > > } > > > > + /* > > + * Synchronize with hardware for Device-TLB invalidate > > + * descriptor. > > + */ > > + ret = dev_invalidate_iotlb_sync(iommu, did, pdev->seg, > > + pdev->bus, pdev->devfn); > > + if ( ret ) > > + { > > + dprintk(XENLOG_WARNING VTDPREFIX, > > + "Device-TLB flush timeout.\n"); > > Are you aware that dprintk() compiles to nothing in non-debug builds? To be honest, I was not aware. > Plus logging the error code and device coordinates might turn out helpful in > such cases. But first of all - if there was a timeout, we'd make it here only for > Dom0. Perhaps the printing should move into an else to the domain_crash()? > And if there was another error, the message would be outright wrong. > IMO, the print for Dom0 is enough. I think it is not a good idea to move into an else to domain_crash(). Domain_crash is quite a common part. Anyway I can improve it in low priority. Jan, thanks.. -Quan
>>> On 20.01.16 at 11:26, <quan.xu@intel.com> wrote: >> On January 15, 2016 at 9:10, <JBeulich@suse.com> wrote: >> >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote: >> > @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu, >> > return 0; >> > } >> > >> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did, >> > + u16 seg, u8 bus, u8 devfn) >> { >> > + struct domain *d; >> > + struct pci_dev *pdev; >> > + >> > + d = rcu_lock_domain_by_id(iommu->domid_map[did]); >> > + ASSERT(d); >> >> Don't you need to acquire some lock in order to safely assert this? > > Referred to the other use case of 'rcu_lock_domain_by_id()', Xen checks > whether > The domain is 'NULL'. > Could I also replace the 'ASSERT(d)' with > + If ( d == NULL ) > + return; > ? At a first glance this doesn't look right, but in the end that's something you need to answer. >> Also note that unused slots hold zero, i.e. there's at least a theoretical > risk of >> problems here when you don't also look at >> iommu->domid_bitmap. >> > I am not clear how to fix this point. Do you have good idea? > Add a lock to 'iommu->domid_bitmap'? How would a lock help avoiding mistaking an unused slot to mean Dom0? As already suggested - I think you simply need to consult the bitmap along with the domain ID array. >> > + { >> > + list_del(&pdev->domain_list); >> >> This should happen under pcidevs_lock - you need to either acquire it or >> ASSERT() it being held. >> > > I should check whether it is under pcidevs_lock -- with > spin_is_locked(&pcidevs_lock) > If it is not under pcidevs_lock, I should acquire it. > I think ASSERT() is not a good idea. Hypervisor acquires this lock and then > remove the resource. I don't understand this last sentence. >> > @@ -349,9 +416,18 @@ static int flush_iotlb_qi( >> > queue_invalidate_iotlb(iommu, >> > type >> >> DMA_TLB_FLUSH_GRANU_OFFSET, dr, >> > dw, did, size_order, 0, addr); >> > + >> > + /* >> > + * Synchronize with hardware for invalidation request descriptors >> > + * submitted before Device-TLB invalidate descriptor. >> > + */ >> > + rc = invalidate_sync(iommu); >> > + if ( rc ) >> > + return rc; >> > + >> > if ( flush_dev_iotlb ) >> > ret = dev_invalidate_iotlb(iommu, did, addr, size_order, type); >> > - rc = invalidate_sync(iommu); >> > + >> > if ( !ret ) >> > ret = rc; >> > } >> >> This change is because of ...? >> > As Kevin's comments, > I add 'dev_invalidate_iotlb_sync()' to dev_invalidate_iotlb(). > Then i can know which device is flush timeout. I don't understand: How is your reply related to you moving up the invocation of invalidate_sync()? >> Plus logging the error code and device coordinates might turn out helpful in >> such cases. But first of all - if there was a timeout, we'd make it here only for >> Dom0. Perhaps the printing should move into an else to the domain_crash()? >> And if there was another error, the message would be outright wrong. >> > IMO, the print for Dom0 is enough. > I think it is not a good idea to move into an else to domain_crash(). > Domain_crash is quite a common part. > Anyway I can improve it in low priority. At the very least the message should not end up being actively misleading. Jan
> On January 20, 2016 at 7:29 pm, <JBeulich@suse.com> wrote: > >>> On 20.01.16 at 11:26, <quan.xu@intel.com> wrote: > >> On January 15, 2016 at 9:10, <JBeulich@suse.com> wrote: > >> >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote: > >> > @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu, > >> > return 0; > >> > } > >> > > >> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did, > >> > + u16 seg, u8 bus, u8 > >> > +devfn) > >> { > >> > + struct domain *d; > >> > + struct pci_dev *pdev; > >> > + > >> > + d = rcu_lock_domain_by_id(iommu->domid_map[did]); > >> > + ASSERT(d); > >> > >> Don't you need to acquire some lock in order to safely assert this? > > > > Referred to the other use case of 'rcu_lock_domain_by_id()', Xen > > checks whether The domain is 'NULL'. > > Could I also replace the 'ASSERT(d)' with > > + If ( d == NULL ) > > + return; > > ? > > At a first glance this doesn't look right, but in the end that's something you need > to answer. > Is it quite similar to whether the domain has been destroyed when Device-TLB is flushing? Correct me if I still doesn't get you point.. > >> Also note that unused slots hold zero, i.e. there's at least a > >> theoretical > > risk of > >> problems here when you don't also look at > >> iommu->domid_bitmap. > >> > > I am not clear how to fix this point. Do you have good idea? > > Add a lock to 'iommu->domid_bitmap'? > > How would a lock help avoiding mistaking an unused slot to mean Dom0? As > already suggested - I think you simply need to consult the bitmap along with the > domain ID array. > Try to get domain id with iommu->domid_map[did] ? > >> > + { > >> > + list_del(&pdev->domain_list); > >> > >> This should happen under pcidevs_lock - you need to either acquire it > >> or > >> ASSERT() it being held. > >> > > > > I should check whether it is under pcidevs_lock -- with > > spin_is_locked(&pcidevs_lock) > > If it is not under pcidevs_lock, I should acquire it. > > I think ASSERT() is not a good idea. Hypervisor acquires this lock and > > then remove the resource. > > I don't understand this last sentence. > For example: in pci_remove_device() { ... spin_lock(&pcidevs_lock); .. iommu_remove_device().. .. spin_unlock(&pcidevs_lock); } Device-TLB is maybe flush error in iommu_remove_device().. Then it is under pcidevs_lock.. In this case, I need to check whether it is under pcidevs_lock. If not, I need to acquire the pcidevs_lock. > >> > @@ -349,9 +416,18 @@ static int flush_iotlb_qi( > >> > queue_invalidate_iotlb(iommu, > >> > type >> > >> DMA_TLB_FLUSH_GRANU_OFFSET, dr, > >> > dw, did, size_order, 0, addr); > >> > + > >> > + /* > >> > + * Synchronize with hardware for invalidation request > descriptors > >> > + * submitted before Device-TLB invalidate descriptor. > >> > + */ > >> > + rc = invalidate_sync(iommu); > >> > + if ( rc ) > >> > + return rc; > >> > + > >> > if ( flush_dev_iotlb ) > >> > ret = dev_invalidate_iotlb(iommu, did, addr, size_order, > type); > >> > - rc = invalidate_sync(iommu); > >> > + > >> > if ( !ret ) > >> > ret = rc; > >> > } > >> > >> This change is because of ...? > >> > > As Kevin's comments, > > I add 'dev_invalidate_iotlb_sync()' to dev_invalidate_iotlb(). > > Then i can know which device is flush timeout. > > I don't understand: How is your reply related to you moving up the invocation of > invalidate_sync()? This invalidate_sync() is to synchronize with hardware for invalidation request descriptors submitted before Device-TLB invalidate descriptor. If I don't add this invalidate_sync().. dev_invalidate_iotlb_timeout() is not clear whether the flush time out is about Device-TLB flush error or the other previous iotlb/iec/context flush error. > > >> Plus logging the error code and device coordinates might turn out > >> helpful in such cases. But first of all - if there was a timeout, > >> we'd make it here only for Dom0. Perhaps the printing should move into an > else to the domain_crash()? > >> And if there was another error, the message would be outright wrong. > >> > > IMO, the print for Dom0 is enough. > > I think it is not a good idea to move into an else to domain_crash(). > > Domain_crash is quite a common part. > > Anyway I can improve it in low priority. > > At the very least the message should not end up being actively misleading. > What about the below? + printk(XENLOG_ERR + "Device %04x:%02x:%02x is flush error.", + seg, bus, devfn); Quan
>>> On 21.01.16 at 17:16, <quan.xu@intel.com> wrote: >> On January 20, 2016 at 7:29 pm, <JBeulich@suse.com> wrote: >> >>> On 20.01.16 at 11:26, <quan.xu@intel.com> wrote: >> >> On January 15, 2016 at 9:10, <JBeulich@suse.com> wrote: >> >> >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote: >> >> > @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu, >> >> > return 0; >> >> > } >> >> > >> >> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did, >> >> > + u16 seg, u8 bus, u8 >> >> > +devfn) >> >> { >> >> > + struct domain *d; >> >> > + struct pci_dev *pdev; >> >> > + >> >> > + d = rcu_lock_domain_by_id(iommu->domid_map[did]); >> >> > + ASSERT(d); >> >> >> >> Don't you need to acquire some lock in order to safely assert this? >> > >> > Referred to the other use case of 'rcu_lock_domain_by_id()', Xen >> > checks whether The domain is 'NULL'. >> > Could I also replace the 'ASSERT(d)' with >> > + If ( d == NULL ) >> > + return; >> > ? >> >> At a first glance this doesn't look right, but in the end that's something > you need >> to answer. >> > > Is it quite similar to whether the domain has been destroyed when Device-TLB > is flushing? Going back to the original question I raised, you need to ask yourself: Can d validly be NULL when you get here (e.g. due to some other processor changing something in parallel)? If yes, you can't ASSERT(), and the next question then is what exactly it means that it got ripped off the table. The answer to that determines whether simply returning would be okay. >> >> Also note that unused slots hold zero, i.e. there's at least a >> >> theoretical >> > risk of >> >> problems here when you don't also look at >> >> iommu->domid_bitmap. >> >> >> > I am not clear how to fix this point. Do you have good idea? >> > Add a lock to 'iommu->domid_bitmap'? >> >> How would a lock help avoiding mistaking an unused slot to mean Dom0? As >> already suggested - I think you simply need to consult the bitmap along with > the >> domain ID array. >> > > Try to get domain id with iommu->domid_map[did] ? ??? >> >> > + { >> >> > + list_del(&pdev->domain_list); >> >> >> >> This should happen under pcidevs_lock - you need to either acquire it >> >> or >> >> ASSERT() it being held. >> >> >> > >> > I should check whether it is under pcidevs_lock -- with >> > spin_is_locked(&pcidevs_lock) >> > If it is not under pcidevs_lock, I should acquire it. >> > I think ASSERT() is not a good idea. Hypervisor acquires this lock and >> > then remove the resource. >> >> I don't understand this last sentence. >> > For example: in > pci_remove_device() > { > ... > spin_lock(&pcidevs_lock); > .. > iommu_remove_device().. > .. > spin_unlock(&pcidevs_lock); > } > > Device-TLB is maybe flush error in iommu_remove_device().. > Then it is under pcidevs_lock.. > In this case, I need to check whether it is under pcidevs_lock. > If not, I need to acquire the pcidevs_lock. Ah, okay. But you can't use spin_is_locked() for that purpose. >> >> > @@ -349,9 +416,18 @@ static int flush_iotlb_qi( >> >> > queue_invalidate_iotlb(iommu, >> >> > type >> >> >> DMA_TLB_FLUSH_GRANU_OFFSET, dr, >> >> > dw, did, size_order, 0, addr); >> >> > + >> >> > + /* >> >> > + * Synchronize with hardware for invalidation request >> descriptors >> >> > + * submitted before Device-TLB invalidate descriptor. >> >> > + */ >> >> > + rc = invalidate_sync(iommu); >> >> > + if ( rc ) >> >> > + return rc; >> >> > + >> >> > if ( flush_dev_iotlb ) >> >> > ret = dev_invalidate_iotlb(iommu, did, addr, size_order, >> type); >> >> > - rc = invalidate_sync(iommu); >> >> > + >> >> > if ( !ret ) >> >> > ret = rc; >> >> > } >> >> >> >> This change is because of ...? >> >> >> > As Kevin's comments, >> > I add 'dev_invalidate_iotlb_sync()' to dev_invalidate_iotlb(). >> > Then i can know which device is flush timeout. >> >> I don't understand: How is your reply related to you moving up the > invocation of >> invalidate_sync()? > > This invalidate_sync() is to synchronize with hardware for invalidation > request descriptors > submitted before Device-TLB invalidate descriptor. > > If I don't add this invalidate_sync().. > dev_invalidate_iotlb_timeout() is not clear whether the flush time out is > about Device-TLB flush error or the other previous iotlb/iec/context flush > error. This being anything but obvious, maybe a brief comment would help? >> >> Plus logging the error code and device coordinates might turn out >> >> helpful in such cases. But first of all - if there was a timeout, >> >> we'd make it here only for Dom0. Perhaps the printing should move into an >> else to the domain_crash()? >> >> And if there was another error, the message would be outright wrong. >> >> >> > IMO, the print for Dom0 is enough. >> > I think it is not a good idea to move into an else to domain_crash(). >> > Domain_crash is quite a common part. >> > Anyway I can improve it in low priority. >> >> At the very least the message should not end up being actively misleading. >> > What about the below? > + printk(XENLOG_ERR > + "Device %04x:%02x:%02x is flush error.", > + seg, bus, devfn); Well, for one you once again omit the error code. And then (I think others had pointed that out already) a device cannot be flush error. Perhaps you mean "Flush error %d on device ...\n"? Plus you absolutely should print PCI device coordinates in the canonical way, so that grep-ing a log file for issues with a specific device will find everything related to that device. Jan
> On January 22, 2016 at 12:31am, <JBeulich@suse.com> wrote: > >>> On 21.01.16 at 17:16, <quan.xu@intel.com> wrote: > >> On January 20, 2016 at 7:29 pm, <JBeulich@suse.com> wrote: > >> >>> On 20.01.16 at 11:26, <quan.xu@intel.com> wrote: > >> >> On January 15, 2016 at 9:10, <JBeulich@suse.com> wrote: > >> >> >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote: > >> >> > @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu, > >> >> > return 0; > >> >> > } > >> >> > > >> >> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 > did, > >> >> > + u16 seg, u8 bus, u8 > >> >> > +devfn) > >> >> { > >> >> > + struct domain *d; > >> >> > + struct pci_dev *pdev; > >> >> > + > >> >> > + d = rcu_lock_domain_by_id(iommu->domid_map[did]); > >> >> > + ASSERT(d); > >> >> > >> >> Don't you need to acquire some lock in order to safely assert this? > >> > > >> > Referred to the other use case of 'rcu_lock_domain_by_id()', Xen > >> > checks whether The domain is 'NULL'. > >> > Could I also replace the 'ASSERT(d)' with > >> > + If ( d == NULL ) > >> > + return; > >> > ? > >> > >> At a first glance this doesn't look right, but in the end that's > >> something > > you need > >> to answer. > >> > > > > Is it quite similar to whether the domain has been destroyed when > > Device-TLB is flushing? > > Going back to the original question I raised, you need to ask > yourself: Can d validly be NULL when you get here (e.g. due to some other > processor changing something in parallel)? I think that is YES. Not all of the callers acquire RCU lock. Such as: $flush_iotlb_qi()--- iommu_flush_iotlb_psi() -- __intel_iommu_iotlb_flush() --intel_iommu_iotlb_flush() --iommu_iotlb_flush() --xenmem_add_to_physmap()--do_memory_op() When Device-TLB is flushing for the above call path, the domain is destroyed in parallel. Then the d is NULL. > If yes, you can't ASSERT(), and the > next question then is what exactly it means that it got ripped off the table. The > answer to that determines whether simply returning would be okay. > IMO, Simply returning would be okay, but the device would not be hidden. > >> >> Also note that unused slots hold zero, i.e. there's at least a > >> >> theoretical > >> > risk of > >> >> problems here when you don't also look at > >> >> iommu->domid_bitmap. > >> >> > >> > I am not clear how to fix this point. Do you have good idea? > >> > Add a lock to 'iommu->domid_bitmap'? > >> > >> How would a lock help avoiding mistaking an unused slot to mean Dom0? > >> As already suggested - I think you simply need to consult the bitmap > >> along with > > the > >> domain ID array. > >> > > > > Try to get domain id with iommu->domid_map[did] ? > > ??? > +if ( iommu->domid_map ) + d = rcu_lock_domain_by_id(iommu->domid_map[did]); Is it right? > >> >> > + { > >> >> > + list_del(&pdev->domain_list); > >> >> > >> >> This should happen under pcidevs_lock - you need to either acquire > >> >> it or > >> >> ASSERT() it being held. > >> >> > >> > > >> > I should check whether it is under pcidevs_lock -- with > >> > spin_is_locked(&pcidevs_lock) > >> > If it is not under pcidevs_lock, I should acquire it. > >> > I think ASSERT() is not a good idea. Hypervisor acquires this lock > >> > and then remove the resource. > >> > >> I don't understand this last sentence. > >> > > For example: in > > pci_remove_device() > > { > > ... > > spin_lock(&pcidevs_lock); > > .. > > iommu_remove_device().. > > .. > > spin_unlock(&pcidevs_lock); > > } > > > > Device-TLB is maybe flush error in iommu_remove_device().. > > Then it is under pcidevs_lock.. > > In this case, I need to check whether it is under pcidevs_lock. > > If not, I need to acquire the pcidevs_lock. > > Ah, okay. But you can't use spin_is_locked() for that purpose. > Why? If I introduce a new parameter 'lock'. + int lock = spin_is_locked(&pcidevs_lock); + if ( !lock ) + spin_lock(&pcidevs_lock); ... + if ( !lock ) + spin_unlock(&pcidevs_lock); Is it right? Jan, do you have some better idea? I think ASSERT is not right. > >> >> > @@ -349,9 +416,18 @@ static int flush_iotlb_qi( > >> >> > queue_invalidate_iotlb(iommu, > >> >> > type >> > >> >> DMA_TLB_FLUSH_GRANU_OFFSET, dr, > >> >> > dw, did, size_order, 0, addr); > >> >> > + > >> >> > + /* > >> >> > + * Synchronize with hardware for invalidation request > >> descriptors > >> >> > + * submitted before Device-TLB invalidate descriptor. > >> >> > + */ > >> >> > + rc = invalidate_sync(iommu); > >> >> > + if ( rc ) > >> >> > + return rc; > >> >> > + > >> >> > if ( flush_dev_iotlb ) > >> >> > ret = dev_invalidate_iotlb(iommu, did, addr, > >> >> > size_order, > >> type); > >> >> > - rc = invalidate_sync(iommu); > >> >> > + > >> >> > if ( !ret ) > >> >> > ret = rc; > >> >> > } > >> >> > >> >> This change is because of ...? > >> >> > >> > As Kevin's comments, > >> > I add 'dev_invalidate_iotlb_sync()' to dev_invalidate_iotlb(). > >> > Then i can know which device is flush timeout. > >> > >> I don't understand: How is your reply related to you moving up the > > invocation of > >> invalidate_sync()? > > > > This invalidate_sync() is to synchronize with hardware for > > invalidation request descriptors submitted before Device-TLB > > invalidate descriptor. > > > > If I don't add this invalidate_sync().. > > dev_invalidate_iotlb_timeout() is not clear whether the flush time out > > is about Device-TLB flush error or the other previous > > iotlb/iec/context flush error. > > This being anything but obvious, maybe a brief comment would help? > Change " Synchronize with hardware for invalidation request descriptors submitted before Device-TLB invalidate descriptor. " TO "Synchronize invalidation completions before Device-TLB invalidation " ? > >> >> Plus logging the error code and device coordinates might turn out > >> >> helpful in such cases. But first of all - if there was a timeout, > >> >> we'd make it here only for Dom0. Perhaps the printing should move > >> >> into an > >> else to the domain_crash()? > >> >> And if there was another error, the message would be outright wrong. > >> >> > >> > IMO, the print for Dom0 is enough. > >> > I think it is not a good idea to move into an else to domain_crash(). > >> > Domain_crash is quite a common part. > >> > Anyway I can improve it in low priority. > >> > >> At the very least the message should not end up being actively misleading. > >> > > What about the below? > > + printk(XENLOG_ERR > > + "Device %04x:%02x:%02x is flush error.", > > + seg, bus, devfn); > > Well, for one you once again omit the error code. And then (I think others had > pointed that out already) a device cannot be flush error. > Perhaps you mean "Flush error %d on device ...\n"? Plus you absolutely should > print PCI device coordinates in the canonical way, What does 'print PCI device coordinates in the canonical way' mean ? Such as "0000:00:02.1"? > so that grep-ing a log file for > issues with a specific device will find everything related to that device. I am appreciate your kindness. Quan
>>> On 22.01.16 at 16:57, <quan.xu@intel.com> wrote: >> On January 22, 2016 at 12:31am, <JBeulich@suse.com> wrote: >> >>> On 21.01.16 at 17:16, <quan.xu@intel.com> wrote: >> >> On January 20, 2016 at 7:29 pm, <JBeulich@suse.com> wrote: >> >> >>> On 20.01.16 at 11:26, <quan.xu@intel.com> wrote: >> >> >> On January 15, 2016 at 9:10, <JBeulich@suse.com> wrote: >> >> >> >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote: >> >> >> Also note that unused slots hold zero, i.e. there's at least a >> >> >> theoretical >> >> > risk of >> >> >> problems here when you don't also look at >> >> >> iommu->domid_bitmap. >> >> >> >> >> > I am not clear how to fix this point. Do you have good idea? >> >> > Add a lock to 'iommu->domid_bitmap'? >> >> >> >> How would a lock help avoiding mistaking an unused slot to mean Dom0? >> >> As already suggested - I think you simply need to consult the bitmap >> >> along with >> > the >> >> domain ID array. >> >> >> > >> > Try to get domain id with iommu->domid_map[did] ? >> >> ??? >> > +if ( iommu->domid_map ) > + d = rcu_lock_domain_by_id(iommu->domid_map[did]); > > Is it right? I don't see what this changes. Again - what your code has been lacking so far is some mechanism to guarantee that what you read from domid_map[] is actually a valid domain ID. I can only once again point your attention to domid_bitmap, which afaics gives you exactly that valid/invalid indication. >> >> >> > + { >> >> >> > + list_del(&pdev->domain_list); >> >> >> >> >> >> This should happen under pcidevs_lock - you need to either acquire >> >> >> it or >> >> >> ASSERT() it being held. >> >> >> >> >> > >> >> > I should check whether it is under pcidevs_lock -- with >> >> > spin_is_locked(&pcidevs_lock) >> >> > If it is not under pcidevs_lock, I should acquire it. >> >> > I think ASSERT() is not a good idea. Hypervisor acquires this lock >> >> > and then remove the resource. >> >> >> >> I don't understand this last sentence. >> >> >> > For example: in >> > pci_remove_device() >> > { >> > ... >> > spin_lock(&pcidevs_lock); >> > .. >> > iommu_remove_device().. >> > .. >> > spin_unlock(&pcidevs_lock); >> > } >> > >> > Device-TLB is maybe flush error in iommu_remove_device().. >> > Then it is under pcidevs_lock.. >> > In this case, I need to check whether it is under pcidevs_lock. >> > If not, I need to acquire the pcidevs_lock. >> >> Ah, okay. But you can't use spin_is_locked() for that purpose. >> > Why? Because spin_is_locked() returns true if _any CPU_ holds the lock, not just when the CPU you're running on does. > If I introduce a new parameter 'lock'. > + int lock = spin_is_locked(&pcidevs_lock); > > > + if ( !lock ) > + spin_lock(&pcidevs_lock); > ... > + if ( !lock ) > + spin_unlock(&pcidevs_lock); > > Is it right? > Jan, do you have some better idea? If indeed different locking state is possible for different call trees, then I'm afraid you won't get around passing down flags to indicate whether the needed lock is already being held. > I think ASSERT is not right. Indeed. >> >> >> > @@ -349,9 +416,18 @@ static int flush_iotlb_qi( >> >> >> > queue_invalidate_iotlb(iommu, >> >> >> > type >> >> >> >> DMA_TLB_FLUSH_GRANU_OFFSET, dr, >> >> >> > dw, did, size_order, 0, addr); >> >> >> > + >> >> >> > + /* >> >> >> > + * Synchronize with hardware for invalidation request >> >> descriptors >> >> >> > + * submitted before Device-TLB invalidate descriptor. >> >> >> > + */ >> >> >> > + rc = invalidate_sync(iommu); >> >> >> > + if ( rc ) >> >> >> > + return rc; >> >> >> > + >> >> >> > if ( flush_dev_iotlb ) >> >> >> > ret = dev_invalidate_iotlb(iommu, did, addr, >> >> >> > size_order, >> >> type); >> >> >> > - rc = invalidate_sync(iommu); >> >> >> > + >> >> >> > if ( !ret ) >> >> >> > ret = rc; >> >> >> > } >> >> >> >> >> >> This change is because of ...? >> >> >> >> >> > As Kevin's comments, >> >> > I add 'dev_invalidate_iotlb_sync()' to dev_invalidate_iotlb(). >> >> > Then i can know which device is flush timeout. >> >> >> >> I don't understand: How is your reply related to you moving up the >> > invocation of >> >> invalidate_sync()? >> > >> > This invalidate_sync() is to synchronize with hardware for >> > invalidation request descriptors submitted before Device-TLB >> > invalidate descriptor. >> > >> > If I don't add this invalidate_sync().. >> > dev_invalidate_iotlb_timeout() is not clear whether the flush time out >> > is about Device-TLB flush error or the other previous >> > iotlb/iec/context flush error. >> >> This being anything but obvious, maybe a brief comment would help? >> > > Change > " > Synchronize with hardware for invalidation request descriptors > submitted before Device-TLB invalidate descriptor. > " > TO > > "Synchronize invalidation completions before Device-TLB invalidation > " > ? I'd even further emphasize the ordering, by saying "Before Device-TLB invalidation we need to ..." (or perhaps it's rather "... we'd like to ..."). >> >> >> Plus logging the error code and device coordinates might turn out >> >> >> helpful in such cases. But first of all - if there was a timeout, >> >> >> we'd make it here only for Dom0. Perhaps the printing should move >> >> >> into an >> >> else to the domain_crash()? >> >> >> And if there was another error, the message would be outright wrong. >> >> >> >> >> > IMO, the print for Dom0 is enough. >> >> > I think it is not a good idea to move into an else to domain_crash(). >> >> > Domain_crash is quite a common part. >> >> > Anyway I can improve it in low priority. >> >> >> >> At the very least the message should not end up being actively misleading. >> >> >> > What about the below? >> > + printk(XENLOG_ERR >> > + "Device %04x:%02x:%02x is flush error.", >> > + seg, bus, devfn); >> >> Well, for one you once again omit the error code. And then (I think others > had >> pointed that out already) a device cannot be flush error. >> Perhaps you mean "Flush error %d on device ...\n"? Plus you absolutely > should >> print PCI device coordinates in the canonical way, > > What does 'print PCI device coordinates in the canonical way' mean ? > Such as "0000:00:02.1"? Yes. There are numerous examples throughout the tree. Jan
> On January 25, 2016 at 10:09pm, <JBeulich@suse.com> wrote: > >>> On 22.01.16 at 16:57, <quan.xu@intel.com> wrote: > >> On January 22, 2016 at 12:31am, <JBeulich@suse.com> wrote: > >> >>> On 21.01.16 at 17:16, <quan.xu@intel.com> wrote: > >> >> On January 20, 2016 at 7:29 pm, <JBeulich@suse.com> wrote: > >> >> >>> On 20.01.16 at 11:26, <quan.xu@intel.com> wrote: > >> >> >> On January 15, 2016 at 9:10, <JBeulich@suse.com> wrote: > >> >> >> >>> On 23.12.15 at 09:25, <quan.xu@intel.com> wrote: > >> >> >> Also note that unused slots hold zero, i.e. there's at least a > >> >> >> theoretical > >> >> > risk of > >> >> >> problems here when you don't also look at > >> >> >> iommu->domid_bitmap. > >> >> >> > >> >> > I am not clear how to fix this point. Do you have good idea? > >> >> > Add a lock to 'iommu->domid_bitmap'? > >> >> > >> >> How would a lock help avoiding mistaking an unused slot to mean Dom0? > >> >> As already suggested - I think you simply need to consult the > >> >> bitmap along with > >> > the > >> >> domain ID array. > >> >> > >> > > >> > Try to get domain id with iommu->domid_map[did] ? > >> > >> ??? > >> > > +if ( iommu->domid_map ) > > + d = rcu_lock_domain_by_id(iommu->domid_map[did]); > > > > Is it right? > > I don't see what this changes. Again - what your code has been lacking so far is > some mechanism to guarantee that what you read from domid_map[] is > actually a valid domain ID. I can only once again point your attention to > domid_bitmap, which afaics gives you exactly that valid/invalid indication. > Jan, Sorry, I was confused about this point as I didn't understand the domid_bitmap. I printed out the iommu->domid_bitmap[did], which was tricky to me. Now I get it. As you mentioned , I simply need to consult the bitmap along with the domain ID array. +If ( test_bit(did, iommu->domid_bitmap) && iommu->domid_map[did] >= 0 ) + d = rcu_lock_domain_by_id(iommu->domid_map[did]); Is it right now? > >> >> >> > + { > >> >> >> > + list_del(&pdev->domain_list); > >> >> >> > >> >> >> This should happen under pcidevs_lock - you need to either > >> >> >> acquire it or > >> >> >> ASSERT() it being held. > >> >> >> > >> >> > > >> >> > I should check whether it is under pcidevs_lock -- with > >> >> > spin_is_locked(&pcidevs_lock) > >> >> > If it is not under pcidevs_lock, I should acquire it. > >> >> > I think ASSERT() is not a good idea. Hypervisor acquires this > >> >> > lock and then remove the resource. > >> >> > >> >> I don't understand this last sentence. > >> >> > >> > For example: in > >> > pci_remove_device() > >> > { > >> > ... > >> > spin_lock(&pcidevs_lock); > >> > .. > >> > iommu_remove_device().. > >> > .. > >> > spin_unlock(&pcidevs_lock); > >> > } > >> > > >> > Device-TLB is maybe flush error in iommu_remove_device().. > >> > Then it is under pcidevs_lock.. > >> > In this case, I need to check whether it is under pcidevs_lock. > >> > If not, I need to acquire the pcidevs_lock. > >> > >> Ah, okay. But you can't use spin_is_locked() for that purpose. > >> > > If I introduce a new parameter 'lock'. > > + int lock = spin_is_locked(&pcidevs_lock); > > > > > > + if ( !lock ) > > + spin_lock(&pcidevs_lock); > > ... > > + if ( !lock ) > > + spin_unlock(&pcidevs_lock); > > > > Is it right? > > Jan, do you have some better idea? > > If indeed different locking state is possible for different call trees, Indeed different. For example, It is _not_under_ lock for the following call tree: $ flush_iotlb_qi()--- iommu_flush_iotlb_psi() -- __intel_iommu_iotlb_flush() --intel_iommu_iotlb_flush() --iommu_iotlb_flush() --xenmem_add_to_physmap()--do_memory_op() It is _under_ lock for the following call tree: $flush_iotlb_qi()--iommu_flush_iotlb_dsi()--domain_context_unmap_one()--domain_context_unmap()--reassign_device_ownership()--deassign_device()-iommu_do_pci_domctl() > then I'm > afraid you won't get around passing down flags to indicate whether the needed > lock is already being held. > Agreed. At first, I am open for any solution. pcidevs_lock is quite a big lock. For this point, it looks much better to add a new flag to delay hiding device. I am also afraid that it may raise further security issues. Jan, thanks!! -Quan
>>> On 26.01.16 at 14:47, <quan.xu@intel.com> wrote: > As you mentioned , I simply need to consult the bitmap along with the domain > ID array. > > +If ( test_bit(did, iommu->domid_bitmap) && iommu->domid_map[did] >= 0 ) > + d = rcu_lock_domain_by_id(iommu->domid_map[did]); > > Is it right now? Mostly, except that I don't understand the >= 0 part. > At first, I am open for any solution. > pcidevs_lock is quite a big lock. For this point, it looks much better to > add a new flag to delay hiding device. > I am also afraid that it may raise further security issues. Well, I'd say just go and see which one turns out to be less cumbersome and/or less intrusive. Jan
> On January 26, 2016 at 10:00pm, <JBeulich@suse.com> wrote: > >>> On 26.01.16 at 14:47, <quan.xu@intel.com> wrote: > > As you mentioned , I simply need to consult the bitmap along with the > > domain ID array. > > > > +If ( test_bit(did, iommu->domid_bitmap) && iommu->domid_map[did] >= 0 ) > > + d = rcu_lock_domain_by_id(iommu->domid_map[did]); > > > > Is it right now? > > Mostly, except that I don't understand the >= 0 part. > Domain ID should be >= 0.. If it is redundant, I can remove it. > > At first, I am open for any solution. > > pcidevs_lock is quite a big lock. For this point, it looks much better > > to add a new flag to delay hiding device. > > I am also afraid that it may raise further security issues. > > Well, I'd say just go and see which one turns out to be less cumbersome and/or > less intrusive. > For this lock, any good idea? IMO, I can get started to add a new flag to delay hiding device. -Quan
>>> On 26.01.16 at 16:27, <quan.xu@intel.com> wrote: >> On January 26, 2016 at 10:00pm, <JBeulich@suse.com> wrote: >> >>> On 26.01.16 at 14:47, <quan.xu@intel.com> wrote: >> > As you mentioned , I simply need to consult the bitmap along with the >> > domain ID array. >> > >> > +If ( test_bit(did, iommu->domid_bitmap) && iommu->domid_map[did] >= 0 ) >> > + d = rcu_lock_domain_by_id(iommu->domid_map[did]); >> > >> > Is it right now? >> >> Mostly, except that I don't understand the >= 0 part. >> > Domain ID should be >= 0.. > If it is redundant, I can remove it. Quan, please: You have the code, so you can check whether negative values ever get stored there. And had you checked, you'd have found that domid_map is declared as u16 *. The "u" here, as I hope you know, stands for "unsigned". (Of course this really should be domid_t, but I'm sure the VT-d maintainers won't care at all about such inconsistencies.) >> > At first, I am open for any solution. >> > pcidevs_lock is quite a big lock. For this point, it looks much better >> > to add a new flag to delay hiding device. >> > I am also afraid that it may raise further security issues. >> >> Well, I'd say just go and see which one turns out to be less cumbersome > and/or >> less intrusive. >> > For this lock, any good idea? > IMO, I can get started to add a new flag to delay hiding device. Once again: Before getting started, please assess which route is going to be the better one. Remember that we had already discussed and put aside some form of deferring the hiding of devices, so if you come back with a patch doing that again, you'll have to be able to explain why the alternative(s) are worse. Jan
> From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Tuesday, January 26, 2016 11:53 PM > > >> > At first, I am open for any solution. > >> > pcidevs_lock is quite a big lock. For this point, it looks much better > >> > to add a new flag to delay hiding device. > >> > I am also afraid that it may raise further security issues. > >> > >> Well, I'd say just go and see which one turns out to be less cumbersome > > and/or > >> less intrusive. > >> > > For this lock, any good idea? > > IMO, I can get started to add a new flag to delay hiding device. > > Once again: Before getting started, please assess which route is > going to be the better one. Remember that we had already > discussed and put aside some form of deferring the hiding of > devices, so if you come back with a patch doing that again, you'll > have to be able to explain why the alternative(s) are worse. > Quan, could you list pros/cons of those alternatives based on discussion so far? Then we can decide which way should be done before you go to actual coding. Earlier suggestion on hiding device immediately is under the assumption that all locks have been held. If this part becomes too complex, and you can explain clearly that deferring the hiding action doesn't lead to any race condition, then people can see why you are proposing defer again. Thanks Kevin
> On January 27, 2016 at 6:48am, <Tian, Kevin> wrote: > > From: Jan Beulich [mailto:JBeulich@suse.com] > > Sent: Tuesday, January 26, 2016 11:53 PM > > Once again: Before getting started, please assess which route is going > > to be the better one. Remember that we had already discussed and put > > aside some form of deferring the hiding of devices, so if you come > > back with a patch doing that again, you'll have to be able to explain > > why the alternative(s) are worse. > > > > Quan, could you list pros/cons of those alternatives based on discussion so far? > Then we can decide which way should be done before you go to actual coding. > Earlier suggestion on hiding device immediately is under the assumption that all > locks have been held. If this part becomes too complex, and you can explain > clearly that deferring the hiding action doesn't lead to any race condition, then > people can see why you are proposing defer again. The following are pros/cons of those alternatives. It is also why I propose defer again. -- -- 1. Hiding the devices immediately Pros: * it makes whatever changes are ASAP after the Device-TLB flush error. Cons: * It may break the code path. * It may lead to any race condition. * Hiding the devices immediately is under the assumption that all locks have been held. Different locking state is possible for different call trees. This part becomes too complex. 2. deferring the hiding of devices Pros: *It doesn't lead to any race condition. *It makes existing calling chains with all required error handling completed. Cons: * It needs to maintain a flag. -- -- ----- How to defer the hiding of devices: When Device-TLB flush is error, it is to set a new flag and then continue existing calling chains with all required error handling completed. Only at safe place we can safely invoke pci_hide_device() variant to hide devices. We do a lazy hide until next time when it's assigned to another guest while the new flag is recognized. Furthermore explanation: A new flag: In my previous email, I introduced a "pci_dev->info.is_unassignable" flag. When Device-TLB flush is error, I think it is not a good idea to modify the pci_dev, As it is uncertain whether pcidevs_lock is acquired. I'd better introduce a new iommu bitmap array to register the device with BDF. It may be quite similar to 'iommu->domid_bitmap' Safe place: I think the beginning of reassign_device_ownership() is a safe place to hide the device which is registered with the new flag. reassign_device_ownership() is under lock. pci_hide_device() variant: Remove the pcidevs_lock, as the pcidevs_lock has been acquired at call tree of reassign_device_ownership(). Thanks!!! -Quan
>>> On 27.01.16 at 12:09, <quan.xu@intel.com> wrote: >> On January 27, 2016 at 6:48am, <Tian, Kevin> wrote: >> > From: Jan Beulich [mailto:JBeulich@suse.com] >> > Sent: Tuesday, January 26, 2016 11:53 PM > > >> > Once again: Before getting started, please assess which route is going >> > to be the better one. Remember that we had already discussed and put >> > aside some form of deferring the hiding of devices, so if you come >> > back with a patch doing that again, you'll have to be able to explain >> > why the alternative(s) are worse. >> > >> >> Quan, could you list pros/cons of those alternatives based on discussion so far? >> Then we can decide which way should be done before you go to actual coding. >> Earlier suggestion on hiding device immediately is under the assumption that all >> locks have been held. If this part becomes too complex, and you can explain >> clearly that deferring the hiding action doesn't lead to any race condition, then >> people can see why you are proposing defer again. > > > The following are pros/cons of those alternatives. It is also why I propose > defer again. > > -- -- > 1. Hiding the devices immediately > Pros: > * it makes whatever changes are ASAP after the Device-TLB flush error. > > Cons: > * It may break the code path. > * It may lead to any race condition. > * Hiding the devices immediately is under the assumption that all locks have been held. > Different locking state is possible for different call trees. This part becomes too complex. So you just repeat what you've already said before. "This part becomes too complex" you say without any kind of proof, yet that's what we need to understand whether the alternative of doing the locking correctly really is this bad (and I continue to not see why it would). Jan
> On January 27, 2016 at 7:24pm, <JBeulich@suse.com> wrote: > >>> On 27.01.16 at 12:09, <quan.xu@intel.com> wrote: > >> On January 27, 2016 at 6:48am, <Tian, Kevin> wrote: > >> > From: Jan Beulich [mailto:JBeulich@suse.com] > >> > Sent: Tuesday, January 26, 2016 11:53 PM > > > > > >> > Once again: Before getting started, please assess which route is > >> > going to be the better one. Remember that we had already discussed > >> > and put aside some form of deferring the hiding of devices, so if > >> > you come back with a patch doing that again, you'll have to be able > >> > to explain why the alternative(s) are worse. > >> > > >> > >> Quan, could you list pros/cons of those alternatives based on discussion so > far? > >> Then we can decide which way should be done before you go to actual > coding. > >> Earlier suggestion on hiding device immediately is under the > >> assumption that all locks have been held. If this part becomes too > >> complex, and you can explain clearly that deferring the hiding action > >> doesn't lead to any race condition, then people can see why you are > proposing defer again. > > > > > > The following are pros/cons of those alternatives. It is also why I > > propose defer again. > > > > -- -- > > 1. Hiding the devices immediately > > Pros: > > * it makes whatever changes are ASAP after the Device-TLB flush error. > > > > Cons: > > * It may break the code path. > > * It may lead to any race condition. > > * Hiding the devices immediately is under the assumption that all locks > have been held. > > Different locking state is possible for different call trees. This part > becomes too complex. > > So you just repeat what you've already said before. "This part becomes too > complex" you say without any kind of proof, yet that's what we need to > understand whether the alternative of doing the locking correctly really is this > bad (and I continue to not see why it would). Such as pcidevs_lock: 1. as I mentioned, it is indeed different locking state is possible for different call trees of flush Device-TLB. When Device-TLB flush is error, It is really challenge to judge whether to acquire the pcidevs_lock or not. For example, *It is _not_under_ lock for the following call tree: $ flush_iotlb_qi()--- iommu_flush_iotlb_psi() -- __intel_iommu_iotlb_flush() --intel_iommu_iotlb_flush() --iommu_iotlb_flush() --xenmem_add_to_physmap()--do_memory_op() *It is _under_ lock for the following call tree: $flush_iotlb_qi()--iommu_flush_iotlb_dsi()--domain_context_unmap_one()--domain_context_unmap()--reassign_device_ownership()--deassign_device()-iommu_do_pci_domctl() 2. if I try to acquire the pcidevs_lock for some _not_under_ lock call tree, it makes things worse. As the pcidevs_lock is a big lock, then Frequent memory modification may block the pci-device assign due to the pcidevs_lock. if I try to split the pcidevs_lock into small locks. It may takes a great deal of time to make it stable. Also I should consider the other locks.. -Quan
>>> On 27.01.16 at 13:38, <quan.xu@intel.com> wrote: >> On January 27, 2016 at 7:24pm, <JBeulich@suse.com> wrote: >> >>> On 27.01.16 at 12:09, <quan.xu@intel.com> wrote: >> >> On January 27, 2016 at 6:48am, <Tian, Kevin> wrote: >> >> > From: Jan Beulich [mailto:JBeulich@suse.com] >> >> > Sent: Tuesday, January 26, 2016 11:53 PM >> > >> > >> >> > Once again: Before getting started, please assess which route is >> >> > going to be the better one. Remember that we had already discussed >> >> > and put aside some form of deferring the hiding of devices, so if >> >> > you come back with a patch doing that again, you'll have to be able >> >> > to explain why the alternative(s) are worse. >> >> > >> >> >> >> Quan, could you list pros/cons of those alternatives based on discussion so >> far? >> >> Then we can decide which way should be done before you go to actual >> coding. >> >> Earlier suggestion on hiding device immediately is under the >> >> assumption that all locks have been held. If this part becomes too >> >> complex, and you can explain clearly that deferring the hiding action >> >> doesn't lead to any race condition, then people can see why you are >> proposing defer again. >> > >> > >> > The following are pros/cons of those alternatives. It is also why I >> > propose defer again. >> > >> > -- -- >> > 1. Hiding the devices immediately >> > Pros: >> > * it makes whatever changes are ASAP after the Device-TLB flush error. >> > >> > Cons: >> > * It may break the code path. >> > * It may lead to any race condition. >> > * Hiding the devices immediately is under the assumption that all > locks >> have been held. >> > Different locking state is possible for different call trees. This > part >> becomes too complex. >> >> So you just repeat what you've already said before. "This part becomes too >> complex" you say without any kind of proof, yet that's what we need to >> understand whether the alternative of doing the locking correctly really is > this >> bad (and I continue to not see why it would). > > > Such as pcidevs_lock: > > 1. as I mentioned, it is indeed different locking state is possible for > different call trees of flush Device-TLB. When Device-TLB flush is error, It is > really challenge to judge whether to acquire the pcidevs_lock or not. > > For example, > *It is _not_under_ lock for the following call tree: > $ flush_iotlb_qi()--- iommu_flush_iotlb_psi() -- __intel_iommu_iotlb_flush() > --intel_iommu_iotlb_flush() --iommu_iotlb_flush() > --xenmem_add_to_physmap()--do_memory_op() > > *It is _under_ lock for the following call tree: > $flush_iotlb_qi()--iommu_flush_iotlb_dsi()--domain_context_unmap_one()--domain_con > text_unmap()--reassign_device_ownership()--deassign_device()-iommu_do_pci_domctl() > > 2. if I try to acquire the pcidevs_lock for some _not_under_ lock call tree, > it makes things worse. As the pcidevs_lock is a big lock, then > Frequent memory modification may block the pci-device assign due to the > pcidevs_lock. if I try to split the pcidevs_lock into small locks. > It may takes a great deal of time to make it stable. I don't understand this, namely in the context of my suggestion to simply pass down a flag indicating whether the lock is being held (and hence acquiring it only in the most narrow scope if not already owning it). Jan
> On January 27, 2016 at 9:15pm, <JBeulich@suse.com> wrote: > >>> On 27.01.16 at 13:38, <quan.xu@intel.com> wrote: > >> On January 27, 2016 at 7:24pm, <JBeulich@suse.com> wrote: > >> >>> On 27.01.16 at 12:09, <quan.xu@intel.com> wrote: > >> >> On January 27, 2016 at 6:48am, <Tian, Kevin> wrote: > >> >> > From: Jan Beulich [mailto:JBeulich@suse.com] > >> >> > Sent: Tuesday, January 26, 2016 11:53 PM > >> > > >> > > >> >> > Once again: Before getting started, please assess which route is > >> >> > going to be the better one. Remember that we had already > >> >> > discussed and put aside some form of deferring the hiding of > >> >> > devices, so if you come back with a patch doing that again, > >> >> > you'll have to be able to explain why the alternative(s) are worse. > >> >> > > >> >> > >> >> Quan, could you list pros/cons of those alternatives based on > >> >> discussion so > >> far? > >> >> Then we can decide which way should be done before you go to > >> >> actual > >> coding. > >> >> Earlier suggestion on hiding device immediately is under the > >> >> assumption that all locks have been held. If this part becomes too > >> >> complex, and you can explain clearly that deferring the hiding > >> >> action doesn't lead to any race condition, then people can see why > >> >> you are > >> proposing defer again. > >> > > >> > > >> > The following are pros/cons of those alternatives. It is also why I > >> > propose defer again. > >> > > >> > -- -- > >> > 1. Hiding the devices immediately > >> > Pros: > >> > * it makes whatever changes are ASAP after the Device-TLB flush > error. > >> > > >> > Cons: > >> > * It may break the code path. > >> > * It may lead to any race condition. > >> > * Hiding the devices immediately is under the assumption that > >> > all > > locks > >> have been held. > >> > Different locking state is possible for different call trees. > >> > This > > part > >> becomes too complex. > >> > >> So you just repeat what you've already said before. "This part > >> becomes too complex" you say without any kind of proof, yet that's > >> what we need to understand whether the alternative of doing the > >> locking correctly really is > > this > >> bad (and I continue to not see why it would). > > > > > > Such as pcidevs_lock: > > > > 1. as I mentioned, it is indeed different locking state is possible > > for different call trees of flush Device-TLB. When Device-TLB flush is > > error, It is really challenge to judge whether to acquire the pcidevs_lock or > not. > > > > For example, > > *It is _not_under_ lock for the following call tree: > > $ flush_iotlb_qi()--- iommu_flush_iotlb_psi() -- > > __intel_iommu_iotlb_flush() > > --intel_iommu_iotlb_flush() --iommu_iotlb_flush() > > --xenmem_add_to_physmap()--do_memory_op() > > > > *It is _under_ lock for the following call tree: > > $flush_iotlb_qi()--iommu_flush_iotlb_dsi()--domain_context_unmap_one() > > --domain_con > > text_unmap()--reassign_device_ownership()--deassign_device()-iommu_do_ > > pci_domctl() > > > > 2. if I try to acquire the pcidevs_lock for some _not_under_ lock call > > tree, it makes things worse. As the pcidevs_lock is a big lock, then > > Frequent memory modification may block the pci-device assign due to > > the pcidevs_lock. if I try to split the pcidevs_lock into small locks. > > It may takes a great deal of time to make it stable. > > I don't understand this, namely in the context of my suggestion to simply pass > down a flag indicating whether the lock is being held (and hence acquiring it > only in the most narrow scope if not already owning it). > This is also an idea. BTW, Does the lock refer to pcidevs_lock? -Quan
>>> On 27.01.16 at 15:13, <quan.xu@intel.com> wrote: >> On January 27, 2016 at 9:15pm, <JBeulich@suse.com> wrote: >> >>> On 27.01.16 at 13:38, <quan.xu@intel.com> wrote: >> >> On January 27, 2016 at 7:24pm, <JBeulich@suse.com> wrote: >> >> >>> On 27.01.16 at 12:09, <quan.xu@intel.com> wrote: >> >> >> On January 27, 2016 at 6:48am, <Tian, Kevin> wrote: >> >> >> > From: Jan Beulich [mailto:JBeulich@suse.com] >> >> >> > Sent: Tuesday, January 26, 2016 11:53 PM >> >> > >> >> > >> >> >> > Once again: Before getting started, please assess which route is >> >> >> > going to be the better one. Remember that we had already >> >> >> > discussed and put aside some form of deferring the hiding of >> >> >> > devices, so if you come back with a patch doing that again, >> >> >> > you'll have to be able to explain why the alternative(s) are worse. >> >> >> > >> >> >> >> >> >> Quan, could you list pros/cons of those alternatives based on >> >> >> discussion so >> >> far? >> >> >> Then we can decide which way should be done before you go to >> >> >> actual >> >> coding. >> >> >> Earlier suggestion on hiding device immediately is under the >> >> >> assumption that all locks have been held. If this part becomes too >> >> >> complex, and you can explain clearly that deferring the hiding >> >> >> action doesn't lead to any race condition, then people can see why >> >> >> you are >> >> proposing defer again. >> >> > >> >> > >> >> > The following are pros/cons of those alternatives. It is also why I >> >> > propose defer again. >> >> > >> >> > -- -- >> >> > 1. Hiding the devices immediately >> >> > Pros: >> >> > * it makes whatever changes are ASAP after the Device-TLB flush >> error. >> >> > >> >> > Cons: >> >> > * It may break the code path. >> >> > * It may lead to any race condition. >> >> > * Hiding the devices immediately is under the assumption that >> >> > all >> > locks >> >> have been held. >> >> > Different locking state is possible for different call trees. >> >> > This >> > part >> >> becomes too complex. >> >> >> >> So you just repeat what you've already said before. "This part >> >> becomes too complex" you say without any kind of proof, yet that's >> >> what we need to understand whether the alternative of doing the >> >> locking correctly really is >> > this >> >> bad (and I continue to not see why it would). >> > >> > >> > Such as pcidevs_lock: >> > >> > 1. as I mentioned, it is indeed different locking state is possible >> > for different call trees of flush Device-TLB. When Device-TLB flush is >> > error, It is really challenge to judge whether to acquire the pcidevs_lock or >> not. >> > >> > For example, >> > *It is _not_under_ lock for the following call tree: >> > $ flush_iotlb_qi()--- iommu_flush_iotlb_psi() -- >> > __intel_iommu_iotlb_flush() >> > --intel_iommu_iotlb_flush() --iommu_iotlb_flush() >> > --xenmem_add_to_physmap()--do_memory_op() >> > >> > *It is _under_ lock for the following call tree: >> > $flush_iotlb_qi()--iommu_flush_iotlb_dsi()--domain_context_unmap_one() >> > --domain_con >> > text_unmap()--reassign_device_ownership()--deassign_device()-iommu_do_ >> > pci_domctl() >> > >> > 2. if I try to acquire the pcidevs_lock for some _not_under_ lock call >> > tree, it makes things worse. As the pcidevs_lock is a big lock, then >> > Frequent memory modification may block the pci-device assign due to >> > the pcidevs_lock. if I try to split the pcidevs_lock into small locks. >> > It may takes a great deal of time to make it stable. >> >> I don't understand this, namely in the context of my suggestion to simply pass >> down a flag indicating whether the lock is being held (and hence acquiring it >> only in the most narrow scope if not already owning it). >> > > This is also an idea. > BTW, Does the lock refer to pcidevs_lock? Yes, for now I assume that only that lock actually needs special attention. Jan
> On January 27, 2016 10:29pm, <JBeulich@suse.com> wrote: > >>> On 27.01.16 at 15:13, <quan.xu@intel.com> wrote: > >> On January 27, 2016 at 9:15pm, <JBeulich@suse.com> wrote: > >> >>> On 27.01.16 at 13:38, <quan.xu@intel.com> wrote: > >> >> On January 27, 2016 at 7:24pm, <JBeulich@suse.com> wrote: > >> >> >>> On 27.01.16 at 12:09, <quan.xu@intel.com> wrote: > >> >> >> On January 27, 2016 at 6:48am, <Tian, Kevin> wrote: > >> >> >> > From: Jan Beulich [mailto:JBeulich@suse.com] > >> >> >> > Sent: Tuesday, January 26, 2016 11:53 PM > >> >> > > >> >> > > >> >> >> > Once again: Before getting started, please assess which route > >> >> >> > is going to be the better one. Remember that we had already > >> >> >> > discussed and put aside some form of deferring the hiding of > >> >> >> > devices, so if you come back with a patch doing that again, > >> >> >> > you'll have to be able to explain why the alternative(s) are worse. > >> >> >> > > >> >> >> > >> >> >> Quan, could you list pros/cons of those alternatives based on > >> >> >> discussion so > >> >> far? > >> >> >> Then we can decide which way should be done before you go to > >> >> >> actual > >> >> coding. > >> >> >> Earlier suggestion on hiding device immediately is under the > >> >> >> assumption that all locks have been held. If this part becomes > >> >> >> too complex, and you can explain clearly that deferring the > >> >> >> hiding action doesn't lead to any race condition, then people > >> >> >> can see why you are > >> >> proposing defer again. > >> >> > > >> >> > > >> >> > The following are pros/cons of those alternatives. It is also > >> >> > why I propose defer again. > >> >> > > >> >> > -- -- > >> >> > 1. Hiding the devices immediately > >> >> > Pros: > >> >> > * it makes whatever changes are ASAP after the Device-TLB > >> >> > flush > >> error. > >> >> > > >> >> > Cons: > >> >> > * It may break the code path. > >> >> > * It may lead to any race condition. > >> >> > * Hiding the devices immediately is under the assumption > >> >> > that all > >> > locks > >> >> have been held. > >> >> > Different locking state is possible for different call trees. > >> >> > This > >> > part > >> >> becomes too complex. > >> >> > >> >> So you just repeat what you've already said before. "This part > >> >> becomes too complex" you say without any kind of proof, yet that's > >> >> what we need to understand whether the alternative of doing the > >> >> locking correctly really is > >> > this > >> >> bad (and I continue to not see why it would). > >> > > >> > > >> > Such as pcidevs_lock: > >> > > >> > 1. as I mentioned, it is indeed different locking state is possible > >> > for different call trees of flush Device-TLB. When Device-TLB flush > >> > is error, It is really challenge to judge whether to acquire the > >> > pcidevs_lock or > >> not. > >> > > >> > For example, > >> > *It is _not_under_ lock for the following call tree: > >> > $ flush_iotlb_qi()--- iommu_flush_iotlb_psi() -- > >> > __intel_iommu_iotlb_flush() > >> > --intel_iommu_iotlb_flush() --iommu_iotlb_flush() > >> > --xenmem_add_to_physmap()--do_memory_op() > >> > > >> > *It is _under_ lock for the following call tree: > >> > $flush_iotlb_qi()--iommu_flush_iotlb_dsi()--domain_context_unmap_on > >> > e() > >> > --domain_con > >> > text_unmap()--reassign_device_ownership()--deassign_device()-iommu_ > >> > do_ > >> > pci_domctl() > >> > > >> > 2. if I try to acquire the pcidevs_lock for some _not_under_ lock > >> > call tree, it makes things worse. As the pcidevs_lock is a big lock, then > >> > Frequent memory modification may block the pci-device assign due > >> > to the pcidevs_lock. if I try to split the pcidevs_lock into small locks. > >> > It may takes a great deal of time to make it stable. > >> > >> I don't understand this, namely in the context of my suggestion to > >> simply pass down a flag indicating whether the lock is being held > >> (and hence acquiring it only in the most narrow scope if not already owning > it). > >> > > > > This is also an idea. > > BTW, Does the lock refer to pcidevs_lock? > > Yes, for now I assume that only that lock actually needs special attention. > Thanks. I think so too. -Quan
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c index 27b3ca7..2d7dc59 100644 --- a/xen/drivers/passthrough/pci.c +++ b/xen/drivers/passthrough/pci.c @@ -407,7 +407,7 @@ static void _pci_hide_device(struct pci_dev *pdev) list_add(&pdev->domain_list, &dom_xen->arch.pdev_list); } -int __init pci_hide_device(int bus, int devfn) +int pci_hide_device(int bus, int devfn) { struct pci_dev *pdev; int rc = -ENOMEM; diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h index ec9c513..a2123ce 100644 --- a/xen/drivers/passthrough/vtd/extern.h +++ b/xen/drivers/passthrough/vtd/extern.h @@ -58,6 +58,8 @@ int ats_device(const struct pci_dev *, const struct acpi_drhd_unit *); int dev_invalidate_iotlb(struct iommu *iommu, u16 did, u64 addr, unsigned int size_order, u64 type); +int dev_invalidate_iotlb_sync(struct iommu *iommu, u16 did, + u16 seg, u8 bus, u8 devfn); int qinval_device_iotlb(struct iommu *iommu, u32 max_invs_pend, u16 sid, u16 size, u64 addr); diff --git a/xen/drivers/passthrough/vtd/qinval.c b/xen/drivers/passthrough/vtd/qinval.c index b227e4e..7330c5d 100644 --- a/xen/drivers/passthrough/vtd/qinval.c +++ b/xen/drivers/passthrough/vtd/qinval.c @@ -190,9 +190,19 @@ static int queue_invalidate_wait(struct iommu *iommu, static int invalidate_sync(struct iommu *iommu) { struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu); + int rc; if ( qi_ctrl->qinval_maddr ) - return queue_invalidate_wait(iommu, 0, 1, 1); + { + rc = queue_invalidate_wait(iommu, 0, 1, 1); + if ( rc ) + { + if ( rc == -ETIMEDOUT ) + panic("Queue invalidate wait descriptor was timeout.\n"); + return rc; + } + } + return 0; } @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu, return 0; } +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did, + u16 seg, u8 bus, u8 devfn) +{ + struct domain *d; + struct pci_dev *pdev; + + d = rcu_lock_domain_by_id(iommu->domid_map[did]); + ASSERT(d); + for_each_pdev(d, pdev) + { + if ( (pdev->seg == seg) && + (pdev->bus == bus) && + (pdev->devfn == devfn) ) + { + if ( pdev->domain ) + { + list_del(&pdev->domain_list); + pdev->domain = NULL; + } + + if ( pci_hide_device(bus, devfn) ) + { + printk(XENLOG_ERR + "IOMMU hide device %04x:%02x:%02x error.", + seg, bus, devfn); + break; + } + + break; + } + } + + if ( !is_hardware_domain(d) ) + domain_crash(d); + rcu_unlock_domain(d); +} + +int dev_invalidate_iotlb_sync(struct iommu *iommu, u16 did, + u16 seg, u8 bus, u8 devfn) +{ + struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu); + int rc; + + if ( qi_ctrl->qinval_maddr ) + { + rc = queue_invalidate_wait(iommu, 0, 1, 1); + if ( rc ) + { + if ( rc == -ETIMEDOUT ) + dev_invalidate_iotlb_timeout(iommu, did, seg, bus, devfn); + return rc; + } + } + + return 0; +} + static void queue_invalidate_iec(struct iommu *iommu, u8 granu, u8 im, u16 iidx) { unsigned long flags; @@ -349,9 +416,18 @@ static int flush_iotlb_qi( queue_invalidate_iotlb(iommu, type >> DMA_TLB_FLUSH_GRANU_OFFSET, dr, dw, did, size_order, 0, addr); + + /* + * Synchronize with hardware for invalidation request descriptors + * submitted before Device-TLB invalidate descriptor. + */ + rc = invalidate_sync(iommu); + if ( rc ) + return rc; + if ( flush_dev_iotlb ) ret = dev_invalidate_iotlb(iommu, did, addr, size_order, type); - rc = invalidate_sync(iommu); + if ( !ret ) ret = rc; } diff --git a/xen/drivers/passthrough/vtd/x86/ats.c b/xen/drivers/passthrough/vtd/x86/ats.c index 7c797f6..6299f52 100644 --- a/xen/drivers/passthrough/vtd/x86/ats.c +++ b/xen/drivers/passthrough/vtd/x86/ats.c @@ -162,6 +162,19 @@ int dev_invalidate_iotlb(struct iommu *iommu, u16 did, return -EOPNOTSUPP; } + /* + * Synchronize with hardware for Device-TLB invalidate + * descriptor. + */ + ret = dev_invalidate_iotlb_sync(iommu, did, pdev->seg, + pdev->bus, pdev->devfn); + if ( ret ) + { + dprintk(XENLOG_WARNING VTDPREFIX, + "Device-TLB flush timeout.\n"); + return ret; + } + if ( !ret ) ret = rc; }
Now if IOTLB/Context/IETC flush is timeout, panic hypervisor. The coming patch set will fix it. If Device-TLB flush is timeout, we'll hide the target ATS device and crash the domain owning this ATS device. If impacted domain is hardware domain, just throw out a warning. The hided Device will be disallowed to be further assigned to any domain. Signed-off-by: Quan Xu <quan.xu@intel.com> --- xen/drivers/passthrough/pci.c | 2 +- xen/drivers/passthrough/vtd/extern.h | 2 + xen/drivers/passthrough/vtd/qinval.c | 80 ++++++++++++++++++++++++++++++++++- xen/drivers/passthrough/vtd/x86/ats.c | 13 ++++++ 4 files changed, 94 insertions(+), 3 deletions(-)