Message ID | 20210715120844.636968-2-ming.lei@redhat.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | blk-mq: fix blk_mq_alloc_request_hctx | expand |
On Thu, Jul 15, 2021 at 08:08:42PM +0800, Ming Lei wrote: > --- a/include/linux/device.h > +++ b/include/linux/device.h > @@ -569,6 +569,7 @@ struct device { > #ifdef CONFIG_DMA_OPS_BYPASS > bool dma_ops_bypass : 1; > #endif > + bool irq_affinity_managed : 1; > }; No documentation for this new field? :(
On Thu, Jul 15, 2021 at 02:40:55PM +0200, Greg Kroah-Hartman wrote: > On Thu, Jul 15, 2021 at 08:08:42PM +0800, Ming Lei wrote: > > --- a/include/linux/device.h > > +++ b/include/linux/device.h > > @@ -569,6 +569,7 @@ struct device { > > #ifdef CONFIG_DMA_OPS_BYPASS > > bool dma_ops_bypass : 1; > > #endif > > + bool irq_affinity_managed : 1; > > }; > > No documentation for this new field? OK, will fix it in next version. Thanks, Ming
On Thu, Jul 15, 2021 at 08:08:42PM +0800, Ming Lei wrote: > irq vector allocation with managed affinity may be used by driver, and > blk-mq needs this info because managed irq will be shutdown when all > CPUs in the affinity mask are offline. > > The info of using managed irq is often produced by drivers(pci subsystem, Add space between "drivers" and "(". s/pci/PCI/ Does this "managed IRQ" (or "managed affinity", not sure what the correct terminology is here) have something to do with devm? > platform device, ...), and it is consumed by blk-mq, so different subsystems > are involved in this info flow Add period at end of sentence. > Address this issue by adding one field of .irq_affinity_managed into > 'struct device'. > > Suggested-by: Christoph Hellwig <hch@lst.de> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > drivers/base/platform.c | 7 +++++++ > drivers/pci/msi.c | 3 +++ > include/linux/device.h | 1 + > 3 files changed, 11 insertions(+) > > diff --git a/drivers/base/platform.c b/drivers/base/platform.c > index 8640578f45e9..d28cb91d5cf9 100644 > --- a/drivers/base/platform.c > +++ b/drivers/base/platform.c > @@ -388,6 +388,13 @@ int devm_platform_get_irqs_affinity(struct platform_device *dev, > ptr->irq[i], ret); > goto err_free_desc; > } > + > + /* > + * mark the device as irq affinity managed if any irq affinity > + * descriptor is managed > + */ > + if (desc[i].is_managed) > + dev->dev.irq_affinity_managed = true; > } > > devres_add(&dev->dev, ptr); > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > index 3d6db20d1b2b..7ddec90b711d 100644 > --- a/drivers/pci/msi.c > +++ b/drivers/pci/msi.c > @@ -1197,6 +1197,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, > if (flags & PCI_IRQ_AFFINITY) { > if (!affd) > affd = &msi_default_affd; > + dev->dev.irq_affinity_managed = true; This is really opaque to me. I can't tell what the connection between PCI_IRQ_AFFINITY and irq_affinity_managed is. AFAICT the only place irq_affinity_managed is ultimately used is blk_mq_hctx_notify_offline(), and there's no obvious connection between that and this code. > } else { > if (WARN_ON(affd)) > affd = NULL; > @@ -1215,6 +1216,8 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, > return nvecs; > } > > + dev->dev.irq_affinity_managed = false; > + > /* use legacy IRQ if allowed */ > if (flags & PCI_IRQ_LEGACY) { > if (min_vecs == 1 && dev->irq) { > diff --git a/include/linux/device.h b/include/linux/device.h > index 59940f1744c1..9ec6e671279e 100644 > --- a/include/linux/device.h > +++ b/include/linux/device.h > @@ -569,6 +569,7 @@ struct device { > #ifdef CONFIG_DMA_OPS_BYPASS > bool dma_ops_bypass : 1; > #endif > + bool irq_affinity_managed : 1; > }; > > /** > -- > 2.31.1 > > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme
On Fri, Jul 16, 2021 at 03:01:54PM -0500, Bjorn Helgaas wrote: > On Thu, Jul 15, 2021 at 08:08:42PM +0800, Ming Lei wrote: > > irq vector allocation with managed affinity may be used by driver, and > > blk-mq needs this info because managed irq will be shutdown when all > > CPUs in the affinity mask are offline. > > > > The info of using managed irq is often produced by drivers(pci subsystem, > > Add space between "drivers" and "(". > s/pci/PCI/ OK. > > Does this "managed IRQ" (or "managed affinity", not sure what the > correct terminology is here) have something to do with devm? > > > platform device, ...), and it is consumed by blk-mq, so different subsystems > > are involved in this info flow > > Add period at end of sentence. OK. > > > Address this issue by adding one field of .irq_affinity_managed into > > 'struct device'. > > > > Suggested-by: Christoph Hellwig <hch@lst.de> > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > drivers/base/platform.c | 7 +++++++ > > drivers/pci/msi.c | 3 +++ > > include/linux/device.h | 1 + > > 3 files changed, 11 insertions(+) > > > > diff --git a/drivers/base/platform.c b/drivers/base/platform.c > > index 8640578f45e9..d28cb91d5cf9 100644 > > --- a/drivers/base/platform.c > > +++ b/drivers/base/platform.c > > @@ -388,6 +388,13 @@ int devm_platform_get_irqs_affinity(struct platform_device *dev, > > ptr->irq[i], ret); > > goto err_free_desc; > > } > > + > > + /* > > + * mark the device as irq affinity managed if any irq affinity > > + * descriptor is managed > > + */ > > + if (desc[i].is_managed) > > + dev->dev.irq_affinity_managed = true; > > } > > > > devres_add(&dev->dev, ptr); > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > > index 3d6db20d1b2b..7ddec90b711d 100644 > > --- a/drivers/pci/msi.c > > +++ b/drivers/pci/msi.c > > @@ -1197,6 +1197,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, > > if (flags & PCI_IRQ_AFFINITY) { > > if (!affd) > > affd = &msi_default_affd; > > + dev->dev.irq_affinity_managed = true; > > This is really opaque to me. I can't tell what the connection between > PCI_IRQ_AFFINITY and irq_affinity_managed is. Comment for PCI_IRQ_AFFINITY is 'Auto-assign affinity', 'irq_affinity_managed' basically means that irq's affinity is managed by kernel. What blk-mq needs is exactly if PCI_IRQ_AFFINITY is applied when allocating irq vectors. When PCI_IRQ_AFFINITY is used, genirq will shutdown the irq when all CPUs in the assigned affinity are offline, then blk-mq has to drain all in-flight IOs which will be completed via this irq and prevent new IO. That is the connection. Or you think 'irq_affinity_managed' isn't named well? > > AFAICT the only place irq_affinity_managed is ultimately used is > blk_mq_hctx_notify_offline(), and there's no obvious connection > between that and this code. I believe the connection is described in comment. Thanks, Ming
On 15/07/2021 13:08, Ming Lei wrote: > irq vector allocation with managed affinity may be used by driver, and > blk-mq needs this info because managed irq will be shutdown when all > CPUs in the affinity mask are offline. > > The info of using managed irq is often produced by drivers(pci subsystem, > platform device, ...), and it is consumed by blk-mq, so different subsystems > are involved in this info flow > > Address this issue by adding one field of .irq_affinity_managed into > 'struct device'. > > Suggested-by: Christoph Hellwig <hch@lst.de> > Signed-off-by: Ming Lei <ming.lei@redhat.com> Did you consider that for PCI device we effectively have this info already: bool dev_has_managed_msi_irq(struct device *dev) { struct msi_desc *desc; list_for_each_entry(desc, dev_to_msi_list(dev), list) { if (desc->affinity && desc->affinity->is_managed) return true; } return false; } Thanks, John > --- > drivers/base/platform.c | 7 +++++++ > drivers/pci/msi.c | 3 +++ > include/linux/device.h | 1 + > 3 files changed, 11 insertions(+) > > diff --git a/drivers/base/platform.c b/drivers/base/platform.c > index 8640578f45e9..d28cb91d5cf9 100644 > --- a/drivers/base/platform.c > +++ b/drivers/base/platform.c > @@ -388,6 +388,13 @@ int devm_platform_get_irqs_affinity(struct platform_device *dev, > ptr->irq[i], ret); > goto err_free_desc; > } > + > + /* > + * mark the device as irq affinity managed if any irq affinity > + * descriptor is managed > + */ > + if (desc[i].is_managed) > + dev->dev.irq_affinity_managed = true; > } > > devres_add(&dev->dev, ptr); > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > index 3d6db20d1b2b..7ddec90b711d 100644 > --- a/drivers/pci/msi.c > +++ b/drivers/pci/msi.c > @@ -1197,6 +1197,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, > if (flags & PCI_IRQ_AFFINITY) { > if (!affd) > affd = &msi_default_affd; > + dev->dev.irq_affinity_managed = true; > } else { > if (WARN_ON(affd)) > affd = NULL; > @@ -1215,6 +1216,8 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, > return nvecs; > } > > + dev->dev.irq_affinity_managed = false; > + > /* use legacy IRQ if allowed */ > if (flags & PCI_IRQ_LEGACY) { > if (min_vecs == 1 && dev->irq) { > diff --git a/include/linux/device.h b/include/linux/device.h > index 59940f1744c1..9ec6e671279e 100644 > --- a/include/linux/device.h > +++ b/include/linux/device.h > @@ -569,6 +569,7 @@ struct device { > #ifdef CONFIG_DMA_OPS_BYPASS > bool dma_ops_bypass : 1; > #endif > + bool irq_affinity_managed : 1; > }; > > /** >
On Mon, Jul 19, 2021 at 08:51:22AM +0100, John Garry wrote: >> Address this issue by adding one field of .irq_affinity_managed into >> 'struct device'. >> >> Suggested-by: Christoph Hellwig <hch@lst.de> >> Signed-off-by: Ming Lei <ming.lei@redhat.com> > > Did you consider that for PCI device we effectively have this info already: > > bool dev_has_managed_msi_irq(struct device *dev) > { > struct msi_desc *desc; > > list_for_each_entry(desc, dev_to_msi_list(dev), list) { > if (desc->affinity && desc->affinity->is_managed) > return true; > } > > return false; Just walking the list seems fine to me given that this is not a performance criticial path. But what are the locking implications? Also does the above imply this won't work for your platform MSI case?
On 19/07/2021 10:44, Christoph Hellwig wrote: > On Mon, Jul 19, 2021 at 08:51:22AM +0100, John Garry wrote: >>> Address this issue by adding one field of .irq_affinity_managed into >>> 'struct device'. >>> >>> Suggested-by: Christoph Hellwig <hch@lst.de> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com> >> >> Did you consider that for PCI device we effectively have this info already: >> >> bool dev_has_managed_msi_irq(struct device *dev) >> { >> struct msi_desc *desc; >> >> list_for_each_entry(desc, dev_to_msi_list(dev), list) I just noticed for_each_msi_entry(), which is the same >> if (desc->affinity && desc->affinity->is_managed) >> return true; >> } >> >> return false; > > Just walking the list seems fine to me given that this is not a > performance criticial path. But what are the locking implications? Since it would be used for sequential setup code, I didn't think any locking was required. But would need to consider where that function lived and whether it's public. > > Also does the above imply this won't work for your platform MSI case? > . > Right. I think that it may be possible to reach into the platform msi descriptors to get this info, but I am not sure it's worth it. There is only 1x user there and there is no generic .map_queues function, so could set the flag directly: int blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev *pdev, for_each_cpu(cpu, mask) qmap->mq_map[cpu] = qmap->queue_offset + queue; } + qmap->use_managed_irq = dev_has_managed_msi_irq(&pdev->dev); } --- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c +++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c @@ -3563,6 +3563,8 @@ static int map_queues_v2_hw(struct Scsi_Host *shost) qmap->mq_map[cpu] = qmap->queue_offset + queue; } + qmap->use_managed_irq = 1; + return 0; }
On Mon, Jul 19, 2021 at 11:39:53AM +0100, John Garry wrote: > On 19/07/2021 10:44, Christoph Hellwig wrote: > > On Mon, Jul 19, 2021 at 08:51:22AM +0100, John Garry wrote: > > > > Address this issue by adding one field of .irq_affinity_managed into > > > > 'struct device'. > > > > > > > > Suggested-by: Christoph Hellwig <hch@lst.de> > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > > > > > Did you consider that for PCI device we effectively have this info already: > > > > > > bool dev_has_managed_msi_irq(struct device *dev) > > > { > > > struct msi_desc *desc; > > > > > > list_for_each_entry(desc, dev_to_msi_list(dev), list) > > I just noticed for_each_msi_entry(), which is the same > > > > > if (desc->affinity && desc->affinity->is_managed) > > > return true; > > > } > > > > > > return false; > > > > Just walking the list seems fine to me given that this is not a > > performance criticial path. But what are the locking implications? > > Since it would be used for sequential setup code, I didn't think any locking > was required. But would need to consider where that function lived and > whether it's public. Yeah, the allocated irq vectors should be live when running map queues. > > > > > Also does the above imply this won't work for your platform MSI case? > > . > > > > Right. I think that it may be possible to reach into the platform msi > descriptors to get this info, but I am not sure it's worth it. There is only > 1x user there and there is no generic .map_queues function, so could set the > flag directly: > > int blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev > *pdev, > for_each_cpu(cpu, mask) > qmap->mq_map[cpu] = qmap->queue_offset + queue; > } > + qmap->use_managed_irq = dev_has_managed_msi_irq(&pdev->dev); > } > > --- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c > +++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c > @@ -3563,6 +3563,8 @@ static int map_queues_v2_hw(struct Scsi_Host *shost) > qmap->mq_map[cpu] = qmap->queue_offset + queue; > } > > + qmap->use_managed_irq = 1; > + > return 0; virtio can be populated via platform device too, but managed irq affinity isn't used, so seems dev_has_managed_msi_irq() is fine. Thanks, Ming
On Sat, Jul 17, 2021 at 05:30:43PM +0800, Ming Lei wrote: > On Fri, Jul 16, 2021 at 03:01:54PM -0500, Bjorn Helgaas wrote: > > On Thu, Jul 15, 2021 at 08:08:42PM +0800, Ming Lei wrote: > > > irq vector allocation with managed affinity may be used by driver, and > > > blk-mq needs this info because managed irq will be shutdown when all > > > CPUs in the affinity mask are offline. > > > > > > The info of using managed irq is often produced by drivers(pci subsystem, > > > > Add space between "drivers" and "(". > > s/pci/PCI/ > > OK. > > > Does this "managed IRQ" (or "managed affinity", not sure what the > > correct terminology is here) have something to do with devm? > > > > > platform device, ...), and it is consumed by blk-mq, so different subsystems > > > are involved in this info flow > > > > Add period at end of sentence. > > OK. > > > > Address this issue by adding one field of .irq_affinity_managed into > > > 'struct device'. > > > > > > Suggested-by: Christoph Hellwig <hch@lst.de> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > > --- > > > drivers/base/platform.c | 7 +++++++ > > > drivers/pci/msi.c | 3 +++ > > > include/linux/device.h | 1 + > > > 3 files changed, 11 insertions(+) > > > > > > diff --git a/drivers/base/platform.c b/drivers/base/platform.c > > > index 8640578f45e9..d28cb91d5cf9 100644 > > > --- a/drivers/base/platform.c > > > +++ b/drivers/base/platform.c > > > @@ -388,6 +388,13 @@ int devm_platform_get_irqs_affinity(struct platform_device *dev, > > > ptr->irq[i], ret); > > > goto err_free_desc; > > > } > > > + > > > + /* > > > + * mark the device as irq affinity managed if any irq affinity > > > + * descriptor is managed > > > + */ > > > + if (desc[i].is_managed) > > > + dev->dev.irq_affinity_managed = true; > > > } > > > > > > devres_add(&dev->dev, ptr); > > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > > > index 3d6db20d1b2b..7ddec90b711d 100644 > > > --- a/drivers/pci/msi.c > > > +++ b/drivers/pci/msi.c > > > @@ -1197,6 +1197,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, > > > if (flags & PCI_IRQ_AFFINITY) { > > > if (!affd) > > > affd = &msi_default_affd; > > > + dev->dev.irq_affinity_managed = true; > > > > This is really opaque to me. I can't tell what the connection between > > PCI_IRQ_AFFINITY and irq_affinity_managed is. > > Comment for PCI_IRQ_AFFINITY is 'Auto-assign affinity', > 'irq_affinity_managed' basically means that irq's affinity is managed by > kernel. > > What blk-mq needs is exactly if PCI_IRQ_AFFINITY is applied when > allocating irq vectors. When PCI_IRQ_AFFINITY is used, genirq will > shutdown the irq when all CPUs in the assigned affinity are offline, > then blk-mq has to drain all in-flight IOs which will be completed > via this irq and prevent new IO. That is the connection. > > Or you think 'irq_affinity_managed' isn't named well? I've been looking at "devm" management where there is a concept of devm-managed IRQs, and you mentioned "managed IRQ" in the commit log. But IIUC this is a completely different sort of management. > > AFAICT the only place irq_affinity_managed is ultimately used is > > blk_mq_hctx_notify_offline(), and there's no obvious connection > > between that and this code. > > I believe the connection is described in comment. You mean the comment that says "hctx needn't to be deactivated in case managed irq isn't used"? Sorry, that really doesn't explain to me why pci_alloc_irq_vectors_affinity() needs to set irq_affinity_managed. There's just a lot of magic here that cannot be deduced by reading the code. Nit: s/needn't to be/needn't be/ in that comment. Or maybe even "Deactivate hctx only when managed IRQ is used"? I still have no idea what that means, but at least it's easier to read :) Or maybe this is actually "draining a queue" instead of "deactivating"? Bjorn
On Mon, Jul 19 2021 at 11:44, Christoph Hellwig wrote: > On Mon, Jul 19, 2021 at 08:51:22AM +0100, John Garry wrote: >>> Address this issue by adding one field of .irq_affinity_managed into >>> 'struct device'. >>> >>> Suggested-by: Christoph Hellwig <hch@lst.de> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com> >> >> Did you consider that for PCI device we effectively have this info already: >> >> bool dev_has_managed_msi_irq(struct device *dev) >> { >> struct msi_desc *desc; >> >> list_for_each_entry(desc, dev_to_msi_list(dev), list) { >> if (desc->affinity && desc->affinity->is_managed) >> return true; >> } >> >> return false; > > Just walking the list seems fine to me given that this is not a > performance criticial path. But what are the locking implications? At the moment there are none because the list is initialized in the setup path and never modified afterwards. Though that might change sooner than later to fix the virtio wreckage vs. MSI-X. > Also does the above imply this won't work for your platform MSI case? The msi descriptors are attached to struct device and independent of platform/PCI/whatever. Thanks, tglx
On Wed, Jul 21, 2021 at 09:20:00AM +0200, Thomas Gleixner wrote: > > Just walking the list seems fine to me given that this is not a > > performance criticial path. But what are the locking implications? > > At the moment there are none because the list is initialized in the > setup path and never modified afterwards. Though that might change > sooner than later to fix the virtio wreckage vs. MSI-X. What is the issue there? Either way, if we keep the helper in the IRQ code it should be easy to spot for anyone adding the locking. > > Also does the above imply this won't work for your platform MSI case? > > The msi descriptors are attached to struct device and independent of > platform/PCI/whatever. That's what I assumed, but this text from John suggested there is something odd about the platform case: "Did you consider that for PCI .."
+ Marc On 21/07/2021 08:24, Christoph Hellwig wrote: > On Wed, Jul 21, 2021 at 09:20:00AM +0200, Thomas Gleixner wrote: >>> Just walking the list seems fine to me given that this is not a >>> performance criticial path. But what are the locking implications? >> At the moment there are none because the list is initialized in the >> setup path and never modified afterwards. Though that might change >> sooner than later to fix the virtio wreckage vs. MSI-X. > What is the issue there? Either way, if we keep the helper in the > IRQ code it should be easy to spot for anyone adding the locking. > >>> Also does the above imply this won't work for your platform MSI case? >> The msi descriptors are attached to struct device and independent of >> platform/PCI/whatever. > That's what I assumed, but this text from John suggested there is > something odd about the platform case: > > "Did you consider that for PCI .." > . For this special platform MSI case there is a secondary interrupt controller (called mbigen) which generates the MSI on behalf of the device, which I think the MSI belongs to (and not the device, itself). See "latter case" mentioned in commit 91f90daa4fb2. I think Marc and Thomas can explain this much better than I could. Anyway, as I mentioned earlier, I think that this specific problem is unique and can be solved without using a function which examines the struct device.msi_list . Thanks, John
On Wed, Jul 21 2021 at 09:24, Christoph Hellwig wrote: > On Wed, Jul 21, 2021 at 09:20:00AM +0200, Thomas Gleixner wrote: >> > Just walking the list seems fine to me given that this is not a >> > performance criticial path. But what are the locking implications? >> >> At the moment there are none because the list is initialized in the >> setup path and never modified afterwards. Though that might change >> sooner than later to fix the virtio wreckage vs. MSI-X. > > What is the issue there? Either way, if we keep the helper in the > IRQ code it should be easy to spot for anyone adding the locking. https://lore.kernel.org/r/87o8bxcuxv.ffs@nanos.tec.linutronix.de TLDR: virtio allocates ONE irq on msix_enable() and then when the guest actually unmasks another entry (e.g. request_irq()), it tears down the allocated one and set's up two. On the third one this repeats .... There are only two options: 1) allocate everything upfront, which is undesired 2) append entries, which might need locking, but I'm still trying to avoid that There is another problem vs. vector exhaustion which can't be fixed that way, but that's a different story. Thanks, tglx
On Wed, Jul 21 2021 at 10:44, John Garry wrote: > On 21/07/2021 08:24, Christoph Hellwig wrote: >> On Wed, Jul 21, 2021 at 09:20:00AM +0200, Thomas Gleixner wrote: >>>> Also does the above imply this won't work for your platform MSI case? >>> The msi descriptors are attached to struct device and independent of >>> platform/PCI/whatever. >> That's what I assumed, but this text from John suggested there is >> something odd about the platform case: >> >> "Did you consider that for PCI .." >> . > > For this special platform MSI case there is a secondary interrupt > controller (called mbigen) which generates the MSI on behalf of the > device, which I think the MSI belongs to (and not the device, itself). MBIGEN is a different story because it converts wired interrupts into MSI interrupts, IOW a MSI based interrupt pin extender. I might be wrong, but I seriously doubt that any multiqueue device which wants to use affinity managed interrupts is built on top of that. Thanks, tglx
On Wed, Jul 21, 2021 at 10:14:25PM +0200, Thomas Gleixner wrote: > https://lore.kernel.org/r/87o8bxcuxv.ffs@nanos.tec.linutronix.de > > TLDR: virtio allocates ONE irq on msix_enable() and then when the guest > actually unmasks another entry (e.g. request_irq()), it tears down the > allocated one and set's up two. On the third one this repeats .... > > There are only two options: > > 1) allocate everything upfront, which is undesired > 2) append entries, which might need locking, but I'm still trying to > avoid that > > There is another problem vs. vector exhaustion which can't be fixed that > way, but that's a different story. FTI, NVMe is similar. We need one IRQ to setup the admin queue, which is used to query/set how many I/O queues are supported. Just two steps though and not unbound.
On Wed, Jul 21 2021 at 22:32, Christoph Hellwig wrote: > On Wed, Jul 21, 2021 at 10:14:25PM +0200, Thomas Gleixner wrote: >> https://lore.kernel.org/r/87o8bxcuxv.ffs@nanos.tec.linutronix.de >> >> TLDR: virtio allocates ONE irq on msix_enable() and then when the >> guest OOps, sorry that should have been VFIO not virtio. >> actually unmasks another entry (e.g. request_irq()), it tears down the >> allocated one and set's up two. On the third one this repeats .... >> >> There are only two options: >> >> 1) allocate everything upfront, which is undesired >> 2) append entries, which might need locking, but I'm still trying to >> avoid that >> >> There is another problem vs. vector exhaustion which can't be fixed that >> way, but that's a different story. > > FTI, NVMe is similar. We need one IRQ to setup the admin queue, > which is used to query/set how many I/O queues are supported. Just > two steps though and not unbound. That's fine because that's controlled by the driver consistently and it (hopefully) makes sure that the admin queue is quiesced before everything is torn down after the initial query. But that's not the case for VFIO. It tears down all in use interrupts and the guest driver is completely oblivious of that. Assume the following situation: 1) VM boots with 8 present CPUs and 16 possible CPUs 2) The passed through card (PF or VF) supports multiqueue and the driver uses managed interrupts which e.g. allocates one queue and one interrupt per possible CPU. Initial setup requests all the interrupts, but only the first 8 queue interrupts are unmasked and therefore reallocated by the host which works by some definition of works because the device is quiet at that point. 3) Host admin plugs the other 8 CPUs into the guest Onlining these CPUs in the guest will unmask the dormant managed queue interrupts and cause the host to allocate the remaining 8 per queue interrupts one by one thereby tearing down _all_ previously allocated ones and then allocating one more than before. Assume that while this goes on the guest has I/O running on the already online CPUs and their associated queues. Depending on the device this either will lose interrupts or reroute them to the legacy INTx which is not handled. This might in the best case result in a few "timedout" requests, but I managed it at least once to make the device go into lala land state, i.e. it did not recover. The above can be fixed by adding an 'append' mode to the MSI code. But that does not fix the overcommit issue where the host runs out of vector space. The result is simply that the guest does not know and just continues to work on device/queues which will never ever recieve an interrupt (again). I got educated that all of this is considered unlikely and my argument that the concept of unlikely simply does not exist at cloud scale got ignored. Sure, I know it's VIRT and therefore not subject to common sense. Thanks, tglx
On Thu, Jul 22, 2021 at 12:38:07AM +0200, Thomas Gleixner wrote: > That's fine because that's controlled by the driver consistently and it > (hopefully) makes sure that the admin queue is quiesced before > everything is torn down after the initial query. Yes, it is. > The above can be fixed by adding an 'append' mode to the MSI code. So IFF we get that append mode I think it would help to simplify drivers that have unmanaged pre and post vectors, and/or do the above proving. So instead of currently requesting a single unmanaged vector, do basic setup, tear it down, request N managed vectors with an unmanaged pre-vector we could just keep the unmanaged vector, never tear it down and just append the post vectors. In the long run this could remove the need to do the pre and post vectors entirely.
On 21/07/2021 21:22, Thomas Gleixner wrote: >>> That's what I assumed, but this text from John suggested there is >>> something odd about the platform case: >>> >>> "Did you consider that for PCI .." >>> . >> For this special platform MSI case there is a secondary interrupt >> controller (called mbigen) which generates the MSI on behalf of the >> device, which I think the MSI belongs to (and not the device, itself). > MBIGEN is a different story because it converts wired interrupts into > MSI interrupts, IOW a MSI based interrupt pin extender. > > I might be wrong, but I seriously doubt that any multiqueue device which > wants to use affinity managed interrupts is built on top of that. Actually the SCSI device for which platform device managed interrupts support was added in commit e15f2fa959f2 uses mbigen. Thanks, John
diff --git a/drivers/base/platform.c b/drivers/base/platform.c index 8640578f45e9..d28cb91d5cf9 100644 --- a/drivers/base/platform.c +++ b/drivers/base/platform.c @@ -388,6 +388,13 @@ int devm_platform_get_irqs_affinity(struct platform_device *dev, ptr->irq[i], ret); goto err_free_desc; } + + /* + * mark the device as irq affinity managed if any irq affinity + * descriptor is managed + */ + if (desc[i].is_managed) + dev->dev.irq_affinity_managed = true; } devres_add(&dev->dev, ptr); diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 3d6db20d1b2b..7ddec90b711d 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -1197,6 +1197,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, if (flags & PCI_IRQ_AFFINITY) { if (!affd) affd = &msi_default_affd; + dev->dev.irq_affinity_managed = true; } else { if (WARN_ON(affd)) affd = NULL; @@ -1215,6 +1216,8 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs, return nvecs; } + dev->dev.irq_affinity_managed = false; + /* use legacy IRQ if allowed */ if (flags & PCI_IRQ_LEGACY) { if (min_vecs == 1 && dev->irq) { diff --git a/include/linux/device.h b/include/linux/device.h index 59940f1744c1..9ec6e671279e 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -569,6 +569,7 @@ struct device { #ifdef CONFIG_DMA_OPS_BYPASS bool dma_ops_bypass : 1; #endif + bool irq_affinity_managed : 1; }; /**
irq vector allocation with managed affinity may be used by driver, and blk-mq needs this info because managed irq will be shutdown when all CPUs in the affinity mask are offline. The info of using managed irq is often produced by drivers(pci subsystem, platform device, ...), and it is consumed by blk-mq, so different subsystems are involved in this info flow Address this issue by adding one field of .irq_affinity_managed into 'struct device'. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- drivers/base/platform.c | 7 +++++++ drivers/pci/msi.c | 3 +++ include/linux/device.h | 1 + 3 files changed, 11 insertions(+)