Message ID | 20200908231759.13336-1-decui@microsoft.com (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Lorenzo Pieralisi |
Headers | show |
Series | [v2] PCI: hv: Fix hibernation in case interrupts are not re-created | expand |
> From: Dexuan Cui <decui@microsoft.com> > Sent: Tuesday, September 8, 2020 4:18 PM > > Hyper-V doesn't trap and emulate the accesses to the MSI/MSI-X registers, > and we must use hv_compose_msi_msg() to ask Hyper-V to create the IOMMU > Interrupt Remapping Table Entries. This is not an issue for a lot of > PCI device drivers (e.g. NVMe driver, Mellanox NIC drivers), which > destroy and re-create the interrupts across hibernation, so > hv_compose_msi_msg() is called automatically. However, some other PCI > device drivers (e.g. the Nvidia driver) may not destroy and re-create > the interrupts across hibernation, so hv_pci_resume() has to call > hv_compose_msi_msg(), otherwise the PCI device drivers can no longer > receive MSI/MSI-X interrupts after hibernation. > > Fixes: ac82fc832708 ("PCI: hv: Add hibernation support") > Signed-off-by: Dexuan Cui <decui@microsoft.com> > Reviewed-by: Jake Oshins <jakeo@microsoft.com> > > --- > > Changes in v2: > Fixed a typo in the comment in hv_irq_unmask. Thanks to Michael! > Added Jake's Reviewed-by. > > drivers/pci/controller/pci-hyperv.c | 44 +++++++++++++++++++++++++++++ > 1 file changed, 44 insertions(+) Hi Lorenzo, Bjorn, Can you please take a look at this patch? I hope it still could have a chance to be in 5.9. :-) Thanks, -- Dexuan
[+MarcZ - this patch needs IRQ maintainers vetting] On Tue, Sep 08, 2020 at 04:17:59PM -0700, Dexuan Cui wrote: > Hyper-V doesn't trap and emulate the accesses to the MSI/MSI-X registers, > and we must use hv_compose_msi_msg() to ask Hyper-V to create the IOMMU > Interrupt Remapping Table Entries. This is not an issue for a lot of > PCI device drivers (e.g. NVMe driver, Mellanox NIC drivers), which > destroy and re-create the interrupts across hibernation, so > hv_compose_msi_msg() is called automatically. However, some other PCI > device drivers (e.g. the Nvidia driver) may not destroy and re-create > the interrupts across hibernation, so hv_pci_resume() has to call > hv_compose_msi_msg(), otherwise the PCI device drivers can no longer > receive MSI/MSI-X interrupts after hibernation. This looks like drivers bugs and I don't think the HV controller driver is where you should fix them. Regardless, this commit log does not provide the information that it should. > Fixes: ac82fc832708 ("PCI: hv: Add hibernation support") > Signed-off-by: Dexuan Cui <decui@microsoft.com> > Reviewed-by: Jake Oshins <jakeo@microsoft.com> > > --- > > Changes in v2: > Fixed a typo in the comment in hv_irq_unmask. Thanks to Michael! > Added Jake's Reviewed-by. > > drivers/pci/controller/pci-hyperv.c | 44 +++++++++++++++++++++++++++++ > 1 file changed, 44 insertions(+) > > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c > index fc4c3a15e570..dd21afb5d62b 100644 > --- a/drivers/pci/controller/pci-hyperv.c > +++ b/drivers/pci/controller/pci-hyperv.c > @@ -1211,6 +1211,21 @@ static void hv_irq_unmask(struct irq_data *data) > pbus = pdev->bus; > hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata); > > + if (hbus->state == hv_pcibus_removing) { > + /* > + * During hibernation, when a CPU is offlined, the kernel tries > + * to move the interrupt to the remaining CPUs that haven't > + * been offlined yet. In this case, the below hv_do_hypercall() > + * always fails since the vmbus channel has been closed, so we > + * should not call the hypercall, but we still need > + * pci_msi_unmask_irq() to reset the mask bit in desc->masked: > + * see cpu_disable_common() -> fixup_irqs() -> > + * irq_migrate_all_off_this_cpu() -> migrate_one_irq(). > + */ > + pci_msi_unmask_irq(data); This is not appropriate - it looks like a plaster to paper over an issue with hyper-V hibernation code sequence. Fix that issue instead of papering over it here. Thanks, Lorenzo > + return; > + } > + > spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags); > > params = &hbus->retarget_msi_interrupt_params; > @@ -3372,6 +3387,33 @@ static int hv_pci_suspend(struct hv_device *hdev) > return 0; > } > > +static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg) > +{ > + struct msi_desc *entry; > + struct irq_data *irq_data; > + > + for_each_pci_msi_entry(entry, pdev) { > + irq_data = irq_get_irq_data(entry->irq); > + if (WARN_ON_ONCE(!irq_data)) > + return -EINVAL; > + > + hv_compose_msi_msg(irq_data, &entry->msg); > + } > + > + return 0; > +} > + > +/* > + * Upon resume, pci_restore_msi_state() -> ... -> __pci_write_msi_msg() > + * re-writes the MSI/MSI-X registers, but since Hyper-V doesn't trap and > + * emulate the accesses, we have to call hv_compose_msi_msg() to ask > + * Hyper-V to re-create the IOMMU Interrupt Remapping Table Entries. > + */ > +static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus) > +{ > + pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL); > +} > + > static int hv_pci_resume(struct hv_device *hdev) > { > struct hv_pcibus_device *hbus = hv_get_drvdata(hdev); > @@ -3405,6 +3447,8 @@ static int hv_pci_resume(struct hv_device *hdev) > > prepopulate_bars(hbus); > > + hv_pci_restore_msi_state(hbus); > + > hbus->state = hv_pcibus_installed; > return 0; > out: > -- > 2.19.1 >
> From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> > Sent: Monday, September 28, 2020 3:43 AM > > [+MarcZ - this patch needs IRQ maintainers vetting] Sure. Hi MarkZ, please also review the patch. Thanks! > On Tue, Sep 08, 2020 at 04:17:59PM -0700, Dexuan Cui wrote: > > Hyper-V doesn't trap and emulate the accesses to the MSI/MSI-X > > registers, and we must use hv_compose_msi_msg() to ask Hyper-V to > > create the IOMMU Interrupt Remapping Table Entries. This is not an issue > > for a lot of PCI device drivers (e.g. NVMe driver, Mellanox NIC drivers), > > which destroy and re-create the interrupts across hibernation, so > > hv_compose_msi_msg() is called automatically. However, some other > > PCI device drivers (e.g. the Nvidia driver) may not destroy and re-create > > the interrupts across hibernation, so hv_pci_resume() has to call > > hv_compose_msi_msg(), otherwise the PCI device drivers can no longer > > receive MSI/MSI-X interrupts after hibernation. > > This looks like drivers bugs and I don't think the HV controller > driver is where you should fix them. IMHO this is not a PCI device driver bug, because I think a PCI device driver is allowed to keep and re-use the MSI/MSI-X interrupts across hibernation, otherwise we would not have pci_restore_msi_state() in pci_restore_state(). The in-tree open-source Nvidia GPU driver drivers/gpu/drm/nouveau is such a PCI device driver that re-uses the MSI-X interrupts across hibernation. The out-of-tree proprietary Nvidia GPU driver also does the same thing. It looks some other in-tree PCI device drivers also do the same thing, though I don't remember their names offhand. IMO it's much better to change the pci-hyperv driver once and for all, than to change every such existing (and future?) PCI device driver. pci_restore_msi_state() directly writes the MSI/MSI-X related registers in __pci_write_msi_msg() and msix_mask_irq(). On a physical machine, this works perfectly, but for a Linux VM running on a hypervisor, which typically enables IOMMU interrupt remapping, the hypervisor usually should trap and emulate the write accesses to the MSI/MSI-X registers, so the hypervisor is able to create the necessary interrupt remapping table entries in the IOMMU, and the MSI/MSI-X interrupts can work in the VM. Hyper-V is different from other hypervisors in that it does not trap and emulate the write accesses, and instead it uses a para-virtualized method, which requires the VM to call hv_compose_msi_msg() to notify the hypervisor of the info that would be passed to the hypervisor in the case of the trap-and-emulate method. I mean this is a Hyper-V specific problem, so IMO we should fix the pci-hyperv driver rather than change the PCI device drivers, which work perfectly on a physical machine and on other hypervisors. Also it can be difficult or impossible to ask the authors of the aforementioned PCI device drivers to destry and re-create MSI/MSI-X acorss hibernation, especially for the out-of-tree driver(s). > Regardless, this commit log does not provide the information that > it should. Hi Lozenzo, I'm happy to add more info. Can you please let me know what extra info I should provide? > > Fixes: ac82fc832708 ("PCI: hv: Add hibernation support") > > Signed-off-by: Dexuan Cui <decui@microsoft.com> > > Reviewed-by: Jake Oshins <jakeo@microsoft.com> > > > > --- > > > > Changes in v2: > > Fixed a typo in the comment in hv_irq_unmask. Thanks to Michael! > > Added Jake's Reviewed-by. > > > > drivers/pci/controller/pci-hyperv.c | 44 +++++++++++++++++++++++++++++ > > 1 file changed, 44 insertions(+) > > > > diff --git a/drivers/pci/controller/pci-hyperv.c > b/drivers/pci/controller/pci-hyperv.c > > index fc4c3a15e570..dd21afb5d62b 100644 > > --- a/drivers/pci/controller/pci-hyperv.c > > +++ b/drivers/pci/controller/pci-hyperv.c > > @@ -1211,6 +1211,21 @@ static void hv_irq_unmask(struct irq_data *data) > > pbus = pdev->bus; > > hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata); > > > > + if (hbus->state == hv_pcibus_removing) { > > + /* > > + * During hibernation, when a CPU is offlined, the kernel tries > > + * to move the interrupt to the remaining CPUs that haven't > > + * been offlined yet. In this case, the below hv_do_hypercall() > > + * always fails since the vmbus channel has been closed, so we > > + * should not call the hypercall, but we still need > > + * pci_msi_unmask_irq() to reset the mask bit in desc->masked: > > + * see cpu_disable_common() -> fixup_irqs() -> > > + * irq_migrate_all_off_this_cpu() -> migrate_one_irq(). > > + */ > > + pci_msi_unmask_irqpci_msi_unmask_irq(data); > > This is not appropriate - it looks like a plaster to paper over an > issue with hyper-V hibernation code sequence. Fix that issue instead > of papering over it here. > > Thanks, > Lorenzo IMO this patch is fixing this Hyper-V specific problem. :-) The probem is unique to Hyper-V because chip->irq_unmask() may fail in a Linux VM running on Hyper-V. chip->irq_unmask() can not fail on a physical machine. I guess this is why the return value of irq_unmask() is defined as "void" in include/linux/irq.h: struct irq_chip { ... void (*irq_mask)(struct irq_data *data); void (*irq_unmask)(struct irq_data *data); As I described in the comment, in a VM on Hyper-V, chip->irq_unmask() fails during the suspending phase of hibernation because it's called when the non-boot CPUs are being offlined, and at this time all the devices, including Hyper-V VMBus devices, have been "frozen" -- this is part of the standard Linux hibernation workflow. Since Hyper-V thinks the VM has frozen the pci-hyperv VMBus device at this moment (i.e. closed the VMBus channel of the VMBus device), it fails chip->irq_unmask(), i.e. hv_irq_unmask() -> hv_do_hypercall(). On a physical machine, unmasking an MSI/MSI-X register just means an MMIO write, which I think can not fail here. So I think this patch is the correct fix, considering Hyper-V's unique implementation of the MSI "chip" (i.e. Hyper-V does not trap and emulate the MSI/MSI-X register accesses, and uses a para-virtualized method as I explained above), and the fact that I shouldn't and can't change the standard Linux hibernation workflow. In hv_irq_unmask(), when I skip the hypercall in the case of hbus->state == hv_pcibus_removing, I still need the pci_msi_unmask_irq(), because of the sequences in kernel/irq/cpuhotplug.c: static bool migrate_one_irq(struct irq_desc *desc) { ... if (maskchip && chip->irq_mask) chip->irq_mask(d); ... err = irq_do_set_affinity(d, affinity, false); ... if (maskchip && chip->irq_unmask) chip->irq_unmask(d); Here if hv_irq_unmask does not call pci_msi_unmask_irq(), the desc->masked remains "true", so later after hibernation, the MSI interrupt line always reamins masked, which is incorrect. Here the slient failure of hv_irq_unmask() does not matter since all the non-boot CPUs are being offlined (meaning all the devices have been frozen). Note: the correct affinity info is still updated into the irqdata data structure in migrate_one_irq() -> irq_do_set_affinity() -> hv_set_affinity(), so when the VM resumes, hv_pci_resume() -> hv_pci_restore_msi_state() is able to correctly restore the irqs with the correct affinity. I hope the explanation can help clarify things. I understand this is not as natual as tht case that Linux runs on a physical machine, but due to the unique PCI pass-through implementation of Hyper-V, IMO this is the only viable fix for the problem here. BTW, this patch is only confined to the pci-hyperv driver and I believe it can no cause any regression. Thanks, -- Dexuan > > + return; > > + } > > + > > spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags); > > > > params = &hbus->retarget_msi_interrupt_params; > > @@ -3372,6 +3387,33 @@ static int hv_pci_suspend(struct hv_device > *hdev) > > return 0; > > } > > > > +static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg) > > +{ > > + struct msi_desc *entry; > > + struct irq_data *irq_data; > > + > > + for_each_pci_msi_entry(entry, pdev) { > > + irq_data = irq_get_irq_data(entry->irq); > > + if (WARN_ON_ONCE(!irq_data)) > > + return -EINVAL; > > + > > + hv_compose_msi_msg(irq_data, &entry->msg); > > + } > > + > > + return 0; > > +} > > + > > +/* > > + * Upon resume, pci_restore_msi_state() -> ... -> __pci_write_msi_msg() > > + * re-writes the MSI/MSI-X registers, but since Hyper-V doesn't trap and > > + * emulate the accesses, we have to call hv_compose_msi_msg() to ask > > + * Hyper-V to re-create the IOMMU Interrupt Remapping Table Entries. > > + */ > > +static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus) > > +{ > > + pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL); > > +} > > + > > static int hv_pci_resume(struct hv_device *hdev) > > { > > struct hv_pcibus_device *hbus = hv_get_drvdata(hdev); > > @@ -3405,6 +3447,8 @@ static int hv_pci_resume(struct hv_device *hdev) > > > > prepopulate_bars(hbus); > > > > + hv_pci_restore_msi_state(hbus); > > + > > hbus->state = hv_pcibus_installed; > > return 0; > > out:
On Wed, Sep 30, 2020 at 12:38:04AM +0000, Dexuan Cui wrote: > > From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> > > Sent: Monday, September 28, 2020 3:43 AM > > > > [+MarcZ - this patch needs IRQ maintainers vetting] > > Sure. Hi MarkZ, please also review the patch. Thanks! > > > On Tue, Sep 08, 2020 at 04:17:59PM -0700, Dexuan Cui wrote: > > > Hyper-V doesn't trap and emulate the accesses to the MSI/MSI-X > > > registers, and we must use hv_compose_msi_msg() to ask Hyper-V to > > > create the IOMMU Interrupt Remapping Table Entries. This is not an issue > > > for a lot of PCI device drivers (e.g. NVMe driver, Mellanox NIC drivers), > > > which destroy and re-create the interrupts across hibernation, so > > > hv_compose_msi_msg() is called automatically. However, some other > > > PCI device drivers (e.g. the Nvidia driver) may not destroy and re-create > > > the interrupts across hibernation, so hv_pci_resume() has to call > > > hv_compose_msi_msg(), otherwise the PCI device drivers can no longer > > > receive MSI/MSI-X interrupts after hibernation. > > > > This looks like drivers bugs and I don't think the HV controller > > driver is where you should fix them. > > IMHO this is not a PCI device driver bug, because I think a PCI device driver > is allowed to keep and re-use the MSI/MSI-X interrupts across hibernation, > otherwise we would not have pci_restore_msi_state() in pci_restore_state(). > > The in-tree open-source Nvidia GPU driver drivers/gpu/drm/nouveau is such > a PCI device driver that re-uses the MSI-X interrupts across hibernation. > The out-of-tree proprietary Nvidia GPU driver also does the same thing. > It looks some other in-tree PCI device drivers also do the same thing, though > I don't remember their names offhand. > > IMO it's much better to change the pci-hyperv driver once and for all, than > to change every such existing (and future?) PCI device driver. > > pci_restore_msi_state() directly writes the MSI/MSI-X related registers > in __pci_write_msi_msg() and msix_mask_irq(). On a physical machine, this > works perfectly, but for a Linux VM running on a hypervisor, which typically > enables IOMMU interrupt remapping, the hypervisor usually should trap and > emulate the write accesses to the MSI/MSI-X registers, so the hypervisor > is able to create the necessary interrupt remapping table entries in the > IOMMU, and the MSI/MSI-X interrupts can work in the VM. Hyper-V is different > from other hypervisors in that it does not trap and emulate the write > accesses, and instead it uses a para-virtualized method, which requires > the VM to call hv_compose_msi_msg() to notify the hypervisor of the info > that would be passed to the hypervisor in the case of the trap-and-emulate > method. > > I mean this is a Hyper-V specific problem, so IMO we should fix the pci-hyperv > driver rather than change the PCI device drivers, which work perfectly on a > physical machine and on other hypervisors. Also it can be difficult or > impossible to ask the authors of the aforementioned PCI device drivers to > destry and re-create MSI/MSI-X acorss hibernation, especially for the > out-of-tree driver(s). Good, so why did you mention PCI drivers in the commit log if they are not related to the problem you are fixing ? > > Regardless, this commit log does not provide the information that > > it should. > > Hi Lozenzo, I'm happy to add more info. Can you please let me know > what extra info I should provide? s/Lozenzo/Lorenzo The info you describe properly below, namely what the _actual_ problem is. > > > Fixes: ac82fc832708 ("PCI: hv: Add hibernation support") > > > Signed-off-by: Dexuan Cui <decui@microsoft.com> > > > Reviewed-by: Jake Oshins <jakeo@microsoft.com> > > > > > > --- > > > > > > Changes in v2: > > > Fixed a typo in the comment in hv_irq_unmask. Thanks to Michael! > > > Added Jake's Reviewed-by. > > > > > > drivers/pci/controller/pci-hyperv.c | 44 +++++++++++++++++++++++++++++ > > > 1 file changed, 44 insertions(+) > > > > > > diff --git a/drivers/pci/controller/pci-hyperv.c > > b/drivers/pci/controller/pci-hyperv.c > > > index fc4c3a15e570..dd21afb5d62b 100644 > > > --- a/drivers/pci/controller/pci-hyperv.c > > > +++ b/drivers/pci/controller/pci-hyperv.c > > > @@ -1211,6 +1211,21 @@ static void hv_irq_unmask(struct irq_data *data) > > > pbus = pdev->bus; > > > hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata); > > > > > > + if (hbus->state == hv_pcibus_removing) { > > > + /* > > > + * During hibernation, when a CPU is offlined, the kernel tries > > > + * to move the interrupt to the remaining CPUs that haven't > > > + * been offlined yet. In this case, the below hv_do_hypercall() > > > + * always fails since the vmbus channel has been closed, so we > > > + * should not call the hypercall, but we still need > > > + * pci_msi_unmask_irq() to reset the mask bit in desc->masked: > > > + * see cpu_disable_common() -> fixup_irqs() -> > > > + * irq_migrate_all_off_this_cpu() -> migrate_one_irq(). > > > + */ > > > + pci_msi_unmask_irqpci_msi_unmask_irq(data); > > > > This is not appropriate - it looks like a plaster to paper over an > > issue with hyper-V hibernation code sequence. Fix that issue instead > > of papering over it here. > > > > Thanks, > > Lorenzo > > IMO this patch is fixing this Hyper-V specific problem. :-) > The probem is unique to Hyper-V because chip->irq_unmask() may fail in a > Linux VM running on Hyper-V. > > chip->irq_unmask() can not fail on a physical machine. I guess this is why > the return value of irq_unmask() is defined as "void" in include/linux/irq.h: > struct irq_chip { > ... > void (*irq_mask)(struct irq_data *data); > void (*irq_unmask)(struct irq_data *data); > > As I described in the comment, in a VM on Hyper-V, chip->irq_unmask() > fails during the suspending phase of hibernation because it's called > when the non-boot CPUs are being offlined, and at this time all the devices, > including Hyper-V VMBus devices, have been "frozen" -- this is part of > the standard Linux hibernation workflow. Since Hyper-V thinks the VM has > frozen the pci-hyperv VMBus device at this moment (i.e. closed the VMBus > channel of the VMBus device), it fails chip->irq_unmask(), i.e. > hv_irq_unmask() -> hv_do_hypercall(). > > On a physical machine, unmasking an MSI/MSI-X register just means an MMIO > write, which I think can not fail here. > > So I think this patch is the correct fix, considering Hyper-V's unique > implementation of the MSI "chip" (i.e. Hyper-V does not trap and emulate > the MSI/MSI-X register accesses, and uses a para-virtualized method as I > explained above), and the fact that I shouldn't and can't change the > standard Linux hibernation workflow. > > In hv_irq_unmask(), when I skip the hypercall in the case of > hbus->state == hv_pcibus_removing, I still need the pci_msi_unmask_irq(), > because of the sequences in kernel/irq/cpuhotplug.c: > > static bool migrate_one_irq(struct irq_desc *desc) > { > ... > if (maskchip && chip->irq_mask) > chip->irq_mask(d); > ... > err = irq_do_set_affinity(d, affinity, false); > ... > if (maskchip && chip->irq_unmask) > chip->irq_unmask(d); > > Here if hv_irq_unmask does not call pci_msi_unmask_irq(), the desc->masked > remains "true", so later after hibernation, the MSI interrupt line always > reamins masked, which is incorrect. > > Here the slient failure of hv_irq_unmask() does not matter since all the > non-boot CPUs are being offlined (meaning all the devices have been > frozen). Note: the correct affinity info is still updated into the > irqdata data structure in migrate_one_irq() -> irq_do_set_affinity() -> > hv_set_affinity(), so when the VM resumes, hv_pci_resume() -> > hv_pci_restore_msi_state() is able to correctly restore the irqs with > the correct affinity. > > I hope the explanation can help clarify things. I understand this is > not as natual as tht case that Linux runs on a physical machine, but > due to the unique PCI pass-through implementation of Hyper-V, IMO this > is the only viable fix for the problem here. BTW, this patch is only > confined to the pci-hyperv driver and I believe it can no cause any > regression. Understood, write this in the commit log and I won't nag you any further. Side note: this issue is there because the hypcall failure triggers an early return from hv_irq_unmask(). Is that early return really correct ? Another possibility is just logging the error and let hv_irq_unmask() continue and call pci_msi_unmask_irq() in the exit path. Is there a hypcall return value that you can use to detect fatal vs non-fatal (ie hibernation) hypcall failures ? I was confused by reading the patch since it seemed that you call pci_msi_unmask_irq() _only_ while hibernating, which was certainly a bug. Thank you for explaining. Lorenzo > Thanks, > -- Dexuan > > > > > + return; > > > + } > > > + > > > spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags); > > > > > > params = &hbus->retarget_msi_interrupt_params; > > > @@ -3372,6 +3387,33 @@ static int hv_pci_suspend(struct hv_device > > *hdev) > > > return 0; > > > } > > > > > > +static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg) > > > +{ > > > + struct msi_desc *entry; > > > + struct irq_data *irq_data; > > > + > > > + for_each_pci_msi_entry(entry, pdev) { > > > + irq_data = irq_get_irq_data(entry->irq); > > > + if (WARN_ON_ONCE(!irq_data)) > > > + return -EINVAL; > > > + > > > + hv_compose_msi_msg(irq_data, &entry->msg); > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +/* > > > + * Upon resume, pci_restore_msi_state() -> ... -> __pci_write_msi_msg() > > > + * re-writes the MSI/MSI-X registers, but since Hyper-V doesn't trap and > > > + * emulate the accesses, we have to call hv_compose_msi_msg() to ask > > > + * Hyper-V to re-create the IOMMU Interrupt Remapping Table Entries. > > > + */ > > > +static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus) > > > +{ > > > + pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL); > > > +} > > > + > > > static int hv_pci_resume(struct hv_device *hdev) > > > { > > > struct hv_pcibus_device *hbus = hv_get_drvdata(hdev); > > > @@ -3405,6 +3447,8 @@ static int hv_pci_resume(struct hv_device *hdev) > > > > > > prepopulate_bars(hbus); > > > > > > + hv_pci_restore_msi_state(hbus); > > > + > > > hbus->state = hv_pcibus_installed; > > > return 0; > > > out: >
> From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> > Sent: Thursday, October 1, 2020 3:13 AM > > ... > > I mean this is a Hyper-V specific problem, so IMO we should fix the > > pci-hyperv driver rather than change the PCI device drivers, which > > work perfectly on a physical machine and on other hypervisors. > > Also it can be difficult or impossible to ask the authors of the > > aforementioned PCI device drivers to destroy and re-create > > MSI/MSI-X across hibernation, especially for the out-of-tree driver(s). > > Good, so why did you mention PCI drivers in the commit log if they > are not related to the problem you are fixing ? I mentioned the names of the PCI device drivers because IMO people want to know how the issue can reproduce (i.e. which PCI devices are affected and which are not), so they know how to test this patch. I'll remove the names of the unaffected PCI device drivers from the commit log, and only keep the name of the Nvidia GPU drivers (which are so far the only drivers I have identified that are affected, when Linux VM runs on Hyper-V and hibernates). > > > Regardless, this commit log does not provide the information that > > > it should. > > > > Hi Lozenzo, I'm happy to add more info. Can you please let me know > > what extra info I should provide? > > s/Lozenzo/Lorenzo Sorry! Will fix the typo. > The info you describe properly below, namely what the _actual_ problem > is. I will send v3 with the below info. > > Here if hv_irq_unmask does not call pci_msi_unmask_irq(), the > > desc->masked remains "true", so later after hibernation, the MSI > > interrupt line always reamins masked, which is incorrect. > > > > Here the slient failure of hv_irq_unmask() does not matter since all the > > non-boot CPUs are being offlined (meaning all the devices have been > > frozen). Note: the correct affinity info is still updated into the > > irqdata data structure in migrate_one_irq() -> irq_do_set_affinity() -> > > hv_set_affinity(), so when the VM resumes, hv_pci_resume() -> > > hv_pci_restore_msi_state() is able to correctly restore the irqs with > > the correct affinity. > > > > I hope the explanation can help clarify things. I understand this is > > not as natual as tht case that Linux runs on a physical machine, but > > due to the unique PCI pass-through implementation of Hyper-V, IMO this > > is the only viable fix for the problem here. BTW, this patch is only > > confined to the pci-hyperv driver and I believe it can no cause any > > regression. > > Understood, write this in the commit log and I won't nag you any further. Ok. I treat it as an opportunity to practise and improve my writing :-) > Side note: this issue is there because the hypcall failure triggers > an early return from hv_irq_unmask(). Yes. > Is that early return really correct ? Good question. IMO it's incorrect, because hv_irq_unmask() is called when the interrupt is activated for the first time, and when the interrupt's affinity is being changed. In these cases, we may as well call pci_msi_unmask_irq() unconditionally, even if the hypercall fails. BTW, AFAIK, in practice the hypercall only fails in 2 cases: 1. The device is removed when Linux VM has not finished the device's initialization. 2. In hibernation, the device has been disabled while the generic hibernation code tries to migrate the interrupt, as I explained. In the 2 cases, the hypercall returns the same error code HV_STATUS_INVALID_PARAMETER(0x5). > Another possibility is just logging the error and let > hv_irq_unmask() continue and call pci_msi_unmask_irq() in the exit > path. This is a good idea. I'll make this change in v3. > Is there a hypcall return value that you can use to detect fatal vs > non-fatal (ie hibernation) hypcall failures ? Unluckily IMO there is not. The spec (v6.0b)'s section 10.5.4 (page 106) https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs does define some return values, but IMO they're not applicable here. > I was confused by reading the patch since it seemed that you call > pci_msi_unmask_irq() _only_ while hibernating, which was certainly > a bug. > > Thank you for explaining. > > Lorenzo Thanks for reviewing! I'll post v3. Looking forward to your new comments! Thanks, -- Dexuan
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index fc4c3a15e570..dd21afb5d62b 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -1211,6 +1211,21 @@ static void hv_irq_unmask(struct irq_data *data) pbus = pdev->bus; hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata); + if (hbus->state == hv_pcibus_removing) { + /* + * During hibernation, when a CPU is offlined, the kernel tries + * to move the interrupt to the remaining CPUs that haven't + * been offlined yet. In this case, the below hv_do_hypercall() + * always fails since the vmbus channel has been closed, so we + * should not call the hypercall, but we still need + * pci_msi_unmask_irq() to reset the mask bit in desc->masked: + * see cpu_disable_common() -> fixup_irqs() -> + * irq_migrate_all_off_this_cpu() -> migrate_one_irq(). + */ + pci_msi_unmask_irq(data); + return; + } + spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags); params = &hbus->retarget_msi_interrupt_params; @@ -3372,6 +3387,33 @@ static int hv_pci_suspend(struct hv_device *hdev) return 0; } +static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg) +{ + struct msi_desc *entry; + struct irq_data *irq_data; + + for_each_pci_msi_entry(entry, pdev) { + irq_data = irq_get_irq_data(entry->irq); + if (WARN_ON_ONCE(!irq_data)) + return -EINVAL; + + hv_compose_msi_msg(irq_data, &entry->msg); + } + + return 0; +} + +/* + * Upon resume, pci_restore_msi_state() -> ... -> __pci_write_msi_msg() + * re-writes the MSI/MSI-X registers, but since Hyper-V doesn't trap and + * emulate the accesses, we have to call hv_compose_msi_msg() to ask + * Hyper-V to re-create the IOMMU Interrupt Remapping Table Entries. + */ +static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus) +{ + pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL); +} + static int hv_pci_resume(struct hv_device *hdev) { struct hv_pcibus_device *hbus = hv_get_drvdata(hdev); @@ -3405,6 +3447,8 @@ static int hv_pci_resume(struct hv_device *hdev) prepopulate_bars(hbus); + hv_pci_restore_msi_state(hbus); + hbus->state = hv_pcibus_installed; return 0; out: