diff mbox series

[v2] PCI: hv: Fix hibernation in case interrupts are not re-created

Message ID 20200908231759.13336-1-decui@microsoft.com (mailing list archive)
State Superseded, archived
Delegated to: Lorenzo Pieralisi
Headers show
Series [v2] PCI: hv: Fix hibernation in case interrupts are not re-created | expand

Commit Message

Dexuan Cui Sept. 8, 2020, 11:17 p.m. UTC
Hyper-V doesn't trap and emulate the accesses to the MSI/MSI-X registers,
and we must use hv_compose_msi_msg() to ask Hyper-V to create the IOMMU
Interrupt Remapping Table Entries. This is not an issue for a lot of
PCI device drivers (e.g. NVMe driver, Mellanox NIC drivers), which
destroy and re-create the interrupts across hibernation, so
hv_compose_msi_msg() is called automatically. However, some other PCI
device drivers (e.g. the Nvidia driver) may not destroy and re-create
the interrupts across hibernation, so hv_pci_resume() has to call
hv_compose_msi_msg(), otherwise the PCI device drivers can no longer
receive MSI/MSI-X interrupts after hibernation.

Fixes: ac82fc832708 ("PCI: hv: Add hibernation support")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Jake Oshins <jakeo@microsoft.com>

---

Changes in v2:
    Fixed a typo in the comment in hv_irq_unmask. Thanks to Michael!
    Added Jake's Reviewed-by.

 drivers/pci/controller/pci-hyperv.c | 44 +++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

Comments

Dexuan Cui Sept. 21, 2020, 5:01 p.m. UTC | #1
> From: Dexuan Cui <decui@microsoft.com>
> Sent: Tuesday, September 8, 2020 4:18 PM
> 
> Hyper-V doesn't trap and emulate the accesses to the MSI/MSI-X registers,
> and we must use hv_compose_msi_msg() to ask Hyper-V to create the IOMMU
> Interrupt Remapping Table Entries. This is not an issue for a lot of
> PCI device drivers (e.g. NVMe driver, Mellanox NIC drivers), which
> destroy and re-create the interrupts across hibernation, so
> hv_compose_msi_msg() is called automatically. However, some other PCI
> device drivers (e.g. the Nvidia driver) may not destroy and re-create
> the interrupts across hibernation, so hv_pci_resume() has to call
> hv_compose_msi_msg(), otherwise the PCI device drivers can no longer
> receive MSI/MSI-X interrupts after hibernation.
> 
> Fixes: ac82fc832708 ("PCI: hv: Add hibernation support")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Reviewed-by: Jake Oshins <jakeo@microsoft.com>
> 
> ---
> 
> Changes in v2:
>     Fixed a typo in the comment in hv_irq_unmask. Thanks to Michael!
>     Added Jake's Reviewed-by.
> 
>  drivers/pci/controller/pci-hyperv.c | 44 +++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)

Hi Lorenzo, Bjorn,
Can you please take a look at this patch? 
I hope it still could have a chance to be in 5.9. :-)

Thanks,
-- Dexuan
Lorenzo Pieralisi Sept. 28, 2020, 10:43 a.m. UTC | #2
[+MarcZ - this patch needs IRQ maintainers vetting]

On Tue, Sep 08, 2020 at 04:17:59PM -0700, Dexuan Cui wrote:
> Hyper-V doesn't trap and emulate the accesses to the MSI/MSI-X registers,
> and we must use hv_compose_msi_msg() to ask Hyper-V to create the IOMMU
> Interrupt Remapping Table Entries. This is not an issue for a lot of
> PCI device drivers (e.g. NVMe driver, Mellanox NIC drivers), which
> destroy and re-create the interrupts across hibernation, so
> hv_compose_msi_msg() is called automatically. However, some other PCI
> device drivers (e.g. the Nvidia driver) may not destroy and re-create
> the interrupts across hibernation, so hv_pci_resume() has to call
> hv_compose_msi_msg(), otherwise the PCI device drivers can no longer
> receive MSI/MSI-X interrupts after hibernation.

This looks like drivers bugs and I don't think the HV controller
driver is where you should fix them. Regardless, this commit log
does not provide the information that it should.

> Fixes: ac82fc832708 ("PCI: hv: Add hibernation support")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Reviewed-by: Jake Oshins <jakeo@microsoft.com>
> 
> ---
> 
> Changes in v2:
>     Fixed a typo in the comment in hv_irq_unmask. Thanks to Michael!
>     Added Jake's Reviewed-by.
> 
>  drivers/pci/controller/pci-hyperv.c | 44 +++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index fc4c3a15e570..dd21afb5d62b 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1211,6 +1211,21 @@ static void hv_irq_unmask(struct irq_data *data)
>  	pbus = pdev->bus;
>  	hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
>  
> +	if (hbus->state == hv_pcibus_removing) {
> +		/*
> +		 * During hibernation, when a CPU is offlined, the kernel tries
> +		 * to move the interrupt to the remaining CPUs that haven't
> +		 * been offlined yet. In this case, the below hv_do_hypercall()
> +		 * always fails since the vmbus channel has been closed, so we
> +		 * should not call the hypercall, but we still need
> +		 * pci_msi_unmask_irq() to reset the mask bit in desc->masked:
> +		 * see cpu_disable_common() -> fixup_irqs() ->
> +		 * irq_migrate_all_off_this_cpu() -> migrate_one_irq().
> +		 */
> +		pci_msi_unmask_irq(data);

This is not appropriate - it looks like a plaster to paper over an
issue with hyper-V hibernation code sequence. Fix that issue instead
of papering over it here.

Thanks,
Lorenzo

> +		return;
> +	}
> +
>  	spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags);
>  
>  	params = &hbus->retarget_msi_interrupt_params;
> @@ -3372,6 +3387,33 @@ static int hv_pci_suspend(struct hv_device *hdev)
>  	return 0;
>  }
>  
> +static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
> +{
> +	struct msi_desc *entry;
> +	struct irq_data *irq_data;
> +
> +	for_each_pci_msi_entry(entry, pdev) {
> +		irq_data = irq_get_irq_data(entry->irq);
> +		if (WARN_ON_ONCE(!irq_data))
> +			return -EINVAL;
> +
> +		hv_compose_msi_msg(irq_data, &entry->msg);
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Upon resume, pci_restore_msi_state() -> ... ->  __pci_write_msi_msg()
> + * re-writes the MSI/MSI-X registers, but since Hyper-V doesn't trap and
> + * emulate the accesses, we have to call hv_compose_msi_msg() to ask
> + * Hyper-V to re-create the IOMMU Interrupt Remapping Table Entries.
> + */
> +static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus)
> +{
> +	pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL);
> +}
> +
>  static int hv_pci_resume(struct hv_device *hdev)
>  {
>  	struct hv_pcibus_device *hbus = hv_get_drvdata(hdev);
> @@ -3405,6 +3447,8 @@ static int hv_pci_resume(struct hv_device *hdev)
>  
>  	prepopulate_bars(hbus);
>  
> +	hv_pci_restore_msi_state(hbus);
> +
>  	hbus->state = hv_pcibus_installed;
>  	return 0;
>  out:
> -- 
> 2.19.1
>
Dexuan Cui Sept. 30, 2020, 12:38 a.m. UTC | #3
> From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> Sent: Monday, September 28, 2020 3:43 AM
>
> [+MarcZ - this patch needs IRQ maintainers vetting]

Sure. Hi MarkZ, please also review the patch. Thanks!

> On Tue, Sep 08, 2020 at 04:17:59PM -0700, Dexuan Cui wrote:
> > Hyper-V doesn't trap and emulate the accesses to the MSI/MSI-X
> > registers, and we must use hv_compose_msi_msg() to ask Hyper-V to
> > create the IOMMU Interrupt Remapping Table Entries. This is not an issue
> > for a lot of PCI device drivers (e.g. NVMe driver, Mellanox NIC drivers),
> > which destroy and re-create the interrupts across hibernation, so
> > hv_compose_msi_msg() is called automatically. However, some other
> > PCI device drivers (e.g. the Nvidia driver) may not destroy and re-create
> > the interrupts across hibernation, so hv_pci_resume() has to call
> > hv_compose_msi_msg(), otherwise the PCI device drivers can no longer
> > receive MSI/MSI-X interrupts after hibernation.
>
> This looks like drivers bugs and I don't think the HV controller
> driver is where you should fix them.

IMHO this is not a PCI device driver bug, because I think a PCI device driver
is allowed to keep and re-use the MSI/MSI-X interrupts across hibernation,
otherwise we would not have pci_restore_msi_state() in pci_restore_state().

The in-tree open-source Nvidia GPU driver drivers/gpu/drm/nouveau is such
a PCI device driver that re-uses the MSI-X interrupts across hibernation.
The out-of-tree proprietary Nvidia GPU driver also does the same thing.
It looks some other in-tree PCI device drivers also do the same thing, though
I don't remember their names offhand.

IMO it's much better to change the pci-hyperv driver once and for all, than
to change every such existing (and future?) PCI device driver.

pci_restore_msi_state() directly writes the MSI/MSI-X related registers
in __pci_write_msi_msg() and msix_mask_irq(). On a physical machine, this
works perfectly, but for a Linux VM running on a hypervisor, which typically
enables IOMMU interrupt remapping, the hypervisor usually should trap and
emulate the write accesses to the MSI/MSI-X registers, so the hypervisor
is able to create the necessary interrupt remapping table entries in the
IOMMU, and the MSI/MSI-X interrupts can work in the VM. Hyper-V is different
from other hypervisors in that it does not trap and emulate the write
accesses, and instead it uses a para-virtualized method, which requires
the VM to call hv_compose_msi_msg() to notify the hypervisor of the info
that would be passed to the hypervisor in the case of the trap-and-emulate
method.

I mean this is a Hyper-V specific problem, so IMO we should fix the pci-hyperv
driver rather than change the PCI device drivers, which work perfectly on a
physical machine and on other hypervisors. Also it can be difficult or 
impossible to ask the authors of the aforementioned PCI device drivers to
destry and re-create MSI/MSI-X acorss hibernation, especially for the
out-of-tree driver(s).

> Regardless, this commit log does not provide the information that
> it should.

Hi Lozenzo, I'm happy to add more info. Can you please let me know
what extra info I should provide?

> > Fixes: ac82fc832708 ("PCI: hv: Add hibernation support")
> > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > Reviewed-by: Jake Oshins <jakeo@microsoft.com>
> >
> > ---
> >
> > Changes in v2:
> >     Fixed a typo in the comment in hv_irq_unmask. Thanks to Michael!
> >     Added Jake's Reviewed-by.
> >
> >  drivers/pci/controller/pci-hyperv.c | 44 +++++++++++++++++++++++++++++
> >  1 file changed, 44 insertions(+)
> >
> > diff --git a/drivers/pci/controller/pci-hyperv.c
> b/drivers/pci/controller/pci-hyperv.c
> > index fc4c3a15e570..dd21afb5d62b 100644
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -1211,6 +1211,21 @@ static void hv_irq_unmask(struct irq_data *data)
> >  	pbus = pdev->bus;
> >  	hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> >
> > +	if (hbus->state == hv_pcibus_removing) {
> > +		/*
> > +		 * During hibernation, when a CPU is offlined, the kernel tries
> > +		 * to move the interrupt to the remaining CPUs that haven't
> > +		 * been offlined yet. In this case, the below hv_do_hypercall()
> > +		 * always fails since the vmbus channel has been closed, so we
> > +		 * should not call the hypercall, but we still need
> > +		 * pci_msi_unmask_irq() to reset the mask bit in desc->masked:
> > +		 * see cpu_disable_common() -> fixup_irqs() ->
> > +		 * irq_migrate_all_off_this_cpu() -> migrate_one_irq().
> > +		 */
> > +		pci_msi_unmask_irqpci_msi_unmask_irq(data);
>
> This is not appropriate - it looks like a plaster to paper over an
> issue with hyper-V hibernation code sequence. Fix that issue instead
> of papering over it here.
>
> Thanks,
> Lorenzo

IMO this patch is fixing this Hyper-V specific problem. :-)

The probem is unique to Hyper-V because chip->irq_unmask() may fail in a
Linux VM running on Hyper-V. 

chip->irq_unmask() can not fail on a physical machine. I guess this is why 
the return value of irq_unmask() is defined as "void" in include/linux/irq.h:
struct irq_chip {
...
        void            (*irq_mask)(struct irq_data *data);
        void            (*irq_unmask)(struct irq_data *data);

As I described in the comment, in a VM on Hyper-V, chip->irq_unmask()
fails during the suspending phase of hibernation because it's called
when the non-boot CPUs are being offlined, and at this time all the devices,
including Hyper-V VMBus devices, have been "frozen" -- this is part of
the standard Linux hibernation workflow. Since Hyper-V thinks the VM has
frozen the pci-hyperv VMBus device at this moment (i.e. closed the VMBus
channel of the VMBus device), it fails chip->irq_unmask(), i.e.
hv_irq_unmask() -> hv_do_hypercall().

On a physical machine, unmasking an MSI/MSI-X register just means an MMIO
write, which I think can not fail here.

So I think this patch is the correct fix, considering Hyper-V's unique
implementation of the MSI "chip" (i.e. Hyper-V does not trap and emulate
the MSI/MSI-X register accesses, and uses a para-virtualized method as I
explained above), and the fact that I shouldn't and can't change the
standard Linux hibernation workflow.

In hv_irq_unmask(), when I skip the hypercall in the case of
hbus->state == hv_pcibus_removing, I still need the pci_msi_unmask_irq(),
because of the sequences in kernel/irq/cpuhotplug.c:

static bool migrate_one_irq(struct irq_desc *desc)
{
...
        if (maskchip && chip->irq_mask)
                chip->irq_mask(d);
...
        err = irq_do_set_affinity(d, affinity, false);
...
        if (maskchip && chip->irq_unmask)
                chip->irq_unmask(d);

Here if hv_irq_unmask does not call pci_msi_unmask_irq(), the desc->masked
remains "true", so later after hibernation, the MSI interrupt line always
reamins masked, which is incorrect.

Here the slient failure of hv_irq_unmask() does not matter since all the
non-boot CPUs are being offlined (meaning all the devices have been
frozen). Note: the correct affinity info is still updated into the
irqdata data structure in migrate_one_irq() -> irq_do_set_affinity() ->
hv_set_affinity(), so when the VM resumes, hv_pci_resume() ->
hv_pci_restore_msi_state() is able to correctly restore the irqs with
the correct affinity.

I hope the explanation can help clarify things. I understand this is not
as natual as tht case that Linux runs on a physical machine, but due to
the unique PCI pass-through implementation of Hyper-V, IMO this is
the only viable fix for the problem here. BTW, this patch is only confined
to the pci-hyperv driver and I believe it can no cause any regression.

Thanks,
-- Dexuan


> > +		return;
> > +	}
> > +
> >  	spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags);
> >
> >  	params = &hbus->retarget_msi_interrupt_params;
> > @@ -3372,6 +3387,33 @@ static int hv_pci_suspend(struct hv_device
> *hdev)
> >  	return 0;
> >  }
> >
> > +static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
> > +{
> > +	struct msi_desc *entry;
> > +	struct irq_data *irq_data;
> > +
> > +	for_each_pci_msi_entry(entry, pdev) {
> > +		irq_data = irq_get_irq_data(entry->irq);
> > +		if (WARN_ON_ONCE(!irq_data))
> > +			return -EINVAL;
> > +
> > +		hv_compose_msi_msg(irq_data, &entry->msg);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Upon resume, pci_restore_msi_state() -> ... ->  __pci_write_msi_msg()
> > + * re-writes the MSI/MSI-X registers, but since Hyper-V doesn't trap and
> > + * emulate the accesses, we have to call hv_compose_msi_msg() to ask
> > + * Hyper-V to re-create the IOMMU Interrupt Remapping Table Entries.
> > + */
> > +static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus)
> > +{
> > +	pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL);
> > +}
> > +
> >  static int hv_pci_resume(struct hv_device *hdev)
> >  {
> >  	struct hv_pcibus_device *hbus = hv_get_drvdata(hdev);
> > @@ -3405,6 +3447,8 @@ static int hv_pci_resume(struct hv_device *hdev)
> >
> >  	prepopulate_bars(hbus);
> >
> > +	hv_pci_restore_msi_state(hbus);
> > +
> >  	hbus->state = hv_pcibus_installed;
> >  	return 0;
> >  out:
Lorenzo Pieralisi Oct. 1, 2020, 10:13 a.m. UTC | #4
On Wed, Sep 30, 2020 at 12:38:04AM +0000, Dexuan Cui wrote:
> > From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> > Sent: Monday, September 28, 2020 3:43 AM
> >
> > [+MarcZ - this patch needs IRQ maintainers vetting]
> 
> Sure. Hi MarkZ, please also review the patch. Thanks!
> 
> > On Tue, Sep 08, 2020 at 04:17:59PM -0700, Dexuan Cui wrote:
> > > Hyper-V doesn't trap and emulate the accesses to the MSI/MSI-X
> > > registers, and we must use hv_compose_msi_msg() to ask Hyper-V to
> > > create the IOMMU Interrupt Remapping Table Entries. This is not an issue
> > > for a lot of PCI device drivers (e.g. NVMe driver, Mellanox NIC drivers),
> > > which destroy and re-create the interrupts across hibernation, so
> > > hv_compose_msi_msg() is called automatically. However, some other
> > > PCI device drivers (e.g. the Nvidia driver) may not destroy and re-create
> > > the interrupts across hibernation, so hv_pci_resume() has to call
> > > hv_compose_msi_msg(), otherwise the PCI device drivers can no longer
> > > receive MSI/MSI-X interrupts after hibernation.
> >
> > This looks like drivers bugs and I don't think the HV controller
> > driver is where you should fix them.
> 
> IMHO this is not a PCI device driver bug, because I think a PCI device driver
> is allowed to keep and re-use the MSI/MSI-X interrupts across hibernation,
> otherwise we would not have pci_restore_msi_state() in pci_restore_state().
> 
> The in-tree open-source Nvidia GPU driver drivers/gpu/drm/nouveau is such
> a PCI device driver that re-uses the MSI-X interrupts across hibernation.
> The out-of-tree proprietary Nvidia GPU driver also does the same thing.
> It looks some other in-tree PCI device drivers also do the same thing, though
> I don't remember their names offhand.
> 
> IMO it's much better to change the pci-hyperv driver once and for all, than
> to change every such existing (and future?) PCI device driver.
> 
> pci_restore_msi_state() directly writes the MSI/MSI-X related registers
> in __pci_write_msi_msg() and msix_mask_irq(). On a physical machine, this
> works perfectly, but for a Linux VM running on a hypervisor, which typically
> enables IOMMU interrupt remapping, the hypervisor usually should trap and
> emulate the write accesses to the MSI/MSI-X registers, so the hypervisor
> is able to create the necessary interrupt remapping table entries in the
> IOMMU, and the MSI/MSI-X interrupts can work in the VM. Hyper-V is different
> from other hypervisors in that it does not trap and emulate the write
> accesses, and instead it uses a para-virtualized method, which requires
> the VM to call hv_compose_msi_msg() to notify the hypervisor of the info
> that would be passed to the hypervisor in the case of the trap-and-emulate
> method.
> 
> I mean this is a Hyper-V specific problem, so IMO we should fix the pci-hyperv
> driver rather than change the PCI device drivers, which work perfectly on a
> physical machine and on other hypervisors. Also it can be difficult or 
> impossible to ask the authors of the aforementioned PCI device drivers to
> destry and re-create MSI/MSI-X acorss hibernation, especially for the
> out-of-tree driver(s).

Good, so why did you mention PCI drivers in the commit log if they
are not related to the problem you are fixing ?

> > Regardless, this commit log does not provide the information that
> > it should.
> 
> Hi Lozenzo, I'm happy to add more info. Can you please let me know
> what extra info I should provide?

s/Lozenzo/Lorenzo

The info you describe properly below, namely what the _actual_ problem
is.

> > > Fixes: ac82fc832708 ("PCI: hv: Add hibernation support")
> > > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > > Reviewed-by: Jake Oshins <jakeo@microsoft.com>
> > >
> > > ---
> > >
> > > Changes in v2:
> > >     Fixed a typo in the comment in hv_irq_unmask. Thanks to Michael!
> > >     Added Jake's Reviewed-by.
> > >
> > >  drivers/pci/controller/pci-hyperv.c | 44 +++++++++++++++++++++++++++++
> > >  1 file changed, 44 insertions(+)
> > >
> > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > b/drivers/pci/controller/pci-hyperv.c
> > > index fc4c3a15e570..dd21afb5d62b 100644
> > > --- a/drivers/pci/controller/pci-hyperv.c
> > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > @@ -1211,6 +1211,21 @@ static void hv_irq_unmask(struct irq_data *data)
> > >  	pbus = pdev->bus;
> > >  	hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> > >
> > > +	if (hbus->state == hv_pcibus_removing) {
> > > +		/*
> > > +		 * During hibernation, when a CPU is offlined, the kernel tries
> > > +		 * to move the interrupt to the remaining CPUs that haven't
> > > +		 * been offlined yet. In this case, the below hv_do_hypercall()
> > > +		 * always fails since the vmbus channel has been closed, so we
> > > +		 * should not call the hypercall, but we still need
> > > +		 * pci_msi_unmask_irq() to reset the mask bit in desc->masked:
> > > +		 * see cpu_disable_common() -> fixup_irqs() ->
> > > +		 * irq_migrate_all_off_this_cpu() -> migrate_one_irq().
> > > +		 */
> > > +		pci_msi_unmask_irqpci_msi_unmask_irq(data);
> >
> > This is not appropriate - it looks like a plaster to paper over an
> > issue with hyper-V hibernation code sequence. Fix that issue instead
> > of papering over it here.
> >
> > Thanks,
> > Lorenzo
> 
> IMO this patch is fixing this Hyper-V specific problem. :-)
> The probem is unique to Hyper-V because chip->irq_unmask() may fail in a
> Linux VM running on Hyper-V. 
> 
> chip->irq_unmask() can not fail on a physical machine. I guess this is why 
> the return value of irq_unmask() is defined as "void" in include/linux/irq.h:
> struct irq_chip {
> ...
>         void            (*irq_mask)(struct irq_data *data);
>         void            (*irq_unmask)(struct irq_data *data);
> 
> As I described in the comment, in a VM on Hyper-V, chip->irq_unmask()
> fails during the suspending phase of hibernation because it's called
> when the non-boot CPUs are being offlined, and at this time all the devices,
> including Hyper-V VMBus devices, have been "frozen" -- this is part of
> the standard Linux hibernation workflow. Since Hyper-V thinks the VM has
> frozen the pci-hyperv VMBus device at this moment (i.e. closed the VMBus
> channel of the VMBus device), it fails chip->irq_unmask(), i.e.
> hv_irq_unmask() -> hv_do_hypercall().
> 
> On a physical machine, unmasking an MSI/MSI-X register just means an MMIO
> write, which I think can not fail here.
> 
> So I think this patch is the correct fix, considering Hyper-V's unique
> implementation of the MSI "chip" (i.e. Hyper-V does not trap and emulate
> the MSI/MSI-X register accesses, and uses a para-virtualized method as I
> explained above), and the fact that I shouldn't and can't change the
> standard Linux hibernation workflow.
> 
> In hv_irq_unmask(), when I skip the hypercall in the case of
> hbus->state == hv_pcibus_removing, I still need the pci_msi_unmask_irq(),
> because of the sequences in kernel/irq/cpuhotplug.c:
> 
> static bool migrate_one_irq(struct irq_desc *desc)
> {
> ...
>         if (maskchip && chip->irq_mask)
>                 chip->irq_mask(d);
> ...
>         err = irq_do_set_affinity(d, affinity, false);
> ...
>         if (maskchip && chip->irq_unmask)
>                 chip->irq_unmask(d);
> 
> Here if hv_irq_unmask does not call pci_msi_unmask_irq(), the desc->masked
> remains "true", so later after hibernation, the MSI interrupt line always
> reamins masked, which is incorrect.
> 
> Here the slient failure of hv_irq_unmask() does not matter since all the
> non-boot CPUs are being offlined (meaning all the devices have been
> frozen). Note: the correct affinity info is still updated into the
> irqdata data structure in migrate_one_irq() -> irq_do_set_affinity() ->
> hv_set_affinity(), so when the VM resumes, hv_pci_resume() ->
> hv_pci_restore_msi_state() is able to correctly restore the irqs with
> the correct affinity.
> 
> I hope the explanation can help clarify things. I understand this is
> not as natual as tht case that Linux runs on a physical machine, but
> due to the unique PCI pass-through implementation of Hyper-V, IMO this
> is the only viable fix for the problem here. BTW, this patch is only
> confined to the pci-hyperv driver and I believe it can no cause any
> regression.

Understood, write this in the commit log and I won't nag you any further.

Side note: this issue is there because the hypcall failure triggers
an early return from hv_irq_unmask(). Is that early return really
correct ? Another possibility is just logging the error and let
hv_irq_unmask() continue and call pci_msi_unmask_irq() in the exit
path.

Is there a hypcall return value that you can use to detect fatal vs
non-fatal (ie hibernation) hypcall failures ?

I was confused by reading the patch since it seemed that you call
pci_msi_unmask_irq() _only_ while hibernating, which was certainly
a bug.

Thank you for explaining.

Lorenzo

> Thanks,
> -- Dexuan
> 
> 
> > > +		return;
> > > +	}
> > > +
> > >  	spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags);
> > >
> > >  	params = &hbus->retarget_msi_interrupt_params;
> > > @@ -3372,6 +3387,33 @@ static int hv_pci_suspend(struct hv_device
> > *hdev)
> > >  	return 0;
> > >  }
> > >
> > > +static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
> > > +{
> > > +	struct msi_desc *entry;
> > > +	struct irq_data *irq_data;
> > > +
> > > +	for_each_pci_msi_entry(entry, pdev) {
> > > +		irq_data = irq_get_irq_data(entry->irq);
> > > +		if (WARN_ON_ONCE(!irq_data))
> > > +			return -EINVAL;
> > > +
> > > +		hv_compose_msi_msg(irq_data, &entry->msg);
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/*
> > > + * Upon resume, pci_restore_msi_state() -> ... ->  __pci_write_msi_msg()
> > > + * re-writes the MSI/MSI-X registers, but since Hyper-V doesn't trap and
> > > + * emulate the accesses, we have to call hv_compose_msi_msg() to ask
> > > + * Hyper-V to re-create the IOMMU Interrupt Remapping Table Entries.
> > > + */
> > > +static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus)
> > > +{
> > > +	pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL);
> > > +}
> > > +
> > >  static int hv_pci_resume(struct hv_device *hdev)
> > >  {
> > >  	struct hv_pcibus_device *hbus = hv_get_drvdata(hdev);
> > > @@ -3405,6 +3447,8 @@ static int hv_pci_resume(struct hv_device *hdev)
> > >
> > >  	prepopulate_bars(hbus);
> > >
> > > +	hv_pci_restore_msi_state(hbus);
> > > +
> > >  	hbus->state = hv_pcibus_installed;
> > >  	return 0;
> > >  out:
>
Dexuan Cui Oct. 1, 2020, 8:53 p.m. UTC | #5
> From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> Sent: Thursday, October 1, 2020 3:13 AM
> > ...
> > I mean this is a Hyper-V specific problem, so IMO we should fix the
> > pci-hyperv driver rather than change the PCI device drivers, which 
> > work perfectly on a physical machine and on other hypervisors. 
> > Also it can be difficult or impossible to ask the authors of the 
> > aforementioned PCI device drivers to destroy and re-create 
> > MSI/MSI-X across hibernation, especially for the out-of-tree driver(s).
> 
> Good, so why did you mention PCI drivers in the commit log if they
> are not related to the problem you are fixing ?

I mentioned the names of the PCI device drivers because IMO people
want to know how the issue can reproduce (i.e. which PCI devices
are affected and which are not), so they know how to test this patch.

I'll remove the names of the unaffected PCI device drivers from the 
commit log, and only keep the name of the Nvidia GPU drivers (which
are so far the only drivers I have identified that are affected, when
Linux VM runs on Hyper-V and hibernates).
 
> > > Regardless, this commit log does not provide the information that
> > > it should.
> >
> > Hi Lozenzo, I'm happy to add more info. Can you please let me know
> > what extra info I should provide?
> 
> s/Lozenzo/Lorenzo
Sorry! Will fix the typo.
 
> The info you describe properly below, namely what the _actual_ problem
> is.

I will send v3 with the below info.
 
> > Here if hv_irq_unmask does not call pci_msi_unmask_irq(), the
> > desc->masked remains "true", so later after hibernation, the MSI
> > interrupt line always reamins masked, which is incorrect.
> >
> > Here the slient failure of hv_irq_unmask() does not matter since all the
> > non-boot CPUs are being offlined (meaning all the devices have been
> > frozen). Note: the correct affinity info is still updated into the
> > irqdata data structure in migrate_one_irq() -> irq_do_set_affinity() ->
> > hv_set_affinity(), so when the VM resumes, hv_pci_resume() ->
> > hv_pci_restore_msi_state() is able to correctly restore the irqs with
> > the correct affinity.
> >
> > I hope the explanation can help clarify things. I understand this is
> > not as natual as tht case that Linux runs on a physical machine, but
> > due to the unique PCI pass-through implementation of Hyper-V, IMO this
> > is the only viable fix for the problem here. BTW, this patch is only
> > confined to the pci-hyperv driver and I believe it can no cause any
> > regression.
> 
> Understood, write this in the commit log and I won't nag you any further.

Ok. I treat it as an opportunity to practise and improve my writing :-)
 
> Side note: this issue is there because the hypcall failure triggers
> an early return from hv_irq_unmask(). 

Yes.

> Is that early return really correct ? 

Good question. IMO it's incorrect, because hv_irq_unmask() is called 
when the interrupt is activated for the first time, and when the interrupt's
affinity is being changed. In these cases, we may as well call
pci_msi_unmask_irq() unconditionally, even if the hypercall fails.

BTW, AFAIK, in practice the hypercall only fails in 2 cases:
1. The device is removed when Linux VM has not finished the device's
initialization.
2. In hibernation, the device has been disabled while the generic
hibernation code tries to migrate the interrupt, as I explained.

In the 2 cases, the hypercall returns the same error code
HV_STATUS_INVALID_PARAMETER(0x5).

> Another possibility is just logging the error and let
> hv_irq_unmask() continue and call pci_msi_unmask_irq() in the exit
> path.

This is a good idea. I'll make this change in v3.
 
> Is there a hypcall return value that you can use to detect fatal vs
> non-fatal (ie hibernation) hypcall failures ?

Unluckily IMO there is not. The spec (v6.0b)'s section 10.5.4 (page 106)
https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs
does define some return values, but IMO they're not applicable here.

> I was confused by reading the patch since it seemed that you call
> pci_msi_unmask_irq() _only_ while hibernating, which was certainly
> a bug.
> 
> Thank you for explaining.
> 
> Lorenzo

Thanks for reviewing! I'll post v3. Looking forward to your new comments!

Thanks,
-- Dexuan
diff mbox series

Patch

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index fc4c3a15e570..dd21afb5d62b 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1211,6 +1211,21 @@  static void hv_irq_unmask(struct irq_data *data)
 	pbus = pdev->bus;
 	hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
 
+	if (hbus->state == hv_pcibus_removing) {
+		/*
+		 * During hibernation, when a CPU is offlined, the kernel tries
+		 * to move the interrupt to the remaining CPUs that haven't
+		 * been offlined yet. In this case, the below hv_do_hypercall()
+		 * always fails since the vmbus channel has been closed, so we
+		 * should not call the hypercall, but we still need
+		 * pci_msi_unmask_irq() to reset the mask bit in desc->masked:
+		 * see cpu_disable_common() -> fixup_irqs() ->
+		 * irq_migrate_all_off_this_cpu() -> migrate_one_irq().
+		 */
+		pci_msi_unmask_irq(data);
+		return;
+	}
+
 	spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags);
 
 	params = &hbus->retarget_msi_interrupt_params;
@@ -3372,6 +3387,33 @@  static int hv_pci_suspend(struct hv_device *hdev)
 	return 0;
 }
 
+static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
+{
+	struct msi_desc *entry;
+	struct irq_data *irq_data;
+
+	for_each_pci_msi_entry(entry, pdev) {
+		irq_data = irq_get_irq_data(entry->irq);
+		if (WARN_ON_ONCE(!irq_data))
+			return -EINVAL;
+
+		hv_compose_msi_msg(irq_data, &entry->msg);
+	}
+
+	return 0;
+}
+
+/*
+ * Upon resume, pci_restore_msi_state() -> ... ->  __pci_write_msi_msg()
+ * re-writes the MSI/MSI-X registers, but since Hyper-V doesn't trap and
+ * emulate the accesses, we have to call hv_compose_msi_msg() to ask
+ * Hyper-V to re-create the IOMMU Interrupt Remapping Table Entries.
+ */
+static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus)
+{
+	pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL);
+}
+
 static int hv_pci_resume(struct hv_device *hdev)
 {
 	struct hv_pcibus_device *hbus = hv_get_drvdata(hdev);
@@ -3405,6 +3447,8 @@  static int hv_pci_resume(struct hv_device *hdev)
 
 	prepopulate_bars(hbus);
 
+	hv_pci_restore_msi_state(hbus);
+
 	hbus->state = hv_pcibus_installed;
 	return 0;
 out: