Message ID | 20230413194042.605768-1-alex.williamson@redhat.com (mailing list archive) |
---|---|
State | Accepted |
Commit | a5a6dd2624698b6e3045c3a1450874d8c790d5d9 |
Headers | show |
Series | PCI: Extend D3hot delay for NVIDIA HDA controllers | expand |
[+cc Mika, Sathy, Lukas since they've been looking at similar delays] On Thu, Apr 13, 2023 at 01:40:42PM -0600, Alex Williamson wrote: > Assignment of NVIDIA Ampere-based GPUs have seen a regression since the > below referenced commit, where the reduced D3hot transition delay appears > to introduce a small window where a D3hot->D0 transition followed by a bus > reset can wedge the device. The entire device is subsequently unavailable, > returning -1 on config space read and is unrecoverable without a host reset. > > This has been observed with RTX A2000 and A5000 GPU and audio functions > assigned to a Windows VM, where shutdown of the VM places the devices in > D3hot prior to vfio-pci performing a bus reset when userspace releases the > devices. The issue has roughly a 2-3% chance of occurring per shutdown. > > Restoring the HDA controller d3hot_delay to the effective value before the > below commit has been shown to resolve the issue. NVIDIA confirms this > change should be safe for all of their HDA controllers. > > Cc: Abhishek Sahu <abhsahu@nvidia.com> > Cc: Tarun Gupta <targupta@nvidia.com> > Fixes: 3e347969a577 ("PCI/PM: Reduce D3hot delay with usleep_range()") > Reported-by: Zhiyi Guo <zhguo@redhat.com> > Reviewed-by: Tarun Gupta <targupta@nvidia.com> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Applied to pci/reset for v6.4, thanks, Alex! I guess there's no real risk here since we're waiting *longer*. It only makes NVIDIA GPU resets take longer. Mika has some patches in flight that increase delays generically in some cases, but I think that applies to D3cold -> D0 transitions, which I don't *think* you're doing here. > --- > > Unfortunately Tarun's reply with confirmation doesn't show up on lore, > possibly due to html email, or else I'd provide that as a Link:. > > drivers/pci/quirks.c | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 44cab813bf95..f4e2a88729fd 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -1939,6 +1939,19 @@ static void quirk_radeon_pm(struct pci_dev *dev) > } > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6741, quirk_radeon_pm); > > +/* > + * NVIDIA Ampere-based HDA controllers can wedge the whole device if a bus > + * reset is performed too soon after transition to D0, extend d3hot_delay > + * to previous effective default for all NVIDIA HDA controllers. > + */ > +static void quirk_nvidia_hda_pm(struct pci_dev *dev) > +{ > + quirk_d3hot_delay(dev, 20); > +} > +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, > + PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8, > + quirk_nvidia_hda_pm); > + > /* > * Ryzen5/7 XHCI controllers fail upon resume from runtime suspend or s2idle. > * https://bugzilla.kernel.org/show_bug.cgi?id=205587 > -- > 2.39.2 >
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 44cab813bf95..f4e2a88729fd 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -1939,6 +1939,19 @@ static void quirk_radeon_pm(struct pci_dev *dev) } DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6741, quirk_radeon_pm); +/* + * NVIDIA Ampere-based HDA controllers can wedge the whole device if a bus + * reset is performed too soon after transition to D0, extend d3hot_delay + * to previous effective default for all NVIDIA HDA controllers. + */ +static void quirk_nvidia_hda_pm(struct pci_dev *dev) +{ + quirk_d3hot_delay(dev, 20); +} +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, + PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8, + quirk_nvidia_hda_pm); + /* * Ryzen5/7 XHCI controllers fail upon resume from runtime suspend or s2idle. * https://bugzilla.kernel.org/show_bug.cgi?id=205587