Message ID | 20190731201927.22054-1-lyude@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | Revert "PCI: Enable NVIDIA HDA controllers" | expand |
Also, I realized after sending this that I should clarify something so there isn't any confusion. A bunch of people on the bug that was mentioned in b516ea586d71 ("PCI: Enable NVIDIA HDA controllers") said that this worked perfectly for their P50 laptops. While I don't doubt that at all, it should be noted that the P50 quirk there is only present on a _very specific_ subset of P50 SKUs, so it's quite likely that the people in that bug report just didn't have a P50 that hits this issue. The relevant model numbers of the P50 with the flakey bioses that require this quirk should be mentioned here: https://bugzilla.kernel.org/show_bug.cgi?id=203003 On Wed, 2019-07-31 at 16:19 -0400, Lyude Paul wrote: > This reverts commit b516ea586d717472178e6ef1c152e85608b0ce32. > > While this fixes audio for a number of users, this commit has the > sideaffect of breaking the BIOS workaround that's required to make the > GPU on the nvidia P50 work, by causing the GPU's PCI device function to > stop working after it's been set to multifunction mode. > > Signed-off-by: Lyude Paul <lyude@redhat.com> > Fixes: b516ea586d71 ("PCI: Enable NVIDIA HDA controllers") > Cc: Lukas Wunner <lukas@wunner.de> > Cc: Daniel Drake <drake@endlessm.com> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Cc: Aaron Plattner <aplattner@nvidia.com> > Cc: Peter Wu <peter@lekensteyn.nl> > Cc: Ilia Mirkin <imirkin@alum.mit.edu> > Cc: Karol Herbst <kherbst@redhat.com> > Cc: Maik Freudenberg <hhfeuer@gmx.de> > Cc: linux-pci@vger.kernel.org > --- > > I'm not really holding my breath on this patch to being accepted: > there's a good chance there's a better solution for this (and I'm going > to continue investigating for one after sending this patch), this is > more just to start a conversation on what the proper way to fix this is. > > So, I'm kind of confused about why exactly this was implemented as an > early boot quirk in the first place. If we're seeing the GPU's PCI > device, we already know the GPU is there. Shouldn't we be able to check > for the existence of the HDA device once we probe the GPU in nouveau? > This would make a lot more sense and be a lot less troublesome. I can > see that in the discussion on > > https://bugs.freedesktop.org/show_bug.cgi?id=75985 > > That people mentioned that unloading nouveau then trying to reprobe for > the audio device didn't work, but that still doesn't explain why this > was implemented as an early quirk and not as something we just do before > nouveau is setup. Can we maybe move this somewhere a little more > sensible? > > drivers/pci/quirks.c | 30 ------------------------------ > include/linux/pci_ids.h | 1 - > 2 files changed, 31 deletions(-) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 208aacf39329..c66c0ca446c4 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -5011,36 +5011,6 @@ DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, > PCI_ANY_ID, > PCI_CLASS_SERIAL_UNKNOWN, 8, > quirk_gpu_usb_typec_ucsi); > > -/* > - * Enable the NVIDIA GPU integrated HDA controller if the BIOS left it > - * disabled. https://devtalk.nvidia.com/default/topic/1024022 > - */ > -static void quirk_nvidia_hda(struct pci_dev *gpu) > -{ > - u8 hdr_type; > - u32 val; > - > - /* There was no integrated HDA controller before MCP89 */ > - if (gpu->device < PCI_DEVICE_ID_NVIDIA_GEFORCE_320M) > - return; > - > - /* Bit 25 at offset 0x488 enables the HDA controller */ > - pci_read_config_dword(gpu, 0x488, &val); > - if (val & BIT(25)) > - return; > - > - pci_info(gpu, "Enabling HDA controller\n"); > - pci_write_config_dword(gpu, 0x488, val | BIT(25)); > - > - /* The GPU becomes a multi-function device when the HDA is enabled */ > - pci_read_config_byte(gpu, PCI_HEADER_TYPE, &hdr_type); > - gpu->multifunction = !!(hdr_type & 0x80); > -} > -DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, > - PCI_BASE_CLASS_DISPLAY, 16, quirk_nvidia_hda); > -DECLARE_PCI_FIXUP_CLASS_RESUME_EARLY(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, > - PCI_BASE_CLASS_DISPLAY, 16, quirk_nvidia_hda); > - > /* > * Some IDT switches incorrectly flag an ACS Source Validation error on > * completions for config read requests even though PCIe r4.0, sec > diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h > index c842735a4f45..f496fb619287 100644 > --- a/include/linux/pci_ids.h > +++ b/include/linux/pci_ids.h > @@ -1336,7 +1336,6 @@ > #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP78S_SMBUS 0x0752 > #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP77_IDE 0x0759 > #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP73_SMBUS 0x07D8 > -#define PCI_DEVICE_ID_NVIDIA_GEFORCE_320M 0x08A0 > #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP79_SMBUS 0x0AA2 > #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP89_SATA 0x0D85 >
On Wed, Jul 31, 2019 at 04:19:27PM -0400, Lyude Paul wrote: > While this fixes audio for a number of users, this commit has the > sideaffect of breaking the BIOS workaround that's required to make the > GPU on the nvidia P50 work, by causing the GPU's PCI device function to > stop working after it's been set to multifunction mode. This is missing a reference to the commit introducing the P50 quirk, which is e0547c81bfcf ("PCI: Reset Lenovo ThinkPad P50 nvgpu at boot if necessary"). Please describe in more detail how the GPU's PCI function stops working. Does it respond with "all ones" when accessing MMIO? Do MMIO accesses cause the system to hang? Could you provide lspci -vvxx output for the GPU and its associated HDA controller with and without b516ea586d71? Does this machine have external display connectors via which audio can be streamed? > I'm not really holding my breath on this patch to being accepted: > there's a good chance there's a better solution for this (and I'm going > to continue investigating for one after sending this patch), this is > more just to start a conversation on what the proper way to fix this is. Posting as an RFC might have been more appropriate then. > So, I'm kind of confused about why exactly this was implemented as an > early boot quirk in the first place. If we're seeing the GPU's PCI > device, we already know the GPU is there. Shouldn't we be able to check > for the existence of the HDA device once we probe the GPU in nouveau? I think a motivation to keep this generic was to make it work with other drivers besides nouveau, specifically Nvidia's proprietary driver. nouveau might not even be enabled. > that still doesn't explain why this was implemented as an early quirk This isn't an early quirk. Those live in arch/x86/kernel/early-quirks.c. This is just a PCI quirk executed on device enumeration and on resume. Devices aren't necessarily enumerated only on boot, e.g. think Thunderbolt. Thanks, Lukas
On Wed, Jul 31, 2019 at 11:18 PM Lukas Wunner <lukas@wunner.de> wrote: > > On Wed, Jul 31, 2019 at 04:19:27PM -0400, Lyude Paul wrote: > > While this fixes audio for a number of users, this commit has the > > sideaffect of breaking the BIOS workaround that's required to make the > > GPU on the nvidia P50 work, by causing the GPU's PCI device function to > > stop working after it's been set to multifunction mode. > > This is missing a reference to the commit introducing the P50 quirk, > which is e0547c81bfcf ("PCI: Reset Lenovo ThinkPad P50 nvgpu at boot > if necessary"). > > Please describe in more detail how the GPU's PCI function stops working. > Does it respond with "all ones" when accessing MMIO? > Do MMIO accesses cause the system to hang? > > Could you provide lspci -vvxx output for the GPU and its associated > HDA controller with and without b516ea586d71? > > Does this machine have external display connectors via which audio > can be streamed? > > > > I'm not really holding my breath on this patch to being accepted: > > there's a good chance there's a better solution for this (and I'm going > > to continue investigating for one after sending this patch), this is > > more just to start a conversation on what the proper way to fix this is. > > Posting as an RFC might have been more appropriate then. > no, a revert is actually appropriate. If a commit fixes something, but breaks something else, it gets either reverted or fixed. If nobody fixes it, then revert it is. > > > So, I'm kind of confused about why exactly this was implemented as an > > early boot quirk in the first place. If we're seeing the GPU's PCI > > device, we already know the GPU is there. Shouldn't we be able to check > > for the existence of the HDA device once we probe the GPU in nouveau? > > I think a motivation to keep this generic was to make it work with > other drivers besides nouveau, specifically Nvidia's proprietary driver. > nouveau might not even be enabled. > > > > that still doesn't explain why this was implemented as an early quirk > > This isn't an early quirk. Those live in arch/x86/kernel/early-quirks.c. > This is just a PCI quirk executed on device enumeration and on resume. > Devices aren't necessarily enumerated only on boot, e.g. think Thunderbolt. > > Thanks, > > Lukas
On Wed, 2019-07-31 at 23:26 +0200, Karol Herbst wrote: > On Wed, Jul 31, 2019 at 11:18 PM Lukas Wunner <lukas@wunner.de> wrote: > > On Wed, Jul 31, 2019 at 04:19:27PM -0400, Lyude Paul wrote: > > > While this fixes audio for a number of users, this commit has the > > > sideaffect of breaking the BIOS workaround that's required to make the > > > GPU on the nvidia P50 work, by causing the GPU's PCI device function to > > > stop working after it's been set to multifunction mode. > > > > This is missing a reference to the commit introducing the P50 quirk, > > which is e0547c81bfcf ("PCI: Reset Lenovo ThinkPad P50 nvgpu at boot > > if necessary"). > > > > Please describe in more detail how the GPU's PCI function stops working. > > Does it respond with "all ones" when accessing MMIO? > > Do MMIO accesses cause the system to hang? > > > > Could you provide lspci -vvxx output for the GPU and its associated > > HDA controller with and without b516ea586d71? > > > > Does this machine have external display connectors via which audio > > can be streamed? > > > > > > > I'm not really holding my breath on this patch to being accepted: > > > there's a good chance there's a better solution for this (and I'm going > > > to continue investigating for one after sending this patch), this is > > > more just to start a conversation on what the proper way to fix this is. > > > > Posting as an RFC might have been more appropriate then. > > > > no, a revert is actually appropriate. If a commit fixes something, > but breaks something else, it gets either reverted or fixed. If nobody > fixes it, then revert it is. To answer Lukas's question btw: most of the details on how things break are back in the original commit (sorry for forgetting the reference!), there's a _lot_ of explanation there that I'd rather not retype, so just refer back to the commit and bug @ https://bugs.freedesktop.org/show_bug.cgi?id=75985 Additionally, there was some extra discussion providing some more detail in the email thread that I had with Bjorn: https://lkml.org/lkml/2019/2/12/1172 As for how this commit breaks the workaround: it seems that when we enable the HDA controller and put the GPU into multifunction mode, the function-level reset stops working and thus we can't reset the GPU anymore. Currently I can see a couple of solutions (again, please feel free to suggest more!): * Just revert the commit. We should do this if necessary, but of course I'd much rather try finding a fix first * Disable the HDA controller temporarily when a GPU reset is neded in quirk_reset_lenovo_thinkpad_p50_nvgpu(), then call the function level reset, then re-enable the HDA controller. I have no idea if this actually works yet, but I'm about to try this on my system * Get quirk_reset_lenovo_thinkpad_p50_nvgpu() to run before quirk_nvidia_hda(). This would probably be fine, but we would need to rework some stuff in the PCI subsystem (maybe it already has a way to do this? haven't checked yet) so that we could perform an flr probe early enough to perform the quirk > > > > So, I'm kind of confused about why exactly this was implemented as an > > > early boot quirk in the first place. If we're seeing the GPU's PCI > > > device, we already know the GPU is there. Shouldn't we be able to check > > > for the existence of the HDA device once we probe the GPU in nouveau? > > > > I think a motivation to keep this generic was to make it work with > > other drivers besides nouveau, specifically Nvidia's proprietary driver. > > nouveau might not even be enabled. > > > > > > > that still doesn't explain why this was implemented as an early quirk > > > > This isn't an early quirk. Those live in arch/x86/kernel/early-quirks.c. > > This is just a PCI quirk executed on device enumeration and on resume. > > Devices aren't necessarily enumerated only on boot, e.g. think > > Thunderbolt. > > > > Thanks, > > > > Lukas
On Wed, 2019-07-31 at 17:35 -0400, Lyude Paul wrote: > On Wed, 2019-07-31 at 23:26 +0200, Karol Herbst wrote: > > On Wed, Jul 31, 2019 at 11:18 PM Lukas Wunner <lukas@wunner.de> wrote: > > > On Wed, Jul 31, 2019 at 04:19:27PM -0400, Lyude Paul wrote: > > > > While this fixes audio for a number of users, this commit has the > > > > sideaffect of breaking the BIOS workaround that's required to make the > > > > GPU on the nvidia P50 work, by causing the GPU's PCI device function > > > > to > > > > stop working after it's been set to multifunction mode. > > > > > > This is missing a reference to the commit introducing the P50 quirk, > > > which is e0547c81bfcf ("PCI: Reset Lenovo ThinkPad P50 nvgpu at boot > > > if necessary"). > > > > > > Please describe in more detail how the GPU's PCI function stops working. > > > Does it respond with "all ones" when accessing MMIO? > > > Do MMIO accesses cause the system to hang? > > > > > > Could you provide lspci -vvxx output for the GPU and its associated > > > HDA controller with and without b516ea586d71? > > > > > > Does this machine have external display connectors via which audio > > > can be streamed? > > > > > > > > > > I'm not really holding my breath on this patch to being accepted: > > > > there's a good chance there's a better solution for this (and I'm > > > > going > > > > to continue investigating for one after sending this patch), this is > > > > more just to start a conversation on what the proper way to fix this > > > > is. > > > > > > Posting as an RFC might have been more appropriate then. > > > > > > > no, a revert is actually appropriate. If a commit fixes something, > > but breaks something else, it gets either reverted or fixed. If nobody > > fixes it, then revert it is. > > To answer Lukas's question btw: most of the details on how things break are > back in the original commit (sorry for forgetting the reference!), there's a > _lot_ of explanation there that I'd rather not retype, so just refer back to > the commit and bug @ https://bugs.freedesktop.org/show_bug.cgi?id=75985 > > Additionally, there was some extra discussion providing some more detail in > the email thread that I had with Bjorn: > > https://lkml.org/lkml/2019/2/12/1172 > > As for how this commit breaks the workaround: it seems that when we enable > the > HDA controller and put the GPU into multifunction mode, the function-level > reset stops working and thus we can't reset the GPU anymore. Currently I can > see a couple of solutions (again, please feel free to suggest more!): > > * Just revert the commit. We should do this if necessary, but of course I'd > much rather try finding a fix first > * Disable the HDA controller temporarily when a GPU reset is neded in > quirk_reset_lenovo_thinkpad_p50_nvgpu(), then call the function level > reset, then re-enable the HDA controller. I have no idea if this actually > works yet, but I'm about to try this on my system > * Get quirk_reset_lenovo_thinkpad_p50_nvgpu() to run before > quirk_nvidia_hda(). This would probably be fine, but we would need to > rework some stuff in the PCI subsystem (maybe it already has a way to do > this? haven't checked yet) so that we could perform an flr probe early > enough to perform the quirk Good news! After some investigation looks like that function level reset actually does work, just that after we put it in multifunction mode pci_parent_bus_reset() sees multiple devices on the bus and returns -ENOTTY as a result. So I should definitely be able to come up with a fix for this other then reverting this :). Will send out patches soon > > > > So, I'm kind of confused about why exactly this was implemented as an > > > > early boot quirk in the first place. If we're seeing the GPU's PCI > > > > device, we already know the GPU is there. Shouldn't we be able to > > > > check > > > > for the existence of the HDA device once we probe the GPU in nouveau? > > > > > > I think a motivation to keep this generic was to make it work with > > > other drivers besides nouveau, specifically Nvidia's proprietary driver. > > > nouveau might not even be enabled. > > > > > > > > > > that still doesn't explain why this was implemented as an early quirk > > > > > > This isn't an early quirk. Those live in arch/x86/kernel/early- > > > quirks.c. > > > This is just a PCI quirk executed on device enumeration and on resume. > > > Devices aren't necessarily enumerated only on boot, e.g. think > > > Thunderbolt. > > > > > > Thanks, > > > > > > Lukas
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 208aacf39329..c66c0ca446c4 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -5011,36 +5011,6 @@ DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, PCI_CLASS_SERIAL_UNKNOWN, 8, quirk_gpu_usb_typec_ucsi); -/* - * Enable the NVIDIA GPU integrated HDA controller if the BIOS left it - * disabled. https://devtalk.nvidia.com/default/topic/1024022 - */ -static void quirk_nvidia_hda(struct pci_dev *gpu) -{ - u8 hdr_type; - u32 val; - - /* There was no integrated HDA controller before MCP89 */ - if (gpu->device < PCI_DEVICE_ID_NVIDIA_GEFORCE_320M) - return; - - /* Bit 25 at offset 0x488 enables the HDA controller */ - pci_read_config_dword(gpu, 0x488, &val); - if (val & BIT(25)) - return; - - pci_info(gpu, "Enabling HDA controller\n"); - pci_write_config_dword(gpu, 0x488, val | BIT(25)); - - /* The GPU becomes a multi-function device when the HDA is enabled */ - pci_read_config_byte(gpu, PCI_HEADER_TYPE, &hdr_type); - gpu->multifunction = !!(hdr_type & 0x80); -} -DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, - PCI_BASE_CLASS_DISPLAY, 16, quirk_nvidia_hda); -DECLARE_PCI_FIXUP_CLASS_RESUME_EARLY(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, - PCI_BASE_CLASS_DISPLAY, 16, quirk_nvidia_hda); - /* * Some IDT switches incorrectly flag an ACS Source Validation error on * completions for config read requests even though PCIe r4.0, sec diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h index c842735a4f45..f496fb619287 100644 --- a/include/linux/pci_ids.h +++ b/include/linux/pci_ids.h @@ -1336,7 +1336,6 @@ #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP78S_SMBUS 0x0752 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP77_IDE 0x0759 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP73_SMBUS 0x07D8 -#define PCI_DEVICE_ID_NVIDIA_GEFORCE_320M 0x08A0 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP79_SMBUS 0x0AA2 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP89_SATA 0x0D85
This reverts commit b516ea586d717472178e6ef1c152e85608b0ce32. While this fixes audio for a number of users, this commit has the sideaffect of breaking the BIOS workaround that's required to make the GPU on the nvidia P50 work, by causing the GPU's PCI device function to stop working after it's been set to multifunction mode. Signed-off-by: Lyude Paul <lyude@redhat.com> Fixes: b516ea586d71 ("PCI: Enable NVIDIA HDA controllers") Cc: Lukas Wunner <lukas@wunner.de> Cc: Daniel Drake <drake@endlessm.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Aaron Plattner <aplattner@nvidia.com> Cc: Peter Wu <peter@lekensteyn.nl> Cc: Ilia Mirkin <imirkin@alum.mit.edu> Cc: Karol Herbst <kherbst@redhat.com> Cc: Maik Freudenberg <hhfeuer@gmx.de> Cc: linux-pci@vger.kernel.org --- I'm not really holding my breath on this patch to being accepted: there's a good chance there's a better solution for this (and I'm going to continue investigating for one after sending this patch), this is more just to start a conversation on what the proper way to fix this is. So, I'm kind of confused about why exactly this was implemented as an early boot quirk in the first place. If we're seeing the GPU's PCI device, we already know the GPU is there. Shouldn't we be able to check for the existence of the HDA device once we probe the GPU in nouveau? This would make a lot more sense and be a lot less troublesome. I can see that in the discussion on https://bugs.freedesktop.org/show_bug.cgi?id=75985 That people mentioned that unloading nouveau then trying to reprobe for the audio device didn't work, but that still doesn't explain why this was implemented as an early quirk and not as something we just do before nouveau is setup. Can we maybe move this somewhere a little more sensible? drivers/pci/quirks.c | 30 ------------------------------ include/linux/pci_ids.h | 1 - 2 files changed, 31 deletions(-)