Message ID | m2fs9lgndw.fsf@gmail.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [BUG] net, pci: 6.3-rc1-4 hangs during boot on PowerEdge R620 with igb | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
Thanks a lot for the report and for all the work you did to bisect and identify the commit. On Fri, Mar 31, 2023 at 12:40:11PM +0100, Donald Hunter wrote: > The 6.3-rc1 and later release candidates are hanging during boot on our > Dell PowerEdge R620 servers with Intel I350 nics (igb). > > After bisecting from v6.2 to v6.3-rc1, I isolated the problem to: > > [6fffbc7ae1373e10b989afe23a9eeb9c49fe15c3] PCI: Honor firmware's device > disabled status > > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c > index 1779582fb500..b1d80c1d7a69 100644 > --- a/drivers/pci/probe.c > +++ b/drivers/pci/probe.c > @@ -1841,6 +1841,8 @@ int pci_setup_device(struct pci_dev *dev) > > pci_set_of_node(dev); > pci_set_acpi_fwnode(dev); > + if (dev->dev.fwnode && !fwnode_device_is_available(dev->dev.fwnode)) > + return -ENODEV; > > pci_dev_assign_slot(dev); I assume this igb NIC (07:00.0) must be built-in (not a plug-in card) because it apparently has an ACPI firmware node, and there's something we don't expect about its status? Hopefully Rob will look at this. If I were looking, I would be interested in acpidump to see what's in the DSDT. Bjorn
On Fri, 31 Mar 2023 at 20:42, Bjorn Helgaas <helgaas@kernel.org> wrote: > > I assume this igb NIC (07:00.0) must be built-in (not a plug-in card) > because it apparently has an ACPI firmware node, and there's something > we don't expect about its status? Yes they are built-in, to my knowledge. > Hopefully Rob will look at this. If I were looking, I would be > interested in acpidump to see what's in the DSDT. I can get an acpidump. Is there a preferred way to share the files, or just an email attachment? > Bjorn
[CCing the regression list, as it should be in the loop for regressions: https://docs.kernel.org/admin-guide/reporting-regressions.html] [TLDR: I'm adding this report to the list of tracked Linux kernel regressions; the text you find below is based on a few templates paragraphs you might have encountered already in similar form. See link in footer if these mails annoy you.] On 31.03.23 13:40, Donald Hunter wrote: > The 6.3-rc1 and later release candidates are hanging during boot on our > Dell PowerEdge R620 servers with Intel I350 nics (igb). > > After bisecting from v6.2 to v6.3-rc1, I isolated the problem to: > > [6fffbc7ae1373e10b989afe23a9eeb9c49fe15c3] PCI: Honor firmware's device > disabled status > [...] Thanks for the report. To be sure the issue doesn't fall through the cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression tracking bot: #regzbot ^introduced 6fffbc7ae1373e10b989afe23a9eeb9c49fe15c3 #regzbot title pci: / net: igb: hangs during boot on PowerEdge R620 #regzbot ignore-activity This isn't a regression? This issue or a fix for it are already discussed somewhere else? It was fixed already? You want to clarify when the regression started to happen? Or point out I got the title or something else totally wrong? Then just reply and tell me -- ideally while also telling regzbot about it, as explained by the page listed in the footer of this mail. Developers: When fixing the issue, remember to add 'Link:' tags pointing to the report (the parent of this mail). See page linked in footer for details. Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr That page also explains what to do if mails like this annoy you.
On Sat, Apr 01, 2023 at 01:52:25PM +0100, Donald Hunter wrote: > On Fri, 31 Mar 2023 at 20:42, Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > I assume this igb NIC (07:00.0) must be built-in (not a plug-in card) > > because it apparently has an ACPI firmware node, and there's something > > we don't expect about its status? > > Yes they are built-in, to my knowledge. > > > Hopefully Rob will look at this. If I were looking, I would be > > interested in acpidump to see what's in the DSDT. > > I can get an acpidump. Is there a preferred way to share the files, or just > an email attachment? I think by default acpidump produces ASCII that can be directly included in email. http://vger.kernel.org/majordomo-info.html says 100K is the limit for vger mailing lists. Or you could open a report at https://bugzilla.kernel.org and attach it there, maybe along with a complete dmesg log and "sudo lspci -vv" output. Bjorn
On Sun, 2 Apr 2023 at 23:55, Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Sat, Apr 01, 2023 at 01:52:25PM +0100, Donald Hunter wrote: > > On Fri, 31 Mar 2023 at 20:42, Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > > > I assume this igb NIC (07:00.0) must be built-in (not a plug-in card) > > > because it apparently has an ACPI firmware node, and there's something > > > we don't expect about its status? > > > > Yes they are built-in, to my knowledge. > > > > > Hopefully Rob will look at this. If I were looking, I would be > > > interested in acpidump to see what's in the DSDT. > > > > I can get an acpidump. Is there a preferred way to share the files, or just > > an email attachment? > > I think by default acpidump produces ASCII that can be directly > included in email. http://vger.kernel.org/majordomo-info.html says > 100K is the limit for vger mailing lists. Or you could open a report > at https://bugzilla.kernel.org and attach it there, maybe along with a > complete dmesg log and "sudo lspci -vv" output. Apologies for the delay, I was unable to access the machine while travelling. https://bugzilla.kernel.org/show_bug.cgi?id=217317
On Mon, Apr 10, 2023 at 04:10:54PM +0100, Donald Hunter wrote: > On Sun, 2 Apr 2023 at 23:55, Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Sat, Apr 01, 2023 at 01:52:25PM +0100, Donald Hunter wrote: > > > On Fri, 31 Mar 2023 at 20:42, Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > > > > > I assume this igb NIC (07:00.0) must be built-in (not a plug-in card) > > > > because it apparently has an ACPI firmware node, and there's something > > > > we don't expect about its status? > > > > > > Yes they are built-in, to my knowledge. > > > > > > > Hopefully Rob will look at this. If I were looking, I would be > > > > interested in acpidump to see what's in the DSDT. > > > > > > I can get an acpidump. Is there a preferred way to share the files, or just > > > an email attachment? > > > > I think by default acpidump produces ASCII that can be directly > > included in email. http://vger.kernel.org/majordomo-info.html says > > 100K is the limit for vger mailing lists. Or you could open a report > > at https://bugzilla.kernel.org and attach it there, maybe along with a > > complete dmesg log and "sudo lspci -vv" output. > > Apologies for the delay, I was unable to access the machine while travelling. > > https://bugzilla.kernel.org/show_bug.cgi?id=217317 Thanks for that! Can you boot a kernel with 6fffbc7ae137 reverted with this in the kernel parameters: dyndbg="file drivers/acpi/* +p" and collect the entire dmesg log?
Bjorn Helgaas <helgaas@kernel.org> writes: > On Mon, Apr 10, 2023 at 04:10:54PM +0100, Donald Hunter wrote: >> On Sun, 2 Apr 2023 at 23:55, Bjorn Helgaas <helgaas@kernel.org> wrote: >> > On Sat, Apr 01, 2023 at 01:52:25PM +0100, Donald Hunter wrote: >> > > On Fri, 31 Mar 2023 at 20:42, Bjorn Helgaas <helgaas@kernel.org> wrote: >> > > > >> > > > I assume this igb NIC (07:00.0) must be built-in (not a plug-in card) >> > > > because it apparently has an ACPI firmware node, and there's something >> > > > we don't expect about its status? >> > > >> > > Yes they are built-in, to my knowledge. >> > > >> > > > Hopefully Rob will look at this. If I were looking, I would be >> > > > interested in acpidump to see what's in the DSDT. >> > > >> > > I can get an acpidump. Is there a preferred way to share the files, or just >> > > an email attachment? >> > >> > I think by default acpidump produces ASCII that can be directly >> > included in email. http://vger.kernel.org/majordomo-info.html says >> > 100K is the limit for vger mailing lists. Or you could open a report >> > at https://bugzilla.kernel.org and attach it there, maybe along with a >> > complete dmesg log and "sudo lspci -vv" output. >> >> Apologies for the delay, I was unable to access the machine while travelling. >> >> https://bugzilla.kernel.org/show_bug.cgi?id=217317 > > Thanks for that! Can you boot a kernel with 6fffbc7ae137 reverted > with this in the kernel parameters: > > dyndbg="file drivers/acpi/* +p" > > and collect the entire dmesg log? Added to the bugzilla report. Thanks!
+Rafael, Andy On Tue, Apr 11, 2023 at 7:53 AM Donald Hunter <donald.hunter@gmail.com> wrote: > > Bjorn Helgaas <helgaas@kernel.org> writes: > > > On Mon, Apr 10, 2023 at 04:10:54PM +0100, Donald Hunter wrote: > >> On Sun, 2 Apr 2023 at 23:55, Bjorn Helgaas <helgaas@kernel.org> wrote: > >> > On Sat, Apr 01, 2023 at 01:52:25PM +0100, Donald Hunter wrote: > >> > > On Fri, 31 Mar 2023 at 20:42, Bjorn Helgaas <helgaas@kernel.org> wrote: > >> > > > > >> > > > I assume this igb NIC (07:00.0) must be built-in (not a plug-in card) > >> > > > because it apparently has an ACPI firmware node, and there's something > >> > > > we don't expect about its status? > >> > > > >> > > Yes they are built-in, to my knowledge. > >> > > > >> > > > Hopefully Rob will look at this. If I were looking, I would be > >> > > > interested in acpidump to see what's in the DSDT. > >> > > > >> > > I can get an acpidump. Is there a preferred way to share the files, or just > >> > > an email attachment? > >> > > >> > I think by default acpidump produces ASCII that can be directly > >> > included in email. http://vger.kernel.org/majordomo-info.html says > >> > 100K is the limit for vger mailing lists. Or you could open a report > >> > at https://bugzilla.kernel.org and attach it there, maybe along with a > >> > complete dmesg log and "sudo lspci -vv" output. > >> > >> Apologies for the delay, I was unable to access the machine while travelling. > >> > >> https://bugzilla.kernel.org/show_bug.cgi?id=217317 > > > > Thanks for that! Can you boot a kernel with 6fffbc7ae137 reverted > > with this in the kernel parameters: > > > > dyndbg="file drivers/acpi/* +p" > > > > and collect the entire dmesg log? > > Added to the bugzilla report. Rafael, Andy, Any ideas why fwnode_device_is_available() would return false for a built-in PCI device with a ACPI device entry? The only thing I see in the log is it looks like the parent PCI bridge/bus doesn't have ACPI device entry (based on "[ 0.913389] pci_bus 0000:07: No ACPI support"). For DT, if the parent doesn't have a node, then the child can't. Not sure on ACPI. Rob
On Tue, Apr 11, 2023 at 02:02:03PM -0500, Rob Herring wrote: > On Tue, Apr 11, 2023 at 7:53 AM Donald Hunter <donald.hunter@gmail.com> wrote: > > Bjorn Helgaas <helgaas@kernel.org> writes: > > > On Mon, Apr 10, 2023 at 04:10:54PM +0100, Donald Hunter wrote: > > >> On Sun, 2 Apr 2023 at 23:55, Bjorn Helgaas <helgaas@kernel.org> wrote: > > >> > On Sat, Apr 01, 2023 at 01:52:25PM +0100, Donald Hunter wrote: > > >> > > On Fri, 31 Mar 2023 at 20:42, Bjorn Helgaas <helgaas@kernel.org> wrote: > > >> > > > > > >> > > > I assume this igb NIC (07:00.0) must be built-in (not a plug-in card) > > >> > > > because it apparently has an ACPI firmware node, and there's something > > >> > > > we don't expect about its status? > > >> > > > > >> > > Yes they are built-in, to my knowledge. > > >> > > > > >> > > > Hopefully Rob will look at this. If I were looking, I would be > > >> > > > interested in acpidump to see what's in the DSDT. > > >> > > > > >> > > I can get an acpidump. Is there a preferred way to share the files, or just > > >> > > an email attachment? > > >> > > > >> > I think by default acpidump produces ASCII that can be directly > > >> > included in email. http://vger.kernel.org/majordomo-info.html says > > >> > 100K is the limit for vger mailing lists. Or you could open a report > > >> > at https://bugzilla.kernel.org and attach it there, maybe along with a > > >> > complete dmesg log and "sudo lspci -vv" output. > > >> > > >> Apologies for the delay, I was unable to access the machine while travelling. > > >> > > >> https://bugzilla.kernel.org/show_bug.cgi?id=217317 > > > > > > Thanks for that! Can you boot a kernel with 6fffbc7ae137 reverted > > > with this in the kernel parameters: > > > > > > dyndbg="file drivers/acpi/* +p" > > > > > > and collect the entire dmesg log? > > > > Added to the bugzilla report. > > Rafael, Andy, Any ideas why fwnode_device_is_available() would return > false for a built-in PCI device with a ACPI device entry? The only > thing I see in the log is it looks like the parent PCI bridge/bus > doesn't have ACPI device entry (based on "[ 0.913389] pci_bus > 0000:07: No ACPI support"). For DT, if the parent doesn't have a node, > then the child can't. Not sure on ACPI. Thanks for the Cc'ing. I haven't checked anything yet, but from the above it sounds like a BIOS issue. If PCI has no ACPI companion tree, then why the heck one of the devices has the entry? I'm not even sure this is allowed by ACPI specification, but as I said, I just solely used the above mail.
On Wed, Apr 12, 2023 at 04:20:33PM +0300, Andy Shevchenko wrote: > On Tue, Apr 11, 2023 at 02:02:03PM -0500, Rob Herring wrote: > > On Tue, Apr 11, 2023 at 7:53 AM Donald Hunter <donald.hunter@gmail.com> wrote: > > > Bjorn Helgaas <helgaas@kernel.org> writes: > > > > On Mon, Apr 10, 2023 at 04:10:54PM +0100, Donald Hunter wrote: > > > >> On Sun, 2 Apr 2023 at 23:55, Bjorn Helgaas <helgaas@kernel.org> wrote: > > > >> > On Sat, Apr 01, 2023 at 01:52:25PM +0100, Donald Hunter wrote: > > > >> > > On Fri, 31 Mar 2023 at 20:42, Bjorn Helgaas <helgaas@kernel.org> wrote: > > > >> > > > > > > >> > > > I assume this igb NIC (07:00.0) must be built-in (not a plug-in card) > > > >> > > > because it apparently has an ACPI firmware node, and there's something > > > >> > > > we don't expect about its status? > > > >> > > > > > >> > > Yes they are built-in, to my knowledge. > > > >> > > > > > >> > > > Hopefully Rob will look at this. If I were looking, I would be > > > >> > > > interested in acpidump to see what's in the DSDT. > > > >> > > > > > >> > > I can get an acpidump. Is there a preferred way to share the files, or just > > > >> > > an email attachment? > > > >> > > > > >> > I think by default acpidump produces ASCII that can be directly > > > >> > included in email. http://vger.kernel.org/majordomo-info.html says > > > >> > 100K is the limit for vger mailing lists. Or you could open a report > > > >> > at https://bugzilla.kernel.org and attach it there, maybe along with a > > > >> > complete dmesg log and "sudo lspci -vv" output. > > > >> > > > >> Apologies for the delay, I was unable to access the machine while travelling. > > > >> > > > >> https://bugzilla.kernel.org/show_bug.cgi?id=217317 > > > > > > > > Thanks for that! Can you boot a kernel with 6fffbc7ae137 reverted > > > > with this in the kernel parameters: > > > > > > > > dyndbg="file drivers/acpi/* +p" > > > > > > > > and collect the entire dmesg log? > > > > > > Added to the bugzilla report. > > > > Rafael, Andy, Any ideas why fwnode_device_is_available() would return > > false for a built-in PCI device with a ACPI device entry? The only > > thing I see in the log is it looks like the parent PCI bridge/bus > > doesn't have ACPI device entry (based on "[ 0.913389] pci_bus > > 0000:07: No ACPI support"). For DT, if the parent doesn't have a node, > > then the child can't. Not sure on ACPI. > > Thanks for the Cc'ing. I haven't checked anything yet, but from the above it > sounds like a BIOS issue. If PCI has no ACPI companion tree, then why the heck > one of the devices has the entry? I'm not even sure this is allowed by ACPI > specification, but as I said, I just solely used the above mail. ACPI r6.5, sec 6.3.7, about _STA says: - Bit [0] - Set if the device is present. - Bit [1] - Set if the device is enabled and decoding its resources. - Bit [3] - Set if the device is functioning properly (cleared if device failed its diagnostics). ... If a device is present on an enumerable bus, then _STA must not return 0. In that case, bit[0] must be set and if the status of the device can be determined through a bus-specific enumeration and discovery mechanism, it must be reflected by the values of bit[1] and bit[3], even though the OSPM is not required to take them into account. Since PCI *is* an enumerable bus, I don't think we can use _STA to decide whether a PCI device is present. We can use _STA to decide whether a host bridge is present, of course, but that doesn't help here because the host bridge in question is PNP0A08:00 that leads to [bus 00-3d], and it is present. I don't know exactly what path led to the igb issue, but I don't think we need to figure that out. I think we just need to avoid the use of _STA in fwnode_device_is_available(). 6fffbc7ae137 ("PCI: Honor firmware's device disabled status") appeared in v6.3-rc1, so I think we need to revert or fix it before v6.3, which will probably be tagged Sunday (and I'll be on vacation Friday-Monday). Bjorn
On Wed, Apr 19, 2023 at 9:34 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Wed, Apr 12, 2023 at 04:20:33PM +0300, Andy Shevchenko wrote: > > On Tue, Apr 11, 2023 at 02:02:03PM -0500, Rob Herring wrote: > > > On Tue, Apr 11, 2023 at 7:53 AM Donald Hunter <donald.hunter@gmail.com> wrote: > > > > Bjorn Helgaas <helgaas@kernel.org> writes: > > > > > On Mon, Apr 10, 2023 at 04:10:54PM +0100, Donald Hunter wrote: > > > > >> On Sun, 2 Apr 2023 at 23:55, Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > >> > On Sat, Apr 01, 2023 at 01:52:25PM +0100, Donald Hunter wrote: > > > > >> > > On Fri, 31 Mar 2023 at 20:42, Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > >> > > > > > > > >> > > > I assume this igb NIC (07:00.0) must be built-in (not a plug-in card) > > > > >> > > > because it apparently has an ACPI firmware node, and there's something > > > > >> > > > we don't expect about its status? > > > > >> > > > > > > >> > > Yes they are built-in, to my knowledge. > > > > >> > > > > > > >> > > > Hopefully Rob will look at this. If I were looking, I would be > > > > >> > > > interested in acpidump to see what's in the DSDT. > > > > >> > > > > > > >> > > I can get an acpidump. Is there a preferred way to share the files, or just > > > > >> > > an email attachment? > > > > >> > > > > > >> > I think by default acpidump produces ASCII that can be directly > > > > >> > included in email. http://vger.kernel.org/majordomo-info.html says > > > > >> > 100K is the limit for vger mailing lists. Or you could open a report > > > > >> > at https://bugzilla.kernel.org and attach it there, maybe along with a > > > > >> > complete dmesg log and "sudo lspci -vv" output. > > > > >> > > > > >> Apologies for the delay, I was unable to access the machine while travelling. > > > > >> > > > > >> https://bugzilla.kernel.org/show_bug.cgi?id=217317 > > > > > > > > > > Thanks for that! Can you boot a kernel with 6fffbc7ae137 reverted > > > > > with this in the kernel parameters: > > > > > > > > > > dyndbg="file drivers/acpi/* +p" > > > > > > > > > > and collect the entire dmesg log? > > > > > > > > Added to the bugzilla report. > > > > > > Rafael, Andy, Any ideas why fwnode_device_is_available() would return > > > false for a built-in PCI device with a ACPI device entry? The only > > > thing I see in the log is it looks like the parent PCI bridge/bus > > > doesn't have ACPI device entry (based on "[ 0.913389] pci_bus > > > 0000:07: No ACPI support"). For DT, if the parent doesn't have a node, > > > then the child can't. Not sure on ACPI. > > > > Thanks for the Cc'ing. I haven't checked anything yet, but from the above it > > sounds like a BIOS issue. If PCI has no ACPI companion tree, then why the heck > > one of the devices has the entry? I'm not even sure this is allowed by ACPI > > specification, but as I said, I just solely used the above mail. > > ACPI r6.5, sec 6.3.7, about _STA says: > > - Bit [0] - Set if the device is present. > - Bit [1] - Set if the device is enabled and decoding its resources. > - Bit [3] - Set if the device is functioning properly (cleared if > device failed its diagnostics). > > ... > > If a device is present on an enumerable bus, then _STA must not > return 0. In that case, bit[0] must be set and if the status of the > device can be determined through a bus-specific enumeration and > discovery mechanism, it must be reflected by the values of bit[1] > and bit[3], even though the OSPM is not required to take them into > account. > > Since PCI *is* an enumerable bus, I don't think we can use _STA to > decide whether a PCI device is present. You are right, _STA can't be used for that. > We can use _STA to decide whether a host bridge is present, of course, > but that doesn't help here because the host bridge in question is > PNP0A08:00 that leads to [bus 00-3d], and it is present. > > I don't know exactly what path led to the igb issue, but I don't think > we need to figure that out. I think we just need to avoid the use of > _STA in fwnode_device_is_available(). I agree. It is incorrect. > 6fffbc7ae137 ("PCI: Honor firmware's device disabled status") appeared > in v6.3-rc1, so I think we need to revert or fix it before v6.3, which > will probably be tagged Sunday (and I'll be on vacation > Friday-Monday). Yes, please revert this one ASAP. Cheers, Rafael
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 1779582fb500..b1d80c1d7a69 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -1841,6 +1841,8 @@ int pci_setup_device(struct pci_dev *dev) pci_set_of_node(dev); pci_set_acpi_fwnode(dev); + if (dev->dev.fwnode && !fwnode_device_is_available(dev->dev.fwnode)) + return -ENODEV; pci_dev_assign_slot(dev);