Message ID | 20210125194138.1103155-1-prarit@redhat.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | pci-driver: Add driver load messages | expand |
On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote: > There are two situations where driver load messages are helpful. > > 1) Some drivers silently load on devices and debugging driver or system > failures in these cases is difficult. While some drivers (networking > for example) may not completely initialize when the PCI driver probe() function > has returned, it is still useful to have some idea of driver completion. Sorry, probably it is me, but I don't understand this use case. Are you adding global to whole kernel command line boot argument to debug what and when? During boot: If device success, you will see it in /sys/bus/pci/[drivers|devices]/*. If device fails, you should get an error from that device (fix the device to return an error), or something immediately won't work and you won't see it in sysfs. During run: We have many other solutions to get debug prints during run, for example tracing, which is possible to toggle dynamically. Right now, my laptop will print 34 prints on boot and endless amount during day-to-day usage. ➜ kernel git:(rdma-next) ✗ lspci |wc -l 34 > > 2) Storage and Network device vendors have relatively short lives for > some of their hardware. Some devices may continue to function but are > problematic due to out-of-date firmware or other issues. Maintaining > a database of the hardware is out-of-the-question in the kernel as it would > require constant updating. Outputting a message in the log would allow > different OSes to determine if the problem hardware was truly supported or not. And rely on some dmesg output as a true source of supported/not supported and making this ABI which needs knob in command line. ? > > Add optional driver load messages from the PCI core that indicates which > driver was loaded, on which slot, and on which device. Why don't you add simple pr_debug(..) without any knob? You will be able to enable/disable it through dynamic prints facility. Thanks
Leon Romanovsky <leon@kernel.org> wrote: > On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote: > > There are two situations where driver load messages are helpful. > > > > 1) Some drivers silently load on devices and debugging driver or system > > failures in these cases is difficult. While some drivers (networking > > for example) may not completely initialize when the PCI driver probe() function > > has returned, it is still useful to have some idea of driver completion. > > Sorry, probably it is me, but I don't understand this use case. > Are you adding global to whole kernel command line boot argument to debug > what and when? > > During boot: > If device success, you will see it in /sys/bus/pci/[drivers|devices]/*. > If device fails, you should get an error from that device (fix the > device to return an error), or something immediately won't work and > you won't see it in sysfs. > What if there is a panic during boot? There's no way to get to sysfs. That's the case where this is helpful. > During run: > We have many other solutions to get debug prints during run, for example > tracing, which is possible to toggle dynamically. > > Right now, my laptop will print 34 prints on boot and endless amount during > day-to-day usage. > > ➜ kernel git:(rdma-next) ✗ lspci |wc -l > 34 > > > > > 2) Storage and Network device vendors have relatively short lives for > > some of their hardware. Some devices may continue to function but are > > problematic due to out-of-date firmware or other issues. Maintaining > > a database of the hardware is out-of-the-question in the kernel as it would > > require constant updating. Outputting a message in the log would allow > > different OSes to determine if the problem hardware was truly supported or not. > > And rely on some dmesg output as a true source of supported/not supported and > making this ABI which needs knob in command line. ? Yes. The console log being saved would work as a true source of load messages to be interpreted by an OS tool. But I see your point about the knob below... > > > > > Add optional driver load messages from the PCI core that indicates which > > driver was loaded, on which slot, and on which device. > > Why don't you add simple pr_debug(..) without any knob? You will be able > to enable/disable it through dynamic prints facility. Good point. I'll wait for more feedback and submit a v2 with pr_debug. P. > > Thanks
On Tue, Jan 26, 2021 at 07:54:46AM -0500, Prarit Bhargava wrote: > Leon Romanovsky <leon@kernel.org> wrote: > > On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote: > > > There are two situations where driver load messages are helpful. > > > > > > 1) Some drivers silently load on devices and debugging driver or system > > > failures in these cases is difficult. While some drivers (networking > > > for example) may not completely initialize when the PCI driver probe() function > > > has returned, it is still useful to have some idea of driver completion. > > > > Sorry, probably it is me, but I don't understand this use case. > > Are you adding global to whole kernel command line boot argument to debug > > what and when? > > > > During boot: > > If device success, you will see it in /sys/bus/pci/[drivers|devices]/*. > > If device fails, you should get an error from that device (fix the > > device to return an error), or something immediately won't work and > > you won't see it in sysfs. > > > > What if there is a panic during boot? There's no way to get to sysfs. > That's the case where this is helpful. How? If you have kernel panic, it means you have much more worse problem than not-supported device. If kernel panic was caused by the driver, you will see call trace related to it. If kernel panic was caused by something else, supported/not supported won't help here. > > > During run: > > We have many other solutions to get debug prints during run, for example > > tracing, which is possible to toggle dynamically. > > > > Right now, my laptop will print 34 prints on boot and endless amount during > > day-to-day usage. > > > > ➜ kernel git:(rdma-next) ✗ lspci |wc -l > > 34 > > > > > > > > 2) Storage and Network device vendors have relatively short lives for > > > some of their hardware. Some devices may continue to function but are > > > problematic due to out-of-date firmware or other issues. Maintaining > > > a database of the hardware is out-of-the-question in the kernel as it would > > > require constant updating. Outputting a message in the log would allow > > > different OSes to determine if the problem hardware was truly supported or not. > > > > And rely on some dmesg output as a true source of supported/not supported and > > making this ABI which needs knob in command line. ? > > Yes. The console log being saved would work as a true source of load > messages to be interpreted by an OS tool. But I see your point about the > knob below... You will need much more stronger claim than the above if you want to proceed ABI path through dmesg prints. > > > > > > > > > Add optional driver load messages from the PCI core that indicates which > > > driver was loaded, on which slot, and on which device. > > > > Why don't you add simple pr_debug(..) without any knob? You will be able > > to enable/disable it through dynamic prints facility. > > Good point. I'll wait for more feedback and submit a v2 with pr_debug. Just to be clear, none of this can be ABI and any kernel print can be changed or removed any minute without any announcement. Thanks > > P. > > > > > Thanks >
On 1/26/21 8:14 AM, Leon Romanovsky wrote: > On Tue, Jan 26, 2021 at 07:54:46AM -0500, Prarit Bhargava wrote: >> Leon Romanovsky <leon@kernel.org> wrote: >>> On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote: >>>> There are two situations where driver load messages are helpful. >>>> >>>> 1) Some drivers silently load on devices and debugging driver or system >>>> failures in these cases is difficult. While some drivers (networking >>>> for example) may not completely initialize when the PCI driver probe() function >>>> has returned, it is still useful to have some idea of driver completion. >>> >>> Sorry, probably it is me, but I don't understand this use case. >>> Are you adding global to whole kernel command line boot argument to debug >>> what and when? >>> >>> During boot: >>> If device success, you will see it in /sys/bus/pci/[drivers|devices]/*. >>> If device fails, you should get an error from that device (fix the >>> device to return an error), or something immediately won't work and >>> you won't see it in sysfs. >>> >> >> What if there is a panic during boot? There's no way to get to sysfs. >> That's the case where this is helpful. > > How? If you have kernel panic, it means you have much more worse problem > than not-supported device. If kernel panic was caused by the driver, you > will see call trace related to it. If kernel panic was caused by > something else, supported/not supported won't help here. I still have no idea *WHICH* device it was that the panic occurred on. > >> >>> During run: >>> We have many other solutions to get debug prints during run, for example >>> tracing, which is possible to toggle dynamically. >>> >>> Right now, my laptop will print 34 prints on boot and endless amount during >>> day-to-day usage. >>> >>> ➜ kernel git:(rdma-next) ✗ lspci |wc -l >>> 34 >>> >>>> >>>> 2) Storage and Network device vendors have relatively short lives for >>>> some of their hardware. Some devices may continue to function but are >>>> problematic due to out-of-date firmware or other issues. Maintaining >>>> a database of the hardware is out-of-the-question in the kernel as it would >>>> require constant updating. Outputting a message in the log would allow >>>> different OSes to determine if the problem hardware was truly supported or not. >>> >>> And rely on some dmesg output as a true source of supported/not supported and >>> making this ABI which needs knob in command line. ? >> >> Yes. The console log being saved would work as a true source of load >> messages to be interpreted by an OS tool. But I see your point about the >> knob below... > > You will need much more stronger claim than the above if you want to proceed > ABI path through dmesg prints. > See my answer below. I agree with you on the ABI statement. >> >>> >>>> >>>> Add optional driver load messages from the PCI core that indicates which >>>> driver was loaded, on which slot, and on which device. >>> >>> Why don't you add simple pr_debug(..) without any knob? You will be able >>> to enable/disable it through dynamic prints facility. >> >> Good point. I'll wait for more feedback and submit a v2 with pr_debug. > > Just to be clear, none of this can be ABI and any kernel print can > be changed or removed any minute without any announcement. Yes, that's absolutely the case and I agree with you that nothing can guarantee ABI of those pr_debug() statements. They are *debug* after all. P. > > Thanks > >> >> P. >> >>> >>> Thanks >> >
On Tue, Jan 26, 2021 at 08:42:12AM -0500, Prarit Bhargava wrote: > > > On 1/26/21 8:14 AM, Leon Romanovsky wrote: > > On Tue, Jan 26, 2021 at 07:54:46AM -0500, Prarit Bhargava wrote: > >> Leon Romanovsky <leon@kernel.org> wrote: > >>> On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote: > >>>> There are two situations where driver load messages are helpful. > >>>> > >>>> 1) Some drivers silently load on devices and debugging driver or system > >>>> failures in these cases is difficult. While some drivers (networking > >>>> for example) may not completely initialize when the PCI driver probe() function > >>>> has returned, it is still useful to have some idea of driver completion. > >>> > >>> Sorry, probably it is me, but I don't understand this use case. > >>> Are you adding global to whole kernel command line boot argument to debug > >>> what and when? > >>> > >>> During boot: > >>> If device success, you will see it in /sys/bus/pci/[drivers|devices]/*. > >>> If device fails, you should get an error from that device (fix the > >>> device to return an error), or something immediately won't work and > >>> you won't see it in sysfs. > >>> > >> > >> What if there is a panic during boot? There's no way to get to sysfs. > >> That's the case where this is helpful. > > > > How? If you have kernel panic, it means you have much more worse problem > > than not-supported device. If kernel panic was caused by the driver, you > > will see call trace related to it. If kernel panic was caused by > > something else, supported/not supported won't help here. > > I still have no idea *WHICH* device it was that the panic occurred on. The kernel panic is printed from the driver. There is one driver loaded for all same PCI devices which are probed without relation to their number. If you have host with ten same cards, you will see one driver and this is where the problem and not in supported/not-supported device. > > > >> > >>> During run: > >>> We have many other solutions to get debug prints during run, for example > >>> tracing, which is possible to toggle dynamically. > >>> > >>> Right now, my laptop will print 34 prints on boot and endless amount during > >>> day-to-day usage. > >>> > >>> ➜ kernel git:(rdma-next) ✗ lspci |wc -l > >>> 34 > >>> > >>>> > >>>> 2) Storage and Network device vendors have relatively short lives for > >>>> some of their hardware. Some devices may continue to function but are > >>>> problematic due to out-of-date firmware or other issues. Maintaining > >>>> a database of the hardware is out-of-the-question in the kernel as it would > >>>> require constant updating. Outputting a message in the log would allow > >>>> different OSes to determine if the problem hardware was truly supported or not. > >>> > >>> And rely on some dmesg output as a true source of supported/not supported and > >>> making this ABI which needs knob in command line. ? > >> > >> Yes. The console log being saved would work as a true source of load > >> messages to be interpreted by an OS tool. But I see your point about the > >> knob below... > > > > You will need much more stronger claim than the above if you want to proceed > > ABI path through dmesg prints. > > > > See my answer below. I agree with you on the ABI statement. > > >> > >>> > >>>> > >>>> Add optional driver load messages from the PCI core that indicates which > >>>> driver was loaded, on which slot, and on which device. > >>> > >>> Why don't you add simple pr_debug(..) without any knob? You will be able > >>> to enable/disable it through dynamic prints facility. > >> > >> Good point. I'll wait for more feedback and submit a v2 with pr_debug. > > > > Just to be clear, none of this can be ABI and any kernel print can > > be changed or removed any minute without any announcement. > > Yes, that's absolutely the case and I agree with you that nothing can guarantee > ABI of those pr_debug() statements. They are *debug* after all. You missed the point. ALL pr*() prints are not ABI, without relation to their level. Thanks > > P. > > > > > Thanks > > > >> > >> P. > >> > >>> > >>> Thanks > >> > > >
On 1/26/21 8:53 AM, Leon Romanovsky wrote: > On Tue, Jan 26, 2021 at 08:42:12AM -0500, Prarit Bhargava wrote: >> >> >> On 1/26/21 8:14 AM, Leon Romanovsky wrote: >>> On Tue, Jan 26, 2021 at 07:54:46AM -0500, Prarit Bhargava wrote: >>>> Leon Romanovsky <leon@kernel.org> wrote: >>>>> On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote: >>>>>> There are two situations where driver load messages are helpful. >>>>>> >>>>>> 1) Some drivers silently load on devices and debugging driver or system >>>>>> failures in these cases is difficult. While some drivers (networking >>>>>> for example) may not completely initialize when the PCI driver probe() function >>>>>> has returned, it is still useful to have some idea of driver completion. >>>>> >>>>> Sorry, probably it is me, but I don't understand this use case. >>>>> Are you adding global to whole kernel command line boot argument to debug >>>>> what and when? >>>>> >>>>> During boot: >>>>> If device success, you will see it in /sys/bus/pci/[drivers|devices]/*. >>>>> If device fails, you should get an error from that device (fix the >>>>> device to return an error), or something immediately won't work and >>>>> you won't see it in sysfs. >>>>> >>>> >>>> What if there is a panic during boot? There's no way to get to sysfs. >>>> That's the case where this is helpful. >>> >>> How? If you have kernel panic, it means you have much more worse problem >>> than not-supported device. If kernel panic was caused by the driver, you >>> will see call trace related to it. If kernel panic was caused by >>> something else, supported/not supported won't help here. >> >> I still have no idea *WHICH* device it was that the panic occurred on. > > The kernel panic is printed from the driver. There is one driver loaded > for all same PCI devices which are probed without relation to their > number.> > If you have host with ten same cards, you will see one driver and this > is where the problem and not in supported/not-supported device. That's true, but you can also have different cards loading the same driver. See, for example, any PCI_IDs list in a driver. For example, 10:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02) 20:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02) Both load the megaraid driver and have different profiles within the driver. I have no idea which one actually panicked until removing one card. It's MUCH worse when debugging new hardware and getting a panic from, for example, the uncore code which binds to a PCI mapped device. One device might work and the next one doesn't. And then you can multiply that by seeing *many* panics at once and trying to determine if the problem was on one specific socket, die, or core. > >>> >>>> >>>>> During run: >>>>> We have many other solutions to get debug prints during run, for example >>>>> tracing, which is possible to toggle dynamically. >>>>> >>>>> Right now, my laptop will print 34 prints on boot and endless amount during >>>>> day-to-day usage. >>>>> >>>>> ➜ kernel git:(rdma-next) ✗ lspci |wc -l >>>>> 34 >>>>> >>>>>> >>>>>> 2) Storage and Network device vendors have relatively short lives for >>>>>> some of their hardware. Some devices may continue to function but are >>>>>> problematic due to out-of-date firmware or other issues. Maintaining >>>>>> a database of the hardware is out-of-the-question in the kernel as it would >>>>>> require constant updating. Outputting a message in the log would allow >>>>>> different OSes to determine if the problem hardware was truly supported or not. >>>>> >>>>> And rely on some dmesg output as a true source of supported/not supported and >>>>> making this ABI which needs knob in command line. ? >>>> >>>> Yes. The console log being saved would work as a true source of load >>>> messages to be interpreted by an OS tool. But I see your point about the >>>> knob below... >>> >>> You will need much more stronger claim than the above if you want to proceed >>> ABI path through dmesg prints. >>> >> >> See my answer below. I agree with you on the ABI statement. >> >>>> >>>>> >>>>>> >>>>>> Add optional driver load messages from the PCI core that indicates which >>>>>> driver was loaded, on which slot, and on which device. >>>>> >>>>> Why don't you add simple pr_debug(..) without any knob? You will be able >>>>> to enable/disable it through dynamic prints facility. >>>> >>>> Good point. I'll wait for more feedback and submit a v2 with pr_debug. >>> >>> Just to be clear, none of this can be ABI and any kernel print can >>> be changed or removed any minute without any announcement. >> >> Yes, that's absolutely the case and I agree with you that nothing can guarantee >> ABI of those pr_debug() statements. They are *debug* after all. > > You missed the point. ALL pr*() prints are not ABI, without relation to their level. > Yes, I understood that. I'm just emphasizing your ABI concern. P. > Thanks > >> >> P. >> >>> >>> Thanks >>> >>>> >>>> P. >>>> >>>>> >>>>> Thanks >>>> >>> >> >
Hi Prarit, On Tue, Jan 26, 2021 at 09:05:23AM -0500, Prarit Bhargava wrote: > On 1/26/21 8:53 AM, Leon Romanovsky wrote: > > On Tue, Jan 26, 2021 at 08:42:12AM -0500, Prarit Bhargava wrote: > >> On 1/26/21 8:14 AM, Leon Romanovsky wrote: > >>> On Tue, Jan 26, 2021 at 07:54:46AM -0500, Prarit Bhargava wrote: > >>>> Leon Romanovsky <leon@kernel.org> wrote: > >>>>> On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote: > >>>>>> There are two situations where driver load messages are helpful. > >>>>>> > >>>>>> 1) Some drivers silently load on devices and debugging driver or system > >>>>>> failures in these cases is difficult. While some drivers (networking > >>>>>> for example) may not completely initialize when the PCI driver probe() function > >>>>>> has returned, it is still useful to have some idea of driver completion. > >>>>> > >>>>> Sorry, probably it is me, but I don't understand this use case. > >>>>> Are you adding global to whole kernel command line boot argument to debug > >>>>> what and when? > >>>>> > >>>>> During boot: > >>>>> If device success, you will see it in /sys/bus/pci/[drivers|devices]/*. > >>>>> If device fails, you should get an error from that device (fix the > >>>>> device to return an error), or something immediately won't work and > >>>>> you won't see it in sysfs. > >>>> > >>>> What if there is a panic during boot? There's no way to get to sysfs. > >>>> That's the case where this is helpful. > >>> > >>> How? If you have kernel panic, it means you have much more worse problem > >>> than not-supported device. If kernel panic was caused by the driver, you > >>> will see call trace related to it. If kernel panic was caused by > >>> something else, supported/not supported won't help here. > >> > >> I still have no idea *WHICH* device it was that the panic occurred on. > > > > The kernel panic is printed from the driver. There is one driver loaded > > for all same PCI devices which are probed without relation to their > > number.> > > If you have host with ten same cards, you will see one driver and this > > is where the problem and not in supported/not-supported device. > > That's true, but you can also have different cards loading the same driver. > See, for example, any PCI_IDs list in a driver. > > For example, > > 10:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02) > 20:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02) > > Both load the megaraid driver and have different profiles within the > driver. I have no idea which one actually panicked until removing > one card. > > It's MUCH worse when debugging new hardware and getting a panic > from, for example, the uncore code which binds to a PCI mapped > device. One device might work and the next one doesn't. And then > you can multiply that by seeing *many* panics at once and trying to > determine if the problem was on one specific socket, die, or core. Would a dev_panic() interface that identified the device and driver help with this? For driver_load_messages, it doesn't seem necessarily PCI-specific. If we want a message like that, maybe it could be in driver_probe_device() or similar? There are already a few pr_debug() calls in that path. There are some enabled by initcall_debug that include the return value from the probe; would those be close to what you're looking for? Bjorn
On 1/26/21 10:12 AM, Bjorn Helgaas wrote: > Hi Prarit, > > On Tue, Jan 26, 2021 at 09:05:23AM -0500, Prarit Bhargava wrote: >> On 1/26/21 8:53 AM, Leon Romanovsky wrote: >>> On Tue, Jan 26, 2021 at 08:42:12AM -0500, Prarit Bhargava wrote: >>>> On 1/26/21 8:14 AM, Leon Romanovsky wrote: >>>>> On Tue, Jan 26, 2021 at 07:54:46AM -0500, Prarit Bhargava wrote: >>>>>> Leon Romanovsky <leon@kernel.org> wrote: >>>>>>> On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote: >>>>>>>> There are two situations where driver load messages are helpful. >>>>>>>> >>>>>>>> 1) Some drivers silently load on devices and debugging driver or system >>>>>>>> failures in these cases is difficult. While some drivers (networking >>>>>>>> for example) may not completely initialize when the PCI driver probe() function >>>>>>>> has returned, it is still useful to have some idea of driver completion. >>>>>>> >>>>>>> Sorry, probably it is me, but I don't understand this use case. >>>>>>> Are you adding global to whole kernel command line boot argument to debug >>>>>>> what and when? >>>>>>> >>>>>>> During boot: >>>>>>> If device success, you will see it in /sys/bus/pci/[drivers|devices]/*. >>>>>>> If device fails, you should get an error from that device (fix the >>>>>>> device to return an error), or something immediately won't work and >>>>>>> you won't see it in sysfs. >>>>>> >>>>>> What if there is a panic during boot? There's no way to get to sysfs. >>>>>> That's the case where this is helpful. >>>>> >>>>> How? If you have kernel panic, it means you have much more worse problem >>>>> than not-supported device. If kernel panic was caused by the driver, you >>>>> will see call trace related to it. If kernel panic was caused by >>>>> something else, supported/not supported won't help here. >>>> >>>> I still have no idea *WHICH* device it was that the panic occurred on. >>> >>> The kernel panic is printed from the driver. There is one driver loaded >>> for all same PCI devices which are probed without relation to their >>> number.> >>> If you have host with ten same cards, you will see one driver and this >>> is where the problem and not in supported/not-supported device. >> >> That's true, but you can also have different cards loading the same driver. >> See, for example, any PCI_IDs list in a driver. >> >> For example, >> >> 10:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02) >> 20:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02) >> >> Both load the megaraid driver and have different profiles within the >> driver. I have no idea which one actually panicked until removing >> one card. >> >> It's MUCH worse when debugging new hardware and getting a panic >> from, for example, the uncore code which binds to a PCI mapped >> device. One device might work and the next one doesn't. And then >> you can multiply that by seeing *many* panics at once and trying to >> determine if the problem was on one specific socket, die, or core. > > Would a dev_panic() interface that identified the device and driver > help with this? It would but, > > For driver_load_messages, it doesn't seem necessarily PCI-specific. > If we want a message like that, maybe it could be in > driver_probe_device() or similar? There are already a few pr_debug() > calls in that path. There are some enabled by initcall_debug that > include the return value from the probe; would those be close to what > you're looking for? I think this drivers/base/dd.c:727 might suffice. Let me try some tests with that and get back to you. Thanks for the pointers, P. > > Bjorn >
On 1/26/21 10:12 AM, Bjorn Helgaas wrote: > Hi Prarit, > > On Tue, Jan 26, 2021 at 09:05:23AM -0500, Prarit Bhargava wrote: >> On 1/26/21 8:53 AM, Leon Romanovsky wrote: >>> On Tue, Jan 26, 2021 at 08:42:12AM -0500, Prarit Bhargava wrote: >>>> On 1/26/21 8:14 AM, Leon Romanovsky wrote: >>>>> On Tue, Jan 26, 2021 at 07:54:46AM -0500, Prarit Bhargava wrote: >>>>>> Leon Romanovsky <leon@kernel.org> wrote: >>>>>>> On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote: >>>>>>>> There are two situations where driver load messages are helpful. >>>>>>>> >>>>>>>> 1) Some drivers silently load on devices and debugging driver or system >>>>>>>> failures in these cases is difficult. While some drivers (networking >>>>>>>> for example) may not completely initialize when the PCI driver probe() function >>>>>>>> has returned, it is still useful to have some idea of driver completion. >>>>>>> >>>>>>> Sorry, probably it is me, but I don't understand this use case. >>>>>>> Are you adding global to whole kernel command line boot argument to debug >>>>>>> what and when? >>>>>>> >>>>>>> During boot: >>>>>>> If device success, you will see it in /sys/bus/pci/[drivers|devices]/*. >>>>>>> If device fails, you should get an error from that device (fix the >>>>>>> device to return an error), or something immediately won't work and >>>>>>> you won't see it in sysfs. >>>>>> >>>>>> What if there is a panic during boot? There's no way to get to sysfs. >>>>>> That's the case where this is helpful. >>>>> >>>>> How? If you have kernel panic, it means you have much more worse problem >>>>> than not-supported device. If kernel panic was caused by the driver, you >>>>> will see call trace related to it. If kernel panic was caused by >>>>> something else, supported/not supported won't help here. >>>> >>>> I still have no idea *WHICH* device it was that the panic occurred on. >>> >>> The kernel panic is printed from the driver. There is one driver loaded >>> for all same PCI devices which are probed without relation to their >>> number.> >>> If you have host with ten same cards, you will see one driver and this >>> is where the problem and not in supported/not-supported device. >> >> That's true, but you can also have different cards loading the same driver. >> See, for example, any PCI_IDs list in a driver. >> >> For example, >> >> 10:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02) >> 20:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02) >> >> Both load the megaraid driver and have different profiles within the >> driver. I have no idea which one actually panicked until removing >> one card. >> >> It's MUCH worse when debugging new hardware and getting a panic >> from, for example, the uncore code which binds to a PCI mapped >> device. One device might work and the next one doesn't. And then >> you can multiply that by seeing *many* panics at once and trying to >> determine if the problem was on one specific socket, die, or core. > > Would a dev_panic() interface that identified the device and driver > help with this? > ^^ the more I look at this problem, the more a dev_panic() that would output a device specific message at panic time is what I really need. > For driver_load_messages, it doesn't seem necessarily PCI-specific. > If we want a message like that, maybe it could be in > driver_probe_device() or similar? There are already a few pr_debug() > calls in that path. There are some enabled by initcall_debug that > include the return value from the probe; would those be close to what > you're looking for? I took a look at those, and unfortunately they do not meet my requirements. Ultimately, at panic time, I need to know that a driver was loaded on a device at a specific location in the PCI space. The driver_probe_device() pr_debug calls tell me the location and the driver, but not anything to uniquely identify the device (ie, the PCI vendor and device IDs). It sounds like you've had some thoughts about a dev_panic() implementation. Care to share them with me? I'm more than willing to implement it but just want to get your more experienced view of what is needed. P. > > Bjorn >
On Thu, Feb 18, 2021 at 01:36:35PM -0500, Prarit Bhargava wrote: > On 1/26/21 10:12 AM, Bjorn Helgaas wrote: > > On Tue, Jan 26, 2021 at 09:05:23AM -0500, Prarit Bhargava wrote: > >> On 1/26/21 8:53 AM, Leon Romanovsky wrote: > >>> On Tue, Jan 26, 2021 at 08:42:12AM -0500, Prarit Bhargava wrote: > >>>> On 1/26/21 8:14 AM, Leon Romanovsky wrote: > >>>>> On Tue, Jan 26, 2021 at 07:54:46AM -0500, Prarit Bhargava wrote: > >>>>>> Leon Romanovsky <leon@kernel.org> wrote: > >>>>>>> On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote: > >>>>>>>> There are two situations where driver load messages are helpful. > >>>>>>>> > >>>>>>>> 1) Some drivers silently load on devices and debugging driver or system > >>>>>>>> failures in these cases is difficult. While some drivers (networking > >>>>>>>> for example) may not completely initialize when the PCI driver probe() function > >>>>>>>> has returned, it is still useful to have some idea of driver completion. > >>>>>>> > >>>>>>> Sorry, probably it is me, but I don't understand this use case. > >>>>>>> Are you adding global to whole kernel command line boot argument to debug > >>>>>>> what and when? > >>>>>>> > >>>>>>> During boot: > >>>>>>> If device success, you will see it in /sys/bus/pci/[drivers|devices]/*. > >>>>>>> If device fails, you should get an error from that device (fix the > >>>>>>> device to return an error), or something immediately won't work and > >>>>>>> you won't see it in sysfs. > >>>>>> > >>>>>> What if there is a panic during boot? There's no way to get to sysfs. > >>>>>> That's the case where this is helpful. > >>>>> > >>>>> How? If you have kernel panic, it means you have much more worse problem > >>>>> than not-supported device. If kernel panic was caused by the driver, you > >>>>> will see call trace related to it. If kernel panic was caused by > >>>>> something else, supported/not supported won't help here. > >>>> > >>>> I still have no idea *WHICH* device it was that the panic occurred on. > >>> > >>> The kernel panic is printed from the driver. There is one driver loaded > >>> for all same PCI devices which are probed without relation to their > >>> number.> > >>> If you have host with ten same cards, you will see one driver and this > >>> is where the problem and not in supported/not-supported device. > >> > >> That's true, but you can also have different cards loading the same driver. > >> See, for example, any PCI_IDs list in a driver. > >> > >> For example, > >> > >> 10:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02) > >> 20:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02) > >> > >> Both load the megaraid driver and have different profiles within the > >> driver. I have no idea which one actually panicked until removing > >> one card. > >> > >> It's MUCH worse when debugging new hardware and getting a panic > >> from, for example, the uncore code which binds to a PCI mapped > >> device. One device might work and the next one doesn't. And > >> then you can multiply that by seeing *many* panics at once and > >> trying to determine if the problem was on one specific socket, > >> die, or core. > > > > Would a dev_panic() interface that identified the device and > > driver help with this? > > ^^ the more I look at this problem, the more a dev_panic() that > would output a device specific message at panic time is what I > really need. > > > For driver_load_messages, it doesn't seem necessarily > > PCI-specific. If we want a message like that, maybe it could be > > in driver_probe_device() or similar? There are already a few > > pr_debug() calls in that path. There are some enabled by > > initcall_debug that include the return value from the probe; would > > those be close to what you're looking for? > > I took a look at those, and unfortunately they do not meet my > requirements. Ultimately, at panic time, I need to know that a > driver was loaded on a device at a specific location in the PCI > space. > > The driver_probe_device() pr_debug calls tell me the location and > the driver, but not anything to uniquely identify the device (ie, > the PCI vendor and device IDs). > > It sounds like you've had some thoughts about a dev_panic() > implementation. Care to share them with me? I'm more than willing > to implement it but just want to get your more experienced view of > what is needed. A dev_panic() might indeed be useful. It would be nice to capture the device information at the *first* crash instead of having to boot again with initcall_debug and try to reproduce the crash. include/linux/dev_printk.h has a whole mess of #defines for dev_info() et al, and maybe that could be extended for dev_panic(). Or maybe the dev_WARN() definition at the bottom, which basically just pastes the driver and device info at the beginning of the string, would be better. Even if you do add a dev_panic(), I would also take a look through drivers/base/dd.c and similar files to see whether the pr_warn/pr_debug/etc they do could be converted to dev_warn/dev_dbg/etc. $ git grep -A2 pr_ drivers/base | grep dev_name finds several likely-looking candidates. Here are some past patches along that line: 64b3eaf3715d ("xenbus: Use dev_printk() when possible") a88a7b3eb076 ("vfio: Use dev_printk() when possible") 780da9e4f5bf ("iommu: Use dev_printk() when possible") Bjorn
On 2/18/21 2:06 PM, Bjorn Helgaas wrote: > On Thu, Feb 18, 2021 at 01:36:35PM -0500, Prarit Bhargava wrote: >> On 1/26/21 10:12 AM, Bjorn Helgaas wrote: >>> On Tue, Jan 26, 2021 at 09:05:23AM -0500, Prarit Bhargava wrote: >>>> On 1/26/21 8:53 AM, Leon Romanovsky wrote: >>>>> On Tue, Jan 26, 2021 at 08:42:12AM -0500, Prarit Bhargava wrote: >>>>>> On 1/26/21 8:14 AM, Leon Romanovsky wrote: >>>>>>> On Tue, Jan 26, 2021 at 07:54:46AM -0500, Prarit Bhargava wrote: >>>>>>>> Leon Romanovsky <leon@kernel.org> wrote: >>>>>>>>> On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote: >>>>>>>>>> There are two situations where driver load messages are helpful. >>>>>>>>>> >>>>>>>>>> 1) Some drivers silently load on devices and debugging driver or system >>>>>>>>>> failures in these cases is difficult. While some drivers (networking >>>>>>>>>> for example) may not completely initialize when the PCI driver probe() function >>>>>>>>>> has returned, it is still useful to have some idea of driver completion. >>>>>>>>> >>>>>>>>> Sorry, probably it is me, but I don't understand this use case. >>>>>>>>> Are you adding global to whole kernel command line boot argument to debug >>>>>>>>> what and when? >>>>>>>>> >>>>>>>>> During boot: >>>>>>>>> If device success, you will see it in /sys/bus/pci/[drivers|devices]/*. >>>>>>>>> If device fails, you should get an error from that device (fix the >>>>>>>>> device to return an error), or something immediately won't work and >>>>>>>>> you won't see it in sysfs. >>>>>>>> >>>>>>>> What if there is a panic during boot? There's no way to get to sysfs. >>>>>>>> That's the case where this is helpful. >>>>>>> >>>>>>> How? If you have kernel panic, it means you have much more worse problem >>>>>>> than not-supported device. If kernel panic was caused by the driver, you >>>>>>> will see call trace related to it. If kernel panic was caused by >>>>>>> something else, supported/not supported won't help here. >>>>>> >>>>>> I still have no idea *WHICH* device it was that the panic occurred on. >>>>> >>>>> The kernel panic is printed from the driver. There is one driver loaded >>>>> for all same PCI devices which are probed without relation to their >>>>> number.> >>>>> If you have host with ten same cards, you will see one driver and this >>>>> is where the problem and not in supported/not-supported device. >>>> >>>> That's true, but you can also have different cards loading the same driver. >>>> See, for example, any PCI_IDs list in a driver. >>>> >>>> For example, >>>> >>>> 10:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02) >>>> 20:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02) >>>> >>>> Both load the megaraid driver and have different profiles within the >>>> driver. I have no idea which one actually panicked until removing >>>> one card. >>>> >>>> It's MUCH worse when debugging new hardware and getting a panic >>>> from, for example, the uncore code which binds to a PCI mapped >>>> device. One device might work and the next one doesn't. And >>>> then you can multiply that by seeing *many* panics at once and >>>> trying to determine if the problem was on one specific socket, >>>> die, or core. >>> >>> Would a dev_panic() interface that identified the device and >>> driver help with this? >> >> ^^ the more I look at this problem, the more a dev_panic() that >> would output a device specific message at panic time is what I >> really need. Bjorn, I went down this road a bit and had a realization. The issue isn't with printing something at panic time, but the *data* that is output. Each PCI device is associated with a struct device. That device struct's name is output for dev_dbg(), etc., commands. The PCI subsystem sets the device struct name at drivers/pci/probe.c: 1799 dev_set_name(&dev->dev, "%04x:%02x:%02x.%d", pci_domain_nr(dev->bus), dev->bus->number, PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn)); My problem really is that the above information is insufficient when I (or a user) need to debug a system. The complexities of debugging multiple broken driver loads would be much easier if I didn't have to constantly add this output manually :). Would you be okay with adding a *debug* parameter to expand the device name to include the vendor & device ID pair? FWIW, I'm somewhat against yet-another-kernel-option but that's really the information I need. I could then add dev_dbg() statements in the local_pci_probe() function. Thoughts? P.
On Thu, Mar 04, 2021 at 09:42:44AM -0500, Prarit Bhargava wrote: > > > On 2/18/21 2:06 PM, Bjorn Helgaas wrote: > > On Thu, Feb 18, 2021 at 01:36:35PM -0500, Prarit Bhargava wrote: > >> On 1/26/21 10:12 AM, Bjorn Helgaas wrote: > >>> On Tue, Jan 26, 2021 at 09:05:23AM -0500, Prarit Bhargava wrote: > >>>> On 1/26/21 8:53 AM, Leon Romanovsky wrote: > >>>>> On Tue, Jan 26, 2021 at 08:42:12AM -0500, Prarit Bhargava wrote: > >>>>>> On 1/26/21 8:14 AM, Leon Romanovsky wrote: > >>>>>>> On Tue, Jan 26, 2021 at 07:54:46AM -0500, Prarit Bhargava wrote: > >>>>>>>> Leon Romanovsky <leon@kernel.org> wrote: > >>>>>>>>> On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote: > >>>>>>>>>> There are two situations where driver load messages are helpful. > >>>>>>>>>> > >>>>>>>>>> 1) Some drivers silently load on devices and debugging driver or system > >>>>>>>>>> failures in these cases is difficult. While some drivers (networking > >>>>>>>>>> for example) may not completely initialize when the PCI driver probe() function > >>>>>>>>>> has returned, it is still useful to have some idea of driver completion. > >>>>>>>>> > >>>>>>>>> Sorry, probably it is me, but I don't understand this use case. > >>>>>>>>> Are you adding global to whole kernel command line boot argument to debug > >>>>>>>>> what and when? > >>>>>>>>> > >>>>>>>>> During boot: > >>>>>>>>> If device success, you will see it in /sys/bus/pci/[drivers|devices]/*. > >>>>>>>>> If device fails, you should get an error from that device (fix the > >>>>>>>>> device to return an error), or something immediately won't work and > >>>>>>>>> you won't see it in sysfs. > >>>>>>>> > >>>>>>>> What if there is a panic during boot? There's no way to get to sysfs. > >>>>>>>> That's the case where this is helpful. > >>>>>>> > >>>>>>> How? If you have kernel panic, it means you have much more worse problem > >>>>>>> than not-supported device. If kernel panic was caused by the driver, you > >>>>>>> will see call trace related to it. If kernel panic was caused by > >>>>>>> something else, supported/not supported won't help here. > >>>>>> > >>>>>> I still have no idea *WHICH* device it was that the panic occurred on. > >>>>> > >>>>> The kernel panic is printed from the driver. There is one driver loaded > >>>>> for all same PCI devices which are probed without relation to their > >>>>> number.> > >>>>> If you have host with ten same cards, you will see one driver and this > >>>>> is where the problem and not in supported/not-supported device. > >>>> > >>>> That's true, but you can also have different cards loading the same driver. > >>>> See, for example, any PCI_IDs list in a driver. > >>>> > >>>> For example, > >>>> > >>>> 10:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02) > >>>> 20:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02) > >>>> > >>>> Both load the megaraid driver and have different profiles within the > >>>> driver. I have no idea which one actually panicked until removing > >>>> one card. > >>>> > >>>> It's MUCH worse when debugging new hardware and getting a panic > >>>> from, for example, the uncore code which binds to a PCI mapped > >>>> device. One device might work and the next one doesn't. And > >>>> then you can multiply that by seeing *many* panics at once and > >>>> trying to determine if the problem was on one specific socket, > >>>> die, or core. > >>> > >>> Would a dev_panic() interface that identified the device and > >>> driver help with this? > >> > >> ^^ the more I look at this problem, the more a dev_panic() that > >> would output a device specific message at panic time is what I > >> really need. > > I went down this road a bit and had a realization. The issue isn't > with printing something at panic time, but the *data* that is > output. Each PCI device is associated with a struct device. That > device struct's name is output for dev_dbg(), etc., commands. The > PCI subsystem sets the device struct name at drivers/pci/probe.c: > 1799 > > dev_set_name(&dev->dev, "%04x:%02x:%02x.%d", pci_domain_nr(dev->bus), > dev->bus->number, PCI_SLOT(dev->devfn), > PCI_FUNC(dev->devfn)); > > My problem really is that the above information is insufficient when > I (or a user) need to debug a system. The complexities of debugging > multiple broken driver loads would be much easier if I didn't have > to constantly add this output manually :). This *should* already be in the dmesg log: pci 0000:00:00.0: [8086:5910] type 00 class 0x060000 pci 0000:00:01.0: [8086:1901] type 01 class 0x060400 pci 0000:00:02.0: [8086:591b] type 00 class 0x030000 So if you had a dev_panic(), that message would include the bus/device/function number, and that would be enough to find the vendor/device ID from when the device was first enumerated. Or are you saying you can't get the part of the dmesg log that contains those vendor/device IDs? > Would you be okay with adding a *debug* parameter to expand the > device name to include the vendor & device ID pair? FWIW, I'm > somewhat against yet-another-kernel-option but that's really the > information I need. I could then add dev_dbg() statements in the > local_pci_probe() function.
On 3/4/21 10:50 AM, Bjorn Helgaas wrote: > On Thu, Mar 04, 2021 at 09:42:44AM -0500, Prarit Bhargava wrote: >> >> >> On 2/18/21 2:06 PM, Bjorn Helgaas wrote: >>> On Thu, Feb 18, 2021 at 01:36:35PM -0500, Prarit Bhargava wrote: >>>> On 1/26/21 10:12 AM, Bjorn Helgaas wrote: >>>>> On Tue, Jan 26, 2021 at 09:05:23AM -0500, Prarit Bhargava wrote: >>>>>> On 1/26/21 8:53 AM, Leon Romanovsky wrote: >>>>>>> On Tue, Jan 26, 2021 at 08:42:12AM -0500, Prarit Bhargava wrote: >>>>>>>> On 1/26/21 8:14 AM, Leon Romanovsky wrote: >>>>>>>>> On Tue, Jan 26, 2021 at 07:54:46AM -0500, Prarit Bhargava wrote: >>>>>>>>>> Leon Romanovsky <leon@kernel.org> wrote: >>>>>>>>>>> On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote: >>>>>>>>>>>> There are two situations where driver load messages are helpful. >>>>>>>>>>>> >>>>>>>>>>>> 1) Some drivers silently load on devices and debugging driver or system >>>>>>>>>>>> failures in these cases is difficult. While some drivers (networking >>>>>>>>>>>> for example) may not completely initialize when the PCI driver probe() function >>>>>>>>>>>> has returned, it is still useful to have some idea of driver completion. >>>>>>>>>>> >>>>>>>>>>> Sorry, probably it is me, but I don't understand this use case. >>>>>>>>>>> Are you adding global to whole kernel command line boot argument to debug >>>>>>>>>>> what and when? >>>>>>>>>>> >>>>>>>>>>> During boot: >>>>>>>>>>> If device success, you will see it in /sys/bus/pci/[drivers|devices]/*. >>>>>>>>>>> If device fails, you should get an error from that device (fix the >>>>>>>>>>> device to return an error), or something immediately won't work and >>>>>>>>>>> you won't see it in sysfs. >>>>>>>>>> >>>>>>>>>> What if there is a panic during boot? There's no way to get to sysfs. >>>>>>>>>> That's the case where this is helpful. >>>>>>>>> >>>>>>>>> How? If you have kernel panic, it means you have much more worse problem >>>>>>>>> than not-supported device. If kernel panic was caused by the driver, you >>>>>>>>> will see call trace related to it. If kernel panic was caused by >>>>>>>>> something else, supported/not supported won't help here. >>>>>>>> >>>>>>>> I still have no idea *WHICH* device it was that the panic occurred on. >>>>>>> >>>>>>> The kernel panic is printed from the driver. There is one driver loaded >>>>>>> for all same PCI devices which are probed without relation to their >>>>>>> number.> >>>>>>> If you have host with ten same cards, you will see one driver and this >>>>>>> is where the problem and not in supported/not-supported device. >>>>>> >>>>>> That's true, but you can also have different cards loading the same driver. >>>>>> See, for example, any PCI_IDs list in a driver. >>>>>> >>>>>> For example, >>>>>> >>>>>> 10:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02) >>>>>> 20:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02) >>>>>> >>>>>> Both load the megaraid driver and have different profiles within the >>>>>> driver. I have no idea which one actually panicked until removing >>>>>> one card. >>>>>> >>>>>> It's MUCH worse when debugging new hardware and getting a panic >>>>>> from, for example, the uncore code which binds to a PCI mapped >>>>>> device. One device might work and the next one doesn't. And >>>>>> then you can multiply that by seeing *many* panics at once and >>>>>> trying to determine if the problem was on one specific socket, >>>>>> die, or core. >>>>> >>>>> Would a dev_panic() interface that identified the device and >>>>> driver help with this? >>>> >>>> ^^ the more I look at this problem, the more a dev_panic() that >>>> would output a device specific message at panic time is what I >>>> really need. >> >> I went down this road a bit and had a realization. The issue isn't >> with printing something at panic time, but the *data* that is >> output. Each PCI device is associated with a struct device. That >> device struct's name is output for dev_dbg(), etc., commands. The >> PCI subsystem sets the device struct name at drivers/pci/probe.c: >> 1799 >> >> dev_set_name(&dev->dev, "%04x:%02x:%02x.%d", pci_domain_nr(dev->bus), >> dev->bus->number, PCI_SLOT(dev->devfn), >> PCI_FUNC(dev->devfn)); >> >> My problem really is that the above information is insufficient when >> I (or a user) need to debug a system. The complexities of debugging >> multiple broken driver loads would be much easier if I didn't have >> to constantly add this output manually :). > > This *should* already be in the dmesg log: > > pci 0000:00:00.0: [8086:5910] type 00 class 0x060000 > pci 0000:00:01.0: [8086:1901] type 01 class 0x060400 > pci 0000:00:02.0: [8086:591b] type 00 class 0x030000 > > So if you had a dev_panic(), that message would include the > bus/device/function number, and that would be enough to find the > vendor/device ID from when the device was first enumerated. > > Or are you saying you can't get the part of the dmesg log that > contains those vendor/device IDs? /me hangs head in shame I didn't notice that until now. :) Uh thanks for the polite hit with a cluebat :) I *think* that will work. Let me try some additional driver failure tests. P. > >> Would you be okay with adding a *debug* parameter to expand the >> device name to include the vendor & device ID pair? FWIW, I'm >> somewhat against yet-another-kernel-option but that's really the >> information I need. I could then add dev_dbg() statements in the >> local_pci_probe() function. >
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 23f209c8c19a..32ecee6a4ef0 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3793,6 +3793,8 @@ nomio [S390] Do not use MIO instructions. norid [S390] ignore the RID field and force use of one PCI domain per PCI function + driver_load_messages + Output driver load status messages. pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power Management. diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 8b587fc97f7b..35d5b6973578 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -23,6 +23,8 @@ #include "pci.h" #include "pcie/portdrv.h" +bool driver_load_messages; + struct pci_dynid { struct list_head node; struct pci_device_id id; @@ -305,10 +307,20 @@ static long local_pci_probe(void *_ddi) */ pm_runtime_get_sync(dev); pci_dev->driver = pci_drv; + if (driver_load_messages) + pci_info(pci_dev, "loading on device [%0x:%0x]\n", + pci_dev->vendor, pci_dev->device); rc = pci_drv->probe(pci_dev, ddi->id); - if (!rc) + if (!rc) { + if (driver_load_messages) + pci_info(pci_dev, "loaded on device [%0x:%0x]\n", + pci_dev->vendor, pci_dev->device); return rc; + } if (rc < 0) { + if (driver_load_messages) + pci_info(pci_dev, "failed (%d) on device [%0x:%0x]\n", + rc, pci_dev->vendor, pci_dev->device); pci_dev->driver = NULL; pm_runtime_put_sync(dev); return rc; diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index e578d34095e9..c64b3b6e1e8d 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -6547,6 +6547,8 @@ static int __init pci_setup(char *str) pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS); } else if (!strncmp(str, "disable_acs_redir=", 18)) { disable_acs_redir_param = str + 18; + } else if (!strncmp(str, "driver_load_messages", 24)) { + driver_load_messages = true; } else { pr_err("PCI: Unknown option `%s'\n", str); } diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index f86cae9aa1f4..db1218e188ac 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -690,4 +690,5 @@ static inline int pci_acpi_program_hp_params(struct pci_dev *dev) extern const struct attribute_group aspm_ctrl_attr_group; #endif +extern bool driver_load_messages; #endif /* DRIVERS_PCI_H */
There are two situations where driver load messages are helpful. 1) Some drivers silently load on devices and debugging driver or system failures in these cases is difficult. While some drivers (networking for example) may not completely initialize when the PCI driver probe() function has returned, it is still useful to have some idea of driver completion. 2) Storage and Network device vendors have relatively short lives for some of their hardware. Some devices may continue to function but are problematic due to out-of-date firmware or other issues. Maintaining a database of the hardware is out-of-the-question in the kernel as it would require constant updating. Outputting a message in the log would allow different OSes to determine if the problem hardware was truly supported or not. Add optional driver load messages from the PCI core that indicates which driver was loaded, on which slot, and on which device. Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Myron Stowe <mstowe@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-doc@vger.kernel.org --- Documentation/admin-guide/kernel-parameters.txt | 2 ++ drivers/pci/pci-driver.c | 14 +++++++++++++- drivers/pci/pci.c | 2 ++ drivers/pci/pci.h | 1 + 4 files changed, 18 insertions(+), 1 deletion(-)