Message ID | 20191204212325.c4k47p5hrnn3vpb5@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | EDAC: skx_common: downgrade message importance on missing PCI device | expand |
> EDAC skx: Can't get tolm/tohm > > for each CPU core, which is noisy. This patch makes it a debug message. This looks like we call skx_init() once per core. Do we keep calling it because the calls are failing? Or do we do that even when calls succeed? I was only really expecting that skx_init() would be called once. -Tony
> This looks like we call skx_init() once per core. Do we keep calling it because > the calls are failing? Or do we do that even when calls succeed? > > I was only really expecting that skx_init() would be called once. So (by experimentation) it seems that if the module load fails it will be retried num_online_cpus times (though not bound to each CPU in turn ... it will maybe try the init call on the same CPU multiple times, but miss running on some CPUs). If the load succeeds, then whoever is repeating the load decides to stop. -Tony
On Tue, Dec 10, 2019 at 12:02:45AM +0000, Luck, Tony wrote: > > This looks like we call skx_init() once per core. Do we keep calling it because > > the calls are failing? Or do we do that even when calls succeed? > > > > I was only really expecting that skx_init() would be called once. > > So (by experimentation) it seems that if the module load fails it > will be retried num_online_cpus times (though not bound to each > CPU in turn ... it will maybe try the init call on the same CPU multiple > times, but miss running on some CPUs). > > If the load succeeds, then whoever is repeating the load decides > to stop. That's the result of our conversion to MODULE_DEVICE_TABLE to match CPU models. So it tries once on each CPU: https://lkml.kernel.org/r/20191107103857.GC19501@zn.tnic I have no clean solution for this except maybe remembering the return value of the first instance probing in the edac core module and then asking it... it ain't pretty though.
On Tue, Dec 10, 2019 at 12:02:45AM +0000, Luck, Tony wrote: > > This looks like we call skx_init() once per core. Do we keep calling it because > > the calls are failing? Or do we do that even when calls succeed? > > > > I was only really expecting that skx_init() would be called once. > > So (by experimentation) it seems that if the module load fails it > will be retried num_online_cpus times (though not bound to each > CPU in turn ... it will maybe try the init call on the same CPU multiple > times, but miss running on some CPUs). > > If the load succeeds, then whoever is repeating the load decides > to stop. Or silently fails to load the module again for all online cpus.
On Tue, Dec 10, 2019 at 10:00:13AM +0100, Borislav Petkov wrote: > On Tue, Dec 10, 2019 at 12:02:45AM +0000, Luck, Tony wrote: > > > This looks like we call skx_init() once per core. Do we keep calling it because > > > the calls are failing? Or do we do that even when calls succeed? > > > > > > I was only really expecting that skx_init() would be called once. > > > > So (by experimentation) it seems that if the module load fails it > > will be retried num_online_cpus times (though not bound to each > > CPU in turn ... it will maybe try the init call on the same CPU multiple > > times, but miss running on some CPUs). > > > > If the load succeeds, then whoever is repeating the load decides > > to stop. > > That's the result of our conversion to MODULE_DEVICE_TABLE to match CPU > models. So it tries once on each CPU: > > https://lkml.kernel.org/r/20191107103857.GC19501@zn.tnic > > I have no clean solution for this except maybe remembering the return > value of the first instance probing in the edac core module and then > asking it... it ain't pretty though. The other option I can think about is just allowing the module to load in this situation: the CPU is present but you have no memory controller PCI devices present. Some drivers will allow loading without having a device present without errors. It's not clean as having to come up with something modutils can pick up as hint to not try to load more than once. Or could just downgrade this specific message since it's a very common case and let the more unusual situations report more than once.
On Mon, Jan 06, 2020 at 10:12:42AM -0500, 'Aristeu Rozanski' wrote: > The other option I can think about is just allowing the module to load > in this situation: the CPU is present but you have no memory controller > PCI devices present. Some drivers will allow loading without having a > device present without errors. It's not clean as having to come up with > something modutils can pick up as hint to not try to load more than once. Yeah, but having a driver loaded when there's no hw to drive is also not pretty. > Or could just downgrade this specific message since it's a very common > case and let the more unusual situations report more than once. Yap, we did a similar thing for adm64_edac: 7fdfee926be7 ("EDAC/amd64: Get rid of the ECC disabled long message") but instead of downgrading prio we actually went and killed it directly. :-)
On Mon, Jan 06, 2020 at 05:17:32PM +0100, Borislav Petkov wrote: > On Mon, Jan 06, 2020 at 10:12:42AM -0500, 'Aristeu Rozanski' wrote: > > The other option I can think about is just allowing the module to load > > in this situation: the CPU is present but you have no memory controller > > PCI devices present. Some drivers will allow loading without having a > > device present without errors. It's not clean as having to come up with > > something modutils can pick up as hint to not try to load more than once. > > Yeah, but having a driver loaded when there's no hw to drive is also not > pretty. > > > Or could just downgrade this specific message since it's a very common > > case and let the more unusual situations report more than once. > > Yap, we did a similar thing for adm64_edac: > > 7fdfee926be7 ("EDAC/amd64: Get rid of the ECC disabled long message") > > but instead of downgrading prio we actually went and killed it directly. > > :-) OK, will resubmit this patch just removing the messages then. Thanks!
On Mon, Jan 06, 2020 at 11:20:14AM -0500, 'Aristeu Rozanski' wrote:
> OK, will resubmit this patch just removing the messages then.
I'm not saying you should blindly remove them. They might be useful for
debugging purposes so you should consider that usage angle first. In the
AMD case, the message was really useless.
On Mon, Jan 06, 2020 at 05:23:06PM +0100, Borislav Petkov wrote: > I'm not saying you should blindly remove them. They might be useful for > debugging purposes so you should consider that usage angle first. In the > AMD case, the message was really useless. Ah, I thought you had an objection to this patch :) Do you mind considering it for inclusion?
On Tue, Jan 07, 2020 at 10:51:09AM -0500, 'Aristeu Rozanski' wrote: > On Mon, Jan 06, 2020 at 05:23:06PM +0100, Borislav Petkov wrote: > > I'm not saying you should blindly remove them. They might be useful for > > debugging purposes so you should consider that usage angle first. In the > > AMD case, the message was really useless. > > Ah, I thought you had an objection to this patch :) > Do you mind considering it for inclusion? That's Tony's call as he's maintaining the Intel side of EDAC.
On Tue, Jan 07, 2020 at 05:45:28PM +0100, Borislav Petkov wrote: > On Tue, Jan 07, 2020 at 10:51:09AM -0500, 'Aristeu Rozanski' wrote: > > On Mon, Jan 06, 2020 at 05:23:06PM +0100, Borislav Petkov wrote: > > > I'm not saying you should blindly remove them. They might be useful for > > > debugging purposes so you should consider that usage angle first. In the > > > AMD case, the message was really useless. > > > > Ah, I thought you had an objection to this patch :) > > Do you mind considering it for inclusion? > > That's Tony's call as he's maintaining the Intel side of EDAC. Already applied to git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git in edac-for-next branch. Sorry, should have sent you an "Applied" message. -Tony
On Tue, Jan 07, 2020 at 01:43:10PM -0800, Luck, Tony wrote: > Already applied to git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git > in edac-for-next branch. Sorry, should have sent you an "Applied" message. Thanks Tony.
diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c index 95662a4ff4c4..99bbaf629b8d 100644 --- a/drivers/edac/skx_common.c +++ b/drivers/edac/skx_common.c @@ -256,7 +256,7 @@ int skx_get_hi_lo(unsigned int did, int off[], u64 *tolm, u64 *tohm) pdev = pci_get_device(PCI_VENDOR_ID_INTEL, did, NULL); if (!pdev) { - skx_printk(KERN_ERR, "Can't get tolm/tohm\n"); + edac_dbg(2, "Can't get tolm/tohm\n"); return -ENODEV; }
Both skx_edac and i10nm_edac drivers are loaded based on the matching CPU being available which leads the module to be automatically loaded in virtual machines as well. That will fail due the missing PCI devices. In both drivers the first function to make use of the PCI devices is skx_get_hi_lo() will simply print EDAC skx: Can't get tolm/tohm for each CPU core, which is noisy. This patch makes it a debug message. Signed-off-by: Aristeu Rozanski <aris@redhat.com>