Message ID | 20210505173027.78428-1-wangglei@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | EDAC: update edac printk wrappers to use printk_ratelimited. | expand |
On Wed, May 05, 2021 at 10:30:27AM -0700, Lei Wang wrote: > Update printk to the ratelimited version, so that in some corner cases when > vast of CE errors show up, the kernel logging can be suppressed. Err, why?
Hi Boris, We found a corner case in production environment that there are ~500 CE errors per second. The SoC otherwise functions just fine. Making printk ratelimited reduced CE error logging to < 20 per second. Though this is just one case so far, we think moving to printk_ratelimited could benefit broader use as well, by helping control the amount of kernel logging. In most running condition, the error rate is way below the limit. And in an error case like this one, vast error logging would not provide much valuable details, rather it's storming the kernel logging. Thanks, -Lei -----Original Message----- From: Borislav Petkov <bp@alien8.de> Sent: Wednesday, May 5, 2021 11:01 AM To: wangglei <wangglei@gmail.com> Cc: mchehab@kernel.org; tony.luck@intel.com; james.morse@arm.com; rric@kernel.org; linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org; Lei Wang (DPLAT) <Wang.Lei@microsoft.com>; Hang Li <hangl@microsoft.com>; tyhicks@linux.microsoft.com; Brandon Waller <bwaller@microsoft.com> Subject: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited. On Wed, May 05, 2021 at 10:30:27AM -0700, Lei Wang wrote: > Update printk to the ratelimited version, so that in some corner cases > when vast of CE errors show up, the kernel logging can be suppressed. Err, why? -- Regards/Gruss, Boris. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.kernel.org%2Ftglx%2Fnotes-about-netiquette&data=04%7C01%7CWang.Lei%40microsoft.com%7C71421584bc2a43951df908d90fefc1b9%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637558344708605379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=fJG0%2Fdk8VCVNIGS0kM2BZDHAXLVcq4CLHEajhND0rzg%3D&reserved=0
Hi Lei, On Wed, May 05, 2021 at 07:02:14PM +0000, Lei Wang (DPLAT) wrote: > Hi Boris, first of all, please do not top-post. > We found a corner case in production environment that there are ~500 > CE errors per second. The SoC otherwise functions just fine. Making > printk ratelimited reduced CE error logging to < 20 per second. If you want to avoid CE logs flooding dmesg, there's a couple of things you can do: 1. Use drivers/ras/cec.c 2. Do not load EDAC drivers at all since you don't care about the error reports, apparently. 3. Fix the CE source: replace the DIMMs, etc. > Though this is just one case so far, we think moving to > printk_ratelimited could benefit broader use as well, by helping > control the amount of kernel logging. No, this will make EDAC driver loading output incomplete when some of the messages are omitted due to the ratelimiting. And no, this is not going to happen. HTH.
On 2021-05-05 21:45:01, Borislav Petkov wrote: > Hi Lei, > > On Wed, May 05, 2021 at 07:02:14PM +0000, Lei Wang (DPLAT) wrote: > > Hi Boris, > > first of all, please do not top-post. > > > We found a corner case in production environment that there are ~500 > > CE errors per second. The SoC otherwise functions just fine. Making > > printk ratelimited reduced CE error logging to < 20 per second. > > If you want to avoid CE logs flooding dmesg, there's a couple of things > you can do: > > 1. Use drivers/ras/cec.c > > 2. Do not load EDAC drivers at all since you don't care about the error > reports, apparently. Lei, if you don't care about the CE error messages at all, there's also an edac_mc_log_ce module parameter that can be used to quiet the message emitted from edac_ce_error(): https://www.kernel.org/doc/html/latest/admin-guide/ras.html#module-parameters > 3. Fix the CE source: replace the DIMMs, etc. > > > Though this is just one case so far, we think moving to > > printk_ratelimited could benefit broader use as well, by helping > > control the amount of kernel logging. > > No, this will make EDAC driver loading output incomplete when some of > the messages are omitted due to the ratelimiting. And no, this is not > going to happen. Boris, I agree that a more surgical approach is needed than this if Lei still needs some traces of the CE error messages in the logs. Would it be any more acceptable to add an edac_mc_printk_ratelimited() macro, which uses printk_ratelimited(), and then call that new macro from edac_ce_error()? If you still don't want those CE errors ratelimited by default, perhaps a new, non-default mode (2) could be added to the edac_mc_log_ce module parameter that uses the ratelimited variant? Tyler > > HTH. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette >
On Wed, May 05, 2021 at 03:23:57PM -0500, Tyler Hicks wrote: > Would it be any more acceptable to add an > edac_mc_printk_ratelimited() macro, which uses printk_ratelimited(), > and then call that new macro from edac_ce_error()? You guys are way off here: the intent of EDAC drivers is to accurately report errors for purposes of counting them and doing analysis on that collected data as to whether components are going wrong - not to ratelimit them as some nuisance output. With breaking the EDAC reporting, you're barking up the wrong tree - if you don't want to see those errors, do not load the drivers. It is that simple.
On 2021-05-05 23:04:44, Borislav Petkov wrote: > On Wed, May 05, 2021 at 03:23:57PM -0500, Tyler Hicks wrote: > > Would it be any more acceptable to add an > > edac_mc_printk_ratelimited() macro, which uses printk_ratelimited(), > > and then call that new macro from edac_ce_error()? > > You guys are way off here: the intent of EDAC drivers is to accurately > report errors for purposes of counting them and doing analysis on > that collected data as to whether components are going wrong - not to > ratelimit them as some nuisance output. > > With breaking the EDAC reporting, you're barking up the wrong tree - if > you don't want to see those errors, do not load the drivers. It is that > simple. As I understand it, the idea here wasn't to treat the log messages as a nuisance that should be completely squelched. The counters are monitored and provide the definitive way to detect large scale problems but the CE log messages can be an easier-to-discover way for humans to identify potential problems when, for example, centralized log aggregation and indexing is in place. The thought was that the full stream of log messages isn't necessary to notice that there's a problem when they are being emitted at such a high rate (500 per second). They're just filling up disk space and/or wasting networking bandwidth at that point. Of course, the best course of action here is to service the machine but there's still a period of time between the CE errors popping up and the machine being serviced. Tyler
On Wed, May 05, 2021 at 04:48:46PM -0500, Tyler Hicks wrote: > The thought was that the full stream of log messages isn't necessary to > notice that there's a problem when they are being emitted at such a high > rate (500 per second). They're just filling up disk space and/or wasting > networking bandwidth at that point. I already asked about this but lemme point it out again: have you guys looked at drivers/ras/cec.c ? With that there won't be *any* error reports in dmesg and it will even poison and offline pages which generate excessive errors so that ... > Of course, the best course of action here is to service the machine > but there's still a period of time between the CE errors popping up > and the machine being serviced. ... you'll have ample time to service the machine.
On 2021-05-06 00:02:44, Borislav Petkov wrote: > On Wed, May 05, 2021 at 04:48:46PM -0500, Tyler Hicks wrote: > > The thought was that the full stream of log messages isn't necessary to > > notice that there's a problem when they are being emitted at such a high > > rate (500 per second). They're just filling up disk space and/or wasting > > networking bandwidth at that point. > > I already asked about this but lemme point it out again: have you guys > looked at drivers/ras/cec.c ? We'll have a closer look. Thanks for the pointer! Tyler > > With that there won't be *any* error reports in dmesg and it will even > poison and offline pages which generate excessive errors so that ... > > > Of course, the best course of action here is to service the machine > > but there's still a period of time between the CE errors popping up > > and the machine being serviced. > > ... you'll have ample time to service the machine. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette >
On 2021-05-05 17:16:11, Tyler Hicks wrote: > On 2021-05-06 00:02:44, Borislav Petkov wrote: > > On Wed, May 05, 2021 at 04:48:46PM -0500, Tyler Hicks wrote: > > > The thought was that the full stream of log messages isn't necessary to > > > notice that there's a problem when they are being emitted at such a high > > > rate (500 per second). They're just filling up disk space and/or wasting > > > networking bandwidth at that point. > > > > I already asked about this but lemme point it out again: have you guys > > looked at drivers/ras/cec.c ? > > We'll have a closer look. Thanks for the pointer! This is x86-specific and not applicable in our situation. Tyler > > Tyler > > > > > With that there won't be *any* error reports in dmesg and it will even > > poison and offline pages which generate excessive errors so that ... > > > > > Of course, the best course of action here is to service the machine > > > but there's still a period of time between the CE errors popping up > > > and the machine being serviced. > > > > ... you'll have ample time to service the machine. > > > > -- > > Regards/Gruss, > > Boris. > > > > https://people.kernel.org/tglx/notes-about-netiquette > >
On Wed, May 05, 2021 at 05:43:57PM -0500, Tyler Hicks wrote: > This is x86-specific That's because it is used by x86 currently. It shouldn't be hard to use it on another arch though as the machinery is pretty generic. > and not applicable in our situation. What is your situation? ARM?
On 2021-05-06 00:55:00, Borislav Petkov wrote: > On Wed, May 05, 2021 at 05:43:57PM -0500, Tyler Hicks wrote: > > This is x86-specific > > That's because it is used by x86 currently. It shouldn't be hard to use > it on another arch though as the machinery is pretty generic. > > > and not applicable in our situation. > > What is your situation? ARM? Yes, though I'm not sure if those additional features are important/useful enough for us to generalize that driver. The main motivation here was just to prevent storage/network from being flooded by obviously-bad nodes that haven't been offlined yet. :) Lei and others on cc will need to evaluate porting cec.c and what it will gain them. Thanks again. Tyler > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette >
>> What is your situation? ARM? > > Yes, though I'm not sure if those additional features are > important/useful enough for us to generalize that driver. The main > motivation here was just to prevent storage/network from being flooded > by obviously-bad nodes that haven't been offlined yet. :) > > Lei and others on cc will need to evaluate porting cec.c and what it > will gain them. Thanks again. Tyler, You might also look at the x86 "storm" detection code (tl;dr version "If error interrupts are coming too fast, turn off the interrupts and poll"). -Tony
Em Wed, 5 May 2021 18:01:52 -0500 Tyler Hicks <tyhicks@linux.microsoft.com> escreveu: > On 2021-05-06 00:55:00, Borislav Petkov wrote: > > On Wed, May 05, 2021 at 05:43:57PM -0500, Tyler Hicks wrote: > > > This is x86-specific > > > > That's because it is used by x86 currently. It shouldn't be hard to use > > it on another arch though as the machinery is pretty generic. > > > > > and not applicable in our situation. > > > > What is your situation? ARM? > > Yes, though I'm not sure if those additional features are > important/useful enough for us to generalize that driver. The main > motivation here was just to prevent storage/network from being flooded > by obviously-bad nodes that haven't been offlined yet. :) Well, if a machine starts to produce 500+ errors per second, then it should be offlined as soon as possible, as otherwise bad results will be produced ;-) See, the CE error reporting mechanism is meant to be used together with some error correction code algorithm like the ones used on ECC memories. Such algorithms are designed to detect a single errored bit with a change usually at the ~10⁻4 to 10^-7 order (the precision depends on how many bits are used and what algorithm is used), but if there are two wrong bits at the same word, the chance to detect is a lot lower. So, keeping the server enabled up to the point that it would consume enough resources at the storage/network to bother someone sounds a terrible idea, as sooner or later it will miss an error or produce an uncorrected event ;-) Besides that, if you're running rasdaemon to capture the hardware errors, the storage will also be flooded by something like that, even if you disable them from going to syslog via sys/module/edac_core/parameters/edac_mc_log_ce. Now, the question is: are those 500+ errors per second a real hardware problem, or is it due to some broken error report mechanism? In the latter case, the driver or the hardware that it is producing invalid errors should be fixed. > > Lei and others on cc will need to evaluate porting cec.c and what it > will gain them. Thanks again. Regards, Mauro
diff --git a/drivers/edac/edac_mc.h b/drivers/edac/edac_mc.h index 881b00eadf7a..355529317980 100644 --- a/drivers/edac/edac_mc.h +++ b/drivers/edac/edac_mc.h @@ -46,19 +46,19 @@ #endif #define edac_printk(level, prefix, fmt, arg...) \ - printk(level "EDAC " prefix ": " fmt, ##arg) + printk_ratelimited(level "EDAC " prefix ": " fmt, ##arg) #define edac_mc_printk(mci, level, fmt, arg...) \ - printk(level "EDAC MC%d: " fmt, mci->mc_idx, ##arg) + printk_ratelimited(level "EDAC MC%d: " fmt, mci->mc_idx, ##arg) #define edac_mc_chipset_printk(mci, level, prefix, fmt, arg...) \ - printk(level "EDAC " prefix " MC%d: " fmt, mci->mc_idx, ##arg) + printk_ratelimited(level "EDAC " prefix " MC%d: " fmt, mci->mc_idx, ##arg) #define edac_device_printk(ctl, level, fmt, arg...) \ - printk(level "EDAC DEVICE%d: " fmt, ctl->dev_idx, ##arg) + printk_ratelimited(level "EDAC DEVICE%d: " fmt, ctl->dev_idx, ##arg) #define edac_pci_printk(ctl, level, fmt, arg...) \ - printk(level "EDAC PCI%d: " fmt, ctl->pci_idx, ##arg) + printk_ratelimited(level "EDAC PCI%d: " fmt, ctl->pci_idx, ##arg) /* prefixes for edac_printk() and edac_mc_printk() */ #define EDAC_MC "MC"
Update printk to the ratelimited version, so that in some corner cases when vast of CE errors show up, the kernel logging can be suppressed. Signed-off-by: Lei Wang <wangglei@gmail.com> --- drivers/edac/edac_mc.h | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)