Message ID | 20250115074301.3514927-1-pandoh@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Rate limit AER logs/IRQs | expand |
Hi Jon, Can you share the base commit used here? I would like to try the patchset. Regards, Terry On 1/15/2025 1:42 AM, Jon Pan-Doh wrote: > Proposal > ======== > > When using native AER, spammy devices can flood kernel logs with AER errors > and slow/stall execution. Add per-device per-error-severity ratelimits > for more robust error logging. Allow userspace to configure ratelimits > via sysfs knobs. > > Motivation > ========== > > Several OCP members have issues with inconsistent PCIe error handling, > exacerbated at datacenter scale (myriad of devices). > OCP HW/Fault Management subproject set out to solve this by > standardizing industry: > > - PCIe error handling best practices > - Fault Management/RAS (incl. PCIe errors) > > Exposing PCIe errors/debug info in-band for a userspace daemon (e.g. > rasdaemon) to collect/pass on to repairability services is part of the > roadmap. > > Background > ========== > > AER error spam has been observed many times, both publicly (e.g. [1], [2], > [3]) and privately. While it usually occurs with correctable errors, it can > happen with uncorrectable errors (e.g. during new HW bringup). > > There have been previous attempts to add ratelimits to AER logs ([4], > [5]). The most recent attempt[5] has many similarities with the proposed > approach. > > Patch organization > ================== > 1-3 AER logging cleanup > 4-7 Ratelimits and sysfs knobs > 8 Sysfs cleanup (RFC that breaks existing ABI/can be dropped) > > Outstanding work > ================ > Cleanup: > - Consolidate aer_print_error() and pci_print_error() path > - Elevate log level logic out of print functions[6] > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=215027 > [2] https://bugzilla.kernel.org/show_bug.cgi?id=201517 > [3] https://bugzilla.kernel.org/show_bug.cgi?id=196183 > [4] https://lore.kernel.org/linux-pci/20230606035442.2886343-2-grundler@chromium.org/ > [5] https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stolarek@oracle.com/ > [6] https://lore.kernel.org/linux-pci/edd77011aafad4c0654358a26b4e538d0c5a321d.1736341506.git.karolina.stolarek@oracle.com/ > > Jon Pan-Doh (8): > PCI/AER: Remove aer_print_port_info > PCI/AER: Move AER stat collection out of __aer_print_error > PCI/AER: Rename struct aer_stats to aer_info > PCI/AER: Introduce ratelimit for error logs > PCI/AER: Introduce ratelimit for AER IRQs > PCI/AER: Add AER sysfs attributes for ratelimits > PCI/AER: Update AER sysfs ABI filename > PCI/AER: Move AER sysfs attributes into separate directory > > ...es-aer_stats => sysfs-bus-pci-devices-aer} | 50 +++- > Documentation/PCI/pcieaer-howto.rst | 10 +- > drivers/pci/pci-sysfs.c | 2 +- > drivers/pci/pci.h | 2 +- > drivers/pci/pcie/aer.c | 227 +++++++++++++----- > include/linux/pci.h | 2 +- > 6 files changed, 216 insertions(+), 77 deletions(-) > rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (69%) >
On Thu, Jan 23, 2025 at 7:18 AM Bowman, Terry <terry.bowman@amd.com> wrote:
> Can you share the base commit used here? I would like to try the patchset.
Sure, it's 7f5b6a8ec18e3add4c74682f60b90c31bdf849f2 ("Merge tag
'pci-v6.13-fixes-3' of
git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci").
As Karolina pointed out[1], there is a chance of conflicts (e.g. TLP
log/print consolidation series) with pci/err and pci-next branches.
The next version will be rebased on top of one of those branches.
Thanks,
Jon
[1] https://lore.kernel.org/linux-pci/8be04d4f-c9e8-4ed2-bf6a-3550d51eb972@oracle.com/
On Thu, Jan 23, 2025 at 10:46:29PM -0800, Jon Pan-Doh wrote: > On Thu, Jan 23, 2025 at 7:18AM Bowman, Terry <terry.bowman@amd.com> wrote: > > Can you share the base commit used here? I would like to try the patchset. > > Sure, it's 7f5b6a8ec18e3add4c74682f60b90c31bdf849f2 ("Merge tag > 'pci-v6.13-fixes-3' of > git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci"). > > As Karolina pointed out[1], there is a chance of conflicts (e.g. TLP > log/print consolidation series) with pci/err and pci-next branches. > The next version will be rebased on top of one of those branches. Patch sets that are intended to be applied to pci.git should generally be based on the most recent -rc1 release because pci.git contains a collection of topic branches (each based on rc1) which are then merged together to form the pull request for Linus. Note that there are several other patch sets in flight which target AER: Terry's CXL error handling (now at v5): https://lore.kernel.org/linux-pci/20250107143852.3692571-1-terry.bowman@amd.com/ Shuai's endpoint error reporting (now at v2): https://lore.kernel.org/linux-pci/20241112135419.59491-1-xueshuai@linux.alibaba.com/ You may want to double-check that your changes do not collide with these other in-flight patch sets. Terry's is most far along and may be applied in the upcoming cycle, though it's unclear to me whether that'll be through the pci or cxl tree. Probably the former to avoid merge conflicts? Thanks, Lukas
Hi Jon, On 15/01/2025 08:42, Jon Pan-Doh wrote: > Proposal > ======== > > When using native AER, spammy devices can flood kernel logs with AER errors > and slow/stall execution. Add per-device per-error-severity ratelimits > for more robust error logging. Allow userspace to configure ratelimits > via sysfs knobs. Do you have any update on the series? I'm aware that a lot is happening in the AER code right now, so I was thinking if it would be helpful to split up the series to get the logs ratelimiting in sooner. There are some concerns about disabling error generation that should be discussed, but I don't want them to block the logs ratelimit changes. I think it would be good to fix this first to save people (myself included) from overflown syslogs. All the best, Karolina > > Motivation > ========== > > Several OCP members have issues with inconsistent PCIe error handling, > exacerbated at datacenter scale (myriad of devices). > OCP HW/Fault Management subproject set out to solve this by > standardizing industry: > > - PCIe error handling best practices > - Fault Management/RAS (incl. PCIe errors) > > Exposing PCIe errors/debug info in-band for a userspace daemon (e.g. > rasdaemon) to collect/pass on to repairability services is part of the > roadmap. > > Background > ========== > > AER error spam has been observed many times, both publicly (e.g. [1], [2], > [3]) and privately. While it usually occurs with correctable errors, it can > happen with uncorrectable errors (e.g. during new HW bringup). > > There have been previous attempts to add ratelimits to AER logs ([4], > [5]). The most recent attempt[5] has many similarities with the proposed > approach. > > Patch organization > ================== > 1-3 AER logging cleanup > 4-7 Ratelimits and sysfs knobs > 8 Sysfs cleanup (RFC that breaks existing ABI/can be dropped) > > Outstanding work > ================ > Cleanup: > - Consolidate aer_print_error() and pci_print_error() path > - Elevate log level logic out of print functions[6] > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=215027 > [2] https://bugzilla.kernel.org/show_bug.cgi?id=201517 > [3] https://bugzilla.kernel.org/show_bug.cgi?id=196183 > [4] https://lore.kernel.org/linux-pci/20230606035442.2886343-2-grundler@chromium.org/ > [5] https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stolarek@oracle.com/ > [6] https://lore.kernel.org/linux-pci/edd77011aafad4c0654358a26b4e538d0c5a321d.1736341506.git.karolina.stolarek@oracle.com/ > > Jon Pan-Doh (8): > PCI/AER: Remove aer_print_port_info > PCI/AER: Move AER stat collection out of __aer_print_error > PCI/AER: Rename struct aer_stats to aer_info > PCI/AER: Introduce ratelimit for error logs > PCI/AER: Introduce ratelimit for AER IRQs > PCI/AER: Add AER sysfs attributes for ratelimits > PCI/AER: Update AER sysfs ABI filename > PCI/AER: Move AER sysfs attributes into separate directory > > ...es-aer_stats => sysfs-bus-pci-devices-aer} | 50 +++- > Documentation/PCI/pcieaer-howto.rst | 10 +- > drivers/pci/pci-sysfs.c | 2 +- > drivers/pci/pci.h | 2 +- > drivers/pci/pcie/aer.c | 227 +++++++++++++----- > include/linux/pci.h | 2 +- > 6 files changed, 216 insertions(+), 77 deletions(-) > rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (69%) >
On Thu, Feb 6, 2025 at 5:32 AM Karolina Stolarek <karolina.stolarek@oracle.com> wrote: > Do you have any update on the series? > > I'm aware that a lot is happening in the AER code right now, so I was > thinking if it would be helpful to split up the series to get the logs > ratelimiting in sooner. There are some concerns about disabling error > generation that should be discussed, but I don't want them to block the > logs ratelimit changes. I think it would be good to fix this first to > save people (myself included) from overflown syslogs. Sorry for the delayed response. I was on vacation and hadn't had time to address the comments. I think splitting the series into log ratelimits vs. irq ratelimits is a good idea as we continue to discuss the latter. I'll aim to send out v2 by the end of week. One outstanding item (mentioned off-list) is Bjorn's desire to consolidate the logging paths (aer_print_error() for AER IRQ and pci_print_error() for CXL/GHES) as a prerequisite (clean up/reduce tech debt). Maybe you could help with this? Thanks, Jon
On 13/02/2025 00:19, Jon Pan-Doh wrote: > > Sorry for the delayed response. I was on vacation and hadn't had time > to address the comments. It's alright, there's no need to apologize for it :) > I think splitting the series into log > ratelimits vs. irq ratelimits is a good idea as we continue to discuss > the latter. I'll aim to send out v2 by the end of week. OK, sounds good > One outstanding item (mentioned off-list) is Bjorn's desire to > consolidate the logging paths (aer_print_error() for AER IRQ and > pci_print_error() for CXL/GHES) as a prerequisite (clean up/reduce > tech debt). Maybe you could help with this? I'd need to dive into CXL part and ramp up, but I think that's something I can help with. Does it mean that you'd rebase this series on the top of the proposed cleanup? All the best, Karolina > > Thanks, > Jon
On Thu, Feb 13, 2025 at 8:00 AM Karolina Stolarek <karolina.stolarek@oracle.com> wrote: > I'd need to dive into CXL part and ramp up, but I think that's something > I can help with. Does it mean that you'd rebase this series on the top > of the proposed cleanup? Yeah. Either that or you can append to the beginning of this series as the first few patches are AER cleanup. The former is probably easier depending on how large the patch(es) are (i.e. I will rebase the ratelimit series on top of AER log cleanup). It may even make sense to absorb the first few patches of this series into the cleanup effort. Thanks for the help, Jon