Message ID | 20230116193902.1315236-1-jiaqiyan@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Introduce per NUMA node memory error statistics | expand |
On Mon, 16 Jan 2023 19:38:59 +0000 Jiaqi Yan <jiaqiyan@google.com> wrote: > Background > ========== > In the RFC for Kernel Support of Memory Error Detection [1], one advantage > of software-based scanning over hardware patrol scrubber is the ability > to make statistics visible to system administrators. The statistics > include 2 categories: > * Memory error statistics, for example, how many memory error are > encountered, how many of them are recovered by the kernel. Note these > memory errors are non-fatal to kernel: during the machine check > exception (MCE) handling kernel already classified MCE's severity to > be unnecessary to panic (but either action required or optional). > * Scanner statistics, for example how many times the scanner have fully > scanned a NUMA node, how many errors are first detected by the scanner. > > The memory error statistics are useful to userspace and actually not > specific to scanner detected memory errors, and are the focus of this RFC. I assume this is a leftover and this is no longer "RFC". I'd normally sit back and await reviewer input, but this series is simple, so I'll slurp it up so we get some testing while that review is ongoing.
On Mon, Jan 16, 2023 at 12:13 PM Andrew Morton <akpm@linux-foundation.org> wrote: > On Mon, 16 Jan 2023 19:38:59 +0000 Jiaqi Yan <jiaqiyan@google.com> wrote: > > > Background > > ========== > > In the RFC for Kernel Support of Memory Error Detection [1], one > advantage > > of software-based scanning over hardware patrol scrubber is the ability > > to make statistics visible to system administrators. The statistics > > include 2 categories: > > * Memory error statistics, for example, how many memory error are > > encountered, how many of them are recovered by the kernel. Note these > > memory errors are non-fatal to kernel: during the machine check > > exception (MCE) handling kernel already classified MCE's severity to > > be unnecessary to panic (but either action required or optional). > > * Scanner statistics, for example how many times the scanner have fully > > scanned a NUMA node, how many errors are first detected by the scanner. > > > > The memory error statistics are useful to userspace and actually not > > specific to scanner detected memory errors, and are the focus of this > RFC. > > I assume this is a leftover and this is no longer "RFC". > > I'd normally sit back and await reviewer input, but this series is > simple, so I'll slurp it up so we get some testing while that review is > ongoing. > Ah, yes, my typo, my intent is PATCH. I did test the patches on several test hosts I have, but more testing is always better. Thanks, Andrew!
On Mon, Jan 16, 2023 at 07:38:59PM +0000, Jiaqi Yan wrote: > Background > ========== > In the RFC for Kernel Support of Memory Error Detection [1], one advantage > of software-based scanning over hardware patrol scrubber is the ability > to make statistics visible to system administrators. The statistics > include 2 categories: > * Memory error statistics, for example, how many memory error are > encountered, how many of them are recovered by the kernel. Note these > memory errors are non-fatal to kernel: during the machine check > exception (MCE) handling kernel already classified MCE's severity to > be unnecessary to panic (but either action required or optional). > * Scanner statistics, for example how many times the scanner have fully > scanned a NUMA node, how many errors are first detected by the scanner. > > The memory error statistics are useful to userspace and actually not > specific to scanner detected memory errors, and are the focus of this RFC. > > Motivation > ========== > Memory error stats are important to userspace but insufficient in kernel > today. Datacenter administrators can better monitor a machine's memory > health with the visible stats. For example, while memory errors are > inevitable on servers with 10+ TB memory, starting server maintenance > when there are only 1~2 recovered memory errors could be overreacting; > in cloud production environment maintenance usually means live migrate > all the workload running on the server and this usually causes nontrivial > disruption to the customer. Providing insight into the scope of memory > errors on a system helps to determine the appropriate follow-up action. > In addition, the kernel's existing memory error stats need to be > standardized so that userspace can reliably count on their usefulness. > > Today kernel provides following memory error info to userspace, but they > are not sufficient or have disadvantages: > * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total, > not per NUMA node stats though > * ras:memory_failure_event: only available after explicitly enabled > * /dev/mcelog provides many useful info about the MCEs, but doesn't > capture how memory_failure recovered memory MCEs > * kernel logs: userspace needs to process log text > > Exposing memory error stats is also a good start for the in-kernel memory > error detector. Today the data source of memory error stats are either > direct memory error consumption, or hardware patrol scrubber detection > (when signaled as UCNA; these signaled as SRAO are not handled by > memory_failure). Sorry, I don't follow this "(...)" part, so let me question. I thought that SRAO events are handled by memory_failure and UCNA events are not, so does this say the opposite? Other than that, the whole description sounds nice and convincing to me. Thank you for your work. - Naoya Horiguchi > Once in-kernel memory scanner is implemented, it will be > the main source as it is usually configured to scan memory DIMMs constantly > and faster than hardware patrol scrubber. > > How Implemented > =============== > As Naoya pointed out [2], exposing memory error statistics to userspace > is useful independent of software or hardware scanner. Therefore we > implement the memory error statistics independent of the in-kernel memory > error detector. It exposes the following per NUMA node memory error > counters: > > /sys/devices/system/node/node${X}/memory_failure/pages_poisoned > /sys/devices/system/node/node${X}/memory_failure/pages_recovered > /sys/devices/system/node/node${X}/memory_failure/pages_ignored > /sys/devices/system/node/node${X}/memory_failure/pages_failed > /sys/devices/system/node/node${X}/memory_failure/pages_delayed > > These counters describe how many raw pages are poisoned and after the > attempted recoveries by the kernel, their resolutions: how many are > recovered, ignored, failed, or delayed respectively. This approach can be > easier to extend for future use cases than /proc/meminfo, trace event, > and log. The following math holds for the statistics: > * pages_poisoned = pages_recovered + pages_ignored + pages_failed + > pages_delayed > * pages_poisoned * page_size = /proc/meminfo/HardwareCorrupted > These memory error stats are reset during machine boot. > > The 1st commit introduces these sysfs entries. The 2nd commit populates > memory error stats every time memory_failure finishes memory error > recovery. The 3rd commit adds documentations for introduced stats. > > [1] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#mc22959244f5388891c523882e61163c6e4d703af > [2] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6 > > Jiaqi Yan (3): > mm: memory-failure: Add memory failure stats to sysfs > mm: memory-failure: Bump memory failure stats to pglist_data > mm: memory-failure: Document memory failure stats > > Documentation/ABI/stable/sysfs-devices-node | 39 +++++++++++ > drivers/base/node.c | 3 + > include/linux/mm.h | 5 ++ > include/linux/mmzone.h | 28 ++++++++ > mm/memory-failure.c | 71 +++++++++++++++++++++ > 5 files changed, 146 insertions(+) > > -- > 2.39.0.314.g84b9a713c41-goog
On Tue, Jan 17, 2023 at 1:19 AM HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com> wrote: > > On Mon, Jan 16, 2023 at 07:38:59PM +0000, Jiaqi Yan wrote: > > Background > > ========== > > In the RFC for Kernel Support of Memory Error Detection [1], one advantage > > of software-based scanning over hardware patrol scrubber is the ability > > to make statistics visible to system administrators. The statistics > > include 2 categories: > > * Memory error statistics, for example, how many memory error are > > encountered, how many of them are recovered by the kernel. Note these > > memory errors are non-fatal to kernel: during the machine check > > exception (MCE) handling kernel already classified MCE's severity to > > be unnecessary to panic (but either action required or optional). > > * Scanner statistics, for example how many times the scanner have fully > > scanned a NUMA node, how many errors are first detected by the scanner. > > > > The memory error statistics are useful to userspace and actually not > > specific to scanner detected memory errors, and are the focus of this RFC. > > > > Motivation > > ========== > > Memory error stats are important to userspace but insufficient in kernel > > today. Datacenter administrators can better monitor a machine's memory > > health with the visible stats. For example, while memory errors are > > inevitable on servers with 10+ TB memory, starting server maintenance > > when there are only 1~2 recovered memory errors could be overreacting; > > in cloud production environment maintenance usually means live migrate > > all the workload running on the server and this usually causes nontrivial > > disruption to the customer. Providing insight into the scope of memory > > errors on a system helps to determine the appropriate follow-up action. > > In addition, the kernel's existing memory error stats need to be > > standardized so that userspace can reliably count on their usefulness. > > > > Today kernel provides following memory error info to userspace, but they > > are not sufficient or have disadvantages: > > * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total, > > not per NUMA node stats though > > * ras:memory_failure_event: only available after explicitly enabled > > * /dev/mcelog provides many useful info about the MCEs, but doesn't > > capture how memory_failure recovered memory MCEs > > * kernel logs: userspace needs to process log text > > > > Exposing memory error stats is also a good start for the in-kernel memory > > error detector. Today the data source of memory error stats are either > > direct memory error consumption, or hardware patrol scrubber detection > > (when signaled as UCNA; these signaled as SRAO are not handled by > > memory_failure). > > Sorry, I don't follow this "(...)" part, so let me question. I thought that > SRAO events are handled by memory_failure and UCNA events are not, so does > this say the opposite? I think UCNA is definitely handled by memory failure, but I was not correct about SRAO. According to Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2, Section 15.6.3: SRAO can be signaled via **either via MCE or CMCI***. For SRAO signaled via **machine check exception**, my reading of the current x86 MCE code is this: 1) kill_current_task is init to 0, and as long as restart IP is valid (MCG_STATUS_RIPV = 1), it remains 0: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/mce/core.c#n1473 2) after classifying severity, worst should be MCE_AO_SEVERITY: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/mce/core.c#n1496 3) therefore, do_machine_check just skips kill_me_now or kill_me_maybe, and directly goto out: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/mce/core.c#n1539 For UCNA and SRAO signled via CMCI, CMCI handler should eventually calls into memory_failure via uc_decode_notifier ((MCE_UCNA_SEVERITY==MCE_DEFERRED_SEVERITY): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/mce/core.c#n579 So it seems the signaling mechanism matters. > > Other than that, the whole description sounds nice and convincing to me. > Thank you for your work. > > - Naoya Horiguchi > > > Once in-kernel memory scanner is implemented, it will be > > the main source as it is usually configured to scan memory DIMMs constantly > > and faster than hardware patrol scrubber. > > > > How Implemented > > =============== > > As Naoya pointed out [2], exposing memory error statistics to userspace > > is useful independent of software or hardware scanner. Therefore we > > implement the memory error statistics independent of the in-kernel memory > > error detector. It exposes the following per NUMA node memory error > > counters: > > > > /sys/devices/system/node/node${X}/memory_failure/pages_poisoned > > /sys/devices/system/node/node${X}/memory_failure/pages_recovered > > /sys/devices/system/node/node${X}/memory_failure/pages_ignored > > /sys/devices/system/node/node${X}/memory_failure/pages_failed > > /sys/devices/system/node/node${X}/memory_failure/pages_delayed > > > > These counters describe how many raw pages are poisoned and after the > > attempted recoveries by the kernel, their resolutions: how many are > > recovered, ignored, failed, or delayed respectively. This approach can be > > easier to extend for future use cases than /proc/meminfo, trace event, > > and log. The following math holds for the statistics: > > * pages_poisoned = pages_recovered + pages_ignored + pages_failed + > > pages_delayed > > * pages_poisoned * page_size = /proc/meminfo/HardwareCorrupted > > These memory error stats are reset during machine boot. > > > > The 1st commit introduces these sysfs entries. The 2nd commit populates > > memory error stats every time memory_failure finishes memory error > > recovery. The 3rd commit adds documentations for introduced stats. > > > > [1] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#mc22959244f5388891c523882e61163c6e4d703af > > [2] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6 > > > > Jiaqi Yan (3): > > mm: memory-failure: Add memory failure stats to sysfs > > mm: memory-failure: Bump memory failure stats to pglist_data > > mm: memory-failure: Document memory failure stats > > > > Documentation/ABI/stable/sysfs-devices-node | 39 +++++++++++ > > drivers/base/node.c | 3 + > > include/linux/mm.h | 5 ++ > > include/linux/mmzone.h | 28 ++++++++ > > mm/memory-failure.c | 71 +++++++++++++++++++++ > > 5 files changed, 146 insertions(+) > > > > -- > > 2.39.0.314.g84b9a713c41-goog
> For SRAO signaled via **machine check exception**, my reading of the > current x86 MCE code is this: ... > 3) therefore, do_machine_check just skips kill_me_now or > kill_me_maybe, and directly goto out: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/mce/core.c#n1539 That does appear to be what we do. But it looks like a regression from older behavior. An SRAO machine check *ought* to call memory_failure() without the MF_ACTION_REQUIRED bit set in flags. -Tony
On Tue, Jan 17, 2023 at 10:34 AM Luck, Tony <tony.luck@intel.com> wrote: > > > For SRAO signaled via **machine check exception**, my reading of the > > current x86 MCE code is this: > ... > > 3) therefore, do_machine_check just skips kill_me_now or > > kill_me_maybe, and directly goto out: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/mce/core.c#n1539 > > That does appear to be what we do. But it looks like a regression from older > behavior. An SRAO machine check *ought* to call memory_failure() without > the MF_ACTION_REQUIRED bit set in flags. > > -Tony > Oh, maybe SRAO signaled via MCE calls memory_failure() with these async code paths? 1. __mc_scan_banks => mce_log => mce_gen_pool_add + irq_work_queue(mce_irq_work) 2. mce_irq_work_cb => mce_schedule_work => schedule_work(&mce_work) 3. mce_work => mce_gen_pool_process => blocking_notifier_call_chain(&x86_mce_decoder_chain, 0, mce) => mce_uc_nb => uc_decode_notifier => memory_failure
> Oh, maybe SRAO signaled via MCE calls memory_failure() with these > async code paths? > > 1. __mc_scan_banks => mce_log => mce_gen_pool_add + irq_work_queue(mce_irq_work) > > 2. mce_irq_work_cb => mce_schedule_work => schedule_work(&mce_work) > > 3. mce_work => mce_gen_pool_process => > blocking_notifier_call_chain(&x86_mce_decoder_chain, 0, mce) > => mce_uc_nb => uc_decode_notifier => memory_failure Yes. That's right. Both SRAO (#MC) and UCNA (CMCI) will follow that path. So memory_failure() is called from the kthread context for that notifier. -Tony
Awesome, thanks Tony, I will correct my cover letter. On Wed, Jan 18, 2023 at 9:51 AM Luck, Tony <tony.luck@intel.com> wrote: > > > Oh, maybe SRAO signaled via MCE calls memory_failure() with these > > async code paths? > > > > 1. __mc_scan_banks => mce_log => mce_gen_pool_add + irq_work_queue(mce_irq_work) > > > > 2. mce_irq_work_cb => mce_schedule_work => schedule_work(&mce_work) > > > > 3. mce_work => mce_gen_pool_process => > > blocking_notifier_call_chain(&x86_mce_decoder_chain, 0, mce) > > => mce_uc_nb => uc_decode_notifier => memory_failure > > Yes. That's right. Both SRAO (#MC) and UCNA (CMCI) will follow that path. > So memory_failure() is called from the kthread context for that notifier. > > -Tony
On Wed, Jan 18, 2023 at 03:33:00PM -0800, Jiaqi Yan wrote: > Awesome, thanks Tony, I will correct my cover letter. > > On Wed, Jan 18, 2023 at 9:51 AM Luck, Tony <tony.luck@intel.com> wrote: > > > > > Oh, maybe SRAO signaled via MCE calls memory_failure() with these > > > async code paths? > > > > > > 1. __mc_scan_banks => mce_log => mce_gen_pool_add + irq_work_queue(mce_irq_work) > > > > > > 2. mce_irq_work_cb => mce_schedule_work => schedule_work(&mce_work) > > > > > > 3. mce_work => mce_gen_pool_process => > > > blocking_notifier_call_chain(&x86_mce_decoder_chain, 0, mce) > > > => mce_uc_nb => uc_decode_notifier => memory_failure > > > > Yes. That's right. Both SRAO (#MC) and UCNA (CMCI) will follow that path. > > So memory_failure() is called from the kthread context for that notifier. I misunderstood the behavior of UCNA (CMCI), so thank you for confirming the behaviors, both of you. - Naoya Horiguchi