Message ID | 1305619719-7480-6-git-send-email-ying.huang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
* Huang Ying <ying.huang@intel.com> wrote: > memory_failure() is the entry point for HWPoison memory error > recovery. It must be called in process context. But commonly > hardware memory errors are notified via MCE or NMI, so some delayed > execution mechanism must be used. In MCE handler, a work queue + ring > buffer mechanism is used. > > In addition to MCE, now APEI (ACPI Platform Error Interface) GHES > (Generic Hardware Error Source) can be used to report memory errors > too. To add support to APEI GHES memory recovery, a mechanism similar > to that of MCE is implemented. memory_failure_queue() is the new > entry point that can be called in IRQ context. The next step is to > make MCE handler uses this interface too. > > Signed-off-by: Huang Ying <ying.huang@intel.com> > Cc: Andi Kleen <ak@linux.intel.com> > Cc: Wu Fengguang <fengguang.wu@intel.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > --- > include/linux/mm.h | 1 > mm/memory-failure.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 93 insertions(+) I have to say i disagree with how this is designed and how this is exposed to user-space - and i pointed this out before. It's up to Len whether you muck up drivers/acpi/ but here you are patching mm/ again ... I just had a quick look into the current affairs of mm/memory-inject.c and it has become an *even* nastier collection of hacks since the last time i commented on its uglies. Special hack upon special hack, totally disorganized code, special-purpose, partly ioctl driven opaque information extraction to user-space using the erst-dbg device interface. We have all the maintenance overhead and little of the gains from hw error event features... In this patch you add: +struct memory_failure_entry { + unsigned long pfn; + int trapno; + int flags; +}; Instead of exposing this event to other users who might be interested in these events - such as the RAS daemon under development by Boris. We have a proper framework (ring-buffer, NMI execution, etc.) for reporting events, why are you not using (and extending) it instead of creating this nasty looking, isolated, ACPI specific low level feature? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/17/2011 04:46 PM, Ingo Molnar wrote: > > * Huang Ying <ying.huang@intel.com> wrote: > >> memory_failure() is the entry point for HWPoison memory error >> recovery. It must be called in process context. But commonly >> hardware memory errors are notified via MCE or NMI, so some delayed >> execution mechanism must be used. In MCE handler, a work queue + ring >> buffer mechanism is used. >> >> In addition to MCE, now APEI (ACPI Platform Error Interface) GHES >> (Generic Hardware Error Source) can be used to report memory errors >> too. To add support to APEI GHES memory recovery, a mechanism similar >> to that of MCE is implemented. memory_failure_queue() is the new >> entry point that can be called in IRQ context. The next step is to >> make MCE handler uses this interface too. >> >> Signed-off-by: Huang Ying <ying.huang@intel.com> >> Cc: Andi Kleen <ak@linux.intel.com> >> Cc: Wu Fengguang <fengguang.wu@intel.com> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> --- >> include/linux/mm.h | 1 >> mm/memory-failure.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> 2 files changed, 93 insertions(+) > > I have to say i disagree with how this is designed and how this is exposed to > user-space - and i pointed this out before. > > It's up to Len whether you muck up drivers/acpi/ but here you are patching mm/ > again ... > > I just had a quick look into the current affairs of mm/memory-inject.c and it > has become an *even* nastier collection of hacks since the last time i > commented on its uglies. > > Special hack upon special hack, totally disorganized code, special-purpose, > partly ioctl driven opaque information extraction to user-space using the > erst-dbg device interface. We have all the maintenance overhead and little of > the gains from hw error event features... Like the name suggested, erst-dbg is only for debugging. It is not a user space interface. The user space interface used by APEI now is printk. > In this patch you add: > > +struct memory_failure_entry { > + unsigned long pfn; > + int trapno; > + int flags; > +}; > > Instead of exposing this event to other users who might be interested in these > events - such as the RAS daemon under development by Boris. > > We have a proper framework (ring-buffer, NMI execution, etc.) for reporting > events, why are you not using (and extending) it instead of creating this nasty > looking, isolated, ACPI specific low level feature? This patch has nothing to do with hardware error event reporting. It is just about hardware error recovering. Best Regards, Huang Ying -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Huang Ying <ying.huang@intel.com> wrote: > On 05/17/2011 04:46 PM, Ingo Molnar wrote: > > > > * Huang Ying <ying.huang@intel.com> wrote: > > > >> memory_failure() is the entry point for HWPoison memory error > >> recovery. It must be called in process context. But commonly > >> hardware memory errors are notified via MCE or NMI, so some delayed > >> execution mechanism must be used. In MCE handler, a work queue + ring > >> buffer mechanism is used. > >> > >> In addition to MCE, now APEI (ACPI Platform Error Interface) GHES > >> (Generic Hardware Error Source) can be used to report memory errors > >> too. To add support to APEI GHES memory recovery, a mechanism similar > >> to that of MCE is implemented. memory_failure_queue() is the new > >> entry point that can be called in IRQ context. The next step is to > >> make MCE handler uses this interface too. > >> > >> Signed-off-by: Huang Ying <ying.huang@intel.com> > >> Cc: Andi Kleen <ak@linux.intel.com> > >> Cc: Wu Fengguang <fengguang.wu@intel.com> > >> Cc: Andrew Morton <akpm@linux-foundation.org> > >> --- > >> include/linux/mm.h | 1 > >> mm/memory-failure.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> 2 files changed, 93 insertions(+) > > > > I have to say i disagree with how this is designed and how this is exposed to > > user-space - and i pointed this out before. > > > > It's up to Len whether you muck up drivers/acpi/ but here you are patching mm/ > > again ... > > > > I just had a quick look into the current affairs of mm/memory-inject.c and it > > has become an *even* nastier collection of hacks since the last time i > > commented on its uglies. > > > > Special hack upon special hack, totally disorganized code, special-purpose, > > partly ioctl driven opaque information extraction to user-space using the > > erst-dbg device interface. We have all the maintenance overhead and little of > > the gains from hw error event features... > > Like the name suggested, erst-dbg is only for debugging. [...] Great, if printk does everything then can the debugging code be removed so that tooling does not accidentally make non-debugging use of it? I can write a patch for that. > [...] It is not a user space interface. The user space interface used by > APEI now is printk. We definitely want printks obviously and primarily - often that is the only thing the admin sees, and most of the time there's no automatable 'policy action' anyway: human intervention is still the most common 'action' that is performed on exceptional system events. Does all the (unspecified) tooling you are enabling here work based off on printk only, or does it perhaps make use of the erst-dbg hack? :-) [ Wrt. printks we definitely would like to have a printk free-form-ASCII event gateway for tooling wants to use printk events in the regular flow of events that are not available via the syslog - Steve sent a print-string-event patch for that some time ago and that works well. ] > > In this patch you add: > > > > +struct memory_failure_entry { > > + unsigned long pfn; > > + int trapno; > > + int flags; > > +}; > > > > Instead of exposing this event to other users who might be interested in these > > events - such as the RAS daemon under development by Boris. > > > > We have a proper framework (ring-buffer, NMI execution, etc.) for reporting > > events, why are you not using (and extending) it instead of creating this nasty > > looking, isolated, ACPI specific low level feature? > > This patch has nothing to do with hardware error event reporting. It is just > about hardware error recovering. Hardware error event reporting and recovery go hand in hand. First is the event, the second is the action. Your structure demonstrates this already: it's called memory_failure_entry. It does: + * This function is called by the low level hardware error handler + * when it detects hardware memory corruption of a page. It schedules + * the recovering of error page, including dropping pages, killing + * processes etc. So based off an error event it does one from a short list of in-kernel policy actions. If put into a proper framework this would be a lot more widely useful: we could for example trigger the killing of tasks (and other policy action) if other (bad) events are triggered - not just the ones that fit into the narrow ACPI scheme you have here. Certain fatal IO errors would be an example, or SLAB memory corruptions or OOM errors - or any other event we are able to report today. So why are we not working towards integrating this into our event reporting/handling framework, as i suggested it from day one on when you started posting these patches? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/17/2011 05:26 PM, Ingo Molnar wrote: > > * Huang Ying <ying.huang@intel.com> wrote: > >> On 05/17/2011 04:46 PM, Ingo Molnar wrote: >>> >>> * Huang Ying <ying.huang@intel.com> wrote: >>> >>>> memory_failure() is the entry point for HWPoison memory error >>>> recovery. It must be called in process context. But commonly >>>> hardware memory errors are notified via MCE or NMI, so some delayed >>>> execution mechanism must be used. In MCE handler, a work queue + ring >>>> buffer mechanism is used. >>>> >>>> In addition to MCE, now APEI (ACPI Platform Error Interface) GHES >>>> (Generic Hardware Error Source) can be used to report memory errors >>>> too. To add support to APEI GHES memory recovery, a mechanism similar >>>> to that of MCE is implemented. memory_failure_queue() is the new >>>> entry point that can be called in IRQ context. The next step is to >>>> make MCE handler uses this interface too. >>>> >>>> Signed-off-by: Huang Ying <ying.huang@intel.com> >>>> Cc: Andi Kleen <ak@linux.intel.com> >>>> Cc: Wu Fengguang <fengguang.wu@intel.com> >>>> Cc: Andrew Morton <akpm@linux-foundation.org> >>>> --- >>>> include/linux/mm.h | 1 >>>> mm/memory-failure.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> 2 files changed, 93 insertions(+) >>> >>> I have to say i disagree with how this is designed and how this is exposed to >>> user-space - and i pointed this out before. >>> >>> It's up to Len whether you muck up drivers/acpi/ but here you are patching mm/ >>> again ... >>> >>> I just had a quick look into the current affairs of mm/memory-inject.c and it >>> has become an *even* nastier collection of hacks since the last time i >>> commented on its uglies. >>> >>> Special hack upon special hack, totally disorganized code, special-purpose, >>> partly ioctl driven opaque information extraction to user-space using the >>> erst-dbg device interface. We have all the maintenance overhead and little of >>> the gains from hw error event features... >> >> Like the name suggested, erst-dbg is only for debugging. [...] > > Great, if printk does everything then can the debugging code be removed so that > tooling does not accidentally make non-debugging use of it? I can write a patch > for that. The erst-dbg is only used for us to test whether the BIOS ERST implementation works. If you have concerns about its mis-usage, how about moving it to debugfs to make it clear that it is not a API, just for debugging? >> [...] It is not a user space interface. The user space interface used by >> APEI now is printk. > > We definitely want printks obviously and primarily - often that is the only > thing the admin sees, and most of the time there's no automatable 'policy > action' anyway: human intervention is still the most common 'action' that is > performed on exceptional system events. > > Does all the (unspecified) tooling you are enabling here work based off on > printk only, or does it perhaps make use of the erst-dbg hack? :-) The only tool makes use of erst-dbg is the debugging tool to test BIOS ERST implementation. There is absolutely NO other tool I am enabling uses erst-dbg. > [ Wrt. printks we definitely would like to have a printk free-form-ASCII event > gateway for tooling wants to use printk events in the regular flow of events > that are not available via the syslog - Steve sent a print-string-event patch > for that some time ago and that works well. ] Thanks for your reminding, I will take a look at it. >>> In this patch you add: >>> >>> +struct memory_failure_entry { >>> + unsigned long pfn; >>> + int trapno; >>> + int flags; >>> +}; >>> >>> Instead of exposing this event to other users who might be interested in these >>> events - such as the RAS daemon under development by Boris. >>> >>> We have a proper framework (ring-buffer, NMI execution, etc.) for reporting >>> events, why are you not using (and extending) it instead of creating this nasty >>> looking, isolated, ACPI specific low level feature? >> >> This patch has nothing to do with hardware error event reporting. It is just >> about hardware error recovering. > > Hardware error event reporting and recovery go hand in hand. First is the > event, the second is the action. > > Your structure demonstrates this already: it's called memory_failure_entry. It > does: > > + * This function is called by the low level hardware error handler > + * when it detects hardware memory corruption of a page. It schedules > + * the recovering of error page, including dropping pages, killing > + * processes etc. > > So based off an error event it does one from a short list of in-kernel policy > actions. > > If put into a proper framework this would be a lot more widely useful: we could > for example trigger the killing of tasks (and other policy action) if other > (bad) events are triggered - not just the ones that fit into the narrow ACPI > scheme you have here. > > Certain fatal IO errors would be an example, or SLAB memory corruptions or OOM > errors - or any other event we are able to report today. > > So why are we not working towards integrating this into our event > reporting/handling framework, as i suggested it from day one on when you > started posting these patches? The memory_failure_queue() introduced in this patch is general, that is, it can be used not only by ACPI/APEI, but also any other hardware error handlers, including your event reporting/handling framework. Best Regards, Huang Ying -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Huang Ying <ying.huang@intel.com> wrote: > > So why are we not working towards integrating this into our event > > reporting/handling framework, as i suggested it from day one on when you > > started posting these patches? > > The memory_failure_queue() introduced in this patch is general, that is, it > can be used not only by ACPI/APEI, but also any other hardware error > handlers, including your event reporting/handling framework. Well, the bit you are steadfastly ignoring is what i have made clear well before you started adding these facilities: THEY ALREADY EXISTS to a large degree :-) So you were and are duplicating code instead of using and extending existing event processing facilities. It does not matter one little bit that the code you added is partly 'generic', it's still overlapping and duplicated. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 20, 2011 at 7:56 PM, Ingo Molnar <mingo@elte.hu> wrote: > > * Huang Ying <ying.huang@intel.com> wrote: > >> > So why are we not working towards integrating this into our event >> > reporting/handling framework, as i suggested it from day one on when you >> > started posting these patches? >> >> The memory_failure_queue() introduced in this patch is general, that is, it >> can be used not only by ACPI/APEI, but also any other hardware error >> handlers, including your event reporting/handling framework. > > Well, the bit you are steadfastly ignoring is what i have made clear well > before you started adding these facilities: THEY ALREADY EXISTS to a large > degree :-) > > So you were and are duplicating code instead of using and extending existing > event processing facilities. It does not matter one little bit that the code > you added is partly 'generic', it's still overlapping and duplicated. How to do hardware error recovering in your perf framework? IMHO, it can be something as follow: - NMI handler run for the hardware error, where hardware error information is collected and put into a ring buffer, an irq_work is triggered for further work - In irq_work handler, memory_failure_queue() is called to do the real recovering work for recoverable memory error in ring buffer. What's your idea about hardware error recovering in perf? Best Regards, Huang Ying -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* huang ying <huang.ying.caritas@gmail.com> wrote: > On Fri, May 20, 2011 at 7:56 PM, Ingo Molnar <mingo@elte.hu> wrote: > > > > * Huang Ying <ying.huang@intel.com> wrote: > > > >> > So why are we not working towards integrating this into our event > >> > reporting/handling framework, as i suggested it from day one on when you > >> > started posting these patches? > >> > >> The memory_failure_queue() introduced in this patch is general, that is, it > >> can be used not only by ACPI/APEI, but also any other hardware error > >> handlers, including your event reporting/handling framework. > > > > Well, the bit you are steadfastly ignoring is what i have made clear well > > before you started adding these facilities: THEY ALREADY EXISTS to a large > > degree :-) > > > > So you were and are duplicating code instead of using and extending existing > > event processing facilities. It does not matter one little bit that the code > > you added is partly 'generic', it's still overlapping and duplicated. > > How to do hardware error recovering in your perf framework? IMHO, it can be > something as follow: > > - NMI handler run for the hardware error, where hardware error > information is collected and put into a ring buffer, an irq_work is > triggered for further work > - In irq_work handler, memory_failure_queue() is called to do the real > recovering work for recoverable memory error in ring buffer. > > What's your idea about hardware error recovering in perf? The first step, the whole irq_work and ring buffer already looks largely duplicated: you can collect into a perf event ring-buffer from NMI context like the regular perf events do. The generalization that *would* make sense is not at the irq_work level really, instead we could generalize a 'struct event' for kernel internal producers and consumers of events that have no explicit PMU connection. This new 'struct event' would be slimmer and would only contain the fields and features that generic event consumers and producers need. Tracing events could be updated to use these kinds of slimmer events. It would still plug nicely into existing event ABIs, would work with event filters, etc. so the tooling side would remain focused and unified. Something like that. It is rather clear by now that splitting out irq_work was a mistake. But mistakes can be fixed and some really nice code could come out of it! Would you be interested in looking into this? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, May 22, 2011 at 6:00 PM, Ingo Molnar <mingo@elte.hu> wrote: > > * huang ying <huang.ying.caritas@gmail.com> wrote: > >> On Fri, May 20, 2011 at 7:56 PM, Ingo Molnar <mingo@elte.hu> wrote: >> > >> > * Huang Ying <ying.huang@intel.com> wrote: >> > >> >> > So why are we not working towards integrating this into our event >> >> > reporting/handling framework, as i suggested it from day one on when you >> >> > started posting these patches? >> >> >> >> The memory_failure_queue() introduced in this patch is general, that is, it >> >> can be used not only by ACPI/APEI, but also any other hardware error >> >> handlers, including your event reporting/handling framework. >> > >> > Well, the bit you are steadfastly ignoring is what i have made clear well >> > before you started adding these facilities: THEY ALREADY EXISTS to a large >> > degree :-) >> > >> > So you were and are duplicating code instead of using and extending existing >> > event processing facilities. It does not matter one little bit that the code >> > you added is partly 'generic', it's still overlapping and duplicated. >> >> How to do hardware error recovering in your perf framework? IMHO, it can be >> something as follow: >> >> - NMI handler run for the hardware error, where hardware error >> information is collected and put into a ring buffer, an irq_work is >> triggered for further work >> - In irq_work handler, memory_failure_queue() is called to do the real >> recovering work for recoverable memory error in ring buffer. >> >> What's your idea about hardware error recovering in perf? > > The first step, the whole irq_work and ring buffer already looks largely > duplicated: you can collect into a perf event ring-buffer from NMI context like > the regular perf events do. Why duplicated? perf uses the general irq_work too. > The generalization that *would* make sense is not at the irq_work level really, > instead we could generalize a 'struct event' for kernel internal producers and > consumers of events that have no explicit PMU connection. > > This new 'struct event' would be slimmer and would only contain the fields and > features that generic event consumers and producers need. Tracing events could > be updated to use these kinds of slimmer events. > > It would still plug nicely into existing event ABIs, would work with event > filters, etc. so the tooling side would remain focused and unified. > > Something like that. It is rather clear by now that splitting out irq_work was > a mistake. But mistakes can be fixed and some really nice code could come out > of it! Would you be interested in looking into this? Yes. This can transfer hardware error data from kernel to user space. Then, how to do hardware error recovering in this big picture? IMHO, we will need to call something like memory_failure_queue() in IRQ context for memory error. Best Regards, Huang Ying -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* huang ying <huang.ying.caritas@gmail.com> wrote: > On Sun, May 22, 2011 at 6:00 PM, Ingo Molnar <mingo@elte.hu> wrote: > > > > * huang ying <huang.ying.caritas@gmail.com> wrote: > > > >> On Fri, May 20, 2011 at 7:56 PM, Ingo Molnar <mingo@elte.hu> wrote: > >> > > >> > * Huang Ying <ying.huang@intel.com> wrote: > >> > > >> >> > So why are we not working towards integrating this into our event > >> >> > reporting/handling framework, as i suggested it from day one on when you > >> >> > started posting these patches? > >> >> > >> >> The memory_failure_queue() introduced in this patch is general, that is, it > >> >> can be used not only by ACPI/APEI, but also any other hardware error > >> >> handlers, including your event reporting/handling framework. > >> > > >> > Well, the bit you are steadfastly ignoring is what i have made clear well > >> > before you started adding these facilities: THEY ALREADY EXISTS to a large > >> > degree :-) > >> > > >> > So you were and are duplicating code instead of using and extending existing > >> > event processing facilities. It does not matter one little bit that the code > >> > you added is partly 'generic', it's still overlapping and duplicated. > >> > >> How to do hardware error recovering in your perf framework? IMHO, it can be > >> something as follow: > >> > >> - NMI handler run for the hardware error, where hardware error > >> information is collected and put into a ring buffer, an irq_work is > >> triggered for further work > >> - In irq_work handler, memory_failure_queue() is called to do the real > >> recovering work for recoverable memory error in ring buffer. > >> > >> What's your idea about hardware error recovering in perf? > > > > The first step, the whole irq_work and ring buffer already looks largely > > duplicated: you can collect into a perf event ring-buffer from NMI context like > > the regular perf events do. > > Why duplicated? perf uses the general irq_work too. Yes, of course, because - if you still remember - Peter split irq_work out of perf events: e360adbe2924: irq_work: Add generic hardirq context callbacks | | Perf currently has such a mechanism, so extract that and provide it as a | generic feature, independent of perf so that others may also benefit. | :-) But in hindsight the level of abstraction (for this usecase) was set too low, because we lose wider access to the actual events themselves: > > The generalization that *would* make sense is not at the irq_work level > > really, instead we could generalize a 'struct event' for kernel internal > > producers and consumers of events that have no explicit PMU connection. > > > > This new 'struct event' would be slimmer and would only contain the fields > > and features that generic event consumers and producers need. Tracing > > events could be updated to use these kinds of slimmer events. > > > > It would still plug nicely into existing event ABIs, would work with event > > filters, etc. so the tooling side would remain focused and unified. > > > > Something like that. It is rather clear by now that splitting out irq_work > > was a mistake. But mistakes can be fixed and some really nice code could > > come out of it! Would you be interested in looking into this? > > Yes. This can transfer hardware error data from kernel to user space. Then, > how to do hardware error recovering in this big picture? IMHO, we will need > to call something like memory_failure_queue() in IRQ context for memory > error. That's where 'active filters' come into the picture - see my other mail (that was in the context of unidentified NMI errors/events) where i outlined how they would work in this case and elsewhere. Via active filters we could share most of the code, gain access to the events and still have kernel driven policy action. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/22/2011 09:25 PM, Ingo Molnar wrote: >>> The generalization that *would* make sense is not at the irq_work level >>> really, instead we could generalize a 'struct event' for kernel internal >>> producers and consumers of events that have no explicit PMU connection. >>> >>> This new 'struct event' would be slimmer and would only contain the fields >>> and features that generic event consumers and producers need. Tracing >>> events could be updated to use these kinds of slimmer events. >>> >>> It would still plug nicely into existing event ABIs, would work with event >>> filters, etc. so the tooling side would remain focused and unified. >>> >>> Something like that. It is rather clear by now that splitting out irq_work >>> was a mistake. But mistakes can be fixed and some really nice code could >>> come out of it! Would you be interested in looking into this? >> >> Yes. This can transfer hardware error data from kernel to user space. Then, >> how to do hardware error recovering in this big picture? IMHO, we will need >> to call something like memory_failure_queue() in IRQ context for memory >> error. > > That's where 'active filters' come into the picture - see my other mail (that > was in the context of unidentified NMI errors/events) where i outlined how they > would work in this case and elsewhere. Via active filters we could share most > of the code, gain access to the events and still have kernel driven policy > action. Is that something as follow? - NMI handler run for the hardware error, where hardware error information is collected and put into perf ring buffer as 'event'. - Some 'active filters' are run for each 'event' in NMI context. - Some operations can not be done in NMI handler, so they are delayed to an IRQ handler (can be done with something like irq_work). - Some other 'active filters' are run for each 'event' in IRQ context. (For memory error, we can call memory_failure_queue() here). Where some 'active filters' are kernel built-in, some 'active filters' can be customized via kernel command line or by user space. If my understanding as above is correct, I think this is a general and complex solution. It is a little hard for user to understand which 'active filters' are in effect. He may need some runtime assistant to understand the code (maybe /sys/events/active_filters, which list all filters in effect now), because that is hard only by reading the source code. Anyway, this is a design style choice. There are still some issues, I don't know how to solve in above framework. - If there are two processes request the same type of hardware error events. One hardware error event will be copied to two ring buffers (each for one process), but the 'active filters' should be run only once for each hardware error event. - How to deal with ring-buffer overflow? For example, there is full of corrected memory error in ring-buffer, and now a recoverable memory error occurs but it can not be put into perf ring buffer because of ring-buffer overflow, how to deal with the recoverable memory error? Best Regards, Huang Ying -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Huang Ying <ying.huang@intel.com> wrote: > > That's where 'active filters' come into the picture - see my other mail > > (that was in the context of unidentified NMI errors/events) where i > > outlined how they would work in this case and elsewhere. Via active filters > > we could share most of the code, gain access to the events and still have > > kernel driven policy action. > > Is that something as follow? > > - NMI handler run for the hardware error, where hardware error > information is collected and put into perf ring buffer as 'event'. Correct. Note that for MCE errors we want the 'persistent event' framework Boris has posted: we want these events to be buffered up to a point even if there is no tool listening in on them: - this gives us boot-time MCE error coverage - this protects us against a logging daemon being restarted and events getting lost > - Some 'active filters' are run for each 'event' in NMI context. Yeah. Whether it's a human-ASCII space 'filter' or really just a callback you register with that event is secondary - both would work. > - Some operations can not be done in NMI handler, so they are delayed to > an IRQ handler (can be done with something like irq_work). Yes. > - Some other 'active filters' are run for each 'event' in IRQ context. > (For memory error, we can call memory_failure_queue() here). Correct. > Where some 'active filters' are kernel built-in, some 'active filters' can be > customized via kernel command line or by user space. Yes. > If my understanding as above is correct, I think this is a general and > complex solution. It is a little hard for user to understand which 'active > filters' are in effect. He may need some runtime assistant to understand the > code (maybe /sys/events/active_filters, which list all filters in effect > now), because that is hard only by reading the source code. Anyway, this is > a design style choice. I don't think it's complex: the built-in rules are in plain sight (can be in the source code or can even be explicitly registered callbacks), the configuration/tooling installed rules will be as complex as the admin or tool wants them to be. > There are still some issues, I don't know how to solve in above framework. > > - If there are two processes request the same type of hardware error > events. One hardware error event will be copied to two ring buffers (each > for one process), but the 'active filters' should be run only once for each > hardware error event. With persistent events 'active filters' should only be attached to the central persistent event. > - How to deal with ring-buffer overflow? For example, there is full of > corrected memory error in ring-buffer, and now a recoverable memory error > occurs but it can not be put into perf ring buffer because of ring-buffer > overflow, how to deal with the recoverable memory error? The solution is to make it large enough. With *every* queueing solution there will be some sort of queue size limit. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> - NMI handler run for the hardware error, where hardware error >> information is collected and put into perf ring buffer as 'event'. > > Correct. > > Note that for MCE errors we want the 'persistent event' framework Boris has > posted: we want these events to be buffered up to a point even if there is no > tool listening in on them: This is a very opportune time to have this discussion. I've been working on getting "in context" recoverable errors working. Sandybridge Server platforms will allow for recovery for both instruction and data fetches in the current execution context. These are flagged in the machine check bank with the "AR" (Action Required) set to 1 (along with several other bits making up a recognizable signature). The critical feature here is that we must not return from the machine check handler to the context that tripped over the error. In the case of the data fault, we'll just re-execute the same access and take another machine check. In the case of the instruction fault there is no valid context to return to (MCGSTATUS.RIPV is zero). There are a couple of cases where recovery is possible: 1) The memory error was found while executing user mode code. The code I have now for recovery makes use of TIF_MCE_NOTIFY to make sure that we don't return to the user, but instead end up in arch/x86/kernel/signal.c:do_notify_resume() where we arrange to have the process handle its own recovery (using mm/memory-failure.c to figure out the type of page, and probably resulting in the mapping out of the page and sending SIGBUS to the process). In your proposed solution, we'd generate an event that would be handled by some process/daemon ... but how would we ensure that the affected process does not run in the mean time? Could we create some analogous method to the ptrace stopped state, and hand control of the affected process to the daemon that gets the event? 2) The memory error was found in certain special sections of the kernel for which recovery is possible (e.g. while copying to/from user memory, perhaps also page copy and page clear). Here I don't have a solution. TIF_MCE_NOTIFY isn't checked when returning from do_machine_check() to kernel code. In a CONFIG_PREEMPT=y kernel, all of the recoverable cases ought to be in places where pre-emption is allowed ... so perhaps we can also use the stop-and-switch option here? -Tony -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/23/2011 07:01 PM, Ingo Molnar wrote: >> If my understanding as above is correct, I think this is a general and >> complex solution. It is a little hard for user to understand which 'active >> filters' are in effect. He may need some runtime assistant to understand the >> code (maybe /sys/events/active_filters, which list all filters in effect >> now), because that is hard only by reading the source code. Anyway, this is >> a design style choice. > > I don't think it's complex: the built-in rules are in plain sight (can be in > the source code or can even be explicitly registered callbacks), the > configuration/tooling installed rules will be as complex as the admin or tool > wants them to be. > >> There are still some issues, I don't know how to solve in above framework. >> >> - If there are two processes request the same type of hardware error >> events. One hardware error event will be copied to two ring buffers (each >> for one process), but the 'active filters' should be run only once for each >> hardware error event. > > With persistent events 'active filters' should only be attached to the central > persistent event. OK. I see. >> - How to deal with ring-buffer overflow? For example, there is full of >> corrected memory error in ring-buffer, and now a recoverable memory error >> occurs but it can not be put into perf ring buffer because of ring-buffer >> overflow, how to deal with the recoverable memory error? > > The solution is to make it large enough. With *every* queueing solution there > will be some sort of queue size limit. Another solution could be: Create two ring-buffer. One is for logging and will be read by RAS daemon; the other is for recovering, the event record will be removed from the ring-buffer after all 'active filters' have been run on it. Even RAS daemon being restarted or hang, recoverable error can be taken cared of. Best Regards, Huang Ying -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Huang Ying <ying.huang@intel.com> wrote: > >> - How to deal with ring-buffer overflow? For example, there is full of > >> corrected memory error in ring-buffer, and now a recoverable memory error > >> occurs but it can not be put into perf ring buffer because of ring-buffer > >> overflow, how to deal with the recoverable memory error? > > > > The solution is to make it large enough. With *every* queueing solution there > > will be some sort of queue size limit. > > Another solution could be: > > Create two ring-buffer. One is for logging and will be read by RAS > daemon; the other is for recovering, the event record will be removed > from the ring-buffer after all 'active filters' have been run on it. > Even RAS daemon being restarted or hang, recoverable error can be taken > cared of. Well, filters will always be executed since they execute when the event is inserted - not when it's extracted. So if you worry about losing *filter* executions (and dependent policy action) - there should be no loss there, ever. But yes, the scheme you outline would work as well: a counting-only event with a filter specified - this will do no buffering at all. So ... to get the ball rolling in this area one of you guys active in RAS should really try a first approximation for the active filter approach: add a test-TRACE_EVENT() for the errors you are interested in and define a convenient way to register policy action with post-filter events. This should work even without having the 'active' portion defined at the ABI and filter-string level. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/24/2011 10:48 AM, Ingo Molnar wrote: > > * Huang Ying <ying.huang@intel.com> wrote: > >>>> - How to deal with ring-buffer overflow? For example, there is full of >>>> corrected memory error in ring-buffer, and now a recoverable memory error >>>> occurs but it can not be put into perf ring buffer because of ring-buffer >>>> overflow, how to deal with the recoverable memory error? >>> >>> The solution is to make it large enough. With *every* queueing solution there >>> will be some sort of queue size limit. >> >> Another solution could be: >> >> Create two ring-buffer. One is for logging and will be read by RAS >> daemon; the other is for recovering, the event record will be removed >> from the ring-buffer after all 'active filters' have been run on it. >> Even RAS daemon being restarted or hang, recoverable error can be taken >> cared of. > > Well, filters will always be executed since they execute when the event is > inserted - not when it's extracted. For filters executed in NMI context, they can be executed when the event is inserted, no need for buffering. But for filters executed in deferred IRQ context, they need to be executed when event's extracted. > So if you worry about losing *filter* executions (and dependent policy action) > - there should be no loss there, ever. > > But yes, the scheme you outline would work as well: a counting-only event with > a filter specified - this will do no buffering at all. > > So ... to get the ball rolling in this area one of you guys active in RAS > should really try a first approximation for the active filter approach: add a > test-TRACE_EVENT() for the errors you are interested in and define a convenient > way to register policy action with post-filter events. This should work even > without having the 'active' portion defined at the ABI and filter-string level. Best Regards, Huang Ying -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Huang Ying <ying.huang@intel.com> wrote: > On 05/24/2011 10:48 AM, Ingo Molnar wrote: > > > > * Huang Ying <ying.huang@intel.com> wrote: > > > >>>> - How to deal with ring-buffer overflow? For example, there is full of > >>>> corrected memory error in ring-buffer, and now a recoverable memory error > >>>> occurs but it can not be put into perf ring buffer because of ring-buffer > >>>> overflow, how to deal with the recoverable memory error? > >>> > >>> The solution is to make it large enough. With *every* queueing solution there > >>> will be some sort of queue size limit. > >> > >> Another solution could be: > >> > >> Create two ring-buffer. One is for logging and will be read by RAS > >> daemon; the other is for recovering, the event record will be removed > >> from the ring-buffer after all 'active filters' have been run on it. > >> Even RAS daemon being restarted or hang, recoverable error can be taken > >> cared of. > > > > Well, filters will always be executed since they execute when the event is > > inserted - not when it's extracted. > > For filters executed in NMI context, they can be executed when the event > is inserted, no need for buffering. But for filters executed in > deferred IRQ context, they need to be executed when event's extracted. Correct. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Luck, Tony <tony.luck@intel.com> wrote: > In your proposed solution, we'd generate an event that would be > handled by some process/daemon ... but how would we ensure that the > affected process does not run in the mean time? Could we create > some analogous method to the ptrace stopped state, and hand control > of the affected process to the daemon that gets the event? Ok, i think there is a bit of a misunderstanding here - which is not a surprise really: we made generic arguments all along with very few specifics. The RAS daemon would deal with 'slow' policy action: fully recovered events. It would also log various events so that people can do post mortem etc. The main point of defining events here is so that there's a single method of transport and a single flexible method of defining and extracting events. Some of the event processing would occur in the kernel: in code that knows about memory_failure() and calls it while making sure we do not execute any user-space instruction. Some of the code would execute *very* early and in a very atomic way, still in NMI context: panicing the box if the error is so severe. Neither of these are steps that the RAS daemon can or wants to handle. The RAS tools would interact with the regular perf facilities setting and configuring the various RAS related events. They'd handle the 'severity' config bits, they'd initiate testing (injection), etc. Ideally the RAS daemon and tools would do what syslog does (and more), with more structured events. In the end of the day most of the 'policy action' is taken by humans anyway, who want to take a look at some ASCII output. So printk() integration and obvious ASCII output for everything is important along the way. > 2) The memory error was found in certain special sections of the > kernel for which recovery is possible (e.g. while copying to/from > user memory, perhaps also page copy and page clear). > > Here I don't have a solution. TIF_MCE_NOTIFY isn't checked when > returning from do_machine_check() to kernel code. Well, since we are already in interrupt context (albeit in a very atomic NMI context), sending a self-IPI is not strictly necessary. We could fix up the return address and jump to the right handler straight away during the IRET. A self-IPI might also not execute *immediately* - there's always the chance of APIC related delays. > In a CONFIG_PREEMPT=y kernel, all of the recoverable cases ought to > be in places where pre-emption is allowed ... so perhaps we can > also use the stop-and-switch option here? Yes, these are generally preemptible cases - and if they are not we can make the error fatal (we do not have to handle *every* complex case, giving up is a fair answer as well - we do not want rare code to be complex really). But you don't need to stop-and-switch: just stack-nesting on top of whatever preemptible code was running there would be enough, wouldnt it? That stops a task from executing until the decision has been made whether it can continue or not. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1664,6 +1664,7 @@ enum mf_flags { }; extern void memory_failure(unsigned long pfn, int trapno); extern int __memory_failure(unsigned long pfn, int trapno, int flags); +extern void memory_failure_queue(unsigned long pfn, int trapno, int flags); extern int unpoison_memory(unsigned long pfn); extern int sysctl_memory_failure_early_kill; extern int sysctl_memory_failure_recovery; --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -52,6 +52,7 @@ #include <linux/swapops.h> #include <linux/hugetlb.h> #include <linux/memory_hotplug.h> +#include <linux/kfifo.h> #include "internal.h" int sysctl_memory_failure_early_kill __read_mostly = 0; @@ -1182,6 +1183,97 @@ void memory_failure(unsigned long pfn, i __memory_failure(pfn, trapno, 0); } +#define MEMORY_FAILURE_FIFO_ORDER 4 +#define MEMORY_FAILURE_FIFO_SIZE (1 << MEMORY_FAILURE_FIFO_ORDER) + +struct memory_failure_entry { + unsigned long pfn; + int trapno; + int flags; +}; + +struct memory_failure_cpu { + DECLARE_KFIFO(fifo, struct memory_failure_entry, + MEMORY_FAILURE_FIFO_SIZE); + spinlock_t lock; + struct work_struct work; +}; + +static DEFINE_PER_CPU(struct memory_failure_cpu, memory_failure_cpu); + +/** + * memory_failure_queue - Schedule handling memory failure of a page. + * @pfn: Page Number of the corrupted page + * @trapno: Trap number reported in the signal to user space. + * @flags: Flags for memory failure handling + * + * This function is called by the low level hardware error handler + * when it detects hardware memory corruption of a page. It schedules + * the recovering of error page, including dropping pages, killing + * processes etc. + * + * The function is primarily of use for corruptions that + * happen outside the current execution context (e.g. when + * detected by a background scrubber) + * + * Can run in IRQ context. + */ +void memory_failure_queue(unsigned long pfn, int trapno, int flags) +{ + struct memory_failure_cpu *mf_cpu; + unsigned long proc_flags; + struct memory_failure_entry entry = { + .pfn = pfn, + .trapno = trapno, + .flags = flags, + }; + + mf_cpu = &get_cpu_var(memory_failure_cpu); + spin_lock_irqsave(&mf_cpu->lock, proc_flags); + if (kfifo_put(&mf_cpu->fifo, &entry)) + schedule_work_on(smp_processor_id(), &mf_cpu->work); + else + pr_err("Memory failure: buffer overflow when recovering memory failure at 0x%#lx\n", + pfn); + spin_unlock_irqrestore(&mf_cpu->lock, proc_flags); + put_cpu_var(memory_failure_cpu); +} +EXPORT_SYMBOL_GPL(memory_failure_queue); + +static void memory_failure_work_func(struct work_struct *work) +{ + struct memory_failure_cpu *mf_cpu; + struct memory_failure_entry entry = { 0, }; + unsigned long proc_flags; + int gotten; + + mf_cpu = &__get_cpu_var(memory_failure_cpu); + for (;;) { + spin_lock_irqsave(&mf_cpu->lock, proc_flags); + gotten = kfifo_get(&mf_cpu->fifo, &entry); + spin_unlock_irqrestore(&mf_cpu->lock, proc_flags); + if (!gotten) + break; + __memory_failure(entry.pfn, entry.trapno, entry.flags); + } +} + +static int __init memory_failure_init(void) +{ + struct memory_failure_cpu *mf_cpu; + int cpu; + + for_each_possible_cpu(cpu) { + mf_cpu = &per_cpu(memory_failure_cpu, cpu); + spin_lock_init(&mf_cpu->lock); + INIT_KFIFO(mf_cpu->fifo); + INIT_WORK(&mf_cpu->work, memory_failure_work_func); + } + + return 0; +} +core_initcall(memory_failure_init); + /** * unpoison_memory - Unpoison a previously poisoned page * @pfn: Page number of the to be unpoisoned page
memory_failure() is the entry point for HWPoison memory error recovery. It must be called in process context. But commonly hardware memory errors are notified via MCE or NMI, so some delayed execution mechanism must be used. In MCE handler, a work queue + ring buffer mechanism is used. In addition to MCE, now APEI (ACPI Platform Error Interface) GHES (Generic Hardware Error Source) can be used to report memory errors too. To add support to APEI GHES memory recovery, a mechanism similar to that of MCE is implemented. memory_failure_queue() is the new entry point that can be called in IRQ context. The next step is to make MCE handler uses this interface too. Signed-off-by: Huang Ying <ying.huang@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> --- include/linux/mm.h | 1 mm/memory-failure.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 93 insertions(+) -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html