[5/9] HWPoison: add memory_failure_queue()

Message ID	1305619719-7480-6-git-send-email-ying.huang@intel.com (mailing list archive)
State	New, archived
Headers	show Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by demeter1.kernel.org (8.14.4/8.14.3) with ESMTP id p4H8Br46003980 for <patchwork-acpi@patchwork.kernel.org>; Tue, 17 May 2011 08:11:54 GMT From: Huang Ying <ying.huang@intel.com> To: Len Brown <lenb@kernel.org> Cc: linux-kernel@vger.kernel.org, Andi Kleen <andi@firstfloor.org>, Tony Luck <tony.luck@intel.com>, ying.huang@intel.com, linux-acpi@vger.kernel.org, Andi Kleen <ak@linux.intel.com>, Wu Fengguang <fengguang.wu@intel.com>, Andrew Morton <akpm@linux-foundation.org> Subject: [PATCH 5/9] HWPoison: add memory_failure_queue() Date: Tue, 17 May 2011 16:08:35 +0800 Message-Id: <1305619719-7480-6-git-send-email-ying.huang@intel.com> In-Reply-To: <1305619719-7480-1-git-send-email-ying.huang@intel.com> References: <1305619719-7480-1-git-send-email-ying.huang@intel.com> Sender: linux-acpi-owner@vger.kernel.org Precedence: bulk

Huang, Ying May 17, 2011, 8:08 a.m. UTC

memory_failure() is the entry point for HWPoison memory error
recovery.  It must be called in process context.  But commonly
hardware memory errors are notified via MCE or NMI, so some delayed
execution mechanism must be used.  In MCE handler, a work queue + ring
buffer mechanism is used.

In addition to MCE, now APEI (ACPI Platform Error Interface) GHES
(Generic Hardware Error Source) can be used to report memory errors
too.  To add support to APEI GHES memory recovery, a mechanism similar
to that of MCE is implemented.  memory_failure_queue() is the new
entry point that can be called in IRQ context.  The next step is to
make MCE handler uses this interface too.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/mm.h  |    1 
 mm/memory-failure.c |   92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 93 insertions(+)

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar May 17, 2011, 8:46 a.m. UTC | #1

* Huang Ying <ying.huang@intel.com> wrote:

> memory_failure() is the entry point for HWPoison memory error
> recovery.  It must be called in process context.  But commonly
> hardware memory errors are notified via MCE or NMI, so some delayed
> execution mechanism must be used.  In MCE handler, a work queue + ring
> buffer mechanism is used.
> 
> In addition to MCE, now APEI (ACPI Platform Error Interface) GHES
> (Generic Hardware Error Source) can be used to report memory errors
> too.  To add support to APEI GHES memory recovery, a mechanism similar
> to that of MCE is implemented.  memory_failure_queue() is the new
> entry point that can be called in IRQ context.  The next step is to
> make MCE handler uses this interface too.
> 
> Signed-off-by: Huang Ying <ying.huang@intel.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Wu Fengguang <fengguang.wu@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  include/linux/mm.h  |    1 
>  mm/memory-failure.c |   92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 93 insertions(+)

I have to say i disagree with how this is designed and how this is exposed to 
user-space - and i pointed this out before.

It's up to Len whether you muck up drivers/acpi/ but here you are patching mm/ 
again ...

I just had a quick look into the current affairs of mm/memory-inject.c and it 
has become an *even* nastier collection of hacks since the last time i 
commented on its uglies.

Special hack upon special hack, totally disorganized code, special-purpose, 
partly ioctl driven opaque information extraction to user-space using the 
erst-dbg device interface. We have all the maintenance overhead and little of 
the gains from hw error event features...

In this patch you add:

+struct memory_failure_entry {
+       unsigned long pfn;
+       int trapno;
+       int flags;
+};

Instead of exposing this event to other users who might be interested in these 
events - such as the RAS daemon under development by Boris.

We have a proper framework (ring-buffer, NMI execution, etc.) for reporting 
events, why are you not using (and extending) it instead of creating this nasty 
looking, isolated, ACPI specific low level feature?

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Huang, Ying May 17, 2011, 8:52 a.m. UTC | #2

On 05/17/2011 04:46 PM, Ingo Molnar wrote:
> 
> * Huang Ying <ying.huang@intel.com> wrote:
> 
>> memory_failure() is the entry point for HWPoison memory error
>> recovery.  It must be called in process context.  But commonly
>> hardware memory errors are notified via MCE or NMI, so some delayed
>> execution mechanism must be used.  In MCE handler, a work queue + ring
>> buffer mechanism is used.
>>
>> In addition to MCE, now APEI (ACPI Platform Error Interface) GHES
>> (Generic Hardware Error Source) can be used to report memory errors
>> too.  To add support to APEI GHES memory recovery, a mechanism similar
>> to that of MCE is implemented.  memory_failure_queue() is the new
>> entry point that can be called in IRQ context.  The next step is to
>> make MCE handler uses this interface too.
>>
>> Signed-off-by: Huang Ying <ying.huang@intel.com>
>> Cc: Andi Kleen <ak@linux.intel.com>
>> Cc: Wu Fengguang <fengguang.wu@intel.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>  include/linux/mm.h  |    1 
>>  mm/memory-failure.c |   92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 93 insertions(+)
> 
> I have to say i disagree with how this is designed and how this is exposed to 
> user-space - and i pointed this out before.
> 
> It's up to Len whether you muck up drivers/acpi/ but here you are patching mm/ 
> again ...
> 
> I just had a quick look into the current affairs of mm/memory-inject.c and it 
> has become an *even* nastier collection of hacks since the last time i 
> commented on its uglies.
> 
> Special hack upon special hack, totally disorganized code, special-purpose, 
> partly ioctl driven opaque information extraction to user-space using the 
> erst-dbg device interface. We have all the maintenance overhead and little of 
> the gains from hw error event features...

Like the name suggested, erst-dbg is only for debugging.  It is not a
user space interface.  The user space interface used by APEI now is printk.

> In this patch you add:
> 
> +struct memory_failure_entry {
> +       unsigned long pfn;
> +       int trapno;
> +       int flags;
> +};
> 
> Instead of exposing this event to other users who might be interested in these 
> events - such as the RAS daemon under development by Boris.
> 
> We have a proper framework (ring-buffer, NMI execution, etc.) for reporting 
> events, why are you not using (and extending) it instead of creating this nasty 
> looking, isolated, ACPI specific low level feature?

This patch has nothing to do with hardware error event reporting.  It is
just about hardware error recovering.

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar May 17, 2011, 9:26 a.m. UTC | #3

* Huang Ying <ying.huang@intel.com> wrote:

> On 05/17/2011 04:46 PM, Ingo Molnar wrote:
> > 
> > * Huang Ying <ying.huang@intel.com> wrote:
> > 
> >> memory_failure() is the entry point for HWPoison memory error
> >> recovery.  It must be called in process context.  But commonly
> >> hardware memory errors are notified via MCE or NMI, so some delayed
> >> execution mechanism must be used.  In MCE handler, a work queue + ring
> >> buffer mechanism is used.
> >>
> >> In addition to MCE, now APEI (ACPI Platform Error Interface) GHES
> >> (Generic Hardware Error Source) can be used to report memory errors
> >> too.  To add support to APEI GHES memory recovery, a mechanism similar
> >> to that of MCE is implemented.  memory_failure_queue() is the new
> >> entry point that can be called in IRQ context.  The next step is to
> >> make MCE handler uses this interface too.
> >>
> >> Signed-off-by: Huang Ying <ying.huang@intel.com>
> >> Cc: Andi Kleen <ak@linux.intel.com>
> >> Cc: Wu Fengguang <fengguang.wu@intel.com>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> ---
> >>  include/linux/mm.h  |    1 
> >>  mm/memory-failure.c |   92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  2 files changed, 93 insertions(+)
> > 
> > I have to say i disagree with how this is designed and how this is exposed to 
> > user-space - and i pointed this out before.
> > 
> > It's up to Len whether you muck up drivers/acpi/ but here you are patching mm/ 
> > again ...
> > 
> > I just had a quick look into the current affairs of mm/memory-inject.c and it 
> > has become an *even* nastier collection of hacks since the last time i 
> > commented on its uglies.
> > 
> > Special hack upon special hack, totally disorganized code, special-purpose, 
> > partly ioctl driven opaque information extraction to user-space using the 
> > erst-dbg device interface. We have all the maintenance overhead and little of 
> > the gains from hw error event features...
> 
> Like the name suggested, erst-dbg is only for debugging. [...]

Great, if printk does everything then can the debugging code be removed so that 
tooling does not accidentally make non-debugging use of it? I can write a patch 
for that.

> [...]  It is not a user space interface.  The user space interface used by 
> APEI now is printk.

We definitely want printks obviously and primarily - often that is the only 
thing the admin sees, and most of the time there's no automatable 'policy 
action' anyway: human intervention is still the most common 'action' that is 
performed on exceptional system events.

Does all the (unspecified) tooling you are enabling here work based off on 
printk only, or does it perhaps make use of the erst-dbg hack? :-)

[ Wrt. printks we definitely would like to have a printk free-form-ASCII event 
  gateway for tooling wants to use printk events in the regular flow of events 
  that are not available via the syslog - Steve sent a print-string-event patch 
  for that some time ago and that works well. ]

> > In this patch you add:
> > 
> > +struct memory_failure_entry {
> > +       unsigned long pfn;
> > +       int trapno;
> > +       int flags;
> > +};
> > 
> > Instead of exposing this event to other users who might be interested in these 
> > events - such as the RAS daemon under development by Boris.
> > 
> > We have a proper framework (ring-buffer, NMI execution, etc.) for reporting 
> > events, why are you not using (and extending) it instead of creating this nasty 
> > looking, isolated, ACPI specific low level feature?
> 
> This patch has nothing to do with hardware error event reporting.  It is just 
> about hardware error recovering.

Hardware error event reporting and recovery go hand in hand. First is the 
event, the second is the action.

Your structure demonstrates this already: it's called memory_failure_entry. It 
does:

+ * This function is called by the low level hardware error handler
+ * when it detects hardware memory corruption of a page. It schedules
+ * the recovering of error page, including dropping pages, killing
+ * processes etc.

So based off an error event it does one from a short list of in-kernel policy 
actions.

If put into a proper framework this would be a lot more widely useful: we could 
for example trigger the killing of tasks (and other policy action) if other 
(bad) events are triggered - not just the ones that fit into the narrow ACPI 
scheme you have here.

Certain fatal IO errors would be an example, or SLAB memory corruptions or OOM 
errors - or any other event we are able to report today.

So why are we not working towards integrating this into our event 
reporting/handling framework, as i suggested it from day one on when you 
started posting these patches?

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Huang, Ying May 18, 2011, 1:10 a.m. UTC | #4

On 05/17/2011 05:26 PM, Ingo Molnar wrote:
> 
> * Huang Ying <ying.huang@intel.com> wrote:
> 
>> On 05/17/2011 04:46 PM, Ingo Molnar wrote:
>>>
>>> * Huang Ying <ying.huang@intel.com> wrote:
>>>
>>>> memory_failure() is the entry point for HWPoison memory error
>>>> recovery.  It must be called in process context.  But commonly
>>>> hardware memory errors are notified via MCE or NMI, so some delayed
>>>> execution mechanism must be used.  In MCE handler, a work queue + ring
>>>> buffer mechanism is used.
>>>>
>>>> In addition to MCE, now APEI (ACPI Platform Error Interface) GHES
>>>> (Generic Hardware Error Source) can be used to report memory errors
>>>> too.  To add support to APEI GHES memory recovery, a mechanism similar
>>>> to that of MCE is implemented.  memory_failure_queue() is the new
>>>> entry point that can be called in IRQ context.  The next step is to
>>>> make MCE handler uses this interface too.
>>>>
>>>> Signed-off-by: Huang Ying <ying.huang@intel.com>
>>>> Cc: Andi Kleen <ak@linux.intel.com>
>>>> Cc: Wu Fengguang <fengguang.wu@intel.com>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> ---
>>>>  include/linux/mm.h  |    1 
>>>>  mm/memory-failure.c |   92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  2 files changed, 93 insertions(+)
>>>
>>> I have to say i disagree with how this is designed and how this is exposed to 
>>> user-space - and i pointed this out before.
>>>
>>> It's up to Len whether you muck up drivers/acpi/ but here you are patching mm/ 
>>> again ...
>>>
>>> I just had a quick look into the current affairs of mm/memory-inject.c and it 
>>> has become an *even* nastier collection of hacks since the last time i 
>>> commented on its uglies.
>>>
>>> Special hack upon special hack, totally disorganized code, special-purpose, 
>>> partly ioctl driven opaque information extraction to user-space using the 
>>> erst-dbg device interface. We have all the maintenance overhead and little of 
>>> the gains from hw error event features...
>>
>> Like the name suggested, erst-dbg is only for debugging. [...]
> 
> Great, if printk does everything then can the debugging code be removed so that 
> tooling does not accidentally make non-debugging use of it? I can write a patch 
> for that.

The erst-dbg is only used for us to test whether the BIOS ERST
implementation works.  If you have concerns about its mis-usage, how
about moving it to debugfs to make it clear that it is not a API, just
for debugging?

>> [...]  It is not a user space interface.  The user space interface used by 
>> APEI now is printk.
> 
> We definitely want printks obviously and primarily - often that is the only 
> thing the admin sees, and most of the time there's no automatable 'policy 
> action' anyway: human intervention is still the most common 'action' that is 
> performed on exceptional system events.
> 
> Does all the (unspecified) tooling you are enabling here work based off on 
> printk only, or does it perhaps make use of the erst-dbg hack? :-)

The only tool makes use of erst-dbg is the debugging tool to test BIOS
ERST implementation.  There is absolutely NO other tool I am enabling
uses erst-dbg.

> [ Wrt. printks we definitely would like to have a printk free-form-ASCII event 
>   gateway for tooling wants to use printk events in the regular flow of events 
>   that are not available via the syslog - Steve sent a print-string-event patch 
>   for that some time ago and that works well. ]

Thanks for your reminding, I will take a look at it.

>>> In this patch you add:
>>>
>>> +struct memory_failure_entry {
>>> +       unsigned long pfn;
>>> +       int trapno;
>>> +       int flags;
>>> +};
>>>
>>> Instead of exposing this event to other users who might be interested in these 
>>> events - such as the RAS daemon under development by Boris.
>>>
>>> We have a proper framework (ring-buffer, NMI execution, etc.) for reporting 
>>> events, why are you not using (and extending) it instead of creating this nasty 
>>> looking, isolated, ACPI specific low level feature?
>>
>> This patch has nothing to do with hardware error event reporting.  It is just 
>> about hardware error recovering.
> 
> Hardware error event reporting and recovery go hand in hand. First is the 
> event, the second is the action.
> 
> Your structure demonstrates this already: it's called memory_failure_entry. It 
> does:
> 
> + * This function is called by the low level hardware error handler
> + * when it detects hardware memory corruption of a page. It schedules
> + * the recovering of error page, including dropping pages, killing
> + * processes etc.
> 
> So based off an error event it does one from a short list of in-kernel policy 
> actions.
> 
> If put into a proper framework this would be a lot more widely useful: we could 
> for example trigger the killing of tasks (and other policy action) if other 
> (bad) events are triggered - not just the ones that fit into the narrow ACPI 
> scheme you have here.
> 
> Certain fatal IO errors would be an example, or SLAB memory corruptions or OOM 
> errors - or any other event we are able to report today.
> 
> So why are we not working towards integrating this into our event 
> reporting/handling framework, as i suggested it from day one on when you 
> started posting these patches?

The memory_failure_queue() introduced in this patch is general, that is,
it can be used not only by ACPI/APEI, but also any other hardware error
handlers, including your event reporting/handling framework.

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar May 20, 2011, 11:56 a.m. UTC | #5

* Huang Ying <ying.huang@intel.com> wrote:

> > So why are we not working towards integrating this into our event 
> > reporting/handling framework, as i suggested it from day one on when you 
> > started posting these patches?
> 
> The memory_failure_queue() introduced in this patch is general, that is, it 
> can be used not only by ACPI/APEI, but also any other hardware error 
> handlers, including your event reporting/handling framework.

Well, the bit you are steadfastly ignoring is what i have made clear well 
before you started adding these facilities: THEY ALREADY EXISTS to a large 
degree :-)

So you were and are duplicating code instead of using and extending existing 
event processing facilities. It does not matter one little bit that the code 
you added is partly 'generic', it's still overlapping and duplicated.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

huang ying May 22, 2011, 8:14 a.m. UTC | #6

On Fri, May 20, 2011 at 7:56 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Huang Ying <ying.huang@intel.com> wrote:
>
>> > So why are we not working towards integrating this into our event
>> > reporting/handling framework, as i suggested it from day one on when you
>> > started posting these patches?
>>
>> The memory_failure_queue() introduced in this patch is general, that is, it
>> can be used not only by ACPI/APEI, but also any other hardware error
>> handlers, including your event reporting/handling framework.
>
> Well, the bit you are steadfastly ignoring is what i have made clear well
> before you started adding these facilities: THEY ALREADY EXISTS to a large
> degree :-)
>
> So you were and are duplicating code instead of using and extending existing
> event processing facilities. It does not matter one little bit that the code
> you added is partly 'generic', it's still overlapping and duplicated.

How to do hardware error recovering in your perf framework?  IMHO,  it
can be something as follow:

- NMI handler run for the hardware error, where hardware error
information is collected and put into a ring buffer, an irq_work is
triggered for further work
- In irq_work handler, memory_failure_queue() is called to do the real
recovering work for recoverable memory error in ring buffer.

What's your idea about hardware error recovering in perf?

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar May 22, 2011, 10 a.m. UTC | #7

* huang ying <huang.ying.caritas@gmail.com> wrote:

> On Fri, May 20, 2011 at 7:56 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Huang Ying <ying.huang@intel.com> wrote:
> >
> >> > So why are we not working towards integrating this into our event
> >> > reporting/handling framework, as i suggested it from day one on when you
> >> > started posting these patches?
> >>
> >> The memory_failure_queue() introduced in this patch is general, that is, it
> >> can be used not only by ACPI/APEI, but also any other hardware error
> >> handlers, including your event reporting/handling framework.
> >
> > Well, the bit you are steadfastly ignoring is what i have made clear well
> > before you started adding these facilities: THEY ALREADY EXISTS to a large
> > degree :-)
> >
> > So you were and are duplicating code instead of using and extending existing
> > event processing facilities. It does not matter one little bit that the code
> > you added is partly 'generic', it's still overlapping and duplicated.
> 
> How to do hardware error recovering in your perf framework?  IMHO, it can be 
> something as follow:
> 
> - NMI handler run for the hardware error, where hardware error
> information is collected and put into a ring buffer, an irq_work is
> triggered for further work
> - In irq_work handler, memory_failure_queue() is called to do the real
> recovering work for recoverable memory error in ring buffer.
> 
> What's your idea about hardware error recovering in perf?

The first step, the whole irq_work and ring buffer already looks largely 
duplicated: you can collect into a perf event ring-buffer from NMI context like 
the regular perf events do.

The generalization that *would* make sense is not at the irq_work level really, 
instead we could generalize a 'struct event' for kernel internal producers and 
consumers of events that have no explicit PMU connection.

This new 'struct event' would be slimmer and would only contain the fields and 
features that generic event consumers and producers need. Tracing events could 
be updated to use these kinds of slimmer events.

It would still plug nicely into existing event ABIs, would work with event 
filters, etc. so the tooling side would remain focused and unified.

Something like that. It is rather clear by now that splitting out irq_work was 
a mistake. But mistakes can be fixed and some really nice code could come out 
of it! Would you be interested in looking into this?

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

huang ying May 22, 2011, 12:32 p.m. UTC | #8

On Sun, May 22, 2011 at 6:00 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * huang ying <huang.ying.caritas@gmail.com> wrote:
>
>> On Fri, May 20, 2011 at 7:56 PM, Ingo Molnar <mingo@elte.hu> wrote:
>> >
>> > * Huang Ying <ying.huang@intel.com> wrote:
>> >
>> >> > So why are we not working towards integrating this into our event
>> >> > reporting/handling framework, as i suggested it from day one on when you
>> >> > started posting these patches?
>> >>
>> >> The memory_failure_queue() introduced in this patch is general, that is, it
>> >> can be used not only by ACPI/APEI, but also any other hardware error
>> >> handlers, including your event reporting/handling framework.
>> >
>> > Well, the bit you are steadfastly ignoring is what i have made clear well
>> > before you started adding these facilities: THEY ALREADY EXISTS to a large
>> > degree :-)
>> >
>> > So you were and are duplicating code instead of using and extending existing
>> > event processing facilities. It does not matter one little bit that the code
>> > you added is partly 'generic', it's still overlapping and duplicated.
>>
>> How to do hardware error recovering in your perf framework?  IMHO, it can be
>> something as follow:
>>
>> - NMI handler run for the hardware error, where hardware error
>> information is collected and put into a ring buffer, an irq_work is
>> triggered for further work
>> - In irq_work handler, memory_failure_queue() is called to do the real
>> recovering work for recoverable memory error in ring buffer.
>>
>> What's your idea about hardware error recovering in perf?
>
> The first step, the whole irq_work and ring buffer already looks largely
> duplicated: you can collect into a perf event ring-buffer from NMI context like
> the regular perf events do.

Why duplicated? perf uses the general irq_work too.

> The generalization that *would* make sense is not at the irq_work level really,
> instead we could generalize a 'struct event' for kernel internal producers and
> consumers of events that have no explicit PMU connection.
>
> This new 'struct event' would be slimmer and would only contain the fields and
> features that generic event consumers and producers need. Tracing events could
> be updated to use these kinds of slimmer events.
>
> It would still plug nicely into existing event ABIs, would work with event
> filters, etc. so the tooling side would remain focused and unified.
>
> Something like that. It is rather clear by now that splitting out irq_work was
> a mistake. But mistakes can be fixed and some really nice code could come out
> of it! Would you be interested in looking into this?

Yes.  This can transfer hardware error data from kernel to user space.
 Then, how to do hardware error recovering in this big picture?  IMHO,
we will need to call something like memory_failure_queue() in IRQ
context for memory error.

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar May 22, 2011, 1:25 p.m. UTC | #9

* huang ying <huang.ying.caritas@gmail.com> wrote:

> On Sun, May 22, 2011 at 6:00 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * huang ying <huang.ying.caritas@gmail.com> wrote:
> >
> >> On Fri, May 20, 2011 at 7:56 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >> >
> >> > * Huang Ying <ying.huang@intel.com> wrote:
> >> >
> >> >> > So why are we not working towards integrating this into our event
> >> >> > reporting/handling framework, as i suggested it from day one on when you
> >> >> > started posting these patches?
> >> >>
> >> >> The memory_failure_queue() introduced in this patch is general, that is, it
> >> >> can be used not only by ACPI/APEI, but also any other hardware error
> >> >> handlers, including your event reporting/handling framework.
> >> >
> >> > Well, the bit you are steadfastly ignoring is what i have made clear well
> >> > before you started adding these facilities: THEY ALREADY EXISTS to a large
> >> > degree :-)
> >> >
> >> > So you were and are duplicating code instead of using and extending existing
> >> > event processing facilities. It does not matter one little bit that the code
> >> > you added is partly 'generic', it's still overlapping and duplicated.
> >>
> >> How to do hardware error recovering in your perf framework?  IMHO, it can be
> >> something as follow:
> >>
> >> - NMI handler run for the hardware error, where hardware error
> >> information is collected and put into a ring buffer, an irq_work is
> >> triggered for further work
> >> - In irq_work handler, memory_failure_queue() is called to do the real
> >> recovering work for recoverable memory error in ring buffer.
> >>
> >> What's your idea about hardware error recovering in perf?
> >
> > The first step, the whole irq_work and ring buffer already looks largely
> > duplicated: you can collect into a perf event ring-buffer from NMI context like
> > the regular perf events do.
> 
> Why duplicated? perf uses the general irq_work too.

Yes, of course, because - if you still remember - Peter split irq_work out of 
perf events:

 e360adbe2924: irq_work: Add generic hardirq context callbacks

 |
 | Perf currently has such a mechanism, so extract that and provide it as a 
 | generic feature, independent of perf so that others may also benefit.
 |

:-)

But in hindsight the level of abstraction (for this usecase) was set too low, 
because we lose wider access to the actual events themselves:

> > The generalization that *would* make sense is not at the irq_work level 
> > really, instead we could generalize a 'struct event' for kernel internal 
> > producers and consumers of events that have no explicit PMU connection.
> >
> > This new 'struct event' would be slimmer and would only contain the fields 
> > and features that generic event consumers and producers need. Tracing 
> > events could be updated to use these kinds of slimmer events.
> >
> > It would still plug nicely into existing event ABIs, would work with event 
> > filters, etc. so the tooling side would remain focused and unified.
> >
> > Something like that. It is rather clear by now that splitting out irq_work 
> > was a mistake. But mistakes can be fixed and some really nice code could 
> > come out of it! Would you be interested in looking into this?
> 
> Yes.  This can transfer hardware error data from kernel to user space. Then, 
> how to do hardware error recovering in this big picture?  IMHO, we will need 
> to call something like memory_failure_queue() in IRQ context for memory 
> error.

That's where 'active filters' come into the picture - see my other mail (that 
was in the context of unidentified NMI errors/events) where i outlined how they 
would work in this case and elsewhere. Via active filters we could share most 
of the code, gain access to the events and still have kernel driven policy 
action.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Huang, Ying May 23, 2011, 2:38 a.m. UTC | #10

On 05/22/2011 09:25 PM, Ingo Molnar wrote:
>>> The generalization that *would* make sense is not at the irq_work level 
>>> really, instead we could generalize a 'struct event' for kernel internal 
>>> producers and consumers of events that have no explicit PMU connection.
>>>
>>> This new 'struct event' would be slimmer and would only contain the fields 
>>> and features that generic event consumers and producers need. Tracing 
>>> events could be updated to use these kinds of slimmer events.
>>>
>>> It would still plug nicely into existing event ABIs, would work with event 
>>> filters, etc. so the tooling side would remain focused and unified.
>>>
>>> Something like that. It is rather clear by now that splitting out irq_work 
>>> was a mistake. But mistakes can be fixed and some really nice code could 
>>> come out of it! Would you be interested in looking into this?
>>
>> Yes.  This can transfer hardware error data from kernel to user space. Then, 
>> how to do hardware error recovering in this big picture?  IMHO, we will need 
>> to call something like memory_failure_queue() in IRQ context for memory 
>> error.
> 
> That's where 'active filters' come into the picture - see my other mail (that 
> was in the context of unidentified NMI errors/events) where i outlined how they 
> would work in this case and elsewhere. Via active filters we could share most 
> of the code, gain access to the events and still have kernel driven policy 
> action.

Is that something as follow?

- NMI handler run for the hardware error, where hardware error
information is collected and put into perf ring buffer as 'event'.

- Some 'active filters' are run for each 'event' in NMI context.

- Some operations can not be done in NMI handler, so they are delayed to
an IRQ handler (can be done with something like irq_work).

- Some other 'active filters' are run for each 'event' in IRQ context.
(For memory error, we can call memory_failure_queue() here).

Where some 'active filters' are kernel built-in, some 'active filters'
can be customized via kernel command line or by user space.


If my understanding as above is correct, I think this is a general and
complex solution.  It is a little hard for user to understand which
'active filters' are in effect.  He may need some runtime assistant to
understand the code (maybe /sys/events/active_filters, which list all
filters in effect now), because that is hard only by reading the source
code.  Anyway, this is a design style choice.

There are still some issues, I don't know how to solve in above framework.

- If there are two processes request the same type of hardware error
events.  One hardware error event will be copied to two ring buffers
(each for one process),  but the 'active filters' should be run only
once for each hardware error event.

- How to deal with ring-buffer overflow?  For example, there is full of
corrected memory error in ring-buffer, and now a recoverable memory
error occurs but it can not be put into perf ring buffer because of
ring-buffer overflow, how to deal with the recoverable memory error?

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar May 23, 2011, 11:01 a.m. UTC | #11

* Huang Ying <ying.huang@intel.com> wrote:

> > That's where 'active filters' come into the picture - see my other mail 
> > (that was in the context of unidentified NMI errors/events) where i 
> > outlined how they would work in this case and elsewhere. Via active filters 
> > we could share most of the code, gain access to the events and still have 
> > kernel driven policy action.
> 
> Is that something as follow?
> 
> - NMI handler run for the hardware error, where hardware error
>   information is collected and put into perf ring buffer as 'event'.

Correct.

Note that for MCE errors we want the 'persistent event' framework Boris has 
posted: we want these events to be buffered up to a point even if there is no 
tool listening in on them:

 - this gives us boot-time MCE error coverage

 - this protects us against a logging daemon being restarted and events
   getting lost

> - Some 'active filters' are run for each 'event' in NMI context.

Yeah. Whether it's a human-ASCII space 'filter' or really just a callback you 
register with that event is secondary - both would work.

> - Some operations can not be done in NMI handler, so they are delayed to
>   an IRQ handler (can be done with something like irq_work).

Yes.

> - Some other 'active filters' are run for each 'event' in IRQ context.
>   (For memory error, we can call memory_failure_queue() here).

Correct.

> Where some 'active filters' are kernel built-in, some 'active filters' can be 
> customized via kernel command line or by user space.

Yes.

> If my understanding as above is correct, I think this is a general and 
> complex solution.  It is a little hard for user to understand which 'active 
> filters' are in effect.  He may need some runtime assistant to understand the 
> code (maybe /sys/events/active_filters, which list all filters in effect 
> now), because that is hard only by reading the source code.  Anyway, this is 
> a design style choice.

I don't think it's complex: the built-in rules are in plain sight (can be in 
the source code or can even be explicitly registered callbacks), the 
configuration/tooling installed rules will be as complex as the admin or tool 
wants them to be.

> There are still some issues, I don't know how to solve in above framework.
> 
> - If there are two processes request the same type of hardware error
>   events.  One hardware error event will be copied to two ring buffers (each 
>   for one process), but the 'active filters' should be run only once for each 
>   hardware error event.

With persistent events 'active filters' should only be attached to the central 
persistent event.

> - How to deal with ring-buffer overflow?  For example, there is full of 
>   corrected memory error in ring-buffer, and now a recoverable memory error 
>   occurs but it can not be put into perf ring buffer because of ring-buffer 
>   overflow, how to deal with the recoverable memory error?

The solution is to make it large enough. With *every* queueing solution there 
will be some sort of queue size limit.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tony Luck May 23, 2011, 4:45 p.m. UTC | #12

>> - NMI handler run for the hardware error, where hardware error
>>   information is collected and put into perf ring buffer as 'event'.
>
> Correct.
>
> Note that for MCE errors we want the 'persistent event' framework Boris has 
> posted: we want these events to be buffered up to a point even if there is no 
> tool listening in on them:

This is a very opportune time to have this discussion. I've been working
on getting "in context" recoverable errors working. Sandybridge Server
platforms will allow for recovery for both instruction and data fetches
in the current execution context. These are flagged in the machine check
bank with the "AR" (Action Required) set to 1 (along with several other
bits making up a recognizable signature).

The critical feature here is that we must not return from the machine
check handler to the context that tripped over the error. In the case
of the data fault, we'll just re-execute the same access and take
another machine check. In the case of the instruction fault there is
no valid context to return to (MCGSTATUS.RIPV is zero).

There are a couple of cases where recovery is possible:

1) The memory error was found while executing user mode code.

The code I have now for recovery makes use of TIF_MCE_NOTIFY to
make sure that we don't return to the user, but instead end up
in arch/x86/kernel/signal.c:do_notify_resume() where we arrange
to have the process handle its own recovery (using mm/memory-failure.c
to figure out the type of page, and probably resulting in the mapping
out of the page and sending SIGBUS to the process).

In your proposed solution, we'd generate an event that would be handled
by some process/daemon ... but how would we ensure that the affected
process does not run in the mean time? Could we create some analogous
method to the ptrace stopped state, and hand control of the affected
process to the daemon that gets the event?

2) The memory error was found in certain special sections of the
   kernel for which recovery is possible (e.g. while copying to/from
   user memory, perhaps also page copy and page clear).

Here I don't have a solution. TIF_MCE_NOTIFY isn't checked when returning
from do_machine_check() to kernel code.

In a CONFIG_PREEMPT=y kernel, all of the recoverable cases ought to be
in places where pre-emption is allowed ... so perhaps we can also use
the stop-and-switch option here?

-Tony

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Huang, Ying May 24, 2011, 2:10 a.m. UTC | #13

On 05/23/2011 07:01 PM, Ingo Molnar wrote:
>> If my understanding as above is correct, I think this is a general and 
>> complex solution.  It is a little hard for user to understand which 'active 
>> filters' are in effect.  He may need some runtime assistant to understand the 
>> code (maybe /sys/events/active_filters, which list all filters in effect 
>> now), because that is hard only by reading the source code.  Anyway, this is 
>> a design style choice.
> 
> I don't think it's complex: the built-in rules are in plain sight (can be in 
> the source code or can even be explicitly registered callbacks), the 
> configuration/tooling installed rules will be as complex as the admin or tool 
> wants them to be.
> 
>> There are still some issues, I don't know how to solve in above framework.
>>
>> - If there are two processes request the same type of hardware error
>>   events.  One hardware error event will be copied to two ring buffers (each 
>>   for one process), but the 'active filters' should be run only once for each 
>>   hardware error event.
> 
> With persistent events 'active filters' should only be attached to the central 
> persistent event.

OK. I see.

>> - How to deal with ring-buffer overflow?  For example, there is full of 
>>   corrected memory error in ring-buffer, and now a recoverable memory error 
>>   occurs but it can not be put into perf ring buffer because of ring-buffer 
>>   overflow, how to deal with the recoverable memory error?
> 
> The solution is to make it large enough. With *every* queueing solution there 
> will be some sort of queue size limit.

Another solution could be:

Create two ring-buffer. One is for logging and will be read by RAS
daemon; the other is for recovering, the event record will be removed
from the ring-buffer after all 'active filters' have been run on it.
Even RAS daemon being restarted or hang, recoverable error can be taken
cared of.

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar May 24, 2011, 2:48 a.m. UTC | #14

* Huang Ying <ying.huang@intel.com> wrote:

> >> - How to deal with ring-buffer overflow?  For example, there is full of 
> >>   corrected memory error in ring-buffer, and now a recoverable memory error 
> >>   occurs but it can not be put into perf ring buffer because of ring-buffer 
> >>   overflow, how to deal with the recoverable memory error?
> > 
> > The solution is to make it large enough. With *every* queueing solution there 
> > will be some sort of queue size limit.
> 
> Another solution could be:
> 
> Create two ring-buffer. One is for logging and will be read by RAS
> daemon; the other is for recovering, the event record will be removed
> from the ring-buffer after all 'active filters' have been run on it.
> Even RAS daemon being restarted or hang, recoverable error can be taken
> cared of.

Well, filters will always be executed since they execute when the event is 
inserted - not when it's extracted.

So if you worry about losing *filter* executions (and dependent policy action) 
- there should be no loss there, ever.

But yes, the scheme you outline would work as well: a counting-only event with 
a filter specified - this will do no buffering at all.

So ... to get the ball rolling in this area one of you guys active in RAS 
should really try a first approximation for the active filter approach: add a 
test-TRACE_EVENT() for the errors you are interested in and define a convenient 
way to register policy action with post-filter events. This should work even 
without having the 'active' portion defined at the ABI and filter-string level.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Huang, Ying May 24, 2011, 3:07 a.m. UTC | #15

On 05/24/2011 10:48 AM, Ingo Molnar wrote:
> 
> * Huang Ying <ying.huang@intel.com> wrote:
> 
>>>> - How to deal with ring-buffer overflow?  For example, there is full of 
>>>>   corrected memory error in ring-buffer, and now a recoverable memory error 
>>>>   occurs but it can not be put into perf ring buffer because of ring-buffer 
>>>>   overflow, how to deal with the recoverable memory error?
>>>
>>> The solution is to make it large enough. With *every* queueing solution there 
>>> will be some sort of queue size limit.
>>
>> Another solution could be:
>>
>> Create two ring-buffer. One is for logging and will be read by RAS
>> daemon; the other is for recovering, the event record will be removed
>> from the ring-buffer after all 'active filters' have been run on it.
>> Even RAS daemon being restarted or hang, recoverable error can be taken
>> cared of.
> 
> Well, filters will always be executed since they execute when the event is 
> inserted - not when it's extracted.

For filters executed in NMI context, they can be executed when the event
is inserted, no need for buffering.  But for filters executed in
deferred IRQ context, they need to be executed when event's extracted.

> So if you worry about losing *filter* executions (and dependent policy action) 
> - there should be no loss there, ever.
> 
> But yes, the scheme you outline would work as well: a counting-only event with 
> a filter specified - this will do no buffering at all.
> 
> So ... to get the ball rolling in this area one of you guys active in RAS 
> should really try a first approximation for the active filter approach: add a 
> test-TRACE_EVENT() for the errors you are interested in and define a convenient 
> way to register policy action with post-filter events. This should work even 
> without having the 'active' portion defined at the ABI and filter-string level.

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar May 24, 2011, 4:24 a.m. UTC | #16

* Huang Ying <ying.huang@intel.com> wrote:

> On 05/24/2011 10:48 AM, Ingo Molnar wrote:
> > 
> > * Huang Ying <ying.huang@intel.com> wrote:
> > 
> >>>> - How to deal with ring-buffer overflow?  For example, there is full of 
> >>>>   corrected memory error in ring-buffer, and now a recoverable memory error 
> >>>>   occurs but it can not be put into perf ring buffer because of ring-buffer 
> >>>>   overflow, how to deal with the recoverable memory error?
> >>>
> >>> The solution is to make it large enough. With *every* queueing solution there 
> >>> will be some sort of queue size limit.
> >>
> >> Another solution could be:
> >>
> >> Create two ring-buffer. One is for logging and will be read by RAS
> >> daemon; the other is for recovering, the event record will be removed
> >> from the ring-buffer after all 'active filters' have been run on it.
> >> Even RAS daemon being restarted or hang, recoverable error can be taken
> >> cared of.
> > 
> > Well, filters will always be executed since they execute when the event is 
> > inserted - not when it's extracted.
> 
> For filters executed in NMI context, they can be executed when the event
> is inserted, no need for buffering.  But for filters executed in
> deferred IRQ context, they need to be executed when event's extracted.

Correct.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar May 25, 2011, 2:08 p.m. UTC | #17

* Luck, Tony <tony.luck@intel.com> wrote:

> In your proposed solution, we'd generate an event that would be 
> handled by some process/daemon ... but how would we ensure that the 
> affected process does not run in the mean time? Could we create 
> some analogous method to the ptrace stopped state, and hand control 
> of the affected process to the daemon that gets the event?

Ok, i think there is a bit of a misunderstanding here - which is not 
a surprise really: we made generic arguments all along with very few 
specifics.

The RAS daemon would deal with 'slow' policy action: fully recovered 
events. It would also log various events so that people can do post 
mortem etc.

The main point of defining events here is so that there's a single 
method of transport and a single flexible method of defining and 
extracting events.

Some of the event processing would occur in the kernel: in code that 
knows about memory_failure() and calls it while making sure we do not 
execute any user-space instruction.

Some of the code would execute *very* early and in a very atomic way, 
still in NMI context: panicing the box if the error is so severe.

Neither of these are steps that the RAS daemon can or wants to 
handle.

The RAS tools would interact with the regular perf facilities setting 
and configuring the various RAS related events. They'd handle the 
'severity' config bits, they'd initiate testing (injection), etc.

Ideally the RAS daemon and tools would do what syslog does (and 
more), with more structured events. In the end of the day most of the 
'policy action' is taken by humans anyway, who want to take a look at 
some ASCII output. So printk() integration and obvious ASCII output 
for everything is important along the way.

> 2) The memory error was found in certain special sections of the
>    kernel for which recovery is possible (e.g. while copying to/from
>    user memory, perhaps also page copy and page clear).
> 
> Here I don't have a solution. TIF_MCE_NOTIFY isn't checked when 
> returning from do_machine_check() to kernel code.

Well, since we are already in interrupt context (albeit in a very 
atomic NMI context), sending a self-IPI is not strictly necessary. We 
could fix up the return address and jump to the right handler 
straight away during the IRET.

A self-IPI might also not execute *immediately* - there's always the 
chance of APIC related delays.

> In a CONFIG_PREEMPT=y kernel, all of the recoverable cases ought to 
> be in places where pre-emption is allowed ... so perhaps we can 
> also use the stop-and-switch option here?

Yes, these are generally preemptible cases - and if they are not we 
can make the error fatal (we do not have to handle *every* complex 
case, giving up is a fair answer as well - we do not want rare code 
to be complex really).

But you don't need to stop-and-switch: just stack-nesting on top of 
whatever preemptible code was running there would be enough, wouldnt 
it? That stops a task from executing until the decision has been made 
whether it can continue or not.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[5/9] HWPoison: add memory_failure_queue()

Commit Message

Comments

Patch