diff mbox series

[v2] ACPI: APEI: do not add task_work to kernel thread to avoid memory leak

Message ID 20220924074953.83064-1-xueshuai@linux.alibaba.com (mailing list archive)
State Mainlined, archived
Headers show
Series [v2] ACPI: APEI: do not add task_work to kernel thread to avoid memory leak | expand

Commit Message

Shuai Xue Sept. 24, 2022, 7:49 a.m. UTC
If an error is detected as a result of user-space process accessing a
corrupt memory location, the CPU may take an abort. Then the platform
firmware reports kernel via NMI like notifications, e.g. NOTIFY_SEA,
NOTIFY_SOFTWARE_DELEGATED, etc.

For NMI like notifications, commit 7f17b4a121d0 ("ACPI: APEI: Kick the
memory_failure() queue for synchronous errors") keep track of whether
memory_failure() work was queued, and make task_work pending to flush out
the queue so that the work is processed before return to user-space.

The code use init_mm to check whether the error occurs in user space:

    if (current->mm != &init_mm)

The condition is always true, becase _nobody_ ever has "init_mm" as a real
VM any more.

In addition to abort, errors can also be signaled as asynchronous
exceptions, such as interrupt and SError. In such case, the interrupted
current process could be any kind of thread. When a kernel thread is
interrupted, the work ghes_kick_task_work deferred to task_work will never
be processed because entry_handler returns to call ret_to_kernel() instead
of ret_to_user(). Consequently, the estatus_node alloced from
ghes_estatus_pool in ghes_in_nmi_queue_one_entry() will not be freed.
After around 200 allocations in our platform, the ghes_estatus_pool will
run of memory and ghes_in_nmi_queue_one_entry() returns ENOMEM. As a
result, the event failed to be processed.

    sdei: event 805 on CPU 113 failed with error: -2

Finally, a lot of unhandled events may cause platform firmware to exceed
some threshold and reboot.

The condition should generally just do

    if (current->mm)

as described in active_mm.rst documentation.

Then if an asynchronous error is detected when a kernel thread is running,
(e.g. when detected by a background scrubber), do not add task_work to it
as the original patch intends to do.

Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
changes since v1:
- add description the side effect and give more details

 drivers/acpi/apei/ghes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Rafael J. Wysocki Sept. 24, 2022, 5:17 p.m. UTC | #1
On Sat, Sep 24, 2022 at 9:50 AM Shuai Xue <xueshuai@linux.alibaba.com> wrote:
>
> If an error is detected as a result of user-space process accessing a
> corrupt memory location, the CPU may take an abort. Then the platform
> firmware reports kernel via NMI like notifications, e.g. NOTIFY_SEA,
> NOTIFY_SOFTWARE_DELEGATED, etc.
>
> For NMI like notifications, commit 7f17b4a121d0 ("ACPI: APEI: Kick the
> memory_failure() queue for synchronous errors") keep track of whether
> memory_failure() work was queued, and make task_work pending to flush out
> the queue so that the work is processed before return to user-space.
>
> The code use init_mm to check whether the error occurs in user space:
>
>     if (current->mm != &init_mm)
>
> The condition is always true, becase _nobody_ ever has "init_mm" as a real
> VM any more.
>
> In addition to abort, errors can also be signaled as asynchronous
> exceptions, such as interrupt and SError. In such case, the interrupted
> current process could be any kind of thread. When a kernel thread is
> interrupted, the work ghes_kick_task_work deferred to task_work will never
> be processed because entry_handler returns to call ret_to_kernel() instead
> of ret_to_user(). Consequently, the estatus_node alloced from
> ghes_estatus_pool in ghes_in_nmi_queue_one_entry() will not be freed.
> After around 200 allocations in our platform, the ghes_estatus_pool will
> run of memory and ghes_in_nmi_queue_one_entry() returns ENOMEM. As a
> result, the event failed to be processed.
>
>     sdei: event 805 on CPU 113 failed with error: -2
>
> Finally, a lot of unhandled events may cause platform firmware to exceed
> some threshold and reboot.
>
> The condition should generally just do
>
>     if (current->mm)
>
> as described in active_mm.rst documentation.
>
> Then if an asynchronous error is detected when a kernel thread is running,
> (e.g. when detected by a background scrubber), do not add task_work to it
> as the original patch intends to do.
>
> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>

I need the APEI code reviewers to tell me that this is all OK.

> ---
> changes since v1:
> - add description the side effect and give more details
>
>  drivers/acpi/apei/ghes.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index d91ad378c00d..80ad530583c9 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -985,7 +985,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>                                 ghes_estatus_cache_add(generic, estatus);
>                 }
>
> -               if (task_work_pending && current->mm != &init_mm) {
> +               if (task_work_pending && current->mm) {
>                         estatus_node->task_work.func = ghes_kick_task_work;
>                         estatus_node->task_work_cpu = smp_processor_id();
>                         ret = task_work_add(current, &estatus_node->task_work,
> --
> 2.20.1.12.g72788fdb
>
Shuai Xue Sept. 26, 2022, 11:35 a.m. UTC | #2
在 2022/9/25 AM1:17, Rafael J. Wysocki 写道:
> On Sat, Sep 24, 2022 at 9:50 AM Shuai Xue <xueshuai@linux.alibaba.com> wrote:
>>
>> If an error is detected as a result of user-space process accessing a
>> corrupt memory location, the CPU may take an abort. Then the platform
>> firmware reports kernel via NMI like notifications, e.g. NOTIFY_SEA,
>> NOTIFY_SOFTWARE_DELEGATED, etc.
>>
>> For NMI like notifications, commit 7f17b4a121d0 ("ACPI: APEI: Kick the
>> memory_failure() queue for synchronous errors") keep track of whether
>> memory_failure() work was queued, and make task_work pending to flush out
>> the queue so that the work is processed before return to user-space.
>>
>> The code use init_mm to check whether the error occurs in user space:
>>
>>     if (current->mm != &init_mm)
>>
>> The condition is always true, becase _nobody_ ever has "init_mm" as a real
>> VM any more.
>>
>> In addition to abort, errors can also be signaled as asynchronous
>> exceptions, such as interrupt and SError. In such case, the interrupted
>> current process could be any kind of thread. When a kernel thread is
>> interrupted, the work ghes_kick_task_work deferred to task_work will never
>> be processed because entry_handler returns to call ret_to_kernel() instead
>> of ret_to_user(). Consequently, the estatus_node alloced from
>> ghes_estatus_pool in ghes_in_nmi_queue_one_entry() will not be freed.
>> After around 200 allocations in our platform, the ghes_estatus_pool will
>> run of memory and ghes_in_nmi_queue_one_entry() returns ENOMEM. As a
>> result, the event failed to be processed.
>>
>>     sdei: event 805 on CPU 113 failed with error: -2
>>
>> Finally, a lot of unhandled events may cause platform firmware to exceed
>> some threshold and reboot.
>>
>> The condition should generally just do
>>
>>     if (current->mm)
>>
>> as described in active_mm.rst documentation.
>>
>> Then if an asynchronous error is detected when a kernel thread is running,
>> (e.g. when detected by a background scrubber), do not add task_work to it
>> as the original patch intends to do.
>>
>> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> 
> I need the APEI code reviewers to tell me that this is all OK.

Thank you for your reply. OK, let's wait the reviewers comments.

Best Regards,
Shuai


> 
>> ---
>> changes since v1:
>> - add description the side effect and give more details
>>
>>  drivers/acpi/apei/ghes.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index d91ad378c00d..80ad530583c9 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -985,7 +985,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
>>                                 ghes_estatus_cache_add(generic, estatus);
>>                 }
>>
>> -               if (task_work_pending && current->mm != &init_mm) {
>> +               if (task_work_pending && current->mm) {
>>                         estatus_node->task_work.func = ghes_kick_task_work;
>>                         estatus_node->task_work_cpu = smp_processor_id();
>>                         ret = task_work_add(current, &estatus_node->task_work,
>> --
>> 2.20.1.12.g72788fdb
>>
Tony Luck Sept. 26, 2022, 3:20 p.m. UTC | #3
>>
>> -               if (task_work_pending && current->mm != &init_mm) {
>> +               if (task_work_pending && current->mm) {
>>                         estatus_node->task_work.func = ghes_kick_task_work;
>>                         estatus_node->task_work_cpu = smp_processor_id();
>>                         ret = task_work_add(current, &estatus_node->task_work,

It seems that you are getting errors reported while running kernel threads. This fix avoids
pointlessly adding "task_work" that will never be processed because kernel threads never
return to user mode.

But maybe something else needs to be done? The code was, and with this fix still is,
taking no action for the error. That doesn't seem right.

Are you injecting errors to create this scenario? What does your test do?

-Tony
Shuai Xue Sept. 27, 2022, 3:50 a.m. UTC | #4
在 2022/9/26 PM11:20, Luck, Tony 写道:
>>>
>>> -               if (task_work_pending && current->mm != &init_mm) {
>>> +               if (task_work_pending && current->mm) {
>>>                         estatus_node->task_work.func = ghes_kick_task_work;
>>>                         estatus_node->task_work_cpu = smp_processor_id();
>>>                         ret = task_work_add(current, &estatus_node->task_work,
> 
> It seems that you are getting errors reported while running kernel threads. This fix avoids
> pointlessly adding "task_work" that will never be processed because kernel threads never
> return to user mode.

Yes, you are right.

> 
> But maybe something else needs to be done? The code was, and with this fix still is,
> taking no action for the error. That doesn't seem right.

Sorry, I don't think so. As far as I know, on Arm platform, hardware error can signal
exceptions including:

- Synchronous External Abort (SEA), e,g. data abort or instruction abort
- Asynchronous External Abort (signalled as SErrors), e,g. L2 can generate SError for
  error responses from the interconnect for a Device or Non-cacheable store
- Fault Handling and Error Recovery interrupts: DDR mainline/demand/scrubber error interrupt

When the error signals asynchronous exceptions (SError or interrupt), any kind of thread can
be interrupted, including kernel thread. Because asynchronous exceptions are signaled in
background, the errors are detected outside of the current execution context.

The GHES driver always queues work to handle memory failure of a page in memory_failure_queue().
If a kernel thread is interrupted:

- Without this fix, the added task_work will never be processed so that the work will not
  be canceled.
- With this fix, the task_work will not be added.

In a conclusion, the error will be handled in a kworker with or without this fix.

The point of fix is that:

- The condition is useless because it is always tree. And I think it is not the original patch
  intends to do.
- The current code leads to memory leaks. The estatus_node will not be freed when task_work is
  added to a kernel thread.


> 
> Are you injecting errors to create this scenario? 

Yes, I am injecting error to create such scenario. After 200 injections, the ghes_estatus_pool
will run of memory and ghes_in_nmi_queue_one_entry() returns ENOMEM. Finally, a lot of unhandled
events may cause platform firmware to exceed some threshold and reboot.

> What does your test do?

My injection includes two steps:

- mmap a device memory for userspace in a driver
- inject uc and trigger write like ras-tools does

I have opened source the code and you can find here[1]. It's forked from your repo and mainly based
on your code :)

By the way, do you have any plans to extend ras-tools to Arm platform. And is there a mail-list
for ras-tools? I send a mail to add test cases to ras-tools several weeks ago, but no response.
May your mailbox regards it as Spam.


[1] https://gitee.com/anolis/ras-tools/tree/arm-devel
Tony Luck Sept. 27, 2022, 5:47 p.m. UTC | #5
I follow and agree with everything up until:

> In a conclusion, the error will be handled in a kworker with or without this fix.

Who handles the error if the interrupt happens during the execution of a kthread?

It isn't handled during the interrupt (it can't be).

Can't use the task_work_add() trick to handle it (because this thread never returns to user mode).

So how is the error handled?

-Tony
Shuai Xue Sept. 29, 2022, 2:33 a.m. UTC | #6
在 2022/9/28 AM1:47, Luck, Tony 写道:
> I follow and agree with everything up until:
> 
>> In a conclusion, the error will be handled in a kworker with or without this fix.

> 
> It isn't handled during the interrupt (it can't be).

Yes, it is not handled during the interrupt and it does not have to.

>
> Who handles the error if the interrupt happens during the execution of a kthread?

As I mentioned, the GHES driver always queues work into workqueue to handle memory
failure of a page in memory_failure_queue(), so the **worker will be scheduled and
handle memory failure later**.

> 
> Can't use the task_work_add() trick to handle it (because this thread never returns to user mode).

Yes, it can not. And this is the key point to fix.

> 
> So how is the error handled?
> 

The workflow to handle hardware error is summery as bellow:

-----------------------------------------------------------------------------
[ghes_sdei_critical_callback: current swapper/3, CPU 3]
ghes_sdei_critical_callback
    => __ghes_sdei_callback
        => ghes_in_nmi_queue_one_entry 		// peak and read estatus
        => irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work
[ghes_sdei_critical_callback: return]
-----------------------------------------------------------------------------
[ghes_proc_in_irq: current swapper/3, CPU 3]
            => ghes_do_proc
                => ghes_handle_memory_failure
                    => ghes_do_memory_failure
                        => memory_failure_queue	 // put work task on current CPU
                            => if (kfifo_put(&mf_cpu->fifo, entry))
                                  schedule_work_on(smp_processor_id(), &mf_cpu->work);
            => task_work_add(current, &estatus_node->task_work, TWA_RESUME); // fix here, always added to current
[ghes_proc_in_irq: return]
-----------------------------------------------------------------------------
// kworker preempts swapper/3 on CPU 3 due to RESCHED flag
[memory_failure_work_func: current kworker, CPU 3]	
     => memory_failure_work_func(&mf_cpu->work)
        => while kfifo_get(&mf_cpu->fifo, &entry);	// until get no work
            => soft/hard offline
-----------------------------------------------------------------------------

STEP 0: The firmware notifies hardware error to kernel through is SDEI
(ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED).

STEP 1: In SDEI callback (or any NMI-like handler), memory from ghes_estatus_pool is
used to save estatus, and added to the ghes_estatus_llist. The swapper running on
CPU 3 is interrupted. irq_work_queue() causes ghes_proc_in_irq() to run in IRQ
context where each estatus in ghes_estatus_llist is processed.

STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on current CPU
in workqueue and add task work to sync with the workqueue.

STEP3: The kworker preempts the current running thread and get CPU 3. Then memory failure
is processed in kworker.

(STEP4 for user thread: ghes_kick_task_work() is called as task_work to ensure any
queued workqueue has been done before returning to user-space. The estatus_node is freed.)

If the task work is not added, estatus_node->task_work.func will be NULL, and estatus_node
is freed in STEP 2.

Hope it helps to make the problem clearer. You can also check the stack dumped in key
function in above flow.

Best Regards,
Shuai


---------------------------------------------------------------------------------------
dump_stack() is added in:
- __ghes_sdei_callback()
- ghes_proc_in_irq()
- memory_failure_queue_kick()
- memory_failure_work_func()
- memory_failure()

[  485.457761] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G            E      6.0.0-rc5+ #33
[  485.457769] Hardware name: xxxx
[  485.457771] Call trace:
[  485.457772]  dump_backtrace+0xe8/0x12c
[  485.457779]  show_stack+0x20/0x50
[  485.457781]  dump_stack_lvl+0x68/0x84
[  485.457785]  dump_stack+0x18/0x34
[  485.457787]  __ghes_sdei_callback+0x24/0x64
[  485.457789]  ghes_sdei_critical_callback+0x5c/0x94
[  485.457792]  sdei_event_handler+0x28/0x90
[  485.457795]  do_sdei_event+0x74/0x160
[  485.457797]  __sdei_handler+0x60/0xf0
[  485.457799]  __sdei_asm_handler+0xbc/0x18c
[  485.457801]  cpu_do_idle+0x14/0x80
[  485.457802]  default_idle_call+0x50/0x114
[  485.457804]  cpuidle_idle_call+0x16c/0x1c0
[  485.457806]  do_idle+0xb8/0x110
[  485.457808]  cpu_startup_entry+0x2c/0x34
[  485.457809]  secondary_start_kernel+0xf0/0x144
[  485.457812]  __secondary_switched+0xb0/0xb4

[  485.459513] EDAC MC0: 1 UE multi-symbol chipkill ECC on unknown memory (node:0 card:3 module:0 rank:0 bank_group:0 bank_address:0 device:0 row:624 column:384 chip_id:0 page:0x89c033 offset:0x400 grain:1 - APEI location: node:0 card:3 module:0 rank:0 bank_group:0 bank_address:0 device:0 row:624 column:384 chip_id:0 status(0x0000000000000400): Storage error in DRAM memory)
[  485.459523] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[  485.470607] {2}[Hardware Error]: event severity: recoverable
[  485.476252] {2}[Hardware Error]:  precise tstamp: 2022-09-29 09:31:27
[  485.482678] {2}[Hardware Error]:  Error 0, type: recoverable
[  485.488322] {2}[Hardware Error]:   section_type: memory error
[  485.494052] {2}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
[  485.503081] {2}[Hardware Error]:   physical_address: 0x000000089c033400
[  485.509680] {2}[Hardware Error]:   node:0 card:3 module:0 rank:0 bank_group:0 bank_address:0 device:0 row:624 column:384 chip_id:0
[  485.521487] {2}[Hardware Error]:   error_type: 5, multi-symbol chipkill ECC

[  485.528439] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G            E      6.0.0-rc5+ #33
[  485.528440] Hardware name: AlibabaCloud AliServer-Xuanwu2.0AM-02-2UC1P-5B/M, BIOS 1.2.M1.AL.E.132.01 08/23/2022
[  485.528441] Call trace:
[  485.528441]  dump_backtrace+0xe8/0x12c
[  485.528443]  show_stack+0x20/0x50
[  485.528444]  dump_stack_lvl+0x68/0x84
[  485.528446]  dump_stack+0x18/0x34
[  485.528448]  ghes_proc_in_irq+0x220/0x250
[  485.528450]  irq_work_single+0x30/0x80
[  485.528453]  irq_work_run_list+0x4c/0x70
[  485.528455]  irq_work_run+0x28/0x44
[  485.528457]  do_handle_IPI+0x2b4/0x2f0
[  485.528459]  ipi_handler+0x24/0x34
[  485.528461]  handle_percpu_devid_irq+0x90/0x1c4
[  485.528463]  generic_handle_domain_irq+0x34/0x50
[  485.528465]  __gic_handle_irq_from_irqson.isra.0+0x130/0x230
[  485.528468]  gic_handle_irq+0x2c/0x60
[  485.528469]  call_on_irq_stack+0x2c/0x38
[  485.528471]  do_interrupt_handler+0x88/0x90
[  485.528472]  el1_interrupt+0x48/0xb0
[  485.528475]  el1h_64_irq_handler+0x18/0x24
[  485.528476]  el1h_64_irq+0x74/0x78
[  485.528477]  __do_softirq+0xa4/0x358
[  485.528478]  __irq_exit_rcu+0x110/0x13c
[  485.528479]  irq_exit_rcu+0x18/0x24
[  485.528480]  el1_interrupt+0x4c/0xb0
[  485.528482]  el1h_64_irq_handler+0x18/0x24
[  485.528483]  el1h_64_irq+0x74/0x78
[  485.528484]  arch_cpu_idle+0x18/0x40
[  485.528485]  default_idle_call+0x50/0x114
[  485.528487]  cpuidle_idle_call+0x16c/0x1c0
[  485.528488]  do_idle+0xb8/0x110
[  485.528489]  cpu_startup_entry+0x2c/0x34
[  485.528491]  secondary_start_kernel+0xf0/0x144
[  485.528493]  __secondary_switched+0xb0/0xb4

[  485.528511] CPU: 3 PID: 12696 Comm: kworker/3:0 Tainted: G            E      6.0.0-rc5+ #33
[  485.528513] Hardware name: AlibabaCloud AliServer-Xuanwu2.0AM-02-2UC1P-5B/M, BIOS 1.2.M1.AL.E.132.01 08/23/2022
[  485.528514] Workqueue: events memory_failure_work_func
[  485.528518] Call trace:
[  485.528519]  dump_backtrace+0xe8/0x12c
[  485.528520]  show_stack+0x20/0x50
[  485.528521]  dump_stack_lvl+0x68/0x84
[  485.528523]  dump_stack+0x18/0x34
[  485.528525]  memory_failure_work_func+0xec/0x180
[  485.528527]  process_one_work+0x1f4/0x460
[  485.528528]  worker_thread+0x188/0x3e4
[  485.528530]  kthread+0xd0/0xd4
[  485.528532]  ret_from_fork+0x10/0x20

[  485.528533] CPU: 3 PID: 12696 Comm: kworker/3:0 Tainted: G            E      6.0.0-rc5+ #33
[  485.528534] Hardware name: AlibabaCloud AliServer-Xuanwu2.0AM-02-2UC1P-5B/M, BIOS 1.2.M1.AL.E.132.01 08/23/2022
[  485.528535] Workqueue: events memory_failure_work_func
[  485.528537] Call trace:
[  485.528538]  dump_backtrace+0xe8/0x12c
[  485.528539]  show_stack+0x20/0x50
[  485.528540]  dump_stack_lvl+0x68/0x84
[  485.528541]  dump_stack+0x18/0x34
[  485.528543]  memory_failure+0x50/0x438
[  485.528544]  memory_failure_work_func+0x174/0x180
[  485.528546]  process_one_work+0x1f4/0x460
[  485.528547]  worker_thread+0x188/0x3e4
[  485.528548]  kthread+0xd0/0xd4
[  485.528550]  ret_from_fork+0x10/0x20
[  485.530622] Memory failure: 0x89c033: recovery action for dirty LRU page: Recovered
Tony Luck Sept. 29, 2022, 8:52 p.m. UTC | #7
Thanks for your patient explanations.

> STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on current CPU
> in workqueue and add task work to sync with the workqueue.

Why is there a difference if the interrupted task was a user task vs. a kernel thread?

It seems arbitrary. If the error can be handled in the kernel thread case without
a task_work_add() to the current process, can't all errors be handled this way?

The current thread likely has nothing to do with the error. Just a matter of chance
on what is running when the NMI is delivered, right?

-Tony
Shuai Xue Sept. 30, 2022, 2:52 a.m. UTC | #8
在 2022/9/30 AM4:52, Luck, Tony 写道:
> Thanks for your patient explanations.

You are welcome :)

> 
>> STEP2: In IRQ context, ghes_proc/_in_irq() queues memory failure work on current CPU
>> in workqueue and add task work to sync with the workqueue.
> 
> Why is there a difference if the interrupted task was a user task vs. a kernel thread?
> 
> It seems arbitrary. If the error can be handled in the kernel thread case without
> a task_work_add() to the current process, can't all errors be handled this way?

I'm afraid not. The kworker in workqueue is asynchronous with ret_to_user() of the
interrupted task. If we return to user-space before the queued memory_failure() work
is processed, we will take the fault again when the error is signal by synchronous
external abort. This loop may cause platform firmware to exceed some threshold and
reboot.

When a user task consuming poison data, a synchronous external abort will be signaled,
for example "einj_mem_uc single" in ras-tools. In such case, the handling flow will
be like bellow:

----------------------------------STEP 0-------------------------------------------
[ghes_sdei_critical_callback: current einj_mem_uc, local cpu]
ghes_sdei_critical_callback
    => __ghes_sdei_callback
        => ghes_in_nmi_queue_one_entry: peak and read estatus
        => irq_work_queue(&ghes_proc_irq_work) // ghes_proc_in_irq - irq_work
[ghes_sdei_critical_callback: return]
-----------------------------------STEP 1------------------------------------------
[ghes_proc_in_irq: current einj_mem_uc, local cpu]
            => ghes_do_proc
                => ghes_handle_memory_failure
                    => ghes_do_memory_failure
                        => memory_failure_queue	- put work task on a specific cpu
                            => if (kfifo_put(&mf_cpu->fifo, entry))
                                  schedule_work_on(smp_processor_id(), &mf_cpu->work);
            => task_work_add(current, &estatus_node->task_work, TWA_RESUME);
[ghes_proc_in_irq: return]
-----------------------------------STEP 3------------------------------------------
// kworker preempts einj_mem_uc on local cpu due to RESCHED flag
[memory_failure_work_func: current kworker, local cpu]	
     => memory_failure_work_func(&mf_cpu->work)
        => while kfifo_get(&mf_cpu->fifo, &entry);	// until get no work
            => soft/hard offline
------------------------------------STEP 4-----------------------------------------
[ghes_kick_task_work: current einj_mem_uc, other cpu]
                => memory_failure_queue_kick
                    => cancel_work_sync //wait memory_failure_work_func finish
                    => memory_failure_work_func(&mf_cpu->work)
                        => kfifo_get(&mf_cpu->fifo, &entry); // no work here
------------------------------------STEP 5-----------------------------------------
[current einj_mem_uc returned to userspace]
                => Killed by SIGBUS

STEP 4 add a task work to ensure the queued memory_failure() work is processed before
returning to user-space. And the interrupted user will be killed by SIGBUS signal.

If we delete STEP 4, the interrupted user task will return to user space synchronously
and consume the poison data again.


> 
> The current thread likely has nothing to do with the error. Just a matter of chance
> on what is running when the NMI is delivered, right?

Yes, the error is actually handled in workqueue. I think the point is that the
synchronous exception signaled by synchronous external abort must be handled
synchronously, otherwise, it will be signaled again.

Best Regards,
Shuai
Tony Luck Sept. 30, 2022, 3:52 p.m. UTC | #9
> Yes, the error is actually handled in workqueue. I think the point is that the
> synchronous exception signaled by synchronous external abort must be handled
> synchronously, otherwise, it will be signaled again.

Ok. Got it now. Thanks.

For Rafael:

Reviewed-by: Tony Luck <tony.luck@intel.com>

-Tony
Rafael J. Wysocki Oct. 4, 2022, 2:07 p.m. UTC | #10
On Fri, Sep 30, 2022 at 5:52 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> > Yes, the error is actually handled in workqueue. I think the point is that the
> > synchronous exception signaled by synchronous external abort must be handled
> > synchronously, otherwise, it will be signaled again.
>
> Ok. Got it now. Thanks.
>
> For Rafael:
>
> Reviewed-by: Tony Luck <tony.luck@intel.com>

Applied as 6.1-rc material, thanks!
Shuai Xue Oct. 13, 2022, 7:05 a.m. UTC | #11
Hi, Tony,

在 2022/9/27 AM11:50, Shuai Xue 写道:
> 
> 
> 在 2022/9/26 PM11:20, Luck, Tony 写道:
>>>>
>>>> -               if (task_work_pending && current->mm != &init_mm) {
>>>> +               if (task_work_pending && current->mm) {
>>>>                         estatus_node->task_work.func = ghes_kick_task_work;
>>>>                         estatus_node->task_work_cpu = smp_processor_id();
>>>>                         ret = task_work_add(current, &estatus_node->task_work,
>>
>> It seems that you are getting errors reported while running kernel threads. This fix avoids
>> pointlessly adding "task_work" that will never be processed because kernel threads never
>> return to user mode.
> 
> Yes, you are right.
> 
>>
>> But maybe something else needs to be done? The code was, and with this fix still is,
>> taking no action for the error. That doesn't seem right.
> 
> Sorry, I don't think so. As far as I know, on Arm platform, hardware error can signal
> exceptions including:
> 
> - Synchronous External Abort (SEA), e,g. data abort or instruction abort
> - Asynchronous External Abort (signalled as SErrors), e,g. L2 can generate SError for
>   error responses from the interconnect for a Device or Non-cacheable store
> - Fault Handling and Error Recovery interrupts: DDR mainline/demand/scrubber error interrupt
> 
> When the error signals asynchronous exceptions (SError or interrupt), any kind of thread can
> be interrupted, including kernel thread. Because asynchronous exceptions are signaled in
> background, the errors are detected outside of the current execution context.
> 
> The GHES driver always queues work to handle memory failure of a page in memory_failure_queue().
> If a kernel thread is interrupted:
> 
> - Without this fix, the added task_work will never be processed so that the work will not
>   be canceled.
> - With this fix, the task_work will not be added.
> 
> In a conclusion, the error will be handled in a kworker with or without this fix.
> 
> The point of fix is that:
> 
> - The condition is useless because it is always tree. And I think it is not the original patch
>   intends to do.
> - The current code leads to memory leaks. The estatus_node will not be freed when task_work is
>   added to a kernel thread.
> 
> 
>>
>> Are you injecting errors to create this scenario? 
> 
> Yes, I am injecting error to create such scenario. After 200 injections, the ghes_estatus_pool
> will run of memory and ghes_in_nmi_queue_one_entry() returns ENOMEM. Finally, a lot of unhandled
> events may cause platform firmware to exceed some threshold and reboot.
> 
>> What does your test do?
> 
> My injection includes two steps:
> 
> - mmap a device memory for userspace in a driver
> - inject uc and trigger write like ras-tools does
> 
> I have opened source the code and you can find here[1]. It's forked from your repo and mainly based
> on your code :)
> 
> By the way, do you have any plans to extend ras-tools to Arm platform. And is there a mail-list
> for ras-tools? I send a mail to add test cases to ras-tools several weeks ago, but no response.
> May your mailbox regards it as Spam.
> 

Thank you for your review, but I am still having problems with this question.

Do you have any interest to extend ras-tools to Arm platform? I forked a arm-devel branch[1]
from your repo:

- port X86 arch specific cases to Arm platform
- add some common cases like hugetlb, thread and Arm specific cases like prefetch, strb, etc, which
  are helpful to test hardware and firmware RAS problems we encountered.

I am pleasure to contribute these code to your upstream repo and looking forward to see a more
powerful and cross platform tools to inject and debug RAS ability on both X86 and Arm platform.

I really appreciate your great work, and look forward to your reply. Thank you.

Best Regards,
Shuai

> [1] https://gitee.com/anolis/ras-tools/tree/arm-devel
Tony Luck Oct. 13, 2022, 5:18 p.m. UTC | #12
> Thank you for your review, but I am still having problems with this question.
>
> Do you have any interest to extend ras-tools to Arm platform? I forked a arm-devel branch[1]
> from your repo:
>
> - port X86 arch specific cases to Arm platform
> - add some common cases like hugetlb, thread and Arm specific cases like prefetch, strb, etc, which
>   are helpful to test hardware and firmware RAS problems we encountered.
>
> I am pleasure to contribute these code to your upstream repo and looking forward to see a more
> powerful and cross platform tools to inject and debug RAS ability on both X86 and Arm platform.
>
> I really appreciate your great work, and look forward to your reply. Thank you.
>
> Best Regards,
> Shuai
>
> > [1] https://gitee.com/anolis/ras-tools/tree/arm-devel

Shaui,

Sorry I didn't follow up on this part.  Yes, I'm happy to take your (excellent) changes.

I did a "git fetch" from your repo and the a "git merge" of all except the README and LICENSE
commits at the tip of your repo. Those I manually cherry-picked (since the last section of your
README said "clone of Tony's repo" ... which makes no sense in my repo).

Resulting tree has been pushed out to git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git

Please check that I didn't break anything.

-Tony
Shuai Xue Oct. 14, 2022, 1:23 p.m. UTC | #13
在 2022/10/14 AM1:18, Luck, Tony 写道:
>> Thank you for your review, but I am still having problems with this question.
>>
>> Do you have any interest to extend ras-tools to Arm platform? I forked a arm-devel branch[1]
>> from your repo:
>>
>> - port X86 arch specific cases to Arm platform
>> - add some common cases like hugetlb, thread and Arm specific cases like prefetch, strb, etc, which
>>   are helpful to test hardware and firmware RAS problems we encountered.
>>
>> I am pleasure to contribute these code to your upstream repo and looking forward to see a more
>> powerful and cross platform tools to inject and debug RAS ability on both X86 and Arm platform.
>>
>> I really appreciate your great work, and look forward to your reply. Thank you.
>>
>> Best Regards,
>> Shuai
>>
>>> [1] https://gitee.com/anolis/ras-tools/tree/arm-devel
> 
> Shaui,
> 
> Sorry I didn't follow up on this part.  Yes, I'm happy to take your (excellent) changes.
> 
> I did a "git fetch" from your repo and the a "git merge" of all except the README and LICENSE
> commits at the tip of your repo. Those I manually cherry-picked (since the last section of your
> README said "clone of Tony's repo" ... which makes no sense in my repo).
> 
> Resulting tree has been pushed out to git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git

Thank you. I am glad to see it really happens :)

> 
> Please check that I didn't break anything.
> 
> -Tony

I recheck it and all look good except hornet.

hornet use PTRACE_GETREGS to read the tracee's general-purpose registers, but it does not work
on Arm. I will send a new patch to extend it into Arm platform.

Cheers,
Shuai
diff mbox series

Patch

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index d91ad378c00d..80ad530583c9 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -985,7 +985,7 @@  static void ghes_proc_in_irq(struct irq_work *irq_work)
 				ghes_estatus_cache_add(generic, estatus);
 		}
 
-		if (task_work_pending && current->mm != &init_mm) {
+		if (task_work_pending && current->mm) {
 			estatus_node->task_work.func = ghes_kick_task_work;
 			estatus_node->task_work_cpu = smp_processor_id();
 			ret = task_work_add(current, &estatus_node->task_work,