diff mbox

[v6,6/7] KVM: arm64: allow get exception information from userspace

Message ID 1503916701-13516-7-git-send-email-gengdongjiu@huawei.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dongjiu Geng Aug. 28, 2017, 10:38 a.m. UTC
when userspace gets SIGBUS signal, it does not know whether
this is a synchronous external abort or SError, so needs
to get the exception syndrome. This patch allows userspace
can get this values. For syndrome, only give userspace
syndrome EC and ISS.

Now we move the synchronous external abort injection logic to
userspace, when userspace injects the SEA exception to guest
OS, it needs to specify the far_el1 value, so this patch gives
the exception virtual address to userspace.

change since v5:
1. modify some patch description

Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
Signed-off-by: Quanming Wu <wuquanming@huawei.com>
---
 arch/arm64/include/uapi/asm/kvm.h |  5 +++++
 arch/arm64/kvm/guest.c            | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

Comments

James Morse Sept. 7, 2017, 4:30 p.m. UTC | #1
Hi Dongjiu Geng,

On 28/08/17 11:38, Dongjiu Geng wrote:
> when userspace gets SIGBUS signal, it does not know whether
> this is a synchronous external abort or SError,

Why would Qemu/kvmtool need to know if the original notification (if there was
one) was synchronous or asynchronous? This is between firmware and the kernel.


I think I can see why you need this: to choose whether to emulate SEA or SEI,
but what if the guest wasn't running? Or the guest was running, but it wasn't
guest-memory that is affected.

What happens if the dram-scrub hardware spots an error in guest memory, but the
guest wasn't running? KVM won't have a relevant ESR value to give you.

What happens if we start swapping a page of guest memory to disk, and discover
the memory is corrupt. This is synchronous, but it wasn't the guest, and KVM
still can't give you an ESR.

What about CPER records discovered through the polled interface? What happens if
I write a PFN into the corrupt-pfn sysfs interface?


I think what you need is some way of knowing if the BUS_MCEERR_A* was directly
caused by a user-space (or guest) access, and if so was it a data or instruction
fetch. These can become SEA notifications.

KVM's user-space shouldn't be a special-case where the kernel behaves
differently: if we tinker with this it needs to make sense for all user space
processes and mean something on all architectures.

I think this information could be useful to other users of these signals, e.g. a
JVM could silently regenerate/reload code/data for a non-direct-access fault
instead of exit-ing (or throwing an exception) for a direct access.

For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by an
access or not. When the mm code gets -EHWPOISON when trying to resolve a
user-space fault we know it was due to a direct-access. (I don't know if/how x86
can know if it was code or data). Faulting guest accesses through KVM are just a
special version of this where KVM fixes-up stage2.

... but for any of this to work we need the address of the corrupt memory.
(-> cover letter)


Thanks,

James
Dongjiu Geng Sept. 13, 2017, 7:32 a.m. UTC | #2
Hi James,


On 2017/9/8 0:30, James Morse wrote:
> Hi Dongjiu Geng,
> 
> On 28/08/17 11:38, Dongjiu Geng wrote:
>> when userspace gets SIGBUS signal, it does not know whether
>> this is a synchronous external abort or SError,
> 
> Why would Qemu/kvmtool need to know if the original notification (if there was
> one) was synchronous or asynchronous? This is between firmware and the kernel.
there are two reasons:

1. Let us firstly discuss the SEA and SEI, there are different workflow for the two different Errors.
2. when record the CPER in the user space, it needs to know the error type, because SEA and SEI are different Error source,
   so they have different offset in the APEI table, that is to say they will be recorded to different place of the APEI table.


         etc/acpi/tables                               etc/hardware_errors
        ====================                    ==========================================
    + +--------------------------+            +------------------+
    | | HEST                     |            |    address       |              +--------------+
    | +--------------------------+            |    registers     |              | Error Status |
    | | GHES0                    |            | +----------------+              | Data Block 0 |
    | +--------------------------+ +--------->| |status_address0 |------------->| +------------+
    | | .................        | |          | +----------------+              | |  CPER      |
    | | error_status_address-----+-+ +------->| |status_address1 |----------+   | |  CPER      |
    | | .................        |   |        | +----------------+          |   | |  ....      |
    | | read_ack_register--------+-+ |        |  .............   |          |   | |  CPER      |
    | | read_ack_preserve        | | |        +------------------+          |   | +-+------------+
    | | read_ack_write           | | | +----->| |status_address10|--------+ |   | Error Status |
    + +--------------------------+ | | |      | +----------------+        | |   | Data Block 1 |
    | | GHES1                    | +-+-+----->| | ack_value0     |        | +-->| +------------+
    + +--------------------------+   | |      | +----------------+        |     | |  CPER      |
    | | .................        |   | | +--->| | ack_value1     |        |     | |  CPER      |
    | | error_status_address-----+---+ | |    | +----------------+        |     | |  ....      |
    | | .................        |     | |    | |  ............. |        |     | |  CPER      |
    | | read_ack_register--------+-----+-+    | +----------------+        |     +-+------------+
    | | read_ack_preserve        |     |   +->| | ack_value10    |        |     | |..........  |
    | | read_ack_write           |     |   |  | +----------------+        |     | +------------+
    + +--------------------------|     |   |                              |     | Error Status |
    | | ...............          |     |   |                              |     | Data Block 10|
    + +--------------------------+     |   |                              +---->| +------------+
    | | GHES10                   |     |   |                                    | |  CPER      |
    + +--------------------------+     |   |                                    | |  CPER      |
    | | .................        |     |   |                                    | |  ....      |
    | | error_status_address-----+-----+   |                                    | |  CPER      |
    | | .................        |         |                                    +-+------------+
    | | read_ack_register--------+---------+
    | | read_ack_preserve        |
    | | read_ack_write           |
    + +--------------------------+

> 
> 
> I think I can see why you need this: to choose whether to emulate SEA or SEI,
emulating SEA or SEI is one reason, another reason is that the CPER will be recorded to different place of APEI.


> but what if the guest wasn't running? Or the guest was running, but it wasn't
> guest-memory that is affected.
If the guest was not running, host firmware will directly notify EL1 host kernel to handle the error, not notify hypervisor
only if the guest was running host firmware can notify the Error to hypervisor.

If the user space is Qemu, and the error is from Qemu, and guest-memory is not involve.
I will not handle it, please see the code for arm64.

void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
{
    ram_addr_t ram_addr;
    hwaddr paddr;

    ARMCPU *cpu = ARM_CPU(c);
    CPUARMState *env = &cpu->env;
    assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
    if (addr) {
        ram_addr = qemu_ram_addr_from_host(addr);
        if (ram_addr != RAM_ADDR_INVALID &&
            kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
            kvm_cpu_synchronize_state(c);
            kvm_hwpoison_page_add(ram_addr);
            if (is_abort_sea(env->exception.syndrome)) {
                ghes_update_guest(ACPI_HEST_NOTIFY_SEA, paddr);
                kvm_inject_arm_sea(c);
            } else if (is_abort_sei(env->exception.syndrome)) {
                ghes_update_guest(ACPI_HEST_NOTIFY_SEI, paddr);
                kvm_inject_arm_sei(c);
            }
            return;
        }
        fprintf(stderr, "Hardware memory error for memory used by "
                "QEMU itself instead of guest system!\n");
    }

    if (code == BUS_MCEERR_AR) {
        fprintf(stderr, "Hardware memory error!\n");
        exit(1);
    }
}


For the x86, it also does not handle it, it only print "Hardware memory error for memory used by QEMU itself instead of guest system!"

void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
{
    X86CPU *cpu = X86_CPU(c);
    CPUX86State *env = &cpu->env;
    ram_addr_t ram_addr;
    hwaddr paddr;

    /* If we get an action required MCE, it has been injected by KVM
     * while the VM was running.  An action optional MCE instead should
     * be coming from the main thread, which qemu_init_sigbus identifies
     * as the "early kill" thread.
     */
    assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);

    if ((env->mcg_cap & MCG_SER_P) && addr) {
        ram_addr = qemu_ram_addr_from_host(addr);
        if (ram_addr != RAM_ADDR_INVALID &&
            kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
            kvm_hwpoison_page_add(ram_addr);
            kvm_mce_inject(cpu, paddr, code);
            return;
        }

        fprintf(stderr, "Hardware memory error for memory used by "
                "QEMU itself instead of guest system!\n");
    }

    if (code == BUS_MCEERR_AR) {
        hardware_memory_error();
    }

    /* Hope we are lucky for AO MCE */
}


> 
> What happens if the dram-scrub hardware spots an error in guest memory, but the
> guest wasn't running? KVM won't have a relevant ESR value to give you.
if the dram-scrub hardware spots an error in guest memory, it will generate
IRQ in DDR controller, not SEA or SEI exception. I still do not consider the GSIV.
For GSIV, may be we can only handle it in the host OS.


> 
> What happens if we start swapping a page of guest memory to disk, and discover
> the memory is corrupt. This is synchronous, but it wasn't the guest, and KVM
> still can't give you an ESR.
I think this Error is reported by IRQ(GSIV), GSIV is not SEA/SEI, we should not give the ESR to them.


> 
> What about CPER records discovered through the polled interface? What happens if
> I write a PFN into the corrupt-pfn sysfs interface?
I do not understand this question.
I think in the process it should report SEA notification when CPU consume the error page.


> 
> 
> I think what you need is some way of knowing if the BUS_MCEERR_A* was directly
> caused by a user-space (or guest) access, and if so was it a data or instruction
when user space received the signal, it will judge whether the memory address is user-space (or guest) address


> fetch. These can become SEA notifications.
In fact, it can be SEI, not always SEA, why it will always SEA notifications?
If the memory properties of data is device type, it may become SEI notification.


> 
> KVM's user-space shouldn't be a special-case where the kernel behaves
> differently: if we tinker with this it needs to make sense for all user space
> processes and mean something on all architectures.
> 
> I think this information could be useful to other users of these signals, e.g. a
> JVM could silently regenerate/reload code/data for a non-direct-access fault
> instead of exit-ing (or throwing an exception) for a direct access.
> 
> For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by an
> access or not. When the mm code gets -EHWPOISON when trying to resolve a

Because of that, so I allow  userspace getting exception information

> user-space fault we know it was due to a direct-access. (I don't know if/how x86
> can know if it was code or data). Faulting guest accesses through KVM are just a
> special version of this where KVM fixes-up stage2.
> 
> ... but for any of this to work we need the address of the corrupt memory.
> (-> cover letter)
> 
> 
> Thanks,
> 
> James
> 
> .
>
James Morse Sept. 14, 2017, 1 p.m. UTC | #3
Hi gengdongjiu,

(re-ordered hunks)

On 13/09/17 08:32, gengdongjiu wrote:
> On 2017/9/8 0:30, James Morse wrote:
>> On 28/08/17 11:38, Dongjiu Geng wrote:
>> For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by
>> an access or not.

Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be triggered via
some CPER flags, but its not. The only code that flags MF_ACTION_REQUIRED is
x86's kernel-first handling, which nicely matches this 'direct access' problem.
BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 equivalent). Powerpc
also triggers these directly, both from what look to be synchronous paths, so I
think its fair to equate BUS_MCEERR_AR to a synchronous access and BUS_MCEERR_AO
to something_else.

I don't think we need anything else.


>> When the mm code gets -EHWPOISON when trying to resolve a
>
> Because of that, so I allow  userspace getting exception information

... and there are cases where you can't get the exception information, and other
cases where it wasn't an exception at all.

[...]


>> What happens if the dram-scrub hardware spots an error in guest memory, but
>> the guest wasn't running? KVM won't have a relevant ESR value to give you.

> if the dram-scrub hardware spots an error in guest memory, it will generate
> IRQ in DDR controller, not SEA or SEI exception. I still do not consider the
> GSIV. For GSIV, may be we can only handle it in the host OS.

Great example: this IRQ pulls us out of a guest, we tromp through APEI and then
memory_failure(), the memory happened to belong to the same guest
(coincidence!), we send it some signal and now its user-space's problem.

Your KVM_REG_ARM64_FAULT mechanism is going to return stale data, even though
the notification interrupted the guest, and it was guest memory that was
affected. KVM doesn't have a relevant ESR.


I'm strongly against exposing 'which notification type' this error originally
came from because:
* it doesn't matter once we've got the CPER records,
* there isn't always an answer (there are/will-be other ways of tripping
  memory_failure())
* it creates ABI between firwmare, host userspace and guest userspace.
  Firmware's choice of notification type shouldn't affect anything other than
  the host kernel.


On 13/09/17 08:32, gengdongjiu wrote:
> On 2017/9/8 0:30, James Morse wrote:
>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>> when userspace gets SIGBUS signal, it does not know whether
>>> this is a synchronous external abort or SError,
>>
>> Why would Qemu/kvmtool need to know if the original notification (if there was
>> one) was synchronous or asynchronous? This is between firmware and the kernel.

> there are two reasons:
> 
> 1. Let us firstly discuss the SEA and SEI, there are different workflow for the two different Errors.
> 2. when record the CPER in the user space, it needs to know the error type, because SEA and SEI are different Error source,
>    so they have different offset in the APEI table, that is to say they will be recorded to different place of the APEI table.

user-space can choose whether to use SEA or SEI, it doesn't have to choose the
same notification type that firmware used, which in turn doesn't have to be the
same as that used by the CPU to notify firmware.

The choice only matters because these notifications hang on an existing pieces
of the Arm-architecture, so the notification can only add to the architecturally
defined meaning. (i.e. You can only send an SEA for something that can already
be described as a synchronous external abort).

Once we get to user-space, for memory_failure() notifications, (which so far is
all we are talking about here), the only thing that could matter is whether the
guest hit a PG_hwpoison page as a stage2 fault. These can be described as
Synchronous-External-Abort.

The Synchronous-External-Abort/SError-Interrupt distinction matters for the CPU
because it can't always make an error synchronous. For memory_failure()
notifications to a KVM guest we really can do this, and we already have this
behaviour for free. An example:

A guest touches some hardware:poisoned memory, for whatever reason the CPU can't
put the world back together to make this a synchronous exception, so it reports
it to firmware as an SError-interrupt.
Linux gets an APEI notification and memory_failure() causes the affected page to
be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space.

Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO->
action optional, probably asynchronous.

But in our example it wasn't really asynchronous, that was just a property of
the original CPU->firmware notification. What happens? The guest vcpu is re-run,
it re-runs the same instructions (this was a contained error so KVM's ELR points
at/before the instruction that steps in the problem). This time KVM takes a
stage2 fault, which the mm code will refuse to fixup because the relevant page
was marked as PG_hwpoision by memory_failure(). KVM signals Qemu/kvmtool with
SIGBUS_MCEERR_AR. Now Qemu/kvmtool can notify the guest using SEA.




>          etc/acpi/tables                               etc/hardware_errors
>         ====================                    ==========================================
>     + +--------------------------+            +------------------+
>     | | HEST                     |            |    address       |              +--------------+
>     | +--------------------------+            |    registers     |              | Error Status |
>     | | GHES0                    |            | +----------------+              | Data Block 0 |
>     | +--------------------------+ +--------->| |status_address0 |------------->| +------------+
>     | | .................        | |          | +----------------+              | |  CPER      |
>     | | error_status_address-----+-+ +------->| |status_address1 |----------+   | |  CPER      |
>     | | .................        |   |        | +----------------+          |   | |  ....      |
>     | | read_ack_register--------+-+ |        |  .............   |          |   | |  CPER      |
>     | | read_ack_preserve        | | |        +------------------+          |   | +-+------------+
>     | | read_ack_write           | | | +----->| |status_address10|--------+ |   | Error Status |
>     + +--------------------------+ | | |      | +----------------+        | |   | Data Block 1 |
>     | | GHES1                    | +-+-+----->| | ack_value0     |        | +-->| +------------+
>     + +--------------------------+   | |      | +----------------+        |     | |  CPER      |
>     | | .................        |   | | +--->| | ack_value1     |        |     | |  CPER      |
>     | | error_status_address-----+---+ | |    | +----------------+        |     | |  ....      |
>     | | .................        |     | |    | |  ............. |        |     | |  CPER      |
>     | | read_ack_register--------+-----+-+    | +----------------+        |     +-+------------+
>     | | read_ack_preserve        |     |   +->| | ack_value10    |        |     | |..........  |
>     | | read_ack_write           |     |   |  | +----------------+        |     | +------------+
>     + +--------------------------|     |   |                              |     | Error Status |
>     | | ...............          |     |   |                              |     | Data Block 10|
>     + +--------------------------+     |   |                              +---->| +------------+
>     | | GHES10                   |     |   |                                    | |  CPER      |
>     + +--------------------------+     |   |                                    | |  CPER      |
>     | | .................        |     |   |                                    | |  ....      |
>     | | error_status_address-----+-----+   |                                    | |  CPER      |
>     | | .................        |         |                                    +-+------------+
>     | | read_ack_register--------+---------+
>     | | read_ack_preserve        |
>     | | read_ack_write           |
>     + +--------------------------+
> 

(nice ascii art!)

>> I think I can see why you need this: to choose whether to emulate SEA or SEI,

> emulating SEA or SEI is one reason, another reason is that the CPER will be recorded to different place of APEI.

(This doesn't matter: Generate the CPER records after you've chosen the
notification and this isn't a problem. Or map your 'Error Status Data Blocks'
to status_address* depending on usage not in a fixed 1:1 way)


>> I think what you need is some way of knowing if the BUS_MCEERR_A* was directly
>> caused by a user-space (or guest) access, and if so was it a data or instruction

> when user space received the signal, it will judge whether the memory address is user-space (or guest) address

>> fetch. These can become SEA notifications.

> In fact, it can be SEI, not always SEA, why it will always SEA notifications?
> If the memory properties of data is device type, it may become SEI notification.

Let's take a step back: in what scenario should we use an emulated-SEA instead
of an emulated-SEI? (forget what the CPU and firmware did, this is up to Qemu to
decide).

It can use SEA if this is a valid Synchronous-external-abort. Stage 2 faults are
synchronous exceptions, if you hit a PG_hwpoision page on this path you can
report this back to the guest as a Synchronous-external-abort/SEA.
How do you know? You get SIGBUS_MCEERR_AR from KVM.


Thanks,

James
Dongjiu Geng Sept. 18, 2017, 1:36 p.m. UTC | #4
James,
   Thanks for your comments, hope we can make the solution better.

On 2017/9/14 21:00, James Morse wrote:
> Hi gengdongjiu,
> 
> (re-ordered hunks)
> 
> On 13/09/17 08:32, gengdongjiu wrote:
>> On 2017/9/8 0:30, James Morse wrote:
>>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>> For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by
>>> an access or not.
> 
> Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be triggered via
> some CPER flags, but its not. The only code that flags MF_ACTION_REQUIRED is
> x86's kernel-first handling, which nicely matches this 'direct access' problem.
> BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 equivalent). Powerpc
> also triggers these directly, both from what look to be synchronous paths, so I
> think its fair to equate BUS_MCEERR_AR to a synchronous access and BUS_MCEERR_AO
> to something_else.

James, thanks for your explanation.
can I understand that your meaning that "BUS_MCEERR_AR" stands for synchronous access and BUS_MCEERR_AO stands for asynchronous access?
Then for "BUS_MCEERR_AO", how to distinguish it is asynchronous data access(SError) and PCIE AER error?
In the user space, we can check the si_code, if it is "BUS_MCEERR_AR", we use SEA notification type for the guest;
if it is "BUS_MCEERR_AO", we use SEI notification type for the guest.
Because there are only two values for si_code("BUS_MCEERR_AR" and BUS_MCEERR_AO), in which case we can use the GSIV(IRQ) notification type?


> 
> I don't think we need anything else.
> 
> 
>>> When the mm code gets -EHWPOISON when trying to resolve a
>>
>> Because of that, so I allow  userspace getting exception information
> 
> ... and there are cases where you can't get the exception information, and other
> cases where it wasn't an exception at all.
> 
> [...]
> 
> 
>>> What happens if the dram-scrub hardware spots an error in guest memory, but
>>> the guest wasn't running? KVM won't have a relevant ESR value to give you.
> 
>> if the dram-scrub hardware spots an error in guest memory, it will generate
>> IRQ in DDR controller, not SEA or SEI exception. I still do not consider the
>> GSIV. For GSIV, may be we can only handle it in the host OS.
> 
> Great example: this IRQ pulls us out of a guest, we tromp through APEI and then
> memory_failure(), the memory happened to belong to the same guest
> (coincidence!), we send it some signal and now its user-space's problem.
> 
> Your KVM_REG_ARM64_FAULT mechanism is going to return stale data, even though
> the notification interrupted the guest, and it was guest memory that was
> affected. KVM doesn't have a relevant ESR.
> 
> 
> I'm strongly against exposing 'which notification type' this error originally
> came from because:
> * it doesn't matter once we've got the CPER records,
> * there isn't always an answer (there are/will-be other ways of tripping
>   memory_failure())
> * it creates ABI between firwmare, host userspace and guest userspace.
>   Firmware's choice of notification type shouldn't affect anything other than
>   the host kernel.
> 
> 
> On 13/09/17 08:32, gengdongjiu wrote:
>> On 2017/9/8 0:30, James Morse wrote:
>>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>>> when userspace gets SIGBUS signal, it does not know whether
>>>> this is a synchronous external abort or SError,
>>>
>>> Why would Qemu/kvmtool need to know if the original notification (if there was
>>> one) was synchronous or asynchronous? This is between firmware and the kernel.
> 
>> there are two reasons:
>>
>> 1. Let us firstly discuss the SEA and SEI, there are different workflow for the two different Errors.
>> 2. when record the CPER in the user space, it needs to know the error type, because SEA and SEI are different Error source,
>>    so they have different offset in the APEI table, that is to say they will be recorded to different place of the APEI table.
> 
> user-space can choose whether to use SEA or SEI, it doesn't have to choose the
> same notification type that firmware used, which in turn doesn't have to be the
> same as that used by the CPU to notify firmware.
> 
> The choice only matters because these notifications hang on an existing pieces
> of the Arm-architecture, so the notification can only add to the architecturally
> defined meaning. (i.e. You can only send an SEA for something that can already
> be described as a synchronous external abort).
> 
> Once we get to user-space, for memory_failure() notifications, (which so far is
> all we are talking about here), the only thing that could matter is whether the
> guest hit a PG_hwpoison page as a stage2 fault. These can be described as
> Synchronous-External-Abort.
> 
> The Synchronous-External-Abort/SError-Interrupt distinction matters for the CPU
> because it can't always make an error synchronous. For memory_failure()
> notifications to a KVM guest we really can do this, and we already have this
> behaviour for free. An example:
> 
> A guest touches some hardware:poisoned memory, for whatever reason the CPU can't
> put the world back together to make this a synchronous exception, so it reports
> it to firmware as an SError-interrupt.

> Linux gets an APEI notification and memory_failure() causes the affected page to
> be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space.
> 
> Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO->
> action optional, probably asynchronous.
If so, in this case, Qemu/kvmtool only got a little information(receive a SIGBUS), for this SIGBUS,
it only include the SIGBUS_MCEERR_AO, error address. not include other information,
only according the SIGBUS_MCEERR_AO and error address, user space does not know whether to use IRQ or POLLed notification.
for example, SIGBUS_MCEERR_AO means asynchronous access, user space can use SEI, IRQ or POLLed notification.
so user space will be confused to use which method.

I think if we use such solution, user space only judging SIGBUS_MCEERR_A* is not enough.
how we provide other extra information to let it choose the proper notification?

> 
> But in our example it wasn't really asynchronous, that was just a property of
> the original CPU->firmware notification. What happens? The guest vcpu is re-run,
> it re-runs the same instructions (this was a contained error so KVM's ELR points
> at/before the instruction that steps in the problem). This time KVM takes a
> stage2 fault, which the mm code will refuse to fixup because the relevant page
> was marked as PG_hwpoision by memory_failure(). KVM signals Qemu/kvmtool with
> SIGBUS_MCEERR_AR. Now Qemu/kvmtool can notify the guest using SEA.
> 
> 
> 
> 
>>          etc/acpi/tables                               etc/hardware_errors
>>         ====================                    ==========================================
>>     + +--------------------------+            +------------------+
>>     | | HEST                     |            |    address       |              +--------------+
>>     | +--------------------------+            |    registers     |              | Error Status |
>>     | | GHES0                    |            | +----------------+              | Data Block 0 |
>>     | +--------------------------+ +--------->| |status_address0 |------------->| +------------+
>>     | | .................        | |          | +----------------+              | |  CPER      |
>>     | | error_status_address-----+-+ +------->| |status_address1 |----------+   | |  CPER      |
>>     | | .................        |   |        | +----------------+          |   | |  ....      |
>>     | | read_ack_register--------+-+ |        |  .............   |          |   | |  CPER      |
>>     | | read_ack_preserve        | | |        +------------------+          |   | +-+------------+
>>     | | read_ack_write           | | | +----->| |status_address10|--------+ |   | Error Status |
>>     + +--------------------------+ | | |      | +----------------+        | |   | Data Block 1 |
>>     | | GHES1                    | +-+-+----->| | ack_value0     |        | +-->| +------------+
>>     + +--------------------------+   | |      | +----------------+        |     | |  CPER      |
>>     | | .................        |   | | +--->| | ack_value1     |        |     | |  CPER      |
>>     | | error_status_address-----+---+ | |    | +----------------+        |     | |  ....      |
>>     | | .................        |     | |    | |  ............. |        |     | |  CPER      |
>>     | | read_ack_register--------+-----+-+    | +----------------+        |     +-+------------+
>>     | | read_ack_preserve        |     |   +->| | ack_value10    |        |     | |..........  |
>>     | | read_ack_write           |     |   |  | +----------------+        |     | +------------+
>>     + +--------------------------|     |   |                              |     | Error Status |
>>     | | ...............          |     |   |                              |     | Data Block 10|
>>     + +--------------------------+     |   |                              +---->| +------------+
>>     | | GHES10                   |     |   |                                    | |  CPER      |
>>     + +--------------------------+     |   |                                    | |  CPER      |
>>     | | .................        |     |   |                                    | |  ....      |
>>     | | error_status_address-----+-----+   |                                    | |  CPER      |
>>     | | .................        |         |                                    +-+------------+
>>     | | read_ack_register--------+---------+
>>     | | read_ack_preserve        |
>>     | | read_ack_write           |
>>     + +--------------------------+
>>
> 
> (nice ascii art!)
> 
>>> I think I can see why you need this: to choose whether to emulate SEA or SEI,
> 
>> emulating SEA or SEI is one reason, another reason is that the CPER will be recorded to different place of APEI.
> 
> (This doesn't matter: Generate the CPER records after you've chosen the
> notification and this isn't a problem. Or map your 'Error Status Data Blocks'
> to status_address* depending on usage not in a fixed 1:1 way)
> 
> 
>>> I think what you need is some way of knowing if the BUS_MCEERR_A* was directly
>>> caused by a user-space (or guest) access, and if so was it a data or instruction
> 
>> when user space received the signal, it will judge whether the memory address is user-space (or guest) address
> 
>>> fetch. These can become SEA notifications.
> 
>> In fact, it can be SEI, not always SEA, why it will always SEA notifications?
>> If the memory properties of data is device type, it may become SEI notification.
> 
> Let's take a step back: in what scenario should we use an emulated-SEA instead
> of an emulated-SEI? (forget what the CPU and firmware did, this is up to Qemu to
> decide).
> 
> It can use SEA if this is a valid Synchronous-external-abort. Stage 2 faults are
> synchronous exceptions, if you hit a PG_hwpoision page on this path you can
> report this back to the guest as a Synchronous-external-abort/SEA.
> How do you know? You get SIGBUS_MCEERR_AR from KVM.
I understand your idea, I agree we can use SEA notification for guest if it is SIGBUS_MCEERR_AR.
I am worried about the SIGBUS_MCEERR_AO.
if it is SIGBUS_MCEERR_AO, we can choose SEI, IRQ or POLLed notification. so according to what principles to choose that?

In my platform, there is another issue.
for the stage2 fault, we get the IPA from the HPFAR_EL2 register,
but for  huawei's CPU, if it is data Error(DFSC[5:0] is 0b010000), not translation error(DFSC[5:0] is 0b0101xx),
the HPFAR_EL2 is NULL, so the IPA is not recorded, in our current KVM code, we get the IPA from the HPFAR_EL2, so
we can not get the right IPA value, because its value is zero.I do not know whether you have same issue.

Although hpfar_el2 does not record IPA, but host firmware can still record the PA
If call memory_failure(), memory_failure can translate the PA to host VA, then deliver host VA to Qemu. Qemu can translate the host VA to IPA.
so we rely on memory_failure() to get the IPA.



> 
> 
> Thanks,
> 
> James
> 
> .
>
Dongjiu Geng Sept. 21, 2017, 7:55 a.m. UTC | #5
Hi James

On 2017/9/14 21:00, James Morse wrote:
> Hi gengdongjiu,

> user-space can choose whether to use SEA or SEI, it doesn't have to choose the
> same notification type that firmware used, which in turn doesn't have to be the
> same as that used by the CPU to notify firmware.
> 
> The choice only matters because these notifications hang on an existing pieces
> of the Arm-architecture, so the notification can only add to the architecturally
> defined meaning. (i.e. You can only send an SEA for something that can already
> be described as a synchronous external abort).
> 
> Once we get to user-space, for memory_failure() notifications, (which so far is
> all we are talking about here), the only thing that could matter is whether the
> guest hit a PG_hwpoison page as a stage2 fault. These can be described as
> Synchronous-External-Abort.
> 
> The Synchronous-External-Abort/SError-Interrupt distinction matters for the CPU
> because it can't always make an error synchronous. For memory_failure()
> notifications to a KVM guest we really can do this, and we already have this
> behaviour for free. An example:
> 
> A guest touches some hardware:poisoned memory, for whatever reason the CPU can't
> put the world back together to make this a synchronous exception, so it reports
> it to firmware as an SError-interrupt.
> Linux gets an APEI notification and memory_failure() causes the affected page to
> be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space.
> 
> Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO->
> action optional, probably asynchronous.
> 
> But in our example it wasn't really asynchronous, that was just a property of
> the original CPU->firmware notification. What happens? The guest vcpu is re-run,
> it re-runs the same instructions (this was a contained error so KVM's ELR points
> at/before the instruction that steps in the problem). This time KVM takes a
> stage2 fault, which the mm code will refuse to fixup because the relevant page
> was marked as PG_hwpoision by memory_failure(). KVM signals Qemu/kvmtool with
> SIGBUS_MCEERR_AR. Now Qemu/kvmtool can notify the guest using SEA.

CC Achin

I have some personal opinion, if you think it is not right, hope you can point out.

Synchronous External Abort and SError Interrupt are hardware exception(hardware concept), which is independent of software notification,
in armv8 without RAS, the two concepts already exist. In the APEI spec, in order to better describe the two exceptions, so use SEA and SEI notification to stand for them.

SEA notification stands for Synchronous External Abort, so may be it is not only a notification, it also stands for a hardware error type.
SEI notification stands for SError Interrupt, so may be it is not only a notification, it also stands for a hardware error type.

In the OS, it has different handling flow to the two exception(two notification):
when the guest OS running, if the hardware generates a Synchronous External Abort, we told the guest OS this error is SError Interrupt instead of Synchronous External Abort.
guest OS uses SEI notification handling flow to deal with it, I am not sure whether it will have problem, because the true hardware exception is Synchronous External Abort,
but software treats it as SError interrupt to handle.

In the mainline code, it does not have SEI notification support, the reason I think it is because of the error address record by firmware is not accurate(SError Interrupt is asynchronous exception).
so if treat a hardware Synchronous External Abort as SError interrupt(SEI). The default OS behavior for SEI is PANIC, that is to say, when hardware triggers a Synchronous External Abort(SEA), if guest
treat it as SError interrupt(SEI), the OS will be panic. in fact, it can be recoverable instead of Panic.

I ever added a patch to support the SEI notification, but not sure whether it is can be accepted by open source, until now, not receive response.
James Morse Sept. 22, 2017, 4:39 p.m. UTC | #6
Hi gengdongjiu,

On 18/09/17 14:36, gengdongjiu wrote:
> On 2017/9/14 21:00, James Morse wrote:
>> On 13/09/17 08:32, gengdongjiu wrote:
>>> On 2017/9/8 0:30, James Morse wrote:
>>>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>>> For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by
>>>> an access or not.
>>
>> Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be triggered via
>> some CPER flags, but its not. The only code that flags MF_ACTION_REQUIRED is
>> x86's kernel-first handling, which nicely matches this 'direct access' problem.
>> BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 equivalent). Powerpc
>> also triggers these directly, both from what look to be synchronous paths, so I
>> think its fair to equate BUS_MCEERR_AR to a synchronous access and BUS_MCEERR_AO
>> to something_else.
> 
> James, thanks for your explanation.
> can I understand that your meaning that "BUS_MCEERR_AR" stands for synchronous access and BUS_MCEERR_AO stands for asynchronous access?

Not 'stands for', as the AR is Action-Required and AO Action-Optional. My point
was I can't find a case where Action-Required is used for an error that isn't
synchronous.

We should run this past the people who maintain the existing BUS_MCEERR_AR
users, in case its just a severity to them.


> Then for "BUS_MCEERR_AO", how to distinguish it is asynchronous data access(SError) and PCIE AER error?

How would userspace get one of these memory errors for a PCIe error?


> In the user space, we can check the si_code, if it is "BUS_MCEERR_AR", we use SEA notification type for the guest;
> if it is "BUS_MCEERR_AO", we use SEI notification type for the guest.
> Because there are only two values for si_code("BUS_MCEERR_AR" and BUS_MCEERR_AO), in which case we can use the GSIV(IRQ) notification type?

This is for Qemu/kvmtool to decide, it depends on what sort of machine they are
emulating.

For example, the physical machine's memory-controller may notify the CPU about
memory errors by triggering SError trapped to EL3, or with a dedicated FIQ, also
routed to EL3. By the time this gets to the host kernel the distinction doesn't
matter. The host has handled the error.

For a guest, your memory-controller is effectively the host kernel. It will give
you an BUS_MCEERR_AO signal for any guest memory that is affected, and a
BUS_MCEERR_AR if the guest directly accesses a page of affected memory.

What Qemu/kvmtool do with this is up to them. If they're emulating a machine
with no RAS features, printing an error and exit.

Otherwise BUS_MCEERR_AR could be notified as one of the flavours of IRQ, unless
the affected vcpu has interrupts masked, in which case an SEA notification gives
you some NMI-like behaviour.

For BUS_MCEERR_AO you could use SEI, IRQ or polled notification. My choice would
be IRQ, as you can't know if the guest supports SEI and it would be a shame to
kill it with an SError if the affected memory was free. SEA for synchronous
errors is still a good choice even if the guest doesn't support it as that
memory is still gone so its still a valid guest:Synchronous-external-abort.


[...]

>>> 1. Let us firstly discuss the SEA and SEI, there are different workflow for the two different Errors.

>> user-space can choose whether to use SEA or SEI, it doesn't have to choose the
>> same notification type that firmware used, which in turn doesn't have to be the
>> same as that used by the CPU to notify firmware.
>>
>> The choice only matters because these notifications hang on an existing pieces
>> of the Arm-architecture, so the notification can only add to the architecturally
>> defined meaning. (i.e. You can only send an SEA for something that can already
>> be described as a synchronous external abort).
>>
>> Once we get to user-space, for memory_failure() notifications, (which so far is
>> all we are talking about here), the only thing that could matter is whether the
>> guest hit a PG_hwpoison page as a stage2 fault. These can be described as
>> Synchronous-External-Abort.
>>
>> The Synchronous-External-Abort/SError-Interrupt distinction matters for the CPU
>> because it can't always make an error synchronous. For memory_failure()
>> notifications to a KVM guest we really can do this, and we already have this
>> behaviour for free. An example:
>>
>> A guest touches some hardware:poisoned memory, for whatever reason the CPU can't
>> put the world back together to make this a synchronous exception, so it reports
>> it to firmware as an SError-interrupt.
> 
>> Linux gets an APEI notification and memory_failure() causes the affected page to
>> be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space.
>>
>> Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO->
>> action optional, probably asynchronous.

> If so, in this case, Qemu/kvmtool only got a little information(receive a SIGBUS), for this SIGBUS,
> it only include the SIGBUS_MCEERR_AO, error address. not include other information,
> only according the SIGBUS_MCEERR_AO and error address, user space does not know whether to use IRQ or POLLed notification.

The kernel can't tell it which to use: user space has to decide. This has to be
a property of the machine you are emulating, not the machine you happen to be
running on.

What happens if the notification came using a future notification type that user
space doesn't know about.
What if user space does know about this type, but the guest doesn't.
What if you migrate to a machine that uses a new notification type that you
didn't advertise to the guest via the HEST at boot time.

These dependencies have to break somewhere, and the right place is between the
host kernel and host user-space. This way whatever Qemu/kvmtool do will work in
the above 'what-ifs'.


> for example, SIGBUS_MCEERR_AO means asynchronous access, user space can use SEI, IRQ or POLLed notification.
> so user space will be confused to use which method.

There isn't a wrong choice here. I suggest always-use-IRQ. Its faster than
POLLed, but won't kill a guest that doesn't support NOTIFY_SEI.


> I think if we use such solution, user space only judging SIGBUS_MCEERR_A* is not enough.
> how we provide other extra information to let it choose the proper notification?

Forget the original notification. This physical machine's hardware configuration
and how its memory controller is wired up to report errors should not be
relevant to Qemu/kvmtool.

You need to decide how your emulated platform reports errors, you may want to
make it a configuration option which defaults to something safe.

[...]

> In my platform, there is another issue.
> for the stage2 fault, we get the IPA from the HPFAR_EL2 register,
> but for  huawei's CPU, if it is data Error(DFSC[5:0] is 0b010000),

'Synchronous External Abort, not on a translation table walk'

> not translation error(DFSC[5:0] is 0b0101xx),

(the set of external abort, parity or ECC errors that we get from the
page-table-walker)

> the HPFAR_EL2 is NULL, so the IPA is not recorded, in our current KVM code, we get the IPA from the HPFAR_EL2, so
> we can not get the right IPA value, because its value is zero.I do not know whether you have same issue.

This is something the ARM-ARM allows, so we have to live with it in software.

For external aborts the ESR has a 'FnV' bit 10 that for your first DSFSC
'Synchronous External Abort, not on a translation table walk' indicates there is
no FAR, (or presumably HPFAR). I assume you have this bit set in the ESR.

This shouldn't be a problem, for firmware-first we should take the address from
the CPER records as this also gives us a range. For kernel-first we'd take
whatever is in the v8.2 RAS ERR records. Its only if this wasn't a RAS error
that we're likely to print out this address as we kill-the-task/panic.


> Although hpfar_el2 does not record IPA, but host firmware can still record the PA

I agree, it can get the PA from the v8.2 RAS ERR registers and hand it to the OS
using CPER.


> If call memory_failure(), memory_failure can translate the PA to host VA, then deliver
> host VA to Qemu.

Yes, this is how it works for any user-space process as two processes sharing
the same page may map it in different locations.


> Qemu can translate the host VA to IPA. so we rely on memory_failure() to get
> the IPA.

Yes. I don't see why this is a problem: The kernel isn't going to pass RAS
events into the guest, so it never needs to know the IPA.

Instead we notify user-space about ranges of memory affected by
memory_failure(), KVM's user-space isn't a special case here.

As you point out, if Qemu wants to notify the guest it can calculate the IPA and
either use CPER for firmware-first, or in the future, update some representation
of the v8.2 ERR records once we can virtualise kernel-first.

(I'm not sure I understand your point here, but I don't think we disagree,)


Thanks,

James
James Morse Sept. 22, 2017, 4:51 p.m. UTC | #7
Hi gengdongjiu,

On 21/09/17 08:55, gengdongjiu wrote:
> On 2017/9/14 21:00, James Morse wrote:
>> user-space can choose whether to use SEA or SEI, it doesn't have to choose the
>> same notification type that firmware used, which in turn doesn't have to be the
>> same as that used by the CPU to notify firmware.
>>
>> The choice only matters because these notifications hang on an existing pieces
>> of the Arm-architecture, so the notification can only add to the architecturally
>> defined meaning. (i.e. You can only send an SEA for something that can already
>> be described as a synchronous external abort).
>>
>> Once we get to user-space, for memory_failure() notifications, (which so far is
>> all we are talking about here), the only thing that could matter is whether the
>> guest hit a PG_hwpoison page as a stage2 fault. These can be described as
>> Synchronous-External-Abort.
>>
>> The Synchronous-External-Abort/SError-Interrupt distinction matters for the CPU
>> because it can't always make an error synchronous. For memory_failure()
>> notifications to a KVM guest we really can do this, and we already have this
>> behaviour for free. An example:
>>
>> A guest touches some hardware:poisoned memory, for whatever reason the CPU can't
>> put the world back together to make this a synchronous exception, so it reports
>> it to firmware as an SError-interrupt.
>> Linux gets an APEI notification and memory_failure() causes the affected page to
>> be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space.
>>
>> Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO->
>> action optional, probably asynchronous.
>>
>> But in our example it wasn't really asynchronous, that was just a property of
>> the original CPU->firmware notification. What happens? The guest vcpu is re-run,
>> it re-runs the same instructions (this was a contained error so KVM's ELR points
>> at/before the instruction that steps in the problem). This time KVM takes a
>> stage2 fault, which the mm code will refuse to fixup because the relevant page
>> was marked as PG_hwpoision by memory_failure(). KVM signals Qemu/kvmtool with
>> SIGBUS_MCEERR_AR. Now Qemu/kvmtool can notify the guest using SEA.
> 
> CC Achin
> 
> I have some personal opinion, if you think it is not right, hope you can point out.
> 
> Synchronous External Abort and SError Interrupt are hardware exception(hardware concept),
> which is independent of software notification,
> in armv8 without RAS, the two concepts already exist. In the APEI spec, in order to
> better describe the two exceptions, so use SEA and SEI notification to stand
for them.

> SEA notification stands for Synchronous External Abort, so may be it is not only a
> notification, it also stands for a hardware error type.
> SEI notification stands for SError Interrupt, so may be it is not only a notification,
> it also stands for a hardware error type.

> In the OS, it has different handling flow to the two exception(two notification):
> when the guest OS running, if the hardware generates a Synchronous External Abort, we
> told the guest OS this error is SError Interrupt instead of Synchronous
External Abort.

This should only happen when APEI doesn't claim the external-abort as a RAS
notification. If there were CPER records to process then the error is handled by
the host, and we can return to the guest.

If this wasn't a firmware-first notification, then you're right KVM hands the
guest an asynchronous external abort. This could be considered a bug in KVM. (we
can discuss with Marc and Christoffer what it should do), but:

I'm not sure what scenario you could see this in: surely all your
CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first
notifications. So they should always be claimed by APEI.


> guest OS uses SEI notification handling flow to deal with it, I am not sure whether it
> will have problem, because the true hardware exception is Synchronous External
> Abort, but software treats it as SError interrupt to handle.

Once you're into a guest the original 'true hardware exception' shouldn't
matter. In this scenario KVM has handed the guest an SError, our question is 'is
it an SEI notification?':

For firmware first the guest OS should poke around in the CPER buffers, find
nothing to do, and return to the arch code for (future) kernel-first.
For kernel first the guest OS should trawl through the v8.2 ERR registers, find
nothing to do, and continue to the default case:

By default, we should panic on SError, unless its classified as a non-fatal RAS
error. (I'm tempted to pr_warn_once() if we get RAS notifications but there is
no work to do).


What you may be seeing is some awkwardness with the change in the SError ESR
with v8.2. Previously the VSE mechanism injected an impdef SError, (but they
were all impdef so it didn't matter).
With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this
means 'classified as a RAS error ... unknown!'.

I have a patch in the upcoming SError/RAS series that changes KVMs virtual-abort
code to specify an impdef ESR for this path.


> In the mainline code, it does not have SEI notification support, the reason I 
> think it is because of the error address record by firmware is not accurate
> (SError Interrupt is asynchronous exception).

Yes, while we don't expect a FAR with an SError, but we do expect a valid
representation of the RAS error in either the CPER records or the v8.2. ERR
registers (or both). If we have neither of those, its not a RAS error and we
should panic.


> so if treat a hardware Synchronous External Abort as SError interrupt(SEI). 
> The default OS behavior for SEI is PANIC, that is to say, when hardware triggers
> a Synchronous External Abort(SEA), if guest treat it as SError interrupt(SEI),
> the OS will be panic. in fact, it can be recoverable instead of Panic.

If its a RAS error APEI (or in the future, the kernel-first handler), should
claim the error, so the guest never sees it. If you are hitting this behaviour
in KVM, then it wasn't a RAS error.


> I ever added a patch to support the SEI notification, but not sure whether
> it is can be accepted by open source, until now, not receive response.

The patch you posted during the merge window made no sense on its own, so must
replace $one_of the other patches in your v5 (or was it v6)... I'll get to it...

Because the SEI notification depends on v8.2 I'd like to get the SError/RAS
series posted (currently re-testing), then I'll pick up enough of the patches
you've posted for a consolidated version of the series, and we can take the
discussion from there.

I'd still like to know what your firmware does if the normal-world believes its
masked physical-SError and you want to hand it an SEI notification.


Thanks,

James
Dongjiu Geng Sept. 27, 2017, 11:07 a.m. UTC | #8
Hi James,
   Sorry for my late response, thank you very much for comments.

On 2017/9/23 0:51, James Morse wrote:
[.....]
>>
>> CC Achin
>>
>> I have some personal opinion, if you think it is not right, hope you can point out.
>>
>> Synchronous External Abort and SError Interrupt are hardware exception(hardware concept),
>> which is independent of software notification,
>> in armv8 without RAS, the two concepts already exist. In the APEI spec, in order to
>> better describe the two exceptions, so use SEA and SEI notification to stand
> for them.
> 
>> SEA notification stands for Synchronous External Abort, so may be it is not only a
>> notification, it also stands for a hardware error type.
>> SEI notification stands for SError Interrupt, so may be it is not only a notification,
>> it also stands for a hardware error type.
> 
>> In the OS, it has different handling flow to the two exception(two notification):
>> when the guest OS running, if the hardware generates a Synchronous External Abort, we
>> told the guest OS this error is SError Interrupt instead of Synchronous
> External Abort.
> 
> This should only happen when APEI doesn't claim the external-abort as a RAS
> notification. If there were CPER records to process then the error is handled by
> the host, and we can return to the guest.

consider again. I think you should be right.
In the firmware-first solution, firmware will shield all kinds of errors and record them to the CPER buffer.


> 
> If this wasn't a firmware-first notification, then you're right KVM hands the
> guest an asynchronous external abort. This could be considered a bug in KVM. (we
> can discuss with Marc and Christoffer what it should do), but:
> 
> I'm not sure what scenario you could see this in: surely all your
> CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first
> notifications. So they should always be claimed by APEI.

Yes, if it is firmware-first we should not exist such issue.

> 
> 
>> guest OS uses SEI notification handling flow to deal with it, I am not sure whether it
>> will have problem, because the true hardware exception is Synchronous External
>> Abort, but software treats it as SError interrupt to handle.
> 
> Once you're into a guest the original 'true hardware exception' shouldn't
> matter. In this scenario KVM has handed the guest an SError, our question is 'is
> it an SEI notification?':
> 
> For firmware first the guest OS should poke around in the CPER buffers, find
> nothing to do, and return to the arch code for (future) kernel-first.
> For kernel first the guest OS should trawl through the v8.2 ERR registers, find
> nothing to do, and continue to the default case:
> 
> By default, we should panic on SError, unless its classified as a non-fatal RAS
> error. (I'm tempted to pr_warn_once() if we get RAS notifications but there is
> no work to do).

understand, thanks.

> 
> 
> What you may be seeing is some awkwardness with the change in the SError ESR
> with v8.2. Previously the VSE mechanism injected an impdef SError, (but they
> were all impdef so it didn't matter).
> With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this
> means 'classified as a RAS error ... unknown!'.
> 
> I have a patch in the upcoming SError/RAS series that changes KVMs virtual-abort
> code to specify an impdef ESR for this path.
Before I remember Marc and you suggest me specify the an impdef ESR (set the vsesr_el2) in
the userspace,
now do you want to specify an impdef ESR in KVM instead of usrspace?
if setting the vsesr_el2 in the KVM, whether user-space need to specify?
May be we can combine the patches that specify an impdef ESR(set vsesr_el2) patch to one.

> 
> 
>> In the mainline code, it does not have SEI notification support, the reason I 
>> think it is because of the error address record by firmware is not accurate
>> (SError Interrupt is asynchronous exception).
> 
> Yes, while we don't expect a FAR with an SError, but we do expect a valid
> representation of the RAS error in either the CPER records or the v8.2. ERR
> registers (or both). If we have neither of those, its not a RAS error and we
> should panic.
> 
> 
>> so if treat a hardware Synchronous External Abort as SError interrupt(SEI). 
>> The default OS behavior for SEI is PANIC, that is to say, when hardware triggers
>> a Synchronous External Abort(SEA), if guest treat it as SError interrupt(SEI),
>> the OS will be panic. in fact, it can be recoverable instead of Panic.
> 
> If its a RAS error APEI (or in the future, the kernel-first handler), should
> claim the error, so the guest never sees it. If you are hitting this behaviour
> in KVM, then it wasn't a RAS error.
> 
> 
>> I ever added a patch to support the SEI notification, but not sure whether
>> it is can be accepted by open source, until now, not receive response.
> 
> The patch you posted during the merge window made no sense on its own, so must
> replace $one_of the other patches in your v5 (or was it v6)... I'll get to it...
yes, thanks.
Because SEI notification support does not depend on the RAS virtualization, I break them out of this
series.
they are here(Today Tyler have some comments, I will update it)

https://patchwork.kernel.org/patch/9950953/
https://patchwork.kernel.org/patch/9952493/

> 
> Because the SEI notification depends on v8.2 I'd like to get the SError/RAS
> series posted (currently re-testing), then I'll pick up enough of the patches
> you've posted for a consolidated version of the series, and we can take the
> discussion from there.
James, it is great, we can make a consolidated version of the series.

> 
> I'd still like to know what your firmware does if the normal-world believes its
> masked physical-SError and you want to hand it an SEI notification.
firstly the physical-SError that happened in the EL2/EL1/EL1 can not be masked if SCR_EL3.EA is set.

when trap to EL3, firmware will record the error to APEI CPER from reading ERR* RAS registers.

(1) if HCR_EL2.TEA is set to 1, exception come from EL0, El1. firmware knows this SError come from guest OS, copy the elr_el3 to elr_el2, copy ESR_El3 to ESR_EL2.
    if the SError exception come from guest EL0 or EL1, set ELR_EL3 with VBAR_EL2 + 0x580(one EL2 SEI entry point),

    execute "ERET", then jump to EL2 hypervisor.

(2)if the SError exception come EL2 hypervisor, copy the elr_el3 to elr_el2, copy ESR_El3 to ESR_EL, set ELR_EL3 with VBAR_EL2+0x380(one EL2 SEI entry point),

   execute "ERET", then jump to EL2 hypervisor.


(2) if HCR_EL2.TEA is set to 0, SError come from host EL1 ,EL0, copy ELR_EL3 to ELR_EL1,ESR_EL3 to ESR_EL1
    if the SError come from host EL0, set ELR_EL3 with VBAR_EL1+0x580
    if the SError come from host EL1, set ELR_EL3 with VBAR_EL1+0x380

  execute "ERET", then jump to host EL1.


> 
> Thanks,
> 
> James
> 
> .
>
gengdongjiu Sept. 27, 2017, 3:37 p.m. UTC | #9
>> What you may be seeing is some awkwardness with the change in the SError ESR
>> with v8.2. Previously the VSE mechanism injected an impdef SError, (but they
>> were all impdef so it didn't matter).
>> With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this
>> means 'classified as a RAS error ... unknown!'.
>>
>> I have a patch in the upcoming SError/RAS series that changes KVMs virtual-abort
>> code to specify an impdef ESR for this path.
> Before I remember Marc and you suggest me specify the an impdef ESR (set the vsesr_el2) in
> the userspace,

I pasted Marc's propose and your suggestion that set VSESR_EL2(specify
virtual SError syndrome) by the user space.
https://lkml.org/lkml/2017/3/20/441
https://lkml.org/lkml/2017/3/20/516


> now do you want to specify an impdef ESR in KVM instead of usrspace?
> if setting the vsesr_el2 in the KVM, whether user-space need to specify?
> May be we can combine the patches that specify an impdef ESR(set vsesr_el2) patch to one.
>
James Morse Oct. 6, 2017, 5:31 p.m. UTC | #10
Hi gengdongjiu,

On 27/09/17 12:07, gengdongjiu wrote:
> On 2017/9/23 0:51, James Morse wrote:
>> If this wasn't a firmware-first notification, then you're right KVM hands the
>> guest an asynchronous external abort. This could be considered a bug in KVM. (we
>> can discuss with Marc and Christoffer what it should do), but:
>>
>> I'm not sure what scenario you could see this in: surely all your
>> CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first
>> notifications. So they should always be claimed by APEI.

> Yes, if it is firmware-first we should not exist such issue.

[...]

>> What you may be seeing is some awkwardness with the change in the SError ESR
>> with v8.2. Previously the VSE mechanism injected an impdef SError, (but they
>> were all impdef so it didn't matter).
>> With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this
>> means 'classified as a RAS error ... unknown!'.

>> I have a patch in the upcoming SError/RAS series that changes KVMs virtual-abort
>> code to specify an impdef ESR for this path.

https://www.spinics.net/lists/arm-kernel/msg609589.html


> Before I remember Marc and you suggest me specify the an impdef ESR (set the vsesr_el2) in
> the userspace,

If Qemu/kvmtool wants to emulate a machine that notifies the guest about
firmware-first RAS Errors using SError/SEI, it needs to decide when to send
these SError and what ESR to specify.


> now do you want to specify an impdef ESR in KVM instead of usrspace?

No, that patch is just trying to fixup the existing cases where KVM already
injects an SError. I'm just trying to keep the behaviour the-same:

Before the RAS Extensions the guest would always get an impdef SError ESR.
After the RAS Extensions KVM has to pick an ESR, as otherwise the guest gets the
hardware-reset value of VSESR_EL2. On the FVP this is all-zeros, which is a RAS
encoding. That patch changes it to still be an impdef SError ESR.


> if setting the vsesr_el2 in the KVM, whether user-space need to specify?

I think we need a KVM CAP API to do this. With that patch it can be wired into
pend_guest_serror(), which takes the ESR to make pending if the CPU supports it.

It's not clear to me whether its useful for user-space to make an SError pending
even if it can't set the ESR....

[...]

>> Because the SEI notification depends on v8.2 I'd like to get the SError/RAS
>> series posted (currently re-testing), then I'll pick up enough of the patches
>> you've posted for a consolidated version of the series, and we can take the
>> discussion from there.

> James, it is great, we can make a consolidated version of the series.

We need to get some of the wider issues fixed first, the big-ugly one is
memory_failure_queue() not being NMI safe. (this isn't a problem for SEA as this
would only become re-entrant if the kernel text was corrupt). It is a problem
for SEI and SDEI.


>> I'd still like to know what your firmware does if the normal-world believes its
>> masked physical-SError and you want to hand it an SEI notification.

Aha, thanks for this:

> firstly the physical-SError that happened in the EL2/EL1/EL1 can not be masked if SCR_EL3.EA is set.

Masked for the CPU because the CPU can deliver the SError to EL3.

What about software? Code at EL1 and EL2 each have a PSTATE.A bit they may have
set. HCR_EL2.VSE respects EL1's PSTATE.A ... the question is does your firmware
respect the PSTATE.A value of the exception level that SError are routed to?


> when trap to EL3, firmware will record the error to APEI CPER from reading ERR* RAS registers.
> 
> (1) if HCR_EL2.TEA is set to 1, exception come from EL0, El1. firmware knows this

HCR_EL2.TEA covers synchronous-external-aborts. For SError you need to check
HCR_EL2.AMO. Some crazy hypervisor may set one and not the other.


> SError come from guest OS, copy the elr_el3 to elr_el2, copy ESR_El3 to
> ESR_EL2.

The EC value in the ELR describes current/lower exception level, you need to
re-encode this for EL2 if the exception came from EL2.


>     if the SError exception come from guest EL0 or EL1, set ELR_EL3 with VBAR_EL2 + 0x580(one EL2 SEI entry point),
> 
>     execute "ERET", then jump to EL2 hypervisor.
>
> (2)if the SError exception come EL2 hypervisor, copy the elr_el3 to elr_el2, copy ESR_El3 to ESR_EL,
> set ELR_EL3 with VBAR_EL2+0x380(one EL2 SEI entry point),
> 
>    execute "ERET", then jump to EL2 hypervisor.

This SError came from EL2. You _must_ check SPSR_EL3.A is clear before returning
to the EL2 SError vector.

EL2 believes it has masked SError, it does this because it can't handle one
right now. If your firmware jumps in anyway - its game over.

We mask SError in entry.S when we take an exception and when we return from an
exception. This is so that we can read/write the ELR/SPSR without them changing
under our feet. If your firmware overwrites these values - we've lost them, and
can never return to the context we interrupted.


> (2) if HCR_EL2.TEA

It's an SError, these are routed by HCR_EL2.AMO.


> is set to 0, SError come from host EL1 ,EL0, copy ELR_EL3 to ELR_EL1,ESR_EL3 to ESR_EL1

Again, the EC needs re-encoding.


>     if the SError come from host EL0, set ELR_EL3 with VBAR_EL1+0x580
>     if the SError come from host EL1, set ELR_EL3 with VBAR_EL1+0x380
> 
>   execute "ERET", then jump to host EL1.

What if the SError came from EL2 and HCR_EL2.AMO is clear. This means SError are
routed to EL1, and are implicitly masked when executing at EL2.

This is the same SError-is-masked issue as above.

What does your firmware do when physical SError is masked and you want to hand
it an SEI notification?


You need to route your fake-SError based on HCR_EL2.AMO and your
fake-External-Abort based on HCR_EL2.TEA.

You also need a triage step at EL3 before you try and notify the normal world.
If you take an uncontainable error to EL3 there is very little point telling the
OS - its probably already on fire. In this case firmware needs to reboot the
machine, and generate a summary of the error for the BERT.


Thanks,

James
Dongjiu Geng Oct. 19, 2017, 7:49 a.m. UTC | #11
On 2017/10/7 1:31, James Morse wrote:
> Hi gengdongjiu,
> 
> On 27/09/17 12:07, gengdongjiu wrote:
>> On 2017/9/23 0:51, James Morse wrote:
>>> If this wasn't a firmware-first notification, then you're right KVM hands the
>>> guest an asynchronous external abort. This could be considered a bug in KVM. (we
>>> can discuss with Marc and Christoffer what it should do), but:
>>>
>>> I'm not sure what scenario you could see this in: surely all your
>>> CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first
>>> notifications. So they should always be claimed by APEI.
> 
>> Yes, if it is firmware-first we should not exist such issue.
> 
> [...]
> 
>>> What you may be seeing is some awkwardness with the change in the SError ESR
>>> with v8.2. Previously the VSE mechanism injected an impdef SError, (but they
>>> were all impdef so it didn't matter).
>>> With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this
>>> means 'classified as a RAS error ... unknown!'.
> 
>>> I have a patch in the upcoming SError/RAS series that changes KVMs virtual-abort
>>> code to specify an impdef ESR for this path.
> 
> https://www.spinics.net/lists/arm-kernel/msg609589.html
> 
> 
>> Before I remember Marc and you suggest me specify the an impdef ESR (set the vsesr_el2) in
>> the userspace,
> 
> If Qemu/kvmtool wants to emulate a machine that notifies the guest about
> firmware-first RAS Errors using SError/SEI, it needs to decide when to send
> these SError and what ESR to specify.
yes, it is. I agree.

> 
> 
>> now do you want to specify an impdef ESR in KVM instead of usrspace?
> 
> No, that patch is just trying to fixup the existing cases where KVM already
> injects an SError. I'm just trying to keep the behaviour the-same:
> 
> Before the RAS Extensions the guest would always get an impdef SError ESR.
> After the RAS Extensions KVM has to pick an ESR, as otherwise the guest gets the
> hardware-reset value of VSESR_EL2. On the FVP this is all-zeros, which is a RAS
> encoding. That patch changes it to still be an impdef SError ESR.
> 
> 
>> if setting the vsesr_el2 in the KVM, whether user-space need to specify?
> 
> I think we need a KVM CAP API to do this. With that patch it can be wired into
> pend_guest_serror(), which takes the ESR to make pending if the CPU supports it.

For this CAP API, I have made a patch in the new series patches.
> 
> It's not clear to me whether its useful for user-space to make an SError pending
> even if it can't set the ESR....
why it can not set the ESR?
In the KVM, we can return a encoding fault info to userspace, then user space can
specify its own ESR for guest.
I already made a patch for it.


> 
> [...]
> 
>>> Because the SEI notification depends on v8.2 I'd like to get the SError/RAS
>>> series posted (currently re-testing), then I'll pick up enough of the patches
>>> you've posted for a consolidated version of the series, and we can take the
>>> discussion from there.
> 
>> James, it is great, we can make a consolidated version of the series.
> 
> We need to get some of the wider issues fixed first, the big-ugly one is
> memory_failure_queue() not being NMI safe. (this isn't a problem for SEA as this
> would only become re-entrant if the kernel text was corrupt). It is a problem
> for SEI and SDEI.
 For memory_failure_queue(), I think the big problem is it is in a process context, not error handling context.
there are two context. and the memory_failure_queue() is scheduled later than the error handling.


> 
> 
>>> I'd still like to know what your firmware does if the normal-world believes its
>>> masked physical-SError and you want to hand it an SEI notification.
> 
> Aha, thanks for this:
> 
>> firstly the physical-SError that happened in the EL2/EL1/EL1 can not be masked if SCR_EL3.EA is set.
> 
> Masked for the CPU because the CPU can deliver the SError to EL3.
> 
> What about software? Code at EL1 and EL2 each have a PSTATE.A bit they may have
> set. HCR_EL2.VSE respects EL1's PSTATE.A ... the question is does your firmware
> respect the PSTATE.A value of the exception level that SError are routed to?

Before route to the target EL, software set the spsr_el3.A to 1, then "eret", the PSTATE.A will be to 1.

Note:
PSTATE.A is shared by different EL, in the hardware, it is one register, not many registers.
spsr_elx has more registers, such as spsr_el1, spsr_el2, spsr_el3.


> 
> 
>> when trap to EL3, firmware will record the error to APEI CPER from reading ERR* RAS registers.
>>
>> (1) if HCR_EL2.TEA is set to 1, exception come from EL0, El1. firmware knows this
> 
> HCR_EL2.TEA covers synchronous-external-aborts. For SError you need to check
> HCR_EL2.AMO. Some crazy hypervisor may set one and not the other.
sorry, it is typo issue, should check HCR_EL2.AMO
> 
> 
>> SError come from guest OS, copy the elr_el3 to elr_el2, copy ESR_El3 to
>> ESR_EL2.
> 
> The EC value in the ELR describes current/lower exception level, you need to
> re-encode this for EL2 if the exception came from EL2.

Regardless it was trapped from EL1 or EL2. they are all from lower exception level.
it should be not encoding, such as Instruction Abort from a lower Exception level.
For both EL1 or EL2, the EC is 0b100000

EC == 0b100000: Instruction Abort from a lower Exception level.
Of course, if need re-encode for some cases, will do it.

> 
> 
>>     if the SError exception come from guest EL0 or EL1, set ELR_EL3 with VBAR_EL2 + 0x580(one EL2 SEI entry point),
>>
>>     execute "ERET", then jump to EL2 hypervisor.
>>
>> (2)if the SError exception come EL2 hypervisor, copy the elr_el3 to elr_el2, copy ESR_El3 to ESR_EL,
>> set ELR_EL3 with VBAR_EL2+0x380(one EL2 SEI entry point),
>>
>>    execute "ERET", then jump to EL2 hypervisor.
> 
> This SError came from EL2. You _must_ check SPSR_EL3.A is clear before returning
> to the EL2 SError vector.
before jumping, it will set the SPSR_EL3.A to 1.

> 
> EL2 believes it has masked SError, it does this because it can't handle one
> right now. If your firmware jumps in anyway - its game over.
> 
> We mask SError in entry.S when we take an exception and when we return from an
> exception. This is so that we can read/write the ELR/SPSR without them changing
> under our feet. If your firmware overwrites these values - we've lost them, and
> can never return to the context we interrupted.
In fact, we do not want to return to the context which is interrupted in Hawei's platform.
we will kill the APP or VM, not return to the context that interrupted.

> 
> 
>> (2) if HCR_EL2.TEA
> 
> It's an SError, these are routed by HCR_EL2.AMO.
typo issue, should be HCR_EL2.AMO.

> 
> 
>> is set to 0, SError come from host EL1 ,EL0, copy ELR_EL3 to ELR_EL1,ESR_EL3 to ESR_EL1
> 
> Again, the EC needs re-encoding.

For EL3, the EL1 and EL0 are all low level, why the EC needs re-encoding?


> 
> 
>>     if the SError come from host EL0, set ELR_EL3 with VBAR_EL1+0x580
>>     if the SError come from host EL1, set ELR_EL3 with VBAR_EL1+0x380
>>
>>   execute "ERET", then jump to host EL1.
> 
> What if the SError came from EL2 and HCR_EL2.AMO is clear. This means SError are
> routed to EL1, and are implicitly masked when executing at EL2.
No, before jump, it will also check the spsr_elx.M to determine jump to which level.


> 
> This is the same SError-is-masked issue as above.
> 
> What does your firmware do when physical SError is masked and you want to hand
> it an SEI notification?

Do you mean spsr_el3.A =1 or PSTATE.A = 1?
we just jump to the corresponding vector entry, not care spsr_el3.A or PSTATE.A is set or not set.

> 
> 
> You need to route your fake-SError based on HCR_EL2.AMO and your
> fake-External-Abort based on HCR_EL2.TEA.
yes, it should. above is typo issue.


> 
> You also need a triage step at EL3 before you try and notify the normal world.
> If you take an uncontainable error to EL3 there is very little point telling the
> OS - its probably already on fire. In this case firmware needs to reboot the
> machine, and generate a summary of the error for the BERT.

yes, it should and we already done that.

For some errors, BIOS will firstly handle it,
and then return to the interrupted context, OS even does not know this error happen.

> 
> 
> Thanks,
> 
> James
> 
> .
>
diff mbox

Patch

diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 9f3ca24bbcc6..514261f682b8 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -181,6 +181,11 @@  struct kvm_arch_memory_slot {
 #define KVM_REG_ARM64_SYSREG_OP2_MASK	0x0000000000000007
 #define KVM_REG_ARM64_SYSREG_OP2_SHIFT	0
 
+/* AArch64 fault registers */
+#define KVM_REG_ARM64_FAULT		(0x0014 << KVM_REG_ARM_COPROC_SHIFT)
+#define KVM_REG_ARM64_FAULT_ESR_EC_ISS	(0)
+#define KVM_REG_ARM64_FAULT_FAR		(1)
+
 #define ARM64_SYS_REG_SHIFT_MASK(x,n) \
 	(((x) << KVM_REG_ARM64_SYSREG_ ## n ## _SHIFT) & \
 	KVM_REG_ARM64_SYSREG_ ## n ## _MASK)
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 5c7f657dd207..020a644b20d7 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -128,6 +128,39 @@  static int set_core_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg)
 out:
 	return err;
 }
+static int get_fault_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg)
+{
+	void __user *uaddr = (void __user *)(unsigned long)reg->addr;
+	u32 value;
+	u32 id = reg->id & ~(KVM_REG_ARCH_MASK |
+			KVM_REG_SIZE_MASK | KVM_REG_ARM64_FAULT);
+
+	switch (id) {
+	case KVM_REG_ARM64_FAULT_ESR_EC_ISS:
+		/*
+		 * The user space may need to know the fault exception class
+		 * ESR_ELx.EC and instruction specific syndrome ESR_ELx.ISS
+		 */
+		value = kvm_vcpu_get_hsr(vcpu) & (ESR_ELx_EC_MASK | ESR_ELx_ISS_MASK);
+		if (copy_to_user(uaddr, &value, KVM_REG_SIZE(reg->id)) != 0)
+			return -EFAULT;
+		break;
+	case KVM_REG_ARM64_FAULT_FAR:
+		/*
+		 * When user space injects synchronized abort, it needs
+		 * to inject the faulting virtual address to guest OS, so
+		 * needs to get it from kvm.
+		 */
+		if (copy_to_user(uaddr, &(vcpu->arch.fault.far_el2),
+				KVM_REG_SIZE(reg->id)) != 0)
+			return -EFAULT;
+		break;
+	default:
+		return -ENOENT;
+	}
+	return 0;
+}
+
 
 int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 {
@@ -243,6 +276,9 @@  int kvm_arm_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg)
 	if ((reg->id & KVM_REG_ARM_COPROC_MASK) == KVM_REG_ARM_CORE)
 		return get_core_reg(vcpu, reg);
 
+	if ((reg->id & KVM_REG_ARM_COPROC_MASK) == KVM_REG_ARM64_FAULT)
+		return get_fault_reg(vcpu, reg);
+
 	if (is_timer_reg(reg->id))
 		return get_timer_reg(vcpu, reg);