ppc/xive: Save/restore state of the External interrupt

Message ID	20220426101125.297064-1-fbarrat@linux.ibm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Frederic Barrat <fbarrat@linux.ibm.com> To: clg@kaod.org, danielhb413@gmail.com, qemu-ppc@nongnu.org, qemu-devel@nongnu.org Subject: [PATCH] ppc/xive: Save/restore state of the External interrupt Date: Tue, 26 Apr 2022 12:11:25 +0200 Message-Id: <20220426101125.297064-1-fbarrat@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=148.163.156.1; envelope-from=fbarrat@linux.ibm.com; helo=mx0a-001b2d01.pphosted.com X-Spam_score_int: -19 X-Spam_score: -2.0 X-Spam_bar: -- X-Spam_report: (-2.0 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Cc: npiggin@gmail.com Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	ppc/xive: Save/restore state of the External interrupt \| expand ppc/xive: Save/restore state of the External interrupt

Frederic Barrat April 26, 2022, 10:11 a.m. UTC

When pulling an OS context from a vcpu, we should lower the External
interrupt if it is pending. Otherwise, it may be delivered in the
hypervisor context, which is unexpected. It can lead to an infinite
loop where the hypervisor catches the External exception, looks for an
interrupt, doesn't find any, exits the exception handler, repeat...

It also means that when pushing a new OS context on a vcpu, we need to
always check the restored Interrupt Pending Buffer (IPB), potentially
merge it with any escalation interrupt, and re-apply the External
interrupt signal if needed.

Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
---

Cedric: I'm wondering the reason behind commit
8256870ada9379abfd1f5b2c209ad01092dd0904, which makes the PIPR field
of the OS context read-only. The comment says it is re-evaluated from
the IPB when pushing a context, but I don't think it's true on P9
if there's no escalation. It's not a problem on P10 because we always
re-evaluate the PIPR if CPPR>0.
In any case, it's no longer an issue with this patch, but I'm
curious as to why restoring the PIPR was a problem in the first place.


 hw/intc/xive.c        | 26 +++++++++++++++++++++++---
 hw/intc/xive2.c       | 14 ++++++++------
 include/hw/ppc/xive.h |  1 +
 3 files changed, 32 insertions(+), 9 deletions(-)

Cédric Le Goater April 26, 2022, 12:35 p.m. UTC | #1

Hello Frederic,

On 4/26/22 12:11, Frederic Barrat wrote:
> When pulling an OS context from a vcpu, we should lower the External
> interrupt if it is pending. Otherwise, it may be delivered in the
> hypervisor context, which is unexpected. It can lead to an infinite
> loop where the hypervisor catches the External exception, looks for an
> interrupt, doesn't find any, exits the exception handler, repeat...

Nice ! I have been chasing this one for a while and couldn't corner it.

Setting an environment for this scenario is a bit painful. I had a
pseries disk image including a nested guest for it. Depending on the
need, I would run the image under KVM (for updates) or under QEMU
PowerNV for dev/tests.

I thne switched to a simpler buildroot env.

> It also means that when pushing a new OS context on a vcpu, we need to
> always check the restored Interrupt Pending Buffer (IPB), potentially
> merge it with any escalation interrupt, and re-apply the External
> interrupt signal if needed.

See below for a proposal to split this patch in 2 or 3 patches.

  
> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
> ---
> 
> Cedric: I'm wondering the reason behind commit
> 8256870ada9379abfd1f5b2c209ad01092dd0904, which makes the PIPR field
> of the OS context read-only.

The comments is related to direct load/store on the TCTX registers.
When using the magic offsets, the logic is different.


> The comment says it is re-evaluated from
> the IPB when pushing a context, but I don't think it's true on P9
> if there's no escalation. It's not a problem on P10 because we always
> re-evaluate the PIPR if CPPR>0.
> In any case, it's no longer an issue with this patch, but I'm
> curious as to why restoring the PIPR was a problem in the first place.
> 
> 
>   hw/intc/xive.c        | 26 +++++++++++++++++++++++---
>   hw/intc/xive2.c       | 14 ++++++++------
>   include/hw/ppc/xive.h |  1 +
>   3 files changed, 32 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index b8e4c7294d..8430db687f 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -114,6 +114,21 @@ static void xive_tctx_notify(XiveTCTX *tctx, uint8_t ring)
>       }
>   }
>   
> +void xive_tctx_eval_ext_signal(XiveTCTX *tctx)

I would call it xive_tctx_reset_os_signal()

> +{
> +    uint8_t *regs = &tctx->regs[TM_QW1_OS];
> +
> +    /*
> +     * Used when pulling an OS context.
> +     * Lower the External interrupt if it's pending. It is necessary
> +     * to avoid catching it in the hypervisor context. It should be
> +     * raised again when re-pushing the OS context.
> +     */
> +    if (regs[TM_PIPR] < regs[TM_CPPR]) {

This seems useless. Since we want the signal down always.

> +        qemu_irq_lower(xive_tctx_output(tctx, TM_QW1_OS));
> +    }
> +}
> +
>   static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr)
>   {
>       uint8_t *regs = &tctx->regs[ring];
> @@ -388,6 +403,8 @@ static uint64_t xive_tm_pull_os_ctx(XivePresenter *xptr, XiveTCTX *tctx,
>       /* Invalidate CAM line */
>       qw1w2_new = xive_set_field32(TM_QW1W2_VO, qw1w2, 0);
>       xive_tctx_set_os_cam(tctx, qw1w2_new);
> +
> +    xive_tctx_eval_ext_signal(tctx);


These change forcing the signal down when the OS context is pulled
should be in its own patch.

>       return qw1w2;
>   }
>   
> @@ -413,10 +430,13 @@ static void xive_tctx_need_resend(XiveRouter *xrtr, XiveTCTX *tctx,
>           /* Reset the NVT value */
>           nvt.w4 = xive_set_field32(NVT_W4_IPB, nvt.w4, 0);
>           xive_router_write_nvt(xrtr, nvt_blk, nvt_idx, &nvt, 4);
> -
> -        /* Merge in current context */
> -        xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb);
>       }
> +    /*
> +     * Always merge in current context. Even if there's no escalation,
> +     * it will check with the IPB value restored before pushing the OS
> +     * context and set the External interrupt signal if needed.
> +     */
> +    xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb);

That's another patch and I am not sure it is needed.

An IPB value is recorded in the NVT when an interrupt is delivered
while a vCPU is not dispatched. With this change, you would force
the update in any case when the context is pushed.

Is that to close a potential race window on an interrupt being
delivered to a vCPU just before it is dispatched by the HV ?

>   }
>   
>   /*
> diff --git a/hw/intc/xive2.c b/hw/intc/xive2.c
> index 3aff42a69e..b7f1917cd2 100644
> --- a/hw/intc/xive2.c
> +++ b/hw/intc/xive2.c
> @@ -269,6 +269,7 @@ uint64_t xive2_tm_pull_os_ctx(XivePresenter *xptr, XiveTCTX *tctx,
>           xive2_tctx_save_os_ctx(xrtr, tctx, nvp_blk, nvp_idx);
>       }
>   
> +    xive_tctx_eval_ext_signal(tctx);
>       return qw1w2;
>   }
>   
> @@ -316,7 +317,6 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx,
>   {
>       Xive2Nvp nvp;
>       uint8_t ipb;
> -    uint8_t cppr = 0;
>   
>       /*
>        * Grab the associated thread interrupt context registers in the
> @@ -337,7 +337,7 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx,
>       /* Automatically restore thread context registers */
>       if (xive2_router_get_config(xrtr) & XIVE2_VP_SAVE_RESTORE &&
>           do_restore) {
> -        cppr = xive2_tctx_restore_os_ctx(xrtr, tctx, nvp_blk, nvp_idx, &nvp);
> +        xive2_tctx_restore_os_ctx(xrtr, tctx, nvp_blk, nvp_idx, &nvp);
>       }
>   
>       ipb = xive_get_field32(NVP2_W2_IPB, nvp.w2);
> @@ -346,10 +346,12 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx,
>           xive2_router_write_nvp(xrtr, nvp_blk, nvp_idx, &nvp, 2);
>       }
>   
> -    /* An IPB or CPPR change can trigger a resend */
> -    if (ipb || cppr) {
> -        xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb);
> -    }
> +    /*
> +     * Always merge in current context. Even if there's no escalation,
> +     * it will check with the IPB value restored and set the External
> +     * interrupt signal if needed.
> +     */
> +    xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb);


I guess this fine. The test on the cppr is just an optimisation, if I am
correct. But it needs to be in another patch.

Same previous comment for ibp.

Thanks,

C.

>   }
>   
>   /*
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index 126e4e2c3a..91512ed5e6 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -527,6 +527,7 @@ Object *xive_tctx_create(Object *cpu, XivePresenter *xptr, Error **errp);
>   void xive_tctx_reset(XiveTCTX *tctx);
>   void xive_tctx_destroy(XiveTCTX *tctx);
>   void xive_tctx_ipb_update(XiveTCTX *tctx, uint8_t ring, uint8_t ipb);
> +void xive_tctx_eval_ext_signal(XiveTCTX *tctx);
>   
>   /*
>    * KVM XIVE device helpers

Cédric Le Goater April 26, 2022, 1:54 p.m. UTC | #2

On 4/26/22 12:11, Frederic Barrat wrote:
> When pulling an OS context from a vcpu, we should lower the External
> interrupt if it is pending. Otherwise, it may be delivered in the
> hypervisor context, which is unexpected. It can lead to an infinite
> loop where the hypervisor catches the External exception, looks for an
> interrupt, doesn't find any, exits the exception handler, repeat...
> 
> It also means that when pushing a new OS context on a vcpu, we need to
> always check the restored Interrupt Pending Buffer (IPB), potentially
> merge it with any escalation interrupt, and re-apply the External
> interrupt signal if needed.
> 
> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
> ---
> 
> Cedric: I'm wondering the reason behind commit
> 8256870ada9379abfd1f5b2c209ad01092dd0904, which makes the PIPR field
> of the OS context read-only. The comment says it is re-evaluated from
> the IPB when pushing a context, but I don't think it's true on P9
> if there's no escalation. It's not a problem on P10 because we always
> re-evaluate the PIPR if CPPR>0.
> In any case, it's no longer an issue with this patch, but I'm
> curious as to why restoring the PIPR was a problem in the first place.
> 
> 
>   hw/intc/xive.c        | 26 +++++++++++++++++++++++---
>   hw/intc/xive2.c       | 14 ++++++++------
>   include/hw/ppc/xive.h |  1 +
>   3 files changed, 32 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index b8e4c7294d..8430db687f 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -114,6 +114,21 @@ static void xive_tctx_notify(XiveTCTX *tctx, uint8_t ring)
>       }
>   }
>   
> +void xive_tctx_eval_ext_signal(XiveTCTX *tctx)
> +{
> +    uint8_t *regs = &tctx->regs[TM_QW1_OS];
> +
> +    /*
> +     * Used when pulling an OS context.
> +     * Lower the External interrupt if it's pending. It is necessary
> +     * to avoid catching it in the hypervisor context. It should be
> +     * raised again when re-pushing the OS context.
> +     */
> +    if (regs[TM_PIPR] < regs[TM_CPPR]) {
> +        qemu_irq_lower(xive_tctx_output(tctx, TM_QW1_OS));
> +    }
> +}
> +
>   static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr)
>   {
>       uint8_t *regs = &tctx->regs[ring];
> @@ -388,6 +403,8 @@ static uint64_t xive_tm_pull_os_ctx(XivePresenter *xptr, XiveTCTX *tctx,
>       /* Invalidate CAM line */
>       qw1w2_new = xive_set_field32(TM_QW1W2_VO, qw1w2, 0);
>       xive_tctx_set_os_cam(tctx, qw1w2_new);
> +
> +    xive_tctx_eval_ext_signal(tctx);
>       return qw1w2;
>   }
>   
> @@ -413,10 +430,13 @@ static void xive_tctx_need_resend(XiveRouter *xrtr, XiveTCTX *tctx,


xive_tctx_need_resend() is restarting the event notification
sequence that was previously initiated in :

   xive_router_end_notify()
     xive_presenter_notify()

and postponed because the target was not found.

We could rename it to :

   xive_presenter_notify_complete/finish()

or find a way to share the same code with a boolean to skip
the matching part. to be discussed.
   

Thanks,

C.



>           /* Reset the NVT value */
>           nvt.w4 = xive_set_field32(NVT_W4_IPB, nvt.w4, 0);
>           xive_router_write_nvt(xrtr, nvt_blk, nvt_idx, &nvt, 4);
> -
> -        /* Merge in current context */
> -        xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb);
>       }
> +    /*
> +     * Always merge in current context. Even if there's no escalation,
> +     * it will check with the IPB value restored before pushing the OS
> +     * context and set the External interrupt signal if needed.
> +     */
> +    xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb);
>   }
>   
>   /*
> diff --git a/hw/intc/xive2.c b/hw/intc/xive2.c
> index 3aff42a69e..b7f1917cd2 100644
> --- a/hw/intc/xive2.c
> +++ b/hw/intc/xive2.c
> @@ -269,6 +269,7 @@ uint64_t xive2_tm_pull_os_ctx(XivePresenter *xptr, XiveTCTX *tctx,
>           xive2_tctx_save_os_ctx(xrtr, tctx, nvp_blk, nvp_idx);
>       }
>   
> +    xive_tctx_eval_ext_signal(tctx);
>       return qw1w2;
>   }
>   
> @@ -316,7 +317,6 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx,
>   {
>       Xive2Nvp nvp;
>       uint8_t ipb;
> -    uint8_t cppr = 0;
>   
>       /*
>        * Grab the associated thread interrupt context registers in the
> @@ -337,7 +337,7 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx,
>       /* Automatically restore thread context registers */
>       if (xive2_router_get_config(xrtr) & XIVE2_VP_SAVE_RESTORE &&
>           do_restore) {
> -        cppr = xive2_tctx_restore_os_ctx(xrtr, tctx, nvp_blk, nvp_idx, &nvp);
> +        xive2_tctx_restore_os_ctx(xrtr, tctx, nvp_blk, nvp_idx, &nvp);
>       }
>   
>       ipb = xive_get_field32(NVP2_W2_IPB, nvp.w2);
> @@ -346,10 +346,12 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx,
>           xive2_router_write_nvp(xrtr, nvp_blk, nvp_idx, &nvp, 2);
>       }
>   
> -    /* An IPB or CPPR change can trigger a resend */
> -    if (ipb || cppr) {
> -        xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb);
> -    }
> +    /*
> +     * Always merge in current context. Even if there's no escalation,
> +     * it will check with the IPB value restored and set the External
> +     * interrupt signal if needed.
> +     */
> +    xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb);
>   }
>   
>   /*
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index 126e4e2c3a..91512ed5e6 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -527,6 +527,7 @@ Object *xive_tctx_create(Object *cpu, XivePresenter *xptr, Error **errp);
>   void xive_tctx_reset(XiveTCTX *tctx);
>   void xive_tctx_destroy(XiveTCTX *tctx);
>   void xive_tctx_ipb_update(XiveTCTX *tctx, uint8_t ring, uint8_t ipb);
> +void xive_tctx_eval_ext_signal(XiveTCTX *tctx);
>   
>   /*
>    * KVM XIVE device helpers

Frederic Barrat April 26, 2022, 2:49 p.m. UTC | #3

Hello Cédric!

Thanks for the detail review!



On 26/04/2022 14:35, Cédric Le Goater wrote:
> Hello Frederic,
> 
> On 4/26/22 12:11, Frederic Barrat wrote:
>> When pulling an OS context from a vcpu, we should lower the External
>> interrupt if it is pending. Otherwise, it may be delivered in the
>> hypervisor context, which is unexpected. It can lead to an infinite
>> loop where the hypervisor catches the External exception, looks for an
>> interrupt, doesn't find any, exits the exception handler, repeat...
> 
> Nice ! I have been chasing this one for a while and couldn't corner it.
> 
> Setting an environment for this scenario is a bit painful. I had a
> pseries disk image including a nested guest for it. Depending on the
> need, I would run the image under KVM (for updates) or under QEMU
> PowerNV for dev/tests.
> 
> I thne switched to a simpler buildroot env.
> 
>> It also means that when pushing a new OS context on a vcpu, we need to
>> always check the restored Interrupt Pending Buffer (IPB), potentially
>> merge it with any escalation interrupt, and re-apply the External
>> interrupt signal if needed.
> 
> See below for a proposal to split this patch in 2 or 3 patches.


Agreed, I will resend splitting it.


>> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
>> ---
>>
>> Cedric: I'm wondering the reason behind commit
>> 8256870ada9379abfd1f5b2c209ad01092dd0904, which makes the PIPR field
>> of the OS context read-only.
> 
> The comments is related to direct load/store on the TCTX registers.
> When using the magic offsets, the logic is different.


OK and to summarize our discussion on this: the PIPR field is read-only 
because that is what the hardware says. Which means we need to make sure 
we *always* re-compute the PIPR when pushing an OS context, since it is 
not restored the same way as the other registers (mostly CPPR, IPB). 
More on that below.



>> The comment says it is re-evaluated from
>> the IPB when pushing a context, but I don't think it's true on P9
>> if there's no escalation. It's not a problem on P10 because we always
>> re-evaluate the PIPR if CPPR>0.
>> In any case, it's no longer an issue with this patch, but I'm
>> curious as to why restoring the PIPR was a problem in the first place.
>>
>>
>>   hw/intc/xive.c        | 26 +++++++++++++++++++++++---
>>   hw/intc/xive2.c       | 14 ++++++++------
>>   include/hw/ppc/xive.h |  1 +
>>   3 files changed, 32 insertions(+), 9 deletions(-)
>>
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index b8e4c7294d..8430db687f 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -114,6 +114,21 @@ static void xive_tctx_notify(XiveTCTX *tctx, 
>> uint8_t ring)
>>       }
>>   }
>> +void xive_tctx_eval_ext_signal(XiveTCTX *tctx)
> 
> I would call it xive_tctx_reset_os_signal()


OK


>> +{
>> +    uint8_t *regs = &tctx->regs[TM_QW1_OS];
>> +
>> +    /*
>> +     * Used when pulling an OS context.
>> +     * Lower the External interrupt if it's pending. It is necessary
>> +     * to avoid catching it in the hypervisor context. It should be
>> +     * raised again when re-pushing the OS context.
>> +     */
>> +    if (regs[TM_PIPR] < regs[TM_CPPR]) {
> 
> This seems useless. Since we want the signal down always.


Agreed the test is ugly though it works and avoid calling 
qemu_irq_lower() when it's not needed. I'll check whether qemu offers a 
way to test the state of a signal (which is what we really want) 
otherwise I'll always lower the signal unconditionnaly.


>> +        qemu_irq_lower(xive_tctx_output(tctx, TM_QW1_OS));
>> +    }
>> +}
>> +
>>   static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t 
>> cppr)
>>   {
>>       uint8_t *regs = &tctx->regs[ring];
>> @@ -388,6 +403,8 @@ static uint64_t xive_tm_pull_os_ctx(XivePresenter 
>> *xptr, XiveTCTX *tctx,
>>       /* Invalidate CAM line */
>>       qw1w2_new = xive_set_field32(TM_QW1W2_VO, qw1w2, 0);
>>       xive_tctx_set_os_cam(tctx, qw1w2_new);
>> +
>> +    xive_tctx_eval_ext_signal(tctx);
> 
> 
> These change forcing the signal down when the OS context is pulled
> should be in its own patch.


OK


>>       return qw1w2;
>>   }
>> @@ -413,10 +430,13 @@ static void xive_tctx_need_resend(XiveRouter 
>> *xrtr, XiveTCTX *tctx,
>>           /* Reset the NVT value */
>>           nvt.w4 = xive_set_field32(NVT_W4_IPB, nvt.w4, 0);
>>           xive_router_write_nvt(xrtr, nvt_blk, nvt_idx, &nvt, 4);
>> -
>> -        /* Merge in current context */
>> -        xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb);
>>       }
>> +    /*
>> +     * Always merge in current context. Even if there's no escalation,
>> +     * it will check with the IPB value restored before pushing the OS
>> +     * context and set the External interrupt signal if needed.
>> +     */
>> +    xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb);
> 
> That's another patch and I am not sure it is needed.
> 
> An IPB value is recorded in the NVT when an interrupt is delivered
> while a vCPU is not dispatched. With this change, you would force
> the update in any case when the context is pushed.
> 
> Is that to close a potential race window on an interrupt being
> delivered to a vCPU just before it is dispatched by the HV ?


Not quite, there's no race involved. Consider the following the scenario 
that I can hit fairly easily:

1. an External interrupt is raised while the guest is on the CPU. We now 
have IPB!=0 and PIPR!=0

2. the guest enters the interrupt handler but before it can ack the 
interrupt (special mmio read "TM_SPC_ACK_OS_REG"), another hypervisor 
interrupt is raised. For example the Hypervisor Decrementer (but I've 
also hit it with others). The hypervisor interrupt is delivered 
immediately in that case and forces a guest exit, so that the hypervisor 
can process the hypervisor interrupt. So we leave the guest and pull the 
OS context with IPB and PIPR!=0. That's the state which is saved, either 
by KVM (P9)) or by the hw (P10 with vp-save-restore).

3. Some time later, we enter the guest again and restore the interrupt 
state, either from KVM (P9) or automatically (P10). So we call:

xive_tm_push_os_ctx()
   --> xive_tctx_need_resend()

In xive_tctx_need_resend(), we check the NVT to see if we had an 
escalation. If we do, then we update the IPB and PIPR and we are fine.
But if we don't have an escalation, we don't call 
xive_tctx_ipb_update(). The IPB value will be correct because it was 
just restored, but we won't recompute the PIPR register.

So it seems to me that irrespective of the separate issue of 
lowering/raising the External interrupt signal when pulling/pushing an 
OS context, we already need to always go through xive_tctx_ipb_update() 
to recompute the PIPR (and it's agreed that's a reason enough to be a 
separate patch).

Now, if on top of that, we add the fact that we may need to raise the 
External interrupt signal when pushing an OS context, then calling 
xive_tctx_ipb_update() is our natural way of doing that. So that's a 
second reason where it helps.


> 
>>   }
>>   /*
>> diff --git a/hw/intc/xive2.c b/hw/intc/xive2.c
>> index 3aff42a69e..b7f1917cd2 100644
>> --- a/hw/intc/xive2.c
>> +++ b/hw/intc/xive2.c
>> @@ -269,6 +269,7 @@ uint64_t xive2_tm_pull_os_ctx(XivePresenter *xptr, 
>> XiveTCTX *tctx,
>>           xive2_tctx_save_os_ctx(xrtr, tctx, nvp_blk, nvp_idx);
>>       }
>> +    xive_tctx_eval_ext_signal(tctx);
>>       return qw1w2;
>>   }
>> @@ -316,7 +317,6 @@ static void xive2_tctx_need_resend(Xive2Router 
>> *xrtr, XiveTCTX *tctx,
>>   {
>>       Xive2Nvp nvp;
>>       uint8_t ipb;
>> -    uint8_t cppr = 0;
>>       /*
>>        * Grab the associated thread interrupt context registers in the
>> @@ -337,7 +337,7 @@ static void xive2_tctx_need_resend(Xive2Router 
>> *xrtr, XiveTCTX *tctx,
>>       /* Automatically restore thread context registers */
>>       if (xive2_router_get_config(xrtr) & XIVE2_VP_SAVE_RESTORE &&
>>           do_restore) {
>> -        cppr = xive2_tctx_restore_os_ctx(xrtr, tctx, nvp_blk, 
>> nvp_idx, &nvp);
>> +        xive2_tctx_restore_os_ctx(xrtr, tctx, nvp_blk, nvp_idx, &nvp);
>>       }
>>       ipb = xive_get_field32(NVP2_W2_IPB, nvp.w2);
>> @@ -346,10 +346,12 @@ static void xive2_tctx_need_resend(Xive2Router 
>> *xrtr, XiveTCTX *tctx,
>>           xive2_router_write_nvp(xrtr, nvp_blk, nvp_idx, &nvp, 2);
>>       }
>> -    /* An IPB or CPPR change can trigger a resend */
>> -    if (ipb || cppr) {
>> -        xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb);
>> -    }
>> +    /*
>> +     * Always merge in current context. Even if there's no escalation,
>> +     * it will check with the IPB value restored and set the External
>> +     * interrupt signal if needed.
>> +     */
>> +    xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb);
> 
> 
> I guess this fine. The test on the cppr is just an optimisation, if I am
> correct. But it needs to be in another patch.
> 
> Same previous comment for ibp.


The scenario I described above still stands on P10. In which case 
checking the CPPR value becomes a moot point if we always call 
xive_tctx_ipb_update(). And the compiler yells if it's unused.
Thanks!

   Fred


> 
> Thanks,
> 
> C.
> 
>>   }
>>   /*
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index 126e4e2c3a..91512ed5e6 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -527,6 +527,7 @@ Object *xive_tctx_create(Object *cpu, 
>> XivePresenter *xptr, Error **errp);
>>   void xive_tctx_reset(XiveTCTX *tctx);
>>   void xive_tctx_destroy(XiveTCTX *tctx);
>>   void xive_tctx_ipb_update(XiveTCTX *tctx, uint8_t ring, uint8_t ipb);
>> +void xive_tctx_eval_ext_signal(XiveTCTX *tctx);
>>   /*
>>    * KVM XIVE device helpers
>

Cédric Le Goater April 26, 2022, 3:24 p.m. UTC | #4

>>> +{
>>> +    uint8_t *regs = &tctx->regs[TM_QW1_OS];
>>> +
>>> +    /*
>>> +     * Used when pulling an OS context.
>>> +     * Lower the External interrupt if it's pending. It is necessary
>>> +     * to avoid catching it in the hypervisor context. It should be
>>> +     * raised again when re-pushing the OS context.
>>> +     */
>>> +    if (regs[TM_PIPR] < regs[TM_CPPR]) {
>>
>> This seems useless. Since we want the signal down always.
> 
> Agreed the test is ugly though it works and avoid calling qemu_irq_lower() when it's not needed. I'll check whether qemu offers a way to test the state of a signal (which is what we really want) otherwise I'll always lower the signal unconditionnaly.

Not really, it's a function call. or test the NSR may be ?

Just lower it, I guess. I don't think this will impact much performance.

[ ... ]

>>>       return qw1w2;
>>>   }
>>> @@ -413,10 +430,13 @@ static void xive_tctx_need_resend(XiveRouter *xrtr, XiveTCTX *tctx,
>>>           /* Reset the NVT value */
>>>           nvt.w4 = xive_set_field32(NVT_W4_IPB, nvt.w4, 0);
>>>           xive_router_write_nvt(xrtr, nvt_blk, nvt_idx, &nvt, 4);
>>> -
>>> -        /* Merge in current context */
>>> -        xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb);
>>>       }
>>> +    /*
>>> +     * Always merge in current context. Even if there's no escalation,
>>> +     * it will check with the IPB value restored before pushing the OS
>>> +     * context and set the External interrupt signal if needed.
>>> +     */
>>> +    xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb);
>>
>> That's another patch and I am not sure it is needed.
>>
>> An IPB value is recorded in the NVT when an interrupt is delivered
>> while a vCPU is not dispatched. With this change, you would force
>> the update in any case when the context is pushed.
>>
>> Is that to close a potential race window on an interrupt being
>> delivered to a vCPU just before it is dispatched by the HV ?
> 
> 
> Not quite, there's no race involved. Consider the following the scenario that I can hit fairly easily:
> 
> 1. an External interrupt is raised while the guest is on the CPU. We now have IPB!=0 and PIPR!=0
> 
> 2. the guest enters the interrupt handler but before it can ack the interrupt (special mmio read "TM_SPC_ACK_OS_REG"), another hypervisor interrupt is raised. For example the Hypervisor Decrementer (but I've also hit it with others). The hypervisor interrupt is delivered immediately in that case and forces a guest exit, so that the hypervisor can process the hypervisor interrupt. So we leave the guest and pull the OS context with IPB and PIPR!=0. 
>
> That's the state which is saved, either by KVM (P9)) 

yes.

> or by the hw (P10 with vp-save-restore).

yes.
  
> 3. Some time later, we enter the guest again and restore the interrupt state, either from KVM (P9) or automatically (P10). So we call:
> 
> xive_tm_push_os_ctx()
>    --> xive_tctx_need_resend()
> 
> In xive_tctx_need_resend(), we check the NVT to see if we had an escalation. If we do, then we update the IPB and PIPR and we are fine.
> But if we don't have an escalation, we don't call xive_tctx_ipb_update(). The IPB value will be correct because it was just restored, but we won't recompute the PIPR register.

Indeed.

> So it seems to me that irrespective of the separate issue of lowering/raising the External interrupt signal when pulling/pushing an OS context, we already need to always go through xive_tctx_ipb_update() to recompute the PIPR (and it's agreed that's a reason enough to be a separate patch).

I think you just wrote the commit log :)

> Now, if on top of that, we add the fact that we may need to raise the External interrupt signal when pushing an OS context, then calling xive_tctx_ipb_update() is our natural way of doing that. So that's a second reason where it helps.

I would be interested to know if you can start an emulated QEMU PowerNV
system (2cpus) with a KVM guest (1 vcpu) and sustain some network load
host<->guest with a ping -f for instance.

Thanks,

C.


> 
>>
>>>   }
>>>   /*
>>> diff --git a/hw/intc/xive2.c b/hw/intc/xive2.c
>>> index 3aff42a69e..b7f1917cd2 100644
>>> --- a/hw/intc/xive2.c
>>> +++ b/hw/intc/xive2.c
>>> @@ -269,6 +269,7 @@ uint64_t xive2_tm_pull_os_ctx(XivePresenter *xptr, XiveTCTX *tctx,
>>>           xive2_tctx_save_os_ctx(xrtr, tctx, nvp_blk, nvp_idx);
>>>       }
>>> +    xive_tctx_eval_ext_signal(tctx);
>>>       return qw1w2;
>>>   }
>>> @@ -316,7 +317,6 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx,
>>>   {
>>>       Xive2Nvp nvp;
>>>       uint8_t ipb;
>>> -    uint8_t cppr = 0;
>>>       /*
>>>        * Grab the associated thread interrupt context registers in the
>>> @@ -337,7 +337,7 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx,
>>>       /* Automatically restore thread context registers */
>>>       if (xive2_router_get_config(xrtr) & XIVE2_VP_SAVE_RESTORE &&
>>>           do_restore) {
>>> -        cppr = xive2_tctx_restore_os_ctx(xrtr, tctx, nvp_blk, nvp_idx, &nvp);
>>> +        xive2_tctx_restore_os_ctx(xrtr, tctx, nvp_blk, nvp_idx, &nvp);
>>>       }
>>>       ipb = xive_get_field32(NVP2_W2_IPB, nvp.w2);
>>> @@ -346,10 +346,12 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx,
>>>           xive2_router_write_nvp(xrtr, nvp_blk, nvp_idx, &nvp, 2);
>>>       }
>>> -    /* An IPB or CPPR change can trigger a resend */
>>> -    if (ipb || cppr) {
>>> -        xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb);
>>> -    }
>>> +    /*
>>> +     * Always merge in current context. Even if there's no escalation,
>>> +     * it will check with the IPB value restored and set the External
>>> +     * interrupt signal if needed.
>>> +     */
>>> +    xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb);
>>
>>
>> I guess this fine. The test on the cppr is just an optimisation, if I am
>> correct. But it needs to be in another patch.
>>
>> Same previous comment for ibp.
> 
> 
> The scenario I described above still stands on P10. In which case checking the CPPR value becomes a moot point if we always call xive_tctx_ipb_update(). And the compiler yells if it's unused.
> Thanks!
> 
>    Fred
> 
> 
>>
>> Thanks,
>>
>> C.
>>
>>>   }
>>>   /*
>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>>> index 126e4e2c3a..91512ed5e6 100644
>>> --- a/include/hw/ppc/xive.h
>>> +++ b/include/hw/ppc/xive.h
>>> @@ -527,6 +527,7 @@ Object *xive_tctx_create(Object *cpu, XivePresenter *xptr, Error **errp);
>>>   void xive_tctx_reset(XiveTCTX *tctx);
>>>   void xive_tctx_destroy(XiveTCTX *tctx);
>>>   void xive_tctx_ipb_update(XiveTCTX *tctx, uint8_t ring, uint8_t ipb);
>>> +void xive_tctx_eval_ext_signal(XiveTCTX *tctx);
>>>   /*
>>>    * KVM XIVE device helpers
>>

Frederic Barrat April 28, 2022, 8:04 a.m. UTC | #5

On 26/04/2022 17:24, Cédric Le Goater wrote:
> 
> I would be interested to know if you can start an emulated QEMU PowerNV
> system (2cpus) with a KVM guest (1 vcpu) and sustain some network load
> host<->guest with a ping -f for instance.

We're not there yet unfortunately. It runs for a while until the powernv 
host hard lockups. Something else to debug...

   Fred

ppc/xive: Save/restore state of the External interrupt

Commit Message

Comments

Patch