Message ID | 20220426101125.297064-1-fbarrat@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ppc/xive: Save/restore state of the External interrupt | expand |
Hello Frederic, On 4/26/22 12:11, Frederic Barrat wrote: > When pulling an OS context from a vcpu, we should lower the External > interrupt if it is pending. Otherwise, it may be delivered in the > hypervisor context, which is unexpected. It can lead to an infinite > loop where the hypervisor catches the External exception, looks for an > interrupt, doesn't find any, exits the exception handler, repeat... Nice ! I have been chasing this one for a while and couldn't corner it. Setting an environment for this scenario is a bit painful. I had a pseries disk image including a nested guest for it. Depending on the need, I would run the image under KVM (for updates) or under QEMU PowerNV for dev/tests. I thne switched to a simpler buildroot env. > It also means that when pushing a new OS context on a vcpu, we need to > always check the restored Interrupt Pending Buffer (IPB), potentially > merge it with any escalation interrupt, and re-apply the External > interrupt signal if needed. See below for a proposal to split this patch in 2 or 3 patches. > Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> > --- > > Cedric: I'm wondering the reason behind commit > 8256870ada9379abfd1f5b2c209ad01092dd0904, which makes the PIPR field > of the OS context read-only. The comments is related to direct load/store on the TCTX registers. When using the magic offsets, the logic is different. > The comment says it is re-evaluated from > the IPB when pushing a context, but I don't think it's true on P9 > if there's no escalation. It's not a problem on P10 because we always > re-evaluate the PIPR if CPPR>0. > In any case, it's no longer an issue with this patch, but I'm > curious as to why restoring the PIPR was a problem in the first place. > > > hw/intc/xive.c | 26 +++++++++++++++++++++++--- > hw/intc/xive2.c | 14 ++++++++------ > include/hw/ppc/xive.h | 1 + > 3 files changed, 32 insertions(+), 9 deletions(-) > > diff --git a/hw/intc/xive.c b/hw/intc/xive.c > index b8e4c7294d..8430db687f 100644 > --- a/hw/intc/xive.c > +++ b/hw/intc/xive.c > @@ -114,6 +114,21 @@ static void xive_tctx_notify(XiveTCTX *tctx, uint8_t ring) > } > } > > +void xive_tctx_eval_ext_signal(XiveTCTX *tctx) I would call it xive_tctx_reset_os_signal() > +{ > + uint8_t *regs = &tctx->regs[TM_QW1_OS]; > + > + /* > + * Used when pulling an OS context. > + * Lower the External interrupt if it's pending. It is necessary > + * to avoid catching it in the hypervisor context. It should be > + * raised again when re-pushing the OS context. > + */ > + if (regs[TM_PIPR] < regs[TM_CPPR]) { This seems useless. Since we want the signal down always. > + qemu_irq_lower(xive_tctx_output(tctx, TM_QW1_OS)); > + } > +} > + > static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr) > { > uint8_t *regs = &tctx->regs[ring]; > @@ -388,6 +403,8 @@ static uint64_t xive_tm_pull_os_ctx(XivePresenter *xptr, XiveTCTX *tctx, > /* Invalidate CAM line */ > qw1w2_new = xive_set_field32(TM_QW1W2_VO, qw1w2, 0); > xive_tctx_set_os_cam(tctx, qw1w2_new); > + > + xive_tctx_eval_ext_signal(tctx); These change forcing the signal down when the OS context is pulled should be in its own patch. > return qw1w2; > } > > @@ -413,10 +430,13 @@ static void xive_tctx_need_resend(XiveRouter *xrtr, XiveTCTX *tctx, > /* Reset the NVT value */ > nvt.w4 = xive_set_field32(NVT_W4_IPB, nvt.w4, 0); > xive_router_write_nvt(xrtr, nvt_blk, nvt_idx, &nvt, 4); > - > - /* Merge in current context */ > - xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); > } > + /* > + * Always merge in current context. Even if there's no escalation, > + * it will check with the IPB value restored before pushing the OS > + * context and set the External interrupt signal if needed. > + */ > + xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); That's another patch and I am not sure it is needed. An IPB value is recorded in the NVT when an interrupt is delivered while a vCPU is not dispatched. With this change, you would force the update in any case when the context is pushed. Is that to close a potential race window on an interrupt being delivered to a vCPU just before it is dispatched by the HV ? > } > > /* > diff --git a/hw/intc/xive2.c b/hw/intc/xive2.c > index 3aff42a69e..b7f1917cd2 100644 > --- a/hw/intc/xive2.c > +++ b/hw/intc/xive2.c > @@ -269,6 +269,7 @@ uint64_t xive2_tm_pull_os_ctx(XivePresenter *xptr, XiveTCTX *tctx, > xive2_tctx_save_os_ctx(xrtr, tctx, nvp_blk, nvp_idx); > } > > + xive_tctx_eval_ext_signal(tctx); > return qw1w2; > } > > @@ -316,7 +317,6 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx, > { > Xive2Nvp nvp; > uint8_t ipb; > - uint8_t cppr = 0; > > /* > * Grab the associated thread interrupt context registers in the > @@ -337,7 +337,7 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx, > /* Automatically restore thread context registers */ > if (xive2_router_get_config(xrtr) & XIVE2_VP_SAVE_RESTORE && > do_restore) { > - cppr = xive2_tctx_restore_os_ctx(xrtr, tctx, nvp_blk, nvp_idx, &nvp); > + xive2_tctx_restore_os_ctx(xrtr, tctx, nvp_blk, nvp_idx, &nvp); > } > > ipb = xive_get_field32(NVP2_W2_IPB, nvp.w2); > @@ -346,10 +346,12 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx, > xive2_router_write_nvp(xrtr, nvp_blk, nvp_idx, &nvp, 2); > } > > - /* An IPB or CPPR change can trigger a resend */ > - if (ipb || cppr) { > - xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); > - } > + /* > + * Always merge in current context. Even if there's no escalation, > + * it will check with the IPB value restored and set the External > + * interrupt signal if needed. > + */ > + xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); I guess this fine. The test on the cppr is just an optimisation, if I am correct. But it needs to be in another patch. Same previous comment for ibp. Thanks, C. > } > > /* > diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h > index 126e4e2c3a..91512ed5e6 100644 > --- a/include/hw/ppc/xive.h > +++ b/include/hw/ppc/xive.h > @@ -527,6 +527,7 @@ Object *xive_tctx_create(Object *cpu, XivePresenter *xptr, Error **errp); > void xive_tctx_reset(XiveTCTX *tctx); > void xive_tctx_destroy(XiveTCTX *tctx); > void xive_tctx_ipb_update(XiveTCTX *tctx, uint8_t ring, uint8_t ipb); > +void xive_tctx_eval_ext_signal(XiveTCTX *tctx); > > /* > * KVM XIVE device helpers
On 4/26/22 12:11, Frederic Barrat wrote: > When pulling an OS context from a vcpu, we should lower the External > interrupt if it is pending. Otherwise, it may be delivered in the > hypervisor context, which is unexpected. It can lead to an infinite > loop where the hypervisor catches the External exception, looks for an > interrupt, doesn't find any, exits the exception handler, repeat... > > It also means that when pushing a new OS context on a vcpu, we need to > always check the restored Interrupt Pending Buffer (IPB), potentially > merge it with any escalation interrupt, and re-apply the External > interrupt signal if needed. > > Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> > --- > > Cedric: I'm wondering the reason behind commit > 8256870ada9379abfd1f5b2c209ad01092dd0904, which makes the PIPR field > of the OS context read-only. The comment says it is re-evaluated from > the IPB when pushing a context, but I don't think it's true on P9 > if there's no escalation. It's not a problem on P10 because we always > re-evaluate the PIPR if CPPR>0. > In any case, it's no longer an issue with this patch, but I'm > curious as to why restoring the PIPR was a problem in the first place. > > > hw/intc/xive.c | 26 +++++++++++++++++++++++--- > hw/intc/xive2.c | 14 ++++++++------ > include/hw/ppc/xive.h | 1 + > 3 files changed, 32 insertions(+), 9 deletions(-) > > diff --git a/hw/intc/xive.c b/hw/intc/xive.c > index b8e4c7294d..8430db687f 100644 > --- a/hw/intc/xive.c > +++ b/hw/intc/xive.c > @@ -114,6 +114,21 @@ static void xive_tctx_notify(XiveTCTX *tctx, uint8_t ring) > } > } > > +void xive_tctx_eval_ext_signal(XiveTCTX *tctx) > +{ > + uint8_t *regs = &tctx->regs[TM_QW1_OS]; > + > + /* > + * Used when pulling an OS context. > + * Lower the External interrupt if it's pending. It is necessary > + * to avoid catching it in the hypervisor context. It should be > + * raised again when re-pushing the OS context. > + */ > + if (regs[TM_PIPR] < regs[TM_CPPR]) { > + qemu_irq_lower(xive_tctx_output(tctx, TM_QW1_OS)); > + } > +} > + > static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr) > { > uint8_t *regs = &tctx->regs[ring]; > @@ -388,6 +403,8 @@ static uint64_t xive_tm_pull_os_ctx(XivePresenter *xptr, XiveTCTX *tctx, > /* Invalidate CAM line */ > qw1w2_new = xive_set_field32(TM_QW1W2_VO, qw1w2, 0); > xive_tctx_set_os_cam(tctx, qw1w2_new); > + > + xive_tctx_eval_ext_signal(tctx); > return qw1w2; > } > > @@ -413,10 +430,13 @@ static void xive_tctx_need_resend(XiveRouter *xrtr, XiveTCTX *tctx, xive_tctx_need_resend() is restarting the event notification sequence that was previously initiated in : xive_router_end_notify() xive_presenter_notify() and postponed because the target was not found. We could rename it to : xive_presenter_notify_complete/finish() or find a way to share the same code with a boolean to skip the matching part. to be discussed. Thanks, C. > /* Reset the NVT value */ > nvt.w4 = xive_set_field32(NVT_W4_IPB, nvt.w4, 0); > xive_router_write_nvt(xrtr, nvt_blk, nvt_idx, &nvt, 4); > - > - /* Merge in current context */ > - xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); > } > + /* > + * Always merge in current context. Even if there's no escalation, > + * it will check with the IPB value restored before pushing the OS > + * context and set the External interrupt signal if needed. > + */ > + xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); > } > > /* > diff --git a/hw/intc/xive2.c b/hw/intc/xive2.c > index 3aff42a69e..b7f1917cd2 100644 > --- a/hw/intc/xive2.c > +++ b/hw/intc/xive2.c > @@ -269,6 +269,7 @@ uint64_t xive2_tm_pull_os_ctx(XivePresenter *xptr, XiveTCTX *tctx, > xive2_tctx_save_os_ctx(xrtr, tctx, nvp_blk, nvp_idx); > } > > + xive_tctx_eval_ext_signal(tctx); > return qw1w2; > } > > @@ -316,7 +317,6 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx, > { > Xive2Nvp nvp; > uint8_t ipb; > - uint8_t cppr = 0; > > /* > * Grab the associated thread interrupt context registers in the > @@ -337,7 +337,7 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx, > /* Automatically restore thread context registers */ > if (xive2_router_get_config(xrtr) & XIVE2_VP_SAVE_RESTORE && > do_restore) { > - cppr = xive2_tctx_restore_os_ctx(xrtr, tctx, nvp_blk, nvp_idx, &nvp); > + xive2_tctx_restore_os_ctx(xrtr, tctx, nvp_blk, nvp_idx, &nvp); > } > > ipb = xive_get_field32(NVP2_W2_IPB, nvp.w2); > @@ -346,10 +346,12 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx, > xive2_router_write_nvp(xrtr, nvp_blk, nvp_idx, &nvp, 2); > } > > - /* An IPB or CPPR change can trigger a resend */ > - if (ipb || cppr) { > - xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); > - } > + /* > + * Always merge in current context. Even if there's no escalation, > + * it will check with the IPB value restored and set the External > + * interrupt signal if needed. > + */ > + xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); > } > > /* > diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h > index 126e4e2c3a..91512ed5e6 100644 > --- a/include/hw/ppc/xive.h > +++ b/include/hw/ppc/xive.h > @@ -527,6 +527,7 @@ Object *xive_tctx_create(Object *cpu, XivePresenter *xptr, Error **errp); > void xive_tctx_reset(XiveTCTX *tctx); > void xive_tctx_destroy(XiveTCTX *tctx); > void xive_tctx_ipb_update(XiveTCTX *tctx, uint8_t ring, uint8_t ipb); > +void xive_tctx_eval_ext_signal(XiveTCTX *tctx); > > /* > * KVM XIVE device helpers
Hello Cédric! Thanks for the detail review! On 26/04/2022 14:35, Cédric Le Goater wrote: > Hello Frederic, > > On 4/26/22 12:11, Frederic Barrat wrote: >> When pulling an OS context from a vcpu, we should lower the External >> interrupt if it is pending. Otherwise, it may be delivered in the >> hypervisor context, which is unexpected. It can lead to an infinite >> loop where the hypervisor catches the External exception, looks for an >> interrupt, doesn't find any, exits the exception handler, repeat... > > Nice ! I have been chasing this one for a while and couldn't corner it. > > Setting an environment for this scenario is a bit painful. I had a > pseries disk image including a nested guest for it. Depending on the > need, I would run the image under KVM (for updates) or under QEMU > PowerNV for dev/tests. > > I thne switched to a simpler buildroot env. > >> It also means that when pushing a new OS context on a vcpu, we need to >> always check the restored Interrupt Pending Buffer (IPB), potentially >> merge it with any escalation interrupt, and re-apply the External >> interrupt signal if needed. > > See below for a proposal to split this patch in 2 or 3 patches. Agreed, I will resend splitting it. >> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> >> --- >> >> Cedric: I'm wondering the reason behind commit >> 8256870ada9379abfd1f5b2c209ad01092dd0904, which makes the PIPR field >> of the OS context read-only. > > The comments is related to direct load/store on the TCTX registers. > When using the magic offsets, the logic is different. OK and to summarize our discussion on this: the PIPR field is read-only because that is what the hardware says. Which means we need to make sure we *always* re-compute the PIPR when pushing an OS context, since it is not restored the same way as the other registers (mostly CPPR, IPB). More on that below. >> The comment says it is re-evaluated from >> the IPB when pushing a context, but I don't think it's true on P9 >> if there's no escalation. It's not a problem on P10 because we always >> re-evaluate the PIPR if CPPR>0. >> In any case, it's no longer an issue with this patch, but I'm >> curious as to why restoring the PIPR was a problem in the first place. >> >> >> hw/intc/xive.c | 26 +++++++++++++++++++++++--- >> hw/intc/xive2.c | 14 ++++++++------ >> include/hw/ppc/xive.h | 1 + >> 3 files changed, 32 insertions(+), 9 deletions(-) >> >> diff --git a/hw/intc/xive.c b/hw/intc/xive.c >> index b8e4c7294d..8430db687f 100644 >> --- a/hw/intc/xive.c >> +++ b/hw/intc/xive.c >> @@ -114,6 +114,21 @@ static void xive_tctx_notify(XiveTCTX *tctx, >> uint8_t ring) >> } >> } >> +void xive_tctx_eval_ext_signal(XiveTCTX *tctx) > > I would call it xive_tctx_reset_os_signal() OK >> +{ >> + uint8_t *regs = &tctx->regs[TM_QW1_OS]; >> + >> + /* >> + * Used when pulling an OS context. >> + * Lower the External interrupt if it's pending. It is necessary >> + * to avoid catching it in the hypervisor context. It should be >> + * raised again when re-pushing the OS context. >> + */ >> + if (regs[TM_PIPR] < regs[TM_CPPR]) { > > This seems useless. Since we want the signal down always. Agreed the test is ugly though it works and avoid calling qemu_irq_lower() when it's not needed. I'll check whether qemu offers a way to test the state of a signal (which is what we really want) otherwise I'll always lower the signal unconditionnaly. >> + qemu_irq_lower(xive_tctx_output(tctx, TM_QW1_OS)); >> + } >> +} >> + >> static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t >> cppr) >> { >> uint8_t *regs = &tctx->regs[ring]; >> @@ -388,6 +403,8 @@ static uint64_t xive_tm_pull_os_ctx(XivePresenter >> *xptr, XiveTCTX *tctx, >> /* Invalidate CAM line */ >> qw1w2_new = xive_set_field32(TM_QW1W2_VO, qw1w2, 0); >> xive_tctx_set_os_cam(tctx, qw1w2_new); >> + >> + xive_tctx_eval_ext_signal(tctx); > > > These change forcing the signal down when the OS context is pulled > should be in its own patch. OK >> return qw1w2; >> } >> @@ -413,10 +430,13 @@ static void xive_tctx_need_resend(XiveRouter >> *xrtr, XiveTCTX *tctx, >> /* Reset the NVT value */ >> nvt.w4 = xive_set_field32(NVT_W4_IPB, nvt.w4, 0); >> xive_router_write_nvt(xrtr, nvt_blk, nvt_idx, &nvt, 4); >> - >> - /* Merge in current context */ >> - xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); >> } >> + /* >> + * Always merge in current context. Even if there's no escalation, >> + * it will check with the IPB value restored before pushing the OS >> + * context and set the External interrupt signal if needed. >> + */ >> + xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); > > That's another patch and I am not sure it is needed. > > An IPB value is recorded in the NVT when an interrupt is delivered > while a vCPU is not dispatched. With this change, you would force > the update in any case when the context is pushed. > > Is that to close a potential race window on an interrupt being > delivered to a vCPU just before it is dispatched by the HV ? Not quite, there's no race involved. Consider the following the scenario that I can hit fairly easily: 1. an External interrupt is raised while the guest is on the CPU. We now have IPB!=0 and PIPR!=0 2. the guest enters the interrupt handler but before it can ack the interrupt (special mmio read "TM_SPC_ACK_OS_REG"), another hypervisor interrupt is raised. For example the Hypervisor Decrementer (but I've also hit it with others). The hypervisor interrupt is delivered immediately in that case and forces a guest exit, so that the hypervisor can process the hypervisor interrupt. So we leave the guest and pull the OS context with IPB and PIPR!=0. That's the state which is saved, either by KVM (P9)) or by the hw (P10 with vp-save-restore). 3. Some time later, we enter the guest again and restore the interrupt state, either from KVM (P9) or automatically (P10). So we call: xive_tm_push_os_ctx() --> xive_tctx_need_resend() In xive_tctx_need_resend(), we check the NVT to see if we had an escalation. If we do, then we update the IPB and PIPR and we are fine. But if we don't have an escalation, we don't call xive_tctx_ipb_update(). The IPB value will be correct because it was just restored, but we won't recompute the PIPR register. So it seems to me that irrespective of the separate issue of lowering/raising the External interrupt signal when pulling/pushing an OS context, we already need to always go through xive_tctx_ipb_update() to recompute the PIPR (and it's agreed that's a reason enough to be a separate patch). Now, if on top of that, we add the fact that we may need to raise the External interrupt signal when pushing an OS context, then calling xive_tctx_ipb_update() is our natural way of doing that. So that's a second reason where it helps. > >> } >> /* >> diff --git a/hw/intc/xive2.c b/hw/intc/xive2.c >> index 3aff42a69e..b7f1917cd2 100644 >> --- a/hw/intc/xive2.c >> +++ b/hw/intc/xive2.c >> @@ -269,6 +269,7 @@ uint64_t xive2_tm_pull_os_ctx(XivePresenter *xptr, >> XiveTCTX *tctx, >> xive2_tctx_save_os_ctx(xrtr, tctx, nvp_blk, nvp_idx); >> } >> + xive_tctx_eval_ext_signal(tctx); >> return qw1w2; >> } >> @@ -316,7 +317,6 @@ static void xive2_tctx_need_resend(Xive2Router >> *xrtr, XiveTCTX *tctx, >> { >> Xive2Nvp nvp; >> uint8_t ipb; >> - uint8_t cppr = 0; >> /* >> * Grab the associated thread interrupt context registers in the >> @@ -337,7 +337,7 @@ static void xive2_tctx_need_resend(Xive2Router >> *xrtr, XiveTCTX *tctx, >> /* Automatically restore thread context registers */ >> if (xive2_router_get_config(xrtr) & XIVE2_VP_SAVE_RESTORE && >> do_restore) { >> - cppr = xive2_tctx_restore_os_ctx(xrtr, tctx, nvp_blk, >> nvp_idx, &nvp); >> + xive2_tctx_restore_os_ctx(xrtr, tctx, nvp_blk, nvp_idx, &nvp); >> } >> ipb = xive_get_field32(NVP2_W2_IPB, nvp.w2); >> @@ -346,10 +346,12 @@ static void xive2_tctx_need_resend(Xive2Router >> *xrtr, XiveTCTX *tctx, >> xive2_router_write_nvp(xrtr, nvp_blk, nvp_idx, &nvp, 2); >> } >> - /* An IPB or CPPR change can trigger a resend */ >> - if (ipb || cppr) { >> - xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); >> - } >> + /* >> + * Always merge in current context. Even if there's no escalation, >> + * it will check with the IPB value restored and set the External >> + * interrupt signal if needed. >> + */ >> + xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); > > > I guess this fine. The test on the cppr is just an optimisation, if I am > correct. But it needs to be in another patch. > > Same previous comment for ibp. The scenario I described above still stands on P10. In which case checking the CPPR value becomes a moot point if we always call xive_tctx_ipb_update(). And the compiler yells if it's unused. Thanks! Fred > > Thanks, > > C. > >> } >> /* >> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h >> index 126e4e2c3a..91512ed5e6 100644 >> --- a/include/hw/ppc/xive.h >> +++ b/include/hw/ppc/xive.h >> @@ -527,6 +527,7 @@ Object *xive_tctx_create(Object *cpu, >> XivePresenter *xptr, Error **errp); >> void xive_tctx_reset(XiveTCTX *tctx); >> void xive_tctx_destroy(XiveTCTX *tctx); >> void xive_tctx_ipb_update(XiveTCTX *tctx, uint8_t ring, uint8_t ipb); >> +void xive_tctx_eval_ext_signal(XiveTCTX *tctx); >> /* >> * KVM XIVE device helpers >
>>> +{ >>> + uint8_t *regs = &tctx->regs[TM_QW1_OS]; >>> + >>> + /* >>> + * Used when pulling an OS context. >>> + * Lower the External interrupt if it's pending. It is necessary >>> + * to avoid catching it in the hypervisor context. It should be >>> + * raised again when re-pushing the OS context. >>> + */ >>> + if (regs[TM_PIPR] < regs[TM_CPPR]) { >> >> This seems useless. Since we want the signal down always. > > Agreed the test is ugly though it works and avoid calling qemu_irq_lower() when it's not needed. I'll check whether qemu offers a way to test the state of a signal (which is what we really want) otherwise I'll always lower the signal unconditionnaly. Not really, it's a function call. or test the NSR may be ? Just lower it, I guess. I don't think this will impact much performance. [ ... ] >>> return qw1w2; >>> } >>> @@ -413,10 +430,13 @@ static void xive_tctx_need_resend(XiveRouter *xrtr, XiveTCTX *tctx, >>> /* Reset the NVT value */ >>> nvt.w4 = xive_set_field32(NVT_W4_IPB, nvt.w4, 0); >>> xive_router_write_nvt(xrtr, nvt_blk, nvt_idx, &nvt, 4); >>> - >>> - /* Merge in current context */ >>> - xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); >>> } >>> + /* >>> + * Always merge in current context. Even if there's no escalation, >>> + * it will check with the IPB value restored before pushing the OS >>> + * context and set the External interrupt signal if needed. >>> + */ >>> + xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); >> >> That's another patch and I am not sure it is needed. >> >> An IPB value is recorded in the NVT when an interrupt is delivered >> while a vCPU is not dispatched. With this change, you would force >> the update in any case when the context is pushed. >> >> Is that to close a potential race window on an interrupt being >> delivered to a vCPU just before it is dispatched by the HV ? > > > Not quite, there's no race involved. Consider the following the scenario that I can hit fairly easily: > > 1. an External interrupt is raised while the guest is on the CPU. We now have IPB!=0 and PIPR!=0 > > 2. the guest enters the interrupt handler but before it can ack the interrupt (special mmio read "TM_SPC_ACK_OS_REG"), another hypervisor interrupt is raised. For example the Hypervisor Decrementer (but I've also hit it with others). The hypervisor interrupt is delivered immediately in that case and forces a guest exit, so that the hypervisor can process the hypervisor interrupt. So we leave the guest and pull the OS context with IPB and PIPR!=0. > > That's the state which is saved, either by KVM (P9)) yes. > or by the hw (P10 with vp-save-restore). yes. > 3. Some time later, we enter the guest again and restore the interrupt state, either from KVM (P9) or automatically (P10). So we call: > > xive_tm_push_os_ctx() > --> xive_tctx_need_resend() > > In xive_tctx_need_resend(), we check the NVT to see if we had an escalation. If we do, then we update the IPB and PIPR and we are fine. > But if we don't have an escalation, we don't call xive_tctx_ipb_update(). The IPB value will be correct because it was just restored, but we won't recompute the PIPR register. Indeed. > So it seems to me that irrespective of the separate issue of lowering/raising the External interrupt signal when pulling/pushing an OS context, we already need to always go through xive_tctx_ipb_update() to recompute the PIPR (and it's agreed that's a reason enough to be a separate patch). I think you just wrote the commit log :) > Now, if on top of that, we add the fact that we may need to raise the External interrupt signal when pushing an OS context, then calling xive_tctx_ipb_update() is our natural way of doing that. So that's a second reason where it helps. I would be interested to know if you can start an emulated QEMU PowerNV system (2cpus) with a KVM guest (1 vcpu) and sustain some network load host<->guest with a ping -f for instance. Thanks, C. > >> >>> } >>> /* >>> diff --git a/hw/intc/xive2.c b/hw/intc/xive2.c >>> index 3aff42a69e..b7f1917cd2 100644 >>> --- a/hw/intc/xive2.c >>> +++ b/hw/intc/xive2.c >>> @@ -269,6 +269,7 @@ uint64_t xive2_tm_pull_os_ctx(XivePresenter *xptr, XiveTCTX *tctx, >>> xive2_tctx_save_os_ctx(xrtr, tctx, nvp_blk, nvp_idx); >>> } >>> + xive_tctx_eval_ext_signal(tctx); >>> return qw1w2; >>> } >>> @@ -316,7 +317,6 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx, >>> { >>> Xive2Nvp nvp; >>> uint8_t ipb; >>> - uint8_t cppr = 0; >>> /* >>> * Grab the associated thread interrupt context registers in the >>> @@ -337,7 +337,7 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx, >>> /* Automatically restore thread context registers */ >>> if (xive2_router_get_config(xrtr) & XIVE2_VP_SAVE_RESTORE && >>> do_restore) { >>> - cppr = xive2_tctx_restore_os_ctx(xrtr, tctx, nvp_blk, nvp_idx, &nvp); >>> + xive2_tctx_restore_os_ctx(xrtr, tctx, nvp_blk, nvp_idx, &nvp); >>> } >>> ipb = xive_get_field32(NVP2_W2_IPB, nvp.w2); >>> @@ -346,10 +346,12 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx, >>> xive2_router_write_nvp(xrtr, nvp_blk, nvp_idx, &nvp, 2); >>> } >>> - /* An IPB or CPPR change can trigger a resend */ >>> - if (ipb || cppr) { >>> - xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); >>> - } >>> + /* >>> + * Always merge in current context. Even if there's no escalation, >>> + * it will check with the IPB value restored and set the External >>> + * interrupt signal if needed. >>> + */ >>> + xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); >> >> >> I guess this fine. The test on the cppr is just an optimisation, if I am >> correct. But it needs to be in another patch. >> >> Same previous comment for ibp. > > > The scenario I described above still stands on P10. In which case checking the CPPR value becomes a moot point if we always call xive_tctx_ipb_update(). And the compiler yells if it's unused. > Thanks! > > Fred > > >> >> Thanks, >> >> C. >> >>> } >>> /* >>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h >>> index 126e4e2c3a..91512ed5e6 100644 >>> --- a/include/hw/ppc/xive.h >>> +++ b/include/hw/ppc/xive.h >>> @@ -527,6 +527,7 @@ Object *xive_tctx_create(Object *cpu, XivePresenter *xptr, Error **errp); >>> void xive_tctx_reset(XiveTCTX *tctx); >>> void xive_tctx_destroy(XiveTCTX *tctx); >>> void xive_tctx_ipb_update(XiveTCTX *tctx, uint8_t ring, uint8_t ipb); >>> +void xive_tctx_eval_ext_signal(XiveTCTX *tctx); >>> /* >>> * KVM XIVE device helpers >>
On 26/04/2022 17:24, Cédric Le Goater wrote: > > I would be interested to know if you can start an emulated QEMU PowerNV > system (2cpus) with a KVM guest (1 vcpu) and sustain some network load > host<->guest with a ping -f for instance. We're not there yet unfortunately. It runs for a while until the powernv host hard lockups. Something else to debug... Fred
diff --git a/hw/intc/xive.c b/hw/intc/xive.c index b8e4c7294d..8430db687f 100644 --- a/hw/intc/xive.c +++ b/hw/intc/xive.c @@ -114,6 +114,21 @@ static void xive_tctx_notify(XiveTCTX *tctx, uint8_t ring) } } +void xive_tctx_eval_ext_signal(XiveTCTX *tctx) +{ + uint8_t *regs = &tctx->regs[TM_QW1_OS]; + + /* + * Used when pulling an OS context. + * Lower the External interrupt if it's pending. It is necessary + * to avoid catching it in the hypervisor context. It should be + * raised again when re-pushing the OS context. + */ + if (regs[TM_PIPR] < regs[TM_CPPR]) { + qemu_irq_lower(xive_tctx_output(tctx, TM_QW1_OS)); + } +} + static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr) { uint8_t *regs = &tctx->regs[ring]; @@ -388,6 +403,8 @@ static uint64_t xive_tm_pull_os_ctx(XivePresenter *xptr, XiveTCTX *tctx, /* Invalidate CAM line */ qw1w2_new = xive_set_field32(TM_QW1W2_VO, qw1w2, 0); xive_tctx_set_os_cam(tctx, qw1w2_new); + + xive_tctx_eval_ext_signal(tctx); return qw1w2; } @@ -413,10 +430,13 @@ static void xive_tctx_need_resend(XiveRouter *xrtr, XiveTCTX *tctx, /* Reset the NVT value */ nvt.w4 = xive_set_field32(NVT_W4_IPB, nvt.w4, 0); xive_router_write_nvt(xrtr, nvt_blk, nvt_idx, &nvt, 4); - - /* Merge in current context */ - xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); } + /* + * Always merge in current context. Even if there's no escalation, + * it will check with the IPB value restored before pushing the OS + * context and set the External interrupt signal if needed. + */ + xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); } /* diff --git a/hw/intc/xive2.c b/hw/intc/xive2.c index 3aff42a69e..b7f1917cd2 100644 --- a/hw/intc/xive2.c +++ b/hw/intc/xive2.c @@ -269,6 +269,7 @@ uint64_t xive2_tm_pull_os_ctx(XivePresenter *xptr, XiveTCTX *tctx, xive2_tctx_save_os_ctx(xrtr, tctx, nvp_blk, nvp_idx); } + xive_tctx_eval_ext_signal(tctx); return qw1w2; } @@ -316,7 +317,6 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx, { Xive2Nvp nvp; uint8_t ipb; - uint8_t cppr = 0; /* * Grab the associated thread interrupt context registers in the @@ -337,7 +337,7 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx, /* Automatically restore thread context registers */ if (xive2_router_get_config(xrtr) & XIVE2_VP_SAVE_RESTORE && do_restore) { - cppr = xive2_tctx_restore_os_ctx(xrtr, tctx, nvp_blk, nvp_idx, &nvp); + xive2_tctx_restore_os_ctx(xrtr, tctx, nvp_blk, nvp_idx, &nvp); } ipb = xive_get_field32(NVP2_W2_IPB, nvp.w2); @@ -346,10 +346,12 @@ static void xive2_tctx_need_resend(Xive2Router *xrtr, XiveTCTX *tctx, xive2_router_write_nvp(xrtr, nvp_blk, nvp_idx, &nvp, 2); } - /* An IPB or CPPR change can trigger a resend */ - if (ipb || cppr) { - xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); - } + /* + * Always merge in current context. Even if there's no escalation, + * it will check with the IPB value restored and set the External + * interrupt signal if needed. + */ + xive_tctx_ipb_update(tctx, TM_QW1_OS, ipb); } /* diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h index 126e4e2c3a..91512ed5e6 100644 --- a/include/hw/ppc/xive.h +++ b/include/hw/ppc/xive.h @@ -527,6 +527,7 @@ Object *xive_tctx_create(Object *cpu, XivePresenter *xptr, Error **errp); void xive_tctx_reset(XiveTCTX *tctx); void xive_tctx_destroy(XiveTCTX *tctx); void xive_tctx_ipb_update(XiveTCTX *tctx, uint8_t ring, uint8_t ipb); +void xive_tctx_eval_ext_signal(XiveTCTX *tctx); /* * KVM XIVE device helpers
When pulling an OS context from a vcpu, we should lower the External interrupt if it is pending. Otherwise, it may be delivered in the hypervisor context, which is unexpected. It can lead to an infinite loop where the hypervisor catches the External exception, looks for an interrupt, doesn't find any, exits the exception handler, repeat... It also means that when pushing a new OS context on a vcpu, we need to always check the restored Interrupt Pending Buffer (IPB), potentially merge it with any escalation interrupt, and re-apply the External interrupt signal if needed. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> --- Cedric: I'm wondering the reason behind commit 8256870ada9379abfd1f5b2c209ad01092dd0904, which makes the PIPR field of the OS context read-only. The comment says it is re-evaluated from the IPB when pushing a context, but I don't think it's true on P9 if there's no escalation. It's not a problem on P10 because we always re-evaluate the PIPR if CPPR>0. In any case, it's no longer an issue with this patch, but I'm curious as to why restoring the PIPR was a problem in the first place. hw/intc/xive.c | 26 +++++++++++++++++++++++--- hw/intc/xive2.c | 14 ++++++++------ include/hw/ppc/xive.h | 1 + 3 files changed, 32 insertions(+), 9 deletions(-)