Message ID | 1344033008.24037.67.camel@pasglop (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sat, Aug 04, 2012 at 08:30:08AM +1000, Benjamin Herrenschmidt wrote: > On Fri, 2012-08-03 at 14:41 -0300, Marcelo Tosatti wrote: > > > > Hrm, except that doing KVM_RUN with a signal is very cumbersome to do > > > and I couldn't quite find the logic in qemu to do it ... but I might > > > just have missed it. I can see indeed that in the migration case you > > > want to actually complete the operation rather than just "abort it". > > > > > > Any chance you can point me to the code that performs that trick qemu > > > side for migration ? > > > > kvm-all.c: > > > > kvm_arch_pre_run(env, run); > > if (env->exit_request) { > > DPRINTF("interrupt exit requested\n"); > > /* > > * KVM requires us to reenter the kernel after IO exits to > > * complete > > * instruction emulation. This self-signal will ensure that > > * we > > * leave ASAP again. > > */ > > qemu_cpu_kick_self(); > > } > > > > > > > Anthony seems to think that for reset we can just abort the operation > > > state in the kernel when the MP state changes. > > Ok, I see now, this is scary really... set a flag somewhere, pray that > we are in the right part of the loop on elsewhere... oh well. > > In fact there's some fun (through harmless) bits to be found, look at > x86 for example: > > if (env->exception_injected == EXCP08_DBLE) { > /* this means triple fault */ > qemu_system_reset_request(); > env->exit_request = 1; > return 0; > } > > Well, qemu_system_reset_request() calls cpu_stop_current() which calls > cpu_exit() which also does: > > env->exit_request = 1; > > So obviously it will be well set by that time :-) > > Now I can see how having it set during kvm_arch_process_async_events() > will work as this is called right before the run loop. Similarily, > EXIT_MMIO and EXIT_IO would work so they are a non issue. > > Our problem I suspect with PAPR doing resets via hcalls is that > our kvm_arch_handle_exit() does: > > case KVM_EXIT_PAPR_HCALL: > dprintf("handle PAPR hypercall\n"); > run->papr_hcall.ret = spapr_hypercall(env, run->papr_hcall.nr, > run->papr_hcall.args); > ret = 1; > break; > > See the ret = 1 here ? That means that the caller (kvm_cpu_exec.c main > loop) will exit immediately upon return. If we had instead returned 0, > it would have looped, seen the "exit_requested" set by > qemu_system_reset_request(), which would have then done the signal + run > trick, completed the hypercall, and returned this time in a clean state. > > So it seems (I don't have the machine at hand to test right now) that by > just changing that ret = 1 to ret = 0, we might be fixing our problem, > including the case where another vcpu is taking a hypercall at the time > of the reset (this other CPU will also need to complete the operation > before stopping). > > David, is there any reason why you did ret = 1 to begin with ? For > things like reset etc... the exit will happen as a result of > env->exit_requested being set by cpu_exit(). Erm, there might have been, but I don't remember what it was. > Are there other cases where you wish the hcall to go back to the main > loop ? In that case, shouldn't it set env->exit_requested rather than > returning 1 at that point ? (H_CEDE for example ?) So, I'm still trying to nut out the implications for H_CEDE, and think if there are any other hypercalls that might want to block the guest for a time. We were considering blocking H_PUT_TCE if qemu devices had active dma maps on the previously mapped iovas. I'm not sure if the discussions that led to the inclusion of the qemu IOMMU code decided that was wholly unnnecessary or just not necessary for the time being. > Paul, David, I don't have time to test that before Tuesday (I'm away on > monday) but if you have a chance, revert the kernel change we did and > try this qemu patch for reset: > > --- a/target-ppc/kvm.c > +++ b/target-ppc/kvm.c > @@ -766,7 +766,7 @@ int kvm_arch_handle_exit(CPUPPCState *env, struct > kvm_run *r > dprintf("handle PAPR hypercall\n"); > run->papr_hcall.ret = spapr_hypercall(env, run->papr_hcall.nr, > run->papr_hcall.args); > - ret = 1; > + ret = 0; > break; > #endif > default: So, usually I think ret = 0 is correct, it lets us handle the hypercall and get back to the guest without further fuss. I think right now, ret = 0 is always correct, since H_CEDE is always handled in kernel. But if we ever have qemu implemented hypercalls which want to "block" then we might need a facility to do something else. Maybe, like I say, I haven't convinved myself I've thought of all the implications yet. > It might also need something like: > > diff --git a/hw/spapr_hcall.c b/hw/spapr_hcall.c > index a5990a9..d4d7fb0 100644 > --- a/hw/spapr_hcall.c > +++ b/hw/spapr_hcall.c > @@ -545,6 +545,7 @@ static target_ulong h_cede(CPUPPCState *env, > sPAPREnvironmen > hreg_compute_hflags(env); > if (!cpu_has_work(env)) { > env->halted = 1; > + env->exit_request = 1; > } > return H_SUCCESS; > } > > Though I don't think H_CEDE ever goes down to qemu, does it ? No, though it might be worth doing that anyway, just in case we ever have some whacky case where the H_CEDE does get handed on from the kernel. Ok, I'll send a patch and we can worry about the blocking qemu hypercall case later, since it doesn't exist now.
On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote: > So, I'm still trying to nut out the implications for H_CEDE, and think > if there are any other hypercalls that might want to block the guest > for a time. We were considering blocking H_PUT_TCE if qemu devices > had active dma maps on the previously mapped iovas. I'm not sure if > the discussions that led to the inclusion of the qemu IOMMU code > decided that was wholly unnnecessary or just not necessary for the > time being. For "sleeping hcalls" they will simply have to set exit_request to complete the hcall from the kernel perspective, leaving us in a state where the kernel is about to restart at srr0 + 4, along with some other flag (stop or halt) to actually freeze the vcpu. If such an "async" hcall decides to return an error, it can then set gpr3 directly using ioctls before restarting the vcpu. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote: > On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote: > > So, I'm still trying to nut out the implications for H_CEDE, and think > > if there are any other hypercalls that might want to block the guest > > for a time. We were considering blocking H_PUT_TCE if qemu devices > > had active dma maps on the previously mapped iovas. I'm not sure if > > the discussions that led to the inclusion of the qemu IOMMU code > > decided that was wholly unnnecessary or just not necessary for the > > time being. > > For "sleeping hcalls" they will simply have to set exit_request to > complete the hcall from the kernel perspective, leaving us in a state > where the kernel is about to restart at srr0 + 4, along with some other > flag (stop or halt) to actually freeze the vcpu. > > If such an "async" hcall decides to return an error, it can then set > gpr3 directly using ioctls before restarting the vcpu. Yeah, I'd pretty much convinced myself of that by the end of yesterday. I hope to send patches implementing these fixes today. There are also some questions about why our in-kernel H_CEDE works kind of differently from x86's hlt instruction implementation (which comes out to qemu unless the irqchip is in-kernel as well). I don't think we have an urgent problem there though.
On 08/07/2012 04:32 AM, David Gibson wrote: > On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote: >> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote: >> > So, I'm still trying to nut out the implications for H_CEDE, and think >> > if there are any other hypercalls that might want to block the guest >> > for a time. We were considering blocking H_PUT_TCE if qemu devices >> > had active dma maps on the previously mapped iovas. I'm not sure if >> > the discussions that led to the inclusion of the qemu IOMMU code >> > decided that was wholly unnnecessary or just not necessary for the >> > time being. >> >> For "sleeping hcalls" they will simply have to set exit_request to >> complete the hcall from the kernel perspective, leaving us in a state >> where the kernel is about to restart at srr0 + 4, along with some other >> flag (stop or halt) to actually freeze the vcpu. >> >> If such an "async" hcall decides to return an error, it can then set >> gpr3 directly using ioctls before restarting the vcpu. > > Yeah, I'd pretty much convinced myself of that by the end of > yesterday. I hope to send patches implementing these fixes today. > > There are also some questions about why our in-kernel H_CEDE works > kind of differently from x86's hlt instruction implementation (which > comes out to qemu unless the irqchip is in-kernel as well). I don't > think we have an urgent problem there though. It's the other way round, hlt sleeps in the kernel unless the irqchip is not in the kernel. Meaning the normal state of things is to sleep in the kernel (whether or not you have an emulated interrupt controller in the kernel -- the term irqchip in kernel is overloaded for x86).
On Tue, Aug 07, 2012 at 11:46:35AM +0300, Avi Kivity wrote: > On 08/07/2012 04:32 AM, David Gibson wrote: > > On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote: > >> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote: > >> > So, I'm still trying to nut out the implications for H_CEDE, and think > >> > if there are any other hypercalls that might want to block the guest > >> > for a time. We were considering blocking H_PUT_TCE if qemu devices > >> > had active dma maps on the previously mapped iovas. I'm not sure if > >> > the discussions that led to the inclusion of the qemu IOMMU code > >> > decided that was wholly unnnecessary or just not necessary for the > >> > time being. > >> > >> For "sleeping hcalls" they will simply have to set exit_request to > >> complete the hcall from the kernel perspective, leaving us in a state > >> where the kernel is about to restart at srr0 + 4, along with some other > >> flag (stop or halt) to actually freeze the vcpu. > >> > >> If such an "async" hcall decides to return an error, it can then set > >> gpr3 directly using ioctls before restarting the vcpu. > > > > Yeah, I'd pretty much convinced myself of that by the end of > > yesterday. I hope to send patches implementing these fixes today. > > > > There are also some questions about why our in-kernel H_CEDE works > > kind of differently from x86's hlt instruction implementation (which > > comes out to qemu unless the irqchip is in-kernel as well). I don't > > think we have an urgent problem there though. > > It's the other way round, hlt sleeps in the kernel unless the irqchip is > not in the kernel. That's the same as what I said. We never have irqchip in kernel (because we haven't written that yet) but we still sleep in-kernel for CEDE. I haven't spotted any problem with that, but now I'm wondering if there is one, since x86 don't do it in what seems like the analogous situation. It's possible this works because our decrementer (timer) interrupts are different at the core level from external interrupts coming from the PIC, and *are* handled in kernel, but I haven't actually followed the logic to work out if this is the case. > Meaning the normal state of things is to sleep in > the kernel (whether or not you have an emulated interrupt controller in > the kernel -- the term irqchip in kernel is overloaded for x86). Uh.. overloaded in what way.
On 08/07/2012 03:14 PM, David Gibson wrote: > On Tue, Aug 07, 2012 at 11:46:35AM +0300, Avi Kivity wrote: >> On 08/07/2012 04:32 AM, David Gibson wrote: >> > On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote: >> >> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote: >> >> > So, I'm still trying to nut out the implications for H_CEDE, and think >> >> > if there are any other hypercalls that might want to block the guest >> >> > for a time. We were considering blocking H_PUT_TCE if qemu devices >> >> > had active dma maps on the previously mapped iovas. I'm not sure if >> >> > the discussions that led to the inclusion of the qemu IOMMU code >> >> > decided that was wholly unnnecessary or just not necessary for the >> >> > time being. >> >> >> >> For "sleeping hcalls" they will simply have to set exit_request to >> >> complete the hcall from the kernel perspective, leaving us in a state >> >> where the kernel is about to restart at srr0 + 4, along with some other >> >> flag (stop or halt) to actually freeze the vcpu. >> >> >> >> If such an "async" hcall decides to return an error, it can then set >> >> gpr3 directly using ioctls before restarting the vcpu. >> > >> > Yeah, I'd pretty much convinced myself of that by the end of >> > yesterday. I hope to send patches implementing these fixes today. >> > >> > There are also some questions about why our in-kernel H_CEDE works >> > kind of differently from x86's hlt instruction implementation (which >> > comes out to qemu unless the irqchip is in-kernel as well). I don't >> > think we have an urgent problem there though. >> >> It's the other way round, hlt sleeps in the kernel unless the irqchip is >> not in the kernel. > > That's the same as what I said. I meant to stress that the normal way which other archs should emulate is sleep-in-kernel. > > We never have irqchip in kernel (because we haven't written that yet) > but we still sleep in-kernel for CEDE. I haven't spotted any problem > with that, but now I'm wondering if there is one, since x86 don't do > it in what seems like the analogous situation. > > It's possible this works because our decrementer (timer) interrupts > are different at the core level from external interrupts coming from > the PIC, and *are* handled in kernel, but I haven't actually followed > the logic to work out if this is the case. > >> Meaning the normal state of things is to sleep in >> the kernel (whether or not you have an emulated interrupt controller in >> the kernel -- the term irqchip in kernel is overloaded for x86). > > Uh.. overloaded in what way. On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and the two PICs are emulated in the kernel. Now the IOAPIC and the PICs correspond to non-x86 interrupt controllers, but the local APIC is more tightly coupled to the core. Interrupt acceptance by the core is an operation that involved synchronous communication with the local APIC: the APIC presents the interrupt, the core accepts it based on the value of the interrupt enable flag and possible a register (CR8), then the APIC updates the ISR and IRR. The upshot is that if the local APIC is in userspace, interrupts must be synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl and HLT is emulated in userspace (so that local APIC emulation can check if an interrupt wakes it up or not). As soon as the local APIC is emulated in the kernel, HLT can be emulated there as well, and interrupts become asynchronous (KVM_IRQ_LINE, a vm ioctl). So irqchip_in_kernel, for most discussions, really means whether interrupt queuing is synchronous or asynchronous. It has nothing to do with the interrupt controllers per se. All non-x86 archs always have irqchip_in_kernel() in this sense. Peter has started to fix up this naming mess in qemu. I guess we should do the same for the kernel (except for ABIs) and document it, because it keeps generating confusion.
On Tue, 2012-08-07 at 16:13 +0300, Avi Kivity wrote: > > Peter has started to fix up this naming mess in qemu. I guess we should > do the same for the kernel (except for ABIs) and document it, because it > keeps generating confusion. Ok so our current situation is that the XICS and MPIC are emulated in userspace entirely but the link between them and the VCPU is the asynchronous EE line so we are fine. I'm currently working on moving the XICS into the kernel for performance reasons, however, just like ARM VGIC, I can't seem to find a way to "fit" in the "generic" irqchip code in there. It's just not generic at all and quite x86 centric :-) So for now I'm just doing my own version of CREATE_IRQCHIP to create it and KVM_INTERRUPT to trigger the various interrupts. None of the mapping stuff (which we really don't need). That's a bit of a problem vs. some of the code qemu-side such as in virtio-pci which does seem to be written around the model exposed by the x86 stuff and relies on doing such mappings so I think we'll have to butcher some of that. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Aug 07, 2012 at 04:13:49PM +0300, Avi Kivity wrote: > On 08/07/2012 03:14 PM, David Gibson wrote: > > On Tue, Aug 07, 2012 at 11:46:35AM +0300, Avi Kivity wrote: > >> On 08/07/2012 04:32 AM, David Gibson wrote: > >> > On Tue, Aug 07, 2012 at 06:57:57AM +1000, Benjamin Herrenschmidt wrote: > >> >> On Mon, 2012-08-06 at 13:13 +1000, David Gibson wrote: > >> >> > So, I'm still trying to nut out the implications for H_CEDE, and think > >> >> > if there are any other hypercalls that might want to block the guest > >> >> > for a time. We were considering blocking H_PUT_TCE if qemu devices > >> >> > had active dma maps on the previously mapped iovas. I'm not sure if > >> >> > the discussions that led to the inclusion of the qemu IOMMU code > >> >> > decided that was wholly unnnecessary or just not necessary for the > >> >> > time being. > >> >> > >> >> For "sleeping hcalls" they will simply have to set exit_request to > >> >> complete the hcall from the kernel perspective, leaving us in a state > >> >> where the kernel is about to restart at srr0 + 4, along with some other > >> >> flag (stop or halt) to actually freeze the vcpu. > >> >> > >> >> If such an "async" hcall decides to return an error, it can then set > >> >> gpr3 directly using ioctls before restarting the vcpu. > >> > > >> > Yeah, I'd pretty much convinced myself of that by the end of > >> > yesterday. I hope to send patches implementing these fixes today. > >> > > >> > There are also some questions about why our in-kernel H_CEDE works > >> > kind of differently from x86's hlt instruction implementation (which > >> > comes out to qemu unless the irqchip is in-kernel as well). I don't > >> > think we have an urgent problem there though. > >> > >> It's the other way round, hlt sleeps in the kernel unless the irqchip is > >> not in the kernel. > > > > That's the same as what I said. > > I meant to stress that the normal way which other archs should emulate > is sleep-in-kernel. Ok. > > We never have irqchip in kernel (because we haven't written that yet) > > but we still sleep in-kernel for CEDE. I haven't spotted any problem > > with that, but now I'm wondering if there is one, since x86 don't do > > it in what seems like the analogous situation. > > > > It's possible this works because our decrementer (timer) interrupts > > are different at the core level from external interrupts coming from > > the PIC, and *are* handled in kernel, but I haven't actually followed > > the logic to work out if this is the case. > > > >> Meaning the normal state of things is to sleep in > >> the kernel (whether or not you have an emulated interrupt controller in > >> the kernel -- the term irqchip in kernel is overloaded for x86). > > > > Uh.. overloaded in what way. > > On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and > the two PICs are emulated in the kernel. Now the IOAPIC and the PICs > correspond to non-x86 interrupt controllers, but the local APIC is more > tightly coupled to the core. Interrupt acceptance by the core is an > operation that involved synchronous communication with the local APIC: > the APIC presents the interrupt, the core accepts it based on the value > of the interrupt enable flag and possible a register (CR8), then the > APIC updates the ISR and IRR. > > The upshot is that if the local APIC is in userspace, interrupts must be > synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl > and HLT is emulated in userspace (so that local APIC emulation can check > if an interrupt wakes it up or not). Sorry, still not 100% getting it. When the vcpu is actually running code, that synchronous communication must still be accomplished via the KVM_INTERRUPT ioctl, yes? So what makes HLT different, that the communication can't be accomplished in that case. > As soon as the local APIC is > emulated in the kernel, HLT can be emulated there as well, and > interrupts become asynchronous (KVM_IRQ_LINE, a vm ioctl). > > So irqchip_in_kernel, for most discussions, really means whether > interrupt queuing is synchronous or asynchronous. It has nothing to do > with the interrupt controllers per se. All non-x86 archs always have > irqchip_in_kernel() in this sense. > > Peter has started to fix up this naming mess in qemu. I guess we should > do the same for the kernel (except for ABIs) and document it, because it > keeps generating confusion. Ok.
On 08/08/2012 12:09 AM, Benjamin Herrenschmidt wrote: > On Tue, 2012-08-07 at 16:13 +0300, Avi Kivity wrote: >> > >> Peter has started to fix up this naming mess in qemu. I guess we should >> do the same for the kernel (except for ABIs) and document it, because it >> keeps generating confusion. > > Ok so our current situation is that the XICS and MPIC are emulated in > userspace entirely but the link between them and the VCPU is the > asynchronous EE line so we are fine. > > I'm currently working on moving the XICS into the kernel for performance > reasons, however, just like ARM VGIC, I can't seem to find a way to > "fit" in the "generic" irqchip code in there. It's just not generic at > all and quite x86 centric :-) The generic code is for the two apic architectures: x64 and ia64. Don't try to shoehorn it in, you'll damage both the shoe and your foot. > > So for now I'm just doing my own version of CREATE_IRQCHIP to create it > and KVM_INTERRUPT to trigger the various interrupts. None of the mapping > stuff (which we really don't need). You mean KVM_IRQ_LINE. KVM_INTERRUPT is a synchronous vcpu ioctl. > > That's a bit of a problem vs. some of the code qemu-side such as in > virtio-pci which does seem to be written around the model exposed by the > x86 stuff and relies on doing such mappings so I think we'll have to > butcher some of that. Can you elaborate? virtio-pci is pci-centric, there should be nothing x86 specific there.
On 08/08/2012 03:49 AM, David Gibson wrote: >> > We never have irqchip in kernel (because we haven't written that yet) >> > but we still sleep in-kernel for CEDE. I haven't spotted any problem >> > with that, but now I'm wondering if there is one, since x86 don't do >> > it in what seems like the analogous situation. >> > >> > It's possible this works because our decrementer (timer) interrupts >> > are different at the core level from external interrupts coming from >> > the PIC, and *are* handled in kernel, but I haven't actually followed >> > the logic to work out if this is the case. >> > >> >> Meaning the normal state of things is to sleep in >> >> the kernel (whether or not you have an emulated interrupt controller in >> >> the kernel -- the term irqchip in kernel is overloaded for x86). >> > >> > Uh.. overloaded in what way. >> >> On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and >> the two PICs are emulated in the kernel. Now the IOAPIC and the PICs >> correspond to non-x86 interrupt controllers, but the local APIC is more >> tightly coupled to the core. Interrupt acceptance by the core is an >> operation that involved synchronous communication with the local APIC: >> the APIC presents the interrupt, the core accepts it based on the value >> of the interrupt enable flag and possible a register (CR8), then the >> APIC updates the ISR and IRR. >> >> The upshot is that if the local APIC is in userspace, interrupts must be >> synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl >> and HLT is emulated in userspace (so that local APIC emulation can check >> if an interrupt wakes it up or not). > > Sorry, still not 100% getting it. When the vcpu is actually running > code, that synchronous communication must still be accomplished via > the KVM_INTERRUPT ioctl, yes? So what makes HLT different, that the > communication can't be accomplished in that case. No, you're correct. HLT could have been emulated in userspace, it just wasn't. The correct statement is that HLT was arbitrarily chosen to be emulated in userspace with the synchronous model, but the asynchronous model forced it into the kernel.
On Wed, 2012-08-08 at 11:52 +0300, Avi Kivity wrote: > > So for now I'm just doing my own version of CREATE_IRQCHIP to create it > > and KVM_INTERRUPT to trigger the various interrupts. None of the mapping > > stuff (which we really don't need). > > You mean KVM_IRQ_LINE. KVM_INTERRUPT is a synchronous vcpu ioctl. Yes, sorry, brain and fingers not agreeing :-) > > That's a bit of a problem vs. some of the code qemu-side such as in > > virtio-pci which does seem to be written around the model exposed by the > > x86 stuff and relies on doing such mappings so I think we'll have to > > butcher some of that. > > Can you elaborate? virtio-pci is pci-centric, there should be nothing > x86 specific there. The kvm_irqchip_add_msi_route & co is a problem, it more/less dives into the x86/ia64 specific routing stuff at a generic level, I'll have to hook things up differently in qemu I suspect. Not a huge deal, we can sort it out. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 08, 2012 at 11:58:58AM +0300, Avi Kivity wrote: > On 08/08/2012 03:49 AM, David Gibson wrote: > >> > We never have irqchip in kernel (because we haven't written that yet) > >> > but we still sleep in-kernel for CEDE. I haven't spotted any problem > >> > with that, but now I'm wondering if there is one, since x86 don't do > >> > it in what seems like the analogous situation. > >> > > >> > It's possible this works because our decrementer (timer) interrupts > >> > are different at the core level from external interrupts coming from > >> > the PIC, and *are* handled in kernel, but I haven't actually followed > >> > the logic to work out if this is the case. > >> > > >> >> Meaning the normal state of things is to sleep in > >> >> the kernel (whether or not you have an emulated interrupt controller in > >> >> the kernel -- the term irqchip in kernel is overloaded for x86). > >> > > >> > Uh.. overloaded in what way. > >> > >> On x86, irqchip-in-kernel means that the local APICs, the IOAPIC, and > >> the two PICs are emulated in the kernel. Now the IOAPIC and the PICs > >> correspond to non-x86 interrupt controllers, but the local APIC is more > >> tightly coupled to the core. Interrupt acceptance by the core is an > >> operation that involved synchronous communication with the local APIC: > >> the APIC presents the interrupt, the core accepts it based on the value > >> of the interrupt enable flag and possible a register (CR8), then the > >> APIC updates the ISR and IRR. > >> > >> The upshot is that if the local APIC is in userspace, interrupts must be > >> synchronous with vcpu exection, so that KVM_INTERRUPT is a vcpu ioctl > >> and HLT is emulated in userspace (so that local APIC emulation can check > >> if an interrupt wakes it up or not). > > > > Sorry, still not 100% getting it. When the vcpu is actually running > > code, that synchronous communication must still be accomplished via > > the KVM_INTERRUPT ioctl, yes? So what makes HLT different, that the > > communication can't be accomplished in that case. > > No, you're correct. HLT could have been emulated in userspace, it just > wasn't. The correct statement is that HLT was arbitrarily chosen to be > emulated in userspace with the synchronous model, but the asynchronous > model forced it into the kernel. Aha! Ok, understood. Uh, assuming you meant kernelspace, not userspace in the first line there, anyway. Ok, so I am now reassured that our current handling of CEDE in kernelspace does not cause problems.
On 08/08/2012 02:59 PM, David Gibson wrote: >> >> No, you're correct. HLT could have been emulated in userspace, it just >> wasn't. The correct statement is that HLT was arbitrarily chosen to be >> emulated in userspace with the synchronous model, but the asynchronous >> model forced it into the kernel. > > Aha! Ok, understood. Uh, assuming you meant kernelspace, not > userspace in the first line there, anyway. I did. > Ok, so I am now reassured > that our current handling of CEDE in kernelspace does not cause > problems. Great. It's a real pity the original local APIC implementation was in userspace, it causes no end of confusion, and in fact was broken for smp until recently even though it is 7 years old.
--- a/target-ppc/kvm.c +++ b/target-ppc/kvm.c @@ -766,7 +766,7 @@ int kvm_arch_handle_exit(CPUPPCState *env, struct kvm_run *r dprintf("handle PAPR hypercall\n"); run->papr_hcall.ret = spapr_hypercall(env, run->papr_hcall.nr, run->papr_hcall.args); - ret = 1; + ret = 0; break; #endif default: It might also need something like: diff --git a/hw/spapr_hcall.c b/hw/spapr_hcall.c index a5990a9..d4d7fb0 100644 --- a/hw/spapr_hcall.c +++ b/hw/spapr_hcall.c @@ -545,6 +545,7 @@ static target_ulong h_cede(CPUPPCState *env, sPAPREnvironmen hreg_compute_hflags(env); if (!cpu_has_work(env)) { env->halted = 1; + env->exit_request = 1; } return H_SUCCESS;