Message ID | 1456751763-24244-1-git-send-email-a.rigo@virtualopensystems.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 29 February 2016 at 13:16, Alvise Rigo <a.rigo@virtualopensystems.com> wrote: > As in the case of tlb_flush(), also tlb_flush_by_mmuidx has to query the > TLB flush if it targets another VCPU. To accomplish this, a new async > work has been added, together with a new TLBFlushByMMUIdxParams. A > bitmap is used to track the MMU indexes to flush. > > This patch applies to the multi_tcg_v8 branch. What's the API for a target CPU emulation to say "and now I must wait for the TLB op to finish" before completing this guest instruction? thanks -- PMM
On 29/02/2016 14:21, Peter Maydell wrote: > On 29 February 2016 at 13:16, Alvise Rigo <a.rigo@virtualopensystems.com> wrote: >> > As in the case of tlb_flush(), also tlb_flush_by_mmuidx has to query the >> > TLB flush if it targets another VCPU. To accomplish this, a new async >> > work has been added, together with a new TLBFlushByMMUIdxParams. A >> > bitmap is used to track the MMU indexes to flush. >> > >> > This patch applies to the multi_tcg_v8 branch. > What's the API for a target CPU emulation to say "and now I must > wait for the TLB op to finish" before completing this guest > instruction? My proposal has been for a while for DMB to put the CPU in a halted state (remote TLB callbacks then can decrement a counter and signal cpu_halt_cond when it's zero), but no one has implemented this. Paolo
On 29 February 2016 at 13:50, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 29/02/2016 14:21, Peter Maydell wrote: >> On 29 February 2016 at 13:16, Alvise Rigo <a.rigo@virtualopensystems.com> wrote: >>> > As in the case of tlb_flush(), also tlb_flush_by_mmuidx has to query the >>> > TLB flush if it targets another VCPU. To accomplish this, a new async >>> > work has been added, together with a new TLBFlushByMMUIdxParams. A >>> > bitmap is used to track the MMU indexes to flush. >>> > >>> > This patch applies to the multi_tcg_v8 branch. >> What's the API for a target CPU emulation to say "and now I must >> wait for the TLB op to finish" before completing this guest >> instruction? > > My proposal has been for a while for DMB to put the CPU in a halted > state (remote TLB callbacks then can decrement a counter and signal > cpu_halt_cond when it's zero), but no one has implemented this. Yeah, that's the other approach -- really split the things that can be async and do real "wait for completion" at points which must synchronize. (Needs a little care since DMB is not the only such point.) An initial implementation that does an immediate wait-for-completion is probably simpler to review though, and add the real asynchrony later. And either way you need an API for the target to wait for completion. thanks -- PMM
On Mon, Feb 29, 2016 at 2:55 PM, Peter Maydell <peter.maydell@linaro.org> wrote: > On 29 February 2016 at 13:50, Paolo Bonzini <pbonzini@redhat.com> wrote: >> >> >> On 29/02/2016 14:21, Peter Maydell wrote: >>> On 29 February 2016 at 13:16, Alvise Rigo <a.rigo@virtualopensystems.com> wrote: >>>> > As in the case of tlb_flush(), also tlb_flush_by_mmuidx has to query the >>>> > TLB flush if it targets another VCPU. To accomplish this, a new async >>>> > work has been added, together with a new TLBFlushByMMUIdxParams. A >>>> > bitmap is used to track the MMU indexes to flush. >>>> > >>>> > This patch applies to the multi_tcg_v8 branch. >>> What's the API for a target CPU emulation to say "and now I must >>> wait for the TLB op to finish" before completing this guest >>> instruction? >> >> My proposal has been for a while for DMB to put the CPU in a halted >> state (remote TLB callbacks then can decrement a counter and signal >> cpu_halt_cond when it's zero), but no one has implemented this. > > Yeah, that's the other approach -- really split the things that can > be async and do real "wait for completion" at points which must > synchronize. (Needs a little care since DMB is not the only such point.) > An initial implementation that does an immediate wait-for-completion > is probably simpler to review though, and add the real asynchrony > later. And either way you need an API for the target to wait for > completion. OK, so basically being sure that the target CPU performs the flush before executing the next TB is not enough. We need a sort of feedback that the flush has been done before emulating the next guest instruction. Did I get it right? Thank you, alvise > > thanks > -- PMM
On 29/02/2016 15:02, alvise rigo wrote: > > Yeah, that's the other approach -- really split the things that can > > be async and do real "wait for completion" at points which must > > synchronize. (Needs a little care since DMB is not the only such point.) > > An initial implementation that does an immediate wait-for-completion > > is probably simpler to review though, and add the real asynchrony > > later. And either way you need an API for the target to wait for > > completion. > OK, so basically being sure that the target CPU performs the flush > before executing the next TB is not enough. We need a sort of feedback > that the flush has been done before emulating the next guest > instruction. Did I get it right? That risks getting deadlocks if CPU A asks B to flush the TLB and vice versa. Using a halted state means that the VCPU thread goes through the cpus.c loop and can for example service other CPUs' TLB flush requests. Paolo
I see the risk. I will come back with something and let you know. Thank you, alvise On Mon, Feb 29, 2016 at 3:06 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 29/02/2016 15:02, alvise rigo wrote: >> > Yeah, that's the other approach -- really split the things that can >> > be async and do real "wait for completion" at points which must >> > synchronize. (Needs a little care since DMB is not the only such point.) >> > An initial implementation that does an immediate wait-for-completion >> > is probably simpler to review though, and add the real asynchrony >> > later. And either way you need an API for the target to wait for >> > completion. >> OK, so basically being sure that the target CPU performs the flush >> before executing the next TB is not enough. We need a sort of feedback >> that the flush has been done before emulating the next guest >> instruction. Did I get it right? > > That risks getting deadlocks if CPU A asks B to flush the TLB and vice > versa. Using a halted state means that the VCPU thread goes through the > cpus.c loop and can for example service other CPUs' TLB flush requests. > > Paolo
A small update on this. I have a working implementation of the "halted state" mechanism for waiting all the pending flushes to be completed. However, the way I'm going back to the cpus.c loop (the while(1) in qemu_tcg_cpu_thread_fn) is a bit convoluted. In the case of the TLB ops that always end the TB, a simple cpu_exit() allows me to go back to the main loop. I think in this case we can also use the cpu_loop_exit(), though making the code a bit more complicated since the PC would require some adjustments. I wanted then to apply the same "halted state" to the LoadLink helper, since also this one might cause some flush requests. In this case, we can not just call cpu_loop_exit() in that the guest code would miss the returned value. Forcing the LDREX instruction to also end the TB through an empty 'is_jmp' condition did the trick allowing once again to use cpu_exit(). Is there another better solution? Thank you, alvise On Mon, Feb 29, 2016 at 3:18 PM, alvise rigo <a.rigo@virtualopensystems.com> wrote: > I see the risk. I will come back with something and let you know. > > Thank you, > alvise > > On Mon, Feb 29, 2016 at 3:06 PM, Paolo Bonzini <pbonzini@redhat.com> > wrote: > > > > > > On 29/02/2016 15:02, alvise rigo wrote: > >> > Yeah, that's the other approach -- really split the things that can > >> > be async and do real "wait for completion" at points which must > >> > synchronize. (Needs a little care since DMB is not the only such > point.) > >> > An initial implementation that does an immediate wait-for-completion > >> > is probably simpler to review though, and add the real asynchrony > >> > later. And either way you need an API for the target to wait for > >> > completion. > >> OK, so basically being sure that the target CPU performs the flush > >> before executing the next TB is not enough. We need a sort of feedback > >> that the flush has been done before emulating the next guest > >> instruction. Did I get it right? > > > > That risks getting deadlocks if CPU A asks B to flush the TLB and vice > > versa. Using a halted state means that the VCPU thread goes through the > > cpus.c loop and can for example service other CPUs' TLB flush requests. > > > > Paolo >
alvise rigo <a.rigo@virtualopensystems.com> writes: > A small update on this. I have a working implementation of the "halted > state" mechanism for waiting all the pending flushes to be completed. > However, the way I'm going back to the cpus.c loop (the while(1) in > qemu_tcg_cpu_thread_fn) is a bit convoluted. In the case of the TLB ops > that always end the TB, a simple cpu_exit() allows me to go back to the > main loop. I think in this case we can also use the cpu_loop_exit(), though > making the code a bit more complicated since the PC would require some > adjustments. > > I wanted then to apply the same "halted state" to the LoadLink helper, > since also this one might cause some flush requests. In this case, we can > not just call cpu_loop_exit() in that the guest code would miss the > returned value. Forcing the LDREX instruction to also end the TB through an > empty 'is_jmp' condition did the trick allowing once again to use > cpu_exit(). Is there another better solution? Have you looked at Emilio's tree where he replaced the async_safe_work mechanism with a mechanism to do the work in the vCPU run loop but halt all other vCPUs first? > > Thank you, > alvise > > On Mon, Feb 29, 2016 at 3:18 PM, alvise rigo <a.rigo@virtualopensystems.com> > wrote: > >> I see the risk. I will come back with something and let you know. >> >> Thank you, >> alvise >> >> On Mon, Feb 29, 2016 at 3:06 PM, Paolo Bonzini <pbonzini@redhat.com> >> wrote: >> > >> > >> > On 29/02/2016 15:02, alvise rigo wrote: >> >> > Yeah, that's the other approach -- really split the things that can >> >> > be async and do real "wait for completion" at points which must >> >> > synchronize. (Needs a little care since DMB is not the only such >> point.) >> >> > An initial implementation that does an immediate wait-for-completion >> >> > is probably simpler to review though, and add the real asynchrony >> >> > later. And either way you need an API for the target to wait for >> >> > completion. >> >> OK, so basically being sure that the target CPU performs the flush >> >> before executing the next TB is not enough. We need a sort of feedback >> >> that the flush has been done before emulating the next guest >> >> instruction. Did I get it right? >> > >> > That risks getting deadlocks if CPU A asks B to flush the TLB and vice >> > versa. Using a halted state means that the VCPU thread goes through the >> > cpus.c loop and can for example service other CPUs' TLB flush requests. >> > >> > Paolo >> -- Alex Bennée
On 04/03/2016 15:28, alvise rigo wrote: > A small update on this. I have a working implementation of the "halted > state" mechanism for waiting all the pending flushes to be completed. > However, the way I'm going back to the cpus.c loop (the while(1) in > qemu_tcg_cpu_thread_fn) is a bit convoluted. In the case of the TLB ops > that always end the TB, a simple cpu_exit() allows me to go back to the > main loop. I think in this case we can also use the cpu_loop_exit(), > though making the code a bit more complicated since the PC would require > some adjustments. I think in both cases the function to use is cpu_loop_exit_restore. It will restart execution of the current instruction so it should be fine as long as you don't call it unconditionally. If you're not calling it directly from the helper, you need to save GETPC() in the helper and propagate it down to the call site. Then the call site can use it as the last argument. For an example see helper_ljmp_protected's call to switch_tss_ra in target-i386/seg_helper.c. > I wanted then to apply the same "halted state" to the LoadLink helper, > since also this one might cause some flush requests. Interesting, where is this documented in the ARM ARM? > In this case, we > can not just call cpu_loop_exit() in that the guest code would miss the > returned value. Forcing the LDREX instruction to also end the TB through > an empty 'is_jmp' condition did the trick allowing once again to use > cpu_exit(). Is there another better solution? Perhaps cpu_loop_exit_restore()? Paolo
Hi Paolo, On Mon, Mar 7, 2016 at 10:18 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 04/03/2016 15:28, alvise rigo wrote: >> A small update on this. I have a working implementation of the "halted >> state" mechanism for waiting all the pending flushes to be completed. >> However, the way I'm going back to the cpus.c loop (the while(1) in >> qemu_tcg_cpu_thread_fn) is a bit convoluted. In the case of the TLB ops >> that always end the TB, a simple cpu_exit() allows me to go back to the >> main loop. I think in this case we can also use the cpu_loop_exit(), >> though making the code a bit more complicated since the PC would require >> some adjustments. > > I think in both cases the function to use is cpu_loop_exit_restore. It > will restart execution of the current instruction so it should be fine > as long as you don't call it unconditionally. Indeed, cpu_loop_exit_restore() works just fine for those helpers that do not return any value, thank you. > > If you're not calling it directly from the helper, you need to save > GETPC() in the helper and propagate it down to the call site. Then the > call site can use it as the last argument. For an example see > helper_ljmp_protected's call to switch_tss_ra in target-i386/seg_helper.c. > >> I wanted then to apply the same "halted state" to the LoadLink helper, >> since also this one might cause some flush requests. > > Interesting, where is this documented in the ARM ARM? I'm referring to the usual flush requests that a LL(x) operation might issue in order to have all the VCPUs agreeing on "x is an exclusive address". Adding the halted state we ensure that the calling VCPU resumes its execution after all the other VCPUs have set the TLB_EXCL flag (this should also fix the race condition you were worried about). > >> In this case, we >> can not just call cpu_loop_exit() in that the guest code would miss the >> returned value. Forcing the LDREX instruction to also end the TB through >> an empty 'is_jmp' condition did the trick allowing once again to use >> cpu_exit(). Is there another better solution? > > Perhaps cpu_loop_exit_restore()? For some reason this is not working to exit from helper_ldlink_name in softmmu_llsc_template.h (the method returns a "WORD_TYPE"). The execution in the guest is brought to an infinite loop, most likely because of a deadlock due to an improper emulation of LDREX and STREX. In any case the cpu_exit() solution still works with the downside of a slightly bigger overhead in exiting/entering the TB. Thank you, alvise > > Paolo
diff --git a/cputlb.c b/cputlb.c index 29252d1..1eeeccb 100644 --- a/cputlb.c +++ b/cputlb.c @@ -103,9 +103,11 @@ void tlb_flush(CPUState *cpu, int flush_global) } } -static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp) +/* Flush tlb_table[] and tlb_v_table[] of @cpu at MMU indexes given by @bitmap. + * Flush also tb_jmp_cache. */ +static inline void tlb_tables_flush_bitmap(CPUState *cpu, unsigned long *bitmap) { - CPUArchState *env = cpu->env_ptr; + int mmu_idx; #if defined(DEBUG_TLB) printf("tlb_flush_by_mmuidx:"); @@ -114,6 +116,41 @@ static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp) links while we are modifying them */ cpu->current_tb = NULL; + for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) { + if (test_bit(mmu_idx, bitmap)) { + CPUArchState *env = cpu->env_ptr; +#if defined(DEBUG_TLB) + printf(" %d", mmu_idx); +#endif + memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[0])); + memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0])); + } + } + memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache)); + +#if defined(DEBUG_TLB) + printf("\n"); +#endif +} + +struct TLBFlushByMMUIdxParams { + CPUState *cpu; + DECLARE_BITMAP(idx_to_flush, NB_MMU_MODES); +}; + +static void tlb_flush_by_mmuidx_async_work(void *opaque) +{ + struct TLBFlushByMMUIdxParams *params = opaque; + + tlb_tables_flush_bitmap(params->cpu, params->idx_to_flush); + + g_free(params); +} + +static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp) +{ + DECLARE_BITMAP(idxmap, NB_MMU_MODES) = { 0 }; + for (;;) { int mmu_idx = va_arg(argp, int); @@ -121,19 +158,23 @@ static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp) break; } -#if defined(DEBUG_TLB) - printf(" %d", mmu_idx); -#endif - - memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[0])); - memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0])); + set_bit(mmu_idx, idxmap); } -#if defined(DEBUG_TLB) - printf("\n"); -#endif + if (!qemu_cpu_is_self(cpu)) { + /* We do not set the pendind_tlb_flush bit, only a global flush + * does that. */ + if (!atomic_read(&cpu->pending_tlb_flush)) { + struct TLBFlushByMMUIdxParams *params; - memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache)); + params = g_malloc(sizeof(struct TLBFlushByMMUIdxParams)); + params->cpu = cpu; + memcpy(params->idx_to_flush, idxmap, sizeof(idxmap)); + async_run_on_cpu(cpu, tlb_flush_by_mmuidx_async_work, params); + } + } else { + tlb_tables_flush_bitmap(cpu, idxmap); + } } void tlb_flush_by_mmuidx(CPUState *cpu, ...)
As in the case of tlb_flush(), also tlb_flush_by_mmuidx has to query the TLB flush if it targets another VCPU. To accomplish this, a new async work has been added, together with a new TLBFlushByMMUIdxParams. A bitmap is used to track the MMU indexes to flush. This patch applies to the multi_tcg_v8 branch. Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com> --- cputlb.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 53 insertions(+), 12 deletions(-)