diff mbox series

[RFC] Implement qemu_thread_yield for posix, use it in mttcg to handle EXCP_YIELD

Message ID 20190717054655.14104-1-npiggin@gmail.com (mailing list archive)
State New, archived
Headers show
Series [RFC] Implement qemu_thread_yield for posix, use it in mttcg to handle EXCP_YIELD | expand

Commit Message

Nicholas Piggin July 17, 2019, 5:46 a.m. UTC
This is a bit of proof of concept in case mttcg becomes more important
yield could be handled like this. You can have by accident or deliberately
force vCPUs onto the same physical CPU and cause inversion issues when the
lock holder was preempted by the waiter. This is lightly tested but not
to the point of measuring performance difference.

I really consider the previous confer/prod patches more important just to
provide a more complete guest environment and better test coverage, than
performance, but maybe someone wants to persue this.

Thanks,
Nick
---
 cpus.c                   |  6 ++++++
 hw/ppc/spapr_hcall.c     | 14 +++++++-------
 include/qemu/thread.h    |  1 +
 util/qemu-thread-posix.c |  5 +++++
 util/qemu-thread-win32.c |  4 ++++
 5 files changed, 23 insertions(+), 7 deletions(-)

Comments

David Gibson July 17, 2019, 11:48 a.m. UTC | #1
On Wed, Jul 17, 2019 at 03:46:55PM +1000, Nicholas Piggin wrote:
> This is a bit of proof of concept in case mttcg becomes more important
> yield could be handled like this. You can have by accident or deliberately
> force vCPUs onto the same physical CPU and cause inversion issues when the
> lock holder was preempted by the waiter. This is lightly tested but not
> to the point of measuring performance difference.
> 
> I really consider the previous confer/prod patches more important just to
> provide a more complete guest environment and better test coverage, than
> performance, but maybe someone wants to persue this.
> 
> Thanks,
> Nick
> ---
>  cpus.c                   |  6 ++++++
>  hw/ppc/spapr_hcall.c     | 14 +++++++-------
>  include/qemu/thread.h    |  1 +
>  util/qemu-thread-posix.c |  5 +++++
>  util/qemu-thread-win32.c |  4 ++++
>  5 files changed, 23 insertions(+), 7 deletions(-)
> 
> diff --git a/cpus.c b/cpus.c
> index 927a00aa90..f036e062d9 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -1760,6 +1760,12 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
>                  qemu_mutex_unlock_iothread();
>                  cpu_exec_step_atomic(cpu);
>                  qemu_mutex_lock_iothread();
> +                break;
> +            case EXCP_YIELD:
> +                qemu_mutex_unlock_iothread();
> +                qemu_thread_yield();
> +                qemu_mutex_lock_iothread();
> +                break;
>              default:
>                  /* Ignore everything else? */
>                  break;
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 57c1ee0fe1..9c24a64dfe 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -1162,13 +1162,13 @@ static target_ulong h_confer(PowerPCCPU *cpu, SpaprMachineState *spapr,
>              return H_SUCCESS;
>          }
>  
> -        /*
> -         * The targeted confer does not do anything special beyond yielding
> -         * the current vCPU, but even this should be better than nothing.
> -         * At least for single-threaded tcg, it gives the target a chance to
> -         * run before we run again. Multi-threaded tcg does not really do
> -         * anything with EXCP_YIELD yet.
> -         */
> +       /*
> +        * The targeted confer does not do anything special beyond yielding
> +        * the current vCPU, but even this should be better than nothing.
> +        * For single-threaded tcg, it gives the target a chance to run
> +        * before we run again, multi-threaded tcg will yield the CPU to
> +        * another vCPU.
> +        */

Uh.. this looks like a whitespace fixup leaked in from your other patches.

>      }
>  
>      cs->exception_index = EXCP_YIELD;
> diff --git a/include/qemu/thread.h b/include/qemu/thread.h
> index 55d83a907c..8525b0a70a 100644
> --- a/include/qemu/thread.h
> +++ b/include/qemu/thread.h
> @@ -160,6 +160,7 @@ void qemu_thread_get_self(QemuThread *thread);
>  bool qemu_thread_is_self(QemuThread *thread);
>  void qemu_thread_exit(void *retval);
>  void qemu_thread_naming(bool enable);
> +void qemu_thread_yield(void);
>  
>  struct Notifier;
>  /**
> diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
> index 1bf5e65dea..91b12a1082 100644
> --- a/util/qemu-thread-posix.c
> +++ b/util/qemu-thread-posix.c
> @@ -573,3 +573,8 @@ void *qemu_thread_join(QemuThread *thread)
>      }
>      return ret;
>  }
> +
> +void qemu_thread_yield(void)
> +{
> +    pthread_yield();
> +}
> diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
> index 572f88535d..72fe406bef 100644
> --- a/util/qemu-thread-win32.c
> +++ b/util/qemu-thread-win32.c
> @@ -442,3 +442,7 @@ bool qemu_thread_is_self(QemuThread *thread)
>  {
>      return GetCurrentThreadId() == thread->tid;
>  }
> +
> +void qemu_thread_yield(void)
> +{
> +}
Alex Bennée Dec. 20, 2019, 1:11 p.m. UTC | #2
Nicholas Piggin <npiggin@gmail.com> writes:

> This is a bit of proof of concept in case mttcg becomes more important
> yield could be handled like this. You can have by accident or deliberately
> force vCPUs onto the same physical CPU and cause inversion issues when the
> lock holder was preempted by the waiter. This is lightly tested but not
> to the point of measuring performance difference.

Sorry I'm so late replying.

Really this comes down to what EXCP_YIELD semantics are meant to mean.
For ARM it's a hint operation because we also have WFE which should halt
until there is some sort of change of state. In those cases exiting the
main-loop and sitting in wait_for_io should be the correct response. If
a vCPU is suspended waiting on the halt condition doesn't it have the
same effect?

>
> I really consider the previous confer/prod patches more important just to
> provide a more complete guest environment and better test coverage, than
> performance, but maybe someone wants to persue this.
>
> Thanks,
> Nick
> ---
>  cpus.c                   |  6 ++++++
>  hw/ppc/spapr_hcall.c     | 14 +++++++-------
>  include/qemu/thread.h    |  1 +
>  util/qemu-thread-posix.c |  5 +++++
>  util/qemu-thread-win32.c |  4 ++++
>  5 files changed, 23 insertions(+), 7 deletions(-)
>
> diff --git a/cpus.c b/cpus.c
> index 927a00aa90..f036e062d9 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -1760,6 +1760,12 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
>                  qemu_mutex_unlock_iothread();
>                  cpu_exec_step_atomic(cpu);
>                  qemu_mutex_lock_iothread();
> +                break;
> +            case EXCP_YIELD:
> +                qemu_mutex_unlock_iothread();
> +                qemu_thread_yield();
> +                qemu_mutex_lock_iothread();
> +                break;
>              default:
>                  /* Ignore everything else? */
>                  break;
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 57c1ee0fe1..9c24a64dfe 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -1162,13 +1162,13 @@ static target_ulong h_confer(PowerPCCPU *cpu, SpaprMachineState *spapr,
>              return H_SUCCESS;
>          }
>  
> -        /*
> -         * The targeted confer does not do anything special beyond yielding
> -         * the current vCPU, but even this should be better than nothing.
> -         * At least for single-threaded tcg, it gives the target a chance to
> -         * run before we run again. Multi-threaded tcg does not really do
> -         * anything with EXCP_YIELD yet.
> -         */
> +       /*
> +        * The targeted confer does not do anything special beyond yielding
> +        * the current vCPU, but even this should be better than nothing.
> +        * For single-threaded tcg, it gives the target a chance to run
> +        * before we run again, multi-threaded tcg will yield the CPU to
> +        * another vCPU.
> +        */
>      }
>  
>      cs->exception_index = EXCP_YIELD;
> diff --git a/include/qemu/thread.h b/include/qemu/thread.h
> index 55d83a907c..8525b0a70a 100644
> --- a/include/qemu/thread.h
> +++ b/include/qemu/thread.h
> @@ -160,6 +160,7 @@ void qemu_thread_get_self(QemuThread *thread);
>  bool qemu_thread_is_self(QemuThread *thread);
>  void qemu_thread_exit(void *retval);
>  void qemu_thread_naming(bool enable);
> +void qemu_thread_yield(void);
>  
>  struct Notifier;
>  /**
> diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
> index 1bf5e65dea..91b12a1082 100644
> --- a/util/qemu-thread-posix.c
> +++ b/util/qemu-thread-posix.c
> @@ -573,3 +573,8 @@ void *qemu_thread_join(QemuThread *thread)
>      }
>      return ret;
>  }
> +
> +void qemu_thread_yield(void)
> +{
> +    pthread_yield();
> +}
> diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
> index 572f88535d..72fe406bef 100644
> --- a/util/qemu-thread-win32.c
> +++ b/util/qemu-thread-win32.c
> @@ -442,3 +442,7 @@ bool qemu_thread_is_self(QemuThread *thread)
>  {
>      return GetCurrentThreadId() == thread->tid;
>  }
> +
> +void qemu_thread_yield(void)
> +{
> +}
Nicholas Piggin Jan. 21, 2020, 11:20 a.m. UTC | #3
Alex Bennée's on December 20, 2019 11:11 pm:
> 
> Nicholas Piggin <npiggin@gmail.com> writes:
> 
>> This is a bit of proof of concept in case mttcg becomes more important
>> yield could be handled like this. You can have by accident or deliberately
>> force vCPUs onto the same physical CPU and cause inversion issues when the
>> lock holder was preempted by the waiter. This is lightly tested but not
>> to the point of measuring performance difference.
> 
> Sorry I'm so late replying.

That's fine if you'll also forgive me :)

> Really this comes down to what EXCP_YIELD semantics are meant to mean.
> For ARM it's a hint operation because we also have WFE which should halt
> until there is some sort of change of state. In those cases exiting the
> main-loop and sitting in wait_for_io should be the correct response. If
> a vCPU is suspended waiting on the halt condition doesn't it have the
> same effect?

For powerpc H_CONFER, the vCPU does not want to wait for ever, but just
give up a some time slice on the physical CPU and allow other vCPUs to
run. But it's not necessary that one does run (if they are all sleeping,
the hypervisor must prevent deadlock). How would you wait on such a
conditon?

Thanks,
Nick
Alex Bennée Jan. 21, 2020, 2:37 p.m. UTC | #4
Nicholas Piggin <npiggin@gmail.com> writes:

> Alex Bennée's on December 20, 2019 11:11 pm:
>>
>> Nicholas Piggin <npiggin@gmail.com> writes:
>>
>>> This is a bit of proof of concept in case mttcg becomes more important
>>> yield could be handled like this. You can have by accident or deliberately
>>> force vCPUs onto the same physical CPU and cause inversion issues when the
>>> lock holder was preempted by the waiter. This is lightly tested but not
>>> to the point of measuring performance difference.
>>
>> Sorry I'm so late replying.
>
> That's fine if you'll also forgive me :)
>
>> Really this comes down to what EXCP_YIELD semantics are meant to mean.
>> For ARM it's a hint operation because we also have WFE which should halt
>> until there is some sort of change of state. In those cases exiting the
>> main-loop and sitting in wait_for_io should be the correct response. If
>> a vCPU is suspended waiting on the halt condition doesn't it have the
>> same effect?
>
> For powerpc H_CONFER, the vCPU does not want to wait for ever, but just
> give up a some time slice on the physical CPU and allow other vCPUs to
> run. But it's not necessary that one does run (if they are all sleeping,
> the hypervisor must prevent deadlock). How would you wait on such a
> conditon?

Isn't H_CONFER a hypercall rather than instruction though? In QEMU's TCG
emulation case I would expect it just to exit to the (guest) hypervisor
which then schedules the next (guest) vCPU. It shouldn't be something
QEMU has to deal with.

If you are running QEMU as a KVM monitor this is still outside of it's
scope as all the scheduling shenanigans are dealt with inside the
kernel.

From QEMU's TCG point of view we want to concern ourselves with what the
real hardware would do - which I think in this case is drop to the
hypervisor and let it sort it out.

--
Alex Bennée
David Gibson Jan. 22, 2020, 3:26 a.m. UTC | #5
On Tue, Jan 21, 2020 at 02:37:59PM +0000, Alex Bennée wrote:
> 
> Nicholas Piggin <npiggin@gmail.com> writes:
> 
> > Alex Bennée's on December 20, 2019 11:11 pm:
> >>
> >> Nicholas Piggin <npiggin@gmail.com> writes:
> >>
> >>> This is a bit of proof of concept in case mttcg becomes more important
> >>> yield could be handled like this. You can have by accident or deliberately
> >>> force vCPUs onto the same physical CPU and cause inversion issues when the
> >>> lock holder was preempted by the waiter. This is lightly tested but not
> >>> to the point of measuring performance difference.
> >>
> >> Sorry I'm so late replying.
> >
> > That's fine if you'll also forgive me :)
> >
> >> Really this comes down to what EXCP_YIELD semantics are meant to mean.
> >> For ARM it's a hint operation because we also have WFE which should halt
> >> until there is some sort of change of state. In those cases exiting the
> >> main-loop and sitting in wait_for_io should be the correct response. If
> >> a vCPU is suspended waiting on the halt condition doesn't it have the
> >> same effect?
> >
> > For powerpc H_CONFER, the vCPU does not want to wait for ever, but just
> > give up a some time slice on the physical CPU and allow other vCPUs to
> > run. But it's not necessary that one does run (if they are all sleeping,
> > the hypervisor must prevent deadlock). How would you wait on such a
> > conditon?
> 
> Isn't H_CONFER a hypercall rather than instruction though? In QEMU's TCG
> emulation case I would expect it just to exit to the (guest) hypervisor
> which then schedules the next (guest) vCPU. It shouldn't be something
> QEMU has to deal with.

That's true if you're emulating a whole system complete with
hypervisor under TCG, which is what the "pnv" machine type does.

However, a more common use of qemu is the "pseries" machine type,
which emulates only a guest (in the cpu architectural sense) with qemu
taking the place of the hypervisor as well as emulating the cpus.  In
that case the H_CONFER hypercall goes to qemu.

> If you are running QEMU as a KVM monitor this is still outside of it's
> scope as all the scheduling shenanigans are dealt with inside the
> kernel.
> 
> From QEMU's TCG point of view we want to concern ourselves with what the
> real hardware would do - which I think in this case is drop to the
> hypervisor and let it sort it out.

Right, but with the "pseries" machine type qemu *is* the hypervisor.
Richard Henderson Jan. 22, 2020, 6:01 p.m. UTC | #6
On 1/21/20 5:26 PM, David Gibson wrote:
> However, a more common use of qemu is the "pseries" machine type,
> which emulates only a guest (in the cpu architectural sense) with qemu
> taking the place of the hypervisor as well as emulating the cpus.  In
> that case the H_CONFER hypercall goes to qemu.
> 
>> If you are running QEMU as a KVM monitor this is still outside of it's
>> scope as all the scheduling shenanigans are dealt with inside the
>> kernel.
>>
>> From QEMU's TCG point of view we want to concern ourselves with what the
>> real hardware would do - which I think in this case is drop to the
>> hypervisor and let it sort it out.
> 
> Right, but with the "pseries" machine type qemu *is* the hypervisor.

In which case this behaviour doesn't seem implausible.

I will note that "pthread_yield" isn't standardized; "sched_yield" is the one
in POSIX.  Though that says nothing about how that syscall might affect a
hypothetical many-to-one pthread implementation.  You could, I suppose, have a
configure test for pthread_yield.

Also, the win32 implementation would be SwitchToThread():

https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-switchtothread

It looks like one need do nothing for the single-threaded implementation,
qemu_tcg_rr_cpu_thread_fn, as any return to the main loop will select the next
round-robin cpu.  But a note to say that's been tested would be nice.


r~
diff mbox series

Patch

diff --git a/cpus.c b/cpus.c
index 927a00aa90..f036e062d9 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1760,6 +1760,12 @@  static void *qemu_tcg_cpu_thread_fn(void *arg)
                 qemu_mutex_unlock_iothread();
                 cpu_exec_step_atomic(cpu);
                 qemu_mutex_lock_iothread();
+                break;
+            case EXCP_YIELD:
+                qemu_mutex_unlock_iothread();
+                qemu_thread_yield();
+                qemu_mutex_lock_iothread();
+                break;
             default:
                 /* Ignore everything else? */
                 break;
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 57c1ee0fe1..9c24a64dfe 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -1162,13 +1162,13 @@  static target_ulong h_confer(PowerPCCPU *cpu, SpaprMachineState *spapr,
             return H_SUCCESS;
         }
 
-        /*
-         * The targeted confer does not do anything special beyond yielding
-         * the current vCPU, but even this should be better than nothing.
-         * At least for single-threaded tcg, it gives the target a chance to
-         * run before we run again. Multi-threaded tcg does not really do
-         * anything with EXCP_YIELD yet.
-         */
+       /*
+        * The targeted confer does not do anything special beyond yielding
+        * the current vCPU, but even this should be better than nothing.
+        * For single-threaded tcg, it gives the target a chance to run
+        * before we run again, multi-threaded tcg will yield the CPU to
+        * another vCPU.
+        */
     }
 
     cs->exception_index = EXCP_YIELD;
diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index 55d83a907c..8525b0a70a 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -160,6 +160,7 @@  void qemu_thread_get_self(QemuThread *thread);
 bool qemu_thread_is_self(QemuThread *thread);
 void qemu_thread_exit(void *retval);
 void qemu_thread_naming(bool enable);
+void qemu_thread_yield(void);
 
 struct Notifier;
 /**
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index 1bf5e65dea..91b12a1082 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -573,3 +573,8 @@  void *qemu_thread_join(QemuThread *thread)
     }
     return ret;
 }
+
+void qemu_thread_yield(void)
+{
+    pthread_yield();
+}
diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
index 572f88535d..72fe406bef 100644
--- a/util/qemu-thread-win32.c
+++ b/util/qemu-thread-win32.c
@@ -442,3 +442,7 @@  bool qemu_thread_is_self(QemuThread *thread)
 {
     return GetCurrentThreadId() == thread->tid;
 }
+
+void qemu_thread_yield(void)
+{
+}