diff mbox series

[v2,2/2] credit: Don't steal vcpus which have yielded

Message ID 20230921122352.2307574-2-george.dunlap@cloud.com (mailing list archive)
State New, archived
Headers show
Series [v2,1/2] credit: Limit load balancing to once per millisecond | expand

Commit Message

George Dunlap Sept. 21, 2023, 12:23 p.m. UTC
On large systems with many vcpus yielding due to spinlock priority
inversion, it's not uncommon for a vcpu to yield its timeslice, only
to be immediately stolen by another pcpu looking for higher-priority
work.

To prevent this:

* Keep the YIELD flag until a vcpu is removed from a runqueue

* When looking for work to steal, skip vcpus which have yielded

NB that this does mean that sometimes a VM is inserted into an empty
runqueue; handle that case.

Signed-off-by: George Dunlap <george.dunlap@cloud.com>
---
Changes since v1:
- Moved a comment tweak to the right patch

CC: Dario Faggioli <dfaggioli@suse.com>
---
 xen/common/sched/credit.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

Comments

George Dunlap Sept. 21, 2023, 2:14 p.m. UTC | #1
On Thu, Sep 21, 2023 at 1:23 PM George Dunlap <george.dunlap@cloud.com> wrote:
>
> On large systems with many vcpus yielding due to spinlock priority
> inversion, it's not uncommon for a vcpu to yield its timeslice, only
> to be immediately stolen by another pcpu looking for higher-priority
> work.
>
> To prevent this:
>
> * Keep the YIELD flag until a vcpu is removed from a runqueue
>
> * When looking for work to steal, skip vcpus which have yielded
>
> NB that this does mean that sometimes a VM is inserted into an empty
> runqueue; handle that case.
>
> Signed-off-by: George Dunlap <george.dunlap@cloud.com>

Marcus,

Just wanted to verify my interpretation of the testing you did of this
patch several months ago:

1. On the problematic workload, mean execution time for the task under
heavy load was around 12 seconds
2. With only patch 2 of this series (0004 in your tests), mean
execution time under heavy load was around 5 seconds
3. With only patch 1 of this series (0003 in your tests), mean
execution time under heavy load was around 3 seconds
4. With both patch 1 and patch 2 of this series (0003+0004 in your
tests), mean execution time under heavy load was also around 3 seconds

So both patches independently exhibit an improvement; but the combined
effect is about the same as the first patch.

Assuming those results are accurate, I would argue that we should take
both patches.  Does anyone want to argue we should only take the first
one?

 -George
Henry Wang Sept. 22, 2023, 1:30 a.m. UTC | #2
Hi George,

> On Sep 21, 2023, at 20:23, George Dunlap <george.dunlap@cloud.com> wrote:
> 
> On large systems with many vcpus yielding due to spinlock priority
> inversion, it's not uncommon for a vcpu to yield its timeslice, only
> to be immediately stolen by another pcpu looking for higher-priority
> work.
> 
> To prevent this:
> 
> * Keep the YIELD flag until a vcpu is removed from a runqueue
> 
> * When looking for work to steal, skip vcpus which have yielded
> 
> NB that this does mean that sometimes a VM is inserted into an empty
> runqueue; handle that case.
> 
> Signed-off-by: George Dunlap <george.dunlap@cloud.com>

Release-acked-by: Henry Wang <Henry.Wang@arm.com>

Kind regards,
Henry
Jürgen Groß Sept. 22, 2023, 7:49 a.m. UTC | #3
On 21.09.23 14:23, George Dunlap wrote:
> On large systems with many vcpus yielding due to spinlock priority
> inversion, it's not uncommon for a vcpu to yield its timeslice, only
> to be immediately stolen by another pcpu looking for higher-priority
> work.
> 
> To prevent this:
> 
> * Keep the YIELD flag until a vcpu is removed from a runqueue
> 
> * When looking for work to steal, skip vcpus which have yielded
> 
> NB that this does mean that sometimes a VM is inserted into an empty
> runqueue; handle that case.
> 
> Signed-off-by: George Dunlap <george.dunlap@cloud.com>
> ---
> Changes since v1:
> - Moved a comment tweak to the right patch
> 
> CC: Dario Faggioli <dfaggioli@suse.com>
> ---
>   xen/common/sched/credit.c | 25 ++++++++++++++-----------
>   1 file changed, 14 insertions(+), 11 deletions(-)
> 
> diff --git a/xen/common/sched/credit.c b/xen/common/sched/credit.c
> index 5c06f596d2..38a6f6fa6d 100644
> --- a/xen/common/sched/credit.c
> +++ b/xen/common/sched/credit.c
> @@ -298,14 +298,10 @@ __runq_insert(struct csched_unit *svc)
>        * runnable unit if we can.  The next runq_sort will bring it forward
>        * within 30ms if the queue too long. */
>       if ( test_bit(CSCHED_FLAG_UNIT_YIELD, &svc->flags)
> -         && __runq_elem(iter)->pri > CSCHED_PRI_IDLE )
> -    {
> +         && __runq_elem(iter)->pri > CSCHED_PRI_IDLE
> +         && iter->next != runq)

Style

>           iter=iter->next;
>   
> -        /* Some sanity checks */
> -        BUG_ON(iter == runq);
> -    }
> -
>       list_add_tail(&svc->runq_elem, iter);
>   }
>   
> @@ -321,6 +317,11 @@ __runq_remove(struct csched_unit *svc)
>   {
>       BUG_ON( !__unit_on_runq(svc) );
>       list_del_init(&svc->runq_elem);
> +
> +    /*
> +     * Clear YIELD flag when scheduling back in
> +     */
> +    clear_bit(CSCHED_FLAG_UNIT_YIELD, &svc->flags);
>   }
>   
>   static inline void
> @@ -1637,6 +1638,13 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
>           if ( speer->pri <= pri )
>               break;
>   
> +        /*
> +         * Don't steal a UNIT which has yielded; it's waiting for a
> +         * reason
> +         */
> +        if (test_bit(CSCHED_FLAG_UNIT_YIELD, &speer->flags))

Style

> +            continue;
> +
>           /* Is this UNIT runnable on our PCPU? */
>           unit = speer->unit;
>           BUG_ON( is_idle_unit(unit) );
> @@ -1954,11 +1962,6 @@ static void cf_check csched_schedule(
>           dec_nr_runnable(sched_cpu);
>       }
>   
> -    /*
> -     * Clear YIELD flag before scheduling out
> -     */
> -    clear_bit(CSCHED_FLAG_UNIT_YIELD, &scurr->flags);
> -
>       do {
>           snext = __runq_elem(runq->next);
>   

With the style issues fixed:

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen
Marcus Granado Oct. 10, 2023, 4:58 p.m. UTC | #4
On Thu, 21 Sept 2023 at 15:14, George Dunlap <george.dunlap@cloud.com> wrote:
>
> On Thu, Sep 21, 2023 at 1:23 PM George Dunlap <george.dunlap@cloud.com> wrote:
> >
> > On large systems with many vcpus yielding due to spinlock priority
> > inversion, it's not uncommon for a vcpu to yield its timeslice, only
> > to be immediately stolen by another pcpu looking for higher-priority
> > work.
> >
> > To prevent this:
> >
> > * Keep the YIELD flag until a vcpu is removed from a runqueue
> >
> > * When looking for work to steal, skip vcpus which have yielded
> >
>
> Marcus,
>
> Just wanted to verify my interpretation of the testing you did of this
> patch several months ago:
>
> 1. On the problematic workload, mean execution time for the task under
> heavy load was around 12 seconds
> 2. With only patch 2 of this series (0004 in your tests), mean
> execution time under heavy load was around 5 seconds
> 3. With only patch 1 of this series (0003 in your tests), mean
> execution time under heavy load was around 3 seconds
> 4. With both patch 1 and patch 2 of this series (0003+0004 in your
> tests), mean execution time under heavy load was also around 3 seconds
>
> So both patches independently exhibit an improvement; but the combined
> effect is about the same as the first patch.
>
Yes, thanks for the summary, that's pretty much it.
Marcus
diff mbox series

Patch

diff --git a/xen/common/sched/credit.c b/xen/common/sched/credit.c
index 5c06f596d2..38a6f6fa6d 100644
--- a/xen/common/sched/credit.c
+++ b/xen/common/sched/credit.c
@@ -298,14 +298,10 @@  __runq_insert(struct csched_unit *svc)
      * runnable unit if we can.  The next runq_sort will bring it forward
      * within 30ms if the queue too long. */
     if ( test_bit(CSCHED_FLAG_UNIT_YIELD, &svc->flags)
-         && __runq_elem(iter)->pri > CSCHED_PRI_IDLE )
-    {
+         && __runq_elem(iter)->pri > CSCHED_PRI_IDLE
+         && iter->next != runq)
         iter=iter->next;
 
-        /* Some sanity checks */
-        BUG_ON(iter == runq);
-    }
-
     list_add_tail(&svc->runq_elem, iter);
 }
 
@@ -321,6 +317,11 @@  __runq_remove(struct csched_unit *svc)
 {
     BUG_ON( !__unit_on_runq(svc) );
     list_del_init(&svc->runq_elem);
+
+    /*
+     * Clear YIELD flag when scheduling back in
+     */
+    clear_bit(CSCHED_FLAG_UNIT_YIELD, &svc->flags);
 }
 
 static inline void
@@ -1637,6 +1638,13 @@  csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
         if ( speer->pri <= pri )
             break;
 
+        /*
+         * Don't steal a UNIT which has yielded; it's waiting for a
+         * reason
+         */
+        if (test_bit(CSCHED_FLAG_UNIT_YIELD, &speer->flags))
+            continue;
+
         /* Is this UNIT runnable on our PCPU? */
         unit = speer->unit;
         BUG_ON( is_idle_unit(unit) );
@@ -1954,11 +1962,6 @@  static void cf_check csched_schedule(
         dec_nr_runnable(sched_cpu);
     }
 
-    /*
-     * Clear YIELD flag before scheduling out
-     */
-    clear_bit(CSCHED_FLAG_UNIT_YIELD, &scurr->flags);
-
     do {
         snext = __runq_elem(runq->next);