From patchwork Mon May  6 06:56:41 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?SsO8cmdlbiBHcm/Dnw==?= <jgross@suse.com>
X-Patchwork-Id: 10930547
Return-Path: <xen-devel-bounces@lists.xenproject.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 53DEB1390
	for <patchwork-xen-devel@patchwork.kernel.org>;
 Mon,  6 May 2019 06:58:52 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 42A0028590
	for <patchwork-xen-devel@patchwork.kernel.org>;
 Mon,  6 May 2019 06:58:52 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 367F228627; Mon,  6 May 2019 06:58:52 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1
Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120])
	(using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 4948628590
	for <patchwork-xen-devel@patchwork.kernel.org>;
 Mon,  6 May 2019 06:58:51 +0000 (UTC)
Received: from localhost ([127.0.0.1] helo=lists.xenproject.org)
	by lists.xenproject.org with esmtp (Exim 4.89)
	(envelope-from <xen-devel-bounces@lists.xenproject.org>)
	id 1hNXZ4-00030Z-Pt; Mon, 06 May 2019 06:57:30 +0000
Received: from all-amaz-eas1.inumbo.com ([34.197.232.57]
 helo=us1-amaz-eas2.inumbo.com)
 by lists.xenproject.org with esmtp (Exim 4.89)
 (envelope-from <SRS0=XZcu=TG=suse.com=jgross@srs-us1.protection.inumbo.net>)
 id 1hNXYm-0002Ia-Q5
 for xen-devel@lists.xenproject.org; Mon, 06 May 2019 06:57:12 +0000
X-Inumbo-ID: 27512e46-6fcc-11e9-a316-8fe7f57547fe
Received: from mx1.suse.de (unknown [195.135.220.15])
 by us1-amaz-eas2.inumbo.com (Halon) with ESMTPS
 id 27512e46-6fcc-11e9-a316-8fe7f57547fe;
 Mon, 06 May 2019 06:57:04 +0000 (UTC)
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
Received: from relay2.suse.de (unknown [195.135.220.254])
 by mx1.suse.de (Postfix) with ESMTP id 0C55EAD03;
 Mon,  6 May 2019 06:57:00 +0000 (UTC)
From: Juergen Gross <jgross@suse.com>
To: xen-devel@lists.xenproject.org
Date: Mon,  6 May 2019 08:56:41 +0200
Message-Id: <20190506065644.7415-43-jgross@suse.com>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20190506065644.7415-1-jgross@suse.com>
References: <20190506065644.7415-1-jgross@suse.com>
Subject: [Xen-devel] [PATCH RFC V2 42/45] xen/sched: add fall back to idle
 vcpu when scheduling item
X-BeenThere: xen-devel@lists.xenproject.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Xen developer discussion <xen-devel.lists.xenproject.org>
List-Unsubscribe: <https://lists.xenproject.org/mailman/options/xen-devel>,
 <mailto:xen-devel-request@lists.xenproject.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xenproject.org>
List-Help: <mailto:xen-devel-request@lists.xenproject.org?subject=help>
List-Subscribe: <https://lists.xenproject.org/mailman/listinfo/xen-devel>,
 <mailto:xen-devel-request@lists.xenproject.org?subject=subscribe>
Cc: Juergen Gross <jgross@suse.com>,
 Stefano Stabellini <sstabellini@kernel.org>, Wei Liu <wei.liu2@citrix.com>,
 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
 George Dunlap <George.Dunlap@eu.citrix.com>,
 Andrew Cooper <andrew.cooper3@citrix.com>,
 Ian Jackson <ian.jackson@eu.citrix.com>, Tim Deegan <tim@xen.org>,
 Julien Grall <julien.grall@arm.com>, Jan Beulich <jbeulich@suse.com>,
 Dario Faggioli <dfaggioli@suse.com>,
 =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>
MIME-Version: 1.0
Errors-To: xen-devel-bounces@lists.xenproject.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
X-Virus-Scanned: ClamAV using ClamSMTP

When scheduling an item with multiple vcpus there is no guarantee all
vcpus are available (e.g. above maxvcpus or vcpu offline). Fall back to
idle vcpu of the current cpu in that case. This requires to store the
correct schedule_item pointer in the idle vcpu as long as it used as
fallback vcpu.

In order to modify the runstates of the correct vcpus when switching
schedule items merge sched_item_runstate_change() into
sched_switch_items() and loop over the affected physical cpus instead
of the item's vcpus. This in turn requires an access function to the
current variable of other cpus.

Today context_saved() is called in case previous and next vcpus differ
when doing a context switch. With an idle vcpu being capable to be a
substitute for an offline vcpu this is problematic when switching to
an idle scheduling item. An idle previous vcpu leaves us in doubt which
schedule item was active previously, so save the previous item pointer
in the per-schedule resource area and use its value being non-NULL as
a hint whether context_saved() should be called.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
RFC V2: new patch (Andrew Cooper)
---
 xen/arch/x86/domain.c         |  21 ++++++
 xen/common/schedule.c         | 153 +++++++++++++++++++++++++++---------------
 xen/include/asm-arm/current.h |   1 +
 xen/include/asm-x86/current.h |   7 +-
 xen/include/asm-x86/smp.h     |   3 +
 xen/include/xen/sched-if.h    |   4 +-
 6 files changed, 134 insertions(+), 55 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index d04e704116..f3dbca5dba 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -154,6 +154,24 @@ static void idle_loop(void)
     }
 }
 
+/*
+ * Idle loop for siblings of active schedule items.
+ * We don't do any standard idle work like tasklets, page scrubbing or
+ * livepatching.
+ * Use default_idle() in order to simulate v->is_urgent.
+ */
+static void guest_idle_loop(void)
+{
+    unsigned int cpu = smp_processor_id();
+
+    for ( ; ; )
+    {
+        if ( !softirq_pending(cpu) )
+            default_idle();
+        do_softirq();
+    }
+}
+
 void startup_cpu_idle_loop(void)
 {
     struct vcpu *v = current;
@@ -167,6 +185,9 @@ void startup_cpu_idle_loop(void)
 
 static void noreturn continue_idle_domain(struct vcpu *v)
 {
+    if ( !is_idle_item(v->sched_item) )
+        reset_stack_and_jump(guest_idle_loop);
+
     reset_stack_and_jump(idle_loop);
 }
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 0de199ccc9..788ecc9e81 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -82,7 +82,18 @@ static struct scheduler __read_mostly ops;
 static inline struct vcpu *sched_item2vcpu_cpu(struct sched_item *item,
                                                unsigned int cpu)
 {
-    return item->domain->vcpu[item->item_id + per_cpu(sched_res_idx, cpu)];
+    unsigned int idx = item->item_id + per_cpu(sched_res_idx, cpu);
+    const struct domain *d = item->domain;
+    struct vcpu *v;
+
+    if ( idx < d->max_vcpus && d->vcpu[idx] )
+    {
+        v = d->vcpu[idx];
+        if ( v->new_state == RUNSTATE_running )
+            return v;
+    }
+
+    return idle_vcpu[cpu];
 }
 
 static inline struct scheduler *dom_scheduler(const struct domain *d)
@@ -196,8 +207,11 @@ static inline void vcpu_runstate_change(
 
     trace_runstate_change(v, new_state);
 
-    item->runstate_cnt[v->runstate.state]--;
-    item->runstate_cnt[new_state]++;
+    if ( !is_idle_vcpu(v) )
+    {
+        item->runstate_cnt[v->runstate.state]--;
+        item->runstate_cnt[new_state]++;
+    }
 
     delta = new_entry_time - v->runstate.state_entry_time;
     if ( delta > 0 )
@@ -209,21 +223,6 @@ static inline void vcpu_runstate_change(
     v->runstate.state = new_state;
 }
 
-static inline void sched_item_runstate_change(struct sched_item *item,
-    bool running, s_time_t new_entry_time)
-{
-    struct vcpu *v;
-
-    for_each_sched_item_vcpu( item, v )
-        if ( running )
-            vcpu_runstate_change(v, v->new_state, new_entry_time);
-        else
-            vcpu_runstate_change(v,
-                ((v->pause_flags & VPF_blocked) ? RUNSTATE_blocked :
-                 (vcpu_runnable(v) ? RUNSTATE_runnable : RUNSTATE_offline)),
-                new_entry_time);
-}
-
 void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate)
 {
     spinlock_t *lock = likely(v == current)
@@ -456,6 +455,7 @@ int sched_init_vcpu(struct vcpu *v)
     if ( is_idle_domain(d) )
     {
         per_cpu(sched_res, v->processor)->curr = item;
+        per_cpu(sched_res, v->processor)->sched_item_idle = item;
         v->is_running = 1;
         item->is_running = 1;
         item->state_entry_time = NOW();
@@ -1631,33 +1631,67 @@ static void sched_switch_items(struct sched_resource *sd,
                                struct sched_item *next, struct sched_item *prev,
                                s_time_t now)
 {
-    sd->curr = next;
-
-    TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, prev->item_id,
-             now - prev->state_entry_time);
-    TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, next->item_id,
-             (next->vcpu->runstate.state == RUNSTATE_runnable) ?
-             (now - next->state_entry_time) : 0, prev->next_time);
+    int cpu;
 
     ASSERT(item_running(prev));
 
-    TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->item_id,
-             next->domain->domain_id, next->item_id);
+    if ( prev != next )
+    {
+        sd->curr = next;
+        sd->prev = prev;
 
-    sched_item_runstate_change(prev, false, now);
-    prev->last_run_time = now;
+        TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id,
+                 prev->item_id, now - prev->state_entry_time);
+        TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id,
+                 next->item_id,
+                 (next->vcpu->runstate.state == RUNSTATE_runnable) ?
+                 (now - next->state_entry_time) : 0, prev->next_time);
+        TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->item_id,
+                 next->domain->domain_id, next->item_id);
 
-    ASSERT(!item_running(next));
-    sched_item_runstate_change(next, true, now);
+        prev->last_run_time = now;
 
-    /*
-     * NB. Don't add any trace records from here until the actual context
-     * switch, else lost_records resume will not work properly.
-     */
+        ASSERT(!item_running(next));
+
+        /*
+         * NB. Don't add any trace records from here until the actual context
+         * switch, else lost_records resume will not work properly.
+         */
+
+        ASSERT(!next->is_running);
+        next->is_running = 1;
 
-    ASSERT(!next->is_running);
-    next->vcpu->is_running = 1;
-    next->is_running = 1;
+        if ( is_idle_item(prev) )
+        {
+            prev->runstate_cnt[RUNSTATE_running] = 0;
+            prev->runstate_cnt[RUNSTATE_runnable] = sched_granularity;
+        }
+        if ( is_idle_item(next) )
+        {
+            next->runstate_cnt[RUNSTATE_running] = sched_granularity;
+            next->runstate_cnt[RUNSTATE_runnable] = 0;
+        }
+    }
+
+    for_each_cpu( cpu, sd->cpus )
+    {
+        struct vcpu *vprev = get_cpu_current(cpu);
+        struct vcpu *vnext = sched_item2vcpu_cpu(next, cpu);
+
+        if ( vprev != vnext || vprev->runstate.state != vnext->new_state )
+        {
+            vcpu_runstate_change(vprev,
+                ((vprev->pause_flags & VPF_blocked) ? RUNSTATE_blocked :
+                 (vcpu_runnable(vprev) ? RUNSTATE_runnable : RUNSTATE_offline)),
+                now);
+            vcpu_runstate_change(vnext, vnext->new_state, now);
+        }
+
+        vnext->is_running = 1;
+
+        if ( is_idle_vcpu(vnext) )
+            vnext->sched_item = next;
+    }
 }
 
 static bool sched_tasklet_check(void)
@@ -1706,25 +1740,25 @@ static struct sched_item *do_schedule(struct sched_item *prev, s_time_t now)
     if ( prev->next_time >= 0 ) /* -ve means no limit */
         set_timer(&sd->s_timer, now + prev->next_time);
 
-    if ( likely(prev != next) )
-        sched_switch_items(sd, next, prev, now);
+    sched_switch_items(sd, next, prev, now);
 
     return next;
 }
 
-static void context_saved(struct vcpu *prev)
+static void context_saved(struct sched_item *item)
 {
-    struct sched_item *item = prev->sched_item;
-
     item->is_running = 0;
     item->state_entry_time = NOW();
+    this_cpu(sched_res)->prev = NULL;
 
     /* Check for migration request /after/ clearing running flag. */
     smp_mb();
 
-    sched_context_saved(vcpu_scheduler(prev), item);
+    sched_context_saved(vcpu_scheduler(item->vcpu), item);
 
-    sched_item_migrate_finish(item);
+    /* Idle never migrates and idle vcpus might belong to other items. */
+    if ( !is_idle_item(item) )
+        sched_item_migrate_finish(item);
 }
 
 /*
@@ -1741,11 +1775,13 @@ static void context_saved(struct vcpu *prev)
 void sched_context_switched(struct vcpu *vprev, struct vcpu *vnext)
 {
     struct sched_item *next = vnext->sched_item;
+    struct sched_resource *sd = this_cpu(sched_res);
 
     /* Clear running flag /after/ writing context to memory. */
     smp_wmb();
 
-    vprev->is_running = 0;
+    if ( vprev != vnext )
+        vprev->is_running = 0;
 
     if ( atomic_read(&next->rendezvous_out_cnt) )
     {
@@ -1754,20 +1790,23 @@ void sched_context_switched(struct vcpu *vprev, struct vcpu *vnext)
         /* Call context_saved() before releasing other waiters. */
         if ( cnt == 1 )
         {
-            if ( vprev != vnext )
-                context_saved(vprev);
+            if ( sd->prev )
+                context_saved(sd->prev);
             atomic_set(&next->rendezvous_out_cnt, 0);
         }
         else
             while ( atomic_read(&next->rendezvous_out_cnt) )
                 cpu_relax();
     }
-    else if ( vprev != vnext && sched_granularity == 1 )
-        context_saved(vprev);
+    else if ( sd->prev )
+        context_saved(sd->prev);
+
+    if ( is_idle_vcpu(vprev) && vprev != vnext )
+        vprev->sched_item = sd->sched_item_idle;
 }
 
 static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
-                                 s_time_t now)
+                                 bool reset_idle_item, s_time_t now)
 {
     if ( unlikely(vprev == vnext) )
     {
@@ -1776,6 +1815,10 @@ static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
                  now - vprev->runstate.state_entry_time,
                  vprev->sched_item->next_time);
         sched_context_switched(vprev, vnext);
+
+        if ( reset_idle_item )
+            vnext->sched_item = this_cpu(sched_res)->sched_item_idle;
+
         trace_continue_running(vnext);
         return continue_running(vprev);
     }
@@ -1851,7 +1894,8 @@ static void sched_slave(void)
 
     pcpu_schedule_unlock_irq(lock, cpu);
 
-    sched_context_switch(vprev, sched_item2vcpu_cpu(next, cpu), now);
+    sched_context_switch(vprev, sched_item2vcpu_cpu(next, cpu),
+                         is_idle_item(next) && !is_idle_item(prev), now);
 }
 
 /*
@@ -1911,7 +1955,8 @@ static void schedule(void)
     pcpu_schedule_unlock_irq(lock, cpu);
 
     vnext = sched_item2vcpu_cpu(next, cpu);
-    sched_context_switch(vprev, vnext, now);
+    sched_context_switch(vprev, vnext,
+                         !is_idle_item(prev) && is_idle_item(next), now);
 }
 
 /* The scheduler timer: force a run through the scheduler */
@@ -1993,6 +2038,7 @@ static int cpu_schedule_up(unsigned int cpu)
         return -ENOMEM;
 
     sd->curr = idle_vcpu[cpu]->sched_item;
+    sd->sched_item_idle = idle_vcpu[cpu]->sched_item;
 
     /*
      * We don't want to risk calling xfree() on an sd->sched_priv
@@ -2170,6 +2216,7 @@ void __init scheduler_init(void)
     if ( vcpu_create(idle_domain, 0) == NULL )
         BUG();
     this_cpu(sched_res)->curr = idle_vcpu[0]->sched_item;
+    this_cpu(sched_res)->sched_item_idle = idle_vcpu[0]->sched_item;
     this_cpu(sched_res)->sched_priv = sched_alloc_pdata(&ops, 0);
     BUG_ON(IS_ERR(this_cpu(sched_res)->sched_priv));
     scheduler_percpu_init(0);
diff --git a/xen/include/asm-arm/current.h b/xen/include/asm-arm/current.h
index c4af66fbb9..a7602eef8c 100644
--- a/xen/include/asm-arm/current.h
+++ b/xen/include/asm-arm/current.h
@@ -18,6 +18,7 @@ DECLARE_PER_CPU(struct vcpu *, curr_vcpu);
 
 #define current            (this_cpu(curr_vcpu))
 #define set_current(vcpu)  do { current = (vcpu); } while (0)
+#define get_cpu_current(cpu)  (per_cpu(curr_vcpu, cpu))
 
 /* Per-VCPU state that lives at the top of the stack */
 struct cpu_info {
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index 5bd64b2271..cb5b6f1176 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -76,6 +76,11 @@ struct cpu_info {
     /* get_stack_bottom() must be 16-byte aligned */
 };
 
+static inline struct cpu_info *get_cpu_info_from_stack(unsigned long sp)
+{
+    return (struct cpu_info *)((sp | (STACK_SIZE - 1)) + 1) - 1;
+}
+
 static inline struct cpu_info *get_cpu_info(void)
 {
 #ifdef __clang__
@@ -86,7 +91,7 @@ static inline struct cpu_info *get_cpu_info(void)
     register unsigned long sp asm("rsp");
 #endif
 
-    return (struct cpu_info *)((sp | (STACK_SIZE - 1)) + 1) - 1;
+    return get_cpu_info_from_stack(sp);
 }
 
 #define get_current()         (get_cpu_info()->current_vcpu)
diff --git a/xen/include/asm-x86/smp.h b/xen/include/asm-x86/smp.h
index 9f533f9072..51a31ab00a 100644
--- a/xen/include/asm-x86/smp.h
+++ b/xen/include/asm-x86/smp.h
@@ -76,6 +76,9 @@ void set_nr_sockets(void);
 /* Representing HT and core siblings in each socket. */
 extern cpumask_t **socket_cpumask;
 
+#define get_cpu_current(cpu) \
+    (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu)
+
 #endif /* !__ASSEMBLY__ */
 
 #endif
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index b3921f3a41..8981d41629 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -39,6 +39,8 @@ struct sched_resource {
     spinlock_t         *schedule_lock,
                        _lock;
     struct sched_item  *curr;           /* current task                    */
+    struct sched_item  *sched_item_idle;
+    struct sched_item  *prev;           /* previous task                   */
     void               *sched_priv;
     struct timer        s_timer;        /* scheduling timer                */
     atomic_t            urgent_count;   /* how many urgent vcpus           */
@@ -152,7 +154,7 @@ static inline void sched_clear_pause_flags_atomic(struct sched_item *item,
 
 static inline struct sched_item *sched_idle_item(unsigned int cpu)
 {
-    return idle_vcpu[cpu]->sched_item;
+    return per_cpu(sched_res, cpu)->sched_item_idle;
 }
 
 static inline unsigned int sched_get_resource_cpu(unsigned int cpu)