From patchwork Fri Mar 29 15:09:24 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?SsO8cmdlbiBHcm/Dnw==?= <jgross@suse.com>
X-Patchwork-Id: 10877267
Return-Path: <xen-devel-bounces@lists.xenproject.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6B3E31874
	for <patchwork-xen-devel@patchwork.kernel.org>;
 Fri, 29 Mar 2019 15:11:39 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 54CF92928A
	for <patchwork-xen-devel@patchwork.kernel.org>;
 Fri, 29 Mar 2019 15:11:39 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 5339629889; Fri, 29 Mar 2019 15:11:39 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1
Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120])
	(using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 518B129881
	for <patchwork-xen-devel@patchwork.kernel.org>;
 Fri, 29 Mar 2019 15:11:38 +0000 (UTC)
Received: from localhost ([127.0.0.1] helo=lists.xenproject.org)
	by lists.xenproject.org with esmtp (Exim 4.89)
	(envelope-from <xen-devel-bounces@lists.xenproject.org>)
	id 1h9t98-0004jQ-Gp; Fri, 29 Mar 2019 15:10:18 +0000
Received: from all-amaz-eas1.inumbo.com ([34.197.232.57]
 helo=us1-amaz-eas2.inumbo.com)
 by lists.xenproject.org with esmtp (Exim 4.89)
 (envelope-from <SRS0=APY9=SA=suse.com=jgross@srs-us1.protection.inumbo.net>)
 id 1h9t8n-0003s2-H6
 for xen-devel@lists.xenproject.org; Fri, 29 Mar 2019 15:09:57 +0000
X-Inumbo-ID: b359144a-5234-11e9-83be-f3b15e2a23d1
Received: from mx1.suse.de (unknown [195.135.220.15])
 by us1-amaz-eas2.inumbo.com (Halon) with ESMTPS
 id b359144a-5234-11e9-83be-f3b15e2a23d1;
 Fri, 29 Mar 2019 15:09:51 +0000 (UTC)
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
Received: from relay2.suse.de (unknown [195.135.220.254])
 by mx1.suse.de (Postfix) with ESMTP id DFE02B03A;
 Fri, 29 Mar 2019 15:09:50 +0000 (UTC)
From: Juergen Gross <jgross@suse.com>
To: xen-devel@lists.xenproject.org
Date: Fri, 29 Mar 2019 16:09:24 +0100
Message-Id: <20190329150934.17694-40-jgross@suse.com>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20190329150934.17694-1-jgross@suse.com>
References: <20190329150934.17694-1-jgross@suse.com>
Subject: [Xen-devel] [PATCH RFC 39/49] xen/sched: add code to sync
 scheduling of all vcpus of a sched item
X-BeenThere: xen-devel@lists.xenproject.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Xen developer discussion <xen-devel.lists.xenproject.org>
List-Unsubscribe: <https://lists.xenproject.org/mailman/options/xen-devel>,
 <mailto:xen-devel-request@lists.xenproject.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xenproject.org>
List-Help: <mailto:xen-devel-request@lists.xenproject.org?subject=help>
List-Subscribe: <https://lists.xenproject.org/mailman/listinfo/xen-devel>,
 <mailto:xen-devel-request@lists.xenproject.org?subject=subscribe>
Cc: Juergen Gross <jgross@suse.com>,
 Stefano Stabellini <sstabellini@kernel.org>, Wei Liu <wei.liu2@citrix.com>,
 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
 George Dunlap <George.Dunlap@eu.citrix.com>,
 Andrew Cooper <andrew.cooper3@citrix.com>,
 Ian Jackson <ian.jackson@eu.citrix.com>, Tim Deegan <tim@xen.org>,
 Julien Grall <julien.grall@arm.com>, Jan Beulich <jbeulich@suse.com>,
 Dario Faggioli <dfaggioli@suse.com>,
 =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>
MIME-Version: 1.0
Errors-To: xen-devel-bounces@lists.xenproject.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
X-Virus-Scanned: ClamAV using ClamSMTP

When switching sched items synchronize all vcpus of the new item to be
scheduled at the same time.

A variable sched_granularity is added which holds the number of vcpus
per schedule item.

As tasklets require to schedule the idle item it is required to set the
tasklet_work_scheduled parameter of do_schedule() to true if any cpu
covered by the current schedule() call has any pending tasklet work.

For joining other vcpus of the schedule item we need to add a new
softirq SCHED_SLAVE_SOFTIRQ in order to have a way to initiate a
context switch without calling the generic schedule() function
selecting the vcpu to switch to, as we already know which vcpu we
want to run. This has the other advantage not to loose any other
concurrent SCHEDULE_SOFTIRQ events.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/arch/x86/domain.c      |  37 +++++-
 xen/common/schedule.c      | 275 ++++++++++++++++++++++++++++++++-------------
 xen/common/softirq.c       |   6 +-
 xen/include/xen/sched-if.h |   7 ++
 xen/include/xen/softirq.h  |   1 +
 5 files changed, 247 insertions(+), 79 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 53b8fa1c9d..7daba4fb91 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1709,12 +1709,45 @@ static void __context_switch(void)
     per_cpu(curr_vcpu, cpu) = n;
 }
 
+/*
+ * Rendezvous on end of context switch.
+ * As no lock is protecting this rendezvous function we need to use atomic
+ * access functions on the counter.
+ * The counter will be 0 in case no rendezvous is needed. For the rendezvous
+ * case it is initialised to the number of cpus to rendezvous plus 1. Each
+ * member entering decrements the counter. The last one will decrement it to
+ * 1 and perform the final needed action in that case (call of context_saved()
+ * if prev was specified, and then set the counter to zero. The other members
+ * will wait until the counter becomes zero until they proceed.
+ */
+static void context_wait_rendezvous_out(struct sched_item *item,
+                                        struct vcpu *prev)
+{
+    if ( atomic_read(&item->rendezvous_out_cnt) )
+    {
+        int cnt = atomic_dec_return(&item->rendezvous_out_cnt);
+
+        /* Call context_saved() before releasing other waiters. */
+        if ( cnt == 1 )
+        {
+            if ( prev )
+                context_saved(prev);
+            atomic_set(&item->rendezvous_out_cnt, 0);
+        }
+        else
+            while ( atomic_read(&item->rendezvous_out_cnt) )
+                cpu_relax();
+    }
+    else if ( prev )
+        context_saved(prev);
+}
 
 void context_switch(struct vcpu *prev, struct vcpu *next)
 {
     unsigned int cpu = smp_processor_id();
     const struct domain *prevd = prev->domain, *nextd = next->domain;
     unsigned int dirty_cpu = next->dirty_cpu;
+    struct sched_item *item = next->sched_item;
 
     ASSERT(local_irq_is_enabled());
 
@@ -1787,7 +1820,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
         }
     }
 
-    context_saved(prev);
+    context_wait_rendezvous_out(item, prev);
 
     if ( prev != next )
     {
@@ -1812,6 +1845,8 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
 
 void continue_running(struct vcpu *same)
 {
+    context_wait_rendezvous_out(same->sched_item, NULL);
+
     /* See the comment above. */
     same->domain->arch.ctxt_switch->tail(same);
     BUG();
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 082225d173..d3474e6565 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -54,6 +54,10 @@ boolean_param("sched_smt_power_savings", sched_smt_power_savings);
  * */
 int sched_ratelimit_us = SCHED_DEFAULT_RATELIMIT_US;
 integer_param("sched_ratelimit_us", sched_ratelimit_us);
+
+/* Number of vcpus per struct sched_item. */
+static unsigned int sched_granularity = 1;
+
 /* Various timer handlers. */
 static void s_timer_fn(void *unused);
 static void vcpu_periodic_timer_fn(void *data);
@@ -1600,116 +1604,235 @@ static void vcpu_periodic_timer_work(struct vcpu *v)
     set_timer(&v->periodic_timer, periodic_next_event);
 }
 
-/*
- * The main function
- * - deschedule the current domain (scheduler independent).
- * - pick a new domain (scheduler dependent).
- */
-static void schedule(void)
+static void sched_switch_items(struct sched_resource *sd,
+                               struct sched_item *next, struct sched_item *prev,
+                               s_time_t now)
 {
-    struct sched_item    *prev = current->sched_item, *next = NULL;
-    s_time_t              now;
-    struct scheduler     *sched;
-    unsigned long        *tasklet_work = &this_cpu(tasklet_work_to_do);
-    bool                  tasklet_work_scheduled = false;
-    struct sched_resource *sd;
-    spinlock_t           *lock;
-    int cpu = smp_processor_id();
+    sd->curr = next;
 
-    ASSERT_NOT_IN_ATOMIC();
+    TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, prev->item_id,
+             now - prev->state_entry_time);
+    TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, next->item_id,
+             (next->vcpu->runstate.state == RUNSTATE_runnable) ?
+             (now - next->state_entry_time) : 0, prev->next_time);
 
-    SCHED_STAT_CRANK(sched_run);
+    ASSERT(prev->vcpu->runstate.state == RUNSTATE_running);
 
-    sd = this_cpu(sched_res);
+    TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->item_id,
+             next->domain->domain_id, next->item_id);
+
+    sched_item_runstate_change(prev, false, now);
+    prev->last_run_time = now;
+
+    ASSERT(next->vcpu->runstate.state != RUNSTATE_running);
+    sched_item_runstate_change(next, true, now);
+
+    /*
+     * NB. Don't add any trace records from here until the actual context
+     * switch, else lost_records resume will not work properly.
+     */
+
+    ASSERT(!next->is_running);
+    next->is_running = 1;
+}
+
+static bool sched_tasklet_check(void)
+{
+    unsigned long *tasklet_work;
+    bool tasklet_work_scheduled = false;
+    const cpumask_t *mask = this_cpu(sched_res)->cpus;
+    int cpu;
 
-    /* Update tasklet scheduling status. */
-    switch ( *tasklet_work )
+    for_each_cpu ( cpu, mask )
     {
-    case TASKLET_enqueued:
-        set_bit(_TASKLET_scheduled, tasklet_work);
-        /* fallthrough */
-    case TASKLET_enqueued|TASKLET_scheduled:
-        tasklet_work_scheduled = true;
-        break;
-    case TASKLET_scheduled:
-        clear_bit(_TASKLET_scheduled, tasklet_work);
-    case 0:
-        /*tasklet_work_scheduled = false;*/
-        break;
-    default:
-        BUG();
-    }
+        tasklet_work = &per_cpu(tasklet_work_to_do, cpu);
 
-    lock = pcpu_schedule_lock_irq(cpu);
+        switch ( *tasklet_work )
+        {
+        case TASKLET_enqueued:
+            set_bit(_TASKLET_scheduled, tasklet_work);
+            /* fallthrough */
+        case TASKLET_enqueued|TASKLET_scheduled:
+            tasklet_work_scheduled = true;
+            break;
+        case TASKLET_scheduled:
+            clear_bit(_TASKLET_scheduled, tasklet_work);
+        case 0:
+            /*tasklet_work_scheduled = false;*/
+            break;
+        default:
+            BUG();
+        }
+    }
 
-    now = NOW();
+    return tasklet_work_scheduled;
+}
 
-    stop_timer(&sd->s_timer);
+static struct sched_item *do_schedule(struct sched_item *prev, s_time_t now)
+{
+    struct scheduler *sched = this_cpu(scheduler);
+    struct sched_resource *sd = this_cpu(sched_res);
+    struct sched_item *next;
 
     /* get policy-specific decision on scheduling... */
-    sched = this_cpu(scheduler);
-    sched->do_schedule(sched, prev, now, tasklet_work_scheduled);
+    sched->do_schedule(sched, prev, now, sched_tasklet_check());
 
     next = prev->next_task;
 
-    sd->curr = next;
-
     if ( prev->next_time >= 0 ) /* -ve means no limit */
         set_timer(&sd->s_timer, now + prev->next_time);
 
-    if ( unlikely(prev == next) )
+    if ( likely(prev != next) )
+        sched_switch_items(sd, next, prev, now);
+
+    return next;
+}
+
+/*
+ * Rendezvous before taking a scheduling decision.
+ * Called with schedule lock held, so all accesses to the rendezvous counter
+ * can be normal ones (no atomic accesses needed).
+ * The counter is initialized to the number of cpus to rendezvous initially.
+ * Each cpu entering will decrement the counter. In case the counter becomes
+ * zero do_schedule() is called and the rendezvous counter for leaving
+ * context_switch() is set. All other members will wait until the counter is
+ * becoming zero, dropping the schedule lock in between.
+ */
+static struct sched_item *sched_wait_rendezvous_in(struct sched_item *prev,
+                                                   spinlock_t *lock, int cpu,
+                                                   s_time_t now)
+{
+    struct sched_item *next;
+
+    if ( !--prev->rendezvous_in_cnt )
+    {
+        next = do_schedule(prev, now);
+        atomic_set(&next->rendezvous_out_cnt, sched_granularity + 1);
+        return next;
+    }
+
+    while ( prev->rendezvous_in_cnt )
     {
         pcpu_schedule_unlock_irq(lock, cpu);
+        cpu_relax();
+        pcpu_schedule_lock_irq(cpu);
+    }
+
+    return prev->next_task;
+}
+
+static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
+                                 s_time_t now)
+{
+    if ( unlikely(vprev == vnext) )
+    {
         TRACE_4D(TRC_SCHED_SWITCH_INFCONT,
-                 next->domain->domain_id, next->item_id,
-                 now - prev->state_entry_time,
-                 prev->next_time);
-        trace_continue_running(next->vcpu);
-        return continue_running(prev->vcpu);
+                 vnext->domain->domain_id, vnext->sched_item->item_id,
+                 now - vprev->runstate.state_entry_time,
+                 vprev->sched_item->next_time);
+        trace_continue_running(vnext);
+        return continue_running(vprev);
     }
 
-    TRACE_3D(TRC_SCHED_SWITCH_INFPREV,
-             prev->domain->domain_id, prev->item_id,
-             now - prev->state_entry_time);
-    TRACE_4D(TRC_SCHED_SWITCH_INFNEXT,
-             next->domain->domain_id, next->item_id,
-             (next->vcpu->runstate.state == RUNSTATE_runnable) ?
-             (now - next->state_entry_time) : 0,
-             prev->next_time);
+    SCHED_STAT_CRANK(sched_ctx);
 
-    ASSERT(prev->vcpu->runstate.state == RUNSTATE_running);
+    stop_timer(&vprev->periodic_timer);
 
-    TRACE_4D(TRC_SCHED_SWITCH,
-             prev->domain->domain_id, prev->item_id,
-             next->domain->domain_id, next->item_id);
+    if ( vnext->sched_item->migrated )
+        vcpu_move_irqs(vnext);
 
-    sched_item_runstate_change(prev, false, now);
-    prev->last_run_time = now;
+    vcpu_periodic_timer_work(vnext);
 
-    ASSERT(next->vcpu->runstate.state != RUNSTATE_running);
-    sched_item_runstate_change(next, true, now);
+    context_switch(vprev, vnext);
+}
 
-    /*
-     * NB. Don't add any trace records from here until the actual context
-     * switch, else lost_records resume will not work properly.
-     */
+static void sched_slave(void)
+{
+    struct vcpu          *vprev = current;
+    struct sched_item    *prev = vprev->sched_item, *next;
+    s_time_t              now;
+    spinlock_t           *lock;
+    int cpu = smp_processor_id();
 
-    ASSERT(!next->is_running);
-    next->is_running = 1;
-    next->state_entry_time = now;
+    ASSERT_NOT_IN_ATOMIC();
+
+    lock = pcpu_schedule_lock_irq(cpu);
+
+    now = NOW();
+
+    if ( !prev->rendezvous_in_cnt )
+    {
+        pcpu_schedule_unlock_irq(lock, cpu);
+        return;
+    }
+
+    stop_timer(&this_cpu(sched_res)->s_timer);
+
+    next = sched_wait_rendezvous_in(prev, lock, cpu, now);
 
     pcpu_schedule_unlock_irq(lock, cpu);
 
-    SCHED_STAT_CRANK(sched_ctx);
+    sched_context_switch(vprev, next->vcpu, now);
+}
 
-    stop_timer(&prev->vcpu->periodic_timer);
+/*
+ * The main function
+ * - deschedule the current domain (scheduler independent).
+ * - pick a new domain (scheduler dependent).
+ */
+static void schedule(void)
+{
+    struct vcpu          *vnext, *vprev = current;
+    struct sched_item    *prev = vprev->sched_item, *next = NULL;
+    s_time_t              now;
+    struct sched_resource *sd;
+    spinlock_t           *lock;
+    int cpu = smp_processor_id();
+
+    ASSERT_NOT_IN_ATOMIC();
 
-    if ( next->migrated )
-        vcpu_move_irqs(next->vcpu);
+    SCHED_STAT_CRANK(sched_run);
 
-    vcpu_periodic_timer_work(next->vcpu);
+    sd = this_cpu(sched_res);
+
+    lock = pcpu_schedule_lock_irq(cpu);
+
+    if ( prev->rendezvous_in_cnt )
+    {
+        /*
+         * We have a race: sched_slave() should be called, so raise a softirq
+         * in order to re-enter schedule() later and call sched_slave() now.
+         */
+        pcpu_schedule_unlock_irq(lock, cpu);
+
+        raise_softirq(SCHEDULE_SOFTIRQ);
+        return sched_slave();
+    }
+
+    now = NOW();
+
+    stop_timer(&sd->s_timer);
+
+    if ( sched_granularity > 1 )
+    {
+        cpumask_t mask;
+
+        prev->rendezvous_in_cnt = sched_granularity;
+        cpumask_andnot(&mask, sd->cpus, cpumask_of(cpu));
+        cpumask_raise_softirq(&mask, SCHED_SLAVE_SOFTIRQ);
+        next = sched_wait_rendezvous_in(prev, lock, cpu, now);
+    }
+    else
+    {
+        prev->rendezvous_in_cnt = 0;
+        next = do_schedule(prev, now);
+        atomic_set(&next->rendezvous_out_cnt, 0);
+    }
+
+    pcpu_schedule_unlock_irq(lock, cpu);
 
-    context_switch(prev->vcpu, next->vcpu);
+    vnext = next->vcpu;
+    sched_context_switch(vprev, vnext, now);
 }
 
 void context_saved(struct vcpu *prev)
@@ -1767,6 +1890,7 @@ static int cpu_schedule_up(unsigned int cpu)
     if ( sd == NULL )
         return -ENOMEM;
     sd->processor = cpu;
+    sd->cpus = cpumask_of(cpu);
     per_cpu(sched_res, cpu) = sd;
 
     per_cpu(scheduler, cpu) = &ops;
@@ -1926,6 +2050,7 @@ void __init scheduler_init(void)
     int i;
 
     open_softirq(SCHEDULE_SOFTIRQ, schedule);
+    open_softirq(SCHED_SLAVE_SOFTIRQ, sched_slave);
 
     for ( i = 0; i < NUM_SCHEDULERS; i++)
     {
diff --git a/xen/common/softirq.c b/xen/common/softirq.c
index 83c3c09bd5..2d66193203 100644
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -33,8 +33,8 @@ static void __do_softirq(unsigned long ignore_mask)
     for ( ; ; )
     {
         /*
-         * Initialise @cpu on every iteration: SCHEDULE_SOFTIRQ may move
-         * us to another processor.
+         * Initialise @cpu on every iteration: SCHEDULE_SOFTIRQ or
+         * SCHED_SLAVE_SOFTIRQ may move us to another processor.
          */
         cpu = smp_processor_id();
 
@@ -55,7 +55,7 @@ void process_pending_softirqs(void)
 {
     ASSERT(!in_irq() && local_irq_is_enabled());
     /* Do not enter scheduler as it can preempt the calling context. */
-    __do_softirq(1ul<<SCHEDULE_SOFTIRQ);
+    __do_softirq((1ul << SCHEDULE_SOFTIRQ) | (1ul << SCHED_SLAVE_SOFTIRQ));
 }
 
 void do_softirq(void)
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index e2bc8f7284..9688d174e4 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -41,6 +41,7 @@ struct sched_resource {
     struct timer        s_timer;        /* scheduling timer                */
     atomic_t            urgent_count;   /* how many urgent vcpus           */
     unsigned            processor;
+    const cpumask_t    *cpus;           /* cpus covered by this struct     */
 };
 
 #define curr_on_cpu(c)    (per_cpu(sched_res, c)->curr)
@@ -86,6 +87,12 @@ struct sched_item {
     /* Next item to run. */
     struct sched_item      *next_task;
     s_time_t                next_time;
+
+    /* Number of vcpus not yet joined for context switch. */
+    unsigned int            rendezvous_in_cnt;
+
+    /* Number of vcpus not yet finished with context switch. */
+    atomic_t                rendezvous_out_cnt;
 };
 
 #define for_each_sched_item(d, e)                                         \
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index c327c9b6cd..d7273b389b 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -4,6 +4,7 @@
 /* Low-latency softirqs come first in the following list. */
 enum {
     TIMER_SOFTIRQ = 0,
+    SCHED_SLAVE_SOFTIRQ,
     SCHEDULE_SOFTIRQ,
     NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
     RCU_SOFTIRQ,