[RFC,39/49] xen/sched: add code to sync scheduling of all vcpus of a sched item

Message ID	20190329150934.17694-40-jgross@suse.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> From: Juergen Gross <jgross@suse.com> To: xen-devel@lists.xenproject.org Date: Fri, 29 Mar 2019 16:09:24 +0100 Message-Id: <20190329150934.17694-40-jgross@suse.com> In-Reply-To: <20190329150934.17694-1-jgross@suse.com> References: <20190329150934.17694-1-jgross@suse.com> Subject: [Xen-devel] [PATCH RFC 39/49] xen/sched: add code to sync scheduling of all vcpus of a sched item Precedence: list Cc: Juergen Gross <jgross@suse.com>, Stefano Stabellini <sstabellini@kernel.org>, Wei Liu <wei.liu2@citrix.com>, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, George Dunlap <George.Dunlap@eu.citrix.com>, Andrew Cooper <andrew.cooper3@citrix.com>, Ian Jackson <ian.jackson@eu.citrix.com>, Tim Deegan <tim@xen.org>, Julien Grall <julien.grall@arm.com>, Jan Beulich <jbeulich@suse.com>, Dario Faggioli <dfaggioli@suse.com>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Series	xen: add core scheduling support \| expand [RFC,00/49] xen: add core scheduling support [RFC,01/49] xen/sched: call cpu_disable_scheduler() via cpu notifier [RFC,02/49] xen: add helper for calling notifier_call_chain() to common/cpu.c [RFC,03/49] xen: add new cpu notifier action CPU_RESUME_FAILED [RFC,04/49] xen: don't free percpu areas during suspend [RFC,05/49] xen/cpupool: simplify suspend/resume handling [RFC,06/49] xen/sched: don't disable scheduler on cpus during suspend [RFC,07/49] xen/sched: fix credit2 smt idle handling [RFC,08/49] xen/sched: use new sched_item instead of vcpu in scheduler interfaces [RFC,09/49] xen/sched: alloc struct sched_item for each vcpu [RFC,10/49] xen/sched: move per-vcpu scheduler private data pointer to sched_item [RFC,11/49] xen/sched: build a linked list of struct sched_item [RFC,12/49] xen/sched: introduce struct sched_resource [RFC,13/49] xen/sched: let pick_cpu return a scheduler resource [RFC,14/49] xen/sched: switch schedule_data.curr to point at sched_item [RFC,15/49] xen/sched: move per cpu scheduler private data into struct sched_resource [RFC,16/49] xen/sched: switch vcpu_schedule_lock to item_schedule_lock [RFC,17/49] xen/sched: move some per-vcpu items to struct sched_item [RFC,18/49] xen/sched: add scheduler helpers hiding vcpu [RFC,19/49] xen/sched: add domain pointer to struct sched_item [RFC,20/49] xen/sched: add id to struct sched_item [RFC,21/49] xen/sched: rename scheduler related perf counters [RFC,22/49] xen/sched: switch struct task_slice from vcpu to sched_item [RFC,23/49] xen/sched: move is_running indicator to struct sched_item [RFC,24/49] xen/sched: make null scheduler vcpu agnostic. [RFC,25/49] xen/sched: make rt scheduler vcpu agnostic. [RFC,26/49] xen/sched: make credit scheduler vcpu agnostic. [RFC,27/49] xen/sched: make credit2 scheduler vcpu agnostic. [RFC,28/49] xen/sched: make arinc653 scheduler vcpu agnostic. [RFC,29/49] xen: add sched_item_pause_nosync() and sched_item_unpause() [RFC,30/49] xen: let vcpu_create() select processor [RFC,31/49] xen/sched: use sched_resource cpu instead smp_processor_id in schedulers [RFC,32/49] xen/sched: switch schedule() from vcpus to sched_items [RFC,33/49] xen/sched: switch sched_move_irqs() to take sched_item as parameter [RFC,34/49] xen: switch from for_each_vcpu() to for_each_sched_item() [RFC,35/49] xen/sched: add runstate counters to struct sched_item [RFC,36/49] xen/sched: rework and rename vcpu_force_reschedule() [RFC,37/49] xen/sched: Change vcpu_migrate_*() to operate on schedule item [RFC,38/49] xen/sched: move struct task_slice into struct sched_item [RFC,39/49] xen/sched: add code to sync scheduling of all vcpus of a sched item [RFC,40/49] xen/sched: add support for multiple vcpus per sched item where missing [RFC,41/49] x86: make loading of GDT at context switch more modular [RFC,42/49] xen/sched: add support for guest vcpu idle [RFC,43/49] xen/sched: modify cpupool_domain_cpumask() to be an item mask [RFC,44/49] xen: round up max vcpus to scheduling granularity [RFC,45/49] xen/sched: support allocating multiple vcpus into one sched item [RFC,46/49] xen/sched: add a scheduler_percpu_init() function [RFC,47/49] xen/sched: support core scheduling in continue_running() [RFC,48/49] xen/sched: make vcpu_wake() core scheduling aware [RFC,49/49] xen/sched: add scheduling granularity enum

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index 53b8fa1c9d..7daba4fb91 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -1709,12 +1709,45 @@ static void __context_switch(void) per_cpu(curr_vcpu, cpu) = n; } +/* + * Rendezvous on end of context switch. + * As no lock is protecting this rendezvous function we need to use atomic + * access functions on the counter. + * The counter will be 0 in case no rendezvous is needed. For the rendezvous + * case it is initialised to the number of cpus to rendezvous plus 1. Each + * member entering decrements the counter. The last one will decrement it to + * 1 and perform the final needed action in that case (call of context_saved() + * if prev was specified, and then set the counter to zero. The other members + * will wait until the counter becomes zero until they proceed. + */ +static void context_wait_rendezvous_out(struct sched_item *item, + struct vcpu *prev) +{ + if ( atomic_read(&item->rendezvous_out_cnt) ) + { + int cnt = atomic_dec_return(&item->rendezvous_out_cnt); + + /* Call context_saved() before releasing other waiters. */ + if ( cnt == 1 ) + { + if ( prev ) + context_saved(prev); + atomic_set(&item->rendezvous_out_cnt, 0); + } + else + while ( atomic_read(&item->rendezvous_out_cnt) ) + cpu_relax(); + } + else if ( prev ) + context_saved(prev); +} void context_switch(struct vcpu *prev, struct vcpu *next) { unsigned int cpu = smp_processor_id(); const struct domain *prevd = prev->domain, *nextd = next->domain; unsigned int dirty_cpu = next->dirty_cpu; + struct sched_item *item = next->sched_item; ASSERT(local_irq_is_enabled()); @@ -1787,7 +1820,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next) } } - context_saved(prev); + context_wait_rendezvous_out(item, prev); if ( prev != next ) { @@ -1812,6 +1845,8 @@ void context_switch(struct vcpu *prev, struct vcpu *next) void continue_running(struct vcpu *same) { + context_wait_rendezvous_out(same->sched_item, NULL); + /* See the comment above. */ same->domain->arch.ctxt_switch->tail(same); BUG(); diff --git a/xen/common/schedule.c b/xen/common/schedule.c index 082225d173..d3474e6565 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -54,6 +54,10 @@ boolean_param("sched_smt_power_savings", sched_smt_power_savings); * */ int sched_ratelimit_us = SCHED_DEFAULT_RATELIMIT_US; integer_param("sched_ratelimit_us", sched_ratelimit_us); + +/* Number of vcpus per struct sched_item. */ +static unsigned int sched_granularity = 1; + /* Various timer handlers. */ static void s_timer_fn(void *unused); static void vcpu_periodic_timer_fn(void *data); @@ -1600,116 +1604,235 @@ static void vcpu_periodic_timer_work(struct vcpu *v) set_timer(&v->periodic_timer, periodic_next_event); } -/* - * The main function - * - deschedule the current domain (scheduler independent). - * - pick a new domain (scheduler dependent). - */ -static void schedule(void) +static void sched_switch_items(struct sched_resource *sd, + struct sched_item *next, struct sched_item *prev, + s_time_t now) { - struct sched_item *prev = current->sched_item, *next = NULL; - s_time_t now; - struct scheduler *sched; - unsigned long *tasklet_work = &this_cpu(tasklet_work_to_do); - bool tasklet_work_scheduled = false; - struct sched_resource *sd; - spinlock_t *lock; - int cpu = smp_processor_id(); + sd->curr = next; - ASSERT_NOT_IN_ATOMIC(); + TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, prev->item_id, + now - prev->state_entry_time); + TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, next->item_id, + (next->vcpu->runstate.state == RUNSTATE_runnable) ? + (now - next->state_entry_time) : 0, prev->next_time); - SCHED_STAT_CRANK(sched_run); + ASSERT(prev->vcpu->runstate.state == RUNSTATE_running); - sd = this_cpu(sched_res); + TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->item_id, + next->domain->domain_id, next->item_id); + + sched_item_runstate_change(prev, false, now); + prev->last_run_time = now; + + ASSERT(next->vcpu->runstate.state != RUNSTATE_running); + sched_item_runstate_change(next, true, now); + + /* + * NB. Don't add any trace records from here until the actual context + * switch, else lost_records resume will not work properly. + */ + + ASSERT(!next->is_running); + next->is_running = 1; +} + +static bool sched_tasklet_check(void) +{ + unsigned long *tasklet_work; + bool tasklet_work_scheduled = false; + const cpumask_t *mask = this_cpu(sched_res)->cpus; + int cpu; - /* Update tasklet scheduling status. */ - switch ( *tasklet_work ) + for_each_cpu ( cpu, mask ) { - case TASKLET_enqueued: - set_bit(_TASKLET_scheduled, tasklet_work); - /* fallthrough */ - case TASKLET_enqueued|TASKLET_scheduled: - tasklet_work_scheduled = true; - break; - case TASKLET_scheduled: - clear_bit(_TASKLET_scheduled, tasklet_work); - case 0: - /*tasklet_work_scheduled = false;*/ - break; - default: - BUG(); - } + tasklet_work = &per_cpu(tasklet_work_to_do, cpu); - lock = pcpu_schedule_lock_irq(cpu); + switch ( *tasklet_work ) + { + case TASKLET_enqueued: + set_bit(_TASKLET_scheduled, tasklet_work); + /* fallthrough */ + case TASKLET_enqueued|TASKLET_scheduled: + tasklet_work_scheduled = true; + break; + case TASKLET_scheduled: + clear_bit(_TASKLET_scheduled, tasklet_work); + case 0: + /*tasklet_work_scheduled = false;*/ + break; + default: + BUG(); + } + } - now = NOW(); + return tasklet_work_scheduled; +} - stop_timer(&sd->s_timer); +static struct sched_item *do_schedule(struct sched_item *prev, s_time_t now) +{ + struct scheduler *sched = this_cpu(scheduler); + struct sched_resource *sd = this_cpu(sched_res); + struct sched_item *next; /* get policy-specific decision on scheduling... */ - sched = this_cpu(scheduler); - sched->do_schedule(sched, prev, now, tasklet_work_scheduled); + sched->do_schedule(sched, prev, now, sched_tasklet_check()); next = prev->next_task; - sd->curr = next; - if ( prev->next_time >= 0 ) /* -ve means no limit */ set_timer(&sd->s_timer, now + prev->next_time); - if ( unlikely(prev == next) ) + if ( likely(prev != next) ) + sched_switch_items(sd, next, prev, now); + + return next; +} + +/* + * Rendezvous before taking a scheduling decision. + * Called with schedule lock held, so all accesses to the rendezvous counter + * can be normal ones (no atomic accesses needed). + * The counter is initialized to the number of cpus to rendezvous initially. + * Each cpu entering will decrement the counter. In case the counter becomes + * zero do_schedule() is called and the rendezvous counter for leaving + * context_switch() is set. All other members will wait until the counter is + * becoming zero, dropping the schedule lock in between. + */ +static struct sched_item *sched_wait_rendezvous_in(struct sched_item *prev, + spinlock_t *lock, int cpu, + s_time_t now) +{ + struct sched_item *next; + + if ( !--prev->rendezvous_in_cnt ) + { + next = do_schedule(prev, now); + atomic_set(&next->rendezvous_out_cnt, sched_granularity + 1); + return next; + } + + while ( prev->rendezvous_in_cnt ) { pcpu_schedule_unlock_irq(lock, cpu); + cpu_relax(); + pcpu_schedule_lock_irq(cpu); + } + + return prev->next_task; +} + +static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext, + s_time_t now) +{ + if ( unlikely(vprev == vnext) ) + { TRACE_4D(TRC_SCHED_SWITCH_INFCONT, - next->domain->domain_id, next->item_id, - now - prev->state_entry_time, - prev->next_time); - trace_continue_running(next->vcpu); - return continue_running(prev->vcpu); + vnext->domain->domain_id, vnext->sched_item->item_id, + now - vprev->runstate.state_entry_time, + vprev->sched_item->next_time); + trace_continue_running(vnext); + return continue_running(vprev); } - TRACE_3D(TRC_SCHED_SWITCH_INFPREV, - prev->domain->domain_id, prev->item_id, - now - prev->state_entry_time); - TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, - next->domain->domain_id, next->item_id, - (next->vcpu->runstate.state == RUNSTATE_runnable) ? - (now - next->state_entry_time) : 0, - prev->next_time); + SCHED_STAT_CRANK(sched_ctx); - ASSERT(prev->vcpu->runstate.state == RUNSTATE_running); + stop_timer(&vprev->periodic_timer); - TRACE_4D(TRC_SCHED_SWITCH, - prev->domain->domain_id, prev->item_id, - next->domain->domain_id, next->item_id); + if ( vnext->sched_item->migrated ) + vcpu_move_irqs(vnext); - sched_item_runstate_change(prev, false, now); - prev->last_run_time = now; + vcpu_periodic_timer_work(vnext); - ASSERT(next->vcpu->runstate.state != RUNSTATE_running); - sched_item_runstate_change(next, true, now); + context_switch(vprev, vnext); +} - /* - * NB. Don't add any trace records from here until the actual context - * switch, else lost_records resume will not work properly. - */ +static void sched_slave(void) +{ + struct vcpu *vprev = current; + struct sched_item *prev = vprev->sched_item, *next; + s_time_t now; + spinlock_t *lock; + int cpu = smp_processor_id(); - ASSERT(!next->is_running); - next->is_running = 1; - next->state_entry_time = now; + ASSERT_NOT_IN_ATOMIC(); + + lock = pcpu_schedule_lock_irq(cpu); + + now = NOW(); + + if ( !prev->rendezvous_in_cnt ) + { + pcpu_schedule_unlock_irq(lock, cpu); + return; + } + + stop_timer(&this_cpu(sched_res)->s_timer); + + next = sched_wait_rendezvous_in(prev, lock, cpu, now); pcpu_schedule_unlock_irq(lock, cpu); - SCHED_STAT_CRANK(sched_ctx); + sched_context_switch(vprev, next->vcpu, now); +} - stop_timer(&prev->vcpu->periodic_timer); +/* + * The main function + * - deschedule the current domain (scheduler independent). + * - pick a new domain (scheduler dependent). + */ +static void schedule(void) +{ + struct vcpu *vnext, *vprev = current; + struct sched_item *prev = vprev->sched_item, *next = NULL; + s_time_t now; + struct sched_resource *sd; + spinlock_t *lock; + int cpu = smp_processor_id(); + + ASSERT_NOT_IN_ATOMIC(); - if ( next->migrated ) - vcpu_move_irqs(next->vcpu); + SCHED_STAT_CRANK(sched_run); - vcpu_periodic_timer_work(next->vcpu); + sd = this_cpu(sched_res); + + lock = pcpu_schedule_lock_irq(cpu); + + if ( prev->rendezvous_in_cnt ) + { + /* + * We have a race: sched_slave() should be called, so raise a softirq + * in order to re-enter schedule() later and call sched_slave() now. + */ + pcpu_schedule_unlock_irq(lock, cpu); + + raise_softirq(SCHEDULE_SOFTIRQ); + return sched_slave(); + } + + now = NOW(); + + stop_timer(&sd->s_timer); + + if ( sched_granularity > 1 ) + { + cpumask_t mask; + + prev->rendezvous_in_cnt = sched_granularity; + cpumask_andnot(&mask, sd->cpus, cpumask_of(cpu)); + cpumask_raise_softirq(&mask, SCHED_SLAVE_SOFTIRQ); + next = sched_wait_rendezvous_in(prev, lock, cpu, now); + } + else + { + prev->rendezvous_in_cnt = 0; + next = do_schedule(prev, now); + atomic_set(&next->rendezvous_out_cnt, 0); + } + + pcpu_schedule_unlock_irq(lock, cpu); - context_switch(prev->vcpu, next->vcpu); + vnext = next->vcpu; + sched_context_switch(vprev, vnext, now); } void context_saved(struct vcpu *prev) @@ -1767,6 +1890,7 @@ static int cpu_schedule_up(unsigned int cpu) if ( sd == NULL ) return -ENOMEM; sd->processor = cpu; + sd->cpus = cpumask_of(cpu); per_cpu(sched_res, cpu) = sd; per_cpu(scheduler, cpu) = &ops; @@ -1926,6 +2050,7 @@ void __init scheduler_init(void) int i; open_softirq(SCHEDULE_SOFTIRQ, schedule); + open_softirq(SCHED_SLAVE_SOFTIRQ, sched_slave); for ( i = 0; i < NUM_SCHEDULERS; i++) { diff --git a/xen/common/softirq.c b/xen/common/softirq.c index 83c3c09bd5..2d66193203 100644 --- a/xen/common/softirq.c +++ b/xen/common/softirq.c @@ -33,8 +33,8 @@ static void __do_softirq(unsigned long ignore_mask) for ( ; ; ) { /* - * Initialise @cpu on every iteration: SCHEDULE_SOFTIRQ may move - * us to another processor. + * Initialise @cpu on every iteration: SCHEDULE_SOFTIRQ or + * SCHED_SLAVE_SOFTIRQ may move us to another processor. */ cpu = smp_processor_id(); @@ -55,7 +55,7 @@ void process_pending_softirqs(void) { ASSERT(!in_irq() && local_irq_is_enabled()); /* Do not enter scheduler as it can preempt the calling context. */ - __do_softirq(1ul<<SCHEDULE_SOFTIRQ); + __do_softirq((1ul << SCHEDULE_SOFTIRQ) | (1ul << SCHED_SLAVE_SOFTIRQ)); } void do_softirq(void) diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h index e2bc8f7284..9688d174e4 100644 --- a/xen/include/xen/sched-if.h +++ b/xen/include/xen/sched-if.h @@ -41,6 +41,7 @@ struct sched_resource { struct timer s_timer; /* scheduling timer */ atomic_t urgent_count; /* how many urgent vcpus */ unsigned processor; + const cpumask_t *cpus; /* cpus covered by this struct */ }; #define curr_on_cpu(c) (per_cpu(sched_res, c)->curr) @@ -86,6 +87,12 @@ struct sched_item { /* Next item to run. */ struct sched_item *next_task; s_time_t next_time; + + /* Number of vcpus not yet joined for context switch. */ + unsigned int rendezvous_in_cnt; + + /* Number of vcpus not yet finished with context switch. */ + atomic_t rendezvous_out_cnt; }; #define for_each_sched_item(d, e) \ diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h index c327c9b6cd..d7273b389b 100644 --- a/xen/include/xen/softirq.h +++ b/xen/include/xen/softirq.h @@ -4,6 +4,7 @@ /* Low-latency softirqs come first in the following list. */ enum { TIMER_SOFTIRQ = 0, + SCHED_SLAVE_SOFTIRQ, SCHEDULE_SOFTIRQ, NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ, RCU_SOFTIRQ,

[RFC,39/49] xen/sched: add code to sync scheduling of all vcpus of a sched item

Commit Message

Patch