From patchwork Fri Aug 9 14:58:20 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?SsO8cmdlbiBHcm/Dnw==?= X-Patchwork-Id: 11086695 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A00E6746 for ; Fri, 9 Aug 2019 15:00:45 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8D1B51FE8B for ; Fri, 9 Aug 2019 15:00:45 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 812651FFBD; Fri, 9 Aug 2019 15:00:45 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id D68CD1FEBD for ; Fri, 9 Aug 2019 15:00:40 +0000 (UTC) Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from ) id 1hw6MV-0007f7-M4; Fri, 09 Aug 2019 14:59:23 +0000 Received: from us1-rack-dfw2.inumbo.com ([104.130.134.6]) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from ) id 1hw6M6-0006j6-9G for xen-devel@lists.xenproject.org; Fri, 09 Aug 2019 14:58:58 +0000 X-Inumbo-ID: 3290b7ba-bab6-11e9-8980-bc764e045a96 Received: from mx1.suse.de (unknown [195.135.220.15]) by us1-rack-dfw2.inumbo.com (Halon) with ESMTPS id 3290b7ba-bab6-11e9-8980-bc764e045a96; Fri, 09 Aug 2019 14:58:51 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 57F5AAFF3; Fri, 9 Aug 2019 14:58:50 +0000 (UTC) From: Juergen Gross To: xen-devel@lists.xenproject.org Date: Fri, 9 Aug 2019 16:58:20 +0200 Message-Id: <20190809145833.1020-36-jgross@suse.com> X-Mailer: git-send-email 2.16.4 In-Reply-To: <20190809145833.1020-1-jgross@suse.com> References: <20190809145833.1020-1-jgross@suse.com> Subject: [Xen-devel] [PATCH v2 35/48] xen/sched: add fall back to idle vcpu when scheduling unit X-BeenThere: xen-devel@lists.xenproject.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Cc: Juergen Gross , Stefano Stabellini , Wei Liu , Konrad Rzeszutek Wilk , George Dunlap , Andrew Cooper , Ian Jackson , Tim Deegan , Julien Grall , Jan Beulich , Dario Faggioli , Volodymyr Babchuk , =?utf-8?q?Roger_Pau_Monn?= =?utf-8?q?=C3=A9?= MIME-Version: 1.0 Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" X-Virus-Scanned: ClamAV using ClamSMTP When scheduling an unit with multiple vcpus there is no guarantee all vcpus are available (e.g. above maxvcpus or vcpu offline). Fall back to idle vcpu of the current cpu in that case. This requires to store the correct schedule_unit pointer in the idle vcpu as long as it used as fallback vcpu. In order to modify the runstates of the correct vcpus when switching schedule units merge sched_unit_runstate_change() into sched_switch_units() and loop over the affected physical cpus instead of the unit's vcpus. This in turn requires an access function to the current variable of other cpus. Today context_saved() is called in case previous and next vcpus differ when doing a context switch. With an idle vcpu being capable to be a substitute for an offline vcpu this is problematic when switching to an idle scheduling unit. An idle previous vcpu leaves us in doubt which schedule unit was active previously, so save the previous unit pointer in the per-schedule resource area and use its value being non-NULL as a hint whether context_saved() should be called. When running an idle vcpu in a non-idle scheduling unit use a specific guest idle loop not performing any tasklets and livepatching in order to avoid populating the cpu caches with memory used by other domains (as far as possible). Softirqs are considered to be save. In order to avoid livepatching when going to guest idle another variant of reset_stack_and_jump() not calling check_for_livepatch_work is needed. Signed-off-by: Juergen Gross Acked-by: Julien Grall --- RFC V2: - new patch (Andrew Cooper) V1: - use urgent_count to select correct idle routine (Jan Beulich) V2: - set vcpu->is_running in context_saved() - introduce reset_stack_and_jump_nolp() (Jan Beulich) - readd scrubbing (Jan Beulich, Andrew Cooper) - get_cpu_current() _NOT_ moved to include/asm-x86/current.h as the needed reference of stack_base[] results in a #include hell --- xen/arch/x86/domain.c | 23 ++++++ xen/common/schedule.c | 169 ++++++++++++++++++++++++++++++------------ xen/include/asm-arm/current.h | 1 + xen/include/asm-x86/current.h | 19 ++++- xen/include/asm-x86/smp.h | 3 + xen/include/xen/sched-if.h | 4 +- xen/include/xen/sched.h | 1 + 7 files changed, 166 insertions(+), 54 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index c45bec8864..7b24d8fa48 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -159,6 +159,25 @@ static void idle_loop(void) } } +/* + * Idle loop for siblings in active schedule units. + * We don't do any standard idle work like tasklets or livepatching. + */ +static void guest_idle_loop(void) +{ + unsigned int cpu = smp_processor_id(); + + for ( ; ; ) + { + ASSERT(!cpu_is_offline(cpu)); + + if ( !softirq_pending(cpu) && !scrub_free_pages() && + !softirq_pending(cpu)) + sched_guest_idle(pm_idle, cpu); + do_softirq(); + } +} + void startup_cpu_idle_loop(void) { struct vcpu *v = current; @@ -172,6 +191,10 @@ void startup_cpu_idle_loop(void) static void noreturn continue_idle_domain(struct vcpu *v) { + /* Idle vcpus might be attached to non-idle units! */ + if ( !is_idle_domain(v->sched_unit->domain) ) + reset_stack_and_jump_nolp(guest_idle_loop); + reset_stack_and_jump(idle_loop); } diff --git a/xen/common/schedule.c b/xen/common/schedule.c index 13d3392640..5cd7d2d857 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -133,10 +133,21 @@ static struct scheduler sched_idle_ops = { .switch_sched = sched_idle_switch_sched, }; +static inline struct vcpu *unit2vcpu_cpu(struct sched_unit *unit, + unsigned int cpu) +{ + unsigned int idx = unit->unit_id + per_cpu(sched_res_idx, cpu); + const struct domain *d = unit->domain; + + return (idx < d->max_vcpus) ? d->vcpu[idx] : NULL; +} + static inline struct vcpu *sched_unit2vcpu_cpu(struct sched_unit *unit, unsigned int cpu) { - return unit->domain->vcpu[unit->unit_id + per_cpu(sched_res_idx, cpu)]; + struct vcpu *v = unit2vcpu_cpu(unit, cpu); + + return (v && v->new_state == RUNSTATE_running) ? v : idle_vcpu[cpu]; } static inline struct scheduler *dom_scheduler(const struct domain *d) @@ -256,8 +267,11 @@ static inline void vcpu_runstate_change( trace_runstate_change(v, new_state); - unit->runstate_cnt[v->runstate.state]--; - unit->runstate_cnt[new_state]++; + if ( !is_idle_vcpu(v) ) + { + unit->runstate_cnt[v->runstate.state]--; + unit->runstate_cnt[new_state]++; + } delta = new_entry_time - v->runstate.state_entry_time; if ( delta > 0 ) @@ -269,19 +283,11 @@ static inline void vcpu_runstate_change( v->runstate.state = new_state; } -static inline void sched_unit_runstate_change(struct sched_unit *unit, - bool running, s_time_t new_entry_time) +void sched_guest_idle(void (*idle) (void), unsigned int cpu) { - struct vcpu *v; - - for_each_sched_unit_vcpu ( unit, v ) - if ( running ) - vcpu_runstate_change(v, v->new_state, new_entry_time); - else - vcpu_runstate_change(v, - ((v->pause_flags & VPF_blocked) ? RUNSTATE_blocked : - (vcpu_runnable(v) ? RUNSTATE_runnable : RUNSTATE_offline)), - new_entry_time); + atomic_inc(&get_sched_res(cpu)->urgent_count); + idle(); + atomic_dec(&get_sched_res(cpu)->urgent_count); } void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate) @@ -519,6 +525,7 @@ int sched_init_vcpu(struct vcpu *v) if ( is_idle_domain(d) ) { get_sched_res(v->processor)->curr = unit; + get_sched_res(v->processor)->sched_unit_idle = unit; v->is_running = 1; unit->is_running = 1; unit->state_entry_time = NOW(); @@ -1748,32 +1755,65 @@ static void sched_switch_units(struct sched_resource *sd, struct sched_unit *next, struct sched_unit *prev, s_time_t now) { - sd->curr = next; - - TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, prev->unit_id, - now - prev->state_entry_time); - TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, next->unit_id, - (next->vcpu_list->runstate.state == RUNSTATE_runnable) ? - (now - next->state_entry_time) : 0, prev->next_time); + int cpu; ASSERT(unit_running(prev)); - TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id, - next->domain->domain_id, next->unit_id); + if ( prev != next ) + { + sd->curr = next; + sd->prev = prev; - sched_unit_runstate_change(prev, false, now); + TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, + prev->unit_id, now - prev->state_entry_time); + TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, + next->unit_id, + (next->vcpu_list->runstate.state == RUNSTATE_runnable) ? + (now - next->state_entry_time) : 0, prev->next_time); + TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->unit_id, + next->domain->domain_id, next->unit_id); - ASSERT(!unit_running(next)); - sched_unit_runstate_change(next, true, now); + ASSERT(!unit_running(next)); - /* - * NB. Don't add any trace records from here until the actual context - * switch, else lost_records resume will not work properly. - */ + /* + * NB. Don't add any trace records from here until the actual context + * switch, else lost_records resume will not work properly. + */ - ASSERT(!next->is_running); - next->vcpu_list->is_running = 1; - next->is_running = 1; + ASSERT(!next->is_running); + next->is_running = 1; + + if ( is_idle_unit(prev) ) + { + prev->runstate_cnt[RUNSTATE_running] = 0; + prev->runstate_cnt[RUNSTATE_runnable] = sched_granularity; + } + if ( is_idle_unit(next) ) + { + next->runstate_cnt[RUNSTATE_running] = sched_granularity; + next->runstate_cnt[RUNSTATE_runnable] = 0; + } + } + + for_each_cpu ( cpu, sd->cpus ) + { + struct vcpu *vprev = get_cpu_current(cpu); + struct vcpu *vnext = sched_unit2vcpu_cpu(next, cpu); + + if ( vprev != vnext || vprev->runstate.state != vnext->new_state ) + { + vcpu_runstate_change(vprev, + ((vprev->pause_flags & VPF_blocked) ? RUNSTATE_blocked : + (vcpu_runnable(vprev) ? RUNSTATE_runnable : RUNSTATE_offline)), + now); + vcpu_runstate_change(vnext, vnext->new_state, now); + } + + vnext->is_running = 1; + + if ( is_idle_vcpu(vnext) ) + vnext->sched_unit = next; + } } static bool sched_tasklet_check_cpu(unsigned int cpu) @@ -1829,29 +1869,48 @@ static struct sched_unit *do_schedule(struct sched_unit *prev, s_time_t now, if ( prev->next_time >= 0 ) /* -ve means no limit */ set_timer(&sd->s_timer, now + prev->next_time); - if ( likely(prev != next) ) - sched_switch_units(sd, next, prev, now); + sched_switch_units(sd, next, prev, now); return next; } -static void context_saved(struct vcpu *prev) +static void context_saved(struct sched_resource *sd, struct vcpu *vprev, + struct vcpu *vnext) { - struct sched_unit *unit = prev->sched_unit; + struct sched_unit *unit = sd->prev; + int cpu; /* Clear running flag /after/ writing context to memory. */ smp_wmb(); - prev->is_running = 0; + if ( !sd->prev ) + { + if ( vprev != vnext ) + vprev->is_running = 0; + return; + } + + for_each_cpu ( cpu, sd->cpus ) + { + struct vcpu *v = unit2vcpu_cpu(unit, cpu); + + if ( !v || !v->is_running ) + v = idle_vcpu[cpu]; + if ( v != vnext ) + v->is_running = 0; + } unit->is_running = 0; unit->state_entry_time = NOW(); + sd->prev = NULL; /* Check for migration request /after/ clearing running flag. */ smp_mb(); - sched_context_saved(vcpu_scheduler(prev), unit); + sched_context_saved(unit_scheduler(unit), unit); - sched_unit_migrate_finish(unit); + /* Idle never migrates and idle vcpus might belong to other units. */ + if ( !is_idle_unit(unit) ) + sched_unit_migrate_finish(unit); } /* @@ -1868,6 +1927,7 @@ static void context_saved(struct vcpu *prev) void sched_context_switched(struct vcpu *vprev, struct vcpu *vnext) { struct sched_unit *next = vnext->sched_unit; + struct sched_resource *sd = get_sched_res(smp_processor_id()); if ( atomic_read(&next->rendezvous_out_cnt) ) { @@ -1876,20 +1936,22 @@ void sched_context_switched(struct vcpu *vprev, struct vcpu *vnext) /* Call context_saved() before releasing other waiters. */ if ( cnt == 1 ) { - if ( vprev != vnext ) - context_saved(vprev); + context_saved(sd, vprev, vnext); atomic_set(&next->rendezvous_out_cnt, 0); } else while ( atomic_read(&next->rendezvous_out_cnt) ) cpu_relax(); } - else if ( vprev != vnext && sched_granularity == 1 ) - context_saved(vprev); + else + context_saved(sd, vprev, vnext); + + if ( is_idle_vcpu(vprev) && vprev != vnext ) + vprev->sched_unit = sd->sched_unit_idle; } static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext, - s_time_t now) + bool reset_idle_unit, s_time_t now) { if ( unlikely(vprev == vnext) ) { @@ -1898,6 +1960,11 @@ static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext, now - vprev->runstate.state_entry_time, vprev->sched_unit->next_time); sched_context_switched(vprev, vnext); + + if ( reset_idle_unit ) + vnext->sched_unit = + get_sched_res(smp_processor_id())->sched_unit_idle; + trace_continue_running(vnext); return continue_running(vprev); } @@ -1956,7 +2023,7 @@ static struct sched_unit *sched_wait_rendezvous_in(struct sched_unit *prev, pcpu_schedule_unlock_irq(*lock, cpu); raise_softirq(SCHED_SLAVE_SOFTIRQ); - sched_context_switch(vprev, vprev, now); + sched_context_switch(vprev, vprev, false, now); } pcpu_schedule_unlock_irq(*lock, cpu); @@ -1995,7 +2062,8 @@ static void sched_slave(void) pcpu_schedule_unlock_irq(lock, cpu); - sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu), now); + sched_context_switch(vprev, sched_unit2vcpu_cpu(next, cpu), + is_idle_unit(next) && !is_idle_unit(prev), now); } /* @@ -2055,7 +2123,8 @@ static void schedule(void) pcpu_schedule_unlock_irq(lock, cpu); vnext = sched_unit2vcpu_cpu(next, cpu); - sched_context_switch(vprev, vnext, now); + sched_context_switch(vprev, vnext, + !is_idle_unit(prev) && is_idle_unit(next), now); } /* The scheduler timer: force a run through the scheduler */ @@ -2126,6 +2195,7 @@ static int cpu_schedule_up(unsigned int cpu) */ sd->curr = idle_vcpu[cpu]->sched_unit; + sd->sched_unit_idle = idle_vcpu[cpu]->sched_unit; sd->sched_priv = NULL; @@ -2295,6 +2365,7 @@ void __init scheduler_init(void) if ( vcpu_create(idle_domain, 0) == NULL ) BUG(); get_sched_res(0)->curr = idle_vcpu[0]->sched_unit; + get_sched_res(0)->sched_unit_idle = idle_vcpu[0]->sched_unit; } /* diff --git a/xen/include/asm-arm/current.h b/xen/include/asm-arm/current.h index 1653e89d30..88beb4645a 100644 --- a/xen/include/asm-arm/current.h +++ b/xen/include/asm-arm/current.h @@ -18,6 +18,7 @@ DECLARE_PER_CPU(struct vcpu *, curr_vcpu); #define current (this_cpu(curr_vcpu)) #define set_current(vcpu) do { current = (vcpu); } while (0) +#define get_cpu_current(cpu) (per_cpu(curr_vcpu, cpu)) /* Per-VCPU state that lives at the top of the stack */ struct cpu_info { diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h index f3508c3c08..0b47485337 100644 --- a/xen/include/asm-x86/current.h +++ b/xen/include/asm-x86/current.h @@ -77,6 +77,11 @@ struct cpu_info { /* get_stack_bottom() must be 16-byte aligned */ }; +static inline struct cpu_info *get_cpu_info_from_stack(unsigned long sp) +{ + return (struct cpu_info *)((sp | (STACK_SIZE - 1)) + 1) - 1; +} + static inline struct cpu_info *get_cpu_info(void) { #ifdef __clang__ @@ -87,7 +92,7 @@ static inline struct cpu_info *get_cpu_info(void) register unsigned long sp asm("rsp"); #endif - return (struct cpu_info *)((sp | (STACK_SIZE - 1)) + 1) - 1; + return get_cpu_info_from_stack(sp); } #define get_current() (get_cpu_info()->current_vcpu) @@ -124,16 +129,22 @@ unsigned long get_stack_dump_bottom (unsigned long sp); # define CHECK_FOR_LIVEPATCH_WORK "" #endif -#define reset_stack_and_jump(__fn) \ +#define switch_stack_and_jump(fn, instr) \ ({ \ __asm__ __volatile__ ( \ "mov %0,%%"__OP"sp;" \ - CHECK_FOR_LIVEPATCH_WORK \ + instr \ "jmp %c1" \ - : : "r" (guest_cpu_user_regs()), "i" (__fn) : "memory" ); \ + : : "r" (guest_cpu_user_regs()), "i" (fn) : "memory" ); \ unreachable(); \ }) +#define reset_stack_and_jump(fn) \ + switch_stack_and_jump(fn, CHECK_FOR_LIVEPATCH_WORK) + +#define reset_stack_and_jump_nolp(fn) \ + switch_stack_and_jump(fn, "") + /* * Which VCPU's state is currently running on each CPU? * This is not necesasrily the same as 'current' as a CPU may be diff --git a/xen/include/asm-x86/smp.h b/xen/include/asm-x86/smp.h index 9f533f9072..51a31ab00a 100644 --- a/xen/include/asm-x86/smp.h +++ b/xen/include/asm-x86/smp.h @@ -76,6 +76,9 @@ void set_nr_sockets(void); /* Representing HT and core siblings in each socket. */ extern cpumask_t **socket_cpumask; +#define get_cpu_current(cpu) \ + (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu) + #endif /* !__ASSEMBLY__ */ #endif diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h index ae46b5395f..3ac7757c0d 100644 --- a/xen/include/xen/sched-if.h +++ b/xen/include/xen/sched-if.h @@ -39,6 +39,8 @@ struct sched_resource { spinlock_t *schedule_lock, _lock; struct sched_unit *curr; /* current task */ + struct sched_unit *sched_unit_idle; + struct sched_unit *prev; /* previous task */ void *sched_priv; struct timer s_timer; /* scheduling timer */ atomic_t urgent_count; /* how many urgent vcpus */ @@ -179,7 +181,7 @@ static inline void sched_clear_pause_flags_atomic(struct sched_unit *unit, static inline struct sched_unit *sched_idle_unit(unsigned int cpu) { - return idle_vcpu[cpu]->sched_unit; + return get_sched_res(cpu)->sched_unit_idle; } static inline unsigned int sched_get_resource_cpu(unsigned int cpu) diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 7585bd81a2..ed0535946f 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -922,6 +922,7 @@ void restore_vcpu_affinity(struct domain *d); void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate); uint64_t get_cpu_idle_time(unsigned int cpu); bool sched_has_urgent_vcpu(void); +void sched_guest_idle(void (*idle) (void), unsigned int cpu); /* * Used by idle loop to decide whether there is work to do: