Message ID | 20190506065644.7415-43-jgross@suse.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | xen: add core scheduling support | expand |
>>> On 06.05.19 at 08:56, <jgross@suse.com> wrote: > --- a/xen/arch/x86/domain.c > +++ b/xen/arch/x86/domain.c > @@ -154,6 +154,24 @@ static void idle_loop(void) > } > } > > +/* > + * Idle loop for siblings of active schedule items. > + * We don't do any standard idle work like tasklets, page scrubbing or > + * livepatching. > + * Use default_idle() in order to simulate v->is_urgent. I guess I'm missing a part of the description which explains all this: What's wrong with doing scrubbing work, for example? Why is doing tasklet work not okay, but softirqs are? What is the deal with v->is_urgent, i.e. what justifies not entering a decent power saving mode here on Intel, but doing so on AMD? > --- a/xen/include/asm-x86/smp.h > +++ b/xen/include/asm-x86/smp.h > @@ -76,6 +76,9 @@ void set_nr_sockets(void); > /* Representing HT and core siblings in each socket. */ > extern cpumask_t **socket_cpumask; > > +#define get_cpu_current(cpu) \ > + (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu) Yet another, slightly different notion of "current". If "current" itself is not suitable (I can't immediately see why that would be, but I also didn't look at all the scheduler specific changes earlier in this series), why isn't per_cpu(curr_vcpu, cpu) either? Jan
On 16/05/2019 15:05, Jan Beulich wrote: >>>> On 06.05.19 at 08:56, <jgross@suse.com> wrote: >> --- a/xen/arch/x86/domain.c >> +++ b/xen/arch/x86/domain.c >> @@ -154,6 +154,24 @@ static void idle_loop(void) >> } >> } >> >> +/* >> + * Idle loop for siblings of active schedule items. >> + * We don't do any standard idle work like tasklets, page scrubbing or >> + * livepatching. >> + * Use default_idle() in order to simulate v->is_urgent. > > I guess I'm missing a part of the description which explains all this: > What's wrong with doing scrubbing work, for example? Why is > doing tasklet work not okay, but softirqs are? What is the deal with > v->is_urgent, i.e. what justifies not entering a decent power > saving mode here on Intel, but doing so on AMD? One of the reasons for using core scheduling is to avoid running vcpus of different domains on the same core in order to minimize the chances for side channel attacks to data of other domains. Not allowing scrubbing or tasklets here is due to avoid accessing data of other domains. As with core scheduling we can be sure the other thread is active (otherwise we would schedule the idle item) and hoping for saving power by using mwait is moot. > >> --- a/xen/include/asm-x86/smp.h >> +++ b/xen/include/asm-x86/smp.h >> @@ -76,6 +76,9 @@ void set_nr_sockets(void); >> /* Representing HT and core siblings in each socket. */ >> extern cpumask_t **socket_cpumask; >> >> +#define get_cpu_current(cpu) \ >> + (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu) > > Yet another, slightly different notion of "current". If "current" > itself is not suitable (I can't immediately see why that would be, > but I also didn't look at all the scheduler specific changes earlier > in this series), why isn't per_cpu(curr_vcpu, cpu) either? current is always the vcpu running on the current physical cpu. curr_vcpu is the vcpu which was the one running in guest mode last (this avoids the need to save/restore context in case a vcpu is blocked for a short time without another guest vcpu running on the physical cpu in between), so with current being idle the two can differ. Here I need "current" from another physical cpu which is not easily available. Juergen
>>> On 16.05.19 at 15:51, <jgross@suse.com> wrote: > On 16/05/2019 15:05, Jan Beulich wrote: >>>>> On 06.05.19 at 08:56, <jgross@suse.com> wrote: >>> --- a/xen/arch/x86/domain.c >>> +++ b/xen/arch/x86/domain.c >>> @@ -154,6 +154,24 @@ static void idle_loop(void) >>> } >>> } >>> >>> +/* >>> + * Idle loop for siblings of active schedule items. >>> + * We don't do any standard idle work like tasklets, page scrubbing or >>> + * livepatching. >>> + * Use default_idle() in order to simulate v->is_urgent. >> >> I guess I'm missing a part of the description which explains all this: >> What's wrong with doing scrubbing work, for example? Why is >> doing tasklet work not okay, but softirqs are? What is the deal with >> v->is_urgent, i.e. what justifies not entering a decent power >> saving mode here on Intel, but doing so on AMD? > > One of the reasons for using core scheduling is to avoid running vcpus > of different domains on the same core in order to minimize the chances > for side channel attacks to data of other domains. Not allowing > scrubbing or tasklets here is due to avoid accessing data of other > domains. So how is running softirqs okay then? And how is scrubbing accessing other domains' data? > As with core scheduling we can be sure the other thread is active > (otherwise we would schedule the idle item) and hoping for saving power > by using mwait is moot. Saving power may be indirect, by the CPU re-arranging resource assignment between threads when one goes idle. I have no idea whether they do this when entering C1, or only when entering deeper C states. And anyway - I'm still none the wiser as to the v->is_urgent relationship. >>> --- a/xen/include/asm-x86/smp.h >>> +++ b/xen/include/asm-x86/smp.h >>> @@ -76,6 +76,9 @@ void set_nr_sockets(void); >>> /* Representing HT and core siblings in each socket. */ >>> extern cpumask_t **socket_cpumask; >>> >>> +#define get_cpu_current(cpu) \ >>> + (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu) >> >> Yet another, slightly different notion of "current". If "current" >> itself is not suitable (I can't immediately see why that would be, >> but I also didn't look at all the scheduler specific changes earlier >> in this series), why isn't per_cpu(curr_vcpu, cpu) either? > > current is always the vcpu running on the current physical cpu. > curr_vcpu is the vcpu which was the one running in guest mode last > (this avoids the need to save/restore context in case a vcpu is > blocked for a short time without another guest vcpu running on the > physical cpu in between), so with current being idle the two can > differ. > > Here I need "current" from another physical cpu which is not easily > available. Oh, right - I should have been able to spot this. Jan
On 16/05/2019 16:41, Jan Beulich wrote: >>>> On 16.05.19 at 15:51, <jgross@suse.com> wrote: >> On 16/05/2019 15:05, Jan Beulich wrote: >>>>>> On 06.05.19 at 08:56, <jgross@suse.com> wrote: >>>> --- a/xen/arch/x86/domain.c >>>> +++ b/xen/arch/x86/domain.c >>>> @@ -154,6 +154,24 @@ static void idle_loop(void) >>>> } >>>> } >>>> >>>> +/* >>>> + * Idle loop for siblings of active schedule items. >>>> + * We don't do any standard idle work like tasklets, page scrubbing or >>>> + * livepatching. >>>> + * Use default_idle() in order to simulate v->is_urgent. >>> >>> I guess I'm missing a part of the description which explains all this: >>> What's wrong with doing scrubbing work, for example? Why is >>> doing tasklet work not okay, but softirqs are? What is the deal with >>> v->is_urgent, i.e. what justifies not entering a decent power >>> saving mode here on Intel, but doing so on AMD? >> >> One of the reasons for using core scheduling is to avoid running vcpus >> of different domains on the same core in order to minimize the chances >> for side channel attacks to data of other domains. Not allowing >> scrubbing or tasklets here is due to avoid accessing data of other >> domains. > > So how is running softirqs okay then? And how is scrubbing accessing > other domains' data? Right now I'm not sure whether it is a good idea to block any softirqs. We definitely need to process scheduling requests and I believe RCU and tasklets, too. The tlbflush one should be uncritical, so timers is the remaining one which might be questionable. This can be fine-tuned later IMO e.g. by defining a softirq mask of critical softirqs to block and eventually splitting up e.g. timer and tasklet softirqs into critical and uncritical ones. Scrubbing will probably pull the cache lines of the dirty pages into the L1 cache of the cpu. For me this sounds problematic. In case we are fine to do scrubbing as there is no risk associated I'm fine to add it back in. >> As with core scheduling we can be sure the other thread is active >> (otherwise we would schedule the idle item) and hoping for saving power >> by using mwait is moot. > > Saving power may be indirect, by the CPU re-arranging > resource assignment between threads when one goes idle. > I have no idea whether they do this when entering C1, or > only when entering deeper C states. SDM Vol. 3 chapter 8.10.1 "HLT instruction": "Here shared resources that were being used by the halted logical processor become available to active logical processors, allowing them to execute at greater efficiency." > And anyway - I'm still none the wiser as to the v->is_urgent > relationship. With v->is_urgent set today's idle loop will drop into default_idle(). I can remove this sentence in case it is just confusing. Juergen
>>> On 17.05.19 at 07:13, <jgross@suse.com> wrote: > On 16/05/2019 16:41, Jan Beulich wrote: >>>>> On 16.05.19 at 15:51, <jgross@suse.com> wrote: >>> On 16/05/2019 15:05, Jan Beulich wrote: >>>>>>> On 06.05.19 at 08:56, <jgross@suse.com> wrote: >>>>> --- a/xen/arch/x86/domain.c >>>>> +++ b/xen/arch/x86/domain.c >>>>> @@ -154,6 +154,24 @@ static void idle_loop(void) >>>>> } >>>>> } >>>>> >>>>> +/* >>>>> + * Idle loop for siblings of active schedule items. >>>>> + * We don't do any standard idle work like tasklets, page scrubbing or >>>>> + * livepatching. >>>>> + * Use default_idle() in order to simulate v->is_urgent. >>>> >>>> I guess I'm missing a part of the description which explains all this: >>>> What's wrong with doing scrubbing work, for example? Why is >>>> doing tasklet work not okay, but softirqs are? What is the deal with >>>> v->is_urgent, i.e. what justifies not entering a decent power >>>> saving mode here on Intel, but doing so on AMD? >>> >>> One of the reasons for using core scheduling is to avoid running vcpus >>> of different domains on the same core in order to minimize the chances >>> for side channel attacks to data of other domains. Not allowing >>> scrubbing or tasklets here is due to avoid accessing data of other >>> domains. >> >> So how is running softirqs okay then? And how is scrubbing accessing >> other domains' data? > > Right now I'm not sure whether it is a good idea to block any softirqs. > We definitely need to process scheduling requests and I believe RCU and > tasklets, too. The tlbflush one should be uncritical, so timers is the > remaining one which might be questionable. This can be fine-tuned later > IMO e.g. by defining a softirq mask of critical softirqs to block and > eventually splitting up e.g. timer and tasklet softirqs into critical > and uncritical ones. Well, okay, but please add an abridged version of this to the patch description then. > Scrubbing will probably pull the cache lines of the dirty pages into > the L1 cache of the cpu. For me this sounds problematic. In case we > are fine to do scrubbing as there is no risk associated I'm fine to add > it back in. Well, of course there's going to be a brief period of time where a cache line will be present in CPU internal buffers (it's not just the cache after all, as we've learned with XSA-297). So I can certainly buy that when using core granularity you don't want to scrub on the other thread. But what about the socket granularity case? Scrubbing on fully idle cores should still be fine, I would think. >>> As with core scheduling we can be sure the other thread is active >>> (otherwise we would schedule the idle item) and hoping for saving power >>> by using mwait is moot. >> >> Saving power may be indirect, by the CPU re-arranging >> resource assignment between threads when one goes idle. >> I have no idea whether they do this when entering C1, or >> only when entering deeper C states. > > SDM Vol. 3 chapter 8.10.1 "HLT instruction": > > "Here shared resources that were being used by the halted logical > processor become available to active logical processors, allowing them > to execute at greater efficiency." To be honest, this is to broad/generic a statement to fully trust it, judging from other areas of the SDM. And then, as per above, what about the socket granularity case? Putting entirely idle cores to sleep is surely worthwhile? >> And anyway - I'm still none the wiser as to the v->is_urgent >> relationship. > > With v->is_urgent set today's idle loop will drop into default_idle(). > I can remove this sentence in case it is just confusing. I'd prefer if the connection would become more obvious. One needs to go from ->is_urgent via ->urgent_count to sched_has_urgent_vcpu() to find where the described behavior really lives. What's worse though: This won't work as intended on AMD at all. I don't think it's correct to fall back to default_idle() in this case. Instead sched_has_urgent_vcpu() returning true should amount to the same effect as max_cstate being set to 1. There's (a) no reason not to use MWAIT on Intel CPUs in this case, if MWAIT can enter C1, and (b) a strong need to use MWAIT on (at least) AMD Fam17, or else it won't be C1 that gets entered. I'll see about making a patch in due course. Jan
On 17/05/2019 08:57, Jan Beulich wrote: >>>> On 17.05.19 at 07:13, <jgross@suse.com> wrote: >> On 16/05/2019 16:41, Jan Beulich wrote: >>>>>> On 16.05.19 at 15:51, <jgross@suse.com> wrote: >>>> On 16/05/2019 15:05, Jan Beulich wrote: >>>>>>>> On 06.05.19 at 08:56, <jgross@suse.com> wrote: >>>>>> --- a/xen/arch/x86/domain.c >>>>>> +++ b/xen/arch/x86/domain.c >>>>>> @@ -154,6 +154,24 @@ static void idle_loop(void) >>>>>> } >>>>>> } >>>>>> >>>>>> +/* >>>>>> + * Idle loop for siblings of active schedule items. >>>>>> + * We don't do any standard idle work like tasklets, page scrubbing or >>>>>> + * livepatching. >>>>>> + * Use default_idle() in order to simulate v->is_urgent. >>>>> >>>>> I guess I'm missing a part of the description which explains all this: >>>>> What's wrong with doing scrubbing work, for example? Why is >>>>> doing tasklet work not okay, but softirqs are? What is the deal with >>>>> v->is_urgent, i.e. what justifies not entering a decent power >>>>> saving mode here on Intel, but doing so on AMD? >>>> >>>> One of the reasons for using core scheduling is to avoid running vcpus >>>> of different domains on the same core in order to minimize the chances >>>> for side channel attacks to data of other domains. Not allowing >>>> scrubbing or tasklets here is due to avoid accessing data of other >>>> domains. >>> >>> So how is running softirqs okay then? And how is scrubbing accessing >>> other domains' data? >> >> Right now I'm not sure whether it is a good idea to block any softirqs. >> We definitely need to process scheduling requests and I believe RCU and >> tasklets, too. The tlbflush one should be uncritical, so timers is the >> remaining one which might be questionable. This can be fine-tuned later >> IMO e.g. by defining a softirq mask of critical softirqs to block and >> eventually splitting up e.g. timer and tasklet softirqs into critical >> and uncritical ones. > > Well, okay, but please add an abridged version of this to the patch > description then. Okay. > >> Scrubbing will probably pull the cache lines of the dirty pages into >> the L1 cache of the cpu. For me this sounds problematic. In case we >> are fine to do scrubbing as there is no risk associated I'm fine to add >> it back in. > > Well, of course there's going to be a brief period of time where > a cache line will be present in CPU internal buffers (it's not just the > cache after all, as we've learned with XSA-297). So I can certainly > buy that when using core granularity you don't want to scrub on > the other thread. But what about the socket granularity case? > Scrubbing on fully idle cores should still be fine, I would think. I think this would depend on the reason for selecting socket scheduling. I'd at least would want to have a way to select that as I could think of e.g. L3-cache side channel attacks, too. So maybe I could add a patch on top for adding a sub-option to the sched-gran parameter which will allow (or disallow?) scrubbing on idle cores or threads. > >>>> As with core scheduling we can be sure the other thread is active >>>> (otherwise we would schedule the idle item) and hoping for saving power >>>> by using mwait is moot. >>> >>> Saving power may be indirect, by the CPU re-arranging >>> resource assignment between threads when one goes idle. >>> I have no idea whether they do this when entering C1, or >>> only when entering deeper C states. >> >> SDM Vol. 3 chapter 8.10.1 "HLT instruction": >> >> "Here shared resources that were being used by the halted logical >> processor become available to active logical processors, allowing them >> to execute at greater efficiency." > > To be honest, this is to broad/generic a statement to fully > trust it, judging from other areas of the SDM. And then, as > per above, what about the socket granularity case? Putting > entirely idle cores to sleep is surely worthwhile? Yes, I assume it is. OTOH this might affect context switches badly as the reaction time for the coordinated switch will rise. Maybe a good reason for another sub-option? >>> And anyway - I'm still none the wiser as to the v->is_urgent >>> relationship. >> >> With v->is_urgent set today's idle loop will drop into default_idle(). >> I can remove this sentence in case it is just confusing. > > I'd prefer if the connection would become more obvious. One > needs to go from ->is_urgent via ->urgent_count to > sched_has_urgent_vcpu() to find where the described > behavior really lives. > > What's worse though: This won't work as intended on AMD > at all. I don't think it's correct to fall back to default_idle() in > this case. Instead sched_has_urgent_vcpu() returning true > should amount to the same effect as max_cstate being set > to 1. There's > (a) no reason not to use MWAIT on Intel CPUs in this case, > if MWAIT can enter C1, and > (b) a strong need to use MWAIT on (at least) AMD Fam17, > or else it won't be C1 that gets entered. > I'll see about making a patch in due course. Thanks. Would you mind doing it in a way that the caller can specify max_cstate? This would remove the need to call sched_has_urgent_vcpu() deep down the idle handling and I could re-use it for my purpose. Juergen
>>> On 17.05.19 at 09:48, <jgross@suse.com> wrote: > On 17/05/2019 08:57, Jan Beulich wrote: >>>>> On 17.05.19 at 07:13, <jgross@suse.com> wrote: >>> On 16/05/2019 16:41, Jan Beulich wrote: >>>>>>> On 16.05.19 at 15:51, <jgross@suse.com> wrote: >>>>> As with core scheduling we can be sure the other thread is active >>>>> (otherwise we would schedule the idle item) and hoping for saving power >>>>> by using mwait is moot. >>>> >>>> Saving power may be indirect, by the CPU re-arranging >>>> resource assignment between threads when one goes idle. >>>> I have no idea whether they do this when entering C1, or >>>> only when entering deeper C states. >>> >>> SDM Vol. 3 chapter 8.10.1 "HLT instruction": >>> >>> "Here shared resources that were being used by the halted logical >>> processor become available to active logical processors, allowing them >>> to execute at greater efficiency." >> >> To be honest, this is to broad/generic a statement to fully >> trust it, judging from other areas of the SDM. And then, as >> per above, what about the socket granularity case? Putting >> entirely idle cores to sleep is surely worthwhile? > > Yes, I assume it is. OTOH this might affect context switches badly > as the reaction time for the coordinated switch will rise. Maybe a > good reason for another sub-option? While I agree that fine grained control is useful, I'm seeing an increasing risk of there going to be too many controls to actually be certain in the end that all possible combinations work correctly. >>>> And anyway - I'm still none the wiser as to the v->is_urgent >>>> relationship. >>> >>> With v->is_urgent set today's idle loop will drop into default_idle(). >>> I can remove this sentence in case it is just confusing. >> >> I'd prefer if the connection would become more obvious. One >> needs to go from ->is_urgent via ->urgent_count to >> sched_has_urgent_vcpu() to find where the described >> behavior really lives. >> >> What's worse though: This won't work as intended on AMD >> at all. I don't think it's correct to fall back to default_idle() in >> this case. Instead sched_has_urgent_vcpu() returning true >> should amount to the same effect as max_cstate being set >> to 1. There's >> (a) no reason not to use MWAIT on Intel CPUs in this case, >> if MWAIT can enter C1, and >> (b) a strong need to use MWAIT on (at least) AMD Fam17, >> or else it won't be C1 that gets entered. >> I'll see about making a patch in due course. > > Thanks. Would you mind doing it in a way that the caller can specify > max_cstate? This would remove the need to call sched_has_urgent_vcpu() > deep down the idle handling and I could re-use it for my purpose. Hmm, to be honest I'm not fancying giving a parameter to default_idle(), pm_idle(), and friends. Conceptually it is not the business of the callers to control the C states to be used. What about the opposite: You simply mark idle (v)CPUs in question as "urgent", thus achieving the intended effect as well. Jan
On 17/05/2019 10:22, Jan Beulich wrote: >>>> On 17.05.19 at 09:48, <jgross@suse.com> wrote: >> On 17/05/2019 08:57, Jan Beulich wrote: >>>>>> On 17.05.19 at 07:13, <jgross@suse.com> wrote: >>>> On 16/05/2019 16:41, Jan Beulich wrote: >>>>>>>> On 16.05.19 at 15:51, <jgross@suse.com> wrote: >>>>>> As with core scheduling we can be sure the other thread is active >>>>>> (otherwise we would schedule the idle item) and hoping for saving power >>>>>> by using mwait is moot. >>>>> >>>>> Saving power may be indirect, by the CPU re-arranging >>>>> resource assignment between threads when one goes idle. >>>>> I have no idea whether they do this when entering C1, or >>>>> only when entering deeper C states. >>>> >>>> SDM Vol. 3 chapter 8.10.1 "HLT instruction": >>>> >>>> "Here shared resources that were being used by the halted logical >>>> processor become available to active logical processors, allowing them >>>> to execute at greater efficiency." >>> >>> To be honest, this is to broad/generic a statement to fully >>> trust it, judging from other areas of the SDM. And then, as >>> per above, what about the socket granularity case? Putting >>> entirely idle cores to sleep is surely worthwhile? >> >> Yes, I assume it is. OTOH this might affect context switches badly >> as the reaction time for the coordinated switch will rise. Maybe a >> good reason for another sub-option? > > While I agree that fine grained control is useful, I'm seeing an > increasing risk of there going to be too many controls to actually > be certain in the end that all possible combinations work > correctly. Okay, I think I'll leave it as is for the moment and do some performance tests later. Depending on the results we can then decide how to proceed. > >>>>> And anyway - I'm still none the wiser as to the v->is_urgent >>>>> relationship. >>>> >>>> With v->is_urgent set today's idle loop will drop into default_idle(). >>>> I can remove this sentence in case it is just confusing. >>> >>> I'd prefer if the connection would become more obvious. One >>> needs to go from ->is_urgent via ->urgent_count to >>> sched_has_urgent_vcpu() to find where the described >>> behavior really lives. >>> >>> What's worse though: This won't work as intended on AMD >>> at all. I don't think it's correct to fall back to default_idle() in >>> this case. Instead sched_has_urgent_vcpu() returning true >>> should amount to the same effect as max_cstate being set >>> to 1. There's >>> (a) no reason not to use MWAIT on Intel CPUs in this case, >>> if MWAIT can enter C1, and >>> (b) a strong need to use MWAIT on (at least) AMD Fam17, >>> or else it won't be C1 that gets entered. >>> I'll see about making a patch in due course. >> >> Thanks. Would you mind doing it in a way that the caller can specify >> max_cstate? This would remove the need to call sched_has_urgent_vcpu() >> deep down the idle handling and I could re-use it for my purpose. > > Hmm, to be honest I'm not fancying giving a parameter to > default_idle(), pm_idle(), and friends. Conceptually it is not > the business of the callers to control the C states to be used. > > What about the opposite: You simply mark idle (v)CPUs in > question as "urgent", thus achieving the intended effect as > well. I can do that, of course. IMO letting the caller specify how he wants idle to behave is cleaner than burying such an implicit dependency deep down the code, but you are the maintainer of that part. Juergen
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index d04e704116..f3dbca5dba 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -154,6 +154,24 @@ static void idle_loop(void) } } +/* + * Idle loop for siblings of active schedule items. + * We don't do any standard idle work like tasklets, page scrubbing or + * livepatching. + * Use default_idle() in order to simulate v->is_urgent. + */ +static void guest_idle_loop(void) +{ + unsigned int cpu = smp_processor_id(); + + for ( ; ; ) + { + if ( !softirq_pending(cpu) ) + default_idle(); + do_softirq(); + } +} + void startup_cpu_idle_loop(void) { struct vcpu *v = current; @@ -167,6 +185,9 @@ void startup_cpu_idle_loop(void) static void noreturn continue_idle_domain(struct vcpu *v) { + if ( !is_idle_item(v->sched_item) ) + reset_stack_and_jump(guest_idle_loop); + reset_stack_and_jump(idle_loop); } diff --git a/xen/common/schedule.c b/xen/common/schedule.c index 0de199ccc9..788ecc9e81 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -82,7 +82,18 @@ static struct scheduler __read_mostly ops; static inline struct vcpu *sched_item2vcpu_cpu(struct sched_item *item, unsigned int cpu) { - return item->domain->vcpu[item->item_id + per_cpu(sched_res_idx, cpu)]; + unsigned int idx = item->item_id + per_cpu(sched_res_idx, cpu); + const struct domain *d = item->domain; + struct vcpu *v; + + if ( idx < d->max_vcpus && d->vcpu[idx] ) + { + v = d->vcpu[idx]; + if ( v->new_state == RUNSTATE_running ) + return v; + } + + return idle_vcpu[cpu]; } static inline struct scheduler *dom_scheduler(const struct domain *d) @@ -196,8 +207,11 @@ static inline void vcpu_runstate_change( trace_runstate_change(v, new_state); - item->runstate_cnt[v->runstate.state]--; - item->runstate_cnt[new_state]++; + if ( !is_idle_vcpu(v) ) + { + item->runstate_cnt[v->runstate.state]--; + item->runstate_cnt[new_state]++; + } delta = new_entry_time - v->runstate.state_entry_time; if ( delta > 0 ) @@ -209,21 +223,6 @@ static inline void vcpu_runstate_change( v->runstate.state = new_state; } -static inline void sched_item_runstate_change(struct sched_item *item, - bool running, s_time_t new_entry_time) -{ - struct vcpu *v; - - for_each_sched_item_vcpu( item, v ) - if ( running ) - vcpu_runstate_change(v, v->new_state, new_entry_time); - else - vcpu_runstate_change(v, - ((v->pause_flags & VPF_blocked) ? RUNSTATE_blocked : - (vcpu_runnable(v) ? RUNSTATE_runnable : RUNSTATE_offline)), - new_entry_time); -} - void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate) { spinlock_t *lock = likely(v == current) @@ -456,6 +455,7 @@ int sched_init_vcpu(struct vcpu *v) if ( is_idle_domain(d) ) { per_cpu(sched_res, v->processor)->curr = item; + per_cpu(sched_res, v->processor)->sched_item_idle = item; v->is_running = 1; item->is_running = 1; item->state_entry_time = NOW(); @@ -1631,33 +1631,67 @@ static void sched_switch_items(struct sched_resource *sd, struct sched_item *next, struct sched_item *prev, s_time_t now) { - sd->curr = next; - - TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, prev->item_id, - now - prev->state_entry_time); - TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, next->item_id, - (next->vcpu->runstate.state == RUNSTATE_runnable) ? - (now - next->state_entry_time) : 0, prev->next_time); + int cpu; ASSERT(item_running(prev)); - TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->item_id, - next->domain->domain_id, next->item_id); + if ( prev != next ) + { + sd->curr = next; + sd->prev = prev; - sched_item_runstate_change(prev, false, now); - prev->last_run_time = now; + TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, + prev->item_id, now - prev->state_entry_time); + TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, + next->item_id, + (next->vcpu->runstate.state == RUNSTATE_runnable) ? + (now - next->state_entry_time) : 0, prev->next_time); + TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->item_id, + next->domain->domain_id, next->item_id); - ASSERT(!item_running(next)); - sched_item_runstate_change(next, true, now); + prev->last_run_time = now; - /* - * NB. Don't add any trace records from here until the actual context - * switch, else lost_records resume will not work properly. - */ + ASSERT(!item_running(next)); + + /* + * NB. Don't add any trace records from here until the actual context + * switch, else lost_records resume will not work properly. + */ + + ASSERT(!next->is_running); + next->is_running = 1; - ASSERT(!next->is_running); - next->vcpu->is_running = 1; - next->is_running = 1; + if ( is_idle_item(prev) ) + { + prev->runstate_cnt[RUNSTATE_running] = 0; + prev->runstate_cnt[RUNSTATE_runnable] = sched_granularity; + } + if ( is_idle_item(next) ) + { + next->runstate_cnt[RUNSTATE_running] = sched_granularity; + next->runstate_cnt[RUNSTATE_runnable] = 0; + } + } + + for_each_cpu( cpu, sd->cpus ) + { + struct vcpu *vprev = get_cpu_current(cpu); + struct vcpu *vnext = sched_item2vcpu_cpu(next, cpu); + + if ( vprev != vnext || vprev->runstate.state != vnext->new_state ) + { + vcpu_runstate_change(vprev, + ((vprev->pause_flags & VPF_blocked) ? RUNSTATE_blocked : + (vcpu_runnable(vprev) ? RUNSTATE_runnable : RUNSTATE_offline)), + now); + vcpu_runstate_change(vnext, vnext->new_state, now); + } + + vnext->is_running = 1; + + if ( is_idle_vcpu(vnext) ) + vnext->sched_item = next; + } } static bool sched_tasklet_check(void) @@ -1706,25 +1740,25 @@ static struct sched_item *do_schedule(struct sched_item *prev, s_time_t now) if ( prev->next_time >= 0 ) /* -ve means no limit */ set_timer(&sd->s_timer, now + prev->next_time); - if ( likely(prev != next) ) - sched_switch_items(sd, next, prev, now); + sched_switch_items(sd, next, prev, now); return next; } -static void context_saved(struct vcpu *prev) +static void context_saved(struct sched_item *item) { - struct sched_item *item = prev->sched_item; - item->is_running = 0; item->state_entry_time = NOW(); + this_cpu(sched_res)->prev = NULL; /* Check for migration request /after/ clearing running flag. */ smp_mb(); - sched_context_saved(vcpu_scheduler(prev), item); + sched_context_saved(vcpu_scheduler(item->vcpu), item); - sched_item_migrate_finish(item); + /* Idle never migrates and idle vcpus might belong to other items. */ + if ( !is_idle_item(item) ) + sched_item_migrate_finish(item); } /* @@ -1741,11 +1775,13 @@ static void context_saved(struct vcpu *prev) void sched_context_switched(struct vcpu *vprev, struct vcpu *vnext) { struct sched_item *next = vnext->sched_item; + struct sched_resource *sd = this_cpu(sched_res); /* Clear running flag /after/ writing context to memory. */ smp_wmb(); - vprev->is_running = 0; + if ( vprev != vnext ) + vprev->is_running = 0; if ( atomic_read(&next->rendezvous_out_cnt) ) { @@ -1754,20 +1790,23 @@ void sched_context_switched(struct vcpu *vprev, struct vcpu *vnext) /* Call context_saved() before releasing other waiters. */ if ( cnt == 1 ) { - if ( vprev != vnext ) - context_saved(vprev); + if ( sd->prev ) + context_saved(sd->prev); atomic_set(&next->rendezvous_out_cnt, 0); } else while ( atomic_read(&next->rendezvous_out_cnt) ) cpu_relax(); } - else if ( vprev != vnext && sched_granularity == 1 ) - context_saved(vprev); + else if ( sd->prev ) + context_saved(sd->prev); + + if ( is_idle_vcpu(vprev) && vprev != vnext ) + vprev->sched_item = sd->sched_item_idle; } static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext, - s_time_t now) + bool reset_idle_item, s_time_t now) { if ( unlikely(vprev == vnext) ) { @@ -1776,6 +1815,10 @@ static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext, now - vprev->runstate.state_entry_time, vprev->sched_item->next_time); sched_context_switched(vprev, vnext); + + if ( reset_idle_item ) + vnext->sched_item = this_cpu(sched_res)->sched_item_idle; + trace_continue_running(vnext); return continue_running(vprev); } @@ -1851,7 +1894,8 @@ static void sched_slave(void) pcpu_schedule_unlock_irq(lock, cpu); - sched_context_switch(vprev, sched_item2vcpu_cpu(next, cpu), now); + sched_context_switch(vprev, sched_item2vcpu_cpu(next, cpu), + is_idle_item(next) && !is_idle_item(prev), now); } /* @@ -1911,7 +1955,8 @@ static void schedule(void) pcpu_schedule_unlock_irq(lock, cpu); vnext = sched_item2vcpu_cpu(next, cpu); - sched_context_switch(vprev, vnext, now); + sched_context_switch(vprev, vnext, + !is_idle_item(prev) && is_idle_item(next), now); } /* The scheduler timer: force a run through the scheduler */ @@ -1993,6 +2038,7 @@ static int cpu_schedule_up(unsigned int cpu) return -ENOMEM; sd->curr = idle_vcpu[cpu]->sched_item; + sd->sched_item_idle = idle_vcpu[cpu]->sched_item; /* * We don't want to risk calling xfree() on an sd->sched_priv @@ -2170,6 +2216,7 @@ void __init scheduler_init(void) if ( vcpu_create(idle_domain, 0) == NULL ) BUG(); this_cpu(sched_res)->curr = idle_vcpu[0]->sched_item; + this_cpu(sched_res)->sched_item_idle = idle_vcpu[0]->sched_item; this_cpu(sched_res)->sched_priv = sched_alloc_pdata(&ops, 0); BUG_ON(IS_ERR(this_cpu(sched_res)->sched_priv)); scheduler_percpu_init(0); diff --git a/xen/include/asm-arm/current.h b/xen/include/asm-arm/current.h index c4af66fbb9..a7602eef8c 100644 --- a/xen/include/asm-arm/current.h +++ b/xen/include/asm-arm/current.h @@ -18,6 +18,7 @@ DECLARE_PER_CPU(struct vcpu *, curr_vcpu); #define current (this_cpu(curr_vcpu)) #define set_current(vcpu) do { current = (vcpu); } while (0) +#define get_cpu_current(cpu) (per_cpu(curr_vcpu, cpu)) /* Per-VCPU state that lives at the top of the stack */ struct cpu_info { diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h index 5bd64b2271..cb5b6f1176 100644 --- a/xen/include/asm-x86/current.h +++ b/xen/include/asm-x86/current.h @@ -76,6 +76,11 @@ struct cpu_info { /* get_stack_bottom() must be 16-byte aligned */ }; +static inline struct cpu_info *get_cpu_info_from_stack(unsigned long sp) +{ + return (struct cpu_info *)((sp | (STACK_SIZE - 1)) + 1) - 1; +} + static inline struct cpu_info *get_cpu_info(void) { #ifdef __clang__ @@ -86,7 +91,7 @@ static inline struct cpu_info *get_cpu_info(void) register unsigned long sp asm("rsp"); #endif - return (struct cpu_info *)((sp | (STACK_SIZE - 1)) + 1) - 1; + return get_cpu_info_from_stack(sp); } #define get_current() (get_cpu_info()->current_vcpu) diff --git a/xen/include/asm-x86/smp.h b/xen/include/asm-x86/smp.h index 9f533f9072..51a31ab00a 100644 --- a/xen/include/asm-x86/smp.h +++ b/xen/include/asm-x86/smp.h @@ -76,6 +76,9 @@ void set_nr_sockets(void); /* Representing HT and core siblings in each socket. */ extern cpumask_t **socket_cpumask; +#define get_cpu_current(cpu) \ + (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu) + #endif /* !__ASSEMBLY__ */ #endif diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h index b3921f3a41..8981d41629 100644 --- a/xen/include/xen/sched-if.h +++ b/xen/include/xen/sched-if.h @@ -39,6 +39,8 @@ struct sched_resource { spinlock_t *schedule_lock, _lock; struct sched_item *curr; /* current task */ + struct sched_item *sched_item_idle; + struct sched_item *prev; /* previous task */ void *sched_priv; struct timer s_timer; /* scheduling timer */ atomic_t urgent_count; /* how many urgent vcpus */ @@ -152,7 +154,7 @@ static inline void sched_clear_pause_flags_atomic(struct sched_item *item, static inline struct sched_item *sched_idle_item(unsigned int cpu) { - return idle_vcpu[cpu]->sched_item; + return per_cpu(sched_res, cpu)->sched_item_idle; } static inline unsigned int sched_get_resource_cpu(unsigned int cpu)
When scheduling an item with multiple vcpus there is no guarantee all vcpus are available (e.g. above maxvcpus or vcpu offline). Fall back to idle vcpu of the current cpu in that case. This requires to store the correct schedule_item pointer in the idle vcpu as long as it used as fallback vcpu. In order to modify the runstates of the correct vcpus when switching schedule items merge sched_item_runstate_change() into sched_switch_items() and loop over the affected physical cpus instead of the item's vcpus. This in turn requires an access function to the current variable of other cpus. Today context_saved() is called in case previous and next vcpus differ when doing a context switch. With an idle vcpu being capable to be a substitute for an offline vcpu this is problematic when switching to an idle scheduling item. An idle previous vcpu leaves us in doubt which schedule item was active previously, so save the previous item pointer in the per-schedule resource area and use its value being non-NULL as a hint whether context_saved() should be called. Signed-off-by: Juergen Gross <jgross@suse.com> --- RFC V2: new patch (Andrew Cooper) --- xen/arch/x86/domain.c | 21 ++++++ xen/common/schedule.c | 153 +++++++++++++++++++++++++++--------------- xen/include/asm-arm/current.h | 1 + xen/include/asm-x86/current.h | 7 +- xen/include/asm-x86/smp.h | 3 + xen/include/xen/sched-if.h | 4 +- 6 files changed, 134 insertions(+), 55 deletions(-)