Message ID | 20160318190538.8117.96025.stgit@Solace.station (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 18/03/16 20:05, Dario Faggioli wrote: > In fact, credit2 uses CPU topology to decide how to arrange > its internal runqueues. Before this change, only 'one runqueue > per socket' was allowed. However, experiments have shown that, > for instance, having one runqueue per physical core improves > performance, especially in case hyperthreading is available. > > In general, it makes sense to allow users to pick one runqueue > arrangement at boot time, so that: > - more experiments can be easily performed to even better > assess and improve performance; > - one can select the best configuration for his specific > use case and/or hardware. > > This patch enables the above. > > Note that, for correctly arranging runqueues to be per-core, > just checking cpu_to_core() on the host CPUs is not enough. > In fact, cores (and hyperthreads) on different sockets, can > have the same core (and thread) IDs! We, therefore, need to > check whether the full topology of two CPUs matches, for > them to be put in the same runqueue. > > Note also that the default (although not functional) for > credit2, since now, has been per-socket runqueue. This patch > leaves things that way, to avoid mixing policy and technical > changes. > > Finally, it would be a nice feature to be able to select > a particular runqueue arrangement, even when creating a > Credit2 cpupool. This is left as future work. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > Signed-off-by: Uma Sharma <uma.sharma523@gmail.com> > --- > Cc: George Dunlap <george.dunlap@eu.citrix.com> > Cc: Uma Sharma <uma.sharma523@gmail.com> > Cc: Juergen Gross <jgross@suse.com> > --- > Cahnges from v1: > * added 'node' and 'global' runqueue arrangements, as > suggested during review; > --- > docs/misc/xen-command-line.markdown | 19 +++++++++ > xen/common/sched_credit2.c | 76 +++++++++++++++++++++++++++++++++-- > 2 files changed, 90 insertions(+), 5 deletions(-) > > diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown > index ca77e3b..0047f94 100644 > --- a/docs/misc/xen-command-line.markdown > +++ b/docs/misc/xen-command-line.markdown > @@ -469,6 +469,25 @@ combination with the `low_crashinfo` command line option. > ### credit2\_load\_window\_shift > > `= <integer>` > > +### credit2\_runqueue > +> `= core | socket | node | all` > + > +> Default: `socket` > + > +Specify how host CPUs are arranged in runqueues. Runqueues are kept > +balanced with respect to the load generated by the vCPUs running on > +them. Smaller runqueues (as in with `core`) means more accurate load > +balancing (for instance, it will deal better with hyperthreading), > +but also more overhead. > + > +Available alternatives, with their meaning, are: > +* `core`: one runqueue per each physical core of the host; > +* `socket`: one runqueue per each physical socket (which often, > + but not always, matches a NUMA node) of the host; > +* `node`: one runqueue per each NUMA node of the host; > +* `all`: just one runqueue shared by all the logical pCPUs of > + the host > + > ### dbgp > > `= ehci[ <integer> | @pci<bus>:<slot>.<func> ]` > > diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c > index 456b9ea..c242dc4 100644 > --- a/xen/common/sched_credit2.c > +++ b/xen/common/sched_credit2.c > @@ -81,10 +81,6 @@ > * Credits are "reset" when the next vcpu in the runqueue is less than > * or equal to zero. At that point, everyone's credits are "clipped" > * to a small value, and a fixed credit is added to everyone. > - * > - * The plan is for all cores that share an L2 will share the same > - * runqueue. At the moment, there is one global runqueue for all > - * cores. > */ > > /* > @@ -193,6 +189,55 @@ static int __read_mostly opt_overload_balance_tolerance = -3; > integer_param("credit2_balance_over", opt_overload_balance_tolerance); > > /* > + * Runqueue organization. > + * > + * The various cpus are to be assigned each one to a runqueue, and we > + * want that to happen basing on topology. At the moment, it is possible > + * to choose to arrange runqueues to be: > + * > + * - per-core: meaning that there will be one runqueue per each physical > + * core of the host. This will happen if the opt_runqueue > + * parameter is set to 'core'; > + * > + * - per-node: meaning that there will be one runqueue per each physical > + * NUMA node of the host. This will happen if the opt_runqueue > + * parameter is set to 'node'; > + * > + * - per-socket: meaning that there will be one runqueue per each physical > + * socket (AKA package, which often, but not always, also > + * matches a NUMA node) of the host; This will happen if > + * the opt_runqueue parameter is set to 'socket'; > + * > + * - global: meaning that there will be only one runqueue to which all the > + * (logical) processors of the host belongs. This will happen if > + * the opt_runqueue parameter is set to 'all'. > + * > + * Depending on the value of opt_runqueue, therefore, cpus that are part of > + * either the same physical core, or of the same physical socket, will be > + * put together to form runqueues. > + */ > +#define OPT_RUNQUEUE_CORE 1 > +#define OPT_RUNQUEUE_SOCKET 2 > +#define OPT_RUNQUEUE_NODE 3 > +#define OPT_RUNQUEUE_ALL 4 > +static int __read_mostly opt_runqueue = OPT_RUNQUEUE_SOCKET; > + > +static void parse_credit2_runqueue(const char *s) > +{ > + if ( !strncmp(s, "core", 4) && !s[4] ) > + opt_runqueue = OPT_RUNQUEUE_CORE; > + else if ( !strncmp(s, "socket", 6) && !s[6] ) > + opt_runqueue = OPT_RUNQUEUE_SOCKET; > + else if ( !strncmp(s, "node", 4) && !s[4] ) > + opt_runqueue = OPT_RUNQUEUE_NODE; > + else if ( !strncmp(s, "all", 6) && !s[6] ) The length is wrong. Should be 3 instead of 6 here. Which poses the question: why don't you use strcmp() here? I don't see any advantage using strncmp() in this case, especially as you've just proven it is more error prone here. > + opt_runqueue = OPT_RUNQUEUE_ALL; > + else > + printk("WARNING, unrecognized value of credit2_runqueue option!\n"); > +} > +custom_param("credit2_runqueue", parse_credit2_runqueue); > + > +/* > * Per-runqueue data > */ > struct csched2_runqueue_data { > @@ -1971,6 +2016,22 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi) > cpumask_clear_cpu(rqi, &prv->active_queues); > } > > +static inline bool_t same_node(unsigned int cpua, unsigned int cpub) > +{ > + return cpu_to_node(cpua) == cpu_to_node(cpub); > +} > + > +static inline bool_t same_socket(unsigned int cpua, unsigned int cpub) > +{ > + return cpu_to_socket(cpua) == cpu_to_socket(cpub); > +} > + > +static inline bool_t same_core(unsigned int cpua, unsigned int cpub) > +{ > + return same_socket(cpua, cpub) && > + cpu_to_core(cpua) == cpu_to_core(cpub); > +} > + > static unsigned int > cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu) > { > @@ -2003,7 +2064,10 @@ cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu) > BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID || > cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID); > > - if ( cpu_to_socket(cpumask_first(&rqd->active)) == cpu_to_socket(cpu) ) > + if ( opt_runqueue == OPT_RUNQUEUE_ALL || > + (opt_runqueue == OPT_RUNQUEUE_CORE && same_core(peer_cpu, cpu)) || > + (opt_runqueue == OPT_RUNQUEUE_SOCKET && same_socket(peer_cpu, cpu)) || > + (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)) ) > break; > } > > @@ -2157,6 +2221,8 @@ csched2_init(struct scheduler *ops) > printk(" load_window_shift: %d\n", opt_load_window_shift); > printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance); > printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance); > + printk(" runqueues arrangement: per-%s\n", > + opt_runqueue == OPT_RUNQUEUE_CORE ? "core" : "socket"); node? all? > > if ( opt_load_window_shift < LOADAVG_WINDOW_SHIFT_MIN ) > { Juergen
On 18/03/16 19:05, Dario Faggioli wrote: > In fact, credit2 uses CPU topology to decide how to arrange > its internal runqueues. Before this change, only 'one runqueue > per socket' was allowed. However, experiments have shown that, > for instance, having one runqueue per physical core improves > performance, especially in case hyperthreading is available. > > In general, it makes sense to allow users to pick one runqueue > arrangement at boot time, so that: > - more experiments can be easily performed to even better > assess and improve performance; > - one can select the best configuration for his specific > use case and/or hardware. > > This patch enables the above. > > Note that, for correctly arranging runqueues to be per-core, > just checking cpu_to_core() on the host CPUs is not enough. > In fact, cores (and hyperthreads) on different sockets, can > have the same core (and thread) IDs! We, therefore, need to > check whether the full topology of two CPUs matches, for > them to be put in the same runqueue. > > Note also that the default (although not functional) for > credit2, since now, has been per-socket runqueue. This patch > leaves things that way, to avoid mixing policy and technical > changes. > > Finally, it would be a nice feature to be able to select > a particular runqueue arrangement, even when creating a > Credit2 cpupool. This is left as future work. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > Signed-off-by: Uma Sharma <uma.sharma523@gmail.com> Looks good, apart from the two errors Juergen pointed out. Thanks, -George
diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown index ca77e3b..0047f94 100644 --- a/docs/misc/xen-command-line.markdown +++ b/docs/misc/xen-command-line.markdown @@ -469,6 +469,25 @@ combination with the `low_crashinfo` command line option. ### credit2\_load\_window\_shift > `= <integer>` +### credit2\_runqueue +> `= core | socket | node | all` + +> Default: `socket` + +Specify how host CPUs are arranged in runqueues. Runqueues are kept +balanced with respect to the load generated by the vCPUs running on +them. Smaller runqueues (as in with `core`) means more accurate load +balancing (for instance, it will deal better with hyperthreading), +but also more overhead. + +Available alternatives, with their meaning, are: +* `core`: one runqueue per each physical core of the host; +* `socket`: one runqueue per each physical socket (which often, + but not always, matches a NUMA node) of the host; +* `node`: one runqueue per each NUMA node of the host; +* `all`: just one runqueue shared by all the logical pCPUs of + the host + ### dbgp > `= ehci[ <integer> | @pci<bus>:<slot>.<func> ]` diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c index 456b9ea..c242dc4 100644 --- a/xen/common/sched_credit2.c +++ b/xen/common/sched_credit2.c @@ -81,10 +81,6 @@ * Credits are "reset" when the next vcpu in the runqueue is less than * or equal to zero. At that point, everyone's credits are "clipped" * to a small value, and a fixed credit is added to everyone. - * - * The plan is for all cores that share an L2 will share the same - * runqueue. At the moment, there is one global runqueue for all - * cores. */ /* @@ -193,6 +189,55 @@ static int __read_mostly opt_overload_balance_tolerance = -3; integer_param("credit2_balance_over", opt_overload_balance_tolerance); /* + * Runqueue organization. + * + * The various cpus are to be assigned each one to a runqueue, and we + * want that to happen basing on topology. At the moment, it is possible + * to choose to arrange runqueues to be: + * + * - per-core: meaning that there will be one runqueue per each physical + * core of the host. This will happen if the opt_runqueue + * parameter is set to 'core'; + * + * - per-node: meaning that there will be one runqueue per each physical + * NUMA node of the host. This will happen if the opt_runqueue + * parameter is set to 'node'; + * + * - per-socket: meaning that there will be one runqueue per each physical + * socket (AKA package, which often, but not always, also + * matches a NUMA node) of the host; This will happen if + * the opt_runqueue parameter is set to 'socket'; + * + * - global: meaning that there will be only one runqueue to which all the + * (logical) processors of the host belongs. This will happen if + * the opt_runqueue parameter is set to 'all'. + * + * Depending on the value of opt_runqueue, therefore, cpus that are part of + * either the same physical core, or of the same physical socket, will be + * put together to form runqueues. + */ +#define OPT_RUNQUEUE_CORE 1 +#define OPT_RUNQUEUE_SOCKET 2 +#define OPT_RUNQUEUE_NODE 3 +#define OPT_RUNQUEUE_ALL 4 +static int __read_mostly opt_runqueue = OPT_RUNQUEUE_SOCKET; + +static void parse_credit2_runqueue(const char *s) +{ + if ( !strncmp(s, "core", 4) && !s[4] ) + opt_runqueue = OPT_RUNQUEUE_CORE; + else if ( !strncmp(s, "socket", 6) && !s[6] ) + opt_runqueue = OPT_RUNQUEUE_SOCKET; + else if ( !strncmp(s, "node", 4) && !s[4] ) + opt_runqueue = OPT_RUNQUEUE_NODE; + else if ( !strncmp(s, "all", 6) && !s[6] ) + opt_runqueue = OPT_RUNQUEUE_ALL; + else + printk("WARNING, unrecognized value of credit2_runqueue option!\n"); +} +custom_param("credit2_runqueue", parse_credit2_runqueue); + +/* * Per-runqueue data */ struct csched2_runqueue_data { @@ -1971,6 +2016,22 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi) cpumask_clear_cpu(rqi, &prv->active_queues); } +static inline bool_t same_node(unsigned int cpua, unsigned int cpub) +{ + return cpu_to_node(cpua) == cpu_to_node(cpub); +} + +static inline bool_t same_socket(unsigned int cpua, unsigned int cpub) +{ + return cpu_to_socket(cpua) == cpu_to_socket(cpub); +} + +static inline bool_t same_core(unsigned int cpua, unsigned int cpub) +{ + return same_socket(cpua, cpub) && + cpu_to_core(cpua) == cpu_to_core(cpub); +} + static unsigned int cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu) { @@ -2003,7 +2064,10 @@ cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu) BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID || cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID); - if ( cpu_to_socket(cpumask_first(&rqd->active)) == cpu_to_socket(cpu) ) + if ( opt_runqueue == OPT_RUNQUEUE_ALL || + (opt_runqueue == OPT_RUNQUEUE_CORE && same_core(peer_cpu, cpu)) || + (opt_runqueue == OPT_RUNQUEUE_SOCKET && same_socket(peer_cpu, cpu)) || + (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)) ) break; } @@ -2157,6 +2221,8 @@ csched2_init(struct scheduler *ops) printk(" load_window_shift: %d\n", opt_load_window_shift); printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance); printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance); + printk(" runqueues arrangement: per-%s\n", + opt_runqueue == OPT_RUNQUEUE_CORE ? "core" : "socket"); if ( opt_load_window_shift < LOADAVG_WINDOW_SHIFT_MIN ) {