Message ID | 20240501151312.635565-10-tj@kernel.org (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | [01/39] cgroup: Implement cgroup_show_cftypes() | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch, async |
On Wed, May 01, 2024 at 05:09:44AM -1000, Tejun Heo wrote: > ->rq_{on|off}line are called either during CPU hotplug or cpuset partition > updates. A planned BPF extensible sched_class wants to tell the BPF > scheduler progs about CPU hotplug events in a way that's synchronized with > rq state changes. > > As the BPF scheduler progs aren't necessarily affected by cpuset partition > updates, we need a way to distinguish the two types of events. Let's add an > argument to tell them apart. That would be a bug. Must not be able to ignore partitions.
Hello, Peter. On Mon, Jun 24, 2024 at 01:32:12PM +0200, Peter Zijlstra wrote: > On Wed, May 01, 2024 at 05:09:44AM -1000, Tejun Heo wrote: > > ->rq_{on|off}line are called either during CPU hotplug or cpuset partition > > updates. A planned BPF extensible sched_class wants to tell the BPF > > scheduler progs about CPU hotplug events in a way that's synchronized with > > rq state changes. > > > > As the BPF scheduler progs aren't necessarily affected by cpuset partition > > updates, we need a way to distinguish the two types of events. Let's add an > > argument to tell them apart. > > That would be a bug. Must not be able to ignore partitions. So, first of all, this implementation was brittle in assuming CPU hotplug events would be called in first and broke after recent cpuset changes. In v7, it's replaced by hooks in sched_cpu_[de]activate(), which has the extra benefit of allowing the BPF hotplug methods to be sleepable. Taking a step back to the sched domains. They don't translate well to sched_ext schedulers where task to CPU associations are often more dynamic (e.g. multiple CPUs sharing a task queue) and load balancing operations can be implemented pretty differently from CFS. The benefits of exposing sched domains directly to the BPF schedulers is unclear as most of relevant information can be obtained from userspace already. The cgroup support side isn't fully developed yet (e.g. cpu.weight is available but I haven't added cpu.max yet) and plans can always change but I was thinking taking a similar approach as cpu.weight for cpuset's isolation features - ie. give the BPF scheduler a way to access the user's configuration and let it implement whatever it wants to implement. Thanks.
On Mon, Jun 24, 2024 at 11:18:06AM -1000, Tejun Heo wrote: > Hello, Peter. > > On Mon, Jun 24, 2024 at 01:32:12PM +0200, Peter Zijlstra wrote: > > On Wed, May 01, 2024 at 05:09:44AM -1000, Tejun Heo wrote: > > > ->rq_{on|off}line are called either during CPU hotplug or cpuset partition > > > updates. A planned BPF extensible sched_class wants to tell the BPF > > > scheduler progs about CPU hotplug events in a way that's synchronized with > > > rq state changes. > > > > > > As the BPF scheduler progs aren't necessarily affected by cpuset partition > > > updates, we need a way to distinguish the two types of events. Let's add an > > > argument to tell them apart. > > > > That would be a bug. Must not be able to ignore partitions. > > So, first of all, this implementation was brittle in assuming CPU hotplug > events would be called in first and broke after recent cpuset changes. In > v7, it's replaced by hooks in sched_cpu_[de]activate(), which has the extra > benefit of allowing the BPF hotplug methods to be sleepable. Urgh, I suppose I should go stare at v7 then. > Taking a step back to the sched domains. They don't translate well to > sched_ext schedulers where task to CPU associations are often more dynamic > (e.g. multiple CPUs sharing a task queue) and load balancing operations can > be implemented pretty differently from CFS. The benefits of exposing sched > domains directly to the BPF schedulers is unclear as most of relevant > information can be obtained from userspace already. Either which way around you want to turn it, you must not violate partitions. If a bpf thing isn't capable of handling partitions, you must refuse loading it when a partition exists and equally disallow creation of partitions when it does load. For partitions specifically, you only need the root_domain, not the full sched_domain trees. I'm aware you have these shared runqueues, but you don't *have* to do that. Esp. so if the user explicitly requested partitions.
Hello, On Tue, Jun 25, 2024 at 10:29:26AM +0200, Peter Zijlstra wrote: ... > > Taking a step back to the sched domains. They don't translate well to > > sched_ext schedulers where task to CPU associations are often more dynamic > > (e.g. multiple CPUs sharing a task queue) and load balancing operations can > > be implemented pretty differently from CFS. The benefits of exposing sched > > domains directly to the BPF schedulers is unclear as most of relevant > > information can be obtained from userspace already. > > Either which way around you want to turn it, you must not violate > partitions. If a bpf thing isn't capable of handling partitions, you > must refuse loading it when a partition exists and equally disallow > creation of partitions when it does load. > > For partitions specifically, you only need the root_domain, not the full > sched_domain trees. > > I'm aware you have these shared runqueues, but you don't *have* to do > that. Esp. so if the user explicitly requested partitions. As a quick work around, I can just disallow / eject the BPF scheduler when partitioning is configured. However, I think I'm still missing something and would appreciate if you can fill me in. Abiding by core scheduling configuration is critical because it has direct user visible and security implications and this can be tested from userspace - are two threads which shouldn't be on the same core on the same core or not? So, the violation condition is pretty clear. However, I'm not sure how partioning is similar. My understanding is that it works as a barrier for the load balancer. LB on this side can't look there and LB on that side can't look here. However, isn't the impact purely performance / isolation difference? IOW, let's say you laod a BPF scheduler which consumes the partition information but doesn't do anything differently based on it. cpumasks are still enforced the same and I can't think of anything which userspace would be able to test to tell whether partitioning is working or not. If the only difference partitions make is on performance. While it would make sense to communicate partitions to the BPF scheduler, would it make sense to reject BPF scheduler based on it? ie. Assuming that the feature is implemented, what would distinguish between one BPF scheduler which handles partitions specially and the other which doesn't care? Thanks.
On Tue, Jun 25, 2024 at 01:41:01PM -1000, Tejun Heo wrote: > Hello, > > On Tue, Jun 25, 2024 at 10:29:26AM +0200, Peter Zijlstra wrote: > ... > > > Taking a step back to the sched domains. They don't translate well to > > > sched_ext schedulers where task to CPU associations are often more dynamic > > > (e.g. multiple CPUs sharing a task queue) and load balancing operations can > > > be implemented pretty differently from CFS. The benefits of exposing sched > > > domains directly to the BPF schedulers is unclear as most of relevant > > > information can be obtained from userspace already. > > > > Either which way around you want to turn it, you must not violate > > partitions. If a bpf thing isn't capable of handling partitions, you > > must refuse loading it when a partition exists and equally disallow > > creation of partitions when it does load. > > > > For partitions specifically, you only need the root_domain, not the full > > sched_domain trees. > > > > I'm aware you have these shared runqueues, but you don't *have* to do > > that. Esp. so if the user explicitly requested partitions. > > As a quick work around, I can just disallow / eject the BPF scheduler when > partitioning is configured. However, I think I'm still missing something and > would appreciate if you can fill me in. > > Abiding by core scheduling configuration is critical because it has direct > user visible and security implications and this can be tested from userspace > - are two threads which shouldn't be on the same core on the same core or > not? So, the violation condition is pretty clear. > > However, I'm not sure how partioning is similar. I'm not sure what you mean. It's like violating the cpumask, probably not a big deal, but against the express wishes of the user. > My understanding is that it > works as a barrier for the load balancer. LB on this side can't look there > and LB on that side can't look here. However, isn't the impact purely > performance / isolation difference? Yes. But this isolation is very important to some people. > IOW, let's say you laod a BPF scheduler > which consumes the partition information but doesn't do anything differently > based on it. cpumasks are still enforced the same and I can't think of > anything which userspace would be able to test to tell whether partitioning > is working or not. So barring a few caveats it really boils down to a task staying in the partition it's part of. If you ever see it leave, you know you got a problem. Now, there's a bunch of ways to actually create partitions: - cpuset - cpuset-v2 - isolcpus boot crap And they're all subtly different iirc, but IIRC the cpuset ones are simplest since the task is part of a cgroup and the cgroup cpumask is imposed on them and things should be fairly straight forward. The isolcpus thing creates a pile of single CPU partitions and people have to manually set cpu-affinity, and here we have some hysterical behaviour that I would love to change but have not yet dared do -- because I know there's people doing dodgy things because they've been sending 'bug' reports. Specifically it is possible to set a cpumask that spans multiple partitions :-( Traditionally the behaviour was that it would place the task on the lowest cpu number, the current behaviour is the task it placed randomly on any CPU in the given mask. It is my opinion that both behaviours are correct, since after all, we don't violate the given constraint, the user provided mask. If that's not what you wanted, you should be setting something else etc.. I've proposed rejecting a cpumask that spans partitions -- I've not yet done this, because clearly people are doing this, however misguided. But perhaps we should just bite the bullet and cause pain -- dunno. Anyway, tl;dr, you can have a cpumask wider than a parition and people still not wanting migrations to happen. > If the only difference partitions make is on performance. People explicitly did not want migrations there -- otherwise they would not have gone to the trouble of setting up the partitions in the first place. > While it would > make sense to communicate partitions to the BPF scheduler, would it make > sense to reject BPF scheduler based on it? ie. Assuming that the feature is > implemented, what would distinguish between one BPF scheduler which handles > partitions specially and the other which doesn't care? Correctness? Anyway, can't you handle this in the kernel part, simply never allow a shared runqueue to cross a root_domain's mask and put some WARNs on to ensure constraints are respected etc.? Should be fairly simple to check prev_cpu and new_cpu are having the same root_domain for instance.
Hello, On Wed, Jun 26, 2024 at 10:23:42AM +0200, Peter Zijlstra wrote: ... > - cpuset > - cpuset-v2 > - isolcpus boot crap > > And they're all subtly different iirc, but IIRC the cpuset ones are > simplest since the task is part of a cgroup and the cgroup cpumask is > imposed on them and things should be fairly straight forward. > > The isolcpus thing creates a pile of single CPU partitions and people > have to manually set cpu-affinity, and here we have some hysterical > behaviour that I would love to change but have not yet dared do -- > because I know there's people doing dodgy things because they've been > sending 'bug' reports. > > Specifically it is possible to set a cpumask that spans multiple > partitions :-( Traditionally the behaviour was that it would place the > task on the lowest cpu number, the current behaviour is the task it > placed randomly on any CPU in the given mask. This is what I was missing. I was just thinking cpuset case and as cpuset partitions are always reflected in the task cpumasks, there isn't whole lot to do. ... > > While it would > > make sense to communicate partitions to the BPF scheduler, would it make > > sense to reject BPF scheduler based on it? ie. Assuming that the feature is > > implemented, what would distinguish between one BPF scheduler which handles > > partitions specially and the other which doesn't care? > > Correctness? Anyway, can't you handle this in the kernel part, simply > never allow a shared runqueue to cross a root_domain's mask and put some > WARNs on to ensure constraints are respected etc.? Should be fairly > simple to check prev_cpu and new_cpu are having the same root_domain for > instance. Yeah, I'll plug it. It might as well be just reject and ejecting BPF schedulers when conditions are detected. The BPF scheduler doesn't have to use the built-in DSQs and can decide to dispatch to any CPU from its BPF queues (however that may be implemented, it can also be in userspace), so it's a bit tricky to enforce correctness dynamically after the fact. I'll think more on it. Thanks.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e48af9fbbd71..90b505fbb488 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9593,7 +9593,7 @@ static inline void balance_hotplug_wait(void) #endif /* CONFIG_HOTPLUG_CPU */ -void set_rq_online(struct rq *rq) +void set_rq_online(struct rq *rq, enum rq_onoff_reason reason) { if (!rq->online) { const struct sched_class *class; @@ -9603,12 +9603,12 @@ void set_rq_online(struct rq *rq) for_each_class(class) { if (class->rq_online) - class->rq_online(rq); + class->rq_online(rq, reason); } } } -void set_rq_offline(struct rq *rq) +void set_rq_offline(struct rq *rq, enum rq_onoff_reason reason) { if (rq->online) { const struct sched_class *class; @@ -9616,7 +9616,7 @@ void set_rq_offline(struct rq *rq) update_rq_clock(rq); for_each_class(class) { if (class->rq_offline) - class->rq_offline(rq); + class->rq_offline(rq, reason); } cpumask_clear_cpu(rq->cpu, rq->rd->online); @@ -9712,7 +9712,7 @@ int sched_cpu_activate(unsigned int cpu) rq_lock_irqsave(rq, &rf); if (rq->rd) { BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span)); - set_rq_online(rq); + set_rq_online(rq, RQ_ONOFF_HOTPLUG); } rq_unlock_irqrestore(rq, &rf); @@ -9756,7 +9756,7 @@ int sched_cpu_deactivate(unsigned int cpu) rq_lock_irqsave(rq, &rf); if (rq->rd) { BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span)); - set_rq_offline(rq); + set_rq_offline(rq, RQ_ONOFF_HOTPLUG); } rq_unlock_irqrestore(rq, &rf); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index a04a436af8cc..010d1dc5f918 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2607,7 +2607,7 @@ static void set_cpus_allowed_dl(struct task_struct *p, } /* Assumes rq->lock is held */ -static void rq_online_dl(struct rq *rq) +static void rq_online_dl(struct rq *rq, enum rq_onoff_reason reason) { if (rq->dl.overloaded) dl_set_overload(rq); @@ -2618,7 +2618,7 @@ static void rq_online_dl(struct rq *rq) } /* Assumes rq->lock is held */ -static void rq_offline_dl(struct rq *rq) +static void rq_offline_dl(struct rq *rq, enum rq_onoff_reason reason) { if (rq->dl.overloaded) dl_clear_overload(rq); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5d7cffee1a4e..8032256d3972 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -12446,14 +12446,14 @@ void trigger_load_balance(struct rq *rq) nohz_balancer_kick(rq); } -static void rq_online_fair(struct rq *rq) +static void rq_online_fair(struct rq *rq, enum rq_onoff_reason reason) { update_sysctl(); update_runtime_enabled(rq); } -static void rq_offline_fair(struct rq *rq) +static void rq_offline_fair(struct rq *rq, enum rq_onoff_reason reason) { update_sysctl(); diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 3261b067b67e..8620474d117d 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -2426,7 +2426,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p) } /* Assumes rq->lock is held */ -static void rq_online_rt(struct rq *rq) +static void rq_online_rt(struct rq *rq, enum rq_onoff_reason reason) { if (rq->rt.overloaded) rt_set_overload(rq); @@ -2437,7 +2437,7 @@ static void rq_online_rt(struct rq *rq) } /* Assumes rq->lock is held */ -static void rq_offline_rt(struct rq *rq) +static void rq_offline_rt(struct rq *rq, enum rq_onoff_reason reason) { if (rq->rt.overloaded) rt_clear_overload(rq); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0b6a34ba2457..bcc8056acadb 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2271,6 +2271,11 @@ extern const u32 sched_prio_to_wmult[40]; #define RETRY_TASK ((void *)-1UL) +enum rq_onoff_reason { + RQ_ONOFF_HOTPLUG, /* CPU is going on/offline */ + RQ_ONOFF_TOPOLOGY, /* sched domain topology update */ +}; + struct affinity_context { const struct cpumask *new_mask; struct cpumask *user_mask; @@ -2309,8 +2314,8 @@ struct sched_class { void (*set_cpus_allowed)(struct task_struct *p, struct affinity_context *ctx); - void (*rq_online)(struct rq *rq); - void (*rq_offline)(struct rq *rq); + void (*rq_online)(struct rq *rq, enum rq_onoff_reason reason); + void (*rq_offline)(struct rq *rq, enum rq_onoff_reason reason); struct rq *(*find_lock_rq)(struct task_struct *p, struct rq *rq); #endif @@ -2853,8 +2858,8 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2) raw_spin_rq_unlock(rq1); } -extern void set_rq_online (struct rq *rq); -extern void set_rq_offline(struct rq *rq); +extern void set_rq_online (struct rq *rq, enum rq_onoff_reason reason); +extern void set_rq_offline(struct rq *rq, enum rq_onoff_reason reason); extern bool sched_smp_initialized; #else /* CONFIG_SMP */ diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 99ea5986038c..12501543c56d 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -497,7 +497,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd) old_rd = rq->rd; if (cpumask_test_cpu(rq->cpu, old_rd->online)) - set_rq_offline(rq); + set_rq_offline(rq, RQ_ONOFF_TOPOLOGY); cpumask_clear_cpu(rq->cpu, old_rd->span); @@ -515,7 +515,7 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd) cpumask_set_cpu(rq->cpu, rd->span); if (cpumask_test_cpu(rq->cpu, cpu_active_mask)) - set_rq_online(rq); + set_rq_online(rq, RQ_ONOFF_TOPOLOGY); rq_unlock_irqrestore(rq, &rf);