Message ID | Z2tNK2oFDX1OPp8C@slm.duckdns.org (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | [sched_ext/for-6.13-fixes] sched_ext: Fix dsq_local_on selftest | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
On Tue, Dec 24, 2024 at 02:09:15PM -1000, Tejun Heo wrote: > The dsp_local_on selftest expects the scheduler to fail by trying to > schedule an e.g. CPU-affine task to the wrong CPU. However, this isn't > guaranteed to happen in the 1 second window that the test is running. > Besides, it's odd to have this particular exception path tested when there > are no other tests that verify that the interface is working at all - e.g. > the test would pass if dsp_local_on interface is completely broken and fails > on any attempt. > > Flip the test so that it verifies that the feature works. While at it, fix a > typo in the info message. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Reported-by: Ihor Solodrai <ihor.solodrai@pm.me> > Link: http://lkml.kernel.org/r/Z1n9v7Z6iNJ-wKmq@slm.duckdns.org Applied to sched_ext/for-6.13-fixes. Thanks.
On Tuesday, December 24th, 2024 at 4:09 PM, Tejun Heo <tj@kernel.org> wrote: > > > The dsp_local_on selftest expects the scheduler to fail by trying to > schedule an e.g. CPU-affine task to the wrong CPU. However, this isn't > guaranteed to happen in the 1 second window that the test is running. > Besides, it's odd to have this particular exception path tested when there > are no other tests that verify that the interface is working at all - e.g. > the test would pass if dsp_local_on interface is completely broken and fails > on any attempt. > > Flip the test so that it verifies that the feature works. While at it, fix a > typo in the info message. > > Signed-off-by: Tejun Heo tj@kernel.org > > Reported-by: Ihor Solodrai ihor.solodrai@pm.me > > Link: http://lkml.kernel.org/r/Z1n9v7Z6iNJ-wKmq@slm.duckdns.org > --- > tools/testing/selftests/sched_ext/dsp_local_on.bpf.c | 5 ++++- > tools/testing/selftests/sched_ext/dsp_local_on.c | 5 +++-- > 2 files changed, 7 insertions(+), 3 deletions(-) Hi Tejun. I've tried running sched_ext selftests on BPF CI today, applying a set of patches from sched_ext/for-6.13-fixes, including this one. You can see the list of patches I added here: https://github.com/kernel-patches/vmtest/pull/332/files With that, dsq_local_on has failed on x86_64 (llvm-18), although it passed with other configurations: https://github.com/kernel-patches/vmtest/actions/runs/12798804552/job/35683769806 Here is a piece of log that appears to be relevant: 2025-01-15T23:28:55.8238375Z [ 5.334631] sched_ext: BPF scheduler "dsp_local_on" disabled (runtime error) 2025-01-15T23:28:55.8243034Z [ 5.335420] sched_ext: dsp_local_on: SCX_DSQ_LOCAL[_ON] verdict target cpu 1 not allowed for kworker/u8:1[33] 2025-01-15T23:28:55.8246187Z [ 5.336139] dispatch_to_local_dsq+0x13e/0x1f0 2025-01-15T23:28:55.8249296Z [ 5.336474] flush_dispatch_buf+0x13d/0x170 2025-01-15T23:28:55.8252083Z [ 5.336793] balance_scx+0x225/0x3e0 2025-01-15T23:28:55.8254695Z [ 5.337065] __schedule+0x406/0xe80 2025-01-15T23:28:55.8257121Z [ 5.337330] schedule+0x41/0xb0 2025-01-15T23:28:55.8260146Z [ 5.337574] schedule_timeout+0xe5/0x160 2025-01-15T23:28:55.8263080Z [ 5.337875] rcu_tasks_kthread+0xb1/0xc0 2025-01-15T23:28:55.8265477Z [ 5.338169] kthread+0xfa/0x120 2025-01-15T23:28:55.8268202Z [ 5.338410] ret_from_fork+0x37/0x50 2025-01-15T23:28:55.8271272Z [ 5.338690] ret_from_fork_asm+0x1a/0x30 2025-01-15T23:28:56.7349562Z ERR: dsp_local_on.c:39 2025-01-15T23:28:56.7350182Z Expected skel->data->uei.kind == EXIT_KIND(SCX_EXIT_UNREG) (1024 == 64) Could you please take a look? Thank you. > > [...]
Hello, sorry about the delay. On Wed, Jan 15, 2025 at 11:50:37PM +0000, Ihor Solodrai wrote: ... > 2025-01-15T23:28:55.8238375Z [ 5.334631] sched_ext: BPF scheduler "dsp_local_on" disabled (runtime error) > 2025-01-15T23:28:55.8243034Z [ 5.335420] sched_ext: dsp_local_on: SCX_DSQ_LOCAL[_ON] verdict target cpu 1 not allowed for kworker/u8:1[33] That's a head scratcher. It's a single node 2 cpu instance and all unbound kworkers should be allowed on all CPUs. I'll update the test to test the actual cpumask but can you see whether this failure is consistent or flaky? Thanks.
On Tuesday, January 21st, 2025 at 5:40 PM, Tejun Heo <tj@kernel.org> wrote: > > > Hello, sorry about the delay. > > On Wed, Jan 15, 2025 at 11:50:37PM +0000, Ihor Solodrai wrote: > ... > > > 2025-01-15T23:28:55.8238375Z [ 5.334631] sched_ext: BPF scheduler "dsp_local_on" disabled (runtime error) > > 2025-01-15T23:28:55.8243034Z [ 5.335420] sched_ext: dsp_local_on: SCX_DSQ_LOCAL[_ON] verdict target cpu 1 not allowed for kworker/u8:1[33] > > > That's a head scratcher. It's a single node 2 cpu instance and all unbound > kworkers should be allowed on all CPUs. I'll update the test to test the > actual cpumask but can you see whether this failure is consistent or flaky? I re-ran all the jobs, and all sched_ext jobs have failed (3/3). Previous time only 1 of 3 runs failed. https://github.com/kernel-patches/vmtest/actions/runs/12798804552/job/36016405680 > > Thanks. > > -- > tejun
On Wed, Jan 22, 2025 at 07:10:00PM +0000, Ihor Solodrai wrote: > > On Tuesday, January 21st, 2025 at 5:40 PM, Tejun Heo <tj@kernel.org> wrote: > > > > > > > Hello, sorry about the delay. > > > > On Wed, Jan 15, 2025 at 11:50:37PM +0000, Ihor Solodrai wrote: > > ... > > > > > 2025-01-15T23:28:55.8238375Z [ 5.334631] sched_ext: BPF scheduler "dsp_local_on" disabled (runtime error) > > > 2025-01-15T23:28:55.8243034Z [ 5.335420] sched_ext: dsp_local_on: SCX_DSQ_LOCAL[_ON] verdict target cpu 1 not allowed for kworker/u8:1[33] > > > > > > That's a head scratcher. It's a single node 2 cpu instance and all unbound > > kworkers should be allowed on all CPUs. I'll update the test to test the > > actual cpumask but can you see whether this failure is consistent or flaky? > > I re-ran all the jobs, and all sched_ext jobs have failed (3/3). > Previous time only 1 of 3 runs failed. > > https://github.com/kernel-patches/vmtest/actions/runs/12798804552/job/36016405680 Oh I see what happens, SCX_DSQ_LOCAL_ON is (incorrectly) resolved to 0. More exactly, none of the enum values are being resolved correctly, likely due to the CO:RE enum refactoring. There’s probably something broken in tools/testing/selftests/sched_ext/Makefile, I’ll take a look. Thanks, -Andrea
On Thu, Jan 23, 2025 at 10:40:52AM +0100, Andrea Righi wrote: > On Wed, Jan 22, 2025 at 07:10:00PM +0000, Ihor Solodrai wrote: > > > > On Tuesday, January 21st, 2025 at 5:40 PM, Tejun Heo <tj@kernel.org> wrote: > > > > > > > > > > > Hello, sorry about the delay. > > > > > > On Wed, Jan 15, 2025 at 11:50:37PM +0000, Ihor Solodrai wrote: > > > ... > > > > > > > 2025-01-15T23:28:55.8238375Z [ 5.334631] sched_ext: BPF scheduler "dsp_local_on" disabled (runtime error) > > > > 2025-01-15T23:28:55.8243034Z [ 5.335420] sched_ext: dsp_local_on: SCX_DSQ_LOCAL[_ON] verdict target cpu 1 not allowed for kworker/u8:1[33] > > > > > > > > > That's a head scratcher. It's a single node 2 cpu instance and all unbound > > > kworkers should be allowed on all CPUs. I'll update the test to test the > > > actual cpumask but can you see whether this failure is consistent or flaky? > > > > I re-ran all the jobs, and all sched_ext jobs have failed (3/3). > > Previous time only 1 of 3 runs failed. > > > > https://github.com/kernel-patches/vmtest/actions/runs/12798804552/job/36016405680 > > Oh I see what happens, SCX_DSQ_LOCAL_ON is (incorrectly) resolved to 0. > > More exactly, none of the enum values are being resolved correctly, likely > due to the CO:RE enum refactoring. There’s probably something broken in > tools/testing/selftests/sched_ext/Makefile, I’ll take a look. Yeah, we need to add SCX_ENUM_INIT() to each test. Will do that once the pending pull request is merged. The original report is a separate issue tho. I'm still a bit baffled by it. Thanks.
On Thu, Jan 23, 2025 at 06:57:58AM -1000, Tejun Heo wrote: > On Thu, Jan 23, 2025 at 10:40:52AM +0100, Andrea Righi wrote: > > On Wed, Jan 22, 2025 at 07:10:00PM +0000, Ihor Solodrai wrote: > > > > > > On Tuesday, January 21st, 2025 at 5:40 PM, Tejun Heo <tj@kernel.org> wrote: > > > > > > > > > > > > > > > Hello, sorry about the delay. > > > > > > > > On Wed, Jan 15, 2025 at 11:50:37PM +0000, Ihor Solodrai wrote: > > > > ... > > > > > > > > > 2025-01-15T23:28:55.8238375Z [ 5.334631] sched_ext: BPF scheduler "dsp_local_on" disabled (runtime error) > > > > > 2025-01-15T23:28:55.8243034Z [ 5.335420] sched_ext: dsp_local_on: SCX_DSQ_LOCAL[_ON] verdict target cpu 1 not allowed for kworker/u8:1[33] > > > > > > > > > > > > That's a head scratcher. It's a single node 2 cpu instance and all unbound > > > > kworkers should be allowed on all CPUs. I'll update the test to test the > > > > actual cpumask but can you see whether this failure is consistent or flaky? > > > > > > I re-ran all the jobs, and all sched_ext jobs have failed (3/3). > > > Previous time only 1 of 3 runs failed. > > > > > > https://github.com/kernel-patches/vmtest/actions/runs/12798804552/job/36016405680 > > > > Oh I see what happens, SCX_DSQ_LOCAL_ON is (incorrectly) resolved to 0. > > > > More exactly, none of the enum values are being resolved correctly, likely > > due to the CO:RE enum refactoring. There’s probably something broken in > > tools/testing/selftests/sched_ext/Makefile, I’ll take a look. > > Yeah, we need to add SCX_ENUM_INIT() to each test. Will do that once the > pending pull request is merged. The original report is a separate issue tho. > I'm still a bit baffled by it. For the enum part: https://lore.kernel.org/all/20250123124606.242115-1-arighi@nvidia.com/ And yeah, I missed that the original bug report was about the unbound kworker not allowed to be dispatched on cpu 1. Weird... I'm wondering if we need to do the cpumask_cnt / scx_bpf_dsq_cancel() game, like we did with scx_rustland to handle concurrent affinity changes, but in this case the kworker shouldn't have its affinity changed... -Andrea
On Thu, Jan 23, 2025 at 07:45:08PM +0100, Andrea Righi wrote: > On Thu, Jan 23, 2025 at 06:57:58AM -1000, Tejun Heo wrote: > > On Thu, Jan 23, 2025 at 10:40:52AM +0100, Andrea Righi wrote: > > > On Wed, Jan 22, 2025 at 07:10:00PM +0000, Ihor Solodrai wrote: > > > > > > > > On Tuesday, January 21st, 2025 at 5:40 PM, Tejun Heo <tj@kernel.org> wrote: > > > > > > > > > > > > > > > > > > > Hello, sorry about the delay. > > > > > > > > > > On Wed, Jan 15, 2025 at 11:50:37PM +0000, Ihor Solodrai wrote: > > > > > ... > > > > > > > > > > > 2025-01-15T23:28:55.8238375Z [ 5.334631] sched_ext: BPF scheduler "dsp_local_on" disabled (runtime error) > > > > > > 2025-01-15T23:28:55.8243034Z [ 5.335420] sched_ext: dsp_local_on: SCX_DSQ_LOCAL[_ON] verdict target cpu 1 not allowed for kworker/u8:1[33] > > > > > > > > > > > > > > > That's a head scratcher. It's a single node 2 cpu instance and all unbound > > > > > kworkers should be allowed on all CPUs. I'll update the test to test the > > > > > actual cpumask but can you see whether this failure is consistent or flaky? > > > > > > > > I re-ran all the jobs, and all sched_ext jobs have failed (3/3). > > > > Previous time only 1 of 3 runs failed. > > > > > > > > https://github.com/kernel-patches/vmtest/actions/runs/12798804552/job/36016405680 > > > > > > Oh I see what happens, SCX_DSQ_LOCAL_ON is (incorrectly) resolved to 0. > > > > > > More exactly, none of the enum values are being resolved correctly, likely > > > due to the CO:RE enum refactoring. There’s probably something broken in > > > tools/testing/selftests/sched_ext/Makefile, I’ll take a look. > > > > Yeah, we need to add SCX_ENUM_INIT() to each test. Will do that once the > > pending pull request is merged. The original report is a separate issue tho. > > I'm still a bit baffled by it. > > For the enum part: https://lore.kernel.org/all/20250123124606.242115-1-arighi@nvidia.com/ > > And yeah, I missed that the original bug report was about the unbound > kworker not allowed to be dispatched on cpu 1. Weird... I'm wondering if we > need to do the cpumask_cnt / scx_bpf_dsq_cancel() game, like we did with > scx_rustland to handle concurrent affinity changes, but in this case the > kworker shouldn't have its affinity changed... Thinking more about this, scx_bpf_task_cpu(p) returns the last known CPU where the task p was running, but it doesn't necessarily give a CPU where the task can run at any time. In general it's probably a safer choice to rely on p->cpus_ptr, maybe doing bpf_cpumask_any_distribute(p->cpus_ptr) for this test case. However, I still don't see why the unbound kworker couldn't be dispatched on cpu 1 in this particular case... -Andrea
diff --git a/tools/testing/selftests/sched_ext/dsp_local_on.bpf.c b/tools/testing/selftests/sched_ext/dsp_local_on.bpf.c index 6325bf76f47e..fbda6bf54671 100644 --- a/tools/testing/selftests/sched_ext/dsp_local_on.bpf.c +++ b/tools/testing/selftests/sched_ext/dsp_local_on.bpf.c @@ -43,7 +43,10 @@ void BPF_STRUCT_OPS(dsp_local_on_dispatch, s32 cpu, struct task_struct *prev) if (!p) return; - target = bpf_get_prandom_u32() % nr_cpus; + if (p->nr_cpus_allowed == nr_cpus) + target = bpf_get_prandom_u32() % nr_cpus; + else + target = scx_bpf_task_cpu(p); scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | target, SCX_SLICE_DFL, 0); bpf_task_release(p); diff --git a/tools/testing/selftests/sched_ext/dsp_local_on.c b/tools/testing/selftests/sched_ext/dsp_local_on.c index 472851b56854..0ff27e57fe43 100644 --- a/tools/testing/selftests/sched_ext/dsp_local_on.c +++ b/tools/testing/selftests/sched_ext/dsp_local_on.c @@ -34,9 +34,10 @@ static enum scx_test_status run(void *ctx) /* Just sleeping is fine, plenty of scheduling events happening */ sleep(1); - SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR)); bpf_link__destroy(link); + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_UNREG)); + return SCX_TEST_PASS; } @@ -50,7 +51,7 @@ static void cleanup(void *ctx) struct scx_test dsp_local_on = { .name = "dsp_local_on", .description = "Verify we can directly dispatch tasks to a local DSQs " - "from osp.dispatch()", + "from ops.dispatch()", .setup = setup, .run = run, .cleanup = cleanup,
The dsp_local_on selftest expects the scheduler to fail by trying to schedule an e.g. CPU-affine task to the wrong CPU. However, this isn't guaranteed to happen in the 1 second window that the test is running. Besides, it's odd to have this particular exception path tested when there are no other tests that verify that the interface is working at all - e.g. the test would pass if dsp_local_on interface is completely broken and fails on any attempt. Flip the test so that it verifies that the feature works. While at it, fix a typo in the info message. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ihor Solodrai <ihor.solodrai@pm.me> Link: http://lkml.kernel.org/r/Z1n9v7Z6iNJ-wKmq@slm.duckdns.org --- tools/testing/selftests/sched_ext/dsp_local_on.bpf.c | 5 ++++- tools/testing/selftests/sched_ext/dsp_local_on.c | 5 +++-- 2 files changed, 7 insertions(+), 3 deletions(-)