From patchwork Fri Jan 31 22:58:37 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Steven Rostedt X-Patchwork-Id: 13955960 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A22151ADFFE; Fri, 31 Jan 2025 22:59:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738364354; cv=none; b=K6aqJ6NRtwYguW6hC+RB71JiPIlx1IcOIZBbFB3edym9qqbYR+Bjo3d649gdj7G3+Zpp5kHn/BzJjfA76KfwZKi+VuATCVfVJnTTPaDt95uqJzRFYgGBDqlpcYbWtIw7wH/3i0Auet3/b1l/uFhcHZ/F156nSoTp0k1Es1eVW2I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738364354; c=relaxed/simple; bh=MAd2X0j647GPX4qmnxZWdWj9YoocWbOj/T2dx5j6F/k=; h=Message-ID:Date:From:To:Cc:Subject; b=FRCXBFAMjUewusF1SMHvRNsf00zuxFBxS/EkykTd4dPT4LuXFEibnJltmj2j69MWMyFCWlHVfMuYwIv/HAX4qqwlQfoVoNQxG13jXZQehpcV5S60XbXIG6m6WoVfzlyaJL+FnhJInJAEgcTMjJ3Ujq+qHfd7xuA9eZaubk8U71U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 19A77C4CED1; Fri, 31 Jan 2025 22:59:14 +0000 (UTC) Received: from rostedt by gandalf with local (Exim 4.98) (envelope-from ) id 1tdzzC-00000003vSQ-1PMb; Fri, 31 Jan 2025 17:59:42 -0500 Message-ID: <20250131225837.972218232@goodmis.org> User-Agent: quilt/0.68 Date: Fri, 31 Jan 2025 17:58:37 -0500 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: Thomas Gleixner , Peter Zijlstra , Ankur Arora , Linus Torvalds , linux-mm@kvack.org, x86@kernel.org, akpm@linux-foundation.org, luto@kernel.org, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, willy@infradead.org, mgorman@suse.de, jon.grimm@amd.com, bharata@amd.com, raghavendra.kt@amd.com, boris.ostrovsky@oracle.com, konrad.wilk@oracle.com, jgross@suse.com, andrew.cooper3@citrix.com, Joel Fernandes , Vineeth Pillai , Suleiman Souhlal , Ingo Molnar , Mathieu Desnoyers , Clark Williams , bigeasy@linutronix.de, daniel.wagner@suse.com, joseph.salisbury@oracle.com, broonie@gmail.com Subject: [RFC][PATCH 0/2] sched: Extended Scheduler Time Slice revisited Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Extended scheduler time slice Wow, it's been over a year since I posted my original POC of this patch series[1]. But that was when the PREEMPT_LAZY (NEED_RESCHED_LAZY) was just being proposed and this patch set depended on that. Now that NEED_RESCHED_LAZY is part of mainline, it's time to revisit this proposal. Quick recap: PREEMPT_LAZY can be used to dynamically set the preemption model of the kernel. To emulate the old server model, if the timer tick goes off and wants to schedule out the currently running SCHED_OTHER task, it will set NEED_RESCHED_LAZY. Instead of immediately scheduling, it will only schedule when the current task is exiting to user space. If the task runs in the kernel for over a tick, then NEED_RESCHED is set which will force the task to schedule as soon as it is out of any kernel critical section. I wanted to give this feature to user space as well. This is a way for user space to inform the kernel that it too is in a critical section (perhaps implementing user space spin locks), and if it happens to lose its quota and the scheduler wants to schedule it out for another SCHED_OTHER task, it can get a little more time to release its locks. The patches use rseq to map a new "cr_counter" field to pass this information to the kernel. Bit 0 is reserved for the kernel, and the other 31 bits tells the kernel that the task is in a critical section. If any of those 31 bits is set, the kernel will try to extend the tasks time slice if it deems fit to do so (not guaranteed of course). The 31 bits allows the task to implement a counter. Note that this rseq memory is per thread so it does not need to worry about racing with other threads. But it does need to worry about racing with the kernel, so the incrementing or decrementing of this value should be done local atomically. (Like with a simple addl or subl in x86, not a read/modify/write). The counter lets user space add 2 to the cr_counter (skipping over bit 0) when it enters a critical section and subtract 2 when it leaves. If the counter is 1, then it knows that the kernel extended its time slice and it should immediately schedule by doing a system call (yield() but any system call will work too). If it does not, the kernel will force a schedule on the next scheduler tick. The first patch implements this and gives the task 1 more tick just like the kernel does before it forces a schedule. But this means user space can ask for a full millisecond up to 10 milliseconds depending on CONFIG_HZ. The second patch is based off of Peter Zijlstra's patch[2], that injects a 50us scheduler tick when it extends the task's slice. This way the most a task will get is 50us extra, which hopefully would not bother other task's latency. Note, that the way EEVDF works, the task will get penalized by losing out on eligibility, and even if it does get a little more time, it may be scheduled less often. I removed POC as I no longer believe this is simply a proof-of-concept. But it is still RFC, and the patches contain trace_printk() in them to see if it is indeed working, as well as to analyze the results of my tests. Those trace_printk()s will be removed if this gets accepted. I wrote a program[3] to micro-benchmark this. What the program does is to create one thread per CPU that will grab a user space spin lock, run in a loop for around 30us and release the spin lock. Then it will go to sleep for "100 + cpu * 27" microseconds before it wakes up and tries again. This is to try to stagger the different threads. Each of these threads are pinned to their corresponding CPU. 5 more threads are created on each CPU that does the following: while (!data->done) { for (i = 0; i < 100; i++) wmb(); do_sleep(10); rmb(); } Where do_sleep(10) will sleep for 10us. This causes a lot of scheduling on SCHED_OTHER tasks. This program keeps track of: - Total number of loops iterated by the threads that are taking the lock - The total time it waited to get a lock (and the average time) - The number of times the lock was contented. - The max time it waited to get a lock - The max time it held the lock (and the average time it held a lock). - It also keeps track of the number of times its time slice was extended It takes two parameters: -d disable using rseq to tell the kernel to extend the lock -w Keep it "extended" even while waiting for the lock Note -w is meaningless with -d, and is really added as an academic exercise as waiting for a critical section isn't necessary a critical section. It runs for 5 seconds and then stops. So all numbers are for a 5 second duration. Without rseq enabled, we had: for i in `seq 10` ; do ./extend-sched -d ; done Finish up Ran for 105278 times Total wait time: 7.657703 (avg: 0.000072) Total contention: 88661 Total extended: 0 max wait: 1410 max: 328 (avg: 43) Finish up Ran for 106703 times Total wait time: 7.371958 (avg: 0.000069) Total contention: 89252 Total extended: 0 max wait: 1822 max: 410 (avg: 42) Finish up Ran for 106679 times Total wait time: 7.344924 (avg: 0.000068) Total contention: 89003 Total extended: 0 max wait: 1499 max: 338 (avg: 42) Finish up Ran for 106512 times Total wait time: 7.398154 (avg: 0.000069) Total contention: 89323 Total extended: 0 max wait: 1231 max: 334 (avg: 42) Finish up Ran for 106686 times Total wait time: 7.369875 (avg: 0.000069) Total contention: 89141 Total extended: 0 max wait: 1606 max: 448 (avg: 42) Finish up Ran for 106291 times Total wait time: 7.464811 (avg: 0.000070) Total contention: 89244 Total extended: 0 max wait: 1727 max: 373 (avg: 42) Finish up Ran for 106230 times Total wait time: 7.467716 (avg: 0.000070) Total contention: 88950 Total extended: 0 max wait: 4084 max: 377 (avg: 42) Finish up Ran for 106699 times Total wait time: 7.369399 (avg: 0.000069) Total contention: 89085 Total extended: 0 max wait: 1415 max: 348 (avg: 42) Finish up Ran for 106648 times Total wait time: 7.352611 (avg: 0.000068) Total contention: 89202 Total extended: 0 max wait: 1177 max: 377 (avg: 42) Finish up Ran for 106532 times Total wait time: 7.363098 (avg: 0.000069) Total contention: 89009 Total extended: 0 max wait: 1454 max: 429 (avg: 42) Now with a 50us slice extension with rseq: for i in `seq 10` ; do ./extend-sched ; done Finish up Ran for 121185 times Total wait time: 3.450114 (avg: 0.000028) Total contention: 84405 Total extended: 19879 max wait: 652 max: 174 (avg: 32) Finish up Ran for 120842 times Total wait time: 3.474066 (avg: 0.000028) Total contention: 84338 Total extended: 20450 max wait: 487 max: 181 (avg: 32) Finish up Ran for 120814 times Total wait time: 3.473712 (avg: 0.000028) Total contention: 83938 Total extended: 20418 max wait: 631 max: 185 (avg: 32) Finish up Ran for 120918 times Total wait time: 3.442310 (avg: 0.000028) Total contention: 83921 Total extended: 20246 max wait: 511 max: 172 (avg: 32) Finish up Ran for 120685 times Total wait time: 3.426023 (avg: 0.000028) Total contention: 83327 Total extended: 20504 max wait: 488 max: 161 (avg: 32) Finish up Ran for 120873 times Total wait time: 3.477329 (avg: 0.000028) Total contention: 84139 Total extended: 20808 max wait: 551 max: 172 (avg: 32) Finish up Ran for 120667 times Total wait time: 3.491623 (avg: 0.000028) Total contention: 84004 Total extended: 20585 max wait: 554 max: 170 (avg: 32) Finish up Ran for 121595 times Total wait time: 3.446635 (avg: 0.000028) Total contention: 84568 Total extended: 20258 max wait: 543 max: 166 (avg: 32) Finish up Ran for 121729 times Total wait time: 3.437635 (avg: 0.000028) Total contention: 84825 Total extended: 20143 max wait: 497 max: 165 (avg: 32) Finish up Ran for 121545 times Total wait time: 3.452991 (avg: 0.000028) Total contention: 84583 Total extended: 20186 max wait: 578 max: 161 (avg: 32) The averages of the 10 runs: No extensions: avg iterations: 106426 avg total wait: 7.416025 seconds avg avg wait: 0.000069 seconds contention: 89087 max wait: 1742 us max: 376.2 us avg max: 42 us With rseq extension: avg iterations: 121085 avg total wait: 3.457244 seconds avg avg wait: 0.000028 seconds contention: 84205 max wait: 549 us max: 171 us avg max: 32 us extended: 20347 This shows that with a 50us extra time slice It was able to run 14659 more iteration (+13.7%) It waited a total of 3.958781 seconds less (-53.3%) The average wait time was 41us less (-59.4%) It had 4882 less contentions (-5.4%) It had 1193us less max wait time (-31.5%) It held the lock for 205.2us less (-54.5%) And the average time it held the lock was 10us less (-23.8%) After running the extend version, I looked at ftrace to see if it hit the 50us max: # trace-cmd show | grep force extend-sched-29816 [000] dBH.. 76942.819849: rseq_delay_resched_tick: timeout -- force resched extend-sched-29865 [000] dbh.. 76944.151878: rseq_delay_resched_tick: timeout -- force resched extend-sched-29865 [000] dBh.. 76945.266837: rseq_delay_resched_tick: timeout -- force resched extend-sched-29865 [000] dBh.. 76946.182833: rseq_delay_resched_tick: timeout -- force resched It did so 4 times. For kicks, here's the run with '-w' where it sets the rseq extend while it spins for waiting for a lock. I would not recommend this even if it does help. Why extend when it is safe to preempt? for i in `seq 10` ; do ./extend-sched -w ; done Finish up Ran for 120111 times Total wait time: 3.389300 (avg: 0.000028) Total contention: 83406 Total extended: 23539 max wait: 438 max: 176 (avg: 32) Finish up Ran for 120241 times Total wait time: 3.377985 (avg: 0.000028) Total contention: 83586 Total extended: 23458 max wait: 453 max: 246 (avg: 32) Finish up Ran for 120140 times Total wait time: 3.391172 (avg: 0.000028) Total contention: 83234 Total extended: 23571 max wait: 446 max: 195 (avg: 32) Finish up Ran for 120100 times Total wait time: 3.366652 (avg: 0.000028) Total contention: 83088 Total extended: 23256 max wait: 2710 max: 2592 (avg: 32) Finish up Ran for 120373 times Total wait time: 3.372495 (avg: 0.000028) Total contention: 83405 Total extended: 23657 max wait: 460 max: 164 (avg: 32) Finish up Ran for 120332 times Total wait time: 3.389414 (avg: 0.000028) Total contention: 83752 Total extended: 23487 max wait: 498 max: 223 (avg: 32) Finish up Ran for 120411 times Total wait time: 3.357409 (avg: 0.000027) Total contention: 83371 Total extended: 23349 max wait: 423 max: 175 (avg: 32) Finish up Ran for 120258 times Total wait time: 3.376960 (avg: 0.000028) Total contention: 83595 Total extended: 23454 max wait: 385 max: 164 (avg: 32) Finish up Ran for 120407 times Total wait time: 3.366934 (avg: 0.000027) Total contention: 83649 Total extended: 23351 max wait: 446 max: 164 (avg: 32) Finish up Ran for 120397 times Total wait time: 3.395540 (avg: 0.000028) Total contention: 83859 Total extended: 23513 max wait: 469 max: 172 (avg: 32) Again, I would not recommend this, as after running this I looked at the trace again to see if it hit the max 50us (I did reset the buffer before running), and I had this: # trace-cmd show | grep force | wc -l 19697 [1] https://lore.kernel.org/all/20231025054219.1acaa3dd@gandalf.local.home/ [2] https://lore.kernel.org/all/20231030132949.GA38123@noisy.programming.kicks-ass.net/ [3] https://rostedt.org/code/extend-sched.c Steven Rostedt (2): sched: Shorten time that tasks can extend their time slice for sched: Extended scheduler time slice ---- include/linux/entry-common.h | 2 + include/linux/sched.h | 19 +++++++++ include/uapi/linux/rseq.h | 24 +++++++++++ kernel/entry/common.c | 16 +++++++- kernel/rseq.c | 96 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/core.c | 16 ++++++++ kernel/sched/syscalls.c | 6 +++ 7 files changed, 177 insertions(+), 2 deletions(-)