Message ID | 20231025235413.597287e1@gandalf.local.home (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [POC,RFC,v2] sched: Extended Scheduler Time Slice | expand |
On Wed, Oct 25, 2023 at 11:54:13PM -0400, Steven Rostedt wrote: > static void extend(void) > { > rseq_map.cr_flags = 1; > } > > static void unextend(void) > { > unsigned long prev; > > prev = xchg(&rseq_map.cr_flags, 0); So you complain about overhead and then you add one of the most expensive ops possible here? xchg has an implicit LOCK prefix and you really don't need LOCK prefix here. > if (prev & 2) { > tracefs_printf(NULL, "Yield!\n"); > sched_yield(); > } > }
On Thu, 26 Oct 2023 12:59:44 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Oct 25, 2023 at 11:54:13PM -0400, Steven Rostedt wrote: > > > static void extend(void) > > { > > rseq_map.cr_flags = 1; > > } > > > > static void unextend(void) > > { > > unsigned long prev; > > > > prev = xchg(&rseq_map.cr_flags, 0); > > So you complain about overhead and then you add one of the most > expensive ops possible here? xchg has an implicit LOCK prefix and you > really don't need LOCK prefix here. Peter, this is the user space side, where I cut and pasted the code from the file I attached. That has: static inline unsigned long xchg(volatile unsigned *ptr, unsigned new) { unsigned ret = new; asm volatile("xchg %b0,%1" : "+r"(ret), "+m"(*(ptr)) : : "memory"); return ret; } -- Steve > > > if (prev & 2) { > > tracefs_printf(NULL, "Yield!\n"); > > sched_yield(); > > } > > }
On 2023-10-26 07:14, Steven Rostedt wrote: > On Thu, 26 Oct 2023 12:59:44 +0200 > Peter Zijlstra <peterz@infradead.org> wrote: > >> On Wed, Oct 25, 2023 at 11:54:13PM -0400, Steven Rostedt wrote: >> >>> static void extend(void) >>> { >>> rseq_map.cr_flags = 1; >>> } >>> >>> static void unextend(void) >>> { >>> unsigned long prev; >>> >>> prev = xchg(&rseq_map.cr_flags, 0); >> >> So you complain about overhead and then you add one of the most >> expensive ops possible here? xchg has an implicit LOCK prefix and you >> really don't need LOCK prefix here. > > Peter, this is the user space side, where I cut and pasted the code from > the file I attached. > > That has: > > static inline unsigned long > xchg(volatile unsigned *ptr, unsigned new) > { > unsigned ret = new; > > asm volatile("xchg %b0,%1" which has an implicit lock prefix (xchg with a memory operand is a special-case): Quoting Intel manual: "If a memory operand is referenced, the processor’s locking protocol is automatically implemented for the duration of the exchange operation, regardless of the presence or absence of the LOCK prefix or of the value of the IOPL. (See the LOCK prefix description in this chapter for more information on the locking protocol.)" Thanks, Mathieu > : "+r"(ret), "+m"(*(ptr)) > : : "memory"); > return ret; > } > > -- Steve > > >> >>> if (prev & 2) { >>> tracefs_printf(NULL, "Yield!\n"); >>> sched_yield(); >>> } >>> } >
On Thu, 26 Oct 2023 at 08:36, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > asm volatile("xchg %b0,%1" > > which has an implicit lock prefix (xchg with a memory operand is a > special-case): Yeah, this is why we do "percpu_xchg()" - which does not want locked semantics - as a "cmpxchg" loop. Steven, check out arch/x86/include/asm/percpu.h for a rough implementation of a 'xchg()' without SMP coherency, just cpu-local one (ie atomic wrt being preempted by the kernel, but not atomic wrt other CPU's accessing the same variable concurrently) Linus
On 2023-10-26 14:50, Linus Torvalds wrote: > On Thu, 26 Oct 2023 at 08:36, Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >>> asm volatile("xchg %b0,%1" >> >> which has an implicit lock prefix (xchg with a memory operand is a >> special-case): > > Yeah, this is why we do "percpu_xchg()" - which does not want locked > semantics - as a "cmpxchg" loop. > > Steven, check out > > arch/x86/include/asm/percpu.h > > for a rough implementation of a 'xchg()' without SMP coherency, just > cpu-local one (ie atomic wrt being preempted by the kernel, but not > atomic wrt other CPU's accessing the same variable concurrently) Actually Steven does not need a xchg to test-and-set a single bit which is only accessed concurrently between kernel and userspace from the same thread. Either "bts" or "andb" should work fine. Thanks, Mathieu > > Linus
On Thu, 26 Oct 2023 14:36:36 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > static inline unsigned long > > xchg(volatile unsigned *ptr, unsigned new) > > { > > unsigned ret = new; > > > > asm volatile("xchg %b0,%1" > > which has an implicit lock prefix (xchg with a memory operand is a > special-case): > > Quoting Intel manual: > > "If a memory operand is referenced, the processor’s locking protocol is > automatically implemented for the duration of the exchange operation, > regardless of the presence or absence of the LOCK prefix or of the value > of the IOPL. (See the LOCK prefix description in this chapter for more > information on the locking protocol.)" > Bah. I learn something new every day :-p I thought xchg required specifying lock too. This means that cmpxchg is actually faster than xchg! It's been a long time since I had written assembly. What makes this interesting though, is that when I ran my tests originally, when my change was disabled, the xchg() code never executed, and it still showed a significant improvement. That is, even with adding xchg(), it still improved much more. But that's probably because the xchg() happens after releasing the lock and after that it goes into a loop waiting for another thread to make the update, which requires a lock cmpxchg(). Hence, the xchg() slowdown wasn't in a critical path of the test. Anyway, I changed the code to use: static inline unsigned clrbit(volatile unsigned *ptr) { unsigned ret; asm volatile("andb %b1,%0" : "+m" (*(volatile char *)ptr) : "iq" (0x2) : "memory"); ret = *ptr; *ptr = 0; return ret; } I just used andb as that's a locally atomic operation. I could also use bts (as Mathieu suggested to me on IRC). It doesn't matter as this is just example code and is not in production. How this is implemented is not an important part of the algorithm. Just that it can atomically clear the bit it sets without a race with the kernel setting the bit it sets. I could modify the code to put these bits in separate bytes as well. That's just an implementation detail we can work on later. I updated my test to have: static bool no_rseq; static void init_extend_map(void) { int ret; if (no_rseq) return; ret = rseq(&rseq_map, sizeof(rseq_map), 0, 0); perror("rseq"); printf("ret = %d (%zd) %p\n", ret, sizeof(rseq_map), &rseq_map); } static inline unsigned clrbit(volatile unsigned *ptr) { unsigned ret; asm volatile("andb %b1,%0" : "+m" (*(volatile char *)ptr) : "iq" (0x2) : "memory"); ret = *ptr; *ptr = 0; return ret; } static void extend(void) { if (no_rseq) return; rseq_map.cr_flags = 1; } static void unextend(void) { unsigned prev; if (no_rseq) return; prev = clrbit(&rseq_map.cr_flags); if (prev & 2) { tracefs_printf(NULL, "Yield!\n"); sched_yield(); } } where the tests with it disabled do not run the extend or unextend() code (as it makes sense to only perform that code with this feature, as that would be part of the overhead to implement it). Here's the results: With no_rseq = true, with 5 runs, which were basically all the same, but the best run was: Ran for 4260797 times Total wait time: 33.905306 And with no_rseq = false; Most runs were like: Ran for 5378845 times Total wait time: 32.253018 But the worse run was: Ran for 4991411 times Total wait time: 31.745662 But with that lower "wait time" it probably was preempted by a something else during that run, as the lower the "wait time" the better the result. -- Steve
On Thu, 26 Oct 2023 14:59:13 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > Steven, check out > > > > arch/x86/include/asm/percpu.h > > > > for a rough implementation of a 'xchg()' without SMP coherency, just > > cpu-local one (ie atomic wrt being preempted by the kernel, but not > > atomic wrt other CPU's accessing the same variable concurrently) > > Actually Steven does not need a xchg to test-and-set a single bit which > is only accessed concurrently between kernel and userspace from the same > thread. Either "bts" or "andb" should work fine. Thanks Linus and Mathieu! Yeah, I just changed it to andb and that works just as well (or better). When I first made the change, it became substantially worse. But then I realized it was because I had "addb" and not "andb" :-D -- Steve
On Thu, 26 Oct 2023 14:59:13 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > for a rough implementation of a 'xchg()' without SMP coherency, just > > cpu-local one (ie atomic wrt being preempted by the kernel, but not > > atomic wrt other CPU's accessing the same variable concurrently) > > Actually Steven does not need a xchg to test-and-set a single bit which > is only accessed concurrently between kernel and userspace from the same > thread. Either "bts" or "andb" should work fine. Hmm, how would bts work? Doesn't that just set a bit? I need to clear one bit while seeing if another bit is set. I could also use subl, as that would also atomically clear the bit. -- Steve
On Thu, 26 Oct 2023 15:20:22 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > Anyway, I changed the code to use: > > static inline unsigned clrbit(volatile unsigned *ptr) > { > unsigned ret; > > asm volatile("andb %b1,%0" > : "+m" (*(volatile char *)ptr) > : "iq" (0x2) > : "memory"); > > ret = *ptr; > *ptr = 0; > > return ret; > } Mathieu also told me that glibc's rseq has some extra padding at the end, that happens to be big enough to hold this feature. That means you can run the code without adding: GLIBC_TUNABLES=glibc.pthread.rseq=0 Attached is the updated test program. -- Steve
On 2023-10-27 12:24, Steven Rostedt wrote: > On Fri, 27 Oct 2023 12:09:56 -0400 > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >>> I need to clear one bit while seeing if another bit is set. I could also >>> use subl, as that would also atomically clear the bit. >> >> Ah ok, I did not get that you needed to test for a different bit than >> the one you clear. > > Yeah, maybe I'm not articulating the implementation as well. > > bit 0: Set by user space to tell the kernel it's in a critical section > > bit 1: Set by kernel that it gave user space extend time slice > > Bit 1 will only be set by the kernel if bit 0 is set. > > When entering a critical section, user space will set bit 0. When it leaves > the critical section, it needs to clear bit 0, but also needs to handle the > race condition from where it clears the bit and where the kernel could > preempt it and set bit 1. Thus it needs an atomic operation to clear bit 0 > without affecting bit 1. Once bit 0 is cleared, it does not need to worry > about bit 1 being set after that as the kernel will only set bit 1 if it > sees that bit 0 was set. After user space clears bit 0, it must check bit 1 > to see if it should now schedule. And it's also up to user space to clear > bit 1, but it can do that at any time with bit 0 cleared. > > extend() { > cr_flags = 1; > } > > unextend() { > cr_flags &= ~1; /* Must be atomic */ > if (cr_flags & 2) { > cr_flags = 0; /* May not be necessary as it gets cleared by extend() */ > sched_yield(); > } > } > > Does that make more sense? Not really. Please see my other email about the need for a reference count here, for nested locks use-cases. By "atomic" operation I suspect you only mean "single instruction" which can alter the state of the field and keep its prior content in a register, not a lock-prefixed atomic operation, right ? The only reason why you have this asm trickiness is because both states are placed into different bits from the same word, which is just an optimization. You could achieve the same much more simply by splitting this state in two different words, e.g.: extend() { WRITE_ONCE(__rseq_abi->cr_nest, __rseq_abi->cr_nest + 1); barrier() } unextend() { barrier() WRITE_ONCE(__rseq_abi->cr_nest, __rseq_abi->cr_nest - 1); if (READ_ONCE(__rseq_abi->must_yield)) { WRITE_ONCE(__rseq_abi->must_yield, 0); sched_yield(); } } Or am I missing something ? Thanks, Mathieu
On Fri, 27 Oct 2023 12:35:56 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > Does that make more sense? > > Not really. > > Please see my other email about the need for a reference count here, for > nested locks use-cases. Note, my original implementation of nested locking was done completely in user space. int __thread lock_cnt; extend() { if (lock_cnt++) return; ... } unextend() { if (--lock_cnt) return; ... } > > By "atomic" operation I suspect you only mean "single instruction" which can > alter the state of the field and keep its prior content in a register, not a > lock-prefixed atomic operation, right ? Correct. Just a per cpu atomic. Hence a "andb" instruction, or the "subl", or whatever. > > The only reason why you have this asm trickiness is because both states > are placed into different bits from the same word, which is just an > optimization. You could achieve the same much more simply by splitting > this state in two different words, e.g.: > > extend() { > WRITE_ONCE(__rseq_abi->cr_nest, __rseq_abi->cr_nest + 1); > barrier() > } > > unextend() { > barrier() > WRITE_ONCE(__rseq_abi->cr_nest, __rseq_abi->cr_nest - 1); > if (READ_ONCE(__rseq_abi->must_yield)) { > WRITE_ONCE(__rseq_abi->must_yield, 0); > sched_yield(); > } > } > > Or am I missing something ? I mentioned about placing this in different bytes, although I meant words, but yeah, if we make them separate it would make it easier. But me being frugal about memory, If this was just two bits (or even a counter with an extra bit) I didn't think about wasting two words for what can be done with one. But this is still an implementation detail, and this code is still very much in flux, and I'm not as worried about those details yet. -- Steve
As this change showed a really good improvement in my micro-benchmark, I decided to give it a try on something that's a bit more of a real world benchmark: PostgreSQL. Hence, I'm also Cc'ing the postgres maintainers in case they are interested. To bring them up to speed, here's what's been going on: A new feature is being discussed to add NEED_RESCHED_LAZY that will allow us (the kernel developers) to remove CONFIG_PREEMPT_NONE and CONFIG_PREEMPT_VOLUNTARY, and move that to user space runtime switches. The idea is that when the scheduler tick goes off and it's time to schedule out a SCHED_OTHER task, user space can have the option of not doing that in the kernel and wait till it enters back into user space before scheduling (simulating PREEMPT_NONE). The scheduler tick will set NEED_RESCHED_LAZY (instead of NEED_RESCHED which always schedules when possible), and that bit will only be looked at when exiting back into user space, where it will perform the schedule if it is set. My idea is to extend this into user space as well. Using the restartable sequences infrastructure (https://lwn.net/Articles/697979/) that maps memory between kernel and user space for threads, I'll use two bits (or one bit and a counter, but that's for later, I'll just discuss the current implementation). bit 0: is set by user space to tell the kernel that it's in a critical section. bit 1: is set by the kernel telling user space that it granted it a bit more time and that it should call back into the kernel with any system call (sched_yield() or gettid()), when it is out of its critical section. Bit 1 will never be set if bit 0 is not set (Note, there's talk about making bit 0 the one set by the kernel, or use a different word entirely to allow the rest of the bits to be used as a counter for nested critical sections). Now when returning back to user space, if the critical section bit (or counter) is set, then it will not call schedule when NEED_RESCHED_LAZY is set. Note that it will still always call schedule on NEED_RESCHED. This gives user space one more tick (1 ms with 1000 HZ kernel config, to 4 ms with 250 HZ kernel config). When user space is done with its critical section, it should check the bit that can be set by the kernel to see if it should then schedule. If the user side bit is not cleared after a tick, then the kernel will set NEED_RESCHED which will force a schedule no matter where user space happens to be. Note, this could also hurt that task in that the scheduler will take away the eligibility of that task to balance out the amount of extra time the task ran for, not to mention, the force schedule could now land in a critical section. Back in 2014 at the Linux Collaboration Summit in Napa Valley I had a nice conversation with Robet Haas about user space spin locks. He told me how they are used in PostgreSQL where futexes did not meet their needs. This conversation kicked off the idea about implementing user space adaptive spin locks (which last year in Belgium, I asked André Almeida to implement - https://lwn.net/Articles/931789/). Even though user space adaptive spinning would greatly help out the contention of a lock, there's still the issue of a lock owner being preempted which would cause all those waiters to also go into the kernel and delay access to the critical sections. In the real time kernel, this was solved by NEED_RESCHED_LAZY: https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/preempt-lazy-support.patch?h=v5.9-rc2-rt1-patches Now Thomas has proposed using a similar solution to solve the PREEMPT_NONE / VOLUNTARY issue. https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/ Which now has a prototype in the rt-devel tree: https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/PREEMPT_AUTO.patch?h=v6.6-rc6-rt10-patches For which I applied to v6.6-rc4 (my current working branch), and applied the solution I explained above (POC with debugging still in it): https://lore.kernel.org/all/20231025235413.597287e1@gandalf.local.home/ Now I downloaded the latest postgres from: https://github.com/postgres/postgres.git And built sha: 26f988212eada9c586223cbbf876c7eb455044d9 After installing it, I rebooted the machine with the updated kernel (requires CONFIG_PREEMPT_AUTO being set), and ran the unmodified version of postgres pgbench test: pgbench -c 100 -T 300 -j 8 -S -n I ran it 16 times and looked at the transactions per second counter (tps). Note, I threw out the first run as it had horrible numbers probably due to everything in cold cache (memory and file system). Then I applied the below patch, did a make clean, make install, rebooted the box again and ran the test for another 16 times (again, the first run was horrible). Here are the results of the tests: I only used the 15 runs after the first run for comparisons. Without the patched postgres executable: First run: tps = 72573.188203 (without initial connection time) 15 runs: tps = 74315.731978 (without initial connection time) tps = 74448.130108 (without initial connection time) tps = 74662.246100 (without initial connection time) tps = 73124.961311 (without initial connection time) tps = 74653.611878 (without initial connection time) tps = 74765.296134 (without initial connection time) tps = 74497.066104 (without initial connection time) tps = 74541.664031 (without initial connection time) tps = 74595.032066 (without initial connection time) tps = 74545.876793 (without initial connection time) tps = 74762.560651 (without initial connection time) tps = 74528.657018 (without initial connection time) tps = 74814.700753 (without initial connection time) tps = 74687.980967 (without initial connection time) tps = 74973.185122 (without initial connection time) With the patched postgres executable: First run: tps = 73562.005970 (without initial connection time) 15 runs: tps = 74560.101322 (without initial connection time) tps = 74711.177071 (without initial connection time) tps = 74551.093281 (without initial connection time) tps = 74559.452628 (without initial connection time) tps = 74737.604361 (without initial connection time) tps = 74521.606019 (without initial connection time) tps = 74870.859166 (without initial connection time) tps = 74545.423471 (without initial connection time) tps = 74805.939815 (without initial connection time) tps = 74665.240730 (without initial connection time) tps = 74701.479550 (without initial connection time) tps = 74897.154079 (without initial connection time) tps = 74879.687067 (without initial connection time) tps = 74792.563116 (without initial connection time) tps = 74852.101317 (without initial connection time) Without the patch: Average: 74527.7800 Std Dev: 420.6304 With the patch: Average: 74710.0988 Std Dev: 136.7250 Notes about my setup. I ran this on one of my older test boxes (pretty much the last of the bare metal machines I test on, as now I do most on VMs, but did not want to run these tests on VMs). It's a 4 core (2 hyper threaded) total of 8 CPUs: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz 32G of RAM. Postgres has several types of locks. I first applied the extend()/unextend() to only the raw spin locks but that didn't make much difference. When I traced it, I found that the time slice seldom landed when one of these spin locks were held. 10 or so during a 5 minute run (I added writes to the kernel tracing buffer via trace_marker to know when the locks were held, and also recorded sched_switch to see if it was ever preempted). So I expanded the usage to the "light weight locks" (lwlocks), which are similar to the spin locks but does some complex backing off. Basically a heuristic spin. Anyway, after adding the logic to these locks, it definitely made a difference. Not a huge one, but it was noticeable beyond the noise. I can imagine that if this was implemented on a machine with many more CPUs than 8, it would make an even bigger difference. I also had to rerun my tests because I left some kernel config options enabled that affected performance. I didn't want that to skew the results. But the results were similar, except that with the slower kernel, the worse performance with the patch was better than the best performance without it. Not by much, but still. After removing the slowdown, that was no longer the case. But I noticed that with the patch, the standard deviation was much smaller than without the patch. I'm guessing that without the patch it depends on how the scheduler interacts with the locking much more and a good run without the patch was just the test being "lucky" that it didn't get preempted as much in a critical section. I personally think the smaller standard deviation is a win as it makes the database run with a more deterministic behavior. Anyway, I'd really like to know what others think about this, and perhaps they can run this on their own testing infrastructure. All the code is available to reproduce. If you want to reproduce it like I did. Checkout the latest Linus tree (or do what I have which is v6.6-rc4), apply the above mentioned PREEMPT_AUTO.patch, then apply my kernel patch. Select CONFIG_PREEMPT_AUTO and build the kernel. Don't worry about that big banner that is on the kernel console at boot up telling you that you are running a debug kernel. You are running one, and it's because I still have a couple of trace_printk()s in it. Run your postgres tests with and without the patch and let me know if there's a difference. Only if you think it's worth it. Let me know if you don't ;-) -- Steve [ The biggest part of the below change is adding in the standard rseq_abi.h header ] diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 315a78cda9..652d3a5560 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -89,11 +89,12 @@ #include "storage/spin.h" #include "utils/memutils.h" +#include "storage/rseq-abi.h" + #ifdef LWLOCK_STATS #include "utils/hsearch.h" #endif - /* We use the ShmemLock spinlock to protect LWLockCounter */ extern slock_t *ShmemLock; @@ -841,6 +842,8 @@ LWLockAttemptLock(LWLock *lock, LWLockMode mode) desired_state += LW_VAL_SHARED; } + if (lock_free) + extend(); /* * Attempt to swap in the state we are expecting. If we didn't see * lock to be free, that's just the old value. If we saw it as free, @@ -863,9 +866,14 @@ LWLockAttemptLock(LWLock *lock, LWLockMode mode) #endif return false; } - else + else { + if (lock_free) + unextend(); return true; /* somebody else has the lock */ + } } + if (lock_free) + unextend(); } pg_unreachable(); } @@ -1868,6 +1876,7 @@ LWLockRelease(LWLock *lock) LWLockWakeup(lock); } + unextend(); /* * Now okay to allow cancel/die interrupts. */ diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c index 327ac64f7c..c22310cfe3 100644 --- a/src/backend/storage/lmgr/s_lock.c +++ b/src/backend/storage/lmgr/s_lock.c @@ -55,6 +55,8 @@ #include "storage/s_lock.h" #include "utils/wait_event.h" +#include "storage/rseq-abi.h" + #define MIN_SPINS_PER_DELAY 10 #define MAX_SPINS_PER_DELAY 1000 #define NUM_DELAYS 1000 @@ -66,7 +68,6 @@ slock_t dummy_spinlock; static int spins_per_delay = DEFAULT_SPINS_PER_DELAY; - /* * s_lock_stuck() - complain about a stuck spinlock */ @@ -94,6 +95,8 @@ s_lock(volatile slock_t *lock, const char *file, int line, const char *func) { SpinDelayStatus delayStatus; + unextend(); + init_spin_delay(&delayStatus, file, line, func); while (TAS_SPIN(lock)) @@ -102,6 +105,7 @@ s_lock(volatile slock_t *lock, const char *file, int line, const char *func) } finish_spin_delay(&delayStatus); + extend(); return delayStatus.delays; } diff --git a/src/include/storage/rseq-abi.h b/src/include/storage/rseq-abi.h new file mode 100644 index 0000000000..b858cf1d6f --- /dev/null +++ b/src/include/storage/rseq-abi.h @@ -0,0 +1,174 @@ +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */ +#ifndef _RSEQ_ABI_H +#define _RSEQ_ABI_H + +/* + * rseq-abi.h + * + * Restartable sequences system call API + * + * Copyright (c) 2015-2022 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> + */ + +#include <ctype.h> +#include <asm/types.h> + +enum rseq_abi_cpu_id_state { + RSEQ_ABI_CPU_ID_UNINITIALIZED = -1, + RSEQ_ABI_CPU_ID_REGISTRATION_FAILED = -2, +}; + +enum rseq_abi_flags { + RSEQ_ABI_FLAG_UNREGISTER = (1 << 0), +}; + +enum rseq_abi_cs_flags_bit { + RSEQ_ABI_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0, + RSEQ_ABI_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1, + RSEQ_ABI_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2, +}; + +enum rseq_abi_cs_flags { + RSEQ_ABI_CS_FLAG_NO_RESTART_ON_PREEMPT = + (1U << RSEQ_ABI_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT), + RSEQ_ABI_CS_FLAG_NO_RESTART_ON_SIGNAL = + (1U << RSEQ_ABI_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT), + RSEQ_ABI_CS_FLAG_NO_RESTART_ON_MIGRATE = + (1U << RSEQ_ABI_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), +}; + +/* + * struct rseq_abi_cs is aligned on 4 * 8 bytes to ensure it is always + * contained within a single cache-line. It is usually declared as + * link-time constant data. + */ +struct rseq_abi_cs { + /* Version of this structure. */ + __u32 version; + /* enum rseq_abi_cs_flags */ + __u32 flags; + __u64 start_ip; + /* Offset from start_ip. */ + __u64 post_commit_offset; + __u64 abort_ip; +} __attribute__((aligned(4 * sizeof(__u64)))); + +/* + * struct rseq_abi is aligned on 4 * 8 bytes to ensure it is always + * contained within a single cache-line. + * + * A single struct rseq_abi per thread is allowed. + */ +struct rseq_abi { + /* + * Restartable sequences cpu_id_start field. Updated by the + * kernel. Read by user-space with single-copy atomicity + * semantics. This field should only be read by the thread which + * registered this data structure. Aligned on 32-bit. Always + * contains a value in the range of possible CPUs, although the + * value may not be the actual current CPU (e.g. if rseq is not + * initialized). This CPU number value should always be compared + * against the value of the cpu_id field before performing a rseq + * commit or returning a value read from a data structure indexed + * using the cpu_id_start value. + */ + __u32 cpu_id_start; + /* + * Restartable sequences cpu_id field. Updated by the kernel. + * Read by user-space with single-copy atomicity semantics. This + * field should only be read by the thread which registered this + * data structure. Aligned on 32-bit. Values + * RSEQ_CPU_ID_UNINITIALIZED and RSEQ_CPU_ID_REGISTRATION_FAILED + * have a special semantic: the former means "rseq uninitialized", + * and latter means "rseq initialization failed". This value is + * meant to be read within rseq critical sections and compared + * with the cpu_id_start value previously read, before performing + * the commit instruction, or read and compared with the + * cpu_id_start value before returning a value loaded from a data + * structure indexed using the cpu_id_start value. + */ + __u32 cpu_id; + /* + * Restartable sequences rseq_cs field. + * + * Contains NULL when no critical section is active for the current + * thread, or holds a pointer to the currently active struct rseq_cs. + * + * Updated by user-space, which sets the address of the currently + * active rseq_cs at the beginning of assembly instruction sequence + * block, and set to NULL by the kernel when it restarts an assembly + * instruction sequence block, as well as when the kernel detects that + * it is preempting or delivering a signal outside of the range + * targeted by the rseq_cs. Also needs to be set to NULL by user-space + * before reclaiming memory that contains the targeted struct rseq_cs. + * + * Read and set by the kernel. Set by user-space with single-copy + * atomicity semantics. This field should only be updated by the + * thread which registered this data structure. Aligned on 64-bit. + */ + union { + __u64 ptr64; + + /* + * The "arch" field provides architecture accessor for + * the ptr field based on architecture pointer size and + * endianness. + */ + struct { +#ifdef __LP64__ + __u64 ptr; +#elif defined(__BYTE_ORDER) ? (__BYTE_ORDER == __BIG_ENDIAN) : defined(__BIG_ENDIAN) + __u32 padding; /* Initialized to zero. */ + __u32 ptr; +#else + __u32 ptr; + __u32 padding; /* Initialized to zero. */ +#endif + } arch; + } rseq_cs; + + /* + * Restartable sequences flags field. + * + * This field should only be updated by the thread which + * registered this data structure. Read by the kernel. + * Mainly used for single-stepping through rseq critical sections + * with debuggers. + * + * - RSEQ_ABI_CS_FLAG_NO_RESTART_ON_PREEMPT + * Inhibit instruction sequence block restart on preemption + * for this thread. + * - RSEQ_ABI_CS_FLAG_NO_RESTART_ON_SIGNAL + * Inhibit instruction sequence block restart on signal + * delivery for this thread. + * - RSEQ_ABI_CS_FLAG_NO_RESTART_ON_MIGRATE + * Inhibit instruction sequence block restart on migration for + * this thread. + */ + __u32 flags; + + /* + * Restartable sequences node_id field. Updated by the kernel. Read by + * user-space with single-copy atomicity semantics. This field should + * only be read by the thread which registered this data structure. + * Aligned on 32-bit. Contains the current NUMA node ID. + */ + __u32 node_id; + + /* + * Restartable sequences mm_cid field. Updated by the kernel. Read by + * user-space with single-copy atomicity semantics. This field should + * only be read by the thread which registered this data structure. + * Aligned on 32-bit. Contains the current thread's concurrency ID + * (allocated uniquely within a memory map). + */ + __u32 mm_cid; + + __u32 cr_flags; + /* + * Flexible array member at end of structure, after last feature field. + */ + char end[]; +} __attribute__((aligned(4 * sizeof(__u64)))); + +#endif /* _RSEQ_ABI_H */ diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h index c9fa84cc43..999a056c6f 100644 --- a/src/include/storage/s_lock.h +++ b/src/include/storage/s_lock.h @@ -96,6 +96,10 @@ #ifndef S_LOCK_H #define S_LOCK_H +#define __GNU_SOURCE +#include <unistd.h> +#include "storage/rseq-abi.h" + #ifdef FRONTEND #error "s_lock.h may not be included from frontend code" #endif @@ -131,6 +135,53 @@ *---------- */ +extern ptrdiff_t __rseq_offset; +extern unsigned int __rseq_size; + +static inline unsigned clear_extend(volatile unsigned *ptr) +{ + unsigned ret; + + asm volatile("andb %b1,%0" + : "+m" (*(volatile char *)ptr) + : "iq" (~0x1) + : "memory"); + + ret = *ptr; + *ptr = 0; + + return ret; +} + +static inline void extend(void) +{ + struct rseq_abi *rseq_ptr; + + if (!__rseq_size) + return; + + rseq_ptr = (void *)((unsigned long)__builtin_thread_pointer() + __rseq_offset); + rseq_ptr->cr_flags = 1; +} + +static inline void unextend(void) +{ + struct rseq_abi *rseq_ptr; + unsigned prev; + + if (!__rseq_size) + return; + + rseq_ptr = (void *)((unsigned long)__builtin_thread_pointer() + __rseq_offset); + + prev = clear_extend(&rseq_ptr->cr_flags); + if (prev & 2) { + gettid(); + } +} + +#define __S_UNLOCK(lock) do { S_UNLOCK(lock); unextend(); } while (0) +#define __S_LOCK(lock) do { extend(); S_LOCK(lock); } while (0) #ifdef __i386__ /* 32-bit i386 */ #define HAS_TEST_AND_SET diff --git a/src/include/storage/spin.h b/src/include/storage/spin.h index 5d809cc980..4230025748 100644 --- a/src/include/storage/spin.h +++ b/src/include/storage/spin.h @@ -59,9 +59,9 @@ #define SpinLockInit(lock) S_INIT_LOCK(lock) -#define SpinLockAcquire(lock) S_LOCK(lock) +#define SpinLockAcquire(lock) __S_LOCK(lock) -#define SpinLockRelease(lock) S_UNLOCK(lock) +#define SpinLockRelease(lock) __S_UNLOCK(lock) #define SpinLockFree(lock) S_LOCK_FREE(lock)
On 2023-10-27 12:49, Steven Rostedt wrote: > On Fri, 27 Oct 2023 12:35:56 -0400 > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >>> Does that make more sense? >> >> Not really. >> >> Please see my other email about the need for a reference count here, for >> nested locks use-cases. > > Note, my original implementation of nested locking was done completely in > user space. > > int __thread lock_cnt; > > extend() { > if (lock_cnt++) > return; > ... > } > > unextend() { > if (--lock_cnt) > return; > ... > } This only works if "your" lock implementation is the only user of this RSEQ feature within a process. RSEQ requires that multiple libraries can share the facilities. Therefore, the rseq field should include the nesting counter as part of the RSEQ ABI so various userspace libraries can use it collaboratively. Thanks, Mathieu
On Mon, 30 Oct 2023 08:56:50 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > This only works if "your" lock implementation is the only user of this > RSEQ feature within a process. RSEQ requires that multiple libraries can > share the facilities. Therefore, the rseq field should include the > nesting counter as part of the RSEQ ABI so various userspace libraries > can use it collaboratively. > Then I would suggest allowing bit 31 be an "on/off" switch, and the lower bits to be a counter. When I first tried this with postgres, there was one lwlock that looked to be held for the majority of the run, so I got rid of the nesting. But I think a mixture of both would work, where you can have a nesting counter and an on/off switch with the caveat that if you use it and enable it, another library may disable it. -- Steve
On 2023-10-30 09:45, Steven Rostedt wrote: > On Mon, 30 Oct 2023 08:56:50 -0400 > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: >> >> This only works if "your" lock implementation is the only user of this >> RSEQ feature within a process. RSEQ requires that multiple libraries can >> share the facilities. Therefore, the rseq field should include the >> nesting counter as part of the RSEQ ABI so various userspace libraries >> can use it collaboratively. >> > > Then I would suggest allowing bit 31 be an "on/off" switch, and the lower > bits to be a counter. When I first tried this with postgres, there was one > lwlock that looked to be held for the majority of the run, so I got rid of > the nesting. But I think a mixture of both would work, where you can have a > nesting counter and an on/off switch with the caveat that if you use it and > enable it, another library may disable it. If you have the nesting counter, why do you need the explicit on/off switch ? Thanks, Mathieu
On Mon, 30 Oct 2023 14:05:05 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > If you have the nesting counter, why do you need the explicit on/off > switch ? Because I gave up when I found that one of the lwlocks seemed to take a long time (pretty much the entire test) or I couldn't find how it was unlocked (the code isn't trivial). So I just made every unlock disable the extended time slot. I need to go back and enable both a counter and an on/off as I now realize that the spin locks (called within the lwlock) will disable the extend time before the lwlock is released. This should work if I have the spinlocks inc and dec (they are straight forward and all locks are associated with an easily found unlock), and have the lwlock use bit 31 as an on/off switch. Anyway, I would let user space decide what it wants to do, and giving it 31 bits to say "I'm extended" and let user space come up with how it handles those 31 bits. -- Steve
On 2023-10-30 14:19, Steven Rostedt wrote: > On Mon, 30 Oct 2023 14:05:05 -0400 > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >> If you have the nesting counter, why do you need the explicit on/off >> switch ? > > Because I gave up when I found that one of the lwlocks seemed to take a long > time (pretty much the entire test) or I couldn't find how it was unlocked > (the code isn't trivial). > > So I just made every unlock disable the extended time slot. I need to go > back and enable both a counter and an on/off as I now realize that the spin > locks (called within the lwlock) will disable the extend time before the > lwlock is released. This should work if I have the spinlocks inc and dec > (they are straight forward and all locks are associated with an easily > found unlock), and have the lwlock use bit 31 as an on/off switch. This extra on/off switch appears to be working around userspace issues. > Anyway, I would let user space decide what it wants to do, and giving it 31 > bits to say "I'm extended" and let user space come up with how it handles > those 31 bits. If this makes it into the RSEQ uapi, RSEQ should state how userspace should collaborate wrt those bits (e.g. nesting counter protocol), even though it's not a kernel ABI per se. Otherwise we'll just push this to libc to specify this, which is odd. Thanks, Mathieu
On Mon, 30 Oct 2023 14:27:10 -0400 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > So I just made every unlock disable the extended time slot. I need to go > > back and enable both a counter and an on/off as I now realize that the spin > > locks (called within the lwlock) will disable the extend time before the > > lwlock is released. This should work if I have the spinlocks inc and dec > > (they are straight forward and all locks are associated with an easily > > found unlock), and have the lwlock use bit 31 as an on/off switch. > > This extra on/off switch appears to be working around userspace issues. Yep! But that doesn't mean there's not a legitimate use case for it. I don't want to limit the feature for that. It's unlikely bit 31 would ever be hit by a counter anyway, for which it could be used as an on/off switch the same way the NEED_RESCHED bit is used as an on/off switch for preempt_count in the kernel. > > > Anyway, I would let user space decide what it wants to do, and giving it 31 > > bits to say "I'm extended" and let user space come up with how it handles > > those 31 bits. > > If this makes it into the RSEQ uapi, RSEQ should state how userspace > should collaborate wrt those bits (e.g. nesting counter protocol), even > though it's not a kernel ABI per se. Otherwise we'll just push this to > libc to specify this, which is odd. I agree that user space should have the usage specified. Hell, that bit could just be used for testing purposes. I think having it reserved is a good thing than not specifying it and limiting its usage later. -- Steve
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h index c233aae5eac9..bd3aa4085e7b 100644 --- a/include/uapi/linux/rseq.h +++ b/include/uapi/linux/rseq.h @@ -37,6 +37,18 @@ enum rseq_cs_flags { (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), }; +enum rseq_cr_flags_bit { + RSEQ_CR_FLAG_IN_CRITICAL_SECTION_BIT = 0, + RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED_BIT = 1, +}; + +enum rseq_cr_flags { + RSEQ_CR_FLAG_IN_CRITICAL_SECTION = + (1U << RSEQ_CR_FLAG_IN_CRITICAL_SECTION_BIT), + RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED = + (1U << RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED_BIT), +}; + /* * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always * contained within a single cache-line. It is usually declared as @@ -148,6 +160,8 @@ struct rseq { */ __u32 mm_cid; + __u32 cr_flags; + /* * Flexible array member at end of structure, after last feature field. */ diff --git a/kernel/entry/common.c b/kernel/entry/common.c index c1f706038637..d8b46b9e5fd7 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -143,21 +143,35 @@ void noinstr exit_to_user_mode(void) /* Workaround to allow gradual conversion of architecture code */ void __weak arch_do_signal_or_restart(struct pt_regs *regs) { } +bool rseq_ignore_lazy_resched(void); static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work) { + unsigned long ignore_mask; + /* * Before returning to user space ensure that all pending work * items have been completed. */ while (ti_work & EXIT_TO_USER_MODE_WORK) { + ignore_mask = 0; local_irq_enable_exit_to_user(ti_work); - if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) + if (ti_work & _TIF_NEED_RESCHED) { schedule(); + } else if (ti_work & _TIF_NEED_RESCHED_LAZY) { + if (rseq_ignore_lazy_resched()) { + trace_printk("Extend!\n"); + /* Allow to leave with NEED_RESCHED_LAZY still set */ + ignore_mask |= _TIF_NEED_RESCHED_LAZY; + } else { + schedule(); + } + } + if (ti_work & _TIF_UPROBE) uprobe_notify_resume(regs); @@ -184,6 +198,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, tick_nohz_user_enter_prepare(); ti_work = read_thread_flags(); + ti_work &= ~ignore_mask; } /* Return the latest work state for arch_exit_to_user_mode() */ diff --git a/kernel/rseq.c b/kernel/rseq.c index 9de6e35fe679..fd9d18f60c04 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -339,6 +339,33 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) force_sigsegv(sig); } +bool rseq_ignore_lazy_resched(void) +{ + struct task_struct *t = current; + u32 flags; + + if (!t->rseq) + return false; + + /* Make sure the cr_flags exist */ + if (t->rseq_len <= offsetof(struct rseq, cr_flags)) + return false; + + if (copy_from_user(&flags, &t->rseq->cr_flags, sizeof(flags))) + return false; + + if (!(flags & RSEQ_CR_FLAG_IN_CRITICAL_SECTION)) + return false; + + flags |= RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED; + + /* If we fault writing, then do not give it an extended slice */ + if (copy_to_user(&t->rseq->cr_flags, &flags, sizeof(flags))) + return false; + + return true; +} + #ifdef CONFIG_DEBUG_RSEQ /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 700b140ac1bb..17ca22e80384 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -993,9 +993,10 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool resched_curr(rq); } else { /* Did the task ignore the lazy reschedule request? */ - if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY)) + if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY)) { + trace_printk("Force resched?\n"); resched_curr(rq); - else + } else resched_curr_lazy(rq); } clear_buddies(cfs_rq, se);