diff mbox series

[POC,RFC] sched: Extended Scheduler Time Slice

Message ID 20231025054219.1acaa3dd@gandalf.local.home (mailing list archive)
State New
Headers show
Series [POC,RFC] sched: Extended Scheduler Time Slice | expand

Commit Message

Steven Rostedt Oct. 25, 2023, 9:42 a.m. UTC
[
 This is basically a resend of this email, but as a separate patch and not
 part of a very long thread.
 https://lore.kernel.org/lkml/20231024214958.22fff0bc@gandalf.local.home/
]

This has very good performance improvements on user space implemented spin
locks, and I'm sure this can be used for spin locks in VMs too. That will
come shortly.

I started with Thomas's PREEMPT_AUTO.patch from the rt-devel tree:

 https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/PREEMPT_AUTO.patch?h=v6.6-rc6-rt10-patches

So you need to select:

  CONFIG_PREEMPT_AUTO

The below is my proof of concept patch. It still has debugging in it, and
I'm sure the interface will need to be changed.

There's now a new file:  /sys/kernel/extend_sched

Attached is a program that tests it. It mmaps that file, with:

 struct extend_map {
	unsigned long		flags;
 };

 static __thread struct extend_map *extend_map;

That is, there's this structure for every thread. It's assigned with:

	fd = open("/sys/kernel/extend_sched", O_RDWR);
	extend_map = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

I don't actually like this interface, as it wastes a full page for just two
bits :-p  Perhaps it should be a new system call, where it just locks in
existing memory from the user application? The requirement is that each
thread needs its own bits to play with. It should not be shared with other
threads. It could be, as it will not mess up the kernel, but will mess up
the application.

Anyway, to tell the kernel to "extend" the time slice if possible because
it's in a critical section, we have:

 static void extend(void)
 {
	if (!extend_map)
		return;

	extend_map->flags = 1;
 }

And to say that's it's done:

 static void unextend(void)
 {
	unsigned long prev;

	if (!extend_map)
		return;

	prev = xchg(&extend_map->flags, 0);
	if (prev & 2)
		sched_yield();
 }

So, bit 1 is for user space to tell the kernel "please extend me", and bit
two is for the kernel to tell user space "OK, I extended you, but call
sched_yield() when done".

This test program creates 1 + number of CPUs threads, that run in a loop
for 5 seconds. Each thread will grab a user space spin lock (not a futex,
but just shared memory). Before grabbing the lock it will call "extend()",
if it fails to grab the lock, it calls "unextend()" and spins on the lock
until its free, where it will try again. Then after it gets the lock, it
will update a counter, and release the lock, calling "unextend()" as well.
Then it will spin on the counter until it increments again to allow another
task to get into the critical section.

With the init of the extend_map disabled and it doesn't use the extend
code, it ends with:

 Ran for 3908165 times
 Total wait time: 33.965654

I can give you stdev and all that too, but the above is pretty much the
same after several runs.

After enabling the extend code, it has:

 Ran for 4829340 times
 Total wait time: 32.635407

It was able to get into the critical section almost 1 million times more in
those 5 seconds! That's a 23% improvement!

The wait time for getting into the critical section also dropped by the
total of over a second (4% improvement).

I ran a traceeval tool on it (still work in progress, but I can post when
it's done), and with the following trace, and the writes to trace-marker
(tracefs_printf)

 trace-cmd record -e sched_switch ./extend-sched

It showed that without the extend, each task was preempted while holding
the lock around 200 times. With the extend, only one task was ever
preempted while holding the lock, and it only happened once!

Note, I tried replacing the user space spin lock with a futex, and it
dropped performance down so much with and without the update, that the
benefit is in the noise.

Below is my patch (with debugging and on top of Thomas's PREEMPT_AUTO.patch):

Attached is the program I tested it with. It uses libtracefs to write to
the trace_marker file, but if you don't want to build it with libtracefs:

  gcc -o extend-sched extend-sched.c `pkg-config --libs --cflags libtracefs` -lpthread

You can just do:

 grep -v tracefs extend-sched.c > extend-sched-notracefs.c

And build that.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---

Comments

Steven Rostedt Oct. 25, 2023, 9:46 a.m. UTC | #1
On Wed, 25 Oct 2023 05:42:19 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> So, bit 1 is for user space to tell the kernel "please extend me", and bit
> two is for the kernel to tell user space "OK, I extended you, but call
> sched_yield() when done".

I'm waiting for someone to correct me and say it's actually "bit 0 and bit 1"

-- Steve
Peter Zijlstra Oct. 25, 2023, 10:29 a.m. UTC | #2
On Wed, Oct 25, 2023 at 05:42:19AM -0400, Steven Rostedt wrote:

> That is, there's this structure for every thread. It's assigned with:
> 
> 	fd = open("/sys/kernel/extend_sched", O_RDWR);
> 	extend_map = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> 
> I don't actually like this interface, as it wastes a full page for just two
> bits :-p  Perhaps it should be a new system call, where it just locks in
> existing memory from the user application? The requirement is that each
> thread needs its own bits to play with. It should not be shared with other
> threads. It could be, as it will not mess up the kernel, but will mess up
> the application.

What was wrong with using rseq?

> Anyway, to tell the kernel to "extend" the time slice if possible because
> it's in a critical section, we have:
> 
>  static void extend(void)
>  {
> 	if (!extend_map)
> 		return;
> 
> 	extend_map->flags = 1;
>  }
> 
> And to say that's it's done:
> 
>  static void unextend(void)
>  {
> 	unsigned long prev;
> 
> 	if (!extend_map)
> 		return;
> 
> 	prev = xchg(&extend_map->flags, 0);
> 	if (prev & 2)
> 		sched_yield();
>  }
> 
> So, bit 1 is for user space to tell the kernel "please extend me", and bit
> two is for the kernel to tell user space "OK, I extended you, but call
> sched_yield() when done".

So what if it doesn't ? Can we kill it for not playing nice ?

[ aside from it being bit 0 and bit 1 as you yourself point out, it is
  also jarring you use a numeral for one and write out the other. ]

That said, I properly hate all these things, extending a slice doesn't
reliably work and we're always left with people demanding an ever longer
extension.

The *much* better heuristic is what the kernel uses, don't spin if the
lock holder isn't running.
Steven Rostedt Oct. 25, 2023, 12:54 p.m. UTC | #3
Peter!

[ After watching Thomas and Paul reply to each other, I figured this is the
  new LMKL greeting. ]


On Wed, 25 Oct 2023 12:29:52 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Oct 25, 2023 at 05:42:19AM -0400, Steven Rostedt wrote:
> 
> > That is, there's this structure for every thread. It's assigned with:
> > 
> > 	fd = open("/sys/kernel/extend_sched", O_RDWR);
> > 	extend_map = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> > 
> > I don't actually like this interface, as it wastes a full page for just two
> > bits :-p  Perhaps it should be a new system call, where it just locks in
> > existing memory from the user application? The requirement is that each
> > thread needs its own bits to play with. It should not be shared with other
> > threads. It could be, as it will not mess up the kernel, but will mess up
> > the application.  
> 
> What was wrong with using rseq?

I didn't want to overload that for something completely different. This is
not a "restartable sequence".

> 
> > Anyway, to tell the kernel to "extend" the time slice if possible because
> > it's in a critical section, we have:
> > 
> >  static void extend(void)
> >  {
> > 	if (!extend_map)
> > 		return;
> > 
> > 	extend_map->flags = 1;
> >  }
> > 
> > And to say that's it's done:
> > 
> >  static void unextend(void)
> >  {
> > 	unsigned long prev;
> > 
> > 	if (!extend_map)
> > 		return;
> > 
> > 	prev = xchg(&extend_map->flags, 0);
> > 	if (prev & 2)
> > 		sched_yield();
> >  }
> > 
> > So, bit 1 is for user space to tell the kernel "please extend me", and bit
> > two is for the kernel to tell user space "OK, I extended you, but call
> > sched_yield() when done".  
> 
> So what if it doesn't ? Can we kill it for not playing nice ?

No, it's no different than a system call running for a long time. You could
set this bit and leave it there for as long as you want, and it should not
affect anything. If you look at what Thomas's PREEMPT_AUTO.patch does, is
that it sets NEED_RESCHED_LAZY at the tick. Without my patch, this will not
schedule right away, but will schedule when going into user space. My patch
will ignore the schedule if NEED_RESCHED_LAZY is set when going into user
space.

With Thomas's patch, if a task is in the kernel for too long, on the next
tick (if I read his code correctly), if NEED_RESCHED_LAZY is still set,
it will then force the schedule. That is, you get two ticks instead of one.
I may have misread the code, but that's what it looks like it does in
update_deadline() in fair.c.

 https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/PREEMPT_AUTO.patch?h=v6.6-rc6-rt10-patches#n587

With my patch, the same thing happens but in user space. This does not give
any more power to any task. I don't expect nor want this to be a privilege
operation. It's no different than running a long system call. And EEVDF
should even keep it fair. As if you use an extra tick, it will go against
your eligibility for the next time around.

Note, NEED_RESCHED still schedules. If a RT or DL task were to wake up, it
will immediately preempt this task regardless of that bit being set.

> 
> [ aside from it being bit 0 and bit 1 as you yourself point out, it is
>   also jarring you use a numeral for one and write out the other. ]
> 
> That said, I properly hate all these things, extending a slice doesn't
> reliably work and we're always left with people demanding an ever longer
> extension.

We could possibly make it adjustable. I'm guessing that will happen anyway
with Thomas's patch. Anyway, my test shows that it makes a huge improvement
for user space implemented spin locks, which I tailored this after how
Postgresql does their spin locks. That is, this is a real world use case. I
plan to implement this in Postgresql and see what improvements it makes in
their tests.

I also plan on testing VMs.

> 
> The *much* better heuristic is what the kernel uses, don't spin if the
> lock holder isn't running.

No it is not. That is a completely useless heuristic for this use case.
That's for waiters and I would guess would make no difference in my test.
The point of this patch is to keep the lock holder running not the waiter
spinning. The reason for the improvement in my test is that the lock was
always held for a very short time and when the time slice came up while the
task was holding the lock, it was able to get it extended to release it, and
then schedule.

Without my patch, you get a several hundreds of this:

    extend-sched-3773  [000]  9628.573272: print:                tracing_mark_write: Have lock!
    extend-sched-3773  [000]  9628.573278: sched_switch:         extend-sched:3773 [120] R ==> mysqld:1216 [120]
          mysqld-1216  [000]  9628.573286: sched_switch:         mysqld:1216 [120] S ==> extend-sched:3773 [120]
    extend-sched-3773  [000]  9628.573287: print:                tracing_mark_write: released lock!

[ Ironically, this example is preempted by mysqld ]

With my patch, there was only a single instance during the run.

When a lock holder schedules out, it greatly increases contention on that
lock. That's the entire reason Thomas implemented NEED_RESCHED_LAZY in the
first place. The aggressive preemption in PREEMPT_RT caused a lot more
contention on spin lock turned mutexes. My patch is to do the exact same
thing for user space implement spin locks, which also includes spin locks
in VM kernels. Adaptive spin locks (spin on owner running) helped
PREEMPT_RT for waiters, but that did nothing to help the lock holder being
preempted, and why NEED_RESCHED_LAZY was still needed even when the kernel
already had adaptive spinners.

The reason I've been told over the last few decades of why people implement
100% user space spin locks is because the overhead of going int the kernel
is way too high. Sleeping is much worse (but that is where the adaptive
spinning comes in, which is a separate issue).

Allowing user space to say "hey, give me a few more microseconds and I'm
fine being preempted" is a very good heuristic. And a way for the kernel to
say, "hey, I gave it to you, you better go into the kernel when you can,
otherwise I'll preempt you no matter what!"

-- Steve
Peter Zijlstra Oct. 25, 2023, 1:55 p.m. UTC | #4
On Wed, Oct 25, 2023 at 08:54:34AM -0400, Steven Rostedt wrote:

> I didn't want to overload that for something completely different. This is
> not a "restartable sequence".

Your hack is arguably worse. At least rseq already exists and most
threads will already have it set up if you have a recent enough glibc.

> > So what if it doesn't ? Can we kill it for not playing nice ?
> 
> No, it's no different than a system call running for a long time. You could

Then why ask for it? What's the point. Also, did you define
sched_yield() semantics for OTHER to something useful? Because if you
didn't you just invoked UB :-) We could be setting your pets on fire.

> set this bit and leave it there for as long as you want, and it should not
> affect anything.

It would affect the worst case interference terms of the system at the
very least.

> If you look at what Thomas's PREEMPT_AUTO.patch

I know what it does, it also means your thing doesn't work the moment
you set things up to have the old full-preempt semantics back. It
doesn't work in the presence of RT/DL tasks, etc..

More importantly, it doesn't work for RT/DL tasks, so having the bit set
and not having OTHER policy is an error.

Do you want an interface that randomly doesn't work ?

> We could possibly make it adjustable. 

Tunables are not a good thing.

> The reason I've been told over the last few decades of why people implement
> 100% user space spin locks is because the overhead of going int the kernel
> is way too high.

Over the last few decades that has been a blatant falsehood. At some
point (right before the whole meltdown trainwreck) amluto had syscall
overhead down to less than 150 cycles.

Then of course meltdown happened and it all went to shit.

But even today (on good hardware or with mitigations=off):

gettid-1m:	179,650,423      cycles
xadd-1m:	 23,036,564      cycles

syscall is the cost of roughly 8 atomic ops. More expensive, sure. But
not insanely so. I've seen atomic ops go up to >1000 cycles if you
contend them hard enough.
Steven Rostedt Oct. 25, 2023, 2:31 p.m. UTC | #5
On Wed, 25 Oct 2023 15:55:45 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Oct 25, 2023 at 08:54:34AM -0400, Steven Rostedt wrote:
> 
> > I didn't want to overload that for something completely different. This is
> > not a "restartable sequence".  
> 
> Your hack is arguably worse. At least rseq already exists and most
> threads will already have it set up if you have a recent enough glibc.

I don't expect that file to be the final solution. I can look at the rseq
code, but I really hate to overload that. I'm thinking perhaps another
system call, or what the hell, add another ioctl like feature to prctl()!
Actually, prctl() may be the proper place for this.

> 
> > > So what if it doesn't ? Can we kill it for not playing nice ?  
> > 
> > No, it's no different than a system call running for a long time. You could  
> 
> Then why ask for it? What's the point. Also, did you define
> sched_yield() semantics for OTHER to something useful? Because if you
> didn't you just invoked UB :-) We could be setting your pets on fire.

Actually, it works with *any* system call. Not just sched_yield(). I just
used that as it was the best one to annotate "the kernel asked me to
schedule, I'm going to schedule". If you noticed, I did not modify
sched_yield() in the patch. The NEED_RESCHED_LAZY is still set, and without
the extend bit set, on return back to user space it will schedule.

> 
> > set this bit and leave it there for as long as you want, and it should not
> > affect anything.  
> 
> It would affect the worst case interference terms of the system at the
> very least.

If you are worried about that, it can easily be configurable to be turned
off. Seriously, I highly doubt that this would be even measurable as
interference. I could be wrong, I haven't tested that. It's something we
can look at, but until it's considered a problem it should not be a show
blocker.

> 
> > If you look at what Thomas's PREEMPT_AUTO.patch  
> 
> I know what it does, it also means your thing doesn't work the moment
> you set things up to have the old full-preempt semantics back. It
> doesn't work in the presence of RT/DL tasks, etc..

Note, I am looking at ways to make this work with full preempt semantics.
This is still a POC, there's a lot of room for improvements here. From my
understanding, the potential of Thomas's patch is to get rid of the
build time configurable semantics of NONE, VOLUNTARY and PREEMPT (only
PREEMPT_RT will be different).

> 
> More importantly, it doesn't work for RT/DL tasks, so having the bit set
> and not having OTHER policy is an error.

It would basically be a nop.

> 
> Do you want an interface that randomly doesn't work ?

An RT task doesn't get preempted by ticks, so how would in randomly not
work? We could allow RR tasks to get a bit more time if it has this bit set
too. Or maybe allow DL to get a little more if there's not another DL task
needing to run.

But for now, this is only for SCHED_OTHER, as this is not usually a problem
for RT/DL tasks. The extend bit is only a hint for the kernel, there's no
guarantees that it will be available or even if the kernel will honor it.
But because there's a lot of code out there that implements user space spin
locks, this could be a huge win for them when implemented, without changing
much.

Remember, RT and DL are about deterministic behavior, SCHED_OTHER is about
performance. This is a performance patch, not a deterministic one.

> 
> > We could possibly make it adjustable.   
> 
> Tunables are not a good thing.
> 
> > The reason I've been told over the last few decades of why people implement
> > 100% user space spin locks is because the overhead of going int the kernel
> > is way too high.  
> 
> Over the last few decades that has been a blatant falsehood. At some
> point (right before the whole meltdown trainwreck) amluto had syscall
> overhead down to less than 150 cycles.

Well, as far as I know, the testing that Postgresql has done has never seen
that.

> 
> Then of course meltdown happened and it all went to shit.

True dat.

> 
> But even today (on good hardware or with mitigations=off):
> 
> gettid-1m:	179,650,423      cycles
> xadd-1m:	 23,036,564      cycles
> 
> syscall is the cost of roughly 8 atomic ops. More expensive, sure. But
> not insanely so. I've seen atomic ops go up to >1000 cycles if you
> contend them hard enough.
> 

This has been your argument for over a decade, and the real world has seen
it differently. Performance matters significantly for user applications, and
if system calls didn't have performance issues, I'm sure the performance
centric applications would have used them.

This is because these critical sections run much less than 8 atomic ops. And
when you are executing these critical sections millions of times a second,
that adds up quickly.

-- Steve
Mathieu Desnoyers Oct. 25, 2023, 2:53 p.m. UTC | #6
On 2023-10-25 10:31, Steven Rostedt wrote:
> On Wed, 25 Oct 2023 15:55:45 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Wed, Oct 25, 2023 at 08:54:34AM -0400, Steven Rostedt wrote:
>>
>>> I didn't want to overload that for something completely different. This is
>>> not a "restartable sequence".
>>
>> Your hack is arguably worse. At least rseq already exists and most
>> threads will already have it set up if you have a recent enough glibc.
> 
> I don't expect that file to be the final solution. I can look at the rseq
> code, but I really hate to overload that. I'm thinking perhaps another
> system call, or what the hell, add another ioctl like feature to prctl()!
> Actually, prctl() may be the proper place for this.
> 

I don't have an informed opinion on whether the proposed heuristic is a 
good idea or not, but it should definitely be implemented as an 
extension to rseq as suggested by Peter. I've even made the whole rseq 
ABI extensible to accommodate those additional use-cases.

In the initial rounds of rseq implementation, I even called rseq "kTLS" 
because I expected it to be extended and eventually become an ABI that 
contains various per-thread fields which are shared between kernel and 
userspace.

So don't let the specific naming of the rseq system call stop you from 
extending it for other purposes when per-thread shared memory between 
kernel and userspace is needed. Setting up various per-thread areas like 
this on thread creation is not free: it requires additional system calls 
on thread creation. It really makes no sense to have more than one.

Thanks,

Mathieu
Steven Rostedt Oct. 25, 2023, 3:07 p.m. UTC | #7
On Wed, 25 Oct 2023 10:53:38 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> I don't have an informed opinion on whether the proposed heuristic is a 
> good idea or not, but it should definitely be implemented as an 

If you want to get an informed opinion, you can start here:  ;-)

  Thomas's first reply that had me think about this solution:

    https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/

  My reply that started it:

    https://lore.kernel.org/lkml/20231024103426.4074d319@gandalf.local.home/

> extension to rseq as suggested by Peter. I've even made the whole rseq 
> ABI extensible to accommodate those additional use-cases.
> 
> In the initial rounds of rseq implementation, I even called rseq "kTLS" 
> because I expected it to be extended and eventually become an ABI that 
> contains various per-thread fields which are shared between kernel and 
> userspace.
> 
> So don't let the specific naming of the rseq system call stop you from 
> extending it for other purposes when per-thread shared memory between 
> kernel and userspace is needed. Setting up various per-thread areas like 
> this on thread creation is not free: it requires additional system calls 
> on thread creation. It really makes no sense to have more than one.

Thanks for the feedback Mathieu. This may indeed be the interface I am
looking for.

-- Steve
Steven Rostedt Oct. 25, 2023, 3:12 p.m. UTC | #8
On Wed, 25 Oct 2023 05:42:19 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> I ran a traceeval tool on it (still work in progress, but I can post when
> it's done), and with the following trace, and the writes to trace-marker
> (tracefs_printf)
> 
>  trace-cmd record -e sched_switch ./extend-sched
> 
> It showed that without the extend, each task was preempted while holding
> the lock around 200 times. With the extend, only one task was ever
> preempted while holding the lock, and it only happened once!

And looking at the trace, it was because it was preempted by an RT task.

    extend-sched-3349  [007]  3309.320747: print:                tracing_mark_write: Have lock!
    extend-sched-3349  [007]  3309.320751: sched_switch:         extend-sched:3349 [120] R ==> migration/7:68 [0]
     migration/7-68    [007]  3309.320754: sched_switch:         migration/7:68 [0] S ==> extend-sched:3349 [120]
    extend-sched-3349  [007]  3309.320756: print:                tracing_mark_write: released lock!

So we know that part of the code works. ;-)

-- Steve
Rasmus Villemoes Oct. 25, 2023, 3:34 p.m. UTC | #9
On 25/10/2023 11.42, Steven Rostedt wrote:

> So, bit 1 is for user space to tell the kernel "please extend me", and bit
> two is for the kernel to tell user space "OK, I extended you, but call
> sched_yield() when done".
I'm not qualified to have an opinion on this. But this sounds quite
similar to
https://lore.kernel.org/lkml/1395767870-28053-1-git-send-email-khalid.aziz@oracle.com/
.

  "A thread sends the scheduler this request by setting a flag in
a memory location it has shared with the kernel.  Kernel uses bytes in
the same memory location to let the thread know when its request for
amnesty from preemption has been granted. Thread should yield the
processor at the end of its critical section if it was granted amnesty
to play nice with other threads."

Rasmus
Mathieu Desnoyers Oct. 25, 2023, 3:42 p.m. UTC | #10
On 2023-10-25 10:31, Steven Rostedt wrote:
> On Wed, 25 Oct 2023 15:55:45 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:

[...]

After digging lore for context, here are some thoughts about the actual
proposal: AFAIU the intent here is to boost the scheduling slice for a
userspace thread running with a mutex held so it can complete faster,
and therefore reduce contention.

I suspect this is not completely unrelated to priority inheritance
futexes, except that one goal stated by Steven is to increase the
owner slice without requiring to call a system call on the fast-path.

Compared to PI futexes, I think Steven's proposal misses the part
where a thread waiting on a futex boosts the lock owner's priority
so it can complete faster. By making the lock owner selfishly claim
that it needs a larger scheduling slice, it opens the door to
scheduler disruption, and it's hard to come up with upper-bounds
that work for all cases.

Hopefully I'm not oversimplifying if I state that we have mainly two
actors to consider:

[A] the lock owner thread

[B] threads that block trying to acquire the lock

The fast-path here is [A]. [B] can go through a system call, I don't
think it matters at all.

So perhaps we can extend the rseq per-thread area with a field that
implements a "held locks" list that allows [A] to let the kernel know
that it is currently holding a set of locks (those can be chained when
locks are nested). It would be updated on lock/unlock with just a few
stores in userspace.

Those lock addresses could then be used as keys for private locks,
or transformed into inode/offset keys for shared-memory locks. Threads
[B] blocking trying to acquire the lock can call a system call which
would boost the lock owner's slice and/or priority for a given lock key.

When the scheduler preempts [A], it would check whether the rseq
per-thread area has a "held locks" field set and use this information
to find the slice/priority boost which are currently active for each
lock, and use this information to boost the task slice/priority
accordingly.

A scheme like this should allow lock priority inheritance without
requiring system calls on the userspace lock/unlock fast path.

Thoughts ?

Thanks,

Mathieu
Mateusz Guzik Oct. 25, 2023, 4:24 p.m. UTC | #11
On Wed, Oct 25, 2023 at 11:42:34AM -0400, Mathieu Desnoyers wrote:
> On 2023-10-25 10:31, Steven Rostedt wrote:
> > On Wed, 25 Oct 2023 15:55:45 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> 
> [...]
> 
> After digging lore for context, here are some thoughts about the actual
> proposal: AFAIU the intent here is to boost the scheduling slice for a
> userspace thread running with a mutex held so it can complete faster,
> and therefore reduce contention.
> 
> I suspect this is not completely unrelated to priority inheritance
> futexes, except that one goal stated by Steven is to increase the
> owner slice without requiring to call a system call on the fast-path.
> 
> Compared to PI futexes, I think Steven's proposal misses the part
> where a thread waiting on a futex boosts the lock owner's priority
> so it can complete faster. By making the lock owner selfishly claim
> that it needs a larger scheduling slice, it opens the door to
> scheduler disruption, and it's hard to come up with upper-bounds
> that work for all cases.
> 
> Hopefully I'm not oversimplifying if I state that we have mainly two
> actors to consider:
> 
> [A] the lock owner thread
> 
> [B] threads that block trying to acquire the lock
> 
> The fast-path here is [A]. [B] can go through a system call, I don't
> think it matters at all.
> 
> So perhaps we can extend the rseq per-thread area with a field that
> implements a "held locks" list that allows [A] to let the kernel know
> that it is currently holding a set of locks (those can be chained when
> locks are nested). It would be updated on lock/unlock with just a few
> stores in userspace.
> 
> Those lock addresses could then be used as keys for private locks,
> or transformed into inode/offset keys for shared-memory locks. Threads
> [B] blocking trying to acquire the lock can call a system call which
> would boost the lock owner's slice and/or priority for a given lock key.
> 
> When the scheduler preempts [A], it would check whether the rseq
> per-thread area has a "held locks" field set and use this information
> to find the slice/priority boost which are currently active for each
> lock, and use this information to boost the task slice/priority
> accordingly.
> 
> A scheme like this should allow lock priority inheritance without
> requiring system calls on the userspace lock/unlock fast path.
> 

I think both this proposal and the original in this thread are opening a
can of worms and I don't think going down that road was properly
justified. A proper justification would demonstrate a big enough(tm)
improvement over a locking primitive with adaptive spinning.

It is well known that what mostly shafts performance of regular
userspace locking is all the nasty going off cpu to wait.

The original benchmark with slice extension disabled keeps using CPUs,
virtually guaranteeing these threads will keep getting preempted, some
of the time while holding the lock. Should that happen all other threads
which happened to get preempted actively waste time.

Adaptive spinning was already mentioned elsewhere in the thread and the
idea itself is at least 2 decades old. If anything I find it strange it
did not land years ago.

I find there is a preliminary patch by you which exports the state so
one can nicely spin without even going to the kernel:
https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/

To be clear, I think a locking primitive which can do adaptive spinning
*and* futexes *and* not get preempted while holding locks is the fastest
option. What is not clear to me if it is sufficiently faster than
adaptive spinning and futexes.

tl;dr perhaps someone(tm) could carry the above to a state where it can
be benchmarked vs the original patch
Steven Rostedt Oct. 25, 2023, 5:17 p.m. UTC | #12
On Wed, 25 Oct 2023 18:24:35 +0200
Mateusz Guzik <mjguzik@gmail.com> wrote:

> On Wed, Oct 25, 2023 at 11:42:34AM -0400, Mathieu Desnoyers wrote:
> > On 2023-10-25 10:31, Steven Rostedt wrote:  
> > > On Wed, 25 Oct 2023 15:55:45 +0200
> > > Peter Zijlstra <peterz@infradead.org> wrote:  
> > 
> > [...]
> > 
> > After digging lore for context, here are some thoughts about the actual
> > proposal: AFAIU the intent here is to boost the scheduling slice for a
> > userspace thread running with a mutex held so it can complete faster,
> > and therefore reduce contention.
> > 
> > I suspect this is not completely unrelated to priority inheritance
> > futexes, except that one goal stated by Steven is to increase the
> > owner slice without requiring to call a system call on the fast-path.

No, I wouldn't say it's the same as priority inheritance, which is to help
with determinism and not performance. PI adds overhead but removes
unbounded latency. On average, a non PI mutex is faster than PI mutex, but
can suffer from unbounded priority inversion.

For this code, I took off my RT hat, and put on my performance hat.

> > 
> > Compared to PI futexes, I think Steven's proposal misses the part
> > where a thread waiting on a futex boosts the lock owner's priority
> > so it can complete faster. By making the lock owner selfishly claim
> > that it needs a larger scheduling slice, it opens the door to
> > scheduler disruption, and it's hard to come up with upper-bounds
> > that work for all cases.

Currently, the extended time is only 1ms for 1000 HZ kernel. It's no
different than holding a kernel spin lock for that much time. It doesn't
happen often, but I have in the past measured spin locks being held for
that long in a non PREEMPT_RT kernel.

And with the new EEVDF, extended time slices will be handled by the
eligibility algorithm, by keeping the task that extended itself from being
eligible for a longer period of time. This keeps things "fair".

> > 
> > Hopefully I'm not oversimplifying if I state that we have mainly two
> > actors to consider:
> > 
> > [A] the lock owner thread
> > 
> > [B] threads that block trying to acquire the lock
> > 
> > The fast-path here is [A]. [B] can go through a system call, I don't
> > think it matters at all.

No, B going into a system call can be just as devastating. Adaptive
spinning helps with that. The thing here is that if A gets preempted, there
will be a lot more B's getting stuck.

I implemented the test with futexes (where you go to sleep on contention)
and the performance dropped considerably, where the benefits of not having
A get preempted made no difference at all. Sure, adaptive spinning helps in
that case, but adaptive spinning would only make it as good as my spinning
in user space logic is without any changes.

> > 
> > So perhaps we can extend the rseq per-thread area with a field that
> > implements a "held locks" list that allows [A] to let the kernel know
> > that it is currently holding a set of locks (those can be chained when
> > locks are nested). It would be updated on lock/unlock with just a few
> > stores in userspace.

And I can see that being a total nightmare to maintain and keep from races
and trusting user space.

> > 
> > Those lock addresses could then be used as keys for private locks,
> > or transformed into inode/offset keys for shared-memory locks. Threads
> > [B] blocking trying to acquire the lock can call a system call which
> > would boost the lock owner's slice and/or priority for a given lock key.

Do you mean that this would be done in user space? Going into the kernel to
do any of this would make it already lost.

> > 
> > When the scheduler preempts [A], it would check whether the rseq
> > per-thread area has a "held locks" field set and use this information
> > to find the slice/priority boost which are currently active for each
> > lock, and use this information to boost the task slice/priority
> > accordingly.

Why do we care about locks here? Note, I'm looking at using this same
feature for VMs on interrupt handlers. The only thing user space needs to
tell the kernel is "It's not a good time to preempt me, but it will be
shortly".

> > 
> > A scheme like this should allow lock priority inheritance without
> > requiring system calls on the userspace lock/unlock fast path.

Priority inheritance doesn't make sense when everything is running.

> >   
> 
> I think both this proposal and the original in this thread are opening a
> can of worms and I don't think going down that road was properly
> justified. A proper justification would demonstrate a big enough(tm)
> improvement over a locking primitive with adaptive spinning.
> 
> It is well known that what mostly shafts performance of regular
> userspace locking is all the nasty going off cpu to wait.
> 
> The original benchmark with slice extension disabled keeps using CPUs,
> virtually guaranteeing these threads will keep getting preempted, some
> of the time while holding the lock. Should that happen all other threads
> which happened to get preempted actively waste time.
> 
> Adaptive spinning was already mentioned elsewhere in the thread and the
> idea itself is at least 2 decades old. If anything I find it strange it
> did not land years ago.

I tried pushing it a long time ago but I believe Peter didn't like the
logic touching the scheduler. Which it had to do, so I dropped it.

Anyway, as I stated previously, my experience here is based on the work I
had done with PREEMPT_RT. Let me give you a little history:

A long time ago, when an 8 CPU machine was considered "huge!", we got
priority inheritance spin-lock turn mutexes working nicely. But because rw
locks were exponentially complex to add PI to (believe me, I tried!), we
gave up and just turned them into a single mutex. This caused huge
contention on the mmap_sem (which was now a mutex and not a rwsem)
especially when running java (which for some unknown reasons creates
hundreds of threads for "Hello world!").

When we booted a machine with 16 or more CPUs, it took forever to boot up
on PREEMPT_RT. The lock contention between all these spin-lock turn mutexes
and rwsems turn mutexes was crazy. PREEMPT_RT took exponentially longer to
boot than the vanilla kernel as the number of CPUs went up.

SUSE proposed a new feature called "adaptive spinning", where on contention
to one of these spin-locks turned mutexes, it would spin if the owner of
the lock was held, and otherwise go to sleep. This was a huge win, as we
found that the contention on these locks dropped significantly. So much so,
that the difference between PREEMPT_RT and vanilla only linearly degraded
as the number of CPUs increased.

The PREEMPT_RT folks continued happily along. But the performance of
PREEMPT_RT was still significantly behind that of the vanilla kernel.
Thomas realized that a large part of this performance gap was due to the
over aggressive preemption that PREEMPT_RT would cause. What PREEMPT_RT
does, is simply allow for more places in the kernel to be preempted where
CONFIG_PREEMPT does not. Specifically, while holding a spin_lock. That
means, when you preempted a kernel when holding a spin_lock, that spin_lock
was much more likely to be contended upon for the simple fact that it is
now held for a much longer time. Thomas realized that introducing
NEED_RECHED_LAZY, and by allowing SCHED_OTHER scheduling to be delayed
while these spin locks are held it would decrease the number of times
preemption happened while these locks are taken. This would decrease the
time that the locks are held and that would decrease the number of times
they were contended.

IMPORTANT NOTE: The above was noticed *with* adaptive spin locks enabled!

That is, adaptive spinning did not solve the issue with tasks holding locks
being preempted.

This is why I'm saying that user space adaptive spin locks solves a
*different* problem. This patch solves the preemption of lock holders
problem, which adaptive spinning DOES NOT ADDRESS.

> 
> I find there is a preliminary patch by you which exports the state so
> one can nicely spin without even going to the kernel:
> https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/
> 
> To be clear, I think a locking primitive which can do adaptive spinning
> *and* futexes *and* not get preempted while holding locks is the fastest
> option. What is not clear to me if it is sufficiently faster than
> adaptive spinning and futexes.

And I'm stating it is, simply because it worked so well with PREEMPT_RT.

> 
> tl;dr perhaps someone(tm) could carry the above to a state where it can
> be benchmarked vs the original patch

And don't get me wrong. I *want* adaptive spinning in user space. I'm just
saying that this is solving a different issue.

-- Steve
Mathieu Desnoyers Oct. 25, 2023, 6:49 p.m. UTC | #13
On 2023-10-25 13:17, Steven Rostedt wrote:
> On Wed, 25 Oct 2023 18:24:35 +0200
> Mateusz Guzik <mjguzik@gmail.com> wrote:
> 
>> On Wed, Oct 25, 2023 at 11:42:34AM -0400, Mathieu Desnoyers wrote:
>>> On 2023-10-25 10:31, Steven Rostedt wrote:
>>>> On Wed, 25 Oct 2023 15:55:45 +0200
>>>> Peter Zijlstra <peterz@infradead.org> wrote:
>>>
[...]
> 
> No, I wouldn't say it's the same as priority inheritance, which is to help
> with determinism and not performance. PI adds overhead but removes
> unbounded latency. On average, a non PI mutex is faster than PI mutex, but
> can suffer from unbounded priority inversion.

AFAIU, this is because PI mutex as implemented by sys futex only boosts the
priority of the lock owner. In my proposal here the owner would be able to
borrow scheduler slices from the waiters as well.

[...]

>>>
>>> Hopefully I'm not oversimplifying if I state that we have mainly two
>>> actors to consider:
>>>
>>> [A] the lock owner thread
>>>
>>> [B] threads that block trying to acquire the lock
>>>
>>> The fast-path here is [A]. [B] can go through a system call, I don't
>>> think it matters at all.
> 
> No, B going into a system call can be just as devastating. Adaptive
> spinning helps with that. The thing here is that if A gets preempted, there
> will be a lot more B's getting stuck.

I would indeed combine this with an adaptive spinning scheme to allow waiters to
stay in userspace if contention is short. As you know, rseq can also help there:

https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/

Therefore, it's only the blocking case that would call into the kernel, which
should not be so devastating.

> 
> I implemented the test with futexes (where you go to sleep on contention)
> and the performance dropped considerably, where the benefits of not having
> A get preempted made no difference at all. Sure, adaptive spinning helps in
> that case, but adaptive spinning would only make it as good as my spinning
> in user space logic is without any changes.

I'm not sure what you are arguing here.

My overall idea would be to combine:

1) Adaptive spinning in userspace,
2) Priority inheritance,
3) Scheduler slices inheritance.

> 
>>>
>>> Those lock addresses could then be used as keys for private locks,
>>> or transformed into inode/offset keys for shared-memory locks. Threads
>>> [B] blocking trying to acquire the lock can call a system call which
>>> would boost the lock owner's slice and/or priority for a given lock key.
> 
> Do you mean that this would be done in user space? Going into the kernel to
> do any of this would make it already lost.

Going to the kernel only happens when threads need to block, which means
that the typical contended half-happy path should be busy-spinning in userspace
(adaptive spinning).

I see why blocking in a scenario where busy-spinning would be better is
inefficient, but I don't see how going to the kernel for the _blocking_
case is bad.

> 
>>>
>>> When the scheduler preempts [A], it would check whether the rseq
>>> per-thread area has a "held locks" field set and use this information
>>> to find the slice/priority boost which are currently active for each
>>> lock, and use this information to boost the task slice/priority
>>> accordingly.
> 
> Why do we care about locks here? Note, I'm looking at using this same
> feature for VMs on interrupt handlers. The only thing user space needs to
> tell the kernel is "It's not a good time to preempt me, but it will be
> shortly".
> 

Quoting https://lore.kernel.org/lkml/20231024103426.4074d319@gandalf.local.home/

"The goal is to prevent a thread / vCPU being preempted while holding a lock
or resource that other threads / vCPUs will want. That is, prevent
contention, as that's usually the biggest issue with performance in user
space and VMs."

We care about locks here because this is in fact your own problem statement.
If you want to consider the different problem of making VM interrupt handlers
go fast, then you should state it up front. Those two distinct problems may
or may not require entirely different solutions.

>>>
>>> A scheme like this should allow lock priority inheritance without
>>> requiring system calls on the userspace lock/unlock fast path.
> 
> Priority inheritance doesn't make sense when everything is running.

I should have also said that this scheme should allow the lock owner to
borrow scheduler time slices from waiters (in addition to PI).

[...]

Thanks,

Mathieu
Steven Rostedt Oct. 25, 2023, 7:19 p.m. UTC | #14
On Wed, 25 Oct 2023 14:49:44 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> > 
> > No, I wouldn't say it's the same as priority inheritance, which is to help
> > with determinism and not performance. PI adds overhead but removes
> > unbounded latency. On average, a non PI mutex is faster than PI mutex, but
> > can suffer from unbounded priority inversion.  
> 
> AFAIU, this is because PI mutex as implemented by sys futex only boosts the
> priority of the lock owner. In my proposal here the owner would be able to
> borrow scheduler slices from the waiters as well.

I would be worried that that could cause even more scheduling disruption.
Now we are taking time slices from other CPUs to run the current one?

> 
> [...]
> 
> >>>
> >>> Hopefully I'm not oversimplifying if I state that we have mainly two
> >>> actors to consider:
> >>>
> >>> [A] the lock owner thread
> >>>
> >>> [B] threads that block trying to acquire the lock
> >>>
> >>> The fast-path here is [A]. [B] can go through a system call, I don't
> >>> think it matters at all.  
> > 
> > No, B going into a system call can be just as devastating. Adaptive
> > spinning helps with that. The thing here is that if A gets preempted, there
> > will be a lot more B's getting stuck.  
> 
> I would indeed combine this with an adaptive spinning scheme to allow waiters to
> stay in userspace if contention is short. As you know, rseq can also help there:
> 
> https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/
> 
> Therefore, it's only the blocking case that would call into the kernel, which
> should not be so devastating.

I think we are in agreement about the blocking case, which I believe you
mean "go to sleep and wait to be woken up when the resource is available".

> 
> > 
> > I implemented the test with futexes (where you go to sleep on contention)
> > and the performance dropped considerably, where the benefits of not having
> > A get preempted made no difference at all. Sure, adaptive spinning helps in
> > that case, but adaptive spinning would only make it as good as my spinning
> > in user space logic is without any changes.  
> 
> I'm not sure what you are arguing here.
> 
> My overall idea would be to combine:
> 
> 1) Adaptive spinning in userspace,
> 2) Priority inheritance,
> 3) Scheduler slices inheritance.

The PI and slice inheritance can become very complex. It does sound more
like the proxy server, which in theory sounds great but is very difficult
to implement in practice.

> 
> >   
> >>>
> >>> Those lock addresses could then be used as keys for private locks,
> >>> or transformed into inode/offset keys for shared-memory locks. Threads
> >>> [B] blocking trying to acquire the lock can call a system call which
> >>> would boost the lock owner's slice and/or priority for a given lock key.  
> > 
> > Do you mean that this would be done in user space? Going into the kernel to
> > do any of this would make it already lost.  
> 
> Going to the kernel only happens when threads need to block, which means
> that the typical contended half-happy path should be busy-spinning in userspace
> (adaptive spinning).
> 
> I see why blocking in a scenario where busy-spinning would be better is
> inefficient, but I don't see how going to the kernel for the _blocking_
> case is bad.

My point of view in this patch is for very short duration critical
sections. This patch is not for the blocking use case at all. For that, I'm
100% on board with adaptive spinning.

IOW, "blocking" is out of scope for this patch.

> 
> >   
> >>>
> >>> When the scheduler preempts [A], it would check whether the rseq
> >>> per-thread area has a "held locks" field set and use this information
> >>> to find the slice/priority boost which are currently active for each
> >>> lock, and use this information to boost the task slice/priority
> >>> accordingly.  
> > 
> > Why do we care about locks here? Note, I'm looking at using this same
> > feature for VMs on interrupt handlers. The only thing user space needs to
> > tell the kernel is "It's not a good time to preempt me, but it will be
> > shortly".
> >   
> 
> Quoting https://lore.kernel.org/lkml/20231024103426.4074d319@gandalf.local.home/
> 
> "The goal is to prevent a thread / vCPU being preempted while holding a lock
> or resource that other threads / vCPUs will want. That is, prevent
> contention, as that's usually the biggest issue with performance in user
> space and VMs."

I should have been more specific. Yes, locks is a major use case here, and
the one I brought up as it's the easiest to understand, but this can be for
any use case that there's a short critical section were preemption could be
bad. I don't want to implement a policy that this is only for locking.

> 
> We care about locks here because this is in fact your own problem statement.
> If you want to consider the different problem of making VM interrupt handlers
> go fast, then you should state it up front. Those two distinct problems may
> or may not require entirely different solutions.

We'll soon know as we'll be testing that too. Anyway, this patch is for the
"short critical section". It's not good for all locking. It's only good for
very short held locks.

Going back to PREEMPT_RT, I did tests where I implemented NEED_RESCHED_LAZY
on all kernel mutexes (it's currently only used for PREEMPT_RT spin locks
that turn into mutexes), and found that it made no difference. Maybe a
little bit, but not enough to push it. The reason is that long held locks,
are either not contended heavily, or if they are, the fact that they are
long held, keeping them running isn't as big of a difference than if they
stop to let something else run. Of course, RT has priority inheritance, so
the PI isn't as much of a overhead compared to the time the locks were held.

> 
> >>>
> >>> A scheme like this should allow lock priority inheritance without
> >>> requiring system calls on the userspace lock/unlock fast path.  
> > 
> > Priority inheritance doesn't make sense when everything is running.  
> 
> I should have also said that this scheme should allow the lock owner to
> borrow scheduler time slices from waiters (in addition to PI).

As I mentioned, that sounds like proxy execution, which is something that's
been a work in progress for some time now.

Anyway, this patch is a very simple solution that can help a large number
of use cases. Even if those use cases are very specific (highly contended
short duration critical sections).

-- Steve
Mathieu Desnoyers Oct. 25, 2023, 9:56 p.m. UTC | #15
On 2023-10-25 15:19, Steven Rostedt wrote:
> On Wed, 25 Oct 2023 14:49:44 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>>>
>>> No, I wouldn't say it's the same as priority inheritance, which is to help
>>> with determinism and not performance. PI adds overhead but removes
>>> unbounded latency. On average, a non PI mutex is faster than PI mutex, but
>>> can suffer from unbounded priority inversion.
>>
>> AFAIU, this is because PI mutex as implemented by sys futex only boosts the
>> priority of the lock owner. In my proposal here the owner would be able to
>> borrow scheduler slices from the waiters as well.
> 
> I would be worried that that could cause even more scheduling disruption.
> Now we are taking time slices from other CPUs to run the current one?
> 

AFAIU, as we look at the time slices of the waiting tasks, those have meaning
in the context of the runqueue they are currently attached to. We first consume
time slices from tasks sharing the same runqueue as the owner task. Else, I
guess we'd need to migrate the owner task to the runqueue where the time slice
is available so we can use it (but only if migration of the owner task is
allowed to that target cpu).

Thanks,

Mathieu
Ankur Arora Oct. 26, 2023, 5:03 a.m. UTC | #16
Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Oct 25, 2023 at 05:42:19AM -0400, Steven Rostedt wrote:
>
>> So, bit 1 is for user space to tell the kernel "please extend me", and bit
>> two is for the kernel to tell user space "OK, I extended you, but call
>> sched_yield() when done".
>
> So what if it doesn't ? Can we kill it for not playing nice ?
>
> [ aside from it being bit 0 and bit 1 as you yourself point out, it is
>   also jarring you use a numeral for one and write out the other. ]
>
> That said, I properly hate all these things, extending a slice doesn't
> reliably work and we're always left with people demanding an ever longer
> extension.
>
> The *much* better heuristic is what the kernel uses, don't spin if the
> lock holder isn't running.

Something like vcpu_is_preempted()? That helps deal with preemption
once it has happened, but an ahead of time signal would be much more
efficient.

And, guest lock holder preemption was a big enough issue that we drop
down to test-and-set spinlocks in some cases.


Thanks

--
ankur
Peter Zijlstra Oct. 26, 2023, 8:44 a.m. UTC | #17
On Wed, Oct 25, 2023 at 10:31:05AM -0400, Steven Rostedt wrote:

> > > > So what if it doesn't ? Can we kill it for not playing nice ?  
> > > 
> > > No, it's no different than a system call running for a long time. You could  
> > 
> > Then why ask for it? What's the point. Also, did you define
> > sched_yield() semantics for OTHER to something useful? Because if you
> > didn't you just invoked UB :-) We could be setting your pets on fire.
> 
> Actually, it works with *any* system call. Not just sched_yield(). I just
> used that as it was the best one to annotate "the kernel asked me to
> schedule, I'm going to schedule". If you noticed, I did not modify
> sched_yield() in the patch. The NEED_RESCHED_LAZY is still set, and without
> the extend bit set, on return back to user space it will schedule.

So I fundamentally *HATE* you tie this hole thing to the
NEED_RESCHED_LAZY thing, that's 100% the wrong layer to be doing this
at.

It very much means you're creating an interface that won't work for a
significant number of setups -- those that use the FULL preempt setting.

> > > set this bit and leave it there for as long as you want, and it should not
> > > affect anything.  
> > 
> > It would affect the worst case interference terms of the system at the
> > very least.
> 
> If you are worried about that, it can easily be configurable to be turned
> off. Seriously, I highly doubt that this would be even measurable as
> interference. I could be wrong, I haven't tested that. It's something we
> can look at, but until it's considered a problem it should not be a show
> blocker.

If everybody sets the thing and leaves it on, you basically double the
worst case latency, no? And weren't you involved in a thread only last
week where the complaint was that Chrome was a pig^W^W^W latency was too
high?

> > > If you look at what Thomas's PREEMPT_AUTO.patch  
> > 
> > I know what it does, it also means your thing doesn't work the moment
> > you set things up to have the old full-preempt semantics back. It
> > doesn't work in the presence of RT/DL tasks, etc..
> 
> Note, I am looking at ways to make this work with full preempt semantics.

By not relying on the PREEMPT_AUTO stuff. If you noodle with the code
that actually sets preempt it should also work with preempt, but you're
working at the wrong layer.

Also see that old Oracle thread that got dug up.

> > More importantly, it doesn't work for RT/DL tasks, so having the bit set
> > and not having OTHER policy is an error.
> 
> It would basically be a nop.

Well yes, but that is not a nice interface is it, run your task as RT/DL
and suddenly it behaves differently.

> Remember, RT and DL are about deterministic behavior, SCHED_OTHER is about
> performance. This is a performance patch, not a deterministic one.

Yeah, but performance means something different depending on who and
when you talk to someone.

> > But even today (on good hardware or with mitigations=off):
> > 
> > gettid-1m:	179,650,423      cycles
> > xadd-1m:	 23,036,564      cycles
> > 
> > syscall is the cost of roughly 8 atomic ops. More expensive, sure. But
> > not insanely so. I've seen atomic ops go up to >1000 cycles if you
> > contend them hard enough.
> > 
> 
> This has been your argument for over a decade, and the real world has seen
> it differently. Performance matters significantly for user applications, and
> if system calls didn't have performance issues, I'm sure the performance
> centric applications would have used them.
> 
> This is because these critical sections run much less than 8 atomic ops. And
> when you are executing these critical sections millions of times a second,
> that adds up quickly.

But you wouldn't be doing syscalls on every section either. If syscalls
were free (0 cycles) and you could hand-wave any syscall you pleased,
how would you do this?

The typical futex like setup is you only syscall on contention, when
userspace is going to be spinning and wasting cycles anyhow. The current
problem is that futex_wait will instantly schedule-out / block, even if
the lock owner is currently one instruction away from releasing the lock.
Peter Zijlstra Oct. 26, 2023, 8:54 a.m. UTC | #18
On Wed, Oct 25, 2023 at 01:17:31PM -0400, Steven Rostedt wrote:
> On Wed, 25 Oct 2023 18:24:35 +0200
> Mateusz Guzik <mjguzik@gmail.com> wrote:
> 
> > On Wed, Oct 25, 2023 at 11:42:34AM -0400, Mathieu Desnoyers wrote:
> > > On 2023-10-25 10:31, Steven Rostedt wrote:  
> > > > On Wed, 25 Oct 2023 15:55:45 +0200
> > > > Peter Zijlstra <peterz@infradead.org> wrote:  
> > > 
> > > [...]
> > > 
> > > After digging lore for context, here are some thoughts about the actual
> > > proposal: AFAIU the intent here is to boost the scheduling slice for a
> > > userspace thread running with a mutex held so it can complete faster,
> > > and therefore reduce contention.
> > > 
> > > I suspect this is not completely unrelated to priority inheritance
> > > futexes, except that one goal stated by Steven is to increase the
> > > owner slice without requiring to call a system call on the fast-path.
> 
> No, I wouldn't say it's the same as priority inheritance, which is to help
> with determinism and not performance. PI adds overhead but removes
> unbounded latency. On average, a non PI mutex is faster than PI mutex, but
> can suffer from unbounded priority inversion.

Matheusz is right though, what you're asking for is a (limited) priority
ceiling, which is a very primitive form of PI, which itself is a very
specific case of proxy execution :-)

Note that in kernel spinners have this priority ceiling by means of
preempt_disable().

> For this code, I took off my RT hat, and put on my performance hat.

Seems to me you took the brain along with the hat.

You're confusing cost of implementation with concept. Yes full blown PI
is fairly expensive, but the concept is still valid. Priority ceilings
were always an approximation.
Steven Rostedt Oct. 26, 2023, 1:16 p.m. UTC | #19
On Thu, 26 Oct 2023 10:44:02 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > Actually, it works with *any* system call. Not just sched_yield(). I just
> > used that as it was the best one to annotate "the kernel asked me to
> > schedule, I'm going to schedule". If you noticed, I did not modify
> > sched_yield() in the patch. The NEED_RESCHED_LAZY is still set, and without
> > the extend bit set, on return back to user space it will schedule.  
> 
> So I fundamentally *HATE* you tie this hole thing to the
> NEED_RESCHED_LAZY thing, that's 100% the wrong layer to be doing this
> at.
> 
> It very much means you're creating an interface that won't work for a
> significant number of setups -- those that use the FULL preempt setting.

And why can't the FULL preempt setting still use the NEED_RESCHED_LAZY?
PREEMPT_RT does. The beauty about NEED_RESCHED_LAZY is that it tells you
whether you *should* schedule, or you *must* schedule (NEED_RESCHED).

> 
> > > > set this bit and leave it there for as long as you want, and it should not
> > > > affect anything.    
> > > 
> > > It would affect the worst case interference terms of the system at the
> > > very least.  
> > 
> > If you are worried about that, it can easily be configurable to be turned
> > off. Seriously, I highly doubt that this would be even measurable as
> > interference. I could be wrong, I haven't tested that. It's something we
> > can look at, but until it's considered a problem it should not be a show
> > blocker.  
> 
> If everybody sets the thing and leaves it on, you basically double the
> worst case latency, no? And weren't you involved in a thread only last
> week where the complaint was that Chrome was a pig^W^W^W latency was too
> high?

In my first email about this:

  https://lore.kernel.org/all/20231024103426.4074d319@gandalf.local.home/

I said:

  If we are worried about abuse, we could even punish tasks that don't call
  sched_yield() by the time its extended time slice is taken.

To elaborate further on this punishment, if we find that it does become an
issue if a bunch of tasks were to always have this bit set and not giving
up the CPU in a timely manner, it could be flagged to ignore that bit
and/or remove some of its eligibility.

That is, it wouldn't take too long before the abuser gets whacked and is no
longer able to abuse.

But I figured we would look into that if EEVDF doesn't naturally take care
of it.

> 
> > > > If you look at what Thomas's PREEMPT_AUTO.patch    
> > > 
> > > I know what it does, it also means your thing doesn't work the moment
> > > you set things up to have the old full-preempt semantics back. It
> > > doesn't work in the presence of RT/DL tasks, etc..  
> > 
> > Note, I am looking at ways to make this work with full preempt semantics.  
> 
> By not relying on the PREEMPT_AUTO stuff. If you noodle with the code
> that actually sets preempt it should also work with preempt, but you're
> working at the wrong layer.

My guess is that NEED_RESCHED_LAZY will work with PREEMPT as well. That
code is still a work in progress, and this code is dependent on that. Right
now it depends on PREEMPT_AUTO because that's the only option that
currently gives us NEED_RESCHED_LAZY. From reading the discussions from
Thomas, it looks like NEED_RESCHED_LAZY will eventually be available in
CONFIG_PREEMPT.

> 
> Also see that old Oracle thread that got dug up.

I'll go back and read that.

> 
> > > More importantly, it doesn't work for RT/DL tasks, so having the bit set
> > > and not having OTHER policy is an error.  
> > 
> > It would basically be a nop.  
> 
> Well yes, but that is not a nice interface is it, run your task as RT/DL
> and suddenly it behaves differently.

User space spin locks would most definitely run differently in RT/DL today!

That could cause them to easily deadlock.

User space spin locks only make sense with SCHED_OTHER, otherwise great
care needs to be taken to not cause unbounded priority inversion.
Especially with FIFO.

> > This is because these critical sections run much less than 8 atomic ops. And
> > when you are executing these critical sections millions of times a second,
> > that adds up quickly.  
> 
> But you wouldn't be doing syscalls on every section either. If syscalls
> were free (0 cycles) and you could hand-wave any syscall you pleased,
> how would you do this?
> 
> The typical futex like setup is you only syscall on contention, when
> userspace is going to be spinning and wasting cycles anyhow. The current
> problem is that futex_wait will instantly schedule-out / block, even if
> the lock owner is currently one instruction away from releasing the lock.

And that is what user space adaptive spin locks are to solve, which I'm
100% all for! (I'm the one that talked André Almeida into working on this).

But as my tests show, the speed up is from keeping the lock holder from
being preempted. The same is true for why Thomas created NEED_RESCHED_LAZY
for PREEMPT_RT when it already had adaptive spin locks.

-- Steve
Steven Rostedt Oct. 26, 2023, 1:40 p.m. UTC | #20
On Thu, 26 Oct 2023 10:54:14 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > No, I wouldn't say it's the same as priority inheritance, which is to help
> > with determinism and not performance. PI adds overhead but removes
> > unbounded latency. On average, a non PI mutex is faster than PI mutex, but
> > can suffer from unbounded priority inversion.  
> 
> Matheusz is right though, what you're asking for is a (limited) priority
> ceiling, which is a very primitive form of PI, which itself is a very
> specific case of proxy execution :-)
> 
> Note that in kernel spinners have this priority ceiling by means of
> preempt_disable().
> 
> > For this code, I took off my RT hat, and put on my performance hat.  
> 
> Seems to me you took the brain along with the hat.
> 
> You're confusing cost of implementation with concept. Yes full blown PI
> is fairly expensive, but the concept is still valid. Priority ceilings
> were always an approximation.

It's a very weak priority ceiling, and why I didn't associate it, as I
would with preempt_disable() in the kernel (and I have mentioned in several
of my talks that preempt_disable() is a PI, as it makes the running task
the highest priority task on the CPU).

The major difference is that this is time sensitive. That is, the kernel
gives it an arbitrary amount of time to finish up. Priority ceiling is
usually associated to an entire critical section. Not the start of it and
if you don't finish it in time we take away that ceiling.

Even proxy execution doesn't work that way.

Hence, why I don't want to associate this with priority inheritance. The
time constraint is a fundamental difference.

-- Steve
Steven Rostedt Oct. 26, 2023, 3:49 p.m. UTC | #21
On Thu, 26 Oct 2023 09:40:35 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Hence, why I don't want to associate this with priority inheritance. The
> time constraint is a fundamental difference.

Let me add one more fundamental difference here that makes this solution
different than priority inheritance and ceiling.

PI and ceiling define the correctness of the system. If you get it wrong or
remove it, the system can be incorrect and lock up, fail deadlines, etc.
There's hundreds, if not thousands of papers mathematically defining the
correctness of PI, ceiling and proxy execution, as they are complex and
critical for the system to behave properly.

This feature is a performance boost only, and has nothing to do with
"correctness". That's because it has that arbitrary time where it can run a
little more. It's more like the difference between having something in
cache and a cache miss. This would cause many academics to quit and find a
job in sales if they had to prove the correctness of an algorithm that gave
you a boost for some random amount of time. The idea here is to help with
performance. If it exists, great, your application will likely perform
better. If it doesn't, no big deal, you may just have to deal with longer
wait times on critical sections.

This is why I do not want to associate this as another form of PI or
ceiling.

-- Steve
Daniel Bristot de Oliveira Oct. 26, 2023, 4:31 p.m. UTC | #22
On 10/26/23 17:49, Steven Rostedt wrote:
> On Thu, 26 Oct 2023 09:40:35 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
>> Hence, why I don't want to associate this with priority inheritance. The
>> time constraint is a fundamental difference.
> 
> Let me add one more fundamental difference here that makes this solution
> different than priority inheritance and ceiling.
> 
> PI and ceiling define the correctness of the system. If you get it wrong or
> remove it, the system can be incorrect and lock up, fail deadlines, etc.
> There's hundreds, if not thousands of papers mathematically defining the
> correctness of PI, ceiling and proxy execution, as they are complex and
> critical for the system to behave properly.
> 
> This feature is a performance boost only, and has nothing to do with
> "correctness". That's because it has that arbitrary time where it can run a
> little more. It's more like the difference between having something in
> cache and a cache miss. This would cause many academics to quit and find a
> job in sales if they had to prove the correctness of an algorithm that gave
> you a boost for some random amount of time. The idea here is to help with
> performance. If it exists, great, your application will likely perform
> better. If it doesn't, no big deal, you may just have to deal with longer
> wait times on critical sections.

terminologies, terminologies.... those academic people :-)

I think that this can also be seen as an extension of the non-preemptive
mode to the user space, but... not entirely, it is a ceiling to the
[ higher than fair/lower than RT ] prior?

and it is not global. It is partitioned: once the section starts, it stays
there, being preempted by RT/DL?

[ trying to understand the implications of it ]

> 
> This is why I do not want to associate this as another form of PI or
> ceiling.
> 
> -- Steve
Steven Rostedt Oct. 26, 2023, 5:26 p.m. UTC | #23
On Thu, 26 Oct 2023 18:31:56 +0200
Daniel Bristot de Oliveira <bristot@kernel.org> wrote:


> > This feature is a performance boost only, and has nothing to do with
> > "correctness". That's because it has that arbitrary time where it can run a
> > little more. It's more like the difference between having something in
> > cache and a cache miss. This would cause many academics to quit and find a
> > job in sales if they had to prove the correctness of an algorithm that gave
> > you a boost for some random amount of time. The idea here is to help with
> > performance. If it exists, great, your application will likely perform
> > better. If it doesn't, no big deal, you may just have to deal with longer
> > wait times on critical sections.  
> 
> terminologies, terminologies.... those academic people :-)

I hope this doesn't cause you to quit and switch to a career in sales!

> 
> I think that this can also be seen as an extension of the non-preemptive
> mode to the user space, but... not entirely, it is a ceiling to the
> [ higher than fair/lower than RT ] prior?

Well, it's just an extended time slice of SCHED_OTHER (up to 1 ms on 1000Hz
to 4 ms on 250Hz). But if an RT or DL task were to wake up it would preempt
it immediately. This feature is at the whims of the kernel implementation
that provides no guarantees. It's just a hint from user space asking the
kernel if it can have a little more time to get out of a critical section
where the time slice ended unfortunately while the task was in a critical
section. The kernel is allowed to deny the request.

> 
> and it is not global. It is partitioned: once the section starts, it stays
> there, being preempted by RT/DL?

Basically yes. Looking at the v6.6-rc4 kernel (which is where I started
from), the base time slice is 3ms.

  # cat /sys/kernel/debug/sched/base_slice_ns
  3000000

Note, when I upped this to 6ms, the benefits of this patch did drop. That
makes total sense because that would drop the number of times the critical
section would be preempted. Basically, it does somewhat the same thing by
extending all time slices.

With this feature enabled, if the schedule slice ends on a critical section
that has this special bit set, the kernel will give up to 1 more ms (1000
HZ) to get out of that section. It will also tell user space that it is
running on extended time by setting bit 1 (0x2). When user space leaves the
critical section, it should check that bit and if it is set call any system
call which the kernel will then call schedule. In my example, I just used
sched_yield(), but it would work with gettid() as well.

Sure, user space can ignore that bit from the kernel and continue, but when
that 1ms is up,  the kernel will preempt that task with prejudice,
regardless if its in a critical section or not. It's in the task's best
interest to make that system call when it knows it's the best time to do so
(not within a critical section). If it does not, it risks being preempted
within a critical section. Not to mention that the EEVDF scheduler will
lower it's eligibility for the next round.

-- Steve
Peter Zijlstra Oct. 30, 2023, 1:29 p.m. UTC | #24
On Thu, Oct 26, 2023 at 09:16:58AM -0400, Steven Rostedt wrote:

> I said:
> 
>   If we are worried about abuse, we could even punish tasks that don't call
>   sched_yield() by the time its extended time slice is taken.

This is a user interface, ofcourse I'm worried about abuse. That's the
first thing you *should* think about.

Userspace is out to get you -- must assume hostile.

Notably, we were talking usec latencies in the Chrome thread, you're
adding 1000 usec latencies here (in the best case, delaying scheduling
until the next tick, 10000usec for the HZ=100 folks). This is quite
'unfortunate'.

On my very aged IVB-EP I can get 50us scheduling latencies on a good
day, on my brand spanking new SPR I can get 20us (more faster more
better etc..).

Ideally we don't allow userspace to extend much (if any) beyond the
granularity already imposed by the kernel's preempt/IRQ-disable regions.
Sadly we don't have a self-measure of that around.

So I had a poke at all this and ended up with the below. I still utterly
detest all this, but it appears to actually work -- although I don't
much see the improvement, the numbers are somewhat unstable. (I say it
works because I see the 'yield -- made it' trace_printk when I do it
right and the 'timeout -- force resched' when I do it 'wrong'.

This thing works across the board and gives userspace 50usec, equal to
what the kernel already imposes on (on the IVB).

I simply took a bit from the existing flags field, and userspace can use
BTR to test if the kernel cleared it -- in which case it needs yield
(and not any other syscall).

Additinally doing a syscall with the bit set will SIGSEGV (when
DEBUG_RSEQ).

---
 include/linux/sched.h     | 17 +++++++++++++
 include/uapi/linux/rseq.h |  5 ++++
 kernel/entry/common.c     | 22 +++++++++++------
 kernel/rseq.c             | 63 ++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/core.c       | 22 +++++++++++++++++
 5 files changed, 121 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7f17295931de..93c0f9ac6293 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -919,6 +919,9 @@ struct task_struct {
 #ifdef CONFIG_RT_MUTEXES
 	unsigned			sched_rt_mutex:1;
 #endif
+#ifdef CONFIG_RSEQ
+	unsigned			rseq_sched_delay:1;
+#endif
 
 	/* Bit to tell LSMs we're in execve(): */
 	unsigned			in_execve:1;
@@ -2451,6 +2454,20 @@ static inline void rseq_syscall(struct pt_regs *regs)
 
 #endif
 
+#ifdef CONFIG_RSEQ
+
+extern bool rseq_delay_resched(void);
+extern void rseq_delay_resched_fini(void);
+extern void rseq_delay_resched_tick(void);
+
+#else
+
+static inline bool rseq_delay_resched(void) { return false; }
+static inline void rseq_delay_resched_fini(void) { }
+static inline void rseq_delay_resched_tick(void) { }
+
+#endif
+
 #ifdef CONFIG_SCHED_CORE
 extern void sched_core_free(struct task_struct *tsk);
 extern void sched_core_fork(struct task_struct *p);
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index c233aae5eac9..ec3b45f32bc8 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -26,6 +26,7 @@ enum rseq_cs_flags_bit {
 	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT	= 0,
 	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT	= 1,
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT	= 2,
+	RSEQ_CS_FLAG_DELAY_RESCHED_BIT		= 3,
 };
 
 enum rseq_cs_flags {
@@ -35,6 +36,8 @@ enum rseq_cs_flags {
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
 	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE	=
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+	RSEQ_CS_FLAG_DELAY_RESCHED		=
+		(1U << RSEQ_CS_FLAG_DELAY_RESCHED_BIT),
 };
 
 /*
@@ -128,6 +131,8 @@ struct rseq {
 	 * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
 	 *     Inhibit instruction sequence block restart on migration for
 	 *     this thread.
+	 * - RSEQ_CS_DELAY_RESCHED
+	 *     Try delay resched...
 	 */
 	__u32 flags;
 
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index d7ee4bc3f2ba..b969c19e29bc 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -145,7 +145,8 @@ void noinstr exit_to_user_mode(void)
 void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
 
 static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
-					    unsigned long ti_work)
+					    unsigned long ti_work,
+					    bool irq)
 {
 	/*
 	 * Before returning to user space ensure that all pending work
@@ -155,8 +156,12 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & _TIF_NEED_RESCHED)
-			schedule();
+		if (ti_work & _TIF_NEED_RESCHED) {
+			if (irq && rseq_delay_resched())
+				clear_tsk_need_resched(current);
+			else
+				schedule();
+		}
 
 		if (ti_work & _TIF_UPROBE)
 			uprobe_notify_resume(regs);
@@ -190,7 +195,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 	return ti_work;
 }
 
-static void exit_to_user_mode_prepare(struct pt_regs *regs)
+static void exit_to_user_mode_prepare(struct pt_regs *regs, bool irq)
 {
 	unsigned long ti_work;
 
@@ -201,7 +206,10 @@ static void exit_to_user_mode_prepare(struct pt_regs *regs)
 
 	ti_work = read_thread_flags();
 	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
-		ti_work = exit_to_user_mode_loop(regs, ti_work);
+		ti_work = exit_to_user_mode_loop(regs, ti_work, irq);
+
+	if (irq)
+		rseq_delay_resched_fini();
 
 	arch_exit_to_user_mode_prepare(regs, ti_work);
 
@@ -282,7 +290,7 @@ static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *reg
 {
 	syscall_exit_to_user_mode_prepare(regs);
 	local_irq_disable_exit_to_user();
-	exit_to_user_mode_prepare(regs);
+	exit_to_user_mode_prepare(regs, false);
 }
 
 void syscall_exit_to_user_mode_work(struct pt_regs *regs)
@@ -306,7 +314,7 @@ noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
 noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
 {
 	instrumentation_begin();
-	exit_to_user_mode_prepare(regs);
+	exit_to_user_mode_prepare(regs, true);
 	instrumentation_end();
 	__exit_to_user_mode();
 }
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 9de6e35fe679..fce7a939f3a9 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -339,6 +339,64 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 	force_sigsegv(sig);
 }
 
+bool rseq_delay_resched(void)
+{
+	struct task_struct *t = current;
+	u32 flags;
+
+	if (!IS_ENABLED(CONFIG_SCHED_HRTICK))
+		return false;
+
+	if (!t->rseq)
+		return false;
+
+	if (t->rseq_sched_delay)
+		return false;
+
+	if (copy_from_user(&flags, &t->rseq->flags, sizeof(flags)))
+		return false;
+
+	if (!(flags & RSEQ_CS_FLAG_DELAY_RESCHED))
+		return false;
+
+	flags &= ~RSEQ_CS_FLAG_DELAY_RESCHED;
+	if (copy_to_user(&t->rseq->flags, &flags, sizeof(flags)))
+		return false;
+
+	t->rseq_sched_delay = 1;
+
+	return true;
+}
+
+void rseq_delay_resched_fini(void)
+{
+#ifdef CONFIG_SCHED_HRTICK
+	extern void hrtick_local_start(u64 delay);
+	struct task_struct *t = current;
+	/*
+	 * IRQs off, guaranteed to return to userspace, start timer on this CPU
+	 * to limit the resched-overdraft.
+	 *
+	 * If your critical section is longer than 50 us you get to keep the
+	 * pieces.
+	 */
+	if (t->rseq_sched_delay)
+		hrtick_local_start(50 * NSEC_PER_USEC);
+#endif
+}
+
+void rseq_delay_resched_tick(void)
+{
+#ifdef CONFIG_SCHED_HRTICK
+	struct task_struct *t = current;
+
+	if (t->rseq_sched_delay) {
+		set_tsk_need_resched(t);
+		trace_printk("timeout -- force resched\n");
+	}
+#endif
+}
+
 #ifdef CONFIG_DEBUG_RSEQ
 
 /*
@@ -353,7 +411,10 @@ void rseq_syscall(struct pt_regs *regs)
 
 	if (!t->rseq)
 		return;
-	if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
+
+	if (rseq_get_rseq_cs(t, &rseq_cs) ||
+	    in_rseq_cs(ip, &rseq_cs) ||
+	    rseq_cs.flags & RSEQ_CS_FLAG_DELAY_RESCHED)
 		force_sig(SIGSEGV);
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6ec65bc2a91e..994497fe97a9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -776,6 +776,7 @@ void update_rq_clock(struct rq *rq)
 
 static void hrtick_clear(struct rq *rq)
 {
+	rseq_delay_resched_tick();
 	if (hrtimer_active(&rq->hrtick_timer))
 		hrtimer_cancel(&rq->hrtick_timer);
 }
@@ -791,6 +792,8 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
 
 	WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
 
+	rseq_delay_resched_tick();
+
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
 	rq->curr->sched_class->task_tick(rq, rq->curr, 1);
@@ -864,6 +867,16 @@ void hrtick_start(struct rq *rq, u64 delay)
 
 #endif /* CONFIG_SMP */
 
+void hrtick_local_start(u64 delay)
+{
+	struct rq *rq = this_rq();
+	struct rq_flags rf;
+
+	rq_lock(rq, &rf);
+	hrtick_start(rq, delay);
+	rq_unlock(rq, &rf);
+}
+
 static void hrtick_rq_init(struct rq *rq)
 {
 #ifdef CONFIG_SMP
@@ -6709,6 +6722,9 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
+#ifdef CONFIG_RSEQ
+	prev->rseq_sched_delay = 0;
+#endif
 #ifdef CONFIG_SCHED_DEBUG
 	rq->last_seen_need_resched_ns = 0;
 #endif
@@ -8607,6 +8623,12 @@ static void do_sched_yield(void)
  */
 SYSCALL_DEFINE0(sched_yield)
 {
+	if (current->rseq_sched_delay) {
+		trace_printk("yield -- made it\n");
+		schedule();
+		return 0;
+	}
+
 	do_sched_yield();
 	return 0;
 }
Steven Rostedt Oct. 30, 2023, 1:52 p.m. UTC | #25
On Mon, 30 Oct 2023 14:29:49 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Oct 26, 2023 at 09:16:58AM -0400, Steven Rostedt wrote:
> 
> > I said:
> > 
> >   If we are worried about abuse, we could even punish tasks that don't call
> >   sched_yield() by the time its extended time slice is taken.  
> 
> This is a user interface, ofcourse I'm worried about abuse. That's the
> first thing you *should* think about.
> 
> Userspace is out to get you -- must assume hostile.

100% agree!

> 
> Notably, we were talking usec latencies in the Chrome thread, you're
> adding 1000 usec latencies here (in the best case, delaying scheduling
> until the next tick, 10000usec for the HZ=100 folks). This is quite
> 'unfortunate'.
> 
> On my very aged IVB-EP I can get 50us scheduling latencies on a good
> day, on my brand spanking new SPR I can get 20us (more faster more
> better etc..).
> 
> Ideally we don't allow userspace to extend much (if any) beyond the
> granularity already imposed by the kernel's preempt/IRQ-disable regions.
> Sadly we don't have a self-measure of that around.
> 
> So I had a poke at all this and ended up with the below. I still utterly
> detest all this, but it appears to actually work -- although I don't
> much see the improvement, the numbers are somewhat unstable. (I say it
> works because I see the 'yield -- made it' trace_printk when I do it
> right and the 'timeout -- force resched' when I do it 'wrong'.
> 
> This thing works across the board and gives userspace 50usec, equal to
> what the kernel already imposes on (on the IVB).
> 
> I simply took a bit from the existing flags field, and userspace can use
> BTR to test if the kernel cleared it -- in which case it needs yield
> (and not any other syscall).
> 
> Additinally doing a syscall with the bit set will SIGSEGV (when
> DEBUG_RSEQ).
> 

Thanks for looking into this even though you detest it ;-)

Unfortunately, now that the merge window has opened (and someone reported a
bug in my code from linux-next :-( ), I need to take a step back from this
and may not be able to work on it again until plumbers. By then, I hope to
have time to dig deeper into what you have done here.

Thanks again Peter!

-- Steve
diff mbox series

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9b13b7d4f1d3..fb540dd0dec0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -740,6 +740,10 @@  struct kmap_ctrl {
 #endif
 };
 
+struct extend_map {
+	long				flags;
+};
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/*
@@ -802,6 +806,8 @@  struct task_struct {
 	unsigned int			core_occupation;
 #endif
 
+	struct extend_map		*extend_map;
+
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index c1f706038637..21d0e4d81d33 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -147,17 +147,32 @@  void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
 static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 					    unsigned long ti_work)
 {
+	unsigned long ignore_mask;
+
 	/*
 	 * Before returning to user space ensure that all pending work
 	 * items have been completed.
 	 */
 	while (ti_work & EXIT_TO_USER_MODE_WORK) {
+		ignore_mask = 0;
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
+		if (ti_work & _TIF_NEED_RESCHED) {
 			schedule();
 
+		} else if (ti_work & _TIF_NEED_RESCHED_LAZY) {
+			if (!current->extend_map ||
+			    !(current->extend_map->flags & 1)) {
+				schedule();
+			} else {
+				trace_printk("Extend!\n");
+				/* Allow to leave with NEED_RESCHED_LAZY still set */
+				ignore_mask |= _TIF_NEED_RESCHED_LAZY;
+				current->extend_map->flags |= 2;
+			}
+		}
+
 		if (ti_work & _TIF_UPROBE)
 			uprobe_notify_resume(regs);
 
@@ -184,6 +199,7 @@  static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		tick_nohz_user_enter_prepare();
 
 		ti_work = read_thread_flags();
+		ti_work &= ~ignore_mask;
 	}
 
 	/* Return the latest work state for arch_exit_to_user_mode() */
diff --git a/kernel/exit.c b/kernel/exit.c
index edb50b4c9972..ddf89ec9ab62 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -906,6 +906,13 @@  void __noreturn do_exit(long code)
 	if (tsk->io_context)
 		exit_io_context(tsk);
 
+	if (tsk->extend_map) {
+		unsigned long addr = (unsigned long)tsk->extend_map;
+
+		virt_to_page(addr)->mapping = NULL;
+		free_page(addr);
+	}
+
 	if (tsk->splice_pipe)
 		free_pipe_info(tsk->splice_pipe);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b6d20dfb9a8..da2214082d25 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1166,6 +1166,8 @@  static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->wake_q.next = NULL;
 	tsk->worker_private = NULL;
 
+	tsk->extend_map = NULL;
+
 	kcov_task_init(tsk);
 	kmsan_task_create(tsk);
 	kmap_local_fork(tsk);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 976092b7bd45..297061cfa08d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -32,3 +32,4 @@  obj-y += core.o
 obj-y += fair.o
 obj-y += build_policy.o
 obj-y += build_utility.o
+obj-y += extend.o
diff --git a/kernel/sched/extend.c b/kernel/sched/extend.c
new file mode 100644
index 000000000000..a632e1a8f57b
--- /dev/null
+++ b/kernel/sched/extend.c
@@ -0,0 +1,90 @@ 
+#include <linux/kobject.h>
+#include <linux/pagemap.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
+
+#ifdef CONFIG_SYSFS
+static ssize_t extend_sched_read(struct file *file,  struct kobject *kobj,
+				 struct bin_attribute *bin_attr,
+				 char *buf, loff_t off, size_t len)
+{
+	static const char output[] = "Extend scheduling time slice\n";
+
+	printk("%s:%d\n", __func__, __LINE__);
+	if (off >= sizeof(output))
+		return 0;
+
+	strscpy(buf, output + off, len);
+	return min((ssize_t)len, sizeof(output) - off - 1);
+}
+
+static ssize_t extend_sched_write(struct file *file, struct kobject *kobj,
+				  struct bin_attribute *bin_attr,
+				  char *buf, loff_t off, size_t len)
+{
+	printk("%s:%d\n", __func__, __LINE__);
+	return -EINVAL;
+}
+
+static vm_fault_t extend_sched_mmap_fault(struct vm_fault *vmf)
+{
+	vm_fault_t ret = VM_FAULT_SIGBUS;
+
+	trace_printk("%s:%d\n", __func__, __LINE__);
+	/* Only has one page */
+	if (vmf->pgoff || !current->extend_map)
+		return ret;
+
+	vmf->page = virt_to_page(current->extend_map);
+
+	get_page(vmf->page);
+	vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+	vmf->page->index   = vmf->pgoff;
+
+	return 0;
+}
+
+static void extend_sched_mmap_open(struct vm_area_struct *vma)
+{
+	printk("%s:%d\n", __func__, __LINE__);
+	WARN_ON(!current->extend_map);
+}
+
+static const struct vm_operations_struct extend_sched_vmops = {
+	.open		= extend_sched_mmap_open,
+	.fault		= extend_sched_mmap_fault,
+};
+
+static int extend_sched_mmap(struct file *file, struct kobject *kobj,
+			     struct bin_attribute *attr,
+			     struct vm_area_struct *vma)
+{
+	if (current->extend_map)
+		return -EBUSY;
+
+	current->extend_map = page_to_virt(alloc_page(GFP_USER | __GFP_ZERO));
+	if (!current->extend_map)
+		return -ENOMEM;
+
+	vm_flags_mod(vma, VM_DONTCOPY | VM_DONTDUMP | VM_MAYWRITE, 0);
+	vma->vm_ops = &extend_sched_vmops;
+
+	return 0;
+}
+
+static struct bin_attribute extend_sched_attr = {
+	.attr = {
+		.name = "extend_sched",
+		.mode = 0777,
+	},
+	.read = &extend_sched_read,
+	.write = &extend_sched_write,
+	.mmap = &extend_sched_mmap,
+};
+
+static __init int extend_init(void)
+{
+	return sysfs_create_bin_file(kernel_kobj, &extend_sched_attr);
+}
+late_initcall(extend_init);
+#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 700b140ac1bb..17ca22e80384 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -993,9 +993,10 @@  static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool
 		resched_curr(rq);
 	} else {
 		/* Did the task ignore the lazy reschedule request? */
-		if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
+		if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY)) {
+			trace_printk("Force resched?\n");
 			resched_curr(rq);
-		else
+		} else
 			resched_curr_lazy(rq);
 	}
 	clear_buddies(cfs_rq, se);