diff mbox series

[RFC,1/2] sched: Extended scheduler time slice

Message ID 20250131225942.365475324@goodmis.org (mailing list archive)
State New
Headers show
Series sched: Extended Scheduler Time Slice revisited | expand

Commit Message

Steven Rostedt Jan. 31, 2025, 10:58 p.m. UTC
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

This is to improve user space implemented spin locks or any critical
section. It may also be extended for VMs and their guest spin locks as
well, but that will come later.

This adds a new field in the struct rseq called cr_counter. This is a 32 bit
field where bit zero is a flag reserved for the kernel, and the other 31
bits can be used as a counter (although the kernel doesn't care how they
are used, as any bit set means the same).

This works in tandem with PREEMPT_LAZY, where a task can tell the kernel
via the rseq structure that it is in a critical section (like holding a
spin lock) that it will be leaving very shortly, and to ask the kernel to
not preempt it at the moment.

The way this works is before going into a critical section, the user space
thread will increment the cr_counter by 2 (skipping bit zero that is
reserved for the kernel). If the tasks time runs out and NEED_RESCHED_LAZY
is set, on the way back out to user space, instead of calling schedule,
the kernel will allow user space to continue to run. For the moment, it
lets it run for one more tick (which will be changed later). When the
kernel lets the thread have some extended time, it will set bit zero of
the rseq cr_counter, to inform the user thread that it was granted extended
time and that it should call a system call immediately after it leaves its
critical section.

When the user thread leaves the critical section, it decrements the
counter by 2 and if the counter equals 1, then it knows that the kernel
extended its time slice and it then will call a system call to allow the
kernel to schedule it.

If NEED_RESCHED is set, then the rseq is ignored and the kernel will
schedule.

Note, the incrementing and decrementing the counter by 2 is just one
implementation that user space can use. As stated, any bit set in the
cr_counter from bit 1 to 31 will cause the kernel to try and grant extra
time.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/sched.h     | 10 ++++++++++
 include/uapi/linux/rseq.h | 24 ++++++++++++++++++++++++
 kernel/entry/common.c     | 14 +++++++++++++-
 kernel/rseq.c             | 30 ++++++++++++++++++++++++++++++
 4 files changed, 77 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra Feb. 1, 2025, 11:59 a.m. UTC | #1
On Fri, Jan 31, 2025 at 05:58:38PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
> 
> This is to improve user space implemented spin locks or any critical
> section. It may also be extended for VMs and their guest spin locks as
> well, but that will come later.
> 
> This adds a new field in the struct rseq called cr_counter. This is a 32 bit
> field where bit zero is a flag reserved for the kernel, and the other 31
> bits can be used as a counter (although the kernel doesn't care how they
> are used, as any bit set means the same).
> 
> This works in tandem with PREEMPT_LAZY, where a task can tell the kernel
> via the rseq structure that it is in a critical section (like holding a
> spin lock) that it will be leaving very shortly, and to ask the kernel to
> not preempt it at the moment.

I still have full hate for this approach.
Steven Rostedt Feb. 1, 2025, 12:47 p.m. UTC | #2
On February 1, 2025 6:59:06 AM EST, Peter Zijlstra <peterz@infradead.org> wrote:

>I still have full hate for this approach.

So what approach would you prefer?

-- Steve
Mathieu Desnoyers Feb. 1, 2025, 2:35 p.m. UTC | #3
On 2025-01-31 23:58, Steven Rostedt wrote:

[...]

> @@ -148,6 +160,18 @@ struct rseq {
>   	 */
>   	__u32 mm_cid;
>   
> +	/*
> +	 * The cr_counter is a way for user space to inform the kernel that
> +	 * it is in a critical section. If bits 1-31 are set, then the
> +	 * kernel may grant the thread a bit more time (but there is no
> +	 * guarantee of how much time or if it is granted at all). If the
> +	 * kernel does grant the thread extra time, it will set bit 0 to
> +	 * inform user space that it has granted the thread more time and that
> +	 * user space should call yield() as soon as it leaves its critical

Does it specifically need to be yield(), or can it be just "entering
the kernel" through any system call or trap ?

[...]

> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 9de6e35fe679..b792e36a3550 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -339,6 +339,36 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
>   	force_sigsegv(sig);
>   }
>   
> +bool rseq_delay_resched(void)
> +{
> +	struct task_struct *t = current;
> +	u32 flags;
> +
> +	if (!t->rseq)
> +		return false;
> +
> +	/* Make sure the cr_counter exists */
> +	if (current->rseq_len <= offsetof(struct rseq, cr_counter))
> +		return false;

It would be clearer/more precise with < offsetofend(struct rseq, cr_counter)

Thanks,

Mathieu
Peter Zijlstra Feb. 1, 2025, 6:11 p.m. UTC | #4
On Sat, Feb 01, 2025 at 07:47:32AM -0500, Steven Rostedt wrote:
> 
> 
> On February 1, 2025 6:59:06 AM EST, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> >I still have full hate for this approach.
> 
> So what approach would you prefer?

The one that does not rely on the preemption method -- I think I posted
something along those line, and someone else recently reposted something
bsaed on it.

Tying things to the preemption method is absurdly bad design -- and I've
told you that before.
Steven Rostedt Feb. 1, 2025, 11:06 p.m. UTC | #5
On Sat, 1 Feb 2025 19:11:29 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Sat, Feb 01, 2025 at 07:47:32AM -0500, Steven Rostedt wrote:
> > 
> > 
> > On February 1, 2025 6:59:06 AM EST, Peter Zijlstra <peterz@infradead.org> wrote:
> >   
> > >I still have full hate for this approach.  
> > 
> > So what approach would you prefer?  
> 
> The one that does not rely on the preemption method -- I think I posted
> something along those line, and someone else recently reposted something
> bsaed on it.
> 
> Tying things to the preemption method is absurdly bad design -- and I've
> told you that before.

How exactly is it "bad design"? Changing the preemption method itself
changes the way applications schedule and can be very noticeable to the
applications themselves.

No preempt, applications will have high latency every time any
application does a system call.

Preempt voluntary is a little more reactive, but more randomly done.

The preempt lazy kconfig has:

          This option provides a scheduler driven preemption model that
          is fundamentally similar to full preemption, but is less
          eager to preempt SCHED_NORMAL tasks in an attempt to
          reduce lock holder preemption and recover some of the performance
          gains seen from using Voluntary preemption.

This could be a config option called PREEMPT_USER_LAZY that extends the
"reduce lock holder preemption of user space spin locks".

But if your issue is with relying on the preemption method, does that
mean you prefer to have this feature for any preemption method? That may
require still using the LAZY flag that can cause a schedule in the
kernel but not in user space?

Note, my group is actually more interested in implementing this for
VMs. But that requires another level of redirection of the pointers.

That is, qemu could create a device that shares memory between the
guest kernel and the qemu VCPU thread. The guest kernel could update
the counter in this shared memory before grabbing a raw_spin_lock which
act like this patch set does. The difference would be that the counter
would need to live in a memory page that only has this information in
it and not the rseq structure itself. Mathieu was concerned about leaks
and corruption in the rseq structure by a malicious guest. Thus, the
counter would have to be a clean memory page that is shared between the
guest and the qemu thread.

The rseq would then have a pointer to this memory, and the host kernel
would then have to traverse that pointer to the location of the counter.

In other words, my real goal is to have this working for guests on
their raw_spin_locks. We first tried to do this in KVM directly, but
the KVM maintainers said this is more a generic scheduling issue and
doesn't belong in KVM. I agreed with them.

-- Steve
Steven Rostedt Feb. 1, 2025, 11:08 p.m. UTC | #6
On Sat, 1 Feb 2025 15:35:13 +0100
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> On 2025-01-31 23:58, Steven Rostedt wrote:
> 
> [...]
> 
> > @@ -148,6 +160,18 @@ struct rseq {
> >   	 */
> >   	__u32 mm_cid;
> >   
> > +	/*
> > +	 * The cr_counter is a way for user space to inform the kernel that
> > +	 * it is in a critical section. If bits 1-31 are set, then the
> > +	 * kernel may grant the thread a bit more time (but there is no
> > +	 * guarantee of how much time or if it is granted at all). If the
> > +	 * kernel does grant the thread extra time, it will set bit 0 to
> > +	 * inform user space that it has granted the thread more time and that
> > +	 * user space should call yield() as soon as it leaves its critical  
> 
> Does it specifically need to be yield(), or can it be just "entering
> the kernel" through any system call or trap ?

No it doesn't need to be yield. That just seemed like the obvious
system call to use. But any system call would force a schedule. We
could just optimize yield() though.

> 
> [...]
> 
> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > index 9de6e35fe679..b792e36a3550 100644
> > --- a/kernel/rseq.c
> > +++ b/kernel/rseq.c
> > @@ -339,6 +339,36 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
> >   	force_sigsegv(sig);
> >   }
> >   
> > +bool rseq_delay_resched(void)
> > +{
> > +	struct task_struct *t = current;
> > +	u32 flags;
> > +
> > +	if (!t->rseq)
> > +		return false;
> > +
> > +	/* Make sure the cr_counter exists */
> > +	if (current->rseq_len <= offsetof(struct rseq, cr_counter))
> > +		return false;  
> 
> It would be clearer/more precise with < offsetofend(struct rseq, cr_counter)

Ah yeah, thanks!

-- Steve
Linus Torvalds Feb. 1, 2025, 11:18 p.m. UTC | #7
On Sat, 1 Feb 2025 at 15:08, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> No it doesn't need to be yield. That just seemed like the obvious
> system call to use. But any system call would force a schedule. We
> could just optimize yield() though.

No, "optimizing" yield is not on the table.

Why? Because it has *semantics* that people depend on because it has
historical meaning. Things like "move the process to the end of the
scheduling queue of that priority".

That may sound like a good thing, but it's ABSOLUTELY NOT what you
should actually do unless you know *exactly* what the system behavior
is.

For example, maybe the reason the kernel set NEED_RESCHED_LAZY is that
a higher-priority process is ready to run. You haven't used up your
time slice yet, but something more important needs the CPU.

If you call "sched_yield()", sure, you'll run that higher priority
thing. So far so good.

But you *also* are literally telling the scheduler to put you at the
back of the queue, despite the fact that maybe you are still in line
to be run for *your* priority level. So now your performance will
absolutely suck, because you just told the scheduler that you are not
important, and other processes in your priority level should get
priority.

So no. "yield()" does not mean "just reschedule". It rally means
"yield my position in the scheduling queue".

You are literally better off using absolutely *ANY* other system call.

The fact that you are confused about this kind of very basic issue
does imply that this patch should absolutely be handled by somebody
who knows the scheduler better.

               Linus
Linus Torvalds Feb. 1, 2025, 11:35 p.m. UTC | #8
On Sat, 1 Feb 2025 at 15:18, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But you *also* are literally telling the scheduler to put you at the
> back of the queue, despite the fact that maybe you are still in line
> to be run for *your* priority level. So now your performance will
> absolutely suck, because you just told the scheduler that you are not
> important, and other processes in your priority level should get
> priority.

Side note: we've messed with this before, exactly because people have
been confused about this before.

I have this dim memory of us having at one point decided to actively
ignore that "put me last" part of sched_yield because we had huge
performance problems with people misusing "sched_yield()".

I didn't actually check what current active schedulers do.  I would
not be in the least surprised if different schedulers end up having
very different behavior (particularly any RT scheduler).

But the moral of the story ends up being the same: don't use yield()
unless you are on some embedded platform and know exactly what the
scheduling pattern is - or if you really want to say "I don't want to
run now, do *anything* else, my performance doesn't matter".

A traditional (reasonable) situation might be "I started another task
or thread, I need for it to run first and initialized things, I'm
polling for that to be done but I don't want to busy-loop if there is
real work to be done".

Where it really is a complete hack: "my performance doesn't matter
because it's a one-time startup thing, and I couldn't be arsed to have
real locking".

In fact, the whole "I couldn't be arsed" is basically the tag-line for
"yield()". Maybe we should rename the system call.

             Linus
Steven Rostedt Feb. 2, 2025, 3:22 a.m. UTC | #9
On Sat, 1 Feb 2025 15:18:15 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, 1 Feb 2025 at 15:08, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > No it doesn't need to be yield. That just seemed like the obvious
> > system call to use. But any system call would force a schedule. We
> > could just optimize yield() though.  
> 
> No, "optimizing" yield is not on the table.

Note, I only mentioned this because Peter had it in his patch he
created when I first brought this up.

  https://lore.kernel.org/all/20231030132949.GA38123@noisy.programming.kicks-ass.net/

 SYSCALL_DEFINE0(sched_yield)
 {
+	if (current->rseq_sched_delay) {
+		trace_printk("yield -- made it\n");
+		schedule();
+		return 0;
+	}
+
 	do_sched_yield();
 	return 0;
 }


> 
> Why? Because it has *semantics* that people depend on because it has
> historical meaning. Things like "move the process to the end of the
> scheduling queue of that priority".

Yes, I know the historical meaning. It's also a system call I've been
telling people to avoid for the last 20 years. I haven't seen or heard
about any application that actually depends on that behavior today.

> 
> That may sound like a good thing, but it's ABSOLUTELY NOT what you
> should actually do unless you know *exactly* what the system behavior
> is.
> 
> For example, maybe the reason the kernel set NEED_RESCHED_LAZY is that
> a higher-priority process is ready to run. You haven't used up your
> time slice yet, but something more important needs the CPU.

If it's RT, it would set NEED_RESCHED and not LAZY. This is only for
SCHED_OTHER. Sure, it could be a task with a negative nice value.

> 
> So no. "yield()" does not mean "just reschedule". It rally means
> "yield my position in the scheduling queue".

The optimization would only treat sched_yield differently if the task
had asked for an extended time slice, and it was granted. Like in
Peter's patch, if that was the case, it would just call schedule and
return. This would not affect yield() for any other task not
implementing the rseq extend counter.

> 
> You are literally better off using absolutely *ANY* other system call.
> 
> The fact that you are confused about this kind of very basic issue
> does imply that this patch should absolutely be handled by somebody
> who knows the scheduler better.
> 

I'm not confused.

And before seeing Peter's use of yield(), I was reluctant to use it for
the very same reasons you mentioned above. In my test programs, I was
simply using getuid(), as that was one of the quickest syscalls.

-- Steve
Steven Rostedt Feb. 2, 2025, 3:26 a.m. UTC | #10
On Sat, 1 Feb 2025 15:35:53 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I didn't actually check what current active schedulers do.  I would
> not be in the least surprised if different schedulers end up having
> very different behavior (particularly any RT scheduler).

The only real use of sched yield() I have ever seen in practice was in
a RT application where all tasks were given the same RT FIFO priority,
and the application would use the yield() system call to trigger the
next task.

That's also the only use case that really does have a strict defined
behavior for yield(). In SCHED_FIFO, as the name suggests, tasks run in
a first-in first-out order. yield() will put the task at the end of
that queue. You can use this method to implement a strict application
defined scheduler.

-- Steve.
Matthew Wilcox Feb. 2, 2025, 7:22 a.m. UTC | #11
On Sat, Feb 01, 2025 at 10:22:08PM -0500, Steven Rostedt wrote:
> And before seeing Peter's use of yield(), I was reluctant to use it for
> the very same reasons you mentioned above. In my test programs, I was
> simply using getuid(), as that was one of the quickest syscalls.

Is getuid() guaranteed to issue a syscall?  It feels like the kind of
information that a tricksy libc could cache.  Traditionally, I think
we've used getppid() as the canonical "very cheap syscall" as no layer
can cache that information.
Steven Rostedt Feb. 2, 2025, 10:29 p.m. UTC | #12
On Sun, 2 Feb 2025 07:22:08 +0000
Matthew Wilcox <willy@infradead.org> wrote:

> On Sat, Feb 01, 2025 at 10:22:08PM -0500, Steven Rostedt wrote:
> > And before seeing Peter's use of yield(), I was reluctant to use it for
> > the very same reasons you mentioned above. In my test programs, I was
> > simply using getuid(), as that was one of the quickest syscalls.  
> 
> Is getuid() guaranteed to issue a syscall?  It feels like the kind of
> information that a tricksy libc could cache.  Traditionally, I think
> we've used getppid() as the canonical "very cheap syscall" as no layer
> can cache that information.

Maybe that was what I used. Can't remember. And I think I even open
coded it using syscall() to not even rely on glibc.

-- Steve
diff mbox series

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 64934e0830af..8e983d8cf72d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2206,6 +2206,16 @@  static inline bool owner_on_cpu(struct task_struct *owner)
 unsigned long sched_cpu_util(int cpu);
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_RSEQ
+
+extern bool rseq_delay_resched(void);
+
+#else
+
+static inline bool rseq_delay_resched(void) { return false; }
+
+#endif
+
 #ifdef CONFIG_SCHED_CORE
 extern void sched_core_free(struct task_struct *tsk);
 extern void sched_core_fork(struct task_struct *p);
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index c233aae5eac9..185fe9826ff9 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -37,6 +37,18 @@  enum rseq_cs_flags {
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
 };
 
+enum rseq_cr_flags_bit {
+	RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED_BIT	= 0,
+};
+
+enum rseq_cr_flags {
+	RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED	=
+	(1U << RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED_BIT),
+};
+
+#define RSEQ_CR_FLAG_IN_CRITICAL_SECTION_MASK	\
+	(~RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED)
+
 /*
  * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
  * contained within a single cache-line. It is usually declared as
@@ -148,6 +160,18 @@  struct rseq {
 	 */
 	__u32 mm_cid;
 
+	/*
+	 * The cr_counter is a way for user space to inform the kernel that
+	 * it is in a critical section. If bits 1-31 are set, then the
+	 * kernel may grant the thread a bit more time (but there is no
+	 * guarantee of how much time or if it is granted at all). If the
+	 * kernel does grant the thread extra time, it will set bit 0 to
+	 * inform user space that it has granted the thread more time and that
+	 * user space should call yield() as soon as it leaves its critical
+	 * section.
+	 */
+	__u32 cr_counter;
+
 	/*
 	 * Flexible array member at end of structure, after last feature field.
 	 */
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index e33691d5adf7..50e35f153bf8 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -90,6 +90,8 @@  void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
 __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 						     unsigned long ti_work)
 {
+	unsigned long ignore_mask = 0;
+
 	/*
 	 * Before returning to user space ensure that all pending work
 	 * items have been completed.
@@ -98,9 +100,18 @@  __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
+		if (ti_work & _TIF_NEED_RESCHED) {
 			schedule();
 
+		} else if (ti_work & _TIF_NEED_RESCHED_LAZY) {
+			/* Allow to leave with NEED_RESCHED_LAZY still set */
+			if (rseq_delay_resched()) {
+				trace_printk("Avoid scheduling\n");
+				ignore_mask |= _TIF_NEED_RESCHED_LAZY;
+			} else
+				schedule();
+		}
+
 		if (ti_work & _TIF_UPROBE)
 			uprobe_notify_resume(regs);
 
@@ -127,6 +138,7 @@  __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		tick_nohz_user_enter_prepare();
 
 		ti_work = read_thread_flags();
+		ti_work &= ~ignore_mask;
 	}
 
 	/* Return the latest work state for arch_exit_to_user_mode() */
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 9de6e35fe679..b792e36a3550 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -339,6 +339,36 @@  void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 	force_sigsegv(sig);
 }
 
+bool rseq_delay_resched(void)
+{
+	struct task_struct *t = current;
+	u32 flags;
+
+	if (!t->rseq)
+		return false;
+
+	/* Make sure the cr_counter exists */
+	if (current->rseq_len <= offsetof(struct rseq, cr_counter))
+		return false;
+
+	/* If this were to fault, it would likely cause a schedule anyway */
+	if (copy_from_user_nofault(&flags, &t->rseq->cr_counter, sizeof(flags)))
+		return false;
+
+	if (!(flags & RSEQ_CR_FLAG_IN_CRITICAL_SECTION_MASK))
+		return false;
+
+	trace_printk("Extend time slice\n");
+	flags |= RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED;
+
+	if (copy_to_user_nofault(&t->rseq->cr_counter, &flags, sizeof(flags))) {
+		trace_printk("Faulted writing rseq\n");
+		return false;
+	}
+
+	return true;
+}
+
 #ifdef CONFIG_DEBUG_RSEQ
 
 /*