diff mbox series

[RFC,v1,1/4] Introducing qpw_lock() and per-cpu queue & flush work

Message ID 20240622035815.569665-2-leobras@redhat.com (mailing list archive)
State New
Headers show
Series Introduce QPW for per-cpu operations | expand

Commit Message

Leonardo Bras June 22, 2024, 3:58 a.m. UTC
Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps cache bouncing low since
cacheline tends to be mostly local, and avoids the cost of locks in non-RT
kernels, even though the very few remote operations will be expensive due
to scheduling overhead.

On the other hand, for RT workloads this can represent a problem: getting
an important workload scheduled out to deal with some unrelated task is
sure to introduce unexpected deadline misses.

It's interesting, though, that local_lock()s in RT kernels become
spinlock(). We can make use of those to avoid scheduling work on a remote
cpu by directly updating another cpu's per_cpu structure, while holding
it's spinlock().

In order to do that, it's necessary to introduce a new set of functions to
make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*)
and also the corresponding queue_percpu_work_on() and flush_percpu_work()
helpers to run the remote work.

On non-RT kernels, no changes are expected, as every one of the introduced
helpers work the exactly same as the current implementation:
qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
queue_percpu_work_on()  ->  queue_work_on()
flush_percpu_work()     ->  flush_work()

For RT kernels, though, qpw_{un,}lock*() will use the extra cpu parameter
to select the correct per-cpu structure to work on, and acquire the
spinlock for that cpu.

queue_percpu_work_on() will just call the requested function in the current
cpu, which will operate in another cpu's per-cpu object. Since the
local_locks() become spinlock()s in PREEMPT_RT, we are safe doing that.

flush_percpu_work() then becomes a no-op since no work is actually
scheduled on a remote cpu.

Some minimal code rework is needed in order to make this mechanism work:
The calls for local_{un,}lock*() on the functions that are currently
scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(), so in
RT kernels they can reference a different cpu. It's also necessary to use a
qpw_struct instead of a work_struct, but it just contains a work struct
and, in PREEMPT_RT, the target cpu.

This should have almost no impact on non-RT kernels: few this_cpu_ptr()
will become per_cpu_ptr(,smp_processor_id()).

On RT kernels, this should improve performance and reduce latency by
removing scheduling noise.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 88 insertions(+)
 create mode 100644 include/linux/qpw.h

Comments

Waiman Long Sept. 4, 2024, 9:39 p.m. UTC | #1
On 6/21/24 23:58, Leonardo Bras wrote:
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.
>
> On the other hand, for RT workloads this can represent a problem: getting
> an important workload scheduled out to deal with some unrelated task is
> sure to introduce unexpected deadline misses.
>
> It's interesting, though, that local_lock()s in RT kernels become
> spinlock(). We can make use of those to avoid scheduling work on a remote
> cpu by directly updating another cpu's per_cpu structure, while holding
> it's spinlock().
>
> In order to do that, it's necessary to introduce a new set of functions to
> make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*)
> and also the corresponding queue_percpu_work_on() and flush_percpu_work()
> helpers to run the remote work.
>
> On non-RT kernels, no changes are expected, as every one of the introduced
> helpers work the exactly same as the current implementation:
> qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
> queue_percpu_work_on()  ->  queue_work_on()
> flush_percpu_work()     ->  flush_work()
>
> For RT kernels, though, qpw_{un,}lock*() will use the extra cpu parameter
> to select the correct per-cpu structure to work on, and acquire the
> spinlock for that cpu.
>
> queue_percpu_work_on() will just call the requested function in the current
> cpu, which will operate in another cpu's per-cpu object. Since the
> local_locks() become spinlock()s in PREEMPT_RT, we are safe doing that.
>
> flush_percpu_work() then becomes a no-op since no work is actually
> scheduled on a remote cpu.
>
> Some minimal code rework is needed in order to make this mechanism work:
> The calls for local_{un,}lock*() on the functions that are currently
> scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(), so in
> RT kernels they can reference a different cpu. It's also necessary to use a
> qpw_struct instead of a work_struct, but it just contains a work struct
> and, in PREEMPT_RT, the target cpu.
>
> This should have almost no impact on non-RT kernels: few this_cpu_ptr()
> will become per_cpu_ptr(,smp_processor_id()).
>
> On RT kernels, this should improve performance and reduce latency by
> removing scheduling noise.
>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
>   include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 88 insertions(+)
>   create mode 100644 include/linux/qpw.h
>
> diff --git a/include/linux/qpw.h b/include/linux/qpw.h
> new file mode 100644
> index 000000000000..ea2686a01e5e
> --- /dev/null
> +++ b/include/linux/qpw.h
> @@ -0,0 +1,88 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_QPW_H
> +#define _LINUX_QPW_H
> +
> +#include "linux/local_lock.h"
> +#include "linux/workqueue.h"
> +
> +#ifndef CONFIG_PREEMPT_RT
> +
> +struct qpw_struct {
> +	struct work_struct work;
> +};
> +
> +#define qpw_lock(lock, cpu)					\
> +	local_lock(lock)
> +
> +#define qpw_unlock(lock, cpu)					\
> +	local_unlock(lock)
> +
> +#define qpw_lock_irqsave(lock, flags, cpu)			\
> +	local_lock_irqsave(lock, flags)
> +
> +#define qpw_unlock_irqrestore(lock, flags, cpu)			\
> +	local_unlock_irqrestore(lock, flags)
> +
> +#define queue_percpu_work_on(c, wq, qpw)			\
> +	queue_work_on(c, wq, &(qpw)->work)
> +
> +#define flush_percpu_work(qpw)					\
> +	flush_work(&(qpw)->work)
> +
> +#define qpw_get_cpu(qpw)					\
> +	smp_processor_id()
> +
> +#define INIT_QPW(qpw, func, c)					\
> +	INIT_WORK(&(qpw)->work, (func))
> +
> +#else /* !CONFIG_PREEMPT_RT */
> +
> +struct qpw_struct {
> +	struct work_struct work;
> +	int cpu;
> +};
> +
> +#define qpw_lock(__lock, cpu)					\
> +	do {							\
> +		migrate_disable();				\
> +		spin_lock(per_cpu_ptr((__lock), cpu));		\
> +	} while (0)
> +
> +#define qpw_unlock(__lock, cpu)					\
> +	do {							\
> +		spin_unlock(per_cpu_ptr((__lock), cpu));	\
> +		migrate_enable();				\
> +	} while (0)

Why there is a migrate_disable/enable() call in qpw_lock/unlock()? The 
rt_spin_lock/unlock() calls have already include a 
migrate_disable/enable() pair.

> +
> +#define qpw_lock_irqsave(lock, flags, cpu)			\
> +	do {							\
> +		typecheck(unsigned long, flags);		\
> +		flags = 0;					\
> +		qpw_lock(lock, cpu);				\
> +	} while (0)
> +
> +#define qpw_unlock_irqrestore(lock, flags, cpu)			\
> +	qpw_unlock(lock, cpu)
> +
> +#define queue_percpu_work_on(c, wq, qpw)			\
> +	do {							\
> +		struct qpw_struct *__qpw = (qpw);		\
> +		WARN_ON((c) != __qpw->cpu);			\
> +		__qpw->work.func(&__qpw->work);			\
> +	} while (0)
> +
> +#define flush_percpu_work(qpw)					\
> +	do {} while (0)
> +
> +#define qpw_get_cpu(w)						\
> +	container_of((w), struct qpw_struct, work)->cpu
> +
> +#define INIT_QPW(qpw, func, c)					\
> +	do {							\
> +		struct qpw_struct *__qpw = (qpw);		\
> +		INIT_WORK(&__qpw->work, (func));		\
> +		__qpw->cpu = (c);				\
> +	} while (0)
> +
> +#endif /* CONFIG_PREEMPT_RT */
> +#endif /* LINUX_QPW_H */

You may also consider adding a documentation file about the 
qpw_lock/unlock() calls.

Cheers,
Longman
Waiman Long Sept. 5, 2024, 12:08 a.m. UTC | #2
On 9/4/24 17:39, Waiman Long wrote:
> On 6/21/24 23:58, Leonardo Bras wrote:
>> Some places in the kernel implement a parallel programming strategy
>> consisting on local_locks() for most of the work, and some rare remote
>> operations are scheduled on target cpu. This keeps cache bouncing low 
>> since
>> cacheline tends to be mostly local, and avoids the cost of locks in 
>> non-RT
>> kernels, even though the very few remote operations will be expensive 
>> due
>> to scheduling overhead.
>>
>> On the other hand, for RT workloads this can represent a problem: 
>> getting
>> an important workload scheduled out to deal with some unrelated task is
>> sure to introduce unexpected deadline misses.
>>
>> It's interesting, though, that local_lock()s in RT kernels become
>> spinlock(). We can make use of those to avoid scheduling work on a 
>> remote
>> cpu by directly updating another cpu's per_cpu structure, while holding
>> it's spinlock().
>>
>> In order to do that, it's necessary to introduce a new set of 
>> functions to
>> make it possible to get another cpu's per-cpu "local" lock 
>> (qpw_{un,}lock*)
>> and also the corresponding queue_percpu_work_on() and 
>> flush_percpu_work()
>> helpers to run the remote work.
>>
>> On non-RT kernels, no changes are expected, as every one of the 
>> introduced
>> helpers work the exactly same as the current implementation:
>> qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
>> queue_percpu_work_on()  ->  queue_work_on()
>> flush_percpu_work()     ->  flush_work()
>>
>> For RT kernels, though, qpw_{un,}lock*() will use the extra cpu 
>> parameter
>> to select the correct per-cpu structure to work on, and acquire the
>> spinlock for that cpu.
>>
>> queue_percpu_work_on() will just call the requested function in the 
>> current
>> cpu, which will operate in another cpu's per-cpu object. Since the
>> local_locks() become spinlock()s in PREEMPT_RT, we are safe doing that.
>>
>> flush_percpu_work() then becomes a no-op since no work is actually
>> scheduled on a remote cpu.
>>
>> Some minimal code rework is needed in order to make this mechanism work:
>> The calls for local_{un,}lock*() on the functions that are currently
>> scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(), 
>> so in
>> RT kernels they can reference a different cpu. It's also necessary to 
>> use a
>> qpw_struct instead of a work_struct, but it just contains a work struct
>> and, in PREEMPT_RT, the target cpu.
>>
>> This should have almost no impact on non-RT kernels: few this_cpu_ptr()
>> will become per_cpu_ptr(,smp_processor_id()).
>>
>> On RT kernels, this should improve performance and reduce latency by
>> removing scheduling noise.
>>
>> Signed-off-by: Leonardo Bras <leobras@redhat.com>
>> ---
>>   include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 88 insertions(+)
>>   create mode 100644 include/linux/qpw.h
>>
>> diff --git a/include/linux/qpw.h b/include/linux/qpw.h
>> new file mode 100644
>> index 000000000000..ea2686a01e5e
>> --- /dev/null
>> +++ b/include/linux/qpw.h
>> @@ -0,0 +1,88 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_QPW_H
>> +#define _LINUX_QPW_H

I would suggest adding a comment with a brief description of what 
qpw_lock/unlock() are for and their use cases. The "qpw" prefix itself 
isn't intuitive enough for a casual reader to understand what they are for.

Cheers,
Longman
Leonardo Bras Sept. 11, 2024, 7:17 a.m. UTC | #3
On Wed, Sep 04, 2024 at 05:39:01PM -0400, Waiman Long wrote:
> On 6/21/24 23:58, Leonardo Bras wrote:
> > Some places in the kernel implement a parallel programming strategy
> > consisting on local_locks() for most of the work, and some rare remote
> > operations are scheduled on target cpu. This keeps cache bouncing low since
> > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > kernels, even though the very few remote operations will be expensive due
> > to scheduling overhead.
> > 
> > On the other hand, for RT workloads this can represent a problem: getting
> > an important workload scheduled out to deal with some unrelated task is
> > sure to introduce unexpected deadline misses.
> > 
> > It's interesting, though, that local_lock()s in RT kernels become
> > spinlock(). We can make use of those to avoid scheduling work on a remote
> > cpu by directly updating another cpu's per_cpu structure, while holding
> > it's spinlock().
> > 
> > In order to do that, it's necessary to introduce a new set of functions to
> > make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*)
> > and also the corresponding queue_percpu_work_on() and flush_percpu_work()
> > helpers to run the remote work.
> > 
> > On non-RT kernels, no changes are expected, as every one of the introduced
> > helpers work the exactly same as the current implementation:
> > qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
> > queue_percpu_work_on()  ->  queue_work_on()
> > flush_percpu_work()     ->  flush_work()
> > 
> > For RT kernels, though, qpw_{un,}lock*() will use the extra cpu parameter
> > to select the correct per-cpu structure to work on, and acquire the
> > spinlock for that cpu.
> > 
> > queue_percpu_work_on() will just call the requested function in the current
> > cpu, which will operate in another cpu's per-cpu object. Since the
> > local_locks() become spinlock()s in PREEMPT_RT, we are safe doing that.
> > 
> > flush_percpu_work() then becomes a no-op since no work is actually
> > scheduled on a remote cpu.
> > 
> > Some minimal code rework is needed in order to make this mechanism work:
> > The calls for local_{un,}lock*() on the functions that are currently
> > scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(), so in
> > RT kernels they can reference a different cpu. It's also necessary to use a
> > qpw_struct instead of a work_struct, but it just contains a work struct
> > and, in PREEMPT_RT, the target cpu.
> > 
> > This should have almost no impact on non-RT kernels: few this_cpu_ptr()
> > will become per_cpu_ptr(,smp_processor_id()).
> > 
> > On RT kernels, this should improve performance and reduce latency by
> > removing scheduling noise.
> > 
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > ---
> >   include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
> >   1 file changed, 88 insertions(+)
> >   create mode 100644 include/linux/qpw.h
> > 
> > diff --git a/include/linux/qpw.h b/include/linux/qpw.h
> > new file mode 100644
> > index 000000000000..ea2686a01e5e
> > --- /dev/null
> > +++ b/include/linux/qpw.h
> > @@ -0,0 +1,88 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_QPW_H
> > +#define _LINUX_QPW_H
> > +
> > +#include "linux/local_lock.h"
> > +#include "linux/workqueue.h"
> > +
> > +#ifndef CONFIG_PREEMPT_RT
> > +
> > +struct qpw_struct {
> > +	struct work_struct work;
> > +};
> > +
> > +#define qpw_lock(lock, cpu)					\
> > +	local_lock(lock)
> > +
> > +#define qpw_unlock(lock, cpu)					\
> > +	local_unlock(lock)
> > +
> > +#define qpw_lock_irqsave(lock, flags, cpu)			\
> > +	local_lock_irqsave(lock, flags)
> > +
> > +#define qpw_unlock_irqrestore(lock, flags, cpu)			\
> > +	local_unlock_irqrestore(lock, flags)
> > +
> > +#define queue_percpu_work_on(c, wq, qpw)			\
> > +	queue_work_on(c, wq, &(qpw)->work)
> > +
> > +#define flush_percpu_work(qpw)					\
> > +	flush_work(&(qpw)->work)
> > +
> > +#define qpw_get_cpu(qpw)					\
> > +	smp_processor_id()
> > +
> > +#define INIT_QPW(qpw, func, c)					\
> > +	INIT_WORK(&(qpw)->work, (func))
> > +
> > +#else /* !CONFIG_PREEMPT_RT */
> > +
> > +struct qpw_struct {
> > +	struct work_struct work;
> > +	int cpu;
> > +};
> > +
> > +#define qpw_lock(__lock, cpu)					\
> > +	do {							\
> > +		migrate_disable();				\
> > +		spin_lock(per_cpu_ptr((__lock), cpu));		\
> > +	} while (0)
> > +
> > +#define qpw_unlock(__lock, cpu)					\
> > +	do {							\
> > +		spin_unlock(per_cpu_ptr((__lock), cpu));	\
> > +		migrate_enable();				\
> > +	} while (0)
> 
> Why there is a migrate_disable/enable() call in qpw_lock/unlock()? The
> rt_spin_lock/unlock() calls have already include a migrate_disable/enable()
> pair.

This was copied from PREEMPT_RT=y local_locks.

In my tree, I see:

#define __local_unlock(__lock)					\
	do {							\
		spin_unlock(this_cpu_ptr((__lock)));		\
		migrate_enable();				\
	} while (0)

But you are right:
For PREEMPT_RT=y, spin_{un,}lock() will be defined in spinlock_rt.h
as rt_spin{un,}lock(), which already runs migrate_{en,dis}able().

On the other hand, for spin_lock() will run migrate_disable() just before 
finishing the function, and local_lock() will run it before calling 
spin_lock() and thus, before spin_acquire().

(local_unlock looks like to have an unnecessary extra migrate_enable(), 
though).

I am not sure if it's actually necessary to run this extra 
migrate_disable() in local_lock() case, maybe Thomas could help us 
understand this.

But sure, if we can remove this from local_{un,}lock(), I am sure we can 
also remove this from qpw.


> 
> > +
> > +#define qpw_lock_irqsave(lock, flags, cpu)			\
> > +	do {							\
> > +		typecheck(unsigned long, flags);		\
> > +		flags = 0;					\
> > +		qpw_lock(lock, cpu);				\
> > +	} while (0)
> > +
> > +#define qpw_unlock_irqrestore(lock, flags, cpu)			\
> > +	qpw_unlock(lock, cpu)
> > +
> > +#define queue_percpu_work_on(c, wq, qpw)			\
> > +	do {							\
> > +		struct qpw_struct *__qpw = (qpw);		\
> > +		WARN_ON((c) != __qpw->cpu);			\
> > +		__qpw->work.func(&__qpw->work);			\
> > +	} while (0)
> > +
> > +#define flush_percpu_work(qpw)					\
> > +	do {} while (0)
> > +
> > +#define qpw_get_cpu(w)						\
> > +	container_of((w), struct qpw_struct, work)->cpu
> > +
> > +#define INIT_QPW(qpw, func, c)					\
> > +	do {							\
> > +		struct qpw_struct *__qpw = (qpw);		\
> > +		INIT_WORK(&__qpw->work, (func));		\
> > +		__qpw->cpu = (c);				\
> > +	} while (0)
> > +
> > +#endif /* CONFIG_PREEMPT_RT */
> > +#endif /* LINUX_QPW_H */
> 
> You may also consider adding a documentation file about the
> qpw_lock/unlock() calls.

Sure, will do when I send the non-RFC version. Thanks for pointing that 
out!

> 
> Cheers,
> Longman
> 

Thanks!
Leo
Leonardo Bras Sept. 11, 2024, 7:18 a.m. UTC | #4
On Wed, Sep 04, 2024 at 08:08:12PM -0400, Waiman Long wrote:
> On 9/4/24 17:39, Waiman Long wrote:
> > On 6/21/24 23:58, Leonardo Bras wrote:
> > > Some places in the kernel implement a parallel programming strategy
> > > consisting on local_locks() for most of the work, and some rare remote
> > > operations are scheduled on target cpu. This keeps cache bouncing
> > > low since
> > > cacheline tends to be mostly local, and avoids the cost of locks in
> > > non-RT
> > > kernels, even though the very few remote operations will be
> > > expensive due
> > > to scheduling overhead.
> > > 
> > > On the other hand, for RT workloads this can represent a problem:
> > > getting
> > > an important workload scheduled out to deal with some unrelated task is
> > > sure to introduce unexpected deadline misses.
> > > 
> > > It's interesting, though, that local_lock()s in RT kernels become
> > > spinlock(). We can make use of those to avoid scheduling work on a
> > > remote
> > > cpu by directly updating another cpu's per_cpu structure, while holding
> > > it's spinlock().
> > > 
> > > In order to do that, it's necessary to introduce a new set of
> > > functions to
> > > make it possible to get another cpu's per-cpu "local" lock
> > > (qpw_{un,}lock*)
> > > and also the corresponding queue_percpu_work_on() and
> > > flush_percpu_work()
> > > helpers to run the remote work.
> > > 
> > > On non-RT kernels, no changes are expected, as every one of the
> > > introduced
> > > helpers work the exactly same as the current implementation:
> > > qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
> > > queue_percpu_work_on()  ->  queue_work_on()
> > > flush_percpu_work()     ->  flush_work()
> > > 
> > > For RT kernels, though, qpw_{un,}lock*() will use the extra cpu
> > > parameter
> > > to select the correct per-cpu structure to work on, and acquire the
> > > spinlock for that cpu.
> > > 
> > > queue_percpu_work_on() will just call the requested function in the
> > > current
> > > cpu, which will operate in another cpu's per-cpu object. Since the
> > > local_locks() become spinlock()s in PREEMPT_RT, we are safe doing that.
> > > 
> > > flush_percpu_work() then becomes a no-op since no work is actually
> > > scheduled on a remote cpu.
> > > 
> > > Some minimal code rework is needed in order to make this mechanism work:
> > > The calls for local_{un,}lock*() on the functions that are currently
> > > scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(),
> > > so in
> > > RT kernels they can reference a different cpu. It's also necessary
> > > to use a
> > > qpw_struct instead of a work_struct, but it just contains a work struct
> > > and, in PREEMPT_RT, the target cpu.
> > > 
> > > This should have almost no impact on non-RT kernels: few this_cpu_ptr()
> > > will become per_cpu_ptr(,smp_processor_id()).
> > > 
> > > On RT kernels, this should improve performance and reduce latency by
> > > removing scheduling noise.
> > > 
> > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > ---
> > >   include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 88 insertions(+)
> > >   create mode 100644 include/linux/qpw.h
> > > 
> > > diff --git a/include/linux/qpw.h b/include/linux/qpw.h
> > > new file mode 100644
> > > index 000000000000..ea2686a01e5e
> > > --- /dev/null
> > > +++ b/include/linux/qpw.h
> > > @@ -0,0 +1,88 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef _LINUX_QPW_H
> > > +#define _LINUX_QPW_H
> 
> I would suggest adding a comment with a brief description of what
> qpw_lock/unlock() are for and their use cases. The "qpw" prefix itself isn't
> intuitive enough for a casual reader to understand what they are for.

Agree, I am also open to discuss a more intuitive naming for these.

> 
> Cheers,
> Longman
> 

Thanks!
Leo
Waiman Long Sept. 11, 2024, 1:39 p.m. UTC | #5
On 9/11/24 03:17, Leonardo Bras wrote:
> On Wed, Sep 04, 2024 at 05:39:01PM -0400, Waiman Long wrote:
>> On 6/21/24 23:58, Leonardo Bras wrote:
>>> Some places in the kernel implement a parallel programming strategy
>>> consisting on local_locks() for most of the work, and some rare remote
>>> operations are scheduled on target cpu. This keeps cache bouncing low since
>>> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
>>> kernels, even though the very few remote operations will be expensive due
>>> to scheduling overhead.
>>>
>>> On the other hand, for RT workloads this can represent a problem: getting
>>> an important workload scheduled out to deal with some unrelated task is
>>> sure to introduce unexpected deadline misses.
>>>
>>> It's interesting, though, that local_lock()s in RT kernels become
>>> spinlock(). We can make use of those to avoid scheduling work on a remote
>>> cpu by directly updating another cpu's per_cpu structure, while holding
>>> it's spinlock().
>>>
>>> In order to do that, it's necessary to introduce a new set of functions to
>>> make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*)
>>> and also the corresponding queue_percpu_work_on() and flush_percpu_work()
>>> helpers to run the remote work.
>>>
>>> On non-RT kernels, no changes are expected, as every one of the introduced
>>> helpers work the exactly same as the current implementation:
>>> qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
>>> queue_percpu_work_on()  ->  queue_work_on()
>>> flush_percpu_work()     ->  flush_work()
>>>
>>> For RT kernels, though, qpw_{un,}lock*() will use the extra cpu parameter
>>> to select the correct per-cpu structure to work on, and acquire the
>>> spinlock for that cpu.
>>>
>>> queue_percpu_work_on() will just call the requested function in the current
>>> cpu, which will operate in another cpu's per-cpu object. Since the
>>> local_locks() become spinlock()s in PREEMPT_RT, we are safe doing that.
>>>
>>> flush_percpu_work() then becomes a no-op since no work is actually
>>> scheduled on a remote cpu.
>>>
>>> Some minimal code rework is needed in order to make this mechanism work:
>>> The calls for local_{un,}lock*() on the functions that are currently
>>> scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(), so in
>>> RT kernels they can reference a different cpu. It's also necessary to use a
>>> qpw_struct instead of a work_struct, but it just contains a work struct
>>> and, in PREEMPT_RT, the target cpu.
>>>
>>> This should have almost no impact on non-RT kernels: few this_cpu_ptr()
>>> will become per_cpu_ptr(,smp_processor_id()).
>>>
>>> On RT kernels, this should improve performance and reduce latency by
>>> removing scheduling noise.
>>>
>>> Signed-off-by: Leonardo Bras <leobras@redhat.com>
>>> ---
>>>    include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++
>>>    1 file changed, 88 insertions(+)
>>>    create mode 100644 include/linux/qpw.h
>>>
>>> diff --git a/include/linux/qpw.h b/include/linux/qpw.h
>>> new file mode 100644
>>> index 000000000000..ea2686a01e5e
>>> --- /dev/null
>>> +++ b/include/linux/qpw.h
>>> @@ -0,0 +1,88 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +#ifndef _LINUX_QPW_H
>>> +#define _LINUX_QPW_H
>>> +
>>> +#include "linux/local_lock.h"
>>> +#include "linux/workqueue.h"
>>> +
>>> +#ifndef CONFIG_PREEMPT_RT
>>> +
>>> +struct qpw_struct {
>>> +	struct work_struct work;
>>> +};
>>> +
>>> +#define qpw_lock(lock, cpu)					\
>>> +	local_lock(lock)
>>> +
>>> +#define qpw_unlock(lock, cpu)					\
>>> +	local_unlock(lock)
>>> +
>>> +#define qpw_lock_irqsave(lock, flags, cpu)			\
>>> +	local_lock_irqsave(lock, flags)
>>> +
>>> +#define qpw_unlock_irqrestore(lock, flags, cpu)			\
>>> +	local_unlock_irqrestore(lock, flags)
>>> +
>>> +#define queue_percpu_work_on(c, wq, qpw)			\
>>> +	queue_work_on(c, wq, &(qpw)->work)
>>> +
>>> +#define flush_percpu_work(qpw)					\
>>> +	flush_work(&(qpw)->work)
>>> +
>>> +#define qpw_get_cpu(qpw)					\
>>> +	smp_processor_id()
>>> +
>>> +#define INIT_QPW(qpw, func, c)					\
>>> +	INIT_WORK(&(qpw)->work, (func))
>>> +
>>> +#else /* !CONFIG_PREEMPT_RT */
>>> +
>>> +struct qpw_struct {
>>> +	struct work_struct work;
>>> +	int cpu;
>>> +};
>>> +
>>> +#define qpw_lock(__lock, cpu)					\
>>> +	do {							\
>>> +		migrate_disable();				\
>>> +		spin_lock(per_cpu_ptr((__lock), cpu));		\
>>> +	} while (0)
>>> +
>>> +#define qpw_unlock(__lock, cpu)					\
>>> +	do {							\
>>> +		spin_unlock(per_cpu_ptr((__lock), cpu));	\
>>> +		migrate_enable();				\
>>> +	} while (0)
>> Why there is a migrate_disable/enable() call in qpw_lock/unlock()? The
>> rt_spin_lock/unlock() calls have already include a migrate_disable/enable()
>> pair.
> This was copied from PREEMPT_RT=y local_locks.
>
> In my tree, I see:
>
> #define __local_unlock(__lock)					\
> 	do {							\
> 		spin_unlock(this_cpu_ptr((__lock)));		\
> 		migrate_enable();				\
> 	} while (0)
>
> But you are right:
> For PREEMPT_RT=y, spin_{un,}lock() will be defined in spinlock_rt.h
> as rt_spin{un,}lock(), which already runs migrate_{en,dis}able().
>
> On the other hand, for spin_lock() will run migrate_disable() just before
> finishing the function, and local_lock() will run it before calling
> spin_lock() and thus, before spin_acquire().
>
> (local_unlock looks like to have an unnecessary extra migrate_enable(),
> though).
>
> I am not sure if it's actually necessary to run this extra
> migrate_disable() in local_lock() case, maybe Thomas could help us
> understand this.
>
> But sure, if we can remove this from local_{un,}lock(), I am sure we can
> also remove this from qpw.

I see. I believe the reason for this extra migrate_disable/enable() is 
to protect the this_cpu_ptr() call to prevent switching to another CPU 
right after this_cpu_ptr() but before the migrate_disable() inside 
rt_spin_lock(). So keep the migrate_disable/enable() as is.

Cheers,
Longman
diff mbox series

Patch

diff --git a/include/linux/qpw.h b/include/linux/qpw.h
new file mode 100644
index 000000000000..ea2686a01e5e
--- /dev/null
+++ b/include/linux/qpw.h
@@ -0,0 +1,88 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_QPW_H
+#define _LINUX_QPW_H
+
+#include "linux/local_lock.h"
+#include "linux/workqueue.h"
+
+#ifndef CONFIG_PREEMPT_RT
+
+struct qpw_struct {
+	struct work_struct work;
+};
+
+#define qpw_lock(lock, cpu)					\
+	local_lock(lock)
+
+#define qpw_unlock(lock, cpu)					\
+	local_unlock(lock)
+
+#define qpw_lock_irqsave(lock, flags, cpu)			\
+	local_lock_irqsave(lock, flags)
+
+#define qpw_unlock_irqrestore(lock, flags, cpu)			\
+	local_unlock_irqrestore(lock, flags)
+
+#define queue_percpu_work_on(c, wq, qpw)			\
+	queue_work_on(c, wq, &(qpw)->work)
+
+#define flush_percpu_work(qpw)					\
+	flush_work(&(qpw)->work)
+
+#define qpw_get_cpu(qpw)					\
+	smp_processor_id()
+
+#define INIT_QPW(qpw, func, c)					\
+	INIT_WORK(&(qpw)->work, (func))
+
+#else /* !CONFIG_PREEMPT_RT */
+
+struct qpw_struct {
+	struct work_struct work;
+	int cpu;
+};
+
+#define qpw_lock(__lock, cpu)					\
+	do {							\
+		migrate_disable();				\
+		spin_lock(per_cpu_ptr((__lock), cpu));		\
+	} while (0)
+
+#define qpw_unlock(__lock, cpu)					\
+	do {							\
+		spin_unlock(per_cpu_ptr((__lock), cpu));	\
+		migrate_enable();				\
+	} while (0)
+
+#define qpw_lock_irqsave(lock, flags, cpu)			\
+	do {							\
+		typecheck(unsigned long, flags);		\
+		flags = 0;					\
+		qpw_lock(lock, cpu);				\
+	} while (0)
+
+#define qpw_unlock_irqrestore(lock, flags, cpu)			\
+	qpw_unlock(lock, cpu)
+
+#define queue_percpu_work_on(c, wq, qpw)			\
+	do {							\
+		struct qpw_struct *__qpw = (qpw);		\
+		WARN_ON((c) != __qpw->cpu);			\
+		__qpw->work.func(&__qpw->work);			\
+	} while (0)
+
+#define flush_percpu_work(qpw)					\
+	do {} while (0)
+
+#define qpw_get_cpu(w)						\
+	container_of((w), struct qpw_struct, work)->cpu
+
+#define INIT_QPW(qpw, func, c)					\
+	do {							\
+		struct qpw_struct *__qpw = (qpw);		\
+		INIT_WORK(&__qpw->work, (func));		\
+		__qpw->cpu = (c);				\
+	} while (0)
+
+#endif /* CONFIG_PREEMPT_RT */
+#endif /* LINUX_QPW_H */