diff mbox series

[v9,01/15] asm-generic: add barrier smp_cond_load_relaxed_timeout()

Message ID 20241107190818.522639-2-ankur.a.arora@oracle.com (mailing list archive)
State New
Headers show
Series arm64: support poll_idle() | expand

Commit Message

Ankur Arora Nov. 7, 2024, 7:08 p.m. UTC
Add a timed variant of smp_cond_load_relaxed().

This is useful because arm64 supports polling on a conditional variable
by directly waiting on the cacheline instead of spin waiting for the
condition to change.

However, an implementation such as this has a problem that it can block
forever -- unless there's an explicit timeout or another out-of-band
mechanism which allows it to come out of the wait state periodically.

smp_cond_load_relaxed_timeout() supports these semantics by specifying
a time-check expression and an associated time-limit.

However, note that for the generic spin-wait implementation we want to
minimize the numbers of instructions executed in each iteration. So,
limit how often we evaluate the time-check expression by doing it once
every smp_cond_time_check_count.

The inner loop in poll_idle() has a substantially similar structure
and constraints as smp_cond_load_relaxed_timeout(), so define
smp_cond_time_check_count to the same value used in poll_idle().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/asm-generic/barrier.h | 42 +++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

Comments

Christoph Lameter (Ampere) Nov. 8, 2024, 2:33 a.m. UTC | #1
On Thu, 7 Nov 2024, Ankur Arora wrote:

> +#ifndef smp_cond_time_check_count
> +/*
> + * Limit how often smp_cond_load_relaxed_timeout() evaluates time_expr_ns.
> + * This helps reduce the number of instructions executed while spin-waiting.
> + */
> +#define smp_cond_time_check_count	200
> +#endif

I dont like these loops that execute differently depending on the
hardware. Can we use cycles and ns instead to have defined periods of
time? Later patches establish the infrastructure to convert cycles to
nanoseconds and microseconds. Use that?

> +#ifndef smp_cond_load_relaxed_timeout
> +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr_ns,	\
> +				      time_limit_ns) ({			\
> +	typeof(ptr) __PTR = (ptr);					\
> +	__unqual_scalar_typeof(*ptr) VAL;				\
> +	unsigned int __count = 0;					\
> +	for (;;) {							\
> +		VAL = READ_ONCE(*__PTR);				\
> +		if (cond_expr)						\
> +			break;						\
> +		cpu_relax();						\
> +		if (__count++ < smp_cond_time_check_count)		\
> +			continue;					\
> +		if ((time_expr_ns) >= time_limit_ns)			\
> +			break;						\

Calling the clock retrieval function repeatedly should be fine and is
typically done in user space as well as in kernel space for functions that
need to wait short time periods.
Ankur Arora Nov. 8, 2024, 7:53 a.m. UTC | #2
Christoph Lameter (Ampere) <cl@gentwo.org> writes:

> On Thu, 7 Nov 2024, Ankur Arora wrote:
>
>> +#ifndef smp_cond_time_check_count
>> +/*
>> + * Limit how often smp_cond_load_relaxed_timeout() evaluates time_expr_ns.
>> + * This helps reduce the number of instructions executed while spin-waiting.
>> + */
>> +#define smp_cond_time_check_count	200
>> +#endif
>
> I dont like these loops that execute differently depending on the
> hardware. Can we use cycles and ns instead to have defined periods of
> time? Later patches establish the infrastructure to convert cycles to
> nanoseconds and microseconds. Use that?
>
>> +#ifndef smp_cond_load_relaxed_timeout
>> +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr_ns,	\
>> +				      time_limit_ns) ({			\
>> +	typeof(ptr) __PTR = (ptr);					\
>> +	__unqual_scalar_typeof(*ptr) VAL;				\
>> +	unsigned int __count = 0;					\
>> +	for (;;) {							\
>> +		VAL = READ_ONCE(*__PTR);				\
>> +		if (cond_expr)						\
>> +			break;						\
>> +		cpu_relax();						\
>> +		if (__count++ < smp_cond_time_check_count)		\
>> +			continue;					\
>> +		if ((time_expr_ns) >= time_limit_ns)			\
>> +			break;						\
>
> Calling the clock retrieval function repeatedly should be fine and is
> typically done in user space as well as in kernel space for functions that
> need to wait short time periods.

The problem is that you might have multiple CPUs polling in idle
for prolonged periods of time. And, so you want to minimize
your power/thermal envelope.

For instance see commit 4dc2375c1a4e "cpuidle: poll_state: Avoid
invoking local_clock() too often" which originally added a similar
rate limit to poll_idle() where they saw exactly that issue.

--
ankur
Christoph Lameter (Ampere) Nov. 8, 2024, 7:41 p.m. UTC | #3
On Thu, 7 Nov 2024, Ankur Arora wrote:

> > Calling the clock retrieval function repeatedly should be fine and is
> > typically done in user space as well as in kernel space for functions that
> > need to wait short time periods.
>
> The problem is that you might have multiple CPUs polling in idle
> for prolonged periods of time. And, so you want to minimize
> your power/thermal envelope.

On ARM that maps to YIELD which does not do anything for the power
envelope AFAICT. It switches to the other hyperthread.

> For instance see commit 4dc2375c1a4e "cpuidle: poll_state: Avoid
> invoking local_clock() too often" which originally added a similar
> rate limit to poll_idle() where they saw exactly that issue.

Looping w/o calling local_clock may increase the wait period etc.

For power saving most arches have special instructions like ARMS
WFE/WFET. These are then causing more accurate wait times than the looping
thing?
Ankur Arora Nov. 8, 2024, 10:15 p.m. UTC | #4
Christoph Lameter (Ampere) <cl@gentwo.org> writes:

> On Thu, 7 Nov 2024, Ankur Arora wrote:
>
>> > Calling the clock retrieval function repeatedly should be fine and is
>> > typically done in user space as well as in kernel space for functions that
>> > need to wait short time periods.
>>
>> The problem is that you might have multiple CPUs polling in idle
>> for prolonged periods of time. And, so you want to minimize
>> your power/thermal envelope.
>
> On ARM that maps to YIELD which does not do anything for the power
> envelope AFAICT. It switches to the other hyperthread.

Agreed. For arm64 patch-5 adds a specialized version.

For the fallback case when we don't have an event stream, the
arm64 version does use the same cpu_relax() loop but that's
not a production thing.

>> For instance see commit 4dc2375c1a4e "cpuidle: poll_state: Avoid
>> invoking local_clock() too often" which originally added a similar
>> rate limit to poll_idle() where they saw exactly that issue.
>
> Looping w/o calling local_clock may increase the wait period etc.

Yeah. I don't think that's a real problem for the poll_idle()
case as the only thing waiting on the other side of the possibly
delayed timer is a deeper idle state.

But, for any other potential users the looping duration might be
too long (the generated code for x86 will execute around 200 * 7
instructions before checking the timer, so a worst case delay of
say around 1-2us.)

I'll note that in the comment around smp_cond_time_check_count
just to warn any future users.

> For power saving most arches have special instructions like ARMS
> WFE/WFET. These are then causing more accurate wait times than the looping
> thing?

Definitely true for WFET. The WFE can still overshoot because the
eventstream has a period of 100us.

--
ankur
Christoph Lameter (Ampere) Nov. 12, 2024, 4:50 p.m. UTC | #5
On Fri, 8 Nov 2024, Ankur Arora wrote:

> > For power saving most arches have special instructions like ARMS
> > WFE/WFET. These are then causing more accurate wait times than the looping
> > thing?
>
> Definitely true for WFET. The WFE can still overshoot because the
> eventstream has a period of 100us.

We can only use the event stream if we need to wait more than 100us.

The rest of the wait period can be coverd by a busy loop. Thus we are
accurate.
Catalin Marinas Nov. 14, 2024, 5:22 p.m. UTC | #6
On Fri, Nov 08, 2024 at 11:41:08AM -0800, Christoph Lameter (Ampere) wrote:
> On Thu, 7 Nov 2024, Ankur Arora wrote:
> > > Calling the clock retrieval function repeatedly should be fine and is
> > > typically done in user space as well as in kernel space for functions that
> > > need to wait short time periods.
> >
> > The problem is that you might have multiple CPUs polling in idle
> > for prolonged periods of time. And, so you want to minimize
> > your power/thermal envelope.
> 
> On ARM that maps to YIELD which does not do anything for the power
> envelope AFAICT. It switches to the other hyperthread.

The issue is not necessarily arm64 but poll_idle() on other
architectures like x86 where, at the end of this series, they still call
cpu_relax() in a loop and check local_clock() every 200 times or so
iterations. So I wouldn't want to revert the improvement in 4dc2375c1a4e
("cpuidle: poll_state: Avoid invoking local_clock() too often").

I agree that the 200 iterations here it's pretty random and it was
something made up for poll_idle() specifically and it could increase the
wait period in other situations (or other architectures).

OTOH, I'm not sure we want to make this API too complex if the only
user for a while would be poll_idle(). We could add a comment that the
timeout granularity can be pretty coarse and architecture dependent (200
cpu_relax() calls in one deployment, 100us on arm64 with WFE).
Ankur Arora Nov. 15, 2024, 12:28 a.m. UTC | #7
Catalin Marinas <catalin.marinas@arm.com> writes:

> On Fri, Nov 08, 2024 at 11:41:08AM -0800, Christoph Lameter (Ampere) wrote:
>> On Thu, 7 Nov 2024, Ankur Arora wrote:
>> > > Calling the clock retrieval function repeatedly should be fine and is
>> > > typically done in user space as well as in kernel space for functions that
>> > > need to wait short time periods.
>> >
>> > The problem is that you might have multiple CPUs polling in idle
>> > for prolonged periods of time. And, so you want to minimize
>> > your power/thermal envelope.
>>
>> On ARM that maps to YIELD which does not do anything for the power
>> envelope AFAICT. It switches to the other hyperthread.
>
> The issue is not necessarily arm64 but poll_idle() on other
> architectures like x86 where, at the end of this series, they still call
> cpu_relax() in a loop and check local_clock() every 200 times or so
> iterations. So I wouldn't want to revert the improvement in 4dc2375c1a4e
> ("cpuidle: poll_state: Avoid invoking local_clock() too often").
>
> I agree that the 200 iterations here it's pretty random and it was
> something made up for poll_idle() specifically and it could increase the
> wait period in other situations (or other architectures).
>
> OTOH, I'm not sure we want to make this API too complex if the only
> user for a while would be poll_idle(). We could add a comment that the
> timeout granularity can be pretty coarse and architecture dependent (200
> cpu_relax() calls in one deployment, 100us on arm64 with WFE).

Yeah, agreed. Not worth over engineering this interface at least not
until there are other users. For now I'll just add a comment mentioning
that the time-check is only coarse grained and architecture dependent.

--
ankur
Ankur Arora Nov. 26, 2024, 5:01 a.m. UTC | #8
Ankur Arora <ankur.a.arora@oracle.com> writes:

> Add a timed variant of smp_cond_load_relaxed().
>
> This is useful because arm64 supports polling on a conditional variable
> by directly waiting on the cacheline instead of spin waiting for the
> condition to change.
>
> However, an implementation such as this has a problem that it can block
> forever -- unless there's an explicit timeout or another out-of-band
> mechanism which allows it to come out of the wait state periodically.
>
> smp_cond_load_relaxed_timeout() supports these semantics by specifying
> a time-check expression and an associated time-limit.
>
> However, note that for the generic spin-wait implementation we want to
> minimize the numbers of instructions executed in each iteration. So,
> limit how often we evaluate the time-check expression by doing it once
> every smp_cond_time_check_count.
>
> The inner loop in poll_idle() has a substantially similar structure
> and constraints as smp_cond_load_relaxed_timeout(), so define
> smp_cond_time_check_count to the same value used in poll_idle().
>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  include/asm-generic/barrier.h | 42 +++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
>
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index d4f581c1e21d..77726ef807e4 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -273,6 +273,48 @@ do {									\
>  })
>  #endif
>
> +#ifndef smp_cond_time_check_count
> +/*
> + * Limit how often smp_cond_load_relaxed_timeout() evaluates time_expr_ns.
> + * This helps reduce the number of instructions executed while spin-waiting.
> + */
> +#define smp_cond_time_check_count	200
> +#endif
> +
> +/**
> + * smp_cond_load_relaxed_timeout() - (Spin) wait for cond with no ordering
> + * guarantees until a timeout expires.
> + * @ptr: pointer to the variable to wait on
> + * @cond: boolean expression to wait for
> + * @time_expr_ns: evaluates to the current time
> + * @time_limit_ns: compared against time_expr_ns
> + *
> + * Equivalent to using READ_ONCE() on the condition variable.
> + *
> + * Due to C lacking lambda expressions we load the value of *ptr into a
> + * pre-named variable @VAL to be used in @cond.

Based on the review comments so far I'm planning to add the following
text to this comment:

  Note that in the generic version the time check is done only coarsely
  to minimize instructions executed while spin-waiting.

  Architecture specific variations might also have their own timeout
  granularity.

Meanwhile, would appreciate more reviews.

Thanks
Ankur

> + */
> +#ifndef smp_cond_load_relaxed_timeout
> +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr_ns,	\
> +				      time_limit_ns) ({			\
> +	typeof(ptr) __PTR = (ptr);					\
> +	__unqual_scalar_typeof(*ptr) VAL;				\
> +	unsigned int __count = 0;					\
> +	for (;;) {							\
> +		VAL = READ_ONCE(*__PTR);				\
> +		if (cond_expr)						\
> +			break;						\
> +		cpu_relax();						\
> +		if (__count++ < smp_cond_time_check_count)		\
> +			continue;					\
> +		if ((time_expr_ns) >= time_limit_ns)			\
> +			break;						\
> +		__count = 0;						\
> +	}								\
> +	(typeof(*ptr))VAL;						\
> +})
> +#endif
> +
>  /*
>   * pmem_wmb() ensures that all stores for which the modification
>   * are written to persistent storage by preceding instructions have
Catalin Marinas Nov. 26, 2024, 10:36 a.m. UTC | #9
On Mon, Nov 25, 2024 at 09:01:56PM -0800, Ankur Arora wrote:
> Ankur Arora <ankur.a.arora@oracle.com> writes:
> > +/**
> > + * smp_cond_load_relaxed_timeout() - (Spin) wait for cond with no ordering
> > + * guarantees until a timeout expires.
> > + * @ptr: pointer to the variable to wait on
> > + * @cond: boolean expression to wait for
> > + * @time_expr_ns: evaluates to the current time
> > + * @time_limit_ns: compared against time_expr_ns
> > + *
> > + * Equivalent to using READ_ONCE() on the condition variable.
> > + *
> > + * Due to C lacking lambda expressions we load the value of *ptr into a
> > + * pre-named variable @VAL to be used in @cond.
> 
> Based on the review comments so far I'm planning to add the following
> text to this comment:
> 
>   Note that in the generic version the time check is done only coarsely
>   to minimize instructions executed while spin-waiting.
> 
>   Architecture specific variations might also have their own timeout
>   granularity.

Looks good.

> Meanwhile, would appreciate more reviews.

It's the middle of the merging window, usually not much review happens
unless they are fixes/regressions.
diff mbox series

Patch

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index d4f581c1e21d..77726ef807e4 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -273,6 +273,48 @@  do {									\
 })
 #endif
 
+#ifndef smp_cond_time_check_count
+/*
+ * Limit how often smp_cond_load_relaxed_timeout() evaluates time_expr_ns.
+ * This helps reduce the number of instructions executed while spin-waiting.
+ */
+#define smp_cond_time_check_count	200
+#endif
+
+/**
+ * smp_cond_load_relaxed_timeout() - (Spin) wait for cond with no ordering
+ * guarantees until a timeout expires.
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ * @time_expr_ns: evaluates to the current time
+ * @time_limit_ns: compared against time_expr_ns
+ *
+ * Equivalent to using READ_ONCE() on the condition variable.
+ *
+ * Due to C lacking lambda expressions we load the value of *ptr into a
+ * pre-named variable @VAL to be used in @cond.
+ */
+#ifndef smp_cond_load_relaxed_timeout
+#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr_ns,	\
+				      time_limit_ns) ({			\
+	typeof(ptr) __PTR = (ptr);					\
+	__unqual_scalar_typeof(*ptr) VAL;				\
+	unsigned int __count = 0;					\
+	for (;;) {							\
+		VAL = READ_ONCE(*__PTR);				\
+		if (cond_expr)						\
+			break;						\
+		cpu_relax();						\
+		if (__count++ < smp_cond_time_check_count)		\
+			continue;					\
+		if ((time_expr_ns) >= time_limit_ns)			\
+			break;						\
+		__count = 0;						\
+	}								\
+	(typeof(*ptr))VAL;						\
+})
+#endif
+
 /*
  * pmem_wmb() ensures that all stores for which the modification
  * are written to persistent storage by preceding instructions have