diff mbox series

[v1] rcu: Fix and improve RCU read lock checks when !CONFIG_DEBUG_LOCK_ALLOC

Message ID 20230711233816.2187577-1-dhavale@google.com (mailing list archive)
State New, archived
Headers show
Series [v1] rcu: Fix and improve RCU read lock checks when !CONFIG_DEBUG_LOCK_ALLOC | expand

Commit Message

Sandeep Dhavale July 11, 2023, 11:38 p.m. UTC
Currently if CONFIG_DEBUG_LOCK_ALLOC is not set

- rcu_read_lock_held() always returns 1
- rcu_read_lock_any_held() may return 0 with CONFIG_PREEMPT_RCU

This is inconsistent and it was discovered when trying a fix
for problem reported [1] with CONFIG_DEBUG_LOCK_ALLOC is not
set and CONFIG_PREEMPT_RCU is enabled. Gist of the problem is
that EROFS wants to detect atomic context so it can do inline
decompression whenever possible, this is important performance
optimization. It turns out that z_erofs_decompressqueue_endio()
can be called from blk_mq_flush_plug_list() with rcu lock held
and hence fix uses rcu_read_lock_any_held() to decide to use
sync/inline decompression vs async decompression.

As per documentation, we should return lock is held if we aren't
certain. But it seems we can improve the checks for if the lock
is held even if CONFIG_DEBUG_LOCK_ALLOC is not set instead of
hardcoding to always return true.

* rcu_read_lock_held()
- For CONFIG_PREEMPT_RCU using rcu_preempt_depth()
- using preemptible() (indirectly preempt_count())

* rcu_read_lock_bh_held()
- For CONFIG_PREEMPT_RT Using in_softirq() (indirectly softirq_cont())
- using preemptible() (indirectly preempt_count())

Lastly to fix the inconsistency, rcu_read_lock_any_held() is updated
to use other rcu_read_lock_*_held() checks.

Two of the improved checks are moved to kernel/rcu/update.c because
rcupdate.h is included from the very low level headers which do not know
what current (task_struct) is so the macro rcu_preempt_depth() cannot be
expanded in the rcupdate.h. See the original comment for
rcu_preempt_depth() in patch at [2] for more information.

[1]
https://lore.kernel.org/all/20230621220848.3379029-1-dhavale@google.com/
[2]
https://lore.kernel.org/all/1281392111-25060-8-git-send-email-paulmck@linux.vnet.ibm.com/

Reported-by: Will Shiu <Will.Shiu@mediatek.com>
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
---
 include/linux/rcupdate.h | 12 +++---------
 kernel/rcu/update.c      | 21 ++++++++++++++++++++-
 2 files changed, 23 insertions(+), 10 deletions(-)

Comments

Joel Fernandes July 12, 2023, 5:02 p.m. UTC | #1
On Tue, Jul 11, 2023 at 7:38 PM Sandeep Dhavale <dhavale@google.com> wrote:
>
> Currently if CONFIG_DEBUG_LOCK_ALLOC is not set
>
> - rcu_read_lock_held() always returns 1
> - rcu_read_lock_any_held() may return 0 with CONFIG_PREEMPT_RCU
>
> This is inconsistent and it was discovered when trying a fix
> for problem reported [1] with CONFIG_DEBUG_LOCK_ALLOC is not
> set and CONFIG_PREEMPT_RCU is enabled. Gist of the problem is
> that EROFS wants to detect atomic context so it can do inline
> decompression whenever possible, this is important performance
> optimization. It turns out that z_erofs_decompressqueue_endio()
> can be called from blk_mq_flush_plug_list() with rcu lock held
> and hence fix uses rcu_read_lock_any_held() to decide to use
> sync/inline decompression vs async decompression.
>
> As per documentation, we should return lock is held if we aren't
> certain. But it seems we can improve the checks for if the lock
> is held even if CONFIG_DEBUG_LOCK_ALLOC is not set instead of
> hardcoding to always return true.
>
> * rcu_read_lock_held()
> - For CONFIG_PREEMPT_RCU using rcu_preempt_depth()
> - using preemptible() (indirectly preempt_count())
>
> * rcu_read_lock_bh_held()
> - For CONFIG_PREEMPT_RT Using in_softirq() (indirectly softirq_cont())
> - using preemptible() (indirectly preempt_count())
>
> Lastly to fix the inconsistency, rcu_read_lock_any_held() is updated
> to use other rcu_read_lock_*_held() checks.
>
> Two of the improved checks are moved to kernel/rcu/update.c because
> rcupdate.h is included from the very low level headers which do not know
> what current (task_struct) is so the macro rcu_preempt_depth() cannot be
> expanded in the rcupdate.h. See the original comment for
> rcu_preempt_depth() in patch at [2] for more information.
>
> [1]
> https://lore.kernel.org/all/20230621220848.3379029-1-dhavale@google.com/
> [2]
> https://lore.kernel.org/all/1281392111-25060-8-git-send-email-paulmck@linux.vnet.ibm.com/
>
> Reported-by: Will Shiu <Will.Shiu@mediatek.com>
> Signed-off-by: Sandeep Dhavale <dhavale@google.com>
> ---
>  include/linux/rcupdate.h | 12 +++---------
>  kernel/rcu/update.c      | 21 ++++++++++++++++++++-
>  2 files changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 5e5f920ade90..0d1d1d8c2360 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -319,14 +319,11 @@ int rcu_read_lock_any_held(void);
>  # define rcu_lock_acquire(a)           do { } while (0)
>  # define rcu_lock_release(a)           do { } while (0)
>
> -static inline int rcu_read_lock_held(void)
> -{
> -       return 1;
> -}
> +int rcu_read_lock_held(void);
>
>  static inline int rcu_read_lock_bh_held(void)
>  {
> -       return 1;
> +       return !preemptible() || in_softirq();
>  }
>
>  static inline int rcu_read_lock_sched_held(void)
> @@ -334,10 +331,7 @@ static inline int rcu_read_lock_sched_held(void)
>         return !preemptible();
>  }
>
> -static inline int rcu_read_lock_any_held(void)
> -{
> -       return !preemptible();
> -}
> +int rcu_read_lock_any_held(void);
>
>  static inline int debug_lockdep_rcu_enabled(void)
>  {
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index 19bf6fa3ee6a..b34fc5bb96cf 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -390,8 +390,27 @@ int rcu_read_lock_any_held(void)
>  }
>  EXPORT_SYMBOL_GPL(rcu_read_lock_any_held);
>
> -#endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> +#else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>
> +int rcu_read_lock_held(void)
> +{
> +       if (IS_ENABLED(CONFIG_PREEMPT_RCU))
> +               return rcu_preempt_depth();
> +       return !preemptible();
> +}
> +EXPORT_SYMBOL_GPL(rcu_read_lock_held);
> +
> +int rcu_read_lock_any_held(void)
> +{
> +       if (rcu_read_lock_held() ||
> +           rcu_read_lock_bh_held() ||
> +           rcu_read_lock_sched_held())
> +               return 1;
> +       return !preemptible();

Actually even the original code is incorrect (the lockdep version)
since preemptible() cannot be relied upon if CONFIG_PREEMPT_COUNT is
not set. However, that's debug code. In this case, it is a real
user (the fs code). In non-preemptible kernels, we are always in an
RCU-sched section. So you can't really see if anyone called your code
under rcu_read_lock(). The rcu_read_lock/unlock() would be getting
NOOPed. In such a kernel, it will always tell your code it is in an
RCU reader. That's not ideal for that erofs code?

Also, per that erofs code:
        /* Use (kthread_)work and sync decompression for atomic contexts only */
        if (!in_task() || irqs_disabled() || rcu_read_lock_any_held()) {

I guess you are also assuming that rcu_read_lock_any_held() tells you
something about atomicity but strictly speaking, it doesn't because
preemptible RCU readers are preemptible. You can't block but
preemption is possible so it is not "atomic". Maybe you meant "cannot
block"?

As such this patch looks correct to me, one thing I noticed is that
you can check rcu_is_watching() like the lockdep-enabled code does.
That will tell you also if a reader-section is possible because in
extended-quiescent-states, RCU readers should be non-existent or
that's a bug.

Could you also verify that this patch does not cause bloating of the
kernel if lockdep is disabled?

thanks,

 - Joel
Sandeep Dhavale July 12, 2023, 9:20 p.m. UTC | #2
Hi Joel,
Thank you for the review!
> Actually even the original code is incorrect (the lockdep version)
> since preemptible() cannot be relied upon if CONFIG_PREEMPT_COUNT is
> not set. However, that's debug code. In this case, it is a real
> user (the fs code). In non-preemptible kernels, we are always in an
> RCU-sched section. So you can't really see if anyone called your code
> under rcu_read_lock(). The rcu_read_lock/unlock() would be getting
> NOOPed. In such a kernel, it will always tell your code it is in an
> RCU reader. That's not ideal for that erofs code?
>
> Also, per that erofs code:
>         /* Use (kthread_)work and sync decompression for atomic contexts only */
>         if (!in_task() || irqs_disabled() || rcu_read_lock_any_held()) {
>
> I guess you are also assuming that rcu_read_lock_any_held() tells you
> something about atomicity but strictly speaking, it doesn't because
> preemptible RCU readers are preemptible. You can't block but
> preemption is possible so it is not "atomic". Maybe you meant "cannot
> block"?
>
Yes, you are correct. For decompression erofs needs to grab pcluster mutex lock,
erofs wants to know if we can or cannot block to decide sync vs async
decompression.

Good point about !CONFIG_PREEMPT_COUNT. Yes, in that case erofs
will always do async decompression which is less than ideal. Maybe
erofs check can be modified to account for !CONFIG_PREEMPT_COUNT.

> As such this patch looks correct to me, one thing I noticed is that
> you can check rcu_is_watching() like the lockdep-enabled code does.
> That will tell you also if a reader-section is possible because in
> extended-quiescent-states, RCU readers should be non-existent or
> that's a bug.
>
Please correct me if I am wrong, reading from the comment in
kernel/rcu/update.c rcu_read_lock_held_common()
..
  * The reason for this is that RCU ignores CPUs that are
 * in such a section, considering these as in extended quiescent state,
 * so such a CPU is effectively never in an RCU read-side critical section
 * regardless of what RCU primitives it invokes.

It seems rcu will treat this as lock not held rather than a fact that
lock is not held. Is my understanding correct?
The reason I chose not to consult rcu_is_watching() in this version
is because check "sleeping function called from invalid context"
will still get triggered (please see kernel/sched/core.c __might_resched())
as it does not consult rcu_is_watching() instead looks at
rcu_preempt_depth() which will be non-zero if rcu_read_lock()
was called (only when CONFIG_PREEMPT_RCU is enabled).

> Could you also verify that this patch does not cause bloating of the
> kernel if lockdep is disabled?
>
Sure, I will do the comparison and send the details.

Thanks,
Sandeep.

> thanks,
>
>  - Joel
Joel Fernandes July 13, 2023, 12:32 a.m. UTC | #3
On Wed, Jul 12, 2023 at 02:20:56PM -0700, Sandeep Dhavale wrote:
[..]
> > As such this patch looks correct to me, one thing I noticed is that
> > you can check rcu_is_watching() like the lockdep-enabled code does.
> > That will tell you also if a reader-section is possible because in
> > extended-quiescent-states, RCU readers should be non-existent or
> > that's a bug.
> >
> Please correct me if I am wrong, reading from the comment in
> kernel/rcu/update.c rcu_read_lock_held_common()
> ..
>   * The reason for this is that RCU ignores CPUs that are
>  * in such a section, considering these as in extended quiescent state,
>  * so such a CPU is effectively never in an RCU read-side critical section
>  * regardless of what RCU primitives it invokes.
> 
> It seems rcu will treat this as lock not held rather than a fact that
> lock is not held. Is my understanding correct?

If RCU treats it as a lock not held, that is a fact for RCU ;-). Maybe you
mean it is not a fact for erofs?

> The reason I chose not to consult rcu_is_watching() in this version
> is because check "sleeping function called from invalid context"
> will still get triggered (please see kernel/sched/core.c __might_resched())
> as it does not consult rcu_is_watching() instead looks at
> rcu_preempt_depth() which will be non-zero if rcu_read_lock()
> was called (only when CONFIG_PREEMPT_RCU is enabled).

I am assuming you mean you would grab the mutex accidentally when in an RCU
reader, and might_sleep() presumably in the mutex internal code will scream?

I would expect in the erofs code that rcu_is_watching() should always return
true, so it should not effect the decision of whether to block or not. I am
suggesting add the check for rcu_is_watching() into the *held() functions for
completeness.

// will be if (!true) when RCU is actively watching the CPU for readers.
bool rcu_read_lock_any_held() {
	if (!rcu_is_watching())
		return false;
	// do the rest..
}

> > Could you also verify that this patch does not cause bloating of the
> > kernel if lockdep is disabled?
> >
> Sure, I will do the comparison and send the details.

Thanks! This is indeed an interesting usecase of grabbing mutex / blocking in
the reader.

thanks,

 - Joel
Gao Xiang July 13, 2023, 2:02 a.m. UTC | #4
On 2023/7/13 08:32, Joel Fernandes wrote:
> On Wed, Jul 12, 2023 at 02:20:56PM -0700, Sandeep Dhavale wrote:
> [..]
>>> As such this patch looks correct to me, one thing I noticed is that
>>> you can check rcu_is_watching() like the lockdep-enabled code does.
>>> That will tell you also if a reader-section is possible because in
>>> extended-quiescent-states, RCU readers should be non-existent or
>>> that's a bug.
>>>
>> Please correct me if I am wrong, reading from the comment in
>> kernel/rcu/update.c rcu_read_lock_held_common()
>> ..
>>    * The reason for this is that RCU ignores CPUs that are
>>   * in such a section, considering these as in extended quiescent state,
>>   * so such a CPU is effectively never in an RCU read-side critical section
>>   * regardless of what RCU primitives it invokes.
>>
>> It seems rcu will treat this as lock not held rather than a fact that
>> lock is not held. Is my understanding correct?
> 
> If RCU treats it as a lock not held, that is a fact for RCU ;-). Maybe you
> mean it is not a fact for erofs?

I'm not sure if I get what you mean, EROFS doesn't take any RCU read lock
here:

z_erofs_decompressqueue_endio() is actually a "bio->bi_end_io", previously
which can be called under two scenarios:

  1) under softirq context, which is actually part of device I/O compleltion;

  2) under threaded context, like what dm-verity or likewise calls.

But EROFS needs to decompress in a threaded context anyway, so we trigger
a workqueue to resolve the case 1).


Recently, someone reported there could be some case 3) [I think it was
introduced recently but I have no time to dig into it]:

  case 3: under RCU read lock context, which is shown by this:
https://lore.kernel.org/r/4a8254eb-ac39-1e19-3d82-417d3a7b9f94@linux.alibaba.com/T/#u

  and such RCU read lock is taken in __blk_mq_run_dispatch_ops().

But as the commit shown, we only need to trigger a workqueue for case 1)
and 3) due to performance reasons.

Hopefully I show it more clear.

Thanks,
Gao Xiang
Gao Xiang July 13, 2023, 2:10 a.m. UTC | #5
On 2023/7/13 10:02, Gao Xiang wrote:
> 
> 
> On 2023/7/13 08:32, Joel Fernandes wrote:
>> On Wed, Jul 12, 2023 at 02:20:56PM -0700, Sandeep Dhavale wrote:
>> [..]
>>>> As such this patch looks correct to me, one thing I noticed is that
>>>> you can check rcu_is_watching() like the lockdep-enabled code does.
>>>> That will tell you also if a reader-section is possible because in
>>>> extended-quiescent-states, RCU readers should be non-existent or
>>>> that's a bug.
>>>>
>>> Please correct me if I am wrong, reading from the comment in
>>> kernel/rcu/update.c rcu_read_lock_held_common()
>>> ..
>>>    * The reason for this is that RCU ignores CPUs that are
>>>   * in such a section, considering these as in extended quiescent state,
>>>   * so such a CPU is effectively never in an RCU read-side critical section
>>>   * regardless of what RCU primitives it invokes.
>>>
>>> It seems rcu will treat this as lock not held rather than a fact that
>>> lock is not held. Is my understanding correct?
>>
>> If RCU treats it as a lock not held, that is a fact for RCU ;-). Maybe you
>> mean it is not a fact for erofs?
> 
> I'm not sure if I get what you mean, EROFS doesn't take any RCU read lock
> here:
> 
> z_erofs_decompressqueue_endio() is actually a "bio->bi_end_io", previously
> which can be called under two scenarios:
> 
>   1) under softirq context, which is actually part of device I/O compleltion;
> 
>   2) under threaded context, like what dm-verity or likewise calls.
> 
> But EROFS needs to decompress in a threaded context anyway, so we trigger
> a workqueue to resolve the case 1).
> 
> 
> Recently, someone reported there could be some case 3) [I think it was
> introduced recently but I have no time to dig into it]:
> 
>   case 3: under RCU read lock context, which is shown by this:
> https://lore.kernel.org/r/4a8254eb-ac39-1e19-3d82-417d3a7b9f94@linux.alibaba.com/T/#u

Sorry about the incorrect link (I really don't know who initally reported
this and on which device):

https://lore.kernel.org/linux-erofs/161f1615-3d85-cf47-d2d5-695adf1ca7d4@linux.alibaba.com/T/#t

> 
>   and such RCU read lock is taken in __blk_mq_run_dispatch_ops().
> 
> But as the commit shown, we only need to trigger a workqueue for case 1)
> and 3) due to performance reasons.
> 
> Hopefully I show it more clear.
> 
> Thanks,
> Gao Xiang
Joel Fernandes July 13, 2023, 2:16 a.m. UTC | #6
> On Jul 12, 2023, at 10:02 PM, Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> 
> 
> 
>> On 2023/7/13 08:32, Joel Fernandes wrote:
>> On Wed, Jul 12, 2023 at 02:20:56PM -0700, Sandeep Dhavale wrote:
>> [..]
>>>> As such this patch looks correct to me, one thing I noticed is that
>>>> you can check rcu_is_watching() like the lockdep-enabled code does.
>>>> That will tell you also if a reader-section is possible because in
>>>> extended-quiescent-states, RCU readers should be non-existent or
>>>> that's a bug.
>>>> 
>>> Please correct me if I am wrong, reading from the comment in
>>> kernel/rcu/update.c rcu_read_lock_held_common()
>>> ..
>>>   * The reason for this is that RCU ignores CPUs that are
>>>  * in such a section, considering these as in extended quiescent state,
>>>  * so such a CPU is effectively never in an RCU read-side critical section
>>>  * regardless of what RCU primitives it invokes.
>>> 
>>> It seems rcu will treat this as lock not held rather than a fact that
>>> lock is not held. Is my understanding correct?
>> If RCU treats it as a lock not held, that is a fact for RCU ;-). Maybe you
>> mean it is not a fact for erofs?
> 
> I'm not sure if I get what you mean, EROFS doesn't take any RCU read lock

We are discussing the case 3 you mentioned below.

> here:
> 
> z_erofs_decompressqueue_endio() is actually a "bio->bi_end_io", previously
> which can be called under two scenarios:
> 
> 1) under softirq context, which is actually part of device I/O compleltion;
> 
> 2) under threaded context, like what dm-verity or likewise calls.
> 
> But EROFS needs to decompress in a threaded context anyway, so we trigger
> a workqueue to resolve the case 1).
> 
> 
> Recently, someone reported there could be some case 3) [I think it was
> introduced recently but I have no time to dig into it]:
> 
> case 3: under RCU read lock context, which is shown by this:
> https://lore.kernel.org/r/4a8254eb-ac39-1e19-3d82-417d3a7b9f94@linux.alibaba.com/T/#u
> 
> and such RCU read lock is taken in __blk_mq_run_dispatch_ops().
> 
> But as the commit shown, we only need to trigger a workqueue for case 1)
> and 3) due to performance reasons.
> 
> Hopefully I show it more clear.

Makes sense. Thanks,

 - Joel

> 
> Thanks,
> Gao Xiang
Paul E. McKenney July 13, 2023, 4:27 a.m. UTC | #7
On Thu, Jul 13, 2023 at 10:02:17AM +0800, Gao Xiang wrote:
> 
> 
> On 2023/7/13 08:32, Joel Fernandes wrote:
> > On Wed, Jul 12, 2023 at 02:20:56PM -0700, Sandeep Dhavale wrote:
> > [..]
> > > > As such this patch looks correct to me, one thing I noticed is that
> > > > you can check rcu_is_watching() like the lockdep-enabled code does.
> > > > That will tell you also if a reader-section is possible because in
> > > > extended-quiescent-states, RCU readers should be non-existent or
> > > > that's a bug.
> > > > 
> > > Please correct me if I am wrong, reading from the comment in
> > > kernel/rcu/update.c rcu_read_lock_held_common()
> > > ..
> > >    * The reason for this is that RCU ignores CPUs that are
> > >   * in such a section, considering these as in extended quiescent state,
> > >   * so such a CPU is effectively never in an RCU read-side critical section
> > >   * regardless of what RCU primitives it invokes.
> > > 
> > > It seems rcu will treat this as lock not held rather than a fact that
> > > lock is not held. Is my understanding correct?
> > 
> > If RCU treats it as a lock not held, that is a fact for RCU ;-). Maybe you
> > mean it is not a fact for erofs?
> 
> I'm not sure if I get what you mean, EROFS doesn't take any RCU read lock
> here:

The key point is that we need lockdep to report errors when
rcu_read_lock(), rcu_dereference(), and friends are used when RCU is
not watching.  We also need lockdep to report an error when someone
uses rcu_dereference() when RCU is not watching, but also forgets the
rcu_read_lock().

And this is the job of rcu_read_lock_held(), which is one reason why
that rcu_is_watching() is needed.

> z_erofs_decompressqueue_endio() is actually a "bio->bi_end_io", previously
> which can be called under two scenarios:
> 
>  1) under softirq context, which is actually part of device I/O compleltion;
> 
>  2) under threaded context, like what dm-verity or likewise calls.
> 
> But EROFS needs to decompress in a threaded context anyway, so we trigger
> a workqueue to resolve the case 1).
> 
> Recently, someone reported there could be some case 3) [I think it was
> introduced recently but I have no time to dig into it]:
> 
>  case 3: under RCU read lock context, which is shown by this:
> https://lore.kernel.org/r/4a8254eb-ac39-1e19-3d82-417d3a7b9f94@linux.alibaba.com/T/#u
> 
>  and such RCU read lock is taken in __blk_mq_run_dispatch_ops().
> 
> But as the commit shown, we only need to trigger a workqueue for case 1)
> and 3) due to performance reasons.

Just out of curiosity, exactly how much is it costing to trigger the
workqueue?

> Hopefully I show it more clear.

One additional question...  What is your plan for kernels built with
CONFIG_PREEMPT_COUNT=n?  After all, in such kernels, there is no way
that I know of for code to determine whether it is in an RCU read-side
critical section, holding a spinlock, or running with preemption disabled.

						Thanx, Paul
Gao Xiang July 13, 2023, 4:41 a.m. UTC | #8
On 2023/7/13 12:27, Paul E. McKenney wrote:
> On Thu, Jul 13, 2023 at 10:02:17AM +0800, Gao Xiang wrote:
>>
>>
>> On 2023/7/13 08:32, Joel Fernandes wrote:
>>> On Wed, Jul 12, 2023 at 02:20:56PM -0700, Sandeep Dhavale wrote:
>>> [..]
>>>>> As such this patch looks correct to me, one thing I noticed is that
>>>>> you can check rcu_is_watching() like the lockdep-enabled code does.
>>>>> That will tell you also if a reader-section is possible because in
>>>>> extended-quiescent-states, RCU readers should be non-existent or
>>>>> that's a bug.
>>>>>
>>>> Please correct me if I am wrong, reading from the comment in
>>>> kernel/rcu/update.c rcu_read_lock_held_common()
>>>> ..
>>>>     * The reason for this is that RCU ignores CPUs that are
>>>>    * in such a section, considering these as in extended quiescent state,
>>>>    * so such a CPU is effectively never in an RCU read-side critical section
>>>>    * regardless of what RCU primitives it invokes.
>>>>
>>>> It seems rcu will treat this as lock not held rather than a fact that
>>>> lock is not held. Is my understanding correct?
>>>
>>> If RCU treats it as a lock not held, that is a fact for RCU ;-). Maybe you
>>> mean it is not a fact for erofs?
>>
>> I'm not sure if I get what you mean, EROFS doesn't take any RCU read lock
>> here:
> 
> The key point is that we need lockdep to report errors when
> rcu_read_lock(), rcu_dereference(), and friends are used when RCU is
> not watching.  We also need lockdep to report an error when someone
> uses rcu_dereference() when RCU is not watching, but also forgets the
> rcu_read_lock().
> 
> And this is the job of rcu_read_lock_held(), which is one reason why
> that rcu_is_watching() is needed.
> 
>> z_erofs_decompressqueue_endio() is actually a "bio->bi_end_io", previously
>> which can be called under two scenarios:
>>
>>   1) under softirq context, which is actually part of device I/O compleltion;
>>
>>   2) under threaded context, like what dm-verity or likewise calls.
>>
>> But EROFS needs to decompress in a threaded context anyway, so we trigger
>> a workqueue to resolve the case 1).
>>
>> Recently, someone reported there could be some case 3) [I think it was
>> introduced recently but I have no time to dig into it]:
>>
>>   case 3: under RCU read lock context, which is shown by this:
>> https://lore.kernel.org/r/4a8254eb-ac39-1e19-3d82-417d3a7b9f94@linux.alibaba.com/T/#u
>>
>>   and such RCU read lock is taken in __blk_mq_run_dispatch_ops().
>>
>> But as the commit shown, we only need to trigger a workqueue for case 1)
>> and 3) due to performance reasons.
> 
> Just out of curiosity, exactly how much is it costing to trigger the
> workqueue?

There are lots of performance issues here and even a plumber
topic last year to show that, see:

[1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org
[2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com
[3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com
[4] https://lpc.events/event/16/contributions/1338/
and more.

I'm not sure if it's necessary to look info all of that,
andSandeep knows more than I am (the scheduling issue
becomes vital on some aarch64 platform.)

Thanks,
Gao Xiang
Gao Xiang July 13, 2023, 4:51 a.m. UTC | #9
On 2023/7/13 12:27, Paul E. McKenney wrote:
> On Thu, Jul 13, 2023 at 10:02:17AM +0800, Gao Xiang wrote:
>>
>>

... sorry forget some.

> 
> One additional question...  What is your plan for kernels built with
> CONFIG_PREEMPT_COUNT=n?  After all, in such kernels, there is no way
> that I know of for code to determine whether it is in an RCU read-side
> critical section, holding a spinlock, or running with preemption disabled.

I'm not sure if Android (or all targetted) users use or care about
this configuration (CONFIG_PREEMPT_COUNT=n), personally I think
for such configuration we could just fall back to the workqueue
approach all the time.

Anyway, such optimization really comes from real workloads /
experience, users don't live well without such mitigation.

Thanks,
Gao Xiang

> 
> 						Thanx, Paul
Paul E. McKenney July 13, 2023, 4:52 a.m. UTC | #10
On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote:
> 
> 
> On 2023/7/13 12:27, Paul E. McKenney wrote:
> > On Thu, Jul 13, 2023 at 10:02:17AM +0800, Gao Xiang wrote:
> > > 
> > > 
> > > On 2023/7/13 08:32, Joel Fernandes wrote:
> > > > On Wed, Jul 12, 2023 at 02:20:56PM -0700, Sandeep Dhavale wrote:
> > > > [..]
> > > > > > As such this patch looks correct to me, one thing I noticed is that
> > > > > > you can check rcu_is_watching() like the lockdep-enabled code does.
> > > > > > That will tell you also if a reader-section is possible because in
> > > > > > extended-quiescent-states, RCU readers should be non-existent or
> > > > > > that's a bug.
> > > > > > 
> > > > > Please correct me if I am wrong, reading from the comment in
> > > > > kernel/rcu/update.c rcu_read_lock_held_common()
> > > > > ..
> > > > >     * The reason for this is that RCU ignores CPUs that are
> > > > >    * in such a section, considering these as in extended quiescent state,
> > > > >    * so such a CPU is effectively never in an RCU read-side critical section
> > > > >    * regardless of what RCU primitives it invokes.
> > > > > 
> > > > > It seems rcu will treat this as lock not held rather than a fact that
> > > > > lock is not held. Is my understanding correct?
> > > > 
> > > > If RCU treats it as a lock not held, that is a fact for RCU ;-). Maybe you
> > > > mean it is not a fact for erofs?
> > > 
> > > I'm not sure if I get what you mean, EROFS doesn't take any RCU read lock
> > > here:
> > 
> > The key point is that we need lockdep to report errors when
> > rcu_read_lock(), rcu_dereference(), and friends are used when RCU is
> > not watching.  We also need lockdep to report an error when someone
> > uses rcu_dereference() when RCU is not watching, but also forgets the
> > rcu_read_lock().
> > 
> > And this is the job of rcu_read_lock_held(), which is one reason why
> > that rcu_is_watching() is needed.
> > 
> > > z_erofs_decompressqueue_endio() is actually a "bio->bi_end_io", previously
> > > which can be called under two scenarios:
> > > 
> > >   1) under softirq context, which is actually part of device I/O compleltion;
> > > 
> > >   2) under threaded context, like what dm-verity or likewise calls.
> > > 
> > > But EROFS needs to decompress in a threaded context anyway, so we trigger
> > > a workqueue to resolve the case 1).
> > > 
> > > Recently, someone reported there could be some case 3) [I think it was
> > > introduced recently but I have no time to dig into it]:
> > > 
> > >   case 3: under RCU read lock context, which is shown by this:
> > > https://lore.kernel.org/r/4a8254eb-ac39-1e19-3d82-417d3a7b9f94@linux.alibaba.com/T/#u
> > > 
> > >   and such RCU read lock is taken in __blk_mq_run_dispatch_ops().
> > > 
> > > But as the commit shown, we only need to trigger a workqueue for case 1)
> > > and 3) due to performance reasons.
> > 
> > Just out of curiosity, exactly how much is it costing to trigger the
> > workqueue?
> 
> There are lots of performance issues here and even a plumber
> topic last year to show that, see:
> 
> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org
> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com
> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com
> [4] https://lpc.events/event/16/contributions/1338/
> and more.
> 
> I'm not sure if it's necessary to look info all of that,
> andSandeep knows more than I am (the scheduling issue
> becomes vital on some aarch64 platform.)

Hmmm...  Please let me try again.

Assuming that this approach turns out to make sense, the resulting
patch will need to clearly state the performance benefits directly in
the commit log.

And of course, for the approach to make sense, it must avoid breaking
the existing lockdep-RCU debugging code.

Is that more clear?

							Thanx, Paul
Gao Xiang July 13, 2023, 4:59 a.m. UTC | #11
On 2023/7/13 12:52, Paul E. McKenney wrote:
> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote:
>>
>>

...

>>
>> There are lots of performance issues here and even a plumber
>> topic last year to show that, see:
>>
>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org
>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com
>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com
>> [4] https://lpc.events/event/16/contributions/1338/
>> and more.
>>
>> I'm not sure if it's necessary to look info all of that,
>> andSandeep knows more than I am (the scheduling issue
>> becomes vital on some aarch64 platform.)
> 
> Hmmm...  Please let me try again.
> 
> Assuming that this approach turns out to make sense, the resulting
> patch will need to clearly state the performance benefits directly in
> the commit log.
> 
> And of course, for the approach to make sense, it must avoid breaking
> the existing lockdep-RCU debugging code.
> 
> Is that more clear?

Personally I'm not working on Android platform any more so I don't
have a way to reproduce, hopefully Sandeep could give actually
number _again_ if dm-verity is enabled and trigger another
workqueue here and make a comparsion why the scheduling latency of
the extra work becomes unacceptable.

Thanks,
Gao Xiang
Joel Fernandes July 13, 2023, 2:07 p.m. UTC | #12
On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> On 2023/7/13 12:52, Paul E. McKenney wrote:
> > On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote:
> >>
> >>
>
> ...
>
> >>
> >> There are lots of performance issues here and even a plumber
> >> topic last year to show that, see:
> >>
> >> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org
> >> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com
> >> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com
> >> [4] https://lpc.events/event/16/contributions/1338/
> >> and more.
> >>
> >> I'm not sure if it's necessary to look info all of that,
> >> andSandeep knows more than I am (the scheduling issue
> >> becomes vital on some aarch64 platform.)
> >
> > Hmmm...  Please let me try again.
> >
> > Assuming that this approach turns out to make sense, the resulting
> > patch will need to clearly state the performance benefits directly in
> > the commit log.
> >
> > And of course, for the approach to make sense, it must avoid breaking
> > the existing lockdep-RCU debugging code.
> >
> > Is that more clear?
>
> Personally I'm not working on Android platform any more so I don't
> have a way to reproduce, hopefully Sandeep could give actually
> number _again_ if dm-verity is enabled and trigger another
> workqueue here and make a comparsion why the scheduling latency of
> the extra work becomes unacceptable.
>

Question from my side, are we talking about only performance issues or
also a crash? It appears z_erofs_decompress_pcluster() takes
mutex_lock(&pcl->lock);

So if it is either in an RCU read-side critical section or in an
atomic section, like the softirq path, then it may
schedule-while-atomic or trigger RCU warnings.

z_erofs_decompressqueue_endio
-> z_erofs_decompress_kickoff
 ->z_erofs_decompressqueue_work
  ->z_erofs_decompress_queue
   -> z_erofs_decompress_pcluster
    -> mutex_lock

Per Sandeep in [1], this stack happens under RCU read-lock in:

#define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \
[...]
                rcu_read_lock();
                (dispatch_ops);
                rcu_read_unlock();
[...]

Coming from:
blk_mq_flush_plug_list ->
                           blk_mq_run_dispatch_ops(q,
                                __blk_mq_flush_plug_list(q, plug));

and __blk_mq_flush_plug_list does this:
          q->mq_ops->queue_rqs(&plug->mq_list);

This somehow ends up calling the bio_endio and the
z_erofs_decompressqueue_endio which grabs the mutex.

So... I have a question, it looks like one of the paths in
__blk_mq_run_dispatch_ops() uses SRCU.  Where are as the alternate
path uses RCU. Why does this alternate want to block even if it is not
supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should
be set? It sounds like you want to block in the "else" path even
though BLK_MQ_F_BLOCKING is not set:

/* run the code block in @dispatch_ops with rcu/srcu read lock held */
#define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \
do {                                                            \
        if ((q)->tag_set->flags & BLK_MQ_F_BLOCKING) {          \
                struct blk_mq_tag_set *__tag_set = (q)->tag_set; \
                int srcu_idx;                                   \
                                                                \
                might_sleep_if(check_sleep);                    \
                srcu_idx = srcu_read_lock(__tag_set->srcu);     \
                (dispatch_ops);                                 \
                srcu_read_unlock(__tag_set->srcu, srcu_idx);    \
        } else {                                                \
                rcu_read_lock();                                \
                (dispatch_ops);                                 \
                rcu_read_unlock();                              \
        }                                                       \
} while (0);

It seems quite incorrect to me that erofs wants to block even though
clearly BLK_MQ_F_BLOCKING is not set. Did I miss something? I believe
it may be more beneficial to correct this properly now, rather than
maintaining the appearance of non-blocking operations and potentially
having to manage atomicity detection with preempt counters at a later
stage.

thanks,

 - Joel

[1] https://lore.kernel.org/all/CAB=BE-QV0PiKFpCOcdEUFxS4wJHsLCcsymAw+nP72Yr3b1WMng@mail.gmail.com/
Gao Xiang July 13, 2023, 2:34 p.m. UTC | #13
On 2023/7/13 22:07, Joel Fernandes wrote:
> On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>> On 2023/7/13 12:52, Paul E. McKenney wrote:
>>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote:
>>>>
>>>>
>>
>> ...
>>
>>>>
>>>> There are lots of performance issues here and even a plumber
>>>> topic last year to show that, see:
>>>>
>>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org
>>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com
>>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com
>>>> [4] https://lpc.events/event/16/contributions/1338/
>>>> and more.
>>>>
>>>> I'm not sure if it's necessary to look info all of that,
>>>> andSandeep knows more than I am (the scheduling issue
>>>> becomes vital on some aarch64 platform.)
>>>
>>> Hmmm...  Please let me try again.
>>>
>>> Assuming that this approach turns out to make sense, the resulting
>>> patch will need to clearly state the performance benefits directly in
>>> the commit log.
>>>
>>> And of course, for the approach to make sense, it must avoid breaking
>>> the existing lockdep-RCU debugging code.
>>>
>>> Is that more clear?
>>
>> Personally I'm not working on Android platform any more so I don't
>> have a way to reproduce, hopefully Sandeep could give actually
>> number _again_ if dm-verity is enabled and trigger another
>> workqueue here and make a comparsion why the scheduling latency of
>> the extra work becomes unacceptable.
>>
> 
> Question from my side, are we talking about only performance issues or
> also a crash? It appears z_erofs_decompress_pcluster() takes
> mutex_lock(&pcl->lock);
> 
> So if it is either in an RCU read-side critical section or in an
> atomic section, like the softirq path, then it may
> schedule-while-atomic or trigger RCU warnings.
> 
> z_erofs_decompressqueue_endio
> -> z_erofs_decompress_kickoff
>   ->z_erofs_decompressqueue_work
>    ->z_erofs_decompress_queue
>     -> z_erofs_decompress_pcluster
>      -> mutex_lock
> 

Why does the softirq path not trigger a workqueue instead? why here
it triggers "schedule-while-atomic" in the softirq context?

> Per Sandeep in [1], this stack happens under RCU read-lock in:
> 
> #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \
> [...]
>                  rcu_read_lock();
>                  (dispatch_ops);
>                  rcu_read_unlock();
> [...]
> 
> Coming from:
> blk_mq_flush_plug_list ->
>                             blk_mq_run_dispatch_ops(q,
>                                  __blk_mq_flush_plug_list(q, plug));
> 
> and __blk_mq_flush_plug_list does this:
>            q->mq_ops->queue_rqs(&plug->mq_list);
> 
> This somehow ends up calling the bio_endio and the
> z_erofs_decompressqueue_endio which grabs the mutex.
> 
> So... I have a question, it looks like one of the paths in
> __blk_mq_run_dispatch_ops() uses SRCU.  Where are as the alternate
> path uses RCU. Why does this alternate want to block even if it is not
> supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should
> be set? It sounds like you want to block in the "else" path even
> though BLK_MQ_F_BLOCKING is not set:

BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with.
That is block layer and mq device driver stuffs. filesystems cannot set
this value.

As I said, as far as I understand, previously,
.end_io() can only be called without RCU context, so it will be fine,
but I don't know when .end_io() can be called under some RCU context
now.


Thanks,
Gao Xiang
Joel Fernandes July 13, 2023, 3:33 p.m. UTC | #14
On Thu, Jul 13, 2023 at 10:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
>
>
> On 2023/7/13 22:07, Joel Fernandes wrote:
> > On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >> On 2023/7/13 12:52, Paul E. McKenney wrote:
> >>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote:
> >>>>
> >>>>
> >>
> >> ...
> >>
> >>>>
> >>>> There are lots of performance issues here and even a plumber
> >>>> topic last year to show that, see:
> >>>>
> >>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org
> >>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com
> >>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com
> >>>> [4] https://lpc.events/event/16/contributions/1338/
> >>>> and more.
> >>>>
> >>>> I'm not sure if it's necessary to look info all of that,
> >>>> andSandeep knows more than I am (the scheduling issue
> >>>> becomes vital on some aarch64 platform.)
> >>>
> >>> Hmmm...  Please let me try again.
> >>>
> >>> Assuming that this approach turns out to make sense, the resulting
> >>> patch will need to clearly state the performance benefits directly in
> >>> the commit log.
> >>>
> >>> And of course, for the approach to make sense, it must avoid breaking
> >>> the existing lockdep-RCU debugging code.
> >>>
> >>> Is that more clear?
> >>
> >> Personally I'm not working on Android platform any more so I don't
> >> have a way to reproduce, hopefully Sandeep could give actually
> >> number _again_ if dm-verity is enabled and trigger another
> >> workqueue here and make a comparsion why the scheduling latency of
> >> the extra work becomes unacceptable.
> >>
> >
> > Question from my side, are we talking about only performance issues or
> > also a crash? It appears z_erofs_decompress_pcluster() takes
> > mutex_lock(&pcl->lock);
> >
> > So if it is either in an RCU read-side critical section or in an
> > atomic section, like the softirq path, then it may
> > schedule-while-atomic or trigger RCU warnings.
> >
> > z_erofs_decompressqueue_endio
> > -> z_erofs_decompress_kickoff
> >   ->z_erofs_decompressqueue_work
> >    ->z_erofs_decompress_queue
> >     -> z_erofs_decompress_pcluster
> >      -> mutex_lock
> >
>
> Why does the softirq path not trigger a workqueue instead?

I said "if it is". I was giving a scenario. mutex_lock() is not
allowed in softirq context or in an RCU-reader.

> > Per Sandeep in [1], this stack happens under RCU read-lock in:
> >
> > #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \
> > [...]
> >                  rcu_read_lock();
> >                  (dispatch_ops);
> >                  rcu_read_unlock();
> > [...]
> >
> > Coming from:
> > blk_mq_flush_plug_list ->
> >                             blk_mq_run_dispatch_ops(q,
> >                                  __blk_mq_flush_plug_list(q, plug));
> >
> > and __blk_mq_flush_plug_list does this:
> >            q->mq_ops->queue_rqs(&plug->mq_list);
> >
> > This somehow ends up calling the bio_endio and the
> > z_erofs_decompressqueue_endio which grabs the mutex.
> >
> > So... I have a question, it looks like one of the paths in
> > __blk_mq_run_dispatch_ops() uses SRCU.  Where are as the alternate
> > path uses RCU. Why does this alternate want to block even if it is not
> > supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should
> > be set? It sounds like you want to block in the "else" path even
> > though BLK_MQ_F_BLOCKING is not set:
>
> BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with.
> That is block layer and mq device driver stuffs. filesystems cannot set
> this value.
>
> As I said, as far as I understand, previously,
> .end_io() can only be called without RCU context, so it will be fine,
> but I don't know when .end_io() can be called under some RCU context
> now.

From what Sandeep described, the code path is in an RCU reader. My
question is more, why doesn't it use SRCU instead since it clearly
does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper
dive needs to be made into that before concluding that the fix is to
use rcu_read_lock_any_held().

 - Joel
Alan Huang July 13, 2023, 4:09 p.m. UTC | #15
> 2023年7月13日 23:33,Joel Fernandes <joel@joelfernandes.org> 写道:
> 
> On Thu, Jul 13, 2023 at 10:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>> 
>> 
>> 
>> On 2023/7/13 22:07, Joel Fernandes wrote:
>>> On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>> On 2023/7/13 12:52, Paul E. McKenney wrote:
>>>>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote:
>>>>>> 
>>>>>> 
>>>> 
>>>> ...
>>>> 
>>>>>> 
>>>>>> There are lots of performance issues here and even a plumber
>>>>>> topic last year to show that, see:
>>>>>> 
>>>>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org
>>>>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com
>>>>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com
>>>>>> [4] https://lpc.events/event/16/contributions/1338/
>>>>>> and more.
>>>>>> 
>>>>>> I'm not sure if it's necessary to look info all of that,
>>>>>> andSandeep knows more than I am (the scheduling issue
>>>>>> becomes vital on some aarch64 platform.)
>>>>> 
>>>>> Hmmm...  Please let me try again.
>>>>> 
>>>>> Assuming that this approach turns out to make sense, the resulting
>>>>> patch will need to clearly state the performance benefits directly in
>>>>> the commit log.
>>>>> 
>>>>> And of course, for the approach to make sense, it must avoid breaking
>>>>> the existing lockdep-RCU debugging code.
>>>>> 
>>>>> Is that more clear?
>>>> 
>>>> Personally I'm not working on Android platform any more so I don't
>>>> have a way to reproduce, hopefully Sandeep could give actually
>>>> number _again_ if dm-verity is enabled and trigger another
>>>> workqueue here and make a comparsion why the scheduling latency of
>>>> the extra work becomes unacceptable.
>>>> 
>>> 
>>> Question from my side, are we talking about only performance issues or
>>> also a crash? It appears z_erofs_decompress_pcluster() takes
>>> mutex_lock(&pcl->lock);
>>> 
>>> So if it is either in an RCU read-side critical section or in an
>>> atomic section, like the softirq path, then it may
>>> schedule-while-atomic or trigger RCU warnings.
>>> 
>>> z_erofs_decompressqueue_endio
>>> -> z_erofs_decompress_kickoff
>>>  ->z_erofs_decompressqueue_work
>>>   ->z_erofs_decompress_queue
>>>    -> z_erofs_decompress_pcluster
>>>     -> mutex_lock
>>> 
>> 
>> Why does the softirq path not trigger a workqueue instead?
> 
> I said "if it is". I was giving a scenario. mutex_lock() is not
> allowed in softirq context or in an RCU-reader.
> 
>>> Per Sandeep in [1], this stack happens under RCU read-lock in:
>>> 
>>> #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \
>>> [...]
>>>                 rcu_read_lock();
>>>                 (dispatch_ops);
>>>                 rcu_read_unlock();
>>> [...]
>>> 
>>> Coming from:
>>> blk_mq_flush_plug_list ->
>>>                            blk_mq_run_dispatch_ops(q,
>>>                                 __blk_mq_flush_plug_list(q, plug));
>>> 
>>> and __blk_mq_flush_plug_list does this:
>>>           q->mq_ops->queue_rqs(&plug->mq_list);
>>> 
>>> This somehow ends up calling the bio_endio and the
>>> z_erofs_decompressqueue_endio which grabs the mutex.
>>> 
>>> So... I have a question, it looks like one of the paths in
>>> __blk_mq_run_dispatch_ops() uses SRCU.  Where are as the alternate
>>> path uses RCU. Why does this alternate want to block even if it is not
>>> supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should
>>> be set? It sounds like you want to block in the "else" path even
>>> though BLK_MQ_F_BLOCKING is not set:
>> 
>> BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with.
>> That is block layer and mq device driver stuffs. filesystems cannot set
>> this value.
>> 
>> As I said, as far as I understand, previously,
>> .end_io() can only be called without RCU context, so it will be fine,
>> but I don't know when .end_io() can be called under some RCU context
>> now.
> 
> From what Sandeep described, the code path is in an RCU reader. My
> question is more, why doesn't it use SRCU instead since it clearly
> does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper
> dive needs to be made into that before concluding that the fix is to
> use rcu_read_lock_any_held().

Copied from [1]:

"Background: Historically erofs would always schedule a kworker for
 decompression which would incur the scheduling cost regardless of
 the context. But z_erofs_decompressqueue_endio() may not always
 be in atomic context and we could actually benefit from doing the
 decompression in z_erofs_decompressqueue_endio() if we are in
 thread context, for example when running with dm-verity.
 This optimization was later added in patch [2] which has shown
 improvement in performance benchmarks.”

I’m not sure if it is a design issue.


[1] https://lore.kernel.org/all/20230621220848.3379029-1-dhavale@google.com/

> 
> - Joel
Paul E. McKenney July 13, 2023, 4:33 p.m. UTC | #16
On Thu, Jul 13, 2023 at 11:33:24AM -0400, Joel Fernandes wrote:
> On Thu, Jul 13, 2023 at 10:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> > On 2023/7/13 22:07, Joel Fernandes wrote:
> > > On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> > >> On 2023/7/13 12:52, Paul E. McKenney wrote:
> > >>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote:
> > >>
> > >> ...
> > >>
> > >>>>
> > >>>> There are lots of performance issues here and even a plumber
> > >>>> topic last year to show that, see:
> > >>>>
> > >>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org
> > >>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com
> > >>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com
> > >>>> [4] https://lpc.events/event/16/contributions/1338/
> > >>>> and more.
> > >>>>
> > >>>> I'm not sure if it's necessary to look info all of that,
> > >>>> andSandeep knows more than I am (the scheduling issue
> > >>>> becomes vital on some aarch64 platform.)
> > >>>
> > >>> Hmmm...  Please let me try again.
> > >>>
> > >>> Assuming that this approach turns out to make sense, the resulting
> > >>> patch will need to clearly state the performance benefits directly in
> > >>> the commit log.
> > >>>
> > >>> And of course, for the approach to make sense, it must avoid breaking
> > >>> the existing lockdep-RCU debugging code.
> > >>>
> > >>> Is that more clear?
> > >>
> > >> Personally I'm not working on Android platform any more so I don't
> > >> have a way to reproduce, hopefully Sandeep could give actually
> > >> number _again_ if dm-verity is enabled and trigger another
> > >> workqueue here and make a comparsion why the scheduling latency of
> > >> the extra work becomes unacceptable.
> > >>
> > >
> > > Question from my side, are we talking about only performance issues or
> > > also a crash? It appears z_erofs_decompress_pcluster() takes
> > > mutex_lock(&pcl->lock);
> > >
> > > So if it is either in an RCU read-side critical section or in an
> > > atomic section, like the softirq path, then it may
> > > schedule-while-atomic or trigger RCU warnings.
> > >
> > > z_erofs_decompressqueue_endio
> > > -> z_erofs_decompress_kickoff
> > >   ->z_erofs_decompressqueue_work
> > >    ->z_erofs_decompress_queue
> > >     -> z_erofs_decompress_pcluster
> > >      -> mutex_lock
> > >
> >
> > Why does the softirq path not trigger a workqueue instead?
> 
> I said "if it is". I was giving a scenario. mutex_lock() is not
> allowed in softirq context or in an RCU-reader.
> 
> > > Per Sandeep in [1], this stack happens under RCU read-lock in:
> > >
> > > #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \
> > > [...]
> > >                  rcu_read_lock();
> > >                  (dispatch_ops);
> > >                  rcu_read_unlock();
> > > [...]
> > >
> > > Coming from:
> > > blk_mq_flush_plug_list ->
> > >                             blk_mq_run_dispatch_ops(q,
> > >                                  __blk_mq_flush_plug_list(q, plug));
> > >
> > > and __blk_mq_flush_plug_list does this:
> > >            q->mq_ops->queue_rqs(&plug->mq_list);
> > >
> > > This somehow ends up calling the bio_endio and the
> > > z_erofs_decompressqueue_endio which grabs the mutex.
> > >
> > > So... I have a question, it looks like one of the paths in
> > > __blk_mq_run_dispatch_ops() uses SRCU.  Where are as the alternate
> > > path uses RCU. Why does this alternate want to block even if it is not
> > > supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should
> > > be set? It sounds like you want to block in the "else" path even
> > > though BLK_MQ_F_BLOCKING is not set:
> >
> > BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with.
> > That is block layer and mq device driver stuffs. filesystems cannot set
> > this value.
> >
> > As I said, as far as I understand, previously,
> > .end_io() can only be called without RCU context, so it will be fine,
> > but I don't know when .end_io() can be called under some RCU context
> > now.
> 
> >From what Sandeep described, the code path is in an RCU reader. My
> question is more, why doesn't it use SRCU instead since it clearly
> does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper
> dive needs to be made into that before concluding that the fix is to
> use rcu_read_lock_any_held().

How can this be solved?

1.	Always use a workqueue.  Simple, but is said to have performance
	issues.

2.	Pass a flag in that indicates whether or not the caller is in an
	RCU read-side critical section.  Conceptually simple, but might
	or might not be reasonable to actually implement in the code as
	it exists now.	(You tell me!)

3.	Create a function in z_erofs that gives you a decent
	approximation, maybe something like the following.

4.	Other ideas here.

The following is untested, and is probably quite buggy, but it should
provide you with a starting point.

static bool z_erofs_wq_needed(void)
{
	if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth())
		return true;  // RCU reader
	if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && !preemptible())
		return true;  // non-preemptible
	if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
		return true;  // non-preeemptible kernel, so play it safe
	return false;
}

You break it, you buy it!  ;-)

							Thanx, Paul
Sandeep Dhavale July 13, 2023, 5:05 p.m. UTC | #17
Hello All,

Let me answer some of the questions raised here.

* Performance aspect
EROFS is one of the popular filesystem of choice in Android for read
only partitions
as it provides 30%+ storage space savings with compression.
In addition to microbenchmarks, boot times and cold app launch
benchmarks are very
important to the Android ecosystem as they directly translate to
responsiveness from user point of view. We saw some
performance penalty in cold app launch benchmarks in a few scenarios.
Analysis showed that the significant variance was coming from the
scheduling cost while decompression cost was more or less the same.
Please see the [1] which shows scheduling costs for kthread vs kworker.

> Just out of curiosity, exactly how much is it costing to trigger the
workqueue?
I think the cost to trigger is not much, it's the actual scheduling latency for
the thread is the one which we want to cut down. And if we are already in
thread context then there is no point in incurring any extra cost if
we can detect
it reliably. That is what erofs check is trying to do.

>One additional question...  What is your plan for kernels built with
CONFIG_PREEMPT_COUNT=n?
If there is no reliable way to detect if we can block or not then in that
case erofs has no option but to schedule the kworker.

* Regarding BLK_MQ_F_BLOCKING
As mentioned by Gao in the thread this is a property of blk-mq device
underneath,
so erofs cannot control it has it has to work with different types of
block devices.

* Regarding rcu_is_watching()

>I am assuming you mean you would grab the mutex accidentally when in an RCU
reader, and might_sleep() presumably in the mutex internal code will scream?

Thank you Paul for explaining in detail why it is important. I can get
the V2 going.
From the looking at the code at kernel/sched/core.c which only looks
at rcu_preempt_depth(),
I am thinking it may still get triggered IIUC.

> The following is untested, and is probably quite buggy, but it should
provide you with a starting point.
..

Yes, that can fix the problem at hand as the erofs check also looks
for rcu_preempt_depth().
A similar approach was discarded as rcu_preempt_depth() was though to
be low level
and we used rcu_read_lock_any_held() which is the superset until we
figured out inconsistency
when ! CONFIG_DEBUG_LOCK_ALLOC.

Paul, Joel,
Shall we fix the rcu_read_lock_*held() regardless of how erofs
improves the check?

Thanks,
Sandeep.

[1] https://lore.kernel.org/linux-erofs/20230208093322.75816-1-hsiangkao@linux.alibaba.com/
Paul E. McKenney July 13, 2023, 5:35 p.m. UTC | #18
On Thu, Jul 13, 2023 at 10:05:46AM -0700, Sandeep Dhavale wrote:
> Hello All,
> 
> Let me answer some of the questions raised here.
> 
> * Performance aspect
> EROFS is one of the popular filesystem of choice in Android for read
> only partitions
> as it provides 30%+ storage space savings with compression.
> In addition to microbenchmarks, boot times and cold app launch
> benchmarks are very
> important to the Android ecosystem as they directly translate to
> responsiveness from user point of view. We saw some
> performance penalty in cold app launch benchmarks in a few scenarios.
> Analysis showed that the significant variance was coming from the
> scheduling cost while decompression cost was more or less the same.
> Please see the [1] which shows scheduling costs for kthread vs kworker.
> 
> > Just out of curiosity, exactly how much is it costing to trigger the
> workqueue?
> I think the cost to trigger is not much, it's the actual scheduling latency for
> the thread is the one which we want to cut down. And if we are already in
> thread context then there is no point in incurring any extra cost if
> we can detect
> it reliably. That is what erofs check is trying to do.
> 
> >One additional question...  What is your plan for kernels built with
> CONFIG_PREEMPT_COUNT=n?
> If there is no reliable way to detect if we can block or not then in that
> case erofs has no option but to schedule the kworker.
> 
> * Regarding BLK_MQ_F_BLOCKING
> As mentioned by Gao in the thread this is a property of blk-mq device
> underneath,
> so erofs cannot control it has it has to work with different types of
> block devices.
> 
> * Regarding rcu_is_watching()
> 
> >I am assuming you mean you would grab the mutex accidentally when in an RCU
> reader, and might_sleep() presumably in the mutex internal code will scream?
> 
> Thank you Paul for explaining in detail why it is important. I can get
> the V2 going.
> >From the looking at the code at kernel/sched/core.c which only looks
> at rcu_preempt_depth(),
> I am thinking it may still get triggered IIUC.
> 
> > The following is untested, and is probably quite buggy, but it should
> provide you with a starting point.
> ..
> 
> Yes, that can fix the problem at hand as the erofs check also looks
> for rcu_preempt_depth().
> A similar approach was discarded as rcu_preempt_depth() was though to
> be low level
> and we used rcu_read_lock_any_held() which is the superset until we
> figured out inconsistency
> when ! CONFIG_DEBUG_LOCK_ALLOC.

Thank you for the background.

> Paul, Joel,
> Shall we fix the rcu_read_lock_*held() regardless of how erofs
> improves the check?

Help me out here.  Exactly what is broken with rcu_read_lock_*held(),
keeping in mind their intended use for lockdep-based debugging?

							Thanx, Paul

> Thanks,
> Sandeep.
> 
> [1] https://lore.kernel.org/linux-erofs/20230208093322.75816-1-hsiangkao@linux.alibaba.com/
Paul E. McKenney July 13, 2023, 6:14 p.m. UTC | #19
On Fri, Jul 14, 2023 at 12:09:27AM +0800, Alan Huang wrote:
> 
> > 2023年7月13日 23:33,Joel Fernandes <joel@joelfernandes.org> 写道:
> > 
> > On Thu, Jul 13, 2023 at 10:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >> 
> >> 
> >> 
> >> On 2023/7/13 22:07, Joel Fernandes wrote:
> >>> On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >>>> On 2023/7/13 12:52, Paul E. McKenney wrote:
> >>>>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote:
> >>>>>> 
> >>>>>> 
> >>>> 
> >>>> ...
> >>>> 
> >>>>>> 
> >>>>>> There are lots of performance issues here and even a plumber
> >>>>>> topic last year to show that, see:
> >>>>>> 
> >>>>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org
> >>>>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com
> >>>>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com
> >>>>>> [4] https://lpc.events/event/16/contributions/1338/
> >>>>>> and more.
> >>>>>> 
> >>>>>> I'm not sure if it's necessary to look info all of that,
> >>>>>> andSandeep knows more than I am (the scheduling issue
> >>>>>> becomes vital on some aarch64 platform.)
> >>>>> 
> >>>>> Hmmm...  Please let me try again.
> >>>>> 
> >>>>> Assuming that this approach turns out to make sense, the resulting
> >>>>> patch will need to clearly state the performance benefits directly in
> >>>>> the commit log.
> >>>>> 
> >>>>> And of course, for the approach to make sense, it must avoid breaking
> >>>>> the existing lockdep-RCU debugging code.
> >>>>> 
> >>>>> Is that more clear?
> >>>> 
> >>>> Personally I'm not working on Android platform any more so I don't
> >>>> have a way to reproduce, hopefully Sandeep could give actually
> >>>> number _again_ if dm-verity is enabled and trigger another
> >>>> workqueue here and make a comparsion why the scheduling latency of
> >>>> the extra work becomes unacceptable.
> >>>> 
> >>> 
> >>> Question from my side, are we talking about only performance issues or
> >>> also a crash? It appears z_erofs_decompress_pcluster() takes
> >>> mutex_lock(&pcl->lock);
> >>> 
> >>> So if it is either in an RCU read-side critical section or in an
> >>> atomic section, like the softirq path, then it may
> >>> schedule-while-atomic or trigger RCU warnings.
> >>> 
> >>> z_erofs_decompressqueue_endio
> >>> -> z_erofs_decompress_kickoff
> >>>  ->z_erofs_decompressqueue_work
> >>>   ->z_erofs_decompress_queue
> >>>    -> z_erofs_decompress_pcluster
> >>>     -> mutex_lock
> >>> 
> >> 
> >> Why does the softirq path not trigger a workqueue instead?
> > 
> > I said "if it is". I was giving a scenario. mutex_lock() is not
> > allowed in softirq context or in an RCU-reader.
> > 
> >>> Per Sandeep in [1], this stack happens under RCU read-lock in:
> >>> 
> >>> #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \
> >>> [...]
> >>>                 rcu_read_lock();
> >>>                 (dispatch_ops);
> >>>                 rcu_read_unlock();
> >>> [...]
> >>> 
> >>> Coming from:
> >>> blk_mq_flush_plug_list ->
> >>>                            blk_mq_run_dispatch_ops(q,
> >>>                                 __blk_mq_flush_plug_list(q, plug));
> >>> 
> >>> and __blk_mq_flush_plug_list does this:
> >>>           q->mq_ops->queue_rqs(&plug->mq_list);
> >>> 
> >>> This somehow ends up calling the bio_endio and the
> >>> z_erofs_decompressqueue_endio which grabs the mutex.
> >>> 
> >>> So... I have a question, it looks like one of the paths in
> >>> __blk_mq_run_dispatch_ops() uses SRCU.  Where are as the alternate
> >>> path uses RCU. Why does this alternate want to block even if it is not
> >>> supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should
> >>> be set? It sounds like you want to block in the "else" path even
> >>> though BLK_MQ_F_BLOCKING is not set:
> >> 
> >> BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with.
> >> That is block layer and mq device driver stuffs. filesystems cannot set
> >> this value.
> >> 
> >> As I said, as far as I understand, previously,
> >> .end_io() can only be called without RCU context, so it will be fine,
> >> but I don't know when .end_io() can be called under some RCU context
> >> now.
> > 
> > From what Sandeep described, the code path is in an RCU reader. My
> > question is more, why doesn't it use SRCU instead since it clearly
> > does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper
> > dive needs to be made into that before concluding that the fix is to
> > use rcu_read_lock_any_held().
> 
> Copied from [1]:
> 
> "Background: Historically erofs would always schedule a kworker for
>  decompression which would incur the scheduling cost regardless of
>  the context. But z_erofs_decompressqueue_endio() may not always
>  be in atomic context and we could actually benefit from doing the
>  decompression in z_erofs_decompressqueue_endio() if we are in
>  thread context, for example when running with dm-verity.
>  This optimization was later added in patch [2] which has shown
>  improvement in performance benchmarks.”
> 
> I’m not sure if it is a design issue.

I have no official opinion myself, but there are quite a few people
who firmly believe that any situation like this one (where driver or
file-system code needs to query the current context to see if blocking
is OK) constitutes a design flaw.  Such people might argue that this
code path should have a clearly documented context, and that if that
documentation states that the code might be in atomic context, then the
driver/fs should assume atomic context.  Alternatively, if driver/fs
needs the context to be non-atomic, the callers should make it so.

See for example in_atomic() and its comment header:

/*
 * Are we running in atomic context?  WARNING: this macro cannot
 * always detect atomic context; in particular, it cannot know about
 * held spinlocks in non-preemptible kernels.  Thus it should not be
 * used in the general case to determine whether sleeping is possible.
 * Do not use in_atomic() in driver code.
 */
#define in_atomic()	(preempt_count() != 0)

In the immortal words of Dan Frye, this should be good clean fun!  ;-)

							Thanx, Paul

> [1] https://lore.kernel.org/all/20230621220848.3379029-1-dhavale@google.com/
> 
> > 
> > - Joel
> 
>
Sandeep Dhavale July 13, 2023, 6:51 p.m. UTC | #20
> Thank you for the background.
>
> > Paul, Joel,
> > Shall we fix the rcu_read_lock_*held() regardless of how erofs
> > improves the check?
>
> Help me out here.  Exactly what is broken with rcu_read_lock_*held(),
> keeping in mind their intended use for lockdep-based debugging?
>
Hi Paul,
With !CONFIG_DEBUG_ALLOC_LOCK
rcu_read_lock_held() -> Always returns 1.
rcu_read_lock_any_held()-> returns !preemptible() so may return 0.

Now current usages for rcu_read_lock_*held() are under RCU_LOCKDEP_WARN()
which becomes noOP with !CONFIG_DEBUG_ALLOC_LOCK
(due to debug_lockdep_rcu_enabled()) so this inconsistency is not causing
any problems right now. So my question was about your opinion for fixing this
for semantics if it's worth correcting.

Also it would have been better IMO if there was a reliable API
for rcu_read_lock_*held() than erofs trying to figure it out at a higher level.

> I have no official opinion myself, but there are quite a few people
...

Regarding erofs trying to detect this, I understand few people can
have different
opinions. Not scheduling a thread while being in a thread context itself
is reasonable in my opinion which also has shown performance gains.

Thanks,
Sandeep.



>                                                         Thanx, Paul
>
> > Thanks,
> > Sandeep.
> >
> > [1] https://lore.kernel.org/linux-erofs/20230208093322.75816-1-hsiangkao@linux.alibaba.com/
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Gao Xiang July 13, 2023, 7 p.m. UTC | #21
On 2023/7/14 02:14, Paul E. McKenney wrote:
> On Fri, Jul 14, 2023 at 12:09:27AM +0800, Alan Huang wrote:

...

>>>
>>>  From what Sandeep described, the code path is in an RCU reader. My
>>> question is more, why doesn't it use SRCU instead since it clearly
>>> does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper
>>> dive needs to be made into that before concluding that the fix is to
>>> use rcu_read_lock_any_held().
>>
>> Copied from [1]:
>>
>> "Background: Historically erofs would always schedule a kworker for
>>   decompression which would incur the scheduling cost regardless of
>>   the context. But z_erofs_decompressqueue_endio() may not always
>>   be in atomic context and we could actually benefit from doing the
>>   decompression in z_erofs_decompressqueue_endio() if we are in
>>   thread context, for example when running with dm-verity.
>>   This optimization was later added in patch [2] which has shown
>>   improvement in performance benchmarks.”
>>
>> I’m not sure if it is a design issue.

What do you mean a design issue, honestly?  I feel uneasy to hear this.

> 
> I have no official opinion myself, but there are quite a few people
> who firmly believe that any situation like this one (where driver or
> file-system code needs to query the current context to see if blocking
> is OK) constitutes a design flaw.  Such people might argue that this
> code path should have a clearly documented context, and that if that
> documentation states that the code might be in atomic context, then the
> driver/fs should assume atomic context.  Alternatively, if driver/fs


I don't think a software decoder (for example, decompression) should be
left in the atomic context honestly.

Regardless of the decompression speed of some algorithm theirselves
(considering very slow decompression on very slow devices), it means
that we also don't have a way to vmap or likewise (considering
decompression + handle extra deduplication copies) in the atomic
context, even memory allocation has to be in an atomic way.


Especially now have more cases that decodes in the RCU reader context
apart from softirq contexts?


> needs the context to be non-atomic, the callers should make it so.
> 
> See for example in_atomic() and its comment header:
> 
> /*
>   * Are we running in atomic context?  WARNING: this macro cannot
>   * always detect atomic context; in particular, it cannot know about
>   * held spinlocks in non-preemptible kernels.  Thus it should not be
>   * used in the general case to determine whether sleeping is possible.
>   * Do not use in_atomic() in driver code.
>   */
> #define in_atomic()	(preempt_count() != 0)
> 
> In the immortal words of Dan Frye, this should be good clean fun!  ;-)

Honestly, I think such helper (to show whether it's in the atomic context)
is useful to driver users, even it could cause some false positive in some
configuration but it's acceptable.


Anyway, I could "Always use a workqueue.", but again, the original commit
was raised by a real vendor (OPPO), and I think if dropping this, all
downstream users which use dm-verity will be impacted and individual end
users will not be happy as well.

Thanks,
Gao Xiang
Paul E. McKenney July 13, 2023, 10:27 p.m. UTC | #22
On Fri, Jul 14, 2023 at 03:00:25AM +0800, Gao Xiang wrote:
> On 2023/7/14 02:14, Paul E. McKenney wrote:
> > On Fri, Jul 14, 2023 at 12:09:27AM +0800, Alan Huang wrote:
> 
> ...
> 
> > > > 
> > > >  From what Sandeep described, the code path is in an RCU reader. My
> > > > question is more, why doesn't it use SRCU instead since it clearly
> > > > does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper
> > > > dive needs to be made into that before concluding that the fix is to
> > > > use rcu_read_lock_any_held().
> > > 
> > > Copied from [1]:
> > > 
> > > "Background: Historically erofs would always schedule a kworker for
> > >   decompression which would incur the scheduling cost regardless of
> > >   the context. But z_erofs_decompressqueue_endio() may not always
> > >   be in atomic context and we could actually benefit from doing the
> > >   decompression in z_erofs_decompressqueue_endio() if we are in
> > >   thread context, for example when running with dm-verity.
> > >   This optimization was later added in patch [2] which has shown
> > >   improvement in performance benchmarks.”
> > > 
> > > I’m not sure if it is a design issue.
> 
> What do you mean a design issue, honestly?  I feel uneasy to hear this.
> 
> > I have no official opinion myself, but there are quite a few people
> > who firmly believe that any situation like this one (where driver or
> > file-system code needs to query the current context to see if blocking
> > is OK) constitutes a design flaw.  Such people might argue that this
> > code path should have a clearly documented context, and that if that
> > documentation states that the code might be in atomic context, then the
> > driver/fs should assume atomic context.  Alternatively, if driver/fs
> 
> I don't think a software decoder (for example, decompression) should be
> left in the atomic context honestly.
> 
> Regardless of the decompression speed of some algorithm theirselves
> (considering very slow decompression on very slow devices), it means
> that we also don't have a way to vmap or likewise (considering
> decompression + handle extra deduplication copies) in the atomic
> context, even memory allocation has to be in an atomic way.
> 
> Especially now have more cases that decodes in the RCU reader context
> apart from softirq contexts?

Just out of curiosity, why are these cases within RCU readers happening?
Is this due to recent MM changes?  But yes, you would have the same
problem in softirq context.

> > needs the context to be non-atomic, the callers should make it so.
> > 
> > See for example in_atomic() and its comment header:
> > 
> > /*
> >   * Are we running in atomic context?  WARNING: this macro cannot
> >   * always detect atomic context; in particular, it cannot know about
> >   * held spinlocks in non-preemptible kernels.  Thus it should not be
> >   * used in the general case to determine whether sleeping is possible.
> >   * Do not use in_atomic() in driver code.
> >   */
> > #define in_atomic()	(preempt_count() != 0)
> > 
> > In the immortal words of Dan Frye, this should be good clean fun!  ;-)
> 
> Honestly, I think such helper (to show whether it's in the atomic context)
> is useful to driver users, even it could cause some false positive in some
> configuration but it's acceptable.
> 
> Anyway, I could "Always use a workqueue.", but again, the original commit
> was raised by a real vendor (OPPO), and I think if dropping this, all
> downstream users which use dm-verity will be impacted and individual end
> users will not be happy as well.

Well, given what we have discussed thus far, for some kernels, you would
always use a workqueue.  One option would be to force CONFIG_PREEMPT_COUNT
to always be enabled, but this option has not been rejected multiple
times in the past.

But you are quite free to roll your own z_erofs_wq_needed().

							Thanx, Paul
Paul E. McKenney July 13, 2023, 10:49 p.m. UTC | #23
On Thu, Jul 13, 2023 at 11:51:34AM -0700, Sandeep Dhavale wrote:
> > Thank you for the background.
> >
> > > Paul, Joel,
> > > Shall we fix the rcu_read_lock_*held() regardless of how erofs
> > > improves the check?
> >
> > Help me out here.  Exactly what is broken with rcu_read_lock_*held(),
> > keeping in mind their intended use for lockdep-based debugging?
> >
> Hi Paul,
> With !CONFIG_DEBUG_ALLOC_LOCK
> rcu_read_lock_held() -> Always returns 1.
> rcu_read_lock_any_held()-> returns !preemptible() so may return 0.
> 
> Now current usages for rcu_read_lock_*held() are under RCU_LOCKDEP_WARN()
> which becomes noOP with !CONFIG_DEBUG_ALLOC_LOCK
> (due to debug_lockdep_rcu_enabled()) so this inconsistency is not causing
> any problems right now. So my question was about your opinion for fixing this
> for semantics if it's worth correcting.
> 
> Also it would have been better IMO if there was a reliable API
> for rcu_read_lock_*held() than erofs trying to figure it out at a higher level.

Sorry, but the current lockdep-support functions need to stay focused
on lockdep.  They are not set up for general use, as we already saw
with rcu_is_watching().

If you get a z_erofs_wq_needed() (or whatever) upstream, and if it turns
out that there is an RCU-specific portion that has clean semantics,
then I would be willing to look at pulling that portion into RCU.
Note "look at" as opposed to "unconditionally agree to".  ;-)

> > I have no official opinion myself, but there are quite a few people
> ...
> 
> Regarding erofs trying to detect this, I understand few people can
> have different
> opinions. Not scheduling a thread while being in a thread context itself
> is reasonable in my opinion which also has shown performance gains.

You still haven't quantified the performance gains.  Presumably they
are most compelling with large numbers of small buffers to be decrypted.

But why not just make a z_erofs_wq_needed() or similar in your own
code, and push it upstream?  If the performance gains really are so
compelling, one would hope that some sort of reasonable arrangement
could be arrived at.

							Thanx, Paul

> Thanks,
> Sandeep.
> 
> 
> 
> >                                                         Thanx, Paul
> >
> > > Thanks,
> > > Sandeep.
> > >
> > > [1] https://lore.kernel.org/linux-erofs/20230208093322.75816-1-hsiangkao@linux.alibaba.com/
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >
Sandeep Dhavale July 13, 2023, 11:08 p.m. UTC | #24
>
> Sorry, but the current lockdep-support functions need to stay focused
> on lockdep.  They are not set up for general use, as we already saw
> with rcu_is_watching().
>
Ok, understood.

> If you get a z_erofs_wq_needed() (or whatever) upstream, and if it turns
> out that there is an RCU-specific portion that has clean semantics,
> then I would be willing to look at pulling that portion into RCU.
> Note "look at" as opposed to "unconditionally agree to".  ;-)
> > > I have no official opinion myself, but there are quite a few people
> > ...
> >
> > Regarding erofs trying to detect this, I understand few people can
> > have different
> > opinions. Not scheduling a thread while being in a thread context itself
> > is reasonable in my opinion which also has shown performance gains.
>
> You still haven't quantified the performance gains.  Presumably they
> are most compelling with large numbers of small buffers to be decrypted.
>

Maybe you missed one of the replies. Link [1] shows the scheduling overhead
for kworker vs high pri kthread. I think we can all see that there is non-zero
cost associated with always scheduling vs inline decompression.

> But why not just make a z_erofs_wq_needed() or similar in your own
> code, and push it upstream?  If the performance gains really are so
> compelling, one would hope that some sort of reasonable arrangement
> could be arrived at.
>
Yes, we will incorporate additional checks in erofs.

Thanks,
Sandeep.

[1] https://lore.kernel.org/linux-erofs/20230208093322.75816-1-hsiangkao@linux.alibaba.com/
Paul E. McKenney July 13, 2023, 11:28 p.m. UTC | #25
On Thu, Jul 13, 2023 at 04:08:29PM -0700, Sandeep Dhavale wrote:
> >
> > Sorry, but the current lockdep-support functions need to stay focused
> > on lockdep.  They are not set up for general use, as we already saw
> > with rcu_is_watching().
> >
> Ok, understood.
> 
> > If you get a z_erofs_wq_needed() (or whatever) upstream, and if it turns
> > out that there is an RCU-specific portion that has clean semantics,
> > then I would be willing to look at pulling that portion into RCU.
> > Note "look at" as opposed to "unconditionally agree to".  ;-)
> > > > I have no official opinion myself, but there are quite a few people
> > > ...
> > >
> > > Regarding erofs trying to detect this, I understand few people can
> > > have different
> > > opinions. Not scheduling a thread while being in a thread context itself
> > > is reasonable in my opinion which also has shown performance gains.
> >
> > You still haven't quantified the performance gains.  Presumably they
> > are most compelling with large numbers of small buffers to be decrypted.
> 
> Maybe you missed one of the replies. Link [1] shows the scheduling overhead
> for kworker vs high pri kthread. I think we can all see that there is non-zero
> cost associated with always scheduling vs inline decompression.

Heh!  A reply I was not CCed on back in February.  ;-)

But data like that included directly in the commit log, gathered
specifically for that commit log's patch, would be very helpful.

> > But why not just make a z_erofs_wq_needed() or similar in your own
> > code, and push it upstream?  If the performance gains really are so
> > compelling, one would hope that some sort of reasonable arrangement
> > could be arrived at.
> >
> Yes, we will incorporate additional checks in erofs.

Sounds good to me!

							Thanx, Paul

> Thanks,
> Sandeep.
> 
> [1] https://lore.kernel.org/linux-erofs/20230208093322.75816-1-hsiangkao@linux.alibaba.com/
Paul E. McKenney July 14, 2023, 2:16 a.m. UTC | #26
On Thu, Jul 13, 2023 at 09:33:35AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 13, 2023 at 11:33:24AM -0400, Joel Fernandes wrote:
> > On Thu, Jul 13, 2023 at 10:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> > > On 2023/7/13 22:07, Joel Fernandes wrote:
> > > > On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> > > >> On 2023/7/13 12:52, Paul E. McKenney wrote:
> > > >>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote:
> > > >>
> > > >> ...
> > > >>
> > > >>>>
> > > >>>> There are lots of performance issues here and even a plumber
> > > >>>> topic last year to show that, see:
> > > >>>>
> > > >>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org
> > > >>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com
> > > >>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com
> > > >>>> [4] https://lpc.events/event/16/contributions/1338/
> > > >>>> and more.
> > > >>>>
> > > >>>> I'm not sure if it's necessary to look info all of that,
> > > >>>> andSandeep knows more than I am (the scheduling issue
> > > >>>> becomes vital on some aarch64 platform.)
> > > >>>
> > > >>> Hmmm...  Please let me try again.
> > > >>>
> > > >>> Assuming that this approach turns out to make sense, the resulting
> > > >>> patch will need to clearly state the performance benefits directly in
> > > >>> the commit log.
> > > >>>
> > > >>> And of course, for the approach to make sense, it must avoid breaking
> > > >>> the existing lockdep-RCU debugging code.
> > > >>>
> > > >>> Is that more clear?
> > > >>
> > > >> Personally I'm not working on Android platform any more so I don't
> > > >> have a way to reproduce, hopefully Sandeep could give actually
> > > >> number _again_ if dm-verity is enabled and trigger another
> > > >> workqueue here and make a comparsion why the scheduling latency of
> > > >> the extra work becomes unacceptable.
> > > >>
> > > >
> > > > Question from my side, are we talking about only performance issues or
> > > > also a crash? It appears z_erofs_decompress_pcluster() takes
> > > > mutex_lock(&pcl->lock);
> > > >
> > > > So if it is either in an RCU read-side critical section or in an
> > > > atomic section, like the softirq path, then it may
> > > > schedule-while-atomic or trigger RCU warnings.
> > > >
> > > > z_erofs_decompressqueue_endio
> > > > -> z_erofs_decompress_kickoff
> > > >   ->z_erofs_decompressqueue_work
> > > >    ->z_erofs_decompress_queue
> > > >     -> z_erofs_decompress_pcluster
> > > >      -> mutex_lock
> > > >
> > >
> > > Why does the softirq path not trigger a workqueue instead?
> > 
> > I said "if it is". I was giving a scenario. mutex_lock() is not
> > allowed in softirq context or in an RCU-reader.
> > 
> > > > Per Sandeep in [1], this stack happens under RCU read-lock in:
> > > >
> > > > #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \
> > > > [...]
> > > >                  rcu_read_lock();
> > > >                  (dispatch_ops);
> > > >                  rcu_read_unlock();
> > > > [...]
> > > >
> > > > Coming from:
> > > > blk_mq_flush_plug_list ->
> > > >                             blk_mq_run_dispatch_ops(q,
> > > >                                  __blk_mq_flush_plug_list(q, plug));
> > > >
> > > > and __blk_mq_flush_plug_list does this:
> > > >            q->mq_ops->queue_rqs(&plug->mq_list);
> > > >
> > > > This somehow ends up calling the bio_endio and the
> > > > z_erofs_decompressqueue_endio which grabs the mutex.
> > > >
> > > > So... I have a question, it looks like one of the paths in
> > > > __blk_mq_run_dispatch_ops() uses SRCU.  Where are as the alternate
> > > > path uses RCU. Why does this alternate want to block even if it is not
> > > > supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should
> > > > be set? It sounds like you want to block in the "else" path even
> > > > though BLK_MQ_F_BLOCKING is not set:
> > >
> > > BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with.
> > > That is block layer and mq device driver stuffs. filesystems cannot set
> > > this value.
> > >
> > > As I said, as far as I understand, previously,
> > > .end_io() can only be called without RCU context, so it will be fine,
> > > but I don't know when .end_io() can be called under some RCU context
> > > now.
> > 
> > >From what Sandeep described, the code path is in an RCU reader. My
> > question is more, why doesn't it use SRCU instead since it clearly
> > does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper
> > dive needs to be made into that before concluding that the fix is to
> > use rcu_read_lock_any_held().
> 
> How can this be solved?
> 
> 1.	Always use a workqueue.  Simple, but is said to have performance
> 	issues.
> 
> 2.	Pass a flag in that indicates whether or not the caller is in an
> 	RCU read-side critical section.  Conceptually simple, but might
> 	or might not be reasonable to actually implement in the code as
> 	it exists now.	(You tell me!)
> 
> 3.	Create a function in z_erofs that gives you a decent
> 	approximation, maybe something like the following.
> 
> 4.	Other ideas here.

5.	#3 plus make the corresponding Kconfig option select
	PREEMPT_COUNT, assuming that any users needing compression in
	non-preemptible kernels are OK with PREEMPT_COUNT being set.
	(Some users of non-preemptible kernels object strenuously
	to the added overhead from CONFIG_PREEMPT_COUNT=y.)

							Thanx, Paul

> The following is untested, and is probably quite buggy, but it should
> provide you with a starting point.
> 
> static bool z_erofs_wq_needed(void)
> {
> 	if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth())
> 		return true;  // RCU reader
> 	if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && !preemptible())
> 		return true;  // non-preemptible
> 	if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
> 		return true;  // non-preeemptible kernel, so play it safe
> 	return false;
> }
> 
> You break it, you buy it!  ;-)
> 
> 							Thanx, Paul
Gao Xiang July 14, 2023, 3:16 a.m. UTC | #27
On 2023/7/14 10:16, Paul E. McKenney wrote:
> On Thu, Jul 13, 2023 at 09:33:35AM -0700, Paul E. McKenney wrote:
>> On Thu, Jul 13, 2023 at 11:33:24AM -0400, Joel Fernandes wrote:

...

>>>
>>> >From what Sandeep described, the code path is in an RCU reader. My
>>> question is more, why doesn't it use SRCU instead since it clearly
>>> does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper
>>> dive needs to be made into that before concluding that the fix is to
>>> use rcu_read_lock_any_held().
>>
>> How can this be solved?
>>
>> 1.	Always use a workqueue.  Simple, but is said to have performance
>> 	issues.
>>
>> 2.	Pass a flag in that indicates whether or not the caller is in an
>> 	RCU read-side critical section.  Conceptually simple, but might
>> 	or might not be reasonable to actually implement in the code as
>> 	it exists now.	(You tell me!)
>>
>> 3.	Create a function in z_erofs that gives you a decent
>> 	approximation, maybe something like the following.
>>
>> 4.	Other ideas here.
> 
> 5.	#3 plus make the corresponding Kconfig option select
> 	PREEMPT_COUNT, assuming that any users needing compression in
> 	non-preemptible kernels are OK with PREEMPT_COUNT being set.
> 	(Some users of non-preemptible kernels object strenuously
> 	to the added overhead from CONFIG_PREEMPT_COUNT=y.)

I'm not sure if it's a good idea, we need to work on
CONFIG_PREEMPT_COUNT=n (why not?), we could just always trigger a
workqueue for this.

Anyway, before we proceed, I also think it'd be better to get some
performance numbers first for this (e.g. with dm-verity) and record
the numbers in the commit message to justify this.  Otherwise, I guess
the same question will be raised again and again.

Thanks,
Gao Xiang
Joel Fernandes July 14, 2023, 1:42 p.m. UTC | #28
On Thu, Jul 13, 2023 at 11:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
>
>
> On 2023/7/14 10:16, Paul E. McKenney wrote:
> > On Thu, Jul 13, 2023 at 09:33:35AM -0700, Paul E. McKenney wrote:
> >> On Thu, Jul 13, 2023 at 11:33:24AM -0400, Joel Fernandes wrote:
>
> ...
>
> >>>
> >>> >From what Sandeep described, the code path is in an RCU reader. My
> >>> question is more, why doesn't it use SRCU instead since it clearly
> >>> does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper
> >>> dive needs to be made into that before concluding that the fix is to
> >>> use rcu_read_lock_any_held().
> >>
> >> How can this be solved?
> >>
> >> 1.   Always use a workqueue.  Simple, but is said to have performance
> >>      issues.
> >>
> >> 2.   Pass a flag in that indicates whether or not the caller is in an
> >>      RCU read-side critical section.  Conceptually simple, but might
> >>      or might not be reasonable to actually implement in the code as
> >>      it exists now.  (You tell me!)
> >>
> >> 3.   Create a function in z_erofs that gives you a decent
> >>      approximation, maybe something like the following.
> >>
> >> 4.   Other ideas here.
> >
> > 5.    #3 plus make the corresponding Kconfig option select
> >       PREEMPT_COUNT, assuming that any users needing compression in
> >       non-preemptible kernels are OK with PREEMPT_COUNT being set.
> >       (Some users of non-preemptible kernels object strenuously
> >       to the added overhead from CONFIG_PREEMPT_COUNT=y.)
>
> I'm not sure if it's a good idea

I think it is a fine idea.

> we need to work on
> CONFIG_PREEMPT_COUNT=n (why not?), we could just always trigger a
> workqueue for this.
>

So CONFIG_PREEMPT_COUNT=n users don't deserve good performance? ;-)

thanks,

 - Joel
Gao Xiang July 14, 2023, 1:51 p.m. UTC | #29
On 2023/7/14 21:42, Joel Fernandes wrote:
> On Thu, Jul 13, 2023 at 11:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2023/7/14 10:16, Paul E. McKenney wrote:
>>> On Thu, Jul 13, 2023 at 09:33:35AM -0700, Paul E. McKenney wrote:
>>>> On Thu, Jul 13, 2023 at 11:33:24AM -0400, Joel Fernandes wrote:
>>
>> ...
>>
>>>>>
>>>>> >From what Sandeep described, the code path is in an RCU reader. My
>>>>> question is more, why doesn't it use SRCU instead since it clearly
>>>>> does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper
>>>>> dive needs to be made into that before concluding that the fix is to
>>>>> use rcu_read_lock_any_held().
>>>>
>>>> How can this be solved?
>>>>
>>>> 1.   Always use a workqueue.  Simple, but is said to have performance
>>>>       issues.
>>>>
>>>> 2.   Pass a flag in that indicates whether or not the caller is in an
>>>>       RCU read-side critical section.  Conceptually simple, but might
>>>>       or might not be reasonable to actually implement in the code as
>>>>       it exists now.  (You tell me!)
>>>>
>>>> 3.   Create a function in z_erofs that gives you a decent
>>>>       approximation, maybe something like the following.
>>>>
>>>> 4.   Other ideas here.
>>>
>>> 5.    #3 plus make the corresponding Kconfig option select
>>>        PREEMPT_COUNT, assuming that any users needing compression in
>>>        non-preemptible kernels are OK with PREEMPT_COUNT being set.
>>>        (Some users of non-preemptible kernels object strenuously
>>>        to the added overhead from CONFIG_PREEMPT_COUNT=y.)
>>
>> I'm not sure if it's a good idea
> 
> I think it is a fine idea.
> 
>> we need to work on
>> CONFIG_PREEMPT_COUNT=n (why not?), we could just always trigger a
>> workqueue for this.
>>
> 
> So CONFIG_PREEMPT_COUNT=n users don't deserve good performance? ;-)

I'm not sure if non-preemptible kernel users really care about
such sensitive latencies, I don't know, my 2 cents.

Thanks,
Gao Xiang

> 
> thanks,
> 
>   - Joel
Steven Rostedt July 14, 2023, 2:56 p.m. UTC | #30
On Fri, 14 Jul 2023 21:51:16 +0800
Gao Xiang <hsiangkao@linux.alibaba.com> wrote:

> >> we need to work on
> >> CONFIG_PREEMPT_COUNT=n (why not?), we could just always trigger a
> >> workqueue for this.
> >>  
> > 
> > So CONFIG_PREEMPT_COUNT=n users don't deserve good performance? ;-)  
> 
> I'm not sure if non-preemptible kernel users really care about
> such sensitive latencies, I don't know, my 2 cents.

CONFIG_PREEMPT_COUNT=n is for *performance* but not for *latency*. That is,
they care about the overall performance (batch processing) but not
interactive performance.

-- Steve
Paul E. McKenney July 14, 2023, 3:13 p.m. UTC | #31
On Fri, Jul 14, 2023 at 10:56:15AM -0400, Steven Rostedt wrote:
> On Fri, 14 Jul 2023 21:51:16 +0800
> Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> 
> > >> we need to work on
> > >> CONFIG_PREEMPT_COUNT=n (why not?), we could just always trigger a
> > >> workqueue for this.
> > >>  
> > > 
> > > So CONFIG_PREEMPT_COUNT=n users don't deserve good performance? ;-)  
> > 
> > I'm not sure if non-preemptible kernel users really care about
> > such sensitive latencies, I don't know, my 2 cents.
> 
> CONFIG_PREEMPT_COUNT=n is for *performance* but not for *latency*. That is,
> they care about the overall performance (batch processing) but not
> interactive performance.

Users of CONFIG_PREEMPT_COUNT=n do care about latency, but normally not
in the sub-millisecond range.  If the February posting is representative
(no idea, myself), these latencies are in the tens of milliseconds.

So one question is "why not both?"

One way would be for the call chain to indicate when in atomic context,
and another would be use of SRCU, to Joel's earlier point.

							Thanx, Paul
Alan Huang July 14, 2023, 3:35 p.m. UTC | #32
> 2023年7月14日 10:16,Paul E. McKenney <paulmck@kernel.org> 写道:
> 
> On Thu, Jul 13, 2023 at 09:33:35AM -0700, Paul E. McKenney wrote:
>> On Thu, Jul 13, 2023 at 11:33:24AM -0400, Joel Fernandes wrote:
>>> On Thu, Jul 13, 2023 at 10:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>> On 2023/7/13 22:07, Joel Fernandes wrote:
>>>>> On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>>>> On 2023/7/13 12:52, Paul E. McKenney wrote:
>>>>>>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote:
>>>>>> 
>>>>>> ...
>>>>>> 
>>>>>>>> 
>>>>>>>> There are lots of performance issues here and even a plumber
>>>>>>>> topic last year to show that, see:
>>>>>>>> 
>>>>>>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org
>>>>>>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com
>>>>>>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com
>>>>>>>> [4] https://lpc.events/event/16/contributions/1338/
>>>>>>>> and more.
>>>>>>>> 
>>>>>>>> I'm not sure if it's necessary to look info all of that,
>>>>>>>> andSandeep knows more than I am (the scheduling issue
>>>>>>>> becomes vital on some aarch64 platform.)
>>>>>>> 
>>>>>>> Hmmm...  Please let me try again.
>>>>>>> 
>>>>>>> Assuming that this approach turns out to make sense, the resulting
>>>>>>> patch will need to clearly state the performance benefits directly in
>>>>>>> the commit log.
>>>>>>> 
>>>>>>> And of course, for the approach to make sense, it must avoid breaking
>>>>>>> the existing lockdep-RCU debugging code.
>>>>>>> 
>>>>>>> Is that more clear?
>>>>>> 
>>>>>> Personally I'm not working on Android platform any more so I don't
>>>>>> have a way to reproduce, hopefully Sandeep could give actually
>>>>>> number _again_ if dm-verity is enabled and trigger another
>>>>>> workqueue here and make a comparsion why the scheduling latency of
>>>>>> the extra work becomes unacceptable.
>>>>>> 
>>>>> 
>>>>> Question from my side, are we talking about only performance issues or
>>>>> also a crash? It appears z_erofs_decompress_pcluster() takes
>>>>> mutex_lock(&pcl->lock);
>>>>> 
>>>>> So if it is either in an RCU read-side critical section or in an
>>>>> atomic section, like the softirq path, then it may
>>>>> schedule-while-atomic or trigger RCU warnings.
>>>>> 
>>>>> z_erofs_decompressqueue_endio
>>>>> -> z_erofs_decompress_kickoff
>>>>>  ->z_erofs_decompressqueue_work
>>>>>   ->z_erofs_decompress_queue
>>>>>    -> z_erofs_decompress_pcluster
>>>>>     -> mutex_lock
>>>>> 
>>>> 
>>>> Why does the softirq path not trigger a workqueue instead?
>>> 
>>> I said "if it is". I was giving a scenario. mutex_lock() is not
>>> allowed in softirq context or in an RCU-reader.
>>> 
>>>>> Per Sandeep in [1], this stack happens under RCU read-lock in:
>>>>> 
>>>>> #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \
>>>>> [...]
>>>>>                 rcu_read_lock();
>>>>>                 (dispatch_ops);
>>>>>                 rcu_read_unlock();
>>>>> [...]
>>>>> 
>>>>> Coming from:
>>>>> blk_mq_flush_plug_list ->
>>>>>                            blk_mq_run_dispatch_ops(q,
>>>>>                                 __blk_mq_flush_plug_list(q, plug));
>>>>> 
>>>>> and __blk_mq_flush_plug_list does this:
>>>>>           q->mq_ops->queue_rqs(&plug->mq_list);
>>>>> 
>>>>> This somehow ends up calling the bio_endio and the
>>>>> z_erofs_decompressqueue_endio which grabs the mutex.
>>>>> 
>>>>> So... I have a question, it looks like one of the paths in
>>>>> __blk_mq_run_dispatch_ops() uses SRCU.  Where are as the alternate
>>>>> path uses RCU. Why does this alternate want to block even if it is not
>>>>> supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should
>>>>> be set? It sounds like you want to block in the "else" path even
>>>>> though BLK_MQ_F_BLOCKING is not set:
>>>> 
>>>> BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with.
>>>> That is block layer and mq device driver stuffs. filesystems cannot set
>>>> this value.
>>>> 
>>>> As I said, as far as I understand, previously,
>>>> .end_io() can only be called without RCU context, so it will be fine,
>>>> but I don't know when .end_io() can be called under some RCU context
>>>> now.
>>> 
>>>> From what Sandeep described, the code path is in an RCU reader. My
>>> question is more, why doesn't it use SRCU instead since it clearly
>>> does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper
>>> dive needs to be made into that before concluding that the fix is to
>>> use rcu_read_lock_any_held().
>> 
>> How can this be solved?
>> 
>> 1. Always use a workqueue.  Simple, but is said to have performance
>> issues.
>> 
>> 2. Pass a flag in that indicates whether or not the caller is in an
>> RCU read-side critical section.  Conceptually simple, but might
>> or might not be reasonable to actually implement in the code as
>> it exists now. (You tell me!)
>> 
>> 3. Create a function in z_erofs that gives you a decent
>> approximation, maybe something like the following.
>> 
>> 4. Other ideas here.
> 
> 5. #3 plus make the corresponding Kconfig option select
> PREEMPT_COUNT, assuming that any users needing compression in
> non-preemptible kernels are OK with PREEMPT_COUNT being set.
> (Some users of non-preemptible kernels object strenuously
> to the added overhead from CONFIG_PREEMPT_COUNT=y.)

6. Set one bit in bio->bi_private, check the bit and flip it in rcu_read_lock() path,
then in z_erofs_decompressqueue_endio, check if the bit has changed.

Not sure if this is feasible or acceptable. :)

> 
> Thanx, Paul
> 
>> The following is untested, and is probably quite buggy, but it should
>> provide you with a starting point.
>> 
>> static bool z_erofs_wq_needed(void)
>> {
>> if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth())
>> return true;  // RCU reader
>> if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && !preemptible())
>> return true;  // non-preemptible
>> if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
>> return true;  // non-preeemptible kernel, so play it safe
>> return false;
>> }
>> 
>> You break it, you buy it!  ;-)
>> 
>> Thanx, Paul
Alan Huang July 14, 2023, 3:54 p.m. UTC | #33
> 2023年7月14日 23:35,Alan Huang <mmpgouride@gmail.com> 写道:
> 
>> 
>> 2023年7月14日 10:16,Paul E. McKenney <paulmck@kernel.org> 写道:
>> 
>> On Thu, Jul 13, 2023 at 09:33:35AM -0700, Paul E. McKenney wrote:
>>> On Thu, Jul 13, 2023 at 11:33:24AM -0400, Joel Fernandes wrote:
>>>> On Thu, Jul 13, 2023 at 10:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>>> On 2023/7/13 22:07, Joel Fernandes wrote:
>>>>>> On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>>>>> On 2023/7/13 12:52, Paul E. McKenney wrote:
>>>>>>>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote:
>>>>>>> 
>>>>>>> ...
>>>>>>> 
>>>>>>>>> 
>>>>>>>>> There are lots of performance issues here and even a plumber
>>>>>>>>> topic last year to show that, see:
>>>>>>>>> 
>>>>>>>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org
>>>>>>>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com
>>>>>>>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com
>>>>>>>>> [4] https://lpc.events/event/16/contributions/1338/
>>>>>>>>> and more.
>>>>>>>>> 
>>>>>>>>> I'm not sure if it's necessary to look info all of that,
>>>>>>>>> andSandeep knows more than I am (the scheduling issue
>>>>>>>>> becomes vital on some aarch64 platform.)
>>>>>>>> 
>>>>>>>> Hmmm...  Please let me try again.
>>>>>>>> 
>>>>>>>> Assuming that this approach turns out to make sense, the resulting
>>>>>>>> patch will need to clearly state the performance benefits directly in
>>>>>>>> the commit log.
>>>>>>>> 
>>>>>>>> And of course, for the approach to make sense, it must avoid breaking
>>>>>>>> the existing lockdep-RCU debugging code.
>>>>>>>> 
>>>>>>>> Is that more clear?
>>>>>>> 
>>>>>>> Personally I'm not working on Android platform any more so I don't
>>>>>>> have a way to reproduce, hopefully Sandeep could give actually
>>>>>>> number _again_ if dm-verity is enabled and trigger another
>>>>>>> workqueue here and make a comparsion why the scheduling latency of
>>>>>>> the extra work becomes unacceptable.
>>>>>>> 
>>>>>> 
>>>>>> Question from my side, are we talking about only performance issues or
>>>>>> also a crash? It appears z_erofs_decompress_pcluster() takes
>>>>>> mutex_lock(&pcl->lock);
>>>>>> 
>>>>>> So if it is either in an RCU read-side critical section or in an
>>>>>> atomic section, like the softirq path, then it may
>>>>>> schedule-while-atomic or trigger RCU warnings.
>>>>>> 
>>>>>> z_erofs_decompressqueue_endio
>>>>>> -> z_erofs_decompress_kickoff
>>>>>> ->z_erofs_decompressqueue_work
>>>>>>  ->z_erofs_decompress_queue
>>>>>>   -> z_erofs_decompress_pcluster
>>>>>>    -> mutex_lock
>>>>>> 
>>>>> 
>>>>> Why does the softirq path not trigger a workqueue instead?
>>>> 
>>>> I said "if it is". I was giving a scenario. mutex_lock() is not
>>>> allowed in softirq context or in an RCU-reader.
>>>> 
>>>>>> Per Sandeep in [1], this stack happens under RCU read-lock in:
>>>>>> 
>>>>>> #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \
>>>>>> [...]
>>>>>>                rcu_read_lock();
>>>>>>                (dispatch_ops);
>>>>>>                rcu_read_unlock();
>>>>>> [...]
>>>>>> 
>>>>>> Coming from:
>>>>>> blk_mq_flush_plug_list ->
>>>>>>                           blk_mq_run_dispatch_ops(q,
>>>>>>                                __blk_mq_flush_plug_list(q, plug));
>>>>>> 
>>>>>> and __blk_mq_flush_plug_list does this:
>>>>>>          q->mq_ops->queue_rqs(&plug->mq_list);
>>>>>> 
>>>>>> This somehow ends up calling the bio_endio and the
>>>>>> z_erofs_decompressqueue_endio which grabs the mutex.
>>>>>> 
>>>>>> So... I have a question, it looks like one of the paths in
>>>>>> __blk_mq_run_dispatch_ops() uses SRCU.  Where are as the alternate
>>>>>> path uses RCU. Why does this alternate want to block even if it is not
>>>>>> supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should
>>>>>> be set? It sounds like you want to block in the "else" path even
>>>>>> though BLK_MQ_F_BLOCKING is not set:
>>>>> 
>>>>> BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with.
>>>>> That is block layer and mq device driver stuffs. filesystems cannot set
>>>>> this value.
>>>>> 
>>>>> As I said, as far as I understand, previously,
>>>>> .end_io() can only be called without RCU context, so it will be fine,
>>>>> but I don't know when .end_io() can be called under some RCU context
>>>>> now.
>>>> 
>>>>> From what Sandeep described, the code path is in an RCU reader. My
>>>> question is more, why doesn't it use SRCU instead since it clearly
>>>> does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper
>>>> dive needs to be made into that before concluding that the fix is to
>>>> use rcu_read_lock_any_held().
>>> 
>>> How can this be solved?
>>> 
>>> 1. Always use a workqueue.  Simple, but is said to have performance
>>> issues.
>>> 
>>> 2. Pass a flag in that indicates whether or not the caller is in an
>>> RCU read-side critical section.  Conceptually simple, but might
>>> or might not be reasonable to actually implement in the code as
>>> it exists now. (You tell me!)
>>> 
>>> 3. Create a function in z_erofs that gives you a decent
>>> approximation, maybe something like the following.
>>> 
>>> 4. Other ideas here.
>> 
>> 5. #3 plus make the corresponding Kconfig option select
>> PREEMPT_COUNT, assuming that any users needing compression in
>> non-preemptible kernels are OK with PREEMPT_COUNT being set.
>> (Some users of non-preemptible kernels object strenuously
>> to the added overhead from CONFIG_PREEMPT_COUNT=y.)
> 
> 6. Set one bit in bio->bi_private, check the bit and flip it in rcu_read_lock() path,
> then in z_erofs_decompressqueue_endio, check if the bit has changed.

Seems bad, read and modify bi_private is a bad idea.

> 
> Not sure if this is feasible or acceptable. :)
> 
>> 
>> Thanx, Paul
>> 
>>> The following is untested, and is probably quite buggy, but it should
>>> provide you with a starting point.
>>> 
>>> static bool z_erofs_wq_needed(void)
>>> {
>>> if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth())
>>> return true;  // RCU reader
>>> if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && !preemptible())
>>> return true;  // non-preemptible
>>> if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
>>> return true;  // non-preeemptible kernel, so play it safe
>>> return false;
>>> }
>>> 
>>> You break it, you buy it!  ;-)
>>> 
>>> Thanx, Paul
Paul E. McKenney July 14, 2023, 5:02 p.m. UTC | #34
On Fri, Jul 14, 2023 at 11:54:47PM +0800, Alan Huang wrote:
> 
> > 2023年7月14日 23:35,Alan Huang <mmpgouride@gmail.com> 写道:
> > 
> >> 
> >> 2023年7月14日 10:16,Paul E. McKenney <paulmck@kernel.org> 写道:
> >> 
> >> On Thu, Jul 13, 2023 at 09:33:35AM -0700, Paul E. McKenney wrote:
> >>> On Thu, Jul 13, 2023 at 11:33:24AM -0400, Joel Fernandes wrote:
> >>>> On Thu, Jul 13, 2023 at 10:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >>>>> On 2023/7/13 22:07, Joel Fernandes wrote:
> >>>>>> On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >>>>>>> On 2023/7/13 12:52, Paul E. McKenney wrote:
> >>>>>>>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote:
> >>>>>>> 
> >>>>>>> ...
> >>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> There are lots of performance issues here and even a plumber
> >>>>>>>>> topic last year to show that, see:
> >>>>>>>>> 
> >>>>>>>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org
> >>>>>>>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com
> >>>>>>>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com
> >>>>>>>>> [4] https://lpc.events/event/16/contributions/1338/
> >>>>>>>>> and more.
> >>>>>>>>> 
> >>>>>>>>> I'm not sure if it's necessary to look info all of that,
> >>>>>>>>> andSandeep knows more than I am (the scheduling issue
> >>>>>>>>> becomes vital on some aarch64 platform.)
> >>>>>>>> 
> >>>>>>>> Hmmm...  Please let me try again.
> >>>>>>>> 
> >>>>>>>> Assuming that this approach turns out to make sense, the resulting
> >>>>>>>> patch will need to clearly state the performance benefits directly in
> >>>>>>>> the commit log.
> >>>>>>>> 
> >>>>>>>> And of course, for the approach to make sense, it must avoid breaking
> >>>>>>>> the existing lockdep-RCU debugging code.
> >>>>>>>> 
> >>>>>>>> Is that more clear?
> >>>>>>> 
> >>>>>>> Personally I'm not working on Android platform any more so I don't
> >>>>>>> have a way to reproduce, hopefully Sandeep could give actually
> >>>>>>> number _again_ if dm-verity is enabled and trigger another
> >>>>>>> workqueue here and make a comparsion why the scheduling latency of
> >>>>>>> the extra work becomes unacceptable.
> >>>>>>> 
> >>>>>> 
> >>>>>> Question from my side, are we talking about only performance issues or
> >>>>>> also a crash? It appears z_erofs_decompress_pcluster() takes
> >>>>>> mutex_lock(&pcl->lock);
> >>>>>> 
> >>>>>> So if it is either in an RCU read-side critical section or in an
> >>>>>> atomic section, like the softirq path, then it may
> >>>>>> schedule-while-atomic or trigger RCU warnings.
> >>>>>> 
> >>>>>> z_erofs_decompressqueue_endio
> >>>>>> -> z_erofs_decompress_kickoff
> >>>>>> ->z_erofs_decompressqueue_work
> >>>>>>  ->z_erofs_decompress_queue
> >>>>>>   -> z_erofs_decompress_pcluster
> >>>>>>    -> mutex_lock
> >>>>>> 
> >>>>> 
> >>>>> Why does the softirq path not trigger a workqueue instead?
> >>>> 
> >>>> I said "if it is". I was giving a scenario. mutex_lock() is not
> >>>> allowed in softirq context or in an RCU-reader.
> >>>> 
> >>>>>> Per Sandeep in [1], this stack happens under RCU read-lock in:
> >>>>>> 
> >>>>>> #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \
> >>>>>> [...]
> >>>>>>                rcu_read_lock();
> >>>>>>                (dispatch_ops);
> >>>>>>                rcu_read_unlock();
> >>>>>> [...]
> >>>>>> 
> >>>>>> Coming from:
> >>>>>> blk_mq_flush_plug_list ->
> >>>>>>                           blk_mq_run_dispatch_ops(q,
> >>>>>>                                __blk_mq_flush_plug_list(q, plug));
> >>>>>> 
> >>>>>> and __blk_mq_flush_plug_list does this:
> >>>>>>          q->mq_ops->queue_rqs(&plug->mq_list);
> >>>>>> 
> >>>>>> This somehow ends up calling the bio_endio and the
> >>>>>> z_erofs_decompressqueue_endio which grabs the mutex.
> >>>>>> 
> >>>>>> So... I have a question, it looks like one of the paths in
> >>>>>> __blk_mq_run_dispatch_ops() uses SRCU.  Where are as the alternate
> >>>>>> path uses RCU. Why does this alternate want to block even if it is not
> >>>>>> supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should
> >>>>>> be set? It sounds like you want to block in the "else" path even
> >>>>>> though BLK_MQ_F_BLOCKING is not set:
> >>>>> 
> >>>>> BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with.
> >>>>> That is block layer and mq device driver stuffs. filesystems cannot set
> >>>>> this value.
> >>>>> 
> >>>>> As I said, as far as I understand, previously,
> >>>>> .end_io() can only be called without RCU context, so it will be fine,
> >>>>> but I don't know when .end_io() can be called under some RCU context
> >>>>> now.
> >>>> 
> >>>>> From what Sandeep described, the code path is in an RCU reader. My
> >>>> question is more, why doesn't it use SRCU instead since it clearly
> >>>> does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper
> >>>> dive needs to be made into that before concluding that the fix is to
> >>>> use rcu_read_lock_any_held().
> >>> 
> >>> How can this be solved?
> >>> 
> >>> 1. Always use a workqueue.  Simple, but is said to have performance
> >>> issues.
> >>> 
> >>> 2. Pass a flag in that indicates whether or not the caller is in an
> >>> RCU read-side critical section.  Conceptually simple, but might
> >>> or might not be reasonable to actually implement in the code as
> >>> it exists now. (You tell me!)
> >>> 
> >>> 3. Create a function in z_erofs that gives you a decent
> >>> approximation, maybe something like the following.
> >>> 
> >>> 4. Other ideas here.
> >> 
> >> 5. #3 plus make the corresponding Kconfig option select
> >> PREEMPT_COUNT, assuming that any users needing compression in
> >> non-preemptible kernels are OK with PREEMPT_COUNT being set.
> >> (Some users of non-preemptible kernels object strenuously
> >> to the added overhead from CONFIG_PREEMPT_COUNT=y.)
> > 
> > 6. Set one bit in bio->bi_private, check the bit and flip it in rcu_read_lock() path,
> > then in z_erofs_decompressqueue_endio, check if the bit has changed.
> 
> Seems bad, read and modify bi_private is a bad idea.

Is there some other field that would work?

							Thanx, Paul

> > Not sure if this is feasible or acceptable. :)
> > 
> >> 
> >> Thanx, Paul
> >> 
> >>> The following is untested, and is probably quite buggy, but it should
> >>> provide you with a starting point.
> >>> 
> >>> static bool z_erofs_wq_needed(void)
> >>> {
> >>> if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth())
> >>> return true;  // RCU reader
> >>> if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && !preemptible())
> >>> return true;  // non-preemptible
> >>> if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
> >>> return true;  // non-preeemptible kernel, so play it safe
> >>> return false;
> >>> }
> >>> 
> >>> You break it, you buy it!  ;-)
> >>> 
> >>> Thanx, Paul
> 
>
Alan Huang July 14, 2023, 6:40 p.m. UTC | #35
> 2023年7月15日 01:02,Paul E. McKenney <paulmck@kernel.org> 写道:
> 
> On Fri, Jul 14, 2023 at 11:54:47PM +0800, Alan Huang wrote:
>> 
>>> 2023年7月14日 23:35,Alan Huang <mmpgouride@gmail.com> 写道:
>>> 
>>>> 
>>>> 2023年7月14日 10:16,Paul E. McKenney <paulmck@kernel.org> 写道:
>>>> 
>>>> On Thu, Jul 13, 2023 at 09:33:35AM -0700, Paul E. McKenney wrote:
>>>>> On Thu, Jul 13, 2023 at 11:33:24AM -0400, Joel Fernandes wrote:
>>>>>> On Thu, Jul 13, 2023 at 10:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>>>>> On 2023/7/13 22:07, Joel Fernandes wrote:
>>>>>>>> On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>>>>>>> On 2023/7/13 12:52, Paul E. McKenney wrote:
>>>>>>>>>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote:
>>>>>>>>> 
>>>>>>>>> ...
>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> There are lots of performance issues here and even a plumber
>>>>>>>>>>> topic last year to show that, see:
>>>>>>>>>>> 
>>>>>>>>>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org
>>>>>>>>>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com
>>>>>>>>>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com
>>>>>>>>>>> [4] https://lpc.events/event/16/contributions/1338/
>>>>>>>>>>> and more.
>>>>>>>>>>> 
>>>>>>>>>>> I'm not sure if it's necessary to look info all of that,
>>>>>>>>>>> andSandeep knows more than I am (the scheduling issue
>>>>>>>>>>> becomes vital on some aarch64 platform.)
>>>>>>>>>> 
>>>>>>>>>> Hmmm...  Please let me try again.
>>>>>>>>>> 
>>>>>>>>>> Assuming that this approach turns out to make sense, the resulting
>>>>>>>>>> patch will need to clearly state the performance benefits directly in
>>>>>>>>>> the commit log.
>>>>>>>>>> 
>>>>>>>>>> And of course, for the approach to make sense, it must avoid breaking
>>>>>>>>>> the existing lockdep-RCU debugging code.
>>>>>>>>>> 
>>>>>>>>>> Is that more clear?
>>>>>>>>> 
>>>>>>>>> Personally I'm not working on Android platform any more so I don't
>>>>>>>>> have a way to reproduce, hopefully Sandeep could give actually
>>>>>>>>> number _again_ if dm-verity is enabled and trigger another
>>>>>>>>> workqueue here and make a comparsion why the scheduling latency of
>>>>>>>>> the extra work becomes unacceptable.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Question from my side, are we talking about only performance issues or
>>>>>>>> also a crash? It appears z_erofs_decompress_pcluster() takes
>>>>>>>> mutex_lock(&pcl->lock);
>>>>>>>> 
>>>>>>>> So if it is either in an RCU read-side critical section or in an
>>>>>>>> atomic section, like the softirq path, then it may
>>>>>>>> schedule-while-atomic or trigger RCU warnings.
>>>>>>>> 
>>>>>>>> z_erofs_decompressqueue_endio
>>>>>>>> -> z_erofs_decompress_kickoff
>>>>>>>> ->z_erofs_decompressqueue_work
>>>>>>>> ->z_erofs_decompress_queue
>>>>>>>>  -> z_erofs_decompress_pcluster
>>>>>>>>   -> mutex_lock
>>>>>>>> 
>>>>>>> 
>>>>>>> Why does the softirq path not trigger a workqueue instead?
>>>>>> 
>>>>>> I said "if it is". I was giving a scenario. mutex_lock() is not
>>>>>> allowed in softirq context or in an RCU-reader.
>>>>>> 
>>>>>>>> Per Sandeep in [1], this stack happens under RCU read-lock in:
>>>>>>>> 
>>>>>>>> #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \
>>>>>>>> [...]
>>>>>>>>               rcu_read_lock();
>>>>>>>>               (dispatch_ops);
>>>>>>>>               rcu_read_unlock();
>>>>>>>> [...]
>>>>>>>> 
>>>>>>>> Coming from:
>>>>>>>> blk_mq_flush_plug_list ->
>>>>>>>>                          blk_mq_run_dispatch_ops(q,
>>>>>>>>                               __blk_mq_flush_plug_list(q, plug));
>>>>>>>> 
>>>>>>>> and __blk_mq_flush_plug_list does this:
>>>>>>>>         q->mq_ops->queue_rqs(&plug->mq_list);
>>>>>>>> 
>>>>>>>> This somehow ends up calling the bio_endio and the
>>>>>>>> z_erofs_decompressqueue_endio which grabs the mutex.
>>>>>>>> 
>>>>>>>> So... I have a question, it looks like one of the paths in
>>>>>>>> __blk_mq_run_dispatch_ops() uses SRCU.  Where are as the alternate
>>>>>>>> path uses RCU. Why does this alternate want to block even if it is not
>>>>>>>> supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should
>>>>>>>> be set? It sounds like you want to block in the "else" path even
>>>>>>>> though BLK_MQ_F_BLOCKING is not set:
>>>>>>> 
>>>>>>> BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with.
>>>>>>> That is block layer and mq device driver stuffs. filesystems cannot set
>>>>>>> this value.
>>>>>>> 
>>>>>>> As I said, as far as I understand, previously,
>>>>>>> .end_io() can only be called without RCU context, so it will be fine,
>>>>>>> but I don't know when .end_io() can be called under some RCU context
>>>>>>> now.
>>>>>> 
>>>>>>> From what Sandeep described, the code path is in an RCU reader. My
>>>>>> question is more, why doesn't it use SRCU instead since it clearly
>>>>>> does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper
>>>>>> dive needs to be made into that before concluding that the fix is to
>>>>>> use rcu_read_lock_any_held().
>>>>> 
>>>>> How can this be solved?
>>>>> 
>>>>> 1. Always use a workqueue.  Simple, but is said to have performance
>>>>> issues.
>>>>> 
>>>>> 2. Pass a flag in that indicates whether or not the caller is in an
>>>>> RCU read-side critical section.  Conceptually simple, but might
>>>>> or might not be reasonable to actually implement in the code as
>>>>> it exists now. (You tell me!)
>>>>> 
>>>>> 3. Create a function in z_erofs that gives you a decent
>>>>> approximation, maybe something like the following.
>>>>> 
>>>>> 4. Other ideas here.
>>>> 
>>>> 5. #3 plus make the corresponding Kconfig option select
>>>> PREEMPT_COUNT, assuming that any users needing compression in
>>>> non-preemptible kernels are OK with PREEMPT_COUNT being set.
>>>> (Some users of non-preemptible kernels object strenuously
>>>> to the added overhead from CONFIG_PREEMPT_COUNT=y.)
>>> 
>>> 6. Set one bit in bio->bi_private, check the bit and flip it in rcu_read_lock() path,
>>> then in z_erofs_decompressqueue_endio, check if the bit has changed.
>> 
>> Seems bad, read and modify bi_private is a bad idea.
> 
> Is there some other field that would work?

Maybe bio->bi_opf, btrfs uses some bits of it.

> 
> Thanx, Paul
> 
>>> Not sure if this is feasible or acceptable. :)
>>> 
>>>> 
>>>> Thanx, Paul
>>>> 
>>>>> The following is untested, and is probably quite buggy, but it should
>>>>> provide you with a starting point.
>>>>> 
>>>>> static bool z_erofs_wq_needed(void)
>>>>> {
>>>>> if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth())
>>>>> return true;  // RCU reader
>>>>> if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && !preemptible())
>>>>> return true;  // non-preemptible
>>>>> if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
>>>>> return true;  // non-preeemptible kernel, so play it safe
>>>>> return false;
>>>>> }
>>>>> 
>>>>> You break it, you buy it!  ;-)
>>>>> 
>>>>> Thanx, Paul
Paul E. McKenney July 14, 2023, 6:44 p.m. UTC | #36
On Sat, Jul 15, 2023 at 02:40:08AM +0800, Alan Huang wrote:
> > 2023年7月15日 01:02,Paul E. McKenney <paulmck@kernel.org> 写道:
> > On Fri, Jul 14, 2023 at 11:54:47PM +0800, Alan Huang wrote:
> >>> 2023年7月14日 23:35,Alan Huang <mmpgouride@gmail.com> 写道:
> >>>> 2023年7月14日 10:16,Paul E. McKenney <paulmck@kernel.org> 写道:

[ . . . ]

> >>>>> How can this be solved?
> >>>>> 
> >>>>> 1. Always use a workqueue.  Simple, but is said to have performance
> >>>>> issues.
> >>>>> 
> >>>>> 2. Pass a flag in that indicates whether or not the caller is in an
> >>>>> RCU read-side critical section.  Conceptually simple, but might
> >>>>> or might not be reasonable to actually implement in the code as
> >>>>> it exists now. (You tell me!)
> >>>>> 
> >>>>> 3. Create a function in z_erofs that gives you a decent
> >>>>> approximation, maybe something like the following.
> >>>>> 
> >>>>> 4. Other ideas here.
> >>>> 
> >>>> 5. #3 plus make the corresponding Kconfig option select
> >>>> PREEMPT_COUNT, assuming that any users needing compression in
> >>>> non-preemptible kernels are OK with PREEMPT_COUNT being set.
> >>>> (Some users of non-preemptible kernels object strenuously
> >>>> to the added overhead from CONFIG_PREEMPT_COUNT=y.)
> >>> 
> >>> 6. Set one bit in bio->bi_private, check the bit and flip it in rcu_read_lock() path,
> >>> then in z_erofs_decompressqueue_endio, check if the bit has changed.
> >> 
> >> Seems bad, read and modify bi_private is a bad idea.
> > 
> > Is there some other field that would work?
> 
> Maybe bio->bi_opf, btrfs uses some bits of it.

Sandeep, thoughts?


							Thanx, Paul

> >>> Not sure if this is feasible or acceptable. :)
> >>> 
> >>>> 
> >>>> Thanx, Paul
> >>>> 
> >>>>> The following is untested, and is probably quite buggy, but it should
> >>>>> provide you with a starting point.
> >>>>> 
> >>>>> static bool z_erofs_wq_needed(void)
> >>>>> {
> >>>>> if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth())
> >>>>> return true;  // RCU reader
> >>>>> if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && !preemptible())
> >>>>> return true;  // non-preemptible
> >>>>> if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
> >>>>> return true;  // non-preeemptible kernel, so play it safe
> >>>>> return false;
> >>>>> }
> >>>>> 
> >>>>> You break it, you buy it!  ;-)
> >>>>> 
> >>>>> Thanx, Paul
> 
>
Sandeep Dhavale July 14, 2023, 7:15 p.m. UTC | #37
>
> Sandeep, thoughts?
>
I prefer to modify erofs check and contain the fix there as erofs
cares about it and it's
least invasive. For contexts where erofs cannot tell for sure, it will
always schedule kworker
(like CONFIG_PREEMPT_COUNT=n).

I will also do measurements to see if erofs should continue to check
for context and
what are the benefits.

Thanks,
Sandeep.
>                                                         Thanx, Paul
>
Paul E. McKenney July 14, 2023, 7:36 p.m. UTC | #38
On Fri, Jul 14, 2023 at 12:15:23PM -0700, Sandeep Dhavale wrote:
> >
> > Sandeep, thoughts?
> >
> I prefer to modify erofs check and contain the fix there as erofs
> cares about it and it's
> least invasive. For contexts where erofs cannot tell for sure, it will
> always schedule kworker
> (like CONFIG_PREEMPT_COUNT=n).
> 
> I will also do measurements to see if erofs should continue to check
> for context and
> what are the benefits.

At the end of the day, I guess it is Gao Xiang's and Chao Yu's decision.

							Thanx, Paul
diff mbox series

Patch

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 5e5f920ade90..0d1d1d8c2360 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -319,14 +319,11 @@  int rcu_read_lock_any_held(void);
 # define rcu_lock_acquire(a)		do { } while (0)
 # define rcu_lock_release(a)		do { } while (0)
 
-static inline int rcu_read_lock_held(void)
-{
-	return 1;
-}
+int rcu_read_lock_held(void);
 
 static inline int rcu_read_lock_bh_held(void)
 {
-	return 1;
+	return !preemptible() || in_softirq();
 }
 
 static inline int rcu_read_lock_sched_held(void)
@@ -334,10 +331,7 @@  static inline int rcu_read_lock_sched_held(void)
 	return !preemptible();
 }
 
-static inline int rcu_read_lock_any_held(void)
-{
-	return !preemptible();
-}
+int rcu_read_lock_any_held(void);
 
 static inline int debug_lockdep_rcu_enabled(void)
 {
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 19bf6fa3ee6a..b34fc5bb96cf 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -390,8 +390,27 @@  int rcu_read_lock_any_held(void)
 }
 EXPORT_SYMBOL_GPL(rcu_read_lock_any_held);
 
-#endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+#else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
 
+int rcu_read_lock_held(void)
+{
+	if (IS_ENABLED(CONFIG_PREEMPT_RCU))
+		return rcu_preempt_depth();
+	return !preemptible();
+}
+EXPORT_SYMBOL_GPL(rcu_read_lock_held);
+
+int rcu_read_lock_any_held(void)
+{
+	if (rcu_read_lock_held() ||
+	    rcu_read_lock_bh_held() ||
+	    rcu_read_lock_sched_held())
+		return 1;
+	return !preemptible();
+}
+EXPORT_SYMBOL_GPL(rcu_read_lock_any_held);
+
+#endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
 /**
  * wakeme_after_rcu() - Callback function to awaken a task after grace period
  * @head: Pointer to rcu_head member within rcu_synchronize structure