Message ID | 20230711233816.2187577-1-dhavale@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v1] rcu: Fix and improve RCU read lock checks when !CONFIG_DEBUG_LOCK_ALLOC | expand |
On Tue, Jul 11, 2023 at 7:38 PM Sandeep Dhavale <dhavale@google.com> wrote: > > Currently if CONFIG_DEBUG_LOCK_ALLOC is not set > > - rcu_read_lock_held() always returns 1 > - rcu_read_lock_any_held() may return 0 with CONFIG_PREEMPT_RCU > > This is inconsistent and it was discovered when trying a fix > for problem reported [1] with CONFIG_DEBUG_LOCK_ALLOC is not > set and CONFIG_PREEMPT_RCU is enabled. Gist of the problem is > that EROFS wants to detect atomic context so it can do inline > decompression whenever possible, this is important performance > optimization. It turns out that z_erofs_decompressqueue_endio() > can be called from blk_mq_flush_plug_list() with rcu lock held > and hence fix uses rcu_read_lock_any_held() to decide to use > sync/inline decompression vs async decompression. > > As per documentation, we should return lock is held if we aren't > certain. But it seems we can improve the checks for if the lock > is held even if CONFIG_DEBUG_LOCK_ALLOC is not set instead of > hardcoding to always return true. > > * rcu_read_lock_held() > - For CONFIG_PREEMPT_RCU using rcu_preempt_depth() > - using preemptible() (indirectly preempt_count()) > > * rcu_read_lock_bh_held() > - For CONFIG_PREEMPT_RT Using in_softirq() (indirectly softirq_cont()) > - using preemptible() (indirectly preempt_count()) > > Lastly to fix the inconsistency, rcu_read_lock_any_held() is updated > to use other rcu_read_lock_*_held() checks. > > Two of the improved checks are moved to kernel/rcu/update.c because > rcupdate.h is included from the very low level headers which do not know > what current (task_struct) is so the macro rcu_preempt_depth() cannot be > expanded in the rcupdate.h. See the original comment for > rcu_preempt_depth() in patch at [2] for more information. > > [1] > https://lore.kernel.org/all/20230621220848.3379029-1-dhavale@google.com/ > [2] > https://lore.kernel.org/all/1281392111-25060-8-git-send-email-paulmck@linux.vnet.ibm.com/ > > Reported-by: Will Shiu <Will.Shiu@mediatek.com> > Signed-off-by: Sandeep Dhavale <dhavale@google.com> > --- > include/linux/rcupdate.h | 12 +++--------- > kernel/rcu/update.c | 21 ++++++++++++++++++++- > 2 files changed, 23 insertions(+), 10 deletions(-) > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > index 5e5f920ade90..0d1d1d8c2360 100644 > --- a/include/linux/rcupdate.h > +++ b/include/linux/rcupdate.h > @@ -319,14 +319,11 @@ int rcu_read_lock_any_held(void); > # define rcu_lock_acquire(a) do { } while (0) > # define rcu_lock_release(a) do { } while (0) > > -static inline int rcu_read_lock_held(void) > -{ > - return 1; > -} > +int rcu_read_lock_held(void); > > static inline int rcu_read_lock_bh_held(void) > { > - return 1; > + return !preemptible() || in_softirq(); > } > > static inline int rcu_read_lock_sched_held(void) > @@ -334,10 +331,7 @@ static inline int rcu_read_lock_sched_held(void) > return !preemptible(); > } > > -static inline int rcu_read_lock_any_held(void) > -{ > - return !preemptible(); > -} > +int rcu_read_lock_any_held(void); > > static inline int debug_lockdep_rcu_enabled(void) > { > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c > index 19bf6fa3ee6a..b34fc5bb96cf 100644 > --- a/kernel/rcu/update.c > +++ b/kernel/rcu/update.c > @@ -390,8 +390,27 @@ int rcu_read_lock_any_held(void) > } > EXPORT_SYMBOL_GPL(rcu_read_lock_any_held); > > -#endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ > +#else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ > > +int rcu_read_lock_held(void) > +{ > + if (IS_ENABLED(CONFIG_PREEMPT_RCU)) > + return rcu_preempt_depth(); > + return !preemptible(); > +} > +EXPORT_SYMBOL_GPL(rcu_read_lock_held); > + > +int rcu_read_lock_any_held(void) > +{ > + if (rcu_read_lock_held() || > + rcu_read_lock_bh_held() || > + rcu_read_lock_sched_held()) > + return 1; > + return !preemptible(); Actually even the original code is incorrect (the lockdep version) since preemptible() cannot be relied upon if CONFIG_PREEMPT_COUNT is not set. However, that's debug code. In this case, it is a real user (the fs code). In non-preemptible kernels, we are always in an RCU-sched section. So you can't really see if anyone called your code under rcu_read_lock(). The rcu_read_lock/unlock() would be getting NOOPed. In such a kernel, it will always tell your code it is in an RCU reader. That's not ideal for that erofs code? Also, per that erofs code: /* Use (kthread_)work and sync decompression for atomic contexts only */ if (!in_task() || irqs_disabled() || rcu_read_lock_any_held()) { I guess you are also assuming that rcu_read_lock_any_held() tells you something about atomicity but strictly speaking, it doesn't because preemptible RCU readers are preemptible. You can't block but preemption is possible so it is not "atomic". Maybe you meant "cannot block"? As such this patch looks correct to me, one thing I noticed is that you can check rcu_is_watching() like the lockdep-enabled code does. That will tell you also if a reader-section is possible because in extended-quiescent-states, RCU readers should be non-existent or that's a bug. Could you also verify that this patch does not cause bloating of the kernel if lockdep is disabled? thanks, - Joel
Hi Joel, Thank you for the review! > Actually even the original code is incorrect (the lockdep version) > since preemptible() cannot be relied upon if CONFIG_PREEMPT_COUNT is > not set. However, that's debug code. In this case, it is a real > user (the fs code). In non-preemptible kernels, we are always in an > RCU-sched section. So you can't really see if anyone called your code > under rcu_read_lock(). The rcu_read_lock/unlock() would be getting > NOOPed. In such a kernel, it will always tell your code it is in an > RCU reader. That's not ideal for that erofs code? > > Also, per that erofs code: > /* Use (kthread_)work and sync decompression for atomic contexts only */ > if (!in_task() || irqs_disabled() || rcu_read_lock_any_held()) { > > I guess you are also assuming that rcu_read_lock_any_held() tells you > something about atomicity but strictly speaking, it doesn't because > preemptible RCU readers are preemptible. You can't block but > preemption is possible so it is not "atomic". Maybe you meant "cannot > block"? > Yes, you are correct. For decompression erofs needs to grab pcluster mutex lock, erofs wants to know if we can or cannot block to decide sync vs async decompression. Good point about !CONFIG_PREEMPT_COUNT. Yes, in that case erofs will always do async decompression which is less than ideal. Maybe erofs check can be modified to account for !CONFIG_PREEMPT_COUNT. > As such this patch looks correct to me, one thing I noticed is that > you can check rcu_is_watching() like the lockdep-enabled code does. > That will tell you also if a reader-section is possible because in > extended-quiescent-states, RCU readers should be non-existent or > that's a bug. > Please correct me if I am wrong, reading from the comment in kernel/rcu/update.c rcu_read_lock_held_common() .. * The reason for this is that RCU ignores CPUs that are * in such a section, considering these as in extended quiescent state, * so such a CPU is effectively never in an RCU read-side critical section * regardless of what RCU primitives it invokes. It seems rcu will treat this as lock not held rather than a fact that lock is not held. Is my understanding correct? The reason I chose not to consult rcu_is_watching() in this version is because check "sleeping function called from invalid context" will still get triggered (please see kernel/sched/core.c __might_resched()) as it does not consult rcu_is_watching() instead looks at rcu_preempt_depth() which will be non-zero if rcu_read_lock() was called (only when CONFIG_PREEMPT_RCU is enabled). > Could you also verify that this patch does not cause bloating of the > kernel if lockdep is disabled? > Sure, I will do the comparison and send the details. Thanks, Sandeep. > thanks, > > - Joel
On Wed, Jul 12, 2023 at 02:20:56PM -0700, Sandeep Dhavale wrote: [..] > > As such this patch looks correct to me, one thing I noticed is that > > you can check rcu_is_watching() like the lockdep-enabled code does. > > That will tell you also if a reader-section is possible because in > > extended-quiescent-states, RCU readers should be non-existent or > > that's a bug. > > > Please correct me if I am wrong, reading from the comment in > kernel/rcu/update.c rcu_read_lock_held_common() > .. > * The reason for this is that RCU ignores CPUs that are > * in such a section, considering these as in extended quiescent state, > * so such a CPU is effectively never in an RCU read-side critical section > * regardless of what RCU primitives it invokes. > > It seems rcu will treat this as lock not held rather than a fact that > lock is not held. Is my understanding correct? If RCU treats it as a lock not held, that is a fact for RCU ;-). Maybe you mean it is not a fact for erofs? > The reason I chose not to consult rcu_is_watching() in this version > is because check "sleeping function called from invalid context" > will still get triggered (please see kernel/sched/core.c __might_resched()) > as it does not consult rcu_is_watching() instead looks at > rcu_preempt_depth() which will be non-zero if rcu_read_lock() > was called (only when CONFIG_PREEMPT_RCU is enabled). I am assuming you mean you would grab the mutex accidentally when in an RCU reader, and might_sleep() presumably in the mutex internal code will scream? I would expect in the erofs code that rcu_is_watching() should always return true, so it should not effect the decision of whether to block or not. I am suggesting add the check for rcu_is_watching() into the *held() functions for completeness. // will be if (!true) when RCU is actively watching the CPU for readers. bool rcu_read_lock_any_held() { if (!rcu_is_watching()) return false; // do the rest.. } > > Could you also verify that this patch does not cause bloating of the > > kernel if lockdep is disabled? > > > Sure, I will do the comparison and send the details. Thanks! This is indeed an interesting usecase of grabbing mutex / blocking in the reader. thanks, - Joel
On 2023/7/13 08:32, Joel Fernandes wrote: > On Wed, Jul 12, 2023 at 02:20:56PM -0700, Sandeep Dhavale wrote: > [..] >>> As such this patch looks correct to me, one thing I noticed is that >>> you can check rcu_is_watching() like the lockdep-enabled code does. >>> That will tell you also if a reader-section is possible because in >>> extended-quiescent-states, RCU readers should be non-existent or >>> that's a bug. >>> >> Please correct me if I am wrong, reading from the comment in >> kernel/rcu/update.c rcu_read_lock_held_common() >> .. >> * The reason for this is that RCU ignores CPUs that are >> * in such a section, considering these as in extended quiescent state, >> * so such a CPU is effectively never in an RCU read-side critical section >> * regardless of what RCU primitives it invokes. >> >> It seems rcu will treat this as lock not held rather than a fact that >> lock is not held. Is my understanding correct? > > If RCU treats it as a lock not held, that is a fact for RCU ;-). Maybe you > mean it is not a fact for erofs? I'm not sure if I get what you mean, EROFS doesn't take any RCU read lock here: z_erofs_decompressqueue_endio() is actually a "bio->bi_end_io", previously which can be called under two scenarios: 1) under softirq context, which is actually part of device I/O compleltion; 2) under threaded context, like what dm-verity or likewise calls. But EROFS needs to decompress in a threaded context anyway, so we trigger a workqueue to resolve the case 1). Recently, someone reported there could be some case 3) [I think it was introduced recently but I have no time to dig into it]: case 3: under RCU read lock context, which is shown by this: https://lore.kernel.org/r/4a8254eb-ac39-1e19-3d82-417d3a7b9f94@linux.alibaba.com/T/#u and such RCU read lock is taken in __blk_mq_run_dispatch_ops(). But as the commit shown, we only need to trigger a workqueue for case 1) and 3) due to performance reasons. Hopefully I show it more clear. Thanks, Gao Xiang
On 2023/7/13 10:02, Gao Xiang wrote: > > > On 2023/7/13 08:32, Joel Fernandes wrote: >> On Wed, Jul 12, 2023 at 02:20:56PM -0700, Sandeep Dhavale wrote: >> [..] >>>> As such this patch looks correct to me, one thing I noticed is that >>>> you can check rcu_is_watching() like the lockdep-enabled code does. >>>> That will tell you also if a reader-section is possible because in >>>> extended-quiescent-states, RCU readers should be non-existent or >>>> that's a bug. >>>> >>> Please correct me if I am wrong, reading from the comment in >>> kernel/rcu/update.c rcu_read_lock_held_common() >>> .. >>> * The reason for this is that RCU ignores CPUs that are >>> * in such a section, considering these as in extended quiescent state, >>> * so such a CPU is effectively never in an RCU read-side critical section >>> * regardless of what RCU primitives it invokes. >>> >>> It seems rcu will treat this as lock not held rather than a fact that >>> lock is not held. Is my understanding correct? >> >> If RCU treats it as a lock not held, that is a fact for RCU ;-). Maybe you >> mean it is not a fact for erofs? > > I'm not sure if I get what you mean, EROFS doesn't take any RCU read lock > here: > > z_erofs_decompressqueue_endio() is actually a "bio->bi_end_io", previously > which can be called under two scenarios: > > 1) under softirq context, which is actually part of device I/O compleltion; > > 2) under threaded context, like what dm-verity or likewise calls. > > But EROFS needs to decompress in a threaded context anyway, so we trigger > a workqueue to resolve the case 1). > > > Recently, someone reported there could be some case 3) [I think it was > introduced recently but I have no time to dig into it]: > > case 3: under RCU read lock context, which is shown by this: > https://lore.kernel.org/r/4a8254eb-ac39-1e19-3d82-417d3a7b9f94@linux.alibaba.com/T/#u Sorry about the incorrect link (I really don't know who initally reported this and on which device): https://lore.kernel.org/linux-erofs/161f1615-3d85-cf47-d2d5-695adf1ca7d4@linux.alibaba.com/T/#t > > and such RCU read lock is taken in __blk_mq_run_dispatch_ops(). > > But as the commit shown, we only need to trigger a workqueue for case 1) > and 3) due to performance reasons. > > Hopefully I show it more clear. > > Thanks, > Gao Xiang
> On Jul 12, 2023, at 10:02 PM, Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > > >> On 2023/7/13 08:32, Joel Fernandes wrote: >> On Wed, Jul 12, 2023 at 02:20:56PM -0700, Sandeep Dhavale wrote: >> [..] >>>> As such this patch looks correct to me, one thing I noticed is that >>>> you can check rcu_is_watching() like the lockdep-enabled code does. >>>> That will tell you also if a reader-section is possible because in >>>> extended-quiescent-states, RCU readers should be non-existent or >>>> that's a bug. >>>> >>> Please correct me if I am wrong, reading from the comment in >>> kernel/rcu/update.c rcu_read_lock_held_common() >>> .. >>> * The reason for this is that RCU ignores CPUs that are >>> * in such a section, considering these as in extended quiescent state, >>> * so such a CPU is effectively never in an RCU read-side critical section >>> * regardless of what RCU primitives it invokes. >>> >>> It seems rcu will treat this as lock not held rather than a fact that >>> lock is not held. Is my understanding correct? >> If RCU treats it as a lock not held, that is a fact for RCU ;-). Maybe you >> mean it is not a fact for erofs? > > I'm not sure if I get what you mean, EROFS doesn't take any RCU read lock We are discussing the case 3 you mentioned below. > here: > > z_erofs_decompressqueue_endio() is actually a "bio->bi_end_io", previously > which can be called under two scenarios: > > 1) under softirq context, which is actually part of device I/O compleltion; > > 2) under threaded context, like what dm-verity or likewise calls. > > But EROFS needs to decompress in a threaded context anyway, so we trigger > a workqueue to resolve the case 1). > > > Recently, someone reported there could be some case 3) [I think it was > introduced recently but I have no time to dig into it]: > > case 3: under RCU read lock context, which is shown by this: > https://lore.kernel.org/r/4a8254eb-ac39-1e19-3d82-417d3a7b9f94@linux.alibaba.com/T/#u > > and such RCU read lock is taken in __blk_mq_run_dispatch_ops(). > > But as the commit shown, we only need to trigger a workqueue for case 1) > and 3) due to performance reasons. > > Hopefully I show it more clear. Makes sense. Thanks, - Joel > > Thanks, > Gao Xiang
On Thu, Jul 13, 2023 at 10:02:17AM +0800, Gao Xiang wrote: > > > On 2023/7/13 08:32, Joel Fernandes wrote: > > On Wed, Jul 12, 2023 at 02:20:56PM -0700, Sandeep Dhavale wrote: > > [..] > > > > As such this patch looks correct to me, one thing I noticed is that > > > > you can check rcu_is_watching() like the lockdep-enabled code does. > > > > That will tell you also if a reader-section is possible because in > > > > extended-quiescent-states, RCU readers should be non-existent or > > > > that's a bug. > > > > > > > Please correct me if I am wrong, reading from the comment in > > > kernel/rcu/update.c rcu_read_lock_held_common() > > > .. > > > * The reason for this is that RCU ignores CPUs that are > > > * in such a section, considering these as in extended quiescent state, > > > * so such a CPU is effectively never in an RCU read-side critical section > > > * regardless of what RCU primitives it invokes. > > > > > > It seems rcu will treat this as lock not held rather than a fact that > > > lock is not held. Is my understanding correct? > > > > If RCU treats it as a lock not held, that is a fact for RCU ;-). Maybe you > > mean it is not a fact for erofs? > > I'm not sure if I get what you mean, EROFS doesn't take any RCU read lock > here: The key point is that we need lockdep to report errors when rcu_read_lock(), rcu_dereference(), and friends are used when RCU is not watching. We also need lockdep to report an error when someone uses rcu_dereference() when RCU is not watching, but also forgets the rcu_read_lock(). And this is the job of rcu_read_lock_held(), which is one reason why that rcu_is_watching() is needed. > z_erofs_decompressqueue_endio() is actually a "bio->bi_end_io", previously > which can be called under two scenarios: > > 1) under softirq context, which is actually part of device I/O compleltion; > > 2) under threaded context, like what dm-verity or likewise calls. > > But EROFS needs to decompress in a threaded context anyway, so we trigger > a workqueue to resolve the case 1). > > Recently, someone reported there could be some case 3) [I think it was > introduced recently but I have no time to dig into it]: > > case 3: under RCU read lock context, which is shown by this: > https://lore.kernel.org/r/4a8254eb-ac39-1e19-3d82-417d3a7b9f94@linux.alibaba.com/T/#u > > and such RCU read lock is taken in __blk_mq_run_dispatch_ops(). > > But as the commit shown, we only need to trigger a workqueue for case 1) > and 3) due to performance reasons. Just out of curiosity, exactly how much is it costing to trigger the workqueue? > Hopefully I show it more clear. One additional question... What is your plan for kernels built with CONFIG_PREEMPT_COUNT=n? After all, in such kernels, there is no way that I know of for code to determine whether it is in an RCU read-side critical section, holding a spinlock, or running with preemption disabled. Thanx, Paul
On 2023/7/13 12:27, Paul E. McKenney wrote: > On Thu, Jul 13, 2023 at 10:02:17AM +0800, Gao Xiang wrote: >> >> >> On 2023/7/13 08:32, Joel Fernandes wrote: >>> On Wed, Jul 12, 2023 at 02:20:56PM -0700, Sandeep Dhavale wrote: >>> [..] >>>>> As such this patch looks correct to me, one thing I noticed is that >>>>> you can check rcu_is_watching() like the lockdep-enabled code does. >>>>> That will tell you also if a reader-section is possible because in >>>>> extended-quiescent-states, RCU readers should be non-existent or >>>>> that's a bug. >>>>> >>>> Please correct me if I am wrong, reading from the comment in >>>> kernel/rcu/update.c rcu_read_lock_held_common() >>>> .. >>>> * The reason for this is that RCU ignores CPUs that are >>>> * in such a section, considering these as in extended quiescent state, >>>> * so such a CPU is effectively never in an RCU read-side critical section >>>> * regardless of what RCU primitives it invokes. >>>> >>>> It seems rcu will treat this as lock not held rather than a fact that >>>> lock is not held. Is my understanding correct? >>> >>> If RCU treats it as a lock not held, that is a fact for RCU ;-). Maybe you >>> mean it is not a fact for erofs? >> >> I'm not sure if I get what you mean, EROFS doesn't take any RCU read lock >> here: > > The key point is that we need lockdep to report errors when > rcu_read_lock(), rcu_dereference(), and friends are used when RCU is > not watching. We also need lockdep to report an error when someone > uses rcu_dereference() when RCU is not watching, but also forgets the > rcu_read_lock(). > > And this is the job of rcu_read_lock_held(), which is one reason why > that rcu_is_watching() is needed. > >> z_erofs_decompressqueue_endio() is actually a "bio->bi_end_io", previously >> which can be called under two scenarios: >> >> 1) under softirq context, which is actually part of device I/O compleltion; >> >> 2) under threaded context, like what dm-verity or likewise calls. >> >> But EROFS needs to decompress in a threaded context anyway, so we trigger >> a workqueue to resolve the case 1). >> >> Recently, someone reported there could be some case 3) [I think it was >> introduced recently but I have no time to dig into it]: >> >> case 3: under RCU read lock context, which is shown by this: >> https://lore.kernel.org/r/4a8254eb-ac39-1e19-3d82-417d3a7b9f94@linux.alibaba.com/T/#u >> >> and such RCU read lock is taken in __blk_mq_run_dispatch_ops(). >> >> But as the commit shown, we only need to trigger a workqueue for case 1) >> and 3) due to performance reasons. > > Just out of curiosity, exactly how much is it costing to trigger the > workqueue? There are lots of performance issues here and even a plumber topic last year to show that, see: [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com [4] https://lpc.events/event/16/contributions/1338/ and more. I'm not sure if it's necessary to look info all of that, andSandeep knows more than I am (the scheduling issue becomes vital on some aarch64 platform.) Thanks, Gao Xiang
On 2023/7/13 12:27, Paul E. McKenney wrote: > On Thu, Jul 13, 2023 at 10:02:17AM +0800, Gao Xiang wrote: >> >> ... sorry forget some. > > One additional question... What is your plan for kernels built with > CONFIG_PREEMPT_COUNT=n? After all, in such kernels, there is no way > that I know of for code to determine whether it is in an RCU read-side > critical section, holding a spinlock, or running with preemption disabled. I'm not sure if Android (or all targetted) users use or care about this configuration (CONFIG_PREEMPT_COUNT=n), personally I think for such configuration we could just fall back to the workqueue approach all the time. Anyway, such optimization really comes from real workloads / experience, users don't live well without such mitigation. Thanks, Gao Xiang > > Thanx, Paul
On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote: > > > On 2023/7/13 12:27, Paul E. McKenney wrote: > > On Thu, Jul 13, 2023 at 10:02:17AM +0800, Gao Xiang wrote: > > > > > > > > > On 2023/7/13 08:32, Joel Fernandes wrote: > > > > On Wed, Jul 12, 2023 at 02:20:56PM -0700, Sandeep Dhavale wrote: > > > > [..] > > > > > > As such this patch looks correct to me, one thing I noticed is that > > > > > > you can check rcu_is_watching() like the lockdep-enabled code does. > > > > > > That will tell you also if a reader-section is possible because in > > > > > > extended-quiescent-states, RCU readers should be non-existent or > > > > > > that's a bug. > > > > > > > > > > > Please correct me if I am wrong, reading from the comment in > > > > > kernel/rcu/update.c rcu_read_lock_held_common() > > > > > .. > > > > > * The reason for this is that RCU ignores CPUs that are > > > > > * in such a section, considering these as in extended quiescent state, > > > > > * so such a CPU is effectively never in an RCU read-side critical section > > > > > * regardless of what RCU primitives it invokes. > > > > > > > > > > It seems rcu will treat this as lock not held rather than a fact that > > > > > lock is not held. Is my understanding correct? > > > > > > > > If RCU treats it as a lock not held, that is a fact for RCU ;-). Maybe you > > > > mean it is not a fact for erofs? > > > > > > I'm not sure if I get what you mean, EROFS doesn't take any RCU read lock > > > here: > > > > The key point is that we need lockdep to report errors when > > rcu_read_lock(), rcu_dereference(), and friends are used when RCU is > > not watching. We also need lockdep to report an error when someone > > uses rcu_dereference() when RCU is not watching, but also forgets the > > rcu_read_lock(). > > > > And this is the job of rcu_read_lock_held(), which is one reason why > > that rcu_is_watching() is needed. > > > > > z_erofs_decompressqueue_endio() is actually a "bio->bi_end_io", previously > > > which can be called under two scenarios: > > > > > > 1) under softirq context, which is actually part of device I/O compleltion; > > > > > > 2) under threaded context, like what dm-verity or likewise calls. > > > > > > But EROFS needs to decompress in a threaded context anyway, so we trigger > > > a workqueue to resolve the case 1). > > > > > > Recently, someone reported there could be some case 3) [I think it was > > > introduced recently but I have no time to dig into it]: > > > > > > case 3: under RCU read lock context, which is shown by this: > > > https://lore.kernel.org/r/4a8254eb-ac39-1e19-3d82-417d3a7b9f94@linux.alibaba.com/T/#u > > > > > > and such RCU read lock is taken in __blk_mq_run_dispatch_ops(). > > > > > > But as the commit shown, we only need to trigger a workqueue for case 1) > > > and 3) due to performance reasons. > > > > Just out of curiosity, exactly how much is it costing to trigger the > > workqueue? > > There are lots of performance issues here and even a plumber > topic last year to show that, see: > > [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org > [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com > [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com > [4] https://lpc.events/event/16/contributions/1338/ > and more. > > I'm not sure if it's necessary to look info all of that, > andSandeep knows more than I am (the scheduling issue > becomes vital on some aarch64 platform.) Hmmm... Please let me try again. Assuming that this approach turns out to make sense, the resulting patch will need to clearly state the performance benefits directly in the commit log. And of course, for the approach to make sense, it must avoid breaking the existing lockdep-RCU debugging code. Is that more clear? Thanx, Paul
On 2023/7/13 12:52, Paul E. McKenney wrote: > On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote: >> >> ... >> >> There are lots of performance issues here and even a plumber >> topic last year to show that, see: >> >> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org >> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com >> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com >> [4] https://lpc.events/event/16/contributions/1338/ >> and more. >> >> I'm not sure if it's necessary to look info all of that, >> andSandeep knows more than I am (the scheduling issue >> becomes vital on some aarch64 platform.) > > Hmmm... Please let me try again. > > Assuming that this approach turns out to make sense, the resulting > patch will need to clearly state the performance benefits directly in > the commit log. > > And of course, for the approach to make sense, it must avoid breaking > the existing lockdep-RCU debugging code. > > Is that more clear? Personally I'm not working on Android platform any more so I don't have a way to reproduce, hopefully Sandeep could give actually number _again_ if dm-verity is enabled and trigger another workqueue here and make a comparsion why the scheduling latency of the extra work becomes unacceptable. Thanks, Gao Xiang
On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > On 2023/7/13 12:52, Paul E. McKenney wrote: > > On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote: > >> > >> > > ... > > >> > >> There are lots of performance issues here and even a plumber > >> topic last year to show that, see: > >> > >> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org > >> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com > >> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com > >> [4] https://lpc.events/event/16/contributions/1338/ > >> and more. > >> > >> I'm not sure if it's necessary to look info all of that, > >> andSandeep knows more than I am (the scheduling issue > >> becomes vital on some aarch64 platform.) > > > > Hmmm... Please let me try again. > > > > Assuming that this approach turns out to make sense, the resulting > > patch will need to clearly state the performance benefits directly in > > the commit log. > > > > And of course, for the approach to make sense, it must avoid breaking > > the existing lockdep-RCU debugging code. > > > > Is that more clear? > > Personally I'm not working on Android platform any more so I don't > have a way to reproduce, hopefully Sandeep could give actually > number _again_ if dm-verity is enabled and trigger another > workqueue here and make a comparsion why the scheduling latency of > the extra work becomes unacceptable. > Question from my side, are we talking about only performance issues or also a crash? It appears z_erofs_decompress_pcluster() takes mutex_lock(&pcl->lock); So if it is either in an RCU read-side critical section or in an atomic section, like the softirq path, then it may schedule-while-atomic or trigger RCU warnings. z_erofs_decompressqueue_endio -> z_erofs_decompress_kickoff ->z_erofs_decompressqueue_work ->z_erofs_decompress_queue -> z_erofs_decompress_pcluster -> mutex_lock Per Sandeep in [1], this stack happens under RCU read-lock in: #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ [...] rcu_read_lock(); (dispatch_ops); rcu_read_unlock(); [...] Coming from: blk_mq_flush_plug_list -> blk_mq_run_dispatch_ops(q, __blk_mq_flush_plug_list(q, plug)); and __blk_mq_flush_plug_list does this: q->mq_ops->queue_rqs(&plug->mq_list); This somehow ends up calling the bio_endio and the z_erofs_decompressqueue_endio which grabs the mutex. So... I have a question, it looks like one of the paths in __blk_mq_run_dispatch_ops() uses SRCU. Where are as the alternate path uses RCU. Why does this alternate want to block even if it is not supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should be set? It sounds like you want to block in the "else" path even though BLK_MQ_F_BLOCKING is not set: /* run the code block in @dispatch_ops with rcu/srcu read lock held */ #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ do { \ if ((q)->tag_set->flags & BLK_MQ_F_BLOCKING) { \ struct blk_mq_tag_set *__tag_set = (q)->tag_set; \ int srcu_idx; \ \ might_sleep_if(check_sleep); \ srcu_idx = srcu_read_lock(__tag_set->srcu); \ (dispatch_ops); \ srcu_read_unlock(__tag_set->srcu, srcu_idx); \ } else { \ rcu_read_lock(); \ (dispatch_ops); \ rcu_read_unlock(); \ } \ } while (0); It seems quite incorrect to me that erofs wants to block even though clearly BLK_MQ_F_BLOCKING is not set. Did I miss something? I believe it may be more beneficial to correct this properly now, rather than maintaining the appearance of non-blocking operations and potentially having to manage atomicity detection with preempt counters at a later stage. thanks, - Joel [1] https://lore.kernel.org/all/CAB=BE-QV0PiKFpCOcdEUFxS4wJHsLCcsymAw+nP72Yr3b1WMng@mail.gmail.com/
On 2023/7/13 22:07, Joel Fernandes wrote: > On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >> On 2023/7/13 12:52, Paul E. McKenney wrote: >>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote: >>>> >>>> >> >> ... >> >>>> >>>> There are lots of performance issues here and even a plumber >>>> topic last year to show that, see: >>>> >>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org >>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com >>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com >>>> [4] https://lpc.events/event/16/contributions/1338/ >>>> and more. >>>> >>>> I'm not sure if it's necessary to look info all of that, >>>> andSandeep knows more than I am (the scheduling issue >>>> becomes vital on some aarch64 platform.) >>> >>> Hmmm... Please let me try again. >>> >>> Assuming that this approach turns out to make sense, the resulting >>> patch will need to clearly state the performance benefits directly in >>> the commit log. >>> >>> And of course, for the approach to make sense, it must avoid breaking >>> the existing lockdep-RCU debugging code. >>> >>> Is that more clear? >> >> Personally I'm not working on Android platform any more so I don't >> have a way to reproduce, hopefully Sandeep could give actually >> number _again_ if dm-verity is enabled and trigger another >> workqueue here and make a comparsion why the scheduling latency of >> the extra work becomes unacceptable. >> > > Question from my side, are we talking about only performance issues or > also a crash? It appears z_erofs_decompress_pcluster() takes > mutex_lock(&pcl->lock); > > So if it is either in an RCU read-side critical section or in an > atomic section, like the softirq path, then it may > schedule-while-atomic or trigger RCU warnings. > > z_erofs_decompressqueue_endio > -> z_erofs_decompress_kickoff > ->z_erofs_decompressqueue_work > ->z_erofs_decompress_queue > -> z_erofs_decompress_pcluster > -> mutex_lock > Why does the softirq path not trigger a workqueue instead? why here it triggers "schedule-while-atomic" in the softirq context? > Per Sandeep in [1], this stack happens under RCU read-lock in: > > #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ > [...] > rcu_read_lock(); > (dispatch_ops); > rcu_read_unlock(); > [...] > > Coming from: > blk_mq_flush_plug_list -> > blk_mq_run_dispatch_ops(q, > __blk_mq_flush_plug_list(q, plug)); > > and __blk_mq_flush_plug_list does this: > q->mq_ops->queue_rqs(&plug->mq_list); > > This somehow ends up calling the bio_endio and the > z_erofs_decompressqueue_endio which grabs the mutex. > > So... I have a question, it looks like one of the paths in > __blk_mq_run_dispatch_ops() uses SRCU. Where are as the alternate > path uses RCU. Why does this alternate want to block even if it is not > supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should > be set? It sounds like you want to block in the "else" path even > though BLK_MQ_F_BLOCKING is not set: BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with. That is block layer and mq device driver stuffs. filesystems cannot set this value. As I said, as far as I understand, previously, .end_io() can only be called without RCU context, so it will be fine, but I don't know when .end_io() can be called under some RCU context now. Thanks, Gao Xiang
On Thu, Jul 13, 2023 at 10:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > > > On 2023/7/13 22:07, Joel Fernandes wrote: > > On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > >> On 2023/7/13 12:52, Paul E. McKenney wrote: > >>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote: > >>>> > >>>> > >> > >> ... > >> > >>>> > >>>> There are lots of performance issues here and even a plumber > >>>> topic last year to show that, see: > >>>> > >>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org > >>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com > >>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com > >>>> [4] https://lpc.events/event/16/contributions/1338/ > >>>> and more. > >>>> > >>>> I'm not sure if it's necessary to look info all of that, > >>>> andSandeep knows more than I am (the scheduling issue > >>>> becomes vital on some aarch64 platform.) > >>> > >>> Hmmm... Please let me try again. > >>> > >>> Assuming that this approach turns out to make sense, the resulting > >>> patch will need to clearly state the performance benefits directly in > >>> the commit log. > >>> > >>> And of course, for the approach to make sense, it must avoid breaking > >>> the existing lockdep-RCU debugging code. > >>> > >>> Is that more clear? > >> > >> Personally I'm not working on Android platform any more so I don't > >> have a way to reproduce, hopefully Sandeep could give actually > >> number _again_ if dm-verity is enabled and trigger another > >> workqueue here and make a comparsion why the scheduling latency of > >> the extra work becomes unacceptable. > >> > > > > Question from my side, are we talking about only performance issues or > > also a crash? It appears z_erofs_decompress_pcluster() takes > > mutex_lock(&pcl->lock); > > > > So if it is either in an RCU read-side critical section or in an > > atomic section, like the softirq path, then it may > > schedule-while-atomic or trigger RCU warnings. > > > > z_erofs_decompressqueue_endio > > -> z_erofs_decompress_kickoff > > ->z_erofs_decompressqueue_work > > ->z_erofs_decompress_queue > > -> z_erofs_decompress_pcluster > > -> mutex_lock > > > > Why does the softirq path not trigger a workqueue instead? I said "if it is". I was giving a scenario. mutex_lock() is not allowed in softirq context or in an RCU-reader. > > Per Sandeep in [1], this stack happens under RCU read-lock in: > > > > #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ > > [...] > > rcu_read_lock(); > > (dispatch_ops); > > rcu_read_unlock(); > > [...] > > > > Coming from: > > blk_mq_flush_plug_list -> > > blk_mq_run_dispatch_ops(q, > > __blk_mq_flush_plug_list(q, plug)); > > > > and __blk_mq_flush_plug_list does this: > > q->mq_ops->queue_rqs(&plug->mq_list); > > > > This somehow ends up calling the bio_endio and the > > z_erofs_decompressqueue_endio which grabs the mutex. > > > > So... I have a question, it looks like one of the paths in > > __blk_mq_run_dispatch_ops() uses SRCU. Where are as the alternate > > path uses RCU. Why does this alternate want to block even if it is not > > supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should > > be set? It sounds like you want to block in the "else" path even > > though BLK_MQ_F_BLOCKING is not set: > > BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with. > That is block layer and mq device driver stuffs. filesystems cannot set > this value. > > As I said, as far as I understand, previously, > .end_io() can only be called without RCU context, so it will be fine, > but I don't know when .end_io() can be called under some RCU context > now. From what Sandeep described, the code path is in an RCU reader. My question is more, why doesn't it use SRCU instead since it clearly does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper dive needs to be made into that before concluding that the fix is to use rcu_read_lock_any_held(). - Joel
> 2023年7月13日 23:33,Joel Fernandes <joel@joelfernandes.org> 写道: > > On Thu, Jul 13, 2023 at 10:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >> >> >> >> On 2023/7/13 22:07, Joel Fernandes wrote: >>> On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >>>> On 2023/7/13 12:52, Paul E. McKenney wrote: >>>>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote: >>>>>> >>>>>> >>>> >>>> ... >>>> >>>>>> >>>>>> There are lots of performance issues here and even a plumber >>>>>> topic last year to show that, see: >>>>>> >>>>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org >>>>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com >>>>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com >>>>>> [4] https://lpc.events/event/16/contributions/1338/ >>>>>> and more. >>>>>> >>>>>> I'm not sure if it's necessary to look info all of that, >>>>>> andSandeep knows more than I am (the scheduling issue >>>>>> becomes vital on some aarch64 platform.) >>>>> >>>>> Hmmm... Please let me try again. >>>>> >>>>> Assuming that this approach turns out to make sense, the resulting >>>>> patch will need to clearly state the performance benefits directly in >>>>> the commit log. >>>>> >>>>> And of course, for the approach to make sense, it must avoid breaking >>>>> the existing lockdep-RCU debugging code. >>>>> >>>>> Is that more clear? >>>> >>>> Personally I'm not working on Android platform any more so I don't >>>> have a way to reproduce, hopefully Sandeep could give actually >>>> number _again_ if dm-verity is enabled and trigger another >>>> workqueue here and make a comparsion why the scheduling latency of >>>> the extra work becomes unacceptable. >>>> >>> >>> Question from my side, are we talking about only performance issues or >>> also a crash? It appears z_erofs_decompress_pcluster() takes >>> mutex_lock(&pcl->lock); >>> >>> So if it is either in an RCU read-side critical section or in an >>> atomic section, like the softirq path, then it may >>> schedule-while-atomic or trigger RCU warnings. >>> >>> z_erofs_decompressqueue_endio >>> -> z_erofs_decompress_kickoff >>> ->z_erofs_decompressqueue_work >>> ->z_erofs_decompress_queue >>> -> z_erofs_decompress_pcluster >>> -> mutex_lock >>> >> >> Why does the softirq path not trigger a workqueue instead? > > I said "if it is". I was giving a scenario. mutex_lock() is not > allowed in softirq context or in an RCU-reader. > >>> Per Sandeep in [1], this stack happens under RCU read-lock in: >>> >>> #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ >>> [...] >>> rcu_read_lock(); >>> (dispatch_ops); >>> rcu_read_unlock(); >>> [...] >>> >>> Coming from: >>> blk_mq_flush_plug_list -> >>> blk_mq_run_dispatch_ops(q, >>> __blk_mq_flush_plug_list(q, plug)); >>> >>> and __blk_mq_flush_plug_list does this: >>> q->mq_ops->queue_rqs(&plug->mq_list); >>> >>> This somehow ends up calling the bio_endio and the >>> z_erofs_decompressqueue_endio which grabs the mutex. >>> >>> So... I have a question, it looks like one of the paths in >>> __blk_mq_run_dispatch_ops() uses SRCU. Where are as the alternate >>> path uses RCU. Why does this alternate want to block even if it is not >>> supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should >>> be set? It sounds like you want to block in the "else" path even >>> though BLK_MQ_F_BLOCKING is not set: >> >> BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with. >> That is block layer and mq device driver stuffs. filesystems cannot set >> this value. >> >> As I said, as far as I understand, previously, >> .end_io() can only be called without RCU context, so it will be fine, >> but I don't know when .end_io() can be called under some RCU context >> now. > > From what Sandeep described, the code path is in an RCU reader. My > question is more, why doesn't it use SRCU instead since it clearly > does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper > dive needs to be made into that before concluding that the fix is to > use rcu_read_lock_any_held(). Copied from [1]: "Background: Historically erofs would always schedule a kworker for decompression which would incur the scheduling cost regardless of the context. But z_erofs_decompressqueue_endio() may not always be in atomic context and we could actually benefit from doing the decompression in z_erofs_decompressqueue_endio() if we are in thread context, for example when running with dm-verity. This optimization was later added in patch [2] which has shown improvement in performance benchmarks.” I’m not sure if it is a design issue. [1] https://lore.kernel.org/all/20230621220848.3379029-1-dhavale@google.com/ > > - Joel
On Thu, Jul 13, 2023 at 11:33:24AM -0400, Joel Fernandes wrote: > On Thu, Jul 13, 2023 at 10:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > On 2023/7/13 22:07, Joel Fernandes wrote: > > > On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > >> On 2023/7/13 12:52, Paul E. McKenney wrote: > > >>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote: > > >> > > >> ... > > >> > > >>>> > > >>>> There are lots of performance issues here and even a plumber > > >>>> topic last year to show that, see: > > >>>> > > >>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org > > >>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com > > >>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com > > >>>> [4] https://lpc.events/event/16/contributions/1338/ > > >>>> and more. > > >>>> > > >>>> I'm not sure if it's necessary to look info all of that, > > >>>> andSandeep knows more than I am (the scheduling issue > > >>>> becomes vital on some aarch64 platform.) > > >>> > > >>> Hmmm... Please let me try again. > > >>> > > >>> Assuming that this approach turns out to make sense, the resulting > > >>> patch will need to clearly state the performance benefits directly in > > >>> the commit log. > > >>> > > >>> And of course, for the approach to make sense, it must avoid breaking > > >>> the existing lockdep-RCU debugging code. > > >>> > > >>> Is that more clear? > > >> > > >> Personally I'm not working on Android platform any more so I don't > > >> have a way to reproduce, hopefully Sandeep could give actually > > >> number _again_ if dm-verity is enabled and trigger another > > >> workqueue here and make a comparsion why the scheduling latency of > > >> the extra work becomes unacceptable. > > >> > > > > > > Question from my side, are we talking about only performance issues or > > > also a crash? It appears z_erofs_decompress_pcluster() takes > > > mutex_lock(&pcl->lock); > > > > > > So if it is either in an RCU read-side critical section or in an > > > atomic section, like the softirq path, then it may > > > schedule-while-atomic or trigger RCU warnings. > > > > > > z_erofs_decompressqueue_endio > > > -> z_erofs_decompress_kickoff > > > ->z_erofs_decompressqueue_work > > > ->z_erofs_decompress_queue > > > -> z_erofs_decompress_pcluster > > > -> mutex_lock > > > > > > > Why does the softirq path not trigger a workqueue instead? > > I said "if it is". I was giving a scenario. mutex_lock() is not > allowed in softirq context or in an RCU-reader. > > > > Per Sandeep in [1], this stack happens under RCU read-lock in: > > > > > > #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ > > > [...] > > > rcu_read_lock(); > > > (dispatch_ops); > > > rcu_read_unlock(); > > > [...] > > > > > > Coming from: > > > blk_mq_flush_plug_list -> > > > blk_mq_run_dispatch_ops(q, > > > __blk_mq_flush_plug_list(q, plug)); > > > > > > and __blk_mq_flush_plug_list does this: > > > q->mq_ops->queue_rqs(&plug->mq_list); > > > > > > This somehow ends up calling the bio_endio and the > > > z_erofs_decompressqueue_endio which grabs the mutex. > > > > > > So... I have a question, it looks like one of the paths in > > > __blk_mq_run_dispatch_ops() uses SRCU. Where are as the alternate > > > path uses RCU. Why does this alternate want to block even if it is not > > > supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should > > > be set? It sounds like you want to block in the "else" path even > > > though BLK_MQ_F_BLOCKING is not set: > > > > BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with. > > That is block layer and mq device driver stuffs. filesystems cannot set > > this value. > > > > As I said, as far as I understand, previously, > > .end_io() can only be called without RCU context, so it will be fine, > > but I don't know when .end_io() can be called under some RCU context > > now. > > >From what Sandeep described, the code path is in an RCU reader. My > question is more, why doesn't it use SRCU instead since it clearly > does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper > dive needs to be made into that before concluding that the fix is to > use rcu_read_lock_any_held(). How can this be solved? 1. Always use a workqueue. Simple, but is said to have performance issues. 2. Pass a flag in that indicates whether or not the caller is in an RCU read-side critical section. Conceptually simple, but might or might not be reasonable to actually implement in the code as it exists now. (You tell me!) 3. Create a function in z_erofs that gives you a decent approximation, maybe something like the following. 4. Other ideas here. The following is untested, and is probably quite buggy, but it should provide you with a starting point. static bool z_erofs_wq_needed(void) { if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth()) return true; // RCU reader if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && !preemptible()) return true; // non-preemptible if (!IS_ENABLED(CONFIG_PREEMPT_COUNT)) return true; // non-preeemptible kernel, so play it safe return false; } You break it, you buy it! ;-) Thanx, Paul
Hello All, Let me answer some of the questions raised here. * Performance aspect EROFS is one of the popular filesystem of choice in Android for read only partitions as it provides 30%+ storage space savings with compression. In addition to microbenchmarks, boot times and cold app launch benchmarks are very important to the Android ecosystem as they directly translate to responsiveness from user point of view. We saw some performance penalty in cold app launch benchmarks in a few scenarios. Analysis showed that the significant variance was coming from the scheduling cost while decompression cost was more or less the same. Please see the [1] which shows scheduling costs for kthread vs kworker. > Just out of curiosity, exactly how much is it costing to trigger the workqueue? I think the cost to trigger is not much, it's the actual scheduling latency for the thread is the one which we want to cut down. And if we are already in thread context then there is no point in incurring any extra cost if we can detect it reliably. That is what erofs check is trying to do. >One additional question... What is your plan for kernels built with CONFIG_PREEMPT_COUNT=n? If there is no reliable way to detect if we can block or not then in that case erofs has no option but to schedule the kworker. * Regarding BLK_MQ_F_BLOCKING As mentioned by Gao in the thread this is a property of blk-mq device underneath, so erofs cannot control it has it has to work with different types of block devices. * Regarding rcu_is_watching() >I am assuming you mean you would grab the mutex accidentally when in an RCU reader, and might_sleep() presumably in the mutex internal code will scream? Thank you Paul for explaining in detail why it is important. I can get the V2 going. From the looking at the code at kernel/sched/core.c which only looks at rcu_preempt_depth(), I am thinking it may still get triggered IIUC. > The following is untested, and is probably quite buggy, but it should provide you with a starting point. .. Yes, that can fix the problem at hand as the erofs check also looks for rcu_preempt_depth(). A similar approach was discarded as rcu_preempt_depth() was though to be low level and we used rcu_read_lock_any_held() which is the superset until we figured out inconsistency when ! CONFIG_DEBUG_LOCK_ALLOC. Paul, Joel, Shall we fix the rcu_read_lock_*held() regardless of how erofs improves the check? Thanks, Sandeep. [1] https://lore.kernel.org/linux-erofs/20230208093322.75816-1-hsiangkao@linux.alibaba.com/
On Thu, Jul 13, 2023 at 10:05:46AM -0700, Sandeep Dhavale wrote: > Hello All, > > Let me answer some of the questions raised here. > > * Performance aspect > EROFS is one of the popular filesystem of choice in Android for read > only partitions > as it provides 30%+ storage space savings with compression. > In addition to microbenchmarks, boot times and cold app launch > benchmarks are very > important to the Android ecosystem as they directly translate to > responsiveness from user point of view. We saw some > performance penalty in cold app launch benchmarks in a few scenarios. > Analysis showed that the significant variance was coming from the > scheduling cost while decompression cost was more or less the same. > Please see the [1] which shows scheduling costs for kthread vs kworker. > > > Just out of curiosity, exactly how much is it costing to trigger the > workqueue? > I think the cost to trigger is not much, it's the actual scheduling latency for > the thread is the one which we want to cut down. And if we are already in > thread context then there is no point in incurring any extra cost if > we can detect > it reliably. That is what erofs check is trying to do. > > >One additional question... What is your plan for kernels built with > CONFIG_PREEMPT_COUNT=n? > If there is no reliable way to detect if we can block or not then in that > case erofs has no option but to schedule the kworker. > > * Regarding BLK_MQ_F_BLOCKING > As mentioned by Gao in the thread this is a property of blk-mq device > underneath, > so erofs cannot control it has it has to work with different types of > block devices. > > * Regarding rcu_is_watching() > > >I am assuming you mean you would grab the mutex accidentally when in an RCU > reader, and might_sleep() presumably in the mutex internal code will scream? > > Thank you Paul for explaining in detail why it is important. I can get > the V2 going. > >From the looking at the code at kernel/sched/core.c which only looks > at rcu_preempt_depth(), > I am thinking it may still get triggered IIUC. > > > The following is untested, and is probably quite buggy, but it should > provide you with a starting point. > .. > > Yes, that can fix the problem at hand as the erofs check also looks > for rcu_preempt_depth(). > A similar approach was discarded as rcu_preempt_depth() was though to > be low level > and we used rcu_read_lock_any_held() which is the superset until we > figured out inconsistency > when ! CONFIG_DEBUG_LOCK_ALLOC. Thank you for the background. > Paul, Joel, > Shall we fix the rcu_read_lock_*held() regardless of how erofs > improves the check? Help me out here. Exactly what is broken with rcu_read_lock_*held(), keeping in mind their intended use for lockdep-based debugging? Thanx, Paul > Thanks, > Sandeep. > > [1] https://lore.kernel.org/linux-erofs/20230208093322.75816-1-hsiangkao@linux.alibaba.com/
On Fri, Jul 14, 2023 at 12:09:27AM +0800, Alan Huang wrote: > > > 2023年7月13日 23:33,Joel Fernandes <joel@joelfernandes.org> 写道: > > > > On Thu, Jul 13, 2023 at 10:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > >> > >> > >> > >> On 2023/7/13 22:07, Joel Fernandes wrote: > >>> On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > >>>> On 2023/7/13 12:52, Paul E. McKenney wrote: > >>>>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote: > >>>>>> > >>>>>> > >>>> > >>>> ... > >>>> > >>>>>> > >>>>>> There are lots of performance issues here and even a plumber > >>>>>> topic last year to show that, see: > >>>>>> > >>>>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org > >>>>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com > >>>>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com > >>>>>> [4] https://lpc.events/event/16/contributions/1338/ > >>>>>> and more. > >>>>>> > >>>>>> I'm not sure if it's necessary to look info all of that, > >>>>>> andSandeep knows more than I am (the scheduling issue > >>>>>> becomes vital on some aarch64 platform.) > >>>>> > >>>>> Hmmm... Please let me try again. > >>>>> > >>>>> Assuming that this approach turns out to make sense, the resulting > >>>>> patch will need to clearly state the performance benefits directly in > >>>>> the commit log. > >>>>> > >>>>> And of course, for the approach to make sense, it must avoid breaking > >>>>> the existing lockdep-RCU debugging code. > >>>>> > >>>>> Is that more clear? > >>>> > >>>> Personally I'm not working on Android platform any more so I don't > >>>> have a way to reproduce, hopefully Sandeep could give actually > >>>> number _again_ if dm-verity is enabled and trigger another > >>>> workqueue here and make a comparsion why the scheduling latency of > >>>> the extra work becomes unacceptable. > >>>> > >>> > >>> Question from my side, are we talking about only performance issues or > >>> also a crash? It appears z_erofs_decompress_pcluster() takes > >>> mutex_lock(&pcl->lock); > >>> > >>> So if it is either in an RCU read-side critical section or in an > >>> atomic section, like the softirq path, then it may > >>> schedule-while-atomic or trigger RCU warnings. > >>> > >>> z_erofs_decompressqueue_endio > >>> -> z_erofs_decompress_kickoff > >>> ->z_erofs_decompressqueue_work > >>> ->z_erofs_decompress_queue > >>> -> z_erofs_decompress_pcluster > >>> -> mutex_lock > >>> > >> > >> Why does the softirq path not trigger a workqueue instead? > > > > I said "if it is". I was giving a scenario. mutex_lock() is not > > allowed in softirq context or in an RCU-reader. > > > >>> Per Sandeep in [1], this stack happens under RCU read-lock in: > >>> > >>> #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ > >>> [...] > >>> rcu_read_lock(); > >>> (dispatch_ops); > >>> rcu_read_unlock(); > >>> [...] > >>> > >>> Coming from: > >>> blk_mq_flush_plug_list -> > >>> blk_mq_run_dispatch_ops(q, > >>> __blk_mq_flush_plug_list(q, plug)); > >>> > >>> and __blk_mq_flush_plug_list does this: > >>> q->mq_ops->queue_rqs(&plug->mq_list); > >>> > >>> This somehow ends up calling the bio_endio and the > >>> z_erofs_decompressqueue_endio which grabs the mutex. > >>> > >>> So... I have a question, it looks like one of the paths in > >>> __blk_mq_run_dispatch_ops() uses SRCU. Where are as the alternate > >>> path uses RCU. Why does this alternate want to block even if it is not > >>> supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should > >>> be set? It sounds like you want to block in the "else" path even > >>> though BLK_MQ_F_BLOCKING is not set: > >> > >> BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with. > >> That is block layer and mq device driver stuffs. filesystems cannot set > >> this value. > >> > >> As I said, as far as I understand, previously, > >> .end_io() can only be called without RCU context, so it will be fine, > >> but I don't know when .end_io() can be called under some RCU context > >> now. > > > > From what Sandeep described, the code path is in an RCU reader. My > > question is more, why doesn't it use SRCU instead since it clearly > > does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper > > dive needs to be made into that before concluding that the fix is to > > use rcu_read_lock_any_held(). > > Copied from [1]: > > "Background: Historically erofs would always schedule a kworker for > decompression which would incur the scheduling cost regardless of > the context. But z_erofs_decompressqueue_endio() may not always > be in atomic context and we could actually benefit from doing the > decompression in z_erofs_decompressqueue_endio() if we are in > thread context, for example when running with dm-verity. > This optimization was later added in patch [2] which has shown > improvement in performance benchmarks.” > > I’m not sure if it is a design issue. I have no official opinion myself, but there are quite a few people who firmly believe that any situation like this one (where driver or file-system code needs to query the current context to see if blocking is OK) constitutes a design flaw. Such people might argue that this code path should have a clearly documented context, and that if that documentation states that the code might be in atomic context, then the driver/fs should assume atomic context. Alternatively, if driver/fs needs the context to be non-atomic, the callers should make it so. See for example in_atomic() and its comment header: /* * Are we running in atomic context? WARNING: this macro cannot * always detect atomic context; in particular, it cannot know about * held spinlocks in non-preemptible kernels. Thus it should not be * used in the general case to determine whether sleeping is possible. * Do not use in_atomic() in driver code. */ #define in_atomic() (preempt_count() != 0) In the immortal words of Dan Frye, this should be good clean fun! ;-) Thanx, Paul > [1] https://lore.kernel.org/all/20230621220848.3379029-1-dhavale@google.com/ > > > > > - Joel > >
> Thank you for the background. > > > Paul, Joel, > > Shall we fix the rcu_read_lock_*held() regardless of how erofs > > improves the check? > > Help me out here. Exactly what is broken with rcu_read_lock_*held(), > keeping in mind their intended use for lockdep-based debugging? > Hi Paul, With !CONFIG_DEBUG_ALLOC_LOCK rcu_read_lock_held() -> Always returns 1. rcu_read_lock_any_held()-> returns !preemptible() so may return 0. Now current usages for rcu_read_lock_*held() are under RCU_LOCKDEP_WARN() which becomes noOP with !CONFIG_DEBUG_ALLOC_LOCK (due to debug_lockdep_rcu_enabled()) so this inconsistency is not causing any problems right now. So my question was about your opinion for fixing this for semantics if it's worth correcting. Also it would have been better IMO if there was a reliable API for rcu_read_lock_*held() than erofs trying to figure it out at a higher level. > I have no official opinion myself, but there are quite a few people ... Regarding erofs trying to detect this, I understand few people can have different opinions. Not scheduling a thread while being in a thread context itself is reasonable in my opinion which also has shown performance gains. Thanks, Sandeep. > Thanx, Paul > > > Thanks, > > Sandeep. > > > > [1] https://lore.kernel.org/linux-erofs/20230208093322.75816-1-hsiangkao@linux.alibaba.com/ > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
On 2023/7/14 02:14, Paul E. McKenney wrote: > On Fri, Jul 14, 2023 at 12:09:27AM +0800, Alan Huang wrote: ... >>> >>> From what Sandeep described, the code path is in an RCU reader. My >>> question is more, why doesn't it use SRCU instead since it clearly >>> does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper >>> dive needs to be made into that before concluding that the fix is to >>> use rcu_read_lock_any_held(). >> >> Copied from [1]: >> >> "Background: Historically erofs would always schedule a kworker for >> decompression which would incur the scheduling cost regardless of >> the context. But z_erofs_decompressqueue_endio() may not always >> be in atomic context and we could actually benefit from doing the >> decompression in z_erofs_decompressqueue_endio() if we are in >> thread context, for example when running with dm-verity. >> This optimization was later added in patch [2] which has shown >> improvement in performance benchmarks.” >> >> I’m not sure if it is a design issue. What do you mean a design issue, honestly? I feel uneasy to hear this. > > I have no official opinion myself, but there are quite a few people > who firmly believe that any situation like this one (where driver or > file-system code needs to query the current context to see if blocking > is OK) constitutes a design flaw. Such people might argue that this > code path should have a clearly documented context, and that if that > documentation states that the code might be in atomic context, then the > driver/fs should assume atomic context. Alternatively, if driver/fs I don't think a software decoder (for example, decompression) should be left in the atomic context honestly. Regardless of the decompression speed of some algorithm theirselves (considering very slow decompression on very slow devices), it means that we also don't have a way to vmap or likewise (considering decompression + handle extra deduplication copies) in the atomic context, even memory allocation has to be in an atomic way. Especially now have more cases that decodes in the RCU reader context apart from softirq contexts? > needs the context to be non-atomic, the callers should make it so. > > See for example in_atomic() and its comment header: > > /* > * Are we running in atomic context? WARNING: this macro cannot > * always detect atomic context; in particular, it cannot know about > * held spinlocks in non-preemptible kernels. Thus it should not be > * used in the general case to determine whether sleeping is possible. > * Do not use in_atomic() in driver code. > */ > #define in_atomic() (preempt_count() != 0) > > In the immortal words of Dan Frye, this should be good clean fun! ;-) Honestly, I think such helper (to show whether it's in the atomic context) is useful to driver users, even it could cause some false positive in some configuration but it's acceptable. Anyway, I could "Always use a workqueue.", but again, the original commit was raised by a real vendor (OPPO), and I think if dropping this, all downstream users which use dm-verity will be impacted and individual end users will not be happy as well. Thanks, Gao Xiang
On Fri, Jul 14, 2023 at 03:00:25AM +0800, Gao Xiang wrote: > On 2023/7/14 02:14, Paul E. McKenney wrote: > > On Fri, Jul 14, 2023 at 12:09:27AM +0800, Alan Huang wrote: > > ... > > > > > > > > > From what Sandeep described, the code path is in an RCU reader. My > > > > question is more, why doesn't it use SRCU instead since it clearly > > > > does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper > > > > dive needs to be made into that before concluding that the fix is to > > > > use rcu_read_lock_any_held(). > > > > > > Copied from [1]: > > > > > > "Background: Historically erofs would always schedule a kworker for > > > decompression which would incur the scheduling cost regardless of > > > the context. But z_erofs_decompressqueue_endio() may not always > > > be in atomic context and we could actually benefit from doing the > > > decompression in z_erofs_decompressqueue_endio() if we are in > > > thread context, for example when running with dm-verity. > > > This optimization was later added in patch [2] which has shown > > > improvement in performance benchmarks.” > > > > > > I’m not sure if it is a design issue. > > What do you mean a design issue, honestly? I feel uneasy to hear this. > > > I have no official opinion myself, but there are quite a few people > > who firmly believe that any situation like this one (where driver or > > file-system code needs to query the current context to see if blocking > > is OK) constitutes a design flaw. Such people might argue that this > > code path should have a clearly documented context, and that if that > > documentation states that the code might be in atomic context, then the > > driver/fs should assume atomic context. Alternatively, if driver/fs > > I don't think a software decoder (for example, decompression) should be > left in the atomic context honestly. > > Regardless of the decompression speed of some algorithm theirselves > (considering very slow decompression on very slow devices), it means > that we also don't have a way to vmap or likewise (considering > decompression + handle extra deduplication copies) in the atomic > context, even memory allocation has to be in an atomic way. > > Especially now have more cases that decodes in the RCU reader context > apart from softirq contexts? Just out of curiosity, why are these cases within RCU readers happening? Is this due to recent MM changes? But yes, you would have the same problem in softirq context. > > needs the context to be non-atomic, the callers should make it so. > > > > See for example in_atomic() and its comment header: > > > > /* > > * Are we running in atomic context? WARNING: this macro cannot > > * always detect atomic context; in particular, it cannot know about > > * held spinlocks in non-preemptible kernels. Thus it should not be > > * used in the general case to determine whether sleeping is possible. > > * Do not use in_atomic() in driver code. > > */ > > #define in_atomic() (preempt_count() != 0) > > > > In the immortal words of Dan Frye, this should be good clean fun! ;-) > > Honestly, I think such helper (to show whether it's in the atomic context) > is useful to driver users, even it could cause some false positive in some > configuration but it's acceptable. > > Anyway, I could "Always use a workqueue.", but again, the original commit > was raised by a real vendor (OPPO), and I think if dropping this, all > downstream users which use dm-verity will be impacted and individual end > users will not be happy as well. Well, given what we have discussed thus far, for some kernels, you would always use a workqueue. One option would be to force CONFIG_PREEMPT_COUNT to always be enabled, but this option has not been rejected multiple times in the past. But you are quite free to roll your own z_erofs_wq_needed(). Thanx, Paul
On Thu, Jul 13, 2023 at 11:51:34AM -0700, Sandeep Dhavale wrote: > > Thank you for the background. > > > > > Paul, Joel, > > > Shall we fix the rcu_read_lock_*held() regardless of how erofs > > > improves the check? > > > > Help me out here. Exactly what is broken with rcu_read_lock_*held(), > > keeping in mind their intended use for lockdep-based debugging? > > > Hi Paul, > With !CONFIG_DEBUG_ALLOC_LOCK > rcu_read_lock_held() -> Always returns 1. > rcu_read_lock_any_held()-> returns !preemptible() so may return 0. > > Now current usages for rcu_read_lock_*held() are under RCU_LOCKDEP_WARN() > which becomes noOP with !CONFIG_DEBUG_ALLOC_LOCK > (due to debug_lockdep_rcu_enabled()) so this inconsistency is not causing > any problems right now. So my question was about your opinion for fixing this > for semantics if it's worth correcting. > > Also it would have been better IMO if there was a reliable API > for rcu_read_lock_*held() than erofs trying to figure it out at a higher level. Sorry, but the current lockdep-support functions need to stay focused on lockdep. They are not set up for general use, as we already saw with rcu_is_watching(). If you get a z_erofs_wq_needed() (or whatever) upstream, and if it turns out that there is an RCU-specific portion that has clean semantics, then I would be willing to look at pulling that portion into RCU. Note "look at" as opposed to "unconditionally agree to". ;-) > > I have no official opinion myself, but there are quite a few people > ... > > Regarding erofs trying to detect this, I understand few people can > have different > opinions. Not scheduling a thread while being in a thread context itself > is reasonable in my opinion which also has shown performance gains. You still haven't quantified the performance gains. Presumably they are most compelling with large numbers of small buffers to be decrypted. But why not just make a z_erofs_wq_needed() or similar in your own code, and push it upstream? If the performance gains really are so compelling, one would hope that some sort of reasonable arrangement could be arrived at. Thanx, Paul > Thanks, > Sandeep. > > > > > Thanx, Paul > > > > > Thanks, > > > Sandeep. > > > > > > [1] https://lore.kernel.org/linux-erofs/20230208093322.75816-1-hsiangkao@linux.alibaba.com/ > > > > -- > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. > >
> > Sorry, but the current lockdep-support functions need to stay focused > on lockdep. They are not set up for general use, as we already saw > with rcu_is_watching(). > Ok, understood. > If you get a z_erofs_wq_needed() (or whatever) upstream, and if it turns > out that there is an RCU-specific portion that has clean semantics, > then I would be willing to look at pulling that portion into RCU. > Note "look at" as opposed to "unconditionally agree to". ;-) > > > I have no official opinion myself, but there are quite a few people > > ... > > > > Regarding erofs trying to detect this, I understand few people can > > have different > > opinions. Not scheduling a thread while being in a thread context itself > > is reasonable in my opinion which also has shown performance gains. > > You still haven't quantified the performance gains. Presumably they > are most compelling with large numbers of small buffers to be decrypted. > Maybe you missed one of the replies. Link [1] shows the scheduling overhead for kworker vs high pri kthread. I think we can all see that there is non-zero cost associated with always scheduling vs inline decompression. > But why not just make a z_erofs_wq_needed() or similar in your own > code, and push it upstream? If the performance gains really are so > compelling, one would hope that some sort of reasonable arrangement > could be arrived at. > Yes, we will incorporate additional checks in erofs. Thanks, Sandeep. [1] https://lore.kernel.org/linux-erofs/20230208093322.75816-1-hsiangkao@linux.alibaba.com/
On Thu, Jul 13, 2023 at 04:08:29PM -0700, Sandeep Dhavale wrote: > > > > Sorry, but the current lockdep-support functions need to stay focused > > on lockdep. They are not set up for general use, as we already saw > > with rcu_is_watching(). > > > Ok, understood. > > > If you get a z_erofs_wq_needed() (or whatever) upstream, and if it turns > > out that there is an RCU-specific portion that has clean semantics, > > then I would be willing to look at pulling that portion into RCU. > > Note "look at" as opposed to "unconditionally agree to". ;-) > > > > I have no official opinion myself, but there are quite a few people > > > ... > > > > > > Regarding erofs trying to detect this, I understand few people can > > > have different > > > opinions. Not scheduling a thread while being in a thread context itself > > > is reasonable in my opinion which also has shown performance gains. > > > > You still haven't quantified the performance gains. Presumably they > > are most compelling with large numbers of small buffers to be decrypted. > > Maybe you missed one of the replies. Link [1] shows the scheduling overhead > for kworker vs high pri kthread. I think we can all see that there is non-zero > cost associated with always scheduling vs inline decompression. Heh! A reply I was not CCed on back in February. ;-) But data like that included directly in the commit log, gathered specifically for that commit log's patch, would be very helpful. > > But why not just make a z_erofs_wq_needed() or similar in your own > > code, and push it upstream? If the performance gains really are so > > compelling, one would hope that some sort of reasonable arrangement > > could be arrived at. > > > Yes, we will incorporate additional checks in erofs. Sounds good to me! Thanx, Paul > Thanks, > Sandeep. > > [1] https://lore.kernel.org/linux-erofs/20230208093322.75816-1-hsiangkao@linux.alibaba.com/
On Thu, Jul 13, 2023 at 09:33:35AM -0700, Paul E. McKenney wrote: > On Thu, Jul 13, 2023 at 11:33:24AM -0400, Joel Fernandes wrote: > > On Thu, Jul 13, 2023 at 10:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > > On 2023/7/13 22:07, Joel Fernandes wrote: > > > > On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > > >> On 2023/7/13 12:52, Paul E. McKenney wrote: > > > >>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote: > > > >> > > > >> ... > > > >> > > > >>>> > > > >>>> There are lots of performance issues here and even a plumber > > > >>>> topic last year to show that, see: > > > >>>> > > > >>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org > > > >>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com > > > >>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com > > > >>>> [4] https://lpc.events/event/16/contributions/1338/ > > > >>>> and more. > > > >>>> > > > >>>> I'm not sure if it's necessary to look info all of that, > > > >>>> andSandeep knows more than I am (the scheduling issue > > > >>>> becomes vital on some aarch64 platform.) > > > >>> > > > >>> Hmmm... Please let me try again. > > > >>> > > > >>> Assuming that this approach turns out to make sense, the resulting > > > >>> patch will need to clearly state the performance benefits directly in > > > >>> the commit log. > > > >>> > > > >>> And of course, for the approach to make sense, it must avoid breaking > > > >>> the existing lockdep-RCU debugging code. > > > >>> > > > >>> Is that more clear? > > > >> > > > >> Personally I'm not working on Android platform any more so I don't > > > >> have a way to reproduce, hopefully Sandeep could give actually > > > >> number _again_ if dm-verity is enabled and trigger another > > > >> workqueue here and make a comparsion why the scheduling latency of > > > >> the extra work becomes unacceptable. > > > >> > > > > > > > > Question from my side, are we talking about only performance issues or > > > > also a crash? It appears z_erofs_decompress_pcluster() takes > > > > mutex_lock(&pcl->lock); > > > > > > > > So if it is either in an RCU read-side critical section or in an > > > > atomic section, like the softirq path, then it may > > > > schedule-while-atomic or trigger RCU warnings. > > > > > > > > z_erofs_decompressqueue_endio > > > > -> z_erofs_decompress_kickoff > > > > ->z_erofs_decompressqueue_work > > > > ->z_erofs_decompress_queue > > > > -> z_erofs_decompress_pcluster > > > > -> mutex_lock > > > > > > > > > > Why does the softirq path not trigger a workqueue instead? > > > > I said "if it is". I was giving a scenario. mutex_lock() is not > > allowed in softirq context or in an RCU-reader. > > > > > > Per Sandeep in [1], this stack happens under RCU read-lock in: > > > > > > > > #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ > > > > [...] > > > > rcu_read_lock(); > > > > (dispatch_ops); > > > > rcu_read_unlock(); > > > > [...] > > > > > > > > Coming from: > > > > blk_mq_flush_plug_list -> > > > > blk_mq_run_dispatch_ops(q, > > > > __blk_mq_flush_plug_list(q, plug)); > > > > > > > > and __blk_mq_flush_plug_list does this: > > > > q->mq_ops->queue_rqs(&plug->mq_list); > > > > > > > > This somehow ends up calling the bio_endio and the > > > > z_erofs_decompressqueue_endio which grabs the mutex. > > > > > > > > So... I have a question, it looks like one of the paths in > > > > __blk_mq_run_dispatch_ops() uses SRCU. Where are as the alternate > > > > path uses RCU. Why does this alternate want to block even if it is not > > > > supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should > > > > be set? It sounds like you want to block in the "else" path even > > > > though BLK_MQ_F_BLOCKING is not set: > > > > > > BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with. > > > That is block layer and mq device driver stuffs. filesystems cannot set > > > this value. > > > > > > As I said, as far as I understand, previously, > > > .end_io() can only be called without RCU context, so it will be fine, > > > but I don't know when .end_io() can be called under some RCU context > > > now. > > > > >From what Sandeep described, the code path is in an RCU reader. My > > question is more, why doesn't it use SRCU instead since it clearly > > does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper > > dive needs to be made into that before concluding that the fix is to > > use rcu_read_lock_any_held(). > > How can this be solved? > > 1. Always use a workqueue. Simple, but is said to have performance > issues. > > 2. Pass a flag in that indicates whether or not the caller is in an > RCU read-side critical section. Conceptually simple, but might > or might not be reasonable to actually implement in the code as > it exists now. (You tell me!) > > 3. Create a function in z_erofs that gives you a decent > approximation, maybe something like the following. > > 4. Other ideas here. 5. #3 plus make the corresponding Kconfig option select PREEMPT_COUNT, assuming that any users needing compression in non-preemptible kernels are OK with PREEMPT_COUNT being set. (Some users of non-preemptible kernels object strenuously to the added overhead from CONFIG_PREEMPT_COUNT=y.) Thanx, Paul > The following is untested, and is probably quite buggy, but it should > provide you with a starting point. > > static bool z_erofs_wq_needed(void) > { > if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth()) > return true; // RCU reader > if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && !preemptible()) > return true; // non-preemptible > if (!IS_ENABLED(CONFIG_PREEMPT_COUNT)) > return true; // non-preeemptible kernel, so play it safe > return false; > } > > You break it, you buy it! ;-) > > Thanx, Paul
On 2023/7/14 10:16, Paul E. McKenney wrote: > On Thu, Jul 13, 2023 at 09:33:35AM -0700, Paul E. McKenney wrote: >> On Thu, Jul 13, 2023 at 11:33:24AM -0400, Joel Fernandes wrote: ... >>> >>> >From what Sandeep described, the code path is in an RCU reader. My >>> question is more, why doesn't it use SRCU instead since it clearly >>> does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper >>> dive needs to be made into that before concluding that the fix is to >>> use rcu_read_lock_any_held(). >> >> How can this be solved? >> >> 1. Always use a workqueue. Simple, but is said to have performance >> issues. >> >> 2. Pass a flag in that indicates whether or not the caller is in an >> RCU read-side critical section. Conceptually simple, but might >> or might not be reasonable to actually implement in the code as >> it exists now. (You tell me!) >> >> 3. Create a function in z_erofs that gives you a decent >> approximation, maybe something like the following. >> >> 4. Other ideas here. > > 5. #3 plus make the corresponding Kconfig option select > PREEMPT_COUNT, assuming that any users needing compression in > non-preemptible kernels are OK with PREEMPT_COUNT being set. > (Some users of non-preemptible kernels object strenuously > to the added overhead from CONFIG_PREEMPT_COUNT=y.) I'm not sure if it's a good idea, we need to work on CONFIG_PREEMPT_COUNT=n (why not?), we could just always trigger a workqueue for this. Anyway, before we proceed, I also think it'd be better to get some performance numbers first for this (e.g. with dm-verity) and record the numbers in the commit message to justify this. Otherwise, I guess the same question will be raised again and again. Thanks, Gao Xiang
On Thu, Jul 13, 2023 at 11:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > > > On 2023/7/14 10:16, Paul E. McKenney wrote: > > On Thu, Jul 13, 2023 at 09:33:35AM -0700, Paul E. McKenney wrote: > >> On Thu, Jul 13, 2023 at 11:33:24AM -0400, Joel Fernandes wrote: > > ... > > >>> > >>> >From what Sandeep described, the code path is in an RCU reader. My > >>> question is more, why doesn't it use SRCU instead since it clearly > >>> does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper > >>> dive needs to be made into that before concluding that the fix is to > >>> use rcu_read_lock_any_held(). > >> > >> How can this be solved? > >> > >> 1. Always use a workqueue. Simple, but is said to have performance > >> issues. > >> > >> 2. Pass a flag in that indicates whether or not the caller is in an > >> RCU read-side critical section. Conceptually simple, but might > >> or might not be reasonable to actually implement in the code as > >> it exists now. (You tell me!) > >> > >> 3. Create a function in z_erofs that gives you a decent > >> approximation, maybe something like the following. > >> > >> 4. Other ideas here. > > > > 5. #3 plus make the corresponding Kconfig option select > > PREEMPT_COUNT, assuming that any users needing compression in > > non-preemptible kernels are OK with PREEMPT_COUNT being set. > > (Some users of non-preemptible kernels object strenuously > > to the added overhead from CONFIG_PREEMPT_COUNT=y.) > > I'm not sure if it's a good idea I think it is a fine idea. > we need to work on > CONFIG_PREEMPT_COUNT=n (why not?), we could just always trigger a > workqueue for this. > So CONFIG_PREEMPT_COUNT=n users don't deserve good performance? ;-) thanks, - Joel
On 2023/7/14 21:42, Joel Fernandes wrote: > On Thu, Jul 13, 2023 at 11:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >> >> >> >> On 2023/7/14 10:16, Paul E. McKenney wrote: >>> On Thu, Jul 13, 2023 at 09:33:35AM -0700, Paul E. McKenney wrote: >>>> On Thu, Jul 13, 2023 at 11:33:24AM -0400, Joel Fernandes wrote: >> >> ... >> >>>>> >>>>> >From what Sandeep described, the code path is in an RCU reader. My >>>>> question is more, why doesn't it use SRCU instead since it clearly >>>>> does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper >>>>> dive needs to be made into that before concluding that the fix is to >>>>> use rcu_read_lock_any_held(). >>>> >>>> How can this be solved? >>>> >>>> 1. Always use a workqueue. Simple, but is said to have performance >>>> issues. >>>> >>>> 2. Pass a flag in that indicates whether or not the caller is in an >>>> RCU read-side critical section. Conceptually simple, but might >>>> or might not be reasonable to actually implement in the code as >>>> it exists now. (You tell me!) >>>> >>>> 3. Create a function in z_erofs that gives you a decent >>>> approximation, maybe something like the following. >>>> >>>> 4. Other ideas here. >>> >>> 5. #3 plus make the corresponding Kconfig option select >>> PREEMPT_COUNT, assuming that any users needing compression in >>> non-preemptible kernels are OK with PREEMPT_COUNT being set. >>> (Some users of non-preemptible kernels object strenuously >>> to the added overhead from CONFIG_PREEMPT_COUNT=y.) >> >> I'm not sure if it's a good idea > > I think it is a fine idea. > >> we need to work on >> CONFIG_PREEMPT_COUNT=n (why not?), we could just always trigger a >> workqueue for this. >> > > So CONFIG_PREEMPT_COUNT=n users don't deserve good performance? ;-) I'm not sure if non-preemptible kernel users really care about such sensitive latencies, I don't know, my 2 cents. Thanks, Gao Xiang > > thanks, > > - Joel
On Fri, 14 Jul 2023 21:51:16 +0800 Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > >> we need to work on > >> CONFIG_PREEMPT_COUNT=n (why not?), we could just always trigger a > >> workqueue for this. > >> > > > > So CONFIG_PREEMPT_COUNT=n users don't deserve good performance? ;-) > > I'm not sure if non-preemptible kernel users really care about > such sensitive latencies, I don't know, my 2 cents. CONFIG_PREEMPT_COUNT=n is for *performance* but not for *latency*. That is, they care about the overall performance (batch processing) but not interactive performance. -- Steve
On Fri, Jul 14, 2023 at 10:56:15AM -0400, Steven Rostedt wrote: > On Fri, 14 Jul 2023 21:51:16 +0800 > Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > > >> we need to work on > > >> CONFIG_PREEMPT_COUNT=n (why not?), we could just always trigger a > > >> workqueue for this. > > >> > > > > > > So CONFIG_PREEMPT_COUNT=n users don't deserve good performance? ;-) > > > > I'm not sure if non-preemptible kernel users really care about > > such sensitive latencies, I don't know, my 2 cents. > > CONFIG_PREEMPT_COUNT=n is for *performance* but not for *latency*. That is, > they care about the overall performance (batch processing) but not > interactive performance. Users of CONFIG_PREEMPT_COUNT=n do care about latency, but normally not in the sub-millisecond range. If the February posting is representative (no idea, myself), these latencies are in the tens of milliseconds. So one question is "why not both?" One way would be for the call chain to indicate when in atomic context, and another would be use of SRCU, to Joel's earlier point. Thanx, Paul
> 2023年7月14日 10:16,Paul E. McKenney <paulmck@kernel.org> 写道: > > On Thu, Jul 13, 2023 at 09:33:35AM -0700, Paul E. McKenney wrote: >> On Thu, Jul 13, 2023 at 11:33:24AM -0400, Joel Fernandes wrote: >>> On Thu, Jul 13, 2023 at 10:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >>>> On 2023/7/13 22:07, Joel Fernandes wrote: >>>>> On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >>>>>> On 2023/7/13 12:52, Paul E. McKenney wrote: >>>>>>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote: >>>>>> >>>>>> ... >>>>>> >>>>>>>> >>>>>>>> There are lots of performance issues here and even a plumber >>>>>>>> topic last year to show that, see: >>>>>>>> >>>>>>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org >>>>>>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com >>>>>>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com >>>>>>>> [4] https://lpc.events/event/16/contributions/1338/ >>>>>>>> and more. >>>>>>>> >>>>>>>> I'm not sure if it's necessary to look info all of that, >>>>>>>> andSandeep knows more than I am (the scheduling issue >>>>>>>> becomes vital on some aarch64 platform.) >>>>>>> >>>>>>> Hmmm... Please let me try again. >>>>>>> >>>>>>> Assuming that this approach turns out to make sense, the resulting >>>>>>> patch will need to clearly state the performance benefits directly in >>>>>>> the commit log. >>>>>>> >>>>>>> And of course, for the approach to make sense, it must avoid breaking >>>>>>> the existing lockdep-RCU debugging code. >>>>>>> >>>>>>> Is that more clear? >>>>>> >>>>>> Personally I'm not working on Android platform any more so I don't >>>>>> have a way to reproduce, hopefully Sandeep could give actually >>>>>> number _again_ if dm-verity is enabled and trigger another >>>>>> workqueue here and make a comparsion why the scheduling latency of >>>>>> the extra work becomes unacceptable. >>>>>> >>>>> >>>>> Question from my side, are we talking about only performance issues or >>>>> also a crash? It appears z_erofs_decompress_pcluster() takes >>>>> mutex_lock(&pcl->lock); >>>>> >>>>> So if it is either in an RCU read-side critical section or in an >>>>> atomic section, like the softirq path, then it may >>>>> schedule-while-atomic or trigger RCU warnings. >>>>> >>>>> z_erofs_decompressqueue_endio >>>>> -> z_erofs_decompress_kickoff >>>>> ->z_erofs_decompressqueue_work >>>>> ->z_erofs_decompress_queue >>>>> -> z_erofs_decompress_pcluster >>>>> -> mutex_lock >>>>> >>>> >>>> Why does the softirq path not trigger a workqueue instead? >>> >>> I said "if it is". I was giving a scenario. mutex_lock() is not >>> allowed in softirq context or in an RCU-reader. >>> >>>>> Per Sandeep in [1], this stack happens under RCU read-lock in: >>>>> >>>>> #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ >>>>> [...] >>>>> rcu_read_lock(); >>>>> (dispatch_ops); >>>>> rcu_read_unlock(); >>>>> [...] >>>>> >>>>> Coming from: >>>>> blk_mq_flush_plug_list -> >>>>> blk_mq_run_dispatch_ops(q, >>>>> __blk_mq_flush_plug_list(q, plug)); >>>>> >>>>> and __blk_mq_flush_plug_list does this: >>>>> q->mq_ops->queue_rqs(&plug->mq_list); >>>>> >>>>> This somehow ends up calling the bio_endio and the >>>>> z_erofs_decompressqueue_endio which grabs the mutex. >>>>> >>>>> So... I have a question, it looks like one of the paths in >>>>> __blk_mq_run_dispatch_ops() uses SRCU. Where are as the alternate >>>>> path uses RCU. Why does this alternate want to block even if it is not >>>>> supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should >>>>> be set? It sounds like you want to block in the "else" path even >>>>> though BLK_MQ_F_BLOCKING is not set: >>>> >>>> BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with. >>>> That is block layer and mq device driver stuffs. filesystems cannot set >>>> this value. >>>> >>>> As I said, as far as I understand, previously, >>>> .end_io() can only be called without RCU context, so it will be fine, >>>> but I don't know when .end_io() can be called under some RCU context >>>> now. >>> >>>> From what Sandeep described, the code path is in an RCU reader. My >>> question is more, why doesn't it use SRCU instead since it clearly >>> does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper >>> dive needs to be made into that before concluding that the fix is to >>> use rcu_read_lock_any_held(). >> >> How can this be solved? >> >> 1. Always use a workqueue. Simple, but is said to have performance >> issues. >> >> 2. Pass a flag in that indicates whether or not the caller is in an >> RCU read-side critical section. Conceptually simple, but might >> or might not be reasonable to actually implement in the code as >> it exists now. (You tell me!) >> >> 3. Create a function in z_erofs that gives you a decent >> approximation, maybe something like the following. >> >> 4. Other ideas here. > > 5. #3 plus make the corresponding Kconfig option select > PREEMPT_COUNT, assuming that any users needing compression in > non-preemptible kernels are OK with PREEMPT_COUNT being set. > (Some users of non-preemptible kernels object strenuously > to the added overhead from CONFIG_PREEMPT_COUNT=y.) 6. Set one bit in bio->bi_private, check the bit and flip it in rcu_read_lock() path, then in z_erofs_decompressqueue_endio, check if the bit has changed. Not sure if this is feasible or acceptable. :) > > Thanx, Paul > >> The following is untested, and is probably quite buggy, but it should >> provide you with a starting point. >> >> static bool z_erofs_wq_needed(void) >> { >> if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth()) >> return true; // RCU reader >> if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && !preemptible()) >> return true; // non-preemptible >> if (!IS_ENABLED(CONFIG_PREEMPT_COUNT)) >> return true; // non-preeemptible kernel, so play it safe >> return false; >> } >> >> You break it, you buy it! ;-) >> >> Thanx, Paul
> 2023年7月14日 23:35,Alan Huang <mmpgouride@gmail.com> 写道: > >> >> 2023年7月14日 10:16,Paul E. McKenney <paulmck@kernel.org> 写道: >> >> On Thu, Jul 13, 2023 at 09:33:35AM -0700, Paul E. McKenney wrote: >>> On Thu, Jul 13, 2023 at 11:33:24AM -0400, Joel Fernandes wrote: >>>> On Thu, Jul 13, 2023 at 10:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >>>>> On 2023/7/13 22:07, Joel Fernandes wrote: >>>>>> On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >>>>>>> On 2023/7/13 12:52, Paul E. McKenney wrote: >>>>>>>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote: >>>>>>> >>>>>>> ... >>>>>>> >>>>>>>>> >>>>>>>>> There are lots of performance issues here and even a plumber >>>>>>>>> topic last year to show that, see: >>>>>>>>> >>>>>>>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org >>>>>>>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com >>>>>>>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com >>>>>>>>> [4] https://lpc.events/event/16/contributions/1338/ >>>>>>>>> and more. >>>>>>>>> >>>>>>>>> I'm not sure if it's necessary to look info all of that, >>>>>>>>> andSandeep knows more than I am (the scheduling issue >>>>>>>>> becomes vital on some aarch64 platform.) >>>>>>>> >>>>>>>> Hmmm... Please let me try again. >>>>>>>> >>>>>>>> Assuming that this approach turns out to make sense, the resulting >>>>>>>> patch will need to clearly state the performance benefits directly in >>>>>>>> the commit log. >>>>>>>> >>>>>>>> And of course, for the approach to make sense, it must avoid breaking >>>>>>>> the existing lockdep-RCU debugging code. >>>>>>>> >>>>>>>> Is that more clear? >>>>>>> >>>>>>> Personally I'm not working on Android platform any more so I don't >>>>>>> have a way to reproduce, hopefully Sandeep could give actually >>>>>>> number _again_ if dm-verity is enabled and trigger another >>>>>>> workqueue here and make a comparsion why the scheduling latency of >>>>>>> the extra work becomes unacceptable. >>>>>>> >>>>>> >>>>>> Question from my side, are we talking about only performance issues or >>>>>> also a crash? It appears z_erofs_decompress_pcluster() takes >>>>>> mutex_lock(&pcl->lock); >>>>>> >>>>>> So if it is either in an RCU read-side critical section or in an >>>>>> atomic section, like the softirq path, then it may >>>>>> schedule-while-atomic or trigger RCU warnings. >>>>>> >>>>>> z_erofs_decompressqueue_endio >>>>>> -> z_erofs_decompress_kickoff >>>>>> ->z_erofs_decompressqueue_work >>>>>> ->z_erofs_decompress_queue >>>>>> -> z_erofs_decompress_pcluster >>>>>> -> mutex_lock >>>>>> >>>>> >>>>> Why does the softirq path not trigger a workqueue instead? >>>> >>>> I said "if it is". I was giving a scenario. mutex_lock() is not >>>> allowed in softirq context or in an RCU-reader. >>>> >>>>>> Per Sandeep in [1], this stack happens under RCU read-lock in: >>>>>> >>>>>> #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ >>>>>> [...] >>>>>> rcu_read_lock(); >>>>>> (dispatch_ops); >>>>>> rcu_read_unlock(); >>>>>> [...] >>>>>> >>>>>> Coming from: >>>>>> blk_mq_flush_plug_list -> >>>>>> blk_mq_run_dispatch_ops(q, >>>>>> __blk_mq_flush_plug_list(q, plug)); >>>>>> >>>>>> and __blk_mq_flush_plug_list does this: >>>>>> q->mq_ops->queue_rqs(&plug->mq_list); >>>>>> >>>>>> This somehow ends up calling the bio_endio and the >>>>>> z_erofs_decompressqueue_endio which grabs the mutex. >>>>>> >>>>>> So... I have a question, it looks like one of the paths in >>>>>> __blk_mq_run_dispatch_ops() uses SRCU. Where are as the alternate >>>>>> path uses RCU. Why does this alternate want to block even if it is not >>>>>> supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should >>>>>> be set? It sounds like you want to block in the "else" path even >>>>>> though BLK_MQ_F_BLOCKING is not set: >>>>> >>>>> BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with. >>>>> That is block layer and mq device driver stuffs. filesystems cannot set >>>>> this value. >>>>> >>>>> As I said, as far as I understand, previously, >>>>> .end_io() can only be called without RCU context, so it will be fine, >>>>> but I don't know when .end_io() can be called under some RCU context >>>>> now. >>>> >>>>> From what Sandeep described, the code path is in an RCU reader. My >>>> question is more, why doesn't it use SRCU instead since it clearly >>>> does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper >>>> dive needs to be made into that before concluding that the fix is to >>>> use rcu_read_lock_any_held(). >>> >>> How can this be solved? >>> >>> 1. Always use a workqueue. Simple, but is said to have performance >>> issues. >>> >>> 2. Pass a flag in that indicates whether or not the caller is in an >>> RCU read-side critical section. Conceptually simple, but might >>> or might not be reasonable to actually implement in the code as >>> it exists now. (You tell me!) >>> >>> 3. Create a function in z_erofs that gives you a decent >>> approximation, maybe something like the following. >>> >>> 4. Other ideas here. >> >> 5. #3 plus make the corresponding Kconfig option select >> PREEMPT_COUNT, assuming that any users needing compression in >> non-preemptible kernels are OK with PREEMPT_COUNT being set. >> (Some users of non-preemptible kernels object strenuously >> to the added overhead from CONFIG_PREEMPT_COUNT=y.) > > 6. Set one bit in bio->bi_private, check the bit and flip it in rcu_read_lock() path, > then in z_erofs_decompressqueue_endio, check if the bit has changed. Seems bad, read and modify bi_private is a bad idea. > > Not sure if this is feasible or acceptable. :) > >> >> Thanx, Paul >> >>> The following is untested, and is probably quite buggy, but it should >>> provide you with a starting point. >>> >>> static bool z_erofs_wq_needed(void) >>> { >>> if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth()) >>> return true; // RCU reader >>> if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && !preemptible()) >>> return true; // non-preemptible >>> if (!IS_ENABLED(CONFIG_PREEMPT_COUNT)) >>> return true; // non-preeemptible kernel, so play it safe >>> return false; >>> } >>> >>> You break it, you buy it! ;-) >>> >>> Thanx, Paul
On Fri, Jul 14, 2023 at 11:54:47PM +0800, Alan Huang wrote: > > > 2023年7月14日 23:35,Alan Huang <mmpgouride@gmail.com> 写道: > > > >> > >> 2023年7月14日 10:16,Paul E. McKenney <paulmck@kernel.org> 写道: > >> > >> On Thu, Jul 13, 2023 at 09:33:35AM -0700, Paul E. McKenney wrote: > >>> On Thu, Jul 13, 2023 at 11:33:24AM -0400, Joel Fernandes wrote: > >>>> On Thu, Jul 13, 2023 at 10:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > >>>>> On 2023/7/13 22:07, Joel Fernandes wrote: > >>>>>> On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > >>>>>>> On 2023/7/13 12:52, Paul E. McKenney wrote: > >>>>>>>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote: > >>>>>>> > >>>>>>> ... > >>>>>>> > >>>>>>>>> > >>>>>>>>> There are lots of performance issues here and even a plumber > >>>>>>>>> topic last year to show that, see: > >>>>>>>>> > >>>>>>>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org > >>>>>>>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com > >>>>>>>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com > >>>>>>>>> [4] https://lpc.events/event/16/contributions/1338/ > >>>>>>>>> and more. > >>>>>>>>> > >>>>>>>>> I'm not sure if it's necessary to look info all of that, > >>>>>>>>> andSandeep knows more than I am (the scheduling issue > >>>>>>>>> becomes vital on some aarch64 platform.) > >>>>>>>> > >>>>>>>> Hmmm... Please let me try again. > >>>>>>>> > >>>>>>>> Assuming that this approach turns out to make sense, the resulting > >>>>>>>> patch will need to clearly state the performance benefits directly in > >>>>>>>> the commit log. > >>>>>>>> > >>>>>>>> And of course, for the approach to make sense, it must avoid breaking > >>>>>>>> the existing lockdep-RCU debugging code. > >>>>>>>> > >>>>>>>> Is that more clear? > >>>>>>> > >>>>>>> Personally I'm not working on Android platform any more so I don't > >>>>>>> have a way to reproduce, hopefully Sandeep could give actually > >>>>>>> number _again_ if dm-verity is enabled and trigger another > >>>>>>> workqueue here and make a comparsion why the scheduling latency of > >>>>>>> the extra work becomes unacceptable. > >>>>>>> > >>>>>> > >>>>>> Question from my side, are we talking about only performance issues or > >>>>>> also a crash? It appears z_erofs_decompress_pcluster() takes > >>>>>> mutex_lock(&pcl->lock); > >>>>>> > >>>>>> So if it is either in an RCU read-side critical section or in an > >>>>>> atomic section, like the softirq path, then it may > >>>>>> schedule-while-atomic or trigger RCU warnings. > >>>>>> > >>>>>> z_erofs_decompressqueue_endio > >>>>>> -> z_erofs_decompress_kickoff > >>>>>> ->z_erofs_decompressqueue_work > >>>>>> ->z_erofs_decompress_queue > >>>>>> -> z_erofs_decompress_pcluster > >>>>>> -> mutex_lock > >>>>>> > >>>>> > >>>>> Why does the softirq path not trigger a workqueue instead? > >>>> > >>>> I said "if it is". I was giving a scenario. mutex_lock() is not > >>>> allowed in softirq context or in an RCU-reader. > >>>> > >>>>>> Per Sandeep in [1], this stack happens under RCU read-lock in: > >>>>>> > >>>>>> #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ > >>>>>> [...] > >>>>>> rcu_read_lock(); > >>>>>> (dispatch_ops); > >>>>>> rcu_read_unlock(); > >>>>>> [...] > >>>>>> > >>>>>> Coming from: > >>>>>> blk_mq_flush_plug_list -> > >>>>>> blk_mq_run_dispatch_ops(q, > >>>>>> __blk_mq_flush_plug_list(q, plug)); > >>>>>> > >>>>>> and __blk_mq_flush_plug_list does this: > >>>>>> q->mq_ops->queue_rqs(&plug->mq_list); > >>>>>> > >>>>>> This somehow ends up calling the bio_endio and the > >>>>>> z_erofs_decompressqueue_endio which grabs the mutex. > >>>>>> > >>>>>> So... I have a question, it looks like one of the paths in > >>>>>> __blk_mq_run_dispatch_ops() uses SRCU. Where are as the alternate > >>>>>> path uses RCU. Why does this alternate want to block even if it is not > >>>>>> supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should > >>>>>> be set? It sounds like you want to block in the "else" path even > >>>>>> though BLK_MQ_F_BLOCKING is not set: > >>>>> > >>>>> BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with. > >>>>> That is block layer and mq device driver stuffs. filesystems cannot set > >>>>> this value. > >>>>> > >>>>> As I said, as far as I understand, previously, > >>>>> .end_io() can only be called without RCU context, so it will be fine, > >>>>> but I don't know when .end_io() can be called under some RCU context > >>>>> now. > >>>> > >>>>> From what Sandeep described, the code path is in an RCU reader. My > >>>> question is more, why doesn't it use SRCU instead since it clearly > >>>> does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper > >>>> dive needs to be made into that before concluding that the fix is to > >>>> use rcu_read_lock_any_held(). > >>> > >>> How can this be solved? > >>> > >>> 1. Always use a workqueue. Simple, but is said to have performance > >>> issues. > >>> > >>> 2. Pass a flag in that indicates whether or not the caller is in an > >>> RCU read-side critical section. Conceptually simple, but might > >>> or might not be reasonable to actually implement in the code as > >>> it exists now. (You tell me!) > >>> > >>> 3. Create a function in z_erofs that gives you a decent > >>> approximation, maybe something like the following. > >>> > >>> 4. Other ideas here. > >> > >> 5. #3 plus make the corresponding Kconfig option select > >> PREEMPT_COUNT, assuming that any users needing compression in > >> non-preemptible kernels are OK with PREEMPT_COUNT being set. > >> (Some users of non-preemptible kernels object strenuously > >> to the added overhead from CONFIG_PREEMPT_COUNT=y.) > > > > 6. Set one bit in bio->bi_private, check the bit and flip it in rcu_read_lock() path, > > then in z_erofs_decompressqueue_endio, check if the bit has changed. > > Seems bad, read and modify bi_private is a bad idea. Is there some other field that would work? Thanx, Paul > > Not sure if this is feasible or acceptable. :) > > > >> > >> Thanx, Paul > >> > >>> The following is untested, and is probably quite buggy, but it should > >>> provide you with a starting point. > >>> > >>> static bool z_erofs_wq_needed(void) > >>> { > >>> if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth()) > >>> return true; // RCU reader > >>> if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && !preemptible()) > >>> return true; // non-preemptible > >>> if (!IS_ENABLED(CONFIG_PREEMPT_COUNT)) > >>> return true; // non-preeemptible kernel, so play it safe > >>> return false; > >>> } > >>> > >>> You break it, you buy it! ;-) > >>> > >>> Thanx, Paul > >
> 2023年7月15日 01:02,Paul E. McKenney <paulmck@kernel.org> 写道: > > On Fri, Jul 14, 2023 at 11:54:47PM +0800, Alan Huang wrote: >> >>> 2023年7月14日 23:35,Alan Huang <mmpgouride@gmail.com> 写道: >>> >>>> >>>> 2023年7月14日 10:16,Paul E. McKenney <paulmck@kernel.org> 写道: >>>> >>>> On Thu, Jul 13, 2023 at 09:33:35AM -0700, Paul E. McKenney wrote: >>>>> On Thu, Jul 13, 2023 at 11:33:24AM -0400, Joel Fernandes wrote: >>>>>> On Thu, Jul 13, 2023 at 10:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >>>>>>> On 2023/7/13 22:07, Joel Fernandes wrote: >>>>>>>> On Thu, Jul 13, 2023 at 12:59 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >>>>>>>>> On 2023/7/13 12:52, Paul E. McKenney wrote: >>>>>>>>>> On Thu, Jul 13, 2023 at 12:41:09PM +0800, Gao Xiang wrote: >>>>>>>>> >>>>>>>>> ... >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> There are lots of performance issues here and even a plumber >>>>>>>>>>> topic last year to show that, see: >>>>>>>>>>> >>>>>>>>>>> [1] https://lore.kernel.org/r/20230519001709.2563-1-tj@kernel.org >>>>>>>>>>> [2] https://lore.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com >>>>>>>>>>> [3] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com >>>>>>>>>>> [4] https://lpc.events/event/16/contributions/1338/ >>>>>>>>>>> and more. >>>>>>>>>>> >>>>>>>>>>> I'm not sure if it's necessary to look info all of that, >>>>>>>>>>> andSandeep knows more than I am (the scheduling issue >>>>>>>>>>> becomes vital on some aarch64 platform.) >>>>>>>>>> >>>>>>>>>> Hmmm... Please let me try again. >>>>>>>>>> >>>>>>>>>> Assuming that this approach turns out to make sense, the resulting >>>>>>>>>> patch will need to clearly state the performance benefits directly in >>>>>>>>>> the commit log. >>>>>>>>>> >>>>>>>>>> And of course, for the approach to make sense, it must avoid breaking >>>>>>>>>> the existing lockdep-RCU debugging code. >>>>>>>>>> >>>>>>>>>> Is that more clear? >>>>>>>>> >>>>>>>>> Personally I'm not working on Android platform any more so I don't >>>>>>>>> have a way to reproduce, hopefully Sandeep could give actually >>>>>>>>> number _again_ if dm-verity is enabled and trigger another >>>>>>>>> workqueue here and make a comparsion why the scheduling latency of >>>>>>>>> the extra work becomes unacceptable. >>>>>>>>> >>>>>>>> >>>>>>>> Question from my side, are we talking about only performance issues or >>>>>>>> also a crash? It appears z_erofs_decompress_pcluster() takes >>>>>>>> mutex_lock(&pcl->lock); >>>>>>>> >>>>>>>> So if it is either in an RCU read-side critical section or in an >>>>>>>> atomic section, like the softirq path, then it may >>>>>>>> schedule-while-atomic or trigger RCU warnings. >>>>>>>> >>>>>>>> z_erofs_decompressqueue_endio >>>>>>>> -> z_erofs_decompress_kickoff >>>>>>>> ->z_erofs_decompressqueue_work >>>>>>>> ->z_erofs_decompress_queue >>>>>>>> -> z_erofs_decompress_pcluster >>>>>>>> -> mutex_lock >>>>>>>> >>>>>>> >>>>>>> Why does the softirq path not trigger a workqueue instead? >>>>>> >>>>>> I said "if it is". I was giving a scenario. mutex_lock() is not >>>>>> allowed in softirq context or in an RCU-reader. >>>>>> >>>>>>>> Per Sandeep in [1], this stack happens under RCU read-lock in: >>>>>>>> >>>>>>>> #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \ >>>>>>>> [...] >>>>>>>> rcu_read_lock(); >>>>>>>> (dispatch_ops); >>>>>>>> rcu_read_unlock(); >>>>>>>> [...] >>>>>>>> >>>>>>>> Coming from: >>>>>>>> blk_mq_flush_plug_list -> >>>>>>>> blk_mq_run_dispatch_ops(q, >>>>>>>> __blk_mq_flush_plug_list(q, plug)); >>>>>>>> >>>>>>>> and __blk_mq_flush_plug_list does this: >>>>>>>> q->mq_ops->queue_rqs(&plug->mq_list); >>>>>>>> >>>>>>>> This somehow ends up calling the bio_endio and the >>>>>>>> z_erofs_decompressqueue_endio which grabs the mutex. >>>>>>>> >>>>>>>> So... I have a question, it looks like one of the paths in >>>>>>>> __blk_mq_run_dispatch_ops() uses SRCU. Where are as the alternate >>>>>>>> path uses RCU. Why does this alternate want to block even if it is not >>>>>>>> supposed to? Is the real issue here that the BLK_MQ_F_BLOCKING should >>>>>>>> be set? It sounds like you want to block in the "else" path even >>>>>>>> though BLK_MQ_F_BLOCKING is not set: >>>>>>> >>>>>>> BLK_MQ_F_BLOCKING is not a flag that a filesystem can do anything with. >>>>>>> That is block layer and mq device driver stuffs. filesystems cannot set >>>>>>> this value. >>>>>>> >>>>>>> As I said, as far as I understand, previously, >>>>>>> .end_io() can only be called without RCU context, so it will be fine, >>>>>>> but I don't know when .end_io() can be called under some RCU context >>>>>>> now. >>>>>> >>>>>>> From what Sandeep described, the code path is in an RCU reader. My >>>>>> question is more, why doesn't it use SRCU instead since it clearly >>>>>> does so if BLK_MQ_F_BLOCKING. What are the tradeoffs? IMHO, a deeper >>>>>> dive needs to be made into that before concluding that the fix is to >>>>>> use rcu_read_lock_any_held(). >>>>> >>>>> How can this be solved? >>>>> >>>>> 1. Always use a workqueue. Simple, but is said to have performance >>>>> issues. >>>>> >>>>> 2. Pass a flag in that indicates whether or not the caller is in an >>>>> RCU read-side critical section. Conceptually simple, but might >>>>> or might not be reasonable to actually implement in the code as >>>>> it exists now. (You tell me!) >>>>> >>>>> 3. Create a function in z_erofs that gives you a decent >>>>> approximation, maybe something like the following. >>>>> >>>>> 4. Other ideas here. >>>> >>>> 5. #3 plus make the corresponding Kconfig option select >>>> PREEMPT_COUNT, assuming that any users needing compression in >>>> non-preemptible kernels are OK with PREEMPT_COUNT being set. >>>> (Some users of non-preemptible kernels object strenuously >>>> to the added overhead from CONFIG_PREEMPT_COUNT=y.) >>> >>> 6. Set one bit in bio->bi_private, check the bit and flip it in rcu_read_lock() path, >>> then in z_erofs_decompressqueue_endio, check if the bit has changed. >> >> Seems bad, read and modify bi_private is a bad idea. > > Is there some other field that would work? Maybe bio->bi_opf, btrfs uses some bits of it. > > Thanx, Paul > >>> Not sure if this is feasible or acceptable. :) >>> >>>> >>>> Thanx, Paul >>>> >>>>> The following is untested, and is probably quite buggy, but it should >>>>> provide you with a starting point. >>>>> >>>>> static bool z_erofs_wq_needed(void) >>>>> { >>>>> if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth()) >>>>> return true; // RCU reader >>>>> if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && !preemptible()) >>>>> return true; // non-preemptible >>>>> if (!IS_ENABLED(CONFIG_PREEMPT_COUNT)) >>>>> return true; // non-preeemptible kernel, so play it safe >>>>> return false; >>>>> } >>>>> >>>>> You break it, you buy it! ;-) >>>>> >>>>> Thanx, Paul
On Sat, Jul 15, 2023 at 02:40:08AM +0800, Alan Huang wrote: > > 2023年7月15日 01:02,Paul E. McKenney <paulmck@kernel.org> 写道: > > On Fri, Jul 14, 2023 at 11:54:47PM +0800, Alan Huang wrote: > >>> 2023年7月14日 23:35,Alan Huang <mmpgouride@gmail.com> 写道: > >>>> 2023年7月14日 10:16,Paul E. McKenney <paulmck@kernel.org> 写道: [ . . . ] > >>>>> How can this be solved? > >>>>> > >>>>> 1. Always use a workqueue. Simple, but is said to have performance > >>>>> issues. > >>>>> > >>>>> 2. Pass a flag in that indicates whether or not the caller is in an > >>>>> RCU read-side critical section. Conceptually simple, but might > >>>>> or might not be reasonable to actually implement in the code as > >>>>> it exists now. (You tell me!) > >>>>> > >>>>> 3. Create a function in z_erofs that gives you a decent > >>>>> approximation, maybe something like the following. > >>>>> > >>>>> 4. Other ideas here. > >>>> > >>>> 5. #3 plus make the corresponding Kconfig option select > >>>> PREEMPT_COUNT, assuming that any users needing compression in > >>>> non-preemptible kernels are OK with PREEMPT_COUNT being set. > >>>> (Some users of non-preemptible kernels object strenuously > >>>> to the added overhead from CONFIG_PREEMPT_COUNT=y.) > >>> > >>> 6. Set one bit in bio->bi_private, check the bit and flip it in rcu_read_lock() path, > >>> then in z_erofs_decompressqueue_endio, check if the bit has changed. > >> > >> Seems bad, read and modify bi_private is a bad idea. > > > > Is there some other field that would work? > > Maybe bio->bi_opf, btrfs uses some bits of it. Sandeep, thoughts? Thanx, Paul > >>> Not sure if this is feasible or acceptable. :) > >>> > >>>> > >>>> Thanx, Paul > >>>> > >>>>> The following is untested, and is probably quite buggy, but it should > >>>>> provide you with a starting point. > >>>>> > >>>>> static bool z_erofs_wq_needed(void) > >>>>> { > >>>>> if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth()) > >>>>> return true; // RCU reader > >>>>> if (IS_ENABLED(CONFIG_PREEMPT_COUNT) && !preemptible()) > >>>>> return true; // non-preemptible > >>>>> if (!IS_ENABLED(CONFIG_PREEMPT_COUNT)) > >>>>> return true; // non-preeemptible kernel, so play it safe > >>>>> return false; > >>>>> } > >>>>> > >>>>> You break it, you buy it! ;-) > >>>>> > >>>>> Thanx, Paul > >
> > Sandeep, thoughts? > I prefer to modify erofs check and contain the fix there as erofs cares about it and it's least invasive. For contexts where erofs cannot tell for sure, it will always schedule kworker (like CONFIG_PREEMPT_COUNT=n). I will also do measurements to see if erofs should continue to check for context and what are the benefits. Thanks, Sandeep. > Thanx, Paul >
On Fri, Jul 14, 2023 at 12:15:23PM -0700, Sandeep Dhavale wrote: > > > > Sandeep, thoughts? > > > I prefer to modify erofs check and contain the fix there as erofs > cares about it and it's > least invasive. For contexts where erofs cannot tell for sure, it will > always schedule kworker > (like CONFIG_PREEMPT_COUNT=n). > > I will also do measurements to see if erofs should continue to check > for context and > what are the benefits. At the end of the day, I guess it is Gao Xiang's and Chao Yu's decision. Thanx, Paul
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 5e5f920ade90..0d1d1d8c2360 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -319,14 +319,11 @@ int rcu_read_lock_any_held(void); # define rcu_lock_acquire(a) do { } while (0) # define rcu_lock_release(a) do { } while (0) -static inline int rcu_read_lock_held(void) -{ - return 1; -} +int rcu_read_lock_held(void); static inline int rcu_read_lock_bh_held(void) { - return 1; + return !preemptible() || in_softirq(); } static inline int rcu_read_lock_sched_held(void) @@ -334,10 +331,7 @@ static inline int rcu_read_lock_sched_held(void) return !preemptible(); } -static inline int rcu_read_lock_any_held(void) -{ - return !preemptible(); -} +int rcu_read_lock_any_held(void); static inline int debug_lockdep_rcu_enabled(void) { diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 19bf6fa3ee6a..b34fc5bb96cf 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -390,8 +390,27 @@ int rcu_read_lock_any_held(void) } EXPORT_SYMBOL_GPL(rcu_read_lock_any_held); -#endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ +#else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ +int rcu_read_lock_held(void) +{ + if (IS_ENABLED(CONFIG_PREEMPT_RCU)) + return rcu_preempt_depth(); + return !preemptible(); +} +EXPORT_SYMBOL_GPL(rcu_read_lock_held); + +int rcu_read_lock_any_held(void) +{ + if (rcu_read_lock_held() || + rcu_read_lock_bh_held() || + rcu_read_lock_sched_held()) + return 1; + return !preemptible(); +} +EXPORT_SYMBOL_GPL(rcu_read_lock_any_held); + +#endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */ /** * wakeme_after_rcu() - Callback function to awaken a task after grace period * @head: Pointer to rcu_head member within rcu_synchronize structure
Currently if CONFIG_DEBUG_LOCK_ALLOC is not set - rcu_read_lock_held() always returns 1 - rcu_read_lock_any_held() may return 0 with CONFIG_PREEMPT_RCU This is inconsistent and it was discovered when trying a fix for problem reported [1] with CONFIG_DEBUG_LOCK_ALLOC is not set and CONFIG_PREEMPT_RCU is enabled. Gist of the problem is that EROFS wants to detect atomic context so it can do inline decompression whenever possible, this is important performance optimization. It turns out that z_erofs_decompressqueue_endio() can be called from blk_mq_flush_plug_list() with rcu lock held and hence fix uses rcu_read_lock_any_held() to decide to use sync/inline decompression vs async decompression. As per documentation, we should return lock is held if we aren't certain. But it seems we can improve the checks for if the lock is held even if CONFIG_DEBUG_LOCK_ALLOC is not set instead of hardcoding to always return true. * rcu_read_lock_held() - For CONFIG_PREEMPT_RCU using rcu_preempt_depth() - using preemptible() (indirectly preempt_count()) * rcu_read_lock_bh_held() - For CONFIG_PREEMPT_RT Using in_softirq() (indirectly softirq_cont()) - using preemptible() (indirectly preempt_count()) Lastly to fix the inconsistency, rcu_read_lock_any_held() is updated to use other rcu_read_lock_*_held() checks. Two of the improved checks are moved to kernel/rcu/update.c because rcupdate.h is included from the very low level headers which do not know what current (task_struct) is so the macro rcu_preempt_depth() cannot be expanded in the rcupdate.h. See the original comment for rcu_preempt_depth() in patch at [2] for more information. [1] https://lore.kernel.org/all/20230621220848.3379029-1-dhavale@google.com/ [2] https://lore.kernel.org/all/1281392111-25060-8-git-send-email-paulmck@linux.vnet.ibm.com/ Reported-by: Will Shiu <Will.Shiu@mediatek.com> Signed-off-by: Sandeep Dhavale <dhavale@google.com> --- include/linux/rcupdate.h | 12 +++--------- kernel/rcu/update.c | 21 ++++++++++++++++++++- 2 files changed, 23 insertions(+), 10 deletions(-)