Message ID | 20240622035815.569665-1-leobras@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | Introduce QPW for per-cpu operations | expand |
Hi, you've included tglx, which is great, but there's also LOCKING PRIMITIVES section in MAINTAINERS so I've added folks from there in my reply. Link to full series: https://lore.kernel.org/all/20240622035815.569665-1-leobras@redhat.com/ On 6/22/24 5:58 AM, Leonardo Bras wrote: > The problem: > Some places in the kernel implement a parallel programming strategy > consisting on local_locks() for most of the work, and some rare remote > operations are scheduled on target cpu. This keeps cache bouncing low since > cacheline tends to be mostly local, and avoids the cost of locks in non-RT > kernels, even though the very few remote operations will be expensive due > to scheduling overhead. > > On the other hand, for RT workloads this can represent a problem: getting > an important workload scheduled out to deal with remote requests is > sure to introduce unexpected deadline misses. > > The idea: > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks. > In this case, instead of scheduling work on a remote cpu, it should > be safe to grab that remote cpu's per-cpu spinlock and run the required > work locally. Tha major cost, which is un/locking in every local function, > already happens in PREEMPT_RT. I've also noticed this a while ago (likely in the context of rewriting SLUB to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of the idea. But I forgot the details about why, so I'll let the the locking experts reply... > Also, there is no need to worry about extra cache bouncing: > The cacheline invalidation already happens due to schedule_work_on(). > > This will avoid schedule_work_on(), and thus avoid scheduling-out an > RT workload. > > For patches 2, 3 & 4, I noticed just grabing the lock and executing > the function locally is much faster than just scheduling it on a > remote cpu. > > Proposed solution: > A new interface called Queue PerCPU Work (QPW), which should replace > Work Queue in the above mentioned use case. > > If PREEMPT_RT=n, this interfaces just wraps the current > local_locks + WorkQueue behavior, so no expected change in runtime. > > If PREEMPT_RT=y, queue_percpu_work_on(cpu,...) will lock that cpu's > per-cpu structure and perform work on it locally. This is possible > because on functions that can be used for performing remote work on > remote per-cpu structures, the local_lock (which is already > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which > is able to get the per_cpu spinlock() for the cpu passed as parameter. > > Patch 1 implements QPW interface, and patches 2, 3 & 4 replaces the > current local_lock + WorkQueue interface by the QPW interface in > swap, memcontrol & slub interface. > > Please let me know what you think on that, and please suggest > improvements. > > Thanks a lot! > Leo > > Leonardo Bras (4): > Introducing qpw_lock() and per-cpu queue & flush work > swap: apply new queue_percpu_work_on() interface > memcontrol: apply new queue_percpu_work_on() interface > slub: apply new queue_percpu_work_on() interface > > include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++ > mm/memcontrol.c | 20 ++++++----- > mm/slub.c | 26 ++++++++------ > mm/swap.c | 26 +++++++------- > 4 files changed, 127 insertions(+), 33 deletions(-) > create mode 100644 include/linux/qpw.h > > > base-commit: 50736169ecc8387247fe6a00932852ce7b057083
On Mon, Jun 24, 2024 at 09:31:51AM +0200, Vlastimil Babka wrote: > Hi, > > you've included tglx, which is great, but there's also LOCKING PRIMITIVES > section in MAINTAINERS so I've added folks from there in my reply. Thanks! > Link to full series: > https://lore.kernel.org/all/20240622035815.569665-1-leobras@redhat.com/ > And apologies to Leonardo... I think this is a follow-up of: https://lpc.events/event/17/contributions/1484/ and I did remember we had a quick chat after that which I suggested it's better to change to a different name, sorry that I never found time to write a proper rely to your previous seriese [1] as promised. [1]: https://lore.kernel.org/lkml/20230729083737.38699-2-leobras@redhat.com/ > On 6/22/24 5:58 AM, Leonardo Bras wrote: > > The problem: > > Some places in the kernel implement a parallel programming strategy > > consisting on local_locks() for most of the work, and some rare remote > > operations are scheduled on target cpu. This keeps cache bouncing low since > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT > > kernels, even though the very few remote operations will be expensive due > > to scheduling overhead. > > > > On the other hand, for RT workloads this can represent a problem: getting > > an important workload scheduled out to deal with remote requests is > > sure to introduce unexpected deadline misses. > > > > The idea: > > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks. > > In this case, instead of scheduling work on a remote cpu, it should > > be safe to grab that remote cpu's per-cpu spinlock and run the required > > work locally. Tha major cost, which is un/locking in every local function, > > already happens in PREEMPT_RT. > > I've also noticed this a while ago (likely in the context of rewriting SLUB > to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of > the idea. But I forgot the details about why, so I'll let the the locking > experts reply... > I think it's a good idea, especially the new name is less confusing ;-) So I wonder Thomas' thoughts as well. And I think a few (micro-)benchmark numbers will help. Regards, Boqun > > Also, there is no need to worry about extra cache bouncing: > > The cacheline invalidation already happens due to schedule_work_on(). > > > > This will avoid schedule_work_on(), and thus avoid scheduling-out an > > RT workload. > > > > For patches 2, 3 & 4, I noticed just grabing the lock and executing > > the function locally is much faster than just scheduling it on a > > remote cpu. > > > > Proposed solution: > > A new interface called Queue PerCPU Work (QPW), which should replace > > Work Queue in the above mentioned use case. > > > > If PREEMPT_RT=n, this interfaces just wraps the current > > local_locks + WorkQueue behavior, so no expected change in runtime. > > > > If PREEMPT_RT=y, queue_percpu_work_on(cpu,...) will lock that cpu's > > per-cpu structure and perform work on it locally. This is possible > > because on functions that can be used for performing remote work on > > remote per-cpu structures, the local_lock (which is already > > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which > > is able to get the per_cpu spinlock() for the cpu passed as parameter. > > > > Patch 1 implements QPW interface, and patches 2, 3 & 4 replaces the > > current local_lock + WorkQueue interface by the QPW interface in > > swap, memcontrol & slub interface. > > > > Please let me know what you think on that, and please suggest > > improvements. > > > > Thanks a lot! > > Leo > > > > Leonardo Bras (4): > > Introducing qpw_lock() and per-cpu queue & flush work > > swap: apply new queue_percpu_work_on() interface > > memcontrol: apply new queue_percpu_work_on() interface > > slub: apply new queue_percpu_work_on() interface > > > > include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++ > > mm/memcontrol.c | 20 ++++++----- > > mm/slub.c | 26 ++++++++------ > > mm/swap.c | 26 +++++++------- > > 4 files changed, 127 insertions(+), 33 deletions(-) > > create mode 100644 include/linux/qpw.h > > > > > > base-commit: 50736169ecc8387247fe6a00932852ce7b057083 >
On Mon, Jun 24, 2024 at 09:31:51AM +0200, Vlastimil Babka wrote: > Hi, > > you've included tglx, which is great, but there's also LOCKING PRIMITIVES > section in MAINTAINERS so I've added folks from there in my reply. > Link to full series: > https://lore.kernel.org/all/20240622035815.569665-1-leobras@redhat.com/ Thanks Vlastimil! > > On 6/22/24 5:58 AM, Leonardo Bras wrote: > > The problem: > > Some places in the kernel implement a parallel programming strategy > > consisting on local_locks() for most of the work, and some rare remote > > operations are scheduled on target cpu. This keeps cache bouncing low since > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT > > kernels, even though the very few remote operations will be expensive due > > to scheduling overhead. > > > > On the other hand, for RT workloads this can represent a problem: getting > > an important workload scheduled out to deal with remote requests is > > sure to introduce unexpected deadline misses. > > > > The idea: > > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks. > > In this case, instead of scheduling work on a remote cpu, it should > > be safe to grab that remote cpu's per-cpu spinlock and run the required > > work locally. Tha major cost, which is un/locking in every local function, > > already happens in PREEMPT_RT. > > I've also noticed this a while ago (likely in the context of rewriting SLUB > to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of > the idea. But I forgot the details about why, so I'll let the the locking > experts reply... > > > Also, there is no need to worry about extra cache bouncing: > > The cacheline invalidation already happens due to schedule_work_on(). > > > > This will avoid schedule_work_on(), and thus avoid scheduling-out an > > RT workload. > > > > For patches 2, 3 & 4, I noticed just grabing the lock and executing > > the function locally is much faster than just scheduling it on a > > remote cpu. > > > > Proposed solution: > > A new interface called Queue PerCPU Work (QPW), which should replace > > Work Queue in the above mentioned use case. > > > > If PREEMPT_RT=n, this interfaces just wraps the current > > local_locks + WorkQueue behavior, so no expected change in runtime. > > > > If PREEMPT_RT=y, queue_percpu_work_on(cpu,...) will lock that cpu's > > per-cpu structure and perform work on it locally. This is possible > > because on functions that can be used for performing remote work on > > remote per-cpu structures, the local_lock (which is already > > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which > > is able to get the per_cpu spinlock() for the cpu passed as parameter. > > > > Patch 1 implements QPW interface, and patches 2, 3 & 4 replaces the > > current local_lock + WorkQueue interface by the QPW interface in > > swap, memcontrol & slub interface. > > > > Please let me know what you think on that, and please suggest > > improvements. > > > > Thanks a lot! > > Leo > > > > Leonardo Bras (4): > > Introducing qpw_lock() and per-cpu queue & flush work > > swap: apply new queue_percpu_work_on() interface > > memcontrol: apply new queue_percpu_work_on() interface > > slub: apply new queue_percpu_work_on() interface > > > > include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++ > > mm/memcontrol.c | 20 ++++++----- > > mm/slub.c | 26 ++++++++------ > > mm/swap.c | 26 +++++++------- > > 4 files changed, 127 insertions(+), 33 deletions(-) > > create mode 100644 include/linux/qpw.h > > > > > > base-commit: 50736169ecc8387247fe6a00932852ce7b057083 >
On Mon, Jun 24, 2024 at 03:54:14PM -0700, Boqun Feng wrote: > On Mon, Jun 24, 2024 at 09:31:51AM +0200, Vlastimil Babka wrote: > > Hi, > > > > you've included tglx, which is great, but there's also LOCKING PRIMITIVES > > section in MAINTAINERS so I've added folks from there in my reply. > > Thanks! > > > Link to full series: > > https://lore.kernel.org/all/20240622035815.569665-1-leobras@redhat.com/ > > > > And apologies to Leonardo... I think this is a follow-up of: > > https://lpc.events/event/17/contributions/1484/ > > and I did remember we had a quick chat after that which I suggested it's > better to change to a different name, sorry that I never found time to > write a proper rely to your previous seriese [1] as promised. > > [1]: https://lore.kernel.org/lkml/20230729083737.38699-2-leobras@redhat.com/ That's correct, I commented about this in the end of above presentation. Don't worry, and thanks for suggesting the per-cpu naming, it was very helpful on designing this solution. > > > On 6/22/24 5:58 AM, Leonardo Bras wrote: > > > The problem: > > > Some places in the kernel implement a parallel programming strategy > > > consisting on local_locks() for most of the work, and some rare remote > > > operations are scheduled on target cpu. This keeps cache bouncing low since > > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT > > > kernels, even though the very few remote operations will be expensive due > > > to scheduling overhead. > > > > > > On the other hand, for RT workloads this can represent a problem: getting > > > an important workload scheduled out to deal with remote requests is > > > sure to introduce unexpected deadline misses. > > > > > > The idea: > > > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks. > > > In this case, instead of scheduling work on a remote cpu, it should > > > be safe to grab that remote cpu's per-cpu spinlock and run the required > > > work locally. Tha major cost, which is un/locking in every local function, > > > already happens in PREEMPT_RT. > > > > I've also noticed this a while ago (likely in the context of rewriting SLUB > > to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of > > the idea. But I forgot the details about why, so I'll let the the locking > > experts reply... > > > > I think it's a good idea, especially the new name is less confusing ;-) > So I wonder Thomas' thoughts as well. Thanks! > > And I think a few (micro-)benchmark numbers will help. Last year I got some numbers on how replacing local_locks with spinlocks would impact memcontrol.c cache operations: https://lore.kernel.org/all/20230125073502.743446-1-leobras@redhat.com/ tl;dr: It increased clocks spent in the most common this_cpu operations, while reducing clocks spent in remote operations (drain_all_stock). In RT case, since local locks are already spinlocks, this cost is already paid, so we can get results like these: drain_all_stock cpus Upstream Patched Diff (cycles) Diff(%) 1 44331.10831 38978.03581 -5353.072507 -12.07520567 8 43992.96512 39026.76654 -4966.198572 -11.2886198 128 156274.6634 58053.87421 -98220.78915 -62.85138425 Upstream: Clocks to schedule work on remote CPU (performing not accounted) Patched: Clocks to grab remote cpu's spinlock and perform the needed work locally. Do you have other suggestions to use as (micro-) benchmarking? Thanks! Leo > > Regards, > Boqun > > > > Also, there is no need to worry about extra cache bouncing: > > > The cacheline invalidation already happens due to schedule_work_on(). > > > > > > This will avoid schedule_work_on(), and thus avoid scheduling-out an > > > RT workload. > > > > > > For patches 2, 3 & 4, I noticed just grabing the lock and executing > > > the function locally is much faster than just scheduling it on a > > > remote cpu. > > > > > > Proposed solution: > > > A new interface called Queue PerCPU Work (QPW), which should replace > > > Work Queue in the above mentioned use case. > > > > > > If PREEMPT_RT=n, this interfaces just wraps the current > > > local_locks + WorkQueue behavior, so no expected change in runtime. > > > > > > If PREEMPT_RT=y, queue_percpu_work_on(cpu,...) will lock that cpu's > > > per-cpu structure and perform work on it locally. This is possible > > > because on functions that can be used for performing remote work on > > > remote per-cpu structures, the local_lock (which is already > > > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which > > > is able to get the per_cpu spinlock() for the cpu passed as parameter. > > > > > > Patch 1 implements QPW interface, and patches 2, 3 & 4 replaces the > > > current local_lock + WorkQueue interface by the QPW interface in > > > swap, memcontrol & slub interface. > > > > > > Please let me know what you think on that, and please suggest > > > improvements. > > > > > > Thanks a lot! > > > Leo > > > > > > Leonardo Bras (4): > > > Introducing qpw_lock() and per-cpu queue & flush work > > > swap: apply new queue_percpu_work_on() interface > > > memcontrol: apply new queue_percpu_work_on() interface > > > slub: apply new queue_percpu_work_on() interface > > > > > > include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++ > > > mm/memcontrol.c | 20 ++++++----- > > > mm/slub.c | 26 ++++++++------ > > > mm/swap.c | 26 +++++++------- > > > 4 files changed, 127 insertions(+), 33 deletions(-) > > > create mode 100644 include/linux/qpw.h > > > > > > > > > base-commit: 50736169ecc8387247fe6a00932852ce7b057083 > > >
On Mon, Jun 24, 2024 at 11:57:57PM -0300, Leonardo Bras wrote: > On Mon, Jun 24, 2024 at 03:54:14PM -0700, Boqun Feng wrote: > > On Mon, Jun 24, 2024 at 09:31:51AM +0200, Vlastimil Babka wrote: > > > Hi, > > > > > > you've included tglx, which is great, but there's also LOCKING PRIMITIVES > > > section in MAINTAINERS so I've added folks from there in my reply. > > > > Thanks! > > > > > Link to full series: > > > https://lore.kernel.org/all/20240622035815.569665-1-leobras@redhat.com/ > > > > > > > And apologies to Leonardo... I think this is a follow-up of: > > > > https://lpc.events/event/17/contributions/1484/ > > > > and I did remember we had a quick chat after that which I suggested it's > > better to change to a different name, sorry that I never found time to > > write a proper rely to your previous seriese [1] as promised. > > > > [1]: https://lore.kernel.org/lkml/20230729083737.38699-2-leobras@redhat.com/ > > That's correct, I commented about this in the end of above presentation. > Don't worry, and thanks for suggesting the per-cpu naming, it was very > helpful on designing this solution. > > > > > > On 6/22/24 5:58 AM, Leonardo Bras wrote: > > > > The problem: > > > > Some places in the kernel implement a parallel programming strategy > > > > consisting on local_locks() for most of the work, and some rare remote > > > > operations are scheduled on target cpu. This keeps cache bouncing low since > > > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT > > > > kernels, even though the very few remote operations will be expensive due > > > > to scheduling overhead. > > > > > > > > On the other hand, for RT workloads this can represent a problem: getting > > > > an important workload scheduled out to deal with remote requests is > > > > sure to introduce unexpected deadline misses. > > > > > > > > The idea: > > > > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks. > > > > In this case, instead of scheduling work on a remote cpu, it should > > > > be safe to grab that remote cpu's per-cpu spinlock and run the required > > > > work locally. Tha major cost, which is un/locking in every local function, > > > > already happens in PREEMPT_RT. > > > > > > I've also noticed this a while ago (likely in the context of rewriting SLUB > > > to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of > > > the idea. But I forgot the details about why, so I'll let the the locking > > > experts reply... > > > > > > > I think it's a good idea, especially the new name is less confusing ;-) > > So I wonder Thomas' thoughts as well. > > Thanks! > > > > > And I think a few (micro-)benchmark numbers will help. > > Last year I got some numbers on how replacing local_locks with > spinlocks would impact memcontrol.c cache operations: > > https://lore.kernel.org/all/20230125073502.743446-1-leobras@redhat.com/ > > tl;dr: It increased clocks spent in the most common this_cpu operations, > while reducing clocks spent in remote operations (drain_all_stock). > > In RT case, since local locks are already spinlocks, this cost is > already paid, so we can get results like these: > > drain_all_stock > cpus Upstream Patched Diff (cycles) Diff(%) > 1 44331.10831 38978.03581 -5353.072507 -12.07520567 > 8 43992.96512 39026.76654 -4966.198572 -11.2886198 > 128 156274.6634 58053.87421 -98220.78915 -62.85138425 > > Upstream: Clocks to schedule work on remote CPU (performing not accounted) > Patched: Clocks to grab remote cpu's spinlock and perform the needed work > locally. This looks good as a micro-benchmark. And it answers why we need patch #3 in this series. It'll be better if we have something similar for patch #2 and #4. Besides, micro-benchmarks are usually a bit artifical IMO, it's better if we have the data to prove that your changes improve the performance from a more global view. For example, could you find or create a use case where flush_slab() becomes somewhat a hot path? And we can then know the performance gain from your changes in that use case. Maybe Vlastimil has something in his mind already? ;-) Also keep in mind that your changes apply to RT, so a natural follow-up question would be: will it hurt the system latency? I know litte about this area, so I must defer this to experts. The above concern brings another opportunity: would it make sense to use real locks instead of queuing work on a remote CPU in the case when RT is not needed, but CPU isolation is important? I.e. nohz_full situations? > > Do you have other suggestions to use as (micro-) benchmarking? > My overall suggestion is that you do find a valuable pattern where queuing remote work may not be the best option, but usually a real world usage would make more sense for the extra complexity that we will pay. Does this make sense? Regards, Boqun > Thanks! > Leo > > > > > > Regards, > > Boqun > > > > > > Also, there is no need to worry about extra cache bouncing: > > > > The cacheline invalidation already happens due to schedule_work_on(). > > > > > > > > This will avoid schedule_work_on(), and thus avoid scheduling-out an > > > > RT workload. > > > > > > > > For patches 2, 3 & 4, I noticed just grabing the lock and executing > > > > the function locally is much faster than just scheduling it on a > > > > remote cpu. > > > > > > > > Proposed solution: > > > > A new interface called Queue PerCPU Work (QPW), which should replace > > > > Work Queue in the above mentioned use case. > > > > > > > > If PREEMPT_RT=n, this interfaces just wraps the current > > > > local_locks + WorkQueue behavior, so no expected change in runtime. > > > > > > > > If PREEMPT_RT=y, queue_percpu_work_on(cpu,...) will lock that cpu's > > > > per-cpu structure and perform work on it locally. This is possible > > > > because on functions that can be used for performing remote work on > > > > remote per-cpu structures, the local_lock (which is already > > > > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which > > > > is able to get the per_cpu spinlock() for the cpu passed as parameter. > > > > > > > > Patch 1 implements QPW interface, and patches 2, 3 & 4 replaces the > > > > current local_lock + WorkQueue interface by the QPW interface in > > > > swap, memcontrol & slub interface. > > > > > > > > Please let me know what you think on that, and please suggest > > > > improvements. > > > > > > > > Thanks a lot! > > > > Leo > > > > > > > > Leonardo Bras (4): > > > > Introducing qpw_lock() and per-cpu queue & flush work > > > > swap: apply new queue_percpu_work_on() interface > > > > memcontrol: apply new queue_percpu_work_on() interface > > > > slub: apply new queue_percpu_work_on() interface > > > > > > > > include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++ > > > > mm/memcontrol.c | 20 ++++++----- > > > > mm/slub.c | 26 ++++++++------ > > > > mm/swap.c | 26 +++++++------- > > > > 4 files changed, 127 insertions(+), 33 deletions(-) > > > > create mode 100644 include/linux/qpw.h > > > > > > > > > > > > base-commit: 50736169ecc8387247fe6a00932852ce7b057083 > > > > > >
On Tue, Jun 25, 2024 at 10:51:13AM -0700, Boqun Feng wrote: > On Mon, Jun 24, 2024 at 11:57:57PM -0300, Leonardo Bras wrote: > > On Mon, Jun 24, 2024 at 03:54:14PM -0700, Boqun Feng wrote: > > > On Mon, Jun 24, 2024 at 09:31:51AM +0200, Vlastimil Babka wrote: > > > > Hi, > > > > > > > > you've included tglx, which is great, but there's also LOCKING PRIMITIVES > > > > section in MAINTAINERS so I've added folks from there in my reply. > > > > > > Thanks! > > > > > > > Link to full series: > > > > https://lore.kernel.org/all/20240622035815.569665-1-leobras@redhat.com/ > > > > > > > > > > And apologies to Leonardo... I think this is a follow-up of: > > > > > > https://lpc.events/event/17/contributions/1484/ > > > > > > and I did remember we had a quick chat after that which I suggested it's > > > better to change to a different name, sorry that I never found time to > > > write a proper rely to your previous seriese [1] as promised. > > > > > > [1]: https://lore.kernel.org/lkml/20230729083737.38699-2-leobras@redhat.com/ > > > > That's correct, I commented about this in the end of above presentation. > > Don't worry, and thanks for suggesting the per-cpu naming, it was very > > helpful on designing this solution. > > > > > > > > > On 6/22/24 5:58 AM, Leonardo Bras wrote: > > > > > The problem: > > > > > Some places in the kernel implement a parallel programming strategy > > > > > consisting on local_locks() for most of the work, and some rare remote > > > > > operations are scheduled on target cpu. This keeps cache bouncing low since > > > > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT > > > > > kernels, even though the very few remote operations will be expensive due > > > > > to scheduling overhead. > > > > > > > > > > On the other hand, for RT workloads this can represent a problem: getting > > > > > an important workload scheduled out to deal with remote requests is > > > > > sure to introduce unexpected deadline misses. > > > > > > > > > > The idea: > > > > > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks. > > > > > In this case, instead of scheduling work on a remote cpu, it should > > > > > be safe to grab that remote cpu's per-cpu spinlock and run the required > > > > > work locally. Tha major cost, which is un/locking in every local function, > > > > > already happens in PREEMPT_RT. > > > > > > > > I've also noticed this a while ago (likely in the context of rewriting SLUB > > > > to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of > > > > the idea. But I forgot the details about why, so I'll let the the locking > > > > experts reply... > > > > > > > > > > I think it's a good idea, especially the new name is less confusing ;-) > > > So I wonder Thomas' thoughts as well. > > > > Thanks! > > > > > > > > And I think a few (micro-)benchmark numbers will help. > > > > Last year I got some numbers on how replacing local_locks with > > spinlocks would impact memcontrol.c cache operations: > > > > https://lore.kernel.org/all/20230125073502.743446-1-leobras@redhat.com/ > > > > tl;dr: It increased clocks spent in the most common this_cpu operations, > > while reducing clocks spent in remote operations (drain_all_stock). > > > > In RT case, since local locks are already spinlocks, this cost is > > already paid, so we can get results like these: > > > > drain_all_stock > > cpus Upstream Patched Diff (cycles) Diff(%) > > 1 44331.10831 38978.03581 -5353.072507 -12.07520567 > > 8 43992.96512 39026.76654 -4966.198572 -11.2886198 > > 128 156274.6634 58053.87421 -98220.78915 -62.85138425 > > > > Upstream: Clocks to schedule work on remote CPU (performing not accounted) > > Patched: Clocks to grab remote cpu's spinlock and perform the needed work > > locally. > > This looks good as a micro-benchmark. And it answers why we need patch > #3 in this series. It'll be better if we have something similar for > patch #2 and #4. I suppose that given the parallel programming scheme is the same, the results tend to be similar, but sure, I can provide such tests. > > Besides, micro-benchmarks are usually a bit artifical IMO, it's better > if we have the data to prove that your changes improve the performance > from a more global view. For example, could you find or create a use > case where flush_slab() becomes somewhat a hot path? And we can then > know the performance gain from your changes in that use case. Maybe > Vlastimil has something in his mind already? ;-) > > Also keep in mind that your changes apply to RT, so a natural follow-up > question would be: will it hurt the system latency? I know litte about > this area, so I must defer this to experts. While we notice some performance improvements, the whole deal of this patchset is not to gain performance, but to reduce latency: When we call schedule_work_on() or queue_work_on(), we end up having a processor being interrupted (IPI) to deal with the required work. If this processor is running a RT task, it introduces latency. So by removing some of those IPIs we have a noticeable reduction in max latency, in tests such as cyclictest and oslat. Maybe it's a good idea to include those in this cover letter. > > The above concern brings another opportunity: would it make sense to use > real locks instead of queuing work on a remote CPU in the case when RT > is not needed, but CPU isolation is important? I.e. nohz_full > situations? By having this qpw interface, that is easily achievable: We can add a kernel parameter that makes qpw_*locks use spinlocks if isolation is enabled. Even though this could be an static branch, this would cost some overhead in non-isolated + non-RT though. But in any case, I am open on implementing this if there is an use-case. > > > > > Do you have other suggestions to use as (micro-) benchmarking? > > > > My overall suggestion is that you do find a valuable pattern where > queuing remote work may not be the best option, but usually a real world > usage would make more sense for the extra complexity that we will pay. > > Does this make sense? Yes, it does. There are scenarios which will cause a lot of queue_work_on, and this patchset would increase performance in RT. I think Marcelo showed me some example a while ago in mm/. But my goal would be just to show that this change does not increase overhead, actually can have some improvements in RT, and achieves latency reduction which is the desired feature. Thanks! Leo > > Regards, > Boqun > > > Thanks! > > Leo > > > > > > > > > > Regards, > > > Boqun > > > > > > > > Also, there is no need to worry about extra cache bouncing: > > > > > The cacheline invalidation already happens due to schedule_work_on(). > > > > > > > > > > This will avoid schedule_work_on(), and thus avoid scheduling-out an > > > > > RT workload. > > > > > > > > > > For patches 2, 3 & 4, I noticed just grabing the lock and executing > > > > > the function locally is much faster than just scheduling it on a > > > > > remote cpu. > > > > > > > > > > Proposed solution: > > > > > A new interface called Queue PerCPU Work (QPW), which should replace > > > > > Work Queue in the above mentioned use case. > > > > > > > > > > If PREEMPT_RT=n, this interfaces just wraps the current > > > > > local_locks + WorkQueue behavior, so no expected change in runtime. > > > > > > > > > > If PREEMPT_RT=y, queue_percpu_work_on(cpu,...) will lock that cpu's > > > > > per-cpu structure and perform work on it locally. This is possible > > > > > because on functions that can be used for performing remote work on > > > > > remote per-cpu structures, the local_lock (which is already > > > > > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which > > > > > is able to get the per_cpu spinlock() for the cpu passed as parameter. > > > > > > > > > > Patch 1 implements QPW interface, and patches 2, 3 & 4 replaces the > > > > > current local_lock + WorkQueue interface by the QPW interface in > > > > > swap, memcontrol & slub interface. > > > > > > > > > > Please let me know what you think on that, and please suggest > > > > > improvements. > > > > > > > > > > Thanks a lot! > > > > > Leo > > > > > > > > > > Leonardo Bras (4): > > > > > Introducing qpw_lock() and per-cpu queue & flush work > > > > > swap: apply new queue_percpu_work_on() interface > > > > > memcontrol: apply new queue_percpu_work_on() interface > > > > > slub: apply new queue_percpu_work_on() interface > > > > > > > > > > include/linux/qpw.h | 88 +++++++++++++++++++++++++++++++++++++++++++++ > > > > > mm/memcontrol.c | 20 ++++++----- > > > > > mm/slub.c | 26 ++++++++------ > > > > > mm/swap.c | 26 +++++++------- > > > > > 4 files changed, 127 insertions(+), 33 deletions(-) > > > > > create mode 100644 include/linux/qpw.h > > > > > > > > > > > > > > > base-commit: 50736169ecc8387247fe6a00932852ce7b057083 > > > > > > > > > >
On Mon, Jun 24, 2024 at 11:57:57PM -0300, Leonardo Bras wrote: > On Mon, Jun 24, 2024 at 03:54:14PM -0700, Boqun Feng wrote: > > On Mon, Jun 24, 2024 at 09:31:51AM +0200, Vlastimil Babka wrote: > > > Hi, > > > > > > you've included tglx, which is great, but there's also LOCKING PRIMITIVES > > > section in MAINTAINERS so I've added folks from there in my reply. > > > > Thanks! > > > > > Link to full series: > > > https://lore.kernel.org/all/20240622035815.569665-1-leobras@redhat.com/ > > > > > > > And apologies to Leonardo... I think this is a follow-up of: > > > > https://lpc.events/event/17/contributions/1484/ > > > > and I did remember we had a quick chat after that which I suggested it's > > better to change to a different name, sorry that I never found time to > > write a proper rely to your previous seriese [1] as promised. > > > > [1]: https://lore.kernel.org/lkml/20230729083737.38699-2-leobras@redhat.com/ > > That's correct, I commented about this in the end of above presentation. > Don't worry, and thanks for suggesting the per-cpu naming, it was very > helpful on designing this solution. > > > > > > On 6/22/24 5:58 AM, Leonardo Bras wrote: > > > > The problem: > > > > Some places in the kernel implement a parallel programming strategy > > > > consisting on local_locks() for most of the work, and some rare remote > > > > operations are scheduled on target cpu. This keeps cache bouncing low since > > > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT > > > > kernels, even though the very few remote operations will be expensive due > > > > to scheduling overhead. > > > > > > > > On the other hand, for RT workloads this can represent a problem: getting > > > > an important workload scheduled out to deal with remote requests is > > > > sure to introduce unexpected deadline misses. > > > > > > > > The idea: > > > > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks. > > > > In this case, instead of scheduling work on a remote cpu, it should > > > > be safe to grab that remote cpu's per-cpu spinlock and run the required > > > > work locally. Tha major cost, which is un/locking in every local function, > > > > already happens in PREEMPT_RT. > > > > > > I've also noticed this a while ago (likely in the context of rewriting SLUB > > > to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of > > > the idea. But I forgot the details about why, so I'll let the the locking > > > experts reply... > > > > > > > I think it's a good idea, especially the new name is less confusing ;-) > > So I wonder Thomas' thoughts as well. > > Thanks! > > > > > And I think a few (micro-)benchmark numbers will help. > > Last year I got some numbers on how replacing local_locks with > spinlocks would impact memcontrol.c cache operations: > > https://lore.kernel.org/all/20230125073502.743446-1-leobras@redhat.com/ > > tl;dr: It increased clocks spent in the most common this_cpu operations, > while reducing clocks spent in remote operations (drain_all_stock). > > In RT case, since local locks are already spinlocks, this cost is > already paid, so we can get results like these: > > drain_all_stock > cpus Upstream Patched Diff (cycles) Diff(%) > 1 44331.10831 38978.03581 -5353.072507 -12.07520567 > 8 43992.96512 39026.76654 -4966.198572 -11.2886198 > 128 156274.6634 58053.87421 -98220.78915 -62.85138425 > > Upstream: Clocks to schedule work on remote CPU (performing not accounted) > Patched: Clocks to grab remote cpu's spinlock and perform the needed work > locally. > > Do you have other suggestions to use as (micro-) benchmarking? > > Thanks! > Leo One improvement which was noted when mm/page_alloc.c was converted to spinlock + remote drain was that, it can bypass waiting for kwork to be scheduled (on heavily loaded CPUs). commit 443c2accd1b6679a1320167f8f56eed6536b806e Author: Nicolas Saenz Julienne <nsaenzju@redhat.com> Date: Fri Jun 24 13:54:22 2022 +0100 mm/page_alloc: remotely drain per-cpu lists Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu drain work queued by __drain_all_pages(). So introduce a new mechanism to remotely drain the per-cpu lists. It is made possible by remotely locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this new scheme is that drain operations are now migration safe. There was no observed performance degradation vs. the previous scheme. Both netperf and hackbench were run in parallel to triggering the __drain_all_pages(NULL, true) code path around ~100 times per second. The new scheme performs a bit better (~5%), although the important point here is there are no performance regressions vs. the previous mechanism. Per-cpu lists draining happens only in slow paths. Minchan Kim tested an earlier version and reported; My workload is not NOHZ CPUs but run apps under heavy memory pressure so they goes to direct reclaim and be stuck on drain_all_pages until work on workqueue run. unit: nanosecond max(dur) avg(dur) count(dur) 166713013 487511.77786438033 1283 From traces, system encountered the drain_all_pages 1283 times and worst case was 166ms and avg was 487us. The other problem was alloc_contig_range in CMA. The PCP draining takes several hundred millisecond sometimes though there is no memory pressure or a few of pages to be migrated out but CPU were fully booked. Your patch perfectly removed those wasted time.
On Mon, Jun 24, 2024 at 09:31:51AM +0200, Vlastimil Babka wrote: > Hi, > > you've included tglx, which is great, but there's also LOCKING PRIMITIVES > section in MAINTAINERS so I've added folks from there in my reply. > Link to full series: > https://lore.kernel.org/all/20240622035815.569665-1-leobras@redhat.com/ > > On 6/22/24 5:58 AM, Leonardo Bras wrote: > > The problem: > > Some places in the kernel implement a parallel programming strategy > > consisting on local_locks() for most of the work, and some rare remote > > operations are scheduled on target cpu. This keeps cache bouncing low since > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT > > kernels, even though the very few remote operations will be expensive due > > to scheduling overhead. > > > > On the other hand, for RT workloads this can represent a problem: getting > > an important workload scheduled out to deal with remote requests is > > sure to introduce unexpected deadline misses. > > > > The idea: > > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks. > > In this case, instead of scheduling work on a remote cpu, it should > > be safe to grab that remote cpu's per-cpu spinlock and run the required > > work locally. Tha major cost, which is un/locking in every local function, > > already happens in PREEMPT_RT. > > I've also noticed this a while ago (likely in the context of rewriting SLUB > to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of > the idea. But I forgot the details about why, so I'll let the the locking > experts reply... Thomas?
On Sat, Jun 22, 2024 at 12:58:08AM -0300, Leonardo Bras wrote: > The problem: > Some places in the kernel implement a parallel programming strategy > consisting on local_locks() for most of the work, and some rare remote > operations are scheduled on target cpu. This keeps cache bouncing low since > cacheline tends to be mostly local, and avoids the cost of locks in non-RT > kernels, even though the very few remote operations will be expensive due > to scheduling overhead. > > On the other hand, for RT workloads this can represent a problem: getting > an important workload scheduled out to deal with remote requests is > sure to introduce unexpected deadline misses. Another hang with a busy polling workload (kernel update hangs on grub2-probe): [342431.665417] INFO: task grub2-probe:24484 blocked for more than 622 seconds. [342431.665458] Tainted: G W X ------- --- 5.14.0-438.el9s.x86_64+rt #1 [342431.665488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [342431.665515] task:grub2-probe state:D stack:0 pid:24484 ppid:24455 flags:0x00004002 [342431.665523] Call Trace: [342431.665525] <TASK> [342431.665527] __schedule+0x22a/0x580 [342431.665537] schedule+0x30/0x80 [342431.665539] schedule_timeout+0x153/0x190 [342431.665543] ? preempt_schedule_thunk+0x16/0x30 [342431.665548] ? preempt_count_add+0x70/0xa0 [342431.665554] __wait_for_common+0x8b/0x1c0 [342431.665557] ? __pfx_schedule_timeout+0x10/0x10 [342431.665560] __flush_work.isra.0+0x15b/0x220 [342431.665565] ? __pfx_wq_barrier_func+0x10/0x10 [342431.665570] __lru_add_drain_all+0x17d/0x220 [342431.665576] invalidate_bdev+0x28/0x40 [342431.665583] blkdev_common_ioctl+0x714/0xa30 [342431.665588] ? bucket_table_alloc.isra.0+0x1/0x150 [342431.665593] ? cp_new_stat+0xbb/0x180 [342431.665599] blkdev_ioctl+0x112/0x270 [342431.665603] ? security_file_ioctl+0x2f/0x50 [342431.665609] __x64_sys_ioctl+0x87/0xc0 [342431.665614] do_syscall_64+0x5c/0xf0 [342431.665619] ? __ct_user_enter+0x89/0x130 [342431.665623] ? syscall_exit_to_user_mode+0x22/0x40 [342431.665625] ? do_syscall_64+0x6b/0xf0 [342431.665627] ? __ct_user_enter+0x89/0x130 [342431.665629] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [342431.665635] RIP: 0033:0x7f39856c757b [342431.665666] RSP: 002b:00007ffd9541c488 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [342431.665670] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f39856c757b [342431.665673] RDX: 0000000000000000 RSI: 0000000000001261 RDI: 0000000000000005 [342431.665674] RBP: 00007ffd9541c540 R08: 0000000000000003 R09: 006164732f766564 [342431.665676] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd9543ca68 [342431.665678] R13: 000055ea758a0708 R14: 000055ea759de338 R15: 00007f398586f000
On Tue, 23 Jul 2024 14:14:34 -0300 Marcelo Tosatti <mtosatti@redhat.com> > On Sat, Jun 22, 2024 at 12:58:08AM -0300, Leonardo Bras wrote: > > The problem: > > Some places in the kernel implement a parallel programming strategy > > consisting on local_locks() for most of the work, and some rare remote > > operations are scheduled on target cpu. This keeps cache bouncing low since > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT > > kernels, even though the very few remote operations will be expensive due > > to scheduling overhead. > > > > On the other hand, for RT workloads this can represent a problem: getting > > an important workload scheduled out to deal with remote requests is > > sure to introduce unexpected deadline misses. > > Another hang with a busy polling workload (kernel update hangs on > grub2-probe): > > [342431.665417] INFO: task grub2-probe:24484 blocked for more than 622 seconds. > [342431.665458] Tainted: G W X ------- --- 5.14.0-438.el9s.x86_64+rt #1 > [342431.665488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [342431.665515] task:grub2-probe state:D stack:0 pid:24484 ppid:24455 flags:0x00004002 > [342431.665523] Call Trace: > [342431.665525] <TASK> > [342431.665527] __schedule+0x22a/0x580 > [342431.665537] schedule+0x30/0x80 > [342431.665539] schedule_timeout+0x153/0x190 > [342431.665543] ? preempt_schedule_thunk+0x16/0x30 > [342431.665548] ? preempt_count_add+0x70/0xa0 > [342431.665554] __wait_for_common+0x8b/0x1c0 > [342431.665557] ? __pfx_schedule_timeout+0x10/0x10 > [342431.665560] __flush_work.isra.0+0x15b/0x220 The fresh new flush_percpu_work() is nop with CONFIG_PREEMPT_RT enabled, why are you testing it with 5.14.0-438.el9s.x86_64+rt instead of mainline? Or what are you testing? BTW the hang fails to show the unexpected deadline misses. > [342431.665565] ? __pfx_wq_barrier_func+0x10/0x10 > [342431.665570] __lru_add_drain_all+0x17d/0x220 > [342431.665576] invalidate_bdev+0x28/0x40 > [342431.665583] blkdev_common_ioctl+0x714/0xa30 > [342431.665588] ? bucket_table_alloc.isra.0+0x1/0x150 > [342431.665593] ? cp_new_stat+0xbb/0x180 > [342431.665599] blkdev_ioctl+0x112/0x270 > [342431.665603] ? security_file_ioctl+0x2f/0x50 > [342431.665609] __x64_sys_ioctl+0x87/0xc0
On Fri, Sep 06, 2024 at 06:19:08AM +0800, Hillf Danton wrote: > On Tue, 23 Jul 2024 14:14:34 -0300 Marcelo Tosatti <mtosatti@redhat.com> > > On Sat, Jun 22, 2024 at 12:58:08AM -0300, Leonardo Bras wrote: > > > The problem: > > > Some places in the kernel implement a parallel programming strategy > > > consisting on local_locks() for most of the work, and some rare remote > > > operations are scheduled on target cpu. This keeps cache bouncing low since > > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT > > > kernels, even though the very few remote operations will be expensive due > > > to scheduling overhead. > > > > > > On the other hand, for RT workloads this can represent a problem: getting > > > an important workload scheduled out to deal with remote requests is > > > sure to introduce unexpected deadline misses. > > > > Another hang with a busy polling workload (kernel update hangs on > > grub2-probe): > > > > [342431.665417] INFO: task grub2-probe:24484 blocked for more than 622 seconds. > > [342431.665458] Tainted: G W X ------- --- 5.14.0-438.el9s.x86_64+rt #1 > > [342431.665488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > [342431.665515] task:grub2-probe state:D stack:0 pid:24484 ppid:24455 flags:0x00004002 > > [342431.665523] Call Trace: > > [342431.665525] <TASK> > > [342431.665527] __schedule+0x22a/0x580 > > [342431.665537] schedule+0x30/0x80 > > [342431.665539] schedule_timeout+0x153/0x190 > > [342431.665543] ? preempt_schedule_thunk+0x16/0x30 > > [342431.665548] ? preempt_count_add+0x70/0xa0 > > [342431.665554] __wait_for_common+0x8b/0x1c0 > > [342431.665557] ? __pfx_schedule_timeout+0x10/0x10 > > [342431.665560] __flush_work.isra.0+0x15b/0x220 > > The fresh new flush_percpu_work() is nop with CONFIG_PREEMPT_RT enabled, why > are you testing it with 5.14.0-438.el9s.x86_64+rt instead of mainline? Or what > are you testing? I am demonstrating a type of bug that can happen without Leo's patch. > BTW the hang fails to show the unexpected deadline misses. Yes, because in this case the realtime app with FIFO priority never stops running, therefore grub2-probe hangs and is unable to execute: > > [342431.665417] INFO: task grub2-probe:24484 blocked for more than 622 seconds > > > [342431.665565] ? __pfx_wq_barrier_func+0x10/0x10 > > [342431.665570] __lru_add_drain_all+0x17d/0x220 > > [342431.665576] invalidate_bdev+0x28/0x40 > > [342431.665583] blkdev_common_ioctl+0x714/0xa30 > > [342431.665588] ? bucket_table_alloc.isra.0+0x1/0x150 > > [342431.665593] ? cp_new_stat+0xbb/0x180 > > [342431.665599] blkdev_ioctl+0x112/0x270 > > [342431.665603] ? security_file_ioctl+0x2f/0x50 > > [342431.665609] __x64_sys_ioctl+0x87/0xc0 Does that make sense now? Thanks!
On Fri, Sep 06, 2024 at 06:19:08AM +0800, Hillf Danton wrote: > On Tue, 23 Jul 2024 14:14:34 -0300 Marcelo Tosatti <mtosatti@redhat.com> > > On Sat, Jun 22, 2024 at 12:58:08AM -0300, Leonardo Bras wrote: > > > The problem: > > > Some places in the kernel implement a parallel programming strategy > > > consisting on local_locks() for most of the work, and some rare remote > > > operations are scheduled on target cpu. This keeps cache bouncing low since > > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT > > > kernels, even though the very few remote operations will be expensive due > > > to scheduling overhead. > > > > > > On the other hand, for RT workloads this can represent a problem: getting > > > an important workload scheduled out to deal with remote requests is > > > sure to introduce unexpected deadline misses. > > > > Another hang with a busy polling workload (kernel update hangs on > > grub2-probe): > > > > [342431.665417] INFO: task grub2-probe:24484 blocked for more than 622 seconds. > > [342431.665458] Tainted: G W X ------- --- 5.14.0-438.el9s.x86_64+rt #1 > > [342431.665488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > [342431.665515] task:grub2-probe state:D stack:0 pid:24484 ppid:24455 flags:0x00004002 > > [342431.665523] Call Trace: > > [342431.665525] <TASK> > > [342431.665527] __schedule+0x22a/0x580 > > [342431.665537] schedule+0x30/0x80 > > [342431.665539] schedule_timeout+0x153/0x190 > > [342431.665543] ? preempt_schedule_thunk+0x16/0x30 > > [342431.665548] ? preempt_count_add+0x70/0xa0 > > [342431.665554] __wait_for_common+0x8b/0x1c0 > > [342431.665557] ? __pfx_schedule_timeout+0x10/0x10 > > [342431.665560] __flush_work.isra.0+0x15b/0x220 > > The fresh new flush_percpu_work() is nop with CONFIG_PREEMPT_RT enabled, why > are you testing it with 5.14.0-438.el9s.x86_64+rt instead of mainline? Or what > are you testing? > > BTW the hang fails to show the unexpected deadline misses. I think he is showing a client case in which my patchset would be helpful, and avoid those stalls in RT=y. > > > [342431.665565] ? __pfx_wq_barrier_func+0x10/0x10 > > [342431.665570] __lru_add_drain_all+0x17d/0x220 > > [342431.665576] invalidate_bdev+0x28/0x40 > > [342431.665583] blkdev_common_ioctl+0x714/0xa30 > > [342431.665588] ? bucket_table_alloc.isra.0+0x1/0x150 > > [342431.665593] ? cp_new_stat+0xbb/0x180 > > [342431.665599] blkdev_ioctl+0x112/0x270 > > [342431.665603] ? security_file_ioctl+0x2f/0x50 > > [342431.665609] __x64_sys_ioctl+0x87/0xc0 >
On Wed, 11 Sep 2024 00:04:46 -0300 Marcelo Tosatti <mtosatti@redhat.com> > On Fri, Sep 06, 2024 at 06:19:08AM +0800, Hillf Danton wrote: > > On Tue, 23 Jul 2024 14:14:34 -0300 Marcelo Tosatti <mtosatti@redhat.com> > > > On Sat, Jun 22, 2024 at 12:58:08AM -0300, Leonardo Bras wrote: > > > > The problem: > > > > Some places in the kernel implement a parallel programming strategy > > > > consisting on local_locks() for most of the work, and some rare remote > > > > operations are scheduled on target cpu. This keeps cache bouncing low since > > > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT > > > > kernels, even though the very few remote operations will be expensive due > > > > to scheduling overhead. > > > > > > > > On the other hand, for RT workloads this can represent a problem: getting > > > > an important workload scheduled out to deal with remote requests is > > > > sure to introduce unexpected deadline misses. > > > > > > Another hang with a busy polling workload (kernel update hangs on > > > grub2-probe): > > > > > > [342431.665417] INFO: task grub2-probe:24484 blocked for more than 622 seconds. > > > [342431.665458] Tainted: G W X ------- --- 5.14.0-438.el9s.x86_64+rt #1 > > > [342431.665488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > [342431.665515] task:grub2-probe state:D stack:0 pid:24484 ppid:24455 flags:0x00004002 > > > [342431.665523] Call Trace: > > > [342431.665525] <TASK> > > > [342431.665527] __schedule+0x22a/0x580 > > > [342431.665537] schedule+0x30/0x80 > > > [342431.665539] schedule_timeout+0x153/0x190 > > > [342431.665543] ? preempt_schedule_thunk+0x16/0x30 > > > [342431.665548] ? preempt_count_add+0x70/0xa0 > > > [342431.665554] __wait_for_common+0x8b/0x1c0 > > > [342431.665557] ? __pfx_schedule_timeout+0x10/0x10 > > > [342431.665560] __flush_work.isra.0+0x15b/0x220 > > > > The fresh new flush_percpu_work() is nop with CONFIG_PREEMPT_RT enabled, why > > are you testing it with 5.14.0-438.el9s.x86_64+rt instead of mainline? Or what > > are you testing? > > I am demonstrating a type of bug that can happen without Leo's patch. > > > BTW the hang fails to show the unexpected deadline misses. > > Yes, because in this case the realtime app with FIFO priority never > stops running, therefore grub2-probe hangs and is unable to execute: > Thanks, I see why it is a type of bug that can happen without Leo's patch. Because linux kernel is never the pill to kill all pains in the field, I prefer to think instead it represents no real idea of 5.14-xxx-rt at product designing stage - what is kernel reaction to 600s cpu hog for instance?. More interesting, what would you comment if task hang is replaced with oom? Given locality cut by this patchset, lock contention follows up and opens the window for priority inversion, right? > > > [342431.665417] INFO: task grub2-probe:24484 blocked for more than 622 seconds > > > > > [342431.665565] ? __pfx_wq_barrier_func+0x10/0x10 > > > [342431.665570] __lru_add_drain_all+0x17d/0x220 > > > [342431.665576] invalidate_bdev+0x28/0x40 > > > [342431.665583] blkdev_common_ioctl+0x714/0xa30 > > > [342431.665588] ? bucket_table_alloc.isra.0+0x1/0x150 > > > [342431.665593] ? cp_new_stat+0xbb/0x180 > > > [342431.665599] blkdev_ioctl+0x112/0x270 > > > [342431.665603] ? security_file_ioctl+0x2f/0x50 > > > [342431.665609] __x64_sys_ioctl+0x87/0xc0