Message ID | 20250319071330.898763-1-gthelen@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | cgroup/rstat: avoid disabling irqs for O(num_cpu) | expand |
(fix mistyped CC address to Eric) On Wed, Mar 19, 2025 at 12:13 AM Greg Thelen <gthelen@google.com> wrote: > > From: Eric Dumazet <edumazet@google.com> > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > iterating all possible cpus. It only drops the lock if there is > scheduler or spin lock contention. If neither, then interrupts can be > disabled for a long time. On large machines this can disable interrupts > for a long enough time to drop network packets. On 400+ CPU machines > I've seen interrupt disabled for over 40 msec. > > Prevent rstat from disabling interrupts while processing all possible > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This > approach was previously discussed in > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > though this was in the context of an non-irq rstat spin lock. > > Benchmark this change with: > 1) a single stat_reader process with 400 threads, each reading a test > memcg's memory.stat repeatedly for 10 seconds. > 2) 400 memory hog processes running in the test memcg and repeatedly > charging memory until oom killed. Then they repeat charging and oom > killing. > > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > interrupts are disabled by rstat for 45341 usec: > # => started at: _raw_spin_lock_irq > # => ended at: cgroup_rstat_flush > # > # > # _------=> CPU# > # / _-----=> irqs-off/BH-disabled > # | / _----=> need-resched > # || / _---=> hardirq/softirq > # ||| / _--=> preempt-depth > # |||| / _-=> migrate-disable > # ||||| / delay > # cmd pid |||||| time | caller > # \ / |||||| \ | / > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush > stat_rea-96532 52d.... 45343us : <stack trace> > => memcg1_stat_format > => memory_stat_format > => memory_stat_show > => seq_read_iter > => vfs_read > => ksys_read > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > longest holder. The longest irqs-off holder has irqs disabled for > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > Running stat_reader memory.stat reader for 10 seconds: > - without memory hogs: 9.84M accesses => 12.7M accesses > - with memory hogs: 9.46M accesses => 11.1M accesses > The throughput of memory.stat access improves. > > The mode of memory.stat access latency after grouping by of 2 buckets: > - without memory hogs: 64 usec => 16 usec > - with memory hogs: 64 usec => 8 usec > The memory.stat latency improves. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Greg Thelen <gthelen@google.com> > Tested-by: Greg Thelen <gthelen@google.com> > --- > kernel/cgroup/rstat.c | 12 +++++------- > 1 file changed, 5 insertions(+), 7 deletions(-) > > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c > index aac91466279f..976c24b3671a 100644 > --- a/kernel/cgroup/rstat.c > +++ b/kernel/cgroup/rstat.c > @@ -323,13 +323,11 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) > rcu_read_unlock(); > } > > - /* play nice and yield if necessary */ > - if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { > - __cgroup_rstat_unlock(cgrp, cpu); > - if (!cond_resched()) > - cpu_relax(); > - __cgroup_rstat_lock(cgrp, cpu); > - } > + /* play nice and avoid disabling interrupts for a long time */ > + __cgroup_rstat_unlock(cgrp, cpu); > + if (!cond_resched()) > + cpu_relax(); > + __cgroup_rstat_lock(cgrp, cpu); > } > } > > -- > 2.49.0.rc1.451.g8f38331e32-goog >
Hello. On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen <gthelen@google.com> wrote: > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > iterating all possible cpus. It only drops the lock if there is > scheduler or spin lock contention. If neither, then interrupts can be > disabled for a long time. On large machines this can disable interrupts > for a long enough time to drop network packets. On 400+ CPU machines > I've seen interrupt disabled for over 40 msec. This is peanuts, watchdog_thresh defaults to 10000 msec. (Tongue-in-cheek, to put that threshold into relation but I see the problem.) > The mode of memory.stat access latency after grouping by of 2 buckets: power > - without memory hogs: 64 usec => 16 usec > - with memory hogs: 64 usec => 8 usec > The memory.stat latency improves. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Greg Thelen <gthelen@google.com> > Tested-by: Greg Thelen <gthelen@google.com> > --- > kernel/cgroup/rstat.c | 12 +++++------- > 1 file changed, 5 insertions(+), 7 deletions(-) FTR, the lock may end up split per-subsys [1] but this would still make sense for memcg's one. (I wonder if Tejun would consider it small enough then to avoid interrupt disabling. Then this could be converted to more widely used cond_resched_lock().) [1] https://lore.kernel.org/r/20250227215543.49928-4-inwardvessel@gmail.com/ But all in all, thanks for this and Acked-by: Michal Koutný <mkoutny@suse.com>
On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote: > From: Eric Dumazet <edumazet@google.com> > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > iterating all possible cpus. It only drops the lock if there is > scheduler or spin lock contention. If neither, then interrupts can be > disabled for a long time. On large machines this can disable interrupts > for a long enough time to drop network packets. On 400+ CPU machines > I've seen interrupt disabled for over 40 msec. > > Prevent rstat from disabling interrupts while processing all possible > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This > approach was previously discussed in > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > though this was in the context of an non-irq rstat spin lock. > > Benchmark this change with: > 1) a single stat_reader process with 400 threads, each reading a test > memcg's memory.stat repeatedly for 10 seconds. > 2) 400 memory hog processes running in the test memcg and repeatedly > charging memory until oom killed. Then they repeat charging and oom > killing. > > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > interrupts are disabled by rstat for 45341 usec: > # => started at: _raw_spin_lock_irq > # => ended at: cgroup_rstat_flush > # > # > # _------=> CPU# > # / _-----=> irqs-off/BH-disabled > # | / _----=> need-resched > # || / _---=> hardirq/softirq > # ||| / _--=> preempt-depth > # |||| / _-=> migrate-disable > # ||||| / delay > # cmd pid |||||| time | caller > # \ / |||||| \ | / > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush > stat_rea-96532 52d.... 45343us : <stack trace> > => memcg1_stat_format > => memory_stat_format > => memory_stat_show > => seq_read_iter > => vfs_read > => ksys_read > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > longest holder. The longest irqs-off holder has irqs disabled for > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > Running stat_reader memory.stat reader for 10 seconds: > - without memory hogs: 9.84M accesses => 12.7M accesses > - with memory hogs: 9.46M accesses => 11.1M accesses > The throughput of memory.stat access improves. > > The mode of memory.stat access latency after grouping by of 2 buckets: > - without memory hogs: 64 usec => 16 usec > - with memory hogs: 64 usec => 8 usec > The memory.stat latency improves. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Greg Thelen <gthelen@google.com> > Tested-by: Greg Thelen <gthelen@google.com> > --- > kernel/cgroup/rstat.c | 12 +++++------- > 1 file changed, 5 insertions(+), 7 deletions(-) > > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c > index aac91466279f..976c24b3671a 100644 > --- a/kernel/cgroup/rstat.c > +++ b/kernel/cgroup/rstat.c > @@ -323,13 +323,11 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) > rcu_read_unlock(); > } > > - /* play nice and yield if necessary */ > - if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { > - __cgroup_rstat_unlock(cgrp, cpu); > - if (!cond_resched()) > - cpu_relax(); > - __cgroup_rstat_lock(cgrp, cpu); > - } > + /* play nice and avoid disabling interrupts for a long time */ > + __cgroup_rstat_unlock(cgrp, cpu); > + if (!cond_resched()) > + cpu_relax(); > + __cgroup_rstat_lock(cgrp, cpu); > } > } Is not this going a little too far? the lock + irq trip is quite expensive in its own right and now is going to be paid for each cpu, as in the total time spent executing cgroup_rstat_flush_locked is going to go up. Would your problem go away toggling this every -- say -- 8 cpus? Just a suggestion.
On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote: > From: Eric Dumazet <edumazet@google.com> > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > iterating all possible cpus. It only drops the lock if there is > scheduler or spin lock contention. If neither, then interrupts can be > disabled for a long time. On large machines this can disable interrupts > for a long enough time to drop network packets. On 400+ CPU machines > I've seen interrupt disabled for over 40 msec. > > Prevent rstat from disabling interrupts while processing all possible > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This > approach was previously discussed in > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > though this was in the context of an non-irq rstat spin lock. > > Benchmark this change with: > 1) a single stat_reader process with 400 threads, each reading a test > memcg's memory.stat repeatedly for 10 seconds. > 2) 400 memory hog processes running in the test memcg and repeatedly > charging memory until oom killed. Then they repeat charging and oom > killing. > > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > interrupts are disabled by rstat for 45341 usec: > # => started at: _raw_spin_lock_irq > # => ended at: cgroup_rstat_flush > # > # > # _------=> CPU# > # / _-----=> irqs-off/BH-disabled > # | / _----=> need-resched > # || / _---=> hardirq/softirq > # ||| / _--=> preempt-depth > # |||| / _-=> migrate-disable > # ||||| / delay > # cmd pid |||||| time | caller > # \ / |||||| \ | / > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush > stat_rea-96532 52d.... 45343us : <stack trace> > => memcg1_stat_format > => memory_stat_format > => memory_stat_show > => seq_read_iter > => vfs_read > => ksys_read > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > longest holder. The longest irqs-off holder has irqs disabled for > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > Running stat_reader memory.stat reader for 10 seconds: > - without memory hogs: 9.84M accesses => 12.7M accesses > - with memory hogs: 9.46M accesses => 11.1M accesses > The throughput of memory.stat access improves. > > The mode of memory.stat access latency after grouping by of 2 buckets: > - without memory hogs: 64 usec => 16 usec > - with memory hogs: 64 usec => 8 usec > The memory.stat latency improves. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Greg Thelen <gthelen@google.com> > Tested-by: Greg Thelen <gthelen@google.com> > --- > kernel/cgroup/rstat.c | 12 +++++------- > 1 file changed, 5 insertions(+), 7 deletions(-) > > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c > index aac91466279f..976c24b3671a 100644 > --- a/kernel/cgroup/rstat.c > +++ b/kernel/cgroup/rstat.c > @@ -323,13 +323,11 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) > rcu_read_unlock(); > } > > - /* play nice and yield if necessary */ > - if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { > - __cgroup_rstat_unlock(cgrp, cpu); > - if (!cond_resched()) > - cpu_relax(); > - __cgroup_rstat_lock(cgrp, cpu); > - } > + /* play nice and avoid disabling interrupts for a long time */ > + __cgroup_rstat_unlock(cgrp, cpu); > + if (!cond_resched()) > + cpu_relax(); > + __cgroup_rstat_lock(cgrp, cpu); This patch looks good as-is, so feel free to add: Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> That being said I think we can do further cleanups here now. We should probably inline cgroup_rstat_flush_locked() into cgroup_rstat_flush(), and move the lock hold and release into the loop. cgroup_rstat_flush_hold() can simply call cgroup_rstat_flush() then hold the lock. Something like this on top: diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index 976c24b3671a7..4f4b8d22555d7 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -299,17 +299,29 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) spin_unlock_irq(&cgroup_rstat_lock); } -/* see cgroup_rstat_flush() */ -static void cgroup_rstat_flush_locked(struct cgroup *cgrp) - __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock) +/** + * cgroup_rstat_flush - flush stats in @cgrp's subtree + * @cgrp: target cgroup + * + * Collect all per-cpu stats in @cgrp's subtree into the global counters + * and propagate them upwards. After this function returns, all cgroups in + * the subtree have up-to-date ->stat. + * + * This also gets all cgroups in the subtree including @cgrp off the + * ->updated_children lists. + * + * This function may block. + */ +__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) { int cpu; - lockdep_assert_held(&cgroup_rstat_lock); - + might_sleep(); for_each_possible_cpu(cpu) { struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu); + /* Reacquire for every CPU to avoiding disabing IRQs too long */ + __cgroup_rstat_lock(cgrp, cpu); for (; pos; pos = pos->rstat_flush_next) { struct cgroup_subsys_state *css; @@ -322,37 +334,12 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) css->ss->css_rstat_flush(css, cpu); rcu_read_unlock(); } - - /* play nice and avoid disabling interrupts for a long time */ __cgroup_rstat_unlock(cgrp, cpu); if (!cond_resched()) cpu_relax(); - __cgroup_rstat_lock(cgrp, cpu); } } -/** - * cgroup_rstat_flush - flush stats in @cgrp's subtree - * @cgrp: target cgroup - * - * Collect all per-cpu stats in @cgrp's subtree into the global counters - * and propagate them upwards. After this function returns, all cgroups in - * the subtree have up-to-date ->stat. - * - * This also gets all cgroups in the subtree including @cgrp off the - * ->updated_children lists. - * - * This function may block. - */ -__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) -{ - might_sleep(); - - __cgroup_rstat_lock(cgrp, -1); - cgroup_rstat_flush_locked(cgrp); - __cgroup_rstat_unlock(cgrp, -1); -} - /** * cgroup_rstat_flush_hold - flush stats in @cgrp's subtree and hold * @cgrp: target cgroup @@ -365,9 +352,8 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) void cgroup_rstat_flush_hold(struct cgroup *cgrp) __acquires(&cgroup_rstat_lock) { - might_sleep(); + cgroup_rstat_flush(cgrp); __cgroup_rstat_lock(cgrp, -1); - cgroup_rstat_flush_locked(cgrp); } /**
On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote: > On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote: > > From: Eric Dumazet <edumazet@google.com> > > > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > > iterating all possible cpus. It only drops the lock if there is > > scheduler or spin lock contention. If neither, then interrupts can be > > disabled for a long time. On large machines this can disable interrupts > > for a long enough time to drop network packets. On 400+ CPU machines > > I've seen interrupt disabled for over 40 msec. > > > > Prevent rstat from disabling interrupts while processing all possible > > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This > > approach was previously discussed in > > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > > though this was in the context of an non-irq rstat spin lock. > > > > Benchmark this change with: > > 1) a single stat_reader process with 400 threads, each reading a test > > memcg's memory.stat repeatedly for 10 seconds. > > 2) 400 memory hog processes running in the test memcg and repeatedly > > charging memory until oom killed. Then they repeat charging and oom > > killing. > > > > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > > interrupts are disabled by rstat for 45341 usec: > > # => started at: _raw_spin_lock_irq > > # => ended at: cgroup_rstat_flush > > # > > # > > # _------=> CPU# > > # / _-----=> irqs-off/BH-disabled > > # | / _----=> need-resched > > # || / _---=> hardirq/softirq > > # ||| / _--=> preempt-depth > > # |||| / _-=> migrate-disable > > # ||||| / delay > > # cmd pid |||||| time | caller > > # \ / |||||| \ | / > > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush > > stat_rea-96532 52d.... 45343us : <stack trace> > > => memcg1_stat_format > > => memory_stat_format > > => memory_stat_show > > => seq_read_iter > > => vfs_read > > => ksys_read > > => do_syscall_64 > > => entry_SYSCALL_64_after_hwframe > > > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > > longest holder. The longest irqs-off holder has irqs disabled for > > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > > > Running stat_reader memory.stat reader for 10 seconds: > > - without memory hogs: 9.84M accesses => 12.7M accesses > > - with memory hogs: 9.46M accesses => 11.1M accesses > > The throughput of memory.stat access improves. > > > > The mode of memory.stat access latency after grouping by of 2 buckets: > > - without memory hogs: 64 usec => 16 usec > > - with memory hogs: 64 usec => 8 usec > > The memory.stat latency improves. > > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > Signed-off-by: Greg Thelen <gthelen@google.com> > > Tested-by: Greg Thelen <gthelen@google.com> > > --- > > kernel/cgroup/rstat.c | 12 +++++------- > > 1 file changed, 5 insertions(+), 7 deletions(-) > > > > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c > > index aac91466279f..976c24b3671a 100644 > > --- a/kernel/cgroup/rstat.c > > +++ b/kernel/cgroup/rstat.c > > @@ -323,13 +323,11 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) > > rcu_read_unlock(); > > } > > > > - /* play nice and yield if necessary */ > > - if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { > > - __cgroup_rstat_unlock(cgrp, cpu); > > - if (!cond_resched()) > > - cpu_relax(); > > - __cgroup_rstat_lock(cgrp, cpu); > > - } > > + /* play nice and avoid disabling interrupts for a long time */ > > + __cgroup_rstat_unlock(cgrp, cpu); > > + if (!cond_resched()) > > + cpu_relax(); > > + __cgroup_rstat_lock(cgrp, cpu); > > } > > } > > Is not this going a little too far? > > the lock + irq trip is quite expensive in its own right and now is > going to be paid for each cpu, as in the total time spent executing > cgroup_rstat_flush_locked is going to go up. > > Would your problem go away toggling this every -- say -- 8 cpus? I was concerned about this too, and about more lock bouncing, but the testing suggests that this actually overall improves the latency of cgroup_rstat_flush_locked() (at least on tested HW). So I don't think we need to do something like this unless a regression is observed. > > Just a suggestion. >
On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote: > From: Eric Dumazet <edumazet@google.com> > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > iterating all possible cpus. It only drops the lock if there is > scheduler or spin lock contention. If neither, then interrupts can be > disabled for a long time. On large machines this can disable interrupts > for a long enough time to drop network packets. On 400+ CPU machines > I've seen interrupt disabled for over 40 msec. > > Prevent rstat from disabling interrupts while processing all possible > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This > approach was previously discussed in > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > though this was in the context of an non-irq rstat spin lock. > > Benchmark this change with: > 1) a single stat_reader process with 400 threads, each reading a test > memcg's memory.stat repeatedly for 10 seconds. > 2) 400 memory hog processes running in the test memcg and repeatedly > charging memory until oom killed. Then they repeat charging and oom > killing. > > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > interrupts are disabled by rstat for 45341 usec: > # => started at: _raw_spin_lock_irq > # => ended at: cgroup_rstat_flush > # > # > # _------=> CPU# > # / _-----=> irqs-off/BH-disabled > # | / _----=> need-resched > # || / _---=> hardirq/softirq > # ||| / _--=> preempt-depth > # |||| / _-=> migrate-disable > # ||||| / delay > # cmd pid |||||| time | caller > # \ / |||||| \ | / > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush > stat_rea-96532 52d.... 45343us : <stack trace> > => memcg1_stat_format > => memory_stat_format > => memory_stat_show > => seq_read_iter > => vfs_read > => ksys_read > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > longest holder. The longest irqs-off holder has irqs disabled for > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > Running stat_reader memory.stat reader for 10 seconds: > - without memory hogs: 9.84M accesses => 12.7M accesses > - with memory hogs: 9.46M accesses => 11.1M accesses > The throughput of memory.stat access improves. > > The mode of memory.stat access latency after grouping by of 2 buckets: > - without memory hogs: 64 usec => 16 usec > - with memory hogs: 64 usec => 8 usec > The memory.stat latency improves. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Greg Thelen <gthelen@google.com> > Tested-by: Greg Thelen <gthelen@google.com> Applied to cgroup/for-6.15. Thanks.
On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote: ... > Is not this going a little too far? > > the lock + irq trip is quite expensive in its own right and now is > going to be paid for each cpu, as in the total time spent executing > cgroup_rstat_flush_locked is going to go up. Lock/unlock when the cacheline is already on the cpu is pretty cheap and on modern x86 cpus at least irq on/off are also only a few cycles, so you probably wouldn't be able to tell the difference. Thanks.
On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote: > From: Eric Dumazet <edumazet@google.com> > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > iterating all possible cpus. It only drops the lock if there is > scheduler or spin lock contention. If neither, then interrupts can be > disabled for a long time. On large machines this can disable interrupts > for a long enough time to drop network packets. On 400+ CPU machines > I've seen interrupt disabled for over 40 msec. > > Prevent rstat from disabling interrupts while processing all possible > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This > approach was previously discussed in > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > though this was in the context of an non-irq rstat spin lock. > > Benchmark this change with: > 1) a single stat_reader process with 400 threads, each reading a test > memcg's memory.stat repeatedly for 10 seconds. > 2) 400 memory hog processes running in the test memcg and repeatedly > charging memory until oom killed. Then they repeat charging and oom > killing. > > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > interrupts are disabled by rstat for 45341 usec: > # => started at: _raw_spin_lock_irq > # => ended at: cgroup_rstat_flush > # > # > # _------=> CPU# > # / _-----=> irqs-off/BH-disabled > # | / _----=> need-resched > # || / _---=> hardirq/softirq > # ||| / _--=> preempt-depth > # |||| / _-=> migrate-disable > # ||||| / delay > # cmd pid |||||| time | caller > # \ / |||||| \ | / > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush > stat_rea-96532 52d.... 45343us : <stack trace> > => memcg1_stat_format > => memory_stat_format > => memory_stat_show > => seq_read_iter > => vfs_read > => ksys_read > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > longest holder. The longest irqs-off holder has irqs disabled for > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > Running stat_reader memory.stat reader for 10 seconds: > - without memory hogs: 9.84M accesses => 12.7M accesses > - with memory hogs: 9.46M accesses => 11.1M accesses > The throughput of memory.stat access improves. > > The mode of memory.stat access latency after grouping by of 2 buckets: > - without memory hogs: 64 usec => 16 usec > - with memory hogs: 64 usec => 8 usec > The memory.stat latency improves. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Greg Thelen <gthelen@google.com> > Tested-by: Greg Thelen <gthelen@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote: > From: Eric Dumazet <edumazet@google.com> > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > iterating all possible cpus. It only drops the lock if there is > scheduler or spin lock contention. If neither, then interrupts can be > disabled for a long time. On large machines this can disable interrupts > for a long enough time to drop network packets. On 400+ CPU machines > I've seen interrupt disabled for over 40 msec. Which kernel was this observed on in production? > > Prevent rstat from disabling interrupts while processing all possible > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. Doing for each cpu might be too extreme. Have you tried with some batching? > This > approach was previously discussed in > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > though this was in the context of an non-irq rstat spin lock. > > Benchmark this change with: > 1) a single stat_reader process with 400 threads, each reading a test > memcg's memory.stat repeatedly for 10 seconds. > 2) 400 memory hog processes running in the test memcg and repeatedly > charging memory until oom killed. Then they repeat charging and oom > killing. > Though this benchmark seems too extreme but userspace holding off irqs for that long time is bad. BTW are these memory hoggers, creating anon memory or file memory? Is [z]swap enabled? For the long term, I think we can use make this work without disabling irqs, similar to how networking manages sock lock. > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > interrupts are disabled by rstat for 45341 usec: > # => started at: _raw_spin_lock_irq > # => ended at: cgroup_rstat_flush > # > # > # _------=> CPU# > # / _-----=> irqs-off/BH-disabled > # | / _----=> need-resched > # || / _---=> hardirq/softirq > # ||| / _--=> preempt-depth > # |||| / _-=> migrate-disable > # ||||| / delay > # cmd pid |||||| time | caller > # \ / |||||| \ | / > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush > stat_rea-96532 52d.... 45343us : <stack trace> > => memcg1_stat_format > => memory_stat_format > => memory_stat_show > => seq_read_iter > => vfs_read > => ksys_read > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > longest holder. The longest irqs-off holder has irqs disabled for > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > Running stat_reader memory.stat reader for 10 seconds: > - without memory hogs: 9.84M accesses => 12.7M accesses > - with memory hogs: 9.46M accesses => 11.1M accesses > The throughput of memory.stat access improves. > > The mode of memory.stat access latency after grouping by of 2 buckets: > - without memory hogs: 64 usec => 16 usec > - with memory hogs: 64 usec => 8 usec > The memory.stat latency improves. So, things are improving even without batching. I wonder if there are less readers then how will this look like. Can you try with single reader as well? > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Greg Thelen <gthelen@google.com> > Tested-by: Greg Thelen <gthelen@google.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
On Wed, Mar 19, 2025 at 05:16:02PM +0000, Yosry Ahmed wrote: > @@ -365,9 +352,8 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) > void cgroup_rstat_flush_hold(struct cgroup *cgrp) > __acquires(&cgroup_rstat_lock) > { > - might_sleep(); > + cgroup_rstat_flush(cgrp); > __cgroup_rstat_lock(cgrp, -1); > - cgroup_rstat_flush_locked(cgrp); > } Might as well remove cgroup_rstat_flush_hold/release entirely? There are no external users, and the concept seems moot when the lock is dropped per default. cgroup_base_stat_cputime_show() can open-code the lock/unlock to stabilize the counts while reading. (btw, why do we not have any locking around the root stats in cgroup_base_stat_cputime_show()? There isn't anything preventing a reader from seeing all zeroes if another reader runs the memset() on cgrp->bstat, is there? Or double times...)
On Wed, Mar 19, 2025 at 02:06:43PM -0400, Johannes Weiner wrote: > On Wed, Mar 19, 2025 at 05:16:02PM +0000, Yosry Ahmed wrote: > > @@ -365,9 +352,8 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) > > void cgroup_rstat_flush_hold(struct cgroup *cgrp) > > __acquires(&cgroup_rstat_lock) > > { > > - might_sleep(); > > + cgroup_rstat_flush(cgrp); > > __cgroup_rstat_lock(cgrp, -1); > > - cgroup_rstat_flush_locked(cgrp); > > } > > Might as well remove cgroup_rstat_flush_hold/release entirely? There > are no external users, and the concept seems moot when the lock is > dropped per default. cgroup_base_stat_cputime_show() can open-code the > lock/unlock to stabilize the counts while reading. Yeah I missed the fact that the users are internal because the functions are not static. I also don't see the point of keeping them. Tejun/Greg, should I send a patch on top of this one or do you prefer sending a new version? > (btw, why do we not have any locking around the root stats in > cgroup_base_stat_cputime_show()? There isn't anything preventing a > reader from seeing all zeroes if another reader runs the memset() on > cgrp->bstat, is there? Or double times...) (I think root_cgroup_cputime() operates on a stack allocated bstat, not cgrp->bstat)
On Wed, Mar 19, 2025 at 6:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote: > > From: Eric Dumazet <edumazet@google.com> > > > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > > iterating all possible cpus. It only drops the lock if there is > > scheduler or spin lock contention. If neither, then interrupts can be > > disabled for a long time. On large machines this can disable interrupts > > for a long enough time to drop network packets. On 400+ CPU machines > > I've seen interrupt disabled for over 40 msec. > > Which kernel was this observed on in production? > I had these issues with the latest upstream kernels, and have seen a 80 ms delay. ./trace-cgroup_rstat_flush_locked.sh Attaching 3 probes... ^C @cgroup_rstat_flush_locked_us: [64, 128) 2 |@ | [128, 256) 3 |@ | [256, 512) 2 |@ | [512, 1K) 6 |@@@ | [1K, 2K) 8 |@@@@@ | [2K, 4K) 3 |@ | [4K, 8K) 0 | | [8K, 16K) 0 | | [16K, 32K) 4 |@@ | [32K, 64K) 83 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64K, 128K) 74 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | @max_us: 79214 Greg in his tests was able to get 45 ms "only" > > > > > Prevent rstat from disabling interrupts while processing all possible > > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. > > Doing for each cpu might be too extreme. Have you tried with some > batching? > Not really, modern platforms are doing the unlock/lock pair faster than all the cache line misses done in the per-cpu loop. > > > This > > approach was previously discussed in > > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > > though this was in the context of an non-irq rstat spin lock. > > > > Benchmark this change with: > > 1) a single stat_reader process with 400 threads, each reading a test > > memcg's memory.stat repeatedly for 10 seconds. > > 2) 400 memory hog processes running in the test memcg and repeatedly > > charging memory until oom killed. Then they repeat charging and oom > > killing. > > > > Though this benchmark seems too extreme but userspace holding off irqs > for that long time is bad. BTW are these memory hoggers, creating anon > memory or file memory? Is [z]swap enabled? > > For the long term, I think we can use make this work without disabling > irqs, similar to how networking manages sock lock. > > > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > > interrupts are disabled by rstat for 45341 usec: > > # => started at: _raw_spin_lock_irq > > # => ended at: cgroup_rstat_flush > > # > > # > > # _------=> CPU# > > # / _-----=> irqs-off/BH-disabled > > # | / _----=> need-resched > > # || / _---=> hardirq/softirq > > # ||| / _--=> preempt-depth > > # |||| / _-=> migrate-disable > > # ||||| / delay > > # cmd pid |||||| time | caller > > # \ / |||||| \ | / > > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on > <-cgroup_rstat_flush > > stat_rea-96532 52d.... 45343us : <stack trace> > > => memcg1_stat_format > > => memory_stat_format > > => memory_stat_show > > => seq_read_iter > > => vfs_read > > => ksys_read > > => do_syscall_64 > > => entry_SYSCALL_64_after_hwframe > > > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > > longest holder. The longest irqs-off holder has irqs disabled for > > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > > > Running stat_reader memory.stat reader for 10 seconds: > > - without memory hogs: 9.84M accesses => 12.7M accesses > > - with memory hogs: 9.46M accesses => 11.1M accesses > > The throughput of memory.stat access improves. > > > > The mode of memory.stat access latency after grouping by of 2 buckets: > > - without memory hogs: 64 usec => 16 usec > > - with memory hogs: 64 usec => 8 usec > > The memory.stat latency improves. > > So, things are improving even without batching. I wonder if there are > less readers then how will this look like. Can you try with single > reader as well? > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > Signed-off-by: Greg Thelen <gthelen@google.com> > > Tested-by: Greg Thelen <gthelen@google.com> > > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >
On Wed, Mar 19, 2025 at 06:35:02PM +0000, Yosry Ahmed wrote: > On Wed, Mar 19, 2025 at 02:06:43PM -0400, Johannes Weiner wrote: > > On Wed, Mar 19, 2025 at 05:16:02PM +0000, Yosry Ahmed wrote: > > > @@ -365,9 +352,8 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) > > > void cgroup_rstat_flush_hold(struct cgroup *cgrp) > > > __acquires(&cgroup_rstat_lock) > > > { > > > - might_sleep(); > > > + cgroup_rstat_flush(cgrp); > > > __cgroup_rstat_lock(cgrp, -1); > > > - cgroup_rstat_flush_locked(cgrp); > > > } > > > > Might as well remove cgroup_rstat_flush_hold/release entirely? There > > are no external users, and the concept seems moot when the lock is > > dropped per default. cgroup_base_stat_cputime_show() can open-code the > > lock/unlock to stabilize the counts while reading. > > Yeah I missed the fact that the users are internal because the functions > are not static. I also don't see the point of keeping them. > > Tejun/Greg, should I send a patch on top of this one or do you prefer > sending a new version? Please send a patch on top. Thanks.
On Wed, Mar 19, 2025 at 06:35:02PM +0000, Yosry Ahmed wrote: > On Wed, Mar 19, 2025 at 02:06:43PM -0400, Johannes Weiner wrote: > > (btw, why do we not have any locking around the root stats in > > cgroup_base_stat_cputime_show()? There isn't anything preventing a > > reader from seeing all zeroes if another reader runs the memset() on > > cgrp->bstat, is there? Or double times...) > > (I think root_cgroup_cputime() operates on a stack allocated bstat, not > cgrp->bstat) That was the case until: commit b824766504e49f3fdcbb8c722e70996a78c3636e Author: Chen Ridong <chenridong@huawei.com> Date: Thu Jul 4 14:01:19 2024 +0000 cgroup/rstat: add force idle show helper Now it's doing this: void cgroup_base_stat_cputime_show(struct seq_file *seq) { struct cgroup *cgrp = seq_css(seq)->cgroup; if (cgroup_parent(cgrp)) { ... } else { /* cgrp->bstat of root is not actually used, reuse it */ root_cgroup_cputime(&cgrp->bstat); usage = cgrp->bstat.cputime.sum_exec_runtime; utime = cgrp->bstat.cputime.utime; stime = cgrp->bstat.cputime.stime; ntime = cgrp->bstat.ntime; } } and then: static void root_cgroup_cputime(struct cgroup_base_stat *bstat) { struct task_cputime *cputime = &bstat->cputime; int i; memset(bstat, 0, sizeof(*bstat)); /* various += on bstat and cputime members */ } So all readers are mucking with the root cgroup's bstat.
On Wed, Mar 19, 2025 at 03:16:42PM -0400, Johannes Weiner wrote: > On Wed, Mar 19, 2025 at 06:35:02PM +0000, Yosry Ahmed wrote: > > On Wed, Mar 19, 2025 at 02:06:43PM -0400, Johannes Weiner wrote: > > > (btw, why do we not have any locking around the root stats in > > > cgroup_base_stat_cputime_show()? There isn't anything preventing a > > > reader from seeing all zeroes if another reader runs the memset() on > > > cgrp->bstat, is there? Or double times...) > > > > (I think root_cgroup_cputime() operates on a stack allocated bstat, not > > cgrp->bstat) > > That was the case until: > > commit b824766504e49f3fdcbb8c722e70996a78c3636e > Author: Chen Ridong <chenridong@huawei.com> > Date: Thu Jul 4 14:01:19 2024 +0000 Nevermind, Tejun pointed me to the follow-up fix he's got already queued up: https://web.git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/commit/?id=c4af66a95aa3bc1d4f607ebd4eea524fb58946e3 That brings it all back on the stack.
On Wed, Mar 19, 2025 at 10:53 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote: > > From: Eric Dumazet <edumazet@google.com> > > > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > > iterating all possible cpus. It only drops the lock if there is > > scheduler or spin lock contention. If neither, then interrupts can be > > disabled for a long time. On large machines this can disable interrupts > > for a long enough time to drop network packets. On 400+ CPU machines > > I've seen interrupt disabled for over 40 msec. > > Which kernel was this observed on in production? > > > > > Prevent rstat from disabling interrupts while processing all possible > > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. > > Doing for each cpu might be too extreme. Have you tried with some > batching? > > > This > > approach was previously discussed in > > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > > though this was in the context of an non-irq rstat spin lock. > > > > Benchmark this change with: > > 1) a single stat_reader process with 400 threads, each reading a test > > memcg's memory.stat repeatedly for 10 seconds. > > 2) 400 memory hog processes running in the test memcg and repeatedly > > charging memory until oom killed. Then they repeat charging and oom > > killing. > > > > Though this benchmark seems too extreme but userspace holding off irqs > for that long time is bad. BTW are these memory hoggers, creating anon > memory or file memory? Is [z]swap enabled? The memory hoggers were anon, without any form of swap. I think the other questions were answered in other replies, but feel free to re-ask and I'll provide details. > For the long term, I think we can use make this work without disabling > irqs, similar to how networking manages sock lock. > > > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > > interrupts are disabled by rstat for 45341 usec: > > # => started at: _raw_spin_lock_irq > > # => ended at: cgroup_rstat_flush > > # > > # > > # _------=> CPU# > > # / _-----=> irqs-off/BH-disabled > > # | / _----=> need-resched > > # || / _---=> hardirq/softirq > > # ||| / _--=> preempt-depth > > # |||| / _-=> migrate-disable > > # ||||| / delay > > # cmd pid |||||| time | caller > > # \ / |||||| \ | / > > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush > > stat_rea-96532 52d.... 45343us : <stack trace> > > => memcg1_stat_format > > => memory_stat_format > > => memory_stat_show > > => seq_read_iter > > => vfs_read > > => ksys_read > > => do_syscall_64 > > => entry_SYSCALL_64_after_hwframe > > > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > > longest holder. The longest irqs-off holder has irqs disabled for > > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > > > Running stat_reader memory.stat reader for 10 seconds: > > - without memory hogs: 9.84M accesses => 12.7M accesses > > - with memory hogs: 9.46M accesses => 11.1M accesses > > The throughput of memory.stat access improves. > > > > The mode of memory.stat access latency after grouping by of 2 buckets: > > - without memory hogs: 64 usec => 16 usec > > - with memory hogs: 64 usec => 8 usec > > The memory.stat latency improves. > > So, things are improving even without batching. I wonder if there are > less readers then how will this look like. Can you try with single > reader as well? > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > Signed-off-by: Greg Thelen <gthelen@google.com> > > Tested-by: Greg Thelen <gthelen@google.com> > > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index aac91466279f..976c24b3671a 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -323,13 +323,11 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) rcu_read_unlock(); } - /* play nice and yield if necessary */ - if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { - __cgroup_rstat_unlock(cgrp, cpu); - if (!cond_resched()) - cpu_relax(); - __cgroup_rstat_lock(cgrp, cpu); - } + /* play nice and avoid disabling interrupts for a long time */ + __cgroup_rstat_unlock(cgrp, cpu); + if (!cond_resched()) + cpu_relax(); + __cgroup_rstat_lock(cgrp, cpu); } }