Message ID | 20230323040037.2389095-2-yosryahmed@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Make rstat flushing IRQ and sleep friendly | expand |
On Wed, Mar 22, 2023 at 9:00 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > Currently, when sleeping is not allowed during rstat flushing, we hold > the global rstat lock with interrupts disabled throughout the entire > flush operation. Flushing in an O(# cgroups * # cpus) operation, and > having interrupts disabled throughout is dangerous. > > For some contexts, we may not want to sleep, but can be interrupted > (e.g. while holding a spinlock or RCU read lock). As such, do not > disable interrupts throughout rstat flushing, only when holding the > percpu lock. This breaks down the O(# cgroups * # cpus) duration with > interrupts disabled to a series of O(# cgroups) durations. > > Furthermore, if a cpu spinning waiting for the global rstat lock, it > doesn't need to spin with interrupts disabled anymore. > > If the caller of rstat flushing needs interrupts to be disabled, it's up > to them to decide that, and it should be fine to hold the global rstat > lock with interrupts disabled. There is currently a single context that > may invoke rstat flushing with interrupts disabled, the > mem_cgroup_flush_stats() call in mem_cgroup_usage(), if called from > mem_cgroup_threshold(). > > To make it safe to hold the global rstat lock with interrupts enabled, > make sure we only flush from in_task() contexts. The side effect of that > we read stale stats in interrupt context, but this should be okay, as > flushing in interrupt context is dangerous anyway as it is an expensive > operation, so reading stale stats is safer. > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Couple of questions: 1. What exactly is cgroup_rstat_lock protecting? Can we just remove it altogether? 2. Are we really calling rstat flush in irq context? 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only done for root memcg. Why is mem_cgroup_threshold() interested in root memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
On Wed, Mar 22, 2023 at 9:29 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Wed, Mar 22, 2023 at 9:00 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > Currently, when sleeping is not allowed during rstat flushing, we hold > > the global rstat lock with interrupts disabled throughout the entire > > flush operation. Flushing in an O(# cgroups * # cpus) operation, and > > having interrupts disabled throughout is dangerous. > > > > For some contexts, we may not want to sleep, but can be interrupted > > (e.g. while holding a spinlock or RCU read lock). As such, do not > > disable interrupts throughout rstat flushing, only when holding the > > percpu lock. This breaks down the O(# cgroups * # cpus) duration with > > interrupts disabled to a series of O(# cgroups) durations. > > > > Furthermore, if a cpu spinning waiting for the global rstat lock, it > > doesn't need to spin with interrupts disabled anymore. > > > > If the caller of rstat flushing needs interrupts to be disabled, it's up > > to them to decide that, and it should be fine to hold the global rstat > > lock with interrupts disabled. There is currently a single context that > > may invoke rstat flushing with interrupts disabled, the > > mem_cgroup_flush_stats() call in mem_cgroup_usage(), if called from > > mem_cgroup_threshold(). > > > > To make it safe to hold the global rstat lock with interrupts enabled, > > make sure we only flush from in_task() contexts. The side effect of that > > we read stale stats in interrupt context, but this should be okay, as > > flushing in interrupt context is dangerous anyway as it is an expensive > > operation, so reading stale stats is safer. > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > Couple of questions: > > 1. What exactly is cgroup_rstat_lock protecting? Can we just remove it > altogether? I believe it protects the global state variables that we flush into. For example, for memcg, it protects mem_cgroup->vmstats. I tried removing the lock and allowing concurrent flushing on different cpus, by changing mem_cgroup->vmstats to use atomics instead, but that turned out to be a little expensive. Also, cgroup_rstat_lock is already contended by different flushers (mitigated by stats_flush_lock on the memcg side). If we remove it, concurrent flushers contend on every single percpu lock instead, which also seems to be expensive. > 2. Are we really calling rstat flush in irq context? I think it is possible through the charge/uncharge path: memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I added the protection against flushing in an interrupt context for future callers as well, as it may cause a deadlock if we don't disable interrupts when acquiring cgroup_rstat_lock. > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only > done for root memcg. Why is mem_cgroup_threshold() interested in root > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ? I am not sure, but the code looks like event notifications may be set up on root memcg, which is why we need to check thresholds. Even if mem_cgroup_threshold() does not flush memcg stats, the purpose of this patch is to make sure the rstat flushing code itself is not disabling interrupts; which it currently does for any unsleepable context, even if it is interruptible.
On Wed, Mar 22, 2023 at 10:15 PM Yosry Ahmed <yosryahmed@google.com> wrote: > [...] > > Couple of questions: > > > > 1. What exactly is cgroup_rstat_lock protecting? Can we just remove it > > altogether? > > I believe it protects the global state variables that we flush into. > For example, for memcg, it protects mem_cgroup->vmstats. > > I tried removing the lock and allowing concurrent flushing on > different cpus, by changing mem_cgroup->vmstats to use atomics > instead, but that turned out to be a little expensive. Also, > cgroup_rstat_lock is already contended by different flushers > (mitigated by stats_flush_lock on the memcg side). If we remove it, > concurrent flushers contend on every single percpu lock instead, which > also seems to be expensive. We should add a comment on what it is protecting. I think block rstat are fine but memcg and bpf would need this. > > > 2. Are we really calling rstat flush in irq context? > > I think it is possible through the charge/uncharge path: > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I > added the protection against flushing in an interrupt context for > future callers as well, as it may cause a deadlock if we don't disable > interrupts when acquiring cgroup_rstat_lock. > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only > > done for root memcg. Why is mem_cgroup_threshold() interested in root > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ? > > I am not sure, but the code looks like event notifications may be set > up on root memcg, which is why we need to check thresholds. This is something we should deprecate as root memcg's usage is ill defined. > > Even if mem_cgroup_threshold() does not flush memcg stats, the purpose > of this patch is to make sure the rstat flushing code itself is not > disabling interrupts; which it currently does for any unsleepable > context, even if it is interruptible. Basically I am saying we should aim for VM_BUG_ON(!in_task()) in the flush function rather than adding should_skip_flush() which does not stop potential new irq flushers.
On Wed, Mar 22, 2023 at 11:33 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Wed, Mar 22, 2023 at 10:15 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > [...] > > > Couple of questions: > > > > > > 1. What exactly is cgroup_rstat_lock protecting? Can we just remove it > > > altogether? > > > > I believe it protects the global state variables that we flush into. > > For example, for memcg, it protects mem_cgroup->vmstats. > > > > I tried removing the lock and allowing concurrent flushing on > > different cpus, by changing mem_cgroup->vmstats to use atomics > > instead, but that turned out to be a little expensive. Also, > > cgroup_rstat_lock is already contended by different flushers > > (mitigated by stats_flush_lock on the memcg side). If we remove it, > > concurrent flushers contend on every single percpu lock instead, which > > also seems to be expensive. > > We should add a comment on what it is protecting. I think block rstat > are fine but memcg and bpf would need this. I think it also protects the cpu base stats flushed by cgroup_base_stat_flush(). I will add a comment in the next version. > > > > > > 2. Are we really calling rstat flush in irq context? > > > > I think it is possible through the charge/uncharge path: > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I > > added the protection against flushing in an interrupt context for > > future callers as well, as it may cause a deadlock if we don't disable > > interrupts when acquiring cgroup_rstat_lock. > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only > > > done for root memcg. Why is mem_cgroup_threshold() interested in root > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ? > > > > I am not sure, but the code looks like event notifications may be set > > up on root memcg, which is why we need to check thresholds. > > This is something we should deprecate as root memcg's usage is ill defined. Right, but I think this would be orthogonal to this patch series. > > > > > Even if mem_cgroup_threshold() does not flush memcg stats, the purpose > > of this patch is to make sure the rstat flushing code itself is not > > disabling interrupts; which it currently does for any unsleepable > > context, even if it is interruptible. > > Basically I am saying we should aim for VM_BUG_ON(!in_task()) in the > flush function rather than adding should_skip_flush() which does not > stop potential new irq flushers. I wanted to start with VM_BUG_ON(!in_task()) but I wasn't sure that all contexts that call rstat flushing are not in irq contexts. I added should_skip_flush() so that if there are existing flushers in irq context, or new flushers are added, we are protected against a deadlock. We can change should_skip_flush() to have a WARN_ON_ONCE(!in_task()) to alert in this case. If you prefer removing should_skip_flush() and just adding VM_BUG_ON(!in_task()) we can do that, but personally I was not confident enough that we have no code paths today that may attempt flushing from irq context.
On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote: > [...] > > > > > > > 2. Are we really calling rstat flush in irq context? > > > > > > I think it is possible through the charge/uncharge path: > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I > > > added the protection against flushing in an interrupt context for > > > future callers as well, as it may cause a deadlock if we don't disable > > > interrupts when acquiring cgroup_rstat_lock. > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ? > > > > > > I am not sure, but the code looks like event notifications may be set > > > up on root memcg, which is why we need to check thresholds. > > > > This is something we should deprecate as root memcg's usage is ill defined. > > Right, but I think this would be orthogonal to this patch series. > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock without either breaking a link between mem_cgroup_threshold and cgroup_rstat_lock or make mem_cgroup_threshold work without disabling irqs. So, this patch can not be applied before either of those two tasks are done (and we may find more such scenarios).
On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > [...] > > > > > > > > > 2. Are we really calling rstat flush in irq context? > > > > > > > > I think it is possible through the charge/uncharge path: > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I > > > > added the protection against flushing in an interrupt context for > > > > future callers as well, as it may cause a deadlock if we don't disable > > > > interrupts when acquiring cgroup_rstat_lock. > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ? > > > > > > > > I am not sure, but the code looks like event notifications may be set > > > > up on root memcg, which is why we need to check thresholds. > > > > > > This is something we should deprecate as root memcg's usage is ill defined. > > > > Right, but I think this would be orthogonal to this patch series. > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock > without either breaking a link between mem_cgroup_threshold and > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling > irqs. > > So, this patch can not be applied before either of those two tasks are > done (and we may find more such scenarios). Could you elaborate why? My understanding is that with an in_task() check to make sure we only acquire cgroup_rstat_lock from non-irq context it should be fine to acquire cgroup_rstat_lock without disabling interrupts.
On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > [...] > > > > > > > > > > > 2. Are we really calling rstat flush in irq context? > > > > > > > > > > I think it is possible through the charge/uncharge path: > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I > > > > > added the protection against flushing in an interrupt context for > > > > > future callers as well, as it may cause a deadlock if we don't disable > > > > > interrupts when acquiring cgroup_rstat_lock. > > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ? > > > > > > > > > > I am not sure, but the code looks like event notifications may be set > > > > > up on root memcg, which is why we need to check thresholds. > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined. > > > > > > Right, but I think this would be orthogonal to this patch series. > > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock > > without either breaking a link between mem_cgroup_threshold and > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling > > irqs. > > > > So, this patch can not be applied before either of those two tasks are > > done (and we may find more such scenarios). > > > Could you elaborate why? > > My understanding is that with an in_task() check to make sure we only > acquire cgroup_rstat_lock from non-irq context it should be fine to > acquire cgroup_rstat_lock without disabling interrupts. From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken with irq disabled while other code paths will take cgroup_rstat_lock with irq enabled. This is a potential deadlock hazard unless cgroup_rstat_lock is always taken with irq disabled.
On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > [...] > > > > > > > > > > > > > 2. Are we really calling rstat flush in irq context? > > > > > > > > > > > > I think it is possible through the charge/uncharge path: > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I > > > > > > added the protection against flushing in an interrupt context for > > > > > > future callers as well, as it may cause a deadlock if we don't disable > > > > > > interrupts when acquiring cgroup_rstat_lock. > > > > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ? > > > > > > > > > > > > I am not sure, but the code looks like event notifications may be set > > > > > > up on root memcg, which is why we need to check thresholds. > > > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined. > > > > > > > > Right, but I think this would be orthogonal to this patch series. > > > > > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock > > > without either breaking a link between mem_cgroup_threshold and > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling > > > irqs. > > > > > > So, this patch can not be applied before either of those two tasks are > > > done (and we may find more such scenarios). > > > > > > Could you elaborate why? > > > > My understanding is that with an in_task() check to make sure we only > > acquire cgroup_rstat_lock from non-irq context it should be fine to > > acquire cgroup_rstat_lock without disabling interrupts. > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken > with irq disabled while other code paths will take cgroup_rstat_lock > with irq enabled. This is a potential deadlock hazard unless > cgroup_rstat_lock is always taken with irq disabled. Oh you are making sure it is not taken in the irq context through should_skip_flush(). Hmm seems like a hack. Normally it is recommended to actually remove all such users instead of silently ignoring/bypassing the functionality. So, how about removing mem_cgroup_flush_stats() from mem_cgroup_usage(). It will break the known chain which is taking cgroup_rstat_lock with irq disabled and you can add WARN_ON_ONCE(!in_task()).
On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > [...] > > > > > > > > > > > > > > > 2. Are we really calling rstat flush in irq context? > > > > > > > > > > > > > > I think it is possible through the charge/uncharge path: > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I > > > > > > > added the protection against flushing in an interrupt context for > > > > > > > future callers as well, as it may cause a deadlock if we don't disable > > > > > > > interrupts when acquiring cgroup_rstat_lock. > > > > > > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ? > > > > > > > > > > > > > > I am not sure, but the code looks like event notifications may be set > > > > > > > up on root memcg, which is why we need to check thresholds. > > > > > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined. > > > > > > > > > > Right, but I think this would be orthogonal to this patch series. > > > > > > > > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock > > > > without either breaking a link between mem_cgroup_threshold and > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling > > > > irqs. > > > > > > > > So, this patch can not be applied before either of those two tasks are > > > > done (and we may find more such scenarios). > > > > > > > > > Could you elaborate why? > > > > > > My understanding is that with an in_task() check to make sure we only > > > acquire cgroup_rstat_lock from non-irq context it should be fine to > > > acquire cgroup_rstat_lock without disabling interrupts. > > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken > > with irq disabled while other code paths will take cgroup_rstat_lock > > with irq enabled. This is a potential deadlock hazard unless > > cgroup_rstat_lock is always taken with irq disabled. > > Oh you are making sure it is not taken in the irq context through > should_skip_flush(). Hmm seems like a hack. Normally it is recommended > to actually remove all such users instead of silently > ignoring/bypassing the functionality. It is a workaround, we simply accept to read stale stats in irq context instead of the expensive flush operation. > > So, how about removing mem_cgroup_flush_stats() from > mem_cgroup_usage(). It will break the known chain which is taking > cgroup_rstat_lock with irq disabled and you can add > WARN_ON_ONCE(!in_task()). This changes the behavior in a more obvious way because: 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage() path is also exercised in a lot of paths outside irq context, this will change the behavior for any event thresholds on the root memcg. With proposed skipped flushing in irq context we only change the behavior in a small subset of cases. I think we can skip flushing in irq context for now, and separately deprecate threshold events for the root memcg. When that is done we can come back and remove should_skip_flush() and add a VM_BUG_ON or WARN_ON_ONCE instead. WDYT? 2. mem_cgroup_usage() is also used when reading usage from userspace. This should be an easy workaround though.
On Thu, Mar 23, 2023 at 9:18 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > 2. Are we really calling rstat flush in irq context? > > > > > > > > > > > > > > > > I think it is possible through the charge/uncharge path: > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I > > > > > > > > added the protection against flushing in an interrupt context for > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable > > > > > > > > interrupts when acquiring cgroup_rstat_lock. > > > > > > > > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ? > > > > > > > > > > > > > > > > I am not sure, but the code looks like event notifications may be set > > > > > > > > up on root memcg, which is why we need to check thresholds. > > > > > > > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined. > > > > > > > > > > > > Right, but I think this would be orthogonal to this patch series. > > > > > > > > > > > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock > > > > > without either breaking a link between mem_cgroup_threshold and > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling > > > > > irqs. > > > > > > > > > > So, this patch can not be applied before either of those two tasks are > > > > > done (and we may find more such scenarios). > > > > > > > > > > > > Could you elaborate why? > > > > > > > > My understanding is that with an in_task() check to make sure we only > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to > > > > acquire cgroup_rstat_lock without disabling interrupts. > > > > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken > > > with irq disabled while other code paths will take cgroup_rstat_lock > > > with irq enabled. This is a potential deadlock hazard unless > > > cgroup_rstat_lock is always taken with irq disabled. > > > > Oh you are making sure it is not taken in the irq context through > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended > > to actually remove all such users instead of silently > > ignoring/bypassing the functionality. > > It is a workaround, we simply accept to read stale stats in irq > context instead of the expensive flush operation. > > > > > So, how about removing mem_cgroup_flush_stats() from > > mem_cgroup_usage(). It will break the known chain which is taking > > cgroup_rstat_lock with irq disabled and you can add > > WARN_ON_ONCE(!in_task()). > > This changes the behavior in a more obvious way because: > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage() > path is also exercised in a lot of paths outside irq context, this > will change the behavior for any event thresholds on the root memcg. > With proposed skipped flushing in irq context we only change the > behavior in a small subset of cases. > > I think we can skip flushing in irq context for now, and separately > deprecate threshold events for the root memcg. When that is done we > can come back and remove should_skip_flush() and add a VM_BUG_ON or > WARN_ON_ONCE instead. WDYT? > > 2. mem_cgroup_usage() is also used when reading usage from userspace. > This should be an easy workaround though. This is a cgroup v1 behavior and to me it is totally reasonable to get the 2 second stale root's usage. Even if you want to skip flushing in irq, do that in the memcg code and keep VM_BUG_ON/WARN_ON_ONCE in the rstat core code. This way we will know if other subsystems are doing the same or not.
On Thu, Mar 23, 2023 at 9:29 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Thu, Mar 23, 2023 at 9:18 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > 2. Are we really calling rstat flush in irq context? > > > > > > > > > > > > > > > > > > I think it is possible through the charge/uncharge path: > > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I > > > > > > > > > added the protection against flushing in an interrupt context for > > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable > > > > > > > > > interrupts when acquiring cgroup_rstat_lock. > > > > > > > > > > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only > > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root > > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ? > > > > > > > > > > > > > > > > > > I am not sure, but the code looks like event notifications may be set > > > > > > > > > up on root memcg, which is why we need to check thresholds. > > > > > > > > > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined. > > > > > > > > > > > > > > Right, but I think this would be orthogonal to this patch series. > > > > > > > > > > > > > > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock > > > > > > without either breaking a link between mem_cgroup_threshold and > > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling > > > > > > irqs. > > > > > > > > > > > > So, this patch can not be applied before either of those two tasks are > > > > > > done (and we may find more such scenarios). > > > > > > > > > > > > > > > Could you elaborate why? > > > > > > > > > > My understanding is that with an in_task() check to make sure we only > > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to > > > > > acquire cgroup_rstat_lock without disabling interrupts. > > > > > > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken > > > > with irq disabled while other code paths will take cgroup_rstat_lock > > > > with irq enabled. This is a potential deadlock hazard unless > > > > cgroup_rstat_lock is always taken with irq disabled. > > > > > > Oh you are making sure it is not taken in the irq context through > > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended > > > to actually remove all such users instead of silently > > > ignoring/bypassing the functionality. > > > > It is a workaround, we simply accept to read stale stats in irq > > context instead of the expensive flush operation. > > > > > > > > So, how about removing mem_cgroup_flush_stats() from > > > mem_cgroup_usage(). It will break the known chain which is taking > > > cgroup_rstat_lock with irq disabled and you can add > > > WARN_ON_ONCE(!in_task()). > > > > This changes the behavior in a more obvious way because: > > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage() > > path is also exercised in a lot of paths outside irq context, this > > will change the behavior for any event thresholds on the root memcg. > > With proposed skipped flushing in irq context we only change the > > behavior in a small subset of cases. > > > > I think we can skip flushing in irq context for now, and separately > > deprecate threshold events for the root memcg. When that is done we > > can come back and remove should_skip_flush() and add a VM_BUG_ON or > > WARN_ON_ONCE instead. WDYT? > > > > 2. mem_cgroup_usage() is also used when reading usage from userspace. > > This should be an easy workaround though. > > This is a cgroup v1 behavior and to me it is totally reasonable to get > the 2 second stale root's usage. Even if you want to skip flushing in > irq, do that in the memcg code and keep VM_BUG_ON/WARN_ON_ONCE in the > rstat core code. This way we will know if other subsystems are doing > the same or not. We can do that. Basically in mem_cgroup_usage() have: /* Some useful comment */ if (in_task()) mem_cgroup_flush_stats(); and in cgroup_rstat_flush() have: WARN_ON_ONCE(!in_task()); I am assuming VM_BUG_ON is not used outside mm code. The only thing that worries me is that if there is another unlikely path somewhere that flushes stats in irq context we may run into a deadlock. I am a little bit nervous about not skipping flushing if !in_task() in cgroup_rstat_flush().
On Thu, Mar 23, 2023 at 9:37 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Thu, Mar 23, 2023 at 9:29 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > On Thu, Mar 23, 2023 at 9:18 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > > > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > > 2. Are we really calling rstat flush in irq context? > > > > > > > > > > > > > > > > > > > > I think it is possible through the charge/uncharge path: > > > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I > > > > > > > > > > added the protection against flushing in an interrupt context for > > > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable > > > > > > > > > > interrupts when acquiring cgroup_rstat_lock. > > > > > > > > > > > > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only > > > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root > > > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ? > > > > > > > > > > > > > > > > > > > > I am not sure, but the code looks like event notifications may be set > > > > > > > > > > up on root memcg, which is why we need to check thresholds. > > > > > > > > > > > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined. > > > > > > > > > > > > > > > > Right, but I think this would be orthogonal to this patch series. > > > > > > > > > > > > > > > > > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock > > > > > > > without either breaking a link between mem_cgroup_threshold and > > > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling > > > > > > > irqs. > > > > > > > > > > > > > > So, this patch can not be applied before either of those two tasks are > > > > > > > done (and we may find more such scenarios). > > > > > > > > > > > > > > > > > > Could you elaborate why? > > > > > > > > > > > > My understanding is that with an in_task() check to make sure we only > > > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to > > > > > > acquire cgroup_rstat_lock without disabling interrupts. > > > > > > > > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken > > > > > with irq disabled while other code paths will take cgroup_rstat_lock > > > > > with irq enabled. This is a potential deadlock hazard unless > > > > > cgroup_rstat_lock is always taken with irq disabled. > > > > > > > > Oh you are making sure it is not taken in the irq context through > > > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended > > > > to actually remove all such users instead of silently > > > > ignoring/bypassing the functionality. > > > > > > It is a workaround, we simply accept to read stale stats in irq > > > context instead of the expensive flush operation. > > > > > > > > > > > So, how about removing mem_cgroup_flush_stats() from > > > > mem_cgroup_usage(). It will break the known chain which is taking > > > > cgroup_rstat_lock with irq disabled and you can add > > > > WARN_ON_ONCE(!in_task()). > > > > > > This changes the behavior in a more obvious way because: > > > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage() > > > path is also exercised in a lot of paths outside irq context, this > > > will change the behavior for any event thresholds on the root memcg. > > > With proposed skipped flushing in irq context we only change the > > > behavior in a small subset of cases. > > > > > > I think we can skip flushing in irq context for now, and separately > > > deprecate threshold events for the root memcg. When that is done we > > > can come back and remove should_skip_flush() and add a VM_BUG_ON or > > > WARN_ON_ONCE instead. WDYT? > > > > > > 2. mem_cgroup_usage() is also used when reading usage from userspace. > > > This should be an easy workaround though. > > > > This is a cgroup v1 behavior and to me it is totally reasonable to get > > the 2 second stale root's usage. Even if you want to skip flushing in > > irq, do that in the memcg code and keep VM_BUG_ON/WARN_ON_ONCE in the > > rstat core code. This way we will know if other subsystems are doing > > the same or not. > > We can do that. Basically in mem_cgroup_usage() have: > > /* Some useful comment */ > if (in_task()) > mem_cgroup_flush_stats(); > > and in cgroup_rstat_flush() have: > WARN_ON_ONCE(!in_task()); > > I am assuming VM_BUG_ON is not used outside mm code. > > The only thing that worries me is that if there is another unlikely > path somewhere that flushes stats in irq context we may run into a > deadlock. I am a little bit nervous about not skipping flushing if > !in_task() in cgroup_rstat_flush(). I think it is a good thing. We will find such scenarios and fix those instead of hiding them forever or keeping the door open for new such scenarios.
On Thu, Mar 23, 2023 at 9:45 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Thu, Mar 23, 2023 at 9:37 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Thu, Mar 23, 2023 at 9:29 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > On Thu, Mar 23, 2023 at 9:18 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > > > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > > > > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > > > > > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > > > > 2. Are we really calling rstat flush in irq context? > > > > > > > > > > > > > > > > > > > > > > I think it is possible through the charge/uncharge path: > > > > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I > > > > > > > > > > > added the protection against flushing in an interrupt context for > > > > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable > > > > > > > > > > > interrupts when acquiring cgroup_rstat_lock. > > > > > > > > > > > > > > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only > > > > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root > > > > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ? > > > > > > > > > > > > > > > > > > > > > > I am not sure, but the code looks like event notifications may be set > > > > > > > > > > > up on root memcg, which is why we need to check thresholds. > > > > > > > > > > > > > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined. > > > > > > > > > > > > > > > > > > Right, but I think this would be orthogonal to this patch series. > > > > > > > > > > > > > > > > > > > > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock > > > > > > > > without either breaking a link between mem_cgroup_threshold and > > > > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling > > > > > > > > irqs. > > > > > > > > > > > > > > > > So, this patch can not be applied before either of those two tasks are > > > > > > > > done (and we may find more such scenarios). > > > > > > > > > > > > > > > > > > > > > Could you elaborate why? > > > > > > > > > > > > > > My understanding is that with an in_task() check to make sure we only > > > > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to > > > > > > > acquire cgroup_rstat_lock without disabling interrupts. > > > > > > > > > > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken > > > > > > with irq disabled while other code paths will take cgroup_rstat_lock > > > > > > with irq enabled. This is a potential deadlock hazard unless > > > > > > cgroup_rstat_lock is always taken with irq disabled. > > > > > > > > > > Oh you are making sure it is not taken in the irq context through > > > > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended > > > > > to actually remove all such users instead of silently > > > > > ignoring/bypassing the functionality. > > > > > > > > It is a workaround, we simply accept to read stale stats in irq > > > > context instead of the expensive flush operation. > > > > > > > > > > > > > > So, how about removing mem_cgroup_flush_stats() from > > > > > mem_cgroup_usage(). It will break the known chain which is taking > > > > > cgroup_rstat_lock with irq disabled and you can add > > > > > WARN_ON_ONCE(!in_task()). > > > > > > > > This changes the behavior in a more obvious way because: > > > > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage() > > > > path is also exercised in a lot of paths outside irq context, this > > > > will change the behavior for any event thresholds on the root memcg. > > > > With proposed skipped flushing in irq context we only change the > > > > behavior in a small subset of cases. > > > > > > > > I think we can skip flushing in irq context for now, and separately > > > > deprecate threshold events for the root memcg. When that is done we > > > > can come back and remove should_skip_flush() and add a VM_BUG_ON or > > > > WARN_ON_ONCE instead. WDYT? > > > > > > > > 2. mem_cgroup_usage() is also used when reading usage from userspace. > > > > This should be an easy workaround though. > > > > > > This is a cgroup v1 behavior and to me it is totally reasonable to get > > > the 2 second stale root's usage. Even if you want to skip flushing in > > > irq, do that in the memcg code and keep VM_BUG_ON/WARN_ON_ONCE in the > > > rstat core code. This way we will know if other subsystems are doing > > > the same or not. > > > > We can do that. Basically in mem_cgroup_usage() have: > > > > /* Some useful comment */ > > if (in_task()) > > mem_cgroup_flush_stats(); > > > > and in cgroup_rstat_flush() have: > > WARN_ON_ONCE(!in_task()); > > > > I am assuming VM_BUG_ON is not used outside mm code. > > > > The only thing that worries me is that if there is another unlikely > > path somewhere that flushes stats in irq context we may run into a > > deadlock. I am a little bit nervous about not skipping flushing if > > !in_task() in cgroup_rstat_flush(). > > I think it is a good thing. We will find such scenarios and fix those > instead of hiding them forever or keeping the door open for new such > scenarios. Sure, I can do that in the next version. I will include a patch that adds an in_task() check to mem_cgroup_usage() before this one. Since BUG_ON() is discouraged and VM_BUG_ON() is mm specific, I guess we are left with WARN_ON_ONCE() for the rstat core code, right? Thanks Shakeel. Any other thoughts I should address for the next version?
On Thu, Mar 23, 2023 at 09:17:33AM -0700, Yosry Ahmed wrote: > On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > 2. Are we really calling rstat flush in irq context? > > > > > > > > > > > > > > > > I think it is possible through the charge/uncharge path: > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I > > > > > > > > added the protection against flushing in an interrupt context for > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable > > > > > > > > interrupts when acquiring cgroup_rstat_lock. > > > > > > > > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ? > > > > > > > > > > > > > > > > I am not sure, but the code looks like event notifications may be set > > > > > > > > up on root memcg, which is why we need to check thresholds. > > > > > > > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined. > > > > > > > > > > > > Right, but I think this would be orthogonal to this patch series. > > > > > > > > > > > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock > > > > > without either breaking a link between mem_cgroup_threshold and > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling > > > > > irqs. > > > > > > > > > > So, this patch can not be applied before either of those two tasks are > > > > > done (and we may find more such scenarios). > > > > > > > > > > > > Could you elaborate why? > > > > > > > > My understanding is that with an in_task() check to make sure we only > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to > > > > acquire cgroup_rstat_lock without disabling interrupts. > > > > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken > > > with irq disabled while other code paths will take cgroup_rstat_lock > > > with irq enabled. This is a potential deadlock hazard unless > > > cgroup_rstat_lock is always taken with irq disabled. > > > > Oh you are making sure it is not taken in the irq context through > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended > > to actually remove all such users instead of silently > > ignoring/bypassing the functionality. +1 It shouldn't silently skip the requested operation, rather it shouldn't be requested from an incompatible context. > > So, how about removing mem_cgroup_flush_stats() from > > mem_cgroup_usage(). It will break the known chain which is taking > > cgroup_rstat_lock with irq disabled and you can add > > WARN_ON_ONCE(!in_task()). > > This changes the behavior in a more obvious way because: > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage() > path is also exercised in a lot of paths outside irq context, this > will change the behavior for any event thresholds on the root memcg. > With proposed skipped flushing in irq context we only change the > behavior in a small subset of cases. Can you do /* Note: stale usage data when called from irq context!! */ if (in_task()) mem_cgroup_flush_stats() directly in the callsite? Maybe even include the whole callchain in the comment that's currently broken and needs fixing/removing.
On Thu, Mar 23, 2023 at 10:33 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Mar 23, 2023 at 09:17:33AM -0700, Yosry Ahmed wrote: > > On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > 2. Are we really calling rstat flush in irq context? > > > > > > > > > > > > > > > > > > I think it is possible through the charge/uncharge path: > > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I > > > > > > > > > added the protection against flushing in an interrupt context for > > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable > > > > > > > > > interrupts when acquiring cgroup_rstat_lock. > > > > > > > > > > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only > > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root > > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ? > > > > > > > > > > > > > > > > > > I am not sure, but the code looks like event notifications may be set > > > > > > > > > up on root memcg, which is why we need to check thresholds. > > > > > > > > > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined. > > > > > > > > > > > > > > Right, but I think this would be orthogonal to this patch series. > > > > > > > > > > > > > > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock > > > > > > without either breaking a link between mem_cgroup_threshold and > > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling > > > > > > irqs. > > > > > > > > > > > > So, this patch can not be applied before either of those two tasks are > > > > > > done (and we may find more such scenarios). > > > > > > > > > > > > > > > Could you elaborate why? > > > > > > > > > > My understanding is that with an in_task() check to make sure we only > > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to > > > > > acquire cgroup_rstat_lock without disabling interrupts. > > > > > > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken > > > > with irq disabled while other code paths will take cgroup_rstat_lock > > > > with irq enabled. This is a potential deadlock hazard unless > > > > cgroup_rstat_lock is always taken with irq disabled. > > > > > > Oh you are making sure it is not taken in the irq context through > > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended > > > to actually remove all such users instead of silently > > > ignoring/bypassing the functionality. > > +1 > > It shouldn't silently skip the requested operation, rather it > shouldn't be requested from an incompatible context. > > > > So, how about removing mem_cgroup_flush_stats() from > > > mem_cgroup_usage(). It will break the known chain which is taking > > > cgroup_rstat_lock with irq disabled and you can add > > > WARN_ON_ONCE(!in_task()). > > > > This changes the behavior in a more obvious way because: > > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage() > > path is also exercised in a lot of paths outside irq context, this > > will change the behavior for any event thresholds on the root memcg. > > With proposed skipped flushing in irq context we only change the > > behavior in a small subset of cases. > > Can you do > > /* Note: stale usage data when called from irq context!! */ > if (in_task()) > mem_cgroup_flush_stats() > > directly in the callsite? Maybe even include the whole callchain in > the comment that's currently broken and needs fixing/removing. Yeah, we can do that in mem_cgroup_usage(), which is the only context that I am aware of that may flush from irq context. We can also add WARN_ON_ONCE(!in_task()) in the rstat core flushing code to catch any other code paths that we are not aware of -- which may result in a deadlock, but hopefully if there is a violation it will be caught soon enough.
On Thu, Mar 23, 2023 at 11:09:30AM -0700, Yosry Ahmed wrote: > On Thu, Mar 23, 2023 at 10:33 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Thu, Mar 23, 2023 at 09:17:33AM -0700, Yosry Ahmed wrote: > > > On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > > > > > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > > 2. Are we really calling rstat flush in irq context? > > > > > > > > > > > > > > > > > > > > I think it is possible through the charge/uncharge path: > > > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I > > > > > > > > > > added the protection against flushing in an interrupt context for > > > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable > > > > > > > > > > interrupts when acquiring cgroup_rstat_lock. > > > > > > > > > > > > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only > > > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root > > > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ? > > > > > > > > > > > > > > > > > > > > I am not sure, but the code looks like event notifications may be set > > > > > > > > > > up on root memcg, which is why we need to check thresholds. > > > > > > > > > > > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined. > > > > > > > > > > > > > > > > Right, but I think this would be orthogonal to this patch series. > > > > > > > > > > > > > > > > > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock > > > > > > > without either breaking a link between mem_cgroup_threshold and > > > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling > > > > > > > irqs. > > > > > > > > > > > > > > So, this patch can not be applied before either of those two tasks are > > > > > > > done (and we may find more such scenarios). > > > > > > > > > > > > > > > > > > Could you elaborate why? > > > > > > > > > > > > My understanding is that with an in_task() check to make sure we only > > > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to > > > > > > acquire cgroup_rstat_lock without disabling interrupts. > > > > > > > > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken > > > > > with irq disabled while other code paths will take cgroup_rstat_lock > > > > > with irq enabled. This is a potential deadlock hazard unless > > > > > cgroup_rstat_lock is always taken with irq disabled. > > > > > > > > Oh you are making sure it is not taken in the irq context through > > > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended > > > > to actually remove all such users instead of silently > > > > ignoring/bypassing the functionality. > > > > +1 > > > > It shouldn't silently skip the requested operation, rather it > > shouldn't be requested from an incompatible context. > > > > > > So, how about removing mem_cgroup_flush_stats() from > > > > mem_cgroup_usage(). It will break the known chain which is taking > > > > cgroup_rstat_lock with irq disabled and you can add > > > > WARN_ON_ONCE(!in_task()). > > > > > > This changes the behavior in a more obvious way because: > > > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage() > > > path is also exercised in a lot of paths outside irq context, this > > > will change the behavior for any event thresholds on the root memcg. > > > With proposed skipped flushing in irq context we only change the > > > behavior in a small subset of cases. > > > > Can you do > > > > /* Note: stale usage data when called from irq context!! */ > > if (in_task()) > > mem_cgroup_flush_stats() > > > > directly in the callsite? Maybe even include the whole callchain in > > the comment that's currently broken and needs fixing/removing. > > Yeah, we can do that in mem_cgroup_usage(), which is the only context > that I am aware of that may flush from irq context. We can also add > WARN_ON_ONCE(!in_task()) in the rstat core flushing code to catch any > other code paths that we are not aware of -- which may result in a > deadlock, but hopefully if there is a violation it will be caught soon > enough. That sounds good to me, thanks!
On Thu, Mar 23, 2023 at 9:52 AM Yosry Ahmed <yosryahmed@google.com> wrote: > [...] > > Sure, I can do that in the next version. I will include a patch that > adds an in_task() check to mem_cgroup_usage() before this one. Since > BUG_ON() is discouraged and VM_BUG_ON() is mm specific, I guess we are > left with WARN_ON_ONCE() for the rstat core code, right? > > Thanks Shakeel. Any other thoughts I should address for the next version? This is all for now (at least for this patch).
Hello, On Thu, Mar 23, 2023 at 04:00:31AM +0000, Yosry Ahmed wrote: > Currently, when sleeping is not allowed during rstat flushing, we hold > the global rstat lock with interrupts disabled throughout the entire > flush operation. Flushing in an O(# cgroups * # cpus) operation, and > having interrupts disabled throughout is dangerous. > > For some contexts, we may not want to sleep, but can be interrupted > (e.g. while holding a spinlock or RCU read lock). As such, do not > disable interrupts throughout rstat flushing, only when holding the > percpu lock. This breaks down the O(# cgroups * # cpus) duration with > interrupts disabled to a series of O(# cgroups) durations. > > Furthermore, if a cpu spinning waiting for the global rstat lock, it > doesn't need to spin with interrupts disabled anymore. I'm generally not a fan of big spin locks w/o irq protection. They too often become a source of unpredictable latency spikes. As you said, the global rstat lock can be held for quite a while. Removing _irq makes irq latency better on the CPU but on the other hand it makes a lot more likely that the lock is gonna be held even longer, possibly significantly so depending on the configuration and workload which will in turn stall other CPUs waiting for the lock. Sure, irqs are being serviced quicker but if the cost is more and longer !irq context multi-cpu stalls, what's the point? I don't think there's anything which requires the global lock to be held throughout the entire flushing sequence and irq needs to be disabled when grabbing the percpu lock anyway, so why not just release the global lock on CPU boundaries instead? We don't really lose anything significant that way. The durations of irq disabled sections are still about the same as in the currently proposed solution at O(# cgroups) and we avoid the risk of holding the global lock for too long unexpectedly from getting hit repeatedly by irqs while holding the global lock. Thanks.
On Thu, Mar 23, 2023 at 6:39 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Thu, Mar 23, 2023 at 04:00:31AM +0000, Yosry Ahmed wrote: > > Currently, when sleeping is not allowed during rstat flushing, we hold > > the global rstat lock with interrupts disabled throughout the entire > > flush operation. Flushing in an O(# cgroups * # cpus) operation, and > > having interrupts disabled throughout is dangerous. > > > > For some contexts, we may not want to sleep, but can be interrupted > > (e.g. while holding a spinlock or RCU read lock). As such, do not > > disable interrupts throughout rstat flushing, only when holding the > > percpu lock. This breaks down the O(# cgroups * # cpus) duration with > > interrupts disabled to a series of O(# cgroups) durations. > > > > Furthermore, if a cpu spinning waiting for the global rstat lock, it > > doesn't need to spin with interrupts disabled anymore. > > I'm generally not a fan of big spin locks w/o irq protection. They too often > become a source of unpredictable latency spikes. As you said, the global > rstat lock can be held for quite a while. Removing _irq makes irq latency > better on the CPU but on the other hand it makes a lot more likely that the > lock is gonna be held even longer, possibly significantly so depending on > the configuration and workload which will in turn stall other CPUs waiting > for the lock. Sure, irqs are being serviced quicker but if the cost is more > and longer !irq context multi-cpu stalls, what's the point? > > I don't think there's anything which requires the global lock to be held > throughout the entire flushing sequence and irq needs to be disabled when > grabbing the percpu lock anyway, so why not just release the global lock on > CPU boundaries instead? We don't really lose anything significant that way. > The durations of irq disabled sections are still about the same as in the > currently proposed solution at O(# cgroups) and we avoid the risk of holding > the global lock for too long unexpectedly from getting hit repeatedly by > irqs while holding the global lock. Thanks for taking a look! I think a problem with this approach is that we risk having to contend for the global lock at every CPU boundary in atomic contexts. Right now we contend for the global lock once, and once we have it we go through all CPUs to flush, only having to contend with updates taking the percpu locks at this point. If we unconditionally release & reacquire the global lock at every CPU boundary then we may contend for it much more frequently with concurrent flushers. On the memory controller side, concurrent flushers are already held back to avoid a thundering herd problem on the global rstat lock, but flushers from outside the memory controller can still compete together or with a flusher from the memory controller. In this case, we risk contending the global lock more and concurrent flushers taking a longer period of time, which may end up causing multi-CPU stalls anyway, right? Also, if we keep _irq when spinning for the lock, then concurrent flushers still need to spin with irq disabled -- another problem that this series tries to fix. This is particularly a problem for flushers in atomic contexts. There is a flusher in mem_cgroup_wb_stats() that flushes while holding another spinlock, and a flusher in mem_cgroup_usage() that flushes with irqs disabled. If flushing takes a longer period of time due to repeated lock contention, it affects such atomic context negatively. I am not sure how all of this matters in practice, it depends heavily on the workloads and the configuration like you mentioned. I am just pointing out the potential disadvantages of reacquiring the lock at every CPU boundary in atomic contexts. > > Thanks. > > -- > tejun
On 3/24/23 03:22, Yosry Ahmed wrote: > On Thu, Mar 23, 2023 at 6:39 PM Tejun Heo <tj@kernel.org> wrote: >> Hello, >> >> On Thu, Mar 23, 2023 at 04:00:31AM +0000, Yosry Ahmed wrote: >>> Currently, when sleeping is not allowed during rstat flushing, we hold >>> the global rstat lock with interrupts disabled throughout the entire >>> flush operation. Flushing in an O(# cgroups * # cpus) operation, and >>> having interrupts disabled throughout is dangerous. >>> >>> For some contexts, we may not want to sleep, but can be interrupted >>> (e.g. while holding a spinlock or RCU read lock). As such, do not >>> disable interrupts throughout rstat flushing, only when holding the >>> percpu lock. This breaks down the O(# cgroups * # cpus) duration with >>> interrupts disabled to a series of O(# cgroups) durations. >>> >>> Furthermore, if a cpu spinning waiting for the global rstat lock, it >>> doesn't need to spin with interrupts disabled anymore. >> I'm generally not a fan of big spin locks w/o irq protection. They too often >> become a source of unpredictable latency spikes. As you said, the global >> rstat lock can be held for quite a while. Removing _irq makes irq latency >> better on the CPU but on the other hand it makes a lot more likely that the >> lock is gonna be held even longer, possibly significantly so depending on >> the configuration and workload which will in turn stall other CPUs waiting >> for the lock. Sure, irqs are being serviced quicker but if the cost is more >> and longer !irq context multi-cpu stalls, what's the point? >> >> I don't think there's anything which requires the global lock to be held >> throughout the entire flushing sequence and irq needs to be disabled when >> grabbing the percpu lock anyway, so why not just release the global lock on >> CPU boundaries instead? We don't really lose anything significant that way. >> The durations of irq disabled sections are still about the same as in the >> currently proposed solution at O(# cgroups) and we avoid the risk of holding >> the global lock for too long unexpectedly from getting hit repeatedly by >> irqs while holding the global lock. > Thanks for taking a look! > > I think a problem with this approach is that we risk having to contend > for the global lock at every CPU boundary in atomic contexts. Right Isn't it the plan to just do a trylock in atomic contexts so that it won't get stuck spinning for the lock for an indeterminate amount of time? > now we contend for the global lock once, and once we have it we go > through all CPUs to flush, only having to contend with updates taking > the percpu locks at this point. If we unconditionally release & > reacquire the global lock at every CPU boundary then we may contend > for it much more frequently with concurrent flushers. Note that with the use of qspinlock in all the major arches, the impact of thundering herds of lockers are much less serious than before. There are certainly some overhead in doing multiple lock acquires and releases, but that shouldn't been too excessive. I am all in for reducing lock hold time as much as possible as it will improve the response time. Cheers, Longman
On Fri, Mar 24, 2023 at 7:12 AM Waiman Long <longman@redhat.com> wrote: > > On 3/24/23 03:22, Yosry Ahmed wrote: > > On Thu, Mar 23, 2023 at 6:39 PM Tejun Heo <tj@kernel.org> wrote: > >> Hello, > >> > >> On Thu, Mar 23, 2023 at 04:00:31AM +0000, Yosry Ahmed wrote: > >>> Currently, when sleeping is not allowed during rstat flushing, we hold > >>> the global rstat lock with interrupts disabled throughout the entire > >>> flush operation. Flushing in an O(# cgroups * # cpus) operation, and > >>> having interrupts disabled throughout is dangerous. > >>> > >>> For some contexts, we may not want to sleep, but can be interrupted > >>> (e.g. while holding a spinlock or RCU read lock). As such, do not > >>> disable interrupts throughout rstat flushing, only when holding the > >>> percpu lock. This breaks down the O(# cgroups * # cpus) duration with > >>> interrupts disabled to a series of O(# cgroups) durations. > >>> > >>> Furthermore, if a cpu spinning waiting for the global rstat lock, it > >>> doesn't need to spin with interrupts disabled anymore. > >> I'm generally not a fan of big spin locks w/o irq protection. They too often > >> become a source of unpredictable latency spikes. As you said, the global > >> rstat lock can be held for quite a while. Removing _irq makes irq latency > >> better on the CPU but on the other hand it makes a lot more likely that the > >> lock is gonna be held even longer, possibly significantly so depending on > >> the configuration and workload which will in turn stall other CPUs waiting > >> for the lock. Sure, irqs are being serviced quicker but if the cost is more > >> and longer !irq context multi-cpu stalls, what's the point? > >> > >> I don't think there's anything which requires the global lock to be held > >> throughout the entire flushing sequence and irq needs to be disabled when > >> grabbing the percpu lock anyway, so why not just release the global lock on > >> CPU boundaries instead? We don't really lose anything significant that way. > >> The durations of irq disabled sections are still about the same as in the > >> currently proposed solution at O(# cgroups) and we avoid the risk of holding > >> the global lock for too long unexpectedly from getting hit repeatedly by > >> irqs while holding the global lock. > > Thanks for taking a look! > > > > I think a problem with this approach is that we risk having to contend > > for the global lock at every CPU boundary in atomic contexts. Right > Isn't it the plan to just do a trylock in atomic contexts so that it > won't get stuck spinning for the lock for an indeterminate amount of time? Not exactly. On the memory controller side, we currently only allow one flusher at a time and force all flushers to flush the full hierarchy, such that concurrent flushers can skip. This is done for both atomic and non-atomic contexts. For flushers outside the memory controller, they can still contend the lock among themselves or with flushers in the memory controller. In this case, instead of contending the lock once, they contend it at each CPU boundary. > > now we contend for the global lock once, and once we have it we go > > through all CPUs to flush, only having to contend with updates taking > > the percpu locks at this point. If we unconditionally release & > > reacquire the global lock at every CPU boundary then we may contend > > for it much more frequently with concurrent flushers. > > Note that with the use of qspinlock in all the major arches, the impact > of thundering herds of lockers are much less serious than before. There > are certainly some overhead in doing multiple lock acquires and > releases, but that shouldn't been too excessive. I ran some tests to measure this. Since I am using a cgroup v1 hierarchy, I cannot reproduce contention between memory controller flushers and non-memory controller flushers, so I removed the "one memory flusher only" restriction to have concurrent memory flushers compete for the global rstat lock to measure the impact: Before (only one flusher allowed to compete for the global rstat lock): ---cgroup_rstat_flush | --1.27%--cgroup_rstat_flush_locked | --0.94%--mem_cgroup_css_rstat_flush After (concurrent flushers allowed to compete for the global rstat lock): ---cgroup_rstat_flush | |--4.94%--_raw_spin_lock | | | --4.94%--queued_spin_lock_slowpath | --0.92%--cgroup_rstat_flush_locked | --0.56%--mem_cgroup_css_rstat_flush This was run with 20 processes trying to flush concurrently, so it may be excessive, but it seems like in this case lock contention makes a significant difference. Again, this is not a regression for non-atomic flushers, as they already compete for the lock at every CPU boundary, but for atomic flushers that don't give up the lock at all today, it would be a regression to start competing for the lock at every CPU boundary. This patch series aims to minimize the number of atomic flushers (brings them down to two, one of which is not common), so this may be fine. My main concern is that for some flushers that this series converts from atomic to non-atomic, we may notice a regression later and revert it (e.g. refault path), which is why I have them in separate patches. If we regress the atomic flushing path, it would be a larger surgery to restore the performance for these paths -- which is why I would rather keep the atomic path without excessive lock contention. Thoughts? > > I am all in for reducing lock hold time as much as possible as it will > improve the response time. > > Cheers, > Longman >
Hello, On Fri, Mar 24, 2023 at 12:22:09AM -0700, Yosry Ahmed wrote: > I think a problem with this approach is that we risk having to contend > for the global lock at every CPU boundary in atomic contexts. Right > now we contend for the global lock once, and once we have it we go > through all CPUs to flush, only having to contend with updates taking > the percpu locks at this point. If we unconditionally release & > reacquire the global lock at every CPU boundary then we may contend > for it much more frequently with concurrent flushers. > > On the memory controller side, concurrent flushers are already held > back to avoid a thundering herd problem on the global rstat lock, but > flushers from outside the memory controller can still compete together > or with a flusher from the memory controller. In this case, we risk > contending the global lock more and concurrent flushers taking a > longer period of time, which may end up causing multi-CPU stalls > anyway, right? Also, if we keep _irq when spinning for the lock, then > concurrent flushers still need to spin with irq disabled -- another > problem that this series tries to fix. > > This is particularly a problem for flushers in atomic contexts. There > is a flusher in mem_cgroup_wb_stats() that flushes while holding > another spinlock, and a flusher in mem_cgroup_usage() that flushes > with irqs disabled. If flushing takes a longer period of time due to > repeated lock contention, it affects such atomic context negatively. > > I am not sure how all of this matters in practice, it depends heavily > on the workloads and the configuration like you mentioned. I am just > pointing out the potential disadvantages of reacquiring the lock at > every CPU boundary in atomic contexts. So, I'm not too convinced by the arguments for a couple reasons: * It's not very difficult to create conditions where a contented non-irq protected spinlock is held unnecessarily long due to heavy IRQ irq load on the holding CPU. This can easily extend the amount of time the lock is held by multiple times if not orders of magnitude. That is likely a significantly worse problem than the contention on the lock cacheline which will lead to a lot more gradual degradation. * If concurrent flushing is an actual problem, we need and can implement a better solution. There's quite a bit of maneuvering room here given that the flushing operations are mostly idempotent in close time proximity and there's no real atomicity requirement across different segments of flushing operations. Thanks.
On Fri, Mar 24, 2023 at 6:54 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Fri, Mar 24, 2023 at 12:22:09AM -0700, Yosry Ahmed wrote: > > I think a problem with this approach is that we risk having to contend > > for the global lock at every CPU boundary in atomic contexts. Right > > now we contend for the global lock once, and once we have it we go > > through all CPUs to flush, only having to contend with updates taking > > the percpu locks at this point. If we unconditionally release & > > reacquire the global lock at every CPU boundary then we may contend > > for it much more frequently with concurrent flushers. > > > > On the memory controller side, concurrent flushers are already held > > back to avoid a thundering herd problem on the global rstat lock, but > > flushers from outside the memory controller can still compete together > > or with a flusher from the memory controller. In this case, we risk > > contending the global lock more and concurrent flushers taking a > > longer period of time, which may end up causing multi-CPU stalls > > anyway, right? Also, if we keep _irq when spinning for the lock, then > > concurrent flushers still need to spin with irq disabled -- another > > problem that this series tries to fix. > > > > This is particularly a problem for flushers in atomic contexts. There > > is a flusher in mem_cgroup_wb_stats() that flushes while holding > > another spinlock, and a flusher in mem_cgroup_usage() that flushes > > with irqs disabled. If flushing takes a longer period of time due to > > repeated lock contention, it affects such atomic context negatively. > > > > I am not sure how all of this matters in practice, it depends heavily > > on the workloads and the configuration like you mentioned. I am just > > pointing out the potential disadvantages of reacquiring the lock at > > every CPU boundary in atomic contexts. > > So, I'm not too convinced by the arguments for a couple reasons: > > * It's not very difficult to create conditions where a contented non-irq > protected spinlock is held unnecessarily long due to heavy IRQ irq load on > the holding CPU. This can easily extend the amount of time the lock is > held by multiple times if not orders of magnitude. That is likely a > significantly worse problem than the contention on the lock cacheline > which will lead to a lot more gradual degradation. I agree that can be a problem, it depends on the specific workload and configuration. The continuous lock contention at each CPU boundary causes a regression (see my reply to Waiman), but I am not sure if it's worse than the scenario you are describing. > > * If concurrent flushing is an actual problem, we need and can implement a > better solution. There's quite a bit of maneuvering room here given that > the flushing operations are mostly idempotent in close time proximity and > there's no real atomicity requirement across different segments of > flushing operations. Concurrent flushing can be a problem for some workloads, especially in the MM code we flush in the reclaim and refault paths. This is currently mitigated by only allowing one flusher at a time from the memcg side, but contention can still happen with flushing when a cgroup is being freed or other flushers in other subsystems. I tried allowing concurrent flushing by completely removing the global rstat lock, and only depending on the percpu locks for synchronization. For this to be correct the global stat counters need to be atomic, this introduced a slow down for flushing in general. I also noticed heavier lock contention on the percpu locks, since all flushers try to acquire all locks in the same order. I even tried implementing a simple retry scheme where we try to acquire the percpu lock, and if we fail we queue the current cpu and move to the next one. This ended up causing a little bit of slowness as well. Together with the slowness introduced by atomic operations it seemed like a significant regression in the simple flushing path. Don't get me wrong, I am all for modifying the current approach, I just want to make sure we are making the correct decision for *most* workloads. Keep in mind that this series aims to minimize the number of flushers from atomic contexts as well, and for non-atomic flushers we allow giving up the lock at CPU boundaries anyway. The current approach only keeps the lock held throughout for atomic flushers. Any ideas here are welcome! > > Thanks. > > -- > tejun
On Fri, Mar 24, 2023 at 7:18 PM Yosry Ahmed <yosryahmed@google.com> wrote: > [...] > Any ideas here are welcome! > Let's move forward. It seems like we are not going to reach an agreement on making cgroup_rstat_lock a non-irq lock. However there is agreement on the memcg code of not flushing in irq context and the cleanup Johannes has requested. Let's proceed with those for now. We can come back to cgroup_rstat_lock later if we still see issues in production. Tejun, do you have any concerns on adding WARN_ON_ONCE(!in_task()) in the rstat flushing code?
On Fri, Mar 24, 2023 at 9:31 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Fri, Mar 24, 2023 at 7:18 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > [...] > > Any ideas here are welcome! > > > > Let's move forward. It seems like we are not going to reach an > agreement on making cgroup_rstat_lock a non-irq lock. However there is > agreement on the memcg code of not flushing in irq context and the > cleanup Johannes has requested. Let's proceed with those for now. We > can come back to cgroup_rstat_lock later if we still see issues in > production. Even if we do not flush from irq context, we still flush from atomic contexts that will currently hold the lock with irqs disabled throughout the entire flush sequence. A primary purpose of this reason is to avoid that. We can either: (a) Proceed with the following approach of making cgroup_rstat_lock a non-irq lock. (b) Proceed with Tejun's suggestion of always releasing and reacquiring the lock at CPU boundaries, even for atomic flushes (if the spinlock needs a break ofc). (c) Something else. I am happy to proceed with any solution, but we need to address the fact that interrupts are always disabled throughout the flush. My main concern about Tejun's suggestion is atomic contexts having to contend cgroup_rstat_lock much more than they do now, but it's still better than what we have today. > > Tejun, do you have any concerns on adding WARN_ON_ONCE(!in_task()) in > the rstat flushing code?
On Fri, Mar 24, 2023 at 9:37 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Fri, Mar 24, 2023 at 9:31 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > On Fri, Mar 24, 2023 at 7:18 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > [...] > > > Any ideas here are welcome! > > > > > > > Let's move forward. It seems like we are not going to reach an > > agreement on making cgroup_rstat_lock a non-irq lock. However there is > > agreement on the memcg code of not flushing in irq context and the > > cleanup Johannes has requested. Let's proceed with those for now. We > > can come back to cgroup_rstat_lock later if we still see issues in > > production. > > Even if we do not flush from irq context, we still flush from atomic > contexts that will currently hold the lock with irqs disabled > throughout the entire flush sequence. A primary purpose of this reason > is to avoid that. > > We can either: > (a) Proceed with the following approach of making cgroup_rstat_lock a > non-irq lock. > (b) Proceed with Tejun's suggestion of always releasing and > reacquiring the lock at CPU boundaries, even for atomic flushes (if > the spinlock needs a break ofc). > (c) Something else. (d) keep the status quo regarding cgroup_rstat_lock (e) decouple the discussion of cgroup_rstat_lock from the agreed improvements. Send the patches for the agreed ones and continue discussing cgroup_rstat_lock.
On Fri, Mar 24, 2023 at 9:46 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Fri, Mar 24, 2023 at 9:37 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Fri, Mar 24, 2023 at 9:31 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > On Fri, Mar 24, 2023 at 7:18 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > [...] > > > > Any ideas here are welcome! > > > > > > > > > > Let's move forward. It seems like we are not going to reach an > > > agreement on making cgroup_rstat_lock a non-irq lock. However there is > > > agreement on the memcg code of not flushing in irq context and the > > > cleanup Johannes has requested. Let's proceed with those for now. We > > > can come back to cgroup_rstat_lock later if we still see issues in > > > production. > > > > Even if we do not flush from irq context, we still flush from atomic > > contexts that will currently hold the lock with irqs disabled > > throughout the entire flush sequence. A primary purpose of this reason > > is to avoid that. > > > > We can either: > > (a) Proceed with the following approach of making cgroup_rstat_lock a > > non-irq lock. > > (b) Proceed with Tejun's suggestion of always releasing and > > reacquiring the lock at CPU boundaries, even for atomic flushes (if > > the spinlock needs a break ofc). > > (c) Something else. > > (d) keep the status quo regarding cgroup_rstat_lock > (e) decouple the discussion of cgroup_rstat_lock from the agreed > improvements. Send the patches for the agreed ones and continue > discussing cgroup_rstat_lock. Ah, I lost sight of the fact that the rest of the patch series does not strictly depend on this patch. I will respin the rest of the patch series separately. Thanks, Shakeel. Meanwhile, it would be useful to reach an agreement here to stop acquiring the cgroup_rstat_lock for a long time with irq disabled in atomic contexts. Tejun, if having the lock be non-irq is a non-starter for you, I can send a patch that instead gives up the lock and reacquires it at every CPU boundary unconditionally -- or perhaps every N CPU boundaries to avoid excessively releasing and reacquiring the lock. Something like: static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep) { ... for_each_possible_cpu(cpu) { ... /* Always yield the at CPU boundaries to enable irqs */ spin_unlock_irq(&cgroup_rstat_lock); /* if @may_sleep, play nice and yield if necessary */ if (may_sleep) cond_resched(); spin_lock_irq(&cgroup_rstat_lock); } } If you have other ideas to avoid disabling irq's for the entire flush sequence I am also open to that. Thanks!
Hello, Yosry. On Mon, Mar 27, 2023 at 04:23:13PM -0700, Yosry Ahmed wrote: > Tejun, if having the lock be non-irq is a non-starter for you, I can This is an actual hazard. We see in prod these unprotected locks causing very big spikes in tail latencies and they can be tricky to root cause too and given the way rstat lock is used it's highly likely to be involved in those scenarios with the proposed change, so it's gonna be a nack from my end. > send a patch that instead gives up the lock and reacquires it at every > CPU boundary unconditionally -- or perhaps every N CPU boundaries to > avoid excessively releasing and reacquiring the lock. I'd just do the simple thing and see whether there's any perf penalty before making it complicated. I'd be pretty surprised if unlocking and relocking the same spinlock adds any noticeable overhead here. Thanks.
Hi Tejun, On Wed, 29 Mar 2023, Tejun Heo wrote: > On Mon, Mar 27, 2023 at 04:23:13PM -0700, Yosry Ahmed wrote: > > Tejun, if having the lock be non-irq is a non-starter for you, I can > > This is an actual hazard. We see in prod these unprotected locks causing > very big spikes in tail latencies and they can be tricky to root cause too > and given the way rstat lock is used it's highly likely to be involved in > those scenarios with the proposed change, so it's gonna be a nack from my > end. Butting in here, I'm fascinated. This is certainly not my area, I know nothing about rstat, but this is the first time I ever heard someone arguing for more disabling of interrupts rather than less. An interrupt coming in while holding a contended resource can certainly add to latencies, that I accept of course. But until now, I thought it was agreed best practice to disable irqs only regretfully, when strictly necessary. If that has changed, I for one want to know about it. How should we now judge which spinlocks should disable interrupts and which should not? Page table locks are currently my main interest - should those be changed? Thanks, Hugh
Hello, Hugh. How have you been? On Wed, Mar 29, 2023 at 12:22:24PM -0700, Hugh Dickins wrote: > Hi Tejun, > Butting in here, I'm fascinated. This is certainly not my area, I know > nothing about rstat, but this is the first time I ever heard someone > arguing for more disabling of interrupts rather than less. > > An interrupt coming in while holding a contended resource can certainly > add to latencies, that I accept of course. But until now, I thought it > was agreed best practice to disable irqs only regretfully, when strictly > necessary. > > If that has changed, I for one want to know about it. How should we > now judge which spinlocks should disable interrupts and which should not? > Page table locks are currently my main interest - should those be changed? For rstat, it's a simple case because the global lock here wraps around per-cpu locks which have to be irq-safe, so the only difference we get between making the global irq-unsafe and keeping it so but releasing inbetween is: Global lock held: G IRQ disabled: I Percpu lock held: P 1. IRQ unsafe GGGGGGGGGGGGGGG~~GGGGG IIII IIII IIII ~~ IIII PPPP PPPP PPPP ~~ PPPP 2. IRQ safe released inbetween cpus GGGG GGGG GGGG ~~ GGGG IIII IIII IIII ~~ IIII PPPP PPPP PPPP ~~ PPPP #2 seems like the obvious thing to do here given how the lock is used and each P section may take a bit of time. So, in the rstat case, the choice is, at least to me, obvious, but even for more generic cases where the bulk of actual work isn't done w/ irq disabled, I don't think the picture is as simple as "use the least protected variant possible" anymore because the underlying hardware changed. For an SMP kernel running on an UP system, "the least protected variant" is the obvious choice to make because you don't lose anything by holding a spinlock longer than necessary. However, as you increase the number of CPUs, there rises a tradeoff between local irq servicing latency and global lock contention. Imagine a, say, 128 cpu system with a few cores servicing relatively high frequency interrupts. Let's say there's a mildly hot lock. Usually, it shows up in the system profile but only just. Let's say something happens and the irq rate on those cores went up for some reason to the point where it becomes a rather common occurrence when the lock is held on one of those cpus, irqs are likely to intervene lengthening how long the lock is held, sometimes, signficantly. Now because the lock is on average held for much longer, it become a lot hotter as more CPUs would stall on it and depending on luck or lack thereof these stalls can span many CPUs on the system for quite a while. This is actually something we saw in production. So, in general, there's a trade off between local irq service latency and inducing global lock contention when using unprotected locks. With more and more CPUs, the balance keeps shifting. The balance still very much depends on the specifics of a given lock but yeah I think it's something we need to be a lot more careful about now. Thanks.
On Wed, 29 Mar 2023, Tejun Heo wrote: > Hello, Hugh. How have you been? > > On Wed, Mar 29, 2023 at 12:22:24PM -0700, Hugh Dickins wrote: > > Hi Tejun, > > Butting in here, I'm fascinated. This is certainly not my area, I know > > nothing about rstat, but this is the first time I ever heard someone > > arguing for more disabling of interrupts rather than less. > > > > An interrupt coming in while holding a contended resource can certainly > > add to latencies, that I accept of course. But until now, I thought it > > was agreed best practice to disable irqs only regretfully, when strictly > > necessary. > > > > If that has changed, I for one want to know about it. How should we > > now judge which spinlocks should disable interrupts and which should not? > > Page table locks are currently my main interest - should those be changed? > > For rstat, it's a simple case because the global lock here wraps around > per-cpu locks which have to be irq-safe, so the only difference we get > between making the global irq-unsafe and keeping it so but releasing > inbetween is: > > Global lock held: G > IRQ disabled: I > Percpu lock held: P > > 1. IRQ unsafe > > GGGGGGGGGGGGGGG~~GGGGG > IIII IIII IIII ~~ IIII > PPPP PPPP PPPP ~~ PPPP > > 2. IRQ safe released inbetween cpus > > GGGG GGGG GGGG ~~ GGGG > IIII IIII IIII ~~ IIII > PPPP PPPP PPPP ~~ PPPP > > #2 seems like the obvious thing to do here given how the lock is used and > each P section may take a bit of time. Many thanks for the detailed response. I'll leave it to the rstat folks, to agree or disagree with your analysis there. > > So, in the rstat case, the choice is, at least to me, obvious, but even for > more generic cases where the bulk of actual work isn't done w/ irq disabled, > I don't think the picture is as simple as "use the least protected variant > possible" anymore because the underlying hardware changed. > > For an SMP kernel running on an UP system, "the least protected variant" is > the obvious choice to make because you don't lose anything by holding a > spinlock longer than necessary. However, as you increase the number of CPUs, > there rises a tradeoff between local irq servicing latency and global lock > contention. > > Imagine a, say, 128 cpu system with a few cores servicing relatively high > frequency interrupts. Let's say there's a mildly hot lock. Usually, it shows > up in the system profile but only just. Let's say something happens and the > irq rate on those cores went up for some reason to the point where it > becomes a rather common occurrence when the lock is held on one of those > cpus, irqs are likely to intervene lengthening how long the lock is held, > sometimes, signficantly. Now because the lock is on average held for much > longer, it become a lot hotter as more CPUs would stall on it and depending > on luck or lack thereof these stalls can span many CPUs on the system for > quite a while. This is actually something we saw in production. > > So, in general, there's a trade off between local irq service latency and > inducing global lock contention when using unprotected locks. With more and > more CPUs, the balance keeps shifting. The balance still very much depends > on the specifics of a given lock but yeah I think it's something we need to > be a lot more careful about now. And this looks a very plausible argument to me: I'll let it sink in. But I hadn't heard that the RT folks were clamouring for more irq disabling: perhaps they partition their machines with more care, and are not devotees of high CPU counts. What I hope is that others will chime in one way or the other - it does sound as if a reappraisal of the balances is overdue. Thanks, Hugh (disabling interrupts for as long as he can)
Thanks for a great discussion, Tejun and Hugh. On Wed, Mar 29, 2023 at 1:38 PM Hugh Dickins <hughd@google.com> wrote: > > On Wed, 29 Mar 2023, Tejun Heo wrote: > > > Hello, Hugh. How have you been? > > > > On Wed, Mar 29, 2023 at 12:22:24PM -0700, Hugh Dickins wrote: > > > Hi Tejun, > > > Butting in here, I'm fascinated. This is certainly not my area, I know > > > nothing about rstat, but this is the first time I ever heard someone > > > arguing for more disabling of interrupts rather than less. > > > > > > An interrupt coming in while holding a contended resource can certainly > > > add to latencies, that I accept of course. But until now, I thought it > > > was agreed best practice to disable irqs only regretfully, when strictly > > > necessary. > > > > > > If that has changed, I for one want to know about it. How should we > > > now judge which spinlocks should disable interrupts and which should not? > > > Page table locks are currently my main interest - should those be changed? > > > > For rstat, it's a simple case because the global lock here wraps around > > per-cpu locks which have to be irq-safe, so the only difference we get > > between making the global irq-unsafe and keeping it so but releasing > > inbetween is: > > > > Global lock held: G > > IRQ disabled: I > > Percpu lock held: P > > > > 1. IRQ unsafe > > > > GGGGGGGGGGGGGGG~~GGGGG > > IIII IIII IIII ~~ IIII > > PPPP PPPP PPPP ~~ PPPP > > > > 2. IRQ safe released inbetween cpus > > > > GGGG GGGG GGGG ~~ GGGG > > IIII IIII IIII ~~ IIII > > PPPP PPPP PPPP ~~ PPPP > > > > #2 seems like the obvious thing to do here given how the lock is used and > > each P section may take a bit of time. > > Many thanks for the detailed response. I'll leave it to the rstat folks, > to agree or disagree with your analysis there. Thanks for the analysis, Tejun, it does indeed make sense. I perf'd releasing and reacquiring the lock at each CPU boundary and the overhead seems to be minimal. It would be higher with contention, but all memcg flushers should be held back by the memcg code, and flushers outside memcg are not frequent (reading blkcg and cpu base stats from user space, and when a cgroup is being removed). I realized that after v2 of this patch series [1], we would only end up with two atomic flushing contexts, mem_cgroup_wb_stats() and mem_cgroup_usage(). The latter is already disabling irqs for other reasons, so anything we do within the rstat core code doesn't really help, it needs to be addressed separately. So only the call site in mem_cgroup_wb_stats() would benefit from not having irqs disabled throughout the flush. I will hold off on sending a patch until I observe that this call site is causing us pain and/or other atomic call sites emerge (or we have to revert one of the ones we made non-atomic), so that we don't hurt other flushers unnecessarily. Does this make sense to you? [1] https://lore.kernel.org/linux-mm/20230328221644.803272-1-yosryahmed@google.com/ > > > > > So, in the rstat case, the choice is, at least to me, obvious, but even for > > more generic cases where the bulk of actual work isn't done w/ irq disabled, > > I don't think the picture is as simple as "use the least protected variant > > possible" anymore because the underlying hardware changed. > > > > For an SMP kernel running on an UP system, "the least protected variant" is > > the obvious choice to make because you don't lose anything by holding a > > spinlock longer than necessary. However, as you increase the number of CPUs, > > there rises a tradeoff between local irq servicing latency and global lock > > contention. > > > > Imagine a, say, 128 cpu system with a few cores servicing relatively high > > frequency interrupts. Let's say there's a mildly hot lock. Usually, it shows > > up in the system profile but only just. Let's say something happens and the > > irq rate on those cores went up for some reason to the point where it > > becomes a rather common occurrence when the lock is held on one of those > > cpus, irqs are likely to intervene lengthening how long the lock is held, > > sometimes, signficantly. Now because the lock is on average held for much > > longer, it become a lot hotter as more CPUs would stall on it and depending > > on luck or lack thereof these stalls can span many CPUs on the system for > > quite a while. This is actually something we saw in production. > > > > So, in general, there's a trade off between local irq service latency and > > inducing global lock contention when using unprotected locks. With more and > > more CPUs, the balance keeps shifting. The balance still very much depends > > on the specifics of a given lock but yeah I think it's something we need to > > be a lot more careful about now. > > And this looks a very plausible argument to me: I'll let it sink in. > > But I hadn't heard that the RT folks were clamouring for more irq disabling: > perhaps they partition their machines with more care, and are not devotees > of high CPU counts. > > What I hope is that others will chime in one way or the other - > it does sound as if a reappraisal of the balances is overdue. > > Thanks, > Hugh (disabling interrupts for as long as he can)
Hello, Hugh. On Wed, Mar 29, 2023 at 01:38:48PM -0700, Hugh Dickins wrote: > > So, in general, there's a trade off between local irq service latency and > > inducing global lock contention when using unprotected locks. With more and > > more CPUs, the balance keeps shifting. The balance still very much depends > > on the specifics of a given lock but yeah I think it's something we need to > > be a lot more careful about now. > > And this looks a very plausible argument to me: I'll let it sink in. Another somewhat relevant change is that flipping irq on/off used to be relatively expensive on older x86 cpus. I forget all details about when and how much but they should be a lot cheaper now. No idea about !x86 cpus tho. > But I hadn't heard that the RT folks were clamouring for more irq disabling: > perhaps they partition their machines with more care, and are not devotees > of high CPU counts. I think RT folks care a lot more about raw IRQ disables. These shouldn't actually disable IRQs on RT kernels. Thanks.
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index 831f1f472bb8..af11e28bd055 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -7,6 +7,7 @@ #include <linux/btf.h> #include <linux/btf_ids.h> +/* This lock should only be held from task context */ static DEFINE_SPINLOCK(cgroup_rstat_lock); static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock); @@ -210,14 +211,24 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep) /* if @may_sleep, play nice and yield if necessary */ if (may_sleep && (need_resched() || spin_needbreak(&cgroup_rstat_lock))) { - spin_unlock_irq(&cgroup_rstat_lock); + spin_unlock(&cgroup_rstat_lock); if (!cond_resched()) cpu_relax(); - spin_lock_irq(&cgroup_rstat_lock); + spin_lock(&cgroup_rstat_lock); } } } +static bool should_skip_flush(void) +{ + /* + * We acquire cgroup_rstat_lock without disabling interrupts, so we + * should not try to acquire from interrupt contexts to avoid deadlocks. + * It can be expensive to flush stats in interrupt context anyway. + */ + return !in_task(); +} + /** * cgroup_rstat_flush - flush stats in @cgrp's subtree * @cgrp: target cgroup @@ -229,15 +240,18 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep) * This also gets all cgroups in the subtree including @cgrp off the * ->updated_children lists. * - * This function may block. + * This function is safe to call from contexts that disable interrupts, but + * @may_sleep must be set to false, otherwise the function may block. */ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) { - might_sleep(); + if (should_skip_flush()) + return; - spin_lock_irq(&cgroup_rstat_lock); + might_sleep(); + spin_lock(&cgroup_rstat_lock); cgroup_rstat_flush_locked(cgrp, true); - spin_unlock_irq(&cgroup_rstat_lock); + spin_unlock(&cgroup_rstat_lock); } /** @@ -248,11 +262,12 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) */ void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp) { - unsigned long flags; + if (should_skip_flush()) + return; - spin_lock_irqsave(&cgroup_rstat_lock, flags); + spin_lock(&cgroup_rstat_lock); cgroup_rstat_flush_locked(cgrp, false); - spin_unlock_irqrestore(&cgroup_rstat_lock, flags); + spin_unlock(&cgroup_rstat_lock); } /** @@ -267,8 +282,11 @@ void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp) void cgroup_rstat_flush_hold(struct cgroup *cgrp) __acquires(&cgroup_rstat_lock) { + if (should_skip_flush()) + return; + might_sleep(); - spin_lock_irq(&cgroup_rstat_lock); + spin_lock(&cgroup_rstat_lock); cgroup_rstat_flush_locked(cgrp, true); } @@ -278,7 +296,7 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp) void cgroup_rstat_flush_release(void) __releases(&cgroup_rstat_lock) { - spin_unlock_irq(&cgroup_rstat_lock); + spin_unlock(&cgroup_rstat_lock); } int cgroup_rstat_init(struct cgroup *cgrp)
Currently, when sleeping is not allowed during rstat flushing, we hold the global rstat lock with interrupts disabled throughout the entire flush operation. Flushing in an O(# cgroups * # cpus) operation, and having interrupts disabled throughout is dangerous. For some contexts, we may not want to sleep, but can be interrupted (e.g. while holding a spinlock or RCU read lock). As such, do not disable interrupts throughout rstat flushing, only when holding the percpu lock. This breaks down the O(# cgroups * # cpus) duration with interrupts disabled to a series of O(# cgroups) durations. Furthermore, if a cpu spinning waiting for the global rstat lock, it doesn't need to spin with interrupts disabled anymore. If the caller of rstat flushing needs interrupts to be disabled, it's up to them to decide that, and it should be fine to hold the global rstat lock with interrupts disabled. There is currently a single context that may invoke rstat flushing with interrupts disabled, the mem_cgroup_flush_stats() call in mem_cgroup_usage(), if called from mem_cgroup_threshold(). To make it safe to hold the global rstat lock with interrupts enabled, make sure we only flush from in_task() contexts. The side effect of that we read stale stats in interrupt context, but this should be okay, as flushing in interrupt context is dangerous anyway as it is an expensive operation, so reading stale stats is safer. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> --- kernel/cgroup/rstat.c | 40 +++++++++++++++++++++++++++++----------- 1 file changed, 29 insertions(+), 11 deletions(-)