Message ID | 20230821205458.1764662-4-yosryahmed@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | memcg: non-unified flushing for userspace stats | expand |
On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: > Unified flushing allows for great concurrency for paths that attempt to > flush the stats, at the expense of potential staleness and a single > flusher paying the extra cost of flushing the full tree. > > This tradeoff makes sense for in-kernel flushers that may observe high > concurrency (e.g. reclaim, refault). For userspace readers, stale stats > may be unexpected and problematic, especially when such stats are used > for critical paths such as userspace OOM handling. Additionally, a > userspace reader will occasionally pay the cost of flushing the entire > hierarchy, which also causes problems in some cases [1]. > > Opt userspace reads out of unified flushing. This makes the cost of > reading the stats more predictable (proportional to the size of the > subtree), as well as the freshness of the stats. Since userspace readers > are not expected to have similar concurrency to in-kernel flushers, > serializing them among themselves and among in-kernel flushers should be > okay. > > This was tested on a machine with 256 cpus by running a synthetic test > The script that creates 50 top-level cgroups, each with 5 children (250 > leaf cgroups). Each leaf cgroup has 10 processes running that allocate > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel > unified flusher). Concurrently, one thread is spawned per-cgroup to read > the stats every second (including root, top-level, and leaf cgroups -- > so total 251 threads). No regressions were observed in the total running > time; which means that non-unified userspace readers are not slowing > down in-kernel unified flushers: I have to admit I am rather confused by cgroup_rstat_flush (and cgroup_rstat_flush_locked). The former says it can block but the later doesn't ever block and even if it drops the cgroup_rstat_lock it merely cond_rescheds or busy loops. How much of a contention and yielding can you see with this patch? What is the worst case? How bad a random user can make the situation by going crazy and trying to flush from many different contexts?
On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: > > Unified flushing allows for great concurrency for paths that attempt to > > flush the stats, at the expense of potential staleness and a single > > flusher paying the extra cost of flushing the full tree. > > > > This tradeoff makes sense for in-kernel flushers that may observe high > > concurrency (e.g. reclaim, refault). For userspace readers, stale stats > > may be unexpected and problematic, especially when such stats are used > > for critical paths such as userspace OOM handling. Additionally, a > > userspace reader will occasionally pay the cost of flushing the entire > > hierarchy, which also causes problems in some cases [1]. > > > > Opt userspace reads out of unified flushing. This makes the cost of > > reading the stats more predictable (proportional to the size of the > > subtree), as well as the freshness of the stats. Since userspace readers > > are not expected to have similar concurrency to in-kernel flushers, > > serializing them among themselves and among in-kernel flushers should be > > okay. > > > > This was tested on a machine with 256 cpus by running a synthetic test > > The script that creates 50 top-level cgroups, each with 5 children (250 > > leaf cgroups). Each leaf cgroup has 10 processes running that allocate > > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel > > unified flusher). Concurrently, one thread is spawned per-cgroup to read > > the stats every second (including root, top-level, and leaf cgroups -- > > so total 251 threads). No regressions were observed in the total running > > time; which means that non-unified userspace readers are not slowing > > down in-kernel unified flushers: > > I have to admit I am rather confused by cgroup_rstat_flush (and > cgroup_rstat_flush_locked). The former says it can block but the later > doesn't ever block and even if it drops the cgroup_rstat_lock it merely > cond_rescheds or busy loops. How much of a contention and yielding can > you see with this patch? What is the worst case? How bad a random user > can make the situation by going crazy and trying to flush from many > different contexts? Userspace readers (or more generically non-unified flushers) can all collectively only block a single unified flusher at most. Specifically, one userspace reader goes to flush and holds cgroup_rstat_lock, meanwhile an in-kernel flusher (e.g. reclaim) goes and tries to flush, and spins on cgroup_rstat_lock. Other in-kernel (unified) flushers will just see another unified flusher in progress and skip. So userspace can only block a single in-kernel reclaimer. Not that it's not really that bad because: (a) As you note, cgroup_rstat_flush() does not really "block", it's cpu-bound. Even when it cond_resched()'s, it yields the lock first. So it can't really hold anyone hostage for long. (b) I assume a random user can only read their own stats, which should be a relatively small subtree, quick to flush. I am assuming a random user cannot read root's memory.stat (which is most expensive). (c) Excessive flushing doesn't really build up because there will be nothing to flush and the lock will be released very shortly after it's held. So to answer your question, I don't think a random user can really affect the system in a significant way by constantly flushing. In fact, in the test script (which I am now attaching, in case you're interested), there are hundreds of threads that are reading stats of different cgroups every 1s, and I don't see any negative effects on in-kernel flushers in this case (reclaimers). > -- > Michal Hocko > SUSE Labs
On Tue 22-08-23 08:30:05, Yosry Ahmed wrote: > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: [...] > So to answer your question, I don't think a random user can really > affect the system in a significant way by constantly flushing. In > fact, in the test script (which I am now attaching, in case you're > interested), there are hundreds of threads that are reading stats of > different cgroups every 1s, and I don't see any negative effects on > in-kernel flushers in this case (reclaimers). I suspect you have missed my point. Maybe I am just misunderstanding the code but it seems to me that the lock dropping inside cgroup_rstat_flush_locked effectivelly allows unbounded number of contenders which is really dangerous when it is triggerable from the userspace. The number of spinners at a moment is always bound by the number CPUs but depending on timing many potential spinners might be cond_rescheded and the worst time latency to complete can be really high. Makes more sense?
On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote: > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: > [...] > > So to answer your question, I don't think a random user can really > > affect the system in a significant way by constantly flushing. In > > fact, in the test script (which I am now attaching, in case you're > > interested), there are hundreds of threads that are reading stats of > > different cgroups every 1s, and I don't see any negative effects on > > in-kernel flushers in this case (reclaimers). > > I suspect you have missed my point. I suspect you are right :) > Maybe I am just misunderstanding > the code but it seems to me that the lock dropping inside > cgroup_rstat_flush_locked effectivelly allows unbounded number of > contenders which is really dangerous when it is triggerable from the > userspace. The number of spinners at a moment is always bound by the > number CPUs but depending on timing many potential spinners might be > cond_rescheded and the worst time latency to complete can be really > high. Makes more sense? I think I understand better now. So basically because we might drop the lock and resched, there can be nr_cpus spinners + other spinners that are currently scheduled away, so these will need to wait to be scheduled and then start spinning on the lock. This may happen for one reader multiple times during its read, which is what can cause a high worst case latency. I hope I understood you correctly this time. Did I? So the logic to give up the lock and sleep was introduced by commit 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") in 4.18. It has been possible for userspace to trigger this scenario by reading cpu.stat, which has been using rstat since then. On the memcg side, it was also possible to trigger this behavior between commit 2d146aa3aa84 ("mm: memcontrol: switch to rstat") and commit fd25a9e0e23b ("memcg: unify memcg stat flushing") (i.e between 5.13 and 5.16). I am not sure there has been any problems from this, but perhaps Tejun can answer this better than I can. The way I think about it is that random userspace reads will mostly be reading their subtrees, which is generally not very large (and can be limited), so every individual read should be cheap enough. Also, consequent flushes on overlapping subtrees will have very little to do as there won't be many pending updates, they should also be very cheap. So unless multiple jobs on the same machine are collectively trying to act maliciously (purposefully or not) and concurrently spawn multiple readers of different parts of the hierarchy (and maintain enough activity to generate stat updates to flush), I don't think it's a concern. I also imagine (but haven't checked) that there is some locking at some level that will throttle a malicious job that spawns multiple readers to the same memory.stat file. I hope this answers your question. > -- > Michal Hocko > SUSE Labs
On Wed 23-08-23 07:55:40, Yosry Ahmed wrote: > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote: > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: > > [...] > > > So to answer your question, I don't think a random user can really > > > affect the system in a significant way by constantly flushing. In > > > fact, in the test script (which I am now attaching, in case you're > > > interested), there are hundreds of threads that are reading stats of > > > different cgroups every 1s, and I don't see any negative effects on > > > in-kernel flushers in this case (reclaimers). > > > > I suspect you have missed my point. > > I suspect you are right :) > > > > Maybe I am just misunderstanding > > the code but it seems to me that the lock dropping inside > > cgroup_rstat_flush_locked effectivelly allows unbounded number of > > contenders which is really dangerous when it is triggerable from the > > userspace. The number of spinners at a moment is always bound by the > > number CPUs but depending on timing many potential spinners might be > > cond_rescheded and the worst time latency to complete can be really > > high. Makes more sense? > > I think I understand better now. So basically because we might drop > the lock and resched, there can be nr_cpus spinners + other spinners > that are currently scheduled away, so these will need to wait to be > scheduled and then start spinning on the lock. This may happen for one > reader multiple times during its read, which is what can cause a high > worst case latency. > > I hope I understood you correctly this time. Did I? Yes. I would just add that this could also influence the worst case latency for a different reader - so an adversary user can stall others. Exposing a shared global lock in uncontrolable way over generally available user interface is not really a great idea IMHO.
On Thu, Aug 24, 2023 at 12:13 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 23-08-23 07:55:40, Yosry Ahmed wrote: > > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote: > > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: > > > [...] > > > > So to answer your question, I don't think a random user can really > > > > affect the system in a significant way by constantly flushing. In > > > > fact, in the test script (which I am now attaching, in case you're > > > > interested), there are hundreds of threads that are reading stats of > > > > different cgroups every 1s, and I don't see any negative effects on > > > > in-kernel flushers in this case (reclaimers). > > > > > > I suspect you have missed my point. > > > > I suspect you are right :) > > > > > > > Maybe I am just misunderstanding > > > the code but it seems to me that the lock dropping inside > > > cgroup_rstat_flush_locked effectivelly allows unbounded number of > > > contenders which is really dangerous when it is triggerable from the > > > userspace. The number of spinners at a moment is always bound by the > > > number CPUs but depending on timing many potential spinners might be > > > cond_rescheded and the worst time latency to complete can be really > > > high. Makes more sense? > > > > I think I understand better now. So basically because we might drop > > the lock and resched, there can be nr_cpus spinners + other spinners > > that are currently scheduled away, so these will need to wait to be > > scheduled and then start spinning on the lock. This may happen for one > > reader multiple times during its read, which is what can cause a high > > worst case latency. > > > > I hope I understood you correctly this time. Did I? > > Yes. I would just add that this could also influence the worst case > latency for a different reader - so an adversary user can stall others. I can add that for v2 to the commit log, thanks. > Exposing a shared global lock in uncontrolable way over generally > available user interface is not really a great idea IMHO. I think that's how it was always meant to be when it was designed. The global rstat lock has always existed and was always available to userspace readers. The memory controller took a different path at some point with unified flushing, but that was mainly because of high concurrency from in-kernel flushers, not because userspace readers caused a problem. Outside of memcg, the core cgroup code has always exercised this global lock when reading cpu.stat since rstat's introduction. I assume there hasn't been any problems since it's still there. I was hoping Tejun would confirm/deny this.
On Thu, Aug 24, 2023 at 11:15 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Thu, Aug 24, 2023 at 12:13 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 23-08-23 07:55:40, Yosry Ahmed wrote: > > > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote: > > > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: > > > > [...] > > > > > So to answer your question, I don't think a random user can really > > > > > affect the system in a significant way by constantly flushing. In > > > > > fact, in the test script (which I am now attaching, in case you're > > > > > interested), there are hundreds of threads that are reading stats of > > > > > different cgroups every 1s, and I don't see any negative effects on > > > > > in-kernel flushers in this case (reclaimers). > > > > > > > > I suspect you have missed my point. > > > > > > I suspect you are right :) > > > > > > > > > > Maybe I am just misunderstanding > > > > the code but it seems to me that the lock dropping inside > > > > cgroup_rstat_flush_locked effectivelly allows unbounded number of > > > > contenders which is really dangerous when it is triggerable from the > > > > userspace. The number of spinners at a moment is always bound by the > > > > number CPUs but depending on timing many potential spinners might be > > > > cond_rescheded and the worst time latency to complete can be really > > > > high. Makes more sense? > > > > > > I think I understand better now. So basically because we might drop > > > the lock and resched, there can be nr_cpus spinners + other spinners > > > that are currently scheduled away, so these will need to wait to be > > > scheduled and then start spinning on the lock. This may happen for one > > > reader multiple times during its read, which is what can cause a high > > > worst case latency. > > > > > > I hope I understood you correctly this time. Did I? > > > > Yes. I would just add that this could also influence the worst case > > latency for a different reader - so an adversary user can stall others. > > I can add that for v2 to the commit log, thanks. > > > Exposing a shared global lock in uncontrolable way over generally > > available user interface is not really a great idea IMHO. > > I think that's how it was always meant to be when it was designed. The > global rstat lock has always existed and was always available to > userspace readers. The memory controller took a different path at some > point with unified flushing, but that was mainly because of high > concurrency from in-kernel flushers, not because userspace readers > caused a problem. Outside of memcg, the core cgroup code has always > exercised this global lock when reading cpu.stat since rstat's > introduction. I assume there hasn't been any problems since it's still > there. I was hoping Tejun would confirm/deny this. One thing we can do to remedy this situation is to replace the global rstat lock with a mutex, and drop the resched/lock dropping condition. Tejun suggested this in the previous thread. This effectively reverts 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") since now all the flushing contexts are sleepable. My synthetic stress test does not show any regressions with mutexes, and there is a small boost to reading latency (probably because we stop dropping the lock / rescheduling). Not sure if we may start seeing need_resched warnings on big flushes though. One other concern that Shakeel pointed out to me is preemption. If someone holding the mutex gets preempted this may starve other waiters. We can disable preemption while we hold the mutex, not sure if that's a common pattern though.
On Thu 24-08-23 11:50:51, Yosry Ahmed wrote: > On Thu, Aug 24, 2023 at 11:15 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Thu, Aug 24, 2023 at 12:13 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Wed 23-08-23 07:55:40, Yosry Ahmed wrote: > > > > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote: > > > > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: > > > > > [...] > > > > > > So to answer your question, I don't think a random user can really > > > > > > affect the system in a significant way by constantly flushing. In > > > > > > fact, in the test script (which I am now attaching, in case you're > > > > > > interested), there are hundreds of threads that are reading stats of > > > > > > different cgroups every 1s, and I don't see any negative effects on > > > > > > in-kernel flushers in this case (reclaimers). > > > > > > > > > > I suspect you have missed my point. > > > > > > > > I suspect you are right :) > > > > > > > > > > > > > Maybe I am just misunderstanding > > > > > the code but it seems to me that the lock dropping inside > > > > > cgroup_rstat_flush_locked effectivelly allows unbounded number of > > > > > contenders which is really dangerous when it is triggerable from the > > > > > userspace. The number of spinners at a moment is always bound by the > > > > > number CPUs but depending on timing many potential spinners might be > > > > > cond_rescheded and the worst time latency to complete can be really > > > > > high. Makes more sense? > > > > > > > > I think I understand better now. So basically because we might drop > > > > the lock and resched, there can be nr_cpus spinners + other spinners > > > > that are currently scheduled away, so these will need to wait to be > > > > scheduled and then start spinning on the lock. This may happen for one > > > > reader multiple times during its read, which is what can cause a high > > > > worst case latency. > > > > > > > > I hope I understood you correctly this time. Did I? > > > > > > Yes. I would just add that this could also influence the worst case > > > latency for a different reader - so an adversary user can stall others. > > > > I can add that for v2 to the commit log, thanks. > > > > > Exposing a shared global lock in uncontrolable way over generally > > > available user interface is not really a great idea IMHO. > > > > I think that's how it was always meant to be when it was designed. The > > global rstat lock has always existed and was always available to > > userspace readers. The memory controller took a different path at some > > point with unified flushing, but that was mainly because of high > > concurrency from in-kernel flushers, not because userspace readers > > caused a problem. Outside of memcg, the core cgroup code has always > > exercised this global lock when reading cpu.stat since rstat's > > introduction. I assume there hasn't been any problems since it's still > > there. I suspect nobody has just considered a malfunctioning or adversary workloads so far. > > I was hoping Tejun would confirm/deny this. Yes, that would be interesting to hear. > One thing we can do to remedy this situation is to replace the global > rstat lock with a mutex, and drop the resched/lock dropping condition. > Tejun suggested this in the previous thread. This effectively reverts > 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") > since now all the flushing contexts are sleepable. I would have a very daring question. Do we really need a global lock in the first place? AFAIU this locks serializes (kinda as the lock can be dropped midway) flushers and cgroup_rstat_flush_hold/release caller (a single one ATM). I can see cgroup_base_stat_cputime_show would like to have a consistent view on multiple stats but can we live without a strong guarantee or to replace the lock with seqlock instead? > My synthetic stress test does not show any regressions with mutexes, > and there is a small boost to reading latency (probably because we > stop dropping the lock / rescheduling). Not sure if we may start > seeing need_resched warnings on big flushes though. Reading 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") it seems the point of moving away from mutex was to have a more usable API. > One other concern that Shakeel pointed out to me is preemption. If > someone holding the mutex gets preempted this may starve other > waiters. We can disable preemption while we hold the mutex, not sure > if that's a common pattern though. No, not really. It is expected that holder of mutex can sleep and can be preempted as well. I might be wrong but the whole discussion so far suggests that the global rstat lock should be reconsidered. From my personal experience global locks easily triggerable from the userspace are just a receip for problems. Stats reading shouldn't be interfering with the system runtime as much as possible and they should be deterministic wrt runtime as well.
On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 24-08-23 11:50:51, Yosry Ahmed wrote: > > On Thu, Aug 24, 2023 at 11:15 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > On Thu, Aug 24, 2023 at 12:13 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Wed 23-08-23 07:55:40, Yosry Ahmed wrote: > > > > > On Wed, Aug 23, 2023 at 12:33 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Tue 22-08-23 08:30:05, Yosry Ahmed wrote: > > > > > > > On Tue, Aug 22, 2023 at 2:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > > > On Mon 21-08-23 20:54:58, Yosry Ahmed wrote: > > > > > > [...] > > > > > > > So to answer your question, I don't think a random user can really > > > > > > > affect the system in a significant way by constantly flushing. In > > > > > > > fact, in the test script (which I am now attaching, in case you're > > > > > > > interested), there are hundreds of threads that are reading stats of > > > > > > > different cgroups every 1s, and I don't see any negative effects on > > > > > > > in-kernel flushers in this case (reclaimers). > > > > > > > > > > > > I suspect you have missed my point. > > > > > > > > > > I suspect you are right :) > > > > > > > > > > > > > > > > Maybe I am just misunderstanding > > > > > > the code but it seems to me that the lock dropping inside > > > > > > cgroup_rstat_flush_locked effectivelly allows unbounded number of > > > > > > contenders which is really dangerous when it is triggerable from the > > > > > > userspace. The number of spinners at a moment is always bound by the > > > > > > number CPUs but depending on timing many potential spinners might be > > > > > > cond_rescheded and the worst time latency to complete can be really > > > > > > high. Makes more sense? > > > > > > > > > > I think I understand better now. So basically because we might drop > > > > > the lock and resched, there can be nr_cpus spinners + other spinners > > > > > that are currently scheduled away, so these will need to wait to be > > > > > scheduled and then start spinning on the lock. This may happen for one > > > > > reader multiple times during its read, which is what can cause a high > > > > > worst case latency. > > > > > > > > > > I hope I understood you correctly this time. Did I? > > > > > > > > Yes. I would just add that this could also influence the worst case > > > > latency for a different reader - so an adversary user can stall others. > > > > > > I can add that for v2 to the commit log, thanks. > > > > > > > Exposing a shared global lock in uncontrolable way over generally > > > > available user interface is not really a great idea IMHO. > > > > > > I think that's how it was always meant to be when it was designed. The > > > global rstat lock has always existed and was always available to > > > userspace readers. The memory controller took a different path at some > > > point with unified flushing, but that was mainly because of high > > > concurrency from in-kernel flushers, not because userspace readers > > > caused a problem. Outside of memcg, the core cgroup code has always > > > exercised this global lock when reading cpu.stat since rstat's > > > introduction. I assume there hasn't been any problems since it's still > > > there. > > I suspect nobody has just considered a malfunctioning or adversary > workloads so far. Perhaps that also means it's not a problem in practice, or at least I hope so :) > > > > I was hoping Tejun would confirm/deny this. > > Yes, that would be interesting to hear. > > > One thing we can do to remedy this situation is to replace the global > > rstat lock with a mutex, and drop the resched/lock dropping condition. > > Tejun suggested this in the previous thread. This effectively reverts > > 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") > > since now all the flushing contexts are sleepable. > > I would have a very daring question. Do we really need a global lock in > the first place? AFAIU this locks serializes (kinda as the lock can be > dropped midway) flushers and cgroup_rstat_flush_hold/release caller (a > single one ATM). I can see cgroup_base_stat_cputime_show would like to > have a consistent view on multiple stats but can we live without a > strong guarantee or to replace the lock with seqlock instead? Unfortunately, it's more difficult than this. I thought about breaking down that lock and falled back to this solution. See below. > > > My synthetic stress test does not show any regressions with mutexes, > > and there is a small boost to reading latency (probably because we > > stop dropping the lock / rescheduling). Not sure if we may start > > seeing need_resched warnings on big flushes though. > > Reading 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") > it seems the point of moving away from mutex was to have a more usable > API. Right, we needed an atomic interface for flushing, but that later turned out to cause some problems, so we reworked the code such that we never have to flush atomically. Now we can go back to the mutex if it makes things better, I am not really sure how much it helps though. > > > One other concern that Shakeel pointed out to me is preemption. If > > someone holding the mutex gets preempted this may starve other > > waiters. We can disable preemption while we hold the mutex, not sure > > if that's a common pattern though. > > No, not really. It is expected that holder of mutex can sleep and can be > preempted as well. Maybe not for this specific case because it's a global mutex and holding it for too long might cause problems? Is it bad to disable preemption while holding a mutex? > > I might be wrong but the whole discussion so far suggests that the > global rstat lock should be reconsidered. From my personal experience > global locks easily triggerable from the userspace are just a receip for > problems. Stats reading shouldn't be interfering with the system runtime > as much as possible and they should be deterministic wrt runtime as > well. The problem is that the global lock also serializes the global counters that we flush to. I will talk from the memcg flushing perspective as that's what I am familiar with. I am not sure how much this is transferable to other flushers. On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock synchronizes access to multiple counters, for this discussion what's most important are: - The global stat counters of the memcg being flushed (e.g. memcg->vmstats->state). - The pending stat counters of the parent being flushed (e.g. parent->vmstats->state_pending). To get rid of this lock, we either need to use atomics for those counters, or have fine-grained locks. I experimented a while back with atomic and flushing was significantly more expensive. The number of atomic operations would scale with O(# cgroups * # cpus) and can grow unbounded. The other option is fine-grained locks. In this case we would need to lock both the memcg being flushed and its parent together. This can go wrong with ordering, and for top-level memcgs the root memcg lock will become the new global lock anyway. One way to overcome this is to change the parent's pending stat counters to also be percpu. This will increase the memory usage of the stat counters per-memcg, by hundreds of bytes per cpu. Let's assume that's okay, so we only need to lock one cgroup at a time. There are more problems. We also have a percpu lock (cgroup_rstat_cpu_lock), which we use to lock the percpu tree which has the cgroups that have updates on this cpu. It is held by both flushing contexts and updating contexts (hot paths). Ideally we don't want to spin on a per-cgroup (non percpu) lock while holding the percpu lock, as flushers of different cpus will start blocking one another, as well as blocking updaters. On the other hand, we need to hold percpu lock to pop a cgroup from that tree and lock it. It's a chicken and egg problem. Also, if we release the percpu lock while flushing, we open another can of worms: (a) Concurrent updates can keep updating the tree putting us in an endless flushing loop. We need some sort of generation tracking for this. (b) Concurrent flushing can flush a parent prematurely on the same cpu as we are flushing a child, and not get the updates from the child. One possible scheme to handle the above is as follows: 1. Hold the percpu lock, find the cgroup that needs to be flushed next. 2. Trylock that cgroup. If we succeed, we flush it with both the percpu and the per-cgroup locks held. 3. If we fail, release the percpu lock and spin on the per-cgroup lock. 4. When we get the per-cgroup lock, take the percpu lock again, and make sure that the locked cgroup is still the correct cgroup to flush. If not, repeat. 5. Flush the cgroup, and go back to step 1 to get the next cgroup. Of course this is complex and error-prone, and might introduce significant overheads due to the number of locks we need to take compared with what we currently have. I guess what I am trying to say is, breaking down that lock is a major surgery that might require re-designing or re-implementing some parts of rstat. I would be extremely happy to be proven wrong. If we can break down that lock then there is no need for unified flushing even for in-kernel contexts, and we can all live happily ever after with cheap(ish) and accurate stats flushing. I really hope we can move forward with the problems at hand (sometimes reads are expensive, sometimes reads are stale), and not block fixing them until we can come up with an alternative to that global lock (unless, of course, there is a simpler way of doing that). Sorry for the very long reply :) Thanks!
On Fri 25-08-23 08:14:54, Yosry Ahmed wrote: > On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko@suse.com> wrote: [...] > > I might be wrong but the whole discussion so far suggests that the > > global rstat lock should be reconsidered. From my personal experience > > global locks easily triggerable from the userspace are just a receip for > > problems. Stats reading shouldn't be interfering with the system runtime > > as much as possible and they should be deterministic wrt runtime as > > well. > > The problem is that the global lock also serializes the global > counters that we flush to. I will talk from the memcg flushing > perspective as that's what I am familiar with. I am not sure how much > this is transferable to other flushers. > > On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock > synchronizes access to multiple counters, for this discussion what's > most important are: > - The global stat counters of the memcg being flushed (e.g. > memcg->vmstats->state). > - The pending stat counters of the parent being flushed (e.g. > parent->vmstats->state_pending). I haven't digested the rest of the email yet (Friday brain, sorry) but I do not think you are adressing this particular part so let me ask before I dive more into the following. I really do not follow the serialization requirement here because the lock doesn't really serialize the flushing, does it? At least not in a sense of a single caller to do the flushing atomicaly from other flushers. It is possible that the current flusher simply drops the lock midway and another one retakes the lock and performs the operation again. So what additional flushing synchronization does it provide and why cannot parallel flushers simply compete over pcp spinlocks? So what am I missing?
On Fri, Aug 25, 2023 at 11:17 AM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 25-08-23 08:14:54, Yosry Ahmed wrote: > > On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > I might be wrong but the whole discussion so far suggests that the > > > global rstat lock should be reconsidered. From my personal experience > > > global locks easily triggerable from the userspace are just a receip for > > > problems. Stats reading shouldn't be interfering with the system runtime > > > as much as possible and they should be deterministic wrt runtime as > > > well. > > > > The problem is that the global lock also serializes the global > > counters that we flush to. I will talk from the memcg flushing > > perspective as that's what I am familiar with. I am not sure how much > > this is transferable to other flushers. > > > > On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock > > synchronizes access to multiple counters, for this discussion what's > > most important are: > > - The global stat counters of the memcg being flushed (e.g. > > memcg->vmstats->state). > > - The pending stat counters of the parent being flushed (e.g. > > parent->vmstats->state_pending). > > I haven't digested the rest of the email yet (Friday brain, sorry) but I > do not think you are adressing this particular part so let me ask before > I dive more into the following. I really do not follow the serialization > requirement here because the lock doesn't really serialize the flushing, > does it? At least not in a sense of a single caller to do the flushing > atomicaly from other flushers. It is possible that the current flusher > simply drops the lock midway and another one retakes the lock and > performs the operation again. So what additional flushing > synchronization does it provide and why cannot parallel flushers simply > compete over pcp spinlocks? > > So what am I missing? Those counters are non-atomic. The lock makes sure we don't have two concurrent flushers updating the same counter locklessly and non-atomically, which would be possible if we flush the same cgroup on two different cpus in parallel. > -- > Michal Hocko > SUSE Labs
On Fri 25-08-23 11:21:16, Yosry Ahmed wrote: > On Fri, Aug 25, 2023 at 11:17 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 25-08-23 08:14:54, Yosry Ahmed wrote: > > > On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko@suse.com> wrote: > > [...] > > > > I might be wrong but the whole discussion so far suggests that the > > > > global rstat lock should be reconsidered. From my personal experience > > > > global locks easily triggerable from the userspace are just a receip for > > > > problems. Stats reading shouldn't be interfering with the system runtime > > > > as much as possible and they should be deterministic wrt runtime as > > > > well. > > > > > > The problem is that the global lock also serializes the global > > > counters that we flush to. I will talk from the memcg flushing > > > perspective as that's what I am familiar with. I am not sure how much > > > this is transferable to other flushers. > > > > > > On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock > > > synchronizes access to multiple counters, for this discussion what's > > > most important are: > > > - The global stat counters of the memcg being flushed (e.g. > > > memcg->vmstats->state). > > > - The pending stat counters of the parent being flushed (e.g. > > > parent->vmstats->state_pending). > > > > I haven't digested the rest of the email yet (Friday brain, sorry) but I > > do not think you are adressing this particular part so let me ask before > > I dive more into the following. I really do not follow the serialization > > requirement here because the lock doesn't really serialize the flushing, > > does it? At least not in a sense of a single caller to do the flushing > > atomicaly from other flushers. It is possible that the current flusher > > simply drops the lock midway and another one retakes the lock and > > performs the operation again. So what additional flushing > > synchronization does it provide and why cannot parallel flushers simply > > compete over pcp spinlocks? > > > > So what am I missing? > > Those counters are non-atomic. The lock makes sure we don't have two > concurrent flushers updating the same counter locklessly and > non-atomically, which would be possible if we flush the same cgroup on > two different cpus in parallel. pcp lock (cpu_lock) guarantees the very same, doesn't it?
On Fri 25-08-23 20:43:02, Michal Hocko wrote: > On Fri 25-08-23 11:21:16, Yosry Ahmed wrote: > > On Fri, Aug 25, 2023 at 11:17 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Fri 25-08-23 08:14:54, Yosry Ahmed wrote: > > > > On Fri, Aug 25, 2023 at 12:05 AM Michal Hocko <mhocko@suse.com> wrote: > > > [...] > > > > > I might be wrong but the whole discussion so far suggests that the > > > > > global rstat lock should be reconsidered. From my personal experience > > > > > global locks easily triggerable from the userspace are just a receip for > > > > > problems. Stats reading shouldn't be interfering with the system runtime > > > > > as much as possible and they should be deterministic wrt runtime as > > > > > well. > > > > > > > > The problem is that the global lock also serializes the global > > > > counters that we flush to. I will talk from the memcg flushing > > > > perspective as that's what I am familiar with. I am not sure how much > > > > this is transferable to other flushers. > > > > > > > > On the memcg side (see mem_cgroup_css_rstat_flush()), the global lock > > > > synchronizes access to multiple counters, for this discussion what's > > > > most important are: > > > > - The global stat counters of the memcg being flushed (e.g. > > > > memcg->vmstats->state). > > > > - The pending stat counters of the parent being flushed (e.g. > > > > parent->vmstats->state_pending). > > > > > > I haven't digested the rest of the email yet (Friday brain, sorry) but I > > > do not think you are adressing this particular part so let me ask before > > > I dive more into the following. I really do not follow the serialization > > > requirement here because the lock doesn't really serialize the flushing, > > > does it? At least not in a sense of a single caller to do the flushing > > > atomicaly from other flushers. It is possible that the current flusher > > > simply drops the lock midway and another one retakes the lock and > > > performs the operation again. So what additional flushing > > > synchronization does it provide and why cannot parallel flushers simply > > > compete over pcp spinlocks? > > > > > > So what am I missing? > > > > Those counters are non-atomic. The lock makes sure we don't have two > > concurrent flushers updating the same counter locklessly and > > non-atomically, which would be possible if we flush the same cgroup on > > two different cpus in parallel. > > pcp lock (cpu_lock) guarantees the very same, doesn't it? Nope, it doesn't. I really need to have a deeper look.
Done my homework and studied the rstat code more (sorry should have done that earlier). On Fri 25-08-23 08:14:54, Yosry Ahmed wrote: [...] > I guess what I am trying to say is, breaking down that lock is a major > surgery that might require re-designing or re-implementing some parts > of rstat. I would be extremely happy to be proven wrong. If we can > break down that lock then there is no need for unified flushing even > for in-kernel contexts, and we can all live happily ever after with > cheap(ish) and accurate stats flushing. Yes, this seems like a big change and also over complicating the whole thing. I am not sure this is worth it. > I really hope we can move forward with the problems at hand (sometimes > reads are expensive, sometimes reads are stale), and not block fixing > them until we can come up with an alternative to that global lock > (unless, of course, there is a simpler way of doing that). Well, I really have to say that I do not like the notion that reading stats is unpredictable. This just makes it really hard to use. If the precision is to be sarificed then this should be preferable over potentially high global lock contention. We already have that model in place of /proc/vmstat (configurable timeout for flusher and a way to flush explicitly). I appreciate you would like to have a better precision but as you have explored the locking is really hard to get rid of here. So from my POV I would prefer to avoid flushing from the stats reading path and implement force flushing by writing to stat file. If the 2s flushing interval is considered to coarse I would be OK to allow setting it from userspace. This way this would be more in line with /proc/vmstat which seems to be working quite well. If this is not accaptable or deemed a wrong approach long term then it would be good to reonsider the current cgroup_rstat_lock at least. Either by turning it into mutex or by dropping the yielding code which can severly affect the worst case latency AFAIU. > Sorry for the very long reply :) Thanks for bearing with me and taking time to formulate all this!
On Mon, Aug 28, 2023 at 8:47 AM Michal Hocko <mhocko@suse.com> wrote: > > Done my homework and studied the rstat code more (sorry should have done > that earlier). > > On Fri 25-08-23 08:14:54, Yosry Ahmed wrote: > [...] > > I guess what I am trying to say is, breaking down that lock is a major > > surgery that might require re-designing or re-implementing some parts > > of rstat. I would be extremely happy to be proven wrong. If we can > > break down that lock then there is no need for unified flushing even > > for in-kernel contexts, and we can all live happily ever after with > > cheap(ish) and accurate stats flushing. > > Yes, this seems like a big change and also over complicating the whole > thing. I am not sure this is worth it. > > > I really hope we can move forward with the problems at hand (sometimes > > reads are expensive, sometimes reads are stale), and not block fixing > > them until we can come up with an alternative to that global lock > > (unless, of course, there is a simpler way of doing that). > > Well, I really have to say that I do not like the notion that reading > stats is unpredictable. This just makes it really hard to use. If > the precision is to be sarificed then this should be preferable over > potentially high global lock contention. We already have that model in > place of /proc/vmstat (configurable timeout for flusher and a way to > flush explicitly). I appreciate you would like to have a better > precision but as you have explored the locking is really hard to get rid > of here. Reading the stats *is* unpredictable today. In terms of accuracy/staleness and cost. Avoiding the flush entirely on the read path will surely make the cost very stable and cheap, but will make accuracy even less predictable. > > So from my POV I would prefer to avoid flushing from the stats reading > path and implement force flushing by writing to stat file. If the 2s > flushing interval is considered to coarse I would be OK to allow setting > it from userspace. This way this would be more in line with /proc/vmstat > which seems to be working quite well. > > If this is not accaptable or deemed a wrong approach long term then it > would be good to reonsider the current cgroup_rstat_lock at least. > Either by turning it into mutex or by dropping the yielding code which > can severly affect the worst case latency AFAIU. Honestly I think it's better if we do it the other way around. We make flushing on the stats reading path non-unified and deterministic. That model also exists and is used for cpu.stat. If we find a problem with the locking being held from userspace, we can then remove flushing from the read path and add interface(s) to configure the periodic flusher and do a force flush. I would like to avoid introducing additional interfaces and configuration knobs unless it's necessary. Also, if we remove the flush entirely the cost will become really cheap. We will have a hard time reversing that in the future if we want to change the implementation. IOW, moving forward with this change seems much more reversible than adopting the /proc/vmstat model. If using a mutex will make things better, we can do that now. It doesn't introduce performance issues in my testing. My only concern is someone sleeping or getting preempted while holding the mutex, so I would prefer disabling preemption while we flush if that doesn't cause problems. Thanks!
On Mon, Aug 28, 2023 at 9:15 AM Yosry Ahmed <yosryahmed@google.com> wrote: > [...] > > > > Well, I really have to say that I do not like the notion that reading > > stats is unpredictable. This just makes it really hard to use. If > > the precision is to be sarificed then this should be preferable over > > potentially high global lock contention. We already have that model in > > place of /proc/vmstat (configurable timeout for flusher and a way to > > flush explicitly). I appreciate you would like to have a better > > precision but as you have explored the locking is really hard to get rid > > of here. > > Reading the stats *is* unpredictable today. In terms of > accuracy/staleness and cost. Avoiding the flush entirely on the read > path will surely make the cost very stable and cheap, but will make > accuracy even less predictable. > On average you would get the stats at most 2 second old, so I would not say it is less predictable. > > > > So from my POV I would prefer to avoid flushing from the stats reading > > path and implement force flushing by writing to stat file. If the 2s > > flushing interval is considered to coarse I would be OK to allow setting > > it from userspace. This way this would be more in line with /proc/vmstat > > which seems to be working quite well. > > > > If this is not accaptable or deemed a wrong approach long term then it > > would be good to reonsider the current cgroup_rstat_lock at least. > > Either by turning it into mutex or by dropping the yielding code which > > can severly affect the worst case latency AFAIU. > > Honestly I think it's better if we do it the other way around. We make > flushing on the stats reading path non-unified and deterministic. That > model also exists and is used for cpu.stat. If we find a problem with > the locking being held from userspace, we can then remove flushing > from the read path and add interface(s) to configure the periodic > flusher and do a force flush. > Here I agree with you. Let's go with the approach which is easy to undo for now. Though I prefer the new explicit interface for flushing, that step would be very hard to undo. Let's reevaluate if the proposed approach shows negative impact on production traffic and I think Cloudflare folks can give us the results soon.
On Mon, Aug 28, 2023 at 10:00 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Mon, Aug 28, 2023 at 9:15 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > [...] > > > > > > Well, I really have to say that I do not like the notion that reading > > > stats is unpredictable. This just makes it really hard to use. If > > > the precision is to be sarificed then this should be preferable over > > > potentially high global lock contention. We already have that model in > > > place of /proc/vmstat (configurable timeout for flusher and a way to > > > flush explicitly). I appreciate you would like to have a better > > > precision but as you have explored the locking is really hard to get rid > > > of here. > > > > Reading the stats *is* unpredictable today. In terms of > > accuracy/staleness and cost. Avoiding the flush entirely on the read > > path will surely make the cost very stable and cheap, but will make > > accuracy even less predictable. > > > > On average you would get the stats at most 2 second old, so I would > not say it is less predictable. > > > > > > > So from my POV I would prefer to avoid flushing from the stats reading > > > path and implement force flushing by writing to stat file. If the 2s > > > flushing interval is considered to coarse I would be OK to allow setting > > > it from userspace. This way this would be more in line with /proc/vmstat > > > which seems to be working quite well. > > > > > > If this is not accaptable or deemed a wrong approach long term then it > > > would be good to reonsider the current cgroup_rstat_lock at least. > > > Either by turning it into mutex or by dropping the yielding code which > > > can severly affect the worst case latency AFAIU. > > > > Honestly I think it's better if we do it the other way around. We make > > flushing on the stats reading path non-unified and deterministic. That > > model also exists and is used for cpu.stat. If we find a problem with > > the locking being held from userspace, we can then remove flushing > > from the read path and add interface(s) to configure the periodic > > flusher and do a force flush. > > > > Here I agree with you. Let's go with the approach which is easy to > undo for now. Though I prefer the new explicit interface for flushing, > that step would be very hard to undo. Let's reevaluate if the proposed > approach shows negative impact on production traffic and I think > Cloudflare folks can give us the results soon. Do you prefer we also switch to using a mutex (with preemption disabled) to avoid the scenario Michal described where flushers give up the lock and sleep resulting in an unbounded wait time in the worst case?
On 8/28/23 13:07, Yosry Ahmed wrote: > >> Here I agree with you. Let's go with the approach which is easy to >> undo for now. Though I prefer the new explicit interface for flushing, >> that step would be very hard to undo. Let's reevaluate if the proposed >> approach shows negative impact on production traffic and I think >> Cloudflare folks can give us the results soon. > Do you prefer we also switch to using a mutex (with preemption > disabled) to avoid the scenario Michal described where flushers give > up the lock and sleep resulting in an unbounded wait time in the worst > case? Locking with mutex with preemption disabled is an oxymoron. Use spinlock if you want to have preemption disabled. The purpose of usiing mutex is to allow the lock owner to sleep, but you can't sleep with preemption disabled. You need to enable preemption first. You can disable preemption for a short time in a non-sleeping section of the lock critical section, but I would not recommend disabling preemption for the whole critical section. Cheers, Longman
On Mon, Aug 28, 2023 at 10:27 AM Waiman Long <longman@redhat.com> wrote: > > > On 8/28/23 13:07, Yosry Ahmed wrote: > > > >> Here I agree with you. Let's go with the approach which is easy to > >> undo for now. Though I prefer the new explicit interface for flushing, > >> that step would be very hard to undo. Let's reevaluate if the proposed > >> approach shows negative impact on production traffic and I think > >> Cloudflare folks can give us the results soon. > > Do you prefer we also switch to using a mutex (with preemption > > disabled) to avoid the scenario Michal described where flushers give > > up the lock and sleep resulting in an unbounded wait time in the worst > > case? > > Locking with mutex with preemption disabled is an oxymoron. Use spinlock > if you want to have preemption disabled. The purpose of usiing mutex is > to allow the lock owner to sleep, but you can't sleep with preemption > disabled. You need to enable preemption first. You can disable > preemption for a short time in a non-sleeping section of the lock > critical section, but I would not recommend disabling preemption for the > whole critical section. I thought using a mutex with preemption disabled would at least allow waiters to sleep rather than spin, is this not correct (or doesn't matter) ? > > Cheers, > Longman >
On 8/28/23 13:28, Yosry Ahmed wrote: > On Mon, Aug 28, 2023 at 10:27 AM Waiman Long <longman@redhat.com> wrote: >> >> On 8/28/23 13:07, Yosry Ahmed wrote: >>>> Here I agree with you. Let's go with the approach which is easy to >>>> undo for now. Though I prefer the new explicit interface for flushing, >>>> that step would be very hard to undo. Let's reevaluate if the proposed >>>> approach shows negative impact on production traffic and I think >>>> Cloudflare folks can give us the results soon. >>> Do you prefer we also switch to using a mutex (with preemption >>> disabled) to avoid the scenario Michal described where flushers give >>> up the lock and sleep resulting in an unbounded wait time in the worst >>> case? >> Locking with mutex with preemption disabled is an oxymoron. Use spinlock >> if you want to have preemption disabled. The purpose of usiing mutex is >> to allow the lock owner to sleep, but you can't sleep with preemption >> disabled. You need to enable preemption first. You can disable >> preemption for a short time in a non-sleeping section of the lock >> critical section, but I would not recommend disabling preemption for the >> whole critical section. > I thought using a mutex with preemption disabled would at least allow > waiters to sleep rather than spin, is this not correct (or doesn't > matter) ? Because of optimistic spinning, a mutex lock waiter will only sleep if the lock holder sleep or when its time slice run out. So the waiters are likely to spin for quite a while before they go to sleep. Cheers, Longman
On 8/28/23 13:35, Waiman Long wrote: > On 8/28/23 13:28, Yosry Ahmed wrote: >> On Mon, Aug 28, 2023 at 10:27 AM Waiman Long <longman@redhat.com> wrote: >>> >>> On 8/28/23 13:07, Yosry Ahmed wrote: >>>>> Here I agree with you. Let's go with the approach which is easy to >>>>> undo for now. Though I prefer the new explicit interface for >>>>> flushing, >>>>> that step would be very hard to undo. Let's reevaluate if the >>>>> proposed >>>>> approach shows negative impact on production traffic and I think >>>>> Cloudflare folks can give us the results soon. >>>> Do you prefer we also switch to using a mutex (with preemption >>>> disabled) to avoid the scenario Michal described where flushers give >>>> up the lock and sleep resulting in an unbounded wait time in the worst >>>> case? >>> Locking with mutex with preemption disabled is an oxymoron. Use >>> spinlock >>> if you want to have preemption disabled. The purpose of usiing mutex is >>> to allow the lock owner to sleep, but you can't sleep with preemption >>> disabled. You need to enable preemption first. You can disable >>> preemption for a short time in a non-sleeping section of the lock >>> critical section, but I would not recommend disabling preemption for >>> the >>> whole critical section. >> I thought using a mutex with preemption disabled would at least allow >> waiters to sleep rather than spin, is this not correct (or doesn't >> matter) ? > > Because of optimistic spinning, a mutex lock waiter will only sleep if > the lock holder sleep or when its time slice run out. So the waiters > are likely to spin for quite a while before they go to sleep. Perhaps you can add a mutex at the read side so that only 1 reader can contend with the global rstat spinlock at any time if this is a concern. Regards, Longman
On Mon, Aug 28, 2023 at 10:43 AM Waiman Long <longman@redhat.com> wrote: > > > On 8/28/23 13:35, Waiman Long wrote: > > On 8/28/23 13:28, Yosry Ahmed wrote: > >> On Mon, Aug 28, 2023 at 10:27 AM Waiman Long <longman@redhat.com> wrote: > >>> > >>> On 8/28/23 13:07, Yosry Ahmed wrote: > >>>>> Here I agree with you. Let's go with the approach which is easy to > >>>>> undo for now. Though I prefer the new explicit interface for > >>>>> flushing, > >>>>> that step would be very hard to undo. Let's reevaluate if the > >>>>> proposed > >>>>> approach shows negative impact on production traffic and I think > >>>>> Cloudflare folks can give us the results soon. > >>>> Do you prefer we also switch to using a mutex (with preemption > >>>> disabled) to avoid the scenario Michal described where flushers give > >>>> up the lock and sleep resulting in an unbounded wait time in the worst > >>>> case? > >>> Locking with mutex with preemption disabled is an oxymoron. Use > >>> spinlock > >>> if you want to have preemption disabled. The purpose of usiing mutex is > >>> to allow the lock owner to sleep, but you can't sleep with preemption > >>> disabled. You need to enable preemption first. You can disable > >>> preemption for a short time in a non-sleeping section of the lock > >>> critical section, but I would not recommend disabling preemption for > >>> the > >>> whole critical section. > >> I thought using a mutex with preemption disabled would at least allow > >> waiters to sleep rather than spin, is this not correct (or doesn't > >> matter) ? > > > > Because of optimistic spinning, a mutex lock waiter will only sleep if > > the lock holder sleep or when its time slice run out. So the waiters > > are likely to spin for quite a while before they go to sleep. I see. Thanks for the explanation. > > Perhaps you can add a mutex at the read side so that only 1 reader can > contend with the global rstat spinlock at any time if this is a concern. I guess we can keep it simple for now and add that later if needed. For unified flushers we already can only have one. If we see a problem from the stat reading side we can add a mutex there. Thanks!
On Mon 28-08-23 13:27:23, Waiman Long wrote: > > On 8/28/23 13:07, Yosry Ahmed wrote: > > > > > Here I agree with you. Let's go with the approach which is easy to > > > undo for now. Though I prefer the new explicit interface for flushing, > > > that step would be very hard to undo. Let's reevaluate if the proposed > > > approach shows negative impact on production traffic and I think > > > Cloudflare folks can give us the results soon. > > Do you prefer we also switch to using a mutex (with preemption > > disabled) to avoid the scenario Michal described where flushers give > > up the lock and sleep resulting in an unbounded wait time in the worst > > case? > > Locking with mutex with preemption disabled is an oxymoron. I believe Yosry wanted to disable preemption _after_ the lock is taken to reduce the time spent while it is held. The idea to use the mutex is to reduce spinning and more importantly to get rid of lock dropping part. It is not really clear (but unlikely) we can drop it while preserving the spinlock as the thing scales with O(#cgroups x #cpus) in the worst case.
On 8/29/23 03:27, Michal Hocko wrote: > On Mon 28-08-23 13:27:23, Waiman Long wrote: >> On 8/28/23 13:07, Yosry Ahmed wrote: >>>> Here I agree with you. Let's go with the approach which is easy to >>>> undo for now. Though I prefer the new explicit interface for flushing, >>>> that step would be very hard to undo. Let's reevaluate if the proposed >>>> approach shows negative impact on production traffic and I think >>>> Cloudflare folks can give us the results soon. >>> Do you prefer we also switch to using a mutex (with preemption >>> disabled) to avoid the scenario Michal described where flushers give >>> up the lock and sleep resulting in an unbounded wait time in the worst >>> case? >> Locking with mutex with preemption disabled is an oxymoron. > I believe Yosry wanted to disable preemption _after_ the lock is taken > to reduce the time spent while it is held. The idea to use the mutex is > to reduce spinning and more importantly to get rid of lock dropping > part. It is not really clear (but unlikely) we can drop it while > preserving the spinlock as the thing scales with O(#cgroups x #cpus) > in the worst case. As I have said later in my email, I am not against disabling preemption selectively on some parts of the lock critical section where preemption is undesirable. However, I am against disabling preemption for the whole duration of the code where the mutex lock is held as it defeats the purpose of using mutex in the first place. Cheers, Longman
On Tue 29-08-23 11:05:28, Waiman Long wrote: > On 8/29/23 03:27, Michal Hocko wrote: > > On Mon 28-08-23 13:27:23, Waiman Long wrote: > > > On 8/28/23 13:07, Yosry Ahmed wrote: > > > > > Here I agree with you. Let's go with the approach which is easy to > > > > > undo for now. Though I prefer the new explicit interface for flushing, > > > > > that step would be very hard to undo. Let's reevaluate if the proposed > > > > > approach shows negative impact on production traffic and I think > > > > > Cloudflare folks can give us the results soon. > > > > Do you prefer we also switch to using a mutex (with preemption > > > > disabled) to avoid the scenario Michal described where flushers give > > > > up the lock and sleep resulting in an unbounded wait time in the worst > > > > case? > > > Locking with mutex with preemption disabled is an oxymoron. > > I believe Yosry wanted to disable preemption _after_ the lock is taken > > to reduce the time spent while it is held. The idea to use the mutex is > > to reduce spinning and more importantly to get rid of lock dropping > > part. It is not really clear (but unlikely) we can drop it while > > preserving the spinlock as the thing scales with O(#cgroups x #cpus) > > in the worst case. > > As I have said later in my email, I am not against disabling preemption > selectively on some parts of the lock critical section where preemption is > undesirable. However, I am against disabling preemption for the whole > duration of the code where the mutex lock is held as it defeats the purpose > of using mutex in the first place. I certainly agree this is an antipattern.
On Tue, Aug 29, 2023 at 8:17 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 29-08-23 11:05:28, Waiman Long wrote: > > On 8/29/23 03:27, Michal Hocko wrote: > > > On Mon 28-08-23 13:27:23, Waiman Long wrote: > > > > On 8/28/23 13:07, Yosry Ahmed wrote: > > > > > > Here I agree with you. Let's go with the approach which is easy to > > > > > > undo for now. Though I prefer the new explicit interface for flushing, > > > > > > that step would be very hard to undo. Let's reevaluate if the proposed > > > > > > approach shows negative impact on production traffic and I think > > > > > > Cloudflare folks can give us the results soon. > > > > > Do you prefer we also switch to using a mutex (with preemption > > > > > disabled) to avoid the scenario Michal described where flushers give > > > > > up the lock and sleep resulting in an unbounded wait time in the worst > > > > > case? > > > > Locking with mutex with preemption disabled is an oxymoron. > > > I believe Yosry wanted to disable preemption _after_ the lock is taken > > > to reduce the time spent while it is held. The idea to use the mutex is > > > to reduce spinning and more importantly to get rid of lock dropping > > > part. It is not really clear (but unlikely) we can drop it while > > > preserving the spinlock as the thing scales with O(#cgroups x #cpus) > > > in the worst case. > > > > As I have said later in my email, I am not against disabling preemption > > selectively on some parts of the lock critical section where preemption is > > undesirable. However, I am against disabling preemption for the whole > > duration of the code where the mutex lock is held as it defeats the purpose > > of using mutex in the first place. > > I certainly agree this is an antipattern. So I guess the verdict is to avoid using a mutex here for now. I sent a v2 which includes an additional small patch suggested by Michal Koutny and an updated changelog for this patch to document this discussion and possible alternatives we can do if things go wrong with this approach: https://lore.kernel.org/lkml/20230828233319.340712-1-yosryahmed@google.com/
Hello, On Fri, Aug 25, 2023 at 09:05:46AM +0200, Michal Hocko wrote: > > > I think that's how it was always meant to be when it was designed. The > > > global rstat lock has always existed and was always available to > > > userspace readers. The memory controller took a different path at some > > > point with unified flushing, but that was mainly because of high > > > concurrency from in-kernel flushers, not because userspace readers > > > caused a problem. Outside of memcg, the core cgroup code has always > > > exercised this global lock when reading cpu.stat since rstat's > > > introduction. I assume there hasn't been any problems since it's still > > > there. > > I suspect nobody has just considered a malfunctioning or adversary > workloads so far. > > > > I was hoping Tejun would confirm/deny this. > > Yes, that would be interesting to hear. So, the assumptions in the original design were: * Writers are high freq but readers are lower freq and can block. * The global lock is mutex. * Back-to-back reads won't have too much to do because it only has to flush what's been accumulated since the last flush which took place just before. It's likely that the userspace side is gonna be just fine if we restore the global lock to be a mutex and let them be. Most of the problems are caused by trying to allow flushing from non-sleepable and kernel contexts. Would it make sense to distinguish what can and can't wait and make the latter group always use cached value? e.g. even in kernel, during oom kill, waiting doesn't really matter and it can just wait to obtain the up-to-date numbers. Thanks.
On Tue, Aug 29, 2023 at 11:44 AM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Fri, Aug 25, 2023 at 09:05:46AM +0200, Michal Hocko wrote: > > > > I think that's how it was always meant to be when it was designed. The > > > > global rstat lock has always existed and was always available to > > > > userspace readers. The memory controller took a different path at some > > > > point with unified flushing, but that was mainly because of high > > > > concurrency from in-kernel flushers, not because userspace readers > > > > caused a problem. Outside of memcg, the core cgroup code has always > > > > exercised this global lock when reading cpu.stat since rstat's > > > > introduction. I assume there hasn't been any problems since it's still > > > > there. > > > > I suspect nobody has just considered a malfunctioning or adversary > > workloads so far. > > > > > > I was hoping Tejun would confirm/deny this. > > > > Yes, that would be interesting to hear. > > So, the assumptions in the original design were: > > * Writers are high freq but readers are lower freq and can block. > > * The global lock is mutex. > > * Back-to-back reads won't have too much to do because it only has to flush > what's been accumulated since the last flush which took place just before. > > It's likely that the userspace side is gonna be just fine if we restore the > global lock to be a mutex and let them be. Most of the problems are caused > by trying to allow flushing from non-sleepable and kernel contexts. So basically restore the flush without disabling preemption, and if a userspace reader gets preempted while holding the mutex it's probably okay because we won't have high concurrency among userspace readers? I think Shakeel was worried that this may cause a priority inversion where a low priority task is preempted while holding the mutex, and prevents high priority tasks from acquiring it to read the stats and take actions (e.g. userspace OOMs). > Would it > make sense to distinguish what can and can't wait and make the latter group > always use cached value? e.g. even in kernel, during oom kill, waiting > doesn't really matter and it can just wait to obtain the up-to-date numbers. The problem with waiting for in-kernel flushers is that high concurrency leads to terrible serialization. Running a stress test with 100s of reclaimers where everyone waits shows ~ 3x slowdowns. This patch series is indeed trying to formalize a distinction between waiters who can wait and those who can't on the memcg side: - Unified flushers always flush the entire tree and only flush if no one else is flushing (no waiting), otherwise they use cached data and hope the concurrent flushing helps. This is what we currently do for most memcg contexts. This patch series opts userspace reads out of it. - Non-unified flushers only flush the subtree they care about and they wait if there are other flushers. This is what we currently do for some zswap accounting code. This patch series opts userspace readers into this. The problem Michal is raising is that dropping the lock can lead to an unbounded number of waiters and longer worst case latency. Especially that this is directly influenced by userspace. Reintroducing the mutex and removing the lock dropping code fixes that problem, but then if the mutex holder gets preempted, we face another problem. Personally I think there is a good chance there won't be userspace latency problems due to the lock as usually there isn't high concurrency among userspace readers, and we can deal with that problem if/when it happens. So far no problem is happening for cpu.stat which has the same potential problem.
Hello, On Tue, Aug 29, 2023 at 12:13:31PM -0700, Yosry Ahmed wrote: ... > > So, the assumptions in the original design were: > > > > * Writers are high freq but readers are lower freq and can block. > > > > * The global lock is mutex. > > > > * Back-to-back reads won't have too much to do because it only has to flush > > what's been accumulated since the last flush which took place just before. > > > > It's likely that the userspace side is gonna be just fine if we restore the > > global lock to be a mutex and let them be. Most of the problems are caused > > by trying to allow flushing from non-sleepable and kernel contexts. > > So basically restore the flush without disabling preemption, and if a > userspace reader gets preempted while holding the mutex it's probably > okay because we won't have high concurrency among userspace readers? > > I think Shakeel was worried that this may cause a priority inversion > where a low priority task is preempted while holding the mutex, and > prevents high priority tasks from acquiring it to read the stats and > take actions (e.g. userspace OOMs). We'll have to see but I'm not sure this is going to be a huge problem. The most common way that priority inversions are resolved is through work conservation - ie. the system just runs out of other things to do, so whatever is blocking the system gets to run and unblocks it. It only really becomes a problem when work conservation breaks down on CPU side which happens if the one holding the resource is 1. blocked on IOs (usually through memory allocation but can be anything) 2. throttled by cpu.max. #1 is not a factor here. #2 is but that is a factor for everything in the kernel and should really be solved from cpu.max side. So, I think in practice, this should be fine, or at least not worse than anything else. > > Would it > > make sense to distinguish what can and can't wait and make the latter group > > always use cached value? e.g. even in kernel, during oom kill, waiting > > doesn't really matter and it can just wait to obtain the up-to-date numbers. > > The problem with waiting for in-kernel flushers is that high > concurrency leads to terrible serialization. Running a stress test > with 100s of reclaimers where everyone waits shows ~ 3x slowdowns. > > This patch series is indeed trying to formalize a distinction between > waiters who can wait and those who can't on the memcg side: > > - Unified flushers always flush the entire tree and only flush if no > one else is flushing (no waiting), otherwise they use cached data and > hope the concurrent flushing helps. This is what we currently do for > most memcg contexts. This patch series opts userspace reads out of it. > > - Non-unified flushers only flush the subtree they care about and they > wait if there are other flushers. This is what we currently do for > some zswap accounting code. This patch series opts userspace readers > into this. > > The problem Michal is raising is that dropping the lock can lead to an > unbounded number of waiters and longer worst case latency. Especially > that this is directly influenced by userspace. Reintroducing the mutex > and removing the lock dropping code fixes that problem, but then if > the mutex holder gets preempted, we face another problem. > > Personally I think there is a good chance there won't be userspace > latency problems due to the lock as usually there isn't high > concurrency among userspace readers, and we can deal with that problem > if/when it happens. So far no problem is happening for cpu.stat which > has the same potential problem. Maybe leave the global lock as-is and gate the userland flushers with a mutex so that there's only ever one contenting on the rstat lock from userland side? Thanks.
On Tue, Aug 29, 2023 at 12:36 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Tue, Aug 29, 2023 at 12:13:31PM -0700, Yosry Ahmed wrote: > ... > > > So, the assumptions in the original design were: > > > > > > * Writers are high freq but readers are lower freq and can block. > > > > > > * The global lock is mutex. > > > > > > * Back-to-back reads won't have too much to do because it only has to flush > > > what's been accumulated since the last flush which took place just before. > > > > > > It's likely that the userspace side is gonna be just fine if we restore the > > > global lock to be a mutex and let them be. Most of the problems are caused > > > by trying to allow flushing from non-sleepable and kernel contexts. > > > > So basically restore the flush without disabling preemption, and if a > > userspace reader gets preempted while holding the mutex it's probably > > okay because we won't have high concurrency among userspace readers? > > > > I think Shakeel was worried that this may cause a priority inversion > > where a low priority task is preempted while holding the mutex, and > > prevents high priority tasks from acquiring it to read the stats and > > take actions (e.g. userspace OOMs). > > We'll have to see but I'm not sure this is going to be a huge problem. The > most common way that priority inversions are resolved is through work > conservation - ie. the system just runs out of other things to do, so > whatever is blocking the system gets to run and unblocks it. It only really > becomes a problem when work conservation breaks down on CPU side which > happens if the one holding the resource is 1. blocked on IOs (usually > through memory allocation but can be anything) 2. throttled by cpu.max. > > #1 is not a factor here. #2 is but that is a factor for everything in the > kernel and should really be solved from cpu.max side. So, I think in > practice, this should be fine, or at least not worse than anything else. If that's not a concern I can just append a patch that changes the spinlock to a mutex to this series. Shakeel, wdyt? > > > > Would it > > > make sense to distinguish what can and can't wait and make the latter group > > > always use cached value? e.g. even in kernel, during oom kill, waiting > > > doesn't really matter and it can just wait to obtain the up-to-date numbers. > > > > The problem with waiting for in-kernel flushers is that high > > concurrency leads to terrible serialization. Running a stress test > > with 100s of reclaimers where everyone waits shows ~ 3x slowdowns. > > > > This patch series is indeed trying to formalize a distinction between > > waiters who can wait and those who can't on the memcg side: > > > > - Unified flushers always flush the entire tree and only flush if no > > one else is flushing (no waiting), otherwise they use cached data and > > hope the concurrent flushing helps. This is what we currently do for > > most memcg contexts. This patch series opts userspace reads out of it. > > > > - Non-unified flushers only flush the subtree they care about and they > > wait if there are other flushers. This is what we currently do for > > some zswap accounting code. This patch series opts userspace readers > > into this. > > > > The problem Michal is raising is that dropping the lock can lead to an > > unbounded number of waiters and longer worst case latency. Especially > > that this is directly influenced by userspace. Reintroducing the mutex > > and removing the lock dropping code fixes that problem, but then if > > the mutex holder gets preempted, we face another problem. > > > > Personally I think there is a good chance there won't be userspace > > latency problems due to the lock as usually there isn't high > > concurrency among userspace readers, and we can deal with that problem > > if/when it happens. So far no problem is happening for cpu.stat which > > has the same potential problem. > > Maybe leave the global lock as-is and gate the userland flushers with a > mutex so that there's only ever one contenting on the rstat lock from > userland side? Waiman suggested this as well. We can do that for sure, although I think we should wait until we are sure it's needed. One question. If whoever is holding that mutex is either flushing with the spinlock held or spinning (i.e not sleepable or preemptable), wouldn't this be equivalent to just changing the spinlock with a mutex and disable preemption while holding it?
Hello, On Tue, Aug 29, 2023 at 12:54:06PM -0700, Yosry Ahmed wrote: ... > > Maybe leave the global lock as-is and gate the userland flushers with a > > mutex so that there's only ever one contenting on the rstat lock from > > userland side? > > Waiman suggested this as well. We can do that for sure, although I > think we should wait until we are sure it's needed. > > One question. If whoever is holding that mutex is either flushing with > the spinlock held or spinning (i.e not sleepable or preemptable), > wouldn't this be equivalent to just changing the spinlock with a mutex > and disable preemption while holding it? Well, it creates layering so that userspace can't flood the inner lock which can cause contention issues for kernel side users. Not sleeping while actively flushing is an side-effect too but the code at least doesn't look as anti-patterny as disabling preemption right after grabbing a mutex. I don't have a strong preference. As long as we stay away from introducing a new user interface construct and can address the noticed scalability issues, it should be fine. Note that there are other ways to address priority inversions and contentions too - e.g. we can always bounce flushing to a [kthread_]kworker and rate limit (or rather latency limit) how often different classes of users can trigger flushing. I don't think we have to go there yet but if the simpler meaures don't work out, there are still many ways to solve the problem within the kernel. Thanks.
On Tue, Aug 29, 2023 at 1:12 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Tue, Aug 29, 2023 at 12:54:06PM -0700, Yosry Ahmed wrote: > ... > > > Maybe leave the global lock as-is and gate the userland flushers with a > > > mutex so that there's only ever one contenting on the rstat lock from > > > userland side? > > > > Waiman suggested this as well. We can do that for sure, although I > > think we should wait until we are sure it's needed. > > > > One question. If whoever is holding that mutex is either flushing with > > the spinlock held or spinning (i.e not sleepable or preemptable), > > wouldn't this be equivalent to just changing the spinlock with a mutex > > and disable preemption while holding it? > > Well, it creates layering so that userspace can't flood the inner lock which > can cause contention issues for kernel side users. Not sleeping while > actively flushing is an side-effect too but the code at least doesn't look > as anti-patterny as disabling preemption right after grabbing a mutex. I see. At most one kernel side flusher will be spinning for the lock at any given point anyway, but I guess having that one kernel side flusher competing against one user side flusher is better competing with N flushers. I will add a mutex on the userspace read side then and spin a v3. Hopefully this addresses Michal's concern as well. The lock dropping logic will still exist for the inner lock, but when one userspace reader drops the inner lock other readers won't be able to pick it up. > > I don't have a strong preference. As long as we stay away from introducing a > new user interface construct and can address the noticed scalability issues, > it should be fine. Note that there are other ways to address priority > inversions and contentions too - e.g. we can always bounce flushing to a > [kthread_]kworker and rate limit (or rather latency limit) how often > different classes of users can trigger flushing. I don't think we have to go > there yet but if the simpler meaures don't work out, there are still many > ways to solve the problem within the kernel. I whole-heartedly agree with the preference to fix the problem within the kernel with minimal/none user space involvement. Thanks!
On Tue 29-08-23 13:20:34, Yosry Ahmed wrote: [...] > I will add a mutex on the userspace read side then and spin a v3. > Hopefully this addresses Michal's concern as well. The lock dropping > logic will still exist for the inner lock, but when one userspace > reader drops the inner lock other readers won't be able to pick it up. Yes, that would minimize the risk of the worst case pathological behavior. > > I don't have a strong preference. As long as we stay away from introducing a > > new user interface construct and can address the noticed scalability issues, > > it should be fine. Note that there are other ways to address priority > > inversions and contentions too - e.g. we can always bounce flushing to a > > [kthread_]kworker and rate limit (or rather latency limit) how often > > different classes of users can trigger flushing. I don't think we have to go > > there yet but if the simpler meaures don't work out, there are still many > > ways to solve the problem within the kernel. > > I whole-heartedly agree with the preference to fix the problem within > the kernel with minimal/none user space involvement. Let's see. While I would love to see a solution that works for everybody without explicit interface we have hit problems with locks involved in stat files in the past.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 90f08b35fa77..d3b13a06224c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1606,7 +1606,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) * * Current memory state: */ - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; @@ -4048,7 +4048,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) int nid; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { seq_printf(m, "%s=%lu", stat->name, @@ -4123,7 +4123,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; @@ -4625,7 +4625,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent; - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); @@ -6640,7 +6640,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v) int i; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { int nid;
Unified flushing allows for great concurrency for paths that attempt to flush the stats, at the expense of potential staleness and a single flusher paying the extra cost of flushing the full tree. This tradeoff makes sense for in-kernel flushers that may observe high concurrency (e.g. reclaim, refault). For userspace readers, stale stats may be unexpected and problematic, especially when such stats are used for critical paths such as userspace OOM handling. Additionally, a userspace reader will occasionally pay the cost of flushing the entire hierarchy, which also causes problems in some cases [1]. Opt userspace reads out of unified flushing. This makes the cost of reading the stats more predictable (proportional to the size of the subtree), as well as the freshness of the stats. Since userspace readers are not expected to have similar concurrency to in-kernel flushers, serializing them among themselves and among in-kernel flushers should be okay. This was tested on a machine with 256 cpus by running a synthetic test The script that creates 50 top-level cgroups, each with 5 children (250 leaf cgroups). Each leaf cgroup has 10 processes running that allocate memory beyond the cgroup limit, invoking reclaim (which is an in-kernel unified flusher). Concurrently, one thread is spawned per-cgroup to read the stats every second (including root, top-level, and leaf cgroups -- so total 251 threads). No regressions were observed in the total running time; which means that non-unified userspace readers are not slowing down in-kernel unified flushers: Base (mm-unstable): real 0m18.228s user 0m9.463s sys 60m15.879s real 0m20.828s user 0m8.535s sys 70m12.364s real 0m19.789s user 0m9.177s sys 66m10.798s With this patch: real 0m19.632s user 0m8.608s sys 64m23.483s real 0m18.463s user 0m7.465s sys 60m34.089s real 0m20.309s user 0m7.754s sys 68m2.392s Additionally, the average latency for reading stats went from roughly 40ms to 5 ms, because we mostly read the stats of leaf cgroups in this script, so we only have to flush one cgroup, instead of *sometimes* flushing the entire tree with unified flushing. [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/ Signed-off-by: Yosry Ahmed <yosryahmed@google.com> --- mm/memcontrol.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)