Message ID | 20230821205458.1764662-1-yosryahmed@google.com (mailing list archive) |
---|---|
Headers | show |
Series | memcg: non-unified flushing for userspace stats | expand |
Hello. On Mon, Aug 21, 2023 at 08:54:55PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote: > For userspace reads, unified flushing leads to non-deterministic stats > staleness and reading cost. I only skimed previous threads but I don't remember if it was resolved: a) periodic flushing was too much spaced for user space readers (i.e. 2s delay is too much [1]), b) periodic flushing didn't catch up (i.e. full tree flush can occassionaly take more than 2s) leading to extra staleness? [1] Assuming that nr_cpus*MEMCG_CHARGE_BATCH error bound is also too much for userspace readers, correct? > The cost of userspace reads are now determinstic, and depend on the > size of the subtree being read. This should fix both the *sometimes* > expensive reads (due to flushing the entire tree) and occasional > staless (due to skipping flushing). This is nice, thanks to the atomic removal in the commit 0a2dc6ac3329 ("cgroup: remove cgroup_rstat_flush_atomic()"). I think the smaller chunks with yielding could be universally better (last words :-). Thanks, Michal
On Tue, Aug 22, 2023 at 6:01 AM Michal Koutný <mkoutny@suse.com> wrote: > > Hello. > > On Mon, Aug 21, 2023 at 08:54:55PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote: > > For userspace reads, unified flushing leads to non-deterministic stats > > staleness and reading cost. > > I only skimed previous threads but I don't remember if it was resolved: > a) periodic flushing was too much spaced for user space readers (i.e. 2s > delay is too much [1]), > b) periodic flushing didn't catch up (i.e. full tree flush can > occassionaly take more than 2s) leading to extra staleness? The idea is that having stats anywhere between 0-2 seconds stale is inconsistent, especially when compared to other values (such as memory.usage_in_bytes), which can give an inconsistent and stale view of the system. For some readers, 2s is too spaced (we have readers that read every second). A time-only bound is scary because on large systems a lot can happen in a second. There will always be a delay anyway, but ideally we want to minimize it. I think 2s is also not a strict bound (e.g. flushing is taking a lot of time, a flush started but the cgroup you care about hasn't been flushed yet, etc). There is also the problem of variable cost of reading. > > [1] Assuming that nr_cpus*MEMCG_CHARGE_BATCH error bound is also too > much for userspace readers, correct? I can't tell for sure to be honest, but given the continuously increased number of cpus on modern systems, and the fact that nr_cpus*MEMCG_CHARGE_BATCH can be localized in one cgroup or spread across the hierarchy, I think it's better if we drop it for userspace reads if possible. > > > The cost of userspace reads are now determinstic, and depend on the > > size of the subtree being read. This should fix both the *sometimes* > > expensive reads (due to flushing the entire tree) and occasional > > staless (due to skipping flushing). > > This is nice, thanks to the atomic removal in the commit 0a2dc6ac3329 > ("cgroup: remove cgroup_rstat_flush_atomic()"). I think the smaller > chunks with yielding could be universally better (last words :-). I was hoping we can remove unified flushing completely, but stress testing with hundreds of concurrent reclaimers shows a lot of regression. Maybe it doesn't matter in practice, but I don't want to pull that trigger :) FWIW, with unified flushing we are getting great concurrency for in-kernel flushers like reclaim or refault, but at the expense of stats staleness. I really wonder what hidden cost we are paying due to the stale stats. I would imagine any problems that manifest from this would be difficult to tie back to the stale stats. > > Thanks, > Michal >