Message ID | 20240327213108.2384666-1-yuanchu@google.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: workingset reporting | expand |
On Wed, Mar 27, 2024 at 02:30:59PM -0700, Yuanchu Xie wrote: > > Promotion/Demotion > Similar to proactive reclaim, a workingset report enables demotion to a > slower tier of memory. > For promotion, the workingset report interfaces need to be extended to > report hotness and gather hotness information from the devices[1]. > > [1] > https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1 > > Sysfs and Cgroup Interfaces > ========== > The interfaces are detailed in the patches that introduce them. The main > idea here is we break down the workingset per-node per-memcg into time > intervals (ms), e.g. > > 1000 anon=137368 file=24530 > 20000 anon=34342 file=0 > 30000 anon=353232 file=333608 > 40000 anon=407198 file=206052 > 9223372036854775807 anon=4925624 file=892892 > > I realize this does not generalize well to hotness information, but I > lack the intuition for an abstraction that presents hotness in a useful > way. Based on a recent proposal for move_phys_pages[2], it seems like > userspace tiering software would like to move specific physical pages, > instead of informing the kernel "move x number of hot pages to y > device". Please advise. > > [2] > https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge.com/ > Please note that this proposed interface (move_phys_pages) is very unlikely to be received upstream due to side channel concerns. Instead, it's more likely that the tiering component will expose a "promote X pages from tier A to tier B", and the kernel component would then use/consume hotness information to determine which pages to promote. (Just as one example, there are many more realistic designs) So if there is a way to expose workingset data to the mm/memory_tiers.c component instead of via sysfs/cgroup - that is preferable. The 'move_phys_pages' interface is more of an experimental interface to test the effectiveness of this approach without having to plumb out the entire system. Definitely anything userland interface should not be designed to generate physical address information for consumption unless it is hard-locked behind admin caps. Regards, Gregory
On Wed, Mar 27, 2024 at 2:44 PM Gregory Price <gregory.price@memverge.com> wrote: > > On Wed, Mar 27, 2024 at 02:30:59PM -0700, Yuanchu Xie wrote: > > I realize this does not generalize well to hotness information, but I > > lack the intuition for an abstraction that presents hotness in a useful > > way. Based on a recent proposal for move_phys_pages[2], it seems like > > userspace tiering software would like to move specific physical pages, > > instead of informing the kernel "move x number of hot pages to y > > device". Please advise. > > > > [2] > > https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge.com/ > > > > Please note that this proposed interface (move_phys_pages) is very > unlikely to be received upstream due to side channel concerns. Instead, > it's more likely that the tiering component will expose a "promote X > pages from tier A to tier B", and the kernel component would then > use/consume hotness information to determine which pages to promote. I see that mm/memory-tiers.c only has support for demotion. What kind of hotness information do devices typically provide? The OCP proposal is not very specific about this. A list of hot pages with configurable threshold? Access frequency for all pages at configured granularity? Is there a way to tell which NUMA node is accessing them, for page promotion? > > (Just as one example, there are many more realistic designs) > > So if there is a way to expose workingset data to the mm/memory_tiers.c > component instead of via sysfs/cgroup - that is preferable. Appreciate the feedback. The data in its current form might be useful to inform demotion decisions, but for promotion, are you aware of any recent developments? I would like to encode hotness as workingset data as well. > > The 'move_phys_pages' interface is more of an experimental interface to > test the effectiveness of this approach without having to plumb out the > entire system. Definitely anything userland interface should not be > designed to generate physical address information for consumption unless > it is hard-locked behind admin caps. > > Regards, > Gregory
On Wed, Mar 27, 2024 at 03:53:39PM -0700, Yuanchu Xie wrote: > On Wed, Mar 27, 2024 at 2:44 PM Gregory Price > <gregory.price@memverge.com> wrote: > > > > Please note that this proposed interface (move_phys_pages) is very > > unlikely to be received upstream due to side channel concerns. Instead, > > it's more likely that the tiering component will expose a "promote X > > pages from tier A to tier B", and the kernel component would then > > use/consume hotness information to determine which pages to promote. > > I see that mm/memory-tiers.c only has support for demotion. What kind > of hotness information do devices typically provide? The OCP proposal > is not very specific about this. > A list of hot pages with configurable threshold? > Access frequency for all pages at configured granularity? > Is there a way to tell which NUMA node is accessing them, for page promotion? (caveat: i'm not a memory-tiers maintainer, you may want to poke at them directly for more information, this is simply spitballing an idea) I don't know of any public proposals of explicit hotness information provided by hardware yet, just the general proposal. For the sake of simplicity, I would make the assumption that you have the least information possible - a simple list of "hot addresses" in Host Physcal Address format. I.e. there's some driver function that amounts to: uint32_t device_get_hot_addresses(uint64_t *addresses, uint32_t buf_max); Where the return value is number of addresses the device returned, and the buf_max is the number of addresses that can be read. Drives providing this functionality would then register this as a callback when its memory becomes a member of some numa node. Re: source node - Devices have no real way of determining upstream source information. > > > > (Just as one example, there are many more realistic designs) > > > > So if there is a way to expose workingset data to the mm/memory_tiers.c > > component instead of via sysfs/cgroup - that is preferable. > > Appreciate the feedback. The data in its current form might be useful > to inform demotion decisions, but for promotion, are you aware of any > recent developments? I would like to encode hotness as workingset data > as well. There were some recent patches to DAMON about promotion/demotion. You might look there. ~Gregory