mbox series

[RFC,v3,0/8] mm: workingset reporting

Message ID 20240327213108.2384666-1-yuanchu@google.com (mailing list archive)
Headers show
Series mm: workingset reporting | expand

Message

Yuanchu Xie March 27, 2024, 9:30 p.m. UTC
This patch series provides workingset reporting of user pages in
lruvecs, of which coldness can be tracked by accessed bits and fd
references. However, the concept of workingset applies generically to
all types of memory, which could be kernel slab caches, discardable
userspace caches (databases), or CXL.mem. Therefore, data sources might
come from slab shrinkers, device drivers, or the userspace. IMO, the
kernel should provide a set of workingset interfaces that should be
generic enough to accommodate the various use cases, and be extensible
to potential future use cases. The current proposed interfaces are not
sufficient in that regard, but I would like to start somewhere, solicit
feedback, and iterate.

Use cases
==========
Job scheduling
For data center machines, workingset information allows the job scheduler
to right-size each job and land more jobs on the same host or NUMA node,
and in the case of a job with increasing workingset, policy decisions
can be made to migrate other jobs off the host/NUMA node, or oom-kill
the misbehaving job. If the job shape is very different from the machine
shape, knowing the workingset per-node can also help inform page
allocation policies.

Proactive reclaim
Workingset information allows the a container manager to proactively
reclaim memory while not impacting a job's performance. While PSI may
provide a reactive measure of when a proactive reclaim has reclaimed too
much, workingset reporting enables the policy to be more accurate and
flexible.

Ballooning (similar to proactive reclaim)
While this patch series does not extend the virtio-balloon device,
balloon policies benefit from workingset to more precisely determine
the size of the memory balloon. On desktops/laptops/mobile devices where
memory is scarce and overcommitted, the balloon sizing in multiple VMs
running on the same device can be orchestrated with workingset reports
from each one.

Promotion/Demotion
Similar to proactive reclaim, a workingset report enables demotion to a
slower tier of memory.
For promotion, the workingset report interfaces need to be extended to
report hotness and gather hotness information from the devices[1].

[1]
https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1

Sysfs and Cgroup Interfaces
==========
The interfaces are detailed in the patches that introduce them. The main
idea here is we break down the workingset per-node per-memcg into time
intervals (ms), e.g.

1000 anon=137368 file=24530
20000 anon=34342 file=0
30000 anon=353232 file=333608
40000 anon=407198 file=206052
9223372036854775807 anon=4925624 file=892892

I realize this does not generalize well to hotness information, but I
lack the intuition for an abstraction that presents hotness in a useful
way. Based on a recent proposal for move_phys_pages[2], it seems like
userspace tiering software would like to move specific physical pages,
instead of informing the kernel "move x number of hot pages to y
device". Please advise.

[2]
https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge.com/

Implementation
==========
Currently, the reporting of user pages is based off of MGLRU, and
therefore requires CONFIG_LRU_GEN=y. We would benefit from more MGLRU
generations for a more fine-grained workingset report. I will make the
generation count configurable in the next version. The workingset
reporting mechanism is gated behind CONFIG_WORKINGSET_REPORT, and the
aging thread is behind CONFIG_WORKINGSET_REPORT_AGING.

--
Changes from RFC v2 -> RFC v3:
- Update to v6.8
- Added an aging kernel thread (gated behind config)
- Added basic selftests for sysfs interface files
- Track swapped out pages for reaccesses
- Refactoring and cleanup
- Dropped the virtio-balloon extension to make things manageable

Changes from RFC v1 -> RFC v2:
- Refactored the patchs into smaller pieces
- Renamed interfaces and functions from wss to wsr (Working Set Reporting)
- Fixed build errors when CONFIG_WSR is not set
- Changed working_set_num_bins to u8 for virtio-balloon
- Added support for per-NUMA node reporting for virtio-balloon

[rfc v1]
https://lore.kernel.org/linux-mm/20230509185419.1088297-1-yuanchu@google.com/
[rfc v2]
https://lore.kernel.org/linux-mm/20230621180454.973862-1-yuanchu@google.com/

Yuanchu Xie (8):
  mm: multi-gen LRU: ignore non-leaf pmd_young for force_scan=true
  mm: aggregate working set information into histograms
  mm: use refresh interval to rate-limit workingset report aggregation
  mm: report workingset during memory pressure driven scanning
  mm: extend working set reporting to memcgs
  mm: add per-memcg reaccess histogram
  mm: add kernel aging thread for workingset reporting
  mm: test system-wide workingset reporting

 drivers/base/node.c                           |   3 +
 include/linux/memcontrol.h                    |   5 +
 include/linux/mmzone.h                        |   4 +
 include/linux/workingset_report.h             | 107 +++
 mm/Kconfig                                    |  15 +
 mm/Makefile                                   |   2 +
 mm/internal.h                                 |  45 ++
 mm/memcontrol.c                               | 386 ++++++++-
 mm/mmzone.c                                   |   2 +
 mm/vmscan.c                                   |  95 ++-
 mm/workingset.c                               |   9 +-
 mm/workingset_report.c                        | 757 ++++++++++++++++++
 mm/workingset_report_aging.c                  | 127 +++
 tools/testing/selftests/mm/.gitignore         |   1 +
 tools/testing/selftests/mm/Makefile           |   3 +
 .../testing/selftests/mm/workingset_report.c  | 315 ++++++++
 .../testing/selftests/mm/workingset_report.h  |  37 +
 .../selftests/mm/workingset_report_test.c     | 328 ++++++++
 18 files changed, 2231 insertions(+), 10 deletions(-)
 create mode 100644 include/linux/workingset_report.h
 create mode 100644 mm/workingset_report.c
 create mode 100644 mm/workingset_report_aging.c
 create mode 100644 tools/testing/selftests/mm/workingset_report.c
 create mode 100644 tools/testing/selftests/mm/workingset_report.h
 create mode 100644 tools/testing/selftests/mm/workingset_report_test.c

Comments

Gregory Price March 27, 2024, 9:44 p.m. UTC | #1
On Wed, Mar 27, 2024 at 02:30:59PM -0700, Yuanchu Xie wrote:
> 
> Promotion/Demotion
> Similar to proactive reclaim, a workingset report enables demotion to a
> slower tier of memory.
> For promotion, the workingset report interfaces need to be extended to
> report hotness and gather hotness information from the devices[1].
> 
> [1]
> https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1
> 
> Sysfs and Cgroup Interfaces
> ==========
> The interfaces are detailed in the patches that introduce them. The main
> idea here is we break down the workingset per-node per-memcg into time
> intervals (ms), e.g.
> 
> 1000 anon=137368 file=24530
> 20000 anon=34342 file=0
> 30000 anon=353232 file=333608
> 40000 anon=407198 file=206052
> 9223372036854775807 anon=4925624 file=892892
> 
> I realize this does not generalize well to hotness information, but I
> lack the intuition for an abstraction that presents hotness in a useful
> way. Based on a recent proposal for move_phys_pages[2], it seems like
> userspace tiering software would like to move specific physical pages,
> instead of informing the kernel "move x number of hot pages to y
> device". Please advise.
> 
> [2]
> https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge.com/
> 

Please note that this proposed interface (move_phys_pages) is very
unlikely to be received upstream due to side channel concerns. Instead,
it's more likely that the tiering component will expose a "promote X
pages from tier A to tier B", and the kernel component would then
use/consume hotness information to determine which pages to promote.

(Just as one example, there are many more realistic designs)

So if there is a way to expose workingset data to the mm/memory_tiers.c
component instead of via sysfs/cgroup - that is preferable.

The 'move_phys_pages' interface is more of an experimental interface to
test the effectiveness of this approach without having to plumb out the
entire system.  Definitely anything userland interface should not be
designed to generate physical address information for consumption unless
it is hard-locked behind admin caps.

Regards,
Gregory
Yuanchu Xie March 27, 2024, 10:53 p.m. UTC | #2
On Wed, Mar 27, 2024 at 2:44 PM Gregory Price
<gregory.price@memverge.com> wrote:
>
> On Wed, Mar 27, 2024 at 02:30:59PM -0700, Yuanchu Xie wrote:
> > I realize this does not generalize well to hotness information, but I
> > lack the intuition for an abstraction that presents hotness in a useful
> > way. Based on a recent proposal for move_phys_pages[2], it seems like
> > userspace tiering software would like to move specific physical pages,
> > instead of informing the kernel "move x number of hot pages to y
> > device". Please advise.
> >
> > [2]
> > https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge.com/
> >
>
> Please note that this proposed interface (move_phys_pages) is very
> unlikely to be received upstream due to side channel concerns. Instead,
> it's more likely that the tiering component will expose a "promote X
> pages from tier A to tier B", and the kernel component would then
> use/consume hotness information to determine which pages to promote.

I see that mm/memory-tiers.c only has support for demotion. What kind
of hotness information do devices typically provide? The OCP proposal
is not very specific about this.
A list of hot pages with configurable threshold?
Access frequency for all pages at configured granularity?
Is there a way to tell which NUMA node is accessing them, for page promotion?
>
> (Just as one example, there are many more realistic designs)
>
> So if there is a way to expose workingset data to the mm/memory_tiers.c
> component instead of via sysfs/cgroup - that is preferable.

Appreciate the feedback. The data in its current form might be useful
to inform demotion decisions, but for promotion, are you aware of any
recent developments? I would like to encode hotness as workingset data
as well.
>
> The 'move_phys_pages' interface is more of an experimental interface to
> test the effectiveness of this approach without having to plumb out the
> entire system.  Definitely anything userland interface should not be
> designed to generate physical address information for consumption unless
> it is hard-locked behind admin caps.
>
> Regards,
> Gregory
Gregory Price March 29, 2024, 5:28 p.m. UTC | #3
On Wed, Mar 27, 2024 at 03:53:39PM -0700, Yuanchu Xie wrote:
> On Wed, Mar 27, 2024 at 2:44 PM Gregory Price
> <gregory.price@memverge.com> wrote:
> >
> > Please note that this proposed interface (move_phys_pages) is very
> > unlikely to be received upstream due to side channel concerns. Instead,
> > it's more likely that the tiering component will expose a "promote X
> > pages from tier A to tier B", and the kernel component would then
> > use/consume hotness information to determine which pages to promote.
> 
> I see that mm/memory-tiers.c only has support for demotion. What kind
> of hotness information do devices typically provide? The OCP proposal
> is not very specific about this.
> A list of hot pages with configurable threshold?
> Access frequency for all pages at configured granularity?
> Is there a way to tell which NUMA node is accessing them, for page promotion?

(caveat: i'm not a memory-tiers maintainer, you may want to poke at them
directly for more information, this is simply spitballing an idea)

I don't know of any public proposals of explicit hotness information
provided by hardware yet, just the general proposal.

For the sake of simplicity, I would make the assumption that you have
the least information possible - a simple list of "hot addresses" in
Host Physcal Address format.

I.e. there's some driver function that amounts to:

uint32_t device_get_hot_addresses(uint64_t *addresses, uint32_t buf_max);

Where the return value is number of addresses the device returned, and
the buf_max is the number of addresses that can be read.

Drives providing this functionality would then register this as a
callback when its memory becomes a member of some numa node.


Re: source node -
Devices have no real way of determining upstream source information.

> >
> > (Just as one example, there are many more realistic designs)
> >
> > So if there is a way to expose workingset data to the mm/memory_tiers.c
> > component instead of via sysfs/cgroup - that is preferable.
> 
> Appreciate the feedback. The data in its current form might be useful
> to inform demotion decisions, but for promotion, are you aware of any
> recent developments? I would like to encode hotness as workingset data
> as well.

There were some recent patches to DAMON about promotion/demotion.  You
might look there.

~Gregory