mbox series

[RFC,v3,0/3] move_phys_pages syscall - migrate page contents given

Message ID 20240319172609.332900-1-gregory.price@memverge.com (mailing list archive)
Headers show
Series move_phys_pages syscall - migrate page contents given | expand

Message

Gregory Price March 19, 2024, 5:26 p.m. UTC
v3:
- pull forward to v6.8
- style and small fixups recommended by jcameron
- update syscall number (will do all archs when RFC tag drops)
- update for new folio code
- added OCP link to device-tracked address hotness proposal
- kept void* over __u64 simply because it integrates cleanly with
  existing migration code. If there's strong opinions, I can refactor.

This patch set is a proposal for a syscall analogous to move_pages,
that migrates pages between NUMA nodes using physical addressing.

The intent is to better enable user-land system-wide memory tiering
as CXL devices begin to provide memory resources on the PCIe bus.

For example, user-land software which is making decisions based on
data sources which expose physical address information no longer
must convert that information to virtual addressing to act upon it
(see background for info on how physical addresses are acquired).

The syscall requires CAP_SYS_ADMIN, since physical address source
information is typically protected by the same (or CAP_SYS_NICE).

This patch set broken into 3 patches:
  1) refactor of existing migration code for code reuse
  2) The sys_move_phys_pages system call.
  3) ktest of the syscall

The sys_move_phys_pages system call validates the page may be
migrated by checking migratable-status of each vma mapping the page,
and the intersection of cpuset policies each vma's task.


Background:

Userspace job schedulers, memory managers, and tiering software
solutions depend on page migration syscalls to reallocate resources
across NUMA nodes. Currently, these calls enable movement of memory
associated with a specific PID. Moves can be requested in coarse,
process-sized strokes (as with migrate_pages), and on specific virtual
pages (via move_pages).

However, a number of profiling mechanisms provide system-wide information
that would benefit from a physical-addressing version move_pages.

There are presently at least 4 ways userland can acquire physical
address information for use with this interface, and 1 hardware offload
mechanism being proposed by opencompute.

1) /proc/pid/pagemap: can be used to do page table translations.
     This is only really useful for testing, and the ktest was
     written using this functionality.

2) X86:  IBS (AMD) and PEBS (Intel) can be configured to return physical
     and/or vitual address information.

3) zoneinfo:  /proc/zoneinfo exposes the start PFN of zones

4) /sys/kernel/mm/page_idle:  A way to query whether a PFN is idle.
   So long as the page size is known, this can be used to identify
   system-wide idle pages that could be migrated to lower tiers.

   https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html

5) CXL Offloaded Hotness Monitoring (Proposed): a CXL memory device
   may provide hot/cold information about its memory. For example,
   it may report the hottest device addresses (0-based) or a physical
   address (if it has access to decoders for convert bases).

   DPA can be cheaply converted to HPA by combining it with data
   exposed by /sys/bus/cxl/ information (region address bases).

See: https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1


Information from these sources facilitates systemwide resource management,
but with the limitations of migrate_pages and move_pages applying to
individual tasks, their outputs must be converted back to virtual addresses
and re-associated with specific PIDs.

Doing this reverse-translation outside of the kernel requires considerable
space and compute, and it will have to be performed again by the existing
system calls.  Much of this work can be avoided if the pages can be
migrated directly with physical memory addressing.

Gregory Price (3):
  mm/migrate: refactor add_page_for_migration for code re-use
  mm/migrate: Create move_phys_pages syscall
  ktest: sys_move_phys_pages ktest

 arch/x86/entry/syscalls/syscall_32.tbl  |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl  |   1 +
 include/linux/syscalls.h                |   5 +
 include/uapi/asm-generic/unistd.h       |   8 +-
 kernel/sys_ni.c                         |   1 +
 mm/migrate.c                            | 288 ++++++++++++++++++++----
 tools/include/uapi/asm-generic/unistd.h |   8 +-
 tools/testing/selftests/mm/migration.c  |  99 ++++++++
 8 files changed, 370 insertions(+), 41 deletions(-)

Comments

Huang, Ying March 20, 2024, 2:48 a.m. UTC | #1
Gregory Price <gourry.memverge@gmail.com> writes:

> v3:
> - pull forward to v6.8
> - style and small fixups recommended by jcameron
> - update syscall number (will do all archs when RFC tag drops)
> - update for new folio code
> - added OCP link to device-tracked address hotness proposal
> - kept void* over __u64 simply because it integrates cleanly with
>   existing migration code. If there's strong opinions, I can refactor.
>
> This patch set is a proposal for a syscall analogous to move_pages,
> that migrates pages between NUMA nodes using physical addressing.
>
> The intent is to better enable user-land system-wide memory tiering
> as CXL devices begin to provide memory resources on the PCIe bus.
>
> For example, user-land software which is making decisions based on
> data sources which expose physical address information no longer
> must convert that information to virtual addressing to act upon it
> (see background for info on how physical addresses are acquired).
>
> The syscall requires CAP_SYS_ADMIN, since physical address source
> information is typically protected by the same (or CAP_SYS_NICE).
>
> This patch set broken into 3 patches:
>   1) refactor of existing migration code for code reuse
>   2) The sys_move_phys_pages system call.
>   3) ktest of the syscall
>
> The sys_move_phys_pages system call validates the page may be
> migrated by checking migratable-status of each vma mapping the page,
> and the intersection of cpuset policies each vma's task.
>
>
> Background:
>
> Userspace job schedulers, memory managers, and tiering software
> solutions depend on page migration syscalls to reallocate resources
> across NUMA nodes. Currently, these calls enable movement of memory
> associated with a specific PID. Moves can be requested in coarse,
> process-sized strokes (as with migrate_pages), and on specific virtual
> pages (via move_pages).
>
> However, a number of profiling mechanisms provide system-wide information
> that would benefit from a physical-addressing version move_pages.
>
> There are presently at least 4 ways userland can acquire physical
> address information for use with this interface, and 1 hardware offload
> mechanism being proposed by opencompute.
>
> 1) /proc/pid/pagemap: can be used to do page table translations.
>      This is only really useful for testing, and the ktest was
>      written using this functionality.
>
> 2) X86:  IBS (AMD) and PEBS (Intel) can be configured to return physical
>      and/or vitual address information.
>
> 3) zoneinfo:  /proc/zoneinfo exposes the start PFN of zones
>
> 4) /sys/kernel/mm/page_idle:  A way to query whether a PFN is idle.
>    So long as the page size is known, this can be used to identify
>    system-wide idle pages that could be migrated to lower tiers.
>
>    https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html
>
> 5) CXL Offloaded Hotness Monitoring (Proposed): a CXL memory device
>    may provide hot/cold information about its memory. For example,
>    it may report the hottest device addresses (0-based) or a physical
>    address (if it has access to decoders for convert bases).
>
>    DPA can be cheaply converted to HPA by combining it with data
>    exposed by /sys/bus/cxl/ information (region address bases).
>
> See: https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1
>
>
> Information from these sources facilitates systemwide resource management,
> but with the limitations of migrate_pages and move_pages applying to
> individual tasks, their outputs must be converted back to virtual addresses
> and re-associated with specific PIDs.
>
> Doing this reverse-translation outside of the kernel requires considerable
> space and compute, and it will have to be performed again by the existing
> system calls.  Much of this work can be avoided if the pages can be
> migrated directly with physical memory addressing.

One difficulty of the idea of the physical address is that we lacks some
user space specified policy information to make decision.  For example,
users may want to pin some pages in DRAM to improve latency, or pin some
pages in CXL memory to do some best effort work.  To make the correct
decision, we need PID and virtual address.

Yes, I found that you have tried to avoid to break the existing policy
in the code.  But it seems better to consider the policy beforehand to
avoid to make the wrong decision in the first place.

--
Best Regards,
Huang, Ying
Gregory Price March 20, 2024, 4:39 a.m. UTC | #2
On Wed, Mar 20, 2024 at 10:48:44AM +0800, Huang, Ying wrote:
> Gregory Price <gourry.memverge@gmail.com> writes:
> 
> > Doing this reverse-translation outside of the kernel requires considerable
> > space and compute, and it will have to be performed again by the existing
> > system calls.  Much of this work can be avoided if the pages can be
> > migrated directly with physical memory addressing.
> 
> One difficulty of the idea of the physical address is that we lacks some
> user space specified policy information to make decision.  For example,
> users may want to pin some pages in DRAM to improve latency, or pin some
> pages in CXL memory to do some best effort work.  To make the correct
> decision, we need PID and virtual address.
> 

I think of this as a second or third order problem.  The core problem
right now isn't the practicality of how userland would actually use this
interface - the core problem is whether the data generated by offloaded
monitoring is even worth collecting and operating on in the first place.  

So this is a quick hack to do some research about whether it's even
worth developing the whole abstraction described by Willy.

This is why it's labeled RFC.  I upped a v3 because I know of two groups
actively looking at using it for research, and because the folio updates
broke the old version.  It's also easier for me to engage through the
list than via private channels for this particular work.


Do I suggest we merge this interface as-is? No, too many concerns about
side channels.  However, it's a clean reuse of move_pages code to
bootstrap the investigation, and it at least gets the gears turning.

Example notes from a sidebar earlier today:

* An interesting proposal from Dan Williams would be to provide some
  sort of `/sys/.../memory_tiering/tierN/promote_hot` interface, with
  a callback mechanism into the relevant hardware drivers that allows
  for this to be abstracted.  This could be done on some interval and
  some threshhold (# pages, hotness threshhold, etc).


The code to execute promotions ends up looking like what I have now

1) Validate the page is elgibile to be promoted by walking the vmas
2) invoking the existing move_pages code

The above idea can be implemented trivially in userland without
having to plumb through a whole brand new callback system.


Sometimes you have to post stupid ideas to get to the good ones :]

~Gregory
Huang, Ying March 20, 2024, 6:01 a.m. UTC | #3
Gregory Price <gregory.price@memverge.com> writes:

> On Wed, Mar 20, 2024 at 10:48:44AM +0800, Huang, Ying wrote:
>> Gregory Price <gourry.memverge@gmail.com> writes:
>> 
>> > Doing this reverse-translation outside of the kernel requires considerable
>> > space and compute, and it will have to be performed again by the existing
>> > system calls.  Much of this work can be avoided if the pages can be
>> > migrated directly with physical memory addressing.
>> 
>> One difficulty of the idea of the physical address is that we lacks some
>> user space specified policy information to make decision.  For example,
>> users may want to pin some pages in DRAM to improve latency, or pin some
>> pages in CXL memory to do some best effort work.  To make the correct
>> decision, we need PID and virtual address.
>> 
>
> I think of this as a second or third order problem.  The core problem
> right now isn't the practicality of how userland would actually use this
> interface - the core problem is whether the data generated by offloaded
> monitoring is even worth collecting and operating on in the first place.  
>
> So this is a quick hack to do some research about whether it's even
> worth developing the whole abstraction described by Willy.
>
> This is why it's labeled RFC.  I upped a v3 because I know of two groups
> actively looking at using it for research, and because the folio updates
> broke the old version.  It's also easier for me to engage through the
> list than via private channels for this particular work.
>
>
> Do I suggest we merge this interface as-is? No, too many concerns about
> side channels.  However, it's a clean reuse of move_pages code to
> bootstrap the investigation, and it at least gets the gears turning.

Got it!  Thanks for detailed explanation.

I think that one of the difficulties of offloaded monitoring is that
it's hard to obey these user specified policies.  The policies may
become more complex in the future, for example, allocate DRAM among
workloads.

> Example notes from a sidebar earlier today:
>
> * An interesting proposal from Dan Williams would be to provide some
>   sort of `/sys/.../memory_tiering/tierN/promote_hot` interface, with
>   a callback mechanism into the relevant hardware drivers that allows
>   for this to be abstracted.  This could be done on some interval and
>   some threshhold (# pages, hotness threshhold, etc).
>
>
> The code to execute promotions ends up looking like what I have now
>
> 1) Validate the page is elgibile to be promoted by walking the vmas
> 2) invoking the existing move_pages code
>
> The above idea can be implemented trivially in userland without
> having to plumb through a whole brand new callback system.
>
>
> Sometimes you have to post stupid ideas to get to the good ones :]
>

--
Best Regards,
Huang, Ying