[RFC,0/6] Supporting GMEM (generalized memory management) for external memory devices

Message ID	20231128125025.4449-1-weixi.zhu@huawei.com (mailing list archive)
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Weixi Zhu <weixi.zhu@huawei.com> To: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>, <akpm@linux-foundation.org> Date: Tue, 28 Nov 2023 20:50:19 +0800 Message-ID: <20231128125025.4449-1-weixi.zhu@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Subject: [Intel-gfx] [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices Precedence: list Cc: dri-devel@lists.freedesktop.org, leonro@nvidia.com, apopple@nvidia.com, amd-gfx@lists.freedesktop.org, mgorman@suse.de, ziy@nvidia.com, rcampbell@nvidia.com, jgg@nvidia.com, weixi.zhu@openeuler.sh, jhubbard@nvidia.com, intel-gfx@lists.freedesktop.org, mhairgrove@nvidia.com, jglisse@redhat.com, rodrigo.vivi@intel.com, intel-gvt-dev@lists.freedesktop.org, Felix.Kuehling@amd.com, Xinhui.Pan@amd.com, christian.koenig@amd.com, alexander.deucher@amd.com, ogabbay@kernel.org, Weixi Zhu <weixi.zhu@huawei.com> Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	Supporting GMEM (generalized memory management) for external memory devices \| expand [RFC,0/6] Supporting GMEM (generalized memory management) for external memory devices [RFC,1/6] mm/gmem: add heterogeneous NUMA node [RFC,2/6] mm/gmem: add arch-independent abstraction to track address mapping status [RFC,3/6] mm/gmem: add GMEM (Generalized Memory Management) interface for external accelerators [RFC,4/6] mm/gmem: add new syscall hmadvise() to issue memory hints for heterogeneous NUMA nodes [RFC,5/6] mm/gmem: resolve VMA conflicts for attached peer devices [RFC,6/6] mm/gmem: extending Linux core MM to support unified virtual address space

zhuweixi Nov. 28, 2023, 12:50 p.m. UTC

The problem:

Accelerator driver developers are forced to reinvent external MM subsystems
case by case, because Linux core MM only considers host memory resources.
These reinvented MM subsystems have similar orders of magnitude of LoC as
Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and Huawei NPU has
30K. Meanwhile, more and more vendors are implementing their own
accelerators, e.g. Microsoft's Maia 100. At the same time,
application-level developers suffer from poor programmability -- they must
consider parallel address spaces and be careful about the limited device
DRAM capacity. This can be alleviated if a malloc()-ed virtual address can
be shared by the accelerator, or the abundant host DRAM can further
transparently backup the device local memory.

These external MM systems share similar mechanisms except for the
hardware-dependent part, so reinventing them is effectively introducing
redundant code (14K~70K for each case). Such developing/maintaining is not
cheap. Furthermore, to share a malloc()-ed virtual address, device drivers
need to deeply interact with Linux MM via low-level MM APIs, e.g. MMU
notifiers/HMM. This raises the bar for driver development, since developers
must understand how Linux MM works. Further, it creates code maintenance
problems -- any changes to Linux MM potentially require coordinated changes
to accelerator drivers using low-level MM APIs.

Putting a cache-coherent bus between host and device will not make these
external MM subsystems disappear. For example, a throughput-oriented
accelerator will not tolerate executing heavy memory access workload with
a host MMU/IOMMU via a remote bus. Therefore, devices will still have
their own MMU and pick a simpler page table format for lower address
translation overhead, requiring external MM subsystems.

--------------------

What GMEM (Generalized Memory Management [1]) does:

GMEM extends Linux MM to share its machine-independent MM code. Only
high-level interface is provided for device drivers. This prevents
accelerator drivers from reinventing the wheel, but relies on drivers to
implement their hardware-dependent functions declared by GMEM. GMEM's key
interface include gm_dev_create(), gm_as_create(), gm_as_attach() and
gm_dev_register_physmem(). Here briefly describe how a device driver
utilizes them:
1. At boot time, call gm_dev_create() and registers the implementation of
   hardware-dependent functions as declared in struct gm_mmu.
     - If the device has local DRAM, call gm_dev_register_physmem() to
       register available physical addresses.
2. When a device context is initialized (e.g. triggered by ioctl), check if
   the current CPU process has been attached to a gmem address space
   (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
   to it.
3. Call gm_as_attach() to attach the device context to a gmem address space.
4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
   device computation happens.

GMEM has changed the following assumptions in Linux MM:
  1. An mm_struct not only handle a single CPU context, but may also handle
     external memory contexts encapsulated as gm_context listed in
     mm->gm_as. An external memory context can include a few or all of the
     following parts: an external MMU (that requires TLB invalidation), an
     external page table (that requires PTE manipulation) and external DRAM
     (that requires physical memory management).
  2. Faulting a MAP_PRIVATE VMA with no CPU PTE found does not necessarily
     mean that a zero-filled physical page should be mapped. The virtual
     page may have been mapped to an external memory device.
  3. Unmapping a page may include sending device TLB invalidation (even if
     its MMU shares CPU page table) and manipulating device PTEs.

--------------------

Semantics of new syscalls:

1. mmap(..., MAP_PRIVATE | MAP_PEER_SHARED)
    Allocate virtual address that is shared between the CPU and all
    attached devices. Data is guaranteed to be coherent whenever the
    address is accessed by either CPU or any attached device. If the device
    does not support page fault, then device driver is responsible for
    faulting memory before data gets accessed. By default, the CPU DRAM is
    can be used as a swap backup for the device local memory.
2. hmadvise(NUMA_id, va_start, size, memory_hint)
    Issuing memory hint for a given VMA. This extends traditional madvise()
    syscall with an extra argument so that programmers have better control
    with heterogeneous devices registered as NUMA nodes. One useful memory
    hint could be MADV_PREFETCH, which guarantees that the physical data of
    the given VMA [VA, VA+size) is migrated to NUMA node #id. Another
    useful memory hint is MADV_DONTNEED. This is helpful to increase device
    memory utilization. It is worth considering extending the existing
    madvise() syscall with one additional argument.

--------------------

Implementation details

1. New VMA flag: MAP_PEER_SHARED

This new flag helps isolate GMEM feature, so that common processes with
no device attached does not need to maintain any logical page table. It
can be deleted if the extra overhead from GMEM is acceptable.

2. MMU functions
The device driver must implement the MMU functions declared in struct
gm_mmu.

VA functions: peer_va_alloc_fixed(), peer_va_free()

They are used to negotiate a common available VMA between a host
process and a device process at the mmap() time. This is because some
accelerators like Intel Xeon Phi or Huawei's Ascend NPU have their
acceleration tasks executed within a device CPU process context. Some
accelerators may also choose a different format of virtual address
space.

PA functions: alloc_page(), free_page(), prepare_page()

Alloc_page() and free_page() are used to allocate and free device physical
pages. Prepare_page() is used to zero-fill or DMA the data of a physical
page. These functions were removed from the submitted patch, since GMEM
does not need to invoke them when testing Huawei's NPU accelerator. The NPU
accelerator has an OS running in the device that manages the device
physical memory. However, even for such a device it is better for the host
to directly manage device physical memory, which saves device HBM and
avoids synchronizing management status between the host and device.

Page-table functions: pmap_create()/destroy()/enter()/release()/protect()

They are used to create and destroy device page tables, install and
uninstall page table entries and to change the protection of page table
entries.

TLB-invalidation functions: tlb_invl(), tlb_invl_coalesced()

They are used to invalidate the TLB entries of a given range of VA or
invalidate a given list of VMAs.

Wrapper functions: peer_map() and peer_unmap()

These two functions are used to create or destroy a device mapping which
could include allocating physical memory and copying data. They effectively
wraps the PA functions, Page-table functions and TLB-invalidation
functions. Implementing these steps together allows devices to optimize the
communication cost between host and device. However, it requires the device
driver to correctly order these steps.

3. Tracking logical mappings:

Each process starts maintaining an xarray in mm->vm_obj->logical_page_table
at the first time a host process calls mmap(MAP_PRIVATE | MAP_PEER_SHARED).
When a virtual page gets touched, its mapping status is created and stored
in struct gm_mapping. The logical page table is utilized to query the
struct gm_mapping given a virtual address. GMEM extends Linux MM to update
and lookup these logical mappings. For example, in the patch set we modify
the page fault path of to additionally check the logical mapping of
MAP_PEER_SHARED VMAs and identify if a device page should be migrated.
Similarly, if the device driver wants to resolve a device page fault or
prefetch data, the driver should call gm_dev_fault(). This function
examines the mapping status and determines whether the device driver should
migrate a CPU page to device or install a zero-filled device page.

The logical mapping abstraction enhances the extensibility of Linux core MM
(a virtual page may be mapped to a device physical page without any CPU PTE
installed). The current implementation is not complete, since it only
focused on anonymous VMAs with MAP_PEER_SHARED flag. The future plan of
logical page table is to provide a generic abstraction layer that support
common anonymous memory (I am looking at you, transparent huge pages) and
file-backed memory.

--------------------

Use cases

GMEM has been tested over Huawei's NPU (neural process unit) device driver.
The original NPU device driver has approximately 30,000 lines of code for
memory management. On the contrary, the GMEM-based one has less than 30
lines of code calling GMEM API, with approximately 3,700 lines of code
implementing the MMU functions. This effectively saves over 26,200 lines
of MM code for one driver. Therefore, developers from accelerator vendors,
including Nvidia, AMD, Intel and other companies are welcome to discuss if
GMEM could be helpful. 

Using GMEM-based driver, it is possible to write a C-style accelerator code
with malloc(), whose underlying mmap() syscall should include
MAP_PEER_SHARED according to current GMEM implementation. Importantly, GMEM
guarantees a coherent view of memory between the host and all attached
devices. This means that any data written by the CPU or any attached
accelerator can be seen by the next memory load instruction issued by any
attached accelerator or the CPU. Furthermore, the NPU device was able to
oversubscribe memory by swapping memory to host DDR. Note that this memory
oversubscription mechanism can be universal if the physical memory
management is provided by GMEM. Other potential use cases of GMEM could
include the IOMMU driver, KVM and RDMA drivers, as long as the device needs
to manage external memory resources like VMAs, MMUs or local DRAMs.

--------------------

Discussion

Physical memory management
Most accelerators require the host OS to manage device DRAM. Even
accelerators capable of running an OS inside the driver can benefit from
it, since it helps avoid synchronizing management status between the host
and device. In Linux OSS EU summit 2023, Hannes Reinecke from SUSE Labs
suggested that people are concerned with the memory consumption of struct
page (which considers all generic scenarios for the kernel). This leads to
a possible solution that, instead of reusing Linux struct page and
ZONE_DEVICE mechanism, GMEM can implement an isolated buddy allocator for
the device to instantiate and register. The isolation is useful because
device DRAM physical address space is independent. Furthermore, the
isolated buddy allocator can utilize a customized struct page that consumes
less memory. It is worth discussing if accelerator vendors desire this
solution.

MMU functions
The MMU functions peer_map() and peer_unmap() overlap other functions,
leaving a question if the MMU functions should be decoupled as more basic
operations. Decoupling them could potentially prevent device drivers
coalescing these basic steps within a single host-device communication
operation, while coupling them makes it more difficult for device drivers
to utilize GMEM interface.

The idea of GMEM was originated from Weixi's PhD study with
Prof. Scott Rixner and Prof. Alan L. Cox at Rice University.

[1] https://arxiv.org/abs/2310.12554.

Weixi Zhu (6):
  mm/gmem: add heterogeneous NUMA node
  mm/gmem: add arch-independent abstraction to track address mapping
    status
  mm/gmem: add GMEM (Generalized Memory Management) interface for
    external accelerators
  mm/gmem: add new syscall hmadvise() to issue memory hints for
    heterogeneous NUMA nodes
  mm/gmem: resolve VMA conflicts for attached peer devices
  mm/gmem: extending Linux core MM to support unified virtual address
    space

 arch/arm64/include/asm/unistd.h         |   2 +-
 arch/arm64/include/asm/unistd32.h       |   2 +
 drivers/base/node.c                     |   6 +
 fs/proc/task_mmu.c                      |   3 +
 include/linux/gmem.h                    | 368 ++++++++++++
 include/linux/mm.h                      |   8 +
 include/linux/mm_types.h                |   5 +
 include/linux/nodemask.h                |  10 +
 include/uapi/asm-generic/mman-common.h  |   4 +
 include/uapi/asm-generic/unistd.h       |   5 +-
 init/main.c                             |   2 +
 kernel/fork.c                           |   5 +
 kernel/sys_ni.c                         |   2 +
 mm/Kconfig                              |  14 +
 mm/Makefile                             |   1 +
 mm/gmem.c                               | 746 ++++++++++++++++++++++++
 mm/huge_memory.c                        |  85 ++-
 mm/memory.c                             |  42 +-
 mm/mempolicy.c                          |   4 +
 mm/mmap.c                               |  40 +-
 mm/oom_kill.c                           |   2 +
 mm/page_alloc.c                         |   3 +
 mm/vm_object.c                          | 309 ++++++++++
 tools/include/uapi/asm-generic/unistd.h |   5 +-
 24 files changed, 1654 insertions(+), 19 deletions(-)
 create mode 100644 include/linux/gmem.h
 create mode 100644 mm/gmem.c
 create mode 100644 mm/vm_object.c

Dave Airlie Nov. 29, 2023, 5:14 a.m. UTC | #1

On Tue, 28 Nov 2023 at 23:07, Christian König <christian.koenig@amd.com> wrote:
>
> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
> > The problem:
> >
> > Accelerator driver developers are forced to reinvent external MM subsystems
> > case by case, because Linux core MM only considers host memory resources.
> > These reinvented MM subsystems have similar orders of magnitude of LoC as
> > Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and Huawei NPU has
> > 30K. Meanwhile, more and more vendors are implementing their own
> > accelerators, e.g. Microsoft's Maia 100. At the same time,
> > application-level developers suffer from poor programmability -- they must
> > consider parallel address spaces and be careful about the limited device
> > DRAM capacity. This can be alleviated if a malloc()-ed virtual address can
> > be shared by the accelerator, or the abundant host DRAM can further
> > transparently backup the device local memory.
> >
> > These external MM systems share similar mechanisms except for the
> > hardware-dependent part, so reinventing them is effectively introducing
> > redundant code (14K~70K for each case). Such developing/maintaining is not
> > cheap. Furthermore, to share a malloc()-ed virtual address, device drivers
> > need to deeply interact with Linux MM via low-level MM APIs, e.g. MMU
> > notifiers/HMM. This raises the bar for driver development, since developers
> > must understand how Linux MM works. Further, it creates code maintenance
> > problems -- any changes to Linux MM potentially require coordinated changes
> > to accelerator drivers using low-level MM APIs.
> >
> > Putting a cache-coherent bus between host and device will not make these
> > external MM subsystems disappear. For example, a throughput-oriented
> > accelerator will not tolerate executing heavy memory access workload with
> > a host MMU/IOMMU via a remote bus. Therefore, devices will still have
> > their own MMU and pick a simpler page table format for lower address
> > translation overhead, requiring external MM subsystems.
> >
> > --------------------
> >
> > What GMEM (Generalized Memory Management [1]) does:
> >
> > GMEM extends Linux MM to share its machine-independent MM code. Only
> > high-level interface is provided for device drivers. This prevents
> > accelerator drivers from reinventing the wheel, but relies on drivers to
> > implement their hardware-dependent functions declared by GMEM. GMEM's key
> > interface include gm_dev_create(), gm_as_create(), gm_as_attach() and
> > gm_dev_register_physmem(). Here briefly describe how a device driver
> > utilizes them:
> > 1. At boot time, call gm_dev_create() and registers the implementation of
> >     hardware-dependent functions as declared in struct gm_mmu.
> >       - If the device has local DRAM, call gm_dev_register_physmem() to
> >         register available physical addresses.
> > 2. When a device context is initialized (e.g. triggered by ioctl), check if
> >     the current CPU process has been attached to a gmem address space
> >     (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
> >     to it.
> > 3. Call gm_as_attach() to attach the device context to a gmem address space.
> > 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
> >     device computation happens.
> >
> > GMEM has changed the following assumptions in Linux MM:
> >    1. An mm_struct not only handle a single CPU context, but may also handle
> >       external memory contexts encapsulated as gm_context listed in
> >       mm->gm_as. An external memory context can include a few or all of the
> >       following parts: an external MMU (that requires TLB invalidation), an
> >       external page table (that requires PTE manipulation) and external DRAM
> >       (that requires physical memory management).
>
> Well that is pretty much exactly what AMD has already proposed with KFD
> and was rejected for rather good reasons.

> >
> > MMU functions
> > The MMU functions peer_map() and peer_unmap() overlap other functions,
> > leaving a question if the MMU functions should be decoupled as more basic
> > operations. Decoupling them could potentially prevent device drivers
> > coalescing these basic steps within a single host-device communication
> > operation, while coupling them makes it more difficult for device drivers
> > to utilize GMEM interface.
>
> Well to be honest all of this sounds like history to me. We have already
> seen the same basic approach in KFD, HMM and to some extend in TTM as well.
>
> And all of them more or less failed. Why should this here be different?


Any info we have on why this has failed to work in the past would be
useful to provide. This is one of those cases where we may not have
documented the bad ideas to stop future developers from thinking they
are bad.

I do think we would want more common code in this area, but I would
think we'd have it more on the driver infrastructure side, than in the
core mm.

Dave.

zhuweixi Nov. 29, 2023, 8:27 a.m. UTC | #2

Glad to hear that more sharable code is desirable. 
IMHO, for a common MM subsystem, it is more beneficial for 
GMEM to extend core MM instead of building a separate one.

As stated in the beginning of my RFC letter, MM systems are 
large and similar. Even a sophisticated one like Linux MM
that has evolved over decades still suffers from an increasing 
number of bugs[1]. So, directly extending core MM to support
devices not only avoids opening a new box of bugs, but also 
allows the community to concentrate on maintaining one single 
MM system. On the other side, GMEM does no hurt to core MM
If a CPU process is not attached with device contexts.

@Christian, could you provide more information on what AMD
proposed with KFD and why it was rejected?

[1] Huang, Jian, Moinuddin K. Qureshi, and Karsten Schwan. "An evolutionary study of linux memory management for fun and profit." 2016 USENIX Annual Technical Conference (USENIX ATC 16). 2016.

Thanks,
Weixi

-----Original Message-----
From: Dave Airlie <airlied@gmail.com> 
Sent: Wednesday, November 29, 2023 1:15 PM
To: Christian König <christian.koenig@amd.com>
Cc: zhuweixi <weixi.zhu@huawei.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org; weixi.zhu@openeuler.sh; mgorman@suse.de; jglisse@redhat.com; rcampbell@nvidia.com; jhubbard@nvidia.com; apopple@nvidia.com; mhairgrove@nvidia.com; ziy@nvidia.com; alexander.deucher@amd.com; Xinhui.Pan@amd.com; amd-gfx@lists.freedesktop.org; Felix.Kuehling@amd.com; ogabbay@kernel.org; dri-devel@lists.freedesktop.org; jgg@nvidia.com; leonro@nvidia.com; zhenyuw@linux.intel.com; zhi.a.wang@intel.com; intel-gvt-dev@lists.freedesktop.org; intel-gfx@lists.freedesktop.org; jani.nikula@linux.intel.com; joonas.lahtinen@linux.intel.com; rodrigo.vivi@intel.com; tvrtko.ursulin@linux.intel.com
Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

On Tue, 28 Nov 2023 at 23:07, Christian König <christian.koenig@amd.com> wrote:
>
> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
> > The problem:
> >
> > Accelerator driver developers are forced to reinvent external MM subsystems
> > case by case, because Linux core MM only considers host memory resources.
> > These reinvented MM subsystems have similar orders of magnitude of LoC as
> > Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and Huawei NPU has
> > 30K. Meanwhile, more and more vendors are implementing their own
> > accelerators, e.g. Microsoft's Maia 100. At the same time,
> > application-level developers suffer from poor programmability -- they must
> > consider parallel address spaces and be careful about the limited device
> > DRAM capacity. This can be alleviated if a malloc()-ed virtual address can
> > be shared by the accelerator, or the abundant host DRAM can further
> > transparently backup the device local memory.
> >
> > These external MM systems share similar mechanisms except for the
> > hardware-dependent part, so reinventing them is effectively introducing
> > redundant code (14K~70K for each case). Such developing/maintaining is not
> > cheap. Furthermore, to share a malloc()-ed virtual address, device drivers
> > need to deeply interact with Linux MM via low-level MM APIs, e.g. MMU
> > notifiers/HMM. This raises the bar for driver development, since developers
> > must understand how Linux MM works. Further, it creates code maintenance
> > problems -- any changes to Linux MM potentially require coordinated changes
> > to accelerator drivers using low-level MM APIs.
> >
> > Putting a cache-coherent bus between host and device will not make these
> > external MM subsystems disappear. For example, a throughput-oriented
> > accelerator will not tolerate executing heavy memory access workload with
> > a host MMU/IOMMU via a remote bus. Therefore, devices will still have
> > their own MMU and pick a simpler page table format for lower address
> > translation overhead, requiring external MM subsystems.
> >
> > --------------------
> >
> > What GMEM (Generalized Memory Management [1]) does:
> >
> > GMEM extends Linux MM to share its machine-independent MM code. Only
> > high-level interface is provided for device drivers. This prevents
> > accelerator drivers from reinventing the wheel, but relies on drivers to
> > implement their hardware-dependent functions declared by GMEM. GMEM's key
> > interface include gm_dev_create(), gm_as_create(), gm_as_attach() and
> > gm_dev_register_physmem(). Here briefly describe how a device driver
> > utilizes them:
> > 1. At boot time, call gm_dev_create() and registers the implementation of
> >     hardware-dependent functions as declared in struct gm_mmu.
> >       - If the device has local DRAM, call gm_dev_register_physmem() to
> >         register available physical addresses.
> > 2. When a device context is initialized (e.g. triggered by ioctl), check if
> >     the current CPU process has been attached to a gmem address space
> >     (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
> >     to it.
> > 3. Call gm_as_attach() to attach the device context to a gmem address space.
> > 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
> >     device computation happens.
> >
> > GMEM has changed the following assumptions in Linux MM:
> >    1. An mm_struct not only handle a single CPU context, but may also handle
> >       external memory contexts encapsulated as gm_context listed in
> >       mm->gm_as. An external memory context can include a few or all of the
> >       following parts: an external MMU (that requires TLB invalidation), an
> >       external page table (that requires PTE manipulation) and external DRAM
> >       (that requires physical memory management).
>
> Well that is pretty much exactly what AMD has already proposed with KFD
> and was rejected for rather good reasons.

> >
> > MMU functions
> > The MMU functions peer_map() and peer_unmap() overlap other functions,
> > leaving a question if the MMU functions should be decoupled as more basic
> > operations. Decoupling them could potentially prevent device drivers
> > coalescing these basic steps within a single host-device communication
> > operation, while coupling them makes it more difficult for device drivers
> > to utilize GMEM interface.
>
> Well to be honest all of this sounds like history to me. We have already
> seen the same basic approach in KFD, HMM and to some extend in TTM as well.
>
> And all of them more or less failed. Why should this here be different?


Any info we have on why this has failed to work in the past would be
useful to provide. This is one of those cases where we may not have
documented the bad ideas to stop future developers from thinking they
are bad.

I do think we would want more common code in this area, but I would
think we'd have it more on the driver infrastructure side, than in the
core mm.

Dave.

Christian König Nov. 29, 2023, 3:22 p.m. UTC | #3

Am 29.11.23 um 09:27 schrieb zhuweixi:
> Glad to hear that more sharable code is desirable.
> IMHO, for a common MM subsystem, it is more beneficial for
> GMEM to extend core MM instead of building a separate one.
>
> As stated in the beginning of my RFC letter, MM systems are
> large and similar. Even a sophisticated one like Linux MM
> that has evolved over decades still suffers from an increasing
> number of bugs[1]. So, directly extending core MM to support
> devices not only avoids opening a new box of bugs, but also
> allows the community to concentrate on maintaining one single
> MM system. On the other side, GMEM does no hurt to core MM
> If a CPU process is not attached with device contexts.
>
> @Christian, could you provide more information on what AMD
> proposed with KFD and why it was rejected?

Well, this is going to be a longer explanation.

The combination of KFD and HMM is based on essentially on the same idea 
as this code here. Even the initial KFD implementation was very similar 
in the sense that it added device contexts to mm_struct and tried to 
manage GPU/acceleration MM the same way as CPU MM. On other words it was 
basically identical to your gm_dev_create() and gm_mmu approach.

As mentioned before this initial proposal was rejected, for more 
background see the discussion around "amdkfd: Add amdkfd skeleton 
driver" on the dri-devel mailing list between 2013 and 2014. You need to 
dig up the whole discussion from the mailing list, but summarizing it 
the general feeling was that it would be a mistake to tie device drivers 
to close to CPU memory management (and stable UAPI) without validating 
that this is really the right thing to do.

So instead of the original implementation KFD has gone upstream with a 
much less invasive approach where a device contexts are just on demand 
looked up for each mm_struct. Felix can probably provide some pointers 
to the implementation.

On the initially supported hardware the KFD used the PCIe ATC feature to 
allow routing of memory accesses directly into the associated CPU 
process address space, later on we switched to an MMU notifier/HMM based 
approach to give similar functionality to the userspace stack on top of 
it for devices which doesn't support the ATC path was just recently 
completely removed and we are now only using MMU notifiers/HMM.

HMM tried to add similar functionality like you propose with the mmap() 
flag and hmadvise() call. The hmadvise() extension actually looks so 
familiar to the HMM proposal that I would expect that this is actually 
based on that code.

All this turned out to have some major design issues.

First of all you have a rather large group of use cases where you don't 
want your device to mirror the address space of your process. Just think 
of thinks like QEMU, KVM, XEN, in general virtualization and container 
handling. Linux has the mantra that everything is a file and if it's not 
a file it should be a file and when you tie device memory management 
into CPU memory management you are pretty much violating exactly that.

Second this doesn't integrate well with the filesystem layer in Linux. 
For example we do have struct pages for HMM exposed device memory, but 
for I/O we still migrate this back to system memory because of (for 
example) the page lock requirements around writeback.

Then third it turned out that the requirements to CPU address space 
management and device address space management are just massively 
different. For example huge and giant pages are a must have for modern 
devices, on the CPU side we are barely switching over to folios now to 
add similar functionality.

The argument that a shared memory management leads to less bugs has also 
absolutely not be proven true. Instead we literally spend month if not 
years hunting down bugs which resulted from interaction between CPU and 
devices.
...

There are a couple of more things on this contra side to that approach, 
but I think that would just make this mail unnecessary long.

To sum it up from over a decade of experience working in this area I can 
just say that CPU and device memory management should absolutely *NOT* 
be mixed. We had those ideas multiple times before, but they either 
failed because they didn't integrated well with the core OS or the 
hardware support is just lagging behind the actual requirements.

What can be done and where I completely agree with Dave is that having 
common components which provides device drivers with the necessary 
functionality to manage their device address space is really good idea. 
Danilo is for example working on a GPUVM component to have common 
virtual address space management and I'm at least sometimes working on 
MMU notifier/HMM improvements.

Providing SVM functionality to your userspace stack is still a really 
good idea, but it should be done with MMU notifiers and components which 
are separate to your CPU memory management instead of tying it directly 
to the CPU address space.

Regards,
Christian.

>
> [1] Huang, Jian, Moinuddin K. Qureshi, and Karsten Schwan. "An evolutionary study of linux memory management for fun and profit." 2016 USENIX Annual Technical Conference (USENIX ATC 16). 2016.
>
> Thanks,
> Weixi
>
> -----Original Message-----
> From: Dave Airlie <airlied@gmail.com>
> Sent: Wednesday, November 29, 2023 1:15 PM
> To: Christian König <christian.koenig@amd.com>
> Cc: zhuweixi <weixi.zhu@huawei.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org; weixi.zhu@openeuler.sh; mgorman@suse.de; jglisse@redhat.com; rcampbell@nvidia.com; jhubbard@nvidia.com; apopple@nvidia.com; mhairgrove@nvidia.com; ziy@nvidia.com; alexander.deucher@amd.com; Xinhui.Pan@amd.com; amd-gfx@lists.freedesktop.org; Felix.Kuehling@amd.com; ogabbay@kernel.org; dri-devel@lists.freedesktop.org; jgg@nvidia.com; leonro@nvidia.com; zhenyuw@linux.intel.com; zhi.a.wang@intel.com; intel-gvt-dev@lists.freedesktop.org; intel-gfx@lists.freedesktop.org; jani.nikula@linux.intel.com; joonas.lahtinen@linux.intel.com; rodrigo.vivi@intel.com; tvrtko.ursulin@linux.intel.com
> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices
>
> On Tue, 28 Nov 2023 at 23:07, Christian König <christian.koenig@amd.com> wrote:
>> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
>>> The problem:
>>>
>>> Accelerator driver developers are forced to reinvent external MM subsystems
>>> case by case, because Linux core MM only considers host memory resources.
>>> These reinvented MM subsystems have similar orders of magnitude of LoC as
>>> Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and Huawei NPU has
>>> 30K. Meanwhile, more and more vendors are implementing their own
>>> accelerators, e.g. Microsoft's Maia 100. At the same time,
>>> application-level developers suffer from poor programmability -- they must
>>> consider parallel address spaces and be careful about the limited device
>>> DRAM capacity. This can be alleviated if a malloc()-ed virtual address can
>>> be shared by the accelerator, or the abundant host DRAM can further
>>> transparently backup the device local memory.
>>>
>>> These external MM systems share similar mechanisms except for the
>>> hardware-dependent part, so reinventing them is effectively introducing
>>> redundant code (14K~70K for each case). Such developing/maintaining is not
>>> cheap. Furthermore, to share a malloc()-ed virtual address, device drivers
>>> need to deeply interact with Linux MM via low-level MM APIs, e.g. MMU
>>> notifiers/HMM. This raises the bar for driver development, since developers
>>> must understand how Linux MM works. Further, it creates code maintenance
>>> problems -- any changes to Linux MM potentially require coordinated changes
>>> to accelerator drivers using low-level MM APIs.
>>>
>>> Putting a cache-coherent bus between host and device will not make these
>>> external MM subsystems disappear. For example, a throughput-oriented
>>> accelerator will not tolerate executing heavy memory access workload with
>>> a host MMU/IOMMU via a remote bus. Therefore, devices will still have
>>> their own MMU and pick a simpler page table format for lower address
>>> translation overhead, requiring external MM subsystems.
>>>
>>> --------------------
>>>
>>> What GMEM (Generalized Memory Management [1]) does:
>>>
>>> GMEM extends Linux MM to share its machine-independent MM code. Only
>>> high-level interface is provided for device drivers. This prevents
>>> accelerator drivers from reinventing the wheel, but relies on drivers to
>>> implement their hardware-dependent functions declared by GMEM. GMEM's key
>>> interface include gm_dev_create(), gm_as_create(), gm_as_attach() and
>>> gm_dev_register_physmem(). Here briefly describe how a device driver
>>> utilizes them:
>>> 1. At boot time, call gm_dev_create() and registers the implementation of
>>>      hardware-dependent functions as declared in struct gm_mmu.
>>>        - If the device has local DRAM, call gm_dev_register_physmem() to
>>>          register available physical addresses.
>>> 2. When a device context is initialized (e.g. triggered by ioctl), check if
>>>      the current CPU process has been attached to a gmem address space
>>>      (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
>>>      to it.
>>> 3. Call gm_as_attach() to attach the device context to a gmem address space.
>>> 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
>>>      device computation happens.
>>>
>>> GMEM has changed the following assumptions in Linux MM:
>>>     1. An mm_struct not only handle a single CPU context, but may also handle
>>>        external memory contexts encapsulated as gm_context listed in
>>>        mm->gm_as. An external memory context can include a few or all of the
>>>        following parts: an external MMU (that requires TLB invalidation), an
>>>        external page table (that requires PTE manipulation) and external DRAM
>>>        (that requires physical memory management).
>> Well that is pretty much exactly what AMD has already proposed with KFD
>> and was rejected for rather good reasons.
>>> MMU functions
>>> The MMU functions peer_map() and peer_unmap() overlap other functions,
>>> leaving a question if the MMU functions should be decoupled as more basic
>>> operations. Decoupling them could potentially prevent device drivers
>>> coalescing these basic steps within a single host-device communication
>>> operation, while coupling them makes it more difficult for device drivers
>>> to utilize GMEM interface.
>> Well to be honest all of this sounds like history to me. We have already
>> seen the same basic approach in KFD, HMM and to some extend in TTM as well.
>>
>> And all of them more or less failed. Why should this here be different?
>
> Any info we have on why this has failed to work in the past would be
> useful to provide. This is one of those cases where we may not have
> documented the bad ideas to stop future developers from thinking they
> are bad.
>
> I do think we would want more common code in this area, but I would
> think we'd have it more on the driver infrastructure side, than in the
> core mm.
>
> Dave.

Zeng, Oak Nov. 29, 2023, 10:23 p.m. UTC | #4

Hi Weixi,

Even though Christian has listed reasons rejecting this proposal (yes they are very reasonable to me), I would open my mind and further explore the possibility here. Since the current GPU driver uses a hmm based implementation (AMD and NV has done this; At Intel we are catching up), I want to explore how much we can benefit from the proposed approach and how your approach can solve some pain points of our development. So basically what I am questioning here is: what is the advantage of your approach against hmm.

To implement a UVM (unified virtual address space b/t cpu and gpu device), with hmm, driver essentially need to implement below functions:

1. device page table update. Your approach requires the same because this is device specific codes

2. Some migration functions to migrate memory b/t system memory and GPU local memory. My understanding is, even though you generalized this a bit, such as modified cpu page fault path, provided "general" gm_dev_fault handler... but device driver still need to provide migration functions because migration functions have to be device specific (i.e., using device dma/copy engine for performance purpose). Right?

3. GPU physical memory management, this part is now in drm/buddy, shared by all drivers. I think with your approach, driver still need to provide callback functions to allocate/free physical pages. Right? Or do you let linux core mm buddy manage device memory directly?

4. madvise/hints/virtual address range management. This has been pain point for us. Right now device driver has to maintain certain virtual address range data structure to maintain hints and other virtual address range based memory attributes. Driver need to sync with linux vma. Driver need to explicitly deal with range split/merging... HMM doesn't provide support in this area. Your approach seems cleaner/simpler to me...


So in above, I have examined the some key factors of a gpu UVM memory manager. I think for #1 and #2, hmm has provide pretty good abstraction/tools for address space mirroring and migration helpers. For #3, since we have a common drm/buddy layer, I don't think it is a big problem for driver writer now.

I do see #4 is something you solved more beautifully, requires new system call though. 

Oak


> -----Original Message-----
> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> Christian König
> Sent: Tuesday, November 28, 2023 8:09 AM
> To: Weixi Zhu <weixi.zhu@huawei.com>; linux-mm@kvack.org; linux-
> kernel@vger.kernel.org; akpm@linux-foundation.org; Danilo Krummrich
> <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter
> <daniel@ffwll.ch>
> Cc: dri-devel@lists.freedesktop.org; leonro@nvidia.com; apopple@nvidia.com;
> amd-gfx@lists.freedesktop.org; mgorman@suse.de; ziy@nvidia.com; Wang, Zhi
> A <zhi.a.wang@intel.com>; rcampbell@nvidia.com; jgg@nvidia.com;
> weixi.zhu@openeuler.sh; jhubbard@nvidia.com; intel-gfx@lists.freedesktop.org;
> mhairgrove@nvidia.com; jglisse@redhat.com; Vivi, Rodrigo
> <rodrigo.vivi@intel.com>; intel-gvt-dev@lists.freedesktop.org;
> tvrtko.ursulin@linux.intel.com; Felix.Kuehling@amd.com; Xinhui.Pan@amd.com;
> alexander.deucher@amd.com; ogabbay@kernel.org
> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
> management) for external memory devices
> 
> Adding a few missing important people to the explicit to list.
> 
> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
> > The problem:
> >
> > Accelerator driver developers are forced to reinvent external MM subsystems
> > case by case, because Linux core MM only considers host memory resources.
> > These reinvented MM subsystems have similar orders of magnitude of LoC as
> > Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and Huawei NPU
> has
> > 30K. Meanwhile, more and more vendors are implementing their own
> > accelerators, e.g. Microsoft's Maia 100. At the same time,
> > application-level developers suffer from poor programmability -- they must
> > consider parallel address spaces and be careful about the limited device
> > DRAM capacity. This can be alleviated if a malloc()-ed virtual address can
> > be shared by the accelerator, or the abundant host DRAM can further
> > transparently backup the device local memory.
> >
> > These external MM systems share similar mechanisms except for the
> > hardware-dependent part, so reinventing them is effectively introducing
> > redundant code (14K~70K for each case). Such developing/maintaining is not
> > cheap. Furthermore, to share a malloc()-ed virtual address, device drivers
> > need to deeply interact with Linux MM via low-level MM APIs, e.g. MMU
> > notifiers/HMM. This raises the bar for driver development, since developers
> > must understand how Linux MM works. Further, it creates code maintenance
> > problems -- any changes to Linux MM potentially require coordinated changes
> > to accelerator drivers using low-level MM APIs.
> >
> > Putting a cache-coherent bus between host and device will not make these
> > external MM subsystems disappear. For example, a throughput-oriented
> > accelerator will not tolerate executing heavy memory access workload with
> > a host MMU/IOMMU via a remote bus. Therefore, devices will still have
> > their own MMU and pick a simpler page table format for lower address
> > translation overhead, requiring external MM subsystems.
> >
> > --------------------
> >
> > What GMEM (Generalized Memory Management [1]) does:
> >
> > GMEM extends Linux MM to share its machine-independent MM code. Only
> > high-level interface is provided for device drivers. This prevents
> > accelerator drivers from reinventing the wheel, but relies on drivers to
> > implement their hardware-dependent functions declared by GMEM. GMEM's
> key
> > interface include gm_dev_create(), gm_as_create(), gm_as_attach() and
> > gm_dev_register_physmem(). Here briefly describe how a device driver
> > utilizes them:
> > 1. At boot time, call gm_dev_create() and registers the implementation of
> >     hardware-dependent functions as declared in struct gm_mmu.
> >       - If the device has local DRAM, call gm_dev_register_physmem() to
> >         register available physical addresses.
> > 2. When a device context is initialized (e.g. triggered by ioctl), check if
> >     the current CPU process has been attached to a gmem address space
> >     (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
> >     to it.
> > 3. Call gm_as_attach() to attach the device context to a gmem address space.
> > 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
> >     device computation happens.
> >
> > GMEM has changed the following assumptions in Linux MM:
> >    1. An mm_struct not only handle a single CPU context, but may also handle
> >       external memory contexts encapsulated as gm_context listed in
> >       mm->gm_as. An external memory context can include a few or all of the
> >       following parts: an external MMU (that requires TLB invalidation), an
> >       external page table (that requires PTE manipulation) and external DRAM
> >       (that requires physical memory management).
> >    2. Faulting a MAP_PRIVATE VMA with no CPU PTE found does not necessarily
> >       mean that a zero-filled physical page should be mapped. The virtual
> >       page may have been mapped to an external memory device.
> >    3. Unmapping a page may include sending device TLB invalidation (even if
> >       its MMU shares CPU page table) and manipulating device PTEs.
> >
> > --------------------
> >
> > Semantics of new syscalls:
> >
> > 1. mmap(..., MAP_PRIVATE | MAP_PEER_SHARED)
> >      Allocate virtual address that is shared between the CPU and all
> >      attached devices. Data is guaranteed to be coherent whenever the
> >      address is accessed by either CPU or any attached device. If the device
> >      does not support page fault, then device driver is responsible for
> >      faulting memory before data gets accessed. By default, the CPU DRAM is
> >      can be used as a swap backup for the device local memory.
> > 2. hmadvise(NUMA_id, va_start, size, memory_hint)
> >      Issuing memory hint for a given VMA. This extends traditional madvise()
> >      syscall with an extra argument so that programmers have better control
> >      with heterogeneous devices registered as NUMA nodes. One useful
> memory
> >      hint could be MADV_PREFETCH, which guarantees that the physical data of
> >      the given VMA [VA, VA+size) is migrated to NUMA node #id. Another
> >      useful memory hint is MADV_DONTNEED. This is helpful to increase device
> >      memory utilization. It is worth considering extending the existing
> >      madvise() syscall with one additional argument.
> >
> > --------------------
> >
> > Implementation details
> >
> > 1. New VMA flag: MAP_PEER_SHARED
> >
> > This new flag helps isolate GMEM feature, so that common processes with
> > no device attached does not need to maintain any logical page table. It
> > can be deleted if the extra overhead from GMEM is acceptable.
> >
> > 2. MMU functions
> > The device driver must implement the MMU functions declared in struct
> > gm_mmu.
> >
> > VA functions: peer_va_alloc_fixed(), peer_va_free()
> >
> > They are used to negotiate a common available VMA between a host
> > process and a device process at the mmap() time. This is because some
> > accelerators like Intel Xeon Phi or Huawei's Ascend NPU have their
> > acceleration tasks executed within a device CPU process context. Some
> > accelerators may also choose a different format of virtual address
> > space.
> >
> > PA functions: alloc_page(), free_page(), prepare_page()
> >
> > Alloc_page() and free_page() are used to allocate and free device physical
> > pages. Prepare_page() is used to zero-fill or DMA the data of a physical
> > page. These functions were removed from the submitted patch, since GMEM
> > does not need to invoke them when testing Huawei's NPU accelerator. The
> NPU
> > accelerator has an OS running in the device that manages the device
> > physical memory. However, even for such a device it is better for the host
> > to directly manage device physical memory, which saves device HBM and
> > avoids synchronizing management status between the host and device.
> >
> > Page-table functions: pmap_create()/destroy()/enter()/release()/protect()
> >
> > They are used to create and destroy device page tables, install and
> > uninstall page table entries and to change the protection of page table
> > entries.
> >
> > TLB-invalidation functions: tlb_invl(), tlb_invl_coalesced()
> >
> > They are used to invalidate the TLB entries of a given range of VA or
> > invalidate a given list of VMAs.
> >
> > Wrapper functions: peer_map() and peer_unmap()
> >
> > These two functions are used to create or destroy a device mapping which
> > could include allocating physical memory and copying data. They effectively
> > wraps the PA functions, Page-table functions and TLB-invalidation
> > functions. Implementing these steps together allows devices to optimize the
> > communication cost between host and device. However, it requires the device
> > driver to correctly order these steps.
> >
> > 3. Tracking logical mappings:
> >
> > Each process starts maintaining an xarray in mm->vm_obj->logical_page_table
> > at the first time a host process calls mmap(MAP_PRIVATE |
> MAP_PEER_SHARED).
> > When a virtual page gets touched, its mapping status is created and stored
> > in struct gm_mapping. The logical page table is utilized to query the
> > struct gm_mapping given a virtual address. GMEM extends Linux MM to
> update
> > and lookup these logical mappings. For example, in the patch set we modify
> > the page fault path of to additionally check the logical mapping of
> > MAP_PEER_SHARED VMAs and identify if a device page should be migrated.
> > Similarly, if the device driver wants to resolve a device page fault or
> > prefetch data, the driver should call gm_dev_fault(). This function
> > examines the mapping status and determines whether the device driver should
> > migrate a CPU page to device or install a zero-filled device page.
> >
> > The logical mapping abstraction enhances the extensibility of Linux core MM
> > (a virtual page may be mapped to a device physical page without any CPU PTE
> > installed). The current implementation is not complete, since it only
> > focused on anonymous VMAs with MAP_PEER_SHARED flag. The future plan of
> > logical page table is to provide a generic abstraction layer that support
> > common anonymous memory (I am looking at you, transparent huge pages)
> and
> > file-backed memory.
> >
> > --------------------
> >
> > Use cases
> >
> > GMEM has been tested over Huawei's NPU (neural process unit) device driver.
> > The original NPU device driver has approximately 30,000 lines of code for
> > memory management. On the contrary, the GMEM-based one has less than 30
> > lines of code calling GMEM API, with approximately 3,700 lines of code
> > implementing the MMU functions. This effectively saves over 26,200 lines
> > of MM code for one driver. Therefore, developers from accelerator vendors,
> > including Nvidia, AMD, Intel and other companies are welcome to discuss if
> > GMEM could be helpful.
> >
> > Using GMEM-based driver, it is possible to write a C-style accelerator code
> > with malloc(), whose underlying mmap() syscall should include
> > MAP_PEER_SHARED according to current GMEM implementation. Importantly,
> GMEM
> > guarantees a coherent view of memory between the host and all attached
> > devices. This means that any data written by the CPU or any attached
> > accelerator can be seen by the next memory load instruction issued by any
> > attached accelerator or the CPU. Furthermore, the NPU device was able to
> > oversubscribe memory by swapping memory to host DDR. Note that this
> memory
> > oversubscription mechanism can be universal if the physical memory
> > management is provided by GMEM. Other potential use cases of GMEM could
> > include the IOMMU driver, KVM and RDMA drivers, as long as the device needs
> > to manage external memory resources like VMAs, MMUs or local DRAMs.
> >
> > --------------------
> >
> > Discussion
> >
> > Physical memory management
> > Most accelerators require the host OS to manage device DRAM. Even
> > accelerators capable of running an OS inside the driver can benefit from
> > it, since it helps avoid synchronizing management status between the host
> > and device. In Linux OSS EU summit 2023, Hannes Reinecke from SUSE Labs
> > suggested that people are concerned with the memory consumption of struct
> > page (which considers all generic scenarios for the kernel). This leads to
> > a possible solution that, instead of reusing Linux struct page and
> > ZONE_DEVICE mechanism, GMEM can implement an isolated buddy allocator
> for
> > the device to instantiate and register. The isolation is useful because
> > device DRAM physical address space is independent. Furthermore, the
> > isolated buddy allocator can utilize a customized struct page that consumes
> > less memory. It is worth discussing if accelerator vendors desire this
> > solution.
> >
> > MMU functions
> > The MMU functions peer_map() and peer_unmap() overlap other functions,
> > leaving a question if the MMU functions should be decoupled as more basic
> > operations. Decoupling them could potentially prevent device drivers
> > coalescing these basic steps within a single host-device communication
> > operation, while coupling them makes it more difficult for device drivers
> > to utilize GMEM interface.
> >
> > The idea of GMEM was originated from Weixi's PhD study with
> > Prof. Scott Rixner and Prof. Alan L. Cox at Rice University.
> >
> > [1] https://arxiv.org/abs/2310.12554.
> >
> > Weixi Zhu (6):
> >    mm/gmem: add heterogeneous NUMA node
> >    mm/gmem: add arch-independent abstraction to track address mapping
> >      status
> >    mm/gmem: add GMEM (Generalized Memory Management) interface for
> >      external accelerators
> >    mm/gmem: add new syscall hmadvise() to issue memory hints for
> >      heterogeneous NUMA nodes
> >    mm/gmem: resolve VMA conflicts for attached peer devices
> >    mm/gmem: extending Linux core MM to support unified virtual address
> >      space
> >
> >   arch/arm64/include/asm/unistd.h         |   2 +-
> >   arch/arm64/include/asm/unistd32.h       |   2 +
> >   drivers/base/node.c                     |   6 +
> >   fs/proc/task_mmu.c                      |   3 +
> >   include/linux/gmem.h                    | 368 ++++++++++++
> >   include/linux/mm.h                      |   8 +
> >   include/linux/mm_types.h                |   5 +
> >   include/linux/nodemask.h                |  10 +
> >   include/uapi/asm-generic/mman-common.h  |   4 +
> >   include/uapi/asm-generic/unistd.h       |   5 +-
> >   init/main.c                             |   2 +
> >   kernel/fork.c                           |   5 +
> >   kernel/sys_ni.c                         |   2 +
> >   mm/Kconfig                              |  14 +
> >   mm/Makefile                             |   1 +
> >   mm/gmem.c                               | 746 ++++++++++++++++++++++++
> >   mm/huge_memory.c                        |  85 ++-
> >   mm/memory.c                             |  42 +-
> >   mm/mempolicy.c                          |   4 +
> >   mm/mmap.c                               |  40 +-
> >   mm/oom_kill.c                           |   2 +
> >   mm/page_alloc.c                         |   3 +
> >   mm/vm_object.c                          | 309 ++++++++++
> >   tools/include/uapi/asm-generic/unistd.h |   5 +-
> >   24 files changed, 1654 insertions(+), 19 deletions(-)
> >   create mode 100644 include/linux/gmem.h
> >   create mode 100644 mm/gmem.c
> >   create mode 100644 mm/vm_object.c
> >

zhuweixi Nov. 30, 2023, 7:22 a.m. UTC | #5

Add @Oak to the KFD discussion. I will reply separately elaborating your questions on GMEM's difference from HMM/MMU notifiers.

Christian, thanks for pointing me to that AMDKFD discussion. I have read the discussion around the AMDKFD skeleton patch and found the previous discussion in the following URLs:
https://lore.kernel.org/dri-devel/1405028848-5660-1-git-send-email-oded.gabbay@amd.com/#r
https://lore.kernel.org/dri-devel/20140711154231.GB1870@gmail.com/

I believe AMDKFD's original patch was rejected mostly because of inserting vendor-specific stuff to the generic core MM.  Jérôme has clearly stated this issue in the second URL. If the code is vendor-specific then it has no place in core MM, period. 

But why does that vendor-specific solution relate to a generalized solution like GMEM? The initial AMDKFD patch doesn't work for Nvidia or Intel.

In fact I think the rejection of the initial AMDKFD patch supports GMEM's idea -- there could have been a simpler AMDKFD implementation if the core MM was extended by GMEM. Also, after 9 years, there are so many other companies building their accelerators over the past few years, especially now the GPT-family has made a much bigger success. Don't we want to advance Linux's core MM for more friendly and generalized support for the upcoming new vendors? 

Now answering Christian's design concerns:

1. "There are cases that do not want to share CPU address space"
Maybe, but I am not fully convinced. The current case we can find is when a NIC utilizes IOMMU for security. For this case, GMEM implemented a generalized VMA support and tested it with NICs using both Intel-IOMMU/Arm-SMMU. This cut 600 LoC of IOVA management code from the IOMMU driver, but it is still not included in this RFC patch -- I cannot find other cases demanding this isolation. The isolation is also unnecessary -- the NIC can enable the IOMMU SVM feature to share the CPU address space. As of KVM, it is essentially a host process that utilizes two different MMUs within the same address space, so it fits GMEM's design... 

2. "This does not integrate well with the filesystem layer in Linux..."
To be honest, not using a logical page table for anonymous memory is why Linux THP fails compared with FreeBSD's superpage, but I am not going to elaborate it here. But yes, and I am looking for merging struct vm_object->logical_page_table with struct address_space->i_pages. This will make a natural support for devices oversubscribing both host DRAM and disks. As explained in my cover letter, struct vm_object borrows FreeBSD's VM design -- it provides a unified abstraction layer for anonymous, file-backed memory and etc. 

3. "Requirements to CPU address space management and device address space management are just massively different. For example huge and giant pages are a must have for modern devices..."
I think you are asking two questions. First, is VA space a problem? GMEM assumes that device VA space should be covered by CPU VA space (sorry i386), should we consider devices using more VA bits than the CPU (64-bit)? Second, yes, modern accelerators definitely demand large pages. From my experience, both Nvidia GPUs and Huawei Ascend NPUs suffer from performance issues using page sizes smaller than 2MB. However, GMEM does not stop a device to use a different page size. A device can choose a 64KB page size running on an X86 host, and GMEM will still work -- whether the CPU page fault goes to 2MB-THP or 4KB paths, GMEM looks up stuct vm_object to examine whether a virtual-to-physical mapping exist on the device page table. If the faulted VA is covered by a 64KB device mapping, a 4KB sub-page must at least be migrated and the 64KB device mapping must be invoked. The device can either keep the rest 15 4KB physical pages and create 15 "contiguous" (with a hole) 4KB mappings or simply wait for the next device page fault to migrate one 4KB page and install a 64KB mapping. The policy is left for device to choose, but the mechanisms are provided by GMEM. So, the current assumption of GMEM is just that your device page sizes must be multiples of CPU base page size.

4. "The argument that a shared memory management leads to less bugs has also absolutely not be proven true. Instead we literally spend month if not years hunting down bugs which resulted from interaction between CPU and devices."
This is another case supporting GMEM. Don't developers want to let GMEM handle the CPU-device interaction so that they can waive months of debugging cost?

PS, hmadvise() is based on the idea of Nvidia's cudaMemAdvise() which provides abundant and useful memory policies. HMM extended mbind() instead.

-Weixi

-----Original Message-----
From: Christian König <christian.koenig@amd.com> 
Sent: Wednesday, November 29, 2023 11:22 PM
To: zhuweixi <weixi.zhu@huawei.com>; Dave Airlie <airlied@gmail.com>
Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org; weixi.zhu@openeuler.sh; mgorman@suse.de; jglisse@redhat.com; rcampbell@nvidia.com; jhubbard@nvidia.com; apopple@nvidia.com; mhairgrove@nvidia.com; ziy@nvidia.com; alexander.deucher@amd.com; Xinhui.Pan@amd.com; amd-gfx@lists.freedesktop.org; Felix.Kuehling@amd.com; ogabbay@kernel.org; dri-devel@lists.freedesktop.org; jgg@nvidia.com; leonro@nvidia.com; zhenyuw@linux.intel.com; zhi.a.wang@intel.com; intel-gvt-dev@lists.freedesktop.org; intel-gfx@lists.freedesktop.org; jani.nikula@linux.intel.com; joonas.lahtinen@linux.intel.com; rodrigo.vivi@intel.com; tvrtko.ursulin@linux.intel.com
Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

Am 29.11.23 um 09:27 schrieb zhuweixi:
> Glad to hear that more sharable code is desirable.
> IMHO, for a common MM subsystem, it is more beneficial for GMEM to 
> extend core MM instead of building a separate one.
>
> As stated in the beginning of my RFC letter, MM systems are large and 
> similar. Even a sophisticated one like Linux MM that has evolved over 
> decades still suffers from an increasing number of bugs[1]. So, 
> directly extending core MM to support devices not only avoids opening 
> a new box of bugs, but also allows the community to concentrate on 
> maintaining one single MM system. On the other side, GMEM does no hurt 
> to core MM If a CPU process is not attached with device contexts.
>
> @Christian, could you provide more information on what AMD proposed 
> with KFD and why it was rejected?

Well, this is going to be a longer explanation.

The combination of KFD and HMM is based on essentially on the same idea as this code here. Even the initial KFD implementation was very similar in the sense that it added device contexts to mm_struct and tried to manage GPU/acceleration MM the same way as CPU MM. On other words it was basically identical to your gm_dev_create() and gm_mmu approach.

As mentioned before this initial proposal was rejected, for more background see the discussion around "amdkfd: Add amdkfd skeleton driver" on the dri-devel mailing list between 2013 and 2014. You need to dig up the whole discussion from the mailing list, but summarizing it the general feeling was that it would be a mistake to tie device drivers to close to CPU memory management (and stable UAPI) without validating that this is really the right thing to do.

So instead of the original implementation KFD has gone upstream with a much less invasive approach where a device contexts are just on demand looked up for each mm_struct. Felix can probably provide some pointers to the implementation.

On the initially supported hardware the KFD used the PCIe ATC feature to allow routing of memory accesses directly into the associated CPU process address space, later on we switched to an MMU notifier/HMM based approach to give similar functionality to the userspace stack on top of it for devices which doesn't support the ATC path was just recently completely removed and we are now only using MMU notifiers/HMM.

HMM tried to add similar functionality like you propose with the mmap() flag and hmadvise() call. The hmadvise() extension actually looks so familiar to the HMM proposal that I would expect that this is actually based on that code.

All this turned out to have some major design issues.

First of all you have a rather large group of use cases where you don't want your device to mirror the address space of your process. Just think of thinks like QEMU, KVM, XEN, in general virtualization and container handling. Linux has the mantra that everything is a file and if it's not a file it should be a file and when you tie device memory management into CPU memory management you are pretty much violating exactly that.

Second this doesn't integrate well with the filesystem layer in Linux. 
For example we do have struct pages for HMM exposed device memory, but for I/O we still migrate this back to system memory because of (for
example) the page lock requirements around writeback.

Then third it turned out that the requirements to CPU address space management and device address space management are just massively different. For example huge and giant pages are a must have for modern devices, on the CPU side we are barely switching over to folios now to add similar functionality.

The argument that a shared memory management leads to less bugs has also absolutely not be proven true. Instead we literally spend month if not years hunting down bugs which resulted from interaction between CPU and devices.
...

There are a couple of more things on this contra side to that approach, but I think that would just make this mail unnecessary long.

To sum it up from over a decade of experience working in this area I can just say that CPU and device memory management should absolutely *NOT* be mixed. We had those ideas multiple times before, but they either failed because they didn't integrated well with the core OS or the hardware support is just lagging behind the actual requirements.

What can be done and where I completely agree with Dave is that having common components which provides device drivers with the necessary functionality to manage their device address space is really good idea. 
Danilo is for example working on a GPUVM component to have common virtual address space management and I'm at least sometimes working on MMU notifier/HMM improvements.

Providing SVM functionality to your userspace stack is still a really good idea, but it should be done with MMU notifiers and components which are separate to your CPU memory management instead of tying it directly to the CPU address space.

Regards,
Christian.

>
> [1] Huang, Jian, Moinuddin K. Qureshi, and Karsten Schwan. "An evolutionary study of linux memory management for fun and profit." 2016 USENIX Annual Technical Conference (USENIX ATC 16). 2016.
>
> Thanks,
> Weixi
>
> -----Original Message-----
> From: Dave Airlie <airlied@gmail.com>
> Sent: Wednesday, November 29, 2023 1:15 PM
> To: Christian König <christian.koenig@amd.com>
> Cc: zhuweixi <weixi.zhu@huawei.com>; linux-mm@kvack.org; 
> linux-kernel@vger.kernel.org; akpm@linux-foundation.org; 
> weixi.zhu@openeuler.sh; mgorman@suse.de; jglisse@redhat.com; 
> rcampbell@nvidia.com; jhubbard@nvidia.com; apopple@nvidia.com; 
> mhairgrove@nvidia.com; ziy@nvidia.com; alexander.deucher@amd.com; 
> Xinhui.Pan@amd.com; amd-gfx@lists.freedesktop.org; 
> Felix.Kuehling@amd.com; ogabbay@kernel.org; 
> dri-devel@lists.freedesktop.org; jgg@nvidia.com; leonro@nvidia.com; 
> zhenyuw@linux.intel.com; zhi.a.wang@intel.com; 
> intel-gvt-dev@lists.freedesktop.org; intel-gfx@lists.freedesktop.org; 
> jani.nikula@linux.intel.com; joonas.lahtinen@linux.intel.com; 
> rodrigo.vivi@intel.com; tvrtko.ursulin@linux.intel.com
> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory 
> management) for external memory devices
>
> On Tue, 28 Nov 2023 at 23:07, Christian König <christian.koenig@amd.com> wrote:
>> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
>>> The problem:
>>>
>>> Accelerator driver developers are forced to reinvent external MM 
>>> subsystems case by case, because Linux core MM only considers host memory resources.
>>> These reinvented MM subsystems have similar orders of magnitude of 
>>> LoC as Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and 
>>> Huawei NPU has 30K. Meanwhile, more and more vendors are 
>>> implementing their own accelerators, e.g. Microsoft's Maia 100. At 
>>> the same time, application-level developers suffer from poor 
>>> programmability -- they must consider parallel address spaces and be 
>>> careful about the limited device DRAM capacity. This can be 
>>> alleviated if a malloc()-ed virtual address can be shared by the 
>>> accelerator, or the abundant host DRAM can further transparently backup the device local memory.
>>>
>>> These external MM systems share similar mechanisms except for the 
>>> hardware-dependent part, so reinventing them is effectively 
>>> introducing redundant code (14K~70K for each case). Such 
>>> developing/maintaining is not cheap. Furthermore, to share a 
>>> malloc()-ed virtual address, device drivers need to deeply interact 
>>> with Linux MM via low-level MM APIs, e.g. MMU notifiers/HMM. This 
>>> raises the bar for driver development, since developers must 
>>> understand how Linux MM works. Further, it creates code maintenance 
>>> problems -- any changes to Linux MM potentially require coordinated changes to accelerator drivers using low-level MM APIs.
>>>
>>> Putting a cache-coherent bus between host and device will not make 
>>> these external MM subsystems disappear. For example, a 
>>> throughput-oriented accelerator will not tolerate executing heavy 
>>> memory access workload with a host MMU/IOMMU via a remote bus. 
>>> Therefore, devices will still have their own MMU and pick a simpler 
>>> page table format for lower address translation overhead, requiring external MM subsystems.
>>>
>>> --------------------
>>>
>>> What GMEM (Generalized Memory Management [1]) does:
>>>
>>> GMEM extends Linux MM to share its machine-independent MM code. Only 
>>> high-level interface is provided for device drivers. This prevents 
>>> accelerator drivers from reinventing the wheel, but relies on 
>>> drivers to implement their hardware-dependent functions declared by 
>>> GMEM. GMEM's key interface include gm_dev_create(), gm_as_create(), 
>>> gm_as_attach() and gm_dev_register_physmem(). Here briefly describe 
>>> how a device driver utilizes them:
>>> 1. At boot time, call gm_dev_create() and registers the implementation of
>>>      hardware-dependent functions as declared in struct gm_mmu.
>>>        - If the device has local DRAM, call gm_dev_register_physmem() to
>>>          register available physical addresses.
>>> 2. When a device context is initialized (e.g. triggered by ioctl), check if
>>>      the current CPU process has been attached to a gmem address space
>>>      (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
>>>      to it.
>>> 3. Call gm_as_attach() to attach the device context to a gmem address space.
>>> 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
>>>      device computation happens.
>>>
>>> GMEM has changed the following assumptions in Linux MM:
>>>     1. An mm_struct not only handle a single CPU context, but may also handle
>>>        external memory contexts encapsulated as gm_context listed in
>>>        mm->gm_as. An external memory context can include a few or all of the
>>>        following parts: an external MMU (that requires TLB invalidation), an
>>>        external page table (that requires PTE manipulation) and external DRAM
>>>        (that requires physical memory management).
>> Well that is pretty much exactly what AMD has already proposed with 
>> KFD and was rejected for rather good reasons.
>>> MMU functions
>>> The MMU functions peer_map() and peer_unmap() overlap other 
>>> functions, leaving a question if the MMU functions should be 
>>> decoupled as more basic operations. Decoupling them could 
>>> potentially prevent device drivers coalescing these basic steps 
>>> within a single host-device communication operation, while coupling 
>>> them makes it more difficult for device drivers to utilize GMEM interface.
>> Well to be honest all of this sounds like history to me. We have 
>> already seen the same basic approach in KFD, HMM and to some extend in TTM as well.
>>
>> And all of them more or less failed. Why should this here be different?
>
> Any info we have on why this has failed to work in the past would be 
> useful to provide. This is one of those cases where we may not have 
> documented the bad ideas to stop future developers from thinking they 
> are bad.
>
> I do think we would want more common code in this area, but I would 
> think we'd have it more on the driver infrastructure side, than in the 
> core mm.
>
> Dave.

Christian König Nov. 30, 2023, 8:27 a.m. UTC | #6

Hi Oak,

yeah, #4 is indeed a really good point and I think Felix will agree to 
that as well.

HMM is basically still missing a way to advise device attributes for the 
CPU address space. Both migration strategy as well as device specific 
information (like cache preferences) fall into this category.

Since there is a device specific component in those attributes as well I 
think device specific IOCTLs still make sense to update them, but HMM 
should offer the functionality to manage and store those information.

Split and merge of VMAs only become a problem if you attach those 
information to VMAs, if you keep them completely separate than that 
doesn't become an issue either. The down side of this approach is that 
you don't get automatically extending attribute ranges for growing VMAs 
for example.

Regards,
Christian.

Am 29.11.23 um 23:23 schrieb Zeng, Oak:
> Hi Weixi,
>
> Even though Christian has listed reasons rejecting this proposal (yes they are very reasonable to me), I would open my mind and further explore the possibility here. Since the current GPU driver uses a hmm based implementation (AMD and NV has done this; At Intel we are catching up), I want to explore how much we can benefit from the proposed approach and how your approach can solve some pain points of our development. So basically what I am questioning here is: what is the advantage of your approach against hmm.
>
> To implement a UVM (unified virtual address space b/t cpu and gpu device), with hmm, driver essentially need to implement below functions:
>
> 1. device page table update. Your approach requires the same because this is device specific codes
>
> 2. Some migration functions to migrate memory b/t system memory and GPU local memory. My understanding is, even though you generalized this a bit, such as modified cpu page fault path, provided "general" gm_dev_fault handler... but device driver still need to provide migration functions because migration functions have to be device specific (i.e., using device dma/copy engine for performance purpose). Right?
>
> 3. GPU physical memory management, this part is now in drm/buddy, shared by all drivers. I think with your approach, driver still need to provide callback functions to allocate/free physical pages. Right? Or do you let linux core mm buddy manage device memory directly?
>
> 4. madvise/hints/virtual address range management. This has been pain point for us. Right now device driver has to maintain certain virtual address range data structure to maintain hints and other virtual address range based memory attributes. Driver need to sync with linux vma. Driver need to explicitly deal with range split/merging... HMM doesn't provide support in this area. Your approach seems cleaner/simpler to me...
>
>
> So in above, I have examined the some key factors of a gpu UVM memory manager. I think for #1 and #2, hmm has provide pretty good abstraction/tools for address space mirroring and migration helpers. For #3, since we have a common drm/buddy layer, I don't think it is a big problem for driver writer now.
>
> I do see #4 is something you solved more beautifully, requires new system call though.
>
> Oak
>
>
>> -----Original Message-----
>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
>> Christian König
>> Sent: Tuesday, November 28, 2023 8:09 AM
>> To: Weixi Zhu <weixi.zhu@huawei.com>; linux-mm@kvack.org; linux-
>> kernel@vger.kernel.org; akpm@linux-foundation.org; Danilo Krummrich
>> <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter
>> <daniel@ffwll.ch>
>> Cc: dri-devel@lists.freedesktop.org; leonro@nvidia.com; apopple@nvidia.com;
>> amd-gfx@lists.freedesktop.org; mgorman@suse.de; ziy@nvidia.com; Wang, Zhi
>> A <zhi.a.wang@intel.com>; rcampbell@nvidia.com; jgg@nvidia.com;
>> weixi.zhu@openeuler.sh; jhubbard@nvidia.com; intel-gfx@lists.freedesktop.org;
>> mhairgrove@nvidia.com; jglisse@redhat.com; Vivi, Rodrigo
>> <rodrigo.vivi@intel.com>; intel-gvt-dev@lists.freedesktop.org;
>> tvrtko.ursulin@linux.intel.com; Felix.Kuehling@amd.com; Xinhui.Pan@amd.com;
>> alexander.deucher@amd.com; ogabbay@kernel.org
>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>> management) for external memory devices
>>
>> Adding a few missing important people to the explicit to list.
>>
>> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
>>> The problem:
>>>
>>> Accelerator driver developers are forced to reinvent external MM subsystems
>>> case by case, because Linux core MM only considers host memory resources.
>>> These reinvented MM subsystems have similar orders of magnitude of LoC as
>>> Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and Huawei NPU
>> has
>>> 30K. Meanwhile, more and more vendors are implementing their own
>>> accelerators, e.g. Microsoft's Maia 100. At the same time,
>>> application-level developers suffer from poor programmability -- they must
>>> consider parallel address spaces and be careful about the limited device
>>> DRAM capacity. This can be alleviated if a malloc()-ed virtual address can
>>> be shared by the accelerator, or the abundant host DRAM can further
>>> transparently backup the device local memory.
>>>
>>> These external MM systems share similar mechanisms except for the
>>> hardware-dependent part, so reinventing them is effectively introducing
>>> redundant code (14K~70K for each case). Such developing/maintaining is not
>>> cheap. Furthermore, to share a malloc()-ed virtual address, device drivers
>>> need to deeply interact with Linux MM via low-level MM APIs, e.g. MMU
>>> notifiers/HMM. This raises the bar for driver development, since developers
>>> must understand how Linux MM works. Further, it creates code maintenance
>>> problems -- any changes to Linux MM potentially require coordinated changes
>>> to accelerator drivers using low-level MM APIs.
>>>
>>> Putting a cache-coherent bus between host and device will not make these
>>> external MM subsystems disappear. For example, a throughput-oriented
>>> accelerator will not tolerate executing heavy memory access workload with
>>> a host MMU/IOMMU via a remote bus. Therefore, devices will still have
>>> their own MMU and pick a simpler page table format for lower address
>>> translation overhead, requiring external MM subsystems.
>>>
>>> --------------------
>>>
>>> What GMEM (Generalized Memory Management [1]) does:
>>>
>>> GMEM extends Linux MM to share its machine-independent MM code. Only
>>> high-level interface is provided for device drivers. This prevents
>>> accelerator drivers from reinventing the wheel, but relies on drivers to
>>> implement their hardware-dependent functions declared by GMEM. GMEM's
>> key
>>> interface include gm_dev_create(), gm_as_create(), gm_as_attach() and
>>> gm_dev_register_physmem(). Here briefly describe how a device driver
>>> utilizes them:
>>> 1. At boot time, call gm_dev_create() and registers the implementation of
>>>      hardware-dependent functions as declared in struct gm_mmu.
>>>        - If the device has local DRAM, call gm_dev_register_physmem() to
>>>          register available physical addresses.
>>> 2. When a device context is initialized (e.g. triggered by ioctl), check if
>>>      the current CPU process has been attached to a gmem address space
>>>      (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
>>>      to it.
>>> 3. Call gm_as_attach() to attach the device context to a gmem address space.
>>> 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
>>>      device computation happens.
>>>
>>> GMEM has changed the following assumptions in Linux MM:
>>>     1. An mm_struct not only handle a single CPU context, but may also handle
>>>        external memory contexts encapsulated as gm_context listed in
>>>        mm->gm_as. An external memory context can include a few or all of the
>>>        following parts: an external MMU (that requires TLB invalidation), an
>>>        external page table (that requires PTE manipulation) and external DRAM
>>>        (that requires physical memory management).
>>>     2. Faulting a MAP_PRIVATE VMA with no CPU PTE found does not necessarily
>>>        mean that a zero-filled physical page should be mapped. The virtual
>>>        page may have been mapped to an external memory device.
>>>     3. Unmapping a page may include sending device TLB invalidation (even if
>>>        its MMU shares CPU page table) and manipulating device PTEs.
>>>
>>> --------------------
>>>
>>> Semantics of new syscalls:
>>>
>>> 1. mmap(..., MAP_PRIVATE | MAP_PEER_SHARED)
>>>       Allocate virtual address that is shared between the CPU and all
>>>       attached devices. Data is guaranteed to be coherent whenever the
>>>       address is accessed by either CPU or any attached device. If the device
>>>       does not support page fault, then device driver is responsible for
>>>       faulting memory before data gets accessed. By default, the CPU DRAM is
>>>       can be used as a swap backup for the device local memory.
>>> 2. hmadvise(NUMA_id, va_start, size, memory_hint)
>>>       Issuing memory hint for a given VMA. This extends traditional madvise()
>>>       syscall with an extra argument so that programmers have better control
>>>       with heterogeneous devices registered as NUMA nodes. One useful
>> memory
>>>       hint could be MADV_PREFETCH, which guarantees that the physical data of
>>>       the given VMA [VA, VA+size) is migrated to NUMA node #id. Another
>>>       useful memory hint is MADV_DONTNEED. This is helpful to increase device
>>>       memory utilization. It is worth considering extending the existing
>>>       madvise() syscall with one additional argument.
>>>
>>> --------------------
>>>
>>> Implementation details
>>>
>>> 1. New VMA flag: MAP_PEER_SHARED
>>>
>>> This new flag helps isolate GMEM feature, so that common processes with
>>> no device attached does not need to maintain any logical page table. It
>>> can be deleted if the extra overhead from GMEM is acceptable.
>>>
>>> 2. MMU functions
>>> The device driver must implement the MMU functions declared in struct
>>> gm_mmu.
>>>
>>> VA functions: peer_va_alloc_fixed(), peer_va_free()
>>>
>>> They are used to negotiate a common available VMA between a host
>>> process and a device process at the mmap() time. This is because some
>>> accelerators like Intel Xeon Phi or Huawei's Ascend NPU have their
>>> acceleration tasks executed within a device CPU process context. Some
>>> accelerators may also choose a different format of virtual address
>>> space.
>>>
>>> PA functions: alloc_page(), free_page(), prepare_page()
>>>
>>> Alloc_page() and free_page() are used to allocate and free device physical
>>> pages. Prepare_page() is used to zero-fill or DMA the data of a physical
>>> page. These functions were removed from the submitted patch, since GMEM
>>> does not need to invoke them when testing Huawei's NPU accelerator. The
>> NPU
>>> accelerator has an OS running in the device that manages the device
>>> physical memory. However, even for such a device it is better for the host
>>> to directly manage device physical memory, which saves device HBM and
>>> avoids synchronizing management status between the host and device.
>>>
>>> Page-table functions: pmap_create()/destroy()/enter()/release()/protect()
>>>
>>> They are used to create and destroy device page tables, install and
>>> uninstall page table entries and to change the protection of page table
>>> entries.
>>>
>>> TLB-invalidation functions: tlb_invl(), tlb_invl_coalesced()
>>>
>>> They are used to invalidate the TLB entries of a given range of VA or
>>> invalidate a given list of VMAs.
>>>
>>> Wrapper functions: peer_map() and peer_unmap()
>>>
>>> These two functions are used to create or destroy a device mapping which
>>> could include allocating physical memory and copying data. They effectively
>>> wraps the PA functions, Page-table functions and TLB-invalidation
>>> functions. Implementing these steps together allows devices to optimize the
>>> communication cost between host and device. However, it requires the device
>>> driver to correctly order these steps.
>>>
>>> 3. Tracking logical mappings:
>>>
>>> Each process starts maintaining an xarray in mm->vm_obj->logical_page_table
>>> at the first time a host process calls mmap(MAP_PRIVATE |
>> MAP_PEER_SHARED).
>>> When a virtual page gets touched, its mapping status is created and stored
>>> in struct gm_mapping. The logical page table is utilized to query the
>>> struct gm_mapping given a virtual address. GMEM extends Linux MM to
>> update
>>> and lookup these logical mappings. For example, in the patch set we modify
>>> the page fault path of to additionally check the logical mapping of
>>> MAP_PEER_SHARED VMAs and identify if a device page should be migrated.
>>> Similarly, if the device driver wants to resolve a device page fault or
>>> prefetch data, the driver should call gm_dev_fault(). This function
>>> examines the mapping status and determines whether the device driver should
>>> migrate a CPU page to device or install a zero-filled device page.
>>>
>>> The logical mapping abstraction enhances the extensibility of Linux core MM
>>> (a virtual page may be mapped to a device physical page without any CPU PTE
>>> installed). The current implementation is not complete, since it only
>>> focused on anonymous VMAs with MAP_PEER_SHARED flag. The future plan of
>>> logical page table is to provide a generic abstraction layer that support
>>> common anonymous memory (I am looking at you, transparent huge pages)
>> and
>>> file-backed memory.
>>>
>>> --------------------
>>>
>>> Use cases
>>>
>>> GMEM has been tested over Huawei's NPU (neural process unit) device driver.
>>> The original NPU device driver has approximately 30,000 lines of code for
>>> memory management. On the contrary, the GMEM-based one has less than 30
>>> lines of code calling GMEM API, with approximately 3,700 lines of code
>>> implementing the MMU functions. This effectively saves over 26,200 lines
>>> of MM code for one driver. Therefore, developers from accelerator vendors,
>>> including Nvidia, AMD, Intel and other companies are welcome to discuss if
>>> GMEM could be helpful.
>>>
>>> Using GMEM-based driver, it is possible to write a C-style accelerator code
>>> with malloc(), whose underlying mmap() syscall should include
>>> MAP_PEER_SHARED according to current GMEM implementation. Importantly,
>> GMEM
>>> guarantees a coherent view of memory between the host and all attached
>>> devices. This means that any data written by the CPU or any attached
>>> accelerator can be seen by the next memory load instruction issued by any
>>> attached accelerator or the CPU. Furthermore, the NPU device was able to
>>> oversubscribe memory by swapping memory to host DDR. Note that this
>> memory
>>> oversubscription mechanism can be universal if the physical memory
>>> management is provided by GMEM. Other potential use cases of GMEM could
>>> include the IOMMU driver, KVM and RDMA drivers, as long as the device needs
>>> to manage external memory resources like VMAs, MMUs or local DRAMs.
>>>
>>> --------------------
>>>
>>> Discussion
>>>
>>> Physical memory management
>>> Most accelerators require the host OS to manage device DRAM. Even
>>> accelerators capable of running an OS inside the driver can benefit from
>>> it, since it helps avoid synchronizing management status between the host
>>> and device. In Linux OSS EU summit 2023, Hannes Reinecke from SUSE Labs
>>> suggested that people are concerned with the memory consumption of struct
>>> page (which considers all generic scenarios for the kernel). This leads to
>>> a possible solution that, instead of reusing Linux struct page and
>>> ZONE_DEVICE mechanism, GMEM can implement an isolated buddy allocator
>> for
>>> the device to instantiate and register. The isolation is useful because
>>> device DRAM physical address space is independent. Furthermore, the
>>> isolated buddy allocator can utilize a customized struct page that consumes
>>> less memory. It is worth discussing if accelerator vendors desire this
>>> solution.
>>>
>>> MMU functions
>>> The MMU functions peer_map() and peer_unmap() overlap other functions,
>>> leaving a question if the MMU functions should be decoupled as more basic
>>> operations. Decoupling them could potentially prevent device drivers
>>> coalescing these basic steps within a single host-device communication
>>> operation, while coupling them makes it more difficult for device drivers
>>> to utilize GMEM interface.
>>>
>>> The idea of GMEM was originated from Weixi's PhD study with
>>> Prof. Scott Rixner and Prof. Alan L. Cox at Rice University.
>>>
>>> [1] https://arxiv.org/abs/2310.12554.
>>>
>>> Weixi Zhu (6):
>>>     mm/gmem: add heterogeneous NUMA node
>>>     mm/gmem: add arch-independent abstraction to track address mapping
>>>       status
>>>     mm/gmem: add GMEM (Generalized Memory Management) interface for
>>>       external accelerators
>>>     mm/gmem: add new syscall hmadvise() to issue memory hints for
>>>       heterogeneous NUMA nodes
>>>     mm/gmem: resolve VMA conflicts for attached peer devices
>>>     mm/gmem: extending Linux core MM to support unified virtual address
>>>       space
>>>
>>>    arch/arm64/include/asm/unistd.h         |   2 +-
>>>    arch/arm64/include/asm/unistd32.h       |   2 +
>>>    drivers/base/node.c                     |   6 +
>>>    fs/proc/task_mmu.c                      |   3 +
>>>    include/linux/gmem.h                    | 368 ++++++++++++
>>>    include/linux/mm.h                      |   8 +
>>>    include/linux/mm_types.h                |   5 +
>>>    include/linux/nodemask.h                |  10 +
>>>    include/uapi/asm-generic/mman-common.h  |   4 +
>>>    include/uapi/asm-generic/unistd.h       |   5 +-
>>>    init/main.c                             |   2 +
>>>    kernel/fork.c                           |   5 +
>>>    kernel/sys_ni.c                         |   2 +
>>>    mm/Kconfig                              |  14 +
>>>    mm/Makefile                             |   1 +
>>>    mm/gmem.c                               | 746 ++++++++++++++++++++++++
>>>    mm/huge_memory.c                        |  85 ++-
>>>    mm/memory.c                             |  42 +-
>>>    mm/mempolicy.c                          |   4 +
>>>    mm/mmap.c                               |  40 +-
>>>    mm/oom_kill.c                           |   2 +
>>>    mm/page_alloc.c                         |   3 +
>>>    mm/vm_object.c                          | 309 ++++++++++
>>>    tools/include/uapi/asm-generic/unistd.h |   5 +-
>>>    24 files changed, 1654 insertions(+), 19 deletions(-)
>>>    create mode 100644 include/linux/gmem.h
>>>    create mode 100644 mm/gmem.c
>>>    create mode 100644 mm/vm_object.c
>>>

zhuweixi Nov. 30, 2023, 10:48 a.m. UTC | #7

Glad to know that there is a common demand for a new syscall like hmadvise(). I expect it would also be useful for homogeneous NUMA cases. Credits to cudaMemAdvise() API which brought this idea to GMEM's design.

To answer @Oak's questions about GMEM vs. HMM,

Here is the major difference:
  GMEM's main target is to stop drivers from reinventing MM code, while HMM/MMU notifiers provide a compatible struct page solution and a coordination mechanism for existing device driver MMs that requires adding extra code to interact with CPU MM.

A straightforward qualitative result for the main target: after integrating Huawei's Ascend NPU driver with GMEM's interface, 30,000 lines of MM code were cut, leaving <100 lines invoking GMEM interface and 3,700 lines implementing vendor-specific functions. Some code from the 3,700 lines should be further moved to GMEM as a generalized feature like device memory oversubscription, but not included in this RFC patch yet. 

A list of high-level differences: 
  1. With HMM/MMU notifiers, drivers need to first implement a full MM subsystem. With GMEM, drivers can reuse Linux's core MM.

  2. HMM encodes device mapping information in the CPU arch-dependent PTEs, while GMEM proposes an abstraction layer in vm_object. Since GMEM's approach further decouples the arch-related stuff, drivers do not need to implement separate code for X86/ARM and etc.

  3. MMU notifiers register hooks at certain core MM events, while GMEM declares basic functions and internally invokes them. GMEM requires less from the driver side -- no need to understand what core MM behaves at certain MMU events. GMEM also expects fewer bugs than MMU notifiers: implementing basic operations with standard declarations vs. implementing whatever random device MM logic in MMU notifiers.

  4. GMEM plans to support a more lightweight physical memory management. The discussion about this part can be found in my cover letter. The question is whether struct page should be compatible (directly use HMM's ZONE_DEVICE solution) or a trimmed, smaller struct page that satisfies generalized demands from accelerators is more preferrable?

  5. GMEM has been demonstrated to allow device memory oversubscription (a GMEM-based 32GB NPU card can run a GPT model oversubscribing 500GB host DDR), while drivers using HMM/MMU notifier must implement this logic one by one. I will submit this part in a future RFC patch.

I want to reiterate that GMEM's shared address space support is a bonus result, not a main contribution... It was done because it was not difficult to implement internal CPU-device coordination mechanism when core MM is extended by GMEM to support devices.

-Weixi

-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@gmail.com> 
Sent: Thursday, November 30, 2023 4:28 PM
To: Zeng, Oak <oak.zeng@intel.com>; Christian König <christian.koenig@amd.com>; zhuweixi <weixi.zhu@huawei.com>; linux-mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org; Danilo Krummrich <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gvt-dev@lists.freedesktop.org; rcampbell@nvidia.com; mhairgrove@nvidia.com; jgg@nvidia.com; weixi.zhu@openeuler.sh; jhubbard@nvidia.com; intel-gfx@lists.freedesktop.org; apopple@nvidia.com; Xinhui.Pan@amd.com; amd-gfx@lists.freedesktop.org; tvrtko.ursulin@linux.intel.com; ogabbay@kernel.org; jglisse@redhat.com; dri-devel@lists.freedesktop.org; ziy@nvidia.com; Vivi, Rodrigo <rodrigo.vivi@intel.com>; alexander.deucher@amd.com; leonro@nvidia.com; Felix.Kuehling@amd.com; Wang, Zhi A <zhi.a.wang@intel.com>; mgorman@suse.de
Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

Hi Oak,

yeah, #4 is indeed a really good point and I think Felix will agree to that as well.

HMM is basically still missing a way to advise device attributes for the CPU address space. Both migration strategy as well as device specific information (like cache preferences) fall into this category.

Since there is a device specific component in those attributes as well I think device specific IOCTLs still make sense to update them, but HMM should offer the functionality to manage and store those information.

Split and merge of VMAs only become a problem if you attach those information to VMAs, if you keep them completely separate than that doesn't become an issue either. The down side of this approach is that you don't get automatically extending attribute ranges for growing VMAs for example.

Regards,
Christian.

Am 29.11.23 um 23:23 schrieb Zeng, Oak:
> Hi Weixi,
>
> Even though Christian has listed reasons rejecting this proposal (yes they are very reasonable to me), I would open my mind and further explore the possibility here. Since the current GPU driver uses a hmm based implementation (AMD and NV has done this; At Intel we are catching up), I want to explore how much we can benefit from the proposed approach and how your approach can solve some pain points of our development. So basically what I am questioning here is: what is the advantage of your approach against hmm.
>
> To implement a UVM (unified virtual address space b/t cpu and gpu device), with hmm, driver essentially need to implement below functions:
>
> 1. device page table update. Your approach requires the same because 
> this is device specific codes
>
> 2. Some migration functions to migrate memory b/t system memory and GPU local memory. My understanding is, even though you generalized this a bit, such as modified cpu page fault path, provided "general" gm_dev_fault handler... but device driver still need to provide migration functions because migration functions have to be device specific (i.e., using device dma/copy engine for performance purpose). Right?
>
> 3. GPU physical memory management, this part is now in drm/buddy, shared by all drivers. I think with your approach, driver still need to provide callback functions to allocate/free physical pages. Right? Or do you let linux core mm buddy manage device memory directly?
>
> 4. madvise/hints/virtual address range management. This has been pain point for us. Right now device driver has to maintain certain virtual address range data structure to maintain hints and other virtual address range based memory attributes. Driver need to sync with linux vma. Driver need to explicitly deal with range split/merging... HMM doesn't provide support in this area. Your approach seems cleaner/simpler to me...
>
>
> So in above, I have examined the some key factors of a gpu UVM memory manager. I think for #1 and #2, hmm has provide pretty good abstraction/tools for address space mirroring and migration helpers. For #3, since we have a common drm/buddy layer, I don't think it is a big problem for driver writer now.
>
> I do see #4 is something you solved more beautifully, requires new system call though.
>
> Oak
>
>
>> -----Original Message-----
>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf 
>> Of Christian König
>> Sent: Tuesday, November 28, 2023 8:09 AM
>> To: Weixi Zhu <weixi.zhu@huawei.com>; linux-mm@kvack.org; linux- 
>> kernel@vger.kernel.org; akpm@linux-foundation.org; Danilo Krummrich 
>> <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter 
>> <daniel@ffwll.ch>
>> Cc: dri-devel@lists.freedesktop.org; leonro@nvidia.com; 
>> apopple@nvidia.com; amd-gfx@lists.freedesktop.org; mgorman@suse.de; 
>> ziy@nvidia.com; Wang, Zhi A <zhi.a.wang@intel.com>; 
>> rcampbell@nvidia.com; jgg@nvidia.com; weixi.zhu@openeuler.sh; 
>> jhubbard@nvidia.com; intel-gfx@lists.freedesktop.org; 
>> mhairgrove@nvidia.com; jglisse@redhat.com; Vivi, Rodrigo 
>> <rodrigo.vivi@intel.com>; intel-gvt-dev@lists.freedesktop.org;
>> tvrtko.ursulin@linux.intel.com; Felix.Kuehling@amd.com; 
>> Xinhui.Pan@amd.com; alexander.deucher@amd.com; ogabbay@kernel.org
>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>> management) for external memory devices
>>
>> Adding a few missing important people to the explicit to list.
>>
>> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
>>> The problem:
>>>
>>> Accelerator driver developers are forced to reinvent external MM 
>>> subsystems case by case, because Linux core MM only considers host memory resources.
>>> These reinvented MM subsystems have similar orders of magnitude of 
>>> LoC as Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and 
>>> Huawei NPU
>> has
>>> 30K. Meanwhile, more and more vendors are implementing their own 
>>> accelerators, e.g. Microsoft's Maia 100. At the same time, 
>>> application-level developers suffer from poor programmability -- 
>>> they must consider parallel address spaces and be careful about the 
>>> limited device DRAM capacity. This can be alleviated if a 
>>> malloc()-ed virtual address can be shared by the accelerator, or the 
>>> abundant host DRAM can further transparently backup the device local memory.
>>>
>>> These external MM systems share similar mechanisms except for the 
>>> hardware-dependent part, so reinventing them is effectively 
>>> introducing redundant code (14K~70K for each case). Such 
>>> developing/maintaining is not cheap. Furthermore, to share a 
>>> malloc()-ed virtual address, device drivers need to deeply interact 
>>> with Linux MM via low-level MM APIs, e.g. MMU notifiers/HMM. This 
>>> raises the bar for driver development, since developers must 
>>> understand how Linux MM works. Further, it creates code maintenance 
>>> problems -- any changes to Linux MM potentially require coordinated changes to accelerator drivers using low-level MM APIs.
>>>
>>> Putting a cache-coherent bus between host and device will not make 
>>> these external MM subsystems disappear. For example, a 
>>> throughput-oriented accelerator will not tolerate executing heavy 
>>> memory access workload with a host MMU/IOMMU via a remote bus. 
>>> Therefore, devices will still have their own MMU and pick a simpler 
>>> page table format for lower address translation overhead, requiring external MM subsystems.
>>>
>>> --------------------
>>>
>>> What GMEM (Generalized Memory Management [1]) does:
>>>
>>> GMEM extends Linux MM to share its machine-independent MM code. Only 
>>> high-level interface is provided for device drivers. This prevents 
>>> accelerator drivers from reinventing the wheel, but relies on 
>>> drivers to implement their hardware-dependent functions declared by 
>>> GMEM. GMEM's
>> key
>>> interface include gm_dev_create(), gm_as_create(), gm_as_attach() 
>>> and gm_dev_register_physmem(). Here briefly describe how a device 
>>> driver utilizes them:
>>> 1. At boot time, call gm_dev_create() and registers the implementation of
>>>      hardware-dependent functions as declared in struct gm_mmu.
>>>        - If the device has local DRAM, call gm_dev_register_physmem() to
>>>          register available physical addresses.
>>> 2. When a device context is initialized (e.g. triggered by ioctl), check if
>>>      the current CPU process has been attached to a gmem address space
>>>      (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
>>>      to it.
>>> 3. Call gm_as_attach() to attach the device context to a gmem address space.
>>> 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
>>>      device computation happens.
>>>
>>> GMEM has changed the following assumptions in Linux MM:
>>>     1. An mm_struct not only handle a single CPU context, but may also handle
>>>        external memory contexts encapsulated as gm_context listed in
>>>        mm->gm_as. An external memory context can include a few or all of the
>>>        following parts: an external MMU (that requires TLB invalidation), an
>>>        external page table (that requires PTE manipulation) and external DRAM
>>>        (that requires physical memory management).
>>>     2. Faulting a MAP_PRIVATE VMA with no CPU PTE found does not necessarily
>>>        mean that a zero-filled physical page should be mapped. The virtual
>>>        page may have been mapped to an external memory device.
>>>     3. Unmapping a page may include sending device TLB invalidation (even if
>>>        its MMU shares CPU page table) and manipulating device PTEs.
>>>
>>> --------------------
>>>
>>> Semantics of new syscalls:
>>>
>>> 1. mmap(..., MAP_PRIVATE | MAP_PEER_SHARED)
>>>       Allocate virtual address that is shared between the CPU and all
>>>       attached devices. Data is guaranteed to be coherent whenever the
>>>       address is accessed by either CPU or any attached device. If the device
>>>       does not support page fault, then device driver is responsible for
>>>       faulting memory before data gets accessed. By default, the CPU DRAM is
>>>       can be used as a swap backup for the device local memory.
>>> 2. hmadvise(NUMA_id, va_start, size, memory_hint)
>>>       Issuing memory hint for a given VMA. This extends traditional madvise()
>>>       syscall with an extra argument so that programmers have better control
>>>       with heterogeneous devices registered as NUMA nodes. One 
>>> useful
>> memory
>>>       hint could be MADV_PREFETCH, which guarantees that the physical data of
>>>       the given VMA [VA, VA+size) is migrated to NUMA node #id. Another
>>>       useful memory hint is MADV_DONTNEED. This is helpful to increase device
>>>       memory utilization. It is worth considering extending the existing
>>>       madvise() syscall with one additional argument.
>>>
>>> --------------------
>>>
>>> Implementation details
>>>
>>> 1. New VMA flag: MAP_PEER_SHARED
>>>
>>> This new flag helps isolate GMEM feature, so that common processes 
>>> with no device attached does not need to maintain any logical page 
>>> table. It can be deleted if the extra overhead from GMEM is acceptable.
>>>
>>> 2. MMU functions
>>> The device driver must implement the MMU functions declared in 
>>> struct gm_mmu.
>>>
>>> VA functions: peer_va_alloc_fixed(), peer_va_free()
>>>
>>> They are used to negotiate a common available VMA between a host 
>>> process and a device process at the mmap() time. This is because 
>>> some accelerators like Intel Xeon Phi or Huawei's Ascend NPU have 
>>> their acceleration tasks executed within a device CPU process 
>>> context. Some accelerators may also choose a different format of 
>>> virtual address space.
>>>
>>> PA functions: alloc_page(), free_page(), prepare_page()
>>>
>>> Alloc_page() and free_page() are used to allocate and free device 
>>> physical pages. Prepare_page() is used to zero-fill or DMA the data 
>>> of a physical page. These functions were removed from the submitted 
>>> patch, since GMEM does not need to invoke them when testing Huawei's 
>>> NPU accelerator. The
>> NPU
>>> accelerator has an OS running in the device that manages the device 
>>> physical memory. However, even for such a device it is better for 
>>> the host to directly manage device physical memory, which saves 
>>> device HBM and avoids synchronizing management status between the host and device.
>>>
>>> Page-table functions: 
>>> pmap_create()/destroy()/enter()/release()/protect()
>>>
>>> They are used to create and destroy device page tables, install and 
>>> uninstall page table entries and to change the protection of page 
>>> table entries.
>>>
>>> TLB-invalidation functions: tlb_invl(), tlb_invl_coalesced()
>>>
>>> They are used to invalidate the TLB entries of a given range of VA 
>>> or invalidate a given list of VMAs.
>>>
>>> Wrapper functions: peer_map() and peer_unmap()
>>>
>>> These two functions are used to create or destroy a device mapping 
>>> which could include allocating physical memory and copying data. 
>>> They effectively wraps the PA functions, Page-table functions and 
>>> TLB-invalidation functions. Implementing these steps together allows 
>>> devices to optimize the communication cost between host and device. 
>>> However, it requires the device driver to correctly order these steps.
>>>
>>> 3. Tracking logical mappings:
>>>
>>> Each process starts maintaining an xarray in 
>>> mm->vm_obj->logical_page_table at the first time a host process 
>>> calls mmap(MAP_PRIVATE |
>> MAP_PEER_SHARED).
>>> When a virtual page gets touched, its mapping status is created and 
>>> stored in struct gm_mapping. The logical page table is utilized to 
>>> query the struct gm_mapping given a virtual address. GMEM extends 
>>> Linux MM to
>> update
>>> and lookup these logical mappings. For example, in the patch set we 
>>> modify the page fault path of to additionally check the logical 
>>> mapping of MAP_PEER_SHARED VMAs and identify if a device page should be migrated.
>>> Similarly, if the device driver wants to resolve a device page fault 
>>> or prefetch data, the driver should call gm_dev_fault(). This 
>>> function examines the mapping status and determines whether the 
>>> device driver should migrate a CPU page to device or install a zero-filled device page.
>>>
>>> The logical mapping abstraction enhances the extensibility of Linux 
>>> core MM (a virtual page may be mapped to a device physical page 
>>> without any CPU PTE installed). The current implementation is not 
>>> complete, since it only focused on anonymous VMAs with 
>>> MAP_PEER_SHARED flag. The future plan of logical page table is to 
>>> provide a generic abstraction layer that support common anonymous 
>>> memory (I am looking at you, transparent huge pages)
>> and
>>> file-backed memory.
>>>
>>> --------------------
>>>
>>> Use cases
>>>
>>> GMEM has been tested over Huawei's NPU (neural process unit) device driver.
>>> The original NPU device driver has approximately 30,000 lines of 
>>> code for memory management. On the contrary, the GMEM-based one has 
>>> less than 30 lines of code calling GMEM API, with approximately 
>>> 3,700 lines of code implementing the MMU functions. This effectively 
>>> saves over 26,200 lines of MM code for one driver. Therefore, 
>>> developers from accelerator vendors, including Nvidia, AMD, Intel 
>>> and other companies are welcome to discuss if GMEM could be helpful.
>>>
>>> Using GMEM-based driver, it is possible to write a C-style 
>>> accelerator code with malloc(), whose underlying mmap() syscall 
>>> should include MAP_PEER_SHARED according to current GMEM 
>>> implementation. Importantly,
>> GMEM
>>> guarantees a coherent view of memory between the host and all 
>>> attached devices. This means that any data written by the CPU or any 
>>> attached accelerator can be seen by the next memory load instruction 
>>> issued by any attached accelerator or the CPU. Furthermore, the NPU 
>>> device was able to oversubscribe memory by swapping memory to host 
>>> DDR. Note that this
>> memory
>>> oversubscription mechanism can be universal if the physical memory 
>>> management is provided by GMEM. Other potential use cases of GMEM 
>>> could include the IOMMU driver, KVM and RDMA drivers, as long as the 
>>> device needs to manage external memory resources like VMAs, MMUs or local DRAMs.
>>>
>>> --------------------
>>>
>>> Discussion
>>>
>>> Physical memory management
>>> Most accelerators require the host OS to manage device DRAM. Even 
>>> accelerators capable of running an OS inside the driver can benefit 
>>> from it, since it helps avoid synchronizing management status 
>>> between the host and device. In Linux OSS EU summit 2023, Hannes 
>>> Reinecke from SUSE Labs suggested that people are concerned with the 
>>> memory consumption of struct page (which considers all generic 
>>> scenarios for the kernel). This leads to a possible solution that, 
>>> instead of reusing Linux struct page and ZONE_DEVICE mechanism, GMEM 
>>> can implement an isolated buddy allocator
>> for
>>> the device to instantiate and register. The isolation is useful 
>>> because device DRAM physical address space is independent. 
>>> Furthermore, the isolated buddy allocator can utilize a customized 
>>> struct page that consumes less memory. It is worth discussing if 
>>> accelerator vendors desire this solution.
>>>
>>> MMU functions
>>> The MMU functions peer_map() and peer_unmap() overlap other 
>>> functions, leaving a question if the MMU functions should be 
>>> decoupled as more basic operations. Decoupling them could 
>>> potentially prevent device drivers coalescing these basic steps 
>>> within a single host-device communication operation, while coupling 
>>> them makes it more difficult for device drivers to utilize GMEM interface.
>>>
>>> The idea of GMEM was originated from Weixi's PhD study with Prof. 
>>> Scott Rixner and Prof. Alan L. Cox at Rice University.
>>>
>>> [1] https://arxiv.org/abs/2310.12554.
>>>
>>> Weixi Zhu (6):
>>>     mm/gmem: add heterogeneous NUMA node
>>>     mm/gmem: add arch-independent abstraction to track address mapping
>>>       status
>>>     mm/gmem: add GMEM (Generalized Memory Management) interface for
>>>       external accelerators
>>>     mm/gmem: add new syscall hmadvise() to issue memory hints for
>>>       heterogeneous NUMA nodes
>>>     mm/gmem: resolve VMA conflicts for attached peer devices
>>>     mm/gmem: extending Linux core MM to support unified virtual address
>>>       space
>>>
>>>    arch/arm64/include/asm/unistd.h         |   2 +-
>>>    arch/arm64/include/asm/unistd32.h       |   2 +
>>>    drivers/base/node.c                     |   6 +
>>>    fs/proc/task_mmu.c                      |   3 +
>>>    include/linux/gmem.h                    | 368 ++++++++++++
>>>    include/linux/mm.h                      |   8 +
>>>    include/linux/mm_types.h                |   5 +
>>>    include/linux/nodemask.h                |  10 +
>>>    include/uapi/asm-generic/mman-common.h  |   4 +
>>>    include/uapi/asm-generic/unistd.h       |   5 +-
>>>    init/main.c                             |   2 +
>>>    kernel/fork.c                           |   5 +
>>>    kernel/sys_ni.c                         |   2 +
>>>    mm/Kconfig                              |  14 +
>>>    mm/Makefile                             |   1 +
>>>    mm/gmem.c                               | 746 ++++++++++++++++++++++++
>>>    mm/huge_memory.c                        |  85 ++-
>>>    mm/memory.c                             |  42 +-
>>>    mm/mempolicy.c                          |   4 +
>>>    mm/mmap.c                               |  40 +-
>>>    mm/oom_kill.c                           |   2 +
>>>    mm/page_alloc.c                         |   3 +
>>>    mm/vm_object.c                          | 309 ++++++++++
>>>    tools/include/uapi/asm-generic/unistd.h |   5 +-
>>>    24 files changed, 1654 insertions(+), 19 deletions(-)
>>>    create mode 100644 include/linux/gmem.h
>>>    create mode 100644 mm/gmem.c
>>>    create mode 100644 mm/vm_object.c
>>>

Christian König Nov. 30, 2023, 1:05 p.m. UTC | #8

Am 30.11.23 um 08:22 schrieb zhuweixi:
> Add @Oak to the KFD discussion. I will reply separately elaborating your questions on GMEM's difference from HMM/MMU notifiers.
>
> Christian, thanks for pointing me to that AMDKFD discussion. I have read the discussion around the AMDKFD skeleton patch and found the previous discussion in the following URLs:
> https://lore.kernel.org/dri-devel/1405028848-5660-1-git-send-email-oded.gabbay@amd.com/#r
> https://lore.kernel.org/dri-devel/20140711154231.GB1870@gmail.com/
>
> I believe AMDKFD's original patch was rejected mostly because of inserting vendor-specific stuff to the generic core MM.  Jérôme has clearly stated this issue in the second URL. If the code is vendor-specific then it has no place in core MM, period.
>
> But why does that vendor-specific solution relate to a generalized solution like GMEM? The initial AMDKFD patch doesn't work for Nvidia or Intel.

KFD was meant to be a vendor agnostic framework, very similar to what 
you propose here.

It's just that it was seen as vendor specific because nobody else 
actually wanted to design the their drivers this way.

>
> In fact I think the rejection of the initial AMDKFD patch supports GMEM's idea -- there could have been a simpler AMDKFD implementation if the core MM was extended by GMEM. Also, after 9 years, there are so many other companies building their accelerators over the past few years, especially now the GPT-family has made a much bigger success. Don't we want to advance Linux's core MM for more friendly and generalized support for the upcoming new vendors?

Well exactly that's the big point: Absolutely not!

We really should never ever encourage people to bind their device 
address space to the CPU address space. This is a very special use case 
and limits the driver design to only this use case.

We have exercised this approach to a rather extreme degree with KFD and 
I can clearly say that doing this was a really big mistake.

As far as I can see you are about to repeat that mistake and even 
encourage others to do so as well.

> Now answering Christian's design concerns:
>
> 1. "There are cases that do not want to share CPU address space"
> Maybe, but I am not fully convinced. The current case we can find is when a NIC utilizes IOMMU for security. For this case, GMEM implemented a generalized VMA support and tested it with NICs using both Intel-IOMMU/Arm-SMMU. This cut 600 LoC of IOVA management code from the IOMMU driver, but it is still not included in this RFC patch -- I cannot find other cases demanding this isolation. The isolation is also unnecessary -- the NIC can enable the IOMMU SVM feature to share the CPU address space. As of KVM, it is essentially a host process that utilizes two different MMUs within the same address space, so it fits GMEM's design...

Maybe I don't completely follow here how you want to save LoC for the 
IOMMU implementation of NICs, but at least for the PASID/PRI support AMD 
just recently gone exactly the opposite direction:

commit 5a0b11a180a9b82b4437a4be1cf73530053f139b
Author: Vasant Hegde <vasant.hegde@amd.com>
Date:   Fri Oct 6 09:57:02 2023 +0000

     iommu/amd: Remove iommu_v2 module

     AMD GPU driver which was the only in-kernel user of iommu_v2 module
     removed dependency on iommu_v2 module.

     Also we are working on adding SVA support in AMD IOMMU driver. Device
     drivers are expected to use common SVA framework to enable device
     PASID/PRI features.

     Removing iommu_v2 module and then adding SVA simplifies the 
development.
     Hence remove iommu_v2 module.

As I wrote before this IOMMU V2 driver was basically binding the CPU 
address space to IOMMU devices using the PASID. For an example see 
function amd_iommu_bind_pasid().

This turned out to be not as useful as we hoped it would be. Essentially 
the use case where you want to give a device access to the whole address 
space of a process are extremely limited. That's why we are removing it 
and switching over to a separate SVA implementation which doesn't depend 
on the CPU address space.


But the virtualization use case I mentioned is completely independent of 
IOMMU. In KVM/XEN/etc.. there is a functionality called native context, 
basically what this means is that instead of passing through complete 
device isolated by IOMMU only specific kernel functionalities are 
exposed to the guest operating system through QEMU.

See here for an example how OpenGL is implemented on top of this: 
https://docs.mesa3d.org/drivers/virgl.html

This is actually using the separation between device memory management 
and CPU memory management and is basically a killer argument why those 
two topics should be separated. Otherwise it's impossible for QEMU to 
actually handle multiple independent device memory address spaces inside 
a single CPU memory address space.

> 2. "This does not integrate well with the filesystem layer in Linux..."
> To be honest, not using a logical page table for anonymous memory is why Linux THP fails compared with FreeBSD's superpage, but I am not going to elaborate it here. But yes, and I am looking for merging struct vm_object->logical_page_table with struct address_space->i_pages. This will make a natural support for devices oversubscribing both host DRAM and disks. As explained in my cover letter, struct vm_object borrows FreeBSD's VM design -- it provides a unified abstraction layer for anonymous, file-backed memory and etc.

I'm not that deep into this stuff, so leaving this to the experts on 
FreeBSD.

> 3. "Requirements to CPU address space management and device address space management are just massively different. For example huge and giant pages are a must have for modern devices..."
> I think you are asking two questions. First, is VA space a problem?

No, this is about something completely different.

> GMEM assumes that device VA space should be covered by CPU VA space (sorry i386), ...
[SNIP]

I'm removing this because you were talking about something different 
than what I meant.

I will try to explain the background on an example outside of machine 
learning and compute since this framework should be applicable to every 
use case and not be limited to those. Otherwise Linux would sooner or 
later just be applicable to only those use cases.

So let's take a look at how modern games use a GPU for example. On 
startup a rather large part of the GPU address space is allocated, for 
example 64GiB. Then the necessary resources (images, texture, vertices, 
shaders etc..) are loaded into separate buffer objects.

Those resources are then mapped into the allocated address on a page by 
page basis. So you basically don't have large VMAs which cover one 
resource, but rather the page tables are used as a remapping table
  into the available resources. This increases the number of virtual 
mappings drastically, it's kind of comparable how an anon_vma works 
inside a VMA on Linux.

Those mappings also are not setup at start and then used throughout the 
whole lifetime of the process, but rather done very dynamically 
sometimes resulting in thousands of mapping operations per second.

Additional to that devices have page table feature which CPUs don't 
have. This ranges from support for partial resident texture over flags 
how caching and dynamic color space compression is made.

So the mappings contain tons of device specific information and it's 
most likely not even possible to handle all of this with a device 
independent mmap() call.

> 4. "The argument that a shared memory management leads to less bugs has also absolutely not be proven true. Instead we literally spend month if not years hunting down bugs which resulted from interaction between CPU and devices."
> This is another case supporting GMEM. Don't developers want to let GMEM handle the CPU-device interaction so that they can waive months of debugging cost?

No, we already have HMM for that.

Regards,
Christian.

>
> PS, hmadvise() is based on the idea of Nvidia's cudaMemAdvise() which provides abundant and useful memory policies. HMM extended mbind() instead.
>
> -Weixi
>
> -----Original Message-----
> From: Christian König <christian.koenig@amd.com>
> Sent: Wednesday, November 29, 2023 11:22 PM
> To: zhuweixi <weixi.zhu@huawei.com>; Dave Airlie <airlied@gmail.com>
> Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org; weixi.zhu@openeuler.sh; mgorman@suse.de; jglisse@redhat.com; rcampbell@nvidia.com; jhubbard@nvidia.com; apopple@nvidia.com; mhairgrove@nvidia.com; ziy@nvidia.com; alexander.deucher@amd.com; Xinhui.Pan@amd.com; amd-gfx@lists.freedesktop.org; Felix.Kuehling@amd.com; ogabbay@kernel.org; dri-devel@lists.freedesktop.org; jgg@nvidia.com; leonro@nvidia.com; zhenyuw@linux.intel.com; zhi.a.wang@intel.com; intel-gvt-dev@lists.freedesktop.org; intel-gfx@lists.freedesktop.org; jani.nikula@linux.intel.com; joonas.lahtinen@linux.intel.com; rodrigo.vivi@intel.com; tvrtko.ursulin@linux.intel.com
> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices
>
> Am 29.11.23 um 09:27 schrieb zhuweixi:
>> Glad to hear that more sharable code is desirable.
>> IMHO, for a common MM subsystem, it is more beneficial for GMEM to
>> extend core MM instead of building a separate one.
>>
>> As stated in the beginning of my RFC letter, MM systems are large and
>> similar. Even a sophisticated one like Linux MM that has evolved over
>> decades still suffers from an increasing number of bugs[1]. So,
>> directly extending core MM to support devices not only avoids opening
>> a new box of bugs, but also allows the community to concentrate on
>> maintaining one single MM system. On the other side, GMEM does no hurt
>> to core MM If a CPU process is not attached with device contexts.
>>
>> @Christian, could you provide more information on what AMD proposed
>> with KFD and why it was rejected?
> Well, this is going to be a longer explanation.
>
> The combination of KFD and HMM is based on essentially on the same idea as this code here. Even the initial KFD implementation was very similar in the sense that it added device contexts to mm_struct and tried to manage GPU/acceleration MM the same way as CPU MM. On other words it was basically identical to your gm_dev_create() and gm_mmu approach.
>
> As mentioned before this initial proposal was rejected, for more background see the discussion around "amdkfd: Add amdkfd skeleton driver" on the dri-devel mailing list between 2013 and 2014. You need to dig up the whole discussion from the mailing list, but summarizing it the general feeling was that it would be a mistake to tie device drivers to close to CPU memory management (and stable UAPI) without validating that this is really the right thing to do.
>
> So instead of the original implementation KFD has gone upstream with a much less invasive approach where a device contexts are just on demand looked up for each mm_struct. Felix can probably provide some pointers to the implementation.
>
> On the initially supported hardware the KFD used the PCIe ATC feature to allow routing of memory accesses directly into the associated CPU process address space, later on we switched to an MMU notifier/HMM based approach to give similar functionality to the userspace stack on top of it for devices which doesn't support the ATC path was just recently completely removed and we are now only using MMU notifiers/HMM.
>
> HMM tried to add similar functionality like you propose with the mmap() flag and hmadvise() call. The hmadvise() extension actually looks so familiar to the HMM proposal that I would expect that this is actually based on that code.
>
> All this turned out to have some major design issues.
>
> First of all you have a rather large group of use cases where you don't want your device to mirror the address space of your process. Just think of thinks like QEMU, KVM, XEN, in general virtualization and container handling. Linux has the mantra that everything is a file and if it's not a file it should be a file and when you tie device memory management into CPU memory management you are pretty much violating exactly that.
>
> Second this doesn't integrate well with the filesystem layer in Linux.
> For example we do have struct pages for HMM exposed device memory, but for I/O we still migrate this back to system memory because of (for
> example) the page lock requirements around writeback.
>
> Then third it turned out that the requirements to CPU address space management and device address space management are just massively different. For example huge and giant pages are a must have for modern devices, on the CPU side we are barely switching over to folios now to add similar functionality.
>
> The argument that a shared memory management leads to less bugs has also absolutely not be proven true. Instead we literally spend month if not years hunting down bugs which resulted from interaction between CPU and devices.
> ...
>
> There are a couple of more things on this contra side to that approach, but I think that would just make this mail unnecessary long.
>
> To sum it up from over a decade of experience working in this area I can just say that CPU and device memory management should absolutely *NOT* be mixed. We had those ideas multiple times before, but they either failed because they didn't integrated well with the core OS or the hardware support is just lagging behind the actual requirements.
>
> What can be done and where I completely agree with Dave is that having common components which provides device drivers with the necessary functionality to manage their device address space is really good idea.
> Danilo is for example working on a GPUVM component to have common virtual address space management and I'm at least sometimes working on MMU notifier/HMM improvements.
>
> Providing SVM functionality to your userspace stack is still a really good idea, but it should be done with MMU notifiers and components which are separate to your CPU memory management instead of tying it directly to the CPU address space.
>
> Regards,
> Christian.
>
>> [1] Huang, Jian, Moinuddin K. Qureshi, and Karsten Schwan. "An evolutionary study of linux memory management for fun and profit." 2016 USENIX Annual Technical Conference (USENIX ATC 16). 2016.
>>
>> Thanks,
>> Weixi
>>
>> -----Original Message-----
>> From: Dave Airlie <airlied@gmail.com>
>> Sent: Wednesday, November 29, 2023 1:15 PM
>> To: Christian König <christian.koenig@amd.com>
>> Cc: zhuweixi <weixi.zhu@huawei.com>; linux-mm@kvack.org;
>> linux-kernel@vger.kernel.org; akpm@linux-foundation.org;
>> weixi.zhu@openeuler.sh; mgorman@suse.de; jglisse@redhat.com;
>> rcampbell@nvidia.com; jhubbard@nvidia.com; apopple@nvidia.com;
>> mhairgrove@nvidia.com; ziy@nvidia.com; alexander.deucher@amd.com;
>> Xinhui.Pan@amd.com; amd-gfx@lists.freedesktop.org;
>> Felix.Kuehling@amd.com; ogabbay@kernel.org;
>> dri-devel@lists.freedesktop.org; jgg@nvidia.com; leonro@nvidia.com;
>> zhenyuw@linux.intel.com; zhi.a.wang@intel.com;
>> intel-gvt-dev@lists.freedesktop.org; intel-gfx@lists.freedesktop.org;
>> jani.nikula@linux.intel.com; joonas.lahtinen@linux.intel.com;
>> rodrigo.vivi@intel.com; tvrtko.ursulin@linux.intel.com
>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>> management) for external memory devices
>>
>> On Tue, 28 Nov 2023 at 23:07, Christian König <christian.koenig@amd.com> wrote:
>>> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
>>>> The problem:
>>>>
>>>> Accelerator driver developers are forced to reinvent external MM
>>>> subsystems case by case, because Linux core MM only considers host memory resources.
>>>> These reinvented MM subsystems have similar orders of magnitude of
>>>> LoC as Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and
>>>> Huawei NPU has 30K. Meanwhile, more and more vendors are
>>>> implementing their own accelerators, e.g. Microsoft's Maia 100. At
>>>> the same time, application-level developers suffer from poor
>>>> programmability -- they must consider parallel address spaces and be
>>>> careful about the limited device DRAM capacity. This can be
>>>> alleviated if a malloc()-ed virtual address can be shared by the
>>>> accelerator, or the abundant host DRAM can further transparently backup the device local memory.
>>>>
>>>> These external MM systems share similar mechanisms except for the
>>>> hardware-dependent part, so reinventing them is effectively
>>>> introducing redundant code (14K~70K for each case). Such
>>>> developing/maintaining is not cheap. Furthermore, to share a
>>>> malloc()-ed virtual address, device drivers need to deeply interact
>>>> with Linux MM via low-level MM APIs, e.g. MMU notifiers/HMM. This
>>>> raises the bar for driver development, since developers must
>>>> understand how Linux MM works. Further, it creates code maintenance
>>>> problems -- any changes to Linux MM potentially require coordinated changes to accelerator drivers using low-level MM APIs.
>>>>
>>>> Putting a cache-coherent bus between host and device will not make
>>>> these external MM subsystems disappear. For example, a
>>>> throughput-oriented accelerator will not tolerate executing heavy
>>>> memory access workload with a host MMU/IOMMU via a remote bus.
>>>> Therefore, devices will still have their own MMU and pick a simpler
>>>> page table format for lower address translation overhead, requiring external MM subsystems.
>>>>
>>>> --------------------
>>>>
>>>> What GMEM (Generalized Memory Management [1]) does:
>>>>
>>>> GMEM extends Linux MM to share its machine-independent MM code. Only
>>>> high-level interface is provided for device drivers. This prevents
>>>> accelerator drivers from reinventing the wheel, but relies on
>>>> drivers to implement their hardware-dependent functions declared by
>>>> GMEM. GMEM's key interface include gm_dev_create(), gm_as_create(),
>>>> gm_as_attach() and gm_dev_register_physmem(). Here briefly describe
>>>> how a device driver utilizes them:
>>>> 1. At boot time, call gm_dev_create() and registers the implementation of
>>>>       hardware-dependent functions as declared in struct gm_mmu.
>>>>         - If the device has local DRAM, call gm_dev_register_physmem() to
>>>>           register available physical addresses.
>>>> 2. When a device context is initialized (e.g. triggered by ioctl), check if
>>>>       the current CPU process has been attached to a gmem address space
>>>>       (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
>>>>       to it.
>>>> 3. Call gm_as_attach() to attach the device context to a gmem address space.
>>>> 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
>>>>       device computation happens.
>>>>
>>>> GMEM has changed the following assumptions in Linux MM:
>>>>      1. An mm_struct not only handle a single CPU context, but may also handle
>>>>         external memory contexts encapsulated as gm_context listed in
>>>>         mm->gm_as. An external memory context can include a few or all of the
>>>>         following parts: an external MMU (that requires TLB invalidation), an
>>>>         external page table (that requires PTE manipulation) and external DRAM
>>>>         (that requires physical memory management).
>>> Well that is pretty much exactly what AMD has already proposed with
>>> KFD and was rejected for rather good reasons.
>>>> MMU functions
>>>> The MMU functions peer_map() and peer_unmap() overlap other
>>>> functions, leaving a question if the MMU functions should be
>>>> decoupled as more basic operations. Decoupling them could
>>>> potentially prevent device drivers coalescing these basic steps
>>>> within a single host-device communication operation, while coupling
>>>> them makes it more difficult for device drivers to utilize GMEM interface.
>>> Well to be honest all of this sounds like history to me. We have
>>> already seen the same basic approach in KFD, HMM and to some extend in TTM as well.
>>>
>>> And all of them more or less failed. Why should this here be different?
>> Any info we have on why this has failed to work in the past would be
>> useful to provide. This is one of those cases where we may not have
>> documented the bad ideas to stop future developers from thinking they
>> are bad.
>>
>> I do think we would want more common code in this area, but I would
>> think we'd have it more on the driver infrastructure side, than in the
>> core mm.
>>
>> Dave.

David Hildenbrand Nov. 30, 2023, 2:55 p.m. UTC | #9

On 29.11.23 09:27, zhuweixi wrote:
> Glad to hear that more sharable code is desirable.
> IMHO, for a common MM subsystem, it is more beneficial for
> GMEM to extend core MM instead of building a separate one.

More core-mm complexity, awesome, we all love that! ;)

zhuweixi Dec. 1, 2023, 2:37 a.m. UTC | #10

From your argument on KVM I can see that the biggest miscommunication between us is that you believed that GMEM wanted to share the whole address space. No, it is not the case. GMEM is only providing coordination via certain mmap() calls. So you are raising a case supporting GMEM again -- passthrough part of the CPU addresses space instead of passthrough the whole CPU address space, is exactly what GMEM can do. On the other side, the IOMMU SVA feature wildly binds the whole address space -- since the hardware feature is to directly share the whole CPU page table.

"We really should never ever encourage people to bind their device address space to the CPU address space. This is a very special use case and limits the driver design to only this use case.
We have exercised this approach to a rather extreme degree with KFD and I can clearly say that doing this was a really big mistake.
As far as I can see you are about to repeat that mistake and even encourage others to do so as well."

-- The behavior of internally "attach device context to mm_struct" in GMEM is ultimately a different approach to coordinate CPU and devices. I want to replace MMU notifiers with this approach because I want to protect core MM from random interactions with external driver MMs. Both GMEM and MMU notifiers are binding device contexts to the CPU context, not putting them in the same address space. If someone is against GMEM's approach for binding CPU and device context, then someone should be against MMU notifiers as well.

Currently, from our discussion I think I received two messages:
	1. The original AMDKFD design was rejected because of inserting vendor-specific stuff to the generic core MM.
	2. The rejection from #1 led to your opinion that anyone cannot mix device and core MM together.

I think #1 really encouraged me that GMEM could help the AMDKFD driver. However I am also confused that why GMEM must be compared with a vendor-specific driver. AMDKFD was only considering a very special use case: AMD GPUs using AMD IOMMU. 
However, GMEM is trying to consider all generalized cases of memory devices. The device can be Nvidia's GPU and Huawei's NPU that use their own MMUs, or AMD/Intel GPUs that use IOMMUs, or other hundreds of new accelerator vendors.

-Weixi

-----Original Message-----
From: Christian König <christian.koenig@amd.com> 
Sent: Thursday, November 30, 2023 9:05 PM
To: zhuweixi <weixi.zhu@huawei.com>; Dave Airlie <airlied@gmail.com>
Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org; weixi.zhu@openeuler.sh; mgorman@suse.de; jglisse@redhat.com; rcampbell@nvidia.com; jhubbard@nvidia.com; apopple@nvidia.com; mhairgrove@nvidia.com; ziy@nvidia.com; alexander.deucher@amd.com; Xinhui.Pan@amd.com; amd-gfx@lists.freedesktop.org; Felix.Kuehling@amd.com; ogabbay@kernel.org; dri-devel@lists.freedesktop.org; jgg@nvidia.com; leonro@nvidia.com; zhenyuw@linux.intel.com; zhi.a.wang@intel.com; intel-gvt-dev@lists.freedesktop.org; intel-gfx@lists.freedesktop.org; jani.nikula@linux.intel.com; joonas.lahtinen@linux.intel.com; rodrigo.vivi@intel.com; tvrtko.ursulin@linux.intel.com; Danilo Krummrich <dakr@redhat.com>; Daniel Vetter <daniel@ffwll.ch>; Zeng, Oak <oak.zeng@intel.com>
Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

Am 30.11.23 um 08:22 schrieb zhuweixi:
> Add @Oak to the KFD discussion. I will reply separately elaborating your questions on GMEM's difference from HMM/MMU notifiers.
>
> Christian, thanks for pointing me to that AMDKFD discussion. I have read the discussion around the AMDKFD skeleton patch and found the previous discussion in the following URLs:
> https://lore.kernel.org/dri-devel/1405028848-5660-1-git-send-email-ode
> d.gabbay@amd.com/#r 
> https://lore.kernel.org/dri-devel/20140711154231.GB1870@gmail.com/
>
> I believe AMDKFD's original patch was rejected mostly because of inserting vendor-specific stuff to the generic core MM.  Jérôme has clearly stated this issue in the second URL. If the code is vendor-specific then it has no place in core MM, period.
>
> But why does that vendor-specific solution relate to a generalized solution like GMEM? The initial AMDKFD patch doesn't work for Nvidia or Intel.

KFD was meant to be a vendor agnostic framework, very similar to what you propose here.

It's just that it was seen as vendor specific because nobody else actually wanted to design the their drivers this way.

>
> In fact I think the rejection of the initial AMDKFD patch supports GMEM's idea -- there could have been a simpler AMDKFD implementation if the core MM was extended by GMEM. Also, after 9 years, there are so many other companies building their accelerators over the past few years, especially now the GPT-family has made a much bigger success. Don't we want to advance Linux's core MM for more friendly and generalized support for the upcoming new vendors?

Well exactly that's the big point: Absolutely not!

We really should never ever encourage people to bind their device address space to the CPU address space. This is a very special use case and limits the driver design to only this use case.

We have exercised this approach to a rather extreme degree with KFD and I can clearly say that doing this was a really big mistake.

As far as I can see you are about to repeat that mistake and even encourage others to do so as well.

> Now answering Christian's design concerns:
>
> 1. "There are cases that do not want to share CPU address space"
> Maybe, but I am not fully convinced. The current case we can find is when a NIC utilizes IOMMU for security. For this case, GMEM implemented a generalized VMA support and tested it with NICs using both Intel-IOMMU/Arm-SMMU. This cut 600 LoC of IOVA management code from the IOMMU driver, but it is still not included in this RFC patch -- I cannot find other cases demanding this isolation. The isolation is also unnecessary -- the NIC can enable the IOMMU SVM feature to share the CPU address space. As of KVM, it is essentially a host process that utilizes two different MMUs within the same address space, so it fits GMEM's design...

Maybe I don't completely follow here how you want to save LoC for the IOMMU implementation of NICs, but at least for the PASID/PRI support AMD just recently gone exactly the opposite direction:

commit 5a0b11a180a9b82b4437a4be1cf73530053f139b
Author: Vasant Hegde <vasant.hegde@amd.com>
Date:   Fri Oct 6 09:57:02 2023 +0000

     iommu/amd: Remove iommu_v2 module

     AMD GPU driver which was the only in-kernel user of iommu_v2 module
     removed dependency on iommu_v2 module.

     Also we are working on adding SVA support in AMD IOMMU driver. Device
     drivers are expected to use common SVA framework to enable device
     PASID/PRI features.

     Removing iommu_v2 module and then adding SVA simplifies the development.
     Hence remove iommu_v2 module.

As I wrote before this IOMMU V2 driver was basically binding the CPU address space to IOMMU devices using the PASID. For an example see function amd_iommu_bind_pasid().

This turned out to be not as useful as we hoped it would be. Essentially the use case where you want to give a device access to the whole address space of a process are extremely limited. That's why we are removing it and switching over to a separate SVA implementation which doesn't depend on the CPU address space.


But the virtualization use case I mentioned is completely independent of IOMMU. In KVM/XEN/etc.. there is a functionality called native context, basically what this means is that instead of passing through complete device isolated by IOMMU only specific kernel functionalities are exposed to the guest operating system through QEMU.

See here for an example how OpenGL is implemented on top of this: 
https://docs.mesa3d.org/drivers/virgl.html

This is actually using the separation between device memory management and CPU memory management and is basically a killer argument why those two topics should be separated. Otherwise it's impossible for QEMU to actually handle multiple independent device memory address spaces inside a single CPU memory address space.

> 2. "This does not integrate well with the filesystem layer in Linux..."
> To be honest, not using a logical page table for anonymous memory is why Linux THP fails compared with FreeBSD's superpage, but I am not going to elaborate it here. But yes, and I am looking for merging struct vm_object->logical_page_table with struct address_space->i_pages. This will make a natural support for devices oversubscribing both host DRAM and disks. As explained in my cover letter, struct vm_object borrows FreeBSD's VM design -- it provides a unified abstraction layer for anonymous, file-backed memory and etc.

I'm not that deep into this stuff, so leaving this to the experts on FreeBSD.

> 3. "Requirements to CPU address space management and device address space management are just massively different. For example huge and giant pages are a must have for modern devices..."
> I think you are asking two questions. First, is VA space a problem?

No, this is about something completely different.

> GMEM assumes that device VA space should be covered by CPU VA space (sorry i386), ...
[SNIP]

I'm removing this because you were talking about something different than what I meant.

I will try to explain the background on an example outside of machine learning and compute since this framework should be applicable to every use case and not be limited to those. Otherwise Linux would sooner or later just be applicable to only those use cases.

So let's take a look at how modern games use a GPU for example. On startup a rather large part of the GPU address space is allocated, for example 64GiB. Then the necessary resources (images, texture, vertices, shaders etc..) are loaded into separate buffer objects.

Those resources are then mapped into the allocated address on a page by page basis. So you basically don't have large VMAs which cover one resource, but rather the page tables are used as a remapping table
  into the available resources. This increases the number of virtual mappings drastically, it's kind of comparable how an anon_vma works inside a VMA on Linux.

Those mappings also are not setup at start and then used throughout the whole lifetime of the process, but rather done very dynamically sometimes resulting in thousands of mapping operations per second.

Additional to that devices have page table feature which CPUs don't have. This ranges from support for partial resident texture over flags how caching and dynamic color space compression is made.

So the mappings contain tons of device specific information and it's most likely not even possible to handle all of this with a device independent mmap() call.

> 4. "The argument that a shared memory management leads to less bugs has also absolutely not be proven true. Instead we literally spend month if not years hunting down bugs which resulted from interaction between CPU and devices."
> This is another case supporting GMEM. Don't developers want to let GMEM handle the CPU-device interaction so that they can waive months of debugging cost?

No, we already have HMM for that.

Regards,
Christian.

>
> PS, hmadvise() is based on the idea of Nvidia's cudaMemAdvise() which provides abundant and useful memory policies. HMM extended mbind() instead.
>
> -Weixi
>
> -----Original Message-----
> From: Christian König <christian.koenig@amd.com>
> Sent: Wednesday, November 29, 2023 11:22 PM
> To: zhuweixi <weixi.zhu@huawei.com>; Dave Airlie <airlied@gmail.com>
> Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; 
> akpm@linux-foundation.org; weixi.zhu@openeuler.sh; mgorman@suse.de; 
> jglisse@redhat.com; rcampbell@nvidia.com; jhubbard@nvidia.com; 
> apopple@nvidia.com; mhairgrove@nvidia.com; ziy@nvidia.com; 
> alexander.deucher@amd.com; Xinhui.Pan@amd.com; 
> amd-gfx@lists.freedesktop.org; Felix.Kuehling@amd.com; 
> ogabbay@kernel.org; dri-devel@lists.freedesktop.org; jgg@nvidia.com; 
> leonro@nvidia.com; zhenyuw@linux.intel.com; zhi.a.wang@intel.com; 
> intel-gvt-dev@lists.freedesktop.org; intel-gfx@lists.freedesktop.org; 
> jani.nikula@linux.intel.com; joonas.lahtinen@linux.intel.com; 
> rodrigo.vivi@intel.com; tvrtko.ursulin@linux.intel.com
> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory 
> management) for external memory devices
>
> Am 29.11.23 um 09:27 schrieb zhuweixi:
>> Glad to hear that more sharable code is desirable.
>> IMHO, for a common MM subsystem, it is more beneficial for GMEM to 
>> extend core MM instead of building a separate one.
>>
>> As stated in the beginning of my RFC letter, MM systems are large and 
>> similar. Even a sophisticated one like Linux MM that has evolved over 
>> decades still suffers from an increasing number of bugs[1]. So, 
>> directly extending core MM to support devices not only avoids opening 
>> a new box of bugs, but also allows the community to concentrate on 
>> maintaining one single MM system. On the other side, GMEM does no 
>> hurt to core MM If a CPU process is not attached with device contexts.
>>
>> @Christian, could you provide more information on what AMD proposed 
>> with KFD and why it was rejected?
> Well, this is going to be a longer explanation.
>
> The combination of KFD and HMM is based on essentially on the same idea as this code here. Even the initial KFD implementation was very similar in the sense that it added device contexts to mm_struct and tried to manage GPU/acceleration MM the same way as CPU MM. On other words it was basically identical to your gm_dev_create() and gm_mmu approach.
>
> As mentioned before this initial proposal was rejected, for more background see the discussion around "amdkfd: Add amdkfd skeleton driver" on the dri-devel mailing list between 2013 and 2014. You need to dig up the whole discussion from the mailing list, but summarizing it the general feeling was that it would be a mistake to tie device drivers to close to CPU memory management (and stable UAPI) without validating that this is really the right thing to do.
>
> So instead of the original implementation KFD has gone upstream with a much less invasive approach where a device contexts are just on demand looked up for each mm_struct. Felix can probably provide some pointers to the implementation.
>
> On the initially supported hardware the KFD used the PCIe ATC feature to allow routing of memory accesses directly into the associated CPU process address space, later on we switched to an MMU notifier/HMM based approach to give similar functionality to the userspace stack on top of it for devices which doesn't support the ATC path was just recently completely removed and we are now only using MMU notifiers/HMM.
>
> HMM tried to add similar functionality like you propose with the mmap() flag and hmadvise() call. The hmadvise() extension actually looks so familiar to the HMM proposal that I would expect that this is actually based on that code.
>
> All this turned out to have some major design issues.
>
> First of all you have a rather large group of use cases where you don't want your device to mirror the address space of your process. Just think of thinks like QEMU, KVM, XEN, in general virtualization and container handling. Linux has the mantra that everything is a file and if it's not a file it should be a file and when you tie device memory management into CPU memory management you are pretty much violating exactly that.
>
> Second this doesn't integrate well with the filesystem layer in Linux.
> For example we do have struct pages for HMM exposed device memory, but 
> for I/O we still migrate this back to system memory because of (for
> example) the page lock requirements around writeback.
>
> Then third it turned out that the requirements to CPU address space management and device address space management are just massively different. For example huge and giant pages are a must have for modern devices, on the CPU side we are barely switching over to folios now to add similar functionality.
>
> The argument that a shared memory management leads to less bugs has also absolutely not be proven true. Instead we literally spend month if not years hunting down bugs which resulted from interaction between CPU and devices.
> ...
>
> There are a couple of more things on this contra side to that approach, but I think that would just make this mail unnecessary long.
>
> To sum it up from over a decade of experience working in this area I can just say that CPU and device memory management should absolutely *NOT* be mixed. We had those ideas multiple times before, but they either failed because they didn't integrated well with the core OS or the hardware support is just lagging behind the actual requirements.
>
> What can be done and where I completely agree with Dave is that having common components which provides device drivers with the necessary functionality to manage their device address space is really good idea.
> Danilo is for example working on a GPUVM component to have common virtual address space management and I'm at least sometimes working on MMU notifier/HMM improvements.
>
> Providing SVM functionality to your userspace stack is still a really good idea, but it should be done with MMU notifiers and components which are separate to your CPU memory management instead of tying it directly to the CPU address space.
>
> Regards,
> Christian.
>
>> [1] Huang, Jian, Moinuddin K. Qureshi, and Karsten Schwan. "An evolutionary study of linux memory management for fun and profit." 2016 USENIX Annual Technical Conference (USENIX ATC 16). 2016.
>>
>> Thanks,
>> Weixi
>>
>> -----Original Message-----
>> From: Dave Airlie <airlied@gmail.com>
>> Sent: Wednesday, November 29, 2023 1:15 PM
>> To: Christian König <christian.koenig@amd.com>
>> Cc: zhuweixi <weixi.zhu@huawei.com>; linux-mm@kvack.org; 
>> linux-kernel@vger.kernel.org; akpm@linux-foundation.org; 
>> weixi.zhu@openeuler.sh; mgorman@suse.de; jglisse@redhat.com; 
>> rcampbell@nvidia.com; jhubbard@nvidia.com; apopple@nvidia.com; 
>> mhairgrove@nvidia.com; ziy@nvidia.com; alexander.deucher@amd.com; 
>> Xinhui.Pan@amd.com; amd-gfx@lists.freedesktop.org; 
>> Felix.Kuehling@amd.com; ogabbay@kernel.org; 
>> dri-devel@lists.freedesktop.org; jgg@nvidia.com; leonro@nvidia.com; 
>> zhenyuw@linux.intel.com; zhi.a.wang@intel.com; 
>> intel-gvt-dev@lists.freedesktop.org; intel-gfx@lists.freedesktop.org; 
>> jani.nikula@linux.intel.com; joonas.lahtinen@linux.intel.com; 
>> rodrigo.vivi@intel.com; tvrtko.ursulin@linux.intel.com
>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>> management) for external memory devices
>>
>> On Tue, 28 Nov 2023 at 23:07, Christian König <christian.koenig@amd.com> wrote:
>>> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
>>>> The problem:
>>>>
>>>> Accelerator driver developers are forced to reinvent external MM 
>>>> subsystems case by case, because Linux core MM only considers host memory resources.
>>>> These reinvented MM subsystems have similar orders of magnitude of 
>>>> LoC as Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and 
>>>> Huawei NPU has 30K. Meanwhile, more and more vendors are 
>>>> implementing their own accelerators, e.g. Microsoft's Maia 100. At 
>>>> the same time, application-level developers suffer from poor 
>>>> programmability -- they must consider parallel address spaces and 
>>>> be careful about the limited device DRAM capacity. This can be 
>>>> alleviated if a malloc()-ed virtual address can be shared by the 
>>>> accelerator, or the abundant host DRAM can further transparently backup the device local memory.
>>>>
>>>> These external MM systems share similar mechanisms except for the 
>>>> hardware-dependent part, so reinventing them is effectively 
>>>> introducing redundant code (14K~70K for each case). Such 
>>>> developing/maintaining is not cheap. Furthermore, to share a 
>>>> malloc()-ed virtual address, device drivers need to deeply interact 
>>>> with Linux MM via low-level MM APIs, e.g. MMU notifiers/HMM. This 
>>>> raises the bar for driver development, since developers must 
>>>> understand how Linux MM works. Further, it creates code maintenance 
>>>> problems -- any changes to Linux MM potentially require coordinated changes to accelerator drivers using low-level MM APIs.
>>>>
>>>> Putting a cache-coherent bus between host and device will not make 
>>>> these external MM subsystems disappear. For example, a 
>>>> throughput-oriented accelerator will not tolerate executing heavy 
>>>> memory access workload with a host MMU/IOMMU via a remote bus.
>>>> Therefore, devices will still have their own MMU and pick a simpler 
>>>> page table format for lower address translation overhead, requiring external MM subsystems.
>>>>
>>>> --------------------
>>>>
>>>> What GMEM (Generalized Memory Management [1]) does:
>>>>
>>>> GMEM extends Linux MM to share its machine-independent MM code. 
>>>> Only high-level interface is provided for device drivers. This 
>>>> prevents accelerator drivers from reinventing the wheel, but relies 
>>>> on drivers to implement their hardware-dependent functions declared 
>>>> by GMEM. GMEM's key interface include gm_dev_create(), 
>>>> gm_as_create(),
>>>> gm_as_attach() and gm_dev_register_physmem(). Here briefly describe 
>>>> how a device driver utilizes them:
>>>> 1. At boot time, call gm_dev_create() and registers the implementation of
>>>>       hardware-dependent functions as declared in struct gm_mmu.
>>>>         - If the device has local DRAM, call gm_dev_register_physmem() to
>>>>           register available physical addresses.
>>>> 2. When a device context is initialized (e.g. triggered by ioctl), check if
>>>>       the current CPU process has been attached to a gmem address space
>>>>       (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
>>>>       to it.
>>>> 3. Call gm_as_attach() to attach the device context to a gmem address space.
>>>> 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
>>>>       device computation happens.
>>>>
>>>> GMEM has changed the following assumptions in Linux MM:
>>>>      1. An mm_struct not only handle a single CPU context, but may also handle
>>>>         external memory contexts encapsulated as gm_context listed in
>>>>         mm->gm_as. An external memory context can include a few or all of the
>>>>         following parts: an external MMU (that requires TLB invalidation), an
>>>>         external page table (that requires PTE manipulation) and external DRAM
>>>>         (that requires physical memory management).
>>> Well that is pretty much exactly what AMD has already proposed with 
>>> KFD and was rejected for rather good reasons.
>>>> MMU functions
>>>> The MMU functions peer_map() and peer_unmap() overlap other 
>>>> functions, leaving a question if the MMU functions should be 
>>>> decoupled as more basic operations. Decoupling them could 
>>>> potentially prevent device drivers coalescing these basic steps 
>>>> within a single host-device communication operation, while coupling 
>>>> them makes it more difficult for device drivers to utilize GMEM interface.
>>> Well to be honest all of this sounds like history to me. We have 
>>> already seen the same basic approach in KFD, HMM and to some extend in TTM as well.
>>>
>>> And all of them more or less failed. Why should this here be different?
>> Any info we have on why this has failed to work in the past would be 
>> useful to provide. This is one of those cases where we may not have 
>> documented the bad ideas to stop future developers from thinking they 
>> are bad.
>>
>> I do think we would want more common code in this area, but I would 
>> think we'd have it more on the driver infrastructure side, than in 
>> the core mm.
>>
>> Dave.

zhuweixi Dec. 1, 2023, 2:44 a.m. UTC | #11

Thanks! I am planning to present GMEM in Linux MM Alignment Sessions so I can collect more input from the mm developers.

@Christian @Oak I will also send you invitations once a presentation is scheduled. :)

-Weixi

-----Original Message-----
From: David Hildenbrand <david@redhat.com> 
Sent: Thursday, November 30, 2023 10:55 PM
To: zhuweixi <weixi.zhu@huawei.com>; Dave Airlie <airlied@gmail.com>; Christian König <christian.koenig@amd.com>
Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org; weixi.zhu@openeuler.sh; mgorman@suse.de; jglisse@redhat.com; rcampbell@nvidia.com; jhubbard@nvidia.com; apopple@nvidia.com; mhairgrove@nvidia.com; ziy@nvidia.com; alexander.deucher@amd.com; Xinhui.Pan@amd.com; amd-gfx@lists.freedesktop.org; Felix.Kuehling@amd.com; ogabbay@kernel.org; dri-devel@lists.freedesktop.org; jgg@nvidia.com; leonro@nvidia.com; zhenyuw@linux.intel.com; zhi.a.wang@intel.com; intel-gvt-dev@lists.freedesktop.org; intel-gfx@lists.freedesktop.org; jani.nikula@linux.intel.com; joonas.lahtinen@linux.intel.com; rodrigo.vivi@intel.com; tvrtko.ursulin@linux.intel.com
Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

On 29.11.23 09:27, zhuweixi wrote:
> Glad to hear that more sharable code is desirable.
> IMHO, for a common MM subsystem, it is more beneficial for GMEM to 
> extend core MM instead of building a separate one.

More core-mm complexity, awesome, we all love that! ;)

--
Cheers,

David / dhildenb

Zeng, Oak Dec. 1, 2023, 5:48 a.m. UTC | #12

See inline comments

> -----Original Message-----
> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> zhuweixi
> Sent: Thursday, November 30, 2023 5:48 AM
> To: Christian König <ckoenig.leichtzumerken@gmail.com>; Zeng, Oak
> <oak.zeng@intel.com>; Christian König <christian.koenig@amd.com>; linux-
> mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org;
> Danilo Krummrich <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel
> Vetter <daniel@ffwll.ch>
> Cc: tvrtko.ursulin@linux.intel.com; rcampbell@nvidia.com; apopple@nvidia.com;
> ziy@nvidia.com; weixi.zhu@openeuler.sh; jhubbard@nvidia.com; intel-
> gfx@lists.freedesktop.org; mhairgrove@nvidia.com; Wang, Zhi A
> <zhi.a.wang@intel.com>; Xinhui.Pan@amd.com; amd-gfx@lists.freedesktop.org;
> jglisse@redhat.com; dri-devel@lists.freedesktop.org; jgg@nvidia.com; Vivi,
> Rodrigo <rodrigo.vivi@intel.com>; alexander.deucher@amd.com;
> Felix.Kuehling@amd.com; intel-gvt-dev@lists.freedesktop.org;
> ogabbay@kernel.org; leonro@nvidia.com; mgorman@suse.de
> Subject: RE: [RFC PATCH 0/6] Supporting GMEM (generalized memory
> management) for external memory devices
> 
> Glad to know that there is a common demand for a new syscall like hmadvise(). I
> expect it would also be useful for homogeneous NUMA cases. Credits to
> cudaMemAdvise() API which brought this idea to GMEM's design.
> 
> To answer @Oak's questions about GMEM vs. HMM,
> 
> Here is the major difference:
>   GMEM's main target is to stop drivers from reinventing MM code, while
> HMM/MMU notifiers provide a compatible struct page solution and a
> coordination mechanism for existing device driver MMs that requires adding
> extra code to interact with CPU MM.
> 
> A straightforward qualitative result for the main target: after integrating Huawei's
> Ascend NPU driver with GMEM's interface, 30,000 lines of MM code were cut,
> leaving <100 lines invoking GMEM interface and 3,700 lines implementing vendor-
> specific functions. Some code from the 3,700 lines should be further moved to
> GMEM as a generalized feature like device memory oversubscription, but not
> included in this RFC patch yet.
> 
> A list of high-level differences:
>   1. With HMM/MMU notifiers, drivers need to first implement a full MM
> subsystem. With GMEM, drivers can reuse Linux's core MM.

A full mm subsystem essentially has below functions:

Physical memory management: neither your approach nor hmm-based solution provide device physical memory management. You mentioned you have a plan but at least for now driver need to mange device physical memory.

Virtual address space management: both approach leverage linux core mm, vma for this.

Data eviction, migration: with hmm, driver need to implement this. It is not clear whether gmem has this function. I guess even gmem has it, it might be slow cpu data copy, compared to modern gpu's fast data copy engine.

Device page table update, va-pa mapping: I think it is driver's responsibility in both approach.

So from the point of re-use core MM, I don't see big difference. Maybe you did it more elegantly. I think it is very possible with your approach driver can be simpler, less codes.

> 
>   2. HMM encodes device mapping information in the CPU arch-dependent PTEs,
> while GMEM proposes an abstraction layer in vm_object. Since GMEM's
> approach further decouples the arch-related stuff, drivers do not need to
> implement separate code for X86/ARM and etc.

I don't understand this...with hmm, when a virtual address range's backing store is in device memory, cpu pte is encoded to point to device memory. Device page table is also encoded to point to the same device memory location. But since device memory is not accessible to CPU (DEVICE_PRIVATE), so when cpu access this virtual address, there is a cpu page fault. Device mapping info is still in device page table, not in cpu ptes.

I do not see with hmm why driver need to implement x86/arm code... driver only take cares of device page table. Hmm/core mm take care of cpu page table, right?

> 
>   3. MMU notifiers register hooks at certain core MM events, while GMEM
> declares basic functions and internally invokes them. GMEM requires less from
> the driver side -- no need to understand what core MM behaves at certain MMU
> events. GMEM also expects fewer bugs than MMU notifiers: implementing basic
> operations with standard declarations vs. implementing whatever random device
> MM logic in MMU notifiers.

This seems true to me. I feel the mmu notifier thing, especially the synchronization/lock design (those sequence numbers, interacting with driver lock, and the mmap lock) are very complicated. I indeed spent time to understand the specification documented in hmm.rst...

Your approach seems better.

> 
>   4. GMEM plans to support a more lightweight physical memory management.
> The discussion about this part can be found in my cover letter. The question is
> whether struct page should be compatible (directly use HMM's ZONE_DEVICE
> solution) or a trimmed, smaller struct page that satisfies generalized demands
> from accelerators is more preferrable?
> 
>   5. GMEM has been demonstrated to allow device memory oversubscription (a
> GMEM-based 32GB NPU card can run a GPT model oversubscribing 500GB host
> DDR), while drivers using HMM/MMU notifier must implement this logic one by
> one. I will submit this part in a future RFC patch.

When device memory is oversubscribed, do you call a driver callback function to evict device memory to system memory? Or just cpu copy? Copy with device's fast copy engine is faster.

I can see even though with both approach we need to implement a driver copy function, with your approach, the driver logic can be simplified. With today's drm/ttm, I do see the logic in the memory eviction area is very complicated. Those eviction fence (some call it suspend fence), dma-fence enable signalling....very complicated to me.

Essentially evict device memory to system memory is nothing different from evict system memory to disk... so if your approach can leverage some linux core mm eviction logic, I do see it can simplify things here...

> 
> I want to reiterate that GMEM's shared address space support is a bonus result,
> not a main contribution... It was done because it was not difficult to implement
> internal CPU-device coordination mechanism when core MM is extended by
> GMEM to support devices.

Besides memory eviction/oversubscription, there are a few other pain points when I use hmm:

1) hmm doesn't support file-back memory, so it is hard to share memory b/t process in a gpu environment. You mentioned you have a plan... How hard is it to support file-backed in your approach?
2)virtual address range based memory attribute/hint: with hmadvise, where do you save the memory attribute of a virtual address range? Do you need to extend vm_area_struct to save it? With hmm, we have to maintain such information at driver. This ends up with pretty complicated logic to split/merge those address range. I know core mm has similar logic to split/merge vma...

Oak


> 
> -Weixi
> 
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Thursday, November 30, 2023 4:28 PM
> To: Zeng, Oak <oak.zeng@intel.com>; Christian König
> <christian.koenig@amd.com>; zhuweixi <weixi.zhu@huawei.com>; linux-
> mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org;
> Danilo Krummrich <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel
> Vetter <daniel@ffwll.ch>
> Cc: intel-gvt-dev@lists.freedesktop.org; rcampbell@nvidia.com;
> mhairgrove@nvidia.com; jgg@nvidia.com; weixi.zhu@openeuler.sh;
> jhubbard@nvidia.com; intel-gfx@lists.freedesktop.org; apopple@nvidia.com;
> Xinhui.Pan@amd.com; amd-gfx@lists.freedesktop.org;
> tvrtko.ursulin@linux.intel.com; ogabbay@kernel.org; jglisse@redhat.com; dri-
> devel@lists.freedesktop.org; ziy@nvidia.com; Vivi, Rodrigo
> <rodrigo.vivi@intel.com>; alexander.deucher@amd.com; leonro@nvidia.com;
> Felix.Kuehling@amd.com; Wang, Zhi A <zhi.a.wang@intel.com>;
> mgorman@suse.de
> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
> management) for external memory devices
> 
> Hi Oak,
> 
> yeah, #4 is indeed a really good point and I think Felix will agree to that as well.
> 
> HMM is basically still missing a way to advise device attributes for the CPU
> address space. Both migration strategy as well as device specific information (like
> cache preferences) fall into this category.
> 
> Since there is a device specific component in those attributes as well I think
> device specific IOCTLs still make sense to update them, but HMM should offer
> the functionality to manage and store those information.
> 
> Split and merge of VMAs only become a problem if you attach those information
> to VMAs, if you keep them completely separate than that doesn't become an
> issue either. The down side of this approach is that you don't get automatically
> extending attribute ranges for growing VMAs for example.
> 
> Regards,
> Christian.
> 
> Am 29.11.23 um 23:23 schrieb Zeng, Oak:
> > Hi Weixi,
> >
> > Even though Christian has listed reasons rejecting this proposal (yes they are
> very reasonable to me), I would open my mind and further explore the possibility
> here. Since the current GPU driver uses a hmm based implementation (AMD and
> NV has done this; At Intel we are catching up), I want to explore how much we
> can benefit from the proposed approach and how your approach can solve some
> pain points of our development. So basically what I am questioning here is: what
> is the advantage of your approach against hmm.
> >
> > To implement a UVM (unified virtual address space b/t cpu and gpu device),
> with hmm, driver essentially need to implement below functions:
> >
> > 1. device page table update. Your approach requires the same because
> > this is device specific codes
> >
> > 2. Some migration functions to migrate memory b/t system memory and GPU
> local memory. My understanding is, even though you generalized this a bit, such
> as modified cpu page fault path, provided "general" gm_dev_fault handler... but
> device driver still need to provide migration functions because migration
> functions have to be device specific (i.e., using device dma/copy engine for
> performance purpose). Right?
> >
> > 3. GPU physical memory management, this part is now in drm/buddy, shared
> by all drivers. I think with your approach, driver still need to provide callback
> functions to allocate/free physical pages. Right? Or do you let linux core mm
> buddy manage device memory directly?
> >
> > 4. madvise/hints/virtual address range management. This has been pain point
> for us. Right now device driver has to maintain certain virtual address range data
> structure to maintain hints and other virtual address range based memory
> attributes. Driver need to sync with linux vma. Driver need to explicitly deal with
> range split/merging... HMM doesn't provide support in this area. Your approach
> seems cleaner/simpler to me...
> >
> >
> > So in above, I have examined the some key factors of a gpu UVM memory
> manager. I think for #1 and #2, hmm has provide pretty good abstraction/tools
> for address space mirroring and migration helpers. For #3, since we have a
> common drm/buddy layer, I don't think it is a big problem for driver writer now.
> >
> > I do see #4 is something you solved more beautifully, requires new system call
> though.
> >
> > Oak
> >
> >
> >> -----Original Message-----
> >> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf
> >> Of Christian König
> >> Sent: Tuesday, November 28, 2023 8:09 AM
> >> To: Weixi Zhu <weixi.zhu@huawei.com>; linux-mm@kvack.org; linux-
> >> kernel@vger.kernel.org; akpm@linux-foundation.org; Danilo Krummrich
> >> <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter
> >> <daniel@ffwll.ch>
> >> Cc: dri-devel@lists.freedesktop.org; leonro@nvidia.com;
> >> apopple@nvidia.com; amd-gfx@lists.freedesktop.org; mgorman@suse.de;
> >> ziy@nvidia.com; Wang, Zhi A <zhi.a.wang@intel.com>;
> >> rcampbell@nvidia.com; jgg@nvidia.com; weixi.zhu@openeuler.sh;
> >> jhubbard@nvidia.com; intel-gfx@lists.freedesktop.org;
> >> mhairgrove@nvidia.com; jglisse@redhat.com; Vivi, Rodrigo
> >> <rodrigo.vivi@intel.com>; intel-gvt-dev@lists.freedesktop.org;
> >> tvrtko.ursulin@linux.intel.com; Felix.Kuehling@amd.com;
> >> Xinhui.Pan@amd.com; alexander.deucher@amd.com; ogabbay@kernel.org
> >> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
> >> management) for external memory devices
> >>
> >> Adding a few missing important people to the explicit to list.
> >>
> >> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
> >>> The problem:
> >>>
> >>> Accelerator driver developers are forced to reinvent external MM
> >>> subsystems case by case, because Linux core MM only considers host
> memory resources.
> >>> These reinvented MM subsystems have similar orders of magnitude of
> >>> LoC as Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and
> >>> Huawei NPU
> >> has
> >>> 30K. Meanwhile, more and more vendors are implementing their own
> >>> accelerators, e.g. Microsoft's Maia 100. At the same time,
> >>> application-level developers suffer from poor programmability --
> >>> they must consider parallel address spaces and be careful about the
> >>> limited device DRAM capacity. This can be alleviated if a
> >>> malloc()-ed virtual address can be shared by the accelerator, or the
> >>> abundant host DRAM can further transparently backup the device local
> memory.
> >>>
> >>> These external MM systems share similar mechanisms except for the
> >>> hardware-dependent part, so reinventing them is effectively
> >>> introducing redundant code (14K~70K for each case). Such
> >>> developing/maintaining is not cheap. Furthermore, to share a
> >>> malloc()-ed virtual address, device drivers need to deeply interact
> >>> with Linux MM via low-level MM APIs, e.g. MMU notifiers/HMM. This
> >>> raises the bar for driver development, since developers must
> >>> understand how Linux MM works. Further, it creates code maintenance
> >>> problems -- any changes to Linux MM potentially require coordinated
> changes to accelerator drivers using low-level MM APIs.
> >>>
> >>> Putting a cache-coherent bus between host and device will not make
> >>> these external MM subsystems disappear. For example, a
> >>> throughput-oriented accelerator will not tolerate executing heavy
> >>> memory access workload with a host MMU/IOMMU via a remote bus.
> >>> Therefore, devices will still have their own MMU and pick a simpler
> >>> page table format for lower address translation overhead, requiring external
> MM subsystems.
> >>>
> >>> --------------------
> >>>
> >>> What GMEM (Generalized Memory Management [1]) does:
> >>>
> >>> GMEM extends Linux MM to share its machine-independent MM code. Only
> >>> high-level interface is provided for device drivers. This prevents
> >>> accelerator drivers from reinventing the wheel, but relies on
> >>> drivers to implement their hardware-dependent functions declared by
> >>> GMEM. GMEM's
> >> key
> >>> interface include gm_dev_create(), gm_as_create(), gm_as_attach()
> >>> and gm_dev_register_physmem(). Here briefly describe how a device
> >>> driver utilizes them:
> >>> 1. At boot time, call gm_dev_create() and registers the implementation of
> >>>      hardware-dependent functions as declared in struct gm_mmu.
> >>>        - If the device has local DRAM, call gm_dev_register_physmem() to
> >>>          register available physical addresses.
> >>> 2. When a device context is initialized (e.g. triggered by ioctl), check if
> >>>      the current CPU process has been attached to a gmem address space
> >>>      (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
> >>>      to it.
> >>> 3. Call gm_as_attach() to attach the device context to a gmem address space.
> >>> 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
> >>>      device computation happens.
> >>>
> >>> GMEM has changed the following assumptions in Linux MM:
> >>>     1. An mm_struct not only handle a single CPU context, but may also handle
> >>>        external memory contexts encapsulated as gm_context listed in
> >>>        mm->gm_as. An external memory context can include a few or all of the
> >>>        following parts: an external MMU (that requires TLB invalidation), an
> >>>        external page table (that requires PTE manipulation) and external DRAM
> >>>        (that requires physical memory management).
> >>>     2. Faulting a MAP_PRIVATE VMA with no CPU PTE found does not
> necessarily
> >>>        mean that a zero-filled physical page should be mapped. The virtual
> >>>        page may have been mapped to an external memory device.
> >>>     3. Unmapping a page may include sending device TLB invalidation (even if
> >>>        its MMU shares CPU page table) and manipulating device PTEs.
> >>>
> >>> --------------------
> >>>
> >>> Semantics of new syscalls:
> >>>
> >>> 1. mmap(..., MAP_PRIVATE | MAP_PEER_SHARED)
> >>>       Allocate virtual address that is shared between the CPU and all
> >>>       attached devices. Data is guaranteed to be coherent whenever the
> >>>       address is accessed by either CPU or any attached device. If the device
> >>>       does not support page fault, then device driver is responsible for
> >>>       faulting memory before data gets accessed. By default, the CPU DRAM is
> >>>       can be used as a swap backup for the device local memory.
> >>> 2. hmadvise(NUMA_id, va_start, size, memory_hint)
> >>>       Issuing memory hint for a given VMA. This extends traditional madvise()
> >>>       syscall with an extra argument so that programmers have better control
> >>>       with heterogeneous devices registered as NUMA nodes. One
> >>> useful
> >> memory
> >>>       hint could be MADV_PREFETCH, which guarantees that the physical data
> of
> >>>       the given VMA [VA, VA+size) is migrated to NUMA node #id. Another
> >>>       useful memory hint is MADV_DONTNEED. This is helpful to increase
> device
> >>>       memory utilization. It is worth considering extending the existing
> >>>       madvise() syscall with one additional argument.
> >>>
> >>> --------------------
> >>>
> >>> Implementation details
> >>>
> >>> 1. New VMA flag: MAP_PEER_SHARED
> >>>
> >>> This new flag helps isolate GMEM feature, so that common processes
> >>> with no device attached does not need to maintain any logical page
> >>> table. It can be deleted if the extra overhead from GMEM is acceptable.
> >>>
> >>> 2. MMU functions
> >>> The device driver must implement the MMU functions declared in
> >>> struct gm_mmu.
> >>>
> >>> VA functions: peer_va_alloc_fixed(), peer_va_free()
> >>>
> >>> They are used to negotiate a common available VMA between a host
> >>> process and a device process at the mmap() time. This is because
> >>> some accelerators like Intel Xeon Phi or Huawei's Ascend NPU have
> >>> their acceleration tasks executed within a device CPU process
> >>> context. Some accelerators may also choose a different format of
> >>> virtual address space.
> >>>
> >>> PA functions: alloc_page(), free_page(), prepare_page()
> >>>
> >>> Alloc_page() and free_page() are used to allocate and free device
> >>> physical pages. Prepare_page() is used to zero-fill or DMA the data
> >>> of a physical page. These functions were removed from the submitted
> >>> patch, since GMEM does not need to invoke them when testing Huawei's
> >>> NPU accelerator. The
> >> NPU
> >>> accelerator has an OS running in the device that manages the device
> >>> physical memory. However, even for such a device it is better for
> >>> the host to directly manage device physical memory, which saves
> >>> device HBM and avoids synchronizing management status between the host
> and device.
> >>>
> >>> Page-table functions:
> >>> pmap_create()/destroy()/enter()/release()/protect()
> >>>
> >>> They are used to create and destroy device page tables, install and
> >>> uninstall page table entries and to change the protection of page
> >>> table entries.
> >>>
> >>> TLB-invalidation functions: tlb_invl(), tlb_invl_coalesced()
> >>>
> >>> They are used to invalidate the TLB entries of a given range of VA
> >>> or invalidate a given list of VMAs.
> >>>
> >>> Wrapper functions: peer_map() and peer_unmap()
> >>>
> >>> These two functions are used to create or destroy a device mapping
> >>> which could include allocating physical memory and copying data.
> >>> They effectively wraps the PA functions, Page-table functions and
> >>> TLB-invalidation functions. Implementing these steps together allows
> >>> devices to optimize the communication cost between host and device.
> >>> However, it requires the device driver to correctly order these steps.
> >>>
> >>> 3. Tracking logical mappings:
> >>>
> >>> Each process starts maintaining an xarray in
> >>> mm->vm_obj->logical_page_table at the first time a host process
> >>> calls mmap(MAP_PRIVATE |
> >> MAP_PEER_SHARED).
> >>> When a virtual page gets touched, its mapping status is created and
> >>> stored in struct gm_mapping. The logical page table is utilized to
> >>> query the struct gm_mapping given a virtual address. GMEM extends
> >>> Linux MM to
> >> update
> >>> and lookup these logical mappings. For example, in the patch set we
> >>> modify the page fault path of to additionally check the logical
> >>> mapping of MAP_PEER_SHARED VMAs and identify if a device page should
> be migrated.
> >>> Similarly, if the device driver wants to resolve a device page fault
> >>> or prefetch data, the driver should call gm_dev_fault(). This
> >>> function examines the mapping status and determines whether the
> >>> device driver should migrate a CPU page to device or install a zero-filled
> device page.
> >>>
> >>> The logical mapping abstraction enhances the extensibility of Linux
> >>> core MM (a virtual page may be mapped to a device physical page
> >>> without any CPU PTE installed). The current implementation is not
> >>> complete, since it only focused on anonymous VMAs with
> >>> MAP_PEER_SHARED flag. The future plan of logical page table is to
> >>> provide a generic abstraction layer that support common anonymous
> >>> memory (I am looking at you, transparent huge pages)
> >> and
> >>> file-backed memory.
> >>>
> >>> --------------------
> >>>
> >>> Use cases
> >>>
> >>> GMEM has been tested over Huawei's NPU (neural process unit) device
> driver.
> >>> The original NPU device driver has approximately 30,000 lines of
> >>> code for memory management. On the contrary, the GMEM-based one has
> >>> less than 30 lines of code calling GMEM API, with approximately
> >>> 3,700 lines of code implementing the MMU functions. This effectively
> >>> saves over 26,200 lines of MM code for one driver. Therefore,
> >>> developers from accelerator vendors, including Nvidia, AMD, Intel
> >>> and other companies are welcome to discuss if GMEM could be helpful.
> >>>
> >>> Using GMEM-based driver, it is possible to write a C-style
> >>> accelerator code with malloc(), whose underlying mmap() syscall
> >>> should include MAP_PEER_SHARED according to current GMEM
> >>> implementation. Importantly,
> >> GMEM
> >>> guarantees a coherent view of memory between the host and all
> >>> attached devices. This means that any data written by the CPU or any
> >>> attached accelerator can be seen by the next memory load instruction
> >>> issued by any attached accelerator or the CPU. Furthermore, the NPU
> >>> device was able to oversubscribe memory by swapping memory to host
> >>> DDR. Note that this
> >> memory
> >>> oversubscription mechanism can be universal if the physical memory
> >>> management is provided by GMEM. Other potential use cases of GMEM
> >>> could include the IOMMU driver, KVM and RDMA drivers, as long as the
> >>> device needs to manage external memory resources like VMAs, MMUs or
> local DRAMs.
> >>>
> >>> --------------------
> >>>
> >>> Discussion
> >>>
> >>> Physical memory management
> >>> Most accelerators require the host OS to manage device DRAM. Even
> >>> accelerators capable of running an OS inside the driver can benefit
> >>> from it, since it helps avoid synchronizing management status
> >>> between the host and device. In Linux OSS EU summit 2023, Hannes
> >>> Reinecke from SUSE Labs suggested that people are concerned with the
> >>> memory consumption of struct page (which considers all generic
> >>> scenarios for the kernel). This leads to a possible solution that,
> >>> instead of reusing Linux struct page and ZONE_DEVICE mechanism, GMEM
> >>> can implement an isolated buddy allocator
> >> for
> >>> the device to instantiate and register. The isolation is useful
> >>> because device DRAM physical address space is independent.
> >>> Furthermore, the isolated buddy allocator can utilize a customized
> >>> struct page that consumes less memory. It is worth discussing if
> >>> accelerator vendors desire this solution.
> >>>
> >>> MMU functions
> >>> The MMU functions peer_map() and peer_unmap() overlap other
> >>> functions, leaving a question if the MMU functions should be
> >>> decoupled as more basic operations. Decoupling them could
> >>> potentially prevent device drivers coalescing these basic steps
> >>> within a single host-device communication operation, while coupling
> >>> them makes it more difficult for device drivers to utilize GMEM interface.
> >>>
> >>> The idea of GMEM was originated from Weixi's PhD study with Prof.
> >>> Scott Rixner and Prof. Alan L. Cox at Rice University.
> >>>
> >>> [1] https://arxiv.org/abs/2310.12554.
> >>>
> >>> Weixi Zhu (6):
> >>>     mm/gmem: add heterogeneous NUMA node
> >>>     mm/gmem: add arch-independent abstraction to track address mapping
> >>>       status
> >>>     mm/gmem: add GMEM (Generalized Memory Management) interface for
> >>>       external accelerators
> >>>     mm/gmem: add new syscall hmadvise() to issue memory hints for
> >>>       heterogeneous NUMA nodes
> >>>     mm/gmem: resolve VMA conflicts for attached peer devices
> >>>     mm/gmem: extending Linux core MM to support unified virtual address
> >>>       space
> >>>
> >>>    arch/arm64/include/asm/unistd.h         |   2 +-
> >>>    arch/arm64/include/asm/unistd32.h       |   2 +
> >>>    drivers/base/node.c                     |   6 +
> >>>    fs/proc/task_mmu.c                      |   3 +
> >>>    include/linux/gmem.h                    | 368 ++++++++++++
> >>>    include/linux/mm.h                      |   8 +
> >>>    include/linux/mm_types.h                |   5 +
> >>>    include/linux/nodemask.h                |  10 +
> >>>    include/uapi/asm-generic/mman-common.h  |   4 +
> >>>    include/uapi/asm-generic/unistd.h       |   5 +-
> >>>    init/main.c                             |   2 +
> >>>    kernel/fork.c                           |   5 +
> >>>    kernel/sys_ni.c                         |   2 +
> >>>    mm/Kconfig                              |  14 +
> >>>    mm/Makefile                             |   1 +
> >>>    mm/gmem.c                               | 746 ++++++++++++++++++++++++
> >>>    mm/huge_memory.c                        |  85 ++-
> >>>    mm/memory.c                             |  42 +-
> >>>    mm/mempolicy.c                          |   4 +
> >>>    mm/mmap.c                               |  40 +-
> >>>    mm/oom_kill.c                           |   2 +
> >>>    mm/page_alloc.c                         |   3 +
> >>>    mm/vm_object.c                          | 309 ++++++++++
> >>>    tools/include/uapi/asm-generic/unistd.h |   5 +-
> >>>    24 files changed, 1654 insertions(+), 19 deletions(-)
> >>>    create mode 100644 include/linux/gmem.h
> >>>    create mode 100644 mm/gmem.c
> >>>    create mode 100644 mm/vm_object.c
> >>>

Alistair Popple Dec. 1, 2023, 6:01 a.m. UTC | #13

zhuweixi <weixi.zhu@huawei.com> writes:

> Glad to know that there is a common demand for a new syscall like
> hmadvise(). I expect it would also be useful for homogeneous NUMA
> cases. Credits to cudaMemAdvise() API which brought this idea to
> GMEM's design.

It's not clear to me that this would need to be a new syscall. Scanning
the patches it looks like your adding a NUMA node anyway, so the
existing interfaces (eg. madvise) with its various options
(MPOL_PREFERRED/PREFERRED_MANY) and
set_mempolicy/set_mempolicy_home_node() could potentially cover this for
both NUMA and hNUMA nodes. The main advantage here would be providing a
common userspace interface for setting these kind of hints.

> To answer @Oak's questions about GMEM vs. HMM,
>
> Here is the major difference:
>   GMEM's main target is to stop drivers from reinventing MM code,
> while HMM/MMU notifiers provide a compatible struct page solution and
> a coordination mechanism for existing device driver MMs that requires
> adding extra code to interact with CPU MM.
>
> A straightforward qualitative result for the main target: after
> integrating Huawei's Ascend NPU driver with GMEM's interface, 30,000
> lines of MM code were cut, leaving <100 lines invoking GMEM interface
> and 3,700 lines implementing vendor-specific functions. Some code from
> the 3,700 lines should be further moved to GMEM as a generalized
> feature like device memory oversubscription, but not included in this
> RFC patch yet.

I think it would be helpful if you could be a bit more specific on what
functionality the current HMM/migrate_vma/MMU notifier interfaces are
missing that every driver has to implement in a common way. Because I'm
not convinced we can't either improve those interfaces to provide what's
needed or add specific components (eg. a physical page allocator)
instead of a whole new framework.

> A list of high-level differences: 
>   1. With HMM/MMU notifiers, drivers need to first implement a full MM subsystem. With GMEM, drivers can reuse Linux's core MM.

What do the common bits of this full MM subsystem look like?
Fundamentally the existing HMM functionality can already make use of
Linux core MM to manage page tables and migrate pages and everything
else seems pretty device specific (ie. actual copying of data,
programming of MMUs, TLBs, etc.)

I can see that there would be scope to have say a generic memory
allocator, which I vaguely recall discussing in relation to
DEIVCE_PRIVATE pages in the past but @Oak suggests something close
already exists (drm/buddy).

Potentially I suppose there is VA allocation that might be common across
devices. However I have not had any experience working with devices with
VA requirements different enough from the CPU to matter. If they are so
different I'm not convinced it would be easy to have a common
implementation anyway.

>   2. HMM encodes device mapping information in the CPU arch-dependent
> PTEs, while GMEM proposes an abstraction layer in vm_object. Since
> GMEM's approach further decouples the arch-related stuff, drivers do
> not need to implement separate code for X86/ARM and etc.

I'm not following this. At present all HMM encodes in CPU PTEs is the
fact a page has been migrated to the device and what permissions it
has. I'm not aware of needing to treat X86 and ARM differently for
example here. Are you saying you want somewhere to store other bits
attached to a particular VA?

>   3. MMU notifiers register hooks at certain core MM events, while
> GMEM declares basic functions and internally invokes them. GMEM
> requires less from the driver side -- no need to understand what core
> MM behaves at certain MMU events. GMEM also expects fewer bugs than
> MMU notifiers: implementing basic operations with standard
> declarations vs. implementing whatever random device MM logic in MMU
> notifiers.

How is this proposal any different though? From what I can see it
replaces MMU notifier callbacks with TLB invalidation callbacks, but
that is essentially what MMU notifier callbacks are anyway. The "random
device MM logic" should just be clearing device TLBs. What other MM
logic has to be implemented in the MMU notifier callbacks that is the
same between devices?

>   4. GMEM plans to support a more lightweight physical memory
> management. The discussion about this part can be found in my cover
> letter. The question is whether struct page should be compatible
> (directly use HMM's ZONE_DEVICE solution) or a trimmed, smaller struct
> page that satisfies generalized demands from accelerators is more
> preferrable?

What is wrong with the current ZONE_DEVICE solution? You mention size of
struct page, but that is already being worked on through the conversion
to folios. Admittedly higher order HMM ZONE_DEVICE folios are not
currently supported, but that is something I'm actively working on at
the moment.

>   5. GMEM has been demonstrated to allow device memory
> oversubscription (a GMEM-based 32GB NPU card can run a GPT model
> oversubscribing 500GB host DDR), while drivers using HMM/MMU notifier
> must implement this logic one by one. I will submit this part in a
> future RFC patch.

I guess that would need to be part of the physical page allocator right?

> I want to reiterate that GMEM's shared address space support is a
> bonus result, not a main contribution... It was done because it was
> not difficult to implement internal CPU-device coordination mechanism
> when core MM is extended by GMEM to support devices.
>
> -Weixi
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com> 
> Sent: Thursday, November 30, 2023 4:28 PM
> To: Zeng, Oak <oak.zeng@intel.com>; Christian König
> <christian.koenig@amd.com>; zhuweixi <weixi.zhu@huawei.com>;
> linux-mm@kvack.org; linux-kernel@vger.kernel.org;
> akpm@linux-foundation.org; Danilo Krummrich <dakr@redhat.com>; Dave
> Airlie <airlied@redhat.com>; Daniel Vetter <daniel@ffwll.ch>
> Cc: intel-gvt-dev@lists.freedesktop.org; rcampbell@nvidia.com;
> mhairgrove@nvidia.com; jgg@nvidia.com; weixi.zhu@openeuler.sh;
> jhubbard@nvidia.com; intel-gfx@lists.freedesktop.org;
> apopple@nvidia.com; Xinhui.Pan@amd.com; amd-gfx@lists.freedesktop.org;
> tvrtko.ursulin@linux.intel.com; ogabbay@kernel.org;
> jglisse@redhat.com; dri-devel@lists.freedesktop.org; ziy@nvidia.com;
> Vivi, Rodrigo <rodrigo.vivi@intel.com>; alexander.deucher@amd.com;
> leonro@nvidia.com; Felix.Kuehling@amd.com; Wang, Zhi A
> <zhi.a.wang@intel.com>; mgorman@suse.de
> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices
>
> Hi Oak,
>
> yeah, #4 is indeed a really good point and I think Felix will agree to that as well.
>
> HMM is basically still missing a way to advise device attributes for
> the CPU address space. Both migration strategy as well as device
> specific information (like cache preferences) fall into this category.
>
> Since there is a device specific component in those attributes as well
> I think device specific IOCTLs still make sense to update them, but
> HMM should offer the functionality to manage and store those
> information.
>
> Split and merge of VMAs only become a problem if you attach those
> information to VMAs, if you keep them completely separate than that
> doesn't become an issue either. The down side of this approach is that
> you don't get automatically extending attribute ranges for growing
> VMAs for example.
>
> Regards,
> Christian.
>
> Am 29.11.23 um 23:23 schrieb Zeng, Oak:
>> Hi Weixi,
>>
>> Even though Christian has listed reasons rejecting this proposal
> (yes they are very reasonable to me), I would open my mind and further
> explore the possibility here. Since the current GPU driver uses a hmm
> based implementation (AMD and NV has done this; At Intel we are
> catching up), I want to explore how much we can benefit from the
> proposed approach and how your approach can solve some pain points of
> our development. So basically what I am questioning here is: what is
> the advantage of your approach against hmm.
>>
>> To implement a UVM (unified virtual address space b/t cpu and gpu device), with hmm, driver essentially need to implement below functions:
>>
>> 1. device page table update. Your approach requires the same because 
>> this is device specific codes
>>
>> 2. Some migration functions to migrate memory b/t system memory and
> GPU local memory. My understanding is, even though you generalized
> this a bit, such as modified cpu page fault path, provided "general"
> gm_dev_fault handler... but device driver still need to provide
> migration functions because migration functions have to be device
> specific (i.e., using device dma/copy engine for performance
> purpose). Right?
>>
>> 3. GPU physical memory management, this part is now in drm/buddy,
> shared by all drivers. I think with your approach, driver still need
> to provide callback functions to allocate/free physical pages. Right?
> Or do you let linux core mm buddy manage device memory directly?
>>
>> 4. madvise/hints/virtual address range management. This has been
> pain point for us. Right now device driver has to maintain certain
> virtual address range data structure to maintain hints and other
> virtual address range based memory attributes. Driver need to sync
> with linux vma. Driver need to explicitly deal with range
> split/merging... HMM doesn't provide support in this area. Your
> approach seems cleaner/simpler to me...
>>
>>
>> So in above, I have examined the some key factors of a gpu UVM
> memory manager. I think for #1 and #2, hmm has provide pretty good
> abstraction/tools for address space mirroring and migration
> helpers. For #3, since we have a common drm/buddy layer, I don't think
> it is a big problem for driver writer now.
>>
>> I do see #4 is something you solved more beautifully, requires new system call though.
>>
>> Oak
>>
>>
>>> -----Original Message-----
>>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf 
>>> Of Christian König
>>> Sent: Tuesday, November 28, 2023 8:09 AM
>>> To: Weixi Zhu <weixi.zhu@huawei.com>; linux-mm@kvack.org; linux- 
>>> kernel@vger.kernel.org; akpm@linux-foundation.org; Danilo Krummrich 
>>> <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter 
>>> <daniel@ffwll.ch>
>>> Cc: dri-devel@lists.freedesktop.org; leonro@nvidia.com; 
>>> apopple@nvidia.com; amd-gfx@lists.freedesktop.org; mgorman@suse.de; 
>>> ziy@nvidia.com; Wang, Zhi A <zhi.a.wang@intel.com>; 
>>> rcampbell@nvidia.com; jgg@nvidia.com; weixi.zhu@openeuler.sh; 
>>> jhubbard@nvidia.com; intel-gfx@lists.freedesktop.org; 
>>> mhairgrove@nvidia.com; jglisse@redhat.com; Vivi, Rodrigo 
>>> <rodrigo.vivi@intel.com>; intel-gvt-dev@lists.freedesktop.org;
>>> tvrtko.ursulin@linux.intel.com; Felix.Kuehling@amd.com; 
>>> Xinhui.Pan@amd.com; alexander.deucher@amd.com; ogabbay@kernel.org
>>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>>> management) for external memory devices
>>>
>>> Adding a few missing important people to the explicit to list.
>>>
>>> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
>>>> The problem:
>>>>
>>>> Accelerator driver developers are forced to reinvent external MM 
>>>> subsystems case by case, because Linux core MM only considers host memory resources.
>>>> These reinvented MM subsystems have similar orders of magnitude of 
>>>> LoC as Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and 
>>>> Huawei NPU
>>> has
>>>> 30K. Meanwhile, more and more vendors are implementing their own 
>>>> accelerators, e.g. Microsoft's Maia 100. At the same time, 
>>>> application-level developers suffer from poor programmability -- 
>>>> they must consider parallel address spaces and be careful about the 
>>>> limited device DRAM capacity. This can be alleviated if a 
>>>> malloc()-ed virtual address can be shared by the accelerator, or the 
>>>> abundant host DRAM can further transparently backup the device local memory.
>>>>
>>>> These external MM systems share similar mechanisms except for the 
>>>> hardware-dependent part, so reinventing them is effectively 
>>>> introducing redundant code (14K~70K for each case). Such 
>>>> developing/maintaining is not cheap. Furthermore, to share a 
>>>> malloc()-ed virtual address, device drivers need to deeply interact 
>>>> with Linux MM via low-level MM APIs, e.g. MMU notifiers/HMM. This 
>>>> raises the bar for driver development, since developers must 
>>>> understand how Linux MM works. Further, it creates code maintenance 
>>>> problems -- any changes to Linux MM potentially require coordinated changes to accelerator drivers using low-level MM APIs.
>>>>
>>>> Putting a cache-coherent bus between host and device will not make 
>>>> these external MM subsystems disappear. For example, a 
>>>> throughput-oriented accelerator will not tolerate executing heavy 
>>>> memory access workload with a host MMU/IOMMU via a remote bus. 
>>>> Therefore, devices will still have their own MMU and pick a simpler 
>>>> page table format for lower address translation overhead, requiring external MM subsystems.
>>>>
>>>> --------------------
>>>>
>>>> What GMEM (Generalized Memory Management [1]) does:
>>>>
>>>> GMEM extends Linux MM to share its machine-independent MM code. Only 
>>>> high-level interface is provided for device drivers. This prevents 
>>>> accelerator drivers from reinventing the wheel, but relies on 
>>>> drivers to implement their hardware-dependent functions declared by 
>>>> GMEM. GMEM's
>>> key
>>>> interface include gm_dev_create(), gm_as_create(), gm_as_attach() 
>>>> and gm_dev_register_physmem(). Here briefly describe how a device 
>>>> driver utilizes them:
>>>> 1. At boot time, call gm_dev_create() and registers the implementation of
>>>>      hardware-dependent functions as declared in struct gm_mmu.
>>>>        - If the device has local DRAM, call gm_dev_register_physmem() to
>>>>          register available physical addresses.
>>>> 2. When a device context is initialized (e.g. triggered by ioctl), check if
>>>>      the current CPU process has been attached to a gmem address space
>>>>      (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
>>>>      to it.
>>>> 3. Call gm_as_attach() to attach the device context to a gmem address space.
>>>> 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
>>>>      device computation happens.
>>>>
>>>> GMEM has changed the following assumptions in Linux MM:
>>>>     1. An mm_struct not only handle a single CPU context, but may also handle
>>>>        external memory contexts encapsulated as gm_context listed in
>>>>        mm->gm_as. An external memory context can include a few or all of the
>>>>        following parts: an external MMU (that requires TLB invalidation), an
>>>>        external page table (that requires PTE manipulation) and external DRAM
>>>>        (that requires physical memory management).
>>>>     2. Faulting a MAP_PRIVATE VMA with no CPU PTE found does not necessarily
>>>>        mean that a zero-filled physical page should be mapped. The virtual
>>>>        page may have been mapped to an external memory device.
>>>>     3. Unmapping a page may include sending device TLB invalidation (even if
>>>>        its MMU shares CPU page table) and manipulating device PTEs.
>>>>
>>>> --------------------
>>>>
>>>> Semantics of new syscalls:
>>>>
>>>> 1. mmap(..., MAP_PRIVATE | MAP_PEER_SHARED)
>>>>       Allocate virtual address that is shared between the CPU and all
>>>>       attached devices. Data is guaranteed to be coherent whenever the
>>>>       address is accessed by either CPU or any attached device. If the device
>>>>       does not support page fault, then device driver is responsible for
>>>>       faulting memory before data gets accessed. By default, the CPU DRAM is
>>>>       can be used as a swap backup for the device local memory.
>>>> 2. hmadvise(NUMA_id, va_start, size, memory_hint)
>>>>       Issuing memory hint for a given VMA. This extends traditional madvise()
>>>>       syscall with an extra argument so that programmers have better control
>>>>       with heterogeneous devices registered as NUMA nodes. One 
>>>> useful
>>> memory
>>>>       hint could be MADV_PREFETCH, which guarantees that the physical data of
>>>>       the given VMA [VA, VA+size) is migrated to NUMA node #id. Another
>>>>       useful memory hint is MADV_DONTNEED. This is helpful to increase device
>>>>       memory utilization. It is worth considering extending the existing
>>>>       madvise() syscall with one additional argument.
>>>>
>>>> --------------------
>>>>
>>>> Implementation details
>>>>
>>>> 1. New VMA flag: MAP_PEER_SHARED
>>>>
>>>> This new flag helps isolate GMEM feature, so that common processes 
>>>> with no device attached does not need to maintain any logical page 
>>>> table. It can be deleted if the extra overhead from GMEM is acceptable.
>>>>
>>>> 2. MMU functions
>>>> The device driver must implement the MMU functions declared in 
>>>> struct gm_mmu.
>>>>
>>>> VA functions: peer_va_alloc_fixed(), peer_va_free()
>>>>
>>>> They are used to negotiate a common available VMA between a host 
>>>> process and a device process at the mmap() time. This is because 
>>>> some accelerators like Intel Xeon Phi or Huawei's Ascend NPU have 
>>>> their acceleration tasks executed within a device CPU process 
>>>> context. Some accelerators may also choose a different format of 
>>>> virtual address space.
>>>>
>>>> PA functions: alloc_page(), free_page(), prepare_page()
>>>>
>>>> Alloc_page() and free_page() are used to allocate and free device 
>>>> physical pages. Prepare_page() is used to zero-fill or DMA the data 
>>>> of a physical page. These functions were removed from the submitted 
>>>> patch, since GMEM does not need to invoke them when testing Huawei's 
>>>> NPU accelerator. The
>>> NPU
>>>> accelerator has an OS running in the device that manages the device 
>>>> physical memory. However, even for such a device it is better for 
>>>> the host to directly manage device physical memory, which saves 
>>>> device HBM and avoids synchronizing management status between the host and device.
>>>>
>>>> Page-table functions: 
>>>> pmap_create()/destroy()/enter()/release()/protect()
>>>>
>>>> They are used to create and destroy device page tables, install and 
>>>> uninstall page table entries and to change the protection of page 
>>>> table entries.
>>>>
>>>> TLB-invalidation functions: tlb_invl(), tlb_invl_coalesced()
>>>>
>>>> They are used to invalidate the TLB entries of a given range of VA 
>>>> or invalidate a given list of VMAs.
>>>>
>>>> Wrapper functions: peer_map() and peer_unmap()
>>>>
>>>> These two functions are used to create or destroy a device mapping 
>>>> which could include allocating physical memory and copying data. 
>>>> They effectively wraps the PA functions, Page-table functions and 
>>>> TLB-invalidation functions. Implementing these steps together allows 
>>>> devices to optimize the communication cost between host and device. 
>>>> However, it requires the device driver to correctly order these steps.
>>>>
>>>> 3. Tracking logical mappings:
>>>>
>>>> Each process starts maintaining an xarray in 
>>>> mm->vm_obj->logical_page_table at the first time a host process 
>>>> calls mmap(MAP_PRIVATE |
>>> MAP_PEER_SHARED).
>>>> When a virtual page gets touched, its mapping status is created and 
>>>> stored in struct gm_mapping. The logical page table is utilized to 
>>>> query the struct gm_mapping given a virtual address. GMEM extends 
>>>> Linux MM to
>>> update
>>>> and lookup these logical mappings. For example, in the patch set we 
>>>> modify the page fault path of to additionally check the logical 
>>>> mapping of MAP_PEER_SHARED VMAs and identify if a device page should be migrated.
>>>> Similarly, if the device driver wants to resolve a device page fault 
>>>> or prefetch data, the driver should call gm_dev_fault(). This 
>>>> function examines the mapping status and determines whether the 
>>>> device driver should migrate a CPU page to device or install a zero-filled device page.
>>>>
>>>> The logical mapping abstraction enhances the extensibility of Linux 
>>>> core MM (a virtual page may be mapped to a device physical page 
>>>> without any CPU PTE installed). The current implementation is not 
>>>> complete, since it only focused on anonymous VMAs with 
>>>> MAP_PEER_SHARED flag. The future plan of logical page table is to 
>>>> provide a generic abstraction layer that support common anonymous 
>>>> memory (I am looking at you, transparent huge pages)
>>> and
>>>> file-backed memory.
>>>>
>>>> --------------------
>>>>
>>>> Use cases
>>>>
>>>> GMEM has been tested over Huawei's NPU (neural process unit) device driver.
>>>> The original NPU device driver has approximately 30,000 lines of 
>>>> code for memory management. On the contrary, the GMEM-based one has 
>>>> less than 30 lines of code calling GMEM API, with approximately 
>>>> 3,700 lines of code implementing the MMU functions. This effectively 
>>>> saves over 26,200 lines of MM code for one driver. Therefore, 
>>>> developers from accelerator vendors, including Nvidia, AMD, Intel 
>>>> and other companies are welcome to discuss if GMEM could be helpful.
>>>>
>>>> Using GMEM-based driver, it is possible to write a C-style 
>>>> accelerator code with malloc(), whose underlying mmap() syscall 
>>>> should include MAP_PEER_SHARED according to current GMEM 
>>>> implementation. Importantly,
>>> GMEM
>>>> guarantees a coherent view of memory between the host and all 
>>>> attached devices. This means that any data written by the CPU or any 
>>>> attached accelerator can be seen by the next memory load instruction 
>>>> issued by any attached accelerator or the CPU. Furthermore, the NPU 
>>>> device was able to oversubscribe memory by swapping memory to host 
>>>> DDR. Note that this
>>> memory
>>>> oversubscription mechanism can be universal if the physical memory 
>>>> management is provided by GMEM. Other potential use cases of GMEM 
>>>> could include the IOMMU driver, KVM and RDMA drivers, as long as the 
>>>> device needs to manage external memory resources like VMAs, MMUs or local DRAMs.
>>>>
>>>> --------------------
>>>>
>>>> Discussion
>>>>
>>>> Physical memory management
>>>> Most accelerators require the host OS to manage device DRAM. Even 
>>>> accelerators capable of running an OS inside the driver can benefit 
>>>> from it, since it helps avoid synchronizing management status 
>>>> between the host and device. In Linux OSS EU summit 2023, Hannes 
>>>> Reinecke from SUSE Labs suggested that people are concerned with the 
>>>> memory consumption of struct page (which considers all generic 
>>>> scenarios for the kernel). This leads to a possible solution that, 
>>>> instead of reusing Linux struct page and ZONE_DEVICE mechanism, GMEM 
>>>> can implement an isolated buddy allocator
>>> for
>>>> the device to instantiate and register. The isolation is useful 
>>>> because device DRAM physical address space is independent. 
>>>> Furthermore, the isolated buddy allocator can utilize a customized 
>>>> struct page that consumes less memory. It is worth discussing if 
>>>> accelerator vendors desire this solution.
>>>>
>>>> MMU functions
>>>> The MMU functions peer_map() and peer_unmap() overlap other 
>>>> functions, leaving a question if the MMU functions should be 
>>>> decoupled as more basic operations. Decoupling them could 
>>>> potentially prevent device drivers coalescing these basic steps 
>>>> within a single host-device communication operation, while coupling 
>>>> them makes it more difficult for device drivers to utilize GMEM interface.
>>>>
>>>> The idea of GMEM was originated from Weixi's PhD study with Prof. 
>>>> Scott Rixner and Prof. Alan L. Cox at Rice University.
>>>>
>>>> [1] https://arxiv.org/abs/2310.12554.
>>>>
>>>> Weixi Zhu (6):
>>>>     mm/gmem: add heterogeneous NUMA node
>>>>     mm/gmem: add arch-independent abstraction to track address mapping
>>>>       status
>>>>     mm/gmem: add GMEM (Generalized Memory Management) interface for
>>>>       external accelerators
>>>>     mm/gmem: add new syscall hmadvise() to issue memory hints for
>>>>       heterogeneous NUMA nodes
>>>>     mm/gmem: resolve VMA conflicts for attached peer devices
>>>>     mm/gmem: extending Linux core MM to support unified virtual address
>>>>       space
>>>>
>>>>    arch/arm64/include/asm/unistd.h         |   2 +-
>>>>    arch/arm64/include/asm/unistd32.h       |   2 +
>>>>    drivers/base/node.c                     |   6 +
>>>>    fs/proc/task_mmu.c                      |   3 +
>>>>    include/linux/gmem.h                    | 368 ++++++++++++
>>>>    include/linux/mm.h                      |   8 +
>>>>    include/linux/mm_types.h                |   5 +
>>>>    include/linux/nodemask.h                |  10 +
>>>>    include/uapi/asm-generic/mman-common.h  |   4 +
>>>>    include/uapi/asm-generic/unistd.h       |   5 +-
>>>>    init/main.c                             |   2 +
>>>>    kernel/fork.c                           |   5 +
>>>>    kernel/sys_ni.c                         |   2 +
>>>>    mm/Kconfig                              |  14 +
>>>>    mm/Makefile                             |   1 +
>>>>    mm/gmem.c                               | 746 ++++++++++++++++++++++++
>>>>    mm/huge_memory.c                        |  85 ++-
>>>>    mm/memory.c                             |  42 +-
>>>>    mm/mempolicy.c                          |   4 +
>>>>    mm/mmap.c                               |  40 +-
>>>>    mm/oom_kill.c                           |   2 +
>>>>    mm/page_alloc.c                         |   3 +
>>>>    mm/vm_object.c                          | 309 ++++++++++
>>>>    tools/include/uapi/asm-generic/unistd.h |   5 +-
>>>>    24 files changed, 1654 insertions(+), 19 deletions(-)
>>>>    create mode 100644 include/linux/gmem.h
>>>>    create mode 100644 mm/gmem.c
>>>>    create mode 100644 mm/vm_object.c
>>>>

Alistair Popple Dec. 1, 2023, 6:11 a.m. UTC | #14

"Zeng, Oak" <oak.zeng@intel.com> writes:

> See inline comments
>
>> -----Original Message-----
>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
>> zhuweixi
>> Sent: Thursday, November 30, 2023 5:48 AM
>> To: Christian König <ckoenig.leichtzumerken@gmail.com>; Zeng, Oak
>> <oak.zeng@intel.com>; Christian König <christian.koenig@amd.com>; linux-
>> mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org;
>> Danilo Krummrich <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel
>> Vetter <daniel@ffwll.ch>
>> Cc: tvrtko.ursulin@linux.intel.com; rcampbell@nvidia.com; apopple@nvidia.com;
>> ziy@nvidia.com; weixi.zhu@openeuler.sh; jhubbard@nvidia.com; intel-
>> gfx@lists.freedesktop.org; mhairgrove@nvidia.com; Wang, Zhi A
>> <zhi.a.wang@intel.com>; Xinhui.Pan@amd.com; amd-gfx@lists.freedesktop.org;
>> jglisse@redhat.com; dri-devel@lists.freedesktop.org; jgg@nvidia.com; Vivi,
>> Rodrigo <rodrigo.vivi@intel.com>; alexander.deucher@amd.com;
>> Felix.Kuehling@amd.com; intel-gvt-dev@lists.freedesktop.org;
>> ogabbay@kernel.org; leonro@nvidia.com; mgorman@suse.de
>> Subject: RE: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>> management) for external memory devices
>> 
>> Glad to know that there is a common demand for a new syscall like hmadvise(). I
>> expect it would also be useful for homogeneous NUMA cases. Credits to
>> cudaMemAdvise() API which brought this idea to GMEM's design.
>> 
>> To answer @Oak's questions about GMEM vs. HMM,
>> 
>> Here is the major difference:
>>   GMEM's main target is to stop drivers from reinventing MM code, while
>> HMM/MMU notifiers provide a compatible struct page solution and a
>> coordination mechanism for existing device driver MMs that requires adding
>> extra code to interact with CPU MM.
>> 
>> A straightforward qualitative result for the main target: after integrating Huawei's
>> Ascend NPU driver with GMEM's interface, 30,000 lines of MM code were cut,
>> leaving <100 lines invoking GMEM interface and 3,700 lines implementing vendor-
>> specific functions. Some code from the 3,700 lines should be further moved to
>> GMEM as a generalized feature like device memory oversubscription, but not
>> included in this RFC patch yet.
>> 
>> A list of high-level differences:
>>   1. With HMM/MMU notifiers, drivers need to first implement a full MM
>> subsystem. With GMEM, drivers can reuse Linux's core MM.
>
> A full mm subsystem essentially has below functions:
>
> Physical memory management: neither your approach nor hmm-based
> solution provide device physical memory management. You mentioned you
> have a plan but at least for now driver need to mange device physical
> memory.
>
> Virtual address space management: both approach leverage linux core mm, vma for this.
>
> Data eviction, migration: with hmm, driver need to implement this. It
> is not clear whether gmem has this function. I guess even gmem has it,
> it might be slow cpu data copy, compared to modern gpu's fast data
> copy engine.
>
> Device page table update, va-pa mapping: I think it is driver's responsibility in both approach.
>
> So from the point of re-use core MM, I don't see big difference. Maybe
> you did it more elegantly. I think it is very possible with your
> approach driver can be simpler, less codes.
>
>> 
>>   2. HMM encodes device mapping information in the CPU arch-dependent PTEs,
>> while GMEM proposes an abstraction layer in vm_object. Since GMEM's
>> approach further decouples the arch-related stuff, drivers do not need to
>> implement separate code for X86/ARM and etc.
>
> I don't understand this...with hmm, when a virtual address range's
> backing store is in device memory, cpu pte is encoded to point to
> device memory. Device page table is also encoded to point to the same
> device memory location. But since device memory is not accessible to
> CPU (DEVICE_PRIVATE), so when cpu access this virtual address, there
> is a cpu page fault. Device mapping info is still in device page
> table, not in cpu ptes.
>
> I do not see with hmm why driver need to implement x86/arm
> code... driver only take cares of device page table. Hmm/core mm take
> care of cpu page table, right?

I see our replies have crossed, but that is my understanding as well.

>> 
>>   3. MMU notifiers register hooks at certain core MM events, while GMEM
>> declares basic functions and internally invokes them. GMEM requires less from
>> the driver side -- no need to understand what core MM behaves at certain MMU
>> events. GMEM also expects fewer bugs than MMU notifiers: implementing basic
>> operations with standard declarations vs. implementing whatever random device
>> MM logic in MMU notifiers.
>
> This seems true to me. I feel the mmu notifier thing, especially the
> synchronization/lock design (those sequence numbers, interacting with
> driver lock, and the mmap lock) are very complicated. I indeed spent
> time to understand the specification documented in hmm.rst...

No argument there, but I think that's something we could look at
providing an improved interface for. I don't think it needs a whole new
subsystem to fix. Probably just a version of hmm_range_fault() that
takes the lock and sets up a MMU notifier itself.

I do think there is value in getting notified when core MM programs new
PTEs though as it would avoid expensive device faults. That's something
there is currently no way of doing.

> Your approach seems better.
>
>> 
>>   4. GMEM plans to support a more lightweight physical memory management.
>> The discussion about this part can be found in my cover letter. The question is
>> whether struct page should be compatible (directly use HMM's ZONE_DEVICE
>> solution) or a trimmed, smaller struct page that satisfies generalized demands
>> from accelerators is more preferrable?
>> 
>>   5. GMEM has been demonstrated to allow device memory oversubscription (a
>> GMEM-based 32GB NPU card can run a GPT model oversubscribing 500GB host
>> DDR), while drivers using HMM/MMU notifier must implement this logic one by
>> one. I will submit this part in a future RFC patch.
>
> When device memory is oversubscribed, do you call a driver callback
> function to evict device memory to system memory? Or just cpu copy?
> Copy with device's fast copy engine is faster.
>
> I can see even though with both approach we need to implement a driver
> copy function, with your approach, the driver logic can be
> simplified. With today's drm/ttm, I do see the logic in the memory
> eviction area is very complicated. Those eviction fence (some call it
> suspend fence), dma-fence enable signalling....very complicated to me.
>
> Essentially evict device memory to system memory is nothing different
> from evict system memory to disk... so if your approach can leverage
> some linux core mm eviction logic, I do see it can simplify things
> here...
>
>> 
>> I want to reiterate that GMEM's shared address space support is a bonus result,
>> not a main contribution... It was done because it was not difficult to implement
>> internal CPU-device coordination mechanism when core MM is extended by
>> GMEM to support devices.
>
> Besides memory eviction/oversubscription, there are a few other pain points when I use hmm:
>
> 1) hmm doesn't support file-back memory, so it is hard to share memory
> b/t process in a gpu environment. You mentioned you have a plan... How
> hard is it to support file-backed in your approach?
> 2)virtual address range based memory attribute/hint: with hmadvise,
> where do you save the memory attribute of a virtual address range? Do
> you need to extend vm_area_struct to save it? With hmm, we have to
> maintain such information at driver. This ends up with pretty
> complicated logic to split/merge those address range. I know core mm
> has similar logic to split/merge vma...
>
> Oak
>
>
>> 
>> -Weixi
>> 
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>> Sent: Thursday, November 30, 2023 4:28 PM
>> To: Zeng, Oak <oak.zeng@intel.com>; Christian König
>> <christian.koenig@amd.com>; zhuweixi <weixi.zhu@huawei.com>; linux-
>> mm@kvack.org; linux-kernel@vger.kernel.org; akpm@linux-foundation.org;
>> Danilo Krummrich <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel
>> Vetter <daniel@ffwll.ch>
>> Cc: intel-gvt-dev@lists.freedesktop.org; rcampbell@nvidia.com;
>> mhairgrove@nvidia.com; jgg@nvidia.com; weixi.zhu@openeuler.sh;
>> jhubbard@nvidia.com; intel-gfx@lists.freedesktop.org; apopple@nvidia.com;
>> Xinhui.Pan@amd.com; amd-gfx@lists.freedesktop.org;
>> tvrtko.ursulin@linux.intel.com; ogabbay@kernel.org; jglisse@redhat.com; dri-
>> devel@lists.freedesktop.org; ziy@nvidia.com; Vivi, Rodrigo
>> <rodrigo.vivi@intel.com>; alexander.deucher@amd.com; leonro@nvidia.com;
>> Felix.Kuehling@amd.com; Wang, Zhi A <zhi.a.wang@intel.com>;
>> mgorman@suse.de
>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>> management) for external memory devices
>> 
>> Hi Oak,
>> 
>> yeah, #4 is indeed a really good point and I think Felix will agree to that as well.
>> 
>> HMM is basically still missing a way to advise device attributes for the CPU
>> address space. Both migration strategy as well as device specific information (like
>> cache preferences) fall into this category.
>> 
>> Since there is a device specific component in those attributes as well I think
>> device specific IOCTLs still make sense to update them, but HMM should offer
>> the functionality to manage and store those information.
>> 
>> Split and merge of VMAs only become a problem if you attach those information
>> to VMAs, if you keep them completely separate than that doesn't become an
>> issue either. The down side of this approach is that you don't get automatically
>> extending attribute ranges for growing VMAs for example.
>> 
>> Regards,
>> Christian.
>> 
>> Am 29.11.23 um 23:23 schrieb Zeng, Oak:
>> > Hi Weixi,
>> >
>> > Even though Christian has listed reasons rejecting this proposal (yes they are
>> very reasonable to me), I would open my mind and further explore the possibility
>> here. Since the current GPU driver uses a hmm based implementation (AMD and
>> NV has done this; At Intel we are catching up), I want to explore how much we
>> can benefit from the proposed approach and how your approach can solve some
>> pain points of our development. So basically what I am questioning here is: what
>> is the advantage of your approach against hmm.
>> >
>> > To implement a UVM (unified virtual address space b/t cpu and gpu device),
>> with hmm, driver essentially need to implement below functions:
>> >
>> > 1. device page table update. Your approach requires the same because
>> > this is device specific codes
>> >
>> > 2. Some migration functions to migrate memory b/t system memory and GPU
>> local memory. My understanding is, even though you generalized this a bit, such
>> as modified cpu page fault path, provided "general" gm_dev_fault handler... but
>> device driver still need to provide migration functions because migration
>> functions have to be device specific (i.e., using device dma/copy engine for
>> performance purpose). Right?
>> >
>> > 3. GPU physical memory management, this part is now in drm/buddy, shared
>> by all drivers. I think with your approach, driver still need to provide callback
>> functions to allocate/free physical pages. Right? Or do you let linux core mm
>> buddy manage device memory directly?
>> >
>> > 4. madvise/hints/virtual address range management. This has been pain point
>> for us. Right now device driver has to maintain certain virtual address range data
>> structure to maintain hints and other virtual address range based memory
>> attributes. Driver need to sync with linux vma. Driver need to explicitly deal with
>> range split/merging... HMM doesn't provide support in this area. Your approach
>> seems cleaner/simpler to me...
>> >
>> >
>> > So in above, I have examined the some key factors of a gpu UVM memory
>> manager. I think for #1 and #2, hmm has provide pretty good abstraction/tools
>> for address space mirroring and migration helpers. For #3, since we have a
>> common drm/buddy layer, I don't think it is a big problem for driver writer now.
>> >
>> > I do see #4 is something you solved more beautifully, requires new system call
>> though.
>> >
>> > Oak
>> >
>> >
>> >> -----Original Message-----
>> >> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf
>> >> Of Christian König
>> >> Sent: Tuesday, November 28, 2023 8:09 AM
>> >> To: Weixi Zhu <weixi.zhu@huawei.com>; linux-mm@kvack.org; linux-
>> >> kernel@vger.kernel.org; akpm@linux-foundation.org; Danilo Krummrich
>> >> <dakr@redhat.com>; Dave Airlie <airlied@redhat.com>; Daniel Vetter
>> >> <daniel@ffwll.ch>
>> >> Cc: dri-devel@lists.freedesktop.org; leonro@nvidia.com;
>> >> apopple@nvidia.com; amd-gfx@lists.freedesktop.org; mgorman@suse.de;
>> >> ziy@nvidia.com; Wang, Zhi A <zhi.a.wang@intel.com>;
>> >> rcampbell@nvidia.com; jgg@nvidia.com; weixi.zhu@openeuler.sh;
>> >> jhubbard@nvidia.com; intel-gfx@lists.freedesktop.org;
>> >> mhairgrove@nvidia.com; jglisse@redhat.com; Vivi, Rodrigo
>> >> <rodrigo.vivi@intel.com>; intel-gvt-dev@lists.freedesktop.org;
>> >> tvrtko.ursulin@linux.intel.com; Felix.Kuehling@amd.com;
>> >> Xinhui.Pan@amd.com; alexander.deucher@amd.com; ogabbay@kernel.org
>> >> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>> >> management) for external memory devices
>> >>
>> >> Adding a few missing important people to the explicit to list.
>> >>
>> >> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
>> >>> The problem:
>> >>>
>> >>> Accelerator driver developers are forced to reinvent external MM
>> >>> subsystems case by case, because Linux core MM only considers host
>> memory resources.
>> >>> These reinvented MM subsystems have similar orders of magnitude of
>> >>> LoC as Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and
>> >>> Huawei NPU
>> >> has
>> >>> 30K. Meanwhile, more and more vendors are implementing their own
>> >>> accelerators, e.g. Microsoft's Maia 100. At the same time,
>> >>> application-level developers suffer from poor programmability --
>> >>> they must consider parallel address spaces and be careful about the
>> >>> limited device DRAM capacity. This can be alleviated if a
>> >>> malloc()-ed virtual address can be shared by the accelerator, or the
>> >>> abundant host DRAM can further transparently backup the device local
>> memory.
>> >>>
>> >>> These external MM systems share similar mechanisms except for the
>> >>> hardware-dependent part, so reinventing them is effectively
>> >>> introducing redundant code (14K~70K for each case). Such
>> >>> developing/maintaining is not cheap. Furthermore, to share a
>> >>> malloc()-ed virtual address, device drivers need to deeply interact
>> >>> with Linux MM via low-level MM APIs, e.g. MMU notifiers/HMM. This
>> >>> raises the bar for driver development, since developers must
>> >>> understand how Linux MM works. Further, it creates code maintenance
>> >>> problems -- any changes to Linux MM potentially require coordinated
>> changes to accelerator drivers using low-level MM APIs.
>> >>>
>> >>> Putting a cache-coherent bus between host and device will not make
>> >>> these external MM subsystems disappear. For example, a
>> >>> throughput-oriented accelerator will not tolerate executing heavy
>> >>> memory access workload with a host MMU/IOMMU via a remote bus.
>> >>> Therefore, devices will still have their own MMU and pick a simpler
>> >>> page table format for lower address translation overhead, requiring external
>> MM subsystems.
>> >>>
>> >>> --------------------
>> >>>
>> >>> What GMEM (Generalized Memory Management [1]) does:
>> >>>
>> >>> GMEM extends Linux MM to share its machine-independent MM code. Only
>> >>> high-level interface is provided for device drivers. This prevents
>> >>> accelerator drivers from reinventing the wheel, but relies on
>> >>> drivers to implement their hardware-dependent functions declared by
>> >>> GMEM. GMEM's
>> >> key
>> >>> interface include gm_dev_create(), gm_as_create(), gm_as_attach()
>> >>> and gm_dev_register_physmem(). Here briefly describe how a device
>> >>> driver utilizes them:
>> >>> 1. At boot time, call gm_dev_create() and registers the implementation of
>> >>>      hardware-dependent functions as declared in struct gm_mmu.
>> >>>        - If the device has local DRAM, call gm_dev_register_physmem() to
>> >>>          register available physical addresses.
>> >>> 2. When a device context is initialized (e.g. triggered by ioctl), check if
>> >>>      the current CPU process has been attached to a gmem address space
>> >>>      (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
>> >>>      to it.
>> >>> 3. Call gm_as_attach() to attach the device context to a gmem address space.
>> >>> 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
>> >>>      device computation happens.
>> >>>
>> >>> GMEM has changed the following assumptions in Linux MM:
>> >>>     1. An mm_struct not only handle a single CPU context, but may also handle
>> >>>        external memory contexts encapsulated as gm_context listed in
>> >>>        mm->gm_as. An external memory context can include a few or all of the
>> >>>        following parts: an external MMU (that requires TLB invalidation), an
>> >>>        external page table (that requires PTE manipulation) and external DRAM
>> >>>        (that requires physical memory management).
>> >>>     2. Faulting a MAP_PRIVATE VMA with no CPU PTE found does not
>> necessarily
>> >>>        mean that a zero-filled physical page should be mapped. The virtual
>> >>>        page may have been mapped to an external memory device.
>> >>>     3. Unmapping a page may include sending device TLB invalidation (even if
>> >>>        its MMU shares CPU page table) and manipulating device PTEs.
>> >>>
>> >>> --------------------
>> >>>
>> >>> Semantics of new syscalls:
>> >>>
>> >>> 1. mmap(..., MAP_PRIVATE | MAP_PEER_SHARED)
>> >>>       Allocate virtual address that is shared between the CPU and all
>> >>>       attached devices. Data is guaranteed to be coherent whenever the
>> >>>       address is accessed by either CPU or any attached device. If the device
>> >>>       does not support page fault, then device driver is responsible for
>> >>>       faulting memory before data gets accessed. By default, the CPU DRAM is
>> >>>       can be used as a swap backup for the device local memory.
>> >>> 2. hmadvise(NUMA_id, va_start, size, memory_hint)
>> >>>       Issuing memory hint for a given VMA. This extends traditional madvise()
>> >>>       syscall with an extra argument so that programmers have better control
>> >>>       with heterogeneous devices registered as NUMA nodes. One
>> >>> useful
>> >> memory
>> >>>       hint could be MADV_PREFETCH, which guarantees that the physical data
>> of
>> >>>       the given VMA [VA, VA+size) is migrated to NUMA node #id. Another
>> >>>       useful memory hint is MADV_DONTNEED. This is helpful to increase
>> device
>> >>>       memory utilization. It is worth considering extending the existing
>> >>>       madvise() syscall with one additional argument.
>> >>>
>> >>> --------------------
>> >>>
>> >>> Implementation details
>> >>>
>> >>> 1. New VMA flag: MAP_PEER_SHARED
>> >>>
>> >>> This new flag helps isolate GMEM feature, so that common processes
>> >>> with no device attached does not need to maintain any logical page
>> >>> table. It can be deleted if the extra overhead from GMEM is acceptable.
>> >>>
>> >>> 2. MMU functions
>> >>> The device driver must implement the MMU functions declared in
>> >>> struct gm_mmu.
>> >>>
>> >>> VA functions: peer_va_alloc_fixed(), peer_va_free()
>> >>>
>> >>> They are used to negotiate a common available VMA between a host
>> >>> process and a device process at the mmap() time. This is because
>> >>> some accelerators like Intel Xeon Phi or Huawei's Ascend NPU have
>> >>> their acceleration tasks executed within a device CPU process
>> >>> context. Some accelerators may also choose a different format of
>> >>> virtual address space.
>> >>>
>> >>> PA functions: alloc_page(), free_page(), prepare_page()
>> >>>
>> >>> Alloc_page() and free_page() are used to allocate and free device
>> >>> physical pages. Prepare_page() is used to zero-fill or DMA the data
>> >>> of a physical page. These functions were removed from the submitted
>> >>> patch, since GMEM does not need to invoke them when testing Huawei's
>> >>> NPU accelerator. The
>> >> NPU
>> >>> accelerator has an OS running in the device that manages the device
>> >>> physical memory. However, even for such a device it is better for
>> >>> the host to directly manage device physical memory, which saves
>> >>> device HBM and avoids synchronizing management status between the host
>> and device.
>> >>>
>> >>> Page-table functions:
>> >>> pmap_create()/destroy()/enter()/release()/protect()
>> >>>
>> >>> They are used to create and destroy device page tables, install and
>> >>> uninstall page table entries and to change the protection of page
>> >>> table entries.
>> >>>
>> >>> TLB-invalidation functions: tlb_invl(), tlb_invl_coalesced()
>> >>>
>> >>> They are used to invalidate the TLB entries of a given range of VA
>> >>> or invalidate a given list of VMAs.
>> >>>
>> >>> Wrapper functions: peer_map() and peer_unmap()
>> >>>
>> >>> These two functions are used to create or destroy a device mapping
>> >>> which could include allocating physical memory and copying data.
>> >>> They effectively wraps the PA functions, Page-table functions and
>> >>> TLB-invalidation functions. Implementing these steps together allows
>> >>> devices to optimize the communication cost between host and device.
>> >>> However, it requires the device driver to correctly order these steps.
>> >>>
>> >>> 3. Tracking logical mappings:
>> >>>
>> >>> Each process starts maintaining an xarray in
>> >>> mm->vm_obj->logical_page_table at the first time a host process
>> >>> calls mmap(MAP_PRIVATE |
>> >> MAP_PEER_SHARED).
>> >>> When a virtual page gets touched, its mapping status is created and
>> >>> stored in struct gm_mapping. The logical page table is utilized to
>> >>> query the struct gm_mapping given a virtual address. GMEM extends
>> >>> Linux MM to
>> >> update
>> >>> and lookup these logical mappings. For example, in the patch set we
>> >>> modify the page fault path of to additionally check the logical
>> >>> mapping of MAP_PEER_SHARED VMAs and identify if a device page should
>> be migrated.
>> >>> Similarly, if the device driver wants to resolve a device page fault
>> >>> or prefetch data, the driver should call gm_dev_fault(). This
>> >>> function examines the mapping status and determines whether the
>> >>> device driver should migrate a CPU page to device or install a zero-filled
>> device page.
>> >>>
>> >>> The logical mapping abstraction enhances the extensibility of Linux
>> >>> core MM (a virtual page may be mapped to a device physical page
>> >>> without any CPU PTE installed). The current implementation is not
>> >>> complete, since it only focused on anonymous VMAs with
>> >>> MAP_PEER_SHARED flag. The future plan of logical page table is to
>> >>> provide a generic abstraction layer that support common anonymous
>> >>> memory (I am looking at you, transparent huge pages)
>> >> and
>> >>> file-backed memory.
>> >>>
>> >>> --------------------
>> >>>
>> >>> Use cases
>> >>>
>> >>> GMEM has been tested over Huawei's NPU (neural process unit) device
>> driver.
>> >>> The original NPU device driver has approximately 30,000 lines of
>> >>> code for memory management. On the contrary, the GMEM-based one has
>> >>> less than 30 lines of code calling GMEM API, with approximately
>> >>> 3,700 lines of code implementing the MMU functions. This effectively
>> >>> saves over 26,200 lines of MM code for one driver. Therefore,
>> >>> developers from accelerator vendors, including Nvidia, AMD, Intel
>> >>> and other companies are welcome to discuss if GMEM could be helpful.
>> >>>
>> >>> Using GMEM-based driver, it is possible to write a C-style
>> >>> accelerator code with malloc(), whose underlying mmap() syscall
>> >>> should include MAP_PEER_SHARED according to current GMEM
>> >>> implementation. Importantly,
>> >> GMEM
>> >>> guarantees a coherent view of memory between the host and all
>> >>> attached devices. This means that any data written by the CPU or any
>> >>> attached accelerator can be seen by the next memory load instruction
>> >>> issued by any attached accelerator or the CPU. Furthermore, the NPU
>> >>> device was able to oversubscribe memory by swapping memory to host
>> >>> DDR. Note that this
>> >> memory
>> >>> oversubscription mechanism can be universal if the physical memory
>> >>> management is provided by GMEM. Other potential use cases of GMEM
>> >>> could include the IOMMU driver, KVM and RDMA drivers, as long as the
>> >>> device needs to manage external memory resources like VMAs, MMUs or
>> local DRAMs.
>> >>>
>> >>> --------------------
>> >>>
>> >>> Discussion
>> >>>
>> >>> Physical memory management
>> >>> Most accelerators require the host OS to manage device DRAM. Even
>> >>> accelerators capable of running an OS inside the driver can benefit
>> >>> from it, since it helps avoid synchronizing management status
>> >>> between the host and device. In Linux OSS EU summit 2023, Hannes
>> >>> Reinecke from SUSE Labs suggested that people are concerned with the
>> >>> memory consumption of struct page (which considers all generic
>> >>> scenarios for the kernel). This leads to a possible solution that,
>> >>> instead of reusing Linux struct page and ZONE_DEVICE mechanism, GMEM
>> >>> can implement an isolated buddy allocator
>> >> for
>> >>> the device to instantiate and register. The isolation is useful
>> >>> because device DRAM physical address space is independent.
>> >>> Furthermore, the isolated buddy allocator can utilize a customized
>> >>> struct page that consumes less memory. It is worth discussing if
>> >>> accelerator vendors desire this solution.
>> >>>
>> >>> MMU functions
>> >>> The MMU functions peer_map() and peer_unmap() overlap other
>> >>> functions, leaving a question if the MMU functions should be
>> >>> decoupled as more basic operations. Decoupling them could
>> >>> potentially prevent device drivers coalescing these basic steps
>> >>> within a single host-device communication operation, while coupling
>> >>> them makes it more difficult for device drivers to utilize GMEM interface.
>> >>>
>> >>> The idea of GMEM was originated from Weixi's PhD study with Prof.
>> >>> Scott Rixner and Prof. Alan L. Cox at Rice University.
>> >>>
>> >>> [1] https://arxiv.org/abs/2310.12554.
>> >>>
>> >>> Weixi Zhu (6):
>> >>>     mm/gmem: add heterogeneous NUMA node
>> >>>     mm/gmem: add arch-independent abstraction to track address mapping
>> >>>       status
>> >>>     mm/gmem: add GMEM (Generalized Memory Management) interface for
>> >>>       external accelerators
>> >>>     mm/gmem: add new syscall hmadvise() to issue memory hints for
>> >>>       heterogeneous NUMA nodes
>> >>>     mm/gmem: resolve VMA conflicts for attached peer devices
>> >>>     mm/gmem: extending Linux core MM to support unified virtual address
>> >>>       space
>> >>>
>> >>>    arch/arm64/include/asm/unistd.h         |   2 +-
>> >>>    arch/arm64/include/asm/unistd32.h       |   2 +
>> >>>    drivers/base/node.c                     |   6 +
>> >>>    fs/proc/task_mmu.c                      |   3 +
>> >>>    include/linux/gmem.h                    | 368 ++++++++++++
>> >>>    include/linux/mm.h                      |   8 +
>> >>>    include/linux/mm_types.h                |   5 +
>> >>>    include/linux/nodemask.h                |  10 +
>> >>>    include/uapi/asm-generic/mman-common.h  |   4 +
>> >>>    include/uapi/asm-generic/unistd.h       |   5 +-
>> >>>    init/main.c                             |   2 +
>> >>>    kernel/fork.c                           |   5 +
>> >>>    kernel/sys_ni.c                         |   2 +
>> >>>    mm/Kconfig                              |  14 +
>> >>>    mm/Makefile                             |   1 +
>> >>>    mm/gmem.c                               | 746 ++++++++++++++++++++++++
>> >>>    mm/huge_memory.c                        |  85 ++-
>> >>>    mm/memory.c                             |  42 +-
>> >>>    mm/mempolicy.c                          |   4 +
>> >>>    mm/mmap.c                               |  40 +-
>> >>>    mm/oom_kill.c                           |   2 +
>> >>>    mm/page_alloc.c                         |   3 +
>> >>>    mm/vm_object.c                          | 309 ++++++++++
>> >>>    tools/include/uapi/asm-generic/unistd.h |   5 +-
>> >>>    24 files changed, 1654 insertions(+), 19 deletions(-)
>> >>>    create mode 100644 include/linux/gmem.h
>> >>>    create mode 100644 mm/gmem.c
>> >>>    create mode 100644 mm/vm_object.c
>> >>>

David Hildenbrand Dec. 1, 2023, 9:29 a.m. UTC | #15

On 01.12.23 03:44, zhuweixi wrote:
> Thanks!

I hope you understood that that was a joke :)

> I am planning to present GMEM in Linux MM Alignment Sessions so I can collect more input from the mm developers.

Sounds good. But please try inviting key HMM/driver developer as well.

Most of the core-mm folks attending that meeting are not that familiar 
with these concepts and they are usually not happy about:

(1) More core-MM complexity for things that can be neatly handled in
     separate subsystems with the existing infrastructure already.

(2) One new way of doing things why the other things remain in place.

(3) New MMAP flags. Usually you have a hard time getting this in.
     Sometimes, there are other ways (e.g., special-purpose file-
     systems).

(4) Changing controversial core-mm design decisions to handle corner
     cases.

Christian König Dec. 1, 2023, 1:16 p.m. UTC | #16

Am 01.12.23 um 06:48 schrieb Zeng, Oak:
> [SNIP]
>>    3. MMU notifiers register hooks at certain core MM events, while GMEM
>> declares basic functions and internally invokes them. GMEM requires less from
>> the driver side -- no need to understand what core MM behaves at certain MMU
>> events. GMEM also expects fewer bugs than MMU notifiers: implementing basic
>> operations with standard declarations vs. implementing whatever random device
>> MM logic in MMU notifiers.
> This seems true to me. I feel the mmu notifier thing, especially the synchronization/lock design (those sequence numbers, interacting with driver lock, and the mmap lock) are very complicated. I indeed spent time to understand the specification documented in hmm.rst...
>
> Your approach seems better.

I have to agree on that as well. HMM/MMU notifiers are developed with 
exposing MM functionality in mind instead of trying to fulfill driver 
requirements.

But this originated not in HMM/MMU notifiers, rather it was a 
requirement to not change the CPU side of the MM code to much.

So when you can get the acknowledgement to make changes to the CPU side 
of the MM code to better handle device driver requirements then I'm 
totally in favor of this.

It's just that I don't think the approach of starting with a new 
framework/idea will help with that. Instead rather try to improve the 
existing functionality.

>>    5. GMEM has been demonstrated to allow device memory oversubscription (a
>> GMEM-based 32GB NPU card can run a GPT model oversubscribing 500GB host
>> DDR), while drivers using HMM/MMU notifier must implement this logic one by
>> one. I will submit this part in a future RFC patch.
> When device memory is oversubscribed, do you call a driver callback function to evict device memory to system memory? Or just cpu copy? Copy with device's fast copy engine is faster.
>
> I can see even though with both approach we need to implement a driver copy function, with your approach, the driver logic can be simplified. With today's drm/ttm, I do see the logic in the memory eviction area is very complicated. Those eviction fence (some call it suspend fence), dma-fence enable signalling....very complicated to me.
>
> Essentially evict device memory to system memory is nothing different from evict system memory to disk... so if your approach can leverage some linux core mm eviction logic, I do see it can simplify things here...

We actually already do this in TTM as well through the MM shrinkers.

It's just that it's an intentional design decision to make the whole 
thing asynchronously using dma_fence etc... That's why you have this 
complexity in there.

>> I want to reiterate that GMEM's shared address space support is a bonus result,
>> not a main contribution... It was done because it was not difficult to implement
>> internal CPU-device coordination mechanism when core MM is extended by
>> GMEM to support devices.
> Besides memory eviction/oversubscription, there are a few other pain points when I use hmm:
>
> 1) hmm doesn't support file-back memory, so it is hard to share memory b/t process in a gpu environment. You mentioned you have a plan... How hard is it to support file-backed in your approach?

As hard as it is to support it through HMM. That's what I meant that 
this approach doesn't integrate well, as far as I know the problem isn't 
inside HMM or any other solution but rather in the file system layer.

Regards,
Christian.

> 2)virtual address range based memory attribute/hint: with hmadvise, where do you save the memory attribute of a virtual address range? Do you need to extend vm_area_struct to save it? With hmm, we have to maintain such information at driver. This ends up with pretty complicated logic to split/merge those address range. I know core mm has similar logic to split/merge vma...
>
> Oak
>
>
>> -Weixi
>>
>> -----Original Message-----
>> From: Christian König<ckoenig.leichtzumerken@gmail.com>
>> Sent: Thursday, November 30, 2023 4:28 PM
>> To: Zeng, Oak<oak.zeng@intel.com>; Christian König
>> <christian.koenig@amd.com>; zhuweixi<weixi.zhu@huawei.com>; linux-
>> mm@kvack.org;linux-kernel@vger.kernel.org;akpm@linux-foundation.org;
>> Danilo Krummrich<dakr@redhat.com>; Dave Airlie<airlied@redhat.com>; Daniel
>> Vetter<daniel@ffwll.ch>
>> Cc:intel-gvt-dev@lists.freedesktop.org;rcampbell@nvidia.com;
>> mhairgrove@nvidia.com;jgg@nvidia.com;weixi.zhu@openeuler.sh;
>> jhubbard@nvidia.com;intel-gfx@lists.freedesktop.org;apopple@nvidia.com;
>> Xinhui.Pan@amd.com;amd-gfx@lists.freedesktop.org;
>> tvrtko.ursulin@linux.intel.com;ogabbay@kernel.org;jglisse@redhat.com; dri-
>> devel@lists.freedesktop.org;ziy@nvidia.com; Vivi, Rodrigo
>> <rodrigo.vivi@intel.com>;alexander.deucher@amd.com;leonro@nvidia.com;
>> Felix.Kuehling@amd.com; Wang, Zhi A<zhi.a.wang@intel.com>;
>> mgorman@suse.de
>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>> management) for external memory devices
>>
>> Hi Oak,
>>
>> yeah, #4 is indeed a really good point and I think Felix will agree to that as well.
>>
>> HMM is basically still missing a way to advise device attributes for the CPU
>> address space. Both migration strategy as well as device specific information (like
>> cache preferences) fall into this category.
>>
>> Since there is a device specific component in those attributes as well I think
>> device specific IOCTLs still make sense to update them, but HMM should offer
>> the functionality to manage and store those information.
>>
>> Split and merge of VMAs only become a problem if you attach those information
>> to VMAs, if you keep them completely separate than that doesn't become an
>> issue either. The down side of this approach is that you don't get automatically
>> extending attribute ranges for growing VMAs for example.
>>
>> Regards,
>> Christian.
>>
>> Am 29.11.23 um 23:23 schrieb Zeng, Oak:
>>> Hi Weixi,
>>>
>>> Even though Christian has listed reasons rejecting this proposal (yes they are
>> very reasonable to me), I would open my mind and further explore the possibility
>> here. Since the current GPU driver uses a hmm based implementation (AMD and
>> NV has done this; At Intel we are catching up), I want to explore how much we
>> can benefit from the proposed approach and how your approach can solve some
>> pain points of our development. So basically what I am questioning here is: what
>> is the advantage of your approach against hmm.
>>> To implement a UVM (unified virtual address space b/t cpu and gpu device),
>> with hmm, driver essentially need to implement below functions:
>>> 1. device page table update. Your approach requires the same because
>>> this is device specific codes
>>>
>>> 2. Some migration functions to migrate memory b/t system memory and GPU
>> local memory. My understanding is, even though you generalized this a bit, such
>> as modified cpu page fault path, provided "general" gm_dev_fault handler... but
>> device driver still need to provide migration functions because migration
>> functions have to be device specific (i.e., using device dma/copy engine for
>> performance purpose). Right?
>>> 3. GPU physical memory management, this part is now in drm/buddy, shared
>> by all drivers. I think with your approach, driver still need to provide callback
>> functions to allocate/free physical pages. Right? Or do you let linux core mm
>> buddy manage device memory directly?
>>> 4. madvise/hints/virtual address range management. This has been pain point
>> for us. Right now device driver has to maintain certain virtual address range data
>> structure to maintain hints and other virtual address range based memory
>> attributes. Driver need to sync with linux vma. Driver need to explicitly deal with
>> range split/merging... HMM doesn't provide support in this area. Your approach
>> seems cleaner/simpler to me...
>>>
>>> So in above, I have examined the some key factors of a gpu UVM memory
>> manager. I think for #1 and #2, hmm has provide pretty good abstraction/tools
>> for address space mirroring and migration helpers. For #3, since we have a
>> common drm/buddy layer, I don't think it is a big problem for driver writer now.
>>> I do see #4 is something you solved more beautifully, requires new system call
>> though.
>>> Oak
>>>
>>>
>>>> -----Original Message-----
>>>> From: dri-devel<dri-devel-bounces@lists.freedesktop.org>  On Behalf
>>>> Of Christian König
>>>> Sent: Tuesday, November 28, 2023 8:09 AM
>>>> To: Weixi Zhu<weixi.zhu@huawei.com>;linux-mm@kvack.org; linux-
>>>> kernel@vger.kernel.org;akpm@linux-foundation.org; Danilo Krummrich
>>>> <dakr@redhat.com>; Dave Airlie<airlied@redhat.com>; Daniel Vetter
>>>> <daniel@ffwll.ch>
>>>> Cc:dri-devel@lists.freedesktop.org;leonro@nvidia.com;
>>>> apopple@nvidia.com;amd-gfx@lists.freedesktop.org;mgorman@suse.de;
>>>> ziy@nvidia.com; Wang, Zhi A<zhi.a.wang@intel.com>;
>>>> rcampbell@nvidia.com;jgg@nvidia.com;weixi.zhu@openeuler.sh;
>>>> jhubbard@nvidia.com;intel-gfx@lists.freedesktop.org;
>>>> mhairgrove@nvidia.com;jglisse@redhat.com; Vivi, Rodrigo
>>>> <rodrigo.vivi@intel.com>;intel-gvt-dev@lists.freedesktop.org;
>>>> tvrtko.ursulin@linux.intel.com;Felix.Kuehling@amd.com;
>>>> Xinhui.Pan@amd.com;alexander.deucher@amd.com;ogabbay@kernel.org
>>>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>>>> management) for external memory devices
>>>>
>>>> Adding a few missing important people to the explicit to list.
>>>>
>>>> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
>>>>> The problem:
>>>>>
>>>>> Accelerator driver developers are forced to reinvent external MM
>>>>> subsystems case by case, because Linux core MM only considers host
>> memory resources.
>>>>> These reinvented MM subsystems have similar orders of magnitude of
>>>>> LoC as Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and
>>>>> Huawei NPU
>>>> has
>>>>> 30K. Meanwhile, more and more vendors are implementing their own
>>>>> accelerators, e.g. Microsoft's Maia 100. At the same time,
>>>>> application-level developers suffer from poor programmability --
>>>>> they must consider parallel address spaces and be careful about the
>>>>> limited device DRAM capacity. This can be alleviated if a
>>>>> malloc()-ed virtual address can be shared by the accelerator, or the
>>>>> abundant host DRAM can further transparently backup the device local
>> memory.
>>>>> These external MM systems share similar mechanisms except for the
>>>>> hardware-dependent part, so reinventing them is effectively
>>>>> introducing redundant code (14K~70K for each case). Such
>>>>> developing/maintaining is not cheap. Furthermore, to share a
>>>>> malloc()-ed virtual address, device drivers need to deeply interact
>>>>> with Linux MM via low-level MM APIs, e.g. MMU notifiers/HMM. This
>>>>> raises the bar for driver development, since developers must
>>>>> understand how Linux MM works. Further, it creates code maintenance
>>>>> problems -- any changes to Linux MM potentially require coordinated
>> changes to accelerator drivers using low-level MM APIs.
>>>>> Putting a cache-coherent bus between host and device will not make
>>>>> these external MM subsystems disappear. For example, a
>>>>> throughput-oriented accelerator will not tolerate executing heavy
>>>>> memory access workload with a host MMU/IOMMU via a remote bus.
>>>>> Therefore, devices will still have their own MMU and pick a simpler
>>>>> page table format for lower address translation overhead, requiring external
>> MM subsystems.
>>>>> --------------------
>>>>>
>>>>> What GMEM (Generalized Memory Management [1]) does:
>>>>>
>>>>> GMEM extends Linux MM to share its machine-independent MM code. Only
>>>>> high-level interface is provided for device drivers. This prevents
>>>>> accelerator drivers from reinventing the wheel, but relies on
>>>>> drivers to implement their hardware-dependent functions declared by
>>>>> GMEM. GMEM's
>>>> key
>>>>> interface include gm_dev_create(), gm_as_create(), gm_as_attach()
>>>>> and gm_dev_register_physmem(). Here briefly describe how a device
>>>>> driver utilizes them:
>>>>> 1. At boot time, call gm_dev_create() and registers the implementation of
>>>>>       hardware-dependent functions as declared in struct gm_mmu.
>>>>>         - If the device has local DRAM, call gm_dev_register_physmem() to
>>>>>           register available physical addresses.
>>>>> 2. When a device context is initialized (e.g. triggered by ioctl), check if
>>>>>       the current CPU process has been attached to a gmem address space
>>>>>       (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
>>>>>       to it.
>>>>> 3. Call gm_as_attach() to attach the device context to a gmem address space.
>>>>> 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
>>>>>       device computation happens.
>>>>>
>>>>> GMEM has changed the following assumptions in Linux MM:
>>>>>      1. An mm_struct not only handle a single CPU context, but may also handle
>>>>>         external memory contexts encapsulated as gm_context listed in
>>>>>         mm->gm_as. An external memory context can include a few or all of the
>>>>>         following parts: an external MMU (that requires TLB invalidation), an
>>>>>         external page table (that requires PTE manipulation) and external DRAM
>>>>>         (that requires physical memory management).
>>>>>      2. Faulting a MAP_PRIVATE VMA with no CPU PTE found does not
>> necessarily
>>>>>         mean that a zero-filled physical page should be mapped. The virtual
>>>>>         page may have been mapped to an external memory device.
>>>>>      3. Unmapping a page may include sending device TLB invalidation (even if
>>>>>         its MMU shares CPU page table) and manipulating device PTEs.
>>>>>
>>>>> --------------------
>>>>>
>>>>> Semantics of new syscalls:
>>>>>
>>>>> 1. mmap(..., MAP_PRIVATE | MAP_PEER_SHARED)
>>>>>        Allocate virtual address that is shared between the CPU and all
>>>>>        attached devices. Data is guaranteed to be coherent whenever the
>>>>>        address is accessed by either CPU or any attached device. If the device
>>>>>        does not support page fault, then device driver is responsible for
>>>>>        faulting memory before data gets accessed. By default, the CPU DRAM is
>>>>>        can be used as a swap backup for the device local memory.
>>>>> 2. hmadvise(NUMA_id, va_start, size, memory_hint)
>>>>>        Issuing memory hint for a given VMA. This extends traditional madvise()
>>>>>        syscall with an extra argument so that programmers have better control
>>>>>        with heterogeneous devices registered as NUMA nodes. One
>>>>> useful
>>>> memory
>>>>>        hint could be MADV_PREFETCH, which guarantees that the physical data
>> of
>>>>>        the given VMA [VA, VA+size) is migrated to NUMA node #id. Another
>>>>>        useful memory hint is MADV_DONTNEED. This is helpful to increase
>> device
>>>>>        memory utilization. It is worth considering extending the existing
>>>>>        madvise() syscall with one additional argument.
>>>>>
>>>>> --------------------
>>>>>
>>>>> Implementation details
>>>>>
>>>>> 1. New VMA flag: MAP_PEER_SHARED
>>>>>
>>>>> This new flag helps isolate GMEM feature, so that common processes
>>>>> with no device attached does not need to maintain any logical page
>>>>> table. It can be deleted if the extra overhead from GMEM is acceptable.
>>>>>
>>>>> 2. MMU functions
>>>>> The device driver must implement the MMU functions declared in
>>>>> struct gm_mmu.
>>>>>
>>>>> VA functions: peer_va_alloc_fixed(), peer_va_free()
>>>>>
>>>>> They are used to negotiate a common available VMA between a host
>>>>> process and a device process at the mmap() time. This is because
>>>>> some accelerators like Intel Xeon Phi or Huawei's Ascend NPU have
>>>>> their acceleration tasks executed within a device CPU process
>>>>> context. Some accelerators may also choose a different format of
>>>>> virtual address space.
>>>>>
>>>>> PA functions: alloc_page(), free_page(), prepare_page()
>>>>>
>>>>> Alloc_page() and free_page() are used to allocate and free device
>>>>> physical pages. Prepare_page() is used to zero-fill or DMA the data
>>>>> of a physical page. These functions were removed from the submitted
>>>>> patch, since GMEM does not need to invoke them when testing Huawei's
>>>>> NPU accelerator. The
>>>> NPU
>>>>> accelerator has an OS running in the device that manages the device
>>>>> physical memory. However, even for such a device it is better for
>>>>> the host to directly manage device physical memory, which saves
>>>>> device HBM and avoids synchronizing management status between the host
>> and device.
>>>>> Page-table functions:
>>>>> pmap_create()/destroy()/enter()/release()/protect()
>>>>>
>>>>> They are used to create and destroy device page tables, install and
>>>>> uninstall page table entries and to change the protection of page
>>>>> table entries.
>>>>>
>>>>> TLB-invalidation functions: tlb_invl(), tlb_invl_coalesced()
>>>>>
>>>>> They are used to invalidate the TLB entries of a given range of VA
>>>>> or invalidate a given list of VMAs.
>>>>>
>>>>> Wrapper functions: peer_map() and peer_unmap()
>>>>>
>>>>> These two functions are used to create or destroy a device mapping
>>>>> which could include allocating physical memory and copying data.
>>>>> They effectively wraps the PA functions, Page-table functions and
>>>>> TLB-invalidation functions. Implementing these steps together allows
>>>>> devices to optimize the communication cost between host and device.
>>>>> However, it requires the device driver to correctly order these steps.
>>>>>
>>>>> 3. Tracking logical mappings:
>>>>>
>>>>> Each process starts maintaining an xarray in
>>>>> mm->vm_obj->logical_page_table at the first time a host process
>>>>> calls mmap(MAP_PRIVATE |
>>>> MAP_PEER_SHARED).
>>>>> When a virtual page gets touched, its mapping status is created and
>>>>> stored in struct gm_mapping. The logical page table is utilized to
>>>>> query the struct gm_mapping given a virtual address. GMEM extends
>>>>> Linux MM to
>>>> update
>>>>> and lookup these logical mappings. For example, in the patch set we
>>>>> modify the page fault path of to additionally check the logical
>>>>> mapping of MAP_PEER_SHARED VMAs and identify if a device page should
>> be migrated.
>>>>> Similarly, if the device driver wants to resolve a device page fault
>>>>> or prefetch data, the driver should call gm_dev_fault(). This
>>>>> function examines the mapping status and determines whether the
>>>>> device driver should migrate a CPU page to device or install a zero-filled
>> device page.
>>>>> The logical mapping abstraction enhances the extensibility of Linux
>>>>> core MM (a virtual page may be mapped to a device physical page
>>>>> without any CPU PTE installed). The current implementation is not
>>>>> complete, since it only focused on anonymous VMAs with
>>>>> MAP_PEER_SHARED flag. The future plan of logical page table is to
>>>>> provide a generic abstraction layer that support common anonymous
>>>>> memory (I am looking at you, transparent huge pages)
>>>> and
>>>>> file-backed memory.
>>>>>
>>>>> --------------------
>>>>>
>>>>> Use cases
>>>>>
>>>>> GMEM has been tested over Huawei's NPU (neural process unit) device
>> driver.
>>>>> The original NPU device driver has approximately 30,000 lines of
>>>>> code for memory management. On the contrary, the GMEM-based one has
>>>>> less than 30 lines of code calling GMEM API, with approximately
>>>>> 3,700 lines of code implementing the MMU functions. This effectively
>>>>> saves over 26,200 lines of MM code for one driver. Therefore,
>>>>> developers from accelerator vendors, including Nvidia, AMD, Intel
>>>>> and other companies are welcome to discuss if GMEM could be helpful.
>>>>>
>>>>> Using GMEM-based driver, it is possible to write a C-style
>>>>> accelerator code with malloc(), whose underlying mmap() syscall
>>>>> should include MAP_PEER_SHARED according to current GMEM
>>>>> implementation. Importantly,
>>>> GMEM
>>>>> guarantees a coherent view of memory between the host and all
>>>>> attached devices. This means that any data written by the CPU or any
>>>>> attached accelerator can be seen by the next memory load instruction
>>>>> issued by any attached accelerator or the CPU. Furthermore, the NPU
>>>>> device was able to oversubscribe memory by swapping memory to host
>>>>> DDR. Note that this
>>>> memory
>>>>> oversubscription mechanism can be universal if the physical memory
>>>>> management is provided by GMEM. Other potential use cases of GMEM
>>>>> could include the IOMMU driver, KVM and RDMA drivers, as long as the
>>>>> device needs to manage external memory resources like VMAs, MMUs or
>> local DRAMs.
>>>>> --------------------
>>>>>
>>>>> Discussion
>>>>>
>>>>> Physical memory management
>>>>> Most accelerators require the host OS to manage device DRAM. Even
>>>>> accelerators capable of running an OS inside the driver can benefit
>>>>> from it, since it helps avoid synchronizing management status
>>>>> between the host and device. In Linux OSS EU summit 2023, Hannes
>>>>> Reinecke from SUSE Labs suggested that people are concerned with the
>>>>> memory consumption of struct page (which considers all generic
>>>>> scenarios for the kernel). This leads to a possible solution that,
>>>>> instead of reusing Linux struct page and ZONE_DEVICE mechanism, GMEM
>>>>> can implement an isolated buddy allocator
>>>> for
>>>>> the device to instantiate and register. The isolation is useful
>>>>> because device DRAM physical address space is independent.
>>>>> Furthermore, the isolated buddy allocator can utilize a customized
>>>>> struct page that consumes less memory. It is worth discussing if
>>>>> accelerator vendors desire this solution.
>>>>>
>>>>> MMU functions
>>>>> The MMU functions peer_map() and peer_unmap() overlap other
>>>>> functions, leaving a question if the MMU functions should be
>>>>> decoupled as more basic operations. Decoupling them could
>>>>> potentially prevent device drivers coalescing these basic steps
>>>>> within a single host-device communication operation, while coupling
>>>>> them makes it more difficult for device drivers to utilize GMEM interface.
>>>>>
>>>>> The idea of GMEM was originated from Weixi's PhD study with Prof.
>>>>> Scott Rixner and Prof. Alan L. Cox at Rice University.
>>>>>
>>>>> [1]https://arxiv.org/abs/2310.12554.
>>>>>
>>>>> Weixi Zhu (6):
>>>>>      mm/gmem: add heterogeneous NUMA node
>>>>>      mm/gmem: add arch-independent abstraction to track address mapping
>>>>>        status
>>>>>      mm/gmem: add GMEM (Generalized Memory Management) interface for
>>>>>        external accelerators
>>>>>      mm/gmem: add new syscall hmadvise() to issue memory hints for
>>>>>        heterogeneous NUMA nodes
>>>>>      mm/gmem: resolve VMA conflicts for attached peer devices
>>>>>      mm/gmem: extending Linux core MM to support unified virtual address
>>>>>        space
>>>>>
>>>>>     arch/arm64/include/asm/unistd.h         |   2 +-
>>>>>     arch/arm64/include/asm/unistd32.h       |   2 +
>>>>>     drivers/base/node.c                     |   6 +
>>>>>     fs/proc/task_mmu.c                      |   3 +
>>>>>     include/linux/gmem.h                    | 368 ++++++++++++
>>>>>     include/linux/mm.h                      |   8 +
>>>>>     include/linux/mm_types.h                |   5 +
>>>>>     include/linux/nodemask.h                |  10 +
>>>>>     include/uapi/asm-generic/mman-common.h  |   4 +
>>>>>     include/uapi/asm-generic/unistd.h       |   5 +-
>>>>>     init/main.c                             |   2 +
>>>>>     kernel/fork.c                           |   5 +
>>>>>     kernel/sys_ni.c                         |   2 +
>>>>>     mm/Kconfig                              |  14 +
>>>>>     mm/Makefile                             |   1 +
>>>>>     mm/gmem.c                               | 746 ++++++++++++++++++++++++
>>>>>     mm/huge_memory.c                        |  85 ++-
>>>>>     mm/memory.c                             |  42 +-
>>>>>     mm/mempolicy.c                          |   4 +
>>>>>     mm/mmap.c                               |  40 +-
>>>>>     mm/oom_kill.c                           |   2 +
>>>>>     mm/page_alloc.c                         |   3 +
>>>>>     mm/vm_object.c                          | 309 ++++++++++
>>>>>     tools/include/uapi/asm-generic/unistd.h |   5 +-
>>>>>     24 files changed, 1654 insertions(+), 19 deletions(-)
>>>>>     create mode 100644 include/linux/gmem.h
>>>>>     create mode 100644 mm/gmem.c
>>>>>     create mode 100644 mm/vm_object.c
>>>>>

Philipp Stanner Dec. 1, 2023, 9:28 p.m. UTC | #17

On Fri, 2023-12-01 at 02:37 +0000, zhuweixi wrote:
> From your argument on KVM I can see that the biggest miscommunication
> between us is that you believed that GMEM wanted to share the whole
> address space. No, it is not the case. GMEM is only providing
> coordination via certain mmap() calls. So you are raising a case
> supporting GMEM again -- passthrough part of the CPU addresses space
> instead of passthrough the whole CPU address space, is exactly what
> GMEM can do. On the other side, the IOMMU SVA feature wildly binds
> the whole address space -- since the hardware feature is to directly
> share the whole CPU page table.
> 
> "We really should never ever encourage people to bind their device
> address space to the CPU address space. This is a very special use
> case and limits the driver design to only this use case.
> We have exercised this approach to a rather extreme degree with KFD
> and I can clearly say that doing this was a really big mistake.
> As far as I can see you are about to repeat that mistake and even
> encourage others to do so as well."
> 
> -- The behavior of internally "attach device context to mm_struct" in
> GMEM is ultimately a different approach to coordinate CPU and
> devices. I want to replace MMU notifiers with this approach because I
> want to protect core MM from random interactions with external driver
> MMs. Both GMEM and MMU notifiers are binding device contexts to the
> CPU context, not putting them in the same address space. If someone
> is against GMEM's approach for binding CPU and device context, then
> someone should be against MMU notifiers as well.
> 
> Currently, from our discussion I think I received two messages:
>         1. The original AMDKFD design was rejected because of
> inserting vendor-specific stuff to the generic core MM.
>         2. The rejection from #1 led to your opinion that anyone
> cannot mix device and core MM together.

That's precisely not what Christian wrote:

"KFD was meant to be a vendor agnostic framework, very similar to what 
you propose here.

It's just that it was seen as vendor specific because nobody else 
actually wanted to design the their drivers this way."


It may be that the original discussion about AMDKFD could hint at #1,
but the one here certainly does not ;)


P.

> 
> I think #1 really encouraged me that GMEM could help the AMDKFD
> driver. However I am also confused that why GMEM must be compared
> with a vendor-specific driver. AMDKFD was only considering a very
> special use case: AMD GPUs using AMD IOMMU. 
> However, GMEM is trying to consider all generalized cases of memory
> devices. The device can be Nvidia's GPU and Huawei's NPU that use
> their own MMUs, or AMD/Intel GPUs that use IOMMUs, or other hundreds
> of new accelerator vendors.
> 
> -Weixi
> 
> -----Original Message-----
> From: Christian König <christian.koenig@amd.com> 
> Sent: Thursday, November 30, 2023 9:05 PM
> To: zhuweixi <weixi.zhu@huawei.com>; Dave Airlie <airlied@gmail.com>
> Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org;
> akpm@linux-foundation.org; weixi.zhu@openeuler.sh; mgorman@suse.de;
> jglisse@redhat.com; rcampbell@nvidia.com; jhubbard@nvidia.com;
> apopple@nvidia.com; mhairgrove@nvidia.com; ziy@nvidia.com;
> alexander.deucher@amd.com; Xinhui.Pan@amd.com;
> amd-gfx@lists.freedesktop.org; Felix.Kuehling@amd.com;
> ogabbay@kernel.org; dri-devel@lists.freedesktop.org; jgg@nvidia.com;
> leonro@nvidia.com; zhenyuw@linux.intel.com; zhi.a.wang@intel.com;
> intel-gvt-dev@lists.freedesktop.org; intel-gfx@lists.freedesktop.org;
> jani.nikula@linux.intel.com; joonas.lahtinen@linux.intel.com;
> rodrigo.vivi@intel.com; tvrtko.ursulin@linux.intel.com; Danilo
> Krummrich <dakr@redhat.com>; Daniel Vetter <daniel@ffwll.ch>; Zeng,
> Oak <oak.zeng@intel.com>
> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
> management) for external memory devices
> 
> Am 30.11.23 um 08:22 schrieb zhuweixi:
> > Add @Oak to the KFD discussion. I will reply separately elaborating
> > your questions on GMEM's difference from HMM/MMU notifiers.
> > 
> > Christian, thanks for pointing me to that AMDKFD discussion. I have
> > read the discussion around the AMDKFD skeleton patch and found the
> > previous discussion in the following URLs:
> > https://lore.kernel.org/dri-devel/1405028848-5660-1-git-send-email-ode
> > d.gabbay@amd.com/#r 
> > https://lore.kernel.org/dri-devel/20140711154231.GB1870@gmail.com/
> > 
> > I believe AMDKFD's original patch was rejected mostly because of
> > inserting vendor-specific stuff to the generic core MM.  Jérôme has
> > clearly stated this issue in the second URL. If the code is vendor-
> > specific then it has no place in core MM, period.
> > 
> > But why does that vendor-specific solution relate to a generalized
> > solution like GMEM? The initial AMDKFD patch doesn't work for
> > Nvidia or Intel.
> 
> KFD was meant to be a vendor agnostic framework, very similar to what
> you propose here.
> 
> It's just that it was seen as vendor specific because nobody else
> actually wanted to design the their drivers this way.
> 
> > 
> > In fact I think the rejection of the initial AMDKFD patch supports
> > GMEM's idea -- there could have been a simpler AMDKFD
> > implementation if the core MM was extended by GMEM. Also, after 9
> > years, there are so many other companies building their
> > accelerators over the past few years, especially now the GPT-family
> > has made a much bigger success. Don't we want to advance Linux's
> > core MM for more friendly and generalized support for the upcoming
> > new vendors?
> 
> Well exactly that's the big point: Absolutely not!
> 
> We really should never ever encourage people to bind their device
> address space to the CPU address space. This is a very special use
> case and limits the driver design to only this use case.
> 
> We have exercised this approach to a rather extreme degree with KFD
> and I can clearly say that doing this was a really big mistake.
> 
> As far as I can see you are about to repeat that mistake and even
> encourage others to do so as well.
> 
> > Now answering Christian's design concerns:
> > 
> > 1. "There are cases that do not want to share CPU address space"
> > Maybe, but I am not fully convinced. The current case we can find
> > is when a NIC utilizes IOMMU for security. For this case, GMEM
> > implemented a generalized VMA support and tested it with NICs using
> > both Intel-IOMMU/Arm-SMMU. This cut 600 LoC of IOVA management code
> > from the IOMMU driver, but it is still not included in this RFC
> > patch -- I cannot find other cases demanding this isolation. The
> > isolation is also unnecessary -- the NIC can enable the IOMMU SVM
> > feature to share the CPU address space. As of KVM, it is
> > essentially a host process that utilizes two different MMUs within
> > the same address space, so it fits GMEM's design...
> 
> Maybe I don't completely follow here how you want to save LoC for the
> IOMMU implementation of NICs, but at least for the PASID/PRI support
> AMD just recently gone exactly the opposite direction:
> 
> commit 5a0b11a180a9b82b4437a4be1cf73530053f139b
> Author: Vasant Hegde <vasant.hegde@amd.com>
> Date:   Fri Oct 6 09:57:02 2023 +0000
> 
>      iommu/amd: Remove iommu_v2 module
> 
>      AMD GPU driver which was the only in-kernel user of iommu_v2
> module
>      removed dependency on iommu_v2 module.
> 
>      Also we are working on adding SVA support in AMD IOMMU driver.
> Device
>      drivers are expected to use common SVA framework to enable
> device
>      PASID/PRI features.
> 
>      Removing iommu_v2 module and then adding SVA simplifies the
> development.
>      Hence remove iommu_v2 module.
> 
> As I wrote before this IOMMU V2 driver was basically binding the CPU
> address space to IOMMU devices using the PASID. For an example see
> function amd_iommu_bind_pasid().
> 
> This turned out to be not as useful as we hoped it would be.
> Essentially the use case where you want to give a device access to
> the whole address space of a process are extremely limited. That's
> why we are removing it and switching over to a separate SVA
> implementation which doesn't depend on the CPU address space.
> 
> 
> But the virtualization use case I mentioned is completely independent
> of IOMMU. In KVM/XEN/etc.. there is a functionality called native
> context, basically what this means is that instead of passing through
> complete device isolated by IOMMU only specific kernel
> functionalities are exposed to the guest operating system through
> QEMU.
> 
> See here for an example how OpenGL is implemented on top of this: 
> https://docs.mesa3d.org/drivers/virgl.html
> 
> This is actually using the separation between device memory
> management and CPU memory management and is basically a killer
> argument why those two topics should be separated. Otherwise it's
> impossible for QEMU to actually handle multiple independent device
> memory address spaces inside a single CPU memory address space.
> 
> > 2. "This does not integrate well with the filesystem layer in
> > Linux..."
> > To be honest, not using a logical page table for anonymous memory
> > is why Linux THP fails compared with FreeBSD's superpage, but I am
> > not going to elaborate it here. But yes, and I am looking for
> > merging struct vm_object->logical_page_table with struct
> > address_space->i_pages. This will make a natural support for
> > devices oversubscribing both host DRAM and disks. As explained in
> > my cover letter, struct vm_object borrows FreeBSD's VM design -- it
> > provides a unified abstraction layer for anonymous, file-backed
> > memory and etc.
> 
> I'm not that deep into this stuff, so leaving this to the experts on
> FreeBSD.
> 
> > 3. "Requirements to CPU address space management and device address
> > space management are just massively different. For example huge and
> > giant pages are a must have for modern devices..."
> > I think you are asking two questions. First, is VA space a problem?
> 
> No, this is about something completely different.
> 
> > GMEM assumes that device VA space should be covered by CPU VA space
> > (sorry i386), ...
> [SNIP]
> 
> I'm removing this because you were talking about something different
> than what I meant.
> 
> I will try to explain the background on an example outside of machine
> learning and compute since this framework should be applicable to
> every use case and not be limited to those. Otherwise Linux would
> sooner or later just be applicable to only those use cases.
> 
> So let's take a look at how modern games use a GPU for example. On
> startup a rather large part of the GPU address space is allocated,
> for example 64GiB. Then the necessary resources (images, texture,
> vertices, shaders etc..) are loaded into separate buffer objects.
> 
> Those resources are then mapped into the allocated address on a page
> by page basis. So you basically don't have large VMAs which cover one
> resource, but rather the page tables are used as a remapping table
>   into the available resources. This increases the number of virtual
> mappings drastically, it's kind of comparable how an anon_vma works
> inside a VMA on Linux.
> 
> Those mappings also are not setup at start and then used throughout
> the whole lifetime of the process, but rather done very dynamically
> sometimes resulting in thousands of mapping operations per second.
> 
> Additional to that devices have page table feature which CPUs don't
> have. This ranges from support for partial resident texture over
> flags how caching and dynamic color space compression is made.
> 
> So the mappings contain tons of device specific information and it's
> most likely not even possible to handle all of this with a device
> independent mmap() call.
> 
> > 4. "The argument that a shared memory management leads to less bugs
> > has also absolutely not be proven true. Instead we literally spend
> > month if not years hunting down bugs which resulted from
> > interaction between CPU and devices."
> > This is another case supporting GMEM. Don't developers want to let
> > GMEM handle the CPU-device interaction so that they can waive
> > months of debugging cost?
> 
> No, we already have HMM for that.
> 
> Regards,
> Christian.
> 
> > 
> > PS, hmadvise() is based on the idea of Nvidia's cudaMemAdvise()
> > which provides abundant and useful memory policies. HMM extended
> > mbind() instead.
> > 
> > -Weixi
> > 
> > -----Original Message-----
> > From: Christian König <christian.koenig@amd.com>
> > Sent: Wednesday, November 29, 2023 11:22 PM
> > To: zhuweixi <weixi.zhu@huawei.com>; Dave Airlie
> > <airlied@gmail.com>
> > Cc: linux-mm@kvack.org; linux-kernel@vger.kernel.org; 
> > akpm@linux-foundation.org; weixi.zhu@openeuler.sh; mgorman@suse.de;
> > jglisse@redhat.com; rcampbell@nvidia.com; jhubbard@nvidia.com; 
> > apopple@nvidia.com; mhairgrove@nvidia.com; ziy@nvidia.com; 
> > alexander.deucher@amd.com; Xinhui.Pan@amd.com; 
> > amd-gfx@lists.freedesktop.org; Felix.Kuehling@amd.com; 
> > ogabbay@kernel.org; dri-devel@lists.freedesktop.org;
> > jgg@nvidia.com; 
> > leonro@nvidia.com; zhenyuw@linux.intel.com; zhi.a.wang@intel.com; 
> > intel-gvt-dev@lists.freedesktop.org;
> > intel-gfx@lists.freedesktop.org; 
> > jani.nikula@linux.intel.com; joonas.lahtinen@linux.intel.com; 
> > rodrigo.vivi@intel.com; tvrtko.ursulin@linux.intel.com
> > Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory 
> > management) for external memory devices
> > 
> > Am 29.11.23 um 09:27 schrieb zhuweixi:
> > > Glad to hear that more sharable code is desirable.
> > > IMHO, for a common MM subsystem, it is more beneficial for GMEM
> > > to 
> > > extend core MM instead of building a separate one.
> > > 
> > > As stated in the beginning of my RFC letter, MM systems are large
> > > and 
> > > similar. Even a sophisticated one like Linux MM that has evolved
> > > over 
> > > decades still suffers from an increasing number of bugs[1]. So, 
> > > directly extending core MM to support devices not only avoids
> > > opening 
> > > a new box of bugs, but also allows the community to concentrate
> > > on 
> > > maintaining one single MM system. On the other side, GMEM does no
> > > hurt to core MM If a CPU process is not attached with device
> > > contexts.
> > > 
> > > @Christian, could you provide more information on what AMD
> > > proposed 
> > > with KFD and why it was rejected?
> > Well, this is going to be a longer explanation.
> > 
> > The combination of KFD and HMM is based on essentially on the same
> > idea as this code here. Even the initial KFD implementation was
> > very similar in the sense that it added device contexts to
> > mm_struct and tried to manage GPU/acceleration MM the same way as
> > CPU MM. On other words it was basically identical to your
> > gm_dev_create() and gm_mmu approach.
> > 
> > As mentioned before this initial proposal was rejected, for more
> > background see the discussion around "amdkfd: Add amdkfd skeleton
> > driver" on the dri-devel mailing list between 2013 and 2014. You
> > need to dig up the whole discussion from the mailing list, but
> > summarizing it the general feeling was that it would be a mistake
> > to tie device drivers to close to CPU memory management (and stable
> > UAPI) without validating that this is really the right thing to do.
> > 
> > So instead of the original implementation KFD has gone upstream
> > with a much less invasive approach where a device contexts are just
> > on demand looked up for each mm_struct. Felix can probably provide
> > some pointers to the implementation.
> > 
> > On the initially supported hardware the KFD used the PCIe ATC
> > feature to allow routing of memory accesses directly into the
> > associated CPU process address space, later on we switched to an
> > MMU notifier/HMM based approach to give similar functionality to
> > the userspace stack on top of it for devices which doesn't support
> > the ATC path was just recently completely removed and we are now
> > only using MMU notifiers/HMM.
> > 
> > HMM tried to add similar functionality like you propose with the
> > mmap() flag and hmadvise() call. The hmadvise() extension actually
> > looks so familiar to the HMM proposal that I would expect that this
> > is actually based on that code.
> > 
> > All this turned out to have some major design issues.
> > 
> > First of all you have a rather large group of use cases where you
> > don't want your device to mirror the address space of your process.
> > Just think of thinks like QEMU, KVM, XEN, in general virtualization
> > and container handling. Linux has the mantra that everything is a
> > file and if it's not a file it should be a file and when you tie
> > device memory management into CPU memory management you are pretty
> > much violating exactly that.
> > 
> > Second this doesn't integrate well with the filesystem layer in
> > Linux.
> > For example we do have struct pages for HMM exposed device memory,
> > but 
> > for I/O we still migrate this back to system memory because of (for
> > example) the page lock requirements around writeback.
> > 
> > Then third it turned out that the requirements to CPU address space
> > management and device address space management are just massively
> > different. For example huge and giant pages are a must have for
> > modern devices, on the CPU side we are barely switching over to
> > folios now to add similar functionality.
> > 
> > The argument that a shared memory management leads to less bugs has
> > also absolutely not be proven true. Instead we literally spend
> > month if not years hunting down bugs which resulted from
> > interaction between CPU and devices.
> > ...
> > 
> > There are a couple of more things on this contra side to that
> > approach, but I think that would just make this mail unnecessary
> > long.
> > 
> > To sum it up from over a decade of experience working in this area
> > I can just say that CPU and device memory management should
> > absolutely *NOT* be mixed. We had those ideas multiple times
> > before, but they either failed because they didn't integrated well
> > with the core OS or the hardware support is just lagging behind the
> > actual requirements.
> > 
> > What can be done and where I completely agree with Dave is that
> > having common components which provides device drivers with the
> > necessary functionality to manage their device address space is
> > really good idea.
> > Danilo is for example working on a GPUVM component to have common
> > virtual address space management and I'm at least sometimes working
> > on MMU notifier/HMM improvements.
> > 
> > Providing SVM functionality to your userspace stack is still a
> > really good idea, but it should be done with MMU notifiers and
> > components which are separate to your CPU memory management instead
> > of tying it directly to the CPU address space.
> > 
> > Regards,
> > Christian.
> > 
> > > [1] Huang, Jian, Moinuddin K. Qureshi, and Karsten Schwan. "An
> > > evolutionary study of linux memory management for fun and
> > > profit." 2016 USENIX Annual Technical Conference (USENIX ATC 16).
> > > 2016.
> > > 
> > > Thanks,
> > > Weixi
> > > 
> > > -----Original Message-----
> > > From: Dave Airlie <airlied@gmail.com>
> > > Sent: Wednesday, November 29, 2023 1:15 PM
> > > To: Christian König <christian.koenig@amd.com>
> > > Cc: zhuweixi <weixi.zhu@huawei.com>; linux-mm@kvack.org; 
> > > linux-kernel@vger.kernel.org; akpm@linux-foundation.org; 
> > > weixi.zhu@openeuler.sh; mgorman@suse.de; jglisse@redhat.com; 
> > > rcampbell@nvidia.com; jhubbard@nvidia.com; apopple@nvidia.com; 
> > > mhairgrove@nvidia.com; ziy@nvidia.com; alexander.deucher@amd.com;
> > > Xinhui.Pan@amd.com; amd-gfx@lists.freedesktop.org; 
> > > Felix.Kuehling@amd.com; ogabbay@kernel.org; 
> > > dri-devel@lists.freedesktop.org; jgg@nvidia.com;
> > > leonro@nvidia.com; 
> > > zhenyuw@linux.intel.com; zhi.a.wang@intel.com; 
> > > intel-gvt-dev@lists.freedesktop.org;
> > > intel-gfx@lists.freedesktop.org; 
> > > jani.nikula@linux.intel.com; joonas.lahtinen@linux.intel.com; 
> > > rodrigo.vivi@intel.com; tvrtko.ursulin@linux.intel.com
> > > Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
> > > management) for external memory devices
> > > 
> > > On Tue, 28 Nov 2023 at 23:07, Christian König
> > > <christian.koenig@amd.com> wrote:
> > > > Am 28.11.23 um 13:50 schrieb Weixi Zhu:
> > > > > The problem:
> > > > > 
> > > > > Accelerator driver developers are forced to reinvent external
> > > > > MM 
> > > > > subsystems case by case, because Linux core MM only considers
> > > > > host memory resources.
> > > > > These reinvented MM subsystems have similar orders of
> > > > > magnitude of 
> > > > > LoC as Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has
> > > > > 14K and 
> > > > > Huawei NPU has 30K. Meanwhile, more and more vendors are 
> > > > > implementing their own accelerators, e.g. Microsoft's Maia
> > > > > 100. At 
> > > > > the same time, application-level developers suffer from poor 
> > > > > programmability -- they must consider parallel address spaces
> > > > > and 
> > > > > be careful about the limited device DRAM capacity. This can
> > > > > be 
> > > > > alleviated if a malloc()-ed virtual address can be shared by
> > > > > the 
> > > > > accelerator, or the abundant host DRAM can further
> > > > > transparently backup the device local memory.
> > > > > 
> > > > > These external MM systems share similar mechanisms except for
> > > > > the 
> > > > > hardware-dependent part, so reinventing them is effectively 
> > > > > introducing redundant code (14K~70K for each case). Such 
> > > > > developing/maintaining is not cheap. Furthermore, to share a 
> > > > > malloc()-ed virtual address, device drivers need to deeply
> > > > > interact 
> > > > > with Linux MM via low-level MM APIs, e.g. MMU notifiers/HMM.
> > > > > This 
> > > > > raises the bar for driver development, since developers must 
> > > > > understand how Linux MM works. Further, it creates code
> > > > > maintenance 
> > > > > problems -- any changes to Linux MM potentially require
> > > > > coordinated changes to accelerator drivers using low-level MM
> > > > > APIs.
> > > > > 
> > > > > Putting a cache-coherent bus between host and device will not
> > > > > make 
> > > > > these external MM subsystems disappear. For example, a 
> > > > > throughput-oriented accelerator will not tolerate executing
> > > > > heavy 
> > > > > memory access workload with a host MMU/IOMMU via a remote
> > > > > bus.
> > > > > Therefore, devices will still have their own MMU and pick a
> > > > > simpler 
> > > > > page table format for lower address translation overhead,
> > > > > requiring external MM subsystems.
> > > > > 
> > > > > --------------------
> > > > > 
> > > > > What GMEM (Generalized Memory Management [1]) does:
> > > > > 
> > > > > GMEM extends Linux MM to share its machine-independent MM
> > > > > code. 
> > > > > Only high-level interface is provided for device drivers.
> > > > > This 
> > > > > prevents accelerator drivers from reinventing the wheel, but
> > > > > relies 
> > > > > on drivers to implement their hardware-dependent functions
> > > > > declared 
> > > > > by GMEM. GMEM's key interface include gm_dev_create(), 
> > > > > gm_as_create(),
> > > > > gm_as_attach() and gm_dev_register_physmem(). Here briefly
> > > > > describe 
> > > > > how a device driver utilizes them:
> > > > > 1. At boot time, call gm_dev_create() and registers the
> > > > > implementation of
> > > > >       hardware-dependent functions as declared in struct
> > > > > gm_mmu.
> > > > >         - If the device has local DRAM, call
> > > > > gm_dev_register_physmem() to
> > > > >           register available physical addresses.
> > > > > 2. When a device context is initialized (e.g. triggered by
> > > > > ioctl), check if
> > > > >       the current CPU process has been attached to a gmem
> > > > > address space
> > > > >       (struct gm_as). If not, call gm_as_create() and point
> > > > > current->mm->gm_as
> > > > >       to it.
> > > > > 3. Call gm_as_attach() to attach the device context to a gmem
> > > > > address space.
> > > > > 4. Invoke gm_dev_fault() to resolve a page fault or prepare
> > > > > data before
> > > > >       device computation happens.
> > > > > 
> > > > > GMEM has changed the following assumptions in Linux MM:
> > > > >      1. An mm_struct not only handle a single CPU context,
> > > > > but may also handle
> > > > >         external memory contexts encapsulated as gm_context
> > > > > listed in
> > > > >         mm->gm_as. An external memory context can include a
> > > > > few or all of the
> > > > >         following parts: an external MMU (that requires TLB
> > > > > invalidation), an
> > > > >         external page table (that requires PTE manipulation)
> > > > > and external DRAM
> > > > >         (that requires physical memory management).
> > > > Well that is pretty much exactly what AMD has already proposed
> > > > with 
> > > > KFD and was rejected for rather good reasons.
> > > > > MMU functions
> > > > > The MMU functions peer_map() and peer_unmap() overlap other 
> > > > > functions, leaving a question if the MMU functions should be 
> > > > > decoupled as more basic operations. Decoupling them could 
> > > > > potentially prevent device drivers coalescing these basic
> > > > > steps 
> > > > > within a single host-device communication operation, while
> > > > > coupling 
> > > > > them makes it more difficult for device drivers to utilize
> > > > > GMEM interface.
> > > > Well to be honest all of this sounds like history to me. We
> > > > have 
> > > > already seen the same basic approach in KFD, HMM and to some
> > > > extend in TTM as well.
> > > > 
> > > > And all of them more or less failed. Why should this here be
> > > > different?
> > > Any info we have on why this has failed to work in the past would
> > > be 
> > > useful to provide. This is one of those cases where we may not
> > > have 
> > > documented the bad ideas to stop future developers from thinking
> > > they 
> > > are bad.
> > > 
> > > I do think we would want more common code in this area, but I
> > > would 
> > > think we'd have it more on the driver infrastructure side, than
> > > in 
> > > the core mm.
> > > 
> > > Dave.
>

Alistair Popple Dec. 3, 2023, 11:32 p.m. UTC | #18

Christian König <christian.koenig@amd.com> writes:

> Am 01.12.23 um 06:48 schrieb Zeng, Oak:
>> [SNIP]

>> Besides memory eviction/oversubscription, there are a few other pain points when I use hmm:
>>
>> 1) hmm doesn't support file-back memory, so it is hard to share
> memory b/t process in a gpu environment. You mentioned you have a
> plan... How hard is it to support file-backed in your approach?
>
> As hard as it is to support it through HMM. That's what I meant that
> this approach doesn't integrate well, as far as I know the problem
> isn't inside HMM or any other solution but rather in the file system
> layer.

In what way does HMM not support file-backed memory? I was under the
impression that at least hmm_range_fault() does.

 - Alistair

> Regards,
> Christian.
>
>> 2)virtual address range based memory attribute/hint: with hmadvise,
> where do you save the memory attribute of a virtual address range? Do
> you need to extend vm_area_struct to save it? With hmm, we have to
> maintain such information at driver. This ends up with pretty
> complicated logic to split/merge those address range. I know core mm
> has similar logic to split/merge vma...
>>
>> Oak
>>
>>
>>> -Weixi
>>>
>>> -----Original Message-----
>>> From: Christian König<ckoenig.leichtzumerken@gmail.com>
>>> Sent: Thursday, November 30, 2023 4:28 PM
>>> To: Zeng, Oak<oak.zeng@intel.com>; Christian König
>>> <christian.koenig@amd.com>; zhuweixi<weixi.zhu@huawei.com>; linux-
>>> mm@kvack.org;linux-kernel@vger.kernel.org;akpm@linux-foundation.org;
>>> Danilo Krummrich<dakr@redhat.com>; Dave Airlie<airlied@redhat.com>; Daniel
>>> Vetter<daniel@ffwll.ch>
>>> Cc:intel-gvt-dev@lists.freedesktop.org;rcampbell@nvidia.com;
>>> mhairgrove@nvidia.com;jgg@nvidia.com;weixi.zhu@openeuler.sh;
>>> jhubbard@nvidia.com;intel-gfx@lists.freedesktop.org;apopple@nvidia.com;
>>> Xinhui.Pan@amd.com;amd-gfx@lists.freedesktop.org;
>>> tvrtko.ursulin@linux.intel.com;ogabbay@kernel.org;jglisse@redhat.com; dri-
>>> devel@lists.freedesktop.org;ziy@nvidia.com; Vivi, Rodrigo
>>> <rodrigo.vivi@intel.com>;alexander.deucher@amd.com;leonro@nvidia.com;
>>> Felix.Kuehling@amd.com; Wang, Zhi A<zhi.a.wang@intel.com>;
>>> mgorman@suse.de
>>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>>> management) for external memory devices
>>>
>>> Hi Oak,
>>>
>>> yeah, #4 is indeed a really good point and I think Felix will agree to that as well.
>>>
>>> HMM is basically still missing a way to advise device attributes for the CPU
>>> address space. Both migration strategy as well as device specific information (like
>>> cache preferences) fall into this category.
>>>
>>> Since there is a device specific component in those attributes as well I think
>>> device specific IOCTLs still make sense to update them, but HMM should offer
>>> the functionality to manage and store those information.
>>>
>>> Split and merge of VMAs only become a problem if you attach those information
>>> to VMAs, if you keep them completely separate than that doesn't become an
>>> issue either. The down side of this approach is that you don't get automatically
>>> extending attribute ranges for growing VMAs for example.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 29.11.23 um 23:23 schrieb Zeng, Oak:
>>>> Hi Weixi,
>>>>
>>>> Even though Christian has listed reasons rejecting this proposal (yes they are
>>> very reasonable to me), I would open my mind and further explore the possibility
>>> here. Since the current GPU driver uses a hmm based implementation (AMD and
>>> NV has done this; At Intel we are catching up), I want to explore how much we
>>> can benefit from the proposed approach and how your approach can solve some
>>> pain points of our development. So basically what I am questioning here is: what
>>> is the advantage of your approach against hmm.
>>>> To implement a UVM (unified virtual address space b/t cpu and gpu device),
>>> with hmm, driver essentially need to implement below functions:
>>>> 1. device page table update. Your approach requires the same because
>>>> this is device specific codes
>>>>
>>>> 2. Some migration functions to migrate memory b/t system memory and GPU
>>> local memory. My understanding is, even though you generalized this a bit, such
>>> as modified cpu page fault path, provided "general" gm_dev_fault handler... but
>>> device driver still need to provide migration functions because migration
>>> functions have to be device specific (i.e., using device dma/copy engine for
>>> performance purpose). Right?
>>>> 3. GPU physical memory management, this part is now in drm/buddy, shared
>>> by all drivers. I think with your approach, driver still need to provide callback
>>> functions to allocate/free physical pages. Right? Or do you let linux core mm
>>> buddy manage device memory directly?
>>>> 4. madvise/hints/virtual address range management. This has been pain point
>>> for us. Right now device driver has to maintain certain virtual address range data
>>> structure to maintain hints and other virtual address range based memory
>>> attributes. Driver need to sync with linux vma. Driver need to explicitly deal with
>>> range split/merging... HMM doesn't provide support in this area. Your approach
>>> seems cleaner/simpler to me...
>>>>
>>>> So in above, I have examined the some key factors of a gpu UVM memory
>>> manager. I think for #1 and #2, hmm has provide pretty good abstraction/tools
>>> for address space mirroring and migration helpers. For #3, since we have a
>>> common drm/buddy layer, I don't think it is a big problem for driver writer now.
>>>> I do see #4 is something you solved more beautifully, requires new system call
>>> though.
>>>> Oak
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: dri-devel<dri-devel-bounces@lists.freedesktop.org>  On Behalf
>>>>> Of Christian König
>>>>> Sent: Tuesday, November 28, 2023 8:09 AM
>>>>> To: Weixi Zhu<weixi.zhu@huawei.com>;linux-mm@kvack.org; linux-
>>>>> kernel@vger.kernel.org;akpm@linux-foundation.org; Danilo Krummrich
>>>>> <dakr@redhat.com>; Dave Airlie<airlied@redhat.com>; Daniel Vetter
>>>>> <daniel@ffwll.ch>
>>>>> Cc:dri-devel@lists.freedesktop.org;leonro@nvidia.com;
>>>>> apopple@nvidia.com;amd-gfx@lists.freedesktop.org;mgorman@suse.de;
>>>>> ziy@nvidia.com; Wang, Zhi A<zhi.a.wang@intel.com>;
>>>>> rcampbell@nvidia.com;jgg@nvidia.com;weixi.zhu@openeuler.sh;
>>>>> jhubbard@nvidia.com;intel-gfx@lists.freedesktop.org;
>>>>> mhairgrove@nvidia.com;jglisse@redhat.com; Vivi, Rodrigo
>>>>> <rodrigo.vivi@intel.com>;intel-gvt-dev@lists.freedesktop.org;
>>>>> tvrtko.ursulin@linux.intel.com;Felix.Kuehling@amd.com;
>>>>> Xinhui.Pan@amd.com;alexander.deucher@amd.com;ogabbay@kernel.org
>>>>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>>>>> management) for external memory devices
>>>>>
>>>>> Adding a few missing important people to the explicit to list.
>>>>>
>>>>> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
>>>>>> The problem:
>>>>>>
>>>>>> Accelerator driver developers are forced to reinvent external MM
>>>>>> subsystems case by case, because Linux core MM only considers host
>>> memory resources.
>>>>>> These reinvented MM subsystems have similar orders of magnitude of
>>>>>> LoC as Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and
>>>>>> Huawei NPU
>>>>> has
>>>>>> 30K. Meanwhile, more and more vendors are implementing their own
>>>>>> accelerators, e.g. Microsoft's Maia 100. At the same time,
>>>>>> application-level developers suffer from poor programmability --
>>>>>> they must consider parallel address spaces and be careful about the
>>>>>> limited device DRAM capacity. This can be alleviated if a
>>>>>> malloc()-ed virtual address can be shared by the accelerator, or the
>>>>>> abundant host DRAM can further transparently backup the device local
>>> memory.
>>>>>> These external MM systems share similar mechanisms except for the
>>>>>> hardware-dependent part, so reinventing them is effectively
>>>>>> introducing redundant code (14K~70K for each case). Such
>>>>>> developing/maintaining is not cheap. Furthermore, to share a
>>>>>> malloc()-ed virtual address, device drivers need to deeply interact
>>>>>> with Linux MM via low-level MM APIs, e.g. MMU notifiers/HMM. This
>>>>>> raises the bar for driver development, since developers must
>>>>>> understand how Linux MM works. Further, it creates code maintenance
>>>>>> problems -- any changes to Linux MM potentially require coordinated
>>> changes to accelerator drivers using low-level MM APIs.
>>>>>> Putting a cache-coherent bus between host and device will not make
>>>>>> these external MM subsystems disappear. For example, a
>>>>>> throughput-oriented accelerator will not tolerate executing heavy
>>>>>> memory access workload with a host MMU/IOMMU via a remote bus.
>>>>>> Therefore, devices will still have their own MMU and pick a simpler
>>>>>> page table format for lower address translation overhead, requiring external
>>> MM subsystems.
>>>>>> --------------------
>>>>>>
>>>>>> What GMEM (Generalized Memory Management [1]) does:
>>>>>>
>>>>>> GMEM extends Linux MM to share its machine-independent MM code. Only
>>>>>> high-level interface is provided for device drivers. This prevents
>>>>>> accelerator drivers from reinventing the wheel, but relies on
>>>>>> drivers to implement their hardware-dependent functions declared by
>>>>>> GMEM. GMEM's
>>>>> key
>>>>>> interface include gm_dev_create(), gm_as_create(), gm_as_attach()
>>>>>> and gm_dev_register_physmem(). Here briefly describe how a device
>>>>>> driver utilizes them:
>>>>>> 1. At boot time, call gm_dev_create() and registers the implementation of
>>>>>>       hardware-dependent functions as declared in struct gm_mmu.
>>>>>>         - If the device has local DRAM, call gm_dev_register_physmem() to
>>>>>>           register available physical addresses.
>>>>>> 2. When a device context is initialized (e.g. triggered by ioctl), check if
>>>>>>       the current CPU process has been attached to a gmem address space
>>>>>>       (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
>>>>>>       to it.
>>>>>> 3. Call gm_as_attach() to attach the device context to a gmem address space.
>>>>>> 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
>>>>>>       device computation happens.
>>>>>>
>>>>>> GMEM has changed the following assumptions in Linux MM:
>>>>>>      1. An mm_struct not only handle a single CPU context, but may also handle
>>>>>>         external memory contexts encapsulated as gm_context listed in
>>>>>>         mm->gm_as. An external memory context can include a few or all of the
>>>>>>         following parts: an external MMU (that requires TLB invalidation), an
>>>>>>         external page table (that requires PTE manipulation) and external DRAM
>>>>>>         (that requires physical memory management).
>>>>>>      2. Faulting a MAP_PRIVATE VMA with no CPU PTE found does not
>>> necessarily
>>>>>>         mean that a zero-filled physical page should be mapped. The virtual
>>>>>>         page may have been mapped to an external memory device.
>>>>>>      3. Unmapping a page may include sending device TLB invalidation (even if
>>>>>>         its MMU shares CPU page table) and manipulating device PTEs.
>>>>>>
>>>>>> --------------------
>>>>>>
>>>>>> Semantics of new syscalls:
>>>>>>
>>>>>> 1. mmap(..., MAP_PRIVATE | MAP_PEER_SHARED)
>>>>>>        Allocate virtual address that is shared between the CPU and all
>>>>>>        attached devices. Data is guaranteed to be coherent whenever the
>>>>>>        address is accessed by either CPU or any attached device. If the device
>>>>>>        does not support page fault, then device driver is responsible for
>>>>>>        faulting memory before data gets accessed. By default, the CPU DRAM is
>>>>>>        can be used as a swap backup for the device local memory.
>>>>>> 2. hmadvise(NUMA_id, va_start, size, memory_hint)
>>>>>>        Issuing memory hint for a given VMA. This extends traditional madvise()
>>>>>>        syscall with an extra argument so that programmers have better control
>>>>>>        with heterogeneous devices registered as NUMA nodes. One
>>>>>> useful
>>>>> memory
>>>>>>        hint could be MADV_PREFETCH, which guarantees that the physical data
>>> of
>>>>>>        the given VMA [VA, VA+size) is migrated to NUMA node #id. Another
>>>>>>        useful memory hint is MADV_DONTNEED. This is helpful to increase
>>> device
>>>>>>        memory utilization. It is worth considering extending the existing
>>>>>>        madvise() syscall with one additional argument.
>>>>>>
>>>>>> --------------------
>>>>>>
>>>>>> Implementation details
>>>>>>
>>>>>> 1. New VMA flag: MAP_PEER_SHARED
>>>>>>
>>>>>> This new flag helps isolate GMEM feature, so that common processes
>>>>>> with no device attached does not need to maintain any logical page
>>>>>> table. It can be deleted if the extra overhead from GMEM is acceptable.
>>>>>>
>>>>>> 2. MMU functions
>>>>>> The device driver must implement the MMU functions declared in
>>>>>> struct gm_mmu.
>>>>>>
>>>>>> VA functions: peer_va_alloc_fixed(), peer_va_free()
>>>>>>
>>>>>> They are used to negotiate a common available VMA between a host
>>>>>> process and a device process at the mmap() time. This is because
>>>>>> some accelerators like Intel Xeon Phi or Huawei's Ascend NPU have
>>>>>> their acceleration tasks executed within a device CPU process
>>>>>> context. Some accelerators may also choose a different format of
>>>>>> virtual address space.
>>>>>>
>>>>>> PA functions: alloc_page(), free_page(), prepare_page()
>>>>>>
>>>>>> Alloc_page() and free_page() are used to allocate and free device
>>>>>> physical pages. Prepare_page() is used to zero-fill or DMA the data
>>>>>> of a physical page. These functions were removed from the submitted
>>>>>> patch, since GMEM does not need to invoke them when testing Huawei's
>>>>>> NPU accelerator. The
>>>>> NPU
>>>>>> accelerator has an OS running in the device that manages the device
>>>>>> physical memory. However, even for such a device it is better for
>>>>>> the host to directly manage device physical memory, which saves
>>>>>> device HBM and avoids synchronizing management status between the host
>>> and device.
>>>>>> Page-table functions:
>>>>>> pmap_create()/destroy()/enter()/release()/protect()
>>>>>>
>>>>>> They are used to create and destroy device page tables, install and
>>>>>> uninstall page table entries and to change the protection of page
>>>>>> table entries.
>>>>>>
>>>>>> TLB-invalidation functions: tlb_invl(), tlb_invl_coalesced()
>>>>>>
>>>>>> They are used to invalidate the TLB entries of a given range of VA
>>>>>> or invalidate a given list of VMAs.
>>>>>>
>>>>>> Wrapper functions: peer_map() and peer_unmap()
>>>>>>
>>>>>> These two functions are used to create or destroy a device mapping
>>>>>> which could include allocating physical memory and copying data.
>>>>>> They effectively wraps the PA functions, Page-table functions and
>>>>>> TLB-invalidation functions. Implementing these steps together allows
>>>>>> devices to optimize the communication cost between host and device.
>>>>>> However, it requires the device driver to correctly order these steps.
>>>>>>
>>>>>> 3. Tracking logical mappings:
>>>>>>
>>>>>> Each process starts maintaining an xarray in
>>>>>> mm->vm_obj->logical_page_table at the first time a host process
>>>>>> calls mmap(MAP_PRIVATE |
>>>>> MAP_PEER_SHARED).
>>>>>> When a virtual page gets touched, its mapping status is created and
>>>>>> stored in struct gm_mapping. The logical page table is utilized to
>>>>>> query the struct gm_mapping given a virtual address. GMEM extends
>>>>>> Linux MM to
>>>>> update
>>>>>> and lookup these logical mappings. For example, in the patch set we
>>>>>> modify the page fault path of to additionally check the logical
>>>>>> mapping of MAP_PEER_SHARED VMAs and identify if a device page should
>>> be migrated.
>>>>>> Similarly, if the device driver wants to resolve a device page fault
>>>>>> or prefetch data, the driver should call gm_dev_fault(). This
>>>>>> function examines the mapping status and determines whether the
>>>>>> device driver should migrate a CPU page to device or install a zero-filled
>>> device page.
>>>>>> The logical mapping abstraction enhances the extensibility of Linux
>>>>>> core MM (a virtual page may be mapped to a device physical page
>>>>>> without any CPU PTE installed). The current implementation is not
>>>>>> complete, since it only focused on anonymous VMAs with
>>>>>> MAP_PEER_SHARED flag. The future plan of logical page table is to
>>>>>> provide a generic abstraction layer that support common anonymous
>>>>>> memory (I am looking at you, transparent huge pages)
>>>>> and
>>>>>> file-backed memory.
>>>>>>
>>>>>> --------------------
>>>>>>
>>>>>> Use cases
>>>>>>
>>>>>> GMEM has been tested over Huawei's NPU (neural process unit) device
>>> driver.
>>>>>> The original NPU device driver has approximately 30,000 lines of
>>>>>> code for memory management. On the contrary, the GMEM-based one has
>>>>>> less than 30 lines of code calling GMEM API, with approximately
>>>>>> 3,700 lines of code implementing the MMU functions. This effectively
>>>>>> saves over 26,200 lines of MM code for one driver. Therefore,
>>>>>> developers from accelerator vendors, including Nvidia, AMD, Intel
>>>>>> and other companies are welcome to discuss if GMEM could be helpful.
>>>>>>
>>>>>> Using GMEM-based driver, it is possible to write a C-style
>>>>>> accelerator code with malloc(), whose underlying mmap() syscall
>>>>>> should include MAP_PEER_SHARED according to current GMEM
>>>>>> implementation. Importantly,
>>>>> GMEM
>>>>>> guarantees a coherent view of memory between the host and all
>>>>>> attached devices. This means that any data written by the CPU or any
>>>>>> attached accelerator can be seen by the next memory load instruction
>>>>>> issued by any attached accelerator or the CPU. Furthermore, the NPU
>>>>>> device was able to oversubscribe memory by swapping memory to host
>>>>>> DDR. Note that this
>>>>> memory
>>>>>> oversubscription mechanism can be universal if the physical memory
>>>>>> management is provided by GMEM. Other potential use cases of GMEM
>>>>>> could include the IOMMU driver, KVM and RDMA drivers, as long as the
>>>>>> device needs to manage external memory resources like VMAs, MMUs or
>>> local DRAMs.
>>>>>> --------------------
>>>>>>
>>>>>> Discussion
>>>>>>
>>>>>> Physical memory management
>>>>>> Most accelerators require the host OS to manage device DRAM. Even
>>>>>> accelerators capable of running an OS inside the driver can benefit
>>>>>> from it, since it helps avoid synchronizing management status
>>>>>> between the host and device. In Linux OSS EU summit 2023, Hannes
>>>>>> Reinecke from SUSE Labs suggested that people are concerned with the
>>>>>> memory consumption of struct page (which considers all generic
>>>>>> scenarios for the kernel). This leads to a possible solution that,
>>>>>> instead of reusing Linux struct page and ZONE_DEVICE mechanism, GMEM
>>>>>> can implement an isolated buddy allocator
>>>>> for
>>>>>> the device to instantiate and register. The isolation is useful
>>>>>> because device DRAM physical address space is independent.
>>>>>> Furthermore, the isolated buddy allocator can utilize a customized
>>>>>> struct page that consumes less memory. It is worth discussing if
>>>>>> accelerator vendors desire this solution.
>>>>>>
>>>>>> MMU functions
>>>>>> The MMU functions peer_map() and peer_unmap() overlap other
>>>>>> functions, leaving a question if the MMU functions should be
>>>>>> decoupled as more basic operations. Decoupling them could
>>>>>> potentially prevent device drivers coalescing these basic steps
>>>>>> within a single host-device communication operation, while coupling
>>>>>> them makes it more difficult for device drivers to utilize GMEM interface.
>>>>>>
>>>>>> The idea of GMEM was originated from Weixi's PhD study with Prof.
>>>>>> Scott Rixner and Prof. Alan L. Cox at Rice University.
>>>>>>
>>>>>> [1]https://arxiv.org/abs/2310.12554.
>>>>>>
>>>>>> Weixi Zhu (6):
>>>>>>      mm/gmem: add heterogeneous NUMA node
>>>>>>      mm/gmem: add arch-independent abstraction to track address mapping
>>>>>>        status
>>>>>>      mm/gmem: add GMEM (Generalized Memory Management) interface for
>>>>>>        external accelerators
>>>>>>      mm/gmem: add new syscall hmadvise() to issue memory hints for
>>>>>>        heterogeneous NUMA nodes
>>>>>>      mm/gmem: resolve VMA conflicts for attached peer devices
>>>>>>      mm/gmem: extending Linux core MM to support unified virtual address
>>>>>>        space
>>>>>>
>>>>>>     arch/arm64/include/asm/unistd.h         |   2 +-
>>>>>>     arch/arm64/include/asm/unistd32.h       |   2 +
>>>>>>     drivers/base/node.c                     |   6 +
>>>>>>     fs/proc/task_mmu.c                      |   3 +
>>>>>>     include/linux/gmem.h                    | 368 ++++++++++++
>>>>>>     include/linux/mm.h                      |   8 +
>>>>>>     include/linux/mm_types.h                |   5 +
>>>>>>     include/linux/nodemask.h                |  10 +
>>>>>>     include/uapi/asm-generic/mman-common.h  |   4 +
>>>>>>     include/uapi/asm-generic/unistd.h       |   5 +-
>>>>>>     init/main.c                             |   2 +
>>>>>>     kernel/fork.c                           |   5 +
>>>>>>     kernel/sys_ni.c                         |   2 +
>>>>>>     mm/Kconfig                              |  14 +
>>>>>>     mm/Makefile                             |   1 +
>>>>>>     mm/gmem.c                               | 746 ++++++++++++++++++++++++
>>>>>>     mm/huge_memory.c                        |  85 ++-
>>>>>>     mm/memory.c                             |  42 +-
>>>>>>     mm/mempolicy.c                          |   4 +
>>>>>>     mm/mmap.c                               |  40 +-
>>>>>>     mm/oom_kill.c                           |   2 +
>>>>>>     mm/page_alloc.c                         |   3 +
>>>>>>     mm/vm_object.c                          | 309 ++++++++++
>>>>>>     tools/include/uapi/asm-generic/unistd.h |   5 +-
>>>>>>     24 files changed, 1654 insertions(+), 19 deletions(-)
>>>>>>     create mode 100644 include/linux/gmem.h
>>>>>>     create mode 100644 mm/gmem.c
>>>>>>     create mode 100644 mm/vm_object.c
>>>>>>

Christian König Dec. 4, 2023, 9:35 a.m. UTC | #19

Am 04.12.23 um 00:32 schrieb Alistair Popple:
> Christian König <christian.koenig@amd.com> writes:
>
>> Am 01.12.23 um 06:48 schrieb Zeng, Oak:
>>> [SNIP]
>>> Besides memory eviction/oversubscription, there are a few other pain points when I use hmm:
>>>
>>> 1) hmm doesn't support file-back memory, so it is hard to share
>> memory b/t process in a gpu environment. You mentioned you have a
>> plan... How hard is it to support file-backed in your approach?
>>
>> As hard as it is to support it through HMM. That's what I meant that
>> this approach doesn't integrate well, as far as I know the problem
>> isn't inside HMM or any other solution but rather in the file system
>> layer.
> In what way does HMM not support file-backed memory? I was under the
> impression that at least hmm_range_fault() does.

Oh, well file-backed memory is indeed supported by HMM. IIRC KFD 
actually allows this for the SVM implementation.

It's just that the way the file system layer (for example) does 
writeback absolutely doesn't fit well with how GPUs and other 
acceleration devices work.

The general assumption in the kernel seems to be that page faults and 
preemption are extremely cheap. So things like copy on write is used 
quite extensively.

For a CPU this basically means you just need to context change into the 
kernel once to get the new address of a page into your PTEs on write, 
while for acceleration devices this always require a complete CPU round 
trip for each initial write access for a 4k page. The performance impact 
is just horrible.

Regards,
Christian.






>
>   - Alistair
>
>> Regards,
>> Christian.
>>
>>> 2)virtual address range based memory attribute/hint: with hmadvise,
>> where do you save the memory attribute of a virtual address range? Do
>> you need to extend vm_area_struct to save it? With hmm, we have to
>> maintain such information at driver. This ends up with pretty
>> complicated logic to split/merge those address range. I know core mm
>> has similar logic to split/merge vma...
>>> Oak
>>>
>>>
>>>> -Weixi
>>>>
>>>> -----Original Message-----
>>>> From: Christian König<ckoenig.leichtzumerken@gmail.com>
>>>> Sent: Thursday, November 30, 2023 4:28 PM
>>>> To: Zeng, Oak<oak.zeng@intel.com>; Christian König
>>>> <christian.koenig@amd.com>; zhuweixi<weixi.zhu@huawei.com>; linux-
>>>> mm@kvack.org;linux-kernel@vger.kernel.org;akpm@linux-foundation.org;
>>>> Danilo Krummrich<dakr@redhat.com>; Dave Airlie<airlied@redhat.com>; Daniel
>>>> Vetter<daniel@ffwll.ch>
>>>> Cc:intel-gvt-dev@lists.freedesktop.org;rcampbell@nvidia.com;
>>>> mhairgrove@nvidia.com;jgg@nvidia.com;weixi.zhu@openeuler.sh;
>>>> jhubbard@nvidia.com;intel-gfx@lists.freedesktop.org;apopple@nvidia.com;
>>>> Xinhui.Pan@amd.com;amd-gfx@lists.freedesktop.org;
>>>> tvrtko.ursulin@linux.intel.com;ogabbay@kernel.org;jglisse@redhat.com; dri-
>>>> devel@lists.freedesktop.org;ziy@nvidia.com; Vivi, Rodrigo
>>>> <rodrigo.vivi@intel.com>;alexander.deucher@amd.com;leonro@nvidia.com;
>>>> Felix.Kuehling@amd.com; Wang, Zhi A<zhi.a.wang@intel.com>;
>>>> mgorman@suse.de
>>>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>>>> management) for external memory devices
>>>>
>>>> Hi Oak,
>>>>
>>>> yeah, #4 is indeed a really good point and I think Felix will agree to that as well.
>>>>
>>>> HMM is basically still missing a way to advise device attributes for the CPU
>>>> address space. Both migration strategy as well as device specific information (like
>>>> cache preferences) fall into this category.
>>>>
>>>> Since there is a device specific component in those attributes as well I think
>>>> device specific IOCTLs still make sense to update them, but HMM should offer
>>>> the functionality to manage and store those information.
>>>>
>>>> Split and merge of VMAs only become a problem if you attach those information
>>>> to VMAs, if you keep them completely separate than that doesn't become an
>>>> issue either. The down side of this approach is that you don't get automatically
>>>> extending attribute ranges for growing VMAs for example.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 29.11.23 um 23:23 schrieb Zeng, Oak:
>>>>> Hi Weixi,
>>>>>
>>>>> Even though Christian has listed reasons rejecting this proposal (yes they are
>>>> very reasonable to me), I would open my mind and further explore the possibility
>>>> here. Since the current GPU driver uses a hmm based implementation (AMD and
>>>> NV has done this; At Intel we are catching up), I want to explore how much we
>>>> can benefit from the proposed approach and how your approach can solve some
>>>> pain points of our development. So basically what I am questioning here is: what
>>>> is the advantage of your approach against hmm.
>>>>> To implement a UVM (unified virtual address space b/t cpu and gpu device),
>>>> with hmm, driver essentially need to implement below functions:
>>>>> 1. device page table update. Your approach requires the same because
>>>>> this is device specific codes
>>>>>
>>>>> 2. Some migration functions to migrate memory b/t system memory and GPU
>>>> local memory. My understanding is, even though you generalized this a bit, such
>>>> as modified cpu page fault path, provided "general" gm_dev_fault handler... but
>>>> device driver still need to provide migration functions because migration
>>>> functions have to be device specific (i.e., using device dma/copy engine for
>>>> performance purpose). Right?
>>>>> 3. GPU physical memory management, this part is now in drm/buddy, shared
>>>> by all drivers. I think with your approach, driver still need to provide callback
>>>> functions to allocate/free physical pages. Right? Or do you let linux core mm
>>>> buddy manage device memory directly?
>>>>> 4. madvise/hints/virtual address range management. This has been pain point
>>>> for us. Right now device driver has to maintain certain virtual address range data
>>>> structure to maintain hints and other virtual address range based memory
>>>> attributes. Driver need to sync with linux vma. Driver need to explicitly deal with
>>>> range split/merging... HMM doesn't provide support in this area. Your approach
>>>> seems cleaner/simpler to me...
>>>>> So in above, I have examined the some key factors of a gpu UVM memory
>>>> manager. I think for #1 and #2, hmm has provide pretty good abstraction/tools
>>>> for address space mirroring and migration helpers. For #3, since we have a
>>>> common drm/buddy layer, I don't think it is a big problem for driver writer now.
>>>>> I do see #4 is something you solved more beautifully, requires new system call
>>>> though.
>>>>> Oak
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: dri-devel<dri-devel-bounces@lists.freedesktop.org>  On Behalf
>>>>>> Of Christian König
>>>>>> Sent: Tuesday, November 28, 2023 8:09 AM
>>>>>> To: Weixi Zhu<weixi.zhu@huawei.com>;linux-mm@kvack.org; linux-
>>>>>> kernel@vger.kernel.org;akpm@linux-foundation.org; Danilo Krummrich
>>>>>> <dakr@redhat.com>; Dave Airlie<airlied@redhat.com>; Daniel Vetter
>>>>>> <daniel@ffwll.ch>
>>>>>> Cc:dri-devel@lists.freedesktop.org;leonro@nvidia.com;
>>>>>> apopple@nvidia.com;amd-gfx@lists.freedesktop.org;mgorman@suse.de;
>>>>>> ziy@nvidia.com; Wang, Zhi A<zhi.a.wang@intel.com>;
>>>>>> rcampbell@nvidia.com;jgg@nvidia.com;weixi.zhu@openeuler.sh;
>>>>>> jhubbard@nvidia.com;intel-gfx@lists.freedesktop.org;
>>>>>> mhairgrove@nvidia.com;jglisse@redhat.com; Vivi, Rodrigo
>>>>>> <rodrigo.vivi@intel.com>;intel-gvt-dev@lists.freedesktop.org;
>>>>>> tvrtko.ursulin@linux.intel.com;Felix.Kuehling@amd.com;
>>>>>> Xinhui.Pan@amd.com;alexander.deucher@amd.com;ogabbay@kernel.org
>>>>>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>>>>>> management) for external memory devices
>>>>>>
>>>>>> Adding a few missing important people to the explicit to list.
>>>>>>
>>>>>> Am 28.11.23 um 13:50 schrieb Weixi Zhu:
>>>>>>> The problem:
>>>>>>>
>>>>>>> Accelerator driver developers are forced to reinvent external MM
>>>>>>> subsystems case by case, because Linux core MM only considers host
>>>> memory resources.
>>>>>>> These reinvented MM subsystems have similar orders of magnitude of
>>>>>>> LoC as Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and
>>>>>>> Huawei NPU
>>>>>> has
>>>>>>> 30K. Meanwhile, more and more vendors are implementing their own
>>>>>>> accelerators, e.g. Microsoft's Maia 100. At the same time,
>>>>>>> application-level developers suffer from poor programmability --
>>>>>>> they must consider parallel address spaces and be careful about the
>>>>>>> limited device DRAM capacity. This can be alleviated if a
>>>>>>> malloc()-ed virtual address can be shared by the accelerator, or the
>>>>>>> abundant host DRAM can further transparently backup the device local
>>>> memory.
>>>>>>> These external MM systems share similar mechanisms except for the
>>>>>>> hardware-dependent part, so reinventing them is effectively
>>>>>>> introducing redundant code (14K~70K for each case). Such
>>>>>>> developing/maintaining is not cheap. Furthermore, to share a
>>>>>>> malloc()-ed virtual address, device drivers need to deeply interact
>>>>>>> with Linux MM via low-level MM APIs, e.g. MMU notifiers/HMM. This
>>>>>>> raises the bar for driver development, since developers must
>>>>>>> understand how Linux MM works. Further, it creates code maintenance
>>>>>>> problems -- any changes to Linux MM potentially require coordinated
>>>> changes to accelerator drivers using low-level MM APIs.
>>>>>>> Putting a cache-coherent bus between host and device will not make
>>>>>>> these external MM subsystems disappear. For example, a
>>>>>>> throughput-oriented accelerator will not tolerate executing heavy
>>>>>>> memory access workload with a host MMU/IOMMU via a remote bus.
>>>>>>> Therefore, devices will still have their own MMU and pick a simpler
>>>>>>> page table format for lower address translation overhead, requiring external
>>>> MM subsystems.
>>>>>>> --------------------
>>>>>>>
>>>>>>> What GMEM (Generalized Memory Management [1]) does:
>>>>>>>
>>>>>>> GMEM extends Linux MM to share its machine-independent MM code. Only
>>>>>>> high-level interface is provided for device drivers. This prevents
>>>>>>> accelerator drivers from reinventing the wheel, but relies on
>>>>>>> drivers to implement their hardware-dependent functions declared by
>>>>>>> GMEM. GMEM's
>>>>>> key
>>>>>>> interface include gm_dev_create(), gm_as_create(), gm_as_attach()
>>>>>>> and gm_dev_register_physmem(). Here briefly describe how a device
>>>>>>> driver utilizes them:
>>>>>>> 1. At boot time, call gm_dev_create() and registers the implementation of
>>>>>>>        hardware-dependent functions as declared in struct gm_mmu.
>>>>>>>          - If the device has local DRAM, call gm_dev_register_physmem() to
>>>>>>>            register available physical addresses.
>>>>>>> 2. When a device context is initialized (e.g. triggered by ioctl), check if
>>>>>>>        the current CPU process has been attached to a gmem address space
>>>>>>>        (struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
>>>>>>>        to it.
>>>>>>> 3. Call gm_as_attach() to attach the device context to a gmem address space.
>>>>>>> 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
>>>>>>>        device computation happens.
>>>>>>>
>>>>>>> GMEM has changed the following assumptions in Linux MM:
>>>>>>>       1. An mm_struct not only handle a single CPU context, but may also handle
>>>>>>>          external memory contexts encapsulated as gm_context listed in
>>>>>>>          mm->gm_as. An external memory context can include a few or all of the
>>>>>>>          following parts: an external MMU (that requires TLB invalidation), an
>>>>>>>          external page table (that requires PTE manipulation) and external DRAM
>>>>>>>          (that requires physical memory management).
>>>>>>>       2. Faulting a MAP_PRIVATE VMA with no CPU PTE found does not
>>>> necessarily
>>>>>>>          mean that a zero-filled physical page should be mapped. The virtual
>>>>>>>          page may have been mapped to an external memory device.
>>>>>>>       3. Unmapping a page may include sending device TLB invalidation (even if
>>>>>>>          its MMU shares CPU page table) and manipulating device PTEs.
>>>>>>>
>>>>>>> --------------------
>>>>>>>
>>>>>>> Semantics of new syscalls:
>>>>>>>
>>>>>>> 1. mmap(..., MAP_PRIVATE | MAP_PEER_SHARED)
>>>>>>>         Allocate virtual address that is shared between the CPU and all
>>>>>>>         attached devices. Data is guaranteed to be coherent whenever the
>>>>>>>         address is accessed by either CPU or any attached device. If the device
>>>>>>>         does not support page fault, then device driver is responsible for
>>>>>>>         faulting memory before data gets accessed. By default, the CPU DRAM is
>>>>>>>         can be used as a swap backup for the device local memory.
>>>>>>> 2. hmadvise(NUMA_id, va_start, size, memory_hint)
>>>>>>>         Issuing memory hint for a given VMA. This extends traditional madvise()
>>>>>>>         syscall with an extra argument so that programmers have better control
>>>>>>>         with heterogeneous devices registered as NUMA nodes. One
>>>>>>> useful
>>>>>> memory
>>>>>>>         hint could be MADV_PREFETCH, which guarantees that the physical data
>>>> of
>>>>>>>         the given VMA [VA, VA+size) is migrated to NUMA node #id. Another
>>>>>>>         useful memory hint is MADV_DONTNEED. This is helpful to increase
>>>> device
>>>>>>>         memory utilization. It is worth considering extending the existing
>>>>>>>         madvise() syscall with one additional argument.
>>>>>>>
>>>>>>> --------------------
>>>>>>>
>>>>>>> Implementation details
>>>>>>>
>>>>>>> 1. New VMA flag: MAP_PEER_SHARED
>>>>>>>
>>>>>>> This new flag helps isolate GMEM feature, so that common processes
>>>>>>> with no device attached does not need to maintain any logical page
>>>>>>> table. It can be deleted if the extra overhead from GMEM is acceptable.
>>>>>>>
>>>>>>> 2. MMU functions
>>>>>>> The device driver must implement the MMU functions declared in
>>>>>>> struct gm_mmu.
>>>>>>>
>>>>>>> VA functions: peer_va_alloc_fixed(), peer_va_free()
>>>>>>>
>>>>>>> They are used to negotiate a common available VMA between a host
>>>>>>> process and a device process at the mmap() time. This is because
>>>>>>> some accelerators like Intel Xeon Phi or Huawei's Ascend NPU have
>>>>>>> their acceleration tasks executed within a device CPU process
>>>>>>> context. Some accelerators may also choose a different format of
>>>>>>> virtual address space.
>>>>>>>
>>>>>>> PA functions: alloc_page(), free_page(), prepare_page()
>>>>>>>
>>>>>>> Alloc_page() and free_page() are used to allocate and free device
>>>>>>> physical pages. Prepare_page() is used to zero-fill or DMA the data
>>>>>>> of a physical page. These functions were removed from the submitted
>>>>>>> patch, since GMEM does not need to invoke them when testing Huawei's
>>>>>>> NPU accelerator. The
>>>>>> NPU
>>>>>>> accelerator has an OS running in the device that manages the device
>>>>>>> physical memory. However, even for such a device it is better for
>>>>>>> the host to directly manage device physical memory, which saves
>>>>>>> device HBM and avoids synchronizing management status between the host
>>>> and device.
>>>>>>> Page-table functions:
>>>>>>> pmap_create()/destroy()/enter()/release()/protect()
>>>>>>>
>>>>>>> They are used to create and destroy device page tables, install and
>>>>>>> uninstall page table entries and to change the protection of page
>>>>>>> table entries.
>>>>>>>
>>>>>>> TLB-invalidation functions: tlb_invl(), tlb_invl_coalesced()
>>>>>>>
>>>>>>> They are used to invalidate the TLB entries of a given range of VA
>>>>>>> or invalidate a given list of VMAs.
>>>>>>>
>>>>>>> Wrapper functions: peer_map() and peer_unmap()
>>>>>>>
>>>>>>> These two functions are used to create or destroy a device mapping
>>>>>>> which could include allocating physical memory and copying data.
>>>>>>> They effectively wraps the PA functions, Page-table functions and
>>>>>>> TLB-invalidation functions. Implementing these steps together allows
>>>>>>> devices to optimize the communication cost between host and device.
>>>>>>> However, it requires the device driver to correctly order these steps.
>>>>>>>
>>>>>>> 3. Tracking logical mappings:
>>>>>>>
>>>>>>> Each process starts maintaining an xarray in
>>>>>>> mm->vm_obj->logical_page_table at the first time a host process
>>>>>>> calls mmap(MAP_PRIVATE |
>>>>>> MAP_PEER_SHARED).
>>>>>>> When a virtual page gets touched, its mapping status is created and
>>>>>>> stored in struct gm_mapping. The logical page table is utilized to
>>>>>>> query the struct gm_mapping given a virtual address. GMEM extends
>>>>>>> Linux MM to
>>>>>> update
>>>>>>> and lookup these logical mappings. For example, in the patch set we
>>>>>>> modify the page fault path of to additionally check the logical
>>>>>>> mapping of MAP_PEER_SHARED VMAs and identify if a device page should
>>>> be migrated.
>>>>>>> Similarly, if the device driver wants to resolve a device page fault
>>>>>>> or prefetch data, the driver should call gm_dev_fault(). This
>>>>>>> function examines the mapping status and determines whether the
>>>>>>> device driver should migrate a CPU page to device or install a zero-filled
>>>> device page.
>>>>>>> The logical mapping abstraction enhances the extensibility of Linux
>>>>>>> core MM (a virtual page may be mapped to a device physical page
>>>>>>> without any CPU PTE installed). The current implementation is not
>>>>>>> complete, since it only focused on anonymous VMAs with
>>>>>>> MAP_PEER_SHARED flag. The future plan of logical page table is to
>>>>>>> provide a generic abstraction layer that support common anonymous
>>>>>>> memory (I am looking at you, transparent huge pages)
>>>>>> and
>>>>>>> file-backed memory.
>>>>>>>
>>>>>>> --------------------
>>>>>>>
>>>>>>> Use cases
>>>>>>>
>>>>>>> GMEM has been tested over Huawei's NPU (neural process unit) device
>>>> driver.
>>>>>>> The original NPU device driver has approximately 30,000 lines of
>>>>>>> code for memory management. On the contrary, the GMEM-based one has
>>>>>>> less than 30 lines of code calling GMEM API, with approximately
>>>>>>> 3,700 lines of code implementing the MMU functions. This effectively
>>>>>>> saves over 26,200 lines of MM code for one driver. Therefore,
>>>>>>> developers from accelerator vendors, including Nvidia, AMD, Intel
>>>>>>> and other companies are welcome to discuss if GMEM could be helpful.
>>>>>>>
>>>>>>> Using GMEM-based driver, it is possible to write a C-style
>>>>>>> accelerator code with malloc(), whose underlying mmap() syscall
>>>>>>> should include MAP_PEER_SHARED according to current GMEM
>>>>>>> implementation. Importantly,
>>>>>> GMEM
>>>>>>> guarantees a coherent view of memory between the host and all
>>>>>>> attached devices. This means that any data written by the CPU or any
>>>>>>> attached accelerator can be seen by the next memory load instruction
>>>>>>> issued by any attached accelerator or the CPU. Furthermore, the NPU
>>>>>>> device was able to oversubscribe memory by swapping memory to host
>>>>>>> DDR. Note that this
>>>>>> memory
>>>>>>> oversubscription mechanism can be universal if the physical memory
>>>>>>> management is provided by GMEM. Other potential use cases of GMEM
>>>>>>> could include the IOMMU driver, KVM and RDMA drivers, as long as the
>>>>>>> device needs to manage external memory resources like VMAs, MMUs or
>>>> local DRAMs.
>>>>>>> --------------------
>>>>>>>
>>>>>>> Discussion
>>>>>>>
>>>>>>> Physical memory management
>>>>>>> Most accelerators require the host OS to manage device DRAM. Even
>>>>>>> accelerators capable of running an OS inside the driver can benefit
>>>>>>> from it, since it helps avoid synchronizing management status
>>>>>>> between the host and device. In Linux OSS EU summit 2023, Hannes
>>>>>>> Reinecke from SUSE Labs suggested that people are concerned with the
>>>>>>> memory consumption of struct page (which considers all generic
>>>>>>> scenarios for the kernel). This leads to a possible solution that,
>>>>>>> instead of reusing Linux struct page and ZONE_DEVICE mechanism, GMEM
>>>>>>> can implement an isolated buddy allocator
>>>>>> for
>>>>>>> the device to instantiate and register. The isolation is useful
>>>>>>> because device DRAM physical address space is independent.
>>>>>>> Furthermore, the isolated buddy allocator can utilize a customized
>>>>>>> struct page that consumes less memory. It is worth discussing if
>>>>>>> accelerator vendors desire this solution.
>>>>>>>
>>>>>>> MMU functions
>>>>>>> The MMU functions peer_map() and peer_unmap() overlap other
>>>>>>> functions, leaving a question if the MMU functions should be
>>>>>>> decoupled as more basic operations. Decoupling them could
>>>>>>> potentially prevent device drivers coalescing these basic steps
>>>>>>> within a single host-device communication operation, while coupling
>>>>>>> them makes it more difficult for device drivers to utilize GMEM interface.
>>>>>>>
>>>>>>> The idea of GMEM was originated from Weixi's PhD study with Prof.
>>>>>>> Scott Rixner and Prof. Alan L. Cox at Rice University.
>>>>>>>
>>>>>>> [1]https://arxiv.org/abs/2310.12554.
>>>>>>>
>>>>>>> Weixi Zhu (6):
>>>>>>>       mm/gmem: add heterogeneous NUMA node
>>>>>>>       mm/gmem: add arch-independent abstraction to track address mapping
>>>>>>>         status
>>>>>>>       mm/gmem: add GMEM (Generalized Memory Management) interface for
>>>>>>>         external accelerators
>>>>>>>       mm/gmem: add new syscall hmadvise() to issue memory hints for
>>>>>>>         heterogeneous NUMA nodes
>>>>>>>       mm/gmem: resolve VMA conflicts for attached peer devices
>>>>>>>       mm/gmem: extending Linux core MM to support unified virtual address
>>>>>>>         space
>>>>>>>
>>>>>>>      arch/arm64/include/asm/unistd.h         |   2 +-
>>>>>>>      arch/arm64/include/asm/unistd32.h       |   2 +
>>>>>>>      drivers/base/node.c                     |   6 +
>>>>>>>      fs/proc/task_mmu.c                      |   3 +
>>>>>>>      include/linux/gmem.h                    | 368 ++++++++++++
>>>>>>>      include/linux/mm.h                      |   8 +
>>>>>>>      include/linux/mm_types.h                |   5 +
>>>>>>>      include/linux/nodemask.h                |  10 +
>>>>>>>      include/uapi/asm-generic/mman-common.h  |   4 +
>>>>>>>      include/uapi/asm-generic/unistd.h       |   5 +-
>>>>>>>      init/main.c                             |   2 +
>>>>>>>      kernel/fork.c                           |   5 +
>>>>>>>      kernel/sys_ni.c                         |   2 +
>>>>>>>      mm/Kconfig                              |  14 +
>>>>>>>      mm/Makefile                             |   1 +
>>>>>>>      mm/gmem.c                               | 746 ++++++++++++++++++++++++
>>>>>>>      mm/huge_memory.c                        |  85 ++-
>>>>>>>      mm/memory.c                             |  42 +-
>>>>>>>      mm/mempolicy.c                          |   4 +
>>>>>>>      mm/mmap.c                               |  40 +-
>>>>>>>      mm/oom_kill.c                           |   2 +
>>>>>>>      mm/page_alloc.c                         |   3 +
>>>>>>>      mm/vm_object.c                          | 309 ++++++++++
>>>>>>>      tools/include/uapi/asm-generic/unistd.h |   5 +-
>>>>>>>      24 files changed, 1654 insertions(+), 19 deletions(-)
>>>>>>>      create mode 100644 include/linux/gmem.h
>>>>>>>      create mode 100644 mm/gmem.c
>>>>>>>      create mode 100644 mm/vm_object.c
>>>>>>>

[RFC,0/6] Supporting GMEM (generalized memory management) for external memory devices

Message

Comments