mbox series

[v1,00/12] MEMORY_DEVICE_COHERENT for CPU-accessible coherent device memory

Message ID 20211012171247.2861-1-alex.sierra@amd.com (mailing list archive)
Headers show
Series MEMORY_DEVICE_COHERENT for CPU-accessible coherent device memory | expand

Message

Sierra Guiza, Alejandro (Alex) Oct. 12, 2021, 5:12 p.m. UTC
This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
owned by a device that can be mapped into CPU page tables like
MEMORY_DEVICE_GENERIC and can also be migrated like MEMORY_DEVICE_PRIVATE.
With MEMORY_DEVICE_COHERENT, we isolate the new memory type from other
subsystems as far as possible, though there are some small changes to
other subsystems such as filesystem DAX, to handle the new memory type
appropriately.

We use ZONE_DEVICE for this instead of NUMA so that the amdgpu
allocator can manage it without conflicting with core mm for non-unified
memory use cases.

How it works: The system BIOS advertises the GPU device memory (aka VRAM)
as SPM (special purpose memory) in the UEFI system address map.
The amdgpu driver registers the memory with devmap as
MEMORY_DEVICE_COHERENT using devm_memremap_pages.

The initial user for this hardware page migration capability will be
the Frontier supercomputer project. Our nodes in the lab have .5 TB of
system memory plus 256 GB of device memory split across 4 GPUs, all in
the same coherent address space. Page migration is expected to improve
application efficiency significantly. We will report empirical results
as they become available.

This includes patches originally by Ralph Campbell to change ZONE_DEVICE
reference counting as requested in previous reviews of this patch series
(see https://patchwork.freedesktop.org/series/90706/). We extended
hmm_test to cover migration of MEMORY_DEVICE_COHERENT. This patch set
builds on HMM and our SVM memory manager already merged in 5.14.
We would like to complete review and merge this migration patchset for
5.16.

Alex Sierra (10):
  mm: add zone device coherent type memory support
  mm: add device coherent vma selection for memory migration
  drm/amdkfd: ref count init for device pages
  drm/amdkfd: add SPM support for SVM
  drm/amdkfd: coherent type as sys mem on migration to ram
  lib: test_hmm add ioctl to get zone device type
  lib: test_hmm add module param for zone device type
  lib: add support for device coherent type in test_hmm
  tools: update hmm-test to support device coherent type
  tools: update test_hmm script to support SP config

Ralph Campbell (2):
  ext4/xfs: add page refcount helper
  mm: remove extra ZONE_DEVICE struct page refcount

 arch/powerpc/kvm/book3s_hv_uvmem.c       |   2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  40 ++--
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |   2 +-
 fs/dax.c                                 |   8 +-
 fs/ext4/inode.c                          |   5 +-
 fs/fuse/dax.c                            |   4 +-
 fs/xfs/xfs_file.c                        |   4 +-
 include/linux/dax.h                      |  10 +
 include/linux/memremap.h                 |  15 +-
 include/linux/migrate.h                  |   1 +
 include/linux/mm.h                       |  19 +-
 lib/test_hmm.c                           | 276 +++++++++++++++++------
 lib/test_hmm_uapi.h                      |  20 +-
 mm/internal.h                            |   8 +
 mm/memcontrol.c                          |  12 +-
 mm/memory-failure.c                      |   6 +-
 mm/memremap.c                            |  71 ++----
 mm/migrate.c                             |  33 +--
 mm/page_alloc.c                          |   3 +
 mm/swap.c                                |  45 +---
 tools/testing/selftests/vm/hmm-tests.c   | 137 +++++++++--
 tools/testing/selftests/vm/test_hmm.sh   |  20 +-
 22 files changed, 490 insertions(+), 251 deletions(-)

Comments

Andrew Morton Oct. 12, 2021, 6:39 p.m. UTC | #1
On Tue, 12 Oct 2021 12:12:35 -0500 Alex Sierra <alex.sierra@amd.com> wrote:

> This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
> owned by a device that can be mapped into CPU page tables like
> MEMORY_DEVICE_GENERIC and can also be migrated like MEMORY_DEVICE_PRIVATE.
> With MEMORY_DEVICE_COHERENT, we isolate the new memory type from other
> subsystems as far as possible, though there are some small changes to
> other subsystems such as filesystem DAX, to handle the new memory type
> appropriately.
> 
> We use ZONE_DEVICE for this instead of NUMA so that the amdgpu
> allocator can manage it without conflicting with core mm for non-unified
> memory use cases.
> 
> How it works: The system BIOS advertises the GPU device memory (aka VRAM)
> as SPM (special purpose memory) in the UEFI system address map.
> The amdgpu driver registers the memory with devmap as
> MEMORY_DEVICE_COHERENT using devm_memremap_pages.
> 
> The initial user for this hardware page migration capability will be
> the Frontier supercomputer project.

To what other uses will this infrastructure be put?

Because I must ask: if this feature is for one single computer which
presumably has a custom kernel, why add it to mainline Linux?

> Our nodes in the lab have .5 TB of
> system memory plus 256 GB of device memory split across 4 GPUs, all in
> the same coherent address space. Page migration is expected to improve
> application efficiency significantly. We will report empirical results
> as they become available.
> 
> This includes patches originally by Ralph Campbell to change ZONE_DEVICE
> reference counting as requested in previous reviews of this patch series
> (see https://patchwork.freedesktop.org/series/90706/). We extended
> hmm_test to cover migration of MEMORY_DEVICE_COHERENT. This patch set
> builds on HMM and our SVM memory manager already merged in 5.14.
> We would like to complete review and merge this migration patchset for
> 5.16.
Jason Gunthorpe Oct. 12, 2021, 6:56 p.m. UTC | #2
On Tue, Oct 12, 2021 at 11:39:57AM -0700, Andrew Morton wrote:
> On Tue, 12 Oct 2021 12:12:35 -0500 Alex Sierra <alex.sierra@amd.com> wrote:
> 
> > This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
> > owned by a device that can be mapped into CPU page tables like
> > MEMORY_DEVICE_GENERIC and can also be migrated like MEMORY_DEVICE_PRIVATE.
> > With MEMORY_DEVICE_COHERENT, we isolate the new memory type from other
> > subsystems as far as possible, though there are some small changes to
> > other subsystems such as filesystem DAX, to handle the new memory type
> > appropriately.
> > 
> > We use ZONE_DEVICE for this instead of NUMA so that the amdgpu
> > allocator can manage it without conflicting with core mm for non-unified
> > memory use cases.
> > 
> > How it works: The system BIOS advertises the GPU device memory (aka VRAM)
> > as SPM (special purpose memory) in the UEFI system address map.
> > The amdgpu driver registers the memory with devmap as
> > MEMORY_DEVICE_COHERENT using devm_memremap_pages.
> > 
> > The initial user for this hardware page migration capability will be
> > the Frontier supercomputer project.
> 
> To what other uses will this infrastructure be put?
> 
> Because I must ask: if this feature is for one single computer which
> presumably has a custom kernel, why add it to mainline Linux?

Well, it certainly isn't just "one single computer". Overall I know of
about, hmm, ~10 *datacenters* worth of installations that are using
similar technology underpinnings.

"Frontier" is the code name for a specific installation but as the
technology is proven out there will be many copies made of that same
approach.

The previous program "Summit" was done with NVIDIA GPUs and PowerPC
CPUs and also included a very similar capability. I think this is a
good sign that this coherently attached accelerator will continue to
be a theme in computing going foward. IIRC this was done using out of
tree kernel patches and NUMA localities.

Specifically with CXL now being standardized and on a path to ubiquity
I think we will see an explosion in deployments of coherently attached
accelerator memory. This is the high end trickling down to wider
usage.

I strongly think many CXL accelerators are going to want to manage
their on-accelerator memory in this way as it makes universal sense to
want to carefully manage memory access locality to optimize for
performance.

Jason
Felix Kuehling Oct. 12, 2021, 7 p.m. UTC | #3
Am 2021-10-12 um 2:39 p.m. schrieb Andrew Morton:
> On Tue, 12 Oct 2021 12:12:35 -0500 Alex Sierra <alex.sierra@amd.com> wrote:
>
>> This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
>> owned by a device that can be mapped into CPU page tables like
>> MEMORY_DEVICE_GENERIC and can also be migrated like MEMORY_DEVICE_PRIVATE.
>> With MEMORY_DEVICE_COHERENT, we isolate the new memory type from other
>> subsystems as far as possible, though there are some small changes to
>> other subsystems such as filesystem DAX, to handle the new memory type
>> appropriately.
>>
>> We use ZONE_DEVICE for this instead of NUMA so that the amdgpu
>> allocator can manage it without conflicting with core mm for non-unified
>> memory use cases.
>>
>> How it works: The system BIOS advertises the GPU device memory (aka VRAM)
>> as SPM (special purpose memory) in the UEFI system address map.
>> The amdgpu driver registers the memory with devmap as
>> MEMORY_DEVICE_COHERENT using devm_memremap_pages.
>>
>> The initial user for this hardware page migration capability will be
>> the Frontier supercomputer project.
> To what other uses will this infrastructure be put?
>
> Because I must ask: if this feature is for one single computer which
> presumably has a custom kernel, why add it to mainline Linux?

I'm not sure this will be the only system with this architecture. This
is only the first one I know of. I hope it's not a one-off, after all
the work we did on it. ;)

The Linux kernel on this system is based on SLES. We are working with
SUSE on backporting patches needed for this system. However, those
patches need to be upstream first.

DEVICE_PUBLIC was removed because it had no users. We're trying to add
it (or something like it) back because we now have a use case for it.

Regards,
  Felix


>
>> Our nodes in the lab have .5 TB of
>> system memory plus 256 GB of device memory split across 4 GPUs, all in
>> the same coherent address space. Page migration is expected to improve
>> application efficiency significantly. We will report empirical results
>> as they become available.
>>
>> This includes patches originally by Ralph Campbell to change ZONE_DEVICE
>> reference counting as requested in previous reviews of this patch series
>> (see https://patchwork.freedesktop.org/series/90706/ We extended
>> hmm_test to cover migration of MEMORY_DEVICE_COHERENT. This patch set
>> builds on HMM and our SVM memory manager already merged in 5.14.
>> We would like to complete review and merge this migration patchset for
>> 5.16.
Andrew Morton Oct. 12, 2021, 7:03 p.m. UTC | #4
On Tue, 12 Oct 2021 15:56:29 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote:

> > To what other uses will this infrastructure be put?
> > 
> > Because I must ask: if this feature is for one single computer which
> > presumably has a custom kernel, why add it to mainline Linux?
> 
> Well, it certainly isn't just "one single computer". Overall I know of
> about, hmm, ~10 *datacenters* worth of installations that are using
> similar technology underpinnings.
> 
> "Frontier" is the code name for a specific installation but as the
> technology is proven out there will be many copies made of that same
> approach.
> 
> The previous program "Summit" was done with NVIDIA GPUs and PowerPC
> CPUs and also included a very similar capability. I think this is a
> good sign that this coherently attached accelerator will continue to
> be a theme in computing going foward. IIRC this was done using out of
> tree kernel patches and NUMA localities.
> 
> Specifically with CXL now being standardized and on a path to ubiquity
> I think we will see an explosion in deployments of coherently attached
> accelerator memory. This is the high end trickling down to wider
> usage.
> 
> I strongly think many CXL accelerators are going to want to manage
> their on-accelerator memory in this way as it makes universal sense to
> want to carefully manage memory access locality to optimize for
> performance.

Thanks.  Can we please get something like the above into the [0/n]
changelog?  Along with any other high-level info which is relevant?

It's rather important.  "why should I review this", "why should we
merge this", etc.
Matthew Wilcox (Oracle) Oct. 12, 2021, 7:11 p.m. UTC | #5
On Tue, Oct 12, 2021 at 11:39:57AM -0700, Andrew Morton wrote:
> Because I must ask: if this feature is for one single computer which
> presumably has a custom kernel, why add it to mainline Linux?

I think in particular patch 2 deserves to be merged because it removes
a ton of cruft from every call to put_page() (at least if you're using
a distro config).  It makes me nervous, but I think it's the right
thing to do.  It may well need more fixups after it has been merged,
but that's life.
Felix Kuehling Oct. 12, 2021, 8:24 p.m. UTC | #6
Am 2021-10-12 um 3:11 p.m. schrieb Matthew Wilcox:
> On Tue, Oct 12, 2021 at 11:39:57AM -0700, Andrew Morton wrote:
>> Because I must ask: if this feature is for one single computer which
>> presumably has a custom kernel, why add it to mainline Linux?
> I think in particular patch 2 deserves to be merged because it removes
> a ton of cruft from every call to put_page() (at least if you're using
> a distro config).  It makes me nervous, but I think it's the right
> thing to do.  It may well need more fixups after it has been merged,
> but that's life.

Maybe we should split the first two patches into a separate series, and
get it merged first, while the more controversial stuff is still under
review?

Thanks,
  Felix
Darrick J. Wong Oct. 12, 2021, 8:44 p.m. UTC | #7
On Tue, Oct 12, 2021 at 04:24:25PM -0400, Felix Kuehling wrote:
> 
> Am 2021-10-12 um 3:11 p.m. schrieb Matthew Wilcox:
> > On Tue, Oct 12, 2021 at 11:39:57AM -0700, Andrew Morton wrote:
> >> Because I must ask: if this feature is for one single computer which
> >> presumably has a custom kernel, why add it to mainline Linux?
> > I think in particular patch 2 deserves to be merged because it removes
> > a ton of cruft from every call to put_page() (at least if you're using
> > a distro config).  It makes me nervous, but I think it's the right
> > thing to do.  It may well need more fixups after it has been merged,
> > but that's life.
> 
> Maybe we should split the first two patches into a separate series, and
> get it merged first, while the more controversial stuff is still under
> review?

Yes, please.  I've seen that first patch several times already. :)

--D

> Thanks,
>   Felix
> 
>
Felix Kuehling Oct. 12, 2021, 11:04 p.m. UTC | #8
Am 2021-10-12 um 3:03 p.m. schrieb Andrew Morton:
> On Tue, 12 Oct 2021 15:56:29 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>>> To what other uses will this infrastructure be put?
>>>
>>> Because I must ask: if this feature is for one single computer which
>>> presumably has a custom kernel, why add it to mainline Linux?
>> Well, it certainly isn't just "one single computer". Overall I know of
>> about, hmm, ~10 *datacenters* worth of installations that are using
>> similar technology underpinnings.
>>
>> "Frontier" is the code name for a specific installation but as the
>> technology is proven out there will be many copies made of that same
>> approach.
>>
>> The previous program "Summit" was done with NVIDIA GPUs and PowerPC
>> CPUs and also included a very similar capability. I think this is a
>> good sign that this coherently attached accelerator will continue to
>> be a theme in computing going foward. IIRC this was done using out of
>> tree kernel patches and NUMA localities.
>>
>> Specifically with CXL now being standardized and on a path to ubiquity
>> I think we will see an explosion in deployments of coherently attached
>> accelerator memory. This is the high end trickling down to wider
>> usage.
>>
>> I strongly think many CXL accelerators are going to want to manage
>> their on-accelerator memory in this way as it makes universal sense to
>> want to carefully manage memory access locality to optimize for
>> performance.
> Thanks.  Can we please get something like the above into the [0/n]
> changelog?  Along with any other high-level info which is relevant?
>
> It's rather important.  "why should I review this", "why should we
> merge this", etc.

Using Jason's input, I suggest adding this text for the next revision of
the cover letter:

DEVICE_PRIVATE memory emulates coherence between CPU and the device by
migrating data back and forth. An application that accesses the same
page (or huge page) from CPU and device concurrently can cause many
migrations, each involving device cache flushes, page table updates and
page faults on the CPU or device.

In contrast, DEVICE_COHERENT enables truly concurrent CPU and device
access to to ZONE_DEVICE pages by taking advantage of HW coherence
protocols.

As a historical reference point, the Summit supercomputer implemented
such a coherent memory architecture with NVidia GPUs and PowerPC CPUs.

The initial user for the DEVICE_COHERENT memory type will be the AMD GPU
driver on the Frontier supercomputer. CXL standardizes a coherent
peripheral interconnect, leading to more mainstream systems and devices
with that capability.

Best regards,
  Felix
Daniel Vetter Oct. 13, 2021, 1:34 p.m. UTC | #9
On Tue, Oct 12, 2021 at 03:56:29PM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 12, 2021 at 11:39:57AM -0700, Andrew Morton wrote:
> > On Tue, 12 Oct 2021 12:12:35 -0500 Alex Sierra <alex.sierra@amd.com> wrote:
> > 
> > > This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
> > > owned by a device that can be mapped into CPU page tables like
> > > MEMORY_DEVICE_GENERIC and can also be migrated like MEMORY_DEVICE_PRIVATE.
> > > With MEMORY_DEVICE_COHERENT, we isolate the new memory type from other
> > > subsystems as far as possible, though there are some small changes to
> > > other subsystems such as filesystem DAX, to handle the new memory type
> > > appropriately.
> > > 
> > > We use ZONE_DEVICE for this instead of NUMA so that the amdgpu
> > > allocator can manage it without conflicting with core mm for non-unified
> > > memory use cases.
> > > 
> > > How it works: The system BIOS advertises the GPU device memory (aka VRAM)
> > > as SPM (special purpose memory) in the UEFI system address map.
> > > The amdgpu driver registers the memory with devmap as
> > > MEMORY_DEVICE_COHERENT using devm_memremap_pages.
> > > 
> > > The initial user for this hardware page migration capability will be
> > > the Frontier supercomputer project.
> > 
> > To what other uses will this infrastructure be put?
> > 
> > Because I must ask: if this feature is for one single computer which
> > presumably has a custom kernel, why add it to mainline Linux?
> 
> Well, it certainly isn't just "one single computer". Overall I know of
> about, hmm, ~10 *datacenters* worth of installations that are using
> similar technology underpinnings.
> 
> "Frontier" is the code name for a specific installation but as the
> technology is proven out there will be many copies made of that same
> approach.
> 
> The previous program "Summit" was done with NVIDIA GPUs and PowerPC
> CPUs and also included a very similar capability. I think this is a
> good sign that this coherently attached accelerator will continue to
> be a theme in computing going foward. IIRC this was done using out of
> tree kernel patches and NUMA localities.
> 
> Specifically with CXL now being standardized and on a path to ubiquity
> I think we will see an explosion in deployments of coherently attached
> accelerator memory. This is the high end trickling down to wider
> usage.
> 
> I strongly think many CXL accelerators are going to want to manage
> their on-accelerator memory in this way as it makes universal sense to
> want to carefully manage memory access locality to optimize for
> performance.

Yeah with CXL this will be used by a lot more drivers/devices, not
even including nvidia's blob.

I guess if you want make sure get an ack on this from CXL folks, so that
we don't end up with a mess.
-Daniel