mbox series

[RFC,0/6] Allow file-backed or shared device private pages

Message ID cover.24b48fced909fe1414e83b58aa468d4393dd06de.1742099301.git-series.apopple@nvidia.com (mailing list archive)
Headers show
Series Allow file-backed or shared device private pages | expand

Message

Alistair Popple March 16, 2025, 4:29 a.m. UTC
To simplify the initial implementation device private pages were restricted to
only being used for private anonymous. This avoided having to deal with issues
related to shared and/or file-backed pagesi early on.

This series lifts that restriction by allowing ZONE_DEVICE private pages to
exist in the pagecache. As the CPU cannot directly access these pages special
care needs to be taken when looking them up in the page-cache. This series
solves the problem by always migrating such pages back from device memory when
looking them up in the pagecache. This is similar to how device private pages
work for anonymous memory, where a CPU fault on the device memory will always
trigger a migration back to CPU system memory.

Initially this series only allows for read-only migration - this is because the
call to migrate pages back will always reload the data from backing storage.
It then introduces a callback that drivers may implement to actually copy any
modified data back as required.

Drivers are expected to call set_page_dirty() when copying data back to ensure
it hits the backing store.

This series is an early draft implementation - in particular error handling
is not dealt with and I'm not sure that the management of PTE write bits is
entirely correct. Much more testing of all the various filesystem corner cases
is also required. The aim of this series is to get early feedback on the overall
concept of putting device private pages in the pagecache before fleshing out the
implementation further.

Signed-off-by: Alistair Popple <apopple@nvidia.com>

Alistair Popple (6):
  mm/migrate_device.c: Don't read dirty bit of non-present PTEs
  mm/migrate: Support file-backed pages with migrate_vma
  mm: Allow device private pages to exist in page cache
  mm: Implement writeback for share device private pages
  selftests/hmm: Add file-backed migration tests
  nouveau: Add SVM support for migrating file-backed pages to the GPU

 drivers/gpu/drm/nouveau/nouveau_dmem.c |  24 ++-
 include/linux/memremap.h               |   2 +-
 include/linux/migrate.h                |   6 +-
 lib/test_hmm.c                         |  27 ++-
 mm/filemap.c                           |  41 ++++-
 mm/memory.c                            |   9 +-
 mm/memremap.c                          |   1 +-
 mm/migrate.c                           |  42 ++--
 mm/migrate_device.c                    | 114 +++++++++++-
 mm/rmap.c                              |   2 +-
 tools/testing/selftests/mm/hmm-tests.c | 252 +++++++++++++++++++++++++-
 11 files changed, 489 insertions(+), 31 deletions(-)

base-commit: 0ad2507d5d93f39619fc42372c347d6006b64319

Comments

Christoph Hellwig March 17, 2025, 6:04 a.m. UTC | #1
On Sun, Mar 16, 2025 at 03:29:23PM +1100, Alistair Popple wrote:
> This series lifts that restriction by allowing ZONE_DEVICE private pages to
> exist in the pagecache.

You'd better provide a really good argument for why we'd even want
to do that.  So far this cover letter fails to do that.
Matthew Wilcox March 26, 2025, 2:14 a.m. UTC | #2
On Sun, Mar 16, 2025 at 11:04:07PM -0700, Christoph Hellwig wrote:
> On Sun, Mar 16, 2025 at 03:29:23PM +1100, Alistair Popple wrote:
> > This series lifts that restriction by allowing ZONE_DEVICE private pages to
> > exist in the pagecache.
> 
> You'd better provide a really good argument for why we'd even want
> to do that.  So far this cover letter fails to do that.

Alistair and I discussed this during his session at LSFMM today.
Here's what I think we agreed to.

The use case is a file containing a potentially very large data set.
Some phases of processing that data set are best done on the GPU, other
phases on the CPU.  We agreed that shared writable mmap was not actually
needed (it might need to be supported for correctness, but it's not a
performance requirement).

So, there's no need to put DEVICE_PRIVATE pages in the page cache.
Instead the GPU will take a copy of the page(s).  We agreed that there
will have to be some indication (probably a folio flag?) that the GPU has
or may have a copy of (some of) the folio so that it can be invalidated
if the page is removed due to truncation / eviction.

Alistair, let me know if that's not what you think we agreed to ;-)
Alistair Popple March 27, 2025, 2:49 p.m. UTC | #3
On Wed, Mar 26, 2025 at 02:14:59AM +0000, Matthew Wilcox wrote:
> On Sun, Mar 16, 2025 at 11:04:07PM -0700, Christoph Hellwig wrote:
> > On Sun, Mar 16, 2025 at 03:29:23PM +1100, Alistair Popple wrote:
> > > This series lifts that restriction by allowing ZONE_DEVICE private pages to
> > > exist in the pagecache.
> > 
> > You'd better provide a really good argument for why we'd even want
> > to do that.  So far this cover letter fails to do that.
> 
> Alistair and I discussed this during his session at LSFMM today.
> Here's what I think we agreed to.

Thanks for writing up this summary.

> 
> The use case is a file containing a potentially very large data set.
> Some phases of processing that data set are best done on the GPU, other
> phases on the CPU.  We agreed that shared writable mmap was not actually
> needed (it might need to be supported for correctness, but it's not a
> performance requirement).

Right. I agree we don't currently have a good usecase for writeback so the next
revision will definitely only support read-only access.

> So, there's no need to put DEVICE_PRIVATE pages in the page cache.
> Instead the GPU will take a copy of the page(s).  We agreed that there
> will have to be some indication (probably a folio flag?) that the GPU has
> or may have a copy of (some of) the folio so that it can be invalidated
> if the page is removed due to truncation / eviction.
>
> Alistair, let me know if that's not what you think we agreed to ;-)

That all looks about right. I think the flag/indication is a good idea and is
probably the best solution, but I will need to write the code to truely convince
myself of that :-)
Matthew Wilcox March 27, 2025, 4:47 p.m. UTC | #4
On Thu, Mar 27, 2025 at 07:49:47AM -0700, Alistair Popple wrote:
> On Wed, Mar 26, 2025 at 02:14:59AM +0000, Matthew Wilcox wrote:
> > So, there's no need to put DEVICE_PRIVATE pages in the page cache.
> > Instead the GPU will take a copy of the page(s).  We agreed that there
> > will have to be some indication (probably a folio flag?) that the GPU has
> > or may have a copy of (some of) the folio so that it can be invalidated
> > if the page is removed due to truncation / eviction.
> >
> > Alistair, let me know if that's not what you think we agreed to ;-)
> 
> That all looks about right. I think the flag/indication is a good idea and is
> probably the best solution, but I will need to write the code to truely convince
> myself of that :-)

It might end up making more sense to make it a per-VMA flag or a
per-inode flag, but that's probably something you're in a better
position to determine than I am.