mbox series

[RFC,0/3] MAP_POPULATE for device memory

Message ID 20220306053211.135762-1-jarkko@kernel.org (mailing list archive)
Headers show
Series MAP_POPULATE for device memory | expand

Message

Jarkko Sakkinen March 6, 2022, 5:32 a.m. UTC
For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow
to use that for initializing the device memory by providing a new callback
f_ops->populate() for the purpose.

SGX patches are provided to show the callback in context.

An obvious alternative is a ioctl but it is less elegant and requires
two syscalls (mmap + ioctl) per memory range, instead of just one
(mmap).

Jarkko Sakkinen (3):
  mm: Add f_ops->populate()
  x86/sgx: Export sgx_encl_page_alloc()
  x86/sgx: Implement EAUG population with MAP_POPULATE

 arch/mips/kernel/vdso.c                    |   2 +-
 arch/x86/kernel/cpu/sgx/driver.c           | 129 +++++++++++++++++++++
 arch/x86/kernel/cpu/sgx/encl.c             |  38 ++++++
 arch/x86/kernel/cpu/sgx/encl.h             |   3 +
 arch/x86/kernel/cpu/sgx/ioctl.c            |  38 ------
 drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c |   2 +-
 fs/coda/file.c                             |   2 +-
 fs/overlayfs/file.c                        |   2 +-
 include/linux/fs.h                         |  12 +-
 include/linux/mm.h                         |   2 +-
 ipc/shm.c                                  |   2 +-
 mm/mmap.c                                  |  10 +-
 mm/nommu.c                                 |   4 +-
 13 files changed, 193 insertions(+), 53 deletions(-)

Comments

David Laight March 6, 2022, 8:30 a.m. UTC | #1
From: Jarkko Sakkinen
> Sent: 06 March 2022 05:32
> 
> For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow
> to use that for initializing the device memory by providing a new callback
> f_ops->populate() for the purpose.
> 
> SGX patches are provided to show the callback in context.
> 
> An obvious alternative is a ioctl but it is less elegant and requires
> two syscalls (mmap + ioctl) per memory range, instead of just one
> (mmap).

Is this all about trying to stop the vm_operations_struct.fault()
function being called?

It is pretty easy to ensure the mappings are setup in the driver's
mmap() function.
Then the fault() function can just return -VM_FAULT_SIGBUS;

If it is actually device memory you just need to call vm_iomap_memory()
That quite nicely mmap()s PCIe memory space into a user process.

Mapping driver memory is slightly more difficult.
For buffers allocated with dma_alloc_coherent() you can
probably use dma_mmap_coherent().
But I have a loop calling remap_pfn_range() because the
buffer area is made of multiple 16kB kernel buffers that
need to be mapped to contiguous user pages.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Matthew Wilcox (Oracle) March 6, 2022, 11:33 a.m. UTC | #2
On Sun, Mar 06, 2022 at 07:32:04AM +0200, Jarkko Sakkinen wrote:
> For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow
> to use that for initializing the device memory by providing a new callback
> f_ops->populate() for the purpose.

As I said, NAK.
Jarkko Sakkinen March 6, 2022, 4:52 p.m. UTC | #3
On Sun, Mar 06, 2022 at 08:30:14AM +0000, David Laight wrote:
> From: Jarkko Sakkinen
> > Sent: 06 March 2022 05:32
> > 
> > For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow
> > to use that for initializing the device memory by providing a new callback
> > f_ops->populate() for the purpose.
> > 
> > SGX patches are provided to show the callback in context.
> > 
> > An obvious alternative is a ioctl but it is less elegant and requires
> > two syscalls (mmap + ioctl) per memory range, instead of just one
> > (mmap).
> 
> Is this all about trying to stop the vm_operations_struct.fault()
> function being called?

In SGX protected memory is actually encrypted normal memory and CPU access
control semantics (marked as reserved, e.g. struct page's).

In SGX you need call ENCLS[EAUG] outside the protected memory to add new
pages to the protected memory. Then when CPU is executing inside this
protected memory, also known as enclaves, it commits the memory as part of
the enclave either with ENCLU[EACCEPT] or ENCLU[EACCEPTCOPY].

So the point is not prevent page faults but to prepare the memory for
pending state so that the enclave can then accept them without round-trips,
and in some cases thus improve performance (in our case in enarx.dev
platform that we are developing).

In fact, #PF handler in SGX driver in the current SGX2 patch set also does
EAUG on-demand. Optimal is to have both routes available. And said, this
can be of course also implemented as ioctl.

BR, Jarkko
Christoph Hellwig March 7, 2022, 7:48 a.m. UTC | #4
On Sun, Mar 06, 2022 at 11:33:02AM +0000, Matthew Wilcox wrote:
> On Sun, Mar 06, 2022 at 07:32:04AM +0200, Jarkko Sakkinen wrote:
> > For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow
> > to use that for initializing the device memory by providing a new callback
> > f_ops->populate() for the purpose.
> 
> As I said, NAK.

Agreed.  This is an amazingly bad interface.
David Hildenbrand March 7, 2022, 10:12 a.m. UTC | #5
On 06.03.22 06:32, Jarkko Sakkinen wrote:
> For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow
> to use that for initializing the device memory by providing a new callback
> f_ops->populate() for the purpose.
> 
> SGX patches are provided to show the callback in context.
> 
> An obvious alternative is a ioctl but it is less elegant and requires
> two syscalls (mmap + ioctl) per memory range, instead of just one
> (mmap).

What about extending MADV_POPULATE_READ | MADV_POPULATE_WRITE to support
VM_IO | VM_PFNMAP (as well?) ?
Jarkko Sakkinen March 7, 2022, 1:29 p.m. UTC | #6
On Sun, Mar 06, 2022 at 11:48:26PM -0800, Christoph Hellwig wrote:
> On Sun, Mar 06, 2022 at 11:33:02AM +0000, Matthew Wilcox wrote:
> > On Sun, Mar 06, 2022 at 07:32:04AM +0200, Jarkko Sakkinen wrote:
> > > For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow
> > > to use that for initializing the device memory by providing a new callback
> > > f_ops->populate() for the purpose.
> > 
> > As I said, NAK.
> 
> Agreed.  This is an amazingly bad interface.

So what would you suggest to sort out the issue? I'm happy to go with
ioctl if nothing else is acceptable.

BR, Jarkko
Jarkko Sakkinen March 7, 2022, 2:22 p.m. UTC | #7
On Mon, Mar 07, 2022 at 11:12:44AM +0100, David Hildenbrand wrote:
> On 06.03.22 06:32, Jarkko Sakkinen wrote:
> > For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow
> > to use that for initializing the device memory by providing a new callback
> > f_ops->populate() for the purpose.
> > 
> > SGX patches are provided to show the callback in context.
> > 
> > An obvious alternative is a ioctl but it is less elegant and requires
> > two syscalls (mmap + ioctl) per memory range, instead of just one
> > (mmap).
> 
> What about extending MADV_POPULATE_READ | MADV_POPULATE_WRITE to support
> VM_IO | VM_PFNMAP (as well?) ?

What would be a proper point to bind that behaviour? For mmap/mprotect it'd
be probably populate_vma_page_range() because that would span both mmap()
and mprotect() (Dave's suggestion in this thread).

For MAP_POPULATE I did not have hard proof to show that it would be used
by other drivers but for madvice() you can find at least a few ioctl
based implementations:

$ git grep -e madv --and \( -e ioc \)  drivers/
drivers/gpu/drm/i915/gem/i915_gem_ioctls.h:int i915_gem_madvise_ioctl(struct drm_device *dev, void *data,
drivers/gpu/drm/i915/i915_driver.c:     DRM_IOCTL_DEF_DRV(I915_GEM_MADVISE, i915_gem_madvise_ioctl, DRM_RENDER_ALLOW),
drivers/gpu/drm/i915/i915_gem.c:i915_gem_madvise_ioctl(struct drm_device *dev, void *data,
drivers/gpu/drm/msm/msm_drv.c:static int msm_ioctl_gem_madvise(struct drm_device *dev, void *data,
drivers/gpu/drm/msm/msm_drv.c:  DRM_IOCTL_DEF_DRV(MSM_GEM_MADVISE,  msm_ioctl_gem_madvise,  DRM_RENDER_ALLOW),
drivers/gpu/drm/panfrost/panfrost_drv.c:static int panfrost_ioctl_madvise(struct drm_device *dev, void *data,
drivers/gpu/drm/vc4/vc4_drv.c:  DRM_IOCTL_DEF_DRV(VC4_GEM_MADVISE, vc4_gem_madvise_ioctl, DRM_RENDER_ALLOW),
drivers/gpu/drm/vc4/vc4_drv.h:int vc4_gem_madvise_ioctl(struct drm_device *dev, void *data,
drivers/gpu/drm/vc4/vc4_gem.c:int vc4_gem_madvise_ioctl(struct drm_device *dev, void *data,

IMHO this also provides supportive claim for MAP_POPULATE, and yeah, I
agree that to be consistent implementation, both madvice() and MAP_POPULATE
should work.

> -- 
> Thanks,
> 
> David / dhildenb

BR, Jarkko
David Hildenbrand March 7, 2022, 2:33 p.m. UTC | #8
On 07.03.22 15:22, Jarkko Sakkinen wrote:
> On Mon, Mar 07, 2022 at 11:12:44AM +0100, David Hildenbrand wrote:
>> On 06.03.22 06:32, Jarkko Sakkinen wrote:
>>> For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow
>>> to use that for initializing the device memory by providing a new callback
>>> f_ops->populate() for the purpose.
>>>
>>> SGX patches are provided to show the callback in context.
>>>
>>> An obvious alternative is a ioctl but it is less elegant and requires
>>> two syscalls (mmap + ioctl) per memory range, instead of just one
>>> (mmap).
>>
>> What about extending MADV_POPULATE_READ | MADV_POPULATE_WRITE to support
>> VM_IO | VM_PFNMAP (as well?) ?
> 
> What would be a proper point to bind that behaviour? For mmap/mprotect it'd
> be probably populate_vma_page_range() because that would span both mmap()
> and mprotect() (Dave's suggestion in this thread).

MADV_POPULATE_* ends up in faultin_vma_page_range(), right next to
populate_vma_page_range(). So it might require a similar way to hook
into the driver I guess.

> 
> For MAP_POPULATE I did not have hard proof to show that it would be used
> by other drivers but for madvice() you can find at least a few ioctl
> based implementations:
> 
> $ git grep -e madv --and \( -e ioc \)  drivers/
> drivers/gpu/drm/i915/gem/i915_gem_ioctls.h:int i915_gem_madvise_ioctl(struct drm_device *dev, void *data,
> drivers/gpu/drm/i915/i915_driver.c:     DRM_IOCTL_DEF_DRV(I915_GEM_MADVISE, i915_gem_madvise_ioctl, DRM_RENDER_ALLOW),
> drivers/gpu/drm/i915/i915_gem.c:i915_gem_madvise_ioctl(struct drm_device *dev, void *data,
> drivers/gpu/drm/msm/msm_drv.c:static int msm_ioctl_gem_madvise(struct drm_device *dev, void *data,
> drivers/gpu/drm/msm/msm_drv.c:  DRM_IOCTL_DEF_DRV(MSM_GEM_MADVISE,  msm_ioctl_gem_madvise,  DRM_RENDER_ALLOW),
> drivers/gpu/drm/panfrost/panfrost_drv.c:static int panfrost_ioctl_madvise(struct drm_device *dev, void *data,
> drivers/gpu/drm/vc4/vc4_drv.c:  DRM_IOCTL_DEF_DRV(VC4_GEM_MADVISE, vc4_gem_madvise_ioctl, DRM_RENDER_ALLOW),
> drivers/gpu/drm/vc4/vc4_drv.h:int vc4_gem_madvise_ioctl(struct drm_device *dev, void *data,
> drivers/gpu/drm/vc4/vc4_gem.c:int vc4_gem_madvise_ioctl(struct drm_device *dev, void *data,
> 
> IMHO this also provides supportive claim for MAP_POPULATE, and yeah, I
> agree that to be consistent implementation, both madvice() and MAP_POPULATE
> should work.

MADV_POPULATE_WRITE + MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE is one way to
dynamically manage memory consumption inside a sparse memory mapping
(preallocate/populate via MADV_POPULATE_WRITE, discard via
MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE).  Extending that whole mechanism to
deal with VM_IO | VM_PFNMAP mappings as well could be interesting.

At least I herd about some ideas where we might want to dynamically
expose memory to a VM (via virtio-mem) inside a sparse memory mapping,
and the memory in that sparse memory mapping is provided from a
dedicated memory pool managed by a device driver -- not just using
ordinary anonymous/file/hugetlb memory as we do right now.

Now, this is certainly stuff for the future, I just wanted to mention it.
Jarkko Sakkinen March 7, 2022, 3:49 p.m. UTC | #9
On Mon, Mar 07, 2022 at 03:33:52PM +0100, David Hildenbrand wrote:
> On 07.03.22 15:22, Jarkko Sakkinen wrote:
> > On Mon, Mar 07, 2022 at 11:12:44AM +0100, David Hildenbrand wrote:
> >> On 06.03.22 06:32, Jarkko Sakkinen wrote:
> >>> For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow
> >>> to use that for initializing the device memory by providing a new callback
> >>> f_ops->populate() for the purpose.
> >>>
> >>> SGX patches are provided to show the callback in context.
> >>>
> >>> An obvious alternative is a ioctl but it is less elegant and requires
> >>> two syscalls (mmap + ioctl) per memory range, instead of just one
> >>> (mmap).
> >>
> >> What about extending MADV_POPULATE_READ | MADV_POPULATE_WRITE to support
> >> VM_IO | VM_PFNMAP (as well?) ?
> > 
> > What would be a proper point to bind that behaviour? For mmap/mprotect it'd
> > be probably populate_vma_page_range() because that would span both mmap()
> > and mprotect() (Dave's suggestion in this thread).
> 
> MADV_POPULATE_* ends up in faultin_vma_page_range(), right next to
> populate_vma_page_range(). So it might require a similar way to hook
> into the driver I guess.
> 
> > 
> > For MAP_POPULATE I did not have hard proof to show that it would be used
> > by other drivers but for madvice() you can find at least a few ioctl
> > based implementations:
> > 
> > $ git grep -e madv --and \( -e ioc \)  drivers/
> > drivers/gpu/drm/i915/gem/i915_gem_ioctls.h:int i915_gem_madvise_ioctl(struct drm_device *dev, void *data,
> > drivers/gpu/drm/i915/i915_driver.c:     DRM_IOCTL_DEF_DRV(I915_GEM_MADVISE, i915_gem_madvise_ioctl, DRM_RENDER_ALLOW),
> > drivers/gpu/drm/i915/i915_gem.c:i915_gem_madvise_ioctl(struct drm_device *dev, void *data,
> > drivers/gpu/drm/msm/msm_drv.c:static int msm_ioctl_gem_madvise(struct drm_device *dev, void *data,
> > drivers/gpu/drm/msm/msm_drv.c:  DRM_IOCTL_DEF_DRV(MSM_GEM_MADVISE,  msm_ioctl_gem_madvise,  DRM_RENDER_ALLOW),
> > drivers/gpu/drm/panfrost/panfrost_drv.c:static int panfrost_ioctl_madvise(struct drm_device *dev, void *data,
> > drivers/gpu/drm/vc4/vc4_drv.c:  DRM_IOCTL_DEF_DRV(VC4_GEM_MADVISE, vc4_gem_madvise_ioctl, DRM_RENDER_ALLOW),
> > drivers/gpu/drm/vc4/vc4_drv.h:int vc4_gem_madvise_ioctl(struct drm_device *dev, void *data,
> > drivers/gpu/drm/vc4/vc4_gem.c:int vc4_gem_madvise_ioctl(struct drm_device *dev, void *data,
> > 
> > IMHO this also provides supportive claim for MAP_POPULATE, and yeah, I
> > agree that to be consistent implementation, both madvice() and MAP_POPULATE
> > should work.
> 
> MADV_POPULATE_WRITE + MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE is one way to
> dynamically manage memory consumption inside a sparse memory mapping
> (preallocate/populate via MADV_POPULATE_WRITE, discard via
> MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE).  Extending that whole mechanism to
> deal with VM_IO | VM_PFNMAP mappings as well could be interesting.
> 
> At least I herd about some ideas where we might want to dynamically
> expose memory to a VM (via virtio-mem) inside a sparse memory mapping,
> and the memory in that sparse memory mapping is provided from a
> dedicated memory pool managed by a device driver -- not just using
> ordinary anonymous/file/hugetlb memory as we do right now.
> 
> Now, this is certainly stuff for the future, I just wanted to mention it.

For SGX purposes I'm now studying the possibly to use ra_state to get
idea where do "prefetching" (EAUG's) in batches, as it is something
that would not require any intrusive changes to mm but thank you for
sharing this. Looking into implementing this properly is the 2nd option,
if that does not work out.

> -- 
> Thanks,
> 
> David / dhildenb

BR, Jarkko
Christoph Hellwig March 7, 2022, 3:56 p.m. UTC | #10
On Mon, Mar 07, 2022 at 03:29:35PM +0200, Jarkko Sakkinen wrote:
> So what would you suggest to sort out the issue? I'm happy to go with
> ioctl if nothing else is acceptable.

PLenty of drivers treat all mmaps as if MAP_POPULATE was specified,
typically by using (io_)remap_pfn_range.  If there any reason to only
optionally have the pre-fault semantics for sgx?  If not this should
be really simple.  And if we have a real need for it to be optional
we'll just need to find a sane way to pass that information to ->mmap.
Jarkko Sakkinen March 7, 2022, 3:58 p.m. UTC | #11
On Mon, Mar 07, 2022 at 07:56:53AM -0800, Christoph Hellwig wrote:
> On Mon, Mar 07, 2022 at 03:29:35PM +0200, Jarkko Sakkinen wrote:
> > So what would you suggest to sort out the issue? I'm happy to go with
> > ioctl if nothing else is acceptable.
> 
> PLenty of drivers treat all mmaps as if MAP_POPULATE was specified,
> typically by using (io_)remap_pfn_range.  If there any reason to only
> optionally have the pre-fault semantics for sgx?  If not this should
> be really simple.  And if we have a real need for it to be optional
> we'll just need to find a sane way to pass that information to ->mmap.

Dave, what if mmap() would just unconditionally EAUG after initialization?

It's an option, yes.

BR, Jarkko
David Laight March 7, 2022, 10:11 p.m. UTC | #12
From: Christoph Hellwig
> Sent: 07 March 2022 15:57
> 
> On Mon, Mar 07, 2022 at 03:29:35PM +0200, Jarkko Sakkinen wrote:
> > So what would you suggest to sort out the issue? I'm happy to go with
> > ioctl if nothing else is acceptable.
> 
> PLenty of drivers treat all mmaps as if MAP_POPULATE was specified,
> typically by using (io_)remap_pfn_range.  If there any reason to only
> optionally have the pre-fault semantics for sgx?  If not this should
> be really simple.  And if we have a real need for it to be optional
> we'll just need to find a sane way to pass that information to ->mmap.

Is there any space in vma->vm_flags ?

That would be better than an extra argument or function.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Jarkko Sakkinen March 8, 2022, 10:10 a.m. UTC | #13
On Mon, Mar 07, 2022 at 10:11:19PM +0000, David Laight wrote:
> From: Christoph Hellwig
> > Sent: 07 March 2022 15:57
> > 
> > On Mon, Mar 07, 2022 at 03:29:35PM +0200, Jarkko Sakkinen wrote:
> > > So what would you suggest to sort out the issue? I'm happy to go with
> > > ioctl if nothing else is acceptable.
> > 
> > PLenty of drivers treat all mmaps as if MAP_POPULATE was specified,
> > typically by using (io_)remap_pfn_range.  If there any reason to only
> > optionally have the pre-fault semantics for sgx?  If not this should
> > be really simple.  And if we have a real need for it to be optional
> > we'll just need to find a sane way to pass that information to ->mmap.
> 
> Is there any space in vma->vm_flags ?
> 
> That would be better than an extra argument or function.

It's very dense but I'll give a shot for callback route based on Dave's
comments in this thread. I.e. use it as filter inside __mm_populate() and
populate_vma_page_range().

For Enarx, which we are implementing being able to use MAP_POPULATE and get
the full range EAUG'd would be best way to optimize the performance of wasm
JIT (Enarx is a wasm run-time capable of running inside an SGX enclave, AMD
SEV-SNP VM etc.). More so than any predictor (ra_state, madvice etc.) inside
#PF handler, which have been suggested in this thread.

After some research on how we implement user space, I'd rather keep the #PF
handler working on a single page (EAUG a single page) and have either ioctl
or MAP_POPULATE to do the batch fill.

We can still "not trust the user space" i.e. the populate does not have to
guarantee to do the full length since the #PF handler will then fill the
holes. This was one concern in this thread but it is not hard to address.

BR, Jarkko