mbox series

[RFC,0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd

Message ID 20241105164549.154700-1-shivankg@amd.com (mailing list archive)
Headers show
Series Add fbind() and NUMA mempolicy support for KVM guest_memfd | expand

Message

Shivank Garg Nov. 5, 2024, 4:45 p.m. UTC
This patch series introduces fbind() syscall to support NUMA memory
policies for KVM guest_memfd, allowing VMMs to configure memory placement
for guest memory. This addresses the current limitation where guest_memfd
allocations ignore NUMA policies, potentially impacting performance of
memory-locality-sensitive workloads.

Currently, while mbind() enables NUMA policy support for userspace
applications, it cannot be used with guest_memfd as the memory isn't mapped
to userspace. This results in guest memory being allocated randomly across
host NUMA nodes, even when specific policies and node preferences are
specified in QEMU commands.

The fbind() syscall is particularly useful for SEV-SNP guests.

Following suggestions from LPC and review comment [1], I switched from an
IOCTL-based approach [2] to fbind() [3]. The fbind() approach is preferred
as it provides a generic NUMA policy configuration working with any fd,
rather than being tied to guest-memfd.

Testing:
QEMU tree- https://github.com/AMDESE/qemu/tree/guest_memfd_fbind_NUMA
Based upon
Github tree- https://github.com/AMDESE/linux/tree/snp-host-latest
Branch: snp-host-latest commit: 85ef1ac

Example command to run a SEV-SNP guest bound to NUMA Node 0 of the host:
$ qemu-system-x86_64 \
   -enable-kvm \
  ...
   -machine memory-encryption=sev0,vmport=off \
   -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1 \
   -numa node,nodeid=0,memdev=ram0,cpus=0-15 \
   -object memory-backend-memfd,id=ram0,policy=bind,host-nodes=0,size=1024M,share=true,prealloc=false

[1]: https://lore.kernel.org/linux-mm/ZvEga7srKhympQBt@intel.com/
[2]: https://lore.kernel.org/linux-mm/20240919094438.10987-1-shivankg@amd.com
[3]: https://lore.kernel.org/kvm/ZOjpIL0SFH+E3Dj4@google.com/


Shivank Garg (3):
  Introduce fbind syscall
  KVM: guest_memfd: Pass file pointer instead of inode in guest_memfd
    APIs
  KVM: guest_memfd: Enforce NUMA mempolicy if available

Shivansh Dhiman (1):
  mm: Add mempolicy support to the filemap layer

 arch/x86/entry/syscalls/syscall_32.tbl |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 include/linux/fs.h                     |  3 ++
 include/linux/mempolicy.h              |  3 ++
 include/linux/pagemap.h                | 40 ++++++++++++++++
 include/linux/syscalls.h               |  3 ++
 include/uapi/asm-generic/unistd.h      |  5 +-
 kernel/sys_ni.c                        |  1 +
 mm/Makefile                            |  2 +-
 mm/fbind.c                             | 49 +++++++++++++++++++
 mm/filemap.c                           | 30 ++++++++++--
 mm/mempolicy.c                         | 57 ++++++++++++++++++++++
 virt/kvm/guest_memfd.c                 | 66 +++++++++++++++++++++-----
 13 files changed, 242 insertions(+), 19 deletions(-)
 create mode 100644 mm/fbind.c

Comments

Matthew Wilcox (Oracle) Nov. 5, 2024, 6:55 p.m. UTC | #1
On Tue, Nov 05, 2024 at 04:45:45PM +0000, Shivank Garg wrote:
> This patch series introduces fbind() syscall to support NUMA memory
> policies for KVM guest_memfd, allowing VMMs to configure memory placement
> for guest memory. This addresses the current limitation where guest_memfd
> allocations ignore NUMA policies, potentially impacting performance of
> memory-locality-sensitive workloads.

Why does guest_memfd ignore numa policies?  The pagecache doesn't,
eg in vma_alloc_folio_noprof().
Shivank Garg Nov. 7, 2024, 8:54 a.m. UTC | #2
Hi Matthew,

On 11/6/2024 12:25 AM, Matthew Wilcox wrote:
> On Tue, Nov 05, 2024 at 04:45:45PM +0000, Shivank Garg wrote:
>> This patch series introduces fbind() syscall to support NUMA memory
>> policies for KVM guest_memfd, allowing VMMs to configure memory placement
>> for guest memory. This addresses the current limitation where guest_memfd
>> allocations ignore NUMA policies, potentially impacting performance of
>> memory-locality-sensitive workloads.
> 
> Why does guest_memfd ignore numa policies?  The pagecache doesn't,
> eg in vma_alloc_folio_noprof().
guest_memfd doesn't have VMAs and hence can't store policy information in
VMA and use vma_alloc_folio_noprof() that fetches mpol from VMA.

The folio allocation path from guest_memfd typically looks like this...

kvm_gmem_get_folio
  filemap_grab_folio
    __filemap_get_folio
      filemap_alloc_folio
        __folio_alloc_node_noprof
          -> goes to the buddy allocator

Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.


Thanks,
Shivank
Matthew Wilcox (Oracle) Nov. 7, 2024, 3:10 p.m. UTC | #3
On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote:
> The folio allocation path from guest_memfd typically looks like this...
> 
> kvm_gmem_get_folio
>   filemap_grab_folio
>     __filemap_get_folio
>       filemap_alloc_folio
>         __folio_alloc_node_noprof
>           -> goes to the buddy allocator
> 
> Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.

It only takes that path if cpuset_do_page_mem_spread() is true.  Is the
real problem that you're trying to solve that cpusets are being used
incorrectly?

Backing up, it seems like you want to make a change to the page cache,
you've had a long discussion with people who aren't the page cache
maintainer, and you all understand the pros and cons of everything,
and here you are dumping a solution on me without talking to me, even
though I was at Plumbers, you didn't find me to tell me I needed to go
to your talk.

So you haven't explained a damned thing to me, and I'm annoyed at you.
Do better.  Starting with your cover letter.
Shivank Garg Nov. 8, 2024, 9:21 a.m. UTC | #4
On 11/7/2024 8:40 PM, Matthew Wilcox wrote:
> On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote:
>> The folio allocation path from guest_memfd typically looks like this...
>>
>> kvm_gmem_get_folio
>>   filemap_grab_folio
>>     __filemap_get_folio
>>       filemap_alloc_folio
>>         __folio_alloc_node_noprof
>>           -> goes to the buddy allocator
>>
>> Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.
> 
> It only takes that path if cpuset_do_page_mem_spread() is true.  Is the
> real problem that you're trying to solve that cpusets are being used
> incorrectly?
> 
> Backing up, it seems like you want to make a change to the page cache,
> you've had a long discussion with people who aren't the page cache
> maintainer, and you all understand the pros and cons of everything,
> and here you are dumping a solution on me without talking to me, even
> though I was at Plumbers, you didn't find me to tell me I needed to go
> to your talk.
> 
> So you haven't explained a damned thing to me, and I'm annoyed at you.
> Do better.  Starting with your cover letter.

Hi Matthew,

I apologize for any misunderstanding and not providing adequate context.

To clarify:
- You may recall this work from its earlier iteration as an
  IOCTL-based approach, where you provided valuable review comments [1].
- I was not physically present at LPC. The discussion happened through
  the mailing list [2] and lobby discussion with my colleagues who visited
  Vienna.
- Based on feedback, particularly regarding the suggestion to consider
  fbind() as a more generic solution, we shifted to the current approach.

I posted this as *RFC* specifically to gather feedback on the feasibility of
this approach and to ensure I'm heading in the right direction.

Would you be willing to help me understand:
1. What additional information would be helpful to you and other reviewers?
2. How cpusets can be used correctly to fix this? (your point on
   cpuset_do_page_mem_spread() is interesting and I'll investigate it more
   thoroughly to understand).

I'll work on improving the cover letter to better explain the problem space
and proposed solution.

Thank you for the valuable feedback.

[1] https://lore.kernel.org/linux-mm/ZuimLtrpv1dXczf5@casper.infradead.org
[2] https://lore.kernel.org/linux-mm/ZvEga7srKhympQBt@intel.com

Best regards,
Shivank
Paolo Bonzini Nov. 8, 2024, 5:31 p.m. UTC | #5
On 11/7/24 16:10, Matthew Wilcox wrote:
> On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote:
>> The folio allocation path from guest_memfd typically looks like this...
>>
>> kvm_gmem_get_folio
>>    filemap_grab_folio
>>      __filemap_get_folio
>>        filemap_alloc_folio
>>          __folio_alloc_node_noprof
>>            -> goes to the buddy allocator
>>
>> Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.
> 
> It only takes that path if cpuset_do_page_mem_spread() is true.  Is the
> real problem that you're trying to solve that cpusets are being used
> incorrectly?

If it's false it's not very different, it goes to alloc_pages_noprof(). 
Then it respects the process's policy, but the policy is not 
customizable without mucking with state that is global to the process.

Taking a step back: the problem is that a VM can be configured to have 
multiple guest-side NUMA nodes, each of which will pick memory from the 
right NUMA node in the host.  Without a per-file operation it's not 
possible to do this on guest_memfd.  The discussion was whether to use 
ioctl() or a new system call.  The discussion ended with the idea of 
posting a *proposal* asking for *comments* as to whether the system call 
would be useful in general beyond KVM.

Commenting on the system call itself I am not sure I like the 
file_operations entry, though I understand that it's the simplest way to 
implement this in an RFC series.  It's a bit surprising that fbind() is 
a total no-op for everything except KVM's guest_memfd.

Maybe whatever you pass to fbind() could be stored in the struct file *, 
and used as the default when creating VMAs; as if every mmap() was 
followed by an mbind(), except that it also does the right thing with 
MAP_POPULATE for example.  Or maybe that's a horrible idea?

Adding linux-api to get input; original thread is at
https://lore.kernel.org/kvm/20241105164549.154700-1-shivankg@amd.com/.

Paolo

> Backing up, it seems like you want to make a change to the page cache,
> you've had a long discussion with people who aren't the page cache
> maintainer, and you all understand the pros and cons of everything,
> and here you are dumping a solution on me without talking to me, even
> though I was at Plumbers, you didn't find me to tell me I needed to go
> to your talk.
> 
> So you haven't explained a damned thing to me, and I'm annoyed at you.
> Do better.  Starting with your cover letter.
>
Vlastimil Babka Nov. 11, 2024, 11:02 a.m. UTC | #6
On 11/8/24 18:31, Paolo Bonzini wrote:
> On 11/7/24 16:10, Matthew Wilcox wrote:
>> On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote:
>>> The folio allocation path from guest_memfd typically looks like this...
>>>
>>> kvm_gmem_get_folio
>>>    filemap_grab_folio
>>>      __filemap_get_folio
>>>        filemap_alloc_folio
>>>          __folio_alloc_node_noprof
>>>            -> goes to the buddy allocator
>>>
>>> Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.
>> 
>> It only takes that path if cpuset_do_page_mem_spread() is true.  Is the
>> real problem that you're trying to solve that cpusets are being used
>> incorrectly?
> 
> If it's false it's not very different, it goes to alloc_pages_noprof(). 
> Then it respects the process's policy, but the policy is not 
> customizable without mucking with state that is global to the process.
> 
> Taking a step back: the problem is that a VM can be configured to have 
> multiple guest-side NUMA nodes, each of which will pick memory from the 
> right NUMA node in the host.  Without a per-file operation it's not 
> possible to do this on guest_memfd.  The discussion was whether to use 
> ioctl() or a new system call.  The discussion ended with the idea of 
> posting a *proposal* asking for *comments* as to whether the system call 
> would be useful in general beyond KVM.
> 
> Commenting on the system call itself I am not sure I like the 
> file_operations entry, though I understand that it's the simplest way to 
> implement this in an RFC series.  It's a bit surprising that fbind() is 
> a total no-op for everything except KVM's guest_memfd.
> 
> Maybe whatever you pass to fbind() could be stored in the struct file *, 
> and used as the default when creating VMAs; as if every mmap() was 
> followed by an mbind(), except that it also does the right thing with 
> MAP_POPULATE for example.  Or maybe that's a horrible idea?

mbind() manpage has this:

       The  specified  policy  will  be  ignored  for  any MAP_SHARED
mappings in the specified memory range.  Rather the pages will be allocated
according to the memory policy of the thread that caused the page to be
allocated. Again, this may not be the thread that called mbind().

So that seems like we're not very keen on having one user of a file set a
policy that would affect other users of the file?

Now the next paragraph of the manpage says that shmem is different, and
guest_memfd is more like shmem than a regular file.

My conclusion from that is that fbind() might be too broad and we don't want
this for actual filesystem-backed files? And if it's limited to guest_memfd,
it shouldn't be an fbind()?

> Adding linux-api to get input; original thread is at
> https://lore.kernel.org/kvm/20241105164549.154700-1-shivankg@amd.com/.
> 
> Paolo
> 
>> Backing up, it seems like you want to make a change to the page cache,
>> you've had a long discussion with people who aren't the page cache
>> maintainer, and you all understand the pros and cons of everything,
>> and here you are dumping a solution on me without talking to me, even
>> though I was at Plumbers, you didn't find me to tell me I needed to go
>> to your talk.
>> 
>> So you haven't explained a damned thing to me, and I'm annoyed at you.
>> Do better.  Starting with your cover letter.
>> 
> 
>
David Hildenbrand Nov. 11, 2024, 10:14 p.m. UTC | #7
On 11.11.24 12:02, Vlastimil Babka wrote:
> On 11/8/24 18:31, Paolo Bonzini wrote:
>> On 11/7/24 16:10, Matthew Wilcox wrote:
>>> On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote:
>>>> The folio allocation path from guest_memfd typically looks like this...
>>>>
>>>> kvm_gmem_get_folio
>>>>     filemap_grab_folio
>>>>       __filemap_get_folio
>>>>         filemap_alloc_folio
>>>>           __folio_alloc_node_noprof
>>>>             -> goes to the buddy allocator
>>>>
>>>> Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.
>>>
>>> It only takes that path if cpuset_do_page_mem_spread() is true.  Is the
>>> real problem that you're trying to solve that cpusets are being used
>>> incorrectly?
>>
>> If it's false it's not very different, it goes to alloc_pages_noprof().
>> Then it respects the process's policy, but the policy is not
>> customizable without mucking with state that is global to the process.
>>
>> Taking a step back: the problem is that a VM can be configured to have
>> multiple guest-side NUMA nodes, each of which will pick memory from the
>> right NUMA node in the host.  Without a per-file operation it's not
>> possible to do this on guest_memfd.  The discussion was whether to use
>> ioctl() or a new system call.  The discussion ended with the idea of
>> posting a *proposal* asking for *comments* as to whether the system call
>> would be useful in general beyond KVM.
>>
>> Commenting on the system call itself I am not sure I like the
>> file_operations entry, though I understand that it's the simplest way to
>> implement this in an RFC series.  It's a bit surprising that fbind() is
>> a total no-op for everything except KVM's guest_memfd.
>>
>> Maybe whatever you pass to fbind() could be stored in the struct file *,
>> and used as the default when creating VMAs; as if every mmap() was
>> followed by an mbind(), except that it also does the right thing with
>> MAP_POPULATE for example.  Or maybe that's a horrible idea?
> 
> mbind() manpage has this:
> 
>         The  specified  policy  will  be  ignored  for  any MAP_SHARED
> mappings in the specified memory range.  Rather the pages will be allocated
> according to the memory policy of the thread that caused the page to be
> allocated. Again, this may not be the thread that called mbind().

I recall discussing that a couple of times in the context of QEMU. I 
have some faint recollection that the manpage is a bit imprecise:

IIRC, hugetlb also ends up using the VMA policy for MAP_SHARED mappings 
during faults (huge_node()->get_vma_policy()) -- but in contrast to 
shmem, it doesn't end up becoming the "shared" policy for the file, used 
when accessed through other VMAs.

> 
> So that seems like we're not very keen on having one user of a file set a
> policy that would affect other users of the file?

For VMs in QEMU we really want to configure the policy once in the main 
process and have all other processes (e.g., vhost-user) not worry about 
that when they mmap() guest memory.

With shmem this works by "shared policy" design (below). For hugetlb, we 
rely on the fact that mbind()+MADV_POPULATE_WRITE allows us to 
preallocate NUMA-aware. So with hugetlb we really preallocate all guest 
RAM to guarantee the NUMA placement.

It would not be the worst idea to have a clean interface to configure 
file-range policies instead of having this weird shmem mbind() behavior 
and the hugetlb hack.

Having that said, other filesystem are rarely used for backing VMs, at 
least in combination with NUMA. So nobody really cared that much for now.

Maybe fbind() would primarily only be useful for in-memory filesystems 
(shmem/hugetlb/guest_memfd).

> 
> Now the next paragraph of the manpage says that shmem is different, and
> guest_memfd is more like shmem than a regular file.
> 
> My conclusion from that is that fbind() might be too broad and we don't want
> this for actual filesystem-backed files? And if it's limited to guest_memfd,
> it shouldn't be an fbind()?

I was just once again diving into how mbind() on shmem is handled. And 
in fact, mbind() will call vma->set_policy() to update the per 
file-range policy. I wonder why we didn't do the same for hugetlb ... 
but of course, hugetlb must be special in any possible way.

Not saying it's the best idea, but as we are talking about mmap support 
of guest_memfd (only allowing to fault in shared/faultable pages), one 
*could* look into implementing mbind()+vma->set_policy() for guest_memfd 
similar to how shmem handles it.

It would require a (temporary) dummy VMA in the worst case (all private 
memory).

It sounds a bit weird, though, to require a VMA to configure this, 
though. But at least it's similar to what shmem does ...