Message ID | 20220310140911.50924-5-chao.p.peng@linux.intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | KVM: mm: fd-based approach for supporting KVM guest private memory | expand |
On Thu, Mar 10, 2022, Chao Peng wrote: > Since page migration / swapping is not supported yet, MFD_INACCESSIBLE > memory behave like longterm pinned pages and thus should be accounted to > mm->pinned_vm and be restricted by RLIMIT_MEMLOCK. > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > --- > mm/shmem.c | 25 ++++++++++++++++++++++++- > 1 file changed, 24 insertions(+), 1 deletion(-) > > diff --git a/mm/shmem.c b/mm/shmem.c > index 7b43e274c9a2..ae46fb96494b 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end) > static void notify_invalidate_page(struct inode *inode, struct folio *folio, > pgoff_t start, pgoff_t end) > { > -#ifdef CONFIG_MEMFILE_NOTIFIER > struct shmem_inode_info *info = SHMEM_I(inode); > > +#ifdef CONFIG_MEMFILE_NOTIFIER > start = max(start, folio->index); > end = min(end, folio->index + folio_nr_pages(folio)); > > memfile_notifier_invalidate(&info->memfile_notifiers, start, end); > #endif > + > + if (info->xflags & SHM_F_INACCESSIBLE) > + atomic64_sub(end - start, ¤t->mm->pinned_vm); As Vishal's to-be-posted selftest discovered, this is broken as current->mm may be NULL. Or it may be a completely different mm, e.g. AFAICT there's nothing that prevents a different process from punching hole in the shmem backing. I don't see a sane way of tracking this in the backing store unless the inode is associated with a single mm when it's created, and that opens up a giant can of worms, e.g. what happens with the accounting if the creating process goes away? I think the correct approach is to not do the locking automatically for SHM_F_INACCESSIBLE, and instead require userspace to do shmctl(.., SHM_LOCK, ...) if userspace knows the consumers don't support migrate/swap. That'd require wrapping migrate_page() and then wiring up notifier hooks for migrate/swap, but IMO that's a good thing to get sorted out sooner than later. KVM isn't planning on support migrate/swap for TDX or SNP, but supporting at least migrate for a software-only implementation a la pKVM should be relatively straightforward. On the notifiee side, KVM can terminate the VM if it gets an unexpected migrate/swap, e.g. so that TDX/SEV VMs don't die later with exceptions and/or data corruption (pre-SNP SEV guests) in the guest. Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the swap, so maybe it's just the page migration path that needs to be updated?
On Thu, Apr 7, 2022, at 9:05 AM, Sean Christopherson wrote: > On Thu, Mar 10, 2022, Chao Peng wrote: >> Since page migration / swapping is not supported yet, MFD_INACCESSIBLE >> memory behave like longterm pinned pages and thus should be accounted to >> mm->pinned_vm and be restricted by RLIMIT_MEMLOCK. >> >> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> >> --- >> mm/shmem.c | 25 ++++++++++++++++++++++++- >> 1 file changed, 24 insertions(+), 1 deletion(-) >> >> diff --git a/mm/shmem.c b/mm/shmem.c >> index 7b43e274c9a2..ae46fb96494b 100644 >> --- a/mm/shmem.c >> +++ b/mm/shmem.c >> @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end) >> static void notify_invalidate_page(struct inode *inode, struct folio *folio, >> pgoff_t start, pgoff_t end) >> { >> -#ifdef CONFIG_MEMFILE_NOTIFIER >> struct shmem_inode_info *info = SHMEM_I(inode); >> >> +#ifdef CONFIG_MEMFILE_NOTIFIER >> start = max(start, folio->index); >> end = min(end, folio->index + folio_nr_pages(folio)); >> >> memfile_notifier_invalidate(&info->memfile_notifiers, start, end); >> #endif >> + >> + if (info->xflags & SHM_F_INACCESSIBLE) >> + atomic64_sub(end - start, ¤t->mm->pinned_vm); > > As Vishal's to-be-posted selftest discovered, this is broken as > current->mm may > be NULL. Or it may be a completely different mm, e.g. AFAICT there's > nothing that > prevents a different process from punching hole in the shmem backing. > How about just not charging the mm in the first place? There’s precedent: ramfs and hugetlbfs (at least sometimes — I’ve lost track of the current status). In any case, for an administrator to try to assemble the various rlimits into a coherent policy is, and always has been, quite messy. ISTM cgroup limits, which can actually add across processes usefully, are much better. So, aside from the fact that these fds aren’t in a filesystem and are thus available by default, I’m not convinced that this accounting is useful or necessary. Maybe we could just have some switch require to enable creation of private memory in the first place, and anyone who flips that switch without configuring cgroups is subject to DoS.
On Thu, Apr 07, 2022 at 04:05:36PM +0000, Sean Christopherson wrote: > On Thu, Mar 10, 2022, Chao Peng wrote: > > Since page migration / swapping is not supported yet, MFD_INACCESSIBLE > > memory behave like longterm pinned pages and thus should be accounted to > > mm->pinned_vm and be restricted by RLIMIT_MEMLOCK. > > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > > --- > > mm/shmem.c | 25 ++++++++++++++++++++++++- > > 1 file changed, 24 insertions(+), 1 deletion(-) > > > > diff --git a/mm/shmem.c b/mm/shmem.c > > index 7b43e274c9a2..ae46fb96494b 100644 > > --- a/mm/shmem.c > > +++ b/mm/shmem.c > > @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end) > > static void notify_invalidate_page(struct inode *inode, struct folio *folio, > > pgoff_t start, pgoff_t end) > > { > > -#ifdef CONFIG_MEMFILE_NOTIFIER > > struct shmem_inode_info *info = SHMEM_I(inode); > > > > +#ifdef CONFIG_MEMFILE_NOTIFIER > > start = max(start, folio->index); > > end = min(end, folio->index + folio_nr_pages(folio)); > > > > memfile_notifier_invalidate(&info->memfile_notifiers, start, end); > > #endif > > + > > + if (info->xflags & SHM_F_INACCESSIBLE) > > + atomic64_sub(end - start, ¤t->mm->pinned_vm); > > As Vishal's to-be-posted selftest discovered, this is broken as current->mm may > be NULL. Or it may be a completely different mm, e.g. AFAICT there's nothing that > prevents a different process from punching hole in the shmem backing. > > I don't see a sane way of tracking this in the backing store unless the inode is > associated with a single mm when it's created, and that opens up a giant can of > worms, e.g. what happens with the accounting if the creating process goes away? Yes, I realized this. > > I think the correct approach is to not do the locking automatically for SHM_F_INACCESSIBLE, > and instead require userspace to do shmctl(.., SHM_LOCK, ...) if userspace knows the > consumers don't support migrate/swap. That'd require wrapping migrate_page() and then > wiring up notifier hooks for migrate/swap, but IMO that's a good thing to get sorted > out sooner than later. KVM isn't planning on support migrate/swap for TDX or SNP, > but supporting at least migrate for a software-only implementation a la pKVM should > be relatively straightforward. On the notifiee side, KVM can terminate the VM if it > gets an unexpected migrate/swap, e.g. so that TDX/SEV VMs don't die later with > exceptions and/or data corruption (pre-SNP SEV guests) in the guest. SHM_LOCK sounds like a good match. Thanks, Chao > > Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the swap, so > maybe it's just the page migration path that needs to be updated?
On Thu, Apr 07, 2022, Andy Lutomirski wrote: > > On Thu, Apr 7, 2022, at 9:05 AM, Sean Christopherson wrote: > > On Thu, Mar 10, 2022, Chao Peng wrote: > >> Since page migration / swapping is not supported yet, MFD_INACCESSIBLE > >> memory behave like longterm pinned pages and thus should be accounted to > >> mm->pinned_vm and be restricted by RLIMIT_MEMLOCK. > >> > >> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > >> --- > >> mm/shmem.c | 25 ++++++++++++++++++++++++- > >> 1 file changed, 24 insertions(+), 1 deletion(-) > >> > >> diff --git a/mm/shmem.c b/mm/shmem.c > >> index 7b43e274c9a2..ae46fb96494b 100644 > >> --- a/mm/shmem.c > >> +++ b/mm/shmem.c > >> @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end) > >> static void notify_invalidate_page(struct inode *inode, struct folio *folio, > >> pgoff_t start, pgoff_t end) > >> { > >> -#ifdef CONFIG_MEMFILE_NOTIFIER > >> struct shmem_inode_info *info = SHMEM_I(inode); > >> > >> +#ifdef CONFIG_MEMFILE_NOTIFIER > >> start = max(start, folio->index); > >> end = min(end, folio->index + folio_nr_pages(folio)); > >> > >> memfile_notifier_invalidate(&info->memfile_notifiers, start, end); > >> #endif > >> + > >> + if (info->xflags & SHM_F_INACCESSIBLE) > >> + atomic64_sub(end - start, ¤t->mm->pinned_vm); > > > > As Vishal's to-be-posted selftest discovered, this is broken as current->mm > > may be NULL. Or it may be a completely different mm, e.g. AFAICT there's > > nothing that prevents a different process from punching hole in the shmem > > backing. > > > > How about just not charging the mm in the first place? There’s precedent: > ramfs and hugetlbfs (at least sometimes — I’ve lost track of the current > status). > > In any case, for an administrator to try to assemble the various rlimits into > a coherent policy is, and always has been, quite messy. ISTM cgroup limits, > which can actually add across processes usefully, are much better. > > So, aside from the fact that these fds aren’t in a filesystem and are thus > available by default, I’m not convinced that this accounting is useful or > necessary. > > Maybe we could just have some switch require to enable creation of private > memory in the first place, and anyone who flips that switch without > configuring cgroups is subject to DoS. I personally have no objection to that, and I'm 99% certain Google doesn't rely on RLIMIT_MEMLOCK.
On 08.04.22 19:56, Sean Christopherson wrote: > On Thu, Apr 07, 2022, Andy Lutomirski wrote: >> >> On Thu, Apr 7, 2022, at 9:05 AM, Sean Christopherson wrote: >>> On Thu, Mar 10, 2022, Chao Peng wrote: >>>> Since page migration / swapping is not supported yet, MFD_INACCESSIBLE >>>> memory behave like longterm pinned pages and thus should be accounted to >>>> mm->pinned_vm and be restricted by RLIMIT_MEMLOCK. >>>> >>>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> >>>> --- >>>> mm/shmem.c | 25 ++++++++++++++++++++++++- >>>> 1 file changed, 24 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/mm/shmem.c b/mm/shmem.c >>>> index 7b43e274c9a2..ae46fb96494b 100644 >>>> --- a/mm/shmem.c >>>> +++ b/mm/shmem.c >>>> @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end) >>>> static void notify_invalidate_page(struct inode *inode, struct folio *folio, >>>> pgoff_t start, pgoff_t end) >>>> { >>>> -#ifdef CONFIG_MEMFILE_NOTIFIER >>>> struct shmem_inode_info *info = SHMEM_I(inode); >>>> >>>> +#ifdef CONFIG_MEMFILE_NOTIFIER >>>> start = max(start, folio->index); >>>> end = min(end, folio->index + folio_nr_pages(folio)); >>>> >>>> memfile_notifier_invalidate(&info->memfile_notifiers, start, end); >>>> #endif >>>> + >>>> + if (info->xflags & SHM_F_INACCESSIBLE) >>>> + atomic64_sub(end - start, ¤t->mm->pinned_vm); >>> >>> As Vishal's to-be-posted selftest discovered, this is broken as current->mm >>> may be NULL. Or it may be a completely different mm, e.g. AFAICT there's >>> nothing that prevents a different process from punching hole in the shmem >>> backing. >>> >> >> How about just not charging the mm in the first place? There’s precedent: >> ramfs and hugetlbfs (at least sometimes — I’ve lost track of the current >> status). >> >> In any case, for an administrator to try to assemble the various rlimits into >> a coherent policy is, and always has been, quite messy. ISTM cgroup limits, >> which can actually add across processes usefully, are much better. >> >> So, aside from the fact that these fds aren’t in a filesystem and are thus >> available by default, I’m not convinced that this accounting is useful or >> necessary. >> >> Maybe we could just have some switch require to enable creation of private >> memory in the first place, and anyone who flips that switch without >> configuring cgroups is subject to DoS. > > I personally have no objection to that, and I'm 99% certain Google doesn't rely > on RLIMIT_MEMLOCK. > It's unnacceptable for distributions to have random unprivileged users be able to allocate an unlimited amount of unmovable memory. And any kind of these "switches" won't help a thing because the distribution will have to enable them either way. I raised in the past that accounting might be challenging, so it's no surprise that something popped up now. RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he past already with secretmem, it's not 100% that good of a fit (unmovable is worth than mlocked). But it gets the job done for now at least. So I'm open for alternative to limit the amount of unmovable memory we might allocate for user space, and then we could convert seretmem as well. Random switches are not an option.
On Thu, Apr 07, 2022 at 04:05:36PM +0000, Sean Christopherson wrote: > Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the swap, so > maybe it's just the page migration path that needs to be updated? My early version prevented migration with -ENOTSUPP for address_space_operations::migratepage(). What's wrong with that approach?
On Fri, Apr 08, 2022 at 09:02:54PM +0800, Chao Peng wrote: > > I think the correct approach is to not do the locking automatically for SHM_F_INACCESSIBLE, > > and instead require userspace to do shmctl(.., SHM_LOCK, ...) if userspace knows the > > consumers don't support migrate/swap. That'd require wrapping migrate_page() and then > > wiring up notifier hooks for migrate/swap, but IMO that's a good thing to get sorted > > out sooner than later. KVM isn't planning on support migrate/swap for TDX or SNP, > > but supporting at least migrate for a software-only implementation a la pKVM should > > be relatively straightforward. On the notifiee side, KVM can terminate the VM if it > > gets an unexpected migrate/swap, e.g. so that TDX/SEV VMs don't die later with > > exceptions and/or data corruption (pre-SNP SEV guests) in the guest. > > SHM_LOCK sounds like a good match. Emm, no. shmctl(2) and SHM_LOCK are SysV IPC thing. I don't see how they fit here.
On Mon, 11 Apr 2022, Kirill A. Shutemov wrote: > On Fri, Apr 08, 2022 at 09:02:54PM +0800, Chao Peng wrote: > > > I think the correct approach is to not do the locking automatically for SHM_F_INACCESSIBLE, > > > and instead require userspace to do shmctl(.., SHM_LOCK, ...) if userspace knows the > > > consumers don't support migrate/swap. That'd require wrapping migrate_page() and then > > > wiring up notifier hooks for migrate/swap, but IMO that's a good thing to get sorted > > > out sooner than later. KVM isn't planning on support migrate/swap for TDX or SNP, > > > but supporting at least migrate for a software-only implementation a la pKVM should > > > be relatively straightforward. On the notifiee side, KVM can terminate the VM if it > > > gets an unexpected migrate/swap, e.g. so that TDX/SEV VMs don't die later with > > > exceptions and/or data corruption (pre-SNP SEV guests) in the guest. > > > > SHM_LOCK sounds like a good match. > > Emm, no. shmctl(2) and SHM_LOCK are SysV IPC thing. I don't see how they > fit here. I am still struggling to formulate a constructive response on MFD_INACCESSIBLE in general: but before doing so, let me jump in here to say that I'm firmly on the side of SHM_LOCK being the right model - but admittedly not through userspace calling shmctl(2). Please refer to our last year's posting "[PATCH 10/16] tmpfs: fcntl(fd, F_MEM_LOCK) to memlock a tmpfs file" for the example of how Shakeel did it then (though only a small part of that would be needed for this case): https://lore.kernel.org/linux-mm/54e03798-d836-ae64-f41-4a1d46bc115b@google.com/ And until such time as swapping is enabled, this memlock accounting would be necessarily entailed by "MFD_INACCESSIBLE", or however that turns out to be implemented: not something that we could trust userspace to call separately. Hugh
On Mon, Apr 11, 2022 at 06:32:33PM +0300, Kirill A. Shutemov wrote: > On Thu, Apr 07, 2022 at 04:05:36PM +0000, Sean Christopherson wrote: > > Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the swap, so > > maybe it's just the page migration path that needs to be updated? > > My early version prevented migration with -ENOTSUPP for > address_space_operations::migratepage(). > > What's wrong with that approach? I previously thought migratepage will not be called since we already marked the pages as UNMOVABLE, sounds not correct? Thanks, Chao > > -- > Kirill A. Shutemov
On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote: > RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he > past already with secretmem, it's not 100% that good of a fit (unmovable > is worth than mlocked). But it gets the job done for now at least. No, it doesn't. There are too many different interpretations how MELOCK is supposed to work eg VFIO accounts per-process so hostile users can just fork to go past it. RDMA is per-process but uses a different counter, so you can double up iouring is per-user and users a 3rd counter, so it can triple up on the above two > So I'm open for alternative to limit the amount of unmovable memory we > might allocate for user space, and then we could convert seretmem as well. I think it has to be cgroup based considering where we are now :\ Jason
On Tue, Apr 12, 2022 at 09:39:25PM +0800, Chao Peng wrote: > On Mon, Apr 11, 2022 at 06:32:33PM +0300, Kirill A. Shutemov wrote: > > On Thu, Apr 07, 2022 at 04:05:36PM +0000, Sean Christopherson wrote: > > > Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the swap, so > > > maybe it's just the page migration path that needs to be updated? > > > > My early version prevented migration with -ENOTSUPP for > > address_space_operations::migratepage(). > > > > What's wrong with that approach? > > I previously thought migratepage will not be called since we already > marked the pages as UNMOVABLE, sounds not correct? Do you mean missing __GFP_MOVABLE? I can be wrong, but I don't see that it direclty affects if the page is migratable. It is a hint to page allocator to group unmovable pages to separate page block and impove availablity of higher order pages this way. Page allocator tries to allocate unmovable pages from pages blocks that already have unmovable pages.
On Tue, Apr 12, 2022, at 7:36 AM, Jason Gunthorpe wrote: > On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote: > >> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he >> past already with secretmem, it's not 100% that good of a fit (unmovable >> is worth than mlocked). But it gets the job done for now at least. > > No, it doesn't. There are too many different interpretations how > MELOCK is supposed to work > > eg VFIO accounts per-process so hostile users can just fork to go past > it. > > RDMA is per-process but uses a different counter, so you can double up > > iouring is per-user and users a 3rd counter, so it can triple up on > the above two > >> So I'm open for alternative to limit the amount of unmovable memory we >> might allocate for user space, and then we could convert seretmem as well. > > I think it has to be cgroup based considering where we are now :\ > So this is another situation where the actual backend (TDX, SEV, pKVM, pure software) makes a difference -- depending on exactly what backend we're using, the memory may not be unmoveable. It might even be swappable (in the potentially distant future). Anyway, here's a concrete proposal, with a bit of handwaving: We add new cgroup limits: memory.unmoveable memory.locked These can be set to an actual number or they can be set to the special value ROOT_CAP. If they're set to ROOT_CAP, then anyone in the cgroup with capable(CAP_SYS_RESOURCE) (i.e. the global capability) can allocate movable or locked memory with this (and potentially other) new APIs. If it's 0, then they can't. If it's another value, then the memory can be allocated, charged to the cgroup, up to the limit, with no particular capability needed. The default at boot is ROOT_CAP. Anyone who wants to configure it differently is free to do so. This avoids introducing a DoS, makes it easy to run tests without configuring cgroup, and lets serious users set up their cgroups. Nothing is charge per mm. To make this fully sensible, we need to know what the backend is for the private memory before allocating any so that we can charge it accordingly.
On Tue, Apr 12, 2022 at 10:28:21PM +0300, Kirill A. Shutemov wrote: > On Tue, Apr 12, 2022 at 09:39:25PM +0800, Chao Peng wrote: > > On Mon, Apr 11, 2022 at 06:32:33PM +0300, Kirill A. Shutemov wrote: > > > On Thu, Apr 07, 2022 at 04:05:36PM +0000, Sean Christopherson wrote: > > > > Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the swap, so > > > > maybe it's just the page migration path that needs to be updated? > > > > > > My early version prevented migration with -ENOTSUPP for > > > address_space_operations::migratepage(). > > > > > > What's wrong with that approach? > > > > I previously thought migratepage will not be called since we already > > marked the pages as UNMOVABLE, sounds not correct? > > Do you mean missing __GFP_MOVABLE? Yes. > I can be wrong, but I don't see that it > direclty affects if the page is migratable. It is a hint to page allocator > to group unmovable pages to separate page block and impove availablity of > higher order pages this way. Page allocator tries to allocate unmovable > pages from pages blocks that already have unmovable pages. OK, thanks. Chao > > -- > Kirill A. Shutemov
On 12.04.22 16:36, Jason Gunthorpe wrote: > On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote: > >> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he >> past already with secretmem, it's not 100% that good of a fit (unmovable >> is worth than mlocked). But it gets the job done for now at least. > > No, it doesn't. There are too many different interpretations how > MELOCK is supposed to work > > eg VFIO accounts per-process so hostile users can just fork to go past > it. > > RDMA is per-process but uses a different counter, so you can double up > > iouring is per-user and users a 3rd counter, so it can triple up on > the above two Thanks for that summary, very helpful. > >> So I'm open for alternative to limit the amount of unmovable memory we >> might allocate for user space, and then we could convert seretmem as well. > > I think it has to be cgroup based considering where we are now :\ Most probably. I think the important lessons we learned are that * mlocked != unmovable. * RLIMIT_MEMLOCK should most probably never have been abused for unmovable memory (especially, long-term pinning)
> > So this is another situation where the actual backend (TDX, SEV, pKVM, pure software) makes a difference -- depending on exactly what backend we're using, the memory may not be unmoveable. It might even be swappable (in the potentially distant future). Right. And on a system without swap we don't particularly care about mlock, but we might (in most cases) care about fragmentation with unmovable memory. > > Anyway, here's a concrete proposal, with a bit of handwaving: Thanks for investing some brainpower. > > We add new cgroup limits: > > memory.unmoveable > memory.locked > > These can be set to an actual number or they can be set to the special value ROOT_CAP. If they're set to ROOT_CAP, then anyone in the cgroup with capable(CAP_SYS_RESOURCE) (i.e. the global capability) can allocate movable or locked memory with this (and potentially other) new APIs. If it's 0, then they can't. If it's another value, then the memory can be allocated, charged to the cgroup, up to the limit, with no particular capability needed. The default at boot is ROOT_CAP. Anyone who wants to configure it differently is free to do so. This avoids introducing a DoS, makes it easy to run tests without configuring cgroup, and lets serious users set up their cgroups. I wonder what the implications are for existing user space. Assume we want to move page pinning (rdma, vfio, io_uring, ...) to the new model. How can we be sure a) We don't break existing user space b) We don't open the doors unnoticed for the admin to go crazy on unmovable memory. Any ideas? > > Nothing is charge per mm. > > To make this fully sensible, we need to know what the backend is for the private memory before allocating any so that we can charge it accordingly. Right, the support for migration and/or swap defines how to account.
On Wed, Apr 13, 2022 at 06:24:56PM +0200, David Hildenbrand wrote: > On 12.04.22 16:36, Jason Gunthorpe wrote: > > On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote: > > > >> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he > >> past already with secretmem, it's not 100% that good of a fit (unmovable > >> is worth than mlocked). But it gets the job done for now at least. > > > > No, it doesn't. There are too many different interpretations how > > MELOCK is supposed to work > > > > eg VFIO accounts per-process so hostile users can just fork to go past > > it. > > > > RDMA is per-process but uses a different counter, so you can double up > > > > iouring is per-user and users a 3rd counter, so it can triple up on > > the above two > > Thanks for that summary, very helpful. I kicked off a big discussion when I suggested to change vfio to use the same as io_uring We may still end up trying it, but the major concern is that libvirt sets the RLIMIT_MEMLOCK and if we touch anything here - including fixing RDMA, or anything really, it becomes a uAPI break for libvirt.. > >> So I'm open for alternative to limit the amount of unmovable memory we > >> might allocate for user space, and then we could convert seretmem as well. > > > > I think it has to be cgroup based considering where we are now :\ > > Most probably. I think the important lessons we learned are that > > * mlocked != unmovable. > * RLIMIT_MEMLOCK should most probably never have been abused for > unmovable memory (especially, long-term pinning) The trouble is I'm not sure how anything can correctly/meaningfully set a limit. Consider qemu where we might have 3 different things all pinning the same page (rdma, iouring, vfio) - should the cgroup give 3x the limit? What use is that really? IMHO there are only two meaningful scenarios - either you are unpriv and limited to a very small number for your user/cgroup - or you are priv and you can do whatever you want. The idea we can fine tune this to exactly the right amount for a workload does not seem realistic and ends up exporting internal kernel decisions into a uAPI.. Jason
On 13.04.22 19:52, Jason Gunthorpe wrote: > On Wed, Apr 13, 2022 at 06:24:56PM +0200, David Hildenbrand wrote: >> On 12.04.22 16:36, Jason Gunthorpe wrote: >>> On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote: >>> >>>> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he >>>> past already with secretmem, it's not 100% that good of a fit (unmovable >>>> is worth than mlocked). But it gets the job done for now at least. >>> >>> No, it doesn't. There are too many different interpretations how >>> MELOCK is supposed to work >>> >>> eg VFIO accounts per-process so hostile users can just fork to go past >>> it. >>> >>> RDMA is per-process but uses a different counter, so you can double up >>> >>> iouring is per-user and users a 3rd counter, so it can triple up on >>> the above two >> >> Thanks for that summary, very helpful. > > I kicked off a big discussion when I suggested to change vfio to use > the same as io_uring > > We may still end up trying it, but the major concern is that libvirt > sets the RLIMIT_MEMLOCK and if we touch anything here - including > fixing RDMA, or anything really, it becomes a uAPI break for libvirt.. > Okay, so we have to introduce a second mechanism, don't use RLIMIT_MEMLOCK for new unmovable memory, and then eventually phase out RLIMIT_MEMLOCK usage for existing unmovable memory consumers (which, as you say, will be difficult). >>>> So I'm open for alternative to limit the amount of unmovable memory we >>>> might allocate for user space, and then we could convert seretmem as well. >>> >>> I think it has to be cgroup based considering where we are now :\ >> >> Most probably. I think the important lessons we learned are that >> >> * mlocked != unmovable. >> * RLIMIT_MEMLOCK should most probably never have been abused for >> unmovable memory (especially, long-term pinning) > > The trouble is I'm not sure how anything can correctly/meaningfully > set a limit. > > Consider qemu where we might have 3 different things all pinning the > same page (rdma, iouring, vfio) - should the cgroup give 3x the limit? > What use is that really? I think your tackling a related problem, that we double-account unmovable/mlocked memory due to lack of ways to track that a page is already pinned by the same user/cgroup/whatsoever. Not easy to solve. The problem also becomes interesting if iouring with fixed buffers doesn't work on guest RAM, but on some other QEMU buffers. > > IMHO there are only two meaningful scenarios - either you are unpriv > and limited to a very small number for your user/cgroup - or you are > priv and you can do whatever you want. > > The idea we can fine tune this to exactly the right amount for a > workload does not seem realistic and ends up exporting internal kernel > decisions into a uAPI.. IMHO, there are three use cases: * App that conditionally uses selected mechanism that end up requiring unmovable, long-term allocations. Secretmem, iouring, rdma. We want some sane, small default. Apps have a backup path in case any such mechanism fails because we're out of allowed unmovable resources. * App that relies on selected mechanism that end up requiring unmovable, long-term allocations. E.g., vfio with known memory consumption, such as the VM size. It's fairly easy to come up with the right value. * App that relies on multiple mechanism that end up requiring unmovable, long-term allocations. QEMU with rdma, iouring, vfio, ... I agree that coming up with something good is problematic. Then, there are privileged/unprivileged apps. There might be admins that just don't care. There might be admins that even want to set some limit instead of configuring "unlimited" for QEMU. Long story short, it should be an admin choice what to configure, especially: * What the default is for random apps * What the maximum is for selected apps * Which apps don't have a maximum
diff --git a/mm/shmem.c b/mm/shmem.c index 7b43e274c9a2..ae46fb96494b 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end) static void notify_invalidate_page(struct inode *inode, struct folio *folio, pgoff_t start, pgoff_t end) { -#ifdef CONFIG_MEMFILE_NOTIFIER struct shmem_inode_info *info = SHMEM_I(inode); +#ifdef CONFIG_MEMFILE_NOTIFIER start = max(start, folio->index); end = min(end, folio->index + folio_nr_pages(folio)); memfile_notifier_invalidate(&info->memfile_notifiers, start, end); #endif + + if (info->xflags & SHM_F_INACCESSIBLE) + atomic64_sub(end - start, ¤t->mm->pinned_vm); } /* @@ -2680,6 +2683,20 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence) return offset; } +static bool memlock_limited(unsigned long npages) +{ + unsigned long lock_limit; + unsigned long pinned; + + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + pinned = atomic64_add_return(npages, ¤t->mm->pinned_vm); + if (pinned > lock_limit && !capable(CAP_IPC_LOCK)) { + atomic64_sub(npages, ¤t->mm->pinned_vm); + return true; + } + return false; +} + static long shmem_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { @@ -2753,6 +2770,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, goto out; } + if ((info->xflags & SHM_F_INACCESSIBLE) && + memlock_limited(end - start)) { + error = -ENOMEM; + goto out; + } + shmem_falloc.waitq = NULL; shmem_falloc.start = start; shmem_falloc.next = start;
Since page migration / swapping is not supported yet, MFD_INACCESSIBLE memory behave like longterm pinned pages and thus should be accounted to mm->pinned_vm and be restricted by RLIMIT_MEMLOCK. Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> --- mm/shmem.c | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-)