[v8,2/9] mmap: make mlock_future_check() global

Message ID	20201110151444.20662-3-rppt@kernel.org (mailing list archive)
State	New
Headers	show Return-Path: <linux-kselftest-owner@kernel.org> From: Mike Rapoport <rppt@kernel.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk>, Andy Lutomirski <luto@kernel.org>, Arnd Bergmann <arnd@arndb.de>, Borislav Petkov <bp@alien8.de>, Catalin Marinas <catalin.marinas@arm.com>, Christopher Lameter <cl@linux.com>, Dan Williams <dan.j.williams@intel.com>, Dave Hansen <dave.hansen@linux.intel.com>, David Hildenbrand <david@redhat.com>, Elena Reshetova <elena.reshetova@intel.com>, "H. Peter Anvin" <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>, James Bottomley <jejb@linux.ibm.com>, "Kirill A. Shutemov" <kirill@shutemov.name>, Matthew Wilcox <willy@infradead.org>, Mark Rutland <mark.rutland@arm.com>, Mike Rapoport <rppt@linux.ibm.com>, Mike Rapoport <rppt@kernel.org>, Michael Kerrisk <mtk.manpages@gmail.com>, Palmer Dabbelt <palmer@dabbelt.com>, Paul Walmsley <paul.walmsley@sifive.com>, Peter Zijlstra <peterz@infradead.org>, Rick Edgecombe <rick.p.edgecombe@intel.com>, Shuah Khan <shuah@kernel.org>, Thomas Gleixner <tglx@linutronix.de>, Tycho Andersen <tycho@tycho.ws>, Will Deacon <will@kernel.org>, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-nvdimm@lists.01.org, linux-riscv@lists.infradead.org, x86@kernel.org Subject: [PATCH v8 2/9] mmap: make mlock_future_check() global Date: Tue, 10 Nov 2020 17:14:37 +0200 Message-Id: <20201110151444.20662-3-rppt@kernel.org> In-Reply-To: <20201110151444.20662-1-rppt@kernel.org> References: <20201110151444.20662-1-rppt@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	mm: introduce memfd_secret system call to create "secret" memory areas \| expand [v8,0/9] mm: introduce memfd_secret system call to create "secret" memory areas [v8,1/9] mm: add definition of PMD_PAGE_ORDER [v8,2/9] mmap: make mlock_future_check() global [v8,3/9] set_memory: allow set_direct_map_*_noflush() for multiple pages [v8,4/9] mm: introduce memfd_secret system call to create "secret" memory areas [v8,5/9] secretmem: use PMD-size pages to amortize direct map fragmentation [v8,6/9] secretmem: add memcg accounting [v8,7/9] PM: hibernate: disable when there are active secretmem users [v8,8/9] arch, mm: wire up memfd_secret system call were relevant [v8,9/9] secretmem: test: add basic selftest for memfd_secret(2)

Mike Rapoport Nov. 10, 2020, 3:14 p.m. UTC

From: Mike Rapoport <rppt@linux.ibm.com>

It will be used by the upcoming secret memory implementation.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 mm/internal.h | 3 +++
 mm/mmap.c     | 5 ++---
 2 files changed, 5 insertions(+), 3 deletions(-)

David Hildenbrand Nov. 10, 2020, 5:17 p.m. UTC | #1

On 10.11.20 16:14, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> It will be used by the upcoming secret memory implementation.
> 
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>   mm/internal.h | 3 +++
>   mm/mmap.c     | 5 ++---
>   2 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index c43ccdddb0f6..ae146a260b14 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -348,6 +348,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
>   extern void mlock_vma_page(struct page *page);
>   extern unsigned int munlock_vma_page(struct page *page);
>   
> +extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
> +			      unsigned long len);
> +
>   /*
>    * Clear the page's PageMlocked().  This can be useful in a situation where
>    * we want to unconditionally remove a page from the pagecache -- e.g.,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 61f72b09d990..c481f088bd50 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1348,9 +1348,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
>   	return hint;
>   }
>   
> -static inline int mlock_future_check(struct mm_struct *mm,
> -				     unsigned long flags,
> -				     unsigned long len)
> +int mlock_future_check(struct mm_struct *mm, unsigned long flags,
> +		       unsigned long len)
>   {
>   	unsigned long locked, lock_limit;
>   
> 

So, an interesting question is if you actually want to charge secretmem 
pages against mlock now, or if you want a dedicated secretmem cgroup 
controller instead?

Mike Rapoport Nov. 10, 2020, 6:06 p.m. UTC | #2

On Tue, Nov 10, 2020 at 06:17:26PM +0100, David Hildenbrand wrote:
> On 10.11.20 16:14, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > It will be used by the upcoming secret memory implementation.
> > 
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > ---
> >   mm/internal.h | 3 +++
> >   mm/mmap.c     | 5 ++---
> >   2 files changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/internal.h b/mm/internal.h
> > index c43ccdddb0f6..ae146a260b14 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -348,6 +348,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
> >   extern void mlock_vma_page(struct page *page);
> >   extern unsigned int munlock_vma_page(struct page *page);
> > +extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
> > +			      unsigned long len);
> > +
> >   /*
> >    * Clear the page's PageMlocked().  This can be useful in a situation where
> >    * we want to unconditionally remove a page from the pagecache -- e.g.,
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 61f72b09d990..c481f088bd50 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1348,9 +1348,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
> >   	return hint;
> >   }
> > -static inline int mlock_future_check(struct mm_struct *mm,
> > -				     unsigned long flags,
> > -				     unsigned long len)
> > +int mlock_future_check(struct mm_struct *mm, unsigned long flags,
> > +		       unsigned long len)
> >   {
> >   	unsigned long locked, lock_limit;
> > 
> 
> So, an interesting question is if you actually want to charge secretmem
> pages against mlock now, or if you want a dedicated secretmem cgroup
> controller instead?

Well, with the current implementation there are three limits an
administrator can use to control secretmem limits: mlock, memcg and
kernel parameter.

The kernel parameter puts a global upper limit for secretmem usage,
memcg accounts all secretmem allocations, including the unused memory in
large pages caching and mlock allows per task limit for secretmem
mappings, well, like mlock does.

I didn't consider a dedicated cgroup, as it seems we already have enough
existing knobs and a new one would be unnecessary.

> -- 
> Thanks,
> 
> David / dhildenb
>

David Hildenbrand Nov. 12, 2020, 4:22 p.m. UTC | #3

On 10.11.20 19:06, Mike Rapoport wrote:
> On Tue, Nov 10, 2020 at 06:17:26PM +0100, David Hildenbrand wrote:
>> On 10.11.20 16:14, Mike Rapoport wrote:
>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>
>>> It will be used by the upcoming secret memory implementation.
>>>
>>> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
>>> ---
>>>    mm/internal.h | 3 +++
>>>    mm/mmap.c     | 5 ++---
>>>    2 files changed, 5 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index c43ccdddb0f6..ae146a260b14 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -348,6 +348,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
>>>    extern void mlock_vma_page(struct page *page);
>>>    extern unsigned int munlock_vma_page(struct page *page);
>>> +extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
>>> +			      unsigned long len);
>>> +
>>>    /*
>>>     * Clear the page's PageMlocked().  This can be useful in a situation where
>>>     * we want to unconditionally remove a page from the pagecache -- e.g.,
>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>> index 61f72b09d990..c481f088bd50 100644
>>> --- a/mm/mmap.c
>>> +++ b/mm/mmap.c
>>> @@ -1348,9 +1348,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
>>>    	return hint;
>>>    }
>>> -static inline int mlock_future_check(struct mm_struct *mm,
>>> -				     unsigned long flags,
>>> -				     unsigned long len)
>>> +int mlock_future_check(struct mm_struct *mm, unsigned long flags,
>>> +		       unsigned long len)
>>>    {
>>>    	unsigned long locked, lock_limit;
>>>
>>
>> So, an interesting question is if you actually want to charge secretmem
>> pages against mlock now, or if you want a dedicated secretmem cgroup
>> controller instead?
> 
> Well, with the current implementation there are three limits an
> administrator can use to control secretmem limits: mlock, memcg and
> kernel parameter.
> 
> The kernel parameter puts a global upper limit for secretmem usage,
> memcg accounts all secretmem allocations, including the unused memory in
> large pages caching and mlock allows per task limit for secretmem
> mappings, well, like mlock does.
> 
> I didn't consider a dedicated cgroup, as it seems we already have enough
> existing knobs and a new one would be unnecessary.

To me it feels like the mlock() limit is a wrong fit for secretmem. But 
maybe there are other cases of using the mlock() limit without actually 
doing mlock() that I am not aware of (most probably :) )?

I mean, my concern is not earth shattering, this can be reworked later. 
As I said, it just feels wrong.

Mike Rapoport Nov. 12, 2020, 7:08 p.m. UTC | #4

On Thu, Nov 12, 2020 at 05:22:00PM +0100, David Hildenbrand wrote:
> On 10.11.20 19:06, Mike Rapoport wrote:
> > On Tue, Nov 10, 2020 at 06:17:26PM +0100, David Hildenbrand wrote:
> > > On 10.11.20 16:14, Mike Rapoport wrote:
> > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > > 
> > > > It will be used by the upcoming secret memory implementation.
> > > > 
> > > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > > > ---
> > > >    mm/internal.h | 3 +++
> > > >    mm/mmap.c     | 5 ++---
> > > >    2 files changed, 5 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > index c43ccdddb0f6..ae146a260b14 100644
> > > > --- a/mm/internal.h
> > > > +++ b/mm/internal.h
> > > > @@ -348,6 +348,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
> > > >    extern void mlock_vma_page(struct page *page);
> > > >    extern unsigned int munlock_vma_page(struct page *page);
> > > > +extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
> > > > +			      unsigned long len);
> > > > +
> > > >    /*
> > > >     * Clear the page's PageMlocked().  This can be useful in a situation where
> > > >     * we want to unconditionally remove a page from the pagecache -- e.g.,
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index 61f72b09d990..c481f088bd50 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -1348,9 +1348,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
> > > >    	return hint;
> > > >    }
> > > > -static inline int mlock_future_check(struct mm_struct *mm,
> > > > -				     unsigned long flags,
> > > > -				     unsigned long len)
> > > > +int mlock_future_check(struct mm_struct *mm, unsigned long flags,
> > > > +		       unsigned long len)
> > > >    {
> > > >    	unsigned long locked, lock_limit;
> > > > 
> > > 
> > > So, an interesting question is if you actually want to charge secretmem
> > > pages against mlock now, or if you want a dedicated secretmem cgroup
> > > controller instead?
> > 
> > Well, with the current implementation there are three limits an
> > administrator can use to control secretmem limits: mlock, memcg and
> > kernel parameter.
> > 
> > The kernel parameter puts a global upper limit for secretmem usage,
> > memcg accounts all secretmem allocations, including the unused memory in
> > large pages caching and mlock allows per task limit for secretmem
> > mappings, well, like mlock does.
> > 
> > I didn't consider a dedicated cgroup, as it seems we already have enough
> > existing knobs and a new one would be unnecessary.
> 
> To me it feels like the mlock() limit is a wrong fit for secretmem. But
> maybe there are other cases of using the mlock() limit without actually
> doing mlock() that I am not aware of (most probably :) )?

Secretmem does not explicitly calls to mlock() but it does what mlock()
does and a bit more. Citing mlock(2):

  mlock(),  mlock2(),  and  mlockall()  lock  part  or all of the calling
  process's virtual address space into RAM, preventing that  memory  from
  being paged to the swap area.

So, based on that secretmem pages are not swappable, I think that
RLIMIT_MEMLOCK is appropriate here.

> I mean, my concern is not earth shattering, this can be reworked later. As I
> said, it just feels wrong.
> 
> -- 
> Thanks,
> 
> David / dhildenb
>

David Hildenbrand Nov. 12, 2020, 8:15 p.m. UTC | #5

> Am 12.11.2020 um 20:08 schrieb Mike Rapoport <rppt@kernel.org>:
> 
> On Thu, Nov 12, 2020 at 05:22:00PM +0100, David Hildenbrand wrote:
>>> On 10.11.20 19:06, Mike Rapoport wrote:
>>> On Tue, Nov 10, 2020 at 06:17:26PM +0100, David Hildenbrand wrote:
>>>> On 10.11.20 16:14, Mike Rapoport wrote:
>>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>> 
>>>>> It will be used by the upcoming secret memory implementation.
>>>>> 
>>>>> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
>>>>> ---
>>>>>   mm/internal.h | 3 +++
>>>>>   mm/mmap.c     | 5 ++---
>>>>>   2 files changed, 5 insertions(+), 3 deletions(-)
>>>>> 
>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>> index c43ccdddb0f6..ae146a260b14 100644
>>>>> --- a/mm/internal.h
>>>>> +++ b/mm/internal.h
>>>>> @@ -348,6 +348,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
>>>>>   extern void mlock_vma_page(struct page *page);
>>>>>   extern unsigned int munlock_vma_page(struct page *page);
>>>>> +extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
>>>>> +                  unsigned long len);
>>>>> +
>>>>>   /*
>>>>>    * Clear the page's PageMlocked().  This can be useful in a situation where
>>>>>    * we want to unconditionally remove a page from the pagecache -- e.g.,
>>>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>>>> index 61f72b09d990..c481f088bd50 100644
>>>>> --- a/mm/mmap.c
>>>>> +++ b/mm/mmap.c
>>>>> @@ -1348,9 +1348,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
>>>>>       return hint;
>>>>>   }
>>>>> -static inline int mlock_future_check(struct mm_struct *mm,
>>>>> -                     unsigned long flags,
>>>>> -                     unsigned long len)
>>>>> +int mlock_future_check(struct mm_struct *mm, unsigned long flags,
>>>>> +               unsigned long len)
>>>>>   {
>>>>>       unsigned long locked, lock_limit;
>>>>> 
>>>> 
>>>> So, an interesting question is if you actually want to charge secretmem
>>>> pages against mlock now, or if you want a dedicated secretmem cgroup
>>>> controller instead?
>>> 
>>> Well, with the current implementation there are three limits an
>>> administrator can use to control secretmem limits: mlock, memcg and
>>> kernel parameter.
>>> 
>>> The kernel parameter puts a global upper limit for secretmem usage,
>>> memcg accounts all secretmem allocations, including the unused memory in
>>> large pages caching and mlock allows per task limit for secretmem
>>> mappings, well, like mlock does.
>>> 
>>> I didn't consider a dedicated cgroup, as it seems we already have enough
>>> existing knobs and a new one would be unnecessary.
>> 
>> To me it feels like the mlock() limit is a wrong fit for secretmem. But
>> maybe there are other cases of using the mlock() limit without actually
>> doing mlock() that I am not aware of (most probably :) )?
> 
> Secretmem does not explicitly calls to mlock() but it does what mlock()
> does and a bit more. Citing mlock(2):
> 
>  mlock(),  mlock2(),  and  mlockall()  lock  part  or all of the calling
>  process's virtual address space into RAM, preventing that  memory  from
>  being paged to the swap area.
> 
> So, based on that secretmem pages are not swappable, I think that
> RLIMIT_MEMLOCK is appropriate here.
> 

The page explicitly lists mlock() system calls. E.g., we also don‘t account for gigantic pages - which might be allocated from CMA and are not swappable.



>> I mean, my concern is not earth shattering, this can be reworked later. As I
>> said, it just feels wrong.
>> 
>> -- 
>> Thanks,
>> 
>> David / dhildenb
>> 
> 
> -- 
> Sincerely yours,
> Mike.
>

Mike Rapoport Nov. 15, 2020, 8:26 a.m. UTC | #6

On Thu, Nov 12, 2020 at 09:15:18PM +0100, David Hildenbrand wrote:
> 
> > Am 12.11.2020 um 20:08 schrieb Mike Rapoport <rppt@kernel.org>:
> > 
> > On Thu, Nov 12, 2020 at 05:22:00PM +0100, David Hildenbrand wrote:
> >>> On 10.11.20 19:06, Mike Rapoport wrote:
> >>> On Tue, Nov 10, 2020 at 06:17:26PM +0100, David Hildenbrand wrote:
> >>>> On 10.11.20 16:14, Mike Rapoport wrote:
> >>>>> From: Mike Rapoport <rppt@linux.ibm.com>
> >>>>> 
> >>>>> It will be used by the upcoming secret memory implementation.
> >>>>> 
> >>>>> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> >>>>> ---
> >>>>>   mm/internal.h | 3 +++
> >>>>>   mm/mmap.c     | 5 ++---
> >>>>>   2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>> 
> >>>>> diff --git a/mm/internal.h b/mm/internal.h
> >>>>> index c43ccdddb0f6..ae146a260b14 100644
> >>>>> --- a/mm/internal.h
> >>>>> +++ b/mm/internal.h
> >>>>> @@ -348,6 +348,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
> >>>>>   extern void mlock_vma_page(struct page *page);
> >>>>>   extern unsigned int munlock_vma_page(struct page *page);
> >>>>> +extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
> >>>>> +                  unsigned long len);
> >>>>> +
> >>>>>   /*
> >>>>>    * Clear the page's PageMlocked().  This can be useful in a situation where
> >>>>>    * we want to unconditionally remove a page from the pagecache -- e.g.,
> >>>>> diff --git a/mm/mmap.c b/mm/mmap.c
> >>>>> index 61f72b09d990..c481f088bd50 100644
> >>>>> --- a/mm/mmap.c
> >>>>> +++ b/mm/mmap.c
> >>>>> @@ -1348,9 +1348,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
> >>>>>       return hint;
> >>>>>   }
> >>>>> -static inline int mlock_future_check(struct mm_struct *mm,
> >>>>> -                     unsigned long flags,
> >>>>> -                     unsigned long len)
> >>>>> +int mlock_future_check(struct mm_struct *mm, unsigned long flags,
> >>>>> +               unsigned long len)
> >>>>>   {
> >>>>>       unsigned long locked, lock_limit;
> >>>>> 
> >>>> 
> >>>> So, an interesting question is if you actually want to charge secretmem
> >>>> pages against mlock now, or if you want a dedicated secretmem cgroup
> >>>> controller instead?
> >>> 
> >>> Well, with the current implementation there are three limits an
> >>> administrator can use to control secretmem limits: mlock, memcg and
> >>> kernel parameter.
> >>> 
> >>> The kernel parameter puts a global upper limit for secretmem usage,
> >>> memcg accounts all secretmem allocations, including the unused memory in
> >>> large pages caching and mlock allows per task limit for secretmem
> >>> mappings, well, like mlock does.
> >>> 
> >>> I didn't consider a dedicated cgroup, as it seems we already have enough
> >>> existing knobs and a new one would be unnecessary.
> >> 
> >> To me it feels like the mlock() limit is a wrong fit for secretmem. But
> >> maybe there are other cases of using the mlock() limit without actually
> >> doing mlock() that I am not aware of (most probably :) )?
> > 
> > Secretmem does not explicitly calls to mlock() but it does what mlock()
> > does and a bit more. Citing mlock(2):
> > 
> >  mlock(),  mlock2(),  and  mlockall()  lock  part  or all of the calling
> >  process's virtual address space into RAM, preventing that  memory  from
> >  being paged to the swap area.
> > 
> > So, based on that secretmem pages are not swappable, I think that
> > RLIMIT_MEMLOCK is appropriate here.
> > 
> 
> The page explicitly lists mlock() system calls.

Well, it's mlock() man page, isn't it? ;-)

My thinking was that since secretmem does what mlock() does wrt
swapability, it should at least obey the same limit, i.e.
RLIMIT_MEMLOCK.

> E.g., we also don‘t
> account for gigantic pages - which might be allocated from CMA and are
> not swappable.
 
Do you mean gigantic pages in hugetlbfs?
It seems to me that hugetlbfs accounting is a completely different
story.

> >> I mean, my concern is not earth shattering, this can be reworked later. As I
> >> said, it just feels wrong.
> >> 
> >> -- 
> >> Thanks,
> >> 
> >> David / dhildenb
> >> 
> > 
> > -- 
> > Sincerely yours,
> > Mike.
> > 
>

David Hildenbrand Nov. 17, 2020, 3:09 p.m. UTC | #7

On 15.11.20 09:26, Mike Rapoport wrote:
> On Thu, Nov 12, 2020 at 09:15:18PM +0100, David Hildenbrand wrote:
>>
>>> Am 12.11.2020 um 20:08 schrieb Mike Rapoport <rppt@kernel.org>:
>>>
>>> On Thu, Nov 12, 2020 at 05:22:00PM +0100, David Hildenbrand wrote:
>>>>> On 10.11.20 19:06, Mike Rapoport wrote:
>>>>> On Tue, Nov 10, 2020 at 06:17:26PM +0100, David Hildenbrand wrote:
>>>>>> On 10.11.20 16:14, Mike Rapoport wrote:
>>>>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>>>>
>>>>>>> It will be used by the upcoming secret memory implementation.
>>>>>>>
>>>>>>> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
>>>>>>> ---
>>>>>>>    mm/internal.h | 3 +++
>>>>>>>    mm/mmap.c     | 5 ++---
>>>>>>>    2 files changed, 5 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>>>> index c43ccdddb0f6..ae146a260b14 100644
>>>>>>> --- a/mm/internal.h
>>>>>>> +++ b/mm/internal.h
>>>>>>> @@ -348,6 +348,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
>>>>>>>    extern void mlock_vma_page(struct page *page);
>>>>>>>    extern unsigned int munlock_vma_page(struct page *page);
>>>>>>> +extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
>>>>>>> +                  unsigned long len);
>>>>>>> +
>>>>>>>    /*
>>>>>>>     * Clear the page's PageMlocked().  This can be useful in a situation where
>>>>>>>     * we want to unconditionally remove a page from the pagecache -- e.g.,
>>>>>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>>>>>> index 61f72b09d990..c481f088bd50 100644
>>>>>>> --- a/mm/mmap.c
>>>>>>> +++ b/mm/mmap.c
>>>>>>> @@ -1348,9 +1348,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
>>>>>>>        return hint;
>>>>>>>    }
>>>>>>> -static inline int mlock_future_check(struct mm_struct *mm,
>>>>>>> -                     unsigned long flags,
>>>>>>> -                     unsigned long len)
>>>>>>> +int mlock_future_check(struct mm_struct *mm, unsigned long flags,
>>>>>>> +               unsigned long len)
>>>>>>>    {
>>>>>>>        unsigned long locked, lock_limit;
>>>>>>>
>>>>>>
>>>>>> So, an interesting question is if you actually want to charge secretmem
>>>>>> pages against mlock now, or if you want a dedicated secretmem cgroup
>>>>>> controller instead?
>>>>>
>>>>> Well, with the current implementation there are three limits an
>>>>> administrator can use to control secretmem limits: mlock, memcg and
>>>>> kernel parameter.
>>>>>
>>>>> The kernel parameter puts a global upper limit for secretmem usage,
>>>>> memcg accounts all secretmem allocations, including the unused memory in
>>>>> large pages caching and mlock allows per task limit for secretmem
>>>>> mappings, well, like mlock does.
>>>>>
>>>>> I didn't consider a dedicated cgroup, as it seems we already have enough
>>>>> existing knobs and a new one would be unnecessary.
>>>>
>>>> To me it feels like the mlock() limit is a wrong fit for secretmem. But
>>>> maybe there are other cases of using the mlock() limit without actually
>>>> doing mlock() that I am not aware of (most probably :) )?
>>>
>>> Secretmem does not explicitly calls to mlock() but it does what mlock()
>>> does and a bit more. Citing mlock(2):
>>>
>>>   mlock(),  mlock2(),  and  mlockall()  lock  part  or all of the calling
>>>   process's virtual address space into RAM, preventing that  memory  from
>>>   being paged to the swap area.
>>>
>>> So, based on that secretmem pages are not swappable, I think that
>>> RLIMIT_MEMLOCK is appropriate here.
>>>
>>
>> The page explicitly lists mlock() system calls.
> 
> Well, it's mlock() man page, isn't it? ;-)

;)

> 
> My thinking was that since secretmem does what mlock() does wrt
> swapability, it should at least obey the same limit, i.e.
> RLIMIT_MEMLOCK.

Right, but at least currently, it behaves like any other CMA allocation 
(IIRC they are all unmovable and, therefore, not swappable). In the 
future, if pages would be movable (but not swappable), I guess it might 
makes more sense. I assume we never ever want to swap secretmem.

"man getrlimit" states for RLIMIT_MEMLOCK:

"This is the maximum number of bytes of memory that may be
  locked into RAM.  [...] This limit affects
  mlock(2), mlockall(2), and the mmap(2) MAP_LOCKED operation.
  Since Linux 2.6.9, it also affects the shmctl(2) SHM_LOCK op‐
  eration [...]"

So that place has to be updated as well I guess? Otherwise this might 
come as a surprise for users.

> 
>> E.g., we also don‘t
>> account for gigantic pages - which might be allocated from CMA and are
>> not swappable.
>   
> Do you mean gigantic pages in hugetlbfs?

Yes

> It seems to me that hugetlbfs accounting is a completely different
> story.

I'd say it is right now comparable to secretmem - which is why I though 
similar accounting would make sense.

Mike Rapoport Nov. 17, 2020, 3:58 p.m. UTC | #8

On Tue, Nov 17, 2020 at 04:09:39PM +0100, David Hildenbrand wrote:
> On 15.11.20 09:26, Mike Rapoport wrote:
> > On Thu, Nov 12, 2020 at 09:15:18PM +0100, David Hildenbrand wrote:

...

> > My thinking was that since secretmem does what mlock() does wrt
> > swapability, it should at least obey the same limit, i.e.
> > RLIMIT_MEMLOCK.
> 
> Right, but at least currently, it behaves like any other CMA allocation
> (IIRC they are all unmovable and, therefore, not swappable). In the future,
> if pages would be movable (but not swappable), I guess it might makes more
> sense. I assume we never ever want to swap secretmem.
> 
> "man getrlimit" states for RLIMIT_MEMLOCK:
> 
> "This is the maximum number of bytes of memory that may be
>  locked into RAM.  [...] This limit affects
>  mlock(2), mlockall(2), and the mmap(2) MAP_LOCKED operation.
>  Since Linux 2.6.9, it also affects the shmctl(2) SHM_LOCK op‐
>  eration [...]"
> 
> So that place has to be updated as well I guess? Otherwise this might come
> as a surprise for users.

Sure.

> > 
> > > E.g., we also don‘t
> > > account for gigantic pages - which might be allocated from CMA and are
> > > not swappable.
> > Do you mean gigantic pages in hugetlbfs?
> 
> Yes
> 
> > It seems to me that hugetlbfs accounting is a completely different
> > story.
> 
> I'd say it is right now comparable to secretmem - which is why I though
> similar accounting would make sense.

IMHO, using RLIMIT_MEMLOCK and memcg is a more straightforward way than
a custom cgroup.

And if we'll see a need for additional mechanism, we can always add it.
 
> -- 
> Thanks,
> 
> David / dhildenb
> 
>

[v8,2/9] mmap: make mlock_future_check() global

Commit Message

Comments

Patch