[v7,01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd

Message ID	20220706082016.2603916-2-chao.p.peng@linux.intel.com (mailing list archive)
State	New
Headers	show Return-Path: <linux-kselftest-owner@kernel.org> From: Chao Peng <chao.p.peng@linux.intel.com> To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, linux-kselftest@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com>, Jonathan Corbet <corbet@lwn.net>, Sean Christopherson <seanjc@google.com>, Vitaly Kuznetsov <vkuznets@redhat.com>, Wanpeng Li <wanpengli@tencent.com>, Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>, Hugh Dickins <hughd@google.com>, Jeff Layton <jlayton@kernel.org>, "J . Bruce Fields" <bfields@fieldses.org>, Andrew Morton <akpm@linux-foundation.org>, Shuah Khan <shuah@kernel.org>, Mike Rapoport <rppt@kernel.org>, Steven Price <steven.price@arm.com>, "Maciej S . Szmigiero" <mail@maciej.szmigiero.name>, Vlastimil Babka <vbabka@suse.cz>, Vishal Annapurve <vannapurve@google.com>, Yu Zhang <yu.c.zhang@linux.intel.com>, Chao Peng <chao.p.peng@linux.intel.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret <qperret@google.com>, Michael Roth <michael.roth@amd.com>, mhocko@suse.com, Muchun Song <songmuchun@bytedance.com> Subject: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd Date: Wed, 6 Jul 2022 16:20:03 +0800 Message-Id: <20220706082016.2603916-2-chao.p.peng@linux.intel.com> In-Reply-To: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> References: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	KVM: mm: fd-based approach for supporting KVM guest private memory \| expand [v7,00/14] KVM: mm: fd-based approach for supporting KVM guest private memory [v7,01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd [v7,02/14] selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE [v7,03/14] mm: Introduce memfile_notifier [v7,04/14] mm/shmem: Support memfile_notifier [v7,05/14] mm/memfd: Introduce MFD_INACCESSIBLE flag [v7,06/14] KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS [v7,07/14] KVM: Use gfn instead of hva for mmu_notifier_retry [v7,08/14] KVM: Rename mmu_notifier_* [v7,09/14] KVM: Extend the memslot to support fd-based private memory [v7,10/14] KVM: Add KVM_EXIT_MEMORY_FAULT exit [v7,11/14] KVM: Register/unregister the guest private memory regions [v7,12/14] KVM: Handle page fault for private memory [v7,13/14] KVM: Enable and expose KVM_MEM_PRIVATE [v7,14/14] memfd_create.2: Describe MFD_INACCESSIBLE flag

Chao Peng July 6, 2022, 8:20 a.m. UTC

Normally, a write to unallocated space of a file or the hole of a sparse
file automatically causes space allocation, for memfd, this equals to
memory allocation. This new seal prevents such automatically allocating,
either this is from a direct write() or a write on the previously
mmap-ed area. The seal does not prevent fallocate() so an explicit
fallocate() can still cause allocating and can be used to reserve
memory.

This is used to prevent unintentional allocation from userspace on a
stray or careless write and any intentional allocation should use an
explicit fallocate(). One of the main usecases is to avoid memory double
allocation for confidential computing usage where we use two memfds to
back guest memory and at a single point only one memfd is alive and we
want to prevent memory allocation for the other memfd which may have
been mmap-ed previously. More discussion can be found at:

  https://lkml.org/lkml/2022/6/14/1255

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/uapi/linux/fcntl.h |  1 +
 mm/memfd.c                 |  3 ++-
 mm/shmem.c                 | 16 ++++++++++++++--
 3 files changed, 17 insertions(+), 3 deletions(-)

David Hildenbrand July 21, 2022, 9:44 a.m. UTC | #1

On 06.07.22 10:20, Chao Peng wrote:
> Normally, a write to unallocated space of a file or the hole of a sparse
> file automatically causes space allocation, for memfd, this equals to
> memory allocation. This new seal prevents such automatically allocating,
> either this is from a direct write() or a write on the previously
> mmap-ed area. The seal does not prevent fallocate() so an explicit
> fallocate() can still cause allocating and can be used to reserve
> memory.
> 
> This is used to prevent unintentional allocation from userspace on a
> stray or careless write and any intentional allocation should use an
> explicit fallocate(). One of the main usecases is to avoid memory double
> allocation for confidential computing usage where we use two memfds to
> back guest memory and at a single point only one memfd is alive and we
> want to prevent memory allocation for the other memfd which may have
> been mmap-ed previously. More discussion can be found at:
> 
>   https://lkml.org/lkml/2022/6/14/1255
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/uapi/linux/fcntl.h |  1 +
>  mm/memfd.c                 |  3 ++-
>  mm/shmem.c                 | 16 ++++++++++++++--
>  3 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 2f86b2ad6d7e..98bdabc8e309 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -43,6 +43,7 @@
>  #define F_SEAL_GROW	0x0004	/* prevent file from growing */
>  #define F_SEAL_WRITE	0x0008	/* prevent writes */
>  #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
> +#define F_SEAL_AUTO_ALLOCATE	0x0020  /* prevent allocation for writes */

Why only "on writes" and not "on reads". IIRC, shmem doesn't support the
shared zeropage, so you'll simply allocate a new page via read() or on
read faults.


Also, I *think* you can place pages via userfaultfd into shmem. Not sure
if that would count "auto alloc", but it would certainly bypass fallocate().

David Hildenbrand July 21, 2022, 9:50 a.m. UTC | #2

On 21.07.22 11:44, David Hildenbrand wrote:
> On 06.07.22 10:20, Chao Peng wrote:
>> Normally, a write to unallocated space of a file or the hole of a sparse
>> file automatically causes space allocation, for memfd, this equals to
>> memory allocation. This new seal prevents such automatically allocating,
>> either this is from a direct write() or a write on the previously
>> mmap-ed area. The seal does not prevent fallocate() so an explicit
>> fallocate() can still cause allocating and can be used to reserve
>> memory.
>>
>> This is used to prevent unintentional allocation from userspace on a
>> stray or careless write and any intentional allocation should use an
>> explicit fallocate(). One of the main usecases is to avoid memory double
>> allocation for confidential computing usage where we use two memfds to
>> back guest memory and at a single point only one memfd is alive and we
>> want to prevent memory allocation for the other memfd which may have
>> been mmap-ed previously. More discussion can be found at:
>>
>>   https://lkml.org/lkml/2022/6/14/1255
>>
>> Suggested-by: Sean Christopherson <seanjc@google.com>
>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>> ---
>>  include/uapi/linux/fcntl.h |  1 +
>>  mm/memfd.c                 |  3 ++-
>>  mm/shmem.c                 | 16 ++++++++++++++--
>>  3 files changed, 17 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
>> index 2f86b2ad6d7e..98bdabc8e309 100644
>> --- a/include/uapi/linux/fcntl.h
>> +++ b/include/uapi/linux/fcntl.h
>> @@ -43,6 +43,7 @@
>>  #define F_SEAL_GROW	0x0004	/* prevent file from growing */
>>  #define F_SEAL_WRITE	0x0008	/* prevent writes */
>>  #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
>> +#define F_SEAL_AUTO_ALLOCATE	0x0020  /* prevent allocation for writes */
> 
> Why only "on writes" and not "on reads". IIRC, shmem doesn't support the
> shared zeropage, so you'll simply allocate a new page via read() or on
> read faults.

Correction: on read() we don't allocate a fresh page. But on read faults
we would. So this comment here needs clarification.

> 
> 
> Also, I *think* you can place pages via userfaultfd into shmem. Not sure
> if that would count "auto alloc", but it would certainly bypass fallocate().
>

Gupta, Pankaj July 21, 2022, 10:27 a.m. UTC | #3

>> Normally, a write to unallocated space of a file or the hole of a sparse
>> file automatically causes space allocation, for memfd, this equals to
>> memory allocation. This new seal prevents such automatically allocating,
>> either this is from a direct write() or a write on the previously
>> mmap-ed area. The seal does not prevent fallocate() so an explicit
>> fallocate() can still cause allocating and can be used to reserve
>> memory.
>>
>> This is used to prevent unintentional allocation from userspace on a
>> stray or careless write and any intentional allocation should use an
>> explicit fallocate(). One of the main usecases is to avoid memory double
>> allocation for confidential computing usage where we use two memfds to
>> back guest memory and at a single point only one memfd is alive and we
>> want to prevent memory allocation for the other memfd which may have
>> been mmap-ed previously. More discussion can be found at:
>>
>>    https://lkml.org/lkml/2022/6/14/1255
>>
>> Suggested-by: Sean Christopherson <seanjc@google.com>
>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>> ---
>>   include/uapi/linux/fcntl.h |  1 +
>>   mm/memfd.c                 |  3 ++-
>>   mm/shmem.c                 | 16 ++++++++++++++--
>>   3 files changed, 17 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
>> index 2f86b2ad6d7e..98bdabc8e309 100644
>> --- a/include/uapi/linux/fcntl.h
>> +++ b/include/uapi/linux/fcntl.h
>> @@ -43,6 +43,7 @@
>>   #define F_SEAL_GROW	0x0004	/* prevent file from growing */
>>   #define F_SEAL_WRITE	0x0008	/* prevent writes */
>>   #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
>> +#define F_SEAL_AUTO_ALLOCATE	0x0020  /* prevent allocation for writes */
> 
> Why only "on writes" and not "on reads". IIRC, shmem doesn't support the
> shared zeropage, so you'll simply allocate a new page via read() or on
> read faults.
> 
> 
> Also, I *think* you can place pages via userfaultfd into shmem. Not sure
> if that would count "auto alloc", but it would certainly bypass fallocate().

I was also thinking this at the same time, but for different reason:

"Want to populate private preboot memory with firmware payload", so was 
thinking userfaulftd could be an option as direct writes are restricted?

Thanks,
Pankaj

Sean Christopherson July 21, 2022, 3:05 p.m. UTC | #4

On Thu, Jul 21, 2022, David Hildenbrand wrote:
> On 21.07.22 11:44, David Hildenbrand wrote:
> > On 06.07.22 10:20, Chao Peng wrote:
> >> Normally, a write to unallocated space of a file or the hole of a sparse
> >> file automatically causes space allocation, for memfd, this equals to
> >> memory allocation. This new seal prevents such automatically allocating,
> >> either this is from a direct write() or a write on the previously
> >> mmap-ed area. The seal does not prevent fallocate() so an explicit
> >> fallocate() can still cause allocating and can be used to reserve
> >> memory.
> >>
> >> This is used to prevent unintentional allocation from userspace on a
> >> stray or careless write and any intentional allocation should use an
> >> explicit fallocate(). One of the main usecases is to avoid memory double
> >> allocation for confidential computing usage where we use two memfds to
> >> back guest memory and at a single point only one memfd is alive and we
> >> want to prevent memory allocation for the other memfd which may have
> >> been mmap-ed previously. More discussion can be found at:
> >>
> >>   https://lkml.org/lkml/2022/6/14/1255
> >>
> >> Suggested-by: Sean Christopherson <seanjc@google.com>
> >> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> >> ---
> >>  include/uapi/linux/fcntl.h |  1 +
> >>  mm/memfd.c                 |  3 ++-
> >>  mm/shmem.c                 | 16 ++++++++++++++--
> >>  3 files changed, 17 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> >> index 2f86b2ad6d7e..98bdabc8e309 100644
> >> --- a/include/uapi/linux/fcntl.h
> >> +++ b/include/uapi/linux/fcntl.h
> >> @@ -43,6 +43,7 @@
> >>  #define F_SEAL_GROW	0x0004	/* prevent file from growing */
> >>  #define F_SEAL_WRITE	0x0008	/* prevent writes */
> >>  #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
> >> +#define F_SEAL_AUTO_ALLOCATE	0x0020  /* prevent allocation for writes */
> > 
> > Why only "on writes" and not "on reads". IIRC, shmem doesn't support the
> > shared zeropage, so you'll simply allocate a new page via read() or on
> > read faults.
> 
> Correction: on read() we don't allocate a fresh page. But on read faults
> we would. So this comment here needs clarification.

Not just the comment, the code too.  The intent of F_SEAL_AUTO_ALLOCATE is very
much to block _all_ implicit allocations (or maybe just fault-based allocations
if "implicit" is too broad of a description).

Chao Peng July 25, 2022, 1:42 p.m. UTC | #5

On Thu, Jul 21, 2022 at 11:44:11AM +0200, David Hildenbrand wrote:
> On 06.07.22 10:20, Chao Peng wrote:
> > Normally, a write to unallocated space of a file or the hole of a sparse
> > file automatically causes space allocation, for memfd, this equals to
> > memory allocation. This new seal prevents such automatically allocating,
> > either this is from a direct write() or a write on the previously
> > mmap-ed area. The seal does not prevent fallocate() so an explicit
> > fallocate() can still cause allocating and can be used to reserve
> > memory.
> > 
> > This is used to prevent unintentional allocation from userspace on a
> > stray or careless write and any intentional allocation should use an
> > explicit fallocate(). One of the main usecases is to avoid memory double
> > allocation for confidential computing usage where we use two memfds to
> > back guest memory and at a single point only one memfd is alive and we
> > want to prevent memory allocation for the other memfd which may have
> > been mmap-ed previously. More discussion can be found at:
> > 
> >   https://lkml.org/lkml/2022/6/14/1255
> > 
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  include/uapi/linux/fcntl.h |  1 +
> >  mm/memfd.c                 |  3 ++-
> >  mm/shmem.c                 | 16 ++++++++++++++--
> >  3 files changed, 17 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> > index 2f86b2ad6d7e..98bdabc8e309 100644
> > --- a/include/uapi/linux/fcntl.h
> > +++ b/include/uapi/linux/fcntl.h
> > @@ -43,6 +43,7 @@
> >  #define F_SEAL_GROW	0x0004	/* prevent file from growing */
> >  #define F_SEAL_WRITE	0x0008	/* prevent writes */
> >  #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
> > +#define F_SEAL_AUTO_ALLOCATE	0x0020  /* prevent allocation for writes */
> 
> Why only "on writes" and not "on reads". IIRC, shmem doesn't support the
> shared zeropage, so you'll simply allocate a new page via read() or on
> read faults.

Right, it also prevents read faults.

> 
> 
> Also, I *think* you can place pages via userfaultfd into shmem. Not sure
> if that would count "auto alloc", but it would certainly bypass fallocate().

Userfaultfd sounds interesting, will further investigate it. But a rough
look sounds it only faults to usrspace for write/read fault, not
write()? Also sounds it operates on vma and userfaultfd_register() takes
mmap_lock which is what we want to avoid for frequent
register/unregister during private/shared memory conversion.

Chao
> 
> -- 
> Thanks,
> 
> David / dhildenb

Chao Peng July 25, 2022, 1:46 p.m. UTC | #6

On Thu, Jul 21, 2022 at 03:05:09PM +0000, Sean Christopherson wrote:
> On Thu, Jul 21, 2022, David Hildenbrand wrote:
> > On 21.07.22 11:44, David Hildenbrand wrote:
> > > On 06.07.22 10:20, Chao Peng wrote:
> > >> Normally, a write to unallocated space of a file or the hole of a sparse
> > >> file automatically causes space allocation, for memfd, this equals to
> > >> memory allocation. This new seal prevents such automatically allocating,
> > >> either this is from a direct write() or a write on the previously
> > >> mmap-ed area. The seal does not prevent fallocate() so an explicit
> > >> fallocate() can still cause allocating and can be used to reserve
> > >> memory.
> > >>
> > >> This is used to prevent unintentional allocation from userspace on a
> > >> stray or careless write and any intentional allocation should use an
> > >> explicit fallocate(). One of the main usecases is to avoid memory double
> > >> allocation for confidential computing usage where we use two memfds to
> > >> back guest memory and at a single point only one memfd is alive and we
> > >> want to prevent memory allocation for the other memfd which may have
> > >> been mmap-ed previously. More discussion can be found at:
> > >>
> > >>   https://lkml.org/lkml/2022/6/14/1255
> > >>
> > >> Suggested-by: Sean Christopherson <seanjc@google.com>
> > >> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > >> ---
> > >>  include/uapi/linux/fcntl.h |  1 +
> > >>  mm/memfd.c                 |  3 ++-
> > >>  mm/shmem.c                 | 16 ++++++++++++++--
> > >>  3 files changed, 17 insertions(+), 3 deletions(-)
> > >>
> > >> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> > >> index 2f86b2ad6d7e..98bdabc8e309 100644
> > >> --- a/include/uapi/linux/fcntl.h
> > >> +++ b/include/uapi/linux/fcntl.h
> > >> @@ -43,6 +43,7 @@
> > >>  #define F_SEAL_GROW	0x0004	/* prevent file from growing */
> > >>  #define F_SEAL_WRITE	0x0008	/* prevent writes */
> > >>  #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
> > >> +#define F_SEAL_AUTO_ALLOCATE	0x0020  /* prevent allocation for writes */
> > > 
> > > Why only "on writes" and not "on reads". IIRC, shmem doesn't support the
> > > shared zeropage, so you'll simply allocate a new page via read() or on
> > > read faults.
> > 
> > Correction: on read() we don't allocate a fresh page. But on read faults
> > we would. So this comment here needs clarification.
> 
> Not just the comment, the code too.  The intent of F_SEAL_AUTO_ALLOCATE is very
> much to block _all_ implicit allocations (or maybe just fault-based allocations
> if "implicit" is too broad of a description).

So maybe still your initial suggestion F_SEAL_FAULT_ALLOCATIONS? One
reason I don't like it is the write() ioctl also cause allocation and we
want to prevent it.

Chao

Chao Peng July 25, 2022, 1:54 p.m. UTC | #7

On Thu, Jul 21, 2022 at 12:27:03PM +0200, Gupta, Pankaj wrote:
> 
> > > Normally, a write to unallocated space of a file or the hole of a sparse
> > > file automatically causes space allocation, for memfd, this equals to
> > > memory allocation. This new seal prevents such automatically allocating,
> > > either this is from a direct write() or a write on the previously
> > > mmap-ed area. The seal does not prevent fallocate() so an explicit
> > > fallocate() can still cause allocating and can be used to reserve
> > > memory.
> > > 
> > > This is used to prevent unintentional allocation from userspace on a
> > > stray or careless write and any intentional allocation should use an
> > > explicit fallocate(). One of the main usecases is to avoid memory double
> > > allocation for confidential computing usage where we use two memfds to
> > > back guest memory and at a single point only one memfd is alive and we
> > > want to prevent memory allocation for the other memfd which may have
> > > been mmap-ed previously. More discussion can be found at:
> > > 
> > >    https://lkml.org/lkml/2022/6/14/1255
> > > 
> > > Suggested-by: Sean Christopherson <seanjc@google.com>
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > ---
> > >   include/uapi/linux/fcntl.h |  1 +
> > >   mm/memfd.c                 |  3 ++-
> > >   mm/shmem.c                 | 16 ++++++++++++++--
> > >   3 files changed, 17 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> > > index 2f86b2ad6d7e..98bdabc8e309 100644
> > > --- a/include/uapi/linux/fcntl.h
> > > +++ b/include/uapi/linux/fcntl.h
> > > @@ -43,6 +43,7 @@
> > >   #define F_SEAL_GROW	0x0004	/* prevent file from growing */
> > >   #define F_SEAL_WRITE	0x0008	/* prevent writes */
> > >   #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
> > > +#define F_SEAL_AUTO_ALLOCATE	0x0020  /* prevent allocation for writes */
> > 
> > Why only "on writes" and not "on reads". IIRC, shmem doesn't support the
> > shared zeropage, so you'll simply allocate a new page via read() or on
> > read faults.
> > 
> > 
> > Also, I *think* you can place pages via userfaultfd into shmem. Not sure
> > if that would count "auto alloc", but it would certainly bypass fallocate().
> 
> I was also thinking this at the same time, but for different reason:
> 
> "Want to populate private preboot memory with firmware payload", so was
> thinking userfaulftd could be an option as direct writes are restricted?

If that can be a side effect, I definitely glad to see it, though I'm
still not clear how userfaultfd can be particularly helpful for that.

Chao
> 
> Thanks,
> Pankaj
> 
> 
> 
>

Gupta, Pankaj July 25, 2022, 2:49 p.m. UTC | #8

>>>> Normally, a write to unallocated space of a file or the hole of a sparse
>>>> file automatically causes space allocation, for memfd, this equals to
>>>> memory allocation. This new seal prevents such automatically allocating,
>>>> either this is from a direct write() or a write on the previously
>>>> mmap-ed area. The seal does not prevent fallocate() so an explicit
>>>> fallocate() can still cause allocating and can be used to reserve
>>>> memory.
>>>>
>>>> This is used to prevent unintentional allocation from userspace on a
>>>> stray or careless write and any intentional allocation should use an
>>>> explicit fallocate(). One of the main usecases is to avoid memory double
>>>> allocation for confidential computing usage where we use two memfds to
>>>> back guest memory and at a single point only one memfd is alive and we
>>>> want to prevent memory allocation for the other memfd which may have
>>>> been mmap-ed previously. More discussion can be found at:
>>>>
>>>>     https://lkml.org/lkml/2022/6/14/1255
>>>>
>>>> Suggested-by: Sean Christopherson <seanjc@google.com>
>>>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>>>> ---
>>>>    include/uapi/linux/fcntl.h |  1 +
>>>>    mm/memfd.c                 |  3 ++-
>>>>    mm/shmem.c                 | 16 ++++++++++++++--
>>>>    3 files changed, 17 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
>>>> index 2f86b2ad6d7e..98bdabc8e309 100644
>>>> --- a/include/uapi/linux/fcntl.h
>>>> +++ b/include/uapi/linux/fcntl.h
>>>> @@ -43,6 +43,7 @@
>>>>    #define F_SEAL_GROW	0x0004	/* prevent file from growing */
>>>>    #define F_SEAL_WRITE	0x0008	/* prevent writes */
>>>>    #define F_SEAL_FUTURE_WRITE	0x0010  /* prevent future writes while mapped */
>>>> +#define F_SEAL_AUTO_ALLOCATE	0x0020  /* prevent allocation for writes */
>>>
>>> Why only "on writes" and not "on reads". IIRC, shmem doesn't support the
>>> shared zeropage, so you'll simply allocate a new page via read() or on
>>> read faults.
>>>
>>>
>>> Also, I *think* you can place pages via userfaultfd into shmem. Not sure
>>> if that would count "auto alloc", but it would certainly bypass fallocate().
>>
>> I was also thinking this at the same time, but for different reason:
>>
>> "Want to populate private preboot memory with firmware payload", so was
>> thinking userfaulftd could be an option as direct writes are restricted?
> 
> If that can be a side effect, I definitely glad to see it, though I'm
> still not clear how userfaultfd can be particularly helpful for that.

Was thinking if we can use userfaultfd to monitor the pagefault on 
virtual firmware memory range and use to populate the private memory.

Not sure if it is a side effect. Was just theoretically thinking (for 
now kept the idea aside as these enhancements can be worked later).

Thanks,
Pankaj

Paolo Bonzini Aug. 5, 2022, 5:55 p.m. UTC | #9

On 7/21/22 11:44, David Hildenbrand wrote:
> 
> Also, I*think*  you can place pages via userfaultfd into shmem. Not
> sure if that would count "auto alloc", but it would certainly bypass
> fallocate().

Yeah, userfaultfd_register would probably have to forbid this for 
F_SEAL_AUTO_ALLOCATE vmas.  Maybe the memfile_node can be reused for 
this, adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags?  Then 
userfault_register would do something like 
memfile_node_get_flags(vma->vm_file) and check the result.

This means moving this patch later, after "mm: Introduce memfile_notifier".

Thanks,

Paolo

David Hildenbrand Aug. 5, 2022, 6:06 p.m. UTC | #10

On 05.08.22 19:55, Paolo Bonzini wrote:
> On 7/21/22 11:44, David Hildenbrand wrote:
>>
>> Also, I*think*  you can place pages via userfaultfd into shmem. Not
>> sure if that would count "auto alloc", but it would certainly bypass
>> fallocate().
> 
> Yeah, userfaultfd_register would probably have to forbid this for 
> F_SEAL_AUTO_ALLOCATE vmas.  Maybe the memfile_node can be reused for 
> this, adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags?  Then 
> userfault_register would do something like 
> memfile_node_get_flags(vma->vm_file) and check the result.

An alternative is to simply have the shmem allocation fail in a similar
way. Maybe it does already, I haven't checked (don't think so).

Chao Peng Aug. 10, 2022, 9:38 a.m. UTC | #11

On Fri, Aug 05, 2022 at 07:55:38PM +0200, Paolo Bonzini wrote:
> On 7/21/22 11:44, David Hildenbrand wrote:
> > 
> > Also, I*think*  you can place pages via userfaultfd into shmem. Not
> > sure if that would count "auto alloc", but it would certainly bypass
> > fallocate().
> 
> Yeah, userfaultfd_register would probably have to forbid this for
> F_SEAL_AUTO_ALLOCATE vmas.  Maybe the memfile_node can be reused for this,
> adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags?  Then userfault_register
> would do something like memfile_node_get_flags(vma->vm_file) and check the
> result.

Then we need change userfault_register uAPI for a new property flag.
Userspace should still the decision-maker for this flag.

> 
> This means moving this patch later, after "mm: Introduce memfile_notifier".

Yes, it makes sense now.

Chao
> 
> Thanks,
> 
> Paolo

Chao Peng Aug. 10, 2022, 9:40 a.m. UTC | #12

On Fri, Aug 05, 2022 at 08:06:03PM +0200, David Hildenbrand wrote:
> On 05.08.22 19:55, Paolo Bonzini wrote:
> > On 7/21/22 11:44, David Hildenbrand wrote:
> >>
> >> Also, I*think*  you can place pages via userfaultfd into shmem. Not
> >> sure if that would count "auto alloc", but it would certainly bypass
> >> fallocate().
> > 
> > Yeah, userfaultfd_register would probably have to forbid this for 
> > F_SEAL_AUTO_ALLOCATE vmas.  Maybe the memfile_node can be reused for 
> > this, adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags?  Then 
> > userfault_register would do something like 
> > memfile_node_get_flags(vma->vm_file) and check the result.
> 
> An alternative is to simply have the shmem allocation fail in a similar
> way. Maybe it does already, I haven't checked (don't think so).

This sounds a better option. We don't need uAPI changes for
userfault_register uAPI but I guess we will still need a KVM uAPI,
either on the memslot or on the whole VM since Roth said this feature
should be optional because some usages may want to disable it for
performance reason. For details please see discussion:
  https://lkml.org/lkml/2022/6/23/1905

Chao
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb

Kirill A. Shutemov Aug. 17, 2022, 11:41 p.m. UTC | #13

On Fri, Aug 05, 2022 at 07:55:38PM +0200, Paolo Bonzini wrote:
> On 7/21/22 11:44, David Hildenbrand wrote:
> > 
> > Also, I*think*  you can place pages via userfaultfd into shmem. Not
> > sure if that would count "auto alloc", but it would certainly bypass
> > fallocate().
> 
> Yeah, userfaultfd_register would probably have to forbid this for
> F_SEAL_AUTO_ALLOCATE vmas.  Maybe the memfile_node can be reused for this,
> adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags?  Then userfault_register
> would do something like memfile_node_get_flags(vma->vm_file) and check the
> result.

I donno, memory allocation with userfaultfd looks pretty intentional to
me. Why would F_SEAL_AUTO_ALLOCATE prevent it?

Maybe we would need it in the future for post-copy migration or something?

Or existing practises around userfaultfd touch memory randomly and
therefore incompatible with F_SEAL_AUTO_ALLOCATE intent?

Note, that userfaultfd is only relevant for shared memory as it requires
VMA which we don't have for MFD_INACCESSIBLE.

Paolo Bonzini Aug. 18, 2022, 9:09 a.m. UTC | #14

On 8/18/22 01:41, Kirill A. Shutemov wrote:
> Note, that userfaultfd is only relevant for shared memory as it requires
> VMA which we don't have for MFD_INACCESSIBLE.

Oh, you're right!  So yeah, looks like userfaultfd is not a problem.

Paolo

David Hildenbrand Aug. 23, 2022, 7:36 a.m. UTC | #15

On 18.08.22 01:41, Kirill A. Shutemov wrote:
> On Fri, Aug 05, 2022 at 07:55:38PM +0200, Paolo Bonzini wrote:
>> On 7/21/22 11:44, David Hildenbrand wrote:
>>>
>>> Also, I*think*  you can place pages via userfaultfd into shmem. Not
>>> sure if that would count "auto alloc", but it would certainly bypass
>>> fallocate().
>>
>> Yeah, userfaultfd_register would probably have to forbid this for
>> F_SEAL_AUTO_ALLOCATE vmas.  Maybe the memfile_node can be reused for this,
>> adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags?  Then userfault_register
>> would do something like memfile_node_get_flags(vma->vm_file) and check the
>> result.
> 
> I donno, memory allocation with userfaultfd looks pretty intentional to
> me. Why would F_SEAL_AUTO_ALLOCATE prevent it?
> 

Can't we say the same about a write()?

> Maybe we would need it in the future for post-copy migration or something?
> 
> Or existing practises around userfaultfd touch memory randomly and
> therefore incompatible with F_SEAL_AUTO_ALLOCATE intent?
> 
> Note, that userfaultfd is only relevant for shared memory as it requires
> VMA which we don't have for MFD_INACCESSIBLE.

This feature (F_SEAL_AUTO_ALLOCATE) is independent of all the lovely
encrypted VM stuff, so it doesn't matter how it relates to MFD_INACCESSIBLE.

Chao Peng Aug. 24, 2022, 10:20 a.m. UTC | #16

On Tue, Aug 23, 2022 at 09:36:57AM +0200, David Hildenbrand wrote:
> On 18.08.22 01:41, Kirill A. Shutemov wrote:
> > On Fri, Aug 05, 2022 at 07:55:38PM +0200, Paolo Bonzini wrote:
> >> On 7/21/22 11:44, David Hildenbrand wrote:
> >>>
> >>> Also, I*think*  you can place pages via userfaultfd into shmem. Not
> >>> sure if that would count "auto alloc", but it would certainly bypass
> >>> fallocate().
> >>
> >> Yeah, userfaultfd_register would probably have to forbid this for
> >> F_SEAL_AUTO_ALLOCATE vmas.  Maybe the memfile_node can be reused for this,
> >> adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags?  Then userfault_register
> >> would do something like memfile_node_get_flags(vma->vm_file) and check the
> >> result.
> > 
> > I donno, memory allocation with userfaultfd looks pretty intentional to
> > me. Why would F_SEAL_AUTO_ALLOCATE prevent it?
> > 
> 
> Can't we say the same about a write()?
> 
> > Maybe we would need it in the future for post-copy migration or something?
> > 
> > Or existing practises around userfaultfd touch memory randomly and
> > therefore incompatible with F_SEAL_AUTO_ALLOCATE intent?
> > 
> > Note, that userfaultfd is only relevant for shared memory as it requires
> > VMA which we don't have for MFD_INACCESSIBLE.
> 
> This feature (F_SEAL_AUTO_ALLOCATE) is independent of all the lovely
> encrypted VM stuff, so it doesn't matter how it relates to MFD_INACCESSIBLE.

Right, this patch is for normal user accssible fd. In KVM this flag is
expected to be set on the shared part of the memslot, while all other
patches in this series are for private part of the memslot.

Private memory doesn't have this need because it's totally inaccissible
from userspace so no chance for userspace to write to the fd and cause
allocation by accident. While for shared memory, malicious/buggy guest
OS may cause userspace to write to any range of the shared fd and cause
memory allocation, even that range should the private memory not the
shared memory be visible to guest OS.

Chao
> 
> -- 
> Thanks,
> 
> David / dhildenb
>

Fuad Tabba Aug. 26, 2022, 3:19 p.m. UTC | #17

Hi Chao,

On Wed, Jul 6, 2022 at 9:25 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> Normally, a write to unallocated space of a file or the hole of a sparse
> file automatically causes space allocation, for memfd, this equals to
> memory allocation. This new seal prevents such automatically allocating,
> either this is from a direct write() or a write on the previously
> mmap-ed area. The seal does not prevent fallocate() so an explicit
> fallocate() can still cause allocating and can be used to reserve
> memory.
>
> This is used to prevent unintentional allocation from userspace on a
> stray or careless write and any intentional allocation should use an
> explicit fallocate(). One of the main usecases is to avoid memory double
> allocation for confidential computing usage where we use two memfds to
> back guest memory and at a single point only one memfd is alive and we
> want to prevent memory allocation for the other memfd which may have
> been mmap-ed previously. More discussion can be found at:
>
>   https://lkml.org/lkml/2022/6/14/1255
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> ---
>  include/uapi/linux/fcntl.h |  1 +
>  mm/memfd.c                 |  3 ++-
>  mm/shmem.c                 | 16 ++++++++++++++--
>  3 files changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 2f86b2ad6d7e..98bdabc8e309 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -43,6 +43,7 @@
>  #define F_SEAL_GROW    0x0004  /* prevent file from growing */
>  #define F_SEAL_WRITE   0x0008  /* prevent writes */
>  #define F_SEAL_FUTURE_WRITE    0x0010  /* prevent future writes while mapped */
> +#define F_SEAL_AUTO_ALLOCATE   0x0020  /* prevent allocation for writes */

I think this should also be added to tools/include/uapi/linux/fcntl.h

Cheers,
/fuad


>  /* (1U << 31) is reserved for signed error codes */
>
>  /*
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 08f5f8304746..2afd898798e4 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -150,7 +150,8 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
>                      F_SEAL_SHRINK | \
>                      F_SEAL_GROW | \
>                      F_SEAL_WRITE | \
> -                    F_SEAL_FUTURE_WRITE)
> +                    F_SEAL_FUTURE_WRITE | \
> +                    F_SEAL_AUTO_ALLOCATE)
>
>  static int memfd_add_seals(struct file *file, unsigned int seals)
>  {
> diff --git a/mm/shmem.c b/mm/shmem.c
> index a6f565308133..6c8aef15a17d 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2051,6 +2051,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>         struct vm_area_struct *vma = vmf->vma;
>         struct inode *inode = file_inode(vma->vm_file);
>         gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
> +       struct shmem_inode_info *info = SHMEM_I(inode);
> +       enum sgp_type sgp;
>         int err;
>         vm_fault_t ret = VM_FAULT_LOCKED;
>
> @@ -2113,7 +2115,12 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>                 spin_unlock(&inode->i_lock);
>         }
>
> -       err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE,
> +       if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE))
> +               sgp = SGP_NOALLOC;
> +       else
> +               sgp = SGP_CACHE;
> +
> +       err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
>                                   gfp, vma, vmf, &ret);
>         if (err)
>                 return vmf_error(err);
> @@ -2459,6 +2466,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
>         struct inode *inode = mapping->host;
>         struct shmem_inode_info *info = SHMEM_I(inode);
>         pgoff_t index = pos >> PAGE_SHIFT;
> +       enum sgp_type sgp;
>         int ret = 0;
>
>         /* i_rwsem is held by caller */
> @@ -2470,7 +2478,11 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
>                         return -EPERM;
>         }
>
> -       ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
> +       if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE))
> +               sgp = SGP_NOALLOC;
> +       else
> +               sgp = SGP_WRITE;
> +       ret = shmem_getpage(inode, index, pagep, sgp);
>
>         if (ret)
>                 return ret;
> --
> 2.25.1
>

Chao Peng Aug. 29, 2022, 3:18 p.m. UTC | #18

On Fri, Aug 26, 2022 at 04:19:32PM +0100, Fuad Tabba wrote:
> Hi Chao,
> 
> On Wed, Jul 6, 2022 at 9:25 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > Normally, a write to unallocated space of a file or the hole of a sparse
> > file automatically causes space allocation, for memfd, this equals to
> > memory allocation. This new seal prevents such automatically allocating,
> > either this is from a direct write() or a write on the previously
> > mmap-ed area. The seal does not prevent fallocate() so an explicit
> > fallocate() can still cause allocating and can be used to reserve
> > memory.
> >
> > This is used to prevent unintentional allocation from userspace on a
> > stray or careless write and any intentional allocation should use an
> > explicit fallocate(). One of the main usecases is to avoid memory double
> > allocation for confidential computing usage where we use two memfds to
> > back guest memory and at a single point only one memfd is alive and we
> > want to prevent memory allocation for the other memfd which may have
> > been mmap-ed previously. More discussion can be found at:
> >
> >   https://lkml.org/lkml/2022/6/14/1255
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > ---
> >  include/uapi/linux/fcntl.h |  1 +
> >  mm/memfd.c                 |  3 ++-
> >  mm/shmem.c                 | 16 ++++++++++++++--
> >  3 files changed, 17 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> > index 2f86b2ad6d7e..98bdabc8e309 100644
> > --- a/include/uapi/linux/fcntl.h
> > +++ b/include/uapi/linux/fcntl.h
> > @@ -43,6 +43,7 @@
> >  #define F_SEAL_GROW    0x0004  /* prevent file from growing */
> >  #define F_SEAL_WRITE   0x0008  /* prevent writes */
> >  #define F_SEAL_FUTURE_WRITE    0x0010  /* prevent future writes while mapped */
> > +#define F_SEAL_AUTO_ALLOCATE   0x0020  /* prevent allocation for writes */
> 
> I think this should also be added to tools/include/uapi/linux/fcntl.h

Yes, thanks.

Chao
> 
> Cheers,
> /fuad
> 
> 
> >  /* (1U << 31) is reserved for signed error codes */
> >
> >  /*
> > diff --git a/mm/memfd.c b/mm/memfd.c
> > index 08f5f8304746..2afd898798e4 100644
> > --- a/mm/memfd.c
> > +++ b/mm/memfd.c
> > @@ -150,7 +150,8 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
> >                      F_SEAL_SHRINK | \
> >                      F_SEAL_GROW | \
> >                      F_SEAL_WRITE | \
> > -                    F_SEAL_FUTURE_WRITE)
> > +                    F_SEAL_FUTURE_WRITE | \
> > +                    F_SEAL_AUTO_ALLOCATE)
> >
> >  static int memfd_add_seals(struct file *file, unsigned int seals)
> >  {
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index a6f565308133..6c8aef15a17d 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2051,6 +2051,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
> >         struct vm_area_struct *vma = vmf->vma;
> >         struct inode *inode = file_inode(vma->vm_file);
> >         gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
> > +       struct shmem_inode_info *info = SHMEM_I(inode);
> > +       enum sgp_type sgp;
> >         int err;
> >         vm_fault_t ret = VM_FAULT_LOCKED;
> >
> > @@ -2113,7 +2115,12 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
> >                 spin_unlock(&inode->i_lock);
> >         }
> >
> > -       err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE,
> > +       if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE))
> > +               sgp = SGP_NOALLOC;
> > +       else
> > +               sgp = SGP_CACHE;
> > +
> > +       err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
> >                                   gfp, vma, vmf, &ret);
> >         if (err)
> >                 return vmf_error(err);
> > @@ -2459,6 +2466,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> >         struct inode *inode = mapping->host;
> >         struct shmem_inode_info *info = SHMEM_I(inode);
> >         pgoff_t index = pos >> PAGE_SHIFT;
> > +       enum sgp_type sgp;
> >         int ret = 0;
> >
> >         /* i_rwsem is held by caller */
> > @@ -2470,7 +2478,11 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> >                         return -EPERM;
> >         }
> >
> > -       ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
> > +       if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE))
> > +               sgp = SGP_NOALLOC;
> > +       else
> > +               sgp = SGP_WRITE;
> > +       ret = shmem_getpage(inode, index, pagep, sgp);
> >
> >         if (ret)
> >                 return ret;
> > --
> > 2.25.1
> >

[v7,01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd

Commit Message

Comments

Patch