[RFC,v2,1/5] mm: Avoid unmapping pinned pages

Message ID	20220120160822.666778608@infradead.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Message-ID: <20220120160822.666778608@infradead.org> User-Agent: quilt/0.66 Date: Thu, 20 Jan 2022 16:55:18 +0100 From: Peter Zijlstra <peterz@infradead.org> To: mingo@redhat.com, tglx@linutronix.de, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-api@vger.kernel.org, x86@kernel.org, peterz@infradead.org, pjt@google.com, posk@google.com, avagin@google.com, jannh@google.com, tdelisle@uwaterloo.ca, mark.rutland@arm.com, posk@posk.io Subject: [RFC][PATCH v2 1/5] mm: Avoid unmapping pinned pages References: <20220120155517.066795336@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	sched: User Managed Concurrency Groups \| expand [RFC,v2,0/5] sched: User Managed Concurrency Groups [RFC,v2,1/5] mm: Avoid unmapping pinned pages [RFC,v2,2/5] entry,x86: Create common IRQ operations for exceptions [RFC,v2,3/5] sched/umcg: add WF_CURRENT_CPU and externise ttwu [RFC,v2,4/5] x86/uaccess: Implement unsafe_try_cmpxchg_user() [RFC,v2,5/5] sched: User Mode Concurency Groups

Peter Zijlstra Jan. 20, 2022, 3:55 p.m. UTC

Add a guarantee for Anon pages that pin_user_page*() ensures the
user-mapping of these pages stay preserved. In order to ensure this
all rmap users have been audited:

 vmscan:	already fails eviction due to page_maybe_dma_pinned()

 migrate:	migration will fail on pinned pages due to
		expected_page_refs() not matching, however that is
		*after* try_to_migrate() has already destroyed the
		user mapping of these pages. Add an early exit for
		this case.

 numa-balance:	as per the above, pinned pages cannot be migrated,
		however numa balancing scanning will happily PROT_NONE
		them to get usage information on these pages. Avoid
		this for pinned pages.

None of the other rmap users (damon,page-idle,mlock,..) unmap the
page, they mostly just muck about with reference,dirty flags etc.

This same guarantee cannot be provided for Shared (file) pages due to
dirty page tracking.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 mm/migrate.c  |   10 +++++++++-
 mm/mprotect.c |    6 ++++++
 2 files changed, 15 insertions(+), 1 deletion(-)

Nadav Amit Jan. 20, 2022, 6:03 p.m. UTC | #1

> On Jan 20, 2022, at 7:55 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> Add a guarantee for Anon pages that pin_user_page*() ensures the
> user-mapping of these pages stay preserved. In order to ensure this
> all rmap users have been audited:
> 
> vmscan:	already fails eviction due to page_maybe_dma_pinned()
> 
> migrate:	migration will fail on pinned pages due to
> 		expected_page_refs() not matching, however that is
> 		*after* try_to_migrate() has already destroyed the
> 		user mapping of these pages. Add an early exit for
> 		this case.
> 
> numa-balance:	as per the above, pinned pages cannot be migrated,
> 		however numa balancing scanning will happily PROT_NONE
> 		them to get usage information on these pages. Avoid
> 		this for pinned pages.
> 
> None of the other rmap users (damon,page-idle,mlock,..) unmap the
> page, they mostly just muck about with reference,dirty flags etc.
> 
> This same guarantee cannot be provided for Shared (file) pages due to
> dirty page tracking.
> 
> 

[ snip ]

> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -106,6 +106,12 @@ static unsigned long change_pte_range(st
> 					continue;
> 
> 				/*
> +				 * Can't migrate pinned pages, avoid touching them.
> +				 */
> +				if (page_maybe_dma_pinned(page))
> +					continue;
> +
> +				/*
> 

I have a similar problem with userfaultfd changing protection for
DMA-pinned pages. For userfaultfd it is important to know how many
pages were actually modified.

I am working on a vectored UFFDIO_WRITEPROTECTV that aborts once
a pinned page is encountered, but also returns the number of pages
that were properly protected. I still need to do some work to
send patches for that as it requires further changes (to return
the number of pages that were handled).

But for the matter of your patch, is it possible to make this
test generic (not migration specific) and rely on a new flag in
cp_flags? I can of course make this change later if you prefer it
this way.

David Hildenbrand Jan. 20, 2022, 6:25 p.m. UTC | #2

On 20.01.22 16:55, Peter Zijlstra wrote:
> Add a guarantee for Anon pages that pin_user_page*() ensures the
> user-mapping of these pages stay preserved. In order to ensure this
> all rmap users have been audited:
> 
>  vmscan:	already fails eviction due to page_maybe_dma_pinned()
> 
>  migrate:	migration will fail on pinned pages due to
> 		expected_page_refs() not matching, however that is
> 		*after* try_to_migrate() has already destroyed the
> 		user mapping of these pages. Add an early exit for
> 		this case.
> 
>  numa-balance:	as per the above, pinned pages cannot be migrated,
> 		however numa balancing scanning will happily PROT_NONE
> 		them to get usage information on these pages. Avoid
> 		this for pinned pages.

page_maybe_dma_pinned() can race with GUP-fast without
mm->write_protect_seq. This is a real problem for vmscan() with
concurrent GUP-fast as it can result in R/O mappings of pinned pages and
GUP will lose synchronicity to the page table on write faults due to
wrong COW.

If you're just using it as an optimization, that should work just fine.
I assume all migration will freeze the refcount and consequently  bail
out at that point. In that case, LGTM.

Peter Zijlstra Jan. 21, 2022, 7:51 a.m. UTC | #3

On Thu, Jan 20, 2022 at 07:25:08PM +0100, David Hildenbrand wrote:
> On 20.01.22 16:55, Peter Zijlstra wrote:
> > Add a guarantee for Anon pages that pin_user_page*() ensures the
> > user-mapping of these pages stay preserved. In order to ensure this
> > all rmap users have been audited:
> > 
> >  vmscan:	already fails eviction due to page_maybe_dma_pinned()
> > 
> >  migrate:	migration will fail on pinned pages due to
> > 		expected_page_refs() not matching, however that is
> > 		*after* try_to_migrate() has already destroyed the
> > 		user mapping of these pages. Add an early exit for
> > 		this case.
> > 
> >  numa-balance:	as per the above, pinned pages cannot be migrated,
> > 		however numa balancing scanning will happily PROT_NONE
> > 		them to get usage information on these pages. Avoid
> > 		this for pinned pages.
> 
> page_maybe_dma_pinned() can race with GUP-fast without
> mm->write_protect_seq. This is a real problem for vmscan() with
> concurrent GUP-fast as it can result in R/O mappings of pinned pages and
> GUP will lose synchronicity to the page table on write faults due to
> wrong COW.

Urgh, so yeah, that might be a problem. Follow up code uses it like
this:

+/*
+ * Pinning a page inhibits rmap based unmap for Anon pages. Doing a load
+ * through the user mapping ensures the user mapping exists.
+ */
+#define umcg_pin_and_load(_self, _pagep, _member)                              \
+({                                                                             \
+       __label__ __out;                                                        \
+       int __ret = -EFAULT;                                                    \
+                                                                               \
+       if (pin_user_pages_fast((unsigned long)(_self), 1, 0, &(_pagep)) != 1)  \
+               goto __out;                                                     \
+                                                                               \
+       if (!PageAnon(_pagep) ||                                                \
+           get_user(_member, &(_self)->_member)) {                             \
+               unpin_user_page(_pagep);                                        \
+               goto __out;                                                     \
+       }                                                                       \
+       __ret = 0;                                                              \
+__out: __ret;                                                                  \
+})

And after that hard assumes (on the penalty of SIGKILL) that direct user
access works. Specifically it does RmW ops on it. So I suppose I'd
better upgrade that load to a RmW at the very least.

But is that sufficient? Let me go find that race you mention...

Peter Zijlstra Jan. 21, 2022, 7:59 a.m. UTC | #4

On Thu, Jan 20, 2022 at 10:03:44AM -0800, Nadav Amit wrote:

> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -106,6 +106,12 @@ static unsigned long change_pte_range(st
> > 					continue;
> > 
> > 				/*
> > +				 * Can't migrate pinned pages, avoid touching them.
> > +				 */
> > +				if (page_maybe_dma_pinned(page))
> > +					continue;
> > +
> > +				/*
> > 
> 
> I have a similar problem with userfaultfd changing protection for
> DMA-pinned pages. For userfaultfd it is important to know how many
> pages were actually modified.
> 
> I am working on a vectored UFFDIO_WRITEPROTECTV that aborts once
> a pinned page is encountered, but also returns the number of pages
> that were properly protected. I still need to do some work to
> send patches for that as it requires further changes (to return
> the number of pages that were handled).
> 
> But for the matter of your patch, is it possible to make this
> test generic (not migration specific) and rely on a new flag in
> cp_flags? I can of course make this change later if you prefer it
> this way.

I have no objection to making it apply more widely, but I'm currently
only interested in the rmap users. If userspace does mprotect() on it's
own pages, it gets to keep whatever pieces are the result of that.

David Hildenbrand Jan. 21, 2022, 8:22 a.m. UTC | #5

On 21.01.22 08:51, Peter Zijlstra wrote:
> On Thu, Jan 20, 2022 at 07:25:08PM +0100, David Hildenbrand wrote:
>> On 20.01.22 16:55, Peter Zijlstra wrote:
>>> Add a guarantee for Anon pages that pin_user_page*() ensures the
>>> user-mapping of these pages stay preserved. In order to ensure this
>>> all rmap users have been audited:
>>>
>>>  vmscan:	already fails eviction due to page_maybe_dma_pinned()
>>>
>>>  migrate:	migration will fail on pinned pages due to
>>> 		expected_page_refs() not matching, however that is
>>> 		*after* try_to_migrate() has already destroyed the
>>> 		user mapping of these pages. Add an early exit for
>>> 		this case.
>>>
>>>  numa-balance:	as per the above, pinned pages cannot be migrated,
>>> 		however numa balancing scanning will happily PROT_NONE
>>> 		them to get usage information on these pages. Avoid
>>> 		this for pinned pages.
>>
>> page_maybe_dma_pinned() can race with GUP-fast without
>> mm->write_protect_seq. This is a real problem for vmscan() with
>> concurrent GUP-fast as it can result in R/O mappings of pinned pages and
>> GUP will lose synchronicity to the page table on write faults due to
>> wrong COW.
> 
> Urgh, so yeah, that might be a problem. Follow up code uses it like
> this:
> 
> +/*
> + * Pinning a page inhibits rmap based unmap for Anon pages. Doing a load
> + * through the user mapping ensures the user mapping exists.
> + */
> +#define umcg_pin_and_load(_self, _pagep, _member)                              \
> +({                                                                             \
> +       __label__ __out;                                                        \
> +       int __ret = -EFAULT;                                                    \
> +                                                                               \
> +       if (pin_user_pages_fast((unsigned long)(_self), 1, 0, &(_pagep)) != 1)  \
> +               goto __out;                                                     \
> +                                                                               \
> +       if (!PageAnon(_pagep) ||                                                \
> +           get_user(_member, &(_self)->_member)) {                             \
> +               unpin_user_page(_pagep);                                        \
> +               goto __out;                                                     \
> +       }                                                                       \
> +       __ret = 0;                                                              \
> +__out: __ret;                                                                  \
> +})
> 
> And after that hard assumes (on the penalty of SIGKILL) that direct user
> access works. Specifically it does RmW ops on it. So I suppose I'd
> better upgrade that load to a RmW at the very least.
> 
> But is that sufficient? Let me go find that race you mention...
> 

It's described in [1] under point 3.

After we put the page into the swapcache, it's still mapped into the
page tables, where GUP can find it. Only after that, we try to unmap the
page (placing swap entries). So it's racy.


Note also point 2. in [1], which is related to O_DIRECT that does
currently not yet use FOLL_PIN but uses FOLL_GET.


[1]
https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com

Peter Zijlstra Jan. 21, 2022, 8:59 a.m. UTC | #6

On Fri, Jan 21, 2022 at 08:51:57AM +0100, Peter Zijlstra wrote:
> On Thu, Jan 20, 2022 at 07:25:08PM +0100, David Hildenbrand wrote:
> > On 20.01.22 16:55, Peter Zijlstra wrote:
> > > Add a guarantee for Anon pages that pin_user_page*() ensures the
> > > user-mapping of these pages stay preserved. In order to ensure this
> > > all rmap users have been audited:
> > > 
> > >  vmscan:	already fails eviction due to page_maybe_dma_pinned()
> > > 
> > >  migrate:	migration will fail on pinned pages due to
> > > 		expected_page_refs() not matching, however that is
> > > 		*after* try_to_migrate() has already destroyed the
> > > 		user mapping of these pages. Add an early exit for
> > > 		this case.
> > > 
> > >  numa-balance:	as per the above, pinned pages cannot be migrated,
> > > 		however numa balancing scanning will happily PROT_NONE
> > > 		them to get usage information on these pages. Avoid
> > > 		this for pinned pages.
> > 
> > page_maybe_dma_pinned() can race with GUP-fast without
> > mm->write_protect_seq. This is a real problem for vmscan() with
> > concurrent GUP-fast as it can result in R/O mappings of pinned pages and
> > GUP will lose synchronicity to the page table on write faults due to
> > wrong COW.
> 
> Urgh, so yeah, that might be a problem. Follow up code uses it like
> this:
> 
> +/*
> + * Pinning a page inhibits rmap based unmap for Anon pages. Doing a load
> + * through the user mapping ensures the user mapping exists.
> + */
> +#define umcg_pin_and_load(_self, _pagep, _member)                              \
> +({                                                                             \
> +       __label__ __out;                                                        \
> +       int __ret = -EFAULT;                                                    \
> +                                                                               \
> +       if (pin_user_pages_fast((unsigned long)(_self), 1, 0, &(_pagep)) != 1)  \
> +               goto __out;                                                     \
> +                                                                               \
> +       if (!PageAnon(_pagep) ||                                                \
> +           get_user(_member, &(_self)->_member)) {                             \
> +               unpin_user_page(_pagep);                                        \
> +               goto __out;                                                     \
> +       }                                                                       \
> +       __ret = 0;                                                              \
> +__out: __ret;                                                                  \
> +})
> 
> And after that hard assumes (on the penalty of SIGKILL) that direct user
> access works. Specifically it does RmW ops on it. So I suppose I'd
> better upgrade that load to a RmW at the very least.
> 
> But is that sufficient? Let me go find that race you mention...

OK, so copy_page_range() vs lockless_pages_from_mm(). Since I use
FOLL_PIN that should be sorted, it'll fall back the slow path and use
mmap_sem and serialize against the fork().

(Also, can I express my hate for __gup_longterm_unlocked(), that
function name is utter garbage)

However, I'm not quite sure what fork() does with pages that have a pin.
There's been a number of GUP vs fork() problems over the years, but I'm
afraid I have lost track of that and I can't quickly find anything in
the code..

Naively, a page that has async DMA activity should not be CoW'ed, or if
it is, care must be taken to ensure the original pages stays in the
original process, but I realize that's somewhat hard.

Let me dig in a bit more.

David Hildenbrand Jan. 21, 2022, 9:04 a.m. UTC | #7

On 21.01.22 09:59, Peter Zijlstra wrote:
> On Fri, Jan 21, 2022 at 08:51:57AM +0100, Peter Zijlstra wrote:
>> On Thu, Jan 20, 2022 at 07:25:08PM +0100, David Hildenbrand wrote:
>>> On 20.01.22 16:55, Peter Zijlstra wrote:
>>>> Add a guarantee for Anon pages that pin_user_page*() ensures the
>>>> user-mapping of these pages stay preserved. In order to ensure this
>>>> all rmap users have been audited:
>>>>
>>>>  vmscan:	already fails eviction due to page_maybe_dma_pinned()
>>>>
>>>>  migrate:	migration will fail on pinned pages due to
>>>> 		expected_page_refs() not matching, however that is
>>>> 		*after* try_to_migrate() has already destroyed the
>>>> 		user mapping of these pages. Add an early exit for
>>>> 		this case.
>>>>
>>>>  numa-balance:	as per the above, pinned pages cannot be migrated,
>>>> 		however numa balancing scanning will happily PROT_NONE
>>>> 		them to get usage information on these pages. Avoid
>>>> 		this for pinned pages.
>>>
>>> page_maybe_dma_pinned() can race with GUP-fast without
>>> mm->write_protect_seq. This is a real problem for vmscan() with
>>> concurrent GUP-fast as it can result in R/O mappings of pinned pages and
>>> GUP will lose synchronicity to the page table on write faults due to
>>> wrong COW.
>>
>> Urgh, so yeah, that might be a problem. Follow up code uses it like
>> this:
>>
>> +/*
>> + * Pinning a page inhibits rmap based unmap for Anon pages. Doing a load
>> + * through the user mapping ensures the user mapping exists.
>> + */
>> +#define umcg_pin_and_load(_self, _pagep, _member)                              \
>> +({                                                                             \
>> +       __label__ __out;                                                        \
>> +       int __ret = -EFAULT;                                                    \
>> +                                                                               \
>> +       if (pin_user_pages_fast((unsigned long)(_self), 1, 0, &(_pagep)) != 1)  \
>> +               goto __out;                                                     \
>> +                                                                               \
>> +       if (!PageAnon(_pagep) ||                                                \
>> +           get_user(_member, &(_self)->_member)) {                             \
>> +               unpin_user_page(_pagep);                                        \
>> +               goto __out;                                                     \
>> +       }                                                                       \
>> +       __ret = 0;                                                              \
>> +__out: __ret;                                                                  \
>> +})
>>
>> And after that hard assumes (on the penalty of SIGKILL) that direct user
>> access works. Specifically it does RmW ops on it. So I suppose I'd
>> better upgrade that load to a RmW at the very least.
>>
>> But is that sufficient? Let me go find that race you mention...
> 
> OK, so copy_page_range() vs lockless_pages_from_mm(). Since I use
> FOLL_PIN that should be sorted, it'll fall back the slow path and use
> mmap_sem and serialize against the fork().
> 
> (Also, can I express my hate for __gup_longterm_unlocked(), that
> function name is utter garbage)

Absolutely, the "_unlocked_ also caused a lot of confusion on my end in
the past.

> 
> However, I'm not quite sure what fork() does with pages that have a pin.

We COW the anon pages always, and we protect against concurrent GUP
using the
* mmap_lock in exclusive mode for ordinary GUP
* mm->write_protect_seq for GUP-fast

> 
> Naively, a page that has async DMA activity should not be CoW'ed, or if
> it is, care must be taken to ensure the original pages stays in the
> original process, but I realize that's somewhat hard.

That's precisely what I'm working on fixing ... and yes, it's hard.

Let me know if you need any other information, I've spent way too much
time on this than I ever panned.

Peter Zijlstra Jan. 21, 2022, 11:40 a.m. UTC | #8

On Fri, Jan 21, 2022 at 10:04:45AM +0100, David Hildenbrand wrote:
> On 21.01.22 09:59, Peter Zijlstra wrote:

> > However, I'm not quite sure what fork() does with pages that have a pin.
> 
> We COW the anon pages always, and we protect against concurrent GUP
> using the
> * mmap_lock in exclusive mode for ordinary GUP
> * mm->write_protect_seq for GUP-fast

Right, but neither the mmap_sem nor the write_protect_seq help anything
at all vs already extant page pins.

But I just found copy_present_page()'s page_needs_cow_for_dma(), which I
think deals with exactly that case, it avoids doing CoW on pinned pages
and instead feeds the child a full copy while keeping the pinned page in
the original process.

> > Naively, a page that has async DMA activity should not be CoW'ed, or if
> > it is, care must be taken to ensure the original pages stays in the
> > original process, but I realize that's somewhat hard.
> 
> That's precisely what I'm working on fixing ... and yes, it's hard.
> 
> Let me know if you need any other information, I've spent way too much
> time on this than I ever panned.

So let me try and get this right:

 - GUP post-fork breaks CoW for FOLL_WRITE/FOLL_PIN, without either
   there's a problem where one task might observe changes by another.

 - GUP pre-fork prevents CoW and does a full copy.

And that all mostly works, except for a fair amount of 'fun' cases?

David Hildenbrand Jan. 21, 2022, 12:04 p.m. UTC | #9

On 21.01.22 12:40, Peter Zijlstra wrote:
> On Fri, Jan 21, 2022 at 10:04:45AM +0100, David Hildenbrand wrote:
>> On 21.01.22 09:59, Peter Zijlstra wrote:
> 
>>> However, I'm not quite sure what fork() does with pages that have a pin.
>>
>> We COW the anon pages always, and we protect against concurrent GUP
>> using the
>> * mmap_lock in exclusive mode for ordinary GUP
>> * mm->write_protect_seq for GUP-fast
> 
> Right, but neither the mmap_sem nor the write_protect_seq help anything
> at all vs already extant page pins.
> 
> But I just found copy_present_page()'s page_needs_cow_for_dma(), which I
> think deals with exactly that case, it avoids doing CoW on pinned pages
> and instead feeds the child a full copy while keeping the pinned page in
> the original process.

Yes, page_needs_cow_for_dma() is the magic bit. The locking I explained
keep its output "reliable".

> 
>>> Naively, a page that has async DMA activity should not be CoW'ed, or if
>>> it is, care must be taken to ensure the original pages stays in the
>>> original process, but I realize that's somewhat hard.
>>
>> That's precisely what I'm working on fixing ... and yes, it's hard.
>>
>> Let me know if you need any other information, I've spent way too much
>> time on this than I ever panned.
> 
> So let me try and get this right:
> 
>  - GUP post-fork breaks CoW for FOLL_WRITE/FOLL_PIN, without either
>    there's a problem where one task might observe changes by another.
> 
>  - GUP pre-fork prevents CoW and does a full copy.

Yes, pretty much.

> 
> And that all mostly works, except for a fair amount of 'fun' cases?

I'd say some obviously broken cases, some racy cases, some fun cases :)

We have three main cases. And usually, trying to tackle one triggers
another.

(1) Missed CoW

If the child R/O pins and unmaps the page, the parent might miss to CoW
and reuse the page. Security issue. Once CVE in that area is currently
still applicable for THP (well, and hugetlb).

(2) Unnecessary CoW

We CoW instead of reusing the page, but there are no relevant pins, so
it's unnecessary.

(3) Wrong CoW

We CoW a page that has relevant pins, losing synchronicity between GUP
and the page tables.

The "criticality" is (1), (3), (2).

(3) can currently get triggered by anything that can map a pinned page
R/O. The racy case is what I described about the swapcache. Other broken
cases are mprotect() and friends (we cannot differentiate between R/O
and R/W pins ...).

[RFC,v2,1/5] mm: Avoid unmapping pinned pages

Commit Message

Comments

Patch