[0/1] mm: restore full accuracy in COW page reuse

Message ID	20210110004435.26382-1-aarcange@redhat.com (mailing list archive)
Headers	show Return-Path: <SRS0=uwNU=GN=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4DFD52225E From: Andrea Arcangeli <aarcange@redhat.com> To: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Yu Zhao <yuzhao@google.com>, Andy Lutomirski <luto@kernel.org>, Peter Xu <peterx@redhat.com>, Pavel Emelyanov <xemul@openvz.org>, Mike Kravetz <mike.kravetz@oracle.com>, Mike Rapoport <rppt@linux.vnet.ibm.com>, Minchan Kim <minchan@kernel.org>, Will Deacon <will@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Linus Torvalds <torvalds@linux-foundation.org>, Hugh Dickins <hughd@google.com>, "Kirill A. Shutemov" <kirill@shutemov.name>, Matthew Wilcox <willy@infradead.org>, Oleg Nesterov <oleg@redhat.com>, Jann Horn <jannh@google.com>, Kees Cook <keescook@chromium.org>, John Hubbard <jhubbard@nvidia.com>, Leon Romanovsky <leonro@nvidia.com>, Jason Gunthorpe <jgg@ziepe.ca>, Jan Kara <jack@suse.cz>, Kirill Tkhai <ktkhai@virtuozzo.com>, Nadav Amit <nadav.amit@gmail.com>, Jens Axboe <axboe@kernel.dk> Subject: [PATCH 0/1] mm: restore full accuracy in COW page reuse Date: Sat, 9 Jan 2021 19:44:34 -0500 Message-Id: <20210110004435.26382-1-aarcange@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: restore full accuracy in COW page reuse \| expand [0/1] mm: restore full accuracy in COW page reuse [1/1] mm: restore full accuracy in COW page reuse

Andrea Arcangeli Jan. 10, 2021, 12:44 a.m. UTC

Hello Andrew and everyone,

Once we agree that COW page reuse requires full accuracy, the next
step is to re-apply 17839856fd588f4ab6b789f482ed3ffd7c403e1f and to
return going in that direction.

Who is going to orthogonally secure vmsplice, Andy, Jason, Jens?  Once
vmsplice is secured from taking unconstrained unprivileged long term
GUP pins, the attack from child to parent can still happen in theory,
but statistically speaking once that huge window is closed, it won't
be a practical concern, so it'll give us time to perfect the full
solution by closing all windows the VM core. vmsplice has to be
orthogonally fixed anyway, even if all windows were closed in VM core
first.

Unfortunately it's still not clear exactly what failed with
17839856fd588f4ab6b789f482ed3ffd7c403e1f but the whole point is that
we need to discuss that together.

Thanks,
Andrea

// SPDX-License-Identifier: GPL-3.0-or-later
/*
 *  reproducer for v5.11 O_DIRECT mm corruption with page_count
 *  instead of mapcount in do_wp_page.
 *
 *  Copyright (C) 2021  Red Hat, Inc.
 *
 *  gcc -O2 -o page_count_do_wp_page page_count_do_wp_page.c -lpthread
 *  page_count_do_wp_page ./whateverfile
 *
 *  NOTE: CONFIG_SOFT_DIRTY=y is required in the kernel config.
 */

#define _GNU_SOURCE
#include <stdbool.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/mman.h>

#define PAGE_SIZE (1UL<<12)
#define HARDBLKSIZE 512

static void* writer(void *_mem)
{
	char *mem = (char *)_mem;
	for(;;) {
		usleep(random() % 1000);
		mem[PAGE_SIZE-1] = 0;
	}
}

static void* background_soft_dirty(void *data)
{
	long fd = (long) data;
	for (;;)
		if (write(fd, "4", 1) != 1)
			perror("write soft dirty"), exit(1);
}

int main(int argc, char *argv[])
{
	if (argc < 2)
		printf("%s <filename>", argv[0]), exit(1);

	char path[PAGE_SIZE];
	strcpy(path, "/proc/");
	sprintf(path + strlen(path), "%d", getpid());
	strcat(path, "/clear_refs");
	long soft_dirty_fd = open(path, O_WRONLY);
	if (soft_dirty_fd < 0)
		perror("open clear_refs"), exit(1);

	char *mem;
	if (posix_memalign((void **)&mem, PAGE_SIZE, PAGE_SIZE*3))
		perror("posix_memalign"), exit(1);
	/* THP is not using page_count so it would not corrupt memory */
	if (madvise(mem, PAGE_SIZE, MADV_NOHUGEPAGE))
		perror("madvise"), exit(1);
	bzero(mem, PAGE_SIZE * 3);
	memset(mem + PAGE_SIZE * 2, 0xff, HARDBLKSIZE);

	/*
	 * This is not specific to O_DIRECT. Even if O_DIRECT was
	 * forced to use PAGE_SIZE minimum granularity for reads, a
	 * recvmsg would create the same issue since it also use
	 * iov_iter_get_pages internally to create transient GUP pins
	 * on anon memory.
	 */
	int fd = open(argv[1], O_DIRECT|O_CREAT|O_RDWR|O_TRUNC, 0600);
	if (fd < 0)
		perror("open"), exit(1);
	if (write(fd, mem, PAGE_SIZE) != PAGE_SIZE)
		perror("write"), exit(1);

	pthread_t soft_dirty;
	if (pthread_create(&soft_dirty, NULL,
			   background_soft_dirty, (void *)soft_dirty_fd))
		perror("soft_dirty"), exit(1);

	pthread_t thread;
	if (pthread_create(&thread, NULL, writer, mem))
		perror("pthread_create"), exit(1);

	bool skip_memset = true;
	while (1) {
		if (pread(fd, mem, HARDBLKSIZE, 0) != HARDBLKSIZE)
			perror("read"), exit(1);
		if (memcmp(mem, mem+PAGE_SIZE, HARDBLKSIZE)) {
			if (memcmp(mem, mem+PAGE_SIZE*2, PAGE_SIZE)) {
				if (skip_memset)
					printf("unexpected memory "
					       "corruption detected\n");
				else
					printf("memory corruption detected, "
					       "dumping page\n");
				int end = PAGE_SIZE;
				if (!memcmp(mem+HARDBLKSIZE, mem+PAGE_SIZE,
					    PAGE_SIZE-HARDBLKSIZE))
					end = HARDBLKSIZE;
				for (int i = 0; i < end; i++)
					printf("%x", mem[i]);
				printf("\n");
			} else
				printf("memory corruption detected\n");
		}
		skip_memset = !skip_memset;
		if (!skip_memset)
			memset(mem, 0xff, HARDBLKSIZE);
	}

	return 0;
}

Andrea Arcangeli (1):
  mm: restore full accuracy in COW page reuse

 include/linux/ksm.h |  7 ++++++
 mm/ksm.c            | 25 +++++++++++++++++++
 mm/memory.c         | 59 ++++++++++++++++++++++++++++++++-------------
 3 files changed, 74 insertions(+), 17 deletions(-)

Linus Torvalds Jan. 10, 2021, 12:55 a.m. UTC | #1

On Sat, Jan 9, 2021 at 4:44 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Once we agree that COW page reuse requires full accuracy, [...]

You have completely and utterly ignored every single argument against that.

Instead, you just continue to push your agenda.

The thing is, GUP works fine.

COW works fine.

The thing that is broken is clear_refs.

Yet you try to now "fix" the two fine cases, because you don't want to
fix clear_refs.

What part of "clear_refs is the _least_ important of the three cases"
are you not willing to understand?

                 Linus

Linus Torvalds Jan. 10, 2021, 1:19 a.m. UTC | #2

On Sat, Jan 9, 2021 at 4:55 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> What part of "clear_refs is the _least_ important of the three cases"
> are you not willing to understand?

In fact, I couldn't even turn on that code with my normal config,
because it depends on CONFIG_CHECKPOINT_RESTORE that I didn't even
have enabled.

IOW, that code is some special-case stuff, and instead of messing up
the rest of the VM, it should be made to conform to all the normal VM
rules and requirements.

Here's two patches to basically start doing that.

The first one is the same one I already sent out earlier, fixing the
locking. And yes, it can be improved upon, but before improving on it,
let's _fix_ the code.

The second is a trivial "oh, look, I can see that the page is pinned,
soft-dirty cannot work so don't do it then". Again, it can be improved
upon, most particularly by doing the same (simple) tests for the
hugepage case too, which I didn't do.

Note: I have not a single actual user of this code that I can test
with, so this is all ENTIRELY untested.

IOW, I am in no way claiming that these patches are perfect and
correct, and the only way to do things.

But what I _am_ claiming is that this clear_refs code (and the UFFD
code) is of secondary importance, and instead of messing up the core
VM, we should fix these special cases to not do bad things.

It really is that simple.

And no, I didn't make the UFFDIO_WRITEPROTECT code take the mmap_sem
for writing. For whoever wants to look at that, it's
mwriteprotect_range() in mm/userfaultfd.c and the fix is literally to
turn the read-lock (and unlock) into a write-lock (and unlock).

           Linus

Linus Torvalds Jan. 10, 2021, 1:37 a.m. UTC | #3

On Sat, Jan 9, 2021 at 5:19 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And no, I didn't make the UFFDIO_WRITEPROTECT code take the mmap_sem
> for writing. For whoever wants to look at that, it's
> mwriteprotect_range() in mm/userfaultfd.c and the fix is literally to
> turn the read-lock (and unlock) into a write-lock (and unlock).

Oh, and if it wasn't obvious, we'll have to debate what to do with
trying to mprotect() a pinned page. Do we just ignore the pinned page
(the way my clear_refs patch did)? Or do we turn it into -EBUSY? Or
what?

So it's not *just* the locking that needs to be fixed. But just take a
look at that suggested clear_refs patch of mine - it sure isn't
complicated.

              Linus

Andrea Arcangeli Jan. 10, 2021, 2:51 a.m. UTC | #4

Hello Linus,

On Sat, Jan 09, 2021 at 05:19:51PM -0800, Linus Torvalds wrote:
> +#define is_cow_mapping(flags) (((flags) & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE)
> +
> +static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr, pte_t pte)
> +{
> +	struct page *page;
> +
> +	if (!is_cow_mapping(vma->vm_flags))
> +		return false;
> +	if (likely(!atomic_read(&vma->vm_mm->has_pinned)))
> +		return false;
> +	page = vm_normal_page(vma, addr, pte);
> +	if (!page)
> +		return false;
> +	if (page_mapcount(page) != 1)
> +		return false;
> +	return page_maybe_dma_pinned(page);
> +}

I just don't see the simplification coming from
09854ba94c6aad7886996bfbee2530b3d8a7f4f4. Instead of checking
page_mapcount above as an optimization, to me it looks much simpler to
check it in a single place, in do_wp_page, that in addition of
optimizing away the superfluous copy, would optimize away the above
complexity as well.

And I won't comment if it's actually safe to skip random pages or
not. All I know is for mprotect and uffd-wp, definitely the above
approach wouldn't work.

In addition I dislike has_pinned and FOLL_PIN.

has_pinned        450 include/linux/mm_types.h 		 * for instance during page table copying for fork().
has_pinned       1021 kernel/fork.c    	atomic64_set(&mm->pinned_vm, 0);
has_pinned       1239 mm/gup.c         		atomic_set(&mm->has_pinned, 1);
has_pinned       2579 mm/gup.c         		atomic_set(&current->mm->has_pinned, 1);
has_pinned       1099 mm/huge_memory.c 		     atomic_read(&src_mm->has_pinned) &&
has_pinned       1213 mm/huge_memory.c 		     atomic_read(&src_mm->has_pinned) &&
has_pinned        819 mm/memory.c      	if (likely(!atomic_read(&src_mm->has_pinned)))

Are we spending 32bit in mm_struct atomic_t just to call atomic_set(1)
on it? Why isn't it a MMF_HAS_PINNED that already can be set
atomically under mmap_read_lock too? There's bit left free there, we
didn't run out yet to justify wasting another 31 bits. I hope I'm
overlooking something.

The existence of FOLL_LONGTERM is good and makes a difference at times
for writeback if it's on a MAP_SHARED, or it makes difference during
GUP to do a page_migrate before taking the pin, but for the whole rest
of the VM it's irrelevant if it's long or short term, so I'm also
concerned from what Jason mentioned about long term pins being treated
differently within the VM. That to me looks fundamentally as flawed as
introducing inaccuracies in do_wp_page, that even ignoring the
performance implications caused by the inaccuracy, simply prevent to
do some useful things.

I obviously agree all common workloads with no GUP pins and no
clear_refs and no uffd, are way more important to be optimal, but I
haven't seen a single benchmark showing the benefit of not taking the
PG_lock before a page copy or any other runtime benefit coming from
page_count in do_wp_page.

To the contrary now I see additional branches in fork and in various
other paths.

The only thing again where I see page_count provides a tangible
benefit, is the vmsplice attack from child to parent, but that has not
been fully fixed and if page_count is added to fix it in all COW
faults, it'll introduce extra inefficiency to the the very common
important workloads, not only to the special GUP/clear_refs/uffd-wp
workloads as your patch above shows.

Thanks,
Andrea

Andrea Arcangeli Jan. 10, 2021, 3:24 a.m. UTC | #5

On Sat, Jan 09, 2021 at 05:37:09PM -0800, Linus Torvalds wrote:
> On Sat, Jan 9, 2021 at 5:19 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > And no, I didn't make the UFFDIO_WRITEPROTECT code take the mmap_sem
> > for writing. For whoever wants to look at that, it's
> > mwriteprotect_range() in mm/userfaultfd.c and the fix is literally to
> > turn the read-lock (and unlock) into a write-lock (and unlock).
> 
> Oh, and if it wasn't obvious, we'll have to debate what to do with
> trying to mprotect() a pinned page. Do we just ignore the pinned page
> (the way my clear_refs patch did)? Or do we turn it into -EBUSY? Or
> what?

Agreed, I assume mprotect would have the same effect.

mprotect in parallel of a read or recvmgs may be undefined, so I
didn't bring it up, but it was pretty clear.

The moment the write bit is cleared (no matter why and from who) and
the PG lock relased, if there's any GUP pin, GUP currently loses
synchrony.

In any case I intended to help exercising the new page_count logic
with the testcase, possibly to make it behave better somehow, no
matter how.

I admit I'm also wondering myself the exact semantics of O_DIRECT on
clear_refs or uffd-wp tracking, but the point is that losing reads and
getting unexpected data in the page, still doesn't look a good
behavior and it had to be at least checked.

To me ultimately the useful use case that is become impossible with
page_count isn't even clear_refs nor uffd-wp.

The useful case that I can see zero fundamental flaws in it, is a RDMA
or some other device computing in pure readonly DMA on the data while
a program runs normally and produces it. It could be even a
framebuffer that doesn't care about coherency. You may want to
occasionally wrprotect the memory under readonly long term GUP pin for
consistency even against bugs of the program itself. Why should
wrprotecting make the device lose synchrony? And kind of performance
we gain to the normal useful cases by breaking the special case? Is
there a benchmark showing it?

> So it's not *just* the locking that needs to be fixed. But just take a
> look at that suggested clear_refs patch of mine - it sure isn't
> complicated.

If we can skip the wrprotection it's fairly easy, I fully agree, even
then it still looks more difficult than using page_mapcount in
do_wp_page in my view, so I also don't see the simplification. And
overall the amount of kernel code had a net increase as result.

Thanks,
Andrea

Linus Torvalds Jan. 10, 2021, 3:51 a.m. UTC | #6

On Sat, Jan 9, 2021 at 6:51 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> I just don't see the simplification coming from
> 09854ba94c6aad7886996bfbee2530b3d8a7f4f4. Instead of checking
> page_mapcount above as an optimization, to me it looks much simpler to
> check it in a single place, in do_wp_page, that in addition of
> optimizing away the superfluous copy, would optimize away the above
> complexity as well.

Here's the difference:

 (a) in COW, "page_mapcount()" is pure and utter garbage, and has zero meaning.

Why? Because MAPCOUNT DOES NOT MATTER FOR COW.

COW is about "I'm about to write to this page, and that means I need
an _exclusive_ page so that I don't write to a page that somebody else
is using".

Can you admit that fundamental fact?

Notice how "page_mapcount()" has absolutely NOTHING to do with
"exclusive page". There are lots of other ways the page can be used
aside from mapcount. The page cache may have a reference to the page.
Somebody that did a GUP may have a reference to the page.

So what actually matters at COW time? The only thing that matters is
"am I the exclusive owner". And guess what? We have a count of page
references. It's "page_count()". That's *EXACTLY* the thing that says
"are there maybe other references to this page".

In other words, COW needs to use page_count(). It really is that easy.
End of story.

So, given that, why do I then do

> +     if (page_mapcount(page) != 1)
> +             return false;

in my patch, when I just told you that "page_mapcount()" is irrelevant for COW?

Guess what? The above isn't about COW. The above isn't about whether
we need to do a copy in order to be able to write to the page without
anybody else being affected by it.

No, at fork time, and at this clear_refs time, the question is
entirely different. The question is not "Do I have exclusive access to
the page", but it is "Did I _already_ made sure that I have exclusive
access to the page because I pinned it"?

See how different the question is?

Because *if* you have done a pinned COW for soem direct-IO read, and
*if* that page is dirty, then you know it's mapped only in your
address space. You're basically doing the _reverse_ of the COW test,
and asking yourself "is this my own private pinned page"?

And then it's actually perfectly sane to do a check that says
"obviously, if somebody else has this page mapped, then that's not the
case".

See?

For COW, "page_mapcount()" is pure and utter garbage, and entirely
meaningless. How many places it's mapped in doesn't matter. You may
have to COW even if it's only mapped in your address space (page
cache, GUP, whatever).

But for "did I already make this exclusive", then it's actually
meaningful to say "is it mapped somewhere else". We know it has other
users - that "page_may_be_pinned()" in fact *guarantees* that it has
other users. But we're double-checking that the other users aren't
other mappings.

That said, I did just realize that that "page_mapcount()" check is
actually pointless.

Because we do have a simpler one. Instead of checking whether all
those references that made us go "page_might_be_pinned()" aren't other
mappings, the simple check for "pte_writable()" would already have
told us that we had already done the COW.

So you are actually right that the page_mapcount() test in my patch is
not the best way to check for this. By the time we see
"page_may_be_pinned()", we might as well just say "Oh, it's a private
mapping and the pte is already writable, so we know we were the
exclusive mapper, because COW and fork() already guarantee that".

> And I won't comment if it's actually safe to skip random pages or
> not. All I know is for mprotect and uffd-wp, definitely the above
> approach wouldn't work.

Why do you say that?

You say ":definitely know", but I think you're full of it.

The fact is, if you have a pinned page, why wouldn't we say "you can't
turn it read-only"?

It's pinned in the VM address space - and it's pinned writable. Simple
and clear semantics. You can *remove* it, but you can't change the
pinning.

              Linus

Linus Torvalds Jan. 10, 2021, 7:30 p.m. UTC | #7

On Sat, Jan 9, 2021 at 7:51 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> COW is about "I'm about to write to this page, and that means I need
> an _exclusive_ page so that I don't write to a page that somebody else
> is using".

So this kind of fundamentally explains why I hate the games we used to
play wrt page_mapcount(): they were fundamentally fragile. I _much_
prefer just having the rule that we use page_count(), which the above
simple and straightforward single sentence explains 100%.

This gets back to the fact that especially considering how we've had
subtle bugs here (the "wrong-way COW" issue has existed since
literally the first GUP ever, so it goes back decades), I want the
core VM rules to be things that can be explained basically from simple
"first principles".

And the reason I argue for the current direction that I'm pushing, is
exactly that the above is a very simple "first principle" for why COW
exists.

If the rule for COW is simply "I will always COW if it's not clear
that I'm the exclusive user", then COW itself is very simple to think
about.

The other rule I want to stress is that COW is common, and that argues
against the model we used to have of "let's lock the page to make sure
that everything else is stable". That model was garbage anyway, since
page locking doesn't even guarantee any stability wrt exclusive use in
the first place (ie GUP being another example), but it's why I truly
detested the old model that depended so much on the page lock to
serialize things.

So if you start off with the rule that "I will always COW unless I can
trivially see I'm the only owner", then I think we have really made
for a really clear and unambiguous rule.

And remember: COW is only an issue for private mappings. So pretty
much BY DEFINITION, doing a COW is always safe for all normal
circumstances.

Now, this is where it does get subtle: that "all normal circumstances"
part. The one special case is a cache-coherent GUP. It's arguable
whether "pinned" should matter or not, and it would obviously be
better if "pinned" simply didn't matter at all (and the only issue
with any long-term pinning would simply be about resource counting).

The current approach I'm advocating is "coherency means that it must
have been writable", and then the way to solve the whole "Oh, it's
shared with something else" is to simply never accept making it
read-only, because BY DEFINITION it's not _really_ read-only (because
we know we've created that other alias of the virtual address that is
*not* controlled by the page table protection bits).

Notice how this is all both conceptually fairly simple (ie I can
explain the rules in plain English without really making any complex
argument) and it is arguably internally fairly self-consistent (ie the
whole notion of "oh, there's another thing that has write access that
page but doesn't go through the page table, so trying to make it
read-only in the page tables is a nonsensical operation").

Are the end results wrt something like soft-dirty a bit odd? Not
really. If you do soft-dirty, such a GUP-shared page would simply
always show up as dirty. That's still consistent with the rules. If
somebody else may be writing to it because of GUP, that page really
*isn't* clean, and us marking it read-only would be just lying about
things.

I'm admittedly not very happy about mprotect() itself, though. It's
actually ok to do the mprotect(PROT_READ) and turn the page read-only:
that will also disable COW itself (because a page fault will now be a
SIGSEGV, not a COW).

But if you then make it writable again with mprotect(PROT_WRITE), you
*have* lost the WP bit, and you'll COW on a write, and lose the
coherency.

Now, I'm willing to just say: "if you do page pinning, and then do
mprotect(PROT_READ), and then do mprotect(PROT_WRITE) and then write
to the page, you really do get to keep both broken pieces". IOW, I'm
perfectly happy to just say you get what you deserve.

But I'd also be perfectly happy to make the whole "I'm the exclusive
user" logic a bit more extensive. Right now it's basically _purely_
page_count(), and the other part of "I'm the exclusive owner" is that
the RW bit in the page table is simply not clear. That makes things
really easy for COW: it just won't happen in the first place if you
broke the "obviously exclusive" rule with GUP.

But we _could_ do something slightly smarter. But "page_mapcount()" is
not that "slightly smarter" thing, because we already know it's broken
wrt lots of other uses (GUP, page cache, whatever).

Just having a bit in the page flags for "I already made this
exclusive, and fork() will keep it so" is I feel the best option. In a
way, "page is writable" right now _is_ that bit. By definition, if you
have a writable page in an anonymous mapping, that is an exclusive
user.

But because "writable" has these interactions with other operations,
it would be better if it was a harder bit than that "maybe_pinned()",
though. It would be lovely if a regular non-pinning write-GUP just
always set it, for example.

"maybe_pinned()" is good enough for the fork() case, which is the one
that matters for long-term pinning. But it's admittedly not perfect.

                  Linus

Jason Gunthorpe Jan. 11, 2021, 1:18 a.m. UTC | #8

On Sun, Jan 10, 2021 at 11:30:57AM -0800, Linus Torvalds wrote:

> Notice how this is all both conceptually fairly simple (ie I can
> explain the rules in plain English without really making any complex
> argument) and it is arguably internally fairly self-consistent (ie the
> whole notion of "oh, there's another thing that has write access that
> page but doesn't go through the page table, so trying to make it
> read-only in the page tables is a nonsensical operation").

Yes exactly!
 
> Are the end results wrt something like soft-dirty a bit odd? Not
> really. If you do soft-dirty, such a GUP-shared page would simply
> always show up as dirty. That's still consistent with the rules. If
> somebody else may be writing to it because of GUP, that page really
> *isn't* clean, and us marking it read-only would be just lying about
> things.
> 
> I'm admittedly not very happy about mprotect() itself, though. It's
> actually ok to do the mprotect(PROT_READ) and turn the page read-only:
> that will also disable COW itself (because a page fault will now be a
> SIGSEGV, not a COW).

I would say even mprotect should not set write protect on page under
DMA, it seems like the sort of thing that will trip up other parts of
the mm that might sensibly assume read-only actually means read only,
not 'probably read-only but there might be DMA writes anyhow'

So copy the page before write protecting it? Then the explicit
mprotect is one of the system calls that can cause de-coherence of the
DMA - like munmap.

If we had reliable pinned detection I'd say to fail the mprotect..

And I think you are right about clear refs too. Clear refs must be
aware of FOLL_LONGTERM and handle it properly - which includes not
write protecting such a page.

Jason

John Hubbard Jan. 11, 2021, 7:26 a.m. UTC | #9

On 1/10/21 11:30 AM, Linus Torvalds wrote:
> On Sat, Jan 9, 2021 at 7:51 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>
> Just having a bit in the page flags for "I already made this
> exclusive, and fork() will keep it so" is I feel the best option. In a
> way, "page is writable" right now _is_ that bit. By definition, if you
> have a writable page in an anonymous mapping, that is an exclusive
> user.
> 
> But because "writable" has these interactions with other operations,
> it would be better if it was a harder bit than that "maybe_pinned()",
> though. It would be lovely if a regular non-pinning write-GUP just
> always set it, for example.
> 
> "maybe_pinned()" is good enough for the fork() case, which is the one
> that matters for long-term pinning. But it's admittedly not perfect.
> 

There is at least one way to improve this part of it--maybe.

IMHO, a lot of the bits in page _refcount are still being wasted (even
after GUP_PIN_COUNTING_BIAS overloading), because it's unlikely that
there are many callers of gup/pup per page. If anyone points out that
that is wrong, then the rest of this falls apart, but...if we were to
make a rule that "only a very few FOLL_GET or FOLL_PIN pins are ever
simultaneously allowed on a given page", then several things become
possible:

1) "maybe dma pinned" could become "very likely dma-pinned", by allowing
perhaps 23 bits for normal page refcounts (leaving just 8 bits for dma
pins), thus all but ensuring no aliasing between normal refcounts and
dma pins. The name change below to "likely" is not actually necessary,
I'm just using it to make the point clear:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..28f7f64e888f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1241,7 +1241,7 @@ static inline void put_page(struct page *page)
   * get_user_pages and page_mkclean and other calls that race to set up page
   * table entries.
   */
-#define GUP_PIN_COUNTING_BIAS (1U << 10)
+#define GUP_PIN_COUNTING_BIAS (1U << 23)

  void unpin_user_page(struct page *page);
  void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
@@ -1274,7 +1274,7 @@ void unpin_user_pages(struct page **pages, unsigned long npages);
   * @Return:    True, if it is likely that the page has been "dma-pinned".
   *             False, if the page is definitely not dma-pinned.
   */
-static inline bool page_maybe_dma_pinned(struct page *page)
+static inline bool page_likely_dma_pinned(struct page *page)
  {
         if (hpage_pincount_available(page))
                 return compound_pincount(page) > 0;

2) Doing (1) allows, arguably, failing mprotect calls if
page_likely_dma_pinned() returns true, because the level of confidence
is much higher now.

3) An additional counter, for normal gup (FOLL_GET) is possible: divide
up the top 8 bits into two separate counters of just 4 bits each. Only
allow either FOLL_GET or FOLL_PIN (and btw, I'm now mentally equating
FOLL_PIN with FOLL_LONGTERM), not both, on a given page. Like this:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..ce5af27e3057 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1241,7 +1241,8 @@ static inline void put_page(struct page *page)
   * get_user_pages and page_mkclean and other calls that race to set up page
   * table entries.
   */
-#define GUP_PIN_COUNTING_BIAS (1U << 10)
+#define DMA_PIN_COUNTING_BIAS (1U << 23)
+#define GUP_PIN_COUNTING_BIAS (1U << 27)

  void unpin_user_page(struct page *page);
  void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,

So:

FOLL_PIN: would use DMA_PIN_COUNTING_BIAS to increment page refcount.
These are long term pins for dma.

FOLL_GET: would use GUP_PIN_COUNTING_BIAS to increment page refcount.
These are not long term pins.

Doing (3) would require yet another release call variant:
unpin_user_pages() would need to take a flag to say which refcount to
release: FOLL_GET or FOLL_PIN. However, I think that's relatively easy
(compared to the original effort of teasing apart generic put_page()
calls, into unpin_user_pages() calls). We already have all the
unpin_user_pages() calls in place, and so it's "merely" a matter of
adding a flag to 74 call sites.

The real question is whether we can get away with supporting only a very
low count of FOLL_PIN and FOLL_GET pages. Can we?

thanks,

Matthew Wilcox (Oracle) Jan. 11, 2021, 12:42 p.m. UTC | #10

On Sun, Jan 10, 2021 at 11:26:57PM -0800, John Hubbard wrote:
> IMHO, a lot of the bits in page _refcount are still being wasted (even
> after GUP_PIN_COUNTING_BIAS overloading), because it's unlikely that
> there are many callers of gup/pup per page. If anyone points out that
> that is wrong, then the rest of this falls apart, but...if we were to
> make a rule that "only a very few FOLL_GET or FOLL_PIN pins are ever
> simultaneously allowed on a given page", then several things become
> possible:

There's "the normal case" and then there's "the attacker case" where
someone's deliberately trying to wrap page->_refcount.  There are lots of
interesting games people can play with an anon page, like stuffing it into
(lots of) pipes, forking lots of children, starting lots of O_DIRECT I/O
against it to a FUSE filesystem that's deliberately engineered to be slow.

We have some protection against that, but I'm not 100% sure it's working,
and making it easier to increase refcount in large chunks makes it more
likely that we would defeat that protection.

Jason Gunthorpe Jan. 11, 2021, 3:52 p.m. UTC | #11

On Sat, Jan 09, 2021 at 09:51:14PM -0500, Andrea Arcangeli wrote:

> Are we spending 32bit in mm_struct atomic_t just to call atomic_set(1)
> on it? Why isn't it a MMF_HAS_PINNED that already can be set
> atomically under mmap_read_lock too? There's bit left free there, we
> didn't run out yet to justify wasting another 31 bits. I hope I'm
> overlooking something.

It needs to be atomic because it is set under gup fast, no mmap
lock. Peter and myself did not find another place to put this that was
already atomic

> The existence of FOLL_LONGTERM is good and makes a difference at times
> for writeback if it's on a MAP_SHARED, or it makes difference during
> GUP to do a page_migrate before taking the pin, but for the whole rest
> of the VM it's irrelevant if it's long or short term, so I'm also
> concerned from what Jason mentioned about long term pins being treated
> differently within the VM. 

Why? They are different. write protect doesn't stop modification of
the data. How is that not a relavent and real difference?

Jason

Jason Gunthorpe Jan. 11, 2021, 4:05 p.m. UTC | #12

On Sun, Jan 10, 2021 at 11:26:57PM -0800, John Hubbard wrote:

> So:
> 
> FOLL_PIN: would use DMA_PIN_COUNTING_BIAS to increment page refcount.
> These are long term pins for dma.
> 
> FOLL_GET: would use GUP_PIN_COUNTING_BIAS to increment page refcount.
> These are not long term pins.

Do we have any places yet that care about the long-term-ness?

The only place long term would matter is if something is waiting for
the page ref count to drop.

Jason

Michal Hocko Jan. 11, 2021, 4:15 p.m. UTC | #13

On Mon 11-01-21 12:05:49, Jason Gunthorpe wrote:
> On Sun, Jan 10, 2021 at 11:26:57PM -0800, John Hubbard wrote:
> 
> > So:
> > 
> > FOLL_PIN: would use DMA_PIN_COUNTING_BIAS to increment page refcount.
> > These are long term pins for dma.
> > 
> > FOLL_GET: would use GUP_PIN_COUNTING_BIAS to increment page refcount.
> > These are not long term pins.
> 
> Do we have any places yet that care about the long-term-ness?

I was hoping to use that information to distinguish ephemeral migration
failures due to page reference from long term pins. The later would be a
hard failure for the migration.

Linus Torvalds Jan. 11, 2021, 7:19 p.m. UTC | #14

On Sun, Jan 10, 2021 at 11:27 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> There is at least one way to improve this part of it--maybe.

It's problematic..

> IMHO, a lot of the bits in page _refcount are still being wasted (even
> after GUP_PIN_COUNTING_BIAS overloading), because it's unlikely that
> there are many callers of gup/pup per page.

It may be unlikely under real loads.

But we've actually had overflow issues on this because rather than
real loads you can do attack loads (ie "lots of processes, lots of
pipe file descriptors, lots of vmsplice() operations on the same
page".

We had to literally add that conditional "try_get_page()" that
protects against overflow..

           Linus

Linus Torvalds Jan. 11, 2021, 10:18 p.m. UTC | #15

On Mon, Jan 11, 2021 at 11:19 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sun, Jan 10, 2021 at 11:27 PM John Hubbard <jhubbard@nvidia.com> wrote:
> > IMHO, a lot of the bits in page _refcount are still being wasted (even
> > after GUP_PIN_COUNTING_BIAS overloading), because it's unlikely that
> > there are many callers of gup/pup per page.
>
> It may be unlikely under real loads.
>
> But we've actually had overflow issues on this because rather than
> real loads you can do attack loads (ie "lots of processes, lots of
> pipe file descriptors, lots of vmsplice() operations on the same
> page".
>
> We had to literally add that conditional "try_get_page()" that
> protects against overflow..

Actually, what I think might be a better model is to actually
strengthen the rules even more, and get rid of GUP_PIN_COUNTING_BIAS
entirely.

What we could do is just make a few clear rules explicit (most of
which we already basically hold to). Starting from that basic

 (a) Anonymous pages are made writable (ie COW) only when they have a
page_count() of 1

That very simple rule then automatically results in the corollary

 (b) a writable page in a COW mapping always starts out reachable
_only_ from the page tables

and now we could have a couple of really simple new rules:

 (c) we never ever make a writable page in a COW mapping read-only
_unless_ it has a page_count() of 1

 (d) we never create a swap cache page out of a writable COW mapping page

Now, if you combine these rules, the whole need for the
GUP_PIN_COUNTING_BIAS basically goes away.

Why? Because we know that the _only_ thing that can elevate the
refcount of a writable COW page is GUP - we'll just make sure nothing
else touches it.

The whole "optimistic page references throigh page cache" etc are
complete non-issues, because the whole point is that we already know
it's not a page cache page. There is simply no other way to reach that
page than through GUP.

Ergo: any writable pte in a COW mapping that has a page with a
page_count() > 1 is pinned by definition, and thus our

   page_maybe_dma_pinned(page)

could remove that "maybe" part, and simply check for

    page_count(page) > 1

(although the rule would be that this is only valid for a cow_mapping
pte, and only while holding the page table lock! So maybe it would be
good to pass in the vma and have an assert for that lock too).

And the thing is, none of the above rules are complicated.

The only new one would be the requirement that you cannot add a page
to the swap cache unless it is read-only in the page tables. That may
be the biggest hurdle here.

The way we handle swap cache is that we *first* add it to the swap
cache, and then we do a "try_to_unmap()" on it. So we currently don't
actually try to walk the page tables until we have already done that
swap cache thing.

But I do think that the only major problem spot is that
shrink_page_list() -> add_to_swap() case, and I think we could make
add_to_swap() just do the rmap walk and turn it read-only first.

(And it's worth pointing out that I'm only talking about regular
non-huge pages above, the rules for splitting hugepages may impact
those cases differently, I didn't really even try to think about those
cases).

But thatadd_to_swap() case might make it too painful. It _would_
simplify our rules wrt anonymous mappings enormously, though.

               Linus

Andy Lutomirski Jan. 12, 2021, 5:07 p.m. UTC | #16

On Mon, Jan 11, 2021 at 2:18 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Jan 11, 2021 at 11:19 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:

> Actually, what I think might be a better model is to actually
> strengthen the rules even more, and get rid of GUP_PIN_COUNTING_BIAS
> entirely.
>
> What we could do is just make a few clear rules explicit (most of
> which we already basically hold to). Starting from that basic
>
>  (a) Anonymous pages are made writable (ie COW) only when they have a
> page_count() of 1

Seems reasonable to me.

>
> That very simple rule then automatically results in the corollary
>
>  (b) a writable page in a COW mapping always starts out reachable
> _only_ from the page tables

Seems reasonable.  I guess that if the COW is triggered by GUP, then
it starts out reachable only from the page tables but then because
reachable through GUP very soon thereafter.

>
> and now we could have a couple of really simple new rules:
>
>  (c) we never ever make a writable page in a COW mapping read-only
> _unless_ it has a page_count() of 1

I don't love this.  Having mprotect() fail in a multithreaded process
because another thread happens to be doing a short-lived IO seems like
it may result in annoying intermittent bugs.

As I understand it, the issue is that the way we determine that we
need to COW a COWable page is that we see that it's read-only.  It
would be nice if we could separately track "the VMA allows writes" and
"this PTE points to a page that is private to the owning VMA", but
maybe there's no bit available for the latter other than looking at RO
vs RW directly.

>
>  (d) we never create a swap cache page out of a writable COW mapping page
>
> Now, if you combine these rules, the whole need for the
> GUP_PIN_COUNTING_BIAS basically goes away.
>
> Why? Because we know that the _only_ thing that can elevate the
> refcount of a writable COW page is GUP - we'll just make sure nothing
> else touches it.

How common is !FOLL_WRITE GUP?  We could potentially say that a
short-term !FOLL_WRITE GUP is permitted on an RO COW page and that a
subsequent COW on the page will wait for the GUP to go away.  This
might be too big a can of worms for the benefit it would provide,
though.

--Andy

Jerome Glisse Jan. 12, 2021, 11:51 p.m. UTC | #17

On Mon, Jan 11, 2021 at 02:18:13PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2021 at 11:19 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Sun, Jan 10, 2021 at 11:27 PM John Hubbard <jhubbard@nvidia.com> wrote:
> > > IMHO, a lot of the bits in page _refcount are still being wasted (even
> > > after GUP_PIN_COUNTING_BIAS overloading), because it's unlikely that
> > > there are many callers of gup/pup per page.
> >
> > It may be unlikely under real loads.
> >
> > But we've actually had overflow issues on this because rather than
> > real loads you can do attack loads (ie "lots of processes, lots of
> > pipe file descriptors, lots of vmsplice() operations on the same
> > page".
> >
> > We had to literally add that conditional "try_get_page()" that
> > protects against overflow..
> 
> Actually, what I think might be a better model is to actually
> strengthen the rules even more, and get rid of GUP_PIN_COUNTING_BIAS
> entirely.
> 
> What we could do is just make a few clear rules explicit (most of
> which we already basically hold to). Starting from that basic
> 
>  (a) Anonymous pages are made writable (ie COW) only when they have a
> page_count() of 1
> 
> That very simple rule then automatically results in the corollary
> 
>  (b) a writable page in a COW mapping always starts out reachable
> _only_ from the page tables
> 
> and now we could have a couple of really simple new rules:
> 
>  (c) we never ever make a writable page in a COW mapping read-only
> _unless_ it has a page_count() of 1

This breaks mprotect(R_ONLY) i do not think we want to do that. This
might break security scheme for user space application which expect
mprotect to make CPU mapping reads only.

Maybe an alternative would be to copy page on mprotect for pages that
do not have a page_count of 1 ? But that makes me uneasy toward short
lived GUP (direct IO racing with a mprotect or maybe simply even page
migration) versus unbound one (like RDMA).

Also I want to make sure i properly understand what happens on fork()
on a COW mapping for a page that has a page_count > 1 ? We copy the
page instead of write protecting the page ?

I believe better here would be to protect the page on the CPU but
forbid child to reuse the page ie if the child ever inherit the page
(parent unmapped the page for instance) it will have to make a copy
and the GUP reference (taken before the fork) might linger on a page
that is no longer associated with any VM. This way we keep fast fork.

Jérôme

Matthew Wilcox (Oracle) Jan. 13, 2021, 2:16 a.m. UTC | #18

On Mon, Jan 11, 2021 at 02:18:13PM -0800, Linus Torvalds wrote:
> The whole "optimistic page references throigh page cache" etc are
> complete non-issues, because the whole point is that we already know
> it's not a page cache page. There is simply no other way to reach that
> page than through GUP.

The thing about the speculative page cache references is that they can
temporarily bump a refcount on a page which _used_ to be in the page
cache and has now been reallocated as some other kind of page.

Now, this is obviously rare, so if it's only a performance question,
it'll be fine.  If there's a correctness issue with copying pages that
would otherwise not have been copied, then it's a problem.

Linus Torvalds Jan. 13, 2021, 2:43 a.m. UTC | #19

On Tue, Jan 12, 2021 at 6:16 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> The thing about the speculative page cache references is that they can
> temporarily bump a refcount on a page which _used_ to be in the page
> cache and has now been reallocated as some other kind of page.

Right you are. Yeah, scratch the "we could make it absolute on 1", and
we do need the PIN count elevation.

           Linus

Linus Torvalds Jan. 13, 2021, 3:31 a.m. UTC | #20

On Tue, Jan 12, 2021 at 6:16 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> The thing about the speculative page cache references is that they can
> temporarily bump a refcount on a page which _used_ to be in the page
> cache and has now been reallocated as some other kind of page.

Oh, and thinking about this made me think we might actually have a
serious bug here, and it has nothing what-so-ever to do with COW, GUP,
or even the page count itself.

It's unlikely enough that I think it's mostly theoretical, but tell me
I'm wrong.

PLEASE tell me I'm wrong:

CPU1 does page_cache_get_speculative under RCU lock

CPU2 frees and re-uses the page

    CPU1                CPU2
    ----                ----

    page = xas_load(&xas);
    if (!page_cache_get_speculative(page))
            goto repeat;
    .. succeeds ..

                        remove page from XA
                        release page
                        reuse for something else

    .. and then re-check ..
    if (unlikely(page != xas_reload(&xas))) {
            put_page(page);
            goto repeat;
    }

ok, the above all looks fine. We got the speculative ref, but then we
noticed that its' not valid any more, so we put it again. All good,
right?

Wrong.

What if that "reuse for something else" was actually really quick, and
both allocated and released it?

That still sounds good, right? Yes, now the "put_page()" will be the
one that _actually_ releases the page, but we're still fine, right?

Very very wrong.

The "reuse for something else" on CPU2 might have gotten not an
order-0 page, but a *high-order* page. So it allocated (and then
immediately free'd) maybe an order-2 allocation with _four_ pages, and
the re-use happened when we had coalesced the buddy pages.

But when we release the page on CPU1, we will release just _one_ page,
and the other three pages will be lost forever.

IOW, we restored the page count perfectly fine, but we screwed up the
page sizes and buddy information.

Ok, so the above is so unlikely from a timing standpoint that I don't
think it ever happens, but I don't see why it couldn't happen in
theory.

Please somebody tell me I'm missing some clever thing we do to make
sure this can actually not happen..

         Linus

David Hildenbrand Jan. 13, 2021, 8:52 a.m. UTC | #21

On 13.01.21 04:31, Linus Torvalds wrote:
> On Tue, Jan 12, 2021 at 6:16 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> The thing about the speculative page cache references is that they can
>> temporarily bump a refcount on a page which _used_ to be in the page
>> cache and has now been reallocated as some other kind of page.
> 
> Oh, and thinking about this made me think we might actually have a
> serious bug here, and it has nothing what-so-ever to do with COW, GUP,
> or even the page count itself.
> 
> It's unlikely enough that I think it's mostly theoretical, but tell me
> I'm wrong.
> 
> PLEASE tell me I'm wrong:
> 
> CPU1 does page_cache_get_speculative under RCU lock
> 
> CPU2 frees and re-uses the page
> 
>     CPU1                CPU2
>     ----                ----
> 
>     page = xas_load(&xas);
>     if (!page_cache_get_speculative(page))
>             goto repeat;
>     .. succeeds ..
> 
>                         remove page from XA
>                         release page
>                         reuse for something else
> 
>     .. and then re-check ..
>     if (unlikely(page != xas_reload(&xas))) {
>             put_page(page);
>             goto repeat;
>     }
> 
> ok, the above all looks fine. We got the speculative ref, but then we
> noticed that its' not valid any more, so we put it again. All good,
> right?
> 
> Wrong.
> 
> What if that "reuse for something else" was actually really quick, and
> both allocated and released it?
> 
> That still sounds good, right? Yes, now the "put_page()" will be the
> one that _actually_ releases the page, but we're still fine, right?
> 
> Very very wrong.
> 
> The "reuse for something else" on CPU2 might have gotten not an
> order-0 page, but a *high-order* page. So it allocated (and then
> immediately free'd) maybe an order-2 allocation with _four_ pages, and
> the re-use happened when we had coalesced the buddy pages.
> 
> But when we release the page on CPU1, we will release just _one_ page,
> and the other three pages will be lost forever.
> 
> IOW, we restored the page count perfectly fine, but we screwed up the
> page sizes and buddy information.
> 
> Ok, so the above is so unlikely from a timing standpoint that I don't
> think it ever happens, but I don't see why it couldn't happen in
> theory.
> 
> Please somebody tell me I'm missing some clever thing we do to make
> sure this can actually not happen..

Wasn't that tackled by latest (not merged AFAIKs) __free_pages() changes?

I'm only able to come up with the doc update, not with the oroginal
fix/change

https://lkml.kernel.org/r/20201027025523.3235-1-willy@infradead.org

David Hildenbrand Jan. 13, 2021, 8:57 a.m. UTC | #22

On 13.01.21 09:52, David Hildenbrand wrote:
> On 13.01.21 04:31, Linus Torvalds wrote:
>> On Tue, Jan 12, 2021 at 6:16 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>
>>> The thing about the speculative page cache references is that they can
>>> temporarily bump a refcount on a page which _used_ to be in the page
>>> cache and has now been reallocated as some other kind of page.
>>
>> Oh, and thinking about this made me think we might actually have a
>> serious bug here, and it has nothing what-so-ever to do with COW, GUP,
>> or even the page count itself.
>>
>> It's unlikely enough that I think it's mostly theoretical, but tell me
>> I'm wrong.
>>
>> PLEASE tell me I'm wrong:
>>
>> CPU1 does page_cache_get_speculative under RCU lock
>>
>> CPU2 frees and re-uses the page
>>
>>     CPU1                CPU2
>>     ----                ----
>>
>>     page = xas_load(&xas);
>>     if (!page_cache_get_speculative(page))
>>             goto repeat;
>>     .. succeeds ..
>>
>>                         remove page from XA
>>                         release page
>>                         reuse for something else
>>
>>     .. and then re-check ..
>>     if (unlikely(page != xas_reload(&xas))) {
>>             put_page(page);
>>             goto repeat;
>>     }
>>
>> ok, the above all looks fine. We got the speculative ref, but then we
>> noticed that its' not valid any more, so we put it again. All good,
>> right?
>>
>> Wrong.
>>
>> What if that "reuse for something else" was actually really quick, and
>> both allocated and released it?
>>
>> That still sounds good, right? Yes, now the "put_page()" will be the
>> one that _actually_ releases the page, but we're still fine, right?
>>
>> Very very wrong.
>>
>> The "reuse for something else" on CPU2 might have gotten not an
>> order-0 page, but a *high-order* page. So it allocated (and then
>> immediately free'd) maybe an order-2 allocation with _four_ pages, and
>> the re-use happened when we had coalesced the buddy pages.
>>
>> But when we release the page on CPU1, we will release just _one_ page,
>> and the other three pages will be lost forever.
>>
>> IOW, we restored the page count perfectly fine, but we screwed up the
>> page sizes and buddy information.
>>
>> Ok, so the above is so unlikely from a timing standpoint that I don't
>> think it ever happens, but I don't see why it couldn't happen in
>> theory.
>>
>> Please somebody tell me I'm missing some clever thing we do to make
>> sure this can actually not happen..
> 
> Wasn't that tackled by latest (not merged AFAIKs) __free_pages() changes?
> 
> I'm only able to come up with the doc update, not with the oroginal
> fix/change
> 
> https://lkml.kernel.org/r/20201027025523.3235-1-willy@infradead.org
> 

Sorry, found it, it's in v5.10

commit e320d3012d25b1fb5f3df4edb7bd44a1c362ec10
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Tue Oct 13 16:56:04 2020 -0700

    mm/page_alloc.c: fix freeing non-compound pages

and

commit 7f194fbb2dd75e9346b305b8902e177b423b1062
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Dec 14 19:11:09 2020 -0800

    mm/page_alloc: add __free_pages() documentation

is in v5.11-rc1

Kirill A. Shutemov Jan. 13, 2021, 12:32 p.m. UTC | #23

On Tue, Jan 12, 2021 at 07:31:07PM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2021 at 6:16 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > The thing about the speculative page cache references is that they can
> > temporarily bump a refcount on a page which _used_ to be in the page
> > cache and has now been reallocated as some other kind of page.
> 
> Oh, and thinking about this made me think we might actually have a
> serious bug here, and it has nothing what-so-ever to do with COW, GUP,
> or even the page count itself.
> 
> It's unlikely enough that I think it's mostly theoretical, but tell me
> I'm wrong.
> 
> PLEASE tell me I'm wrong:
> 
> CPU1 does page_cache_get_speculative under RCU lock
> 
> CPU2 frees and re-uses the page
> 
>     CPU1                CPU2
>     ----                ----
> 
>     page = xas_load(&xas);
>     if (!page_cache_get_speculative(page))
>             goto repeat;
>     .. succeeds ..
> 
>                         remove page from XA
>                         release page
>                         reuse for something else

How can it be reused if CPU1 hold reference to it?

> 
>     .. and then re-check ..
>     if (unlikely(page != xas_reload(&xas))) {
>             put_page(page);
>             goto repeat;
>     }
>

Matthew Wilcox (Oracle) Jan. 13, 2021, 12:55 p.m. UTC | #24

On Wed, Jan 13, 2021 at 03:32:32PM +0300, Kirill A. Shutemov wrote:
> On Tue, Jan 12, 2021 at 07:31:07PM -0800, Linus Torvalds wrote:
> > On Tue, Jan 12, 2021 at 6:16 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > The thing about the speculative page cache references is that they can
> > > temporarily bump a refcount on a page which _used_ to be in the page
> > > cache and has now been reallocated as some other kind of page.
> > 
> > Oh, and thinking about this made me think we might actually have a
> > serious bug here, and it has nothing what-so-ever to do with COW, GUP,
> > or even the page count itself.
> > 
> > It's unlikely enough that I think it's mostly theoretical, but tell me
> > I'm wrong.
> > 
> > PLEASE tell me I'm wrong:
> > 
> > CPU1 does page_cache_get_speculative under RCU lock
> > 
> > CPU2 frees and re-uses the page
> > 
> >     CPU1                CPU2
> >     ----                ----
> > 
> >     page = xas_load(&xas);
> >     if (!page_cache_get_speculative(page))
> >             goto repeat;
> >     .. succeeds ..
> > 
> >                         remove page from XA
> >                         release page
> >                         reuse for something else
> 
> How can it be reused if CPU1 hold reference to it?

Yes, Linus mis-stated it:

    page = xas_load(&xas);
                        remove page from XA
                        release page
                        reuse for something else
    if (!page_cache_get_speculative(page))
            goto repeat;
    if (unlikely(page != xas_reload(&xas))) {
            put_page(page);

... but as David pointed out, I fixed this in e320d3012d25

Linus Torvalds Jan. 13, 2021, 7:54 p.m. UTC | #25

On Wed, Jan 13, 2021 at 4:56 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> Yes, Linus mis-stated it:

Yeah, I got the order wrong.

> ... but as David pointed out, I fixed this in e320d3012d25

.. and I must have seen it, but not really internalized it.

And now that I look at it more closely, I'm actually surprised that
other than the magic "speculative first page" case we don't seem to
use page reference counting for non-order-0 pages (which would break
that hack horribly).

             Linus

Peter Xu Jan. 13, 2021, 11:54 p.m. UTC | #26

On Sun, Jan 10, 2021 at 11:30:57AM -0800, Linus Torvalds wrote:
> So if you start off with the rule that "I will always COW unless I can
> trivially see I'm the only owner", then I think we have really made
> for a really clear and unambiguous rule.

I must confess that's the major reason that when I saw the COW simplification
patch I felt it great.  Not only because it's an easier way to understand all
these, but also that it helped unbreak uffd-wp at that moment by dropping the
patch of "COW for read gups" instead of introducing more things to fix stuff,
like the FOLL_BREAK_COW idea [1].

But it's a pity that only until now, after I read all the threads... I found we
might start to lose things for the beautifulness the COW patch brought.

It turns out the simplicity is just not for free; there is always a cost.  The
complexity around GUP always existed probably because GUP is hard itself! It's
just a matter of where the complexity resides: either in the do_wp_page(), or
in the rest of the codes in some other ways, e.g., a complicated version of
page_trans_huge_map_swapcount().

For example, in the future we'll have no way to wr-protect any pinned pages, no
matter for soft-dirty or uffd-wp or else: it'll be illegal by definition!
While it's not extremely easy to get the reasoning from instinct of human being
I'd say...  "DMA pinned pages could be written after all by device", that's
true, but how about read DMA pins?  Oh read DMA pin does not exist because if
we try read DMA pin it broke... but is that a strong enough reason?

We probably want to fix mprotect() too with pinned pages, because even if
mprotect(READ) is fine then in mprotect(READ|WRITE) we'll need to make sure the
write bit being solid if the page is pinned (because now we'll assume all
pinned pages should always be with the write bit set in ptes), or another
alternative could be we'll just fail some mprotect() upon pinned pages.

Limitations just start to pile up, it seems to me..

We also unveiled the vmsplice issue with thp, because for now on COW we don't
have symmetric behavior for small/huge pages any more: for small pages, we'll
do page_count() check to make that copy decision; while we're still with
page_mapcount() for thps.  Do we finally need to convert do_huge_pmd_wp_page()
to also use the same logic as the small pages?  The risk in front is not clear
to me, though.

GUP(write=0) will be extremely special from now on, simply because read won't
trigger COW... I must confess, from instinction, GUP(write=0) should simply
mean that "I want to get this page, but for reading".  Then I realized it's not
extremelyl obvious to me why it should be treated totally differently against a
write GUP on this matter.

Due to my lack on memory management experience, I just started to notice that a
huge amount of work that previously done hard on maintaining page_mapcount()
simply just to make sure GUP pages keep its meaning by instinction, which is:
when we GUP a page, it'll be always coherent with the page we've got in the
process page table.  Now that instinct will go away too, because the GUPed page
now could be some different page rather than the one that sits in the pgtable.

Then I remembered something very nice about the COW simplification change: As
the test case reported [2], there is a huge boost with 31.4% after COW
simplification.  Today I went back and try to understand why, because I
suddenly found I cannot fully understand it - I thought it was majorly because
of the page lock, but maybe not, since the page lock is actually a fine granule
lock in this test that covers 1G mem.

The test was carried out on a host with 192G mem + 104 cores, what it did was
simply:

  ./usemem --runtime 300 -n 104 --prealloc --prefault 959297984

./usemem is the memory workload that does memory accesses, it fork()s into 104
processes with a shared 1G mem region for those accesses so after it's all done
it could use up to 104G pages after all the COWs.  When it runs, COW happens
very frequently across merely all the cores, triggering the do_wp_page() path.

However 104 processes won't easily collapse on the same 4K page at the same
time which shares the lock.  I just noticed maybe it's simply the overhead of
reuse_swap_page().

But how heavy would reuse_swap_page() be?  It'll be heavier if THP is the case
because page_trans_huge_map_swapcount() has a complicated loop for those small
pages, then I found that indeed the test was carried out with THP enabled and
also by default on:

        CONFIG_TRANSPARENT_HUGEPAGE=y
        CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y

IIUC that'll make reuse_swap_page() very heavy, I think.  Because initially the
100+ processes share some THPs, first writes on the THPs will split the THPs,
then reuse_swap_page() will always go the slowest path on looping over each
small pages for each small page writes of COW.

So that 31.4% number got "shrinked" a bit after I noticed all these - this is
already a specific test which merely do COW only but nothing else.  It'll be
hard to tell how many performance gain we can get by this simplification of COW
on other real-life workloads.  Not to mention that removing the
reuse_swap_page() seems to also delayed swap recycle (which is something we'll
need to do sooner or later; so if COW got fast something else got slower,
e.g. the page reclaim logic) and I noticed Ying tried to partly recover it [3].
It's not clear to me where it's the best place to do the recycle, but that
sounds like a different problem.

The major two benefits to me with the current COW simplification are simplicity
of logic and performance gains. But they start to fade away with above.  Not
really that much, but big enough to let me start questioning myself on whether
it's the best approach from pure technical pov...

Then I do also notice that actually the other path of FOLL_BREAK_COW [1] might
be able to fix all of above frustrations like mentioned by others.  It's still
complicated, I really don't like that part.  But again, it just seems to be the
matter of where the complexity would be, there's just no way to avoid those
complexity.  The thing that I'm afraid is that the complexity is by nature and
we can't change it.  I'm also afraid that if we go the current way it'll be
even more complicated at last.  So it seems wise to think more about the
direction since we're at a cross road before starting to fix all the todos
e.g. soft-dirty and mprotect against pinned pages, and all the rest of things
we'd need to fix with current COW page reuse with current solution.

I felt extremely sorry to have dumped my brain somehow, especially considering
above could be completely garbage so I wasted time for a lot of people reading
it or simply scanning it over...  However I expressed it just in case it could
help in some form.

Thanks,

[1] https://lore.kernel.org/lkml/20200811183950.10603-1-peterx@redhat.com/
[2] https://lore.kernel.org/lkml/20200914024321.GG26874@shao2-debian/
[3] https://lore.kernel.org/lkml/20210113024241.179113-1-ying.huang@intel.com/

David Hildenbrand Jan. 15, 2021, 8:59 a.m. UTC | #27

On 10.01.21 01:44, Andrea Arcangeli wrote:
> Hello Andrew and everyone,
> 
> Once we agree that COW page reuse requires full accuracy, the next
> step is to re-apply 17839856fd588f4ab6b789f482ed3ffd7c403e1f and to
> return going in that direction.

After stumbling over the heated discussion around this, I wanted to
understand the details and the different opinions. I tried to summarize
in my simple words (bear with me) what happened and how I think we can
proceed from here. Maybe that helps.

====

What happened:

1) We simplified handling of faults on write-protected pages (page table
entries): we changed the logic when we can reuse a page ("simply
unprotecting it"), and when we have to copy it instead (COW). The
essence of the simplification is, that we only reuse a page if we are
the only single user of the page, meaning page_count(page) == 1, and the
page is mapped into a single process (page_mapcount(page) == 1);
otherwise we copy it. Simple.

2) The old code was complicated and there are GUP (e.g., RDMA, VFIO)
cases that were broken in various ways in the old code already: most
prominently fork(). As one example, it would have been possible for
mprotect(READ) memory to still get modified by GUP users like RDMA.
Write protection (AFAIU via any mechanism) after GUP pinned a page was
not effective; the page was not copied.

3) Speculative pagecache reference can temporarily bump up the
page_count(page), resulting in false positives. We could see
page_count(page) > 1, although we're the single instance that actually
uses a page. In the simplified code, we might copy a page although not
necessary (I cannot tell how often that actually happens).

4) clear_refs(4) ("measure approximately how much memory a process is
using"), uffd-wp (let's call it "lightweight write-protection, handling
the actual fault in user space"), and mprotect(READ) all write-protect
page table entries to generate faults on next write access. With the
simplified code, we will COW whenever we find the page_count(page) > 1.

The simplification seemed to regress clear_refs and uffdio-wp code
(AFAIU in case of uffd-wp, it results in memory corruption). But looks
like we can mostly fix it by adding more extensive locking.

5) Mechanisms like GUP (AFAIU including Direct I/O) also takes
references on pages, increasing page_count(). With the simplification,
we might now end up copying a page, although there is "somewhat" only a
single user/"process" involved.

One example is RDMA: if we read memory using RDMA and mprotect(READ)
such memory, we might end up copying the underlying page on the next
write: suddenly, RDMA is disconnected and will no longer read what is
getting written. Not to mention, we consume more memory. AFAIU, other
examples include direct I/O (e.g., write() with O_DIRECT).

AFAIU, a more extreme case is probably VFIO: A VM with VFIO (e.g.,
passthrough of a PCI device) can essentially be corrupted by "echo 4 >
/proc/[pid]/clear_refs".

6) While some people think it is okay to break GUP further, as it is
already broken in various other ways, other people think this is
changing something that used to work (AFAIU a user-visible change) with
little benefit.

7) There is no easy way to detect if a page really was pinned: we might
have false positives. Further, there is no way to distinguish if it was
pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking
we most probably would need more counters, which we cannot fit into
struct page. (AFAIU, for huge pages it's easier).

However, AFAIU, even being able to detect if (and how) a page was pinned
would not completely help to solve the puzzle.

8) We have a vmsplice security issue that has to be fixed by touching
the code in question. A forked child process can read memory content of
its parent, which was modified by the parent after fork. AFAIU, the fix
will further lock us in into the direction of the code we are heading.

9) The simplification is part of v5.10, which is a LTS release. AFAIU,
that one needs fixing, too.

I see the following possible directions we can head

A) Keep the simplification. Try fixing the fallout. Keep the GUP cases
broken or make mprotect() fail when detecting such a scenario;
AFAIU, both are user-visible changes.

B) Keep the simplification. Try fixing the fallout. Fix GUP cases that
used to work; AFAIU fixing this is the hard/impossible part, and is
undesired by some people..

C) Revert the simplification for now. Go back to the drawing board and
use what we learned to come up with a simplification that (all?  )
people are happy with.

D) Revert the simplification: turns out the code could not get
simplified to this extend. We learned a lot, though.

======

Please let me know in case I messed up anything and/or missed important
points.

Jason Gunthorpe Jan. 15, 2021, 6:37 p.m. UTC | #28

On Fri, Jan 15, 2021 at 09:59:23AM +0100, David Hildenbrand wrote:

> AFAIU, a more extreme case is probably VFIO: A VM with VFIO (e.g.,
> passthrough of a PCI device) can essentially be corrupted by "echo 4 >
> /proc/[pid]/clear_refs".

I've been told when doing migration with RDMA the VM's memory also
ends up pinned, and then it does the stuff of #4. So it deliberately
does clear_refs(4) on RDMA pinned memory and requires no COW. This is
now a real world uABI break, unfortunately.

> 7) There is no easy way to detect if a page really was pinned: we might
> have false positives. Further, there is no way to distinguish if it was
> pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking
> we most probably would need more counters, which we cannot fit into
> struct page. (AFAIU, for huge pages it's easier).

I think this is the real issue. We can only store so much information,
so we have to decide which things work and which things are broken. So
far someone hasn't presented a way to record everything at least..

> However, AFAIU, even being able to detect if (and how) a page was pinned
> would not completely help to solve the puzzle.

At least for COW reuuse, uf we assign labels to every page user, and
imagine we can track everything, I think we get this list:
 - # of ptes referencing the page (mapcount?)
 - # of page * pointer references that don't touch data 
   (ie the speculative page cache ref)
 - # of DMA/CPU readers
 - # of DMA/CPU writers
 - # of long term data accesses
 - # of other reader/writers (specifically process incoherent reader/writers,
   not "DMA with the CPU" like vmsplice/iouring)

Maybe there are more? This is what I've understood so far from
this thread?

Today's kernel makes the COW reuse decision as:

  # ptes == 1 &&
    # refs == 0 &&
    # DMA readers == 0 &&
    # DMA writers == 0 &&
    # of longterm == 0 &&
    # other reader/writers == 0

(in essence this is what _refcount == 1 is saying, I think)

From a GUP perspective I think the useful property is "a physical page
under GUP is not indirectly removed from the mm_struct that pinned
it". This is the idea that the process CPU page table and the ongoing
DMA remain synchronized. This is a generalized statement from the
clear_refs(4) and fork() regressions.

Therefore, COW should not copy a page just because it is under GUP, it
breaks the idea directly.

We've also said speculative #refs should not cause COW. Removing both
of those gets us to the COW reuse decision as:

  # ptes == 1 &&
    # other reader/writers == 0

And I think where Linus is coming from is '# ptes' (eg mapcount) alone
is not right because there are other relavent reader/writers too. (I'm
not sure what these are, has someone pointed at one?)

So, we have 64 bits for _refcount and _mapcount and we currently encode
things as:

 - # ptes                    (_mapcount)
 - # page pointers +         (low bits of _refcount)
   # DMA reader + writers +
   # other reader/writers +
   # ptes        # We incr both _mapcount and_refcount?
 - # long term data acesses  (high bits of _refcount

If we move '# other reader/writers' to _mapcount (maybe with a shift),
does it help?

We also talked about GUP as meaning wrprotect == 0, but we could also
change that to the idea that GUP means COW will always re-use, eg
'#ptes == 1 && # other reader/writers == 0'.

This gives some definition what mprotect(PROT_READ) means to pages
under DMA (though I still think PROT_READ of pages under DMA write is
weird)

> 8) We have a vmsplice security issue that has to be fixed by touching
> the code in question. A forked child process can read memory content of
> its parent, which was modified by the parent after fork. AFAIU, the fix
> will further lock us in into the direction of the code we are heading.

No, vmsplice is just wrong. vmsplice has to do
FOLL_LONGTERM|FOLL_FORCE|FOLL_WRITE for read only access to pages if
userspace controls the duration of the pin.

There are other bad bugs, like permanently locking
DAX/CMA/ZONE_MIGRATE memory if the above pattern is not used.

There was some debate over alternatives, but for a backport security
fix it has to be above. AFAIK.

Jason

David Hildenbrand Jan. 15, 2021, 7:46 p.m. UTC | #29

>> 7) There is no easy way to detect if a page really was pinned: we might
>> have false positives. Further, there is no way to distinguish if it was
>> pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking
>> we most probably would need more counters, which we cannot fit into
>> struct page. (AFAIU, for huge pages it's easier).
> 
> I think this is the real issue. We can only store so much information,
> so we have to decide which things work and which things are broken. So
> far someone hasn't presented a way to record everything at least..

I do wonder how many (especially long-term) GUP readers/writers we have
to expect, and especially, support for a single base page. Do we have a
rough estimate?

With RDMA, I would assume we only need a single one (e.g., once RDMA
device; I'm pretty sure I'm wrong, sounds too easy).
With VFIO I guess we need one for each VFIO container (~ in the worst
case one for each passthrough device).
With direct I/O, vmsplice and other GUP users ?? No idea.

If we could somehow put a limit on the #GUP we support, and fail further
GUP (e.g., -EAGAIN?) once a limit is reached, we could partition the
refcount into something like (assume max #15 GUP READ and #15 GUP R/W,
which is most probably a horribly bad choice)

[ GUP READ ][ GUP R/W ] [  ordinary ]
31  ...  28 27  ...  24 23   ....   0

But due to saturate handling in "ordinary", we would lose further 2 bits
(AFAIU), leaving us "only" 22 bits for "ordinary". Now, I have no idea
how many bits we actually need in practice.

Maybe we need less for GUP READ, because most users want GUP R/W? No idea.

Just wild ideas. Most probably that has already been discussed, and most
probably people figured that it's impossible :)

Jason Gunthorpe Jan. 15, 2021, 7:53 p.m. UTC | #30

On Fri, Jan 15, 2021 at 08:46:48PM +0100, David Hildenbrand wrote:

> Just wild ideas. Most probably that has already been discussed, and most
> probably people figured that it's impossible :)

No, I think it is all fair topics.

There is no API reason for any of this to be limited, but in practice,
I doubt there is more than small 10's

Huge pages complicate things of course

Jason

John Hubbard Jan. 16, 2021, 3:40 a.m. UTC | #31

On 1/15/21 11:46 AM, David Hildenbrand wrote:
>>> 7) There is no easy way to detect if a page really was pinned: we might
>>> have false positives. Further, there is no way to distinguish if it was
>>> pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking
>>> we most probably would need more counters, which we cannot fit into
>>> struct page. (AFAIU, for huge pages it's easier).
>>
>> I think this is the real issue. We can only store so much information,
>> so we have to decide which things work and which things are broken. So
>> far someone hasn't presented a way to record everything at least..
> 
> I do wonder how many (especially long-term) GUP readers/writers we have
> to expect, and especially, support for a single base page. Do we have a
> rough estimate?
> 
> With RDMA, I would assume we only need a single one (e.g., once RDMA
> device; I'm pretty sure I'm wrong, sounds too easy).
> With VFIO I guess we need one for each VFIO container (~ in the worst
> case one for each passthrough device).
> With direct I/O, vmsplice and other GUP users ?? No idea.
> 
> If we could somehow put a limit on the #GUP we support, and fail further
> GUP (e.g., -EAGAIN?) once a limit is reached, we could partition the
> refcount into something like (assume max #15 GUP READ and #15 GUP R/W,
> which is most probably a horribly bad choice)
> 
> [ GUP READ ][ GUP R/W ] [  ordinary ]
> 31  ...  28 27  ...  24 23   ....   0
> 
> But due to saturate handling in "ordinary", we would lose further 2 bits
> (AFAIU), leaving us "only" 22 bits for "ordinary". Now, I have no idea
> how many bits we actually need in practice.
> 
> Maybe we need less for GUP READ, because most users want GUP R/W? No idea.
> 
> Just wild ideas. Most probably that has already been discussed, and most
> probably people figured that it's impossible :)
> 

I proposed this exact idea a few days ago [1]. It's remarkable that we both
picked nearly identical values for the layout! :)

But as the responses show, security problems prevent pursuing that approach.


[1] https://lore.kernel.org/r/45806a5a-65c2-67ce-fc92-dc8c2144d766@nvidia.com

thanks,

David Hildenbrand Jan. 16, 2021, 11:42 a.m. UTC | #32

On 16.01.21 04:40, John Hubbard wrote:
> On 1/15/21 11:46 AM, David Hildenbrand wrote:
>>>> 7) There is no easy way to detect if a page really was pinned: we might
>>>> have false positives. Further, there is no way to distinguish if it was
>>>> pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking
>>>> we most probably would need more counters, which we cannot fit into
>>>> struct page. (AFAIU, for huge pages it's easier).
>>>
>>> I think this is the real issue. We can only store so much information,
>>> so we have to decide which things work and which things are broken. So
>>> far someone hasn't presented a way to record everything at least..
>>
>> I do wonder how many (especially long-term) GUP readers/writers we have
>> to expect, and especially, support for a single base page. Do we have a
>> rough estimate?
>>
>> With RDMA, I would assume we only need a single one (e.g., once RDMA
>> device; I'm pretty sure I'm wrong, sounds too easy).
>> With VFIO I guess we need one for each VFIO container (~ in the worst
>> case one for each passthrough device).
>> With direct I/O, vmsplice and other GUP users ?? No idea.
>>
>> If we could somehow put a limit on the #GUP we support, and fail further
>> GUP (e.g., -EAGAIN?) once a limit is reached, we could partition the
>> refcount into something like (assume max #15 GUP READ and #15 GUP R/W,
>> which is most probably a horribly bad choice)
>>
>> [ GUP READ ][ GUP R/W ] [  ordinary ]
>> 31  ...  28 27  ...  24 23   ....   0
>>
>> But due to saturate handling in "ordinary", we would lose further 2 bits
>> (AFAIU), leaving us "only" 22 bits for "ordinary". Now, I have no idea
>> how many bits we actually need in practice.
>>
>> Maybe we need less for GUP READ, because most users want GUP R/W? No idea.
>>
>> Just wild ideas. Most probably that has already been discussed, and most
>> probably people figured that it's impossible :)
>>
> 
> I proposed this exact idea a few days ago [1]. It's remarkable that we both
> picked nearly identical values for the layout! :)

Heh! Somehow I missed that. But well, there were *a lot* of mails :)

> 
> But as the responses show, security problems prevent pursuing that approach.
It still feels kind of wrong to waste valuable space in the memmap.

In an ideal world (well, one that still only allows for a 64 byte memmap
:) ), we would:

1) Partition the refcount into separate fields that cannot overflow into
each other, similar to my example above, but maybe add even more fields.

2) Reject attempts that would result in an overflow to everything except
the "ordinary" field (e.g., GUP fields in my example above).

3) Put an upper limit on the "ordinary" field that we ever expect for
sane workloads (E.g., 10 bits). In addition, reserve some bits (like the
saturate logic) that we handle as a "red zone".

4) For the "ordinary" field, as soon as we enter the red zone, we know
we have an attack going on. We continue on paths that we cannot fail
(e.g., get_page()) but eventually try stopping the attacker(s). AFAIU,
we know the attacker(s) are something (e.g., one ore multiple processes)
that has direct access to the page in their address space. Of course,
the more paths we can reject, the better.

Now, we would:

a) Have to know what sane upper limits on the "ordinary" field are. I
have no idea which values we can expect. Attacker vs. sane workload.

b) Need a way to identify the attacker(s). In the simplest case, this is
a single process. In the hard case, this involves many processes.

c) Need a way to stop the attacker(s). Doing that out of random context
is problematic. Last resort is doing this asynchronously from another
thread, which leaves more time for the attacker to do harm.

Of course, problem gets more involved as soon as we might have a
malicious child process that uses a page from a well-behaving parent
process for the attack.

Imagine we kill relevant processes, we might end up killing someone
who's not responsible. And even if we don't kill, but instead reject
try_get_page(), we might degrade the well-behaving parent process AFAIKS.

Alternatives to killing the process might be unmapping the problematic
page from the address space.

Reminds me a little about handling memory errors for a page, eventually
killing all users of that page. mm/memory-failure.c:kill_procs().

Complicated problem :)

[0/1] mm: restore full accuracy in COW page reuse

Message

Comments