Message ID | 20210110004435.26382-1-aarcange@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: restore full accuracy in COW page reuse | expand |
On Sat, Jan 9, 2021 at 4:44 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > Once we agree that COW page reuse requires full accuracy, [...] You have completely and utterly ignored every single argument against that. Instead, you just continue to push your agenda. The thing is, GUP works fine. COW works fine. The thing that is broken is clear_refs. Yet you try to now "fix" the two fine cases, because you don't want to fix clear_refs. What part of "clear_refs is the _least_ important of the three cases" are you not willing to understand? Linus
On Sat, Jan 9, 2021 at 4:55 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > What part of "clear_refs is the _least_ important of the three cases" > are you not willing to understand? In fact, I couldn't even turn on that code with my normal config, because it depends on CONFIG_CHECKPOINT_RESTORE that I didn't even have enabled. IOW, that code is some special-case stuff, and instead of messing up the rest of the VM, it should be made to conform to all the normal VM rules and requirements. Here's two patches to basically start doing that. The first one is the same one I already sent out earlier, fixing the locking. And yes, it can be improved upon, but before improving on it, let's _fix_ the code. The second is a trivial "oh, look, I can see that the page is pinned, soft-dirty cannot work so don't do it then". Again, it can be improved upon, most particularly by doing the same (simple) tests for the hugepage case too, which I didn't do. Note: I have not a single actual user of this code that I can test with, so this is all ENTIRELY untested. IOW, I am in no way claiming that these patches are perfect and correct, and the only way to do things. But what I _am_ claiming is that this clear_refs code (and the UFFD code) is of secondary importance, and instead of messing up the core VM, we should fix these special cases to not do bad things. It really is that simple. And no, I didn't make the UFFDIO_WRITEPROTECT code take the mmap_sem for writing. For whoever wants to look at that, it's mwriteprotect_range() in mm/userfaultfd.c and the fix is literally to turn the read-lock (and unlock) into a write-lock (and unlock). Linus
On Sat, Jan 9, 2021 at 5:19 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And no, I didn't make the UFFDIO_WRITEPROTECT code take the mmap_sem > for writing. For whoever wants to look at that, it's > mwriteprotect_range() in mm/userfaultfd.c and the fix is literally to > turn the read-lock (and unlock) into a write-lock (and unlock). Oh, and if it wasn't obvious, we'll have to debate what to do with trying to mprotect() a pinned page. Do we just ignore the pinned page (the way my clear_refs patch did)? Or do we turn it into -EBUSY? Or what? So it's not *just* the locking that needs to be fixed. But just take a look at that suggested clear_refs patch of mine - it sure isn't complicated. Linus
Hello Linus, On Sat, Jan 09, 2021 at 05:19:51PM -0800, Linus Torvalds wrote: > +#define is_cow_mapping(flags) (((flags) & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE) > + > +static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr, pte_t pte) > +{ > + struct page *page; > + > + if (!is_cow_mapping(vma->vm_flags)) > + return false; > + if (likely(!atomic_read(&vma->vm_mm->has_pinned))) > + return false; > + page = vm_normal_page(vma, addr, pte); > + if (!page) > + return false; > + if (page_mapcount(page) != 1) > + return false; > + return page_maybe_dma_pinned(page); > +} I just don't see the simplification coming from 09854ba94c6aad7886996bfbee2530b3d8a7f4f4. Instead of checking page_mapcount above as an optimization, to me it looks much simpler to check it in a single place, in do_wp_page, that in addition of optimizing away the superfluous copy, would optimize away the above complexity as well. And I won't comment if it's actually safe to skip random pages or not. All I know is for mprotect and uffd-wp, definitely the above approach wouldn't work. In addition I dislike has_pinned and FOLL_PIN. has_pinned 450 include/linux/mm_types.h * for instance during page table copying for fork(). has_pinned 1021 kernel/fork.c atomic64_set(&mm->pinned_vm, 0); has_pinned 1239 mm/gup.c atomic_set(&mm->has_pinned, 1); has_pinned 2579 mm/gup.c atomic_set(¤t->mm->has_pinned, 1); has_pinned 1099 mm/huge_memory.c atomic_read(&src_mm->has_pinned) && has_pinned 1213 mm/huge_memory.c atomic_read(&src_mm->has_pinned) && has_pinned 819 mm/memory.c if (likely(!atomic_read(&src_mm->has_pinned))) Are we spending 32bit in mm_struct atomic_t just to call atomic_set(1) on it? Why isn't it a MMF_HAS_PINNED that already can be set atomically under mmap_read_lock too? There's bit left free there, we didn't run out yet to justify wasting another 31 bits. I hope I'm overlooking something. The existence of FOLL_LONGTERM is good and makes a difference at times for writeback if it's on a MAP_SHARED, or it makes difference during GUP to do a page_migrate before taking the pin, but for the whole rest of the VM it's irrelevant if it's long or short term, so I'm also concerned from what Jason mentioned about long term pins being treated differently within the VM. That to me looks fundamentally as flawed as introducing inaccuracies in do_wp_page, that even ignoring the performance implications caused by the inaccuracy, simply prevent to do some useful things. I obviously agree all common workloads with no GUP pins and no clear_refs and no uffd, are way more important to be optimal, but I haven't seen a single benchmark showing the benefit of not taking the PG_lock before a page copy or any other runtime benefit coming from page_count in do_wp_page. To the contrary now I see additional branches in fork and in various other paths. The only thing again where I see page_count provides a tangible benefit, is the vmsplice attack from child to parent, but that has not been fully fixed and if page_count is added to fix it in all COW faults, it'll introduce extra inefficiency to the the very common important workloads, not only to the special GUP/clear_refs/uffd-wp workloads as your patch above shows. Thanks, Andrea
On Sat, Jan 09, 2021 at 05:37:09PM -0800, Linus Torvalds wrote: > On Sat, Jan 9, 2021 at 5:19 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > And no, I didn't make the UFFDIO_WRITEPROTECT code take the mmap_sem > > for writing. For whoever wants to look at that, it's > > mwriteprotect_range() in mm/userfaultfd.c and the fix is literally to > > turn the read-lock (and unlock) into a write-lock (and unlock). > > Oh, and if it wasn't obvious, we'll have to debate what to do with > trying to mprotect() a pinned page. Do we just ignore the pinned page > (the way my clear_refs patch did)? Or do we turn it into -EBUSY? Or > what? Agreed, I assume mprotect would have the same effect. mprotect in parallel of a read or recvmgs may be undefined, so I didn't bring it up, but it was pretty clear. The moment the write bit is cleared (no matter why and from who) and the PG lock relased, if there's any GUP pin, GUP currently loses synchrony. In any case I intended to help exercising the new page_count logic with the testcase, possibly to make it behave better somehow, no matter how. I admit I'm also wondering myself the exact semantics of O_DIRECT on clear_refs or uffd-wp tracking, but the point is that losing reads and getting unexpected data in the page, still doesn't look a good behavior and it had to be at least checked. To me ultimately the useful use case that is become impossible with page_count isn't even clear_refs nor uffd-wp. The useful case that I can see zero fundamental flaws in it, is a RDMA or some other device computing in pure readonly DMA on the data while a program runs normally and produces it. It could be even a framebuffer that doesn't care about coherency. You may want to occasionally wrprotect the memory under readonly long term GUP pin for consistency even against bugs of the program itself. Why should wrprotecting make the device lose synchrony? And kind of performance we gain to the normal useful cases by breaking the special case? Is there a benchmark showing it? > So it's not *just* the locking that needs to be fixed. But just take a > look at that suggested clear_refs patch of mine - it sure isn't > complicated. If we can skip the wrprotection it's fairly easy, I fully agree, even then it still looks more difficult than using page_mapcount in do_wp_page in my view, so I also don't see the simplification. And overall the amount of kernel code had a net increase as result. Thanks, Andrea
On Sat, Jan 9, 2021 at 6:51 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > I just don't see the simplification coming from > 09854ba94c6aad7886996bfbee2530b3d8a7f4f4. Instead of checking > page_mapcount above as an optimization, to me it looks much simpler to > check it in a single place, in do_wp_page, that in addition of > optimizing away the superfluous copy, would optimize away the above > complexity as well. Here's the difference: (a) in COW, "page_mapcount()" is pure and utter garbage, and has zero meaning. Why? Because MAPCOUNT DOES NOT MATTER FOR COW. COW is about "I'm about to write to this page, and that means I need an _exclusive_ page so that I don't write to a page that somebody else is using". Can you admit that fundamental fact? Notice how "page_mapcount()" has absolutely NOTHING to do with "exclusive page". There are lots of other ways the page can be used aside from mapcount. The page cache may have a reference to the page. Somebody that did a GUP may have a reference to the page. So what actually matters at COW time? The only thing that matters is "am I the exclusive owner". And guess what? We have a count of page references. It's "page_count()". That's *EXACTLY* the thing that says "are there maybe other references to this page". In other words, COW needs to use page_count(). It really is that easy. End of story. So, given that, why do I then do > + if (page_mapcount(page) != 1) > + return false; in my patch, when I just told you that "page_mapcount()" is irrelevant for COW? Guess what? The above isn't about COW. The above isn't about whether we need to do a copy in order to be able to write to the page without anybody else being affected by it. No, at fork time, and at this clear_refs time, the question is entirely different. The question is not "Do I have exclusive access to the page", but it is "Did I _already_ made sure that I have exclusive access to the page because I pinned it"? See how different the question is? Because *if* you have done a pinned COW for soem direct-IO read, and *if* that page is dirty, then you know it's mapped only in your address space. You're basically doing the _reverse_ of the COW test, and asking yourself "is this my own private pinned page"? And then it's actually perfectly sane to do a check that says "obviously, if somebody else has this page mapped, then that's not the case". See? For COW, "page_mapcount()" is pure and utter garbage, and entirely meaningless. How many places it's mapped in doesn't matter. You may have to COW even if it's only mapped in your address space (page cache, GUP, whatever). But for "did I already make this exclusive", then it's actually meaningful to say "is it mapped somewhere else". We know it has other users - that "page_may_be_pinned()" in fact *guarantees* that it has other users. But we're double-checking that the other users aren't other mappings. That said, I did just realize that that "page_mapcount()" check is actually pointless. Because we do have a simpler one. Instead of checking whether all those references that made us go "page_might_be_pinned()" aren't other mappings, the simple check for "pte_writable()" would already have told us that we had already done the COW. So you are actually right that the page_mapcount() test in my patch is not the best way to check for this. By the time we see "page_may_be_pinned()", we might as well just say "Oh, it's a private mapping and the pte is already writable, so we know we were the exclusive mapper, because COW and fork() already guarantee that". > And I won't comment if it's actually safe to skip random pages or > not. All I know is for mprotect and uffd-wp, definitely the above > approach wouldn't work. Why do you say that? You say ":definitely know", but I think you're full of it. The fact is, if you have a pinned page, why wouldn't we say "you can't turn it read-only"? It's pinned in the VM address space - and it's pinned writable. Simple and clear semantics. You can *remove* it, but you can't change the pinning. Linus
On Sat, Jan 9, 2021 at 7:51 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > COW is about "I'm about to write to this page, and that means I need > an _exclusive_ page so that I don't write to a page that somebody else > is using". So this kind of fundamentally explains why I hate the games we used to play wrt page_mapcount(): they were fundamentally fragile. I _much_ prefer just having the rule that we use page_count(), which the above simple and straightforward single sentence explains 100%. This gets back to the fact that especially considering how we've had subtle bugs here (the "wrong-way COW" issue has existed since literally the first GUP ever, so it goes back decades), I want the core VM rules to be things that can be explained basically from simple "first principles". And the reason I argue for the current direction that I'm pushing, is exactly that the above is a very simple "first principle" for why COW exists. If the rule for COW is simply "I will always COW if it's not clear that I'm the exclusive user", then COW itself is very simple to think about. The other rule I want to stress is that COW is common, and that argues against the model we used to have of "let's lock the page to make sure that everything else is stable". That model was garbage anyway, since page locking doesn't even guarantee any stability wrt exclusive use in the first place (ie GUP being another example), but it's why I truly detested the old model that depended so much on the page lock to serialize things. So if you start off with the rule that "I will always COW unless I can trivially see I'm the only owner", then I think we have really made for a really clear and unambiguous rule. And remember: COW is only an issue for private mappings. So pretty much BY DEFINITION, doing a COW is always safe for all normal circumstances. Now, this is where it does get subtle: that "all normal circumstances" part. The one special case is a cache-coherent GUP. It's arguable whether "pinned" should matter or not, and it would obviously be better if "pinned" simply didn't matter at all (and the only issue with any long-term pinning would simply be about resource counting). The current approach I'm advocating is "coherency means that it must have been writable", and then the way to solve the whole "Oh, it's shared with something else" is to simply never accept making it read-only, because BY DEFINITION it's not _really_ read-only (because we know we've created that other alias of the virtual address that is *not* controlled by the page table protection bits). Notice how this is all both conceptually fairly simple (ie I can explain the rules in plain English without really making any complex argument) and it is arguably internally fairly self-consistent (ie the whole notion of "oh, there's another thing that has write access that page but doesn't go through the page table, so trying to make it read-only in the page tables is a nonsensical operation"). Are the end results wrt something like soft-dirty a bit odd? Not really. If you do soft-dirty, such a GUP-shared page would simply always show up as dirty. That's still consistent with the rules. If somebody else may be writing to it because of GUP, that page really *isn't* clean, and us marking it read-only would be just lying about things. I'm admittedly not very happy about mprotect() itself, though. It's actually ok to do the mprotect(PROT_READ) and turn the page read-only: that will also disable COW itself (because a page fault will now be a SIGSEGV, not a COW). But if you then make it writable again with mprotect(PROT_WRITE), you *have* lost the WP bit, and you'll COW on a write, and lose the coherency. Now, I'm willing to just say: "if you do page pinning, and then do mprotect(PROT_READ), and then do mprotect(PROT_WRITE) and then write to the page, you really do get to keep both broken pieces". IOW, I'm perfectly happy to just say you get what you deserve. But I'd also be perfectly happy to make the whole "I'm the exclusive user" logic a bit more extensive. Right now it's basically _purely_ page_count(), and the other part of "I'm the exclusive owner" is that the RW bit in the page table is simply not clear. That makes things really easy for COW: it just won't happen in the first place if you broke the "obviously exclusive" rule with GUP. But we _could_ do something slightly smarter. But "page_mapcount()" is not that "slightly smarter" thing, because we already know it's broken wrt lots of other uses (GUP, page cache, whatever). Just having a bit in the page flags for "I already made this exclusive, and fork() will keep it so" is I feel the best option. In a way, "page is writable" right now _is_ that bit. By definition, if you have a writable page in an anonymous mapping, that is an exclusive user. But because "writable" has these interactions with other operations, it would be better if it was a harder bit than that "maybe_pinned()", though. It would be lovely if a regular non-pinning write-GUP just always set it, for example. "maybe_pinned()" is good enough for the fork() case, which is the one that matters for long-term pinning. But it's admittedly not perfect. Linus
On Sun, Jan 10, 2021 at 11:30:57AM -0800, Linus Torvalds wrote: > Notice how this is all both conceptually fairly simple (ie I can > explain the rules in plain English without really making any complex > argument) and it is arguably internally fairly self-consistent (ie the > whole notion of "oh, there's another thing that has write access that > page but doesn't go through the page table, so trying to make it > read-only in the page tables is a nonsensical operation"). Yes exactly! > Are the end results wrt something like soft-dirty a bit odd? Not > really. If you do soft-dirty, such a GUP-shared page would simply > always show up as dirty. That's still consistent with the rules. If > somebody else may be writing to it because of GUP, that page really > *isn't* clean, and us marking it read-only would be just lying about > things. > > I'm admittedly not very happy about mprotect() itself, though. It's > actually ok to do the mprotect(PROT_READ) and turn the page read-only: > that will also disable COW itself (because a page fault will now be a > SIGSEGV, not a COW). I would say even mprotect should not set write protect on page under DMA, it seems like the sort of thing that will trip up other parts of the mm that might sensibly assume read-only actually means read only, not 'probably read-only but there might be DMA writes anyhow' So copy the page before write protecting it? Then the explicit mprotect is one of the system calls that can cause de-coherence of the DMA - like munmap. If we had reliable pinned detection I'd say to fail the mprotect.. And I think you are right about clear refs too. Clear refs must be aware of FOLL_LONGTERM and handle it properly - which includes not write protecting such a page. Jason
On 1/10/21 11:30 AM, Linus Torvalds wrote: > On Sat, Jan 9, 2021 at 7:51 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > Just having a bit in the page flags for "I already made this > exclusive, and fork() will keep it so" is I feel the best option. In a > way, "page is writable" right now _is_ that bit. By definition, if you > have a writable page in an anonymous mapping, that is an exclusive > user. > > But because "writable" has these interactions with other operations, > it would be better if it was a harder bit than that "maybe_pinned()", > though. It would be lovely if a regular non-pinning write-GUP just > always set it, for example. > > "maybe_pinned()" is good enough for the fork() case, which is the one > that matters for long-term pinning. But it's admittedly not perfect. > There is at least one way to improve this part of it--maybe. IMHO, a lot of the bits in page _refcount are still being wasted (even after GUP_PIN_COUNTING_BIAS overloading), because it's unlikely that there are many callers of gup/pup per page. If anyone points out that that is wrong, then the rest of this falls apart, but...if we were to make a rule that "only a very few FOLL_GET or FOLL_PIN pins are ever simultaneously allowed on a given page", then several things become possible: 1) "maybe dma pinned" could become "very likely dma-pinned", by allowing perhaps 23 bits for normal page refcounts (leaving just 8 bits for dma pins), thus all but ensuring no aliasing between normal refcounts and dma pins. The name change below to "likely" is not actually necessary, I'm just using it to make the point clear: diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8cd6ae..28f7f64e888f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1241,7 +1241,7 @@ static inline void put_page(struct page *page) * get_user_pages and page_mkclean and other calls that race to set up page * table entries. */ -#define GUP_PIN_COUNTING_BIAS (1U << 10) +#define GUP_PIN_COUNTING_BIAS (1U << 23) void unpin_user_page(struct page *page); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, @@ -1274,7 +1274,7 @@ void unpin_user_pages(struct page **pages, unsigned long npages); * @Return: True, if it is likely that the page has been "dma-pinned". * False, if the page is definitely not dma-pinned. */ -static inline bool page_maybe_dma_pinned(struct page *page) +static inline bool page_likely_dma_pinned(struct page *page) { if (hpage_pincount_available(page)) return compound_pincount(page) > 0; 2) Doing (1) allows, arguably, failing mprotect calls if page_likely_dma_pinned() returns true, because the level of confidence is much higher now. 3) An additional counter, for normal gup (FOLL_GET) is possible: divide up the top 8 bits into two separate counters of just 4 bits each. Only allow either FOLL_GET or FOLL_PIN (and btw, I'm now mentally equating FOLL_PIN with FOLL_LONGTERM), not both, on a given page. Like this: diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8cd6ae..ce5af27e3057 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1241,7 +1241,8 @@ static inline void put_page(struct page *page) * get_user_pages and page_mkclean and other calls that race to set up page * table entries. */ -#define GUP_PIN_COUNTING_BIAS (1U << 10) +#define DMA_PIN_COUNTING_BIAS (1U << 23) +#define GUP_PIN_COUNTING_BIAS (1U << 27) void unpin_user_page(struct page *page); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, So: FOLL_PIN: would use DMA_PIN_COUNTING_BIAS to increment page refcount. These are long term pins for dma. FOLL_GET: would use GUP_PIN_COUNTING_BIAS to increment page refcount. These are not long term pins. Doing (3) would require yet another release call variant: unpin_user_pages() would need to take a flag to say which refcount to release: FOLL_GET or FOLL_PIN. However, I think that's relatively easy (compared to the original effort of teasing apart generic put_page() calls, into unpin_user_pages() calls). We already have all the unpin_user_pages() calls in place, and so it's "merely" a matter of adding a flag to 74 call sites. The real question is whether we can get away with supporting only a very low count of FOLL_PIN and FOLL_GET pages. Can we? thanks,
On Sun, Jan 10, 2021 at 11:26:57PM -0800, John Hubbard wrote: > IMHO, a lot of the bits in page _refcount are still being wasted (even > after GUP_PIN_COUNTING_BIAS overloading), because it's unlikely that > there are many callers of gup/pup per page. If anyone points out that > that is wrong, then the rest of this falls apart, but...if we were to > make a rule that "only a very few FOLL_GET or FOLL_PIN pins are ever > simultaneously allowed on a given page", then several things become > possible: There's "the normal case" and then there's "the attacker case" where someone's deliberately trying to wrap page->_refcount. There are lots of interesting games people can play with an anon page, like stuffing it into (lots of) pipes, forking lots of children, starting lots of O_DIRECT I/O against it to a FUSE filesystem that's deliberately engineered to be slow. We have some protection against that, but I'm not 100% sure it's working, and making it easier to increase refcount in large chunks makes it more likely that we would defeat that protection.
On Sat, Jan 09, 2021 at 09:51:14PM -0500, Andrea Arcangeli wrote: > Are we spending 32bit in mm_struct atomic_t just to call atomic_set(1) > on it? Why isn't it a MMF_HAS_PINNED that already can be set > atomically under mmap_read_lock too? There's bit left free there, we > didn't run out yet to justify wasting another 31 bits. I hope I'm > overlooking something. It needs to be atomic because it is set under gup fast, no mmap lock. Peter and myself did not find another place to put this that was already atomic > The existence of FOLL_LONGTERM is good and makes a difference at times > for writeback if it's on a MAP_SHARED, or it makes difference during > GUP to do a page_migrate before taking the pin, but for the whole rest > of the VM it's irrelevant if it's long or short term, so I'm also > concerned from what Jason mentioned about long term pins being treated > differently within the VM. Why? They are different. write protect doesn't stop modification of the data. How is that not a relavent and real difference? Jason
On Sun, Jan 10, 2021 at 11:26:57PM -0800, John Hubbard wrote: > So: > > FOLL_PIN: would use DMA_PIN_COUNTING_BIAS to increment page refcount. > These are long term pins for dma. > > FOLL_GET: would use GUP_PIN_COUNTING_BIAS to increment page refcount. > These are not long term pins. Do we have any places yet that care about the long-term-ness? The only place long term would matter is if something is waiting for the page ref count to drop. Jason
On Mon 11-01-21 12:05:49, Jason Gunthorpe wrote: > On Sun, Jan 10, 2021 at 11:26:57PM -0800, John Hubbard wrote: > > > So: > > > > FOLL_PIN: would use DMA_PIN_COUNTING_BIAS to increment page refcount. > > These are long term pins for dma. > > > > FOLL_GET: would use GUP_PIN_COUNTING_BIAS to increment page refcount. > > These are not long term pins. > > Do we have any places yet that care about the long-term-ness? I was hoping to use that information to distinguish ephemeral migration failures due to page reference from long term pins. The later would be a hard failure for the migration.
On Sun, Jan 10, 2021 at 11:27 PM John Hubbard <jhubbard@nvidia.com> wrote: > > There is at least one way to improve this part of it--maybe. It's problematic.. > IMHO, a lot of the bits in page _refcount are still being wasted (even > after GUP_PIN_COUNTING_BIAS overloading), because it's unlikely that > there are many callers of gup/pup per page. It may be unlikely under real loads. But we've actually had overflow issues on this because rather than real loads you can do attack loads (ie "lots of processes, lots of pipe file descriptors, lots of vmsplice() operations on the same page". We had to literally add that conditional "try_get_page()" that protects against overflow.. Linus
On Mon, Jan 11, 2021 at 11:19 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sun, Jan 10, 2021 at 11:27 PM John Hubbard <jhubbard@nvidia.com> wrote: > > IMHO, a lot of the bits in page _refcount are still being wasted (even > > after GUP_PIN_COUNTING_BIAS overloading), because it's unlikely that > > there are many callers of gup/pup per page. > > It may be unlikely under real loads. > > But we've actually had overflow issues on this because rather than > real loads you can do attack loads (ie "lots of processes, lots of > pipe file descriptors, lots of vmsplice() operations on the same > page". > > We had to literally add that conditional "try_get_page()" that > protects against overflow.. Actually, what I think might be a better model is to actually strengthen the rules even more, and get rid of GUP_PIN_COUNTING_BIAS entirely. What we could do is just make a few clear rules explicit (most of which we already basically hold to). Starting from that basic (a) Anonymous pages are made writable (ie COW) only when they have a page_count() of 1 That very simple rule then automatically results in the corollary (b) a writable page in a COW mapping always starts out reachable _only_ from the page tables and now we could have a couple of really simple new rules: (c) we never ever make a writable page in a COW mapping read-only _unless_ it has a page_count() of 1 (d) we never create a swap cache page out of a writable COW mapping page Now, if you combine these rules, the whole need for the GUP_PIN_COUNTING_BIAS basically goes away. Why? Because we know that the _only_ thing that can elevate the refcount of a writable COW page is GUP - we'll just make sure nothing else touches it. The whole "optimistic page references throigh page cache" etc are complete non-issues, because the whole point is that we already know it's not a page cache page. There is simply no other way to reach that page than through GUP. Ergo: any writable pte in a COW mapping that has a page with a page_count() > 1 is pinned by definition, and thus our page_maybe_dma_pinned(page) could remove that "maybe" part, and simply check for page_count(page) > 1 (although the rule would be that this is only valid for a cow_mapping pte, and only while holding the page table lock! So maybe it would be good to pass in the vma and have an assert for that lock too). And the thing is, none of the above rules are complicated. The only new one would be the requirement that you cannot add a page to the swap cache unless it is read-only in the page tables. That may be the biggest hurdle here. The way we handle swap cache is that we *first* add it to the swap cache, and then we do a "try_to_unmap()" on it. So we currently don't actually try to walk the page tables until we have already done that swap cache thing. But I do think that the only major problem spot is that shrink_page_list() -> add_to_swap() case, and I think we could make add_to_swap() just do the rmap walk and turn it read-only first. (And it's worth pointing out that I'm only talking about regular non-huge pages above, the rules for splitting hugepages may impact those cases differently, I didn't really even try to think about those cases). But thatadd_to_swap() case might make it too painful. It _would_ simplify our rules wrt anonymous mappings enormously, though. Linus
On Mon, Jan 11, 2021 at 2:18 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, Jan 11, 2021 at 11:19 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > Actually, what I think might be a better model is to actually > strengthen the rules even more, and get rid of GUP_PIN_COUNTING_BIAS > entirely. > > What we could do is just make a few clear rules explicit (most of > which we already basically hold to). Starting from that basic > > (a) Anonymous pages are made writable (ie COW) only when they have a > page_count() of 1 Seems reasonable to me. > > That very simple rule then automatically results in the corollary > > (b) a writable page in a COW mapping always starts out reachable > _only_ from the page tables Seems reasonable. I guess that if the COW is triggered by GUP, then it starts out reachable only from the page tables but then because reachable through GUP very soon thereafter. > > and now we could have a couple of really simple new rules: > > (c) we never ever make a writable page in a COW mapping read-only > _unless_ it has a page_count() of 1 I don't love this. Having mprotect() fail in a multithreaded process because another thread happens to be doing a short-lived IO seems like it may result in annoying intermittent bugs. As I understand it, the issue is that the way we determine that we need to COW a COWable page is that we see that it's read-only. It would be nice if we could separately track "the VMA allows writes" and "this PTE points to a page that is private to the owning VMA", but maybe there's no bit available for the latter other than looking at RO vs RW directly. > > (d) we never create a swap cache page out of a writable COW mapping page > > Now, if you combine these rules, the whole need for the > GUP_PIN_COUNTING_BIAS basically goes away. > > Why? Because we know that the _only_ thing that can elevate the > refcount of a writable COW page is GUP - we'll just make sure nothing > else touches it. How common is !FOLL_WRITE GUP? We could potentially say that a short-term !FOLL_WRITE GUP is permitted on an RO COW page and that a subsequent COW on the page will wait for the GUP to go away. This might be too big a can of worms for the benefit it would provide, though. --Andy
On Mon, Jan 11, 2021 at 02:18:13PM -0800, Linus Torvalds wrote: > On Mon, Jan 11, 2021 at 11:19 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Sun, Jan 10, 2021 at 11:27 PM John Hubbard <jhubbard@nvidia.com> wrote: > > > IMHO, a lot of the bits in page _refcount are still being wasted (even > > > after GUP_PIN_COUNTING_BIAS overloading), because it's unlikely that > > > there are many callers of gup/pup per page. > > > > It may be unlikely under real loads. > > > > But we've actually had overflow issues on this because rather than > > real loads you can do attack loads (ie "lots of processes, lots of > > pipe file descriptors, lots of vmsplice() operations on the same > > page". > > > > We had to literally add that conditional "try_get_page()" that > > protects against overflow.. > > Actually, what I think might be a better model is to actually > strengthen the rules even more, and get rid of GUP_PIN_COUNTING_BIAS > entirely. > > What we could do is just make a few clear rules explicit (most of > which we already basically hold to). Starting from that basic > > (a) Anonymous pages are made writable (ie COW) only when they have a > page_count() of 1 > > That very simple rule then automatically results in the corollary > > (b) a writable page in a COW mapping always starts out reachable > _only_ from the page tables > > and now we could have a couple of really simple new rules: > > (c) we never ever make a writable page in a COW mapping read-only > _unless_ it has a page_count() of 1 This breaks mprotect(R_ONLY) i do not think we want to do that. This might break security scheme for user space application which expect mprotect to make CPU mapping reads only. Maybe an alternative would be to copy page on mprotect for pages that do not have a page_count of 1 ? But that makes me uneasy toward short lived GUP (direct IO racing with a mprotect or maybe simply even page migration) versus unbound one (like RDMA). Also I want to make sure i properly understand what happens on fork() on a COW mapping for a page that has a page_count > 1 ? We copy the page instead of write protecting the page ? I believe better here would be to protect the page on the CPU but forbid child to reuse the page ie if the child ever inherit the page (parent unmapped the page for instance) it will have to make a copy and the GUP reference (taken before the fork) might linger on a page that is no longer associated with any VM. This way we keep fast fork. Jérôme
On Mon, Jan 11, 2021 at 02:18:13PM -0800, Linus Torvalds wrote: > The whole "optimistic page references throigh page cache" etc are > complete non-issues, because the whole point is that we already know > it's not a page cache page. There is simply no other way to reach that > page than through GUP. The thing about the speculative page cache references is that they can temporarily bump a refcount on a page which _used_ to be in the page cache and has now been reallocated as some other kind of page. Now, this is obviously rare, so if it's only a performance question, it'll be fine. If there's a correctness issue with copying pages that would otherwise not have been copied, then it's a problem.
On Tue, Jan 12, 2021 at 6:16 PM Matthew Wilcox <willy@infradead.org> wrote: > > The thing about the speculative page cache references is that they can > temporarily bump a refcount on a page which _used_ to be in the page > cache and has now been reallocated as some other kind of page. Right you are. Yeah, scratch the "we could make it absolute on 1", and we do need the PIN count elevation. Linus
On Tue, Jan 12, 2021 at 6:16 PM Matthew Wilcox <willy@infradead.org> wrote: > > The thing about the speculative page cache references is that they can > temporarily bump a refcount on a page which _used_ to be in the page > cache and has now been reallocated as some other kind of page. Oh, and thinking about this made me think we might actually have a serious bug here, and it has nothing what-so-ever to do with COW, GUP, or even the page count itself. It's unlikely enough that I think it's mostly theoretical, but tell me I'm wrong. PLEASE tell me I'm wrong: CPU1 does page_cache_get_speculative under RCU lock CPU2 frees and re-uses the page CPU1 CPU2 ---- ---- page = xas_load(&xas); if (!page_cache_get_speculative(page)) goto repeat; .. succeeds .. remove page from XA release page reuse for something else .. and then re-check .. if (unlikely(page != xas_reload(&xas))) { put_page(page); goto repeat; } ok, the above all looks fine. We got the speculative ref, but then we noticed that its' not valid any more, so we put it again. All good, right? Wrong. What if that "reuse for something else" was actually really quick, and both allocated and released it? That still sounds good, right? Yes, now the "put_page()" will be the one that _actually_ releases the page, but we're still fine, right? Very very wrong. The "reuse for something else" on CPU2 might have gotten not an order-0 page, but a *high-order* page. So it allocated (and then immediately free'd) maybe an order-2 allocation with _four_ pages, and the re-use happened when we had coalesced the buddy pages. But when we release the page on CPU1, we will release just _one_ page, and the other three pages will be lost forever. IOW, we restored the page count perfectly fine, but we screwed up the page sizes and buddy information. Ok, so the above is so unlikely from a timing standpoint that I don't think it ever happens, but I don't see why it couldn't happen in theory. Please somebody tell me I'm missing some clever thing we do to make sure this can actually not happen.. Linus
On 13.01.21 04:31, Linus Torvalds wrote: > On Tue, Jan 12, 2021 at 6:16 PM Matthew Wilcox <willy@infradead.org> wrote: >> >> The thing about the speculative page cache references is that they can >> temporarily bump a refcount on a page which _used_ to be in the page >> cache and has now been reallocated as some other kind of page. > > Oh, and thinking about this made me think we might actually have a > serious bug here, and it has nothing what-so-ever to do with COW, GUP, > or even the page count itself. > > It's unlikely enough that I think it's mostly theoretical, but tell me > I'm wrong. > > PLEASE tell me I'm wrong: > > CPU1 does page_cache_get_speculative under RCU lock > > CPU2 frees and re-uses the page > > CPU1 CPU2 > ---- ---- > > page = xas_load(&xas); > if (!page_cache_get_speculative(page)) > goto repeat; > .. succeeds .. > > remove page from XA > release page > reuse for something else > > .. and then re-check .. > if (unlikely(page != xas_reload(&xas))) { > put_page(page); > goto repeat; > } > > ok, the above all looks fine. We got the speculative ref, but then we > noticed that its' not valid any more, so we put it again. All good, > right? > > Wrong. > > What if that "reuse for something else" was actually really quick, and > both allocated and released it? > > That still sounds good, right? Yes, now the "put_page()" will be the > one that _actually_ releases the page, but we're still fine, right? > > Very very wrong. > > The "reuse for something else" on CPU2 might have gotten not an > order-0 page, but a *high-order* page. So it allocated (and then > immediately free'd) maybe an order-2 allocation with _four_ pages, and > the re-use happened when we had coalesced the buddy pages. > > But when we release the page on CPU1, we will release just _one_ page, > and the other three pages will be lost forever. > > IOW, we restored the page count perfectly fine, but we screwed up the > page sizes and buddy information. > > Ok, so the above is so unlikely from a timing standpoint that I don't > think it ever happens, but I don't see why it couldn't happen in > theory. > > Please somebody tell me I'm missing some clever thing we do to make > sure this can actually not happen.. Wasn't that tackled by latest (not merged AFAIKs) __free_pages() changes? I'm only able to come up with the doc update, not with the oroginal fix/change https://lkml.kernel.org/r/20201027025523.3235-1-willy@infradead.org
On 13.01.21 09:52, David Hildenbrand wrote: > On 13.01.21 04:31, Linus Torvalds wrote: >> On Tue, Jan 12, 2021 at 6:16 PM Matthew Wilcox <willy@infradead.org> wrote: >>> >>> The thing about the speculative page cache references is that they can >>> temporarily bump a refcount on a page which _used_ to be in the page >>> cache and has now been reallocated as some other kind of page. >> >> Oh, and thinking about this made me think we might actually have a >> serious bug here, and it has nothing what-so-ever to do with COW, GUP, >> or even the page count itself. >> >> It's unlikely enough that I think it's mostly theoretical, but tell me >> I'm wrong. >> >> PLEASE tell me I'm wrong: >> >> CPU1 does page_cache_get_speculative under RCU lock >> >> CPU2 frees and re-uses the page >> >> CPU1 CPU2 >> ---- ---- >> >> page = xas_load(&xas); >> if (!page_cache_get_speculative(page)) >> goto repeat; >> .. succeeds .. >> >> remove page from XA >> release page >> reuse for something else >> >> .. and then re-check .. >> if (unlikely(page != xas_reload(&xas))) { >> put_page(page); >> goto repeat; >> } >> >> ok, the above all looks fine. We got the speculative ref, but then we >> noticed that its' not valid any more, so we put it again. All good, >> right? >> >> Wrong. >> >> What if that "reuse for something else" was actually really quick, and >> both allocated and released it? >> >> That still sounds good, right? Yes, now the "put_page()" will be the >> one that _actually_ releases the page, but we're still fine, right? >> >> Very very wrong. >> >> The "reuse for something else" on CPU2 might have gotten not an >> order-0 page, but a *high-order* page. So it allocated (and then >> immediately free'd) maybe an order-2 allocation with _four_ pages, and >> the re-use happened when we had coalesced the buddy pages. >> >> But when we release the page on CPU1, we will release just _one_ page, >> and the other three pages will be lost forever. >> >> IOW, we restored the page count perfectly fine, but we screwed up the >> page sizes and buddy information. >> >> Ok, so the above is so unlikely from a timing standpoint that I don't >> think it ever happens, but I don't see why it couldn't happen in >> theory. >> >> Please somebody tell me I'm missing some clever thing we do to make >> sure this can actually not happen.. > > Wasn't that tackled by latest (not merged AFAIKs) __free_pages() changes? > > I'm only able to come up with the doc update, not with the oroginal > fix/change > > https://lkml.kernel.org/r/20201027025523.3235-1-willy@infradead.org > Sorry, found it, it's in v5.10 commit e320d3012d25b1fb5f3df4edb7bd44a1c362ec10 Author: Matthew Wilcox (Oracle) <willy@infradead.org> Date: Tue Oct 13 16:56:04 2020 -0700 mm/page_alloc.c: fix freeing non-compound pages and commit 7f194fbb2dd75e9346b305b8902e177b423b1062 Author: Matthew Wilcox (Oracle) <willy@infradead.org> Date: Mon Dec 14 19:11:09 2020 -0800 mm/page_alloc: add __free_pages() documentation is in v5.11-rc1
On Tue, Jan 12, 2021 at 07:31:07PM -0800, Linus Torvalds wrote: > On Tue, Jan 12, 2021 at 6:16 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > The thing about the speculative page cache references is that they can > > temporarily bump a refcount on a page which _used_ to be in the page > > cache and has now been reallocated as some other kind of page. > > Oh, and thinking about this made me think we might actually have a > serious bug here, and it has nothing what-so-ever to do with COW, GUP, > or even the page count itself. > > It's unlikely enough that I think it's mostly theoretical, but tell me > I'm wrong. > > PLEASE tell me I'm wrong: > > CPU1 does page_cache_get_speculative under RCU lock > > CPU2 frees and re-uses the page > > CPU1 CPU2 > ---- ---- > > page = xas_load(&xas); > if (!page_cache_get_speculative(page)) > goto repeat; > .. succeeds .. > > remove page from XA > release page > reuse for something else How can it be reused if CPU1 hold reference to it? > > .. and then re-check .. > if (unlikely(page != xas_reload(&xas))) { > put_page(page); > goto repeat; > } >
On Wed, Jan 13, 2021 at 03:32:32PM +0300, Kirill A. Shutemov wrote: > On Tue, Jan 12, 2021 at 07:31:07PM -0800, Linus Torvalds wrote: > > On Tue, Jan 12, 2021 at 6:16 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > The thing about the speculative page cache references is that they can > > > temporarily bump a refcount on a page which _used_ to be in the page > > > cache and has now been reallocated as some other kind of page. > > > > Oh, and thinking about this made me think we might actually have a > > serious bug here, and it has nothing what-so-ever to do with COW, GUP, > > or even the page count itself. > > > > It's unlikely enough that I think it's mostly theoretical, but tell me > > I'm wrong. > > > > PLEASE tell me I'm wrong: > > > > CPU1 does page_cache_get_speculative under RCU lock > > > > CPU2 frees and re-uses the page > > > > CPU1 CPU2 > > ---- ---- > > > > page = xas_load(&xas); > > if (!page_cache_get_speculative(page)) > > goto repeat; > > .. succeeds .. > > > > remove page from XA > > release page > > reuse for something else > > How can it be reused if CPU1 hold reference to it? Yes, Linus mis-stated it: page = xas_load(&xas); remove page from XA release page reuse for something else if (!page_cache_get_speculative(page)) goto repeat; if (unlikely(page != xas_reload(&xas))) { put_page(page); ... but as David pointed out, I fixed this in e320d3012d25
On Wed, Jan 13, 2021 at 4:56 AM Matthew Wilcox <willy@infradead.org> wrote: > > Yes, Linus mis-stated it: Yeah, I got the order wrong. > ... but as David pointed out, I fixed this in e320d3012d25 .. and I must have seen it, but not really internalized it. And now that I look at it more closely, I'm actually surprised that other than the magic "speculative first page" case we don't seem to use page reference counting for non-order-0 pages (which would break that hack horribly). Linus
On Sun, Jan 10, 2021 at 11:30:57AM -0800, Linus Torvalds wrote: > So if you start off with the rule that "I will always COW unless I can > trivially see I'm the only owner", then I think we have really made > for a really clear and unambiguous rule. I must confess that's the major reason that when I saw the COW simplification patch I felt it great. Not only because it's an easier way to understand all these, but also that it helped unbreak uffd-wp at that moment by dropping the patch of "COW for read gups" instead of introducing more things to fix stuff, like the FOLL_BREAK_COW idea [1]. But it's a pity that only until now, after I read all the threads... I found we might start to lose things for the beautifulness the COW patch brought. It turns out the simplicity is just not for free; there is always a cost. The complexity around GUP always existed probably because GUP is hard itself! It's just a matter of where the complexity resides: either in the do_wp_page(), or in the rest of the codes in some other ways, e.g., a complicated version of page_trans_huge_map_swapcount(). For example, in the future we'll have no way to wr-protect any pinned pages, no matter for soft-dirty or uffd-wp or else: it'll be illegal by definition! While it's not extremely easy to get the reasoning from instinct of human being I'd say... "DMA pinned pages could be written after all by device", that's true, but how about read DMA pins? Oh read DMA pin does not exist because if we try read DMA pin it broke... but is that a strong enough reason? We probably want to fix mprotect() too with pinned pages, because even if mprotect(READ) is fine then in mprotect(READ|WRITE) we'll need to make sure the write bit being solid if the page is pinned (because now we'll assume all pinned pages should always be with the write bit set in ptes), or another alternative could be we'll just fail some mprotect() upon pinned pages. Limitations just start to pile up, it seems to me.. We also unveiled the vmsplice issue with thp, because for now on COW we don't have symmetric behavior for small/huge pages any more: for small pages, we'll do page_count() check to make that copy decision; while we're still with page_mapcount() for thps. Do we finally need to convert do_huge_pmd_wp_page() to also use the same logic as the small pages? The risk in front is not clear to me, though. GUP(write=0) will be extremely special from now on, simply because read won't trigger COW... I must confess, from instinction, GUP(write=0) should simply mean that "I want to get this page, but for reading". Then I realized it's not extremelyl obvious to me why it should be treated totally differently against a write GUP on this matter. Due to my lack on memory management experience, I just started to notice that a huge amount of work that previously done hard on maintaining page_mapcount() simply just to make sure GUP pages keep its meaning by instinction, which is: when we GUP a page, it'll be always coherent with the page we've got in the process page table. Now that instinct will go away too, because the GUPed page now could be some different page rather than the one that sits in the pgtable. Then I remembered something very nice about the COW simplification change: As the test case reported [2], there is a huge boost with 31.4% after COW simplification. Today I went back and try to understand why, because I suddenly found I cannot fully understand it - I thought it was majorly because of the page lock, but maybe not, since the page lock is actually a fine granule lock in this test that covers 1G mem. The test was carried out on a host with 192G mem + 104 cores, what it did was simply: ./usemem --runtime 300 -n 104 --prealloc --prefault 959297984 ./usemem is the memory workload that does memory accesses, it fork()s into 104 processes with a shared 1G mem region for those accesses so after it's all done it could use up to 104G pages after all the COWs. When it runs, COW happens very frequently across merely all the cores, triggering the do_wp_page() path. However 104 processes won't easily collapse on the same 4K page at the same time which shares the lock. I just noticed maybe it's simply the overhead of reuse_swap_page(). But how heavy would reuse_swap_page() be? It'll be heavier if THP is the case because page_trans_huge_map_swapcount() has a complicated loop for those small pages, then I found that indeed the test was carried out with THP enabled and also by default on: CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y IIUC that'll make reuse_swap_page() very heavy, I think. Because initially the 100+ processes share some THPs, first writes on the THPs will split the THPs, then reuse_swap_page() will always go the slowest path on looping over each small pages for each small page writes of COW. So that 31.4% number got "shrinked" a bit after I noticed all these - this is already a specific test which merely do COW only but nothing else. It'll be hard to tell how many performance gain we can get by this simplification of COW on other real-life workloads. Not to mention that removing the reuse_swap_page() seems to also delayed swap recycle (which is something we'll need to do sooner or later; so if COW got fast something else got slower, e.g. the page reclaim logic) and I noticed Ying tried to partly recover it [3]. It's not clear to me where it's the best place to do the recycle, but that sounds like a different problem. The major two benefits to me with the current COW simplification are simplicity of logic and performance gains. But they start to fade away with above. Not really that much, but big enough to let me start questioning myself on whether it's the best approach from pure technical pov... Then I do also notice that actually the other path of FOLL_BREAK_COW [1] might be able to fix all of above frustrations like mentioned by others. It's still complicated, I really don't like that part. But again, it just seems to be the matter of where the complexity would be, there's just no way to avoid those complexity. The thing that I'm afraid is that the complexity is by nature and we can't change it. I'm also afraid that if we go the current way it'll be even more complicated at last. So it seems wise to think more about the direction since we're at a cross road before starting to fix all the todos e.g. soft-dirty and mprotect against pinned pages, and all the rest of things we'd need to fix with current COW page reuse with current solution. I felt extremely sorry to have dumped my brain somehow, especially considering above could be completely garbage so I wasted time for a lot of people reading it or simply scanning it over... However I expressed it just in case it could help in some form. Thanks, [1] https://lore.kernel.org/lkml/20200811183950.10603-1-peterx@redhat.com/ [2] https://lore.kernel.org/lkml/20200914024321.GG26874@shao2-debian/ [3] https://lore.kernel.org/lkml/20210113024241.179113-1-ying.huang@intel.com/
On 10.01.21 01:44, Andrea Arcangeli wrote: > Hello Andrew and everyone, > > Once we agree that COW page reuse requires full accuracy, the next > step is to re-apply 17839856fd588f4ab6b789f482ed3ffd7c403e1f and to > return going in that direction. After stumbling over the heated discussion around this, I wanted to understand the details and the different opinions. I tried to summarize in my simple words (bear with me) what happened and how I think we can proceed from here. Maybe that helps. ==== What happened: 1) We simplified handling of faults on write-protected pages (page table entries): we changed the logic when we can reuse a page ("simply unprotecting it"), and when we have to copy it instead (COW). The essence of the simplification is, that we only reuse a page if we are the only single user of the page, meaning page_count(page) == 1, and the page is mapped into a single process (page_mapcount(page) == 1); otherwise we copy it. Simple. 2) The old code was complicated and there are GUP (e.g., RDMA, VFIO) cases that were broken in various ways in the old code already: most prominently fork(). As one example, it would have been possible for mprotect(READ) memory to still get modified by GUP users like RDMA. Write protection (AFAIU via any mechanism) after GUP pinned a page was not effective; the page was not copied. 3) Speculative pagecache reference can temporarily bump up the page_count(page), resulting in false positives. We could see page_count(page) > 1, although we're the single instance that actually uses a page. In the simplified code, we might copy a page although not necessary (I cannot tell how often that actually happens). 4) clear_refs(4) ("measure approximately how much memory a process is using"), uffd-wp (let's call it "lightweight write-protection, handling the actual fault in user space"), and mprotect(READ) all write-protect page table entries to generate faults on next write access. With the simplified code, we will COW whenever we find the page_count(page) > 1. The simplification seemed to regress clear_refs and uffdio-wp code (AFAIU in case of uffd-wp, it results in memory corruption). But looks like we can mostly fix it by adding more extensive locking. 5) Mechanisms like GUP (AFAIU including Direct I/O) also takes references on pages, increasing page_count(). With the simplification, we might now end up copying a page, although there is "somewhat" only a single user/"process" involved. One example is RDMA: if we read memory using RDMA and mprotect(READ) such memory, we might end up copying the underlying page on the next write: suddenly, RDMA is disconnected and will no longer read what is getting written. Not to mention, we consume more memory. AFAIU, other examples include direct I/O (e.g., write() with O_DIRECT). AFAIU, a more extreme case is probably VFIO: A VM with VFIO (e.g., passthrough of a PCI device) can essentially be corrupted by "echo 4 > /proc/[pid]/clear_refs". 6) While some people think it is okay to break GUP further, as it is already broken in various other ways, other people think this is changing something that used to work (AFAIU a user-visible change) with little benefit. 7) There is no easy way to detect if a page really was pinned: we might have false positives. Further, there is no way to distinguish if it was pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking we most probably would need more counters, which we cannot fit into struct page. (AFAIU, for huge pages it's easier). However, AFAIU, even being able to detect if (and how) a page was pinned would not completely help to solve the puzzle. 8) We have a vmsplice security issue that has to be fixed by touching the code in question. A forked child process can read memory content of its parent, which was modified by the parent after fork. AFAIU, the fix will further lock us in into the direction of the code we are heading. 9) The simplification is part of v5.10, which is a LTS release. AFAIU, that one needs fixing, too. I see the following possible directions we can head A) Keep the simplification. Try fixing the fallout. Keep the GUP cases broken or make mprotect() fail when detecting such a scenario; AFAIU, both are user-visible changes. B) Keep the simplification. Try fixing the fallout. Fix GUP cases that used to work; AFAIU fixing this is the hard/impossible part, and is undesired by some people.. C) Revert the simplification for now. Go back to the drawing board and use what we learned to come up with a simplification that (all? ) people are happy with. D) Revert the simplification: turns out the code could not get simplified to this extend. We learned a lot, though. ====== Please let me know in case I messed up anything and/or missed important points.
On Fri, Jan 15, 2021 at 09:59:23AM +0100, David Hildenbrand wrote: > AFAIU, a more extreme case is probably VFIO: A VM with VFIO (e.g., > passthrough of a PCI device) can essentially be corrupted by "echo 4 > > /proc/[pid]/clear_refs". I've been told when doing migration with RDMA the VM's memory also ends up pinned, and then it does the stuff of #4. So it deliberately does clear_refs(4) on RDMA pinned memory and requires no COW. This is now a real world uABI break, unfortunately. > 7) There is no easy way to detect if a page really was pinned: we might > have false positives. Further, there is no way to distinguish if it was > pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking > we most probably would need more counters, which we cannot fit into > struct page. (AFAIU, for huge pages it's easier). I think this is the real issue. We can only store so much information, so we have to decide which things work and which things are broken. So far someone hasn't presented a way to record everything at least.. > However, AFAIU, even being able to detect if (and how) a page was pinned > would not completely help to solve the puzzle. At least for COW reuuse, uf we assign labels to every page user, and imagine we can track everything, I think we get this list: - # of ptes referencing the page (mapcount?) - # of page * pointer references that don't touch data (ie the speculative page cache ref) - # of DMA/CPU readers - # of DMA/CPU writers - # of long term data accesses - # of other reader/writers (specifically process incoherent reader/writers, not "DMA with the CPU" like vmsplice/iouring) Maybe there are more? This is what I've understood so far from this thread? Today's kernel makes the COW reuse decision as: # ptes == 1 && # refs == 0 && # DMA readers == 0 && # DMA writers == 0 && # of longterm == 0 && # other reader/writers == 0 (in essence this is what _refcount == 1 is saying, I think) From a GUP perspective I think the useful property is "a physical page under GUP is not indirectly removed from the mm_struct that pinned it". This is the idea that the process CPU page table and the ongoing DMA remain synchronized. This is a generalized statement from the clear_refs(4) and fork() regressions. Therefore, COW should not copy a page just because it is under GUP, it breaks the idea directly. We've also said speculative #refs should not cause COW. Removing both of those gets us to the COW reuse decision as: # ptes == 1 && # other reader/writers == 0 And I think where Linus is coming from is '# ptes' (eg mapcount) alone is not right because there are other relavent reader/writers too. (I'm not sure what these are, has someone pointed at one?) So, we have 64 bits for _refcount and _mapcount and we currently encode things as: - # ptes (_mapcount) - # page pointers + (low bits of _refcount) # DMA reader + writers + # other reader/writers + # ptes # We incr both _mapcount and_refcount? - # long term data acesses (high bits of _refcount If we move '# other reader/writers' to _mapcount (maybe with a shift), does it help? We also talked about GUP as meaning wrprotect == 0, but we could also change that to the idea that GUP means COW will always re-use, eg '#ptes == 1 && # other reader/writers == 0'. This gives some definition what mprotect(PROT_READ) means to pages under DMA (though I still think PROT_READ of pages under DMA write is weird) > 8) We have a vmsplice security issue that has to be fixed by touching > the code in question. A forked child process can read memory content of > its parent, which was modified by the parent after fork. AFAIU, the fix > will further lock us in into the direction of the code we are heading. No, vmsplice is just wrong. vmsplice has to do FOLL_LONGTERM|FOLL_FORCE|FOLL_WRITE for read only access to pages if userspace controls the duration of the pin. There are other bad bugs, like permanently locking DAX/CMA/ZONE_MIGRATE memory if the above pattern is not used. There was some debate over alternatives, but for a backport security fix it has to be above. AFAIK. Jason
>> 7) There is no easy way to detect if a page really was pinned: we might >> have false positives. Further, there is no way to distinguish if it was >> pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking >> we most probably would need more counters, which we cannot fit into >> struct page. (AFAIU, for huge pages it's easier). > > I think this is the real issue. We can only store so much information, > so we have to decide which things work and which things are broken. So > far someone hasn't presented a way to record everything at least.. I do wonder how many (especially long-term) GUP readers/writers we have to expect, and especially, support for a single base page. Do we have a rough estimate? With RDMA, I would assume we only need a single one (e.g., once RDMA device; I'm pretty sure I'm wrong, sounds too easy). With VFIO I guess we need one for each VFIO container (~ in the worst case one for each passthrough device). With direct I/O, vmsplice and other GUP users ?? No idea. If we could somehow put a limit on the #GUP we support, and fail further GUP (e.g., -EAGAIN?) once a limit is reached, we could partition the refcount into something like (assume max #15 GUP READ and #15 GUP R/W, which is most probably a horribly bad choice) [ GUP READ ][ GUP R/W ] [ ordinary ] 31 ... 28 27 ... 24 23 .... 0 But due to saturate handling in "ordinary", we would lose further 2 bits (AFAIU), leaving us "only" 22 bits for "ordinary". Now, I have no idea how many bits we actually need in practice. Maybe we need less for GUP READ, because most users want GUP R/W? No idea. Just wild ideas. Most probably that has already been discussed, and most probably people figured that it's impossible :)
On Fri, Jan 15, 2021 at 08:46:48PM +0100, David Hildenbrand wrote: > Just wild ideas. Most probably that has already been discussed, and most > probably people figured that it's impossible :) No, I think it is all fair topics. There is no API reason for any of this to be limited, but in practice, I doubt there is more than small 10's Huge pages complicate things of course Jason
On 1/15/21 11:46 AM, David Hildenbrand wrote: >>> 7) There is no easy way to detect if a page really was pinned: we might >>> have false positives. Further, there is no way to distinguish if it was >>> pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking >>> we most probably would need more counters, which we cannot fit into >>> struct page. (AFAIU, for huge pages it's easier). >> >> I think this is the real issue. We can only store so much information, >> so we have to decide which things work and which things are broken. So >> far someone hasn't presented a way to record everything at least.. > > I do wonder how many (especially long-term) GUP readers/writers we have > to expect, and especially, support for a single base page. Do we have a > rough estimate? > > With RDMA, I would assume we only need a single one (e.g., once RDMA > device; I'm pretty sure I'm wrong, sounds too easy). > With VFIO I guess we need one for each VFIO container (~ in the worst > case one for each passthrough device). > With direct I/O, vmsplice and other GUP users ?? No idea. > > If we could somehow put a limit on the #GUP we support, and fail further > GUP (e.g., -EAGAIN?) once a limit is reached, we could partition the > refcount into something like (assume max #15 GUP READ and #15 GUP R/W, > which is most probably a horribly bad choice) > > [ GUP READ ][ GUP R/W ] [ ordinary ] > 31 ... 28 27 ... 24 23 .... 0 > > But due to saturate handling in "ordinary", we would lose further 2 bits > (AFAIU), leaving us "only" 22 bits for "ordinary". Now, I have no idea > how many bits we actually need in practice. > > Maybe we need less for GUP READ, because most users want GUP R/W? No idea. > > Just wild ideas. Most probably that has already been discussed, and most > probably people figured that it's impossible :) > I proposed this exact idea a few days ago [1]. It's remarkable that we both picked nearly identical values for the layout! :) But as the responses show, security problems prevent pursuing that approach. [1] https://lore.kernel.org/r/45806a5a-65c2-67ce-fc92-dc8c2144d766@nvidia.com thanks,
On 16.01.21 04:40, John Hubbard wrote: > On 1/15/21 11:46 AM, David Hildenbrand wrote: >>>> 7) There is no easy way to detect if a page really was pinned: we might >>>> have false positives. Further, there is no way to distinguish if it was >>>> pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking >>>> we most probably would need more counters, which we cannot fit into >>>> struct page. (AFAIU, for huge pages it's easier). >>> >>> I think this is the real issue. We can only store so much information, >>> so we have to decide which things work and which things are broken. So >>> far someone hasn't presented a way to record everything at least.. >> >> I do wonder how many (especially long-term) GUP readers/writers we have >> to expect, and especially, support for a single base page. Do we have a >> rough estimate? >> >> With RDMA, I would assume we only need a single one (e.g., once RDMA >> device; I'm pretty sure I'm wrong, sounds too easy). >> With VFIO I guess we need one for each VFIO container (~ in the worst >> case one for each passthrough device). >> With direct I/O, vmsplice and other GUP users ?? No idea. >> >> If we could somehow put a limit on the #GUP we support, and fail further >> GUP (e.g., -EAGAIN?) once a limit is reached, we could partition the >> refcount into something like (assume max #15 GUP READ and #15 GUP R/W, >> which is most probably a horribly bad choice) >> >> [ GUP READ ][ GUP R/W ] [ ordinary ] >> 31 ... 28 27 ... 24 23 .... 0 >> >> But due to saturate handling in "ordinary", we would lose further 2 bits >> (AFAIU), leaving us "only" 22 bits for "ordinary". Now, I have no idea >> how many bits we actually need in practice. >> >> Maybe we need less for GUP READ, because most users want GUP R/W? No idea. >> >> Just wild ideas. Most probably that has already been discussed, and most >> probably people figured that it's impossible :) >> > > I proposed this exact idea a few days ago [1]. It's remarkable that we both > picked nearly identical values for the layout! :) Heh! Somehow I missed that. But well, there were *a lot* of mails :) > > But as the responses show, security problems prevent pursuing that approach. It still feels kind of wrong to waste valuable space in the memmap. In an ideal world (well, one that still only allows for a 64 byte memmap :) ), we would: 1) Partition the refcount into separate fields that cannot overflow into each other, similar to my example above, but maybe add even more fields. 2) Reject attempts that would result in an overflow to everything except the "ordinary" field (e.g., GUP fields in my example above). 3) Put an upper limit on the "ordinary" field that we ever expect for sane workloads (E.g., 10 bits). In addition, reserve some bits (like the saturate logic) that we handle as a "red zone". 4) For the "ordinary" field, as soon as we enter the red zone, we know we have an attack going on. We continue on paths that we cannot fail (e.g., get_page()) but eventually try stopping the attacker(s). AFAIU, we know the attacker(s) are something (e.g., one ore multiple processes) that has direct access to the page in their address space. Of course, the more paths we can reject, the better. Now, we would: a) Have to know what sane upper limits on the "ordinary" field are. I have no idea which values we can expect. Attacker vs. sane workload. b) Need a way to identify the attacker(s). In the simplest case, this is a single process. In the hard case, this involves many processes. c) Need a way to stop the attacker(s). Doing that out of random context is problematic. Last resort is doing this asynchronously from another thread, which leaves more time for the attacker to do harm. Of course, problem gets more involved as soon as we might have a malicious child process that uses a page from a well-behaving parent process for the attack. Imagine we kill relevant processes, we might end up killing someone who's not responsible. And even if we don't kill, but instead reject try_get_page(), we might degrade the well-behaving parent process AFAIKS. Alternatives to killing the process might be unmapping the problematic page from the address space. Reminds me a little about handling memory errors for a page, eventually killing all users of that page. mm/memory-failure.c:kill_procs(). Complicated problem :)