Message ID | 20210107200402.31095-1-aarcange@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | page_count can't be used to decide when wp_page_copy | expand |
On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote: > vmsplice syscall API is insecure allowing long term GUP PINs without > privilege. Lots of places are relying on pin_user_pages long term pins of memory, and cannot be converted to notifiers. I don't think it is reasonable to just declare that insecure and requires privileges, it is a huge ABI break. FWIW, vhost tries to use notifiers as a replacement for GUP, and I think it ended up quite strange and complicated. It is hard to maintain performance when every access to the pages needs to hold some protection against parallel invalidation. Jason
On Thu, Jan 7, 2021 at 12:25 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > Lots of places are relying on pin_user_pages long term pins of memory, > and cannot be converted to notifiers. > > I don't think it is reasonable to just declare that insecure and > requires privileges, it is a huge ABI break. Also, I think GUP (and pin_user_pages() as a special case) is a lot more important and more commonly used than UFFD. Which is really why I think this needs to be fixed by just fixing UFFD to take the write lock. I think Andrea is blinded by his own love for UFFDIO: when I do a debian codesearch for UFFDIO_WRITEPROTECT, all it finds is the kernel and strace (and the qemu copies of the kernel headers). Does the debian code search cover everything? Obviously not. But if you cannot find A SINGLE USE of that thing in the Debian code search, then that is sure a sign of _something_. Linus
On Thu, Jan 7, 2021 at 12:32 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Which is really why I think this needs to be fixed by just fixing UFFD > to take the write lock. Side note, and not really related to UFFD, but the mmap_sem in general: I was at one point actually hoping that we could make the mmap_sem a spinlock, or at least make the rule be that we never do any IO under it. At which point a write lock hopefully really shouldn't be such a huge deal. The main source of IO under the mmap lock was traditionally the page faults obviously needing to read the data in, but we now try to handle that with the whole notion of page fault restart instead. But I'm 100% sure we don't do as good a job of it as we _could_ do, and there are probably a lot of other cases where we end up doing IO under the mmap lock simply because we can and nobody has looked at it very much. So if taking the mmap_sem for writing is a huge deal - because it ends up serializing with IO by people who take it for reading - I think that is something that might be worth really looking into. For example, right now I think we (still) only do the page fault retry once - and the second time if the page still isn't available, we'll actually wait with the mmap_sem held. That goes back to the very original page fault retry logic, when I was worried that some infinite retry would cause busy-waiting because somebody didn't do the proper "drop mmap_sem, then wait, then return retry". And if that actually causes problems, maybe we should just make sure to fix it? remove that FAULT_FLAG_TRIED bit entirely, and make the rule be that we always drop the mmap_sem and retry? Similarly, if there are users that don't set FAULT_FLAG_ALLOW_RETRY at all (because they don't have the logic to check if it's a re-try and re-do the mmap_sem etc), maybe we can just fix them. I think all the architectures do it properly in their page fault paths (I think Peter Xu converted them all - no?), but maybe there are cases of GUP that don't have it. Or maybe there is something else that I just didn't notice, where we end up having bad latencies on the mmap_sem. I think those would very much be worth fixing, so that if UFFDIO_WRITEPROTECT taking the mmapo_sem for writing causes problems, we can _fix_ those problems. But I think it's entirely wrong to treat UFFDIO_WRITEPROTECT as specially as Andrea seems to want to treat it. Particularly with absolutely zero use cases to back it up. Linus
On Thu, Jan 07, 2021 at 04:25:25PM -0400, Jason Gunthorpe wrote: > On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote: > > > vmsplice syscall API is insecure allowing long term GUP PINs without > > privilege. > > Lots of places are relying on pin_user_pages long term pins of memory, > and cannot be converted to notifiers. > > I don't think it is reasonable to just declare that insecure and > requires privileges, it is a huge ABI break. Where's that ABI? Are there specs or a code example in kernel besides vmsplice itself? I don't see how it's possible to consider long term GUP pins completely unprivileged if not using mmu notifier. vmsplice doesn't even account them in rlimit (it cannot because it cannot identify all put_pages either). Long term GUP pins not using mmu notifier and not accounted in rlimit are an order of magnitude more VM-intrusive than mlock. The reason it's worse than mlock, even if ignore all performance feature that they break including numa bindings and that mlock doesn't risk to break, come because you can unmap the memory after taking those rlimit unaccounted GUP pins. So the OOM killer won't even have a chance to see the GUP pins coming. So it can't be that mlock has to be privileged but unconstrainted unaccounted long term GUP pins as in vmsplice are ok to stay unprivileged. Now io_uring does account the GPU pins in the mlock rlimit, but after the vma is unmapped it'd still cause the same confusion to OOM killer and in addition the assumption that each GUP pin cost 4k is also flawed. However io_uring model can use the mmu notifier without slowdown to the fast paths, so it's not going to cause any ABI break to fix it. Or to see it another way, it'd be fine to declare all mlock rlimits are obsolete and memcg is the only way to constrain RAM usage, but then mlock should stop being privileged, because mlock is a lesser concern and it won't risk to confuse the OOM killer at least. The good thing is the GUP pins won't escape memcg accounting but that accounting also doesn't come entirely free. > FWIW, vhost tries to use notifiers as a replacement for GUP, and I > think it ended up quite strange and complicated. It is hard to > maintain performance when every access to the pages needs to hold some > protection against parallel invalidation. And that's fine, this is all about if it should require a one liner change to add the username in the realtime group in /etc/group or not. You're focusing on your use case, but we've to put things in prospective of all these changes started. The whole zygote issue wouldn't even register if the child had the exact same credentials of the parent. Problem is the child dropped privileges and went with a luser id, that clearly cannot ptrace the parent, and so if long term unprivileged GUP pins are gone from the equation, what remains that the child can do is purely theoretical even before commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f. NOTE: I'm all for fixing the COW for good, but vmsplice or any long term GUP pin that is absolutely required to make such attack practical, looks the real low hanging fruit here to fix. However fixing it so clear_refs becomes fundamentally incompatible with mmu notifier users unless they all convert to pure !FOLL_GET GUPs, let alone long term GUP pins not using mmu notifier, doesn't look great. For vmsplice that new break-COW is the fix because it happens in the other process. For every legit long term GUP, where the break-COW happens in the single and only process, it's silent MM corruption. Thanks, Andrea
On Thu, Jan 07, 2021 at 12:32:09PM -0800, Linus Torvalds wrote: > I think Andrea is blinded by his own love for UFFDIO: when I do a > debian codesearch for UFFDIO_WRITEPROTECT, all it finds is the kernel > and strace (and the qemu copies of the kernel headers). For the record, I feel obliged to reiterate I'm thinking purely in clear_refs terms here. It'd be great if we can only focus on clear_refs_write and nothing else. Like I said earlier, whatever way clear_refs/softdirty copes with do_wp_page, uffd-wp can do the identical thing so, uffd-wp is effectively irrelevant in this whole discussion. Clear-refs/softdirty predates uffd-wp by several years too. Thanks, Andrea
On Thu, Jan 07, 2021 at 01:05:19PM -0800, Linus Torvalds wrote: > I think those would very much be worth fixing, so that if > UFFDIO_WRITEPROTECT taking the mmapo_sem for writing causes problems, > we can _fix_ those problems. > > But I think it's entirely wrong to treat UFFDIO_WRITEPROTECT as > specially as Andrea seems to want to treat it. Particularly with > absolutely zero use cases to back it up. Again for the record: there's nothing at all special in UFFDIO_WRITEPROTECT in this respect. If you could stop mentioning UFFDIO_WRITEPROTECT and only focus on softdirty/clear_refs, maybe you wouldn't think my judgment is biased towards clear_refs/softdirty too. You can imagine the side effects of page_count doing a COW erroneously, as corollary of the fact that KSM won't ever allow to merge two pages if one of them is under GUP pin. Why is that? Thanks, Andrea
On Thu, Jan 7, 2021 at 2:03 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > If you could stop mentioning UFFDIO_WRITEPROTECT and only focus on > softdirty/clear_refs, maybe you wouldn't think my judgment is biased > towards clear_refs/softdirty too. So I think we can agree that even that softdirty case we can just handle by "don't do that then". if a page is pinned, the dirty bit of it makes no sense, because it might be dirtied complately asynchronously by the pinner. So I think none of the softdirty stuff should touch pinned pages. I think it falls solidly under that "don't do it then". Just skipping over them in clear_soft_dirty[_pmd]() doesn't look that hard, does it? Linus
On Thu, Jan 07, 2021 at 02:17:50PM -0800, Linus Torvalds wrote: > So I think we can agree that even that softdirty case we can just > handle by "don't do that then". Absolutely. The question is if somebody was happily running clear_refs with a RDMA attached to the process, by the time they update and reboot they'll find it the hard way with silent mm corruption currently. So I was obliged to report this issue and the fact there was very strong reason why page_count was not used there and it's even documented explicitly in the source: * [..] however we only use * page_trans_huge_mapcount() in the copy-on-write faults where we * need full accuracy to avoid breaking page pinning, [..] I didn't entirely forget the comment when I reiterated it in fact also in Message-ID: <20200527212005.GC31990@redhat.com> on May 27 2020 since I recalled there was a very explicit reason why we had to use page_mapcount in do_wp_page and deliver full accuracy. Now I cannot proof there's any such user that will break, but we'll find those with a 1 year or more of delay. Even the tlb flushing deferral that caused clear_refs_write to also corrupt userland memory and literally lose userland writes even without any special secondary MMU hardware being attached to the memory, took 6 months to find. > if a page is pinned, the dirty bit of it makes no sense, because it > might be dirtied complately asynchronously by the pinner. > > So I think none of the softdirty stuff should touch pinned pages. I > think it falls solidly under that "don't do it then". > > Just skipping over them in clear_soft_dirty[_pmd]() doesn't look that > hard, does it? 1) How do you know again if it's not speculative lookup or an O_DIRECT that made them look pinned? 2) it's a hugepage 1, a 4k pin will make soft dirty then skip 511 dirty bits by mistake including those pages that weren't pinned and that userland is still writing through the transhuge pmd. Until v4.x we had a page pin counter for each subpage so the above wouldn't have happened, but not anymore since it was simplified and improved but now the page_count is pretty inefficient there, even if it'd be possible to make safe. 3) the GUP(write=0) may be just reading from RAM and sending to the other end so clear_refs may have currently very much tracked CPU writes fine, with no interference whatsoever from the secondary MMU working in readonly on the memory. Thanks, Andrea
On Thu, Jan 07, 2021 at 04:45:33PM -0500, Andrea Arcangeli wrote: > On Thu, Jan 07, 2021 at 04:25:25PM -0400, Jason Gunthorpe wrote: > > On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote: > > > > > vmsplice syscall API is insecure allowing long term GUP PINs without > > > privilege. > > > > Lots of places are relying on pin_user_pages long term pins of memory, > > and cannot be converted to notifiers. > > > > I don't think it is reasonable to just declare that insecure and > > requires privileges, it is a huge ABI break. > > Where's that ABI? Are there specs or a code example in kernel besides > vmsplice itself? If I understand you right, you are trying to say that the 193 pin_user_pages() callers cannot exist as unpriv any more? The majority cannot be converted to notifiers because they are DMA based. Every one of those is an ABI for something, and does not expect extra privilege to function. It would be a major breaking change to have pin_user_pages require some cap. > The whole zygote issue wouldn't even register if the child had the > exact same credentials of the parent. Problem is the child dropped > privileges and went with a luser id, that clearly cannot ptrace the > parent, and so if long term unprivileged GUP pins are gone from the > equation, what remains that the child can do is purely theoretical > even before commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f. Sorry, I'm not sure I've found a good explanation how ptrace and GUP are interacting to become a security problem. 17839 makes sense to me, and read-only GUP has been avoided by places like RDMA and others for a very long time because of these issues, adding the same idea to the core code looks OK. The semantics we discussed during the COW on fork thread for pin user pages were, more or less, that once pinned a page should not be silently removed from the mm it is currently in by COW or otherwise in the kernel. So maybe ptrace should not be COW'ing pinned pages at all, as that is exactly the same kind of silent corruption fork was causing. Jason
On Fri, Jan 08, 2021 at 09:36:49AM -0400, Jason Gunthorpe wrote: > On Thu, Jan 07, 2021 at 04:45:33PM -0500, Andrea Arcangeli wrote: > > On Thu, Jan 07, 2021 at 04:25:25PM -0400, Jason Gunthorpe wrote: > > > On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote: > > > > > > > vmsplice syscall API is insecure allowing long term GUP PINs without ^^^^^^^^^ > > > > privilege. > > > > > > Lots of places are relying on pin_user_pages long term pins of memory, > > > and cannot be converted to notifiers. > > > > > > I don't think it is reasonable to just declare that insecure and > > > requires privileges, it is a huge ABI break. > > > > Where's that ABI? Are there specs or a code example in kernel besides > > vmsplice itself? > > If I understand you right, you are trying to say that the 193 > pin_user_pages() callers cannot exist as unpriv any more? 193, 1k 1m or their number in general, won't just make them safe... > The majority cannot be converted to notifiers because they are DMA > based. Every one of those is an ABI for something, and does not expect > extra privilege to function. It would be a major breaking change to > have pin_user_pages require some cap. ... what makes them safe is to be transient GUP pin and not long term. Please note the "long term" in the underlined line. O_DIRECT is perfectly ok to be unprivileged obviously. The VM can wait, eventually it goes away. Even a swapout is not an instant event and can be hold off by any number of other things besides a transient GUP pin. It can be hold off by PG_lock just to make an example. mlock however is long term, persistent, vmsplice takes persistent and can pin way too much memory for each mm, that doesn't feel safe. The more places doing stuff like that, the more likely one causes a safety issue, not the other way around it in fact. > > The whole zygote issue wouldn't even register if the child had the > > exact same credentials of the parent. Problem is the child dropped > > privileges and went with a luser id, that clearly cannot ptrace the > > parent, and so if long term unprivileged GUP pins are gone from the > > equation, what remains that the child can do is purely theoretical > > even before commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f. > > Sorry, I'm not sure I've found a good explanation how ptrace and GUP > are interacting to become a security problem. ptrace is not involved. What I meant by mentioning ptrace, is that if the child can ptrace the parent, then it doesn't matter if it can also do the below, so the security concern is zero in such case. With O_DIRECT or any transient pin you will never munmap while O_DIRECT is in flight, if you munmap it's undefined what happens in such case anyway. It is a theoretical security issue made practical by vmsplice API that allows to enlarge the window to years of time (not guaranteed milliseconds), to wait for the parent to trigger the wp_page_reuse. Remove vmsplice and the security issue in theory remains, but removed vmsplice it becomes irrelevant statistically speaking in practice. io_uring has similar concern but it can use mmu notifier, so it can totally fix it and be 100% safe from this. The scheduler disclosure date was 2020-08-25 so I can freely explain the case that motivated all these changes. case A) if !fork() { // in child mmap one page vmsplice takes gup pin long term on such page munmap one page // mapcount == 1 (parent mm) // page_count == 2 (gup in child, and parent mm) } else { parent writes to the page // mapcount == 1, wp_page_reuse } parent did a COW with mapcount == 1 so the parent will take over a page that is still GUP pinned in the child. That's the security issue because in this case the GUP pin is malicious. Now imagine this case B) mmap one page RDMA or any secondary MMU takes a long term GUP pin munmap one page // mapcount == 1 (parent mm) // page_count == 2 (gup in RDMA, and parent mm) How does the VM can tell between the two different cases? It can't. The current page_count in do_wp_page treats both cases the same and because page_count is 2 in both cases, it calls wp_page_copy in both cases breaking-COW in both cases. However, you know full well in the second case it is a feature and not a bug, that wp_page_reuse is called instead, and in fact it has to be called or it's a bug (and that's the bug page_count in do_wp_page introduces). So page_count in do_wp_page is breaking all valid users, to take care of the purely theoretical security issue that isn't a practical concern if only vmsplice is secured at least as good as mlock. page_count in do_wp_page is fundamentally flawed for all long term GUP pin done by secondary MMUs attached to the memory. The fix in 17839856fd588f4ab6b789f482ed3ffd7c403e1f had to work by triggering a GUP(write=1), that would break-COW while vmsplice runs, in turn fully resolving the security concern, but without breaking your very important case B. > 17839 makes sense to me, and read-only GUP has been avoided by places > like RDMA and others for a very long time because of these issues, > adding the same idea to the core code looks OK. Yes I acked 17839856fd588f4ab6b789f482ed3ffd7c403e1f since it looked the cleanest solution to take care of the purely theoretical security issue (purely theoretical after vmsplice is taken care of). I planned today to look what didn't work exactly in 17839856fd588f4ab6b789f482ed3ffd7c403e1f that may have required to move to 09854ba94c6aad7886996bfbee2530b3d8a7f4f4, it was an huge email thread and I was too busy with urgent work at the time. > The semantics we discussed during the COW on fork thread for pin user > pages were, more or less, that once pinned a page should not be > silently removed from the mm it is currently in by COW or otherwise in > the kernel. I don't get what you mean here. Could you elaborate? > So maybe ptrace should not be COW'ing pinned pages at all, as that is > exactly the same kind of silent corruption fork was causing. ptrace isn't involved, details above. Could you elaborate also if fork started corrupting with 17839856fd588f4ab6b789f482ed3ffd7c403e1f applied? In which commit exactly the corruption started. In general fork(), unless you copy all GUP pinned pages and you don't wrprotect them in fork(), must be handled by blocking all writes on the RDMA region in the parent, then you fork, only after child did the exec you're allowed to unblock the writes in the parent that holds the GUP long term pins. I don't see a notable difference from page_count or mapcount in do_wp_page in this respect: only copying in fork() if the page is pinned like I was also proposed here https://lkml.kernel.org/r/20090311165833.GI27823@random.random will also prevent having to block the writes until exec is run though. FWIW I obviously agree in copying in fork any pinned page, but that was supposed to be an orthogonal improvement and it wasn't supposed to fix a recent regression (the fork vs thread vs gup race always existed, and the need of stopping writes in between fork and exec also). Thanks, Andrea
On Fri, Jan 08, 2021 at 12:00:36PM -0500, Andrea Arcangeli wrote: > > The majority cannot be converted to notifiers because they are DMA > > based. Every one of those is an ABI for something, and does not expect > > extra privilege to function. It would be a major breaking change to > > have pin_user_pages require some cap. > > ... what makes them safe is to be transient GUP pin and not long > term. > > Please note the "long term" in the underlined line. Many of them are long term, though only 50 or so have been marked specifically with FOLL_LONGTERM. I don't see how we can make such a major ABI break. Looking at it, vmsplice() is simply wrong. A long term page pin must use pin_user_pages(), and either FOLL_LONGTERM|FOLL_WRITE (write mode) FOLL_LONGTERM|FOLL_FORCE|FOLL_WRITE (read mode) ie it must COW and it must reject cases that are not longterm safe, like DAX and CMA and so on. These are the well established rules, vmsplice does not get a pass simply because it is using the CPU to memory copy as its "DMA". > speaking in practice. io_uring has similar concern but it can use mmu > notifier, so it can totally fix it and be 100% safe from this. IIRC io_uring does use FOLL_LONGTERM and FOLL_WRITE.. > The scheduler disclosure date was 2020-08-25 so I can freely explain > the case that motivated all these changes. > > case A) > > if !fork() { > // in child > mmap one page > vmsplice takes gup pin long term on such page > munmap one page > // mapcount == 1 (parent mm) > // page_count == 2 (gup in child, and parent mm) > } else { > parent writes to the page > // mapcount == 1, wp_page_reuse > } > > parent did a COW with mapcount == 1 so the parent will take over a > page that is still GUP pinned in the child. Sorry, I missed something, how does mmaping a fresh new page in the child impact the parent? I guess the issue is not to mmap but to GUP a shared page in a way that doesn't trigger COW during GUP and then munmap that page so a future parent COW does re-use, leaking access. It seems enforcing FOLL_WRITE to always COW on GUP closes this, right? This is what all correct FOLL_LONGTERM users do today, it is required for many other reasons beyond this interesting security issue. > However, you know full well in the second case it is a feature and not > a bug, that wp_page_reuse is called instead, and in fact it has to be > called or it's a bug (and that's the bug page_count in do_wp_page > introduces). What I was trying to explain below, is I think we agreed that a page under active FOLL_LONGTERM pin *can not* be write protected. Establishing the FOLL_LONGTERM pin (for read or write) must *always* break the write protection and the VM *cannot* later establish a new write protection on that page while the pin is active. Indeed, it is complete nonsense to try and write protect a page that has active DMA write activity! Changing the CPU page protection bits will not stop any DMA! Doing so will inevitably become a security problem with an attack similar to what you described. So this is what was done during fork() - fork will no longer write protect pages under FOLL_LONGTERM to make them COWable, instead it will copy them at fork time. Any other place doing write protect must also follow these same rules. I wasn't aware this could be used to create a security problem, but it does make sense. write protect really must mean writes to the memory must stop and that is fundementally incompatible with active DMA. Thus write protect of pages under DMA must be forbidden, as a matter of security. Jason
On Fri, Jan 8, 2021 at 10:19 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Jan 08, 2021 at 12:00:36PM -0500, Andrea Arcangeli wrote: > > > The majority cannot be converted to notifiers because they are DMA > > > based. Every one of those is an ABI for something, and does not expect > > > extra privilege to function. It would be a major breaking change to > > > have pin_user_pages require some cap. > > > > ... what makes them safe is to be transient GUP pin and not long > > term. > > > > Please note the "long term" in the underlined line. > > Many of them are long term, though only 50 or so have been marked > specifically with FOLL_LONGTERM. I don't see how we can make such a > major ABI break. > > Looking at it, vmsplice() is simply wrong. A long term page pin must > use pin_user_pages(), and either FOLL_LONGTERM|FOLL_WRITE (write mode) > FOLL_LONGTERM|FOLL_FORCE|FOLL_WRITE (read mode) Can we just remove vmsplice() support? We could make it do a normal copy, thereby getting rid of a fair amount of nastiness and potential attacks. Even ignoring issues relating to the length of time that the vmsplice reference is alive, we also have whatever problems could be caused by a malicious or misguided user vmsplice()ing some memory and then modifying it. --Andy
On Fri, Jan 8, 2021 at 10:31 AM Andy Lutomirski <luto@kernel.org> wrote: > > Can we just remove vmsplice() support? We could make it do a normal > copy, thereby getting rid of a fair amount of nastiness and potential > attacks. Even ignoring issues relating to the length of time that the > vmsplice reference is alive, we also have whatever problems could be > caused by a malicious or misguided user vmsplice()ing some memory and > then modifying it. Well, that "misguided user" is kind of the point, originally. That's what zero-copying is all about. But we could certainly remove it in favor of copying, because zero-copy has seldom really been a huge advantage in practice outside of benchmarks. That said, I continue to not buy into Andrea's argument that page_count() is wrong. Instead, the argument is: (1) COW can never happen "too much": the definition of a private mapping is that you have your own copy of the data. (2) the one counter case I feel is valid is page pinning when used for a special "pseudo-shared memory" thing and that's basically what FOLL_GUP does. So _regardless_ of any vmsplice issues, I actually think that those two basic rules should be our guiding principle. And the corollary to (2) is that COW must absolutely NEVER re-use too little. And that _was_ the bug with vmsplice, in that it allowed re-use that it shouldn't have. Linus
On Fri, Jan 8, 2021 at 10:19 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > Sorry, I missed something, how does mmaping a fresh new page in the > child impact the parent? > > I guess the issue is not to mmap but to GUP a shared page No. It has nothing to do with a shared page. The problem with the COW in the child is that the parent now BELIEVES that it has a private copy (because page_mapcount() was 1), but it doesn't really. But because the parent *thought* it had a private copy of the page, when the _parent_ did a write, it would cause the page COW logic to go "you have exclusive access to the page, so I'll just make it writable". The parent then writes whatever private data to that page. That page is still in the system as a vmsplice'd page, and the child can now read that private data that was _supposed_ to be exclusive to the parent, but wasn't. And the thing is, blaming vmsplice() is entirely wrong. The exact same thing used to be able to happen with any GUP case, vmsplice() was just the simplest way to cause that non-mapped page access. But any GUP could do it, with the child basically fooling the parent into revealing data. Note that Zygote itself is in no way special from a technical standpoint, and this can happen after any random fork(). The only real difference is that in all *traditional* UNIX cases, this "child can see the parent's data with trickery before execve()" situation simply doesn't *matter*. In traditional fork() situations, the parent and the child are really the same program, and if you don't trust the child, then you don't trust the parent either. The Android Zygote case isn't _technically_ any different. But the difference is that because the whole idea with Zygote is to pre-map the JIT stuff for the child, you are in this special situation where the parent doesn't actually trust the child. See? No _technical_ difference. Exact same scenario as for any random fork() with GUP and COW going the wrong way. It just normally doesn't _matter_. And see above: because this is not really specific to vmsplice() (apart from that just being the easiest model), the _original_ fix for this was just "GUP will break COW early" commit: 17839856fd58 ("gup: document and work around "COW can break either way" issue") which is very straightforward: if you do a GUP lookup, you force that GUP to do the COW for you, so that nobody can then fool another process to think that it has a private page that can be re-used, but it really has a second reference to it. Because whoever took the "sneaky" GUP reference had to get their _own_ private copy first. But while that approach was very simple and very targeted (and I don't think it's wrong per se), it then caused other problems. In fact, it caused other problems for pretty much all the same cases that the current model causes problems for: all the odd special cases that do weird things to the VM. And because these problems were so odd, the alternate solution - and the thing I'm really pushing for - is to make the _core_ VM rules very simple and straightforward, and then the odd special cases have to live with those simple and straightforward rules. And the most core of those rules is that "page_mapcount()" fundamenally doesn't matter, because there are other references to pages that are all equally valid. Thinking that a page being "mapped" makes is special is wrong, as exemplified by any GUP case (but also as exemplified by the page cache or the swap cache, which were always a source of _other_ special cases for the COW code). So if you accept that notion of "page_mapcount()" is meaninfless being a truism (which Andrea obviously doesn't), then the logical extension of that is the set of rules I outlined in my reply to Andy: (a) COW can never happen "too much", and "page_count()" is the fundamental "somebody has a reference to this page" (b) page pinning and any other "this needs to be coherent" ends up being a special per-page "shared memory" case That special "shared memory page" thing in (b) is then that rule that when we pin a page, we make sure it's writable, and stays writable, so that COW never breaks the association. That's then the thing that causes problems for anybody who wants to write-protect stuff. Linus
On Fri, Jan 08, 2021 at 02:19:45PM -0400, Jason Gunthorpe wrote: > On Fri, Jan 08, 2021 at 12:00:36PM -0500, Andrea Arcangeli wrote: > > > The majority cannot be converted to notifiers because they are DMA > > > based. Every one of those is an ABI for something, and does not expect > > > extra privilege to function. It would be a major breaking change to > > > have pin_user_pages require some cap. > > > > ... what makes them safe is to be transient GUP pin and not long > > term. > > > > Please note the "long term" in the underlined line. > > Many of them are long term, though only 50 or so have been marked > specifically with FOLL_LONGTERM. I don't see how we can make such a > major ABI break. io_uring is one of those indeed and I already flagged it. This isn't a black and white issue, kernel memory is also pinned but it's not in movable pageblocks... How do you tell the VM in GUP to migrate memory to a non movable pageblock before pinning it? Because that's what it should do to create less breakage. For example iommu obviously need to be privileged, if your argument that it's enough to use the right API to take long term pins unconstrained, that's not the case. Pins are pins and prevent moving or freeing the memory, their effect is the same and again worse than mlock on many levels. (then I know on preempt-rt should behave like a pin, and that's fine, you disable all features for such purpose there) io_uring is fine in comparison to vmpslice but still not perfect, because it does the RLIMIT_MEMLOCK accounting but unfortunately, is tangibly unreliable since a pin can cost 2m or 1G (now 1G is basically privileged so it doesn't hurt to get the accounting wrong in such case, but it's still technically mixing counting apples as oranges). Maybe io_uring could keep not doing mmu notifier, I'd need more investigation to be sure, but what's the point of keeping it VM-breaking when it doesn't need to? Is io_uring required to setup the ring at high frequency? > Looking at it, vmsplice() is simply wrong. A long term page pin must > use pin_user_pages(), and either FOLL_LONGTERM|FOLL_WRITE (write mode) > FOLL_LONGTERM|FOLL_FORCE|FOLL_WRITE (read mode) > > ie it must COW and it must reject cases that are not longterm safe, > like DAX and CMA and so on. > > These are the well established rules, vmsplice does not get a pass Where are the established rules written down? pin_user_pages.rst doesn't even make a mention of FOLL_FORCE or FOLL_WRITE at all, mm.h same thing. In any case, the extra flags required in FOLL_LONGTERM should be implied by FOLL_LONGTERM itself, once it enters the gup code, because it's not cool having to FOLL_WRITE in all drivers for a GUP(write=0), let alone having to specify FOLL_FORCE for just a read. But this is going offtopic. > simply because it is using the CPU to memory copy as its "DMA". vmsplice can't find all put_pages that may release the pages when the pipe is read, or it'd be at least be able to do the unreliable RLIMIT_MEMLOCK accounting. I'm glad we agree vmsplice is unsafe. The PR for the seccomp filter is open so if you don't mind, I'll link your review as confirmation. > > speaking in practice. io_uring has similar concern but it can use mmu > > notifier, so it can totally fix it and be 100% safe from this. > > IIRC io_uring does use FOLL_LONGTERM and FOLL_WRITE.. Right it's one of those 50. FOLL_WRITE won't magically allow the memory to be swapped or migrated. To make another example a single unprivileged pin on the movable zone, can break memhotunplug unless you use the mmu notifier. Every other advanced feature falls apart. So again, if an unprivileged syscalls allows a very limited number of pages, maybe it checks also if it got a THP or a gigapage page from the pin, it sets its own limit, maybe again it's not a big concern. vmsplice currently with zero privilege allows this: 2 0 1074432 9589344 13548 1321860 4 0 4 172 2052 9997 5 2 93 0 0 -> vmsplice reproducer started here 1 0 1074432 8538184 13548 1325820 0 0 0 0 1973 8838 4 3 93 0 0 1 0 1074432 8538436 13548 1325524 0 0 0 0 1730 8168 4 2 94 0 0 1 0 1074432 8539096 13556 1321880 0 0 0 72 1811 8640 3 2 95 0 0 0 0 1074432 8539348 13564 1322028 0 0 0 36 1936 8684 4 2 95 0 0 -> vmsplice killed here 1 0 1074432 9586720 13564 1322248 0 0 0 0 1893 8514 4 2 94 0 0 That's ~1G that goes away for each task and I didn't even check if it's all THP pages getting in there, the rss is 3MB despite 1G is taken down in GUP pins with zero privilege: 1512 pts/25 S 0:00 0 0 133147 3044 0.0 ./vmsplice Again memcg is robust so it's not a concern for the host, the attack remains contained in the per-memcg OOM killer. It'd only reach the host OOM killer logic if the host itself does the accounting wrong and runs out of memory which can be enforced it won't happen. > > The scheduler disclosure date was 2020-08-25 so I can freely explain > > the case that motivated all these changes. > > > > case A) > > > > if !fork() { > > // in child > > mmap one page > > vmsplice takes gup pin long term on such page > > munmap one page > > // mapcount == 1 (parent mm) > > // page_count == 2 (gup in child, and parent mm) > > } else { > > parent writes to the page > > // mapcount == 1, wp_page_reuse > > } > > > > parent did a COW with mapcount == 1 so the parent will take over a > > page that is still GUP pinned in the child. > > Sorry, I missed something, how does mmaping a fresh new page in the > child impact the parent? Apologies... of course the "mmap" line had to be moved before fork. > I guess the issue is not to mmap but to GUP a shared page in a way > that doesn't trigger COW during GUP and then munmap that page so a > future parent COW does re-use, leaking access. Right. Jann reported the writes of the parent are readable then by reading the pipe 1 year later. > It seems enforcing FOLL_WRITE to always COW on GUP closes this, right? Exactly, it was supposed to do that. And I don't mean in the caller with FOLL_WRITE/write=1 explicitly set in vmsplice, I mean with 17839856fd588f4ab6b789f482ed3ffd7c403e1f which looked great to me as a solution for it. > This is what all correct FOLL_LONGTERM users do today, it is required > for many other reasons beyond this interesting security issue. Exactly. Except this also applies to O_DIRECT not just FOLL_LONGTERM, in theory. And only in theory. Any transient GUP pin no matter which fancy API you use to take it, is enough to open the window for the above attack, not just FOLL_LONGERM. However only unprivileged long term GUP pins can make this race reproducible. So this has to be fixed in the GUP core too, as it was supposed to be fixed for a while reliably (and it's not fixed anymore on current upstream if taking the GUP pin on a THP). For those with the reproducer for the bug fixed in 17839856fd588f4ab6b789f482ed3ffd7c403e1f here's the patch to apply to reproduce it once on v5.11 once again: --- vmsplice.c 2020-05-28 03:03:26.760303487 -0400 +++ vmsplice-v5.11.c 2021-01-08 17:28:37.028747370 -0500 @@ -24 +24 @@ - struct iovec iov = {.iov_base = data, .iov_len = 0x1000 }; + struct iovec iov = {.iov_base = data, .iov_len = 2*1024*1024 }; @@ -26 +26 @@ - SYSCHK(munmap(data, 0x1000)); + SYSCHK(munmap(data, 2*1024*1024)); @@ -28,2 +28,2 @@ - char buf[0x1000]; - SYSCHK(read(pipe_fds[0], buf, 0x1000)); + char buf[2*1024*1024]; + SYSCHK(read(pipe_fds[0], buf, 2*1024*1024)); @@ -34 +34 @@ - if (posix_memalign(&data, 0x1000, 0x1000)) + if (posix_memalign(&data, 2*1024*1024, 2*1024*1024)) @@ -35,0 +36,2 @@ + if (madvise(data, 2*1024*1024, MADV_HUGEPAGE)) + errx(1, "madvise()"); $ /tmp/x read string from child: THIS IS SECRET I exploited it just to be sure I didn't miss something in the source review of the THP code. So I hope after all this discussion I could at least provide 1 single useful information, if nothing else. > > However, you know full well in the second case it is a feature and not > > a bug, that wp_page_reuse is called instead, and in fact it has to be > > called or it's a bug (and that's the bug page_count in do_wp_page > > introduces). > > What I was trying to explain below, is I think we agreed that a page > under active FOLL_LONGTERM pin *can not* be write protected. > > Establishing the FOLL_LONGTERM pin (for read or write) must *always* > break the write protection and the VM *cannot* later establish a new > write protection on that page while the pin is active. > > Indeed, it is complete nonsense to try and write protect a page that > has active DMA write activity! Changing the CPU page protection bits > will not stop any DMA! Doing so will inevitably become a security > problem with an attack similar to what you described. > > So this is what was done during fork() - fork will no longer write > protect pages under FOLL_LONGTERM to make them COWable, instead it > will copy them at fork time. > > Any other place doing write protect must also follow these same > rules. > > I wasn't aware this could be used to create a security problem, but it > does make sense. write protect really must mean writes to the memory > must stop and that is fundementally incompatible with active DMA. > > Thus write protect of pages under DMA must be forbidden, as a matter > of security. You're thinking at your use case only. Thinking long term GUP pin is read-write DMA is very reductive. There doesn't need to be DMA at all. KVM and a shadow MMU can attach to the RAM in readonly totally fine. And if it writes, it'll write not through the PCI bus, still with the CPU access. In fact Peter did an awesome work by writing the dirty ring for the KVM shadow MMU and some vmx also provides a page modification logging on some CPUs. So we have already all the dirty tracking that protects the shadow pagetable: https://kvmforum2020.sched.com/event/eE4R/kvm-dirty-ring-a-new-approach-to-logging-peter-xu-red-hat So it's completely normal that you could plug that with clear_refs and wrprotecting the linux pagetable while a KVM mapping exists that absolutely must not go out of sync. Nothing at all can go wrong, unless wp_copy_page suddenly makes the secondary MMU go out of sync the moment you wrprotect the page with clear_refs. You don't even need readonly access from DMA for the above to make sense, the above makes perfect sense even with the secondary MMU and primary MMU all writing at the same time and it must not break. Overall a design where the only safety of a secondary MMU from going out of sync comes from the wrprotection not happening looks weak. Ultimately, what do we really gain from all this breakage? Where are the do_wp_page benchmarks comparing 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 against b7333b58f358f38d90d78e00c1ee5dec82df10ad ? Link? Definitely there's no benchmark in the git log justifying this sudden breakage on so many levels that even re-opened the old zygote bug as shown above. Thanks, Andrea
On Fri, Jan 08, 2021 at 10:31:24AM -0800, Andy Lutomirski wrote: > Can we just remove vmsplice() support? We could make it do a normal The single case I've seen vmsplice used so far, that was really cool is localhost live migration of qemu. However despite really cool, it wasn't merged in the end, and I don't recall exactly why. There are even more efficient (but slightly more complex) ways to do that than vmsplice: using MAP_SHARED gigapages or MAP_SHARED tmpfs with THP opted-in in the tmpfs mount, as guest physical memory instead of anon memory and finding a way not having it cleared by kexec, so you can also upgrade the host kernel and not just qemu... is a way more optimal way to PIN and move all pages through the pipe and still having to pay a superfluous copy on destination. My guess why it's not popular, and I may be completely wrong on this since I basically never used vmsplice (other than to proof of concept DoS my phone to verify the long term GUP pin exploit works), is that vmsplice is a more efficient, but not the most efficient option. Exactly like in the live migration in place, it's always more efficient to share a tmpfs THP backed region and have true zero copy, than going through a pipe that still does one copy at the receiving end. It may also be simpler and it's not dependent on F_SETPIPE_SIZE obscure tunings. So in the end it's still too slow for apps that requires maximum performance, and not worth the extra work for those that don't. I love vmsplice conceptually, just I'd rather prefer an luser cannot run it. > copy, thereby getting rid of a fair amount of nastiness and potential > attacks. Even ignoring issues relating to the length of time that the > vmsplice reference is alive, we also have whatever problems could be > caused by a malicious or misguided user vmsplice()ing some memory and > then modifying it. Sorry to ask but I'm curious, what also goes wrong if the user modifies memory under GUP pin from vmsplice? That's not obvious to see. Thanks, Andrea
On Fri, Jan 08, 2021 at 05:43:56PM -0500, Andrea Arcangeli wrote: > On Fri, Jan 08, 2021 at 02:19:45PM -0400, Jason Gunthorpe wrote: > > On Fri, Jan 08, 2021 at 12:00:36PM -0500, Andrea Arcangeli wrote: > > > > The majority cannot be converted to notifiers because they are DMA > > > > based. Every one of those is an ABI for something, and does not expect > > > > extra privilege to function. It would be a major breaking change to > > > > have pin_user_pages require some cap. > > > > > > ... what makes them safe is to be transient GUP pin and not long > > > term. > > > > > > Please note the "long term" in the underlined line. > > > > Many of them are long term, though only 50 or so have been marked > > specifically with FOLL_LONGTERM. I don't see how we can make such a > > major ABI break. > > io_uring is one of those indeed and I already flagged it. > > This isn't a black and white issue, kernel memory is also pinned but > it's not in movable pageblocks... How do you tell the VM in GUP to > migrate memory to a non movable pageblock before pinning it? Because > that's what it should do to create less breakage. There is already a patch series floating about to do exactly that for FOLL_LONGTERM pins based on the existing code in GUP for CMA migration > For example iommu obviously need to be privileged, if your argument > that it's enough to use the right API to take long term pins > unconstrained, that's not the case. Pins are pins and prevent moving > or freeing the memory, their effect is the same and again worse than > mlock on many levels. The ship sailed on this a decade ago, it is completely infeasible to go back now, it would completely break widely used things like GPU, RDMA and more. > Maybe io_uring could keep not doing mmu notifier, I'd need more > investigation to be sure, but what's the point of keeping it > VM-breaking when it doesn't need to? Is io_uring required to setup the > ring at high frequency? If we want to have a high speed copy_from_user like thing that is not based on page pins but on mmu notifiers, then we should make that infrastructure and the various places that need it should use common code. At least vhost and io_uring are good candidates. Otherwise, we are pretending that they are DMA and using the DMA centric pin_user_pages() interface, which we still have to support and make work. > In any case, the extra flags required in FOLL_LONGTERM should be > implied by FOLL_LONGTERM itself, once it enters the gup code, because > it's not cool having to FOLL_WRITE in all drivers for a GUP(write=0), > let alone having to specify FOLL_FORCE for just a read. But this is > going offtopic. We really should revise this, I've been thinking for a while we need to internalize into gup.c the FOLL_FORCE|FOLL_WRITE|FOLL_LONGTERM idiom at least.. > > simply because it is using the CPU to memory copy as its "DMA". > > vmsplice can't find all put_pages that may release the pages when the > pipe is read, or it'd be at least be able to do the unreliable > RLIMIT_MEMLOCK accounting. Yikes! So it can't even use pin_user_pages FOLL_LONGTERM properly because that requires unpin_user_pages(), which means finding all the unpin sites too :\ > I'm glad we agree vmsplice is unsafe. The PR for the seccomp filter is > open so if you don't mind, I'll link your review as confirmation. OK > To make another example a single unprivileged pin on the movable zone, > can break memhotunplug unless you use the mmu notifier. Every other > advanced feature falls apart. As above FOLL_LONGTERM will someday migrate from movable zones. The fact that people keep adding MM features that are incompatible with FOLL_LONGTERM is troublesome. However, the people who want hot-pluggable DIMMS don't get to veto the people who want RDMA, GPU and so on out of the kernel. (or vice versa) It seems we will end up with a MM where some work loads will be incompatible with some MM features. > So again, if an unprivileged syscalls allows a very limited number of > pages, maybe it checks also if it got a THP or a gigapage page from > the pin, it sets its own limit, maybe again it's not a big > concern. We also don't do a good job uniformly tracking rmlimit/etc. I'd ideally like to see that in the core code too. Someone once tried that a bit but we couldn't ge agreement what the right thing was because different drivers do different things. Sigh. > Any transient GUP pin no matter which fancy API you use to take it, is > enough to open the window for the above attack, not just FOLL_LONGERM. Yes, that is interesting. We've always known that the FOLL_LONGTERM special machinery is techincally needed for O_DIRECT and basically all other cases for coherence, but till now I hand't heard of a security argument. It does make sense :( > For those with the reproducer for the bug fixed in > 17839856fd588f4ab6b789f482ed3ffd7c403e1f here's the patch to apply to > reproduce it once on v5.11 once again: So this is still at least because vmsplice is buggy to use plain get_user_pages() for it's long term usage, and buggy to not use the FOLL_FORCE|FOLL_WRITE idiom for read :\ A small patch to make vmsplice set those flags on its gup would at least robustly close this immediate security problem without whatever side effects caused the revert of commit forcing that in GUP iteself. > You're thinking at your use case only. I'm thinking about the rules to make pin_user_pages(FOLL_LONGTERM) sane and working, yes. It is an API we have that is used widely, and really needs a solid definition. This idea we can just throw it out completely is a no-go to me. There are other similar APIs, like normal GUP, hmm_range_fault, and so on, but these are different things, with different rules. > Thinking long term GUP pin is read-write DMA is very reductive. > > There doesn't need to be DMA at all. > > KVM and a shadow MMU can attach to the RAM in readonly totally > fine. And if it writes, it'll write not through the PCI bus, still > with the CPU access. That is not gup FOLL_LONGTERM, that is mmu notifiers.. mmu notifier users who are using hmm_range_fault() do not ever take any page references when doing their work, that seems like the right approach, for a shadow mmu? > Nothing at all can go wrong, unless wp_copy_page suddenly makes the > secondary MMU go out of sync the moment you wrprotect the page with > clear_refs. To be honest, I've read most of this discussion, and the prior one, between you and Linus carefully, but I still don't understand what clear_refs is about or how KVM's use of mmu notifiers got broken. This is probably because I'm only a little familiar with those areas :\ Is it actually broken or just inefficient? If wp_copy_page is going more often than it should the secondary mmu should still fully track that? > Overall a design where the only safety of a secondary MMU from going > out of sync comes from the wrprotection not happening looks weak. To be clear, here I am only talking about pin_user_pages. We now have logic to tell if a page is under pin_user_pages(FOLL_LONGTERM) or not, and that is what is driving the copy on fork logic. secondary-mmu drivers using mmu notifier should not trigger this logic and should not restrict write protect. > Ultimately, what do we really gain from all this breakage? Well, the clean definition of pin_user_pages(FOLL_LONGTERM) is very positive for DMA drivers working in that area. Jason
Hello Jason, On Fri, Jan 08, 2021 at 08:42:55PM -0400, Jason Gunthorpe wrote: > There is already a patch series floating about to do exactly that for > FOLL_LONGTERM pins based on the existing code in GUP for CMA migration Sounds great. > The ship sailed on this a decade ago, it is completely infeasible to > go back now, it would completely break widely used things like GPU, > RDMA and more. For all those that aren't using mmu notifier and that rely solely on page pins, they still require privilege, except they do through /dev/ permissions. Just the fact there's no capability check in the read/write/ioctl doesn't mean those device inodes can be opened any luser: the fact the kernel allows it, doesn't mean the /dev/ permission does too. The same applies to /dev/kvm too, not just PCI device drivers. Device drivers that you need to open in /dev/ before you can take a GUP pin require whole different checks than syscalls like vmsplice and io_uring that are universally available. The very same GUP long term pinning kernel code can be perfectly safe to use without any permission check for a device driver of an iommu in /dev/, but completely unsafe for a syscall. > If we want to have a high speed copy_from_user like thing that is not > based on page pins but on mmu notifiers, then we should make that > infrastructure and the various places that need it should use common > code. At least vhost and io_uring are good candidates. Actually the mmu notifier doesn't strictly require pins, it only requires GUP. All users tend to use FOLL_GET just as a safety precaution (I already tried to optimize away the two atomics per GUP, but we were naked by the KVM maintainer that didn't want to take the risk, I would have, but it's a fair point indeed, obviously it's safer with the pin plus the mmu notifier, two is safer than one). I'm not sure how any copy-user could obviate a secondary MMU mapping, mappings and copies are mutually exclusive. Any copy would be breaking memory coherency in this environment. > Otherwise, we are pretending that they are DMA and using the DMA > centric pin_user_pages() interface, which we still have to support and > make work. vhost and io_uring would be pure software constructs, but there are hardware users of the GUP pin that don't use any DMA. The long term GUP pin is not only about PCI devices doing DMA. KVM is not ever using any DMA, despite it takes terabytes worth of very long term GUP pins. > > In any case, the extra flags required in FOLL_LONGTERM should be > > implied by FOLL_LONGTERM itself, once it enters the gup code, because > > it's not cool having to FOLL_WRITE in all drivers for a GUP(write=0), > > let alone having to specify FOLL_FORCE for just a read. But this is > > going offtopic. > > We really should revise this, I've been thinking for a while we need > to internalize into gup.c the FOLL_FORCE|FOLL_WRITE|FOLL_LONGTERM > idiom at least.. 100% agreed. > > > simply because it is using the CPU to memory copy as its "DMA". > > > > vmsplice can't find all put_pages that may release the pages when the > > pipe is read, or it'd be at least be able to do the unreliable > > RLIMIT_MEMLOCK accounting. > > Yikes! So it can't even use pin_user_pages FOLL_LONGTERM properly > because that requires unpin_user_pages(), which means finding all the > unpin sites too :\ Exactly. > > To make another example a single unprivileged pin on the movable zone, > > can break memhotunplug unless you use the mmu notifier. Every other > > advanced feature falls apart. > > As above FOLL_LONGTERM will someday migrate from movable zones. Something like: 1) migrate from movable zones contextually to GUP 2) be accounted on the compound_order not on the number of GUP (io_uring needs fixing here) 3) maybe account not only in rlimit, but also expose the total worth of GUP pins in page_order units (not pins) to the OOM killer to be added to the rss (will double count though). Maybe 3 is overkill but without it, OOM killer won't even see those GUP pin coming, so if not done it's still kind of unsafe, if done it'll risk double count. Even then a GUP pin, still prevents optimization, it can't converge in the right NUMA node the io ring just to make an example, but that's a secondary performance concern. The primary concern with the mmu notifier in io_uring is the take_all_locks latency. Longlived apps like qemu would be fine with mmu notifier. The main question is also if there's any short-lived latency io_uring usage... that wouldn't fly with take_all_locks. The problem with the mmu notifier as an universal solution, for example is that it can't wait for I/O completion of O_DIRECT since it has no clue where the put_page is to wait for it, otherwise we could avoid even the FOLL_GET for O_DIRECT and guarantee the I/O has to be completed before paging or anything can unmap the page under I/O from the pagetable. Even if we could reliably identify all the put_page of transient pins reliably, it would need to be always on. Currently we go the extra mile to require zero exclusive cachelines when it's unregistered and that makes the registering a latency outlier. > The fact that people keep adding MM features that are incompatible > with FOLL_LONGTERM is troublesome. Ehm in my view it's actually FOLL_LONGTERM without ability to use the mmu notifier that is troublesome :). It's funny how we look at the two opposite sides of the same coin. I'm sure there will be devices doing that will for a very long time, but they don't need to be perfect, the current handling is satisfactory, and we can do a best effort to improve things are described above but it's not critical. > However, the people who want hot-pluggable DIMMS don't get to veto the > people who want RDMA, GPU and so on out of the kernel. (or vice versa) > > It seems we will end up with a MM where some work loads will be > incompatible with some MM features. I see the incompatibility you describe as problem we have today, in the present, and that will fade with time. Reminds me when we had >4G of RAM and 32bit devices doing DMA. How many 32bit devices are there now? We're not talking here about any random PCI device, we're talking here about very special and very advanced devices that need to have "long term" GUP pins in order to operate, not the usual nvme/gigabit device where GUP pins are never long term. > We also don't do a good job uniformly tracking rmlimit/etc. I'd > ideally like to see that in the core code too. Someone once tried that > a bit but we couldn't ge agreement what the right thing was because > different drivers do different things. Sigh. Consolidating would be great I agree. > > Any transient GUP pin no matter which fancy API you use to take it, is > > enough to open the window for the above attack, not just FOLL_LONGERM. > > Yes, that is interesting. We've always known that the FOLL_LONGTERM > special machinery is techincally needed for O_DIRECT and basically all > other cases for coherence, but till now I hand't heard of a security > argument. It does make sense :( The security argument is really specific to such case described, and ideally whatever fix we do to close all windows, would cover all O_DIRECT too. > > For those with the reproducer for the bug fixed in > > 17839856fd588f4ab6b789f482ed3ffd7c403e1f here's the patch to apply to > > reproduce it once on v5.11 once again: > > So this is still at least because vmsplice is buggy to use plain > get_user_pages() for it's long term usage, and buggy to not use the > FOLL_FORCE|FOLL_WRITE idiom for read :\ > > A small patch to make vmsplice set those flags on its gup would at > least robustly close this immediate security problem without whatever > side effects caused the revert of commit forcing that in GUP iteself. Exactly, if we fix vmsplice, and we close the biggest window, what remains is so small it shouldn't be practical. We still have to close all windows then. > > You're thinking at your use case only. > > I'm thinking about the rules to make pin_user_pages(FOLL_LONGTERM) > sane and working, yes. It is an API we have that is used widely, and > really needs a solid definition. This idea we can just throw it out > completely is a no-go to me. > > There are other similar APIs, like normal GUP, hmm_range_fault, and so hmm depends on mmu notifier so there's no VM interference there. > on, but these are different things, with different rules. I'm not suggesting to throw out anything. It's like if you got a 32bit device, you did bounce buffers. If you got a CPU without MMU you got to deal with MMU=n. How many Linux VM features you can use MMU=n? Is it mlock accounting required with Linux built with MMU=n? (I'd be shocked if it can actually build but still) You have to live with the limitations the hardware delivers. vmsplice and io_uring have no limitation and zero hardware constraint, so they've not a single valid justification, unlike device drivers, in addition their access cannot be controlled through /dev/ permission like it happens regularly for all device drivers. > > Thinking long term GUP pin is read-write DMA is very reductive. > > > > There doesn't need to be DMA at all. > > > > KVM and a shadow MMU can attach to the RAM in readonly totally > > fine. And if it writes, it'll write not through the PCI bus, still > > with the CPU access. > > That is not gup FOLL_LONGTERM, that is mmu notifiers.. Correct. Although KVM initially used the equivalent of FOLL_LONGTERM back then. Then KVM become the primary MMU Notfifier user of course. The only difference between FOLL_LONGTERM and mmu notifier, is if the hardware is capable of handling it. There is no real difference other than that. > mmu notifier users who are using hmm_range_fault() do not ever take any > page references when doing their work, that seems like the right > approach, for a shadow mmu? They all can do like HMM or you can take the FOLL_GET as long as you remember put_page. Jerome also intended to optimize the KVM fault like that, but like said above, we were naked on that attempt. If there is the pin or not makes zero semantical difference, it's purely an optimization when there is no pin, and it's a bugcheck safety feature if there is the pin. By the time it can make a runtime difference if there is the pin or not, put_page has been called already. > > Nothing at all can go wrong, unless wp_copy_page suddenly makes the > > secondary MMU go out of sync the moment you wrprotect the page with > > clear_refs. > > To be honest, I've read most of this discussion, and the prior one, > between you and Linus carefully, but I still don't understand what > clear_refs is about or how KVM's use of mmu notifiers got broken. This > is probably because I'm only a little familiar with those areas :\ KVM use of mmu notifier is not related to this. clear_refs simply can wrprotect the page. Of any process. echo .. >/proc/self/clear_refs. Then you check in /proc/self/pagemap looking for soft dirty (or something like that). The point is that if you do echo ... >/proc/self/clear_refs on your pid, that has any FOLL_LONGTERM on its mm, it'll just cause your device driver to go out of sync with the mm. It'll see the old pages, before the spurious COWs. The CPU will use new pages (the spurious COWs). > Is it actually broken or just inefficient? If wp_copy_page is going > more often than it should the secondary mmu should still fully track > that? It's about the DMA going out of sync and losing view of the mm. In addition the TLB flush broke with the mmu_read_lock but that can be fixed somehow. The TLB flush, still only because of the spurious COWs, has now to cope with the fact that there can be spurious wp_page_copy right after wrprotecting a read-write page. Before that couldn't happen, fork() couldn't run since it takes mmap_write_lock, so if the pte was writable and transitioned to non-writable it'd mean it was a exclusive page and it would be guaranteed re-used, so the stale TLB would keep writing in place. The stale TLB is the exact same equivalent of your FOLL_LONGTERM, except it's the window the CPU has on the old page, the FOLL_LONGTERM is the window the PCI device has on the old page. The spurious COW is what makes TLB and PCI device go out of sync reading and writing to the old page, while the CPU moved on to a new page. The issue is really similar. > To be clear, here I am only talking about pin_user_pages. We now have > logic to tell if a page is under pin_user_pages(FOLL_LONGTERM) or not, > and that is what is driving the copy on fork logic. fork() wrprotects and like every other wrprotect, was just falling in the above scenario. > secondary-mmu drivers using mmu notifier should not trigger this logic > and should not restrict write protect. That's a great point. I didn't think the mmu notifier will invalidate the secondary MMU and ultimately issue a GUP after the wp_copy_page to keep it in sync. The funny thing that doesn't make sense is that wp_copy_page will only be invoked because the PIN was left by KVM on the page for that extra safety I was talking about earlier. Are we forced to drop all the page pins to be able to wrprotect the memory without being flooded by immediate COWs? So the ultimate breakpoint, is the FOLL_LONGTERM and no mmu notifier to go out of sync on a wrprotect, which can happen if the device is doing a readonly access long term. I quote you earlier: "A long term page pin must use pin_user_pages(), and either FOLL_LONGTERM|FOLL_WRITE (write mode) FOLL_LONGTERM|FOLL_FORCE|FOLL_WRITE (read mode)" You clearly contemplate the existance of a read mode, long term. That is also completely compatible with wrprotection. Why should we pick a model that forbids this to work? What do we get back from it? I only see unnecessary risk and inefficiencies coming back from it. > > Ultimately, what do we really gain from all this breakage? > > Well, the clean definition of pin_user_pages(FOLL_LONGTERM) is very > positive for DMA drivers working in that area. I was referring to page_count in do_wp_page, not pin_user_pages sorry for the confusion. Thanks, Andrea
On Fri, 8 Jan 2021 14:19:45 -0400 Jason Gunthorpe wrote: > > What I was trying to explain below, is I think we agreed that a page > under active FOLL_LONGTERM pin *can not* be write protected. > > Establishing the FOLL_LONGTERM pin (for read or write) must *always* > break the write protection and the VM *cannot* later establish a new > write protection on that page while the pin is active. > > Indeed, it is complete nonsense to try and write protect a page that > has active DMA write activity! Changing the CPU page protection bits > will not stop any DMA! Doing so will inevitably become a security > problem with an attack similar to what you described. > > So this is what was done during fork() - fork will no longer write > protect pages under FOLL_LONGTERM to make them COWable, instead it > will copy them at fork time. Is it, in a step forward, unlikely for DMA write activity to happen during page copy at fork? > > Any other place doing write protect must also follow these same > rules. > > I wasn't aware this could be used to create a security problem, but it > does make sense. write protect really must mean writes to the memory > must stop and that is fundementally incompatible with active DMA. > > Thus write protect of pages under DMA must be forbidden, as a matter > of security. > > Jason
> On Jan 8, 2021, at 3:34 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > On Fri, Jan 08, 2021 at 10:31:24AM -0800, Andy Lutomirski wrote: >> Can we just remove vmsplice() support? We could make it do a normal > >> copy, thereby getting rid of a fair amount of nastiness and potential >> attacks. Even ignoring issues relating to the length of time that the >> vmsplice reference is alive, we also have whatever problems could be >> caused by a malicious or misguided user vmsplice()ing some memory and >> then modifying it. > > Sorry to ask but I'm curious, what also goes wrong if the user > modifies memory under GUP pin from vmsplice? That's not obvious to > see. It breaks the otherwise true rule that the data in pipe buffers is immutable. Even just quoting the manpage: SPLICE_F_GIFT The user pages are a gift to the kernel. The application may not modify this memory ever, otherwise the page cache and on- disk data may differ. That's no good. I can also imagine use cases in which modified vmsplice() pages that end up in various parts of the network stack could be problematic. For example, if you can arrange for TCP or, worse, TLS to transmit data and then retransmit modified data, you might get odd results. In the latter case, some security properties of TLS might be broken. --Andy
On Sat, Jan 9, 2021 at 11:03 AM Andy Lutomirski <luto@kernel.org> wrote: > > > > > Sorry to ask but I'm curious, what also goes wrong if the user > > modifies memory under GUP pin from vmsplice? That's not obvious to > > see. > > It breaks the otherwise true rule that the data in pipe buffers is > immutable. Note that this continued harping on vmsplice() is entirely misguided. Anything using GUP has the same issues. This really has nothing to do with vmsplice() per se. In many ways, vmsplice() might be the least of your issues, because it's fairly easy to just limit that for untrusted use. And no, that does not mean "we should make vmsplice root-only" kind of limiting. There are no security issues in any normal situation. Again, it's mainly about things that don't trust each other _despite_ running in similar contexts as far as the kernel is concerned. IOW, exactly that "zygote" kind of situation. If you are a JIT (whether Zygote or a web browser), you basically need to limit the things the untrusted JIT'ed code can do. And that limiting may include vmsplice(). But note the "include" part of "include vmsplice()". Any other GUP user really does have the same issues, it may just be less obvious and have very different timings (or depend on access to devices etc). Absolutely nothing cares about "data in pipe buffers changing" in any other case. You can already write any data you want to a pipe, it doesn't matter if it changes after the write or not. (In many ways, "data in the page cache" is a *much* more difficult issue for the kernel, and it's fundamental to any shared mmap. It's much more difficult because that data is obviously very much also accessible for DMA etc for writeout, and if you have something like "checksums are calculated separately and non-atomically from the actual DMA accesses", you will potentially get checksum errors where the actual disk contents don't match your separately calculated checksums until the _next_ write. This can actually be a feature - seeing "further modifications were concurrent to the write" - but most people end up considering it a bug). Linus
On Thu, Jan 07, 2021 at 01:05:19PM -0800, Linus Torvalds wrote: > Side note, and not really related to UFFD, but the mmap_sem in > general: I was at one point actually hoping that we could make the > mmap_sem a spinlock, or at least make the rule be that we never do any > IO under it. At which point a write lock hopefully really shouldn't be > such a huge deal. There's a (small) group of us working towards that. It has some prerequisites, but where we're hoping to go currently: - Replace the vma rbtree with a b-tree protected with a spinlock - Page faults walk the b-tree under RCU, like peterz/laurent's SPF patchset - If we need to do I/O, take a refcount on the VMA After that, we can gradually move things out from mmap_sem protection to just the vma tree spinlock, or whatever makes sense for them. In a very real way the mmap_sem is the MM layer's BKL.
On Sat, Jan 9, 2021 at 11:33 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Jan 07, 2021 at 01:05:19PM -0800, Linus Torvalds wrote: > > Side note, and not really related to UFFD, but the mmap_sem in > > general: I was at one point actually hoping that we could make the > > mmap_sem a spinlock, or at least make the rule be that we never do any > > IO under it. At which point a write lock hopefully really shouldn't be > > such a huge deal. > > There's a (small) group of us working towards that. It has some > prerequisites, but where we're hoping to go currently: > > - Replace the vma rbtree with a b-tree protected with a spinlock > - Page faults walk the b-tree under RCU, like peterz/laurent's SPF patchset > - If we need to do I/O, take a refcount on the VMA > > After that, we can gradually move things out from mmap_sem protection > to just the vma tree spinlock, or whatever makes sense for them. In a > very real way the mmap_sem is the MM layer's BKL. Well, we could do the "no IO" part first, and keep the semaphore part. Some people actually prefer a semaphore to a spinlock, because it doesn't end up causing preemption issues. As long as you don't do IO (or memory allocations) under a semaphore (ok, in this case it's a rwsem, same difference), it might even be preferable to keep it as a semaphore rather than as a spinlock. So it doesn't necessarily have to go all the way - we _could_ just try something like "when taking the mmap_sem, set a thread flag" and then have a "warn if doing allocations or IO under that flag". And since this is about performance, not some hard requirement, it might not even matter if we catch all cases. If we fix it so that any regular load on most normal filesystems never see the warning, we'd already be golden. Of course, I think we've had issues with rw_sems for _other_ reasons. Waiman actually removed the reader optimistic spinning because it caused bad interactions with mixed reader-writer loads. So rwsemapores may end up not working as well as spinlocks if the common situation is "just wait a bit, you'll get it". Linus
On Fri, Jan 08, 2021 at 09:50:08PM -0500, Andrea Arcangeli wrote: > For all those that aren't using mmu notifier and that rely solely on > page pins, they still require privilege, except they do through /dev/ > permissions. It is normal that the dev nodes are a+rw so it doesn't really require privilege in any real sense. > Actually the mmu notifier doesn't strictly require pins, it only > requires GUP. All users tend to use FOLL_GET just as a safety > precaution (I already tried to optimize away the two atomics per GUP, > but we were naked by the KVM maintainer that didn't want to take the > risk, I would have, but it's a fair point indeed, obviously it's safer > with the pin plus the mmu notifier, two is safer than one). I'm not sure what holding the pin will do to reduce risk? If you get into a situation where you are stuffing a page into the SMMU that is not in the CPU's MMU then everything is lost. Holding a pin while carrying a page from the CPU page table to the SMMU just ensures that page isn't freed until it is installed, but once installed you are back to being broken. > I'm not sure how any copy-user could obviate a secondary MMU mapping, > mappings and copies are mutually exclusive. Any copy would be breaking > memory coherency in this environment. Because most places need to copy from user to stable kernel memory before processing data under user control. You can't just cast a user controlled pointer to a kstruct and use it - that is very likely a security bug. Still, the general version is something like kmap: map = user_map_setup(user_ptr, length) kptr = user_map_enter(map) [use kptr] user_map_leave(map, kptr) And inside it could use mmu notifiers, or gup, or whatever. user_map_setup() would register the notifier and user_map_enter() would validate the cache'd page pointer and block cached invalidation until user_map_leave(). > The primary concern with the mmu notifier in io_uring is the > take_all_locks latency. Just enabling mmu_notifier takes a performance hit on the entire process too, it is not such a simple decision.. We'd need benchmarks against a database or scientific application to see how negative the notifier actually becomes. > The problem with the mmu notifier as an universal solution, for > example is that it can't wait for I/O completion of O_DIRECT since it > has no clue where the put_page is to wait for it, otherwise we could > avoid even the FOLL_GET for O_DIRECT and guarantee the I/O has to be > completed before paging or anything can unmap the page under I/O from > the pagetable. GPU is already doing something like this, waiting in a notifier invalidate callback for DMA to finish before allowing invalidate to complete. It is horrendously complicated and I'm not sure blocking invalidate for a long time is actually much better for the MM.. > I see the incompatibility you describe as problem we have today, in > the present, and that will fade with time. > > Reminds me when we had >4G of RAM and 32bit devices doing DMA. How > many 32bit devices are there now? I'm not so sure anymore. A few years ago OpenCAPI and PCI PRI seemed like good things, but now with experience they carry pretty bad performance hits to use them. Lots of places are skipping them. CXL offers another chance at this, so we'll see again in another 5 years or so if it works out. It is not any easy problem to solve from a HW perspective. > We're not talking here about any random PCI device, we're talking here > about very special and very advanced devices that need to have "long > term" GUP pins in order to operate, not the usual nvme/gigabit device > where GUP pins are never long term. Beyond RDMA, netdev's XDP uses FOLL_LONGTERM, so do various video devices, lots of things related to virtualization like vfio, vdpa and vhost. I think this is a bit defeatist to say it doesn't matter. If anything as time goes on it seems to be growing, not shrinking currently. > The point is that if you do echo ... >/proc/self/clear_refs on your > pid, that has any FOLL_LONGTERM on its mm, it'll just cause your > device driver to go out of sync with the mm. It'll see the old pages, > before the spurious COWs. The CPU will use new pages (the spurious > COWs). But if you do that then clear-refs isn't going to work they way it thought either - this first needs some explanation for how clear_refs is supposed to work when DMA WRITE is active on the page. I'd certainly say causing a loss of synchrony is not acceptable, so if we keep Linus's version of COW then clear_refs has to not write protect pages under DMA. > > secondary-mmu drivers using mmu notifier should not trigger this logic > > and should not restrict write protect. > > That's a great point. I didn't think the mmu notifier will invalidate > the secondary MMU and ultimately issue a GUP after the wp_copy_page to > keep it in sync. It had better, or mmu notifiers are broken, right? > The funny thing that doesn't make sense is that wp_copy_page will only > be invoked because the PIN was left by KVM on the page for that extra > safety I was talking about earlier. Yes, with the COW change if kvm cares about this inefficiency it should not have the unnecessary pin. > You clearly contemplate the existance of a read mode, long term. That > is also completely compatible with wrprotection. We talked about a read mode, but we didn't flesh it out. It is not unconditionally compatible with wrprotect - most likely you still can't write protect a page under READ DMA because when you eventually take the COW there will be ambiguous situations that will break the synchrony. Jason
On Sat, Jan 09, 2021 at 11:49:58AM +0800, Hillf Danton wrote: > On Fri, 8 Jan 2021 14:19:45 -0400 Jason Gunthorpe wrote: > > > > What I was trying to explain below, is I think we agreed that a page > > under active FOLL_LONGTERM pin *can not* be write protected. > > > > Establishing the FOLL_LONGTERM pin (for read or write) must *always* > > break the write protection and the VM *cannot* later establish a new > > write protection on that page while the pin is active. > > > > Indeed, it is complete nonsense to try and write protect a page that > > has active DMA write activity! Changing the CPU page protection bits > > will not stop any DMA! Doing so will inevitably become a security > > problem with an attack similar to what you described. > > > > So this is what was done during fork() - fork will no longer write > > protect pages under FOLL_LONGTERM to make them COWable, instead it > > will copy them at fork time. > > Is it, in a step forward, unlikely for DMA write activity to happen > during page copy at fork? I'm not sure it matters, it is not that much different than CPU write activity concurrent to fork(). fork() will capture some point in time - if the application cares that this data is coherent during fork() then it has to deliberately cause coherence somehow. DMA just has fewer options for the application to create the coherency because of data tearing during the page copy. Jason
On Fri, Jan 08, 2021 at 08:42:55PM -0400, Jason Gunthorpe wrote: > On Fri, Jan 08, 2021 at 05:43:56PM -0500, Andrea Arcangeli wrote: > > On Fri, Jan 08, 2021 at 02:19:45PM -0400, Jason Gunthorpe wrote: > > > On Fri, Jan 08, 2021 at 12:00:36PM -0500, Andrea Arcangeli wrote: > > > > > The majority cannot be converted to notifiers because they are DMA > > > > > based. Every one of those is an ABI for something, and does not expect > > > > > extra privilege to function. It would be a major breaking change to > > > > > have pin_user_pages require some cap. > > > > > > > > ... what makes them safe is to be transient GUP pin and not long > > > > term. > > > > > > > > Please note the "long term" in the underlined line. > > > > > > Many of them are long term, though only 50 or so have been marked > > > specifically with FOLL_LONGTERM. I don't see how we can make such a > > > major ABI break. > > > > io_uring is one of those indeed and I already flagged it. > > > > This isn't a black and white issue, kernel memory is also pinned but > > it's not in movable pageblocks... How do you tell the VM in GUP to > > migrate memory to a non movable pageblock before pinning it? Because > > that's what it should do to create less breakage. > > There is already a patch series floating about to do exactly that for > FOLL_LONGTERM pins based on the existing code in GUP for CMA migration > > > For example iommu obviously need to be privileged, if your argument > > that it's enough to use the right API to take long term pins > > unconstrained, that's not the case. Pins are pins and prevent moving > > or freeing the memory, their effect is the same and again worse than > > mlock on many levels. > > The ship sailed on this a decade ago, it is completely infeasible to > go back now, it would completely break widely used things like GPU, > RDMA and more. > I am late to this but GPU should not be use as an excuse for GUP. GUP is a broken model and the way GPU use GUP is less broken then RDMA. In GPU driver GUP contract with userspace is that the data the GPU can access is a snapshot of what the process memory was at the time you asked for the GUP. Process can start using different pages right after. There is no constant coherency contract (ie CPU and GPU can be working on different pages). If you want coherency ie always have CPU and GPU work on the same page then you need to use mmu notifier and avoid pinning pages. Anything that does not abide by mmu notifier is broken and can not be fix. Cheers, Jérôme
On Wed, Jan 13, 2021 at 04:56:38PM -0500, Jerome Glisse wrote: > is a broken model and the way GPU use GUP is less broken then RDMA. In > GPU driver GUP contract with userspace is that the data the GPU can > access is a snapshot of what the process memory was at the time you > asked for the GUP. Process can start using different pages right after. > There is no constant coherency contract (ie CPU and GPU can be working > on different pages). Look at the habana labs "totally not a GPU" driver, it doesn't work that way, GPU compute operations do want coherency. The mmu notifier hackery some of the other GPU drivers use to get coherency requires putting the kernel between every single work submission, and has all kinds of wonky issues and limitations - I think it is net worse approach than GUP, honestly. Jason
On Wed, Jan 13, 2021 at 07:39:36PM -0400, Jason Gunthorpe wrote: > On Wed, Jan 13, 2021 at 04:56:38PM -0500, Jerome Glisse wrote: > > > is a broken model and the way GPU use GUP is less broken then RDMA. In > > GPU driver GUP contract with userspace is that the data the GPU can > > access is a snapshot of what the process memory was at the time you > > asked for the GUP. Process can start using different pages right after. > > There is no constant coherency contract (ie CPU and GPU can be working > > on different pages). > > Look at the habana labs "totally not a GPU" driver, it doesn't work > that way, GPU compute operations do want coherency. > > The mmu notifier hackery some of the other GPU drivers use to get > coherency requires putting the kernel between every single work > submission, and has all kinds of wonky issues and limitations - I > think it is net worse approach than GUP, honestly. Yes what GPU driver do today with GUP is wrong but it is only use for texture upload/download. So that is a very limited scope (amdkfd being an exception here). Yes also to the fact that waiting on GPU fence from mmu notifier callback is bad. We are thinking on how to solve this. But what do matter is that hardware is moving in right direction and we will no longer need GUP. So GUP is dying out in GPU driver. Cheers, Jérôme
On Sat 09-01-21 11:46:46, Linus Torvalds wrote: > On Sat, Jan 9, 2021 at 11:33 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Thu, Jan 07, 2021 at 01:05:19PM -0800, Linus Torvalds wrote: > > > Side note, and not really related to UFFD, but the mmap_sem in > > > general: I was at one point actually hoping that we could make the > > > mmap_sem a spinlock, or at least make the rule be that we never do any > > > IO under it. At which point a write lock hopefully really shouldn't be > > > such a huge deal. > > > > There's a (small) group of us working towards that. It has some > > prerequisites, but where we're hoping to go currently: > > > > - Replace the vma rbtree with a b-tree protected with a spinlock > > - Page faults walk the b-tree under RCU, like peterz/laurent's SPF patchset > > - If we need to do I/O, take a refcount on the VMA > > > > After that, we can gradually move things out from mmap_sem protection > > to just the vma tree spinlock, or whatever makes sense for them. In a > > very real way the mmap_sem is the MM layer's BKL. > > Well, we could do the "no IO" part first, and keep the semaphore part. > > Some people actually prefer a semaphore to a spinlock, because it > doesn't end up causing preemption issues. > > As long as you don't do IO (or memory allocations) under a semaphore > (ok, in this case it's a rwsem, same difference), it might even be > preferable to keep it as a semaphore rather than as a spinlock. > > So it doesn't necessarily have to go all the way - we _could_ just try > something like "when taking the mmap_sem, set a thread flag" and then > have a "warn if doing allocations or IO under that flag". > > And since this is about performance, not some hard requirement, it > might not even matter if we catch all cases. If we fix it so that any > regular load on most normal filesystems never see the warning, we'd > already be golden. Honestly, I'd *love* if a filesystem can be guaranteed that ->fault and ->mkwrite callbacks do not happen under mmap_sem (or if at least fs would be free to drop mmap_sem if it finds the page is not already cached / prepared for writing). Because for filesystems the locking of page fault is really painful as the lock ordering wrt mmap_sem is exactly oposite compared to read / write path (read & write path must be designed so that mmap_sem can be taken inside it to copy user data, fault path may be all happening under mmap_sem). As a result this has been a long term source of deadlocks, stale data exposure issues, and filesystem corruption issues due to insufficient locking for multiple filesystems. But when I was looking at what it would take to achieve this several years ago, fixing all GUP users to deal with mmap_sem being dropped during a fault was a gigantic task because there were users of GUP relying on mmap_sem being held for large code sections around the GUP call... Honza