Message ID | 20220808073232.8808-1-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW | expand |
On 08.08.22 09:32, David Hildenbrand wrote: > Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know > that FOLL_FORCE can be possibly dangerous, especially if there are races > that can be exploited by user space. > > Right now, it would be sufficient to have some code that sets a PTE of > a R/O-mapped shared page dirty, in order for it to erroneously become > writable by FOLL_FORCE. The implications of setting a write-protected PTE > dirty might not be immediately obvious to everyone. > > And in fact ever since commit 9ae0f87d009c ("mm/shmem: unconditionally set > pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map > a shmem page R/O while marking the pte dirty. This can be used by > unprivileged user space to modify tmpfs/shmem file content even if the user > does not have write permissions to the file -- Dirty COW restricted to > tmpfs/shmem (CVE-2022-2590). > > To fix such security issues for good, the insight is that we really only > need that fancy retry logic (FOLL_COW) for COW mappings that are not > writable (!VM_WRITE). And in a COW mapping, we really only broke COW if > we have an exclusive anonymous page mapped. If we have something else > mapped, or the mapped anonymous page might be shared (!PageAnonExclusive), > we have to trigger a write fault to break COW. If we don't find an > exclusive anonymous page when we retry, we have to trigger COW breaking > once again because something intervened. > > Let's move away from this mandatory-retry + dirty handling and rely on > our PageAnonExclusive() flag for making a similar decision, to use the > same COW logic as in other kernel parts here as well. In case we stumble > over a PTE in a COW mapping that does not map an exclusive anonymous page, > COW was not properly broken and we have to trigger a fake write-fault to > break COW. > > Just like we do in can_change_pte_writable() added via > commit 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive > anonymous pages when changing protection") and commit 76aefad628aa > ("mm/mprotect: fix soft-dirty check in can_change_pte_writable()"), take > care of softdirty and uffd-wp manually. > > For example, a write() via /proc/self/mem to a uffd-wp-protected range has > to fail instead of silently granting write access and bypassing the > userspace fault handler. Note that FOLL_FORCE is not only used for debug > access, but also triggered by applications without debug intentions, for > example, when pinning pages via RDMA. > > This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are > affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR. > > Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So > let's just get rid of it. I have to add here: "Thanks to Nadav Amit for pointing out that the pte_dirty() check in FOLL_FORCE code is problematic and might be exploitable."
I'm still reading through this, but STOP DOING THIS On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: > > + VM_BUG_ON(!is_cow_mapping(vma->vm_flags)); Using BUG_ON() for debugging is simply not ok. And saying "but it's just a VM_BUG_ON()" does not change *anything*. At least Fedora enables that unconditionally for normal people, it is not some kind of "only VM people do this". Really. BUG_ON() IS NOT FOR DEBUGGING. Stop it. Now. If you have a condition that must not happen, you either write that condition into the code, or - if you are convinced it cannot happen - you make it a WARN_ON_ONCE() so that people can report it to you. The BUG_ON() will just make the machine die. And for the facebooks and googles of the world, the WARN_ON() will be sufficient. Linus
On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: > > For example, a write() via /proc/self/mem to a uffd-wp-protected range has > to fail instead of silently granting write access and bypassing the > userspace fault handler. Note that FOLL_FORCE is not only used for debug > access, but also triggered by applications without debug intentions, for > example, when pinning pages via RDMA. So this made me go "Whaa?" I didn't even realize that the media drivers and rdma used FOLL_FORCE. That's just completely bogus. Why do they do that? It seems to be completely bogus, and seems to have no actual valid reason for it. Looking through the history, it goes back to the original code submission in 2006, and doesn't have a mention of why. I think the original reason was that the code didn't have pinning, so it used "do a write" as a pin mechanism - even for reads. IOW, I think the non-ptrace use of FOLL_FORCE should just be removed. Linus
On 09.08.22 20:27, Linus Torvalds wrote: > I'm still reading through this, but > > STOP DOING THIS > > On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: >> >> + VM_BUG_ON(!is_cow_mapping(vma->vm_flags)); > > Using BUG_ON() for debugging is simply not ok. > > And saying "but it's just a VM_BUG_ON()" does not change *anything*. > At least Fedora enables that unconditionally for normal people, it is > not some kind of "only VM people do this". > > Really. BUG_ON() IS NOT FOR DEBUGGING. I totally agree with BUG_ON ... but if I get talked to in all-caps on a Thursday evening and feel like I just touched the forbidden fruit, I have to ask for details about VM_BUG_ON nowadays. VM_BUG_ON is only active with CONFIG_DEBUG_VM. ... which indicated some kind of debugging at least to me. I *know* that Fedora enables it and I *know* that this will make Fedora crash. I know why Fedora enables this debug option, but it somewhat destorys the whole purpose of VM_BUG_ON kind of nowadays? For this case, this condition will never trigger and I consider it much more a hint to the reader that we can rest assured that this condition holds. And on production systems, it will get optimized out. Should we forbid any new usage of VM_BUG_ON just like we mostly do with BUG_ON? > > Stop it. Now. > > If you have a condition that must not happen, you either write that > condition into the code, or - if you are convinced it cannot happen - > you make it a WARN_ON_ONCE() so that people can report it to you. I can just turn that into a WARN_ON_ONCE() or even a VM_WARN_ON_ONCE().
On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: > > For example, a write() via /proc/self/mem to a uffd-wp-protected range has > to fail instead of silently granting write access and bypassing the > userspace fault handler. This, btw, just makes me go "uffd-wp is broken garbage" once more. It also makes me go "if uffd-wp can disallow ptrace writes, then why doesn't regular write protect do it"? IOW, I don't think the patch is wrong (apart from the VM_BUG_ON's that absolutely must go away), but I get the strong feelign that we instead should try to get rid of FOLL_FORCE entirely. If some other user action can stop FOLL_FORCE anyway, then why do we support it at all? Linus
On Tue, Aug 09, 2022 at 11:40:50AM -0700, Linus Torvalds wrote: > On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: > > > > For example, a write() via /proc/self/mem to a uffd-wp-protected range has > > to fail instead of silently granting write access and bypassing the > > userspace fault handler. Note that FOLL_FORCE is not only used for debug > > access, but also triggered by applications without debug intentions, for > > example, when pinning pages via RDMA. > > So this made me go "Whaa?" > > I didn't even realize that the media drivers and rdma used FOLL_FORCE. > > That's just completely bogus. > > Why do they do that? > > It seems to be completely bogus, and seems to have no actual valid > reason for it. Looking through the history, it goes back to the > original code submission in 2006, and doesn't have a mention of why. It is because of all this madness with COW. Lets say an app does: buf = mmap(MAP_PRIVATE) rdma_pin_for_read(buf) buf[0] = 1 Then the store to buf[0] will COW the page and the pin will become decoherent. The purpose of the FORCE is to force COW to happen early so this is avoided. Sadly there are real apps that end up working this way, eg because they are using buffer in .data or something. I've hoped David's new work on this provided some kind of path to a proper solution, as the need is very similar to all the other places where we need to ensure there is no possiblity of future COW. So, these usage can be interpreted as a FOLL flag we don't have - some kind of (FOLL_EXCLUSIVE | FOLL_READ) to match the PG_anon_exclusive sort of idea. Jason
On 09.08.22 20:48, Jason Gunthorpe wrote: > On Tue, Aug 09, 2022 at 11:40:50AM -0700, Linus Torvalds wrote: >> On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: >>> >>> For example, a write() via /proc/self/mem to a uffd-wp-protected range has >>> to fail instead of silently granting write access and bypassing the >>> userspace fault handler. Note that FOLL_FORCE is not only used for debug >>> access, but also triggered by applications without debug intentions, for >>> example, when pinning pages via RDMA. >> >> So this made me go "Whaa?" >> >> I didn't even realize that the media drivers and rdma used FOLL_FORCE. >> >> That's just completely bogus. >> >> Why do they do that? >> >> It seems to be completely bogus, and seems to have no actual valid >> reason for it. Looking through the history, it goes back to the >> original code submission in 2006, and doesn't have a mention of why. > > It is because of all this madness with COW. > > Lets say an app does: > > buf = mmap(MAP_PRIVATE) > rdma_pin_for_read(buf) > buf[0] = 1 > > Then the store to buf[0] will COW the page and the pin will become > decoherent. > > The purpose of the FORCE is to force COW to happen early so this is > avoided. > > Sadly there are real apps that end up working this way, eg because > they are using buffer in .data or something. > > I've hoped David's new work on this provided some kind of path to a > proper solution, as the need is very similar to all the other places > where we need to ensure there is no possiblity of future COW. > > So, these usage can be interpreted as a FOLL flag we don't have - some > kind of (FOLL_EXCLUSIVE | FOLL_READ) to match the PG_anon_exclusive > sort of idea. Thanks Jason for the explanation. I have patches in the works to no longer use FOLL_FORCE|FOLL_WRITE for taking a reliable longerterm R/O pin in a MAP_PRIVATE mapping. The patches are mostly done (and comparably simple), I merely deferred sending them out because I stumbled over this issue first. Some ugly corner cases of MAP_SHARED remain, but for most prominent use cases, my upcoming patches should allow us to just have longterm R/O pins working as expected.
On Tue, Aug 9, 2022 at 11:45 AM David Hildenbrand <david@redhat.com> wrote: > > I totally agree with BUG_ON ... but if I get talked to in all-caps on a > Thursday evening and feel like I just touched the forbidden fruit, I > have to ask for details about VM_BUG_ON nowadays. > > VM_BUG_ON is only active with CONFIG_DEBUG_VM. ... which indicated some > kind of debugging at least to me. I *know* that Fedora enables it and I > *know* that this will make Fedora crash. No. VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally no different, the only difference is "we can make the code smaller because these are less important". The only possible case where BUG_ON can validly be used is "I have some fundamental data corruption and cannot possibly return an error". This kind of "I don't think this can happen" is _never_ an excuse for it. Honestly, 99% of our existing BUG_ON() ones are completely bogus, and left-over debug code that wasn't removed because they never triggered. I've several times considered just using a coccinelle script to remove every single BUG_ON() (and VM_BUG_ON()) as simply bogus. Because they are pure noise. I just tried to find a valid BUG_ON() that would make me go "yeah, that's actually worth it", and couldn't really find one. Yeah, there are several ones in the scheduler that make me go "ok, if that triggers, the machine is dead anyway", so in that sense there are certainly BUG_ON()s that don't _hurt_. But as a very good approximation, the rule is "absolutely no new BUG_ON() calls _ever_". Because I really cannot see a single case where "proper error handling and WARN_ON_ONCE()" isn't the right thing. Now, that said, there is one very valid sub-form of BUG_ON(): BUILD_BUG_ON() is absolutely 100% fine. Linus
On Tue, Aug 9, 2022 at 11:48 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > It is because of all this madness with COW. Yes, yes, but we have the proper long-term pinning now with PG_anon_exclusive, and it actually gets the pinning right not just over COW, but even over a fork - which that early write never did. David, I thought all of that got properly merged? Is there something still missing? Linus
On Tue, Aug 09, 2022 at 11:59:45AM -0700, Linus Torvalds wrote: > But as a very good approximation, the rule is "absolutely no new > BUG_ON() calls _ever_". Because I really cannot see a single case > where "proper error handling and WARN_ON_ONCE()" isn't the right > thing. Parallel to this discussion I've had ones where people more or less say Since BUG_ON crashes the machine and Linus says that crashing the machine is bad, WARN_ON will also crash the machine if you set the panic_on_warn parameter, so it is also bad, thus we shouldn't use anything. I've generally maintained that people who set the panic_on_warn *want* these crashes, because that is the entire point of it. So we should use WARN_ON with an error recovery for "can't happen" assertions like these. I think it is what you are saying here. Jason
On 09.08.22 20:48, Linus Torvalds wrote: > On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: >> >> For example, a write() via /proc/self/mem to a uffd-wp-protected range has >> to fail instead of silently granting write access and bypassing the >> userspace fault handler. > > This, btw, just makes me go "uffd-wp is broken garbage" once more. > > It also makes me go "if uffd-wp can disallow ptrace writes, then why > doesn't regular write protect do it"? I remember that it's not just uffd-wp, it's also ordinary userfaultfd if we have no page mapped, because we'd have to drop the mmap lock in order to notify user space about the fault and wait for a resolution. IIUC, we cannot tell what exactly user-space will do as a response to a user write fault here (for example, QEMU VM snapshots have to copy page content away such that the VM snapshot remains consistent and we won't corrupt the snapshot), so we have to back off and fail the GUP. I'd say, for ptrace that's even the right thing to do because one might deadlock waiting on the user space thread that handles faults ... but that's a little off-topic to this fix here. I'm just trying to keep the semantics unchanged, as weird as they might be. > > IOW, I don't think the patch is wrong (apart from the VM_BUG_ON's that > absolutely must go away), but I get the strong feelign that we instead > should try to get rid of FOLL_FORCE entirely. I can resend v2 soonish, taking care of the VM_BUG_ON as you requested if there are no other comments. > > If some other user action can stop FOLL_FORCE anyway, then why do we > support it at all? My humble opinion is that debugging userfaultfd-managed memory is a corner case and that we can hopefully stop using FOLL_FORCE outside of debugging context soon. Having that said, I do enjoy having the uffd and uffd-wp feature available in user space (especially in QEMU). I don't always enjoy having to handle such corner cases in the kernel.
On 09.08.22 21:07, Linus Torvalds wrote: > On Tue, Aug 9, 2022 at 11:48 AM Jason Gunthorpe <jgg@nvidia.com> wrote: >> >> It is because of all this madness with COW. > > Yes, yes, but we have the proper long-term pinning now with > PG_anon_exclusive, and it actually gets the pinning right not just > over COW, but even over a fork - which that early write never did. > > David, I thought all of that got properly merged? Is there something > still missing? The only thing to get R/O longterm pins in MAP_PRIVATE correct that's missing is that we have to break COW when taking a R/O longterm pin when *not* finding an anon page inside a private mapping. Regarding anon pages I am not aware of issues (due to PG_anon_exclusive). If anybody here wants to stare at a webpage, the following commit explains the rough idea for MAP_PRIVATE: https://github.com/davidhildenbrand/linux/commit/cd7989fb76d2513c86f01e6f7a74415eee5d3150 Once we have that in place, we can mostly get rid of FOLL_FORCE|FOLL_WRITE for R/O longterm pins. There are some corner cases though that need some additional thought which i am still working on. FS-handled COW in MAP_SHARED mappings is just nasty (hello DAX). (the wrong use of FOLL_GET instead of FOLL_PIN for O_DIRECT and friends still persists, but that's a different thing to handle and it's only problematic with concurrent fork() IIRC)
On Tue, Aug 9, 2022 at 12:09 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > Since BUG_ON crashes the machine and Linus says that crashing the > machine is bad, WARN_ON will also crash the machine if you set the > panic_on_warn parameter, so it is also bad, thus we shouldn't use > anything. If you set 'panic_on_warn' you get to keep both pieces when something breaks. The thing is, there are people who *do* want to stop immediately when something goes wrong in the kernel. Anybody doing large-scale virtualization presumably has all the infrastructure to get debug info out of the virtual environment. And people who run controlled loads in big server machine setups and have a MIS department to manage said machines typically also prefer for a machine to just crash over continuing. So in those situations, a dead machine is still a dead machine, but you get the information out, and panic_on_warn is fine, because panic and reboot is fine. And yes, that's actually a fairly common case. Things like syzkaller etc *wants* to abort on the first warning, because that's kind of the point. But while that kind of virtualized automation machinery is very very common, and is a big deal, it's by no means the only deal, and the most important thing to the point where nothing else matters. And if you are *not* in a farm, and if you are *not* using virtualization, a dead machine is literally a useless brick. Nobody has serial lines on individual machines any more. In most cases, the hardware literally doesn't even exist any more. So in that situation, you really cannot afford to take the approach of "just kill the machine". If you are on a laptop and are doing power management code, you generally cannot do that in a virtual environment, and you already have enough problems with suspend and resume being hard to debug, without people also going "oh, let's just BUG_ON() and kill the machine". Because the other side of that "we have a lot of machine farms doing automated testing" is that those machine farms do not generally find a lot of the exciting cases. Almost every single merge window, I end up having to bisect and report an oops or a WARN_ON(), because I actually run on real hardware. And said problem was never seen in linux-next. So we have two very different cases: the "virtual machine with good logging where a dead machine is fine" - use 'panic_on_warn'. And the actual real hardware with real drivers, running real loads by users. Both are valid. But the second case means that BUG_ON() is basically _never_ valid. Linus
On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: > So I've read through the patch several times, and it seems fine, but this function (and the pmd version of it) just read oddly to me. > +static inline bool can_follow_write_pte(pte_t pte, struct page *page, > + struct vm_area_struct *vma, > + unsigned int flags) > +{ > + if (pte_write(pte)) > + return true; > + if (!(flags & FOLL_FORCE)) > + return false; > + > + /* > + * See check_vma_flags(): only COW mappings need that special > + * "force" handling when they lack VM_WRITE. > + */ > + if (vma->vm_flags & VM_WRITE) > + return false; > + VM_BUG_ON(!is_cow_mapping(vma->vm_flags)); So apart from the VM_BUG_ON(), this code just looks really strange - even despite the comment. Just conceptually, the whole "if it's writable, return that you cannot follow it for a write" just looks so very very strange. That doesn't make the code _wrong_, but considering how many times this has had subtle bugs, let's not write code that looks strange. So I would suggest that to protect against future bugs, we try to make it be fairly clear and straightforward, and maybe even a bit overly protective. For example, let's kill the "shared mapping that you don't have write permissions to" very explicitly and without any subtle code at all. The vm_flags tests are cheap and easy, and we could very easily just add some core ones to make any mistakes much less critical. Now, making that 'is_cow_mapping()' check explicit at the very top of this would already go a long way: /* FOLL_FORCE for writability only affects COW mappings */ if (!is_cow_mapping(vma->vm_flags)) return false; but I'd actually go even further: in this case that "is_cow_mapping()" helper to some degree actually hides what is going on. So I'd actually prefer for that function to be written something like /* If the pte is writable, we can write to the page */ if (pte_write(pte)) return true; /* Maybe FOLL_FORCE is set to override it? */ if (flags & FOLL_FORCE) return false; /* But FOLL_FORCE has no effect on shared mappings */ if (vma->vm_flags & MAP_SHARED) return false; /* .. or read-only private ones */ if (!(vma->vm_flags & MAP_MAYWRITE)) return false; /* .. or already writable ones that just need to take a write fault */ if (vma->vm_flags & MAP_WRITE) return false; and the two first vm_flags tests above are basically doing tat "is_cow_mapping()", and maybe we could even have a comment to that effect, but wouldn't it be nice to just write it out that way? And after you've written it out like the above, now that if (!page || !PageAnon(page) || !PageAnonExclusive(page)) return false; makes you pretty safe from a data sharing perspective: it's most definitely not a shared page at that point. So if you write it that way, the only remaining issues are the magic special soft-dirty and uffd ones, but at that point it's purely about the semantics of those features, no longer about any possible "oh, we fooled some shared page to be writable". And I think the above is fairly legible without any subtle cases, and the one-liner comments make it all fairly clear that it's testing. Is any of this in any _technical_ way different from what your patch did? No. It's literally just rewriting it to be a bit more explicit in what it is doing, I think, and it makes that odd "it's not writable if VM_WRITE is set" case a bit more explicit. Hmm? Linus
On 09.08.22 22:00, Linus Torvalds wrote: > On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: >> > > So I've read through the patch several times, and it seems fine, but > this function (and the pmd version of it) just read oddly to me. > >> +static inline bool can_follow_write_pte(pte_t pte, struct page *page, >> + struct vm_area_struct *vma, >> + unsigned int flags) >> +{ >> + if (pte_write(pte)) >> + return true; >> + if (!(flags & FOLL_FORCE)) >> + return false; >> + >> + /* >> + * See check_vma_flags(): only COW mappings need that special >> + * "force" handling when they lack VM_WRITE. >> + */ >> + if (vma->vm_flags & VM_WRITE) >> + return false; >> + VM_BUG_ON(!is_cow_mapping(vma->vm_flags)); > > So apart from the VM_BUG_ON(), this code just looks really strange - > even despite the comment. Just conceptually, the whole "if it's > writable, return that you cannot follow it for a write" just looks so > very very strange. > > That doesn't make the code _wrong_, but considering how many times > this has had subtle bugs, let's not write code that looks strange. > > So I would suggest that to protect against future bugs, we try to make > it be fairly clear and straightforward, and maybe even a bit overly > protective. > > For example, let's kill the "shared mapping that you don't have write > permissions to" very explicitly and without any subtle code at all. > The vm_flags tests are cheap and easy, and we could very easily just > add some core ones to make any mistakes much less critical. > > Now, making that 'is_cow_mapping()' check explicit at the very top of > this would already go a long way: > > /* FOLL_FORCE for writability only affects COW mappings */ > if (!is_cow_mapping(vma->vm_flags)) > return false; I actually put the is_cow_mapping() mapping check in there because check_vma_flags() should make sure that we cannot possibly end up here in that case. But we can spell it out with comments, doesn't hurt. > > but I'd actually go even further: in this case that "is_cow_mapping()" > helper to some degree actually hides what is going on. > > So I'd actually prefer for that function to be written something like > > /* If the pte is writable, we can write to the page */ > if (pte_write(pte)) > return true; > > /* Maybe FOLL_FORCE is set to override it? */ > if (flags & FOLL_FORCE) > return false; > > /* But FOLL_FORCE has no effect on shared mappings */ > if (vma->vm_flags & MAP_SHARED) > return false; > > /* .. or read-only private ones */ > if (!(vma->vm_flags & MAP_MAYWRITE)) > return false; > > /* .. or already writable ones that just need to take a write fault */ > if (vma->vm_flags & MAP_WRITE) > return false; > > and the two first vm_flags tests above are basically doing tat > "is_cow_mapping()", and maybe we could even have a comment to that > effect, but wouldn't it be nice to just write it out that way? > > And after you've written it out like the above, now that > > if (!page || !PageAnon(page) || !PageAnonExclusive(page)) > return false; > > makes you pretty safe from a data sharing perspective: it's most > definitely not a shared page at that point. > > So if you write it that way, the only remaining issues are the magic > special soft-dirty and uffd ones, but at that point it's purely about > the semantics of those features, no longer about any possible "oh, we > fooled some shared page to be writable". > > And I think the above is fairly legible without any subtle cases, and > the one-liner comments make it all fairly clear that it's testing. > > Is any of this in any _technical_ way different from what your patch > did? No. It's literally just rewriting it to be a bit more explicit in > what it is doing, I think, and it makes that odd "it's not writable if > VM_WRITE is set" case a bit more explicit. > > Hmm? No strong opinion. I'm happy as long as it's fixed, and the fix is robust.
On 09.08.22 22:00, Linus Torvalds wrote: > On Mon, Aug 8, 2022 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: >> > > So I've read through the patch several times, and it seems fine, but > this function (and the pmd version of it) just read oddly to me. > >> +static inline bool can_follow_write_pte(pte_t pte, struct page *page, >> + struct vm_area_struct *vma, >> + unsigned int flags) >> +{ >> + if (pte_write(pte)) >> + return true; >> + if (!(flags & FOLL_FORCE)) >> + return false; >> + >> + /* >> + * See check_vma_flags(): only COW mappings need that special >> + * "force" handling when they lack VM_WRITE. >> + */ >> + if (vma->vm_flags & VM_WRITE) >> + return false; >> + VM_BUG_ON(!is_cow_mapping(vma->vm_flags)); > > So apart from the VM_BUG_ON(), this code just looks really strange - > even despite the comment. Just conceptually, the whole "if it's > writable, return that you cannot follow it for a write" just looks so > very very strange. > > That doesn't make the code _wrong_, but considering how many times > this has had subtle bugs, let's not write code that looks strange. > > So I would suggest that to protect against future bugs, we try to make > it be fairly clear and straightforward, and maybe even a bit overly > protective. > > For example, let's kill the "shared mapping that you don't have write > permissions to" very explicitly and without any subtle code at all. > The vm_flags tests are cheap and easy, and we could very easily just > add some core ones to make any mistakes much less critical. > > Now, making that 'is_cow_mapping()' check explicit at the very top of > this would already go a long way: > > /* FOLL_FORCE for writability only affects COW mappings */ > if (!is_cow_mapping(vma->vm_flags)) > return false; > > but I'd actually go even further: in this case that "is_cow_mapping()" > helper to some degree actually hides what is going on. > > So I'd actually prefer for that function to be written something like > > /* If the pte is writable, we can write to the page */ > if (pte_write(pte)) > return true; > > /* Maybe FOLL_FORCE is set to override it? */ > if (flags & FOLL_FORCE) > return false; > > /* But FOLL_FORCE has no effect on shared mappings */ > if (vma->vm_flags & MAP_SHARED) > return false; I'd actually rather check for MAP_MAYSHARE here, which is even stronger. Thoughts?
On Tue, Aug 9, 2022 at 1:07 PM David Hildenbrand <david@redhat.com> wrote: > > > /* But FOLL_FORCE has no effect on shared mappings */ > > if (vma->vm_flags & MAP_SHARED) > > return false; > > I'd actually rather check for MAP_MAYSHARE here, which is even stronger. > Thoughts? Hmm. Adding the test for both is basically free (all those vm_flags checks end up being a bit mask and compare), so no objections. For some reason I though VM_SHARED and VM_MAYSHARE end up always being the same bits, and it was a mistake to make them two bits in the first place (unlike the read/write/exec bits that can are about mprotect), But as there are two bits, I'm sure somebody ends up touching one and not the other. So yeah, I don't see any downside to just checking both bits. [ That said, is_cow_mapping() only checks VM_SHARED, so if they are ever different, that's a potential source of confusion ] Linus
On 09.08.22 22:14, Linus Torvalds wrote: > On Tue, Aug 9, 2022 at 1:07 PM David Hildenbrand <david@redhat.com> wrote: >> >>> /* But FOLL_FORCE has no effect on shared mappings */ >>> if (vma->vm_flags & MAP_SHARED) >>> return false; >> >> I'd actually rather check for MAP_MAYSHARE here, which is even stronger. >> Thoughts? > > Hmm. Adding the test for both is basically free (all those vm_flags > checks end up being a bit mask and compare), so no objections. > > For some reason I though VM_SHARED and VM_MAYSHARE end up always being > the same bits, and it was a mistake to make them two bits in the first > place (unlike the read/write/exec bits that can are about mprotect), > > But as there are two bits, I'm sure somebody ends up touching one and > not the other. > > So yeah, I don't see any downside to just checking both bits. > > [ That said, is_cow_mapping() only checks VM_SHARED, so if they are > ever different, that's a potential source of confusion ] IIUC VM_MAYSHARE is always set in a MAP_SHARED mapping, but for file mappings we only set VM_SHARED if the file allows for writes (and we can set VM_MAYWRITE or even VM_WRITE). [don't ask me why, it's a steady source of confusion] That's why is_cow_mapping() works even due to the weird MAP_SHARED vs. VM_MAYSHARE logic. I'd actually only check for VM_MAYSHARE here, which implies MAP_SHARED. If someone would ever mess that up, I guess we'd be in bigger trouble. But whatever you prefer.
On Tue, Aug 9, 2022 at 1:14 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But as there are two bits, I'm sure somebody ends up touching one and > not the other. Ugh. The nommu code certainly does odd things here. That just looks wrong. And the hugetlb code does something horrible too, but never *sets* it, so it looks like some odd subset of VM_SHARED. Linus
On 09.08.22 22:20, Linus Torvalds wrote: > On Tue, Aug 9, 2022 at 1:14 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> But as there are two bits, I'm sure somebody ends up touching one and >> not the other. > > Ugh. The nommu code certainly does odd things here. That just looks wrong. > > And the hugetlb code does something horrible too, but never *sets* it, > so it looks like some odd subset of VM_SHARED. mm/mmap.c:do_mmap() contains the magic bit switch (flags & MAP_TYPE) { case MAP_SHARED: ... case MAP_SHARED_VALIDATE: ... vm_flags |= VM_SHARED | VM_MAYSHARE; if (!(file->f_mode & FMODE_WRITE)) vm_flags &= ~(VM_MAYWRITE | VM_SHARED); fallthrough; VM_SHARED semantics are confusing.
On Tue, Aug 9, 2022 at 1:20 PM David Hildenbrand <david@redhat.com> wrote: > > IIUC VM_MAYSHARE is always set in a MAP_SHARED mapping, but for file > mappings we only set VM_SHARED if the file allows for writes Heh. This is a horrific hack, and probably should go away. Yeah, we have that if (!(file->f_mode & FMODE_WRITE)) vm_flags &= ~(VM_MAYWRITE | VM_SHARED); but I think that's _entirely_ historical. Long long ago, in a galaxy far away, we didn't handle shared mmap() very well. In fact, we used to not handle it at all. But nntpd would use write() to update the spool file, adn them read it through a shared mmap. And since our mmap() *was* coherent with people doing write() system calls, but didn't handle actual dirty shared mmap, what Linux used to do was to just say "Oh, you want a read-only shared file mmap? I can do that - I'll just downgrade it to a read-only _private_ mapping, and it actually ends up with the same semantics". And here we are, 30 years later, and it still does that, but it leaves the VM_MAYSHARE flag so that /proc/<pid>/maps can show that it's a shared mapping. Linus
On Tue, Aug 9, 2022 at 1:30 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And here we are, 30 years later, and it still does that, but it leaves > the VM_MAYSHARE flag so that /proc/<pid>/maps can show that it's a > shared mapping. .. thinking about it, we end up still having some things that this helps. For example, because we clear the VM_SHARED flags for read-only shared mappings, they don't end up going through mapping_{un}map_writable(), and don't update i_mmap_writable, and don't cause issues with mapping_deny_writable() or mapping_writably_mapped(). So it ends up actually having random small semantic details due to those almost three decades of history. I'm sure there are other odd pieces like that. Linus
On 09.08.22 22:30, Linus Torvalds wrote: > On Tue, Aug 9, 2022 at 1:20 PM David Hildenbrand <david@redhat.com> wrote: >> >> IIUC VM_MAYSHARE is always set in a MAP_SHARED mapping, but for file >> mappings we only set VM_SHARED if the file allows for writes > > Heh. > > This is a horrific hack, and probably should go away. > > Yeah, we have that > > if (!(file->f_mode & FMODE_WRITE)) > vm_flags &= ~(VM_MAYWRITE | VM_SHARED); > > > but I think that's _entirely_ historical. > > Long long ago, in a galaxy far away, we didn't handle shared mmap() > very well. In fact, we used to not handle it at all. > > But nntpd would use write() to update the spool file, adn them read it > through a shared mmap. > > And since our mmap() *was* coherent with people doing write() system > calls, but didn't handle actual dirty shared mmap, what Linux used to > do was to just say "Oh, you want a read-only shared file mmap? I can > do that - I'll just downgrade it to a read-only _private_ mapping, and > it actually ends up with the same semantics". > > And here we are, 30 years later, and it still does that, but it leaves > the VM_MAYSHARE flag so that /proc/<pid>/maps can show that it's a > shared mapping. I was suspecting that this code is full of legacy :) What would make sense to me is to just have VM_SHARED and make it correspond to MAP_SHARED, that would at least confuse me less. Once I have some spare cycles I'll see how easy that might be to achieve.
From: Jason Gunthorpe > Sent: 09 August 2022 20:08 > > On Tue, Aug 09, 2022 at 11:59:45AM -0700, Linus Torvalds wrote: > > > But as a very good approximation, the rule is "absolutely no new > > BUG_ON() calls _ever_". Because I really cannot see a single case > > where "proper error handling and WARN_ON_ONCE()" isn't the right > > thing. > > Parallel to this discussion I've had ones where people more or less > say > > Since BUG_ON crashes the machine and Linus says that crashing the > machine is bad, WARN_ON will also crash the machine if you set the > panic_on_warn parameter, so it is also bad, thus we shouldn't use > anything. > > I've generally maintained that people who set the panic_on_warn *want* > these crashes, because that is the entire point of it. So we should > use WARN_ON with an error recovery for "can't happen" assertions like > these. I think it is what you are saying here. They don't necessarily want the crashes, it is more the people who built the distribution think they want the crashes. I have had issues with a customer system (with our drivers) randomly locking up. Someone had decided that 'PANIC_ON_OOPS' was a good idea but hadn't enabled anything to actually take the dump. So instead of a diagnosable problem (and a 'doh' moment) you get several weeks of head scratching and a very annoyed user. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 18e01474cf6b..2222ed598112 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2885,7 +2885,6 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ #define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ #define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ -#define FOLL_COW 0x4000 /* internal GUP flag */ #define FOLL_ANON 0x8000 /* don't do file mappings */ #define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */ #define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ diff --git a/mm/gup.c b/mm/gup.c index 732825157430..7a0b207f566f 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -478,14 +478,34 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, return -EEXIST; } -/* - * FOLL_FORCE can write to even unwritable pte's, but only - * after we've gone through a COW cycle and they are dirty. - */ -static inline bool can_follow_write_pte(pte_t pte, unsigned int flags) -{ - return pte_write(pte) || - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte)); +/* FOLL_FORCE can write to even unwritable PTEs in COW mappings. */ +static inline bool can_follow_write_pte(pte_t pte, struct page *page, + struct vm_area_struct *vma, + unsigned int flags) +{ + if (pte_write(pte)) + return true; + if (!(flags & FOLL_FORCE)) + return false; + + /* + * See check_vma_flags(): only COW mappings need that special + * "force" handling when they lack VM_WRITE. + */ + if (vma->vm_flags & VM_WRITE) + return false; + VM_BUG_ON(!is_cow_mapping(vma->vm_flags)); + + /* + * See can_change_pte_writable(): we broke COW and could map the page + * writable if we have an exclusive anonymous page and a write-fault + * isn't require for other reasons. + */ + if (!page || !PageAnon(page) || !PageAnonExclusive(page)) + return false; + if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte)) + return false; + return !userfaultfd_pte_wp(vma, pte); } static struct page *follow_page_pte(struct vm_area_struct *vma, @@ -528,12 +548,19 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, } if ((flags & FOLL_NUMA) && pte_protnone(pte)) goto no_page; - if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) { - pte_unmap_unlock(ptep, ptl); - return NULL; - } page = vm_normal_page(vma, address, pte); + + /* + * We only care about anon pages in can_follow_write_pte() and don't + * have to worry about pte_devmap() because they are never anon. + */ + if ((flags & FOLL_WRITE) && + !can_follow_write_pte(pte, page, vma, flags)) { + page = NULL; + goto out; + } + if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) { /* * Only return device mapping pages in the FOLL_GET or FOLL_PIN @@ -986,17 +1013,6 @@ static int faultin_page(struct vm_area_struct *vma, return -EBUSY; } - /* - * The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when - * necessary, even if maybe_mkwrite decided not to set pte_write. We - * can thus safely do subsequent page lookups as if they were reads. - * But only do so when looping for pte_write is futile: in some cases - * userspace may also be wanting to write to the gotten user page, - * which a read fault here might prevent (a readonly page might get - * reCOWed by userspace write). - */ - if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE)) - *flags |= FOLL_COW; return 0; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8a7c1b344abe..352b5220e95e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1040,12 +1040,6 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, assert_spin_locked(pmd_lockptr(mm, pmd)); - /* - * When we COW a devmap PMD entry, we split it into PTEs, so we should - * not be in this function with `flags & FOLL_COW` set. - */ - WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set"); - /* FOLL_GET and FOLL_PIN are mutually exclusive. */ if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == (FOLL_PIN | FOLL_GET))) @@ -1395,14 +1389,23 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) return VM_FAULT_FALLBACK; } -/* - * FOLL_FORCE can write to even unwritable pmd's, but only - * after we've gone through a COW cycle and they are dirty. - */ -static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags) +/* See can_follow_write_pte() on FOLL_FORCE details. */ +static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page, + struct vm_area_struct *vma, + unsigned int flags) { - return pmd_write(pmd) || - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd)); + if (pmd_write(pmd)) + return true; + if (!(flags & FOLL_FORCE)) + return false; + if (vma->vm_flags & VM_WRITE) + return false; + VM_BUG_ON(!is_cow_mapping(vma->vm_flags)); + if (!page || !PageAnon(page) || !PageAnonExclusive(page)) + return false; + if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd)) + return false; + return !userfaultfd_huge_pmd_wp(vma, pmd); } struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, @@ -1411,12 +1414,16 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, unsigned int flags) { struct mm_struct *mm = vma->vm_mm; - struct page *page = NULL; + struct page *page; assert_spin_locked(pmd_lockptr(mm, pmd)); - if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags)) - goto out; + page = pmd_page(*pmd); + VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page); + + if ((flags & FOLL_WRITE) && + !can_follow_write_pmd(*pmd, page, vma, flags)) + return NULL; /* Avoid dumping huge zero page */ if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd)) @@ -1424,10 +1431,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, /* Full NUMA hinting faults to serialise migration in fault paths */ if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) - goto out; - - page = pmd_page(*pmd); - VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page); + return NULL; if (!pmd_write(*pmd) && gup_must_unshare(flags, page)) return ERR_PTR(-EMLINK); @@ -1444,7 +1448,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT; VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page); -out: return page; }
Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know that FOLL_FORCE can be possibly dangerous, especially if there are races that can be exploited by user space. Right now, it would be sufficient to have some code that sets a PTE of a R/O-mapped shared page dirty, in order for it to erroneously become writable by FOLL_FORCE. The implications of setting a write-protected PTE dirty might not be immediately obvious to everyone. And in fact ever since commit 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map a shmem page R/O while marking the pte dirty. This can be used by unprivileged user space to modify tmpfs/shmem file content even if the user does not have write permissions to the file -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590). To fix such security issues for good, the insight is that we really only need that fancy retry logic (FOLL_COW) for COW mappings that are not writable (!VM_WRITE). And in a COW mapping, we really only broke COW if we have an exclusive anonymous page mapped. If we have something else mapped, or the mapped anonymous page might be shared (!PageAnonExclusive), we have to trigger a write fault to break COW. If we don't find an exclusive anonymous page when we retry, we have to trigger COW breaking once again because something intervened. Let's move away from this mandatory-retry + dirty handling and rely on our PageAnonExclusive() flag for making a similar decision, to use the same COW logic as in other kernel parts here as well. In case we stumble over a PTE in a COW mapping that does not map an exclusive anonymous page, COW was not properly broken and we have to trigger a fake write-fault to break COW. Just like we do in can_change_pte_writable() added via commit 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection") and commit 76aefad628aa ("mm/mprotect: fix soft-dirty check in can_change_pte_writable()"), take care of softdirty and uffd-wp manually. For example, a write() via /proc/self/mem to a uffd-wp-protected range has to fail instead of silently granting write access and bypassing the userspace fault handler. Note that FOLL_FORCE is not only used for debug access, but also triggered by applications without debug intentions, for example, when pinning pages via RDMA. This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR. Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So let's just get rid of it. Note 1: We don't check for the PTE being dirty because it doesn't matter for making a "was COWed" decision anymore, and whoever modifies the page has to set the page dirty either way. Note 2: Kernels before extended uffd-wp support and before PageAnonExclusive (< 5.19) can simply revert the problematic commit instead and be safe regarding UFFDIO_CONTINUE. A backport to v5.19 requires minor adjustments due to lack of vma_soft_dirty_enabled(). Fixes: 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte") Cc: <stable@vger.kernel.org> # 5.16+ Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: David Hildenbrand <david@redhat.com> --- Against upstream from yesterday instead of v5.19 because I wanted to reference the mprotect commit IDs and can_change_pte_writable(), and I wanted to directly use vma_soft_dirty_enabled(). I have a working reproducer that I'll post to oss-security in one week. Of course, that reproducer no longer triggers with that commit and my ptrace testing indicated that FOLL_FORCE seems to continue working as expected. --- include/linux/mm.h | 1 - mm/gup.c | 62 +++++++++++++++++++++++++++++----------------- mm/huge_memory.c | 45 +++++++++++++++++---------------- 3 files changed, 63 insertions(+), 45 deletions(-) base-commit: 1612c382ffbdf1f673caec76502b1c00e6d35363