Message ID | c8ee7e02d3d4f50bb3e40855c53bda39eec85b7d.1682321768.git.lstoakes@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v2] mm/gup: disallow GUP writing to file-backed mappings by default | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
I'm pretty sure DIRECT I/O reads that write into file backed mappings are out there in the wild. So while I wish we had never allowed this, the exercise seems futile and instead we need to work on supporting this usecase, with the FOLL_PIN infrastructure being a big step toward that.
On Mon, Apr 24, 2023 at 02:43:56AM -0700, Christoph Hellwig wrote: > I'm pretty sure DIRECT I/O reads that write into file backed mappings > are out there in the wild. > > So while I wish we had never allowed this, the exercise seems futile and > instead we need to work on supporting this usecase, with the FOLL_PIN > infrastructure being a big step toward that. It's not entirely futile, there's at least one specific use case, which is io_uring which is currently open coding an equivalent check themselves. By introducing this change we prevent them from having to do so and provide a means by which other callers who implicitly need this do not have to do so either. In addition, this change frees up a blocked patch series intending to clean up GUP which should help open the door to further improvements across the system. So I would argue certainly not futile. In addition, I think it's useful to explicitly document that this is a broken case and, through use of the flag, highlight places which are problematic (although perhaps not exhaustively). I know Jason is keen on fixing this at a fundamental level and this flag is ultimately his suggestion, so it certainly doesn't stand in the way of this work moving forward.
On Mon, Apr 24, 2023 at 11:17:55AM +0100, Lorenzo Stoakes wrote: > On Mon, Apr 24, 2023 at 02:43:56AM -0700, Christoph Hellwig wrote: > > I'm pretty sure DIRECT I/O reads that write into file backed mappings > > are out there in the wild. I wonder if that is really the case? I know people tried this with RDMA and it didn't get very far before testing uncovered data corruption and kernel crashes.. Maybe O_DIRECT has a much smaller race window so people can get away with it? > I know Jason is keen on fixing this at a fundamental level and this flag is > ultimately his suggestion, so it certainly doesn't stand in the way of this > work moving forward. Yeah, the point is to close it off, because while we wish it was fixed properly, it isn't. We are still who knows how far away from it. In the mean time this is a fairly simple way to oops the kernel, especially with cases like io_uring and RDMA. So, I view it as a security problem. My general dislike was that io_uring protected itself from the security problem and we left all the rest of the GUP users out to dry. So, my suggestion was to mark the places where we want to allow this, eg O_DIRECT, and block everwhere else. Lorenzo, I would significantly par back the list you have. I also suggest we force block it at some kernel lockdown level.. Alternatively, perhaps we abuse FOLL_LONGTERM and prevent it from working with filebacked pages since, I think, the ease of triggering a bug goes up the longer the pages are pinned. Jason
On Mon, Apr 24, 2023 at 09:28:07AM -0300, Jason Gunthorpe wrote: > On Mon, Apr 24, 2023 at 11:17:55AM +0100, Lorenzo Stoakes wrote: > > On Mon, Apr 24, 2023 at 02:43:56AM -0700, Christoph Hellwig wrote: > > > I'm pretty sure DIRECT I/O reads that write into file backed mappings > > > are out there in the wild. > > I wonder if that is really the case? I know people tried this with > RDMA and it didn't get very far before testing uncovered data > corruption and kernel crashes.. Maybe O_DIRECT has a much smaller race > window so people can get away with it? It absolutely causes all kinds of issues even with O_DIRECT. I'd be all for trying to disallow it as it simplies a lot of things, but I fear it's not going to stick. > So, my suggestion was to mark the places where we want to allow this, > eg O_DIRECT, and block everwhere else. Lorenzo, I would significantly > par back the list you have. I think an opt-in is a good idea no matter how many places end up needing it. I'd prefer a less dramatic name and a better explanation on why it should only be set when needed. > I also suggest we force block it at some kernel lockdown level.. > > Alternatively, perhaps we abuse FOLL_LONGTERM and prevent it from > working with filebacked pages since, I think, the ease of triggering a > bug goes up the longer the pages are pinned. FOLL_LONGTERM on file backed pages is a nightmare. If you think you can get away with prohibiting it for RDMA, and KVM doesn't need it I'd be all for not allowing that at all.
On Mon, Apr 24, 2023 at 09:28:07AM -0300, Jason Gunthorpe wrote: > On Mon, Apr 24, 2023 at 11:17:55AM +0100, Lorenzo Stoakes wrote: > > On Mon, Apr 24, 2023 at 02:43:56AM -0700, Christoph Hellwig wrote: > > > I'm pretty sure DIRECT I/O reads that write into file backed mappings > > > are out there in the wild. > > I wonder if that is really the case? I know people tried this with > RDMA and it didn't get very far before testing uncovered data > corruption and kernel crashes.. Maybe O_DIRECT has a much smaller race > window so people can get away with it? > > > I know Jason is keen on fixing this at a fundamental level and this flag is > > ultimately his suggestion, so it certainly doesn't stand in the way of this > > work moving forward. > > Yeah, the point is to close it off, because while we wish it was > fixed properly, it isn't. We are still who knows how far away from it. > > In the mean time this is a fairly simple way to oops the kernel, > especially with cases like io_uring and RDMA. So, I view it as a > security problem. > > My general dislike was that io_uring protected itself from the > security problem and we left all the rest of the GUP users out to dry. > > So, my suggestion was to mark the places where we want to allow this, > eg O_DIRECT, and block everwhere else. Lorenzo, I would significantly > par back the list you have. I was being fairly conservative in that list, though we certainly need to set the flag for /proc/$pid/mem and ptrace to avoid breaking this functionality (I observed breakpoints breaking without it which obviously is a no go :). I'm not sure if there's a more general way we could check for this though? A perhaps slightly unpleasant solution might be to not enforce this when FOLL_FORCE is specified which is mostly a ptrace + friends thing then we could drop all those exceptions. I wouldn't be totally opposed to dropping it for RDMA too, because I suspect accessing file-backed mappings for that is pretty iffy. Do you have a sense of which in the list you feel could be pared back? > > I also suggest we force block it at some kernel lockdown level.. > > Alternatively, perhaps we abuse FOLL_LONGTERM and prevent it from > working with filebacked pages since, I think, the ease of triggering a > bug goes up the longer the pages are pinned. > This would solve the io_uring case and it is certainly more of a concern when the pin is intended to be kept around, though it feels a bit icky as a non-FOLL_LONGTERM pin could surely be problematic too? > Jason
On Mon, Apr 24, 2023 at 05:38:44AM -0700, Christoph Hellwig wrote: > > I wonder if that is really the case? I know people tried this with > > RDMA and it didn't get very far before testing uncovered data > > corruption and kernel crashes.. Maybe O_DIRECT has a much smaller race > > window so people can get away with it? > > It absolutely causes all kinds of issues even with O_DIRECT. I'd be > all for trying to disallow it as it simplies a lot of things, but I fear > it's not going to stick. I suggest the kernel lockdown mode again. If people want to do unsafe things they can boot in a lessor lockdown mode, we've disabled several kernel features this way already. lockdown integrity sounds appropriate for this kind of bug. Maybe we can start to get some data on who is actually using it. > > Alternatively, perhaps we abuse FOLL_LONGTERM and prevent it from > > working with filebacked pages since, I think, the ease of triggering a > > bug goes up the longer the pages are pinned. > > FOLL_LONGTERM on file backed pages is a nightmare. If you think you > can get away with prohibiting it for RDMA, and KVM doesn't need it > I'd be all for not allowing that at all. Yes, I think we can and should do this. Jason
On Mon, Apr 24, 2023 at 01:38:49PM +0100, Lorenzo Stoakes wrote: > I was being fairly conservative in that list, though we certainly need to > set the flag for /proc/$pid/mem and ptrace to avoid breaking this > functionality (I observed breakpoints breaking without it which obviously > is a no go :). I'm not sure if there's a more general way we could check > for this though? More broadly we should make sure these usages of GUP safe somehow so that it can reliably write to those types of pages without breaking the current FS contract.. I forget exactly, but IIRC, don't you have to hold some kind of page spinlock while writing to the page memory? So, users that do this, or can be fixed to do this, can get file backed pages. It suggests that a flag name is more like FOLL_CALLER_USES_FILE_WRITE_LOCKING > I wouldn't be totally opposed to dropping it for RDMA too, because I > suspect accessing file-backed mappings for that is pretty iffy. > > Do you have a sense of which in the list you feel could be pared back? Anything using FOLL_LONGTERM should not set the flag, GUP should even block the combination. And we need to have in mind that the flag indicates the code is buggy, so if you set it then we should understand how is that caller expected to be fixed. Jason
On Mon, Apr 24, 2023 at 10:39:25AM -0300, Jason Gunthorpe wrote: > On Mon, Apr 24, 2023 at 01:38:49PM +0100, Lorenzo Stoakes wrote: > > > I was being fairly conservative in that list, though we certainly need to > > set the flag for /proc/$pid/mem and ptrace to avoid breaking this > > functionality (I observed breakpoints breaking without it which obviously > > is a no go :). I'm not sure if there's a more general way we could check > > for this though? > > More broadly we should make sure these usages of GUP safe somehow so > that it can reliably write to those types of pages without breaking > the current FS contract.. > > I forget exactly, but IIRC, don't you have to hold some kind of page > spinlock while writing to the page memory? > I think perhaps you're thinking of the mm->mmap_lock? Which will be held for the FOLL_GET cases and simply prevent the VMA from disappearing below us but not do much else. > So, users that do this, or can be fixed to do this, can get file > backed pages. It suggests that a flag name is more like > FOLL_CALLER_USES_FILE_WRITE_LOCKING > As stated above, I'm not sure what locking you're referring to, but seems to me that FOLL_GET already implies what you're thinking? I wonder whether we should do this check purely for FOLL_PIN to be honest? As this indicates medium to long-term access without mmap_lock held. This would exclude the /proc/$pid/mem and ptrace paths which use gup_remote(). That and a very specific use of uprobes are the only places that use FOLL_GET in this instance and each of them are careful in any case to handle setting the dirty page flag. All PUP cases that do not specify FOLL_LONGTERM also do this, so we could atually go so far as to reduce the patch to simply performing the vma_wants_writenotify() check if (FOLL_PIN | FOLL_LONGTERM) is specified, which covers the io_uring case. Alternatively if we wanted to be safer, we could add a FOLL_ALLOW_FILE_PIN that is checked on FOLL_PIN and ignored on FOLL_LONGTERM? > > I wouldn't be totally opposed to dropping it for RDMA too, because I > > suspect accessing file-backed mappings for that is pretty iffy. > > > > Do you have a sense of which in the list you feel could be pared back? > > Anything using FOLL_LONGTERM should not set the flag, GUP should even > block the combination. OK > > And we need to have in mind that the flag indicates the code is > buggy, so if you set it then we should understand how is that caller > expected to be fixed. > > Jason I think we are working towards a much simpler solution in any case!
On Mon, Apr 24, 2023 at 03:29:57PM +0100, Lorenzo Stoakes wrote: > On Mon, Apr 24, 2023 at 10:39:25AM -0300, Jason Gunthorpe wrote: > > On Mon, Apr 24, 2023 at 01:38:49PM +0100, Lorenzo Stoakes wrote: > > > > > I was being fairly conservative in that list, though we certainly need to > > > set the flag for /proc/$pid/mem and ptrace to avoid breaking this > > > functionality (I observed breakpoints breaking without it which obviously > > > is a no go :). I'm not sure if there's a more general way we could check > > > for this though? > > > > More broadly we should make sure these usages of GUP safe somehow so > > that it can reliably write to those types of pages without breaking > > the current FS contract.. > > > > I forget exactly, but IIRC, don't you have to hold some kind of page > > spinlock while writing to the page memory? > > > > I think perhaps you're thinking of the mm->mmap_lock? Which will be held > for the FOLL_GET cases and simply prevent the VMA from disappearing below > us but not do much else. No not mmap_lock, I want to say there is a per-page lock that interacts with the write protect, or at worst this needs to use the page table spinlocks. > I wonder whether we should do this check purely for FOLL_PIN to be honest? > As this indicates medium to long-term access without mmap_lock held. This > would exclude the /proc/$pid/mem and ptrace paths which use gup_remote(). Everything is buggy. FOLL_PIN is part of a someday solution to solve it. > That and a very specific use of uprobes are the only places that use > FOLL_GET in this instance and each of them are careful in any case to > handle setting the dirty page flag. That is actually the bug :) Broadly the bug is to make a page dirty without holding the right locks to actually dirty it. Jason
On Mon, Apr 24, 2023 at 02:36:47PM -0300, Jason Gunthorpe wrote: > On Mon, Apr 24, 2023 at 03:29:57PM +0100, Lorenzo Stoakes wrote: > > On Mon, Apr 24, 2023 at 10:39:25AM -0300, Jason Gunthorpe wrote: > > > On Mon, Apr 24, 2023 at 01:38:49PM +0100, Lorenzo Stoakes wrote: > > > > > > > I was being fairly conservative in that list, though we certainly need to > > > > set the flag for /proc/$pid/mem and ptrace to avoid breaking this > > > > functionality (I observed breakpoints breaking without it which obviously > > > > is a no go :). I'm not sure if there's a more general way we could check > > > > for this though? > > > > > > More broadly we should make sure these usages of GUP safe somehow so > > > that it can reliably write to those types of pages without breaking > > > the current FS contract.. > > > > > > I forget exactly, but IIRC, don't you have to hold some kind of page > > > spinlock while writing to the page memory? > > > > > > > I think perhaps you're thinking of the mm->mmap_lock? Which will be held > > for the FOLL_GET cases and simply prevent the VMA from disappearing below > > us but not do much else. > > No not mmap_lock, I want to say there is a per-page lock that > interacts with the write protect, or at worst this needs to use the > page table spinlocks. > > > I wonder whether we should do this check purely for FOLL_PIN to be honest? > > As this indicates medium to long-term access without mmap_lock held. This > > would exclude the /proc/$pid/mem and ptrace paths which use gup_remote(). > > Everything is buggy. FOLL_PIN is part of a someday solution to solve > it. > > > That and a very specific use of uprobes are the only places that use > > FOLL_GET in this instance and each of them are careful in any case to > > handle setting the dirty page flag. > > That is actually the bug :) Broadly the bug is to make a page dirty > without holding the right locks to actually dirty it. > > Jason OK I guess you mean the folio lock :) Well there is unpin_user_pages_dirty_lock() and unpin_user_page_range_dirty_lock() and also set_page_dirty_lock() (used by __access_remote_vm()) which should avoid this. Also __access_remote_vm() which all the ptrace and /proc/$pid/mem use does set_page_dirty_lock() and only after the user actually writes to it (and with FOLL_FORCE of course). None of these are correctly telling a write notify filesystem about the change, however. We definitely need to keep ptrace and /proc/$pid/mem functioning correctly, and I given the privilege levels required I don't think there's a security issue there? I do think the mkwrite/write notify file system check is the correct one as these are the only ones to whom the page being dirty would matter to. So perhaps we can move forward with:- - Use mkwrite check rather than shmem/hugetlb. - ALWAYS enforce not write notify file system if FOLL_LONGTERM (that removes a lot of the changes here). - If FOLL_FORCE, then allow this to override the check. This is required for /proc/$pid/mem and ptrace and is a privileged operation anyway, so can not cause a security issue. - Add a FOLL_WILL_UNPIN_DIRTY flag to indicate that the caller will actually do so (required for process_vm_access cases). Alternatively we could implement something very cautious and opt-in, like a FOLL_CHECK_SANITY flag? (starting to feel like I need one of those myself :)
On Mon, Apr 24, 2023 at 07:22:03PM +0100, Lorenzo Stoakes wrote: > OK I guess you mean the folio lock :) Well there is > unpin_user_pages_dirty_lock() and unpin_user_page_range_dirty_lock() and > also set_page_dirty_lock() (used by __access_remote_vm()) which should > avoid this. It has been a while, but IIRC, these are all basically racy, the comment in front of set_page_dirty_lock() even says it is racy.. The race is that a FS cleans a page and thinks it cannot become dirty, and then it becomes dirty - and all variations of that.. Looking around a bit, I suppose what I'd expect to see is a sequence sort of like what do_page_mkwrite() does: /* Synchronize with the FS and get the page locked */ ret = vmf->vma->vm_ops->page_mkwrite(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) return ret; if (unlikely(!(ret & VM_FAULT_LOCKED))) { lock_page(page); if (!page->mapping) { unlock_page(page); return 0; /* retry */ } ret |= VM_FAULT_LOCKED; } else VM_BUG_ON_PAGE(!PageLocked(page), page); /* Write to the page with the CPU */ va = kmap_local_atomic(page); memcpy(va, ....); kunmap_local_atomic(page); /* Tell the FS and unlock it. */ set_page_dirty(page); unlock_page(page); I don't know if this is is exactly right, but it seems closerish So maybe some kind of GUP interfaces that returns single locked pages is the right direction? IDK Or maybe we just need to make a memcpy primitive that works while holding the PTLs? > We definitely need to keep ptrace and /proc/$pid/mem functioning correctly, > and I given the privilege levels required I don't think there's a security > issue there? Even root is not allowed to trigger data corruption or oops inside the kernel. Jason
On Mon, Apr 24, 2023 at 03:54:48PM -0300, Jason Gunthorpe wrote: > On Mon, Apr 24, 2023 at 07:22:03PM +0100, Lorenzo Stoakes wrote: > > > OK I guess you mean the folio lock :) Well there is > > unpin_user_pages_dirty_lock() and unpin_user_page_range_dirty_lock() and > > also set_page_dirty_lock() (used by __access_remote_vm()) which should > > avoid this. > > It has been a while, but IIRC, these are all basically racy, the > comment in front of set_page_dirty_lock() even says it is racy.. > > The race is that a FS cleans a page and thinks it cannot become dirty, > and then it becomes dirty - and all variations of that.. > > Looking around a bit, I suppose what I'd expect to see is a sequence > sort of like what do_page_mkwrite() does: > > /* Synchronize with the FS and get the page locked */ > ret = vmf->vma->vm_ops->page_mkwrite(vmf); > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) > return ret; > if (unlikely(!(ret & VM_FAULT_LOCKED))) { > lock_page(page); > if (!page->mapping) { > unlock_page(page); > return 0; /* retry */ > } > ret |= VM_FAULT_LOCKED; > } else > VM_BUG_ON_PAGE(!PageLocked(page), page); > > /* Write to the page with the CPU */ > va = kmap_local_atomic(page); > memcpy(va, ....); > kunmap_local_atomic(page); > > /* Tell the FS and unlock it. */ > set_page_dirty(page); > unlock_page(page); > > I don't know if this is is exactly right, but it seems closerish > > So maybe some kind of GUP interfaces that returns single locked pages > is the right direction? IDK > > Or maybe we just need to make a memcpy primitive that works while > holding the PTLs? > I think this patch suggestion has scope crept from 'incremental improvement' to 'major rework of GUP' at this point. Also surely you'd want to obtain the PTL of all mappings to a file? This seems really unworkable and I don't think holding a folio lock over a long period is sensible either. > > We definitely need to keep ptrace and /proc/$pid/mem functioning correctly, > > and I given the privilege levels required I don't think there's a security > > issue there? > > Even root is not allowed to trigger data corruption or oops inside the > kernel. > > Jason Of course, but isn't this supposed to be an incremental fix? It feels a little contradictory to want to introduce a flag intentionally to try to highlight brokenness then to not accept any solution that doesn't solve that brokenness. In any case, I feel that this patch isn't going to go anywhere as-is, it's insufficiently large to solve the problem as a whole (I think that's a bigger problem we can return to later), and there appears to be no taste for an incremental improvement, even from the suggester :) As a result, I suggest we take the cautious route in order to unstick the vmas patch series - introduce an OPT-IN flag which allows the check to be made, and update io_uring to use this. That way it defers the larger discussion around this improvement, avoids breaking anything, provides some basis in code for this check and is a net, incremental small and digestible improvement.
On Mon, Apr 24, 2023 at 08:18:33PM +0100, Lorenzo Stoakes wrote: > I think this patch suggestion has scope crept from 'incremental > improvement' to 'major rework of GUP' at this point. I don't really expect to you clean up all the callers - but we are trying to understand what is actually wrong here to come up with the right FOLL_ names and overall strategy. Leave behind a comment, for instance. I don't think anyone has really thought about the ptrace users too much till now, we were all thinking about DMA use cases, it shows we still have some areas that need attention. > Also surely you'd want to obtain the PTL of all mappings to a file? No, just one is fine. If you do the memcpy under a single PTL that points at a writable copy of the page then everything is trivially fine because it is very similar to what the CPU itself would do, which is fine by definition.. Jason
On Mon, Apr 24, 2023 at 07:53:26PM -0300, Jason Gunthorpe wrote: > On Mon, Apr 24, 2023 at 08:18:33PM +0100, Lorenzo Stoakes wrote: > > > I think this patch suggestion has scope crept from 'incremental > > improvement' to 'major rework of GUP' at this point. > > I don't really expect to you clean up all the callers - but we are > trying to understand what is actually wrong here to come up with the > right FOLL_ names and overall strategy. Leave behind a comment, for > instance. > Right, but you are suggesting introducing a whole new GUP interface holding the right locks etc. which is really scope-creeping from the original intent. I'm not disagreeing that we need an interface that can return things in a state where the dirtying can be done correctly, I just don't think _this_ patch series is the place for it. > I don't think anyone has really thought about the ptrace users too > much till now, we were all thinking about DMA use cases, it shows we > still have some areas that need attention. I do like to feel that my recent glut of GUP activity, even if noisy and frustrating, has at least helped give some insights into usage and semantics :) > > > Also surely you'd want to obtain the PTL of all mappings to a file? > > No, just one is fine. If you do the memcpy under a single PTL that > points at a writable copy of the page then everything is trivially > fine because it is very similar to what the CPU itself would do, which > is fine by definition.. > > Jason Except you dirty a page that is mapped elsewhere that thought everything was cleaned and... not sure the PTLs really help you much? Anyway I feel we're digressing into the broader discussion which needs to be had, but not when trying to unstick the vmas series :) I am going to put forward an opt-in variant of this change that explicitly checks whether any VMA in the range requires dirty page tracking, if not failing the GUP operation. This can then form the basis of the opt-OUT variant (it'll be the same check code right?) and help provide a basis for the additional work that clearly needs to be done. It will also replace the open-coded VMA check in io_uring so has utility and justification just from that. If we want to be more adventerous the opt-in variant could default to on for FOLL_LONGTERM too, but that discussion can be had over on that patch series.
On Tue, Apr 25, 2023 at 12:03:34AM +0100, Lorenzo Stoakes wrote: > Except you dirty a page that is mapped elsewhere that thought everything > was cleaned and... not sure the PTLs really help you much? If we have a writable PTE then while the PTE's PTL is held it is impossible for a FS to make the page clean as any cleaning action has to also take the PTL to make the PTE non-present or non-writable. > If we want to be more adventerous the opt-in variant could default to on > for FOLL_LONGTERM too, but that discussion can be had over on that patch > series. I think you should at least do this too to explain why io_uring code is moving into common code.. Jason
On Mon, Apr 24, 2023 at 08:17:11PM -0300, Jason Gunthorpe wrote: > On Tue, Apr 25, 2023 at 12:03:34AM +0100, Lorenzo Stoakes wrote: > > > Except you dirty a page that is mapped elsewhere that thought everything > > was cleaned and... not sure the PTLs really help you much? > > If we have a writable PTE then while the PTE's PTL is held it is impossible > for a FS to make the page clean as any cleaning action has to also > take the PTL to make the PTE non-present or non-writable. > That's a very good point! Passing things back with a spinlock held feels pretty icky though, and obviously a no-go for a FOLL_PIN. Perhaps for a FOLL_GET this would be workable. > > If we want to be more adventerous the opt-in variant could default to on > > for FOLL_LONGTERM too, but that discussion can be had over on that patch > > series. > > I think you should at least do this too to explain why io_uring code > is moving into common code.. > OK, I'll respin this as a v3 of this series then since we'll be defaulting FOLL_LONGTERM at least (for which there seems to be broad consensus), but also permit this flag to be set manually and set it for io_uring. > Jason
On Tue, Apr 25, 2023 at 12:26:25AM +0100, Lorenzo Stoakes wrote: > On Mon, Apr 24, 2023 at 08:17:11PM -0300, Jason Gunthorpe wrote: > > On Tue, Apr 25, 2023 at 12:03:34AM +0100, Lorenzo Stoakes wrote: > > > > > Except you dirty a page that is mapped elsewhere that thought everything > > > was cleaned and... not sure the PTLs really help you much? > > > > If we have a writable PTE then while the PTE's PTL is held it is impossible > > for a FS to make the page clean as any cleaning action has to also > > take the PTL to make the PTE non-present or non-writable. > > > > That's a very good point! Passing things back with a spinlock held feels > pretty icky though, and obviously a no-go for a FOLL_PIN. Perhaps for a > FOLL_GET this would be workable. I didn't look closely at the ptrace code but maybe it would work to lock the folio and pass back a locked folio. Interacting with the PTLs to make the lock reliable. It is the logical inverse of the code I pointed to for inserting a folio into the page table. (but I've never looked at the folio lock or how the FSs use it, so don't belive me on this) Another interesting idea would be to use mm/pagewalk.c to implement the memory copy fully under the PTLs. Jason
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c index f693bc753b6b..b9019dad8008 100644 --- a/drivers/infiniband/hw/qib/qib_user_pages.c +++ b/drivers/infiniband/hw/qib/qib_user_pages.c @@ -110,7 +110,8 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages, for (got = 0; got < num_pages; got += ret) { ret = pin_user_pages(start_page + got * PAGE_SIZE, num_pages - got, - FOLL_LONGTERM | FOLL_WRITE, + FOLL_LONGTERM | FOLL_WRITE | + FOLL_ALLOW_BROKEN_FILE_MAPPING, p + got, NULL); if (ret < 0) { mmap_read_unlock(current->mm); diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c index 2a5cac2658ec..33cf79b248a9 100644 --- a/drivers/infiniband/hw/usnic/usnic_uiom.c +++ b/drivers/infiniband/hw/usnic/usnic_uiom.c @@ -85,7 +85,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable, int dmasync, struct usnic_uiom_reg *uiomr) { struct list_head *chunk_list = &uiomr->chunk_list; - unsigned int gup_flags = FOLL_LONGTERM; + unsigned int gup_flags = FOLL_LONGTERM | FOLL_ALLOW_BROKEN_FILE_MAPPING; struct page **page_list; struct scatterlist *sg; struct usnic_uiom_chunk *chunk; diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c index f51ab2ccf151..bc3e8c0898e5 100644 --- a/drivers/infiniband/sw/siw/siw_mem.c +++ b/drivers/infiniband/sw/siw/siw_mem.c @@ -368,7 +368,8 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable) struct mm_struct *mm_s; u64 first_page_va; unsigned long mlock_limit; - unsigned int foll_flags = FOLL_LONGTERM; + unsigned int foll_flags = + FOLL_LONGTERM | FOLL_ALLOW_BROKEN_FILE_MAPPING; int num_pages, num_chunks, i, rv = 0; if (!can_do_mlock()) diff --git a/fs/proc/base.c b/fs/proc/base.c index 96a6a08c8235..3e3f5ea9849f 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -855,7 +855,8 @@ static ssize_t mem_rw(struct file *file, char __user *buf, if (!mmget_not_zero(mm)) goto free; - flags = FOLL_FORCE | (write ? FOLL_WRITE : 0); + flags = FOLL_FORCE | FOLL_ALLOW_BROKEN_FILE_MAPPING | + (write ? FOLL_WRITE : 0); while (count > 0) { size_t this_len = min_t(size_t, count, PAGE_SIZE); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 3fc9e680f174..e76637b4c78f 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1185,6 +1185,14 @@ enum { FOLL_PCI_P2PDMA = 1 << 10, /* allow interrupts from generic signals */ FOLL_INTERRUPTIBLE = 1 << 11, + /* + * By default we disallow write access to known broken file-backed + * memory mappings (i.e. anything other than hugetlb/shmem + * mappings). Some code may rely upon being able to access this + * regardless for legacy reasons, thus we provide a flag to indicate + * this. + */ + FOLL_ALLOW_BROKEN_FILE_MAPPING = 1 << 12, /* See also internal only FOLL flags in mm/internal.h */ }; diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 59887c69d54c..ec330d3b0218 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -373,7 +373,8 @@ __update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d) return -EINVAL; ret = get_user_pages_remote(mm, vaddr, 1, - FOLL_WRITE, &page, &vma, NULL); + FOLL_WRITE | FOLL_ALLOW_BROKEN_FILE_MAPPING, + &page, &vma, NULL); if (unlikely(ret <= 0)) { /* * We are asking for 1 page. If get_user_pages_remote() fails, diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 0786450074c1..db5022b21b8e 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -58,7 +58,8 @@ int ptrace_access_vm(struct task_struct *tsk, unsigned long addr, return 0; } - ret = __access_remote_vm(mm, addr, buf, len, gup_flags); + ret = __access_remote_vm(mm, addr, buf, len, + gup_flags | FOLL_ALLOW_BROKEN_FILE_MAPPING); mmput(mm); return ret; diff --git a/mm/gup.c b/mm/gup.c index 1f72a717232b..68d5570c0bae 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -959,16 +959,46 @@ static int faultin_page(struct vm_area_struct *vma, return 0; } +/* + * Writing to file-backed mappings using GUP is a fundamentally broken operation + * as kernel write access to GUP mappings may not adhere to the semantics + * expected by a file system. + * + * In most instances we disallow this broken behaviour, however there are some + * exceptions to this enforced here. + */ +static inline bool can_write_file_mapping(struct vm_area_struct *vma, + unsigned long gup_flags) +{ + struct file *file = vma->vm_file; + + /* If we aren't pinning then no problematic write can occur. */ + if (!(gup_flags & (FOLL_GET | FOLL_PIN))) + return true; + + /* Special mappings should pose no problem. */ + if (!file) + return true; + + /* Has the caller explicitly indicated this case is acceptable? */ + if (gup_flags & FOLL_ALLOW_BROKEN_FILE_MAPPING) + return true; + + /* shmem and hugetlb mappings do not have problematic semantics. */ + return vma_is_shmem(vma) || is_file_hugepages(file); +} + static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) { vm_flags_t vm_flags = vma->vm_flags; int write = (gup_flags & FOLL_WRITE); int foreign = (gup_flags & FOLL_REMOTE); + bool vma_anon = vma_is_anonymous(vma); if (vm_flags & (VM_IO | VM_PFNMAP)) return -EFAULT; - if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) + if ((gup_flags & FOLL_ANON) && !vma_anon) return -EFAULT; if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) @@ -978,6 +1008,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) return -EFAULT; if (write) { + if (!vma_anon && + WARN_ON_ONCE(!can_write_file_mapping(vma, gup_flags))) + return -EFAULT; + if (!(vm_flags & VM_WRITE)) { if (!(gup_flags & FOLL_FORCE)) return -EFAULT; diff --git a/mm/memory.c b/mm/memory.c index 146bb94764f8..e3d535991548 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5683,7 +5683,8 @@ int access_process_vm(struct task_struct *tsk, unsigned long addr, if (!mm) return 0; - ret = __access_remote_vm(mm, addr, buf, len, gup_flags); + ret = __access_remote_vm(mm, addr, buf, len, + gup_flags | FOLL_ALLOW_BROKEN_FILE_MAPPING); mmput(mm); diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c index 78dfaf9e8990..ef126c08e89c 100644 --- a/mm/process_vm_access.c +++ b/mm/process_vm_access.c @@ -81,7 +81,7 @@ static int process_vm_rw_single_vec(unsigned long addr, ssize_t rc = 0; unsigned long max_pages_per_loop = PVM_MAX_KMALLOC_PAGES / sizeof(struct pages *); - unsigned int flags = 0; + unsigned int flags = FOLL_ALLOW_BROKEN_FILE_MAPPING; /* Work out address and page range required */ if (len == 0) diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c index 02207e852d79..b93cfcaccb0d 100644 --- a/net/xdp/xdp_umem.c +++ b/net/xdp/xdp_umem.c @@ -93,7 +93,7 @@ void xdp_put_umem(struct xdp_umem *umem, bool defer_cleanup) static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address) { - unsigned int gup_flags = FOLL_WRITE; + unsigned int gup_flags = FOLL_WRITE | FOLL_ALLOW_BROKEN_FILE_MAPPING; long npgs; int err;
It isn't safe to write to file-backed mappings as GUP does not ensure that the semantics associated with such a write are performed correctly, for instance file systems which rely upon write-notify will not be correctly notified. There are exceptions to this - shmem and hugetlb mappings pose no such concern so we do permit this operation in these cases. In addition, if no pinning takes place (neither FOLL_GET nor FOLL_PIN is specified and neither flags gets implicitly set) then no writing can occur so we do not perform the check in this instance. This is an important exception, as populate_vma_page_range() invokes __get_user_pages() in this way (and thus so does __mm_populate(), used by MAP_POPULATE mmap() and mlock() invocations). There are GUP users within the kernel that do nevertheless rely upon this behaviour, so we introduce the FOLL_ALLOW_BROKEN_FILE_MAPPING flag to explicitly permit this kind of GUP access. This is required in order to not break userspace in instances where the uAPI might permit file-mapped addresses - a number of RDMA users require this for instance, as do the process_vm_[read/write]v() system calls, /proc/$pid/mem, ptrace and SDT uprobes. Each of these callers have been updated to use this flag. Making this change is an important step towards a more reliable GUP, and explicitly indicates which callers might encounter issues moving forward. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> --- v2: - Add accidentally excluded ptrace_access_vm() use of FOLL_ALLOW_BROKEN_FILE_MAPPING. - Tweak commit message. v1: https://lore.kernel.org/all/f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com/ drivers/infiniband/hw/qib/qib_user_pages.c | 3 +- drivers/infiniband/hw/usnic/usnic_uiom.c | 2 +- drivers/infiniband/sw/siw/siw_mem.c | 3 +- fs/proc/base.c | 3 +- include/linux/mm_types.h | 8 +++++ kernel/events/uprobes.c | 3 +- kernel/ptrace.c | 3 +- mm/gup.c | 36 +++++++++++++++++++++- mm/memory.c | 3 +- mm/process_vm_access.c | 2 +- net/xdp/xdp_umem.c | 2 +- 11 files changed, 58 insertions(+), 10 deletions(-) -- 2.40.0