diff mbox series

[v2] mm/gup: disallow GUP writing to file-backed mappings by default

Message ID c8ee7e02d3d4f50bb3e40855c53bda39eec85b7d.1682321768.git.lstoakes@gmail.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [v2] mm/gup: disallow GUP writing to file-backed mappings by default | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Lorenzo Stoakes April 24, 2023, 7:43 a.m. UTC
It isn't safe to write to file-backed mappings as GUP does not ensure that
the semantics associated with such a write are performed correctly, for
instance file systems which rely upon write-notify will not be correctly
notified.

There are exceptions to this - shmem and hugetlb mappings pose no such
concern so we do permit this operation in these cases.

In addition, if no pinning takes place (neither FOLL_GET nor FOLL_PIN is
specified and neither flags gets implicitly set) then no writing can occur
so we do not perform the check in this instance.

This is an important exception, as populate_vma_page_range() invokes
__get_user_pages() in this way (and thus so does __mm_populate(), used by
MAP_POPULATE mmap() and mlock() invocations).

There are GUP users within the kernel that do nevertheless rely upon this
behaviour, so we introduce the FOLL_ALLOW_BROKEN_FILE_MAPPING flag to
explicitly permit this kind of GUP access.

This is required in order to not break userspace in instances where the
uAPI might permit file-mapped addresses - a number of RDMA users require
this for instance, as do the process_vm_[read/write]v() system calls,
/proc/$pid/mem, ptrace and SDT uprobes. Each of these callers have been
updated to use this flag.

Making this change is an important step towards a more reliable GUP, and
explicitly indicates which callers might encounter issues moving forward.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
v2:
- Add accidentally excluded ptrace_access_vm() use of
  FOLL_ALLOW_BROKEN_FILE_MAPPING.
- Tweak commit message.

v1:
https://lore.kernel.org/all/f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com/

 drivers/infiniband/hw/qib/qib_user_pages.c |  3 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c   |  2 +-
 drivers/infiniband/sw/siw/siw_mem.c        |  3 +-
 fs/proc/base.c                             |  3 +-
 include/linux/mm_types.h                   |  8 +++++
 kernel/events/uprobes.c                    |  3 +-
 kernel/ptrace.c                            |  3 +-
 mm/gup.c                                   | 36 +++++++++++++++++++++-
 mm/memory.c                                |  3 +-
 mm/process_vm_access.c                     |  2 +-
 net/xdp/xdp_umem.c                         |  2 +-
 11 files changed, 58 insertions(+), 10 deletions(-)

--
2.40.0

Comments

Christoph Hellwig April 24, 2023, 9:43 a.m. UTC | #1
I'm pretty sure DIRECT I/O reads that write into file backed mappings
are out there in the wild.

So while I wish we had never allowed this, the exercise seems futile and
instead we need to work on supporting this usecase, with the FOLL_PIN
infrastructure being a big step toward that.
Lorenzo Stoakes April 24, 2023, 10:17 a.m. UTC | #2
On Mon, Apr 24, 2023 at 02:43:56AM -0700, Christoph Hellwig wrote:
> I'm pretty sure DIRECT I/O reads that write into file backed mappings
> are out there in the wild.
>
> So while I wish we had never allowed this, the exercise seems futile and
> instead we need to work on supporting this usecase, with the FOLL_PIN
> infrastructure being a big step toward that.

It's not entirely futile, there's at least one specific use case, which is
io_uring which is currently open coding an equivalent check themselves. By
introducing this change we prevent them from having to do so and provide a
means by which other callers who implicitly need this do not have to do so
either.

In addition, this change frees up a blocked patch series intending to clean
up GUP which should help open the door to further improvements across the
system.

So I would argue certainly not futile.

In addition, I think it's useful to explicitly document that this is a
broken case and, through use of the flag, highlight places which are
problematic (although perhaps not exhaustively).

I know Jason is keen on fixing this at a fundamental level and this flag is
ultimately his suggestion, so it certainly doesn't stand in the way of this
work moving forward.
Jason Gunthorpe April 24, 2023, 12:28 p.m. UTC | #3
On Mon, Apr 24, 2023 at 11:17:55AM +0100, Lorenzo Stoakes wrote:
> On Mon, Apr 24, 2023 at 02:43:56AM -0700, Christoph Hellwig wrote:
> > I'm pretty sure DIRECT I/O reads that write into file backed mappings
> > are out there in the wild.

I wonder if that is really the case? I know people tried this with
RDMA and it didn't get very far before testing uncovered data
corruption and kernel crashes.. Maybe O_DIRECT has a much smaller race
window so people can get away with it?

> I know Jason is keen on fixing this at a fundamental level and this flag is
> ultimately his suggestion, so it certainly doesn't stand in the way of this
> work moving forward.

Yeah, the point is to close it off, because while we wish it was
fixed properly, it isn't. We are still who knows how far away from it.

In the mean time this is a fairly simple way to oops the kernel,
especially with cases like io_uring and RDMA. So, I view it as a
security problem.

My general dislike was that io_uring protected itself from the
security problem and we left all the rest of the GUP users out to dry.

So, my suggestion was to mark the places where we want to allow this,
eg O_DIRECT, and block everwhere else. Lorenzo, I would significantly
par back the list you have.

I also suggest we force block it at some kernel lockdown level..

Alternatively, perhaps we abuse FOLL_LONGTERM and prevent it from
working with filebacked pages since, I think, the ease of triggering a
bug goes up the longer the pages are pinned.

Jason
Christoph Hellwig April 24, 2023, 12:38 p.m. UTC | #4
On Mon, Apr 24, 2023 at 09:28:07AM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 24, 2023 at 11:17:55AM +0100, Lorenzo Stoakes wrote:
> > On Mon, Apr 24, 2023 at 02:43:56AM -0700, Christoph Hellwig wrote:
> > > I'm pretty sure DIRECT I/O reads that write into file backed mappings
> > > are out there in the wild.
> 
> I wonder if that is really the case? I know people tried this with
> RDMA and it didn't get very far before testing uncovered data
> corruption and kernel crashes.. Maybe O_DIRECT has a much smaller race
> window so people can get away with it?

It absolutely causes all kinds of issues even with O_DIRECT.  I'd be
all for trying to disallow it as it simplies a lot of things, but I fear
it's not going to stick.

> So, my suggestion was to mark the places where we want to allow this,
> eg O_DIRECT, and block everwhere else. Lorenzo, I would significantly
> par back the list you have.

I think an opt-in is a good idea no matter how many places end up
needing it.  I'd prefer a less dramatic name and a better explanation
on why it should only be set when needed.

> I also suggest we force block it at some kernel lockdown level..
> 
> Alternatively, perhaps we abuse FOLL_LONGTERM and prevent it from
> working with filebacked pages since, I think, the ease of triggering a
> bug goes up the longer the pages are pinned.

FOLL_LONGTERM on file backed pages is a nightmare.  If you think you
can get away with prohibiting it for RDMA, and KVM doesn't need it
I'd be all for not allowing that at all.
Lorenzo Stoakes April 24, 2023, 12:38 p.m. UTC | #5
On Mon, Apr 24, 2023 at 09:28:07AM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 24, 2023 at 11:17:55AM +0100, Lorenzo Stoakes wrote:
> > On Mon, Apr 24, 2023 at 02:43:56AM -0700, Christoph Hellwig wrote:
> > > I'm pretty sure DIRECT I/O reads that write into file backed mappings
> > > are out there in the wild.
>
> I wonder if that is really the case? I know people tried this with
> RDMA and it didn't get very far before testing uncovered data
> corruption and kernel crashes.. Maybe O_DIRECT has a much smaller race
> window so people can get away with it?
>
> > I know Jason is keen on fixing this at a fundamental level and this flag is
> > ultimately his suggestion, so it certainly doesn't stand in the way of this
> > work moving forward.
>
> Yeah, the point is to close it off, because while we wish it was
> fixed properly, it isn't. We are still who knows how far away from it.
>
> In the mean time this is a fairly simple way to oops the kernel,
> especially with cases like io_uring and RDMA. So, I view it as a
> security problem.
>
> My general dislike was that io_uring protected itself from the
> security problem and we left all the rest of the GUP users out to dry.
>
> So, my suggestion was to mark the places where we want to allow this,
> eg O_DIRECT, and block everwhere else. Lorenzo, I would significantly
> par back the list you have.

I was being fairly conservative in that list, though we certainly need to
set the flag for /proc/$pid/mem and ptrace to avoid breaking this
functionality (I observed breakpoints breaking without it which obviously
is a no go :). I'm not sure if there's a more general way we could check
for this though?

A perhaps slightly unpleasant solution might be to not enforce this when
FOLL_FORCE is specified which is mostly a ptrace + friends thing then we
could drop all those exceptions.

I wouldn't be totally opposed to dropping it for RDMA too, because I
suspect accessing file-backed mappings for that is pretty iffy.

Do you have a sense of which in the list you feel could be pared back?

>
> I also suggest we force block it at some kernel lockdown level..
>
> Alternatively, perhaps we abuse FOLL_LONGTERM and prevent it from
> working with filebacked pages since, I think, the ease of triggering a
> bug goes up the longer the pages are pinned.
>

This would solve the io_uring case and it is certainly more of a concern
when the pin is intended to be kept around, though it feels a bit icky as a
non-FOLL_LONGTERM pin could surely be problematic too?

> Jason
Jason Gunthorpe April 24, 2023, 1:26 p.m. UTC | #6
On Mon, Apr 24, 2023 at 05:38:44AM -0700, Christoph Hellwig wrote:

> > I wonder if that is really the case? I know people tried this with
> > RDMA and it didn't get very far before testing uncovered data
> > corruption and kernel crashes.. Maybe O_DIRECT has a much smaller race
> > window so people can get away with it?
> 
> It absolutely causes all kinds of issues even with O_DIRECT.  I'd be
> all for trying to disallow it as it simplies a lot of things, but I fear
> it's not going to stick.

I suggest the kernel lockdown mode again. If people want to do unsafe
things they can boot in a lessor lockdown mode, we've disabled several
kernel features this way already. lockdown integrity sounds
appropriate for this kind of bug.

Maybe we can start to get some data on who is actually using it.

> > Alternatively, perhaps we abuse FOLL_LONGTERM and prevent it from
> > working with filebacked pages since, I think, the ease of triggering a
> > bug goes up the longer the pages are pinned.
> 
> FOLL_LONGTERM on file backed pages is a nightmare.  If you think you
> can get away with prohibiting it for RDMA, and KVM doesn't need it
> I'd be all for not allowing that at all.

Yes, I think we can and should do this.

Jason
Jason Gunthorpe April 24, 2023, 1:39 p.m. UTC | #7
On Mon, Apr 24, 2023 at 01:38:49PM +0100, Lorenzo Stoakes wrote:

> I was being fairly conservative in that list, though we certainly need to
> set the flag for /proc/$pid/mem and ptrace to avoid breaking this
> functionality (I observed breakpoints breaking without it which obviously
> is a no go :). I'm not sure if there's a more general way we could check
> for this though?

More broadly we should make sure these usages of GUP safe somehow so
that it can reliably write to those types of pages without breaking
the current FS contract..

I forget exactly, but IIRC, don't you have to hold some kind of page
spinlock while writing to the page memory?

So, users that do this, or can be fixed to do this, can get file
backed pages. It suggests that a flag name is more like
FOLL_CALLER_USES_FILE_WRITE_LOCKING

> I wouldn't be totally opposed to dropping it for RDMA too, because I
> suspect accessing file-backed mappings for that is pretty iffy.
> 
> Do you have a sense of which in the list you feel could be pared back?

Anything using FOLL_LONGTERM should not set the flag, GUP should even
block the combination.

And we need to have in mind that the flag indicates the code is
buggy, so if you set it then we should understand how is that caller
expected to be fixed.

Jason
Lorenzo Stoakes April 24, 2023, 2:29 p.m. UTC | #8
On Mon, Apr 24, 2023 at 10:39:25AM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 24, 2023 at 01:38:49PM +0100, Lorenzo Stoakes wrote:
>
> > I was being fairly conservative in that list, though we certainly need to
> > set the flag for /proc/$pid/mem and ptrace to avoid breaking this
> > functionality (I observed breakpoints breaking without it which obviously
> > is a no go :). I'm not sure if there's a more general way we could check
> > for this though?
>
> More broadly we should make sure these usages of GUP safe somehow so
> that it can reliably write to those types of pages without breaking
> the current FS contract..
>
> I forget exactly, but IIRC, don't you have to hold some kind of page
> spinlock while writing to the page memory?
>

I think perhaps you're thinking of the mm->mmap_lock? Which will be held
for the FOLL_GET cases and simply prevent the VMA from disappearing below
us but not do much else.

> So, users that do this, or can be fixed to do this, can get file
> backed pages. It suggests that a flag name is more like
> FOLL_CALLER_USES_FILE_WRITE_LOCKING
>

As stated above, I'm not sure what locking you're referring to, but seems
to me that FOLL_GET already implies what you're thinking?

I wonder whether we should do this check purely for FOLL_PIN to be honest?
As this indicates medium to long-term access without mmap_lock held. This
would exclude the /proc/$pid/mem and ptrace paths which use gup_remote().

That and a very specific use of uprobes are the only places that use
FOLL_GET in this instance and each of them are careful in any case to
handle setting the dirty page flag.

All PUP cases that do not specify FOLL_LONGTERM also do this, so we could
atually go so far as to reduce the patch to simply performing the
vma_wants_writenotify() check if (FOLL_PIN | FOLL_LONGTERM) is specified,
which covers the io_uring case.

Alternatively if we wanted to be safer, we could add a FOLL_ALLOW_FILE_PIN
that is checked on FOLL_PIN and ignored on FOLL_LONGTERM?

> > I wouldn't be totally opposed to dropping it for RDMA too, because I
> > suspect accessing file-backed mappings for that is pretty iffy.
> >
> > Do you have a sense of which in the list you feel could be pared back?
>
> Anything using FOLL_LONGTERM should not set the flag, GUP should even
> block the combination.

OK

>
> And we need to have in mind that the flag indicates the code is
> buggy, so if you set it then we should understand how is that caller
> expected to be fixed.
>
> Jason

I think we are working towards a much simpler solution in any case!
Jason Gunthorpe April 24, 2023, 5:36 p.m. UTC | #9
On Mon, Apr 24, 2023 at 03:29:57PM +0100, Lorenzo Stoakes wrote:
> On Mon, Apr 24, 2023 at 10:39:25AM -0300, Jason Gunthorpe wrote:
> > On Mon, Apr 24, 2023 at 01:38:49PM +0100, Lorenzo Stoakes wrote:
> >
> > > I was being fairly conservative in that list, though we certainly need to
> > > set the flag for /proc/$pid/mem and ptrace to avoid breaking this
> > > functionality (I observed breakpoints breaking without it which obviously
> > > is a no go :). I'm not sure if there's a more general way we could check
> > > for this though?
> >
> > More broadly we should make sure these usages of GUP safe somehow so
> > that it can reliably write to those types of pages without breaking
> > the current FS contract..
> >
> > I forget exactly, but IIRC, don't you have to hold some kind of page
> > spinlock while writing to the page memory?
> >
> 
> I think perhaps you're thinking of the mm->mmap_lock? Which will be held
> for the FOLL_GET cases and simply prevent the VMA from disappearing below
> us but not do much else.

No not mmap_lock, I want to say there is a per-page lock that
interacts with the write protect, or at worst this needs to use the
page table spinlocks.

> I wonder whether we should do this check purely for FOLL_PIN to be honest?
> As this indicates medium to long-term access without mmap_lock held. This
> would exclude the /proc/$pid/mem and ptrace paths which use gup_remote().

Everything is buggy. FOLL_PIN is part of a someday solution to solve
it.

> That and a very specific use of uprobes are the only places that use
> FOLL_GET in this instance and each of them are careful in any case to
> handle setting the dirty page flag.

That is actually the bug :) Broadly the bug is to make a page dirty
without holding the right locks to actually dirty it.

Jason
Lorenzo Stoakes April 24, 2023, 6:22 p.m. UTC | #10
On Mon, Apr 24, 2023 at 02:36:47PM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 24, 2023 at 03:29:57PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Apr 24, 2023 at 10:39:25AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Apr 24, 2023 at 01:38:49PM +0100, Lorenzo Stoakes wrote:
> > >
> > > > I was being fairly conservative in that list, though we certainly need to
> > > > set the flag for /proc/$pid/mem and ptrace to avoid breaking this
> > > > functionality (I observed breakpoints breaking without it which obviously
> > > > is a no go :). I'm not sure if there's a more general way we could check
> > > > for this though?
> > >
> > > More broadly we should make sure these usages of GUP safe somehow so
> > > that it can reliably write to those types of pages without breaking
> > > the current FS contract..
> > >
> > > I forget exactly, but IIRC, don't you have to hold some kind of page
> > > spinlock while writing to the page memory?
> > >
> >
> > I think perhaps you're thinking of the mm->mmap_lock? Which will be held
> > for the FOLL_GET cases and simply prevent the VMA from disappearing below
> > us but not do much else.
>
> No not mmap_lock, I want to say there is a per-page lock that
> interacts with the write protect, or at worst this needs to use the
> page table spinlocks.
>
> > I wonder whether we should do this check purely for FOLL_PIN to be honest?
> > As this indicates medium to long-term access without mmap_lock held. This
> > would exclude the /proc/$pid/mem and ptrace paths which use gup_remote().
>
> Everything is buggy. FOLL_PIN is part of a someday solution to solve
> it.
>
> > That and a very specific use of uprobes are the only places that use
> > FOLL_GET in this instance and each of them are careful in any case to
> > handle setting the dirty page flag.
>
> That is actually the bug :) Broadly the bug is to make a page dirty
> without holding the right locks to actually dirty it.
>
> Jason

OK I guess you mean the folio lock :) Well there is
unpin_user_pages_dirty_lock() and unpin_user_page_range_dirty_lock() and
also set_page_dirty_lock() (used by __access_remote_vm()) which should
avoid this.

Also __access_remote_vm() which all the ptrace and /proc/$pid/mem use does
set_page_dirty_lock() and only after the user actually writes to it (and
with FOLL_FORCE of course).

None of these are correctly telling a write notify filesystem about the
change, however.

We definitely need to keep ptrace and /proc/$pid/mem functioning correctly,
and I given the privilege levels required I don't think there's a security
issue there?

I do think the mkwrite/write notify file system check is the correct one as
these are the only ones to whom the page being dirty would matter to.

So perhaps we can move forward with:-

- Use mkwrite check rather than shmem/hugetlb.
- ALWAYS enforce not write notify file system if FOLL_LONGTERM (that
  removes a lot of the changes here).
- If FOLL_FORCE, then allow this to override the check. This is required
  for /proc/$pid/mem and ptrace and is a privileged operation anyway, so
  can not cause a security issue.
- Add a FOLL_WILL_UNPIN_DIRTY flag to indicate that the caller will
  actually do so (required for process_vm_access cases).

Alternatively we could implement something very cautious and opt-in, like a
FOLL_CHECK_SANITY flag? (starting to feel like I need one of those myself :)
Jason Gunthorpe April 24, 2023, 6:54 p.m. UTC | #11
On Mon, Apr 24, 2023 at 07:22:03PM +0100, Lorenzo Stoakes wrote:

> OK I guess you mean the folio lock :) Well there is
> unpin_user_pages_dirty_lock() and unpin_user_page_range_dirty_lock() and
> also set_page_dirty_lock() (used by __access_remote_vm()) which should
> avoid this.

It has been a while, but IIRC, these are all basically racy, the
comment in front of set_page_dirty_lock() even says it is racy..

The race is that a FS cleans a page and thinks it cannot become dirty,
and then it becomes dirty - and all variations of that..

Looking around a bit, I suppose what I'd expect to see is a sequence
sort of like what do_page_mkwrite() does:

        /* Synchronize with the FS and get the page locked */
     	ret = vmf->vma->vm_ops->page_mkwrite(vmf);
	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
		return ret;
	if (unlikely(!(ret & VM_FAULT_LOCKED))) {
		lock_page(page);
		if (!page->mapping) {
			unlock_page(page);
			return 0; /* retry */
		}
		ret |= VM_FAULT_LOCKED;
	} else
		VM_BUG_ON_PAGE(!PageLocked(page), page);

	/* Write to the page with the CPU */
	va = kmap_local_atomic(page);
	memcpy(va, ....);
	kunmap_local_atomic(page);

	/* Tell the FS and unlock it. */
	set_page_dirty(page);	
	unlock_page(page);

I don't know if this is is exactly right, but it seems closerish

So maybe some kind of GUP interfaces that returns single locked pages
is the right direction? IDK

Or maybe we just need to make a memcpy primitive that works while
holding the PTLs?

> We definitely need to keep ptrace and /proc/$pid/mem functioning correctly,
> and I given the privilege levels required I don't think there's a security
> issue there?

Even root is not allowed to trigger data corruption or oops inside the
kernel.

Jason
Lorenzo Stoakes April 24, 2023, 7:18 p.m. UTC | #12
On Mon, Apr 24, 2023 at 03:54:48PM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 24, 2023 at 07:22:03PM +0100, Lorenzo Stoakes wrote:
>
> > OK I guess you mean the folio lock :) Well there is
> > unpin_user_pages_dirty_lock() and unpin_user_page_range_dirty_lock() and
> > also set_page_dirty_lock() (used by __access_remote_vm()) which should
> > avoid this.
>
> It has been a while, but IIRC, these are all basically racy, the
> comment in front of set_page_dirty_lock() even says it is racy..
>
> The race is that a FS cleans a page and thinks it cannot become dirty,
> and then it becomes dirty - and all variations of that..
>
> Looking around a bit, I suppose what I'd expect to see is a sequence
> sort of like what do_page_mkwrite() does:
>
>         /* Synchronize with the FS and get the page locked */
>      	ret = vmf->vma->vm_ops->page_mkwrite(vmf);
> 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
> 		return ret;
> 	if (unlikely(!(ret & VM_FAULT_LOCKED))) {
> 		lock_page(page);
> 		if (!page->mapping) {
> 			unlock_page(page);
> 			return 0; /* retry */
> 		}
> 		ret |= VM_FAULT_LOCKED;
> 	} else
> 		VM_BUG_ON_PAGE(!PageLocked(page), page);
>
> 	/* Write to the page with the CPU */
> 	va = kmap_local_atomic(page);
> 	memcpy(va, ....);
> 	kunmap_local_atomic(page);
>
> 	/* Tell the FS and unlock it. */
> 	set_page_dirty(page);
> 	unlock_page(page);
>
> I don't know if this is is exactly right, but it seems closerish
>
> So maybe some kind of GUP interfaces that returns single locked pages
> is the right direction? IDK
>
> Or maybe we just need to make a memcpy primitive that works while
> holding the PTLs?
>

I think this patch suggestion has scope crept from 'incremental
improvement' to 'major rework of GUP' at this point. Also surely you'd want
to obtain the PTL of all mappings to a file? This seems really unworkable
and I don't think holding a folio lock over a long period is sensible
either.

> > We definitely need to keep ptrace and /proc/$pid/mem functioning correctly,
> > and I given the privilege levels required I don't think there's a security
> > issue there?
>
> Even root is not allowed to trigger data corruption or oops inside the
> kernel.
>
> Jason

Of course, but isn't this supposed to be an incremental fix? It feels a
little contradictory to want to introduce a flag intentionally to try to
highlight brokenness then to not accept any solution that doesn't solve
that brokenness.

In any case, I feel that this patch isn't going to go anywhere as-is, it's
insufficiently large to solve the problem as a whole (I think that's a
bigger problem we can return to later), and there appears to be no taste
for an incremental improvement, even from the suggester :)

As a result, I suggest we take the cautious route in order to unstick the
vmas patch series - introduce an OPT-IN flag which allows the check to be
made, and update io_uring to use this.

That way it defers the larger discussion around this improvement, avoids
breaking anything, provides some basis in code for this check and is a net,
incremental small and digestible improvement.
Jason Gunthorpe April 24, 2023, 10:53 p.m. UTC | #13
On Mon, Apr 24, 2023 at 08:18:33PM +0100, Lorenzo Stoakes wrote:

> I think this patch suggestion has scope crept from 'incremental
> improvement' to 'major rework of GUP' at this point. 

I don't really expect to you clean up all the callers - but we are
trying to understand what is actually wrong here to come up with the
right FOLL_ names and overall strategy. Leave behind a comment, for
instance.

I don't think anyone has really thought about the ptrace users too
much till now, we were all thinking about DMA use cases, it shows we
still have some areas that need attention.

> Also surely you'd want to obtain the PTL of all mappings to a file?

No, just one is fine. If you do the memcpy under a single PTL that
points at a writable copy of the page then everything is trivially
fine because it is very similar to what the CPU itself would do, which
is fine by definition..

Jason
Lorenzo Stoakes April 24, 2023, 11:03 p.m. UTC | #14
On Mon, Apr 24, 2023 at 07:53:26PM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 24, 2023 at 08:18:33PM +0100, Lorenzo Stoakes wrote:
>
> > I think this patch suggestion has scope crept from 'incremental
> > improvement' to 'major rework of GUP' at this point.
>
> I don't really expect to you clean up all the callers - but we are
> trying to understand what is actually wrong here to come up with the
> right FOLL_ names and overall strategy. Leave behind a comment, for
> instance.
>

Right, but you are suggesting introducing a whole new GUP interface holding
the right locks etc. which is really scope-creeping from the original
intent.

I'm not disagreeing that we need an interface that can return things in a
state where the dirtying can be done correctly, I just don't think _this_
patch series is the place for it.

> I don't think anyone has really thought about the ptrace users too
> much till now, we were all thinking about DMA use cases, it shows we
> still have some areas that need attention.

I do like to feel that my recent glut of GUP activity, even if noisy and
frustrating, has at least helped give some insights into usage and
semantics :)

>
> > Also surely you'd want to obtain the PTL of all mappings to a file?
>
> No, just one is fine. If you do the memcpy under a single PTL that
> points at a writable copy of the page then everything is trivially
> fine because it is very similar to what the CPU itself would do, which
> is fine by definition..
>
> Jason

Except you dirty a page that is mapped elsewhere that thought everything
was cleaned and... not sure the PTLs really help you much?

Anyway I feel we're digressing into the broader discussion which needs to
be had, but not when trying to unstick the vmas series :)

I am going to put forward an opt-in variant of this change that explicitly
checks whether any VMA in the range requires dirty page tracking, if not
failing the GUP operation.

This can then form the basis of the opt-OUT variant (it'll be the same
check code right?) and help provide a basis for the additional work that
clearly needs to be done.

It will also replace the open-coded VMA check in io_uring so has utility
and justification just from that.

If we want to be more adventerous the opt-in variant could default to on
for FOLL_LONGTERM too, but that discussion can be had over on that patch
series.
Jason Gunthorpe April 24, 2023, 11:17 p.m. UTC | #15
On Tue, Apr 25, 2023 at 12:03:34AM +0100, Lorenzo Stoakes wrote:

> Except you dirty a page that is mapped elsewhere that thought everything
> was cleaned and... not sure the PTLs really help you much?

If we have a writable PTE then while the PTE's PTL is held it is impossible
for a FS to make the page clean as any cleaning action has to also
take the PTL to make the PTE non-present or non-writable.

> If we want to be more adventerous the opt-in variant could default to on
> for FOLL_LONGTERM too, but that discussion can be had over on that patch
> series.

I think you should at least do this too to explain why io_uring code
is moving into common code..

Jason
Lorenzo Stoakes April 24, 2023, 11:26 p.m. UTC | #16
On Mon, Apr 24, 2023 at 08:17:11PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 25, 2023 at 12:03:34AM +0100, Lorenzo Stoakes wrote:
>
> > Except you dirty a page that is mapped elsewhere that thought everything
> > was cleaned and... not sure the PTLs really help you much?
>
> If we have a writable PTE then while the PTE's PTL is held it is impossible
> for a FS to make the page clean as any cleaning action has to also
> take the PTL to make the PTE non-present or non-writable.
>

That's a very good point! Passing things back with a spinlock held feels
pretty icky though, and obviously a no-go for a FOLL_PIN. Perhaps for a
FOLL_GET this would be workable.

> > If we want to be more adventerous the opt-in variant could default to on
> > for FOLL_LONGTERM too, but that discussion can be had over on that patch
> > series.
>
> I think you should at least do this too to explain why io_uring code
> is moving into common code..
>

OK, I'll respin this as a v3 of this series then since we'll be defaulting
FOLL_LONGTERM at least (for which there seems to be broad consensus), but
also permit this flag to be set manually and set it for io_uring.

> Jason
Jason Gunthorpe April 24, 2023, 11:30 p.m. UTC | #17
On Tue, Apr 25, 2023 at 12:26:25AM +0100, Lorenzo Stoakes wrote:
> On Mon, Apr 24, 2023 at 08:17:11PM -0300, Jason Gunthorpe wrote:
> > On Tue, Apr 25, 2023 at 12:03:34AM +0100, Lorenzo Stoakes wrote:
> >
> > > Except you dirty a page that is mapped elsewhere that thought everything
> > > was cleaned and... not sure the PTLs really help you much?
> >
> > If we have a writable PTE then while the PTE's PTL is held it is impossible
> > for a FS to make the page clean as any cleaning action has to also
> > take the PTL to make the PTE non-present or non-writable.
> >
> 
> That's a very good point! Passing things back with a spinlock held feels
> pretty icky though, and obviously a no-go for a FOLL_PIN. Perhaps for a
> FOLL_GET this would be workable.

I didn't look closely at the ptrace code but maybe it would work to
lock the folio and pass back a locked folio. Interacting with the PTLs
to make the lock reliable. It is the logical inverse of the code I
pointed to for inserting a folio into the page table. (but I've never
looked at the folio lock or how the FSs use it, so don't belive me on
this)

Another interesting idea would be to use mm/pagewalk.c to implement
the memory copy fully under the PTLs.

Jason
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index f693bc753b6b..b9019dad8008 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -110,7 +110,8 @@  int qib_get_user_pages(unsigned long start_page, size_t num_pages,
 	for (got = 0; got < num_pages; got += ret) {
 		ret = pin_user_pages(start_page + got * PAGE_SIZE,
 				     num_pages - got,
-				     FOLL_LONGTERM | FOLL_WRITE,
+				     FOLL_LONGTERM | FOLL_WRITE |
+				     FOLL_ALLOW_BROKEN_FILE_MAPPING,
 				     p + got, NULL);
 		if (ret < 0) {
 			mmap_read_unlock(current->mm);
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 2a5cac2658ec..33cf79b248a9 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -85,7 +85,7 @@  static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
 				int dmasync, struct usnic_uiom_reg *uiomr)
 {
 	struct list_head *chunk_list = &uiomr->chunk_list;
-	unsigned int gup_flags = FOLL_LONGTERM;
+	unsigned int gup_flags = FOLL_LONGTERM | FOLL_ALLOW_BROKEN_FILE_MAPPING;
 	struct page **page_list;
 	struct scatterlist *sg;
 	struct usnic_uiom_chunk *chunk;
diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
index f51ab2ccf151..bc3e8c0898e5 100644
--- a/drivers/infiniband/sw/siw/siw_mem.c
+++ b/drivers/infiniband/sw/siw/siw_mem.c
@@ -368,7 +368,8 @@  struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
 	struct mm_struct *mm_s;
 	u64 first_page_va;
 	unsigned long mlock_limit;
-	unsigned int foll_flags = FOLL_LONGTERM;
+	unsigned int foll_flags =
+		FOLL_LONGTERM | FOLL_ALLOW_BROKEN_FILE_MAPPING;
 	int num_pages, num_chunks, i, rv = 0;

 	if (!can_do_mlock())
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 96a6a08c8235..3e3f5ea9849f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -855,7 +855,8 @@  static ssize_t mem_rw(struct file *file, char __user *buf,
 	if (!mmget_not_zero(mm))
 		goto free;

-	flags = FOLL_FORCE | (write ? FOLL_WRITE : 0);
+	flags = FOLL_FORCE | FOLL_ALLOW_BROKEN_FILE_MAPPING |
+		(write ? FOLL_WRITE : 0);

 	while (count > 0) {
 		size_t this_len = min_t(size_t, count, PAGE_SIZE);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3fc9e680f174..e76637b4c78f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1185,6 +1185,14 @@  enum {
 	FOLL_PCI_P2PDMA = 1 << 10,
 	/* allow interrupts from generic signals */
 	FOLL_INTERRUPTIBLE = 1 << 11,
+	/*
+	 * By default we disallow write access to known broken file-backed
+	 * memory mappings (i.e. anything other than hugetlb/shmem
+	 * mappings). Some code may rely upon being able to access this
+	 * regardless for legacy reasons, thus we provide a flag to indicate
+	 * this.
+	 */
+	FOLL_ALLOW_BROKEN_FILE_MAPPING = 1 << 12,

 	/* See also internal only FOLL flags in mm/internal.h */
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 59887c69d54c..ec330d3b0218 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -373,7 +373,8 @@  __update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
 		return -EINVAL;

 	ret = get_user_pages_remote(mm, vaddr, 1,
-			FOLL_WRITE, &page, &vma, NULL);
+				    FOLL_WRITE | FOLL_ALLOW_BROKEN_FILE_MAPPING,
+				    &page, &vma, NULL);
 	if (unlikely(ret <= 0)) {
 		/*
 		 * We are asking for 1 page. If get_user_pages_remote() fails,
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 0786450074c1..db5022b21b8e 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -58,7 +58,8 @@  int ptrace_access_vm(struct task_struct *tsk, unsigned long addr,
 		return 0;
 	}

-	ret = __access_remote_vm(mm, addr, buf, len, gup_flags);
+	ret = __access_remote_vm(mm, addr, buf, len,
+				 gup_flags | FOLL_ALLOW_BROKEN_FILE_MAPPING);
 	mmput(mm);

 	return ret;
diff --git a/mm/gup.c b/mm/gup.c
index 1f72a717232b..68d5570c0bae 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -959,16 +959,46 @@  static int faultin_page(struct vm_area_struct *vma,
 	return 0;
 }

+/*
+ * Writing to file-backed mappings using GUP is a fundamentally broken operation
+ * as kernel write access to GUP mappings may not adhere to the semantics
+ * expected by a file system.
+ *
+ * In most instances we disallow this broken behaviour, however there are some
+ * exceptions to this enforced here.
+ */
+static inline bool can_write_file_mapping(struct vm_area_struct *vma,
+					  unsigned long gup_flags)
+{
+	struct file *file = vma->vm_file;
+
+	/* If we aren't pinning then no problematic write can occur. */
+	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
+		return true;
+
+	/* Special mappings should pose no problem. */
+	if (!file)
+		return true;
+
+	/* Has the caller explicitly indicated this case is acceptable? */
+	if (gup_flags & FOLL_ALLOW_BROKEN_FILE_MAPPING)
+		return true;
+
+	/* shmem and hugetlb mappings do not have problematic semantics. */
+	return vma_is_shmem(vma) || is_file_hugepages(file);
+}
+
 static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 {
 	vm_flags_t vm_flags = vma->vm_flags;
 	int write = (gup_flags & FOLL_WRITE);
 	int foreign = (gup_flags & FOLL_REMOTE);
+	bool vma_anon = vma_is_anonymous(vma);

 	if (vm_flags & (VM_IO | VM_PFNMAP))
 		return -EFAULT;

-	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
+	if ((gup_flags & FOLL_ANON) && !vma_anon)
 		return -EFAULT;

 	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
@@ -978,6 +1008,10 @@  static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 		return -EFAULT;

 	if (write) {
+		if (!vma_anon &&
+		    WARN_ON_ONCE(!can_write_file_mapping(vma, gup_flags)))
+			return -EFAULT;
+
 		if (!(vm_flags & VM_WRITE)) {
 			if (!(gup_flags & FOLL_FORCE))
 				return -EFAULT;
diff --git a/mm/memory.c b/mm/memory.c
index 146bb94764f8..e3d535991548 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5683,7 +5683,8 @@  int access_process_vm(struct task_struct *tsk, unsigned long addr,
 	if (!mm)
 		return 0;

-	ret = __access_remote_vm(mm, addr, buf, len, gup_flags);
+	ret = __access_remote_vm(mm, addr, buf, len,
+				 gup_flags | FOLL_ALLOW_BROKEN_FILE_MAPPING);

 	mmput(mm);

diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 78dfaf9e8990..ef126c08e89c 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -81,7 +81,7 @@  static int process_vm_rw_single_vec(unsigned long addr,
 	ssize_t rc = 0;
 	unsigned long max_pages_per_loop = PVM_MAX_KMALLOC_PAGES
 		/ sizeof(struct pages *);
-	unsigned int flags = 0;
+	unsigned int flags = FOLL_ALLOW_BROKEN_FILE_MAPPING;

 	/* Work out address and page range required */
 	if (len == 0)
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 02207e852d79..b93cfcaccb0d 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -93,7 +93,7 @@  void xdp_put_umem(struct xdp_umem *umem, bool defer_cleanup)

 static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
 {
-	unsigned int gup_flags = FOLL_WRITE;
+	unsigned int gup_flags = FOLL_WRITE | FOLL_ALLOW_BROKEN_FILE_MAPPING;
 	long npgs;
 	int err;