diff mbox series

mm/gup: disallow GUP writing to file-backed mappings by default

Message ID f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com (mailing list archive)
State New
Headers show
Series mm/gup: disallow GUP writing to file-backed mappings by default | expand

Commit Message

Lorenzo Stoakes April 22, 2023, 1:37 p.m. UTC
It isn't safe to write to file-backed mappings as GUP does not ensure that
the semantics associated with such a write are performed correctly, for
instance filesystems which rely upon write-notify will not be correctly
notified.

There are exceptions to this - shmem and hugetlb mappings are (in effect)
anonymous mappings by other names so we do permit this operation in these
cases.

In addition, if no pinning takes place (neither FOLL_GET nor FOLL_PIN is
specified and neither flags gets implicitly set) then no writing can occur
so we do not perform the check in this instance.

This is an important exception, as populate_vma_page_range() invokes
__get_user_pages() in this way (and thus so does __mm_populate(), used by
MAP_POPULATE mmap() and mlock() invocations).

There are GUP users within the kernel that do nevertheless rely upon this
behaviour, so we introduce the FOLL_ALLOW_BROKEN_FILE_MAPPING flag to
explicitly permit this kind of GUP access.

This is required in order to not break userspace in instances where the
uAPI might permit file-mapped addresses - a number of RDMA users require
this for instance, as do the process_vm_[read/write]v() system calls,
/proc/$pid/mem, ptrace and SDT uprobes. Each of these callers have been
updated to use this flag.

Making this change is an important step towards a more reliable GUP, and
explicitly indicates which callers might encouter issues moving forward.

Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
 drivers/infiniband/hw/qib/qib_user_pages.c |  3 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c   |  2 +-
 drivers/infiniband/sw/siw/siw_mem.c        |  3 +-
 fs/proc/base.c                             |  3 +-
 include/linux/mm_types.h                   |  8 +++++
 kernel/events/uprobes.c                    |  3 +-
 mm/gup.c                                   | 36 +++++++++++++++++++++-
 mm/memory.c                                |  3 +-
 mm/process_vm_access.c                     |  2 +-
 net/xdp/xdp_umem.c                         |  2 +-
 10 files changed, 56 insertions(+), 9 deletions(-)

Comments

Hillf Danton April 23, 2023, 12:46 a.m. UTC | #1
On 22 Apr 2023 14:37:05 +0100 Lorenzo Stoakes <lstoakes@gmail.com>
> @@ -110,7 +110,8 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
>  	for (got = 0; got < num_pages; got += ret) {
>  		ret = pin_user_pages(start_page + got * PAGE_SIZE,
>  				     num_pages - got,
> -				     FOLL_LONGTERM | FOLL_WRITE,
> +				     FOLL_LONGTERM | FOLL_WRITE |
> +				     FOLL_ALLOW_BROKEN_FILE_MAPPING,
>  				     p + got, NULL);
>  		if (ret < 0) {
>  			mmap_read_unlock(current->mm);

[...]

> +/*
> + * Writing to file-backed mappings using GUP is a fundamentally broken operation
> + * as kernel write access to GUP mappings may not adhere to the semantics
> + * expected by a file system.
> + *
> + * In most instances we disallow this broken behaviour, however there are some
> + * exceptions to this enforced here.
> + */
> +static inline bool can_write_file_mapping(struct vm_area_struct *vma,
> +					  unsigned long gup_flags)
> +{
> +	struct file *file = vma->vm_file;
> +
> +	/* If we aren't pinning then no problematic write can occur. */
> +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> +		return true;
> +
> +	/* Special mappings should pose no problem. */
> +	if (!file)
> +		return true;
> +
> +	/* Has the caller explicitly indicated this case is acceptable? */
> +	if (gup_flags & FOLL_ALLOW_BROKEN_FILE_MAPPING)
> +		return true;

Does the caller mean that PUP for DMA is no longer breaking writeback?
> +
> +	/* shmem and hugetlb mappings do not have problematic semantics. */
> +	return vma_is_shmem(vma) || is_file_hugepages(file);
> +}
Simon Horman April 23, 2023, 7:32 p.m. UTC | #2
On Sat, Apr 22, 2023 at 02:37:05PM +0100, Lorenzo Stoakes wrote:
> It isn't safe to write to file-backed mappings as GUP does not ensure that
> the semantics associated with such a write are performed correctly, for
> instance filesystems which rely upon write-notify will not be correctly
> notified.
> 
> There are exceptions to this - shmem and hugetlb mappings are (in effect)
> anonymous mappings by other names so we do permit this operation in these
> cases.
> 
> In addition, if no pinning takes place (neither FOLL_GET nor FOLL_PIN is
> specified and neither flags gets implicitly set) then no writing can occur
> so we do not perform the check in this instance.
> 
> This is an important exception, as populate_vma_page_range() invokes
> __get_user_pages() in this way (and thus so does __mm_populate(), used by
> MAP_POPULATE mmap() and mlock() invocations).
> 
> There are GUP users within the kernel that do nevertheless rely upon this
> behaviour, so we introduce the FOLL_ALLOW_BROKEN_FILE_MAPPING flag to
> explicitly permit this kind of GUP access.
> 
> This is required in order to not break userspace in instances where the
> uAPI might permit file-mapped addresses - a number of RDMA users require
> this for instance, as do the process_vm_[read/write]v() system calls,
> /proc/$pid/mem, ptrace and SDT uprobes. Each of these callers have been
> updated to use this flag.
> 
> Making this change is an important step towards a more reliable GUP, and
> explicitly indicates which callers might encouter issues moving forward.

nit: s/encouter/encounter/
Lorenzo Stoakes April 23, 2023, 7:40 p.m. UTC | #3
On Sun, Apr 23, 2023 at 09:32:01PM +0200, Simon Horman wrote:
> On Sat, Apr 22, 2023 at 02:37:05PM +0100, Lorenzo Stoakes wrote:
> > It isn't safe to write to file-backed mappings as GUP does not ensure that
> > the semantics associated with such a write are performed correctly, for
> > instance filesystems which rely upon write-notify will not be correctly
> > notified.
> >
> > There are exceptions to this - shmem and hugetlb mappings are (in effect)
> > anonymous mappings by other names so we do permit this operation in these
> > cases.
> >
> > In addition, if no pinning takes place (neither FOLL_GET nor FOLL_PIN is
> > specified and neither flags gets implicitly set) then no writing can occur
> > so we do not perform the check in this instance.
> >
> > This is an important exception, as populate_vma_page_range() invokes
> > __get_user_pages() in this way (and thus so does __mm_populate(), used by
> > MAP_POPULATE mmap() and mlock() invocations).
> >
> > There are GUP users within the kernel that do nevertheless rely upon this
> > behaviour, so we introduce the FOLL_ALLOW_BROKEN_FILE_MAPPING flag to
> > explicitly permit this kind of GUP access.
> >
> > This is required in order to not break userspace in instances where the
> > uAPI might permit file-mapped addresses - a number of RDMA users require
> > this for instance, as do the process_vm_[read/write]v() system calls,
> > /proc/$pid/mem, ptrace and SDT uprobes. Each of these callers have been
> > updated to use this flag.
> >
> > Making this change is an important step towards a more reliable GUP, and
> > explicitly indicates which callers might encouter issues moving forward.
>
> nit: s/encouter/encounter/
>

Ack, I always seem to leave at least one or two easter egg spelling
mistakes in :)

Will fix up on next respin (in unlikely event of no further review,
hopefully Andrew would pick up!)
Simon Horman April 23, 2023, 8:02 p.m. UTC | #4
On Sun, Apr 23, 2023 at 08:40:53PM +0100, Lorenzo Stoakes wrote:
> On Sun, Apr 23, 2023 at 09:32:01PM +0200, Simon Horman wrote:
> > On Sat, Apr 22, 2023 at 02:37:05PM +0100, Lorenzo Stoakes wrote:
> > > It isn't safe to write to file-backed mappings as GUP does not ensure that
> > > the semantics associated with such a write are performed correctly, for
> > > instance filesystems which rely upon write-notify will not be correctly
> > > notified.
> > >
> > > There are exceptions to this - shmem and hugetlb mappings are (in effect)
> > > anonymous mappings by other names so we do permit this operation in these
> > > cases.
> > >
> > > In addition, if no pinning takes place (neither FOLL_GET nor FOLL_PIN is
> > > specified and neither flags gets implicitly set) then no writing can occur
> > > so we do not perform the check in this instance.
> > >
> > > This is an important exception, as populate_vma_page_range() invokes
> > > __get_user_pages() in this way (and thus so does __mm_populate(), used by
> > > MAP_POPULATE mmap() and mlock() invocations).
> > >
> > > There are GUP users within the kernel that do nevertheless rely upon this
> > > behaviour, so we introduce the FOLL_ALLOW_BROKEN_FILE_MAPPING flag to
> > > explicitly permit this kind of GUP access.
> > >
> > > This is required in order to not break userspace in instances where the
> > > uAPI might permit file-mapped addresses - a number of RDMA users require
> > > this for instance, as do the process_vm_[read/write]v() system calls,
> > > /proc/$pid/mem, ptrace and SDT uprobes. Each of these callers have been
> > > updated to use this flag.
> > >
> > > Making this change is an important step towards a more reliable GUP, and
> > > explicitly indicates which callers might encouter issues moving forward.
> >
> > nit: s/encouter/encounter/
> >
> 
> Ack, I always seem to leave at least one or two easter egg spelling
> mistakes in :)

:)

> Will fix up on next respin (in unlikely event of no further review,
> hopefully Andrew would pick up!)
>
Dave Chinner April 23, 2023, 10:29 p.m. UTC | #5
On Sat, Apr 22, 2023 at 02:37:05PM +0100, Lorenzo Stoakes wrote:
> +/*
> + * Writing to file-backed mappings using GUP is a fundamentally broken operation
> + * as kernel write access to GUP mappings may not adhere to the semantics
> + * expected by a file system.
> + *
> + * In most instances we disallow this broken behaviour, however there are some
> + * exceptions to this enforced here.
> + */
> +static inline bool can_write_file_mapping(struct vm_area_struct *vma,
> +					  unsigned long gup_flags)
> +{
> +	struct file *file = vma->vm_file;
> +
> +	/* If we aren't pinning then no problematic write can occur. */
> +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> +		return true;
> +
> +	/* Special mappings should pose no problem. */
> +	if (!file)
> +		return true;

Ok...

> +
> +	/* Has the caller explicitly indicated this case is acceptable? */
> +	if (gup_flags & FOLL_ALLOW_BROKEN_FILE_MAPPING)
> +		return true;
> +
> +	/* shmem and hugetlb mappings do not have problematic semantics. */
> +	return vma_is_shmem(vma) || is_file_hugepages(file);
> +}

This looks backwards. We only want the override to occur when the
target won't otherwise allow it. i.e.  This should be:

	if (vma_is_shmem(vma))
		return true;
	if (is_file_hugepages(vma)
		return true;

	/*
	 * Issue a warning only if we are allowing a write to a mapping
	 * that does not support what we are attempting to do functionality.
	 */
	if (WARN_ON_ONCE(gup_flags & FOLL_ALLOW_BROKEN_FILE_MAPPING))
		return true;
	return false;

i.e. we only want the warning to fire when the override is
triggered - indicating that the caller is actually using a file
mapping in a broken way, not when it is being used on
file/filesystem that actually supports file mappings in this way.

>  static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>  {
>  	vm_flags_t vm_flags = vma->vm_flags;
>  	int write = (gup_flags & FOLL_WRITE);
>  	int foreign = (gup_flags & FOLL_REMOTE);
> +	bool vma_anon = vma_is_anonymous(vma);
>  
>  	if (vm_flags & (VM_IO | VM_PFNMAP))
>  		return -EFAULT;
>  
> -	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
> +	if ((gup_flags & FOLL_ANON) && !vma_anon)
>  		return -EFAULT;
>  
>  	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
> @@ -978,6 +1008,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>  		return -EFAULT;
>  
>  	if (write) {
> +		if (!vma_anon &&
> +		    WARN_ON_ONCE(!can_write_file_mapping(vma, gup_flags)))
> +			return -EFAULT;

Yeah, the warning definitely belongs in the check function when the
override triggers allow broken behaviour to proceed, not when we
disallow a write fault because the underlying file/filesystem does
not support the operation being attempted.

-Dave.
Lorenzo Stoakes April 23, 2023, 10:56 p.m. UTC | #6
On Mon, Apr 24, 2023 at 08:29:41AM +1000, Dave Chinner wrote:
> On Sat, Apr 22, 2023 at 02:37:05PM +0100, Lorenzo Stoakes wrote:
> > +/*
> > + * Writing to file-backed mappings using GUP is a fundamentally broken operation
> > + * as kernel write access to GUP mappings may not adhere to the semantics
> > + * expected by a file system.
> > + *
> > + * In most instances we disallow this broken behaviour, however there are some
> > + * exceptions to this enforced here.
> > + */
> > +static inline bool can_write_file_mapping(struct vm_area_struct *vma,
> > +					  unsigned long gup_flags)
> > +{
> > +	struct file *file = vma->vm_file;
> > +
> > +	/* If we aren't pinning then no problematic write can occur. */
> > +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> > +		return true;
> > +
> > +	/* Special mappings should pose no problem. */
> > +	if (!file)
> > +		return true;
>
> Ok...
>
> > +
> > +	/* Has the caller explicitly indicated this case is acceptable? */
> > +	if (gup_flags & FOLL_ALLOW_BROKEN_FILE_MAPPING)
> > +		return true;
> > +
> > +	/* shmem and hugetlb mappings do not have problematic semantics. */
> > +	return vma_is_shmem(vma) || is_file_hugepages(file);
> > +}
>
> This looks backwards. We only want the override to occur when the
> target won't otherwise allow it. i.e.  This should be:
>
> 	if (vma_is_shmem(vma))
> 		return true;
> 	if (is_file_hugepages(vma)
> 		return true;
>
> 	/*
> 	 * Issue a warning only if we are allowing a write to a mapping
> 	 * that does not support what we are attempting to do functionality.
> 	 */
> 	if (WARN_ON_ONCE(gup_flags & FOLL_ALLOW_BROKEN_FILE_MAPPING))
> 		return true;
> 	return false;
>
> i.e. we only want the warning to fire when the override is
> triggered - indicating that the caller is actually using a file
> mapping in a broken way, not when it is being used on
> file/filesystem that actually supports file mappings in this way.
>
> >  static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> >  {
> >  	vm_flags_t vm_flags = vma->vm_flags;
> >  	int write = (gup_flags & FOLL_WRITE);
> >  	int foreign = (gup_flags & FOLL_REMOTE);
> > +	bool vma_anon = vma_is_anonymous(vma);
> >
> >  	if (vm_flags & (VM_IO | VM_PFNMAP))
> >  		return -EFAULT;
> >
> > -	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
> > +	if ((gup_flags & FOLL_ANON) && !vma_anon)
> >  		return -EFAULT;
> >
> >  	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
> > @@ -978,6 +1008,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> >  		return -EFAULT;
> >
> >  	if (write) {
> > +		if (!vma_anon &&
> > +		    WARN_ON_ONCE(!can_write_file_mapping(vma, gup_flags)))
> > +			return -EFAULT;
>
> Yeah, the warning definitely belongs in the check function when the
> override triggers allow broken behaviour to proceed, not when we
> disallow a write fault because the underlying file/filesystem does
> not support the operation being attempted.

I disagree for two reasons:-

1. There are places in the kernel that rely on this broken behaviour, most
   notably ptrace (and /proc/$pid/mem), but also the other places where you
   can see I've added this flag. I'm not sure spamming warnings for
   ordinary cases would be useful.

2. The purpose of putting a warning here is to catch any case I might have
   missed where broken behaviour is required, but now disallowed, because it
   might actually be hard for a GUP user to track down that this is why the
   GUP is no longer functioning (since all they'll see is an -EFAULT).

This warned upon check should in reality not occur, because it implies the
GUP user is trying to do something broken and is _not_ explicitly telling
GUP that it knows it's doing it and can live with the consequences. And on
that basis, is worthy of a warning so we know we have to go put this flag
in that place (and know it is a source of problematic GUP usage), or fix
the caller.

An example case is placing breakpoints in gdb, without the flag being set
for /proc/$pid/mem this will just fail. Raising a kernel warning when a
user places a breakpoint seems... unhelpful :)

>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
Mika Penttilä April 24, 2023, 3:41 a.m. UTC | #7
Hi,


On 22.4.2023 16.37, Lorenzo Stoakes wrote:
> It isn't safe to write to file-backed mappings as GUP does not ensure that
> the semantics associated with such a write are performed correctly, for
> instance filesystems which rely upon write-notify will not be correctly
> notified.
> 
> There are exceptions to this - shmem and hugetlb mappings are (in effect)
> anonymous mappings by other names so we do permit this operation in these
> cases.
> 
> In addition, if no pinning takes place (neither FOLL_GET nor FOLL_PIN is
> specified and neither flags gets implicitly set) then no writing can occur
> so we do not perform the check in this instance.
> 
> This is an important exception, as populate_vma_page_range() invokes
> __get_user_pages() in this way (and thus so does __mm_populate(), used by
> MAP_POPULATE mmap() and mlock() invocations).
> 
> There are GUP users within the kernel that do nevertheless rely upon this
> behaviour, so we introduce the FOLL_ALLOW_BROKEN_FILE_MAPPING flag to
> explicitly permit this kind of GUP access.
> 
> This is required in order to not break userspace in instances where the
> uAPI might permit file-mapped addresses - a number of RDMA users require
> this for instance, as do the process_vm_[read/write]v() system calls,
> /proc/$pid/mem, ptrace and SDT uprobes. Each of these callers have been
> updated to use this flag.
> 
> Making this change is an important step towards a more reliable GUP, and
> explicitly indicates which callers might encouter issues moving forward.
> 
> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
> ---
>   drivers/infiniband/hw/qib/qib_user_pages.c |  3 +-
>   drivers/infiniband/hw/usnic/usnic_uiom.c   |  2 +-
>   drivers/infiniband/sw/siw/siw_mem.c        |  3 +-
>   fs/proc/base.c                             |  3 +-
>   include/linux/mm_types.h                   |  8 +++++
>   kernel/events/uprobes.c                    |  3 +-
>   mm/gup.c                                   | 36 +++++++++++++++++++++-
>   mm/memory.c                                |  3 +-
>   mm/process_vm_access.c                     |  2 +-
>   net/xdp/xdp_umem.c                         |  2 +-
>   10 files changed, 56 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
> index f693bc753b6b..b9019dad8008 100644
> --- a/drivers/infiniband/hw/qib/qib_user_pages.c
> +++ b/drivers/infiniband/hw/qib/qib_user_pages.c
> @@ -110,7 +110,8 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
>   	for (got = 0; got < num_pages; got += ret) {
>   		ret = pin_user_pages(start_page + got * PAGE_SIZE,
>   				     num_pages - got,
> -				     FOLL_LONGTERM | FOLL_WRITE,
> +				     FOLL_LONGTERM | FOLL_WRITE |
> +				     FOLL_ALLOW_BROKEN_FILE_MAPPING,
>   				     p + got, NULL);
>   		if (ret < 0) {
>   			mmap_read_unlock(current->mm);
> diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
> index 2a5cac2658ec..33cf79b248a9 100644
> --- a/drivers/infiniband/hw/usnic/usnic_uiom.c
> +++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
> @@ -85,7 +85,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
>   				int dmasync, struct usnic_uiom_reg *uiomr)
>   {
>   	struct list_head *chunk_list = &uiomr->chunk_list;
> -	unsigned int gup_flags = FOLL_LONGTERM;
> +	unsigned int gup_flags = FOLL_LONGTERM | FOLL_ALLOW_BROKEN_FILE_MAPPING;
>   	struct page **page_list;
>   	struct scatterlist *sg;
>   	struct usnic_uiom_chunk *chunk;
> diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
> index f51ab2ccf151..bc3e8c0898e5 100644
> --- a/drivers/infiniband/sw/siw/siw_mem.c
> +++ b/drivers/infiniband/sw/siw/siw_mem.c
> @@ -368,7 +368,8 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
>   	struct mm_struct *mm_s;
>   	u64 first_page_va;
>   	unsigned long mlock_limit;
> -	unsigned int foll_flags = FOLL_LONGTERM;
> +	unsigned int foll_flags =
> +		FOLL_LONGTERM | FOLL_ALLOW_BROKEN_FILE_MAPPING;
>   	int num_pages, num_chunks, i, rv = 0;
>   
>   	if (!can_do_mlock())
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 96a6a08c8235..3e3f5ea9849f 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -855,7 +855,8 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
>   	if (!mmget_not_zero(mm))
>   		goto free;
>   
> -	flags = FOLL_FORCE | (write ? FOLL_WRITE : 0);
> +	flags = FOLL_FORCE | FOLL_ALLOW_BROKEN_FILE_MAPPING |
> +		(write ? FOLL_WRITE : 0);
>   
>   	while (count > 0) {
>   		size_t this_len = min_t(size_t, count, PAGE_SIZE);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 3fc9e680f174..e76637b4c78f 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1185,6 +1185,14 @@ enum {
>   	FOLL_PCI_P2PDMA = 1 << 10,
>   	/* allow interrupts from generic signals */
>   	FOLL_INTERRUPTIBLE = 1 << 11,
> +	/*
> +	 * By default we disallow write access to known broken file-backed
> +	 * memory mappings (i.e. anything other than hugetlb/shmem
> +	 * mappings). Some code may rely upon being able to access this
> +	 * regardless for legacy reasons, thus we provide a flag to indicate
> +	 * this.
> +	 */
> +	FOLL_ALLOW_BROKEN_FILE_MAPPING = 1 << 12,
>   
>   	/* See also internal only FOLL flags in mm/internal.h */
>   };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 59887c69d54c..ec330d3b0218 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -373,7 +373,8 @@ __update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
>   		return -EINVAL;
>   
>   	ret = get_user_pages_remote(mm, vaddr, 1,
> -			FOLL_WRITE, &page, &vma, NULL);
> +				    FOLL_WRITE | FOLL_ALLOW_BROKEN_FILE_MAPPING,
> +				    &page, &vma, NULL);
>   	if (unlikely(ret <= 0)) {
>   		/*
>   		 * We are asking for 1 page. If get_user_pages_remote() fails,
> diff --git a/mm/gup.c b/mm/gup.c
> index 1f72a717232b..68d5570c0bae 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -959,16 +959,46 @@ static int faultin_page(struct vm_area_struct *vma,
>   	return 0;
>   }
>   
> +/*
> + * Writing to file-backed mappings using GUP is a fundamentally broken operation
> + * as kernel write access to GUP mappings may not adhere to the semantics
> + * expected by a file system.
> + *
> + * In most instances we disallow this broken behaviour, however there are some
> + * exceptions to this enforced here.
> + */
> +static inline bool can_write_file_mapping(struct vm_area_struct *vma,
> +					  unsigned long gup_flags)
> +{
> +	struct file *file = vma->vm_file;
> +
> +	/* If we aren't pinning then no problematic write can occur. */
> +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> +		return true;
> +
> +	/* Special mappings should pose no problem. */
> +	if (!file)
> +		return true;
> +
> +	/* Has the caller explicitly indicated this case is acceptable? */
> +	if (gup_flags & FOLL_ALLOW_BROKEN_FILE_MAPPING)
> +		return true;
> +
> +	/* shmem and hugetlb mappings do not have problematic semantics. */
> +	return vma_is_shmem(vma) || is_file_hugepages(file);
> +}
> +
>   static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>   {
>   	vm_flags_t vm_flags = vma->vm_flags;
>   	int write = (gup_flags & FOLL_WRITE);
>   	int foreign = (gup_flags & FOLL_REMOTE);
> +	bool vma_anon = vma_is_anonymous(vma);
>   
>   	if (vm_flags & (VM_IO | VM_PFNMAP))
>   		return -EFAULT;
>   
> -	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
> +	if ((gup_flags & FOLL_ANON) && !vma_anon)
>   		return -EFAULT;
>   
>   	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
> @@ -978,6 +1008,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>   		return -EFAULT;
>   
>   	if (write) {
> +		if (!vma_anon &&
> +		    WARN_ON_ONCE(!can_write_file_mapping(vma, gup_flags)))
> +			return -EFAULT;
> +
>   		if (!(vm_flags & VM_WRITE)) {
>   			if (!(gup_flags & FOLL_FORCE))
>   				return -EFAULT;
> diff --git a/mm/memory.c b/mm/memory.c
> index 146bb94764f8..e3d535991548 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5683,7 +5683,8 @@ int access_process_vm(struct task_struct *tsk, unsigned long addr,
>   	if (!mm)
>   		return 0;
>   
> -	ret = __access_remote_vm(mm, addr, buf, len, gup_flags);
> +	ret = __access_remote_vm(mm, addr, buf, len,
> +				 gup_flags | FOLL_ALLOW_BROKEN_FILE_MAPPING);
>   
>   	mmput(mm);
>   
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index 78dfaf9e8990..ef126c08e89c 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -81,7 +81,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
>   	ssize_t rc = 0;
>   	unsigned long max_pages_per_loop = PVM_MAX_KMALLOC_PAGES
>   		/ sizeof(struct pages *);
> -	unsigned int flags = 0;
> +	unsigned int flags = FOLL_ALLOW_BROKEN_FILE_MAPPING;
>   
>   	/* Work out address and page range required */
>   	if (len == 0)
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index 02207e852d79..b93cfcaccb0d 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -93,7 +93,7 @@ void xdp_put_umem(struct xdp_umem *umem, bool defer_cleanup)
>   
>   static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
>   {
> -	unsigned int gup_flags = FOLL_WRITE;
> +	unsigned int gup_flags = FOLL_WRITE | FOLL_ALLOW_BROKEN_FILE_MAPPING;
>   	long npgs;
>   	int err;
>   

Not sure about this in general, but seemss at least ptrace 
(ptrace_access_vm()) seems to be broken here..


--Mika
Lorenzo Stoakes April 24, 2023, 6:51 a.m. UTC | #8
On Mon, Apr 24, 2023 at 06:41:38AM +0300, Mika Penttilä wrote:
>
> Hi,
>
>
> On 22.4.2023 16.37, Lorenzo Stoakes wrote:
> > It isn't safe to write to file-backed mappings as GUP does not ensure that
> > the semantics associated with such a write are performed correctly, for
> > instance filesystems which rely upon write-notify will not be correctly
> > notified.
> >
> > There are exceptions to this - shmem and hugetlb mappings are (in effect)
> > anonymous mappings by other names so we do permit this operation in these
> > cases.
> >
> > In addition, if no pinning takes place (neither FOLL_GET nor FOLL_PIN is
> > specified and neither flags gets implicitly set) then no writing can occur
> > so we do not perform the check in this instance.
> >
> > This is an important exception, as populate_vma_page_range() invokes
> > __get_user_pages() in this way (and thus so does __mm_populate(), used by
> > MAP_POPULATE mmap() and mlock() invocations).
> >
> > There are GUP users within the kernel that do nevertheless rely upon this
> > behaviour, so we introduce the FOLL_ALLOW_BROKEN_FILE_MAPPING flag to
> > explicitly permit this kind of GUP access.
> >
> > This is required in order to not break userspace in instances where the
> > uAPI might permit file-mapped addresses - a number of RDMA users require
> > this for instance, as do the process_vm_[read/write]v() system calls,
> > /proc/$pid/mem, ptrace and SDT uprobes. Each of these callers have been
> > updated to use this flag.
> >
> > Making this change is an important step towards a more reliable GUP, and
> > explicitly indicates which callers might encouter issues moving forward.
> >
> > Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
> > ---
> >   drivers/infiniband/hw/qib/qib_user_pages.c |  3 +-
> >   drivers/infiniband/hw/usnic/usnic_uiom.c   |  2 +-
> >   drivers/infiniband/sw/siw/siw_mem.c        |  3 +-
> >   fs/proc/base.c                             |  3 +-
> >   include/linux/mm_types.h                   |  8 +++++
> >   kernel/events/uprobes.c                    |  3 +-
> >   mm/gup.c                                   | 36 +++++++++++++++++++++-
> >   mm/memory.c                                |  3 +-
> >   mm/process_vm_access.c                     |  2 +-
> >   net/xdp/xdp_umem.c                         |  2 +-
> >   10 files changed, 56 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
> > index f693bc753b6b..b9019dad8008 100644
> > --- a/drivers/infiniband/hw/qib/qib_user_pages.c
> > +++ b/drivers/infiniband/hw/qib/qib_user_pages.c
> > @@ -110,7 +110,8 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
> >   	for (got = 0; got < num_pages; got += ret) {
> >   		ret = pin_user_pages(start_page + got * PAGE_SIZE,
> >   				     num_pages - got,
> > -				     FOLL_LONGTERM | FOLL_WRITE,
> > +				     FOLL_LONGTERM | FOLL_WRITE |
> > +				     FOLL_ALLOW_BROKEN_FILE_MAPPING,
> >   				     p + got, NULL);
> >   		if (ret < 0) {
> >   			mmap_read_unlock(current->mm);
> > diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
> > index 2a5cac2658ec..33cf79b248a9 100644
> > --- a/drivers/infiniband/hw/usnic/usnic_uiom.c
> > +++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
> > @@ -85,7 +85,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
> >   				int dmasync, struct usnic_uiom_reg *uiomr)
> >   {
> >   	struct list_head *chunk_list = &uiomr->chunk_list;
> > -	unsigned int gup_flags = FOLL_LONGTERM;
> > +	unsigned int gup_flags = FOLL_LONGTERM | FOLL_ALLOW_BROKEN_FILE_MAPPING;
> >   	struct page **page_list;
> >   	struct scatterlist *sg;
> >   	struct usnic_uiom_chunk *chunk;
> > diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
> > index f51ab2ccf151..bc3e8c0898e5 100644
> > --- a/drivers/infiniband/sw/siw/siw_mem.c
> > +++ b/drivers/infiniband/sw/siw/siw_mem.c
> > @@ -368,7 +368,8 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
> >   	struct mm_struct *mm_s;
> >   	u64 first_page_va;
> >   	unsigned long mlock_limit;
> > -	unsigned int foll_flags = FOLL_LONGTERM;
> > +	unsigned int foll_flags =
> > +		FOLL_LONGTERM | FOLL_ALLOW_BROKEN_FILE_MAPPING;
> >   	int num_pages, num_chunks, i, rv = 0;
> >   	if (!can_do_mlock())
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 96a6a08c8235..3e3f5ea9849f 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -855,7 +855,8 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
> >   	if (!mmget_not_zero(mm))
> >   		goto free;
> > -	flags = FOLL_FORCE | (write ? FOLL_WRITE : 0);
> > +	flags = FOLL_FORCE | FOLL_ALLOW_BROKEN_FILE_MAPPING |
> > +		(write ? FOLL_WRITE : 0);
> >   	while (count > 0) {
> >   		size_t this_len = min_t(size_t, count, PAGE_SIZE);
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 3fc9e680f174..e76637b4c78f 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -1185,6 +1185,14 @@ enum {
> >   	FOLL_PCI_P2PDMA = 1 << 10,
> >   	/* allow interrupts from generic signals */
> >   	FOLL_INTERRUPTIBLE = 1 << 11,
> > +	/*
> > +	 * By default we disallow write access to known broken file-backed
> > +	 * memory mappings (i.e. anything other than hugetlb/shmem
> > +	 * mappings). Some code may rely upon being able to access this
> > +	 * regardless for legacy reasons, thus we provide a flag to indicate
> > +	 * this.
> > +	 */
> > +	FOLL_ALLOW_BROKEN_FILE_MAPPING = 1 << 12,
> >   	/* See also internal only FOLL flags in mm/internal.h */
> >   };
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 59887c69d54c..ec330d3b0218 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -373,7 +373,8 @@ __update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
> >   		return -EINVAL;
> >   	ret = get_user_pages_remote(mm, vaddr, 1,
> > -			FOLL_WRITE, &page, &vma, NULL);
> > +				    FOLL_WRITE | FOLL_ALLOW_BROKEN_FILE_MAPPING,
> > +				    &page, &vma, NULL);
> >   	if (unlikely(ret <= 0)) {
> >   		/*
> >   		 * We are asking for 1 page. If get_user_pages_remote() fails,
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 1f72a717232b..68d5570c0bae 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -959,16 +959,46 @@ static int faultin_page(struct vm_area_struct *vma,
> >   	return 0;
> >   }
> > +/*
> > + * Writing to file-backed mappings using GUP is a fundamentally broken operation
> > + * as kernel write access to GUP mappings may not adhere to the semantics
> > + * expected by a file system.
> > + *
> > + * In most instances we disallow this broken behaviour, however there are some
> > + * exceptions to this enforced here.
> > + */
> > +static inline bool can_write_file_mapping(struct vm_area_struct *vma,
> > +					  unsigned long gup_flags)
> > +{
> > +	struct file *file = vma->vm_file;
> > +
> > +	/* If we aren't pinning then no problematic write can occur. */
> > +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> > +		return true;
> > +
> > +	/* Special mappings should pose no problem. */
> > +	if (!file)
> > +		return true;
> > +
> > +	/* Has the caller explicitly indicated this case is acceptable? */
> > +	if (gup_flags & FOLL_ALLOW_BROKEN_FILE_MAPPING)
> > +		return true;
> > +
> > +	/* shmem and hugetlb mappings do not have problematic semantics. */
> > +	return vma_is_shmem(vma) || is_file_hugepages(file);
> > +}
> > +
> >   static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> >   {
> >   	vm_flags_t vm_flags = vma->vm_flags;
> >   	int write = (gup_flags & FOLL_WRITE);
> >   	int foreign = (gup_flags & FOLL_REMOTE);
> > +	bool vma_anon = vma_is_anonymous(vma);
> >   	if (vm_flags & (VM_IO | VM_PFNMAP))
> >   		return -EFAULT;
> > -	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
> > +	if ((gup_flags & FOLL_ANON) && !vma_anon)
> >   		return -EFAULT;
> >   	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
> > @@ -978,6 +1008,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> >   		return -EFAULT;
> >   	if (write) {
> > +		if (!vma_anon &&
> > +		    WARN_ON_ONCE(!can_write_file_mapping(vma, gup_flags)))
> > +			return -EFAULT;
> > +
> >   		if (!(vm_flags & VM_WRITE)) {
> >   			if (!(gup_flags & FOLL_FORCE))
> >   				return -EFAULT;
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 146bb94764f8..e3d535991548 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5683,7 +5683,8 @@ int access_process_vm(struct task_struct *tsk, unsigned long addr,
> >   	if (!mm)
> >   		return 0;
> > -	ret = __access_remote_vm(mm, addr, buf, len, gup_flags);
> > +	ret = __access_remote_vm(mm, addr, buf, len,
> > +				 gup_flags | FOLL_ALLOW_BROKEN_FILE_MAPPING);
> >   	mmput(mm);
> > diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> > index 78dfaf9e8990..ef126c08e89c 100644
> > --- a/mm/process_vm_access.c
> > +++ b/mm/process_vm_access.c
> > @@ -81,7 +81,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
> >   	ssize_t rc = 0;
> >   	unsigned long max_pages_per_loop = PVM_MAX_KMALLOC_PAGES
> >   		/ sizeof(struct pages *);
> > -	unsigned int flags = 0;
> > +	unsigned int flags = FOLL_ALLOW_BROKEN_FILE_MAPPING;
> >   	/* Work out address and page range required */
> >   	if (len == 0)
> > diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> > index 02207e852d79..b93cfcaccb0d 100644
> > --- a/net/xdp/xdp_umem.c
> > +++ b/net/xdp/xdp_umem.c
> > @@ -93,7 +93,7 @@ void xdp_put_umem(struct xdp_umem *umem, bool defer_cleanup)
> >   static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
> >   {
> > -	unsigned int gup_flags = FOLL_WRITE;
> > +	unsigned int gup_flags = FOLL_WRITE | FOLL_ALLOW_BROKEN_FILE_MAPPING;
> >   	long npgs;
> >   	int err;
>
> Not sure about this in general, but seemss at least ptrace
> (ptrace_access_vm()) seems to be broken here..

Ah thanks, that was an oversight as it uses __access_remote_vm() rather
than access_process_vm(). I had carefully examined both (and all other GUP
callers) but in supplying the flag for the latter I in typically squeezy
brained human fashion forgot to also do so for the former, will respin
accordingly.

>
>
> --Mika
>
>
Kirill A. Shutemov April 24, 2023, 12:02 p.m. UTC | #9
On Sat, Apr 22, 2023 at 02:37:05PM +0100, Lorenzo Stoakes wrote:
> @@ -959,16 +959,46 @@ static int faultin_page(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> +/*
> + * Writing to file-backed mappings using GUP is a fundamentally broken operation
> + * as kernel write access to GUP mappings may not adhere to the semantics
> + * expected by a file system.
> + *
> + * In most instances we disallow this broken behaviour, however there are some
> + * exceptions to this enforced here.
> + */
> +static inline bool can_write_file_mapping(struct vm_area_struct *vma,
> +					  unsigned long gup_flags)
> +{
> +	struct file *file = vma->vm_file;
> +
> +	/* If we aren't pinning then no problematic write can occur. */
> +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> +		return true;
> +
> +	/* Special mappings should pose no problem. */
> +	if (!file)
> +		return true;
> +
> +	/* Has the caller explicitly indicated this case is acceptable? */
> +	if (gup_flags & FOLL_ALLOW_BROKEN_FILE_MAPPING)
> +		return true;
> +
> +	/* shmem and hugetlb mappings do not have problematic semantics. */
> +	return vma_is_shmem(vma) || is_file_hugepages(file);

Can this be generalized to any fs that doesn't have vm_ops->page_mkwrite()?
Lorenzo Stoakes April 24, 2023, 12:31 p.m. UTC | #10
On Mon, Apr 24, 2023 at 03:02:47PM +0300, Kirill A. Shutemov wrote:
> On Sat, Apr 22, 2023 at 02:37:05PM +0100, Lorenzo Stoakes wrote:
> > @@ -959,16 +959,46 @@ static int faultin_page(struct vm_area_struct *vma,
> >  	return 0;
> >  }
> >
> > +/*
> > + * Writing to file-backed mappings using GUP is a fundamentally broken operation
> > + * as kernel write access to GUP mappings may not adhere to the semantics
> > + * expected by a file system.
> > + *
> > + * In most instances we disallow this broken behaviour, however there are some
> > + * exceptions to this enforced here.
> > + */
> > +static inline bool can_write_file_mapping(struct vm_area_struct *vma,
> > +					  unsigned long gup_flags)
> > +{
> > +	struct file *file = vma->vm_file;
> > +
> > +	/* If we aren't pinning then no problematic write can occur. */
> > +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> > +		return true;
> > +
> > +	/* Special mappings should pose no problem. */
> > +	if (!file)
> > +		return true;
> > +
> > +	/* Has the caller explicitly indicated this case is acceptable? */
> > +	if (gup_flags & FOLL_ALLOW_BROKEN_FILE_MAPPING)
> > +		return true;
> > +
> > +	/* shmem and hugetlb mappings do not have problematic semantics. */
> > +	return vma_is_shmem(vma) || is_file_hugepages(file);
>
> Can this be generalized to any fs that doesn't have vm_ops->page_mkwrite()?
>

Something more general would be preferable, however I believe there were
concerns broader than write notify, for instance not correctly marking the
folio dirty after writing to it, though arguably the caller should
certainly be ensuring that (and in many cases, do).

Jason will have more of a sense of this I think!

> --
>   Kiryl Shutsemau / Kirill A. Shutemov
Kirill A. Shutemov April 24, 2023, 1:40 p.m. UTC | #11
On Mon, Apr 24, 2023 at 01:31:56PM +0100, Lorenzo Stoakes wrote:
> On Mon, Apr 24, 2023 at 03:02:47PM +0300, Kirill A. Shutemov wrote:
> > On Sat, Apr 22, 2023 at 02:37:05PM +0100, Lorenzo Stoakes wrote:
> > > @@ -959,16 +959,46 @@ static int faultin_page(struct vm_area_struct *vma,
> > >  	return 0;
> > >  }
> > >
> > > +/*
> > > + * Writing to file-backed mappings using GUP is a fundamentally broken operation
> > > + * as kernel write access to GUP mappings may not adhere to the semantics
> > > + * expected by a file system.
> > > + *
> > > + * In most instances we disallow this broken behaviour, however there are some
> > > + * exceptions to this enforced here.
> > > + */
> > > +static inline bool can_write_file_mapping(struct vm_area_struct *vma,
> > > +					  unsigned long gup_flags)
> > > +{
> > > +	struct file *file = vma->vm_file;
> > > +
> > > +	/* If we aren't pinning then no problematic write can occur. */
> > > +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> > > +		return true;
> > > +
> > > +	/* Special mappings should pose no problem. */
> > > +	if (!file)
> > > +		return true;
> > > +
> > > +	/* Has the caller explicitly indicated this case is acceptable? */
> > > +	if (gup_flags & FOLL_ALLOW_BROKEN_FILE_MAPPING)
> > > +		return true;
> > > +
> > > +	/* shmem and hugetlb mappings do not have problematic semantics. */
> > > +	return vma_is_shmem(vma) || is_file_hugepages(file);
> >
> > Can this be generalized to any fs that doesn't have vm_ops->page_mkwrite()?
> >
> 
> Something more general would be preferable, however I believe there were
> concerns broader than write notify, for instance not correctly marking the
> folio dirty after writing to it, though arguably the caller should
> certainly be ensuring that (and in many cases, do).

It doesn't make much sense to me.

Shared writable mapping without page_mkwrite (or pfn_write) will setup
writeable PTE even on read faults[1], so you will not get the page dirty,
unless you scan page table entries for dirty bit.

[1] See logic around vm_page_prot vs. vma_wants_writenotify().
Jan Kara April 24, 2023, 4:09 p.m. UTC | #12
On Sat 22-04-23 14:37:05, Lorenzo Stoakes wrote:
> It isn't safe to write to file-backed mappings as GUP does not ensure that
> the semantics associated with such a write are performed correctly, for
> instance filesystems which rely upon write-notify will not be correctly
> notified.

I agree that this is currently subtly broken. But we also know from the bug
reports there are definitely users of this functionality out in the wild.
After all that was the reason why John Hubbard (added to CC) added
FOLL_PIN functionality to the kernel - so that filesystems can recognize
the pages may be modified behind their back and act accordingly.

So I'm maybe missing a bigger picture why we would like a change like this.
Because we still need to teach filesystems to handle pinned pages for the
usecases you don't forbid and for which we know there are users and then
what's the benefit of this patch?

								Honza

> There are exceptions to this - shmem and hugetlb mappings are (in effect)
> anonymous mappings by other names so we do permit this operation in these
> cases.
> 
> In addition, if no pinning takes place (neither FOLL_GET nor FOLL_PIN is
> specified and neither flags gets implicitly set) then no writing can occur
> so we do not perform the check in this instance.
> 
> This is an important exception, as populate_vma_page_range() invokes
> __get_user_pages() in this way (and thus so does __mm_populate(), used by
> MAP_POPULATE mmap() and mlock() invocations).
> 
> There are GUP users within the kernel that do nevertheless rely upon this
> behaviour, so we introduce the FOLL_ALLOW_BROKEN_FILE_MAPPING flag to
> explicitly permit this kind of GUP access.
> 
> This is required in order to not break userspace in instances where the
> uAPI might permit file-mapped addresses - a number of RDMA users require
> this for instance, as do the process_vm_[read/write]v() system calls,
> /proc/$pid/mem, ptrace and SDT uprobes. Each of these callers have been
> updated to use this flag.
> 
> Making this change is an important step towards a more reliable GUP, and
> explicitly indicates which callers might encouter issues moving forward.
> 
> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
> ---
>  drivers/infiniband/hw/qib/qib_user_pages.c |  3 +-
>  drivers/infiniband/hw/usnic/usnic_uiom.c   |  2 +-
>  drivers/infiniband/sw/siw/siw_mem.c        |  3 +-
>  fs/proc/base.c                             |  3 +-
>  include/linux/mm_types.h                   |  8 +++++
>  kernel/events/uprobes.c                    |  3 +-
>  mm/gup.c                                   | 36 +++++++++++++++++++++-
>  mm/memory.c                                |  3 +-
>  mm/process_vm_access.c                     |  2 +-
>  net/xdp/xdp_umem.c                         |  2 +-
>  10 files changed, 56 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
> index f693bc753b6b..b9019dad8008 100644
> --- a/drivers/infiniband/hw/qib/qib_user_pages.c
> +++ b/drivers/infiniband/hw/qib/qib_user_pages.c
> @@ -110,7 +110,8 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
>  	for (got = 0; got < num_pages; got += ret) {
>  		ret = pin_user_pages(start_page + got * PAGE_SIZE,
>  				     num_pages - got,
> -				     FOLL_LONGTERM | FOLL_WRITE,
> +				     FOLL_LONGTERM | FOLL_WRITE |
> +				     FOLL_ALLOW_BROKEN_FILE_MAPPING,
>  				     p + got, NULL);
>  		if (ret < 0) {
>  			mmap_read_unlock(current->mm);
> diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
> index 2a5cac2658ec..33cf79b248a9 100644
> --- a/drivers/infiniband/hw/usnic/usnic_uiom.c
> +++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
> @@ -85,7 +85,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
>  				int dmasync, struct usnic_uiom_reg *uiomr)
>  {
>  	struct list_head *chunk_list = &uiomr->chunk_list;
> -	unsigned int gup_flags = FOLL_LONGTERM;
> +	unsigned int gup_flags = FOLL_LONGTERM | FOLL_ALLOW_BROKEN_FILE_MAPPING;
>  	struct page **page_list;
>  	struct scatterlist *sg;
>  	struct usnic_uiom_chunk *chunk;
> diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
> index f51ab2ccf151..bc3e8c0898e5 100644
> --- a/drivers/infiniband/sw/siw/siw_mem.c
> +++ b/drivers/infiniband/sw/siw/siw_mem.c
> @@ -368,7 +368,8 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
>  	struct mm_struct *mm_s;
>  	u64 first_page_va;
>  	unsigned long mlock_limit;
> -	unsigned int foll_flags = FOLL_LONGTERM;
> +	unsigned int foll_flags =
> +		FOLL_LONGTERM | FOLL_ALLOW_BROKEN_FILE_MAPPING;
>  	int num_pages, num_chunks, i, rv = 0;
>  
>  	if (!can_do_mlock())
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 96a6a08c8235..3e3f5ea9849f 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -855,7 +855,8 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
>  	if (!mmget_not_zero(mm))
>  		goto free;
>  
> -	flags = FOLL_FORCE | (write ? FOLL_WRITE : 0);
> +	flags = FOLL_FORCE | FOLL_ALLOW_BROKEN_FILE_MAPPING |
> +		(write ? FOLL_WRITE : 0);
>  
>  	while (count > 0) {
>  		size_t this_len = min_t(size_t, count, PAGE_SIZE);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 3fc9e680f174..e76637b4c78f 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1185,6 +1185,14 @@ enum {
>  	FOLL_PCI_P2PDMA = 1 << 10,
>  	/* allow interrupts from generic signals */
>  	FOLL_INTERRUPTIBLE = 1 << 11,
> +	/*
> +	 * By default we disallow write access to known broken file-backed
> +	 * memory mappings (i.e. anything other than hugetlb/shmem
> +	 * mappings). Some code may rely upon being able to access this
> +	 * regardless for legacy reasons, thus we provide a flag to indicate
> +	 * this.
> +	 */
> +	FOLL_ALLOW_BROKEN_FILE_MAPPING = 1 << 12,
>  
>  	/* See also internal only FOLL flags in mm/internal.h */
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 59887c69d54c..ec330d3b0218 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -373,7 +373,8 @@ __update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
>  		return -EINVAL;
>  
>  	ret = get_user_pages_remote(mm, vaddr, 1,
> -			FOLL_WRITE, &page, &vma, NULL);
> +				    FOLL_WRITE | FOLL_ALLOW_BROKEN_FILE_MAPPING,
> +				    &page, &vma, NULL);
>  	if (unlikely(ret <= 0)) {
>  		/*
>  		 * We are asking for 1 page. If get_user_pages_remote() fails,
> diff --git a/mm/gup.c b/mm/gup.c
> index 1f72a717232b..68d5570c0bae 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -959,16 +959,46 @@ static int faultin_page(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> +/*
> + * Writing to file-backed mappings using GUP is a fundamentally broken operation
> + * as kernel write access to GUP mappings may not adhere to the semantics
> + * expected by a file system.
> + *
> + * In most instances we disallow this broken behaviour, however there are some
> + * exceptions to this enforced here.
> + */
> +static inline bool can_write_file_mapping(struct vm_area_struct *vma,
> +					  unsigned long gup_flags)
> +{
> +	struct file *file = vma->vm_file;
> +
> +	/* If we aren't pinning then no problematic write can occur. */
> +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> +		return true;
> +
> +	/* Special mappings should pose no problem. */
> +	if (!file)
> +		return true;
> +
> +	/* Has the caller explicitly indicated this case is acceptable? */
> +	if (gup_flags & FOLL_ALLOW_BROKEN_FILE_MAPPING)
> +		return true;
> +
> +	/* shmem and hugetlb mappings do not have problematic semantics. */
> +	return vma_is_shmem(vma) || is_file_hugepages(file);
> +}
> +
>  static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>  {
>  	vm_flags_t vm_flags = vma->vm_flags;
>  	int write = (gup_flags & FOLL_WRITE);
>  	int foreign = (gup_flags & FOLL_REMOTE);
> +	bool vma_anon = vma_is_anonymous(vma);
>  
>  	if (vm_flags & (VM_IO | VM_PFNMAP))
>  		return -EFAULT;
>  
> -	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
> +	if ((gup_flags & FOLL_ANON) && !vma_anon)
>  		return -EFAULT;
>  
>  	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
> @@ -978,6 +1008,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
>  		return -EFAULT;
>  
>  	if (write) {
> +		if (!vma_anon &&
> +		    WARN_ON_ONCE(!can_write_file_mapping(vma, gup_flags)))
> +			return -EFAULT;
> +
>  		if (!(vm_flags & VM_WRITE)) {
>  			if (!(gup_flags & FOLL_FORCE))
>  				return -EFAULT;
> diff --git a/mm/memory.c b/mm/memory.c
> index 146bb94764f8..e3d535991548 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5683,7 +5683,8 @@ int access_process_vm(struct task_struct *tsk, unsigned long addr,
>  	if (!mm)
>  		return 0;
>  
> -	ret = __access_remote_vm(mm, addr, buf, len, gup_flags);
> +	ret = __access_remote_vm(mm, addr, buf, len,
> +				 gup_flags | FOLL_ALLOW_BROKEN_FILE_MAPPING);
>  
>  	mmput(mm);
>  
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index 78dfaf9e8990..ef126c08e89c 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -81,7 +81,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
>  	ssize_t rc = 0;
>  	unsigned long max_pages_per_loop = PVM_MAX_KMALLOC_PAGES
>  		/ sizeof(struct pages *);
> -	unsigned int flags = 0;
> +	unsigned int flags = FOLL_ALLOW_BROKEN_FILE_MAPPING;
>  
>  	/* Work out address and page range required */
>  	if (len == 0)
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index 02207e852d79..b93cfcaccb0d 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -93,7 +93,7 @@ void xdp_put_umem(struct xdp_umem *umem, bool defer_cleanup)
>  
>  static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
>  {
> -	unsigned int gup_flags = FOLL_WRITE;
> +	unsigned int gup_flags = FOLL_WRITE | FOLL_ALLOW_BROKEN_FILE_MAPPING;
>  	long npgs;
>  	int err;
>  
> -- 
> 2.40.0
>
Lorenzo Stoakes April 24, 2023, 4:22 p.m. UTC | #13
On Mon, Apr 24, 2023 at 06:09:52PM +0200, Jan Kara wrote:
> On Sat 22-04-23 14:37:05, Lorenzo Stoakes wrote:
> > It isn't safe to write to file-backed mappings as GUP does not ensure that
> > the semantics associated with such a write are performed correctly, for
> > instance filesystems which rely upon write-notify will not be correctly
> > notified.
>
> I agree that this is currently subtly broken. But we also know from the bug
> reports there are definitely users of this functionality out in the wild.
> After all that was the reason why John Hubbard (added to CC) added
> FOLL_PIN functionality to the kernel - so that filesystems can recognize
> the pages may be modified behind their back and act accordingly.
>
> So I'm maybe missing a bigger picture why we would like a change like this.
> Because we still need to teach filesystems to handle pinned pages for the
> usecases you don't forbid and for which we know there are users and then
> what's the benefit of this patch?
>
> 								Honza

The concern is that it is more than subtly broken, at least in some
instances. The fact that there are use cases is reflected in the effort to
provide a flag, the purpose is to explicitly disallow broken behaviour in
all but those we know need it.

A motivating case is io_uring which performs its own equivalent check
precisely because GUP does not, which is clearly a broken state of affairs.

While folio_maybe_dma_pinned() definitely helps avoid some broken
situations, it certainly does not bring this pinned pages in line with
file semantics.

The current state of the discussion is leaning towards simply disallowing
this for FOLL_WRITE | FOLL_PIN | FOLL_LONGTERM file-backed mappings for
file systems with write notify, which really under no circumstances should
be occurring.

I will respin a v3 shortly which will be much simplified along these lines.

>
> > There are exceptions to this - shmem and hugetlb mappings are (in effect)
> > anonymous mappings by other names so we do permit this operation in these
> > cases.
> >
> > In addition, if no pinning takes place (neither FOLL_GET nor FOLL_PIN is
> > specified and neither flags gets implicitly set) then no writing can occur
> > so we do not perform the check in this instance.
> >
> > This is an important exception, as populate_vma_page_range() invokes
> > __get_user_pages() in this way (and thus so does __mm_populate(), used by
> > MAP_POPULATE mmap() and mlock() invocations).
> >
> > There are GUP users within the kernel that do nevertheless rely upon this
> > behaviour, so we introduce the FOLL_ALLOW_BROKEN_FILE_MAPPING flag to
> > explicitly permit this kind of GUP access.
> >
> > This is required in order to not break userspace in instances where the
> > uAPI might permit file-mapped addresses - a number of RDMA users require
> > this for instance, as do the process_vm_[read/write]v() system calls,
> > /proc/$pid/mem, ptrace and SDT uprobes. Each of these callers have been
> > updated to use this flag.
> >
> > Making this change is an important step towards a more reliable GUP, and
> > explicitly indicates which callers might encouter issues moving forward.
> >
> > Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> > Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
> > ---
> >  drivers/infiniband/hw/qib/qib_user_pages.c |  3 +-
> >  drivers/infiniband/hw/usnic/usnic_uiom.c   |  2 +-
> >  drivers/infiniband/sw/siw/siw_mem.c        |  3 +-
> >  fs/proc/base.c                             |  3 +-
> >  include/linux/mm_types.h                   |  8 +++++
> >  kernel/events/uprobes.c                    |  3 +-
> >  mm/gup.c                                   | 36 +++++++++++++++++++++-
> >  mm/memory.c                                |  3 +-
> >  mm/process_vm_access.c                     |  2 +-
> >  net/xdp/xdp_umem.c                         |  2 +-
> >  10 files changed, 56 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
> > index f693bc753b6b..b9019dad8008 100644
> > --- a/drivers/infiniband/hw/qib/qib_user_pages.c
> > +++ b/drivers/infiniband/hw/qib/qib_user_pages.c
> > @@ -110,7 +110,8 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
> >  	for (got = 0; got < num_pages; got += ret) {
> >  		ret = pin_user_pages(start_page + got * PAGE_SIZE,
> >  				     num_pages - got,
> > -				     FOLL_LONGTERM | FOLL_WRITE,
> > +				     FOLL_LONGTERM | FOLL_WRITE |
> > +				     FOLL_ALLOW_BROKEN_FILE_MAPPING,
> >  				     p + got, NULL);
> >  		if (ret < 0) {
> >  			mmap_read_unlock(current->mm);
> > diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
> > index 2a5cac2658ec..33cf79b248a9 100644
> > --- a/drivers/infiniband/hw/usnic/usnic_uiom.c
> > +++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
> > @@ -85,7 +85,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
> >  				int dmasync, struct usnic_uiom_reg *uiomr)
> >  {
> >  	struct list_head *chunk_list = &uiomr->chunk_list;
> > -	unsigned int gup_flags = FOLL_LONGTERM;
> > +	unsigned int gup_flags = FOLL_LONGTERM | FOLL_ALLOW_BROKEN_FILE_MAPPING;
> >  	struct page **page_list;
> >  	struct scatterlist *sg;
> >  	struct usnic_uiom_chunk *chunk;
> > diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
> > index f51ab2ccf151..bc3e8c0898e5 100644
> > --- a/drivers/infiniband/sw/siw/siw_mem.c
> > +++ b/drivers/infiniband/sw/siw/siw_mem.c
> > @@ -368,7 +368,8 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
> >  	struct mm_struct *mm_s;
> >  	u64 first_page_va;
> >  	unsigned long mlock_limit;
> > -	unsigned int foll_flags = FOLL_LONGTERM;
> > +	unsigned int foll_flags =
> > +		FOLL_LONGTERM | FOLL_ALLOW_BROKEN_FILE_MAPPING;
> >  	int num_pages, num_chunks, i, rv = 0;
> >
> >  	if (!can_do_mlock())
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 96a6a08c8235..3e3f5ea9849f 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -855,7 +855,8 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
> >  	if (!mmget_not_zero(mm))
> >  		goto free;
> >
> > -	flags = FOLL_FORCE | (write ? FOLL_WRITE : 0);
> > +	flags = FOLL_FORCE | FOLL_ALLOW_BROKEN_FILE_MAPPING |
> > +		(write ? FOLL_WRITE : 0);
> >
> >  	while (count > 0) {
> >  		size_t this_len = min_t(size_t, count, PAGE_SIZE);
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 3fc9e680f174..e76637b4c78f 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -1185,6 +1185,14 @@ enum {
> >  	FOLL_PCI_P2PDMA = 1 << 10,
> >  	/* allow interrupts from generic signals */
> >  	FOLL_INTERRUPTIBLE = 1 << 11,
> > +	/*
> > +	 * By default we disallow write access to known broken file-backed
> > +	 * memory mappings (i.e. anything other than hugetlb/shmem
> > +	 * mappings). Some code may rely upon being able to access this
> > +	 * regardless for legacy reasons, thus we provide a flag to indicate
> > +	 * this.
> > +	 */
> > +	FOLL_ALLOW_BROKEN_FILE_MAPPING = 1 << 12,
> >
> >  	/* See also internal only FOLL flags in mm/internal.h */
> >  };
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 59887c69d54c..ec330d3b0218 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -373,7 +373,8 @@ __update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
> >  		return -EINVAL;
> >
> >  	ret = get_user_pages_remote(mm, vaddr, 1,
> > -			FOLL_WRITE, &page, &vma, NULL);
> > +				    FOLL_WRITE | FOLL_ALLOW_BROKEN_FILE_MAPPING,
> > +				    &page, &vma, NULL);
> >  	if (unlikely(ret <= 0)) {
> >  		/*
> >  		 * We are asking for 1 page. If get_user_pages_remote() fails,
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 1f72a717232b..68d5570c0bae 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -959,16 +959,46 @@ static int faultin_page(struct vm_area_struct *vma,
> >  	return 0;
> >  }
> >
> > +/*
> > + * Writing to file-backed mappings using GUP is a fundamentally broken operation
> > + * as kernel write access to GUP mappings may not adhere to the semantics
> > + * expected by a file system.
> > + *
> > + * In most instances we disallow this broken behaviour, however there are some
> > + * exceptions to this enforced here.
> > + */
> > +static inline bool can_write_file_mapping(struct vm_area_struct *vma,
> > +					  unsigned long gup_flags)
> > +{
> > +	struct file *file = vma->vm_file;
> > +
> > +	/* If we aren't pinning then no problematic write can occur. */
> > +	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
> > +		return true;
> > +
> > +	/* Special mappings should pose no problem. */
> > +	if (!file)
> > +		return true;
> > +
> > +	/* Has the caller explicitly indicated this case is acceptable? */
> > +	if (gup_flags & FOLL_ALLOW_BROKEN_FILE_MAPPING)
> > +		return true;
> > +
> > +	/* shmem and hugetlb mappings do not have problematic semantics. */
> > +	return vma_is_shmem(vma) || is_file_hugepages(file);
> > +}
> > +
> >  static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> >  {
> >  	vm_flags_t vm_flags = vma->vm_flags;
> >  	int write = (gup_flags & FOLL_WRITE);
> >  	int foreign = (gup_flags & FOLL_REMOTE);
> > +	bool vma_anon = vma_is_anonymous(vma);
> >
> >  	if (vm_flags & (VM_IO | VM_PFNMAP))
> >  		return -EFAULT;
> >
> > -	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
> > +	if ((gup_flags & FOLL_ANON) && !vma_anon)
> >  		return -EFAULT;
> >
> >  	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
> > @@ -978,6 +1008,10 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> >  		return -EFAULT;
> >
> >  	if (write) {
> > +		if (!vma_anon &&
> > +		    WARN_ON_ONCE(!can_write_file_mapping(vma, gup_flags)))
> > +			return -EFAULT;
> > +
> >  		if (!(vm_flags & VM_WRITE)) {
> >  			if (!(gup_flags & FOLL_FORCE))
> >  				return -EFAULT;
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 146bb94764f8..e3d535991548 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5683,7 +5683,8 @@ int access_process_vm(struct task_struct *tsk, unsigned long addr,
> >  	if (!mm)
> >  		return 0;
> >
> > -	ret = __access_remote_vm(mm, addr, buf, len, gup_flags);
> > +	ret = __access_remote_vm(mm, addr, buf, len,
> > +				 gup_flags | FOLL_ALLOW_BROKEN_FILE_MAPPING);
> >
> >  	mmput(mm);
> >
> > diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> > index 78dfaf9e8990..ef126c08e89c 100644
> > --- a/mm/process_vm_access.c
> > +++ b/mm/process_vm_access.c
> > @@ -81,7 +81,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
> >  	ssize_t rc = 0;
> >  	unsigned long max_pages_per_loop = PVM_MAX_KMALLOC_PAGES
> >  		/ sizeof(struct pages *);
> > -	unsigned int flags = 0;
> > +	unsigned int flags = FOLL_ALLOW_BROKEN_FILE_MAPPING;
> >
> >  	/* Work out address and page range required */
> >  	if (len == 0)
> > diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> > index 02207e852d79..b93cfcaccb0d 100644
> > --- a/net/xdp/xdp_umem.c
> > +++ b/net/xdp/xdp_umem.c
> > @@ -93,7 +93,7 @@ void xdp_put_umem(struct xdp_umem *umem, bool defer_cleanup)
> >
> >  static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
> >  {
> > -	unsigned int gup_flags = FOLL_WRITE;
> > +	unsigned int gup_flags = FOLL_WRITE | FOLL_ALLOW_BROKEN_FILE_MAPPING;
> >  	long npgs;
> >  	int err;
> >
> > --
> > 2.40.0
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jason Gunthorpe April 24, 2023, 5:49 p.m. UTC | #14
On Sun, Apr 23, 2023 at 11:56:48PM +0100, Lorenzo Stoakes wrote:

> This warned upon check should in reality not occur, because it implies the
> GUP user is trying to do something broken and is _not_ explicitly telling
> GUP that it knows it's doing it and can live with the consequences. And on
> that basis, is worthy of a warning so we know we have to go put this flag
> in that place (and know it is a source of problematic GUP usage), or fix
> the caller.

It is fine for debugging, but we can't merge user triggerable
WARN_ONs..

Since the GUP caller has no idea if userspace might be maliciously
passing in a file VMA we can't throw warnings.

Jason
Jason Gunthorpe April 24, 2023, 5:53 p.m. UTC | #15
On Mon, Apr 24, 2023 at 06:09:52PM +0200, Jan Kara wrote:

> So I'm maybe missing a bigger picture why we would like a change like this.
> Because we still need to teach filesystems to handle pinned pages for the
> usecases you don't forbid and for which we know there are users and then
> what's the benefit of this patch?

It immediately closes a security hole we have been talking about
fixing for years already and still don't have a full fix for.

An acknowledgment that this is not going to go away and starting to
clamp down on the insecurity may motivate some investment in the
proper fix. :)

Jason
Jason Gunthorpe April 24, 2023, 5:59 p.m. UTC | #16
On Mon, Apr 24, 2023 at 04:40:26PM +0300, Kirill A. Shutemov wrote:
> > Something more general would be preferable, however I believe there were
> > concerns broader than write notify, for instance not correctly marking the
> > folio dirty after writing to it, though arguably the caller should
> > certainly be ensuring that (and in many cases, do).
> 
> It doesn't make much sense to me.
> 
> Shared writable mapping without page_mkwrite (or pfn_write) will setup
> writeable PTE even on read faults[1], so you will not get the page dirty,
> unless you scan page table entries for dirty bit.

The general statement for supporting GUP is that the VMA owner never
relies on write protect, either explicitly through removing the write
bit in the PTE or implicitly through zapping the inode and removing
all PTEs.

The general bug we have is that the FS does some action to prevent
writes and then becomes surprised that the page was made dirty.

GUP allows write access to the page to continue past any write protect
action the FS takes.

AFAIK all GUP users do correctly do mkdirty and we have helpers to
support this during unpin, that is not the bug.

So, I don't know about page_mkwrite, if it correlates with the abvoe
then great :)

Jason
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index f693bc753b6b..b9019dad8008 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -110,7 +110,8 @@  int qib_get_user_pages(unsigned long start_page, size_t num_pages,
 	for (got = 0; got < num_pages; got += ret) {
 		ret = pin_user_pages(start_page + got * PAGE_SIZE,
 				     num_pages - got,
-				     FOLL_LONGTERM | FOLL_WRITE,
+				     FOLL_LONGTERM | FOLL_WRITE |
+				     FOLL_ALLOW_BROKEN_FILE_MAPPING,
 				     p + got, NULL);
 		if (ret < 0) {
 			mmap_read_unlock(current->mm);
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 2a5cac2658ec..33cf79b248a9 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -85,7 +85,7 @@  static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
 				int dmasync, struct usnic_uiom_reg *uiomr)
 {
 	struct list_head *chunk_list = &uiomr->chunk_list;
-	unsigned int gup_flags = FOLL_LONGTERM;
+	unsigned int gup_flags = FOLL_LONGTERM | FOLL_ALLOW_BROKEN_FILE_MAPPING;
 	struct page **page_list;
 	struct scatterlist *sg;
 	struct usnic_uiom_chunk *chunk;
diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
index f51ab2ccf151..bc3e8c0898e5 100644
--- a/drivers/infiniband/sw/siw/siw_mem.c
+++ b/drivers/infiniband/sw/siw/siw_mem.c
@@ -368,7 +368,8 @@  struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
 	struct mm_struct *mm_s;
 	u64 first_page_va;
 	unsigned long mlock_limit;
-	unsigned int foll_flags = FOLL_LONGTERM;
+	unsigned int foll_flags =
+		FOLL_LONGTERM | FOLL_ALLOW_BROKEN_FILE_MAPPING;
 	int num_pages, num_chunks, i, rv = 0;
 
 	if (!can_do_mlock())
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 96a6a08c8235..3e3f5ea9849f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -855,7 +855,8 @@  static ssize_t mem_rw(struct file *file, char __user *buf,
 	if (!mmget_not_zero(mm))
 		goto free;
 
-	flags = FOLL_FORCE | (write ? FOLL_WRITE : 0);
+	flags = FOLL_FORCE | FOLL_ALLOW_BROKEN_FILE_MAPPING |
+		(write ? FOLL_WRITE : 0);
 
 	while (count > 0) {
 		size_t this_len = min_t(size_t, count, PAGE_SIZE);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3fc9e680f174..e76637b4c78f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1185,6 +1185,14 @@  enum {
 	FOLL_PCI_P2PDMA = 1 << 10,
 	/* allow interrupts from generic signals */
 	FOLL_INTERRUPTIBLE = 1 << 11,
+	/*
+	 * By default we disallow write access to known broken file-backed
+	 * memory mappings (i.e. anything other than hugetlb/shmem
+	 * mappings). Some code may rely upon being able to access this
+	 * regardless for legacy reasons, thus we provide a flag to indicate
+	 * this.
+	 */
+	FOLL_ALLOW_BROKEN_FILE_MAPPING = 1 << 12,
 
 	/* See also internal only FOLL flags in mm/internal.h */
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 59887c69d54c..ec330d3b0218 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -373,7 +373,8 @@  __update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
 		return -EINVAL;
 
 	ret = get_user_pages_remote(mm, vaddr, 1,
-			FOLL_WRITE, &page, &vma, NULL);
+				    FOLL_WRITE | FOLL_ALLOW_BROKEN_FILE_MAPPING,
+				    &page, &vma, NULL);
 	if (unlikely(ret <= 0)) {
 		/*
 		 * We are asking for 1 page. If get_user_pages_remote() fails,
diff --git a/mm/gup.c b/mm/gup.c
index 1f72a717232b..68d5570c0bae 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -959,16 +959,46 @@  static int faultin_page(struct vm_area_struct *vma,
 	return 0;
 }
 
+/*
+ * Writing to file-backed mappings using GUP is a fundamentally broken operation
+ * as kernel write access to GUP mappings may not adhere to the semantics
+ * expected by a file system.
+ *
+ * In most instances we disallow this broken behaviour, however there are some
+ * exceptions to this enforced here.
+ */
+static inline bool can_write_file_mapping(struct vm_area_struct *vma,
+					  unsigned long gup_flags)
+{
+	struct file *file = vma->vm_file;
+
+	/* If we aren't pinning then no problematic write can occur. */
+	if (!(gup_flags & (FOLL_GET | FOLL_PIN)))
+		return true;
+
+	/* Special mappings should pose no problem. */
+	if (!file)
+		return true;
+
+	/* Has the caller explicitly indicated this case is acceptable? */
+	if (gup_flags & FOLL_ALLOW_BROKEN_FILE_MAPPING)
+		return true;
+
+	/* shmem and hugetlb mappings do not have problematic semantics. */
+	return vma_is_shmem(vma) || is_file_hugepages(file);
+}
+
 static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 {
 	vm_flags_t vm_flags = vma->vm_flags;
 	int write = (gup_flags & FOLL_WRITE);
 	int foreign = (gup_flags & FOLL_REMOTE);
+	bool vma_anon = vma_is_anonymous(vma);
 
 	if (vm_flags & (VM_IO | VM_PFNMAP))
 		return -EFAULT;
 
-	if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
+	if ((gup_flags & FOLL_ANON) && !vma_anon)
 		return -EFAULT;
 
 	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
@@ -978,6 +1008,10 @@  static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 		return -EFAULT;
 
 	if (write) {
+		if (!vma_anon &&
+		    WARN_ON_ONCE(!can_write_file_mapping(vma, gup_flags)))
+			return -EFAULT;
+
 		if (!(vm_flags & VM_WRITE)) {
 			if (!(gup_flags & FOLL_FORCE))
 				return -EFAULT;
diff --git a/mm/memory.c b/mm/memory.c
index 146bb94764f8..e3d535991548 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5683,7 +5683,8 @@  int access_process_vm(struct task_struct *tsk, unsigned long addr,
 	if (!mm)
 		return 0;
 
-	ret = __access_remote_vm(mm, addr, buf, len, gup_flags);
+	ret = __access_remote_vm(mm, addr, buf, len,
+				 gup_flags | FOLL_ALLOW_BROKEN_FILE_MAPPING);
 
 	mmput(mm);
 
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 78dfaf9e8990..ef126c08e89c 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -81,7 +81,7 @@  static int process_vm_rw_single_vec(unsigned long addr,
 	ssize_t rc = 0;
 	unsigned long max_pages_per_loop = PVM_MAX_KMALLOC_PAGES
 		/ sizeof(struct pages *);
-	unsigned int flags = 0;
+	unsigned int flags = FOLL_ALLOW_BROKEN_FILE_MAPPING;
 
 	/* Work out address and page range required */
 	if (len == 0)
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 02207e852d79..b93cfcaccb0d 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -93,7 +93,7 @@  void xdp_put_umem(struct xdp_umem *umem, bool defer_cleanup)
 
 static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
 {
-	unsigned int gup_flags = FOLL_WRITE;
+	unsigned int gup_flags = FOLL_WRITE | FOLL_ALLOW_BROKEN_FILE_MAPPING;
 	long npgs;
 	int err;