diff mbox series

[5/7] io_uring: rsrc: use FOLL_SAME_FILE on pin_user_pages()

Message ID 17357dec04b32593b71e4fdf3c30a346020acf98.1681508038.git.lstoakes@gmail.com (mailing list archive)
State New
Headers show
Series None | expand

Commit Message

Lorenzo Stoakes April 14, 2023, 11:27 p.m. UTC
Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers")
prevents io_pin_pages() from pinning pages spanning multiple VMAs with
permitted characteristics (anon/huge), requiring that all VMAs share the
same vm_file.

The newly introduced FOLL_SAME_FILE flag permits this to be expressed as a
GUP flag rather than having to retrieve VMAs to perform the check.

We then only need to perform a VMA lookup for the first VMA to assert the
anon/hugepage requirement as we know the rest of the VMAs will possess the
same characteristics.

Doing this eliminates the one instance of vmas being used by
pin_user_pages().

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 io_uring/rsrc.c | 39 ++++++++++++++++-----------------------
 1 file changed, 16 insertions(+), 23 deletions(-)

Comments

Jason Gunthorpe April 17, 2023, 12:56 p.m. UTC | #1
On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote:
> Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers")
> prevents io_pin_pages() from pinning pages spanning multiple VMAs with
> permitted characteristics (anon/huge), requiring that all VMAs share the
> same vm_file.

That commmit doesn't really explain why io_uring is doing such a weird
thing.

What exactly is the problem with mixing struct pages from different
files and why of all the GUP users does only io_uring need to care
about this?

If there is no justification then lets revert that commit instead.

>  		/* don't support file backed memory */
> -		for (i = 0; i < nr_pages; i++) {
> -			if (vmas[i]->vm_file != file) {
> -				ret = -EINVAL;
> -				break;
> -			}
> -			if (!file)
> -				continue;
> -			if (!vma_is_shmem(vmas[i]) && !is_file_hugepages(file)) {
> -				ret = -EOPNOTSUPP;
> -				break;
> -			}
> -		}
> +		file = vma->vm_file;
> +		if (file && !vma_is_shmem(vma) && !is_file_hugepages(file))
> +			ret = -EOPNOTSUPP;
> +

Also, why is it doing this?

All GUP users don't work entirely right for any fops implementation
that assumes write protect is unconditionally possible. eg most
filesystems.

We've been ignoring blocking it because it is an ABI break and it does
sort of work in some cases.

I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than
io_uring open coding this kind of stuff.

Jason
Lorenzo Stoakes April 17, 2023, 1:19 p.m. UTC | #2
On Mon, Apr 17, 2023 at 09:56:34AM -0300, Jason Gunthorpe wrote:
> On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote:
> > Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers")
> > prevents io_pin_pages() from pinning pages spanning multiple VMAs with
> > permitted characteristics (anon/huge), requiring that all VMAs share the
> > same vm_file.
>
> That commmit doesn't really explain why io_uring is doing such a weird
> thing.
>
> What exactly is the problem with mixing struct pages from different
> files and why of all the GUP users does only io_uring need to care
> about this?
>
> If there is no justification then lets revert that commit instead.
>
> >  		/* don't support file backed memory */
> > -		for (i = 0; i < nr_pages; i++) {
> > -			if (vmas[i]->vm_file != file) {
> > -				ret = -EINVAL;
> > -				break;
> > -			}
> > -			if (!file)
> > -				continue;
> > -			if (!vma_is_shmem(vmas[i]) && !is_file_hugepages(file)) {
> > -				ret = -EOPNOTSUPP;
> > -				break;
> > -			}
> > -		}
> > +		file = vma->vm_file;
> > +		if (file && !vma_is_shmem(vma) && !is_file_hugepages(file))
> > +			ret = -EOPNOTSUPP;
> > +
>
> Also, why is it doing this?
>
> All GUP users don't work entirely right for any fops implementation
> that assumes write protect is unconditionally possible. eg most
> filesystems.
>
> We've been ignoring blocking it because it is an ABI break and it does
> sort of work in some cases.
>

I will leave this to Jens and Pavel to revert on!

> I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than
> io_uring open coding this kind of stuff.
>

How would the semantics of this work? What is broken? It is a little
frustrating that we have FOLL_ANON but hugetlb as an outlying case, adding
FOLL_ANON_OR_HUGETLB was another consideration...

> Jason
Jason Gunthorpe April 17, 2023, 1:26 p.m. UTC | #3
On Mon, Apr 17, 2023 at 02:19:16PM +0100, Lorenzo Stoakes wrote:

> > I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than
> > io_uring open coding this kind of stuff.
> >
> 
> How would the semantics of this work? What is broken? It is a little
> frustrating that we have FOLL_ANON but hugetlb as an outlying case, adding
> FOLL_ANON_OR_HUGETLB was another consideration...

It says "historically this user has accepted file backed pages and we
we think there may actually be users doing that, so don't break the
uABI"

Without the flag GUP would refuse to return file backed pages that can
trigger kernel crashes or data corruption.

Eg we'd want most places to not specify the flag and the few that do
to have some justification.

We should consdier removing FOLL_ANON, I'm not sure it really makes
sense these days for what proc is doing with it. All that proc stuff
could likely be turned into a kthread_use_mm() and a simple
copy_to/from user?

I suspect that eliminates the need to check for FOLL_ANON?

Jason
Lorenzo Stoakes April 17, 2023, 2 p.m. UTC | #4
On Mon, Apr 17, 2023 at 10:26:09AM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 17, 2023 at 02:19:16PM +0100, Lorenzo Stoakes wrote:
>
> > > I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than
> > > io_uring open coding this kind of stuff.
> > >
> >
> > How would the semantics of this work? What is broken? It is a little
> > frustrating that we have FOLL_ANON but hugetlb as an outlying case, adding
> > FOLL_ANON_OR_HUGETLB was another consideration...
>
> It says "historically this user has accepted file backed pages and we
> we think there may actually be users doing that, so don't break the
> uABI"

Having written a bunch here I suddenly realised that you probably mean for
this flag to NOT be applied to the io_uring code and thus have it enforce
the 'anonymous or hugetlb' check by default?

>
> Without the flag GUP would refuse to return file backed pages that can
> trigger kernel crashes or data corruption.
>
> Eg we'd want most places to not specify the flag and the few that do
> to have some justification.
>

So you mean to disallow file-backed page pinning as a whole unless this
flag is specified? For FOLL_GET I can see that access to the underlying
data is dangerous as the memory may get reclaimed or migrated, but surely
DMA-pinned memory (as is the case here) is safe?

Or is this a product more so of some kernel process accessing file-backed
pages for a file system which expects write-notify semantics and doesn't
get them in this case, which could indeed be horribly broken.

In which case yes this seems sensible.

> We should consdier removing FOLL_ANON, I'm not sure it really makes
> sense these days for what proc is doing with it. All that proc stuff
> could likely be turned into a kthread_use_mm() and a simple
> copy_to/from user?
>
> I suspect that eliminates the need to check for FOLL_ANON?
>
> Jason

I am definitely in favour of cutting things down if possible, and very much
prefer the use of uaccess if we are able to do so rather than GUP.

I do feel that GUP should be focused purely on pinning memory rather than
manipulating it (whether read or write) so I agree with this sentiment.
Jason Gunthorpe April 17, 2023, 2:15 p.m. UTC | #5
On Mon, Apr 17, 2023 at 03:00:16PM +0100, Lorenzo Stoakes wrote:
> On Mon, Apr 17, 2023 at 10:26:09AM -0300, Jason Gunthorpe wrote:
> > On Mon, Apr 17, 2023 at 02:19:16PM +0100, Lorenzo Stoakes wrote:
> >
> > > > I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than
> > > > io_uring open coding this kind of stuff.
> > > >
> > >
> > > How would the semantics of this work? What is broken? It is a little
> > > frustrating that we have FOLL_ANON but hugetlb as an outlying case, adding
> > > FOLL_ANON_OR_HUGETLB was another consideration...
> >
> > It says "historically this user has accepted file backed pages and we
> > we think there may actually be users doing that, so don't break the
> > uABI"
> 
> Having written a bunch here I suddenly realised that you probably mean for
> this flag to NOT be applied to the io_uring code and thus have it enforce
> the 'anonymous or hugetlb' check by default?

Yes

> So you mean to disallow file-backed page pinning as a whole unless this
> flag is specified? 

Yes

> For FOLL_GET I can see that access to the underlying
> data is dangerous as the memory may get reclaimed or migrated, but surely
> DMA-pinned memory (as is the case here) is safe?

No, it is all broken, read-only access is safe.

We are trying to get a point where pin access will interact properly
with the filesystem, but it isn't done yet.

> Or is this a product more so of some kernel process accessing file-backed
> pages for a file system which expects write-notify semantics and doesn't
> get them in this case, which could indeed be horribly broken.

Yes, broadly

> I am definitely in favour of cutting things down if possible, and very much
> prefer the use of uaccess if we are able to do so rather than GUP.
> 
> I do feel that GUP should be focused purely on pinning memory rather than
> manipulating it (whether read or write) so I agree with this sentiment.

Yes, someone needs to be brave enough to go and try to adjust these
old places :)

I see in the git history this was added to solve CVE-2018-1120 - eg
FUSE can hold off fault-in indefinitely. So the flag is really badly
misnamed - it is "FOLL_DONT_BLOCK_ON_USERSPACE" and anon memory is a
simple, but overly narrow, way to get that property.

If it is changed to use kthread_use_mm() it needs a VMA based check
for the same idea.

Jason
Lorenzo Stoakes April 17, 2023, 3:20 p.m. UTC | #6
On Mon, Apr 17, 2023 at 11:15:10AM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 17, 2023 at 03:00:16PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Apr 17, 2023 at 10:26:09AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Apr 17, 2023 at 02:19:16PM +0100, Lorenzo Stoakes wrote:
> > >
> > > > > I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than
> > > > > io_uring open coding this kind of stuff.
> > > > >
> > > >
> > > > How would the semantics of this work? What is broken? It is a little
> > > > frustrating that we have FOLL_ANON but hugetlb as an outlying case, adding
> > > > FOLL_ANON_OR_HUGETLB was another consideration...
> > >
> > > It says "historically this user has accepted file backed pages and we
> > > we think there may actually be users doing that, so don't break the
> > > uABI"
> >
> > Having written a bunch here I suddenly realised that you probably mean for
> > this flag to NOT be applied to the io_uring code and thus have it enforce
> > the 'anonymous or hugetlb' check by default?
>
> Yes
>
> > So you mean to disallow file-backed page pinning as a whole unless this
> > flag is specified?
>
> Yes
>
> > For FOLL_GET I can see that access to the underlying
> > data is dangerous as the memory may get reclaimed or migrated, but surely
> > DMA-pinned memory (as is the case here) is safe?
>
> No, it is all broken, read-only access is safe.
>
> We are trying to get a point where pin access will interact properly
> with the filesystem, but it isn't done yet.
>
> > Or is this a product more so of some kernel process accessing file-backed
> > pages for a file system which expects write-notify semantics and doesn't
> > get them in this case, which could indeed be horribly broken.
>
> Yes, broadly
>
> > I am definitely in favour of cutting things down if possible, and very much
> > prefer the use of uaccess if we are able to do so rather than GUP.
> >
> > I do feel that GUP should be focused purely on pinning memory rather than
> > manipulating it (whether read or write) so I agree with this sentiment.
>
> Yes, someone needs to be brave enough to go and try to adjust these
> old places :)

Well, I liek to think of myself as stupid^W brave enough to do such things
so may try a separate patch series on that :)

>
> I see in the git history this was added to solve CVE-2018-1120 - eg
> FUSE can hold off fault-in indefinitely. So the flag is really badly
> misnamed - it is "FOLL_DONT_BLOCK_ON_USERSPACE" and anon memory is a
> simple, but overly narrow, way to get that property.
>
> If it is changed to use kthread_use_mm() it needs a VMA based check
> for the same idea.
>
> Jason

I'll try my hand at patching this also!

As for FOLL_ALLOW_BROKEN_FILE_MAPPINGS, I do really like this idea, and
think it is actually probably quite important we do it, however this feels
a bit out of scope for this patch series.

I think perhaps the way forward is, if Jens and Pavel don't have any issue
with it, we open code the check and drop FOLL_SAME_FILE for this series,
then introduce it in a separate one + replace the open coding there?

I am eager to try to keep this focused on the specific task of dropping the
vmas parameter as I think FOLL_ALLOW_BROKEN_FILE_MAPPINGS is likely to
garner some discussion which should be kept separate.
Lorenzo Stoakes April 17, 2023, 7 p.m. UTC | #7
On Mon, Apr 17, 2023 at 10:26:09AM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 17, 2023 at 02:19:16PM +0100, Lorenzo Stoakes wrote:
>
> > > I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than
> > > io_uring open coding this kind of stuff.
> > >
> >
> > How would the semantics of this work? What is broken? It is a little
> > frustrating that we have FOLL_ANON but hugetlb as an outlying case, adding
> > FOLL_ANON_OR_HUGETLB was another consideration...
>
> It says "historically this user has accepted file backed pages and we
> we think there may actually be users doing that, so don't break the
> uABI"
>
> Without the flag GUP would refuse to return file backed pages that can
> trigger kernel crashes or data corruption.
>
> Eg we'd want most places to not specify the flag and the few that do
> to have some justification.
>
> We should consdier removing FOLL_ANON, I'm not sure it really makes
> sense these days for what proc is doing with it. All that proc stuff
> could likely be turned into a kthread_use_mm() and a simple
> copy_to/from user?
>
> I suspect that eliminates the need to check for FOLL_ANON?
>
> Jason

The proc invocations utilising FOLL_ANON are get_mm_proctitle(),
get_mm_cmdline() and environ_read() which each pass it to
access_remote_vm() and which will be being called from a process context,
i.e. with tsk->mm != NULL, but kthread_use_mm() explicitly disallows the
(slightly mind boggling) idea of swapping out an established mm.

So I don't think this route is plausible unless you were thinking of
somehow offloading to a thread?

In any case, if we institute the FOLL_ALLOW_BROKEN_FILE_MAPPINGS flag we
can just drop FOLL_ANON altogether right, as this will be implied and
hugetlb should work here too?

Separately, I find the semantics of access_remote_vm() kind of weird, and
with a possible mmap_lock-free future it does make me wonder whether
something better could be done there.

(Section where I sound like I might be going mad) Perhaps having some means
of context switching into the kernel portion of the remote process as if
were a system call or soft interrupt handler and having that actually do
the uaccess operation could be useful here?

I'm guesing nothing like that exists yet?
Jason Gunthorpe April 17, 2023, 7:24 p.m. UTC | #8
On Mon, Apr 17, 2023 at 08:00:48PM +0100, Lorenzo Stoakes wrote:

> So I don't think this route is plausible unless you were thinking of
> somehow offloading to a thread?

ah, fair enough
 
> In any case, if we institute the FOLL_ALLOW_BROKEN_FILE_MAPPINGS flag we
> can just drop FOLL_ANON altogether right, as this will be implied and
> hugetlb should work here too?

Well.. no, as I said read-only access to the pages works fine, so GUP
should not block that. It is only write that has issues

> Separately, I find the semantics of access_remote_vm() kind of weird, and
> with a possible mmap_lock-free future it does make me wonder whether
> something better could be done there.

Yes, it is very weird, kthread_use_mm is much nicer.
 
> (Section where I sound like I might be going mad) Perhaps having some means
> of context switching into the kernel portion of the remote process as if
> were a system call or soft interrupt handler and having that actually do
> the uaccess operation could be useful here?

This is the kthread_use_mm() approach, that is basically what it
does. You are suggesting to extend it to kthreads that already have a
process attached...

access_remote_vm is basically copy_to/from_user built using kmap and
GUP.

even a simple step of localizing FOLL_ANON to __access_remote_vm,
since it must have the VMA nyhow, would be an improvement.

Jason
Lorenzo Stoakes April 17, 2023, 7:45 p.m. UTC | #9
On Mon, Apr 17, 2023 at 04:24:04PM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 17, 2023 at 08:00:48PM +0100, Lorenzo Stoakes wrote:
>
> > So I don't think this route is plausible unless you were thinking of
> > somehow offloading to a thread?
>
> ah, fair enough
>
> > In any case, if we institute the FOLL_ALLOW_BROKEN_FILE_MAPPINGS flag we
> > can just drop FOLL_ANON altogether right, as this will be implied and
> > hugetlb should work here too?
>
> Well.. no, as I said read-only access to the pages works fine, so GUP
> should not block that. It is only write that has issues
>
> > Separately, I find the semantics of access_remote_vm() kind of weird, and
> > with a possible mmap_lock-free future it does make me wonder whether
> > something better could be done there.
>
> Yes, it is very weird, kthread_use_mm is much nicer.
>
> > (Section where I sound like I might be going mad) Perhaps having some means
> > of context switching into the kernel portion of the remote process as if
> > were a system call or soft interrupt handler and having that actually do
> > the uaccess operation could be useful here?
>
> This is the kthread_use_mm() approach, that is basically what it
> does. You are suggesting to extend it to kthreads that already have a
> process attached...

Yeah, I wonder how plausible this is as we could in theory simply eliminate
these remote cases altogether which could be relatively efficient if we
could find a way to batch up operations.

>
> access_remote_vm is basically copy_to/from_user built using kmap and
> GUP.
>
> even a simple step of localizing FOLL_ANON to __access_remote_vm,
> since it must have the VMA nyhow, would be an improvement.

This is used from places where this flag might not be set though,
e.g. acess_process_vm() and ptrace.

However, access_remote_vm() is only used by the proc stuff, so I will spin
up a patch to move this function and treat it as a helper which sets
FOLL_ANON.

>
> Jason
Pavel Begunkov April 18, 2023, 4:25 p.m. UTC | #10
On 4/17/23 13:56, Jason Gunthorpe wrote:
> On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote:
>> Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers")
>> prevents io_pin_pages() from pinning pages spanning multiple VMAs with
>> permitted characteristics (anon/huge), requiring that all VMAs share the
>> same vm_file.
> 
> That commmit doesn't really explain why io_uring is doing such a weird
> thing.
> 
> What exactly is the problem with mixing struct pages from different
> files and why of all the GUP users does only io_uring need to care
> about this?

Simply because it doesn't seem sane to mix and register buffers of
different "nature" as one. It's not a huge deal for currently allowed
types, e.g. mixing normal and huge anon pages, but it's rather a matter
of time before it gets extended, and then I'll certainly become a
problem. We've been asked just recently to allow registering bufs
provided mapped by some specific driver, or there might be DMA mapped
memory in the future.

Rejecting based on vmas might be too conservative, I agree and am all
for if someone can help to make it right.


> If there is no justification then lets revert that commit instead.
> 
>>   		/* don't support file backed memory */
>> -		for (i = 0; i < nr_pages; i++) {
>> -			if (vmas[i]->vm_file != file) {
>> -				ret = -EINVAL;
>> -				break;
>> -			}
>> -			if (!file)
>> -				continue;
>> -			if (!vma_is_shmem(vmas[i]) && !is_file_hugepages(file)) {
>> -				ret = -EOPNOTSUPP;
>> -				break;
>> -			}
>> -		}
>> +		file = vma->vm_file;
>> +		if (file && !vma_is_shmem(vma) && !is_file_hugepages(file))
>> +			ret = -EOPNOTSUPP;
>> +
> 
> Also, why is it doing this?

There were problems with filesystem mappings, I believe.
Jens may remember what it was.


> All GUP users don't work entirely right for any fops implementation
> that assumes write protect is unconditionally possible. eg most
> filesystems.
> 
> We've been ignoring blocking it because it is an ABI break and it does
> sort of work in some cases.
> 
> I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than
> io_uring open coding this kind of stuff.
Pavel Begunkov April 18, 2023, 4:35 p.m. UTC | #11
On 4/18/23 17:25, Pavel Begunkov wrote:
> On 4/17/23 13:56, Jason Gunthorpe wrote:
>> On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote:
>>> Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers")
>>> prevents io_pin_pages() from pinning pages spanning multiple VMAs with
>>> permitted characteristics (anon/huge), requiring that all VMAs share the
>>> same vm_file.
>>
>> That commmit doesn't really explain why io_uring is doing such a weird
>> thing.
>>
>> What exactly is the problem with mixing struct pages from different
>> files and why of all the GUP users does only io_uring need to care
>> about this?
> 
> Simply because it doesn't seem sane to mix and register buffers of
> different "nature" as one. It's not a huge deal for currently allowed
> types, e.g. mixing normal and huge anon pages, but it's rather a matter
> of time before it gets extended, and then I'll certainly become a
> problem. We've been asked just recently to allow registering bufs
> provided mapped by some specific driver, or there might be DMA mapped
> memory in the future.
> 
> Rejecting based on vmas might be too conservative, I agree and am all
> for if someone can help to make it right.

For some reason I thought it was rejecting if involves more than
one different vma. ->vm_file checks still sound fair to me, but in
any case, open to changing it.
Jason Gunthorpe April 18, 2023, 4:36 p.m. UTC | #12
On Tue, Apr 18, 2023 at 05:25:08PM +0100, Pavel Begunkov wrote:
> On 4/17/23 13:56, Jason Gunthorpe wrote:
> > On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote:
> > > Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers")
> > > prevents io_pin_pages() from pinning pages spanning multiple VMAs with
> > > permitted characteristics (anon/huge), requiring that all VMAs share the
> > > same vm_file.
> > 
> > That commmit doesn't really explain why io_uring is doing such a weird
> > thing.
> > 
> > What exactly is the problem with mixing struct pages from different
> > files and why of all the GUP users does only io_uring need to care
> > about this?
> 
> Simply because it doesn't seem sane to mix and register buffers of
> different "nature" as one. 

That is not a good reason. Once things are converted to struct pages
they don't need to care about their "nature"

> problem. We've been asked just recently to allow registering bufs
> provided mapped by some specific driver, or there might be DMA mapped
> memory in the future.

We already have GUP flags to deal with it, eg FOLL_PCI_P2PDMA

> Rejecting based on vmas might be too conservative, I agree and am all
> for if someone can help to make it right.

It is GUP's problem to deal with this, not the callers.

GUP is defined to return a list of normal CPU DRAM in struct page
format. The caller doesn't care where or what this memory is, it is
all interchangable - by API contract of GUP itself.

If you use FOLL_PCI_P2PDMA then the definition expands to allow struct
pages that are MMIO.

In future, if someone invents new memory or new struct pages with
special needs it is their job to ensure it is blocked from GUP - for
*everyone*. eg how the PCI_P2PDMA was blocked from normal GUP.

io_uring is not special, there are many users of GUP, they all need to
work consistently.

> > Also, why is it doing this?
> 
> There were problems with filesystem mappings, I believe.
> Jens may remember what it was.

Yes, I know about this, but as above, io_uring is not special, if we
want to block this GUP blocks it to protect all users, not io_uring
just protects itself..

Jason
Pavel Begunkov April 18, 2023, 5:25 p.m. UTC | #13
On 4/18/23 17:36, Jason Gunthorpe wrote:
> On Tue, Apr 18, 2023 at 05:25:08PM +0100, Pavel Begunkov wrote:
>> On 4/17/23 13:56, Jason Gunthorpe wrote:
>>> On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote:
>>>> Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers")
>>>> prevents io_pin_pages() from pinning pages spanning multiple VMAs with
>>>> permitted characteristics (anon/huge), requiring that all VMAs share the
>>>> same vm_file.
>>>
>>> That commmit doesn't really explain why io_uring is doing such a weird
>>> thing.
>>>
>>> What exactly is the problem with mixing struct pages from different
>>> files and why of all the GUP users does only io_uring need to care
>>> about this?
>>
>> Simply because it doesn't seem sane to mix and register buffers of
>> different "nature" as one.
> 
> That is not a good reason. Once things are converted to struct pages
> they don't need to care about their "nature"

Arguing purely about uapi, I do think it is. Even though it can be
passed down and a page is a page, Frankenstein's Monster mixing anon
pages, pages for io_uring queues, device shared memory, and what not
else doesn't seem right for uapi. I see keeping buffers as a single
entity in opposite to a set of random pages beneficial for the future.

And again, as for how it's internally done, I don't have any preference
whatsoever.

>> problem. We've been asked just recently to allow registering bufs
>> provided mapped by some specific driver, or there might be DMA mapped
>> memory in the future.
> 
> We already have GUP flags to deal with it, eg FOLL_PCI_P2PDMA
> 
>> Rejecting based on vmas might be too conservative, I agree and am all
>> for if someone can help to make it right.
> 
> It is GUP's problem to deal with this, not the callers.

Ok, that's even better for io_uring if the same can be achieved
just by passing flags.


> GUP is defined to return a list of normal CPU DRAM in struct page
> format. The caller doesn't care where or what this memory is, it is
> all interchangable - by API contract of GUP itself.
> 
> If you use FOLL_PCI_P2PDMA then the definition expands to allow struct
> pages that are MMIO.
> 
> In future, if someone invents new memory or new struct pages with
> special needs it is their job to ensure it is blocked from GUP - for
> *everyone*. eg how the PCI_P2PDMA was blocked from normal GUP.
> 
> io_uring is not special, there are many users of GUP, they all need to
> work consistently.
Jason Gunthorpe April 18, 2023, 6:19 p.m. UTC | #14
On Tue, Apr 18, 2023 at 06:25:24PM +0100, Pavel Begunkov wrote:
> On 4/18/23 17:36, Jason Gunthorpe wrote:
> > On Tue, Apr 18, 2023 at 05:25:08PM +0100, Pavel Begunkov wrote:
> > > On 4/17/23 13:56, Jason Gunthorpe wrote:
> > > > On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote:
> > > > > Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers")
> > > > > prevents io_pin_pages() from pinning pages spanning multiple VMAs with
> > > > > permitted characteristics (anon/huge), requiring that all VMAs share the
> > > > > same vm_file.
> > > > 
> > > > That commmit doesn't really explain why io_uring is doing such a weird
> > > > thing.
> > > > 
> > > > What exactly is the problem with mixing struct pages from different
> > > > files and why of all the GUP users does only io_uring need to care
> > > > about this?
> > > 
> > > Simply because it doesn't seem sane to mix and register buffers of
> > > different "nature" as one.
> > 
> > That is not a good reason. Once things are converted to struct pages
> > they don't need to care about their "nature"
> 
> Arguing purely about uapi, I do think it is. Even though it can be
> passed down and a page is a page, Frankenstein's Monster mixing anon
> pages, pages for io_uring queues, device shared memory, and what not
> else doesn't seem right for uapi. I see keeping buffers as a single
> entity in opposite to a set of random pages beneficial for the future.

Again, it is not up to io_uring to make this choice. We have GUP as
part of our uAPI all over the place, GUP decides how it works, not
random different ideas all over the place.

We don't have these kinds of restrictions for O_DIRECT, for instance.

There should be consistency in the uAPI across everything.

Jason
diff mbox series

Patch

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 7a43aed8e395..adc860bcbd4f 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1141,9 +1141,8 @@  static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
 struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages)
 {
 	unsigned long start, end, nr_pages;
-	struct vm_area_struct **vmas = NULL;
 	struct page **pages = NULL;
-	int i, pret, ret = -ENOMEM;
+	int pret, ret = -ENOMEM;
 
 	end = (ubuf + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	start = ubuf >> PAGE_SHIFT;
@@ -1153,31 +1152,26 @@  struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages)
 	if (!pages)
 		goto done;
 
-	vmas = kvmalloc_array(nr_pages, sizeof(struct vm_area_struct *),
-			      GFP_KERNEL);
-	if (!vmas)
-		goto done;
-
 	ret = 0;
 	mmap_read_lock(current->mm);
-	pret = pin_user_pages(ubuf, nr_pages, FOLL_WRITE | FOLL_LONGTERM,
-			      pages, vmas);
+
+	pret = pin_user_pages(ubuf, nr_pages,
+			      FOLL_WRITE | FOLL_LONGTERM | FOLL_SAME_FILE,
+			      pages, NULL);
 	if (pret == nr_pages) {
-		struct file *file = vmas[0]->vm_file;
+		/*
+		 * lookup the first VMA, we require that all VMAs in range
+		 * maintain the same file characteristics, as enforced by
+		 * FOLL_SAME_FILE
+		 */
+		struct vm_area_struct *vma = vma_lookup(current->mm, ubuf);
+		struct file *file;
 
 		/* don't support file backed memory */
-		for (i = 0; i < nr_pages; i++) {
-			if (vmas[i]->vm_file != file) {
-				ret = -EINVAL;
-				break;
-			}
-			if (!file)
-				continue;
-			if (!vma_is_shmem(vmas[i]) && !is_file_hugepages(file)) {
-				ret = -EOPNOTSUPP;
-				break;
-			}
-		}
+		file = vma->vm_file;
+		if (file && !vma_is_shmem(vma) && !is_file_hugepages(file))
+			ret = -EOPNOTSUPP;
+
 		*npages = nr_pages;
 	} else {
 		ret = pret < 0 ? pret : -EFAULT;
@@ -1194,7 +1188,6 @@  struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages)
 	}
 	ret = 0;
 done:
-	kvfree(vmas);
 	if (ret < 0) {
 		kvfree(pages);
 		pages = ERR_PTR(ret);