Message ID | 17357dec04b32593b71e4fdf3c30a346020acf98.1681508038.git.lstoakes@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | None | expand |
On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote: > Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers") > prevents io_pin_pages() from pinning pages spanning multiple VMAs with > permitted characteristics (anon/huge), requiring that all VMAs share the > same vm_file. That commmit doesn't really explain why io_uring is doing such a weird thing. What exactly is the problem with mixing struct pages from different files and why of all the GUP users does only io_uring need to care about this? If there is no justification then lets revert that commit instead. > /* don't support file backed memory */ > - for (i = 0; i < nr_pages; i++) { > - if (vmas[i]->vm_file != file) { > - ret = -EINVAL; > - break; > - } > - if (!file) > - continue; > - if (!vma_is_shmem(vmas[i]) && !is_file_hugepages(file)) { > - ret = -EOPNOTSUPP; > - break; > - } > - } > + file = vma->vm_file; > + if (file && !vma_is_shmem(vma) && !is_file_hugepages(file)) > + ret = -EOPNOTSUPP; > + Also, why is it doing this? All GUP users don't work entirely right for any fops implementation that assumes write protect is unconditionally possible. eg most filesystems. We've been ignoring blocking it because it is an ABI break and it does sort of work in some cases. I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than io_uring open coding this kind of stuff. Jason
On Mon, Apr 17, 2023 at 09:56:34AM -0300, Jason Gunthorpe wrote: > On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote: > > Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers") > > prevents io_pin_pages() from pinning pages spanning multiple VMAs with > > permitted characteristics (anon/huge), requiring that all VMAs share the > > same vm_file. > > That commmit doesn't really explain why io_uring is doing such a weird > thing. > > What exactly is the problem with mixing struct pages from different > files and why of all the GUP users does only io_uring need to care > about this? > > If there is no justification then lets revert that commit instead. > > > /* don't support file backed memory */ > > - for (i = 0; i < nr_pages; i++) { > > - if (vmas[i]->vm_file != file) { > > - ret = -EINVAL; > > - break; > > - } > > - if (!file) > > - continue; > > - if (!vma_is_shmem(vmas[i]) && !is_file_hugepages(file)) { > > - ret = -EOPNOTSUPP; > > - break; > > - } > > - } > > + file = vma->vm_file; > > + if (file && !vma_is_shmem(vma) && !is_file_hugepages(file)) > > + ret = -EOPNOTSUPP; > > + > > Also, why is it doing this? > > All GUP users don't work entirely right for any fops implementation > that assumes write protect is unconditionally possible. eg most > filesystems. > > We've been ignoring blocking it because it is an ABI break and it does > sort of work in some cases. > I will leave this to Jens and Pavel to revert on! > I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than > io_uring open coding this kind of stuff. > How would the semantics of this work? What is broken? It is a little frustrating that we have FOLL_ANON but hugetlb as an outlying case, adding FOLL_ANON_OR_HUGETLB was another consideration... > Jason
On Mon, Apr 17, 2023 at 02:19:16PM +0100, Lorenzo Stoakes wrote: > > I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than > > io_uring open coding this kind of stuff. > > > > How would the semantics of this work? What is broken? It is a little > frustrating that we have FOLL_ANON but hugetlb as an outlying case, adding > FOLL_ANON_OR_HUGETLB was another consideration... It says "historically this user has accepted file backed pages and we we think there may actually be users doing that, so don't break the uABI" Without the flag GUP would refuse to return file backed pages that can trigger kernel crashes or data corruption. Eg we'd want most places to not specify the flag and the few that do to have some justification. We should consdier removing FOLL_ANON, I'm not sure it really makes sense these days for what proc is doing with it. All that proc stuff could likely be turned into a kthread_use_mm() and a simple copy_to/from user? I suspect that eliminates the need to check for FOLL_ANON? Jason
On Mon, Apr 17, 2023 at 10:26:09AM -0300, Jason Gunthorpe wrote: > On Mon, Apr 17, 2023 at 02:19:16PM +0100, Lorenzo Stoakes wrote: > > > > I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than > > > io_uring open coding this kind of stuff. > > > > > > > How would the semantics of this work? What is broken? It is a little > > frustrating that we have FOLL_ANON but hugetlb as an outlying case, adding > > FOLL_ANON_OR_HUGETLB was another consideration... > > It says "historically this user has accepted file backed pages and we > we think there may actually be users doing that, so don't break the > uABI" Having written a bunch here I suddenly realised that you probably mean for this flag to NOT be applied to the io_uring code and thus have it enforce the 'anonymous or hugetlb' check by default? > > Without the flag GUP would refuse to return file backed pages that can > trigger kernel crashes or data corruption. > > Eg we'd want most places to not specify the flag and the few that do > to have some justification. > So you mean to disallow file-backed page pinning as a whole unless this flag is specified? For FOLL_GET I can see that access to the underlying data is dangerous as the memory may get reclaimed or migrated, but surely DMA-pinned memory (as is the case here) is safe? Or is this a product more so of some kernel process accessing file-backed pages for a file system which expects write-notify semantics and doesn't get them in this case, which could indeed be horribly broken. In which case yes this seems sensible. > We should consdier removing FOLL_ANON, I'm not sure it really makes > sense these days for what proc is doing with it. All that proc stuff > could likely be turned into a kthread_use_mm() and a simple > copy_to/from user? > > I suspect that eliminates the need to check for FOLL_ANON? > > Jason I am definitely in favour of cutting things down if possible, and very much prefer the use of uaccess if we are able to do so rather than GUP. I do feel that GUP should be focused purely on pinning memory rather than manipulating it (whether read or write) so I agree with this sentiment.
On Mon, Apr 17, 2023 at 03:00:16PM +0100, Lorenzo Stoakes wrote: > On Mon, Apr 17, 2023 at 10:26:09AM -0300, Jason Gunthorpe wrote: > > On Mon, Apr 17, 2023 at 02:19:16PM +0100, Lorenzo Stoakes wrote: > > > > > > I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than > > > > io_uring open coding this kind of stuff. > > > > > > > > > > How would the semantics of this work? What is broken? It is a little > > > frustrating that we have FOLL_ANON but hugetlb as an outlying case, adding > > > FOLL_ANON_OR_HUGETLB was another consideration... > > > > It says "historically this user has accepted file backed pages and we > > we think there may actually be users doing that, so don't break the > > uABI" > > Having written a bunch here I suddenly realised that you probably mean for > this flag to NOT be applied to the io_uring code and thus have it enforce > the 'anonymous or hugetlb' check by default? Yes > So you mean to disallow file-backed page pinning as a whole unless this > flag is specified? Yes > For FOLL_GET I can see that access to the underlying > data is dangerous as the memory may get reclaimed or migrated, but surely > DMA-pinned memory (as is the case here) is safe? No, it is all broken, read-only access is safe. We are trying to get a point where pin access will interact properly with the filesystem, but it isn't done yet. > Or is this a product more so of some kernel process accessing file-backed > pages for a file system which expects write-notify semantics and doesn't > get them in this case, which could indeed be horribly broken. Yes, broadly > I am definitely in favour of cutting things down if possible, and very much > prefer the use of uaccess if we are able to do so rather than GUP. > > I do feel that GUP should be focused purely on pinning memory rather than > manipulating it (whether read or write) so I agree with this sentiment. Yes, someone needs to be brave enough to go and try to adjust these old places :) I see in the git history this was added to solve CVE-2018-1120 - eg FUSE can hold off fault-in indefinitely. So the flag is really badly misnamed - it is "FOLL_DONT_BLOCK_ON_USERSPACE" and anon memory is a simple, but overly narrow, way to get that property. If it is changed to use kthread_use_mm() it needs a VMA based check for the same idea. Jason
On Mon, Apr 17, 2023 at 11:15:10AM -0300, Jason Gunthorpe wrote: > On Mon, Apr 17, 2023 at 03:00:16PM +0100, Lorenzo Stoakes wrote: > > On Mon, Apr 17, 2023 at 10:26:09AM -0300, Jason Gunthorpe wrote: > > > On Mon, Apr 17, 2023 at 02:19:16PM +0100, Lorenzo Stoakes wrote: > > > > > > > > I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than > > > > > io_uring open coding this kind of stuff. > > > > > > > > > > > > > How would the semantics of this work? What is broken? It is a little > > > > frustrating that we have FOLL_ANON but hugetlb as an outlying case, adding > > > > FOLL_ANON_OR_HUGETLB was another consideration... > > > > > > It says "historically this user has accepted file backed pages and we > > > we think there may actually be users doing that, so don't break the > > > uABI" > > > > Having written a bunch here I suddenly realised that you probably mean for > > this flag to NOT be applied to the io_uring code and thus have it enforce > > the 'anonymous or hugetlb' check by default? > > Yes > > > So you mean to disallow file-backed page pinning as a whole unless this > > flag is specified? > > Yes > > > For FOLL_GET I can see that access to the underlying > > data is dangerous as the memory may get reclaimed or migrated, but surely > > DMA-pinned memory (as is the case here) is safe? > > No, it is all broken, read-only access is safe. > > We are trying to get a point where pin access will interact properly > with the filesystem, but it isn't done yet. > > > Or is this a product more so of some kernel process accessing file-backed > > pages for a file system which expects write-notify semantics and doesn't > > get them in this case, which could indeed be horribly broken. > > Yes, broadly > > > I am definitely in favour of cutting things down if possible, and very much > > prefer the use of uaccess if we are able to do so rather than GUP. > > > > I do feel that GUP should be focused purely on pinning memory rather than > > manipulating it (whether read or write) so I agree with this sentiment. > > Yes, someone needs to be brave enough to go and try to adjust these > old places :) Well, I liek to think of myself as stupid^W brave enough to do such things so may try a separate patch series on that :) > > I see in the git history this was added to solve CVE-2018-1120 - eg > FUSE can hold off fault-in indefinitely. So the flag is really badly > misnamed - it is "FOLL_DONT_BLOCK_ON_USERSPACE" and anon memory is a > simple, but overly narrow, way to get that property. > > If it is changed to use kthread_use_mm() it needs a VMA based check > for the same idea. > > Jason I'll try my hand at patching this also! As for FOLL_ALLOW_BROKEN_FILE_MAPPINGS, I do really like this idea, and think it is actually probably quite important we do it, however this feels a bit out of scope for this patch series. I think perhaps the way forward is, if Jens and Pavel don't have any issue with it, we open code the check and drop FOLL_SAME_FILE for this series, then introduce it in a separate one + replace the open coding there? I am eager to try to keep this focused on the specific task of dropping the vmas parameter as I think FOLL_ALLOW_BROKEN_FILE_MAPPINGS is likely to garner some discussion which should be kept separate.
On Mon, Apr 17, 2023 at 10:26:09AM -0300, Jason Gunthorpe wrote: > On Mon, Apr 17, 2023 at 02:19:16PM +0100, Lorenzo Stoakes wrote: > > > > I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than > > > io_uring open coding this kind of stuff. > > > > > > > How would the semantics of this work? What is broken? It is a little > > frustrating that we have FOLL_ANON but hugetlb as an outlying case, adding > > FOLL_ANON_OR_HUGETLB was another consideration... > > It says "historically this user has accepted file backed pages and we > we think there may actually be users doing that, so don't break the > uABI" > > Without the flag GUP would refuse to return file backed pages that can > trigger kernel crashes or data corruption. > > Eg we'd want most places to not specify the flag and the few that do > to have some justification. > > We should consdier removing FOLL_ANON, I'm not sure it really makes > sense these days for what proc is doing with it. All that proc stuff > could likely be turned into a kthread_use_mm() and a simple > copy_to/from user? > > I suspect that eliminates the need to check for FOLL_ANON? > > Jason The proc invocations utilising FOLL_ANON are get_mm_proctitle(), get_mm_cmdline() and environ_read() which each pass it to access_remote_vm() and which will be being called from a process context, i.e. with tsk->mm != NULL, but kthread_use_mm() explicitly disallows the (slightly mind boggling) idea of swapping out an established mm. So I don't think this route is plausible unless you were thinking of somehow offloading to a thread? In any case, if we institute the FOLL_ALLOW_BROKEN_FILE_MAPPINGS flag we can just drop FOLL_ANON altogether right, as this will be implied and hugetlb should work here too? Separately, I find the semantics of access_remote_vm() kind of weird, and with a possible mmap_lock-free future it does make me wonder whether something better could be done there. (Section where I sound like I might be going mad) Perhaps having some means of context switching into the kernel portion of the remote process as if were a system call or soft interrupt handler and having that actually do the uaccess operation could be useful here? I'm guesing nothing like that exists yet?
On Mon, Apr 17, 2023 at 08:00:48PM +0100, Lorenzo Stoakes wrote: > So I don't think this route is plausible unless you were thinking of > somehow offloading to a thread? ah, fair enough > In any case, if we institute the FOLL_ALLOW_BROKEN_FILE_MAPPINGS flag we > can just drop FOLL_ANON altogether right, as this will be implied and > hugetlb should work here too? Well.. no, as I said read-only access to the pages works fine, so GUP should not block that. It is only write that has issues > Separately, I find the semantics of access_remote_vm() kind of weird, and > with a possible mmap_lock-free future it does make me wonder whether > something better could be done there. Yes, it is very weird, kthread_use_mm is much nicer. > (Section where I sound like I might be going mad) Perhaps having some means > of context switching into the kernel portion of the remote process as if > were a system call or soft interrupt handler and having that actually do > the uaccess operation could be useful here? This is the kthread_use_mm() approach, that is basically what it does. You are suggesting to extend it to kthreads that already have a process attached... access_remote_vm is basically copy_to/from_user built using kmap and GUP. even a simple step of localizing FOLL_ANON to __access_remote_vm, since it must have the VMA nyhow, would be an improvement. Jason
On Mon, Apr 17, 2023 at 04:24:04PM -0300, Jason Gunthorpe wrote: > On Mon, Apr 17, 2023 at 08:00:48PM +0100, Lorenzo Stoakes wrote: > > > So I don't think this route is plausible unless you were thinking of > > somehow offloading to a thread? > > ah, fair enough > > > In any case, if we institute the FOLL_ALLOW_BROKEN_FILE_MAPPINGS flag we > > can just drop FOLL_ANON altogether right, as this will be implied and > > hugetlb should work here too? > > Well.. no, as I said read-only access to the pages works fine, so GUP > should not block that. It is only write that has issues > > > Separately, I find the semantics of access_remote_vm() kind of weird, and > > with a possible mmap_lock-free future it does make me wonder whether > > something better could be done there. > > Yes, it is very weird, kthread_use_mm is much nicer. > > > (Section where I sound like I might be going mad) Perhaps having some means > > of context switching into the kernel portion of the remote process as if > > were a system call or soft interrupt handler and having that actually do > > the uaccess operation could be useful here? > > This is the kthread_use_mm() approach, that is basically what it > does. You are suggesting to extend it to kthreads that already have a > process attached... Yeah, I wonder how plausible this is as we could in theory simply eliminate these remote cases altogether which could be relatively efficient if we could find a way to batch up operations. > > access_remote_vm is basically copy_to/from_user built using kmap and > GUP. > > even a simple step of localizing FOLL_ANON to __access_remote_vm, > since it must have the VMA nyhow, would be an improvement. This is used from places where this flag might not be set though, e.g. acess_process_vm() and ptrace. However, access_remote_vm() is only used by the proc stuff, so I will spin up a patch to move this function and treat it as a helper which sets FOLL_ANON. > > Jason
On 4/17/23 13:56, Jason Gunthorpe wrote: > On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote: >> Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers") >> prevents io_pin_pages() from pinning pages spanning multiple VMAs with >> permitted characteristics (anon/huge), requiring that all VMAs share the >> same vm_file. > > That commmit doesn't really explain why io_uring is doing such a weird > thing. > > What exactly is the problem with mixing struct pages from different > files and why of all the GUP users does only io_uring need to care > about this? Simply because it doesn't seem sane to mix and register buffers of different "nature" as one. It's not a huge deal for currently allowed types, e.g. mixing normal and huge anon pages, but it's rather a matter of time before it gets extended, and then I'll certainly become a problem. We've been asked just recently to allow registering bufs provided mapped by some specific driver, or there might be DMA mapped memory in the future. Rejecting based on vmas might be too conservative, I agree and am all for if someone can help to make it right. > If there is no justification then lets revert that commit instead. > >> /* don't support file backed memory */ >> - for (i = 0; i < nr_pages; i++) { >> - if (vmas[i]->vm_file != file) { >> - ret = -EINVAL; >> - break; >> - } >> - if (!file) >> - continue; >> - if (!vma_is_shmem(vmas[i]) && !is_file_hugepages(file)) { >> - ret = -EOPNOTSUPP; >> - break; >> - } >> - } >> + file = vma->vm_file; >> + if (file && !vma_is_shmem(vma) && !is_file_hugepages(file)) >> + ret = -EOPNOTSUPP; >> + > > Also, why is it doing this? There were problems with filesystem mappings, I believe. Jens may remember what it was. > All GUP users don't work entirely right for any fops implementation > that assumes write protect is unconditionally possible. eg most > filesystems. > > We've been ignoring blocking it because it is an ABI break and it does > sort of work in some cases. > > I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than > io_uring open coding this kind of stuff.
On 4/18/23 17:25, Pavel Begunkov wrote: > On 4/17/23 13:56, Jason Gunthorpe wrote: >> On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote: >>> Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers") >>> prevents io_pin_pages() from pinning pages spanning multiple VMAs with >>> permitted characteristics (anon/huge), requiring that all VMAs share the >>> same vm_file. >> >> That commmit doesn't really explain why io_uring is doing such a weird >> thing. >> >> What exactly is the problem with mixing struct pages from different >> files and why of all the GUP users does only io_uring need to care >> about this? > > Simply because it doesn't seem sane to mix and register buffers of > different "nature" as one. It's not a huge deal for currently allowed > types, e.g. mixing normal and huge anon pages, but it's rather a matter > of time before it gets extended, and then I'll certainly become a > problem. We've been asked just recently to allow registering bufs > provided mapped by some specific driver, or there might be DMA mapped > memory in the future. > > Rejecting based on vmas might be too conservative, I agree and am all > for if someone can help to make it right. For some reason I thought it was rejecting if involves more than one different vma. ->vm_file checks still sound fair to me, but in any case, open to changing it.
On Tue, Apr 18, 2023 at 05:25:08PM +0100, Pavel Begunkov wrote: > On 4/17/23 13:56, Jason Gunthorpe wrote: > > On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote: > > > Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers") > > > prevents io_pin_pages() from pinning pages spanning multiple VMAs with > > > permitted characteristics (anon/huge), requiring that all VMAs share the > > > same vm_file. > > > > That commmit doesn't really explain why io_uring is doing such a weird > > thing. > > > > What exactly is the problem with mixing struct pages from different > > files and why of all the GUP users does only io_uring need to care > > about this? > > Simply because it doesn't seem sane to mix and register buffers of > different "nature" as one. That is not a good reason. Once things are converted to struct pages they don't need to care about their "nature" > problem. We've been asked just recently to allow registering bufs > provided mapped by some specific driver, or there might be DMA mapped > memory in the future. We already have GUP flags to deal with it, eg FOLL_PCI_P2PDMA > Rejecting based on vmas might be too conservative, I agree and am all > for if someone can help to make it right. It is GUP's problem to deal with this, not the callers. GUP is defined to return a list of normal CPU DRAM in struct page format. The caller doesn't care where or what this memory is, it is all interchangable - by API contract of GUP itself. If you use FOLL_PCI_P2PDMA then the definition expands to allow struct pages that are MMIO. In future, if someone invents new memory or new struct pages with special needs it is their job to ensure it is blocked from GUP - for *everyone*. eg how the PCI_P2PDMA was blocked from normal GUP. io_uring is not special, there are many users of GUP, they all need to work consistently. > > Also, why is it doing this? > > There were problems with filesystem mappings, I believe. > Jens may remember what it was. Yes, I know about this, but as above, io_uring is not special, if we want to block this GUP blocks it to protect all users, not io_uring just protects itself.. Jason
On 4/18/23 17:36, Jason Gunthorpe wrote: > On Tue, Apr 18, 2023 at 05:25:08PM +0100, Pavel Begunkov wrote: >> On 4/17/23 13:56, Jason Gunthorpe wrote: >>> On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote: >>>> Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers") >>>> prevents io_pin_pages() from pinning pages spanning multiple VMAs with >>>> permitted characteristics (anon/huge), requiring that all VMAs share the >>>> same vm_file. >>> >>> That commmit doesn't really explain why io_uring is doing such a weird >>> thing. >>> >>> What exactly is the problem with mixing struct pages from different >>> files and why of all the GUP users does only io_uring need to care >>> about this? >> >> Simply because it doesn't seem sane to mix and register buffers of >> different "nature" as one. > > That is not a good reason. Once things are converted to struct pages > they don't need to care about their "nature" Arguing purely about uapi, I do think it is. Even though it can be passed down and a page is a page, Frankenstein's Monster mixing anon pages, pages for io_uring queues, device shared memory, and what not else doesn't seem right for uapi. I see keeping buffers as a single entity in opposite to a set of random pages beneficial for the future. And again, as for how it's internally done, I don't have any preference whatsoever. >> problem. We've been asked just recently to allow registering bufs >> provided mapped by some specific driver, or there might be DMA mapped >> memory in the future. > > We already have GUP flags to deal with it, eg FOLL_PCI_P2PDMA > >> Rejecting based on vmas might be too conservative, I agree and am all >> for if someone can help to make it right. > > It is GUP's problem to deal with this, not the callers. Ok, that's even better for io_uring if the same can be achieved just by passing flags. > GUP is defined to return a list of normal CPU DRAM in struct page > format. The caller doesn't care where or what this memory is, it is > all interchangable - by API contract of GUP itself. > > If you use FOLL_PCI_P2PDMA then the definition expands to allow struct > pages that are MMIO. > > In future, if someone invents new memory or new struct pages with > special needs it is their job to ensure it is blocked from GUP - for > *everyone*. eg how the PCI_P2PDMA was blocked from normal GUP. > > io_uring is not special, there are many users of GUP, they all need to > work consistently.
On Tue, Apr 18, 2023 at 06:25:24PM +0100, Pavel Begunkov wrote: > On 4/18/23 17:36, Jason Gunthorpe wrote: > > On Tue, Apr 18, 2023 at 05:25:08PM +0100, Pavel Begunkov wrote: > > > On 4/17/23 13:56, Jason Gunthorpe wrote: > > > > On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote: > > > > > Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers") > > > > > prevents io_pin_pages() from pinning pages spanning multiple VMAs with > > > > > permitted characteristics (anon/huge), requiring that all VMAs share the > > > > > same vm_file. > > > > > > > > That commmit doesn't really explain why io_uring is doing such a weird > > > > thing. > > > > > > > > What exactly is the problem with mixing struct pages from different > > > > files and why of all the GUP users does only io_uring need to care > > > > about this? > > > > > > Simply because it doesn't seem sane to mix and register buffers of > > > different "nature" as one. > > > > That is not a good reason. Once things are converted to struct pages > > they don't need to care about their "nature" > > Arguing purely about uapi, I do think it is. Even though it can be > passed down and a page is a page, Frankenstein's Monster mixing anon > pages, pages for io_uring queues, device shared memory, and what not > else doesn't seem right for uapi. I see keeping buffers as a single > entity in opposite to a set of random pages beneficial for the future. Again, it is not up to io_uring to make this choice. We have GUP as part of our uAPI all over the place, GUP decides how it works, not random different ideas all over the place. We don't have these kinds of restrictions for O_DIRECT, for instance. There should be consistency in the uAPI across everything. Jason
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index 7a43aed8e395..adc860bcbd4f 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -1141,9 +1141,8 @@ static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages, struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages) { unsigned long start, end, nr_pages; - struct vm_area_struct **vmas = NULL; struct page **pages = NULL; - int i, pret, ret = -ENOMEM; + int pret, ret = -ENOMEM; end = (ubuf + len + PAGE_SIZE - 1) >> PAGE_SHIFT; start = ubuf >> PAGE_SHIFT; @@ -1153,31 +1152,26 @@ struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages) if (!pages) goto done; - vmas = kvmalloc_array(nr_pages, sizeof(struct vm_area_struct *), - GFP_KERNEL); - if (!vmas) - goto done; - ret = 0; mmap_read_lock(current->mm); - pret = pin_user_pages(ubuf, nr_pages, FOLL_WRITE | FOLL_LONGTERM, - pages, vmas); + + pret = pin_user_pages(ubuf, nr_pages, + FOLL_WRITE | FOLL_LONGTERM | FOLL_SAME_FILE, + pages, NULL); if (pret == nr_pages) { - struct file *file = vmas[0]->vm_file; + /* + * lookup the first VMA, we require that all VMAs in range + * maintain the same file characteristics, as enforced by + * FOLL_SAME_FILE + */ + struct vm_area_struct *vma = vma_lookup(current->mm, ubuf); + struct file *file; /* don't support file backed memory */ - for (i = 0; i < nr_pages; i++) { - if (vmas[i]->vm_file != file) { - ret = -EINVAL; - break; - } - if (!file) - continue; - if (!vma_is_shmem(vmas[i]) && !is_file_hugepages(file)) { - ret = -EOPNOTSUPP; - break; - } - } + file = vma->vm_file; + if (file && !vma_is_shmem(vma) && !is_file_hugepages(file)) + ret = -EOPNOTSUPP; + *npages = nr_pages; } else { ret = pret < 0 ? pret : -EFAULT; @@ -1194,7 +1188,6 @@ struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages) } ret = 0; done: - kvfree(vmas); if (ret < 0) { kvfree(pages); pages = ERR_PTR(ret);
Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers") prevents io_pin_pages() from pinning pages spanning multiple VMAs with permitted characteristics (anon/huge), requiring that all VMAs share the same vm_file. The newly introduced FOLL_SAME_FILE flag permits this to be expressed as a GUP flag rather than having to retrieve VMAs to perform the check. We then only need to perform a VMA lookup for the first VMA to assert the anon/hugepage requirement as we know the rest of the VMAs will possess the same characteristics. Doing this eliminates the one instance of vmas being used by pin_user_pages(). Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> --- io_uring/rsrc.c | 39 ++++++++++++++++----------------------- 1 file changed, 16 insertions(+), 23 deletions(-)