Message ID | 20230126141626.2809643-3-dhowells@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | iov_iter: Improve page extraction (pin or just list) | expand |
On Thu, Jan 26, 2023 at 02:16:20PM +0000, David Howells wrote: > +/** > + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained > + * @iter: The iterator > + * > + * Examine the iterator and indicate by returning true or false as to how, if > + * at all, pages extracted from the iterator will be retained by the extraction > + * function. > + * > + * %true indicates that the pages will have a pin placed in them that the > + * caller must unpin. This is must be done for DMA/async DIO to force fork() > + * to forcibly copy a page for the child (the parent must retain the original > + * page). > + * > + * %false indicates that no measures are taken and that it's up to the caller > + * to retain the pages. > + */ > +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter) > +{ > + return user_backed_iter(iter); > +} > + Wait a sec; why would we want a pin for pages we won't be modifying? A reference - sure, but...
On Thu, Jan 26, 2023 at 09:59:36PM +0000, Al Viro wrote: > On Thu, Jan 26, 2023 at 02:16:20PM +0000, David Howells wrote: > > > +/** > > + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained > > + * @iter: The iterator > > + * > > + * Examine the iterator and indicate by returning true or false as to how, if > > + * at all, pages extracted from the iterator will be retained by the extraction > > + * function. > > + * > > + * %true indicates that the pages will have a pin placed in them that the > > + * caller must unpin. This is must be done for DMA/async DIO to force fork() > > + * to forcibly copy a page for the child (the parent must retain the original > > + * page). > > + * > > + * %false indicates that no measures are taken and that it's up to the caller > > + * to retain the pages. > > + */ > > +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter) > > +{ > > + return user_backed_iter(iter); > > +} > > + > > Wait a sec; why would we want a pin for pages we won't be modifying? > A reference - sure, but... After having looked through the earlier iterations of the patchset - sorry, but that won't fly for (at least) vmsplice(). There we can't pin those suckers; thankfully, we don't need to - they are used only for fetches, so FOLL_GET is sufficient. With your "we'll just pin them, source or destination" you won't be able to convert at least that call of iov_iter_get_pages2(). And there might be other similar cases; I won't swear there's more, but ISTR running into more than one of the "pin won't be OK here, but fortunately it's a data source" places.
On 1/26/23 14:36, Al Viro wrote: ... >>> +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter) >>> +{ >>> + return user_backed_iter(iter); >>> +} >>> + >> >> Wait a sec; why would we want a pin for pages we won't be modifying? >> A reference - sure, but... > > After having looked through the earlier iterations of the patchset - > sorry, but that won't fly for (at least) vmsplice(). There we can't > pin those suckers; thankfully, we don't need to - they are used only > for fetches, so FOLL_GET is sufficient. With your "we'll just pin them, > source or destination" you won't be able to convert at least that > call of iov_iter_get_pages2(). And there might be other similar cases; > I won't swear there's more, but ISTR running into more than one of > the "pin won't be OK here, but fortunately it's a data source" places. Assuming that "page is a data source" means that we are writing out from the page to a block device (so, a WRITE operation, which of course actually *reads* from the page), then... ...one thing I'm worried about now is whether Jan's original problem report [1] can be fixed, because that involves page writeback. And it seems like we need to mark the pages involved as "maybe dma-pinned" via FOLL_PIN pins, in order to solve it. Or am I missing a key point (I hope)? [1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/T/#u thanks,
On 26.01.23 23:36, Al Viro wrote: > On Thu, Jan 26, 2023 at 09:59:36PM +0000, Al Viro wrote: >> On Thu, Jan 26, 2023 at 02:16:20PM +0000, David Howells wrote: >> >>> +/** >>> + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained >>> + * @iter: The iterator >>> + * >>> + * Examine the iterator and indicate by returning true or false as to how, if >>> + * at all, pages extracted from the iterator will be retained by the extraction >>> + * function. >>> + * >>> + * %true indicates that the pages will have a pin placed in them that the >>> + * caller must unpin. This is must be done for DMA/async DIO to force fork() >>> + * to forcibly copy a page for the child (the parent must retain the original >>> + * page). >>> + * >>> + * %false indicates that no measures are taken and that it's up to the caller >>> + * to retain the pages. >>> + */ >>> +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter) >>> +{ >>> + return user_backed_iter(iter); >>> +} >>> + >> >> Wait a sec; why would we want a pin for pages we won't be modifying? >> A reference - sure, but... > > After having looked through the earlier iterations of the patchset - > sorry, but that won't fly for (at least) vmsplice(). There we can't > pin those suckers; We'll need a way to pass FOLL_LONGTERM to pin_user_pages_fast() to handle such long-term pinning as vmsplice() needs. But the release path (unpin) will be the same.
Al says that pinning a page (ie. FOLL_PIN) could cause a deadlock if a page is vmspliced into a pipe with the pipe holding a pin on it because pinned pages are removed from all page tables. Is this actually the case? I can't see offhand where in mm/gup.c it does this. David
On 27.01.23 00:56, David Howells wrote: > Al says that pinning a page (ie. FOLL_PIN) could cause a deadlock if a page is > vmspliced into a pipe with the pipe holding a pin on it because pinned pages > are removed from all page tables. Is this actually the case? I can't see > offhand where in mm/gup.c it does this. Pinning a page is mostly taking a "special" reference on the page, indicating to the system that the page maybe pinned. For an ordinary order-0 page, this is increasing the refcount by 1024 instead of 1. In addition, we'll do some COW-unsharing magic depending on the page type (e.g., anon vs. fike-backed), and FOLL_LONGTERM. So if the page is mapped R/O only and we want to pin it R/O (!FOLL_WRITE), we might replace it in the page table by a different page via a fault (FAULT_FLAG_UNSHARE). Last but not least, with FOLL_LONGTERM we will make sure to migrate the target page off of MIGRATE_MOVABLE/CMA memory where the unmovable page (while pinned) could otherwise cause trouble (e.g., blocking memory hotunplug). So again, we'd replace it in the page tale by a different page via a fault. In all cases, the page won't be unmapped from the page table.
On Thu, Jan 26, 2023 at 11:56:50PM +0000, David Howells wrote: > Al says that pinning a page (ie. FOLL_PIN) could cause a deadlock if a page is > vmspliced into a pipe with the pipe holding a pin on it because pinned pages > are removed from all page tables. Is this actually the case? I can't see > offhand where in mm/gup.c it does this. It doesn't; sorry, really confused memories of what's going on, took a while to sort them out (FWIW, writeback is where we unmap and check if page is pinned, while pin_user_pages running into an unmapped page will end up with handle_mm_fault() (->fault(), actually) try to get the sucker locked and block on that until the writeback is over). Said that, I still think that pinned pages (arbitrary pagecache ones, at that) ending up in a pipe is a seriously bad idea. It's trivial to arrange for them to stay that way indefinitely - no priveleges needed, very few limits, etc.
On Fri, Jan 27, 2023 at 12:52:38AM +0000, Al Viro wrote: > On Thu, Jan 26, 2023 at 11:56:50PM +0000, David Howells wrote: > > Al says that pinning a page (ie. FOLL_PIN) could cause a deadlock if a page is > > vmspliced into a pipe with the pipe holding a pin on it because pinned pages > > are removed from all page tables. Is this actually the case? I can't see > > offhand where in mm/gup.c it does this. > > It doesn't; sorry, really confused memories of what's going on, took a while > to sort them out (FWIW, writeback is where we unmap and check if page is > pinned, while pin_user_pages running into an unmapped page will end up > with handle_mm_fault() (->fault(), actually) try to get the sucker locked > and block on that until the writeback is over). Umm... OK, I really need to reread that area. Hopefully will be done with that by tomorrow...
On Fri, Jan 27, 2023 at 12:44:08AM +0100, David Hildenbrand wrote: > On 26.01.23 23:36, Al Viro wrote: > > On Thu, Jan 26, 2023 at 09:59:36PM +0000, Al Viro wrote: > > > On Thu, Jan 26, 2023 at 02:16:20PM +0000, David Howells wrote: > > > > > > > +/** > > > > + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained > > > > + * @iter: The iterator > > > > + * > > > > + * Examine the iterator and indicate by returning true or false as to how, if > > > > + * at all, pages extracted from the iterator will be retained by the extraction > > > > + * function. > > > > + * > > > > + * %true indicates that the pages will have a pin placed in them that the > > > > + * caller must unpin. This is must be done for DMA/async DIO to force fork() > > > > + * to forcibly copy a page for the child (the parent must retain the original > > > > + * page). > > > > + * > > > > + * %false indicates that no measures are taken and that it's up to the caller > > > > + * to retain the pages. > > > > + */ > > > > +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter) > > > > +{ > > > > + return user_backed_iter(iter); > > > > +} > > > > + > > > > > > Wait a sec; why would we want a pin for pages we won't be modifying? > > > A reference - sure, but... > > > > After having looked through the earlier iterations of the patchset - > > sorry, but that won't fly for (at least) vmsplice(). There we can't > > pin those suckers; > > We'll need a way to pass FOLL_LONGTERM to pin_user_pages_fast() to handle > such long-term pinning as vmsplice() needs. But the release path (unpin) > will be the same. Umm... Are you saying that if the source area contains DAX mmaps, vmsplice() from it will fail?
On Fri 27-01-23 02:02:31, Al Viro wrote: > On Fri, Jan 27, 2023 at 12:44:08AM +0100, David Hildenbrand wrote: > > On 26.01.23 23:36, Al Viro wrote: > > > On Thu, Jan 26, 2023 at 09:59:36PM +0000, Al Viro wrote: > > > > On Thu, Jan 26, 2023 at 02:16:20PM +0000, David Howells wrote: > > > > > > > > > +/** > > > > > + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained > > > > > + * @iter: The iterator > > > > > + * > > > > > + * Examine the iterator and indicate by returning true or false as to how, if > > > > > + * at all, pages extracted from the iterator will be retained by the extraction > > > > > + * function. > > > > > + * > > > > > + * %true indicates that the pages will have a pin placed in them that the > > > > > + * caller must unpin. This is must be done for DMA/async DIO to force fork() > > > > > + * to forcibly copy a page for the child (the parent must retain the original > > > > > + * page). > > > > > + * > > > > > + * %false indicates that no measures are taken and that it's up to the caller > > > > > + * to retain the pages. > > > > > + */ > > > > > +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter) > > > > > +{ > > > > > + return user_backed_iter(iter); > > > > > +} > > > > > + > > > > > > > > Wait a sec; why would we want a pin for pages we won't be modifying? > > > > A reference - sure, but... > > > > > > After having looked through the earlier iterations of the patchset - > > > sorry, but that won't fly for (at least) vmsplice(). There we can't > > > pin those suckers; > > > > We'll need a way to pass FOLL_LONGTERM to pin_user_pages_fast() to handle > > such long-term pinning as vmsplice() needs. But the release path (unpin) > > will be the same. > > Umm... Are you saying that if the source area contains DAX mmaps, vmsplice() > from it will fail? Yes, that's the plan. Because as you wrote elsewhere, it is otherwise too easy to lock up operations such as truncate(2) on DAX filesystems. Honza
On 27.01.23 13:30, Jan Kara wrote: > On Fri 27-01-23 02:02:31, Al Viro wrote: >> On Fri, Jan 27, 2023 at 12:44:08AM +0100, David Hildenbrand wrote: >>> On 26.01.23 23:36, Al Viro wrote: >>>> On Thu, Jan 26, 2023 at 09:59:36PM +0000, Al Viro wrote: >>>>> On Thu, Jan 26, 2023 at 02:16:20PM +0000, David Howells wrote: >>>>> >>>>>> +/** >>>>>> + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained >>>>>> + * @iter: The iterator >>>>>> + * >>>>>> + * Examine the iterator and indicate by returning true or false as to how, if >>>>>> + * at all, pages extracted from the iterator will be retained by the extraction >>>>>> + * function. >>>>>> + * >>>>>> + * %true indicates that the pages will have a pin placed in them that the >>>>>> + * caller must unpin. This is must be done for DMA/async DIO to force fork() >>>>>> + * to forcibly copy a page for the child (the parent must retain the original >>>>>> + * page). >>>>>> + * >>>>>> + * %false indicates that no measures are taken and that it's up to the caller >>>>>> + * to retain the pages. >>>>>> + */ >>>>>> +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter) >>>>>> +{ >>>>>> + return user_backed_iter(iter); >>>>>> +} >>>>>> + >>>>> >>>>> Wait a sec; why would we want a pin for pages we won't be modifying? >>>>> A reference - sure, but... >>>> >>>> After having looked through the earlier iterations of the patchset - >>>> sorry, but that won't fly for (at least) vmsplice(). There we can't >>>> pin those suckers; >>> >>> We'll need a way to pass FOLL_LONGTERM to pin_user_pages_fast() to handle >>> such long-term pinning as vmsplice() needs. But the release path (unpin) >>> will be the same. >> >> Umm... Are you saying that if the source area contains DAX mmaps, vmsplice() >> from it will fail? > > Yes, that's the plan. Because as you wrote elsewhere, it is otherwise too easy > to lock up operations such as truncate(2) on DAX filesystems. Right, it's then the same behavior as we already have for other FOLL_LONGTERM users, such as RDMA or io_uring. ... if we're afraid of breaking existing setups we could add some kind of fallback to copy to a buffer like ordinary pipe writes.
On Fri 27-01-23 00:52:38, Al Viro wrote: > On Thu, Jan 26, 2023 at 11:56:50PM +0000, David Howells wrote: > > Al says that pinning a page (ie. FOLL_PIN) could cause a deadlock if a page is > > vmspliced into a pipe with the pipe holding a pin on it because pinned pages > > are removed from all page tables. Is this actually the case? I can't see > > offhand where in mm/gup.c it does this. > > It doesn't; sorry, really confused memories of what's going on, took a while > to sort them out (FWIW, writeback is where we unmap and check if page is > pinned, while pin_user_pages running into an unmapped page will end up > with handle_mm_fault() (->fault(), actually) try to get the sucker locked > and block on that until the writeback is over). > > Said that, I still think that pinned pages (arbitrary pagecache ones, > at that) ending up in a pipe is a seriously bad idea. It's trivial to > arrange for them to stay that way indefinitely - no priveleges needed, > very few limits, etc. I tend to agree but is there a big difference compared to normal page references? There's no difference for memory usage, pages still can be truncated from the file and disk space reclaimed (this is where DAX has problems...) so standard file operations won't notice. The only difference is that they could stay permanently dirty (we don't know whether the pin owner copies data to or from the page) so it could cause trouble with dirty throttling - and it is really only the throttling itself - page reclaim will have the same troubles with both pins and ordinary page references... Am I missing something? Honza
diff --git a/include/linux/uio.h b/include/linux/uio.h index bf77cd3d5fb1..b1be128bb2fa 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -361,9 +361,34 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction, .count = count }; } - /* Flags for iov_iter_get/extract_pages*() */ /* Allow P2PDMA on the extracted pages */ #define ITER_ALLOW_P2PDMA ((__force iov_iter_extraction_t)0x01) +ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages, + size_t maxsize, unsigned int maxpages, + iov_iter_extraction_t extraction_flags, + size_t *offset0); + +/** + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained + * @iter: The iterator + * + * Examine the iterator and indicate by returning true or false as to how, if + * at all, pages extracted from the iterator will be retained by the extraction + * function. + * + * %true indicates that the pages will have a pin placed in them that the + * caller must unpin. This is must be done for DMA/async DIO to force fork() + * to forcibly copy a page for the child (the parent must retain the original + * page). + * + * %false indicates that no measures are taken and that it's up to the caller + * to retain the pages. + */ +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter) +{ + return user_backed_iter(iter); +} + #endif diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 553afc870866..d69a05950555 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1916,3 +1916,324 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state) i->iov -= state->nr_segs - i->nr_segs; i->nr_segs = state->nr_segs; } + +/* + * Extract a list of contiguous pages from an ITER_PIPE iterator. This does + * not get references of its own on the pages, nor does it get a pin on them. + * If there's a partial page, it adds that first and will then allocate and add + * pages into the pipe to make up the buffer space to the amount required. + * + * The caller must hold the pipe locked and only transferring into a pipe is + * supported. + */ +static ssize_t iov_iter_extract_pipe_pages(struct iov_iter *i, + struct page ***pages, size_t maxsize, + unsigned int maxpages, + iov_iter_extraction_t extraction_flags, + size_t *offset0) +{ + unsigned int nr, offset, chunk, j; + struct page **p; + size_t left; + + if (!sanity(i)) + return -EFAULT; + + offset = pipe_npages(i, &nr); + if (!nr) + return -EFAULT; + *offset0 = offset; + + maxpages = min_t(size_t, nr, maxpages); + maxpages = want_pages_array(pages, maxsize, offset, maxpages); + if (!maxpages) + return -ENOMEM; + p = *pages; + + left = maxsize; + for (j = 0; j < maxpages; j++) { + struct page *page = append_pipe(i, left, &offset); + if (!page) + break; + chunk = min_t(size_t, left, PAGE_SIZE - offset); + left -= chunk; + *p++ = page; + } + if (!j) + return -EFAULT; + return maxsize - left; +} + +/* + * Extract a list of contiguous pages from an ITER_XARRAY iterator. This does not + * get references on the pages, nor does it get a pin on them. + */ +static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i, + struct page ***pages, size_t maxsize, + unsigned int maxpages, + iov_iter_extraction_t extraction_flags, + size_t *offset0) +{ + struct page *page, **p; + unsigned int nr = 0, offset; + loff_t pos = i->xarray_start + i->iov_offset; + pgoff_t index = pos >> PAGE_SHIFT; + XA_STATE(xas, i->xarray, index); + + offset = pos & ~PAGE_MASK; + *offset0 = offset; + + maxpages = want_pages_array(pages, maxsize, offset, maxpages); + if (!maxpages) + return -ENOMEM; + p = *pages; + + rcu_read_lock(); + for (page = xas_load(&xas); page; page = xas_next(&xas)) { + if (xas_retry(&xas, page)) + continue; + + /* Has the page moved or been split? */ + if (unlikely(page != xas_reload(&xas))) { + xas_reset(&xas); + continue; + } + + p[nr++] = find_subpage(page, xas.xa_index); + if (nr == maxpages) + break; + } + rcu_read_unlock(); + + maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize); + iov_iter_advance(i, maxsize); + return maxsize; +} + +/* + * Extract a list of contiguous pages from an ITER_BVEC iterator. This does + * not get references on the pages, nor does it get a pin on them. + */ +static ssize_t iov_iter_extract_bvec_pages(struct iov_iter *i, + struct page ***pages, size_t maxsize, + unsigned int maxpages, + iov_iter_extraction_t extraction_flags, + size_t *offset0) +{ + struct page **p, *page; + size_t skip = i->iov_offset, offset; + int k; + + for (;;) { + if (i->nr_segs == 0) + return 0; + maxsize = min(maxsize, i->bvec->bv_len - skip); + if (maxsize) + break; + i->iov_offset = 0; + i->nr_segs--; + i->bvec++; + skip = 0; + } + + skip += i->bvec->bv_offset; + page = i->bvec->bv_page + skip / PAGE_SIZE; + offset = skip % PAGE_SIZE; + *offset0 = offset; + + maxpages = want_pages_array(pages, maxsize, offset, maxpages); + if (!maxpages) + return -ENOMEM; + p = *pages; + for (k = 0; k < maxpages; k++) + p[k] = page + k; + + maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset); + iov_iter_advance(i, maxsize); + return maxsize; +} + +/* + * Extract a list of virtually contiguous pages from an ITER_KVEC iterator. + * This does not get references on the pages, nor does it get a pin on them. + */ +static ssize_t iov_iter_extract_kvec_pages(struct iov_iter *i, + struct page ***pages, size_t maxsize, + unsigned int maxpages, + iov_iter_extraction_t extraction_flags, + size_t *offset0) +{ + struct page **p, *page; + const void *kaddr; + size_t skip = i->iov_offset, offset, len; + int k; + + for (;;) { + if (i->nr_segs == 0) + return 0; + maxsize = min(maxsize, i->kvec->iov_len - skip); + if (maxsize) + break; + i->iov_offset = 0; + i->nr_segs--; + i->kvec++; + skip = 0; + } + + kaddr = i->kvec->iov_base + skip; + offset = (unsigned long)kaddr & ~PAGE_MASK; + *offset0 = offset; + + maxpages = want_pages_array(pages, maxsize, offset, maxpages); + if (!maxpages) + return -ENOMEM; + p = *pages; + + kaddr -= offset; + len = offset + maxsize; + for (k = 0; k < maxpages; k++) { + size_t seg = min_t(size_t, len, PAGE_SIZE); + + if (is_vmalloc_or_module_addr(kaddr)) + page = vmalloc_to_page(kaddr); + else + page = virt_to_page(kaddr); + + p[k] = page; + len -= seg; + kaddr += PAGE_SIZE; + } + + maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset); + iov_iter_advance(i, maxsize); + return maxsize; +} + +/* + * Extract a list of contiguous pages from a user iterator and get a pin on + * each of them. This should only be used if the iterator is user-backed + * (IOBUF/UBUF). + * + * It does not get refs on the pages, but the pages must be unpinned by the + * caller once the transfer is complete. + * + * This is safe to be used where background IO/DMA *is* going to be modifying + * the buffer; using a pin rather than a ref makes forces fork() to give the + * child a copy of the page. + */ +static ssize_t iov_iter_extract_user_pages(struct iov_iter *i, + struct page ***pages, + size_t maxsize, + unsigned int maxpages, + iov_iter_extraction_t extraction_flags, + size_t *offset0) +{ + unsigned long addr; + unsigned int gup_flags = FOLL_PIN; + size_t offset; + int res; + + if (i->data_source == ITER_DEST) + gup_flags |= FOLL_WRITE; + if (extraction_flags & ITER_ALLOW_P2PDMA) + gup_flags |= FOLL_PCI_P2PDMA; + if (i->nofault) + gup_flags |= FOLL_NOFAULT; + + addr = first_iovec_segment(i, &maxsize); + *offset0 = offset = addr % PAGE_SIZE; + addr &= PAGE_MASK; + maxpages = want_pages_array(pages, maxsize, offset, maxpages); + if (!maxpages) + return -ENOMEM; + res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages); + if (unlikely(res <= 0)) + return res; + maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset); + iov_iter_advance(i, maxsize); + return maxsize; +} + +/** + * iov_iter_extract_pages - Extract a list of contiguous pages from an iterator + * @i: The iterator to extract from + * @pages: Where to return the list of pages + * @maxsize: The maximum amount of iterator to extract + * @maxpages: The maximum size of the list of pages + * @extraction_flags: Flags to qualify request + * @offset0: Where to return the starting offset into (*@pages)[0] + * + * Extract a list of contiguous pages from the current point of the iterator, + * advancing the iterator. The maximum number of pages and the maximum amount + * of page contents can be set. + * + * If *@pages is NULL, a page list will be allocated to the required size and + * *@pages will be set to its base. If *@pages is not NULL, it will be assumed + * that the caller allocated a page list at least @maxpages in size and this + * will be filled in. + * + * @extraction_flags can have ITER_ALLOW_P2PDMA set to request peer-to-peer DMA + * be allowed on the pages extracted. + * + * The iov_iter_extract_will_pin() function can be used to query how cleanup + * should be performed. + * + * Extra refs or pins on the pages may be obtained as follows: + * + * (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF), pins will be + * added to the pages, but refs will not be taken. + * iov_iter_extract_will_pin() will return true. + * + * (*) If the iterator is ITER_PIPE, this must describe a destination for the + * data. Additional pages may be allocated and added to the pipe (which + * will hold the refs), but pins will not be obtained for the caller. The + * caller must hold the pipe lock. iov_iter_extract_will_pin() will + * return false. + * + * (*) If the iterator is ITER_KVEC, ITER_BVEC or ITER_XARRAY, the pages are + * merely listed; no extra refs or pins are obtained. + * iov_iter_extract_will_pin() will return 0. + * + * Note also: + * + * (*) Use with ITER_DISCARD is not supported as that has no content. + * + * On success, the function sets *@pages to the new pagelist, if allocated, and + * sets *offset0 to the offset into the first page. + * + * It may also return -ENOMEM and -EFAULT. + */ +ssize_t iov_iter_extract_pages(struct iov_iter *i, + struct page ***pages, + size_t maxsize, + unsigned int maxpages, + iov_iter_extraction_t extraction_flags, + size_t *offset0) +{ + maxsize = min_t(size_t, min_t(size_t, maxsize, i->count), MAX_RW_COUNT); + if (!maxsize) + return 0; + + if (likely(user_backed_iter(i))) + return iov_iter_extract_user_pages(i, pages, maxsize, + maxpages, extraction_flags, + offset0); + if (iov_iter_is_kvec(i)) + return iov_iter_extract_kvec_pages(i, pages, maxsize, + maxpages, extraction_flags, + offset0); + if (iov_iter_is_bvec(i)) + return iov_iter_extract_bvec_pages(i, pages, maxsize, + maxpages, extraction_flags, + offset0); + if (iov_iter_is_pipe(i)) + return iov_iter_extract_pipe_pages(i, pages, maxsize, + maxpages, extraction_flags, + offset0); + if (iov_iter_is_xarray(i)) + return iov_iter_extract_xarray_pages(i, pages, maxsize, + maxpages, extraction_flags, + offset0); + return -EFAULT; +} +EXPORT_SYMBOL_GPL(iov_iter_extract_pages);