diff mbox series

[v11,2/8] iov_iter: Add a function to extract a page list from an iterator

Message ID 20230126141626.2809643-3-dhowells@redhat.com (mailing list archive)
State New
Headers show
Series iov_iter: Improve page extraction (pin or just list) | expand

Commit Message

David Howells Jan. 26, 2023, 2:16 p.m. UTC
Add a function, iov_iter_extract_pages(), to extract a list of pages from
an iterator.  The pages may be returned with a pin added or nothing,
depending on the type of iterator.

Add a second function, iov_iter_extract_will_pin(), to determine how the
cleanup should be done.

There are two cases:

 (1) ITER_IOVEC or ITER_UBUF iterator.

     Extracted pages will have pins (FOLL_PIN) obtained on them so that a
     concurrent fork() will forcibly copy the page so that DMA is done
     to/from the parent's buffer and is unavailable to/unaffected by the
     child process.

     iov_iter_extract_will_pin() will return true for this case.  The
     caller should use something like unpin_user_page() to dispose of the
     page.

 (2) Any other sort of iterator.

     No refs or pins are obtained on the page, the assumption is made that
     the caller will manage page retention.

     iov_iter_extract_will_pin() will return false.  The pages don't need
     additional disposal.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: John Hubbard <jhubbard@nvidia.com>
cc: David Hildenbrand <david@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---

Notes:
    ver #11)
     - Fix iov_iter_extract_kvec_pages() to include the offset into the page in
       the returned starting offset.
     - Use __bitwise for the extraction flags
    
    ver #10)
     - Fix use of i->kvec in iov_iter_extract_bvec_pages() to be i->bvec.
    
    ver #9)
     - Rename iov_iter_extract_mode() to iov_iter_extract_will_pin() and make
       it return true/false not FOLL_PIN/0 as FOLL_PIN is going to be made
       private to mm/.
     - Change extract_flags to extraction_flags.
    
    ver #8)
     - It seems that all DIO is supposed to be done under FOLL_PIN now, and not
       FOLL_GET, so switch to only using pin_user_pages() for user-backed
       iters.
     - Wrap an argument in brackets in the iov_iter_extract_mode() macro.
     - Drop the extract_flags argument to iov_iter_extract_mode() for now
       [hch].
    
    ver #7)
     - Switch to passing in iter-specific flags rather than FOLL_* flags.
     - Drop the direction flags for now.
     - Use ITER_ALLOW_P2PDMA to request FOLL_PCI_P2PDMA.
     - Disallow use of ITER_ALLOW_P2PDMA with non-user-backed iter.
     - Add support for extraction from KVEC-type iters.
     - Use iov_iter_advance() rather than open-coding it.
     - Make BVEC- and KVEC-type skip over initial empty vectors.
    
    ver #6)
     - Add back the function to indicate the cleanup mode.
     - Drop the cleanup_mode return arg to iov_iter_extract_pages().
     - Pass FOLL_SOURCE/DEST_BUF in gup_flags.  Check this against the iter
       data_source.
    
    ver #4)
     - Use ITER_SOURCE/DEST instead of WRITE/READ.
     - Allow additional FOLL_* flags, such as FOLL_PCI_P2PDMA to be passed in.
    
    ver #3)
     - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
       to get/pin_user_pages_fast()[1].

 include/linux/uio.h |  27 +++-
 lib/iov_iter.c      | 321 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 347 insertions(+), 1 deletion(-)

Comments

Al Viro Jan. 26, 2023, 9:59 p.m. UTC | #1
On Thu, Jan 26, 2023 at 02:16:20PM +0000, David Howells wrote:

> +/**
> + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
> + * @iter: The iterator
> + *
> + * Examine the iterator and indicate by returning true or false as to how, if
> + * at all, pages extracted from the iterator will be retained by the extraction
> + * function.
> + *
> + * %true indicates that the pages will have a pin placed in them that the
> + * caller must unpin.  This is must be done for DMA/async DIO to force fork()
> + * to forcibly copy a page for the child (the parent must retain the original
> + * page).
> + *
> + * %false indicates that no measures are taken and that it's up to the caller
> + * to retain the pages.
> + */
> +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter)
> +{
> +	return user_backed_iter(iter);
> +}
> +

Wait a sec; why would we want a pin for pages we won't be modifying?
A reference - sure, but...
Al Viro Jan. 26, 2023, 10:36 p.m. UTC | #2
On Thu, Jan 26, 2023 at 09:59:36PM +0000, Al Viro wrote:
> On Thu, Jan 26, 2023 at 02:16:20PM +0000, David Howells wrote:
> 
> > +/**
> > + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
> > + * @iter: The iterator
> > + *
> > + * Examine the iterator and indicate by returning true or false as to how, if
> > + * at all, pages extracted from the iterator will be retained by the extraction
> > + * function.
> > + *
> > + * %true indicates that the pages will have a pin placed in them that the
> > + * caller must unpin.  This is must be done for DMA/async DIO to force fork()
> > + * to forcibly copy a page for the child (the parent must retain the original
> > + * page).
> > + *
> > + * %false indicates that no measures are taken and that it's up to the caller
> > + * to retain the pages.
> > + */
> > +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter)
> > +{
> > +	return user_backed_iter(iter);
> > +}
> > +
> 
> Wait a sec; why would we want a pin for pages we won't be modifying?
> A reference - sure, but...

After having looked through the earlier iterations of the patchset -
sorry, but that won't fly for (at least) vmsplice().  There we can't
pin those suckers; thankfully, we don't need to - they are used only
for fetches, so FOLL_GET is sufficient.  With your "we'll just pin them,
source or destination" you won't be able to convert at least that
call of iov_iter_get_pages2().  And there might be other similar cases;
I won't swear there's more, but ISTR running into more than one of
the "pin won't be OK here, but fortunately it's a data source" places.
John Hubbard Jan. 26, 2023, 10:49 p.m. UTC | #3
On 1/26/23 14:36, Al Viro wrote:
...
>>> +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter)
>>> +{
>>> +	return user_backed_iter(iter);
>>> +}
>>> +
>>
>> Wait a sec; why would we want a pin for pages we won't be modifying?
>> A reference - sure, but...
> 
> After having looked through the earlier iterations of the patchset -
> sorry, but that won't fly for (at least) vmsplice().  There we can't
> pin those suckers; thankfully, we don't need to - they are used only
> for fetches, so FOLL_GET is sufficient.  With your "we'll just pin them,
> source or destination" you won't be able to convert at least that
> call of iov_iter_get_pages2().  And there might be other similar cases;
> I won't swear there's more, but ISTR running into more than one of
> the "pin won't be OK here, but fortunately it's a data source" places.

Assuming that "page is a data source" means that we are writing out from
the page to a block device (so, a WRITE operation, which of course
actually *reads* from the page), then...

...one thing I'm worried about now is whether Jan's original problem
report [1] can be fixed, because that involves page writeback. And it
seems like we need to mark the pages involved as "maybe dma-pinned" via
FOLL_PIN pins, in order to solve it.

Or am I missing a key point (I hope)?


[1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/T/#u

thanks,
David Hildenbrand Jan. 26, 2023, 11:44 p.m. UTC | #4
On 26.01.23 23:36, Al Viro wrote:
> On Thu, Jan 26, 2023 at 09:59:36PM +0000, Al Viro wrote:
>> On Thu, Jan 26, 2023 at 02:16:20PM +0000, David Howells wrote:
>>
>>> +/**
>>> + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
>>> + * @iter: The iterator
>>> + *
>>> + * Examine the iterator and indicate by returning true or false as to how, if
>>> + * at all, pages extracted from the iterator will be retained by the extraction
>>> + * function.
>>> + *
>>> + * %true indicates that the pages will have a pin placed in them that the
>>> + * caller must unpin.  This is must be done for DMA/async DIO to force fork()
>>> + * to forcibly copy a page for the child (the parent must retain the original
>>> + * page).
>>> + *
>>> + * %false indicates that no measures are taken and that it's up to the caller
>>> + * to retain the pages.
>>> + */
>>> +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter)
>>> +{
>>> +	return user_backed_iter(iter);
>>> +}
>>> +
>>
>> Wait a sec; why would we want a pin for pages we won't be modifying?
>> A reference - sure, but...
> 
> After having looked through the earlier iterations of the patchset -
> sorry, but that won't fly for (at least) vmsplice().  There we can't
> pin those suckers; 

We'll need a way to pass FOLL_LONGTERM to pin_user_pages_fast() to 
handle such long-term pinning as vmsplice() needs. But the release path 
(unpin) will be the same.
David Howells Jan. 26, 2023, 11:56 p.m. UTC | #5
Al says that pinning a page (ie. FOLL_PIN) could cause a deadlock if a page is
vmspliced into a pipe with the pipe holding a pin on it because pinned pages
are removed from all page tables.  Is this actually the case?  I can't see
offhand where in mm/gup.c it does this.

David
David Hildenbrand Jan. 27, 2023, 12:10 a.m. UTC | #6
On 27.01.23 00:56, David Howells wrote:
> Al says that pinning a page (ie. FOLL_PIN) could cause a deadlock if a page is
> vmspliced into a pipe with the pipe holding a pin on it because pinned pages
> are removed from all page tables.  Is this actually the case?  I can't see
> offhand where in mm/gup.c it does this.

Pinning a page is mostly taking a "special" reference on the page, 
indicating to the system that the page maybe pinned. For an ordinary 
order-0 page, this is increasing the refcount by 1024 instead of 1.

In addition, we'll do some COW-unsharing magic depending on the page 
type (e.g., anon vs. fike-backed), and FOLL_LONGTERM. So if the page is 
mapped R/O only and we want to pin it R/O (!FOLL_WRITE), we might 
replace it in the page table by a different page via a fault 
(FAULT_FLAG_UNSHARE).

Last but not least, with FOLL_LONGTERM we will make sure to migrate the 
target page off of MIGRATE_MOVABLE/CMA memory where the unmovable page 
(while pinned) could otherwise cause trouble (e.g., blocking memory 
hotunplug). So again, we'd replace it in the page tale by a different 
page via a fault.

In all cases, the page won't be unmapped from the page table.
Al Viro Jan. 27, 2023, 12:52 a.m. UTC | #7
On Thu, Jan 26, 2023 at 11:56:50PM +0000, David Howells wrote:
> Al says that pinning a page (ie. FOLL_PIN) could cause a deadlock if a page is
> vmspliced into a pipe with the pipe holding a pin on it because pinned pages
> are removed from all page tables.  Is this actually the case?  I can't see
> offhand where in mm/gup.c it does this.

It doesn't; sorry, really confused memories of what's going on, took a while
to sort them out (FWIW, writeback is where we unmap and check if page is
pinned, while pin_user_pages running into an unmapped page will end up
with handle_mm_fault() (->fault(), actually) try to get the sucker locked
and block on that until the writeback is over).

Said that, I still think that pinned pages (arbitrary pagecache ones,
at that) ending up in a pipe is a seriously bad idea.  It's trivial to
arrange for them to stay that way indefinitely - no priveleges needed,
very few limits, etc.
Al Viro Jan. 27, 2023, 1:21 a.m. UTC | #8
On Fri, Jan 27, 2023 at 12:52:38AM +0000, Al Viro wrote:
> On Thu, Jan 26, 2023 at 11:56:50PM +0000, David Howells wrote:
> > Al says that pinning a page (ie. FOLL_PIN) could cause a deadlock if a page is
> > vmspliced into a pipe with the pipe holding a pin on it because pinned pages
> > are removed from all page tables.  Is this actually the case?  I can't see
> > offhand where in mm/gup.c it does this.
> 
> It doesn't; sorry, really confused memories of what's going on, took a while
> to sort them out (FWIW, writeback is where we unmap and check if page is
> pinned, while pin_user_pages running into an unmapped page will end up
> with handle_mm_fault() (->fault(), actually) try to get the sucker locked
> and block on that until the writeback is over).

Umm...  OK, I really need to reread that area.  Hopefully will be done with
that by tomorrow...
Al Viro Jan. 27, 2023, 2:02 a.m. UTC | #9
On Fri, Jan 27, 2023 at 12:44:08AM +0100, David Hildenbrand wrote:
> On 26.01.23 23:36, Al Viro wrote:
> > On Thu, Jan 26, 2023 at 09:59:36PM +0000, Al Viro wrote:
> > > On Thu, Jan 26, 2023 at 02:16:20PM +0000, David Howells wrote:
> > > 
> > > > +/**
> > > > + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
> > > > + * @iter: The iterator
> > > > + *
> > > > + * Examine the iterator and indicate by returning true or false as to how, if
> > > > + * at all, pages extracted from the iterator will be retained by the extraction
> > > > + * function.
> > > > + *
> > > > + * %true indicates that the pages will have a pin placed in them that the
> > > > + * caller must unpin.  This is must be done for DMA/async DIO to force fork()
> > > > + * to forcibly copy a page for the child (the parent must retain the original
> > > > + * page).
> > > > + *
> > > > + * %false indicates that no measures are taken and that it's up to the caller
> > > > + * to retain the pages.
> > > > + */
> > > > +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter)
> > > > +{
> > > > +	return user_backed_iter(iter);
> > > > +}
> > > > +
> > > 
> > > Wait a sec; why would we want a pin for pages we won't be modifying?
> > > A reference - sure, but...
> > 
> > After having looked through the earlier iterations of the patchset -
> > sorry, but that won't fly for (at least) vmsplice().  There we can't
> > pin those suckers;
> 
> We'll need a way to pass FOLL_LONGTERM to pin_user_pages_fast() to handle
> such long-term pinning as vmsplice() needs. But the release path (unpin)
> will be the same.

Umm...  Are you saying that if the source area contains DAX mmaps, vmsplice()
from it will fail?
Jan Kara Jan. 27, 2023, 12:30 p.m. UTC | #10
On Fri 27-01-23 02:02:31, Al Viro wrote:
> On Fri, Jan 27, 2023 at 12:44:08AM +0100, David Hildenbrand wrote:
> > On 26.01.23 23:36, Al Viro wrote:
> > > On Thu, Jan 26, 2023 at 09:59:36PM +0000, Al Viro wrote:
> > > > On Thu, Jan 26, 2023 at 02:16:20PM +0000, David Howells wrote:
> > > > 
> > > > > +/**
> > > > > + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
> > > > > + * @iter: The iterator
> > > > > + *
> > > > > + * Examine the iterator and indicate by returning true or false as to how, if
> > > > > + * at all, pages extracted from the iterator will be retained by the extraction
> > > > > + * function.
> > > > > + *
> > > > > + * %true indicates that the pages will have a pin placed in them that the
> > > > > + * caller must unpin.  This is must be done for DMA/async DIO to force fork()
> > > > > + * to forcibly copy a page for the child (the parent must retain the original
> > > > > + * page).
> > > > > + *
> > > > > + * %false indicates that no measures are taken and that it's up to the caller
> > > > > + * to retain the pages.
> > > > > + */
> > > > > +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter)
> > > > > +{
> > > > > +	return user_backed_iter(iter);
> > > > > +}
> > > > > +
> > > > 
> > > > Wait a sec; why would we want a pin for pages we won't be modifying?
> > > > A reference - sure, but...
> > > 
> > > After having looked through the earlier iterations of the patchset -
> > > sorry, but that won't fly for (at least) vmsplice().  There we can't
> > > pin those suckers;
> > 
> > We'll need a way to pass FOLL_LONGTERM to pin_user_pages_fast() to handle
> > such long-term pinning as vmsplice() needs. But the release path (unpin)
> > will be the same.
> 
> Umm...  Are you saying that if the source area contains DAX mmaps, vmsplice()
> from it will fail?

Yes, that's the plan. Because as you wrote elsewhere, it is otherwise too easy
to lock up operations such as truncate(2) on DAX filesystems.

								Honza
David Hildenbrand Jan. 27, 2023, 12:34 p.m. UTC | #11
On 27.01.23 13:30, Jan Kara wrote:
> On Fri 27-01-23 02:02:31, Al Viro wrote:
>> On Fri, Jan 27, 2023 at 12:44:08AM +0100, David Hildenbrand wrote:
>>> On 26.01.23 23:36, Al Viro wrote:
>>>> On Thu, Jan 26, 2023 at 09:59:36PM +0000, Al Viro wrote:
>>>>> On Thu, Jan 26, 2023 at 02:16:20PM +0000, David Howells wrote:
>>>>>
>>>>>> +/**
>>>>>> + * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
>>>>>> + * @iter: The iterator
>>>>>> + *
>>>>>> + * Examine the iterator and indicate by returning true or false as to how, if
>>>>>> + * at all, pages extracted from the iterator will be retained by the extraction
>>>>>> + * function.
>>>>>> + *
>>>>>> + * %true indicates that the pages will have a pin placed in them that the
>>>>>> + * caller must unpin.  This is must be done for DMA/async DIO to force fork()
>>>>>> + * to forcibly copy a page for the child (the parent must retain the original
>>>>>> + * page).
>>>>>> + *
>>>>>> + * %false indicates that no measures are taken and that it's up to the caller
>>>>>> + * to retain the pages.
>>>>>> + */
>>>>>> +static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter)
>>>>>> +{
>>>>>> +	return user_backed_iter(iter);
>>>>>> +}
>>>>>> +
>>>>>
>>>>> Wait a sec; why would we want a pin for pages we won't be modifying?
>>>>> A reference - sure, but...
>>>>
>>>> After having looked through the earlier iterations of the patchset -
>>>> sorry, but that won't fly for (at least) vmsplice().  There we can't
>>>> pin those suckers;
>>>
>>> We'll need a way to pass FOLL_LONGTERM to pin_user_pages_fast() to handle
>>> such long-term pinning as vmsplice() needs. But the release path (unpin)
>>> will be the same.
>>
>> Umm...  Are you saying that if the source area contains DAX mmaps, vmsplice()
>> from it will fail?
> 
> Yes, that's the plan. Because as you wrote elsewhere, it is otherwise too easy
> to lock up operations such as truncate(2) on DAX filesystems.

Right, it's then the same behavior as we already have for other 
FOLL_LONGTERM users, such as RDMA or io_uring.

... if we're afraid of breaking existing setups we could add some kind 
of fallback to copy to a buffer like ordinary pipe writes.
Jan Kara Jan. 27, 2023, 12:38 p.m. UTC | #12
On Fri 27-01-23 00:52:38, Al Viro wrote:
> On Thu, Jan 26, 2023 at 11:56:50PM +0000, David Howells wrote:
> > Al says that pinning a page (ie. FOLL_PIN) could cause a deadlock if a page is
> > vmspliced into a pipe with the pipe holding a pin on it because pinned pages
> > are removed from all page tables.  Is this actually the case?  I can't see
> > offhand where in mm/gup.c it does this.
> 
> It doesn't; sorry, really confused memories of what's going on, took a while
> to sort them out (FWIW, writeback is where we unmap and check if page is
> pinned, while pin_user_pages running into an unmapped page will end up
> with handle_mm_fault() (->fault(), actually) try to get the sucker locked
> and block on that until the writeback is over).
> 
> Said that, I still think that pinned pages (arbitrary pagecache ones,
> at that) ending up in a pipe is a seriously bad idea.  It's trivial to
> arrange for them to stay that way indefinitely - no priveleges needed,
> very few limits, etc.

I tend to agree but is there a big difference compared to normal page
references? There's no difference for memory usage, pages still can be
truncated from the file and disk space reclaimed (this is where DAX has
problems...) so standard file operations won't notice. The only difference
is that they could stay permanently dirty (we don't know whether the pin
owner copies data to or from the page) so it could cause trouble with dirty
throttling - and it is really only the throttling itself - page reclaim
will have the same troubles with both pins and ordinary page references...
Am I missing something?

								Honza
diff mbox series

Patch

diff --git a/include/linux/uio.h b/include/linux/uio.h
index bf77cd3d5fb1..b1be128bb2fa 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -361,9 +361,34 @@  static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
 		.count = count
 	};
 }
-
 /* Flags for iov_iter_get/extract_pages*() */
 /* Allow P2PDMA on the extracted pages */
 #define ITER_ALLOW_P2PDMA	((__force iov_iter_extraction_t)0x01)
 
+ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
+			       size_t maxsize, unsigned int maxpages,
+			       iov_iter_extraction_t extraction_flags,
+			       size_t *offset0);
+
+/**
+ * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
+ * @iter: The iterator
+ *
+ * Examine the iterator and indicate by returning true or false as to how, if
+ * at all, pages extracted from the iterator will be retained by the extraction
+ * function.
+ *
+ * %true indicates that the pages will have a pin placed in them that the
+ * caller must unpin.  This is must be done for DMA/async DIO to force fork()
+ * to forcibly copy a page for the child (the parent must retain the original
+ * page).
+ *
+ * %false indicates that no measures are taken and that it's up to the caller
+ * to retain the pages.
+ */
+static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter)
+{
+	return user_backed_iter(iter);
+}
+
 #endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 553afc870866..d69a05950555 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1916,3 +1916,324 @@  void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
 		i->iov -= state->nr_segs - i->nr_segs;
 	i->nr_segs = state->nr_segs;
 }
+
+/*
+ * Extract a list of contiguous pages from an ITER_PIPE iterator.  This does
+ * not get references of its own on the pages, nor does it get a pin on them.
+ * If there's a partial page, it adds that first and will then allocate and add
+ * pages into the pipe to make up the buffer space to the amount required.
+ *
+ * The caller must hold the pipe locked and only transferring into a pipe is
+ * supported.
+ */
+static ssize_t iov_iter_extract_pipe_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   iov_iter_extraction_t extraction_flags,
+					   size_t *offset0)
+{
+	unsigned int nr, offset, chunk, j;
+	struct page **p;
+	size_t left;
+
+	if (!sanity(i))
+		return -EFAULT;
+
+	offset = pipe_npages(i, &nr);
+	if (!nr)
+		return -EFAULT;
+	*offset0 = offset;
+
+	maxpages = min_t(size_t, nr, maxpages);
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	left = maxsize;
+	for (j = 0; j < maxpages; j++) {
+		struct page *page = append_pipe(i, left, &offset);
+		if (!page)
+			break;
+		chunk = min_t(size_t, left, PAGE_SIZE - offset);
+		left -= chunk;
+		*p++ = page;
+	}
+	if (!j)
+		return -EFAULT;
+	return maxsize - left;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_XARRAY iterator.  This does not
+ * get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i,
+					     struct page ***pages, size_t maxsize,
+					     unsigned int maxpages,
+					     iov_iter_extraction_t extraction_flags,
+					     size_t *offset0)
+{
+	struct page *page, **p;
+	unsigned int nr = 0, offset;
+	loff_t pos = i->xarray_start + i->iov_offset;
+	pgoff_t index = pos >> PAGE_SHIFT;
+	XA_STATE(xas, i->xarray, index);
+
+	offset = pos & ~PAGE_MASK;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	rcu_read_lock();
+	for (page = xas_load(&xas); page; page = xas_next(&xas)) {
+		if (xas_retry(&xas, page))
+			continue;
+
+		/* Has the page moved or been split? */
+		if (unlikely(page != xas_reload(&xas))) {
+			xas_reset(&xas);
+			continue;
+		}
+
+		p[nr++] = find_subpage(page, xas.xa_index);
+		if (nr == maxpages)
+			break;
+	}
+	rcu_read_unlock();
+
+	maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_BVEC iterator.  This does
+ * not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_bvec_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   iov_iter_extraction_t extraction_flags,
+					   size_t *offset0)
+{
+	struct page **p, *page;
+	size_t skip = i->iov_offset, offset;
+	int k;
+
+	for (;;) {
+		if (i->nr_segs == 0)
+			return 0;
+		maxsize = min(maxsize, i->bvec->bv_len - skip);
+		if (maxsize)
+			break;
+		i->iov_offset = 0;
+		i->nr_segs--;
+		i->bvec++;
+		skip = 0;
+	}
+
+	skip += i->bvec->bv_offset;
+	page = i->bvec->bv_page + skip / PAGE_SIZE;
+	offset = skip % PAGE_SIZE;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+	for (k = 0; k < maxpages; k++)
+		p[k] = page + k;
+
+	maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of virtually contiguous pages from an ITER_KVEC iterator.
+ * This does not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_kvec_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   iov_iter_extraction_t extraction_flags,
+					   size_t *offset0)
+{
+	struct page **p, *page;
+	const void *kaddr;
+	size_t skip = i->iov_offset, offset, len;
+	int k;
+
+	for (;;) {
+		if (i->nr_segs == 0)
+			return 0;
+		maxsize = min(maxsize, i->kvec->iov_len - skip);
+		if (maxsize)
+			break;
+		i->iov_offset = 0;
+		i->nr_segs--;
+		i->kvec++;
+		skip = 0;
+	}
+
+	kaddr = i->kvec->iov_base + skip;
+	offset = (unsigned long)kaddr & ~PAGE_MASK;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	kaddr -= offset;
+	len = offset + maxsize;
+	for (k = 0; k < maxpages; k++) {
+		size_t seg = min_t(size_t, len, PAGE_SIZE);
+
+		if (is_vmalloc_or_module_addr(kaddr))
+			page = vmalloc_to_page(kaddr);
+		else
+			page = virt_to_page(kaddr);
+
+		p[k] = page;
+		len -= seg;
+		kaddr += PAGE_SIZE;
+	}
+
+	maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get a pin on
+ * each of them.  This should only be used if the iterator is user-backed
+ * (IOBUF/UBUF).
+ *
+ * It does not get refs on the pages, but the pages must be unpinned by the
+ * caller once the transfer is complete.
+ *
+ * This is safe to be used where background IO/DMA *is* going to be modifying
+ * the buffer; using a pin rather than a ref makes forces fork() to give the
+ * child a copy of the page.
+ */
+static ssize_t iov_iter_extract_user_pages(struct iov_iter *i,
+					   struct page ***pages,
+					   size_t maxsize,
+					   unsigned int maxpages,
+					   iov_iter_extraction_t extraction_flags,
+					   size_t *offset0)
+{
+	unsigned long addr;
+	unsigned int gup_flags = FOLL_PIN;
+	size_t offset;
+	int res;
+
+	if (i->data_source == ITER_DEST)
+		gup_flags |= FOLL_WRITE;
+	if (extraction_flags & ITER_ALLOW_P2PDMA)
+		gup_flags |= FOLL_PCI_P2PDMA;
+	if (i->nofault)
+		gup_flags |= FOLL_NOFAULT;
+
+	addr = first_iovec_segment(i, &maxsize);
+	*offset0 = offset = addr % PAGE_SIZE;
+	addr &= PAGE_MASK;
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages);
+	if (unlikely(res <= 0))
+		return res;
+	maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/**
+ * iov_iter_extract_pages - Extract a list of contiguous pages from an iterator
+ * @i: The iterator to extract from
+ * @pages: Where to return the list of pages
+ * @maxsize: The maximum amount of iterator to extract
+ * @maxpages: The maximum size of the list of pages
+ * @extraction_flags: Flags to qualify request
+ * @offset0: Where to return the starting offset into (*@pages)[0]
+ *
+ * Extract a list of contiguous pages from the current point of the iterator,
+ * advancing the iterator.  The maximum number of pages and the maximum amount
+ * of page contents can be set.
+ *
+ * If *@pages is NULL, a page list will be allocated to the required size and
+ * *@pages will be set to its base.  If *@pages is not NULL, it will be assumed
+ * that the caller allocated a page list at least @maxpages in size and this
+ * will be filled in.
+ *
+ * @extraction_flags can have ITER_ALLOW_P2PDMA set to request peer-to-peer DMA
+ * be allowed on the pages extracted.
+ *
+ * The iov_iter_extract_will_pin() function can be used to query how cleanup
+ * should be performed.
+ *
+ * Extra refs or pins on the pages may be obtained as follows:
+ *
+ *  (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF), pins will be
+ *      added to the pages, but refs will not be taken.
+ *      iov_iter_extract_will_pin() will return true.
+ *
+ *  (*) If the iterator is ITER_PIPE, this must describe a destination for the
+ *      data.  Additional pages may be allocated and added to the pipe (which
+ *      will hold the refs), but pins will not be obtained for the caller.  The
+ *      caller must hold the pipe lock.  iov_iter_extract_will_pin() will
+ *      return false.
+ *
+ *  (*) If the iterator is ITER_KVEC, ITER_BVEC or ITER_XARRAY, the pages are
+ *      merely listed; no extra refs or pins are obtained.
+ *      iov_iter_extract_will_pin() will return 0.
+ *
+ * Note also:
+ *
+ *  (*) Use with ITER_DISCARD is not supported as that has no content.
+ *
+ * On success, the function sets *@pages to the new pagelist, if allocated, and
+ * sets *offset0 to the offset into the first page.
+ *
+ * It may also return -ENOMEM and -EFAULT.
+ */
+ssize_t iov_iter_extract_pages(struct iov_iter *i,
+			       struct page ***pages,
+			       size_t maxsize,
+			       unsigned int maxpages,
+			       iov_iter_extraction_t extraction_flags,
+			       size_t *offset0)
+{
+	maxsize = min_t(size_t, min_t(size_t, maxsize, i->count), MAX_RW_COUNT);
+	if (!maxsize)
+		return 0;
+
+	if (likely(user_backed_iter(i)))
+		return iov_iter_extract_user_pages(i, pages, maxsize,
+						   maxpages, extraction_flags,
+						   offset0);
+	if (iov_iter_is_kvec(i))
+		return iov_iter_extract_kvec_pages(i, pages, maxsize,
+						   maxpages, extraction_flags,
+						   offset0);
+	if (iov_iter_is_bvec(i))
+		return iov_iter_extract_bvec_pages(i, pages, maxsize,
+						   maxpages, extraction_flags,
+						   offset0);
+	if (iov_iter_is_pipe(i))
+		return iov_iter_extract_pipe_pages(i, pages, maxsize,
+						   maxpages, extraction_flags,
+						   offset0);
+	if (iov_iter_is_xarray(i))
+		return iov_iter_extract_xarray_pages(i, pages, maxsize,
+						     maxpages, extraction_flags,
+						     offset0);
+	return -EFAULT;
+}
+EXPORT_SYMBOL_GPL(iov_iter_extract_pages);