[v2,4/7] iov_iter: new iov_iter_pin_pages*() routines

Message ID	20220831041843.973026-5-jhubbard@nvidia.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 12.22.5.238 as permitted sender) receiver=protection.outlook.com; client-ip=12.22.5.238; helo=mail.nvidia.com; pr=C From: John Hubbard <jhubbard@nvidia.com> To: Andrew Morton <akpm@linux-foundation.org> CC: Jens Axboe <axboe@kernel.dk>, Alexander Viro <viro@zeniv.linux.org.uk>, Miklos Szeredi <miklos@szeredi.hu>, Christoph Hellwig <hch@infradead.org>, "Darrick J . Wong" <djwong@kernel.org>, Trond Myklebust <trond.myklebust@hammerspace.com>, Anna Schumaker <anna@kernel.org>, Jan Kara <jack@suse.cz>, David Hildenbrand <david@redhat.com>, Logan Gunthorpe <logang@deltatee.com>, <linux-block@vger.kernel.org>, <linux-fsdevel@vger.kernel.org>, <linux-xfs@vger.kernel.org>, <linux-nfs@vger.kernel.org>, <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>, John Hubbard <jhubbard@nvidia.com> Subject: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines Date: Tue, 30 Aug 2022 21:18:40 -0700 Message-ID: <20220831041843.973026-5-jhubbard@nvidia.com> In-Reply-To: <20220831041843.973026-1-jhubbard@nvidia.com> References: <20220831041843.973026-1-jhubbard@nvidia.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	convert most filesystems to pin_user_pages_fast() \| expand [v2,0/7] convert most filesystems to pin_user_pages_fast() [v2,1/7] mm: change release_pages() to use unsigned long for npages [v2,2/7] mm/gup: introduce pin_user_page() [v2,3/7] block: add dio_w_() wrappers for pin, unpin user pages [v2,4/7] iov_iter: new iov_iter_pin_pages() routines [v2,5/7] block, bio, fs: convert most filesystems to pin_user_pages_fast() [v2,6/7] NFS: direct-io: convert to FOLL_PIN pages [v2,7/7] fuse: convert direct IO paths to use FOLL_PIN

John Hubbard Aug. 31, 2022, 4:18 a.m. UTC

Provide two new wrapper routines that are intended for user space pages
only:

    iov_iter_pin_pages()
    iov_iter_pin_pages_alloc()

Internally, these routines call pin_user_pages_fast(), instead of
get_user_pages_fast(), for user_backed_iter(i) and iov_iter_bvec(i)
cases.

As always, callers must use unpin_user_pages() or a suitable FOLL_PIN
variant, to release the pages, if they actually were acquired via
pin_user_pages_fast().

This is a prerequisite to converting bio/block layers over to use
pin_user_pages_fast().

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/uio.h |  4 +++
 lib/iov_iter.c      | 86 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 84 insertions(+), 6 deletions(-)

Al Viro Sept. 1, 2022, 12:42 a.m. UTC | #1

On Tue, Aug 30, 2022 at 09:18:40PM -0700, John Hubbard wrote:
> Provide two new wrapper routines that are intended for user space pages
> only:
> 
>     iov_iter_pin_pages()
>     iov_iter_pin_pages_alloc()
> 
> Internally, these routines call pin_user_pages_fast(), instead of
> get_user_pages_fast(), for user_backed_iter(i) and iov_iter_bvec(i)
> cases.
> 
> As always, callers must use unpin_user_pages() or a suitable FOLL_PIN
> variant, to release the pages, if they actually were acquired via
> pin_user_pages_fast().
> 
> This is a prerequisite to converting bio/block layers over to use
> pin_user_pages_fast().

What of ITER_PIPE (splice from O_DIRECT fd to a to pipe, for filesystem
that uses generic_file_splice_read())?

John Hubbard Sept. 1, 2022, 1:48 a.m. UTC | #2

On 8/31/22 17:42, Al Viro wrote:
> On Tue, Aug 30, 2022 at 09:18:40PM -0700, John Hubbard wrote:
>> Provide two new wrapper routines that are intended for user space pages
>> only:
>>
>>     iov_iter_pin_pages()
>>     iov_iter_pin_pages_alloc()
>>
>> Internally, these routines call pin_user_pages_fast(), instead of
>> get_user_pages_fast(), for user_backed_iter(i) and iov_iter_bvec(i)
>> cases.
>>
>> As always, callers must use unpin_user_pages() or a suitable FOLL_PIN
>> variant, to release the pages, if they actually were acquired via
>> pin_user_pages_fast().
>>
>> This is a prerequisite to converting bio/block layers over to use
>> pin_user_pages_fast().
> 
> What of ITER_PIPE (splice from O_DIRECT fd to a to pipe, for filesystem
> that uses generic_file_splice_read())?

Yes. And it turns out that I sent this v2 just a little too early: it
does not include Jan Kara's latest idea [1] of including ITER_PIPE and
ITER_XARRAY. That should fix this up.


[1] https://lore.kernel.org/r/20220831094349.boln4jjajkdtykx3@quack3

thanks,

Christoph Hellwig Sept. 6, 2022, 6:47 a.m. UTC | #3

I'd it one step back.  For BVECS we never need a get or pin.  The
block layer already does this, an the other callers should as well.
For KVEC the same is true.  For PIPE and xarray as you pointed out
we can probably just do the pin, it is not like these are performance
paths.

So, I'd suggest to:

 - factor out the user backed and bvec cases from
   __iov_iter_get_pages_alloc into helper just to keep
   __iov_iter_get_pages_alloc readable.
 - for the pin case don't use the existing bvec helper at all, but
   copy the logic for the block layer for not pinning.

John Hubbard Sept. 6, 2022, 7:44 a.m. UTC | #4

On 9/5/22 23:47, Christoph Hellwig wrote:
> I'd it one step back.  For BVECS we never need a get or pin.  The
> block layer already does this, an the other callers should as well.
> For KVEC the same is true.  For PIPE and xarray as you pointed out
> we can probably just do the pin, it is not like these are performance
> paths.
> 
> So, I'd suggest to:
> 
>  - factor out the user backed and bvec cases from
>    __iov_iter_get_pages_alloc into helper just to keep
>    __iov_iter_get_pages_alloc readable.

OK, that part is clear.

>  - for the pin case don't use the existing bvec helper at all, but
>    copy the logic for the block layer for not pinning.

I'm almost, but not quite sure I get the idea above. Overall, what
happens to bvec pages? Leave the get_page() pin in place for FOLL_GET
(or USE_FOLL_GET), I suppose, but do...what, for FOLL_PIN callers?

thanks,

Christoph Hellwig Sept. 6, 2022, 7:48 a.m. UTC | #5

On Tue, Sep 06, 2022 at 12:44:28AM -0700, John Hubbard wrote:
> OK, that part is clear.
> 
> >  - for the pin case don't use the existing bvec helper at all, but
> >    copy the logic for the block layer for not pinning.
> 
> I'm almost, but not quite sure I get the idea above. Overall, what
> happens to bvec pages? Leave the get_page() pin in place for FOLL_GET
> (or USE_FOLL_GET), I suppose, but do...what, for FOLL_PIN callers?

Do not change anyhing for FOLL_GET callers, as they are on the way out
anyway.

For FOLL_PIN callers, never pin bvec and kvec pages:  For file systems
not acquiring a reference is obviously safe, and the other callers will
need an audit, but I can't think of why it woul  ever be unsafe.

John Hubbard Sept. 6, 2022, 7:58 a.m. UTC | #6

On 9/6/22 00:48, Christoph Hellwig wrote:
> On Tue, Sep 06, 2022 at 12:44:28AM -0700, John Hubbard wrote:
>> OK, that part is clear.
>>
>>>  - for the pin case don't use the existing bvec helper at all, but
>>>    copy the logic for the block layer for not pinning.
>>
>> I'm almost, but not quite sure I get the idea above. Overall, what
>> happens to bvec pages? Leave the get_page() pin in place for FOLL_GET
>> (or USE_FOLL_GET), I suppose, but do...what, for FOLL_PIN callers?
> 
> Do not change anyhing for FOLL_GET callers, as they are on the way out
> anyway.
> 

OK, got it.

> For FOLL_PIN callers, never pin bvec and kvec pages:  For file systems
> not acquiring a reference is obviously safe, and the other callers will
> need an audit, but I can't think of why it woul  ever be unsafe.

In order to do that, one would need to be confident that such bvec and kvec
pages do not get passed down to bio_release_pages() (or the new
bio_unpin_pages()). Also, I'm missing a key point, because today bvec pages get
a get_page() reference from __iov_iter_get_pages_alloc(). If I just skip
that, then the get/put calls are unbalanced...

I can just hear Al Viro repeating his points about splice() and vmsplice(),
heh. :)

thanks,

Jan Kara Sept. 6, 2022, 10:21 a.m. UTC | #7

On Tue 06-09-22 00:48:49, Christoph Hellwig wrote:
> On Tue, Sep 06, 2022 at 12:44:28AM -0700, John Hubbard wrote:
> > OK, that part is clear.
> > 
> > >  - for the pin case don't use the existing bvec helper at all, but
> > >    copy the logic for the block layer for not pinning.
> > 
> > I'm almost, but not quite sure I get the idea above. Overall, what
> > happens to bvec pages? Leave the get_page() pin in place for FOLL_GET
> > (or USE_FOLL_GET), I suppose, but do...what, for FOLL_PIN callers?
> 
> Do not change anyhing for FOLL_GET callers, as they are on the way out
> anyway.
> 
> For FOLL_PIN callers, never pin bvec and kvec pages:  For file systems
> not acquiring a reference is obviously safe, and the other callers will
> need an audit, but I can't think of why it woul  ever be unsafe.

Are you sure about "For file systems not acquiring a reference is obviously
safe"? I can see places e.g. in orangefs, afs, etc. which create bvec iters
from pagecache pages. And then we have iter_file_splice_write() which
creates bvec from pipe pages (which can also be pagecache pages if
vmsplice() is used). So perhaps there are no lifetime issues even without
acquiring a reference (but looking at the code I would not say it is
obvious) but I definitely don't see how it would be safe to not get a pin
to signal to filesystem backing the pagecache page that there is DMA
happening to/from the page.

								Honza

Christoph Hellwig Sept. 7, 2022, 8:45 a.m. UTC | #8

On Tue, Sep 06, 2022 at 12:21:06PM +0200, Jan Kara wrote:
> > For FOLL_PIN callers, never pin bvec and kvec pages:  For file systems
> > not acquiring a reference is obviously safe, and the other callers will
> > need an audit, but I can't think of why it woul  ever be unsafe.
> 
> Are you sure about "For file systems not acquiring a reference is obviously
> safe"? I can see places e.g. in orangefs, afs, etc. which create bvec iters
> from pagecache pages. And then we have iter_file_splice_write() which
> creates bvec from pipe pages (which can also be pagecache pages if
> vmsplice() is used). So perhaps there are no lifetime issues even without
> acquiring a reference (but looking at the code I would not say it is
> obvious) but I definitely don't see how it would be safe to not get a pin
> to signal to filesystem backing the pagecache page that there is DMA
> happening to/from the page.

I mean in the context of iov_iter_get_pages callers, that is direct
I/O.  Direct callers of iov_iter_bvec which then pass that iov to
->read_iter / ->write_iter will need to hold references (those are
the references that the callers of iov_iter_get_pages rely on!).

Christoph Hellwig Sept. 7, 2022, 8:50 a.m. UTC | #9

On Tue, Sep 06, 2022 at 12:58:59AM -0700, John Hubbard wrote:
> > For FOLL_PIN callers, never pin bvec and kvec pages:  For file systems
> > not acquiring a reference is obviously safe, and the other callers will
> > need an audit, but I can't think of why it woul  ever be unsafe.
> 
> In order to do that, one would need to be confident that such bvec and kvec
> pages do not get passed down to bio_release_pages() (or the new
> bio_unpin_pages()). Also, I'm missing a key point, because today bvec pages get
> a get_page() reference from __iov_iter_get_pages_alloc(). If I just skip
> that, then the get/put calls are unbalanced...

Except that for the most relevant callers (bdev and iomap) we never
end up in __iov_iter_get_pages_alloc. What I suggest is that, with
the move to the pin helper, we codify that logic.

For callers of the bio_iov_iter_pin_pages helper, the corresponding
bio_unpin_pages helper will encapsulate that logic.  For direct calls
to iov_iter_pin_pages, we should add a iov_iter_unpin_pages helper that
also encapsulates the logic.  The beauty is that this means the caller
itself does not have to care about any of thise, and the logic is in
one (well two due to the block special case that reuses the bio_vec
array for the pages space) place.

> I can just hear Al Viro repeating his points about splice() and vmsplice(),
> heh. :)

splice does take these references before calling into the file system
(see my reply to Jan).

Al Viro Sept. 14, 2022, 3:51 a.m. UTC | #10

On Wed, Sep 07, 2022 at 01:45:26AM -0700, Christoph Hellwig wrote:
> On Tue, Sep 06, 2022 at 12:21:06PM +0200, Jan Kara wrote:
> > > For FOLL_PIN callers, never pin bvec and kvec pages:  For file systems
> > > not acquiring a reference is obviously safe, and the other callers will
> > > need an audit, but I can't think of why it woul  ever be unsafe.
> > 
> > Are you sure about "For file systems not acquiring a reference is obviously
> > safe"? I can see places e.g. in orangefs, afs, etc. which create bvec iters
> > from pagecache pages. And then we have iter_file_splice_write() which
> > creates bvec from pipe pages (which can also be pagecache pages if
> > vmsplice() is used). So perhaps there are no lifetime issues even without
> > acquiring a reference (but looking at the code I would not say it is
> > obvious) but I definitely don't see how it would be safe to not get a pin
> > to signal to filesystem backing the pagecache page that there is DMA
> > happening to/from the page.
> 
> I mean in the context of iov_iter_get_pages callers, that is direct
> I/O.  Direct callers of iov_iter_bvec which then pass that iov to
> ->read_iter / ->write_iter will need to hold references (those are
> the references that the callers of iov_iter_get_pages rely on!).

Unless I'm misreading Jan, the question is whether they should get or
pin.  AFAICS, anyone who passes the sucker to ->read_iter() (or ->recvmsg(),
or does direct copy_to_iter()/zero_iter(), etc.) is falling under
=================================================================================
CASE 5: Pinning in order to write to the data within the page
-------------------------------------------------------------
Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
write to a page's data, unpin" can cause a problem. Case 5 may be considered a
superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
other words, if the code is neither Case 1 nor Case 2, it may still require
FOLL_PIN, for patterns like this:

Correct (uses FOLL_PIN calls):
    pin_user_pages()
    write to the data within the pages
    unpin_user_pages()

INCORRECT (uses FOLL_GET calls):
    get_user_pages()
    write to the data within the pages
    put_page()
=================================================================================

Regarding iter_file_splice_write() case, do we need to pin pages
when we are not going to modify the data in those?

The same goes for afs, AFAICS; I started to type "... and everything that passes
WRITE to iov_iter_bvec()", but...
drivers/vhost/vringh.c:1165:            iov_iter_bvec(&iter, READ, iov, ret, translated);
drivers/vhost/vringh.c:1198:            iov_iter_bvec(&iter, WRITE, iov, ret, translated);
is backwards - READ is for data destinations, comes with copy_to_iter(); WRITE is
for data sources and it comes with copy_from_iter()...
I'm really tempted to slap
	if (WARN_ON(i->data_source))
		return 0;
into copy_to_iter() et.al., along with its opposite for copy_from_iter().
And see who comes screaming...  Things like
        if (unlikely(iov_iter_is_pipe(i) || iov_iter_is_discard(i))) {
                WARN_ON(1);
                return 0;
        }
in e.g. csum_and_copy_from_iter() would be replaced by that, and become
easier to understand...
These two are also getting it wrong, BTW:
drivers/target/target_core_file.c:340:  iov_iter_bvec(&iter, READ, bvec, sgl_nents, len);
drivers/target/target_core_file.c:476:  iov_iter_bvec(&iter, READ, bvec, nolb, len);

Jan Kara Sept. 14, 2022, 2:52 p.m. UTC | #11

On Wed 14-09-22 04:51:17, Al Viro wrote:
> On Wed, Sep 07, 2022 at 01:45:26AM -0700, Christoph Hellwig wrote:
> > On Tue, Sep 06, 2022 at 12:21:06PM +0200, Jan Kara wrote:
> > > > For FOLL_PIN callers, never pin bvec and kvec pages:  For file systems
> > > > not acquiring a reference is obviously safe, and the other callers will
> > > > need an audit, but I can't think of why it woul  ever be unsafe.
> > > 
> > > Are you sure about "For file systems not acquiring a reference is obviously
> > > safe"? I can see places e.g. in orangefs, afs, etc. which create bvec iters
> > > from pagecache pages. And then we have iter_file_splice_write() which
> > > creates bvec from pipe pages (which can also be pagecache pages if
> > > vmsplice() is used). So perhaps there are no lifetime issues even without
> > > acquiring a reference (but looking at the code I would not say it is
> > > obvious) but I definitely don't see how it would be safe to not get a pin
> > > to signal to filesystem backing the pagecache page that there is DMA
> > > happening to/from the page.
> > 
> > I mean in the context of iov_iter_get_pages callers, that is direct
> > I/O.  Direct callers of iov_iter_bvec which then pass that iov to
> > ->read_iter / ->write_iter will need to hold references (those are
> > the references that the callers of iov_iter_get_pages rely on!).
> 
> Unless I'm misreading Jan, the question is whether they should get or
> pin.  AFAICS, anyone who passes the sucker to ->read_iter() (or ->recvmsg(),
> or does direct copy_to_iter()/zero_iter(), etc.) is falling under
> =================================================================================
> CASE 5: Pinning in order to write to the data within the page
> -------------------------------------------------------------
> Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
> write to a page's data, unpin" can cause a problem. Case 5 may be considered a
> superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
> other words, if the code is neither Case 1 nor Case 2, it may still require
> FOLL_PIN, for patterns like this:
> 
> Correct (uses FOLL_PIN calls):
>     pin_user_pages()
>     write to the data within the pages
>     unpin_user_pages()
> 
> INCORRECT (uses FOLL_GET calls):
>     get_user_pages()
>     write to the data within the pages
>     put_page()
> =================================================================================

Yes, that was my point.

> Regarding iter_file_splice_write() case, do we need to pin pages
> when we are not going to modify the data in those?

Strictly speaking not. So far we are pinning pages even if they serve as
data source because it is simpler not to bother about data access direction
but I'm not really aware of anything that would mandate that.

								Honza

Al Viro Sept. 14, 2022, 4:42 p.m. UTC | #12

On Wed, Sep 14, 2022 at 04:52:33PM +0200, Jan Kara wrote:
> > =================================================================================
> > CASE 5: Pinning in order to write to the data within the page
> > -------------------------------------------------------------
> > Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
> > write to a page's data, unpin" can cause a problem. Case 5 may be considered a
> > superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
> > other words, if the code is neither Case 1 nor Case 2, it may still require
> > FOLL_PIN, for patterns like this:
> > 
> > Correct (uses FOLL_PIN calls):
> >     pin_user_pages()
> >     write to the data within the pages
> >     unpin_user_pages()
> > 
> > INCORRECT (uses FOLL_GET calls):
> >     get_user_pages()
> >     write to the data within the pages
> >     put_page()
> > =================================================================================
> 
> Yes, that was my point.

The thing is, at which point do we pin those pages?  pin_user_pages() works by
userland address; by the time we get to any of those we have struct page
references and no idea whether they are still mapped anywhere.

How would that work?  What protects the area where you want to avoid running
into pinned pages from previously acceptable page getting pinned?  If "they
must have been successfully unmapped" is a part of what you are planning, we
really do have a problem...

Jan Kara Sept. 15, 2022, 8:16 a.m. UTC | #13

On Wed 14-09-22 17:42:40, Al Viro wrote:
> On Wed, Sep 14, 2022 at 04:52:33PM +0200, Jan Kara wrote:
> > > =================================================================================
> > > CASE 5: Pinning in order to write to the data within the page
> > > -------------------------------------------------------------
> > > Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
> > > write to a page's data, unpin" can cause a problem. Case 5 may be considered a
> > > superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
> > > other words, if the code is neither Case 1 nor Case 2, it may still require
> > > FOLL_PIN, for patterns like this:
> > > 
> > > Correct (uses FOLL_PIN calls):
> > >     pin_user_pages()
> > >     write to the data within the pages
> > >     unpin_user_pages()
> > > 
> > > INCORRECT (uses FOLL_GET calls):
> > >     get_user_pages()
> > >     write to the data within the pages
> > >     put_page()
> > > =================================================================================
> > 
> > Yes, that was my point.
> 
> The thing is, at which point do we pin those pages?  pin_user_pages() works by
> userland address; by the time we get to any of those we have struct page
> references and no idea whether they are still mapped anywhere.

Yes, pin_user_pages() currently works by page address but there's nothing
fundamental about that. Technically, pin is currently just another type of
page reference so we can as well just pin the page when given struct page.
In fact John Hubbart has added such helper in this series.

> How would that work?  What protects the area where you want to avoid running
> into pinned pages from previously acceptable page getting pinned?  If "they
> must have been successfully unmapped" is a part of what you are planning, we
> really do have a problem...

But this is a very good question. So far the idea was that we lock the
page, unmap (or writeprotect) the page, and then check pincount == 0 and
that is a reliable method for making sure page data is stable (until we
unlock the page & release other locks blocking page faults and writes). But
once suddently ordinary page references can be used to create pins this
does not work anymore. Hrm.

Just brainstorming ideas now: So we'd either need to obtain the pins early
when we still have the virtual address (but I guess that is often not
practical but should work e.g. for normal direct IO path) or we need some
way to "simulate" the page fault when pinning the page, just don't map it
into page tables in the end. This simulated page fault could be perhaps
avoided if rmap walk shows that the page is already mapped somewhere with
suitable permissions.

								Honza

Al Viro Sept. 16, 2022, 1:55 a.m. UTC | #14

On Thu, Sep 15, 2022 at 10:16:25AM +0200, Jan Kara wrote:

> > How would that work?  What protects the area where you want to avoid running
> > into pinned pages from previously acceptable page getting pinned?  If "they
> > must have been successfully unmapped" is a part of what you are planning, we
> > really do have a problem...
> 
> But this is a very good question. So far the idea was that we lock the
> page, unmap (or writeprotect) the page, and then check pincount == 0 and
> that is a reliable method for making sure page data is stable (until we
> unlock the page & release other locks blocking page faults and writes). But
> once suddently ordinary page references can be used to create pins this
> does not work anymore. Hrm.
> 
> Just brainstorming ideas now: So we'd either need to obtain the pins early
> when we still have the virtual address (but I guess that is often not
> practical but should work e.g. for normal direct IO path) or we need some
> way to "simulate" the page fault when pinning the page, just don't map it
> into page tables in the end. This simulated page fault could be perhaps
> avoided if rmap walk shows that the page is already mapped somewhere with
> suitable permissions.

OK...  I'd done some digging; results so far

	* READ vs. WRITE turned out to be an awful way to specify iov_iter
data direction.  Local iov_iter branch so far:
	get rid of unlikely() on page_copy_sane() calls
	csum_and_copy_to_iter(): handle ITER_DISCARD
	[s390] copy_oldmem_kernel() - WRITE is "data source", not destination
	[fsi] WRITE is "data source", not destination...
	[infiniband] READ is "data destination", not source...
	[s390] zcore: WRITE is "data source", not destination...
	[target] fix iov_iter_bvec() "direction" argument
	[vhost] fix 'direction' argument of iov_iter_{init,bvec}()
	[xen] fix "direction" argument of iov_iter_kvec()
	[trace] READ means "data destination", not source...
	iov_iter: saner checks for attempt to copy to/from iterator
	use less confusing names for iov_iter direction initializers
those 8 commits in the middle consist of fixes, some of them with more than
one call site affected.  Folks keep going "oh, we are going to copy data
into that iterator, must be WRITE".  Wrong - WRITE means "as for write(2)",
i.e. the data _source_, not data destination.  And the same kind of bugs
goes in the opposite direction, of course.
	I think something like ITER_DEST vs. ITER_SOURCE would be less
confusing.

	* anything that goes with ITER_SOURCE doesn't need pin.
	* ITER_IOVEC/ITER_UBUF need pin for get_pages and for nothing else.
Need to grab reference on get_pages, obviously.
	* even more obviously, ITER_DISCARD is irrelevant here.
	* ITER_PIPE only modifies anonymous pages that had been allocated
by iov_iter primitives and hadn't been observed by anything outside until
we are done with said ITER_PIPE.
	* quite a few instances are similar to e.g. REQ_OP_READ handling in
/dev/loop - we work with ITER_BVEC there and we do modify the page contents,
but the damn thing would better be given to us locked and stay locked until
all involved modifications (be it real IO/decoding/whatever) is complete.
That ought to be safe, unless I'm missing something.

That doesn't cover everything; still going through the list...

Al Viro Sept. 20, 2022, 5:02 a.m. UTC | #15

On Fri, Sep 16, 2022 at 02:55:53AM +0100, Al Viro wrote:
> 	* READ vs. WRITE turned out to be an awful way to specify iov_iter
> data direction.  Local iov_iter branch so far:
> 	get rid of unlikely() on page_copy_sane() calls
> 	csum_and_copy_to_iter(): handle ITER_DISCARD
> 	[s390] copy_oldmem_kernel() - WRITE is "data source", not destination
> 	[fsi] WRITE is "data source", not destination...
> 	[infiniband] READ is "data destination", not source...
> 	[s390] zcore: WRITE is "data source", not destination...
> 	[target] fix iov_iter_bvec() "direction" argument
> 	[vhost] fix 'direction' argument of iov_iter_{init,bvec}()
> 	[xen] fix "direction" argument of iov_iter_kvec()
> 	[trace] READ means "data destination", not source...
> 	iov_iter: saner checks for attempt to copy to/from iterator
> 	use less confusing names for iov_iter direction initializers
> those 8 commits in the middle consist of fixes, some of them with more than
> one call site affected.  Folks keep going "oh, we are going to copy data
> into that iterator, must be WRITE".  Wrong - WRITE means "as for write(2)",
> i.e. the data _source_, not data destination.  And the same kind of bugs
> goes in the opposite direction, of course.
> 	I think something like ITER_DEST vs. ITER_SOURCE would be less
> confusing.
> 
> 	* anything that goes with ITER_SOURCE doesn't need pin.
> 	* ITER_IOVEC/ITER_UBUF need pin for get_pages and for nothing else.
> Need to grab reference on get_pages, obviously.
> 	* even more obviously, ITER_DISCARD is irrelevant here.
> 	* ITER_PIPE only modifies anonymous pages that had been allocated
> by iov_iter primitives and hadn't been observed by anything outside until
> we are done with said ITER_PIPE.
> 	* quite a few instances are similar to e.g. REQ_OP_READ handling in
> /dev/loop - we work with ITER_BVEC there and we do modify the page contents,
> but the damn thing would better be given to us locked and stay locked until
> all involved modifications (be it real IO/decoding/whatever) is complete.
> That ought to be safe, unless I'm missing something.
> 
> That doesn't cover everything; still going through the list...

More:

nvme target: nvme read requests end up with somebody allocating and filling
sglist, followed by reading from file into it (using ITER_BVEC).  Then the
pages are sent out, presumably.  I would be very surprised if it turned out
to be anything other than anon pages allocated by the driver, but I'd like
to see that confirmed by nvme folks.  Probably doesn't need pinning.

->read_folio() instances - page locked by caller, not unlocked until we are done.

->readahead() instances - pages are in the segment of page cache that had been
populated and locked by the caller; some are ITER_BVEC (with page(s) extracted
by readahead_page()), some - ITER_XARRAY.
other similar places (some of ->write_begin() instances, after having grabbed
a locked page, etc.)

->issue_read() instances - the call graph is scary (in particular, recursion
prevention there is non-obvious), but unless netfs folks say otherwise, I'd
assume that all pages involved are supposed to be locked by the caller.
swap reads (ending up at __swap_read_unplug()) - pages locked by callers.

in some cases (cifs) pages are privately allocated and not visible to anyone
else.

io_import_fixed() sets ITER_BVEC over pinned pages; see io_pin_pages() for
the place where that's done.

In cifs_send_async_read() we take the pages that will eventually go into
ITER_BVEC iterator from iov_iter_get_pages() - that one wants pinning if
the type of ctx->iter would demand so.  The same goes for setup_aio_ctx_iter() -
iov_iter_get_pages() is used to make an ITER_BVEC counterpart of the
iov_iter passed to ->read_iter(), with the same considerations re pinning.
The same goes for ceph __iter_get_bvecs().

Haven't done yet:

drivers/target/target_core_file.c:292:  iov_iter_bvec(&iter, is_write, aio_cmd->bvecs, sgl_nents, len);
drivers/vhost/vringh.c:1198:            iov_iter_bvec(&iter, ITER_DEST, iov, ret, translated);
fs/afs/dir.c:308:       iov_iter_xarray(&req->def_iter, ITER_DEST, &dvnode->netfs.inode.i_mapping->i_pages,
net/ceph/messenger_v1.c:52:     iov_iter_bvec(&msg.msg_iter, ITER_DEST, &bvec, 1, length);
net/ceph/messenger_v2.c:236:    iov_iter_bvec(&con->v2.in_iter, ITER_DEST, &con->v2.in_bvec, 1, bv->bv_len);
net/sunrpc/svcsock.c:263:       iov_iter_bvec(&msg.msg_iter, ITER_DEST, bvec, i, buflen);
net/sunrpc/xprtsock.c:376:      iov_iter_bvec(&msg->msg_iter, ITER_DEST, bvec, nr, count);

The picture so far looks like we mostly need to take care of pinning when
we obtain the references from iov_iter_get_pages().  What's more, it looks
like ITER_BVEC/ITER_XARRAY/ITER_PIPE we really don't need to pin anything on
get_pages/pin_pages - they are already protected (or, in case of ITER_PIPE,
allocated by iov_iter itself and not reachable by anybody outside).
Might or might not be true for the remaining 7 call sites...

NOTE: all of the above assumes that callers with pre-locked pages are
either synchronous or do not unlock until the completion callbacks.
It does appear to be true; if it is true, I really wonder if we need
to even grab references in iov_iter_pin_pages() for anything other
than ITER_IOVEC/ITER_UBUF.  The right primitive might be
	if user-backed
		pin pages
	else
		just copy the pointers; any lifetime-related issues are
		up to the caller.
	advance iterator in either case

Al Viro Sept. 22, 2022, 2:22 a.m. UTC | #16

On Thu, Sep 15, 2022 at 10:16:25AM +0200, Jan Kara wrote:

> > How would that work?  What protects the area where you want to avoid running
> > into pinned pages from previously acceptable page getting pinned?  If "they
> > must have been successfully unmapped" is a part of what you are planning, we
> > really do have a problem...
> 
> But this is a very good question. So far the idea was that we lock the
> page, unmap (or writeprotect) the page, and then check pincount == 0 and
> that is a reliable method for making sure page data is stable (until we
> unlock the page & release other locks blocking page faults and writes). But
> once suddently ordinary page references can be used to create pins this
> does not work anymore. Hrm.
> 
> Just brainstorming ideas now: So we'd either need to obtain the pins early
> when we still have the virtual address (but I guess that is often not
> practical but should work e.g. for normal direct IO path) or we need some
> way to "simulate" the page fault when pinning the page, just don't map it
> into page tables in the end. This simulated page fault could be perhaps
> avoided if rmap walk shows that the page is already mapped somewhere with
> suitable permissions.

OK.  As far as I can see, the rules are along the lines of
	* creator of ITER_BVEC/ITER_XARRAY is responsible for pages being safe.
	  That includes
		* page known to be locked by caller
		* page being privately allocated and not visible to anyone else
		* iterator being data source
		* page coming from pin_user_pages(), possibly as the result of
		  iov_iter_pin_pages() on ITER_IOVEC/ITER_UBUF.
	* ITER_PIPE pages are always safe
	* pages found in ITER_BVEC/ITER_XARRAY are safe, since the iterator
	  had been created with such.
My preference would be to have iov_iter_get_pages() and friends pin if and
only if we have data-destination iov_iter that is user-backed.  For
data-source user-backed we only need FOLL_GET, and for all other flavours
(ITER_BVEC, etc.) we only do get_page(), if we need to grab any references
at all.

What I'd like to have is the understanding of the places where we drop
the references acquired by iov_iter_get_pages().  How do we decide
whether to unpin?  E.g. pipe_buffer carries a reference to page and no
way to tell whether it's a pinned one; results of iov_iter_get_pages()
on ITER_IOVEC *can* end up there, but thankfully only from data-source
(== WRITE, aka.  ITER_SOURCE) iov_iter.  So for those we don't care.
Then there's nfs_request; AFAICS, we do need to pin the references in
those if they are coming from nfs_direct_read_schedule_iovec(), but
not if they come from readpage_async_filler().  How do we deal with
coalescence, etc.?  It's been a long time since I really looked at
that code...  Christoph, could you give any comments on that one?

Note, BTW, that nfs_request coming from readpage_async_filler() have
pages locked by caller; the ones from nfs_direct_read_schedule_iovec()
do not, and that's where we want them pinned.  Resulting page references
end up (after quite a trip through data structures) stuffed into struct
rpc_rqst ->rc_recv_buf.pages[] and when a response arrives from server,
they get picked by xs_read_bvec() and fed to iov_iter_bvec().  In one
case it's safe since the pages are locked; in another - since they would
come from pin_user_pages().  The call chain at the time they are used
has nothing to do with the originator - sunrpc is looking at the arrived
response to READ that matches an rpc_rqst that had been created by sender
of that request and safety is the sender's responsibility.

John Hubbard Sept. 22, 2022, 6:09 a.m. UTC | #17

On 9/21/22 19:22, Al Viro wrote:
> On Thu, Sep 15, 2022 at 10:16:25AM +0200, Jan Kara wrote:
> 
>>> How would that work?  What protects the area where you want to avoid running
>>> into pinned pages from previously acceptable page getting pinned?  If "they
>>> must have been successfully unmapped" is a part of what you are planning, we
>>> really do have a problem...
>>
>> But this is a very good question. So far the idea was that we lock the
>> page, unmap (or writeprotect) the page, and then check pincount == 0 and
>> that is a reliable method for making sure page data is stable (until we
>> unlock the page & release other locks blocking page faults and writes). But
>> once suddently ordinary page references can be used to create pins this
>> does not work anymore. Hrm.
>>
>> Just brainstorming ideas now: So we'd either need to obtain the pins early
>> when we still have the virtual address (but I guess that is often not
>> practical but should work e.g. for normal direct IO path) or we need some
>> way to "simulate" the page fault when pinning the page, just don't map it
>> into page tables in the end. This simulated page fault could be perhaps
>> avoided if rmap walk shows that the page is already mapped somewhere with
>> suitable permissions.
> 
> OK.  As far as I can see, the rules are along the lines of
> 	* creator of ITER_BVEC/ITER_XARRAY is responsible for pages being safe.
> 	  That includes
> 		* page known to be locked by caller
> 		* page being privately allocated and not visible to anyone else
> 		* iterator being data source
> 		* page coming from pin_user_pages(), possibly as the result of
> 		  iov_iter_pin_pages() on ITER_IOVEC/ITER_UBUF.
> 	* ITER_PIPE pages are always safe
> 	* pages found in ITER_BVEC/ITER_XARRAY are safe, since the iterator
> 	  had been created with such.
> My preference would be to have iov_iter_get_pages() and friends pin if and
> only if we have data-destination iov_iter that is user-backed.  For
> data-source user-backed we only need FOLL_GET, and for all other flavours
> (ITER_BVEC, etc.) we only do get_page(), if we need to grab any references
> at all.

This rule would mostly work, as long as we can relax it in some cases, to
allow pinning of both source and dest pages, instead of just destination
pages, in some cases. In particular, bio_release_pages() has lost all
context about whether it was a read or a write request, as far as I can
tell. And bio_release_pages() is the primary place to unpin pages for
direct IO.

> 
> What I'd like to have is the understanding of the places where we drop
> the references acquired by iov_iter_get_pages().  How do we decide
> whether to unpin?  E.g. pipe_buffer carries a reference to page and no
> way to tell whether it's a pinned one; results of iov_iter_get_pages()
> on ITER_IOVEC *can* end up there, but thankfully only from data-source
> (== WRITE, aka.  ITER_SOURCE) iov_iter.  So for those we don't care.
> Then there's nfs_request; AFAICS, we do need to pin the references in
> those if they are coming from nfs_direct_read_schedule_iovec(), but
> not if they come from readpage_async_filler().  How do we deal with
> coalescence, etc.?  It's been a long time since I really looked at
> that code...  Christoph, could you give any comments on that one?
> 
> Note, BTW, that nfs_request coming from readpage_async_filler() have
> pages locked by caller; the ones from nfs_direct_read_schedule_iovec()
> do not, and that's where we want them pinned.  Resulting page references
> end up (after quite a trip through data structures) stuffed into struct
> rpc_rqst ->rc_recv_buf.pages[] and when a response arrives from server,
> they get picked by xs_read_bvec() and fed to iov_iter_bvec().  In one
> case it's safe since the pages are locked; in another - since they would
> come from pin_user_pages().  The call chain at the time they are used
> has nothing to do with the originator - sunrpc is looking at the arrived
> response to READ that matches an rpc_rqst that had been created by sender
> of that request and safety is the sender's responsibility.

For NFS Direct, is there any reason it can't be as simple as this
(conceptually, that is--the implementation of iov_iter_pin_pages_alloc()
is not shown here)? Here:


diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 1707f46b1335..7dbc705bab83 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -142,13 +142,6 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
 	return 0;
 }
 
-static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
-{
-	unsigned int i;
-	for (i = 0; i < npages; i++)
-		put_page(pages[i]);
-}
-
 void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
 			      struct nfs_direct_req *dreq)
 {
@@ -332,7 +325,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 		size_t pgbase;
 		unsigned npages, i;
 
-		result = iov_iter_get_pages_alloc2(iter, &pagevec,
+		result = iov_iter_pin_pages_alloc(iter, &pagevec,
 						  rsize, &pgbase);
 		if (result < 0)
 			break;
@@ -362,7 +355,16 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 			pos += req_len;
 			dreq->bytes_left -= req_len;
 		}
-		nfs_direct_release_pages(pagevec, npages);
+
+		/*
+		 * iov_iter_pin_pages_alloc() calls pin_user_pages_fast() for
+		 * the user_backed_iter() case (only).
+		 */
+		if (user_backed_iter(iter))
+			unpin_user_pages(pagevec, npages);
+		else
+			release_pages(pagevec, npages);
+
 		kvfree(pagevec);
 		if (result < 0)
 			break;
@@ -829,7 +831,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 			pos += req_len;
 			dreq->bytes_left -= req_len;
 		}
-		nfs_direct_release_pages(pagevec, npages);
+		release_pages(pagevec, npages);
 		kvfree(pagevec);
 		if (result < 0)
 			break;

thanks,

Jan Kara Sept. 22, 2022, 11:29 a.m. UTC | #18

On Wed 21-09-22 23:09:06, John Hubbard wrote:
> On 9/21/22 19:22, Al Viro wrote:
> > On Thu, Sep 15, 2022 at 10:16:25AM +0200, Jan Kara wrote:
> > 
> >>> How would that work?  What protects the area where you want to avoid running
> >>> into pinned pages from previously acceptable page getting pinned?  If "they
> >>> must have been successfully unmapped" is a part of what you are planning, we
> >>> really do have a problem...
> >>
> >> But this is a very good question. So far the idea was that we lock the
> >> page, unmap (or writeprotect) the page, and then check pincount == 0 and
> >> that is a reliable method for making sure page data is stable (until we
> >> unlock the page & release other locks blocking page faults and writes). But
> >> once suddently ordinary page references can be used to create pins this
> >> does not work anymore. Hrm.
> >>
> >> Just brainstorming ideas now: So we'd either need to obtain the pins early
> >> when we still have the virtual address (but I guess that is often not
> >> practical but should work e.g. for normal direct IO path) or we need some
> >> way to "simulate" the page fault when pinning the page, just don't map it
> >> into page tables in the end. This simulated page fault could be perhaps
> >> avoided if rmap walk shows that the page is already mapped somewhere with
> >> suitable permissions.
> > 
> > OK.  As far as I can see, the rules are along the lines of
> > 	* creator of ITER_BVEC/ITER_XARRAY is responsible for pages being safe.
> > 	  That includes
> > 		* page known to be locked by caller
> > 		* page being privately allocated and not visible to anyone else
> > 		* iterator being data source
> > 		* page coming from pin_user_pages(), possibly as the result of
> > 		  iov_iter_pin_pages() on ITER_IOVEC/ITER_UBUF.
> > 	* ITER_PIPE pages are always safe
> > 	* pages found in ITER_BVEC/ITER_XARRAY are safe, since the iterator
> > 	  had been created with such.
> > My preference would be to have iov_iter_get_pages() and friends pin if and
> > only if we have data-destination iov_iter that is user-backed.  For
> > data-source user-backed we only need FOLL_GET, and for all other flavours
> > (ITER_BVEC, etc.) we only do get_page(), if we need to grab any references
> > at all.
> 
> This rule would mostly work, as long as we can relax it in some cases, to
> allow pinning of both source and dest pages, instead of just destination
> pages, in some cases. In particular, bio_release_pages() has lost all
> context about whether it was a read or a write request, as far as I can
> tell. And bio_release_pages() is the primary place to unpin pages for
> direct IO.

Well, we already do have BIO_NO_PAGE_REF bio flag that gets checked in
bio_release_pages(). I think we can easily spare another bio flag to tell
whether we need to unpin or not. So as long as all the pages in the created
bio need the same treatment, the situation should be simple.

> > What I'd like to have is the understanding of the places where we drop
> > the references acquired by iov_iter_get_pages().  How do we decide
> > whether to unpin?  E.g. pipe_buffer carries a reference to page and no
> > way to tell whether it's a pinned one; results of iov_iter_get_pages()
> > on ITER_IOVEC *can* end up there, but thankfully only from data-source
> > (== WRITE, aka.  ITER_SOURCE) iov_iter.  So for those we don't care.
> > Then there's nfs_request; AFAICS, we do need to pin the references in
> > those if they are coming from nfs_direct_read_schedule_iovec(), but
> > not if they come from readpage_async_filler().  How do we deal with
> > coalescence, etc.?  It's been a long time since I really looked at
> > that code...  Christoph, could you give any comments on that one?
> > 
> > Note, BTW, that nfs_request coming from readpage_async_filler() have
> > pages locked by caller; the ones from nfs_direct_read_schedule_iovec()
> > do not, and that's where we want them pinned.  Resulting page references
> > end up (after quite a trip through data structures) stuffed into struct
> > rpc_rqst ->rc_recv_buf.pages[] and when a response arrives from server,
> > they get picked by xs_read_bvec() and fed to iov_iter_bvec().  In one
> > case it's safe since the pages are locked; in another - since they would
> > come from pin_user_pages().  The call chain at the time they are used
> > has nothing to do with the originator - sunrpc is looking at the arrived
> > response to READ that matches an rpc_rqst that had been created by sender
> > of that request and safety is the sender's responsibility.
> 
> For NFS Direct, is there any reason it can't be as simple as this
> (conceptually, that is--the implementation of iov_iter_pin_pages_alloc()
> is not shown here)? Here:
> 
> 
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index 1707f46b1335..7dbc705bab83 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -142,13 +142,6 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
>  	return 0;
>  }
>  
> -static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
> -{
> -	unsigned int i;
> -	for (i = 0; i < npages; i++)
> -		put_page(pages[i]);
> -}
> -
>  void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
>  			      struct nfs_direct_req *dreq)
>  {
> @@ -332,7 +325,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
>  		size_t pgbase;
>  		unsigned npages, i;
>  
> -		result = iov_iter_get_pages_alloc2(iter, &pagevec,
> +		result = iov_iter_pin_pages_alloc(iter, &pagevec,
>  						  rsize, &pgbase);
>  		if (result < 0)
>  			break;
> @@ -362,7 +355,16 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
>  			pos += req_len;
>  			dreq->bytes_left -= req_len;
>  		}
> -		nfs_direct_release_pages(pagevec, npages);
> +
> +		/*
> +		 * iov_iter_pin_pages_alloc() calls pin_user_pages_fast() for
> +		 * the user_backed_iter() case (only).
> +		 */
> +		if (user_backed_iter(iter))
> +			unpin_user_pages(pagevec, npages);
> +		else
> +			release_pages(pagevec, npages);
> +

I don't think this will work. The pin nfs_direct_read_schedule_iovec()
obtains needs to be released once the IO is completed. Not once the IO is
submitted. Notice how nfs_create_request()->__nfs_create_request() gets
another page reference which is released on completion
(nfs_direct_read_completion->nfs_release_request->nfs_page_group_destroy->
nfs_free_request->nfs_clear_request). And we need to stop releasing the
obtained pin in nfs_direct_read_schedule_iovec() (and acquiring another
reference in __nfs_create_request()) and instead propagate it to
nfs_clear_request() where it can get released.

								Honza

Christoph Hellwig Sept. 22, 2022, 2:31 p.m. UTC | #19

On Wed, Sep 14, 2022 at 04:51:17AM +0100, Al Viro wrote:
> Unless I'm misreading Jan, the question is whether they should get or
> pin.

And I think the answer is:  inside ->read_iter or ->write_iter they
should neither get or pin.  The callers of it need to pin the pages
if they are pagecache pages that can potentially be written to through
shared mappings, else a get would be enough.  But the method instance
should not have to care and just be able to rely on the caller making
sure they do not go away.

> I'm really tempted to slap
> 	if (WARN_ON(i->data_source))
> 		return 0;
> into copy_to_iter() et.al., along with its opposite for copy_from_iter().

Ys, I think that would be useful.  And we could use something more
descriptive than READ/WRITE to start with.

Christoph Hellwig Sept. 22, 2022, 2:36 p.m. UTC | #20

On Tue, Sep 20, 2022 at 06:02:11AM +0100, Al Viro wrote:
> nvme target: nvme read requests end up with somebody allocating and filling
> sglist, followed by reading from file into it (using ITER_BVEC).  Then the
> pages are sent out, presumably

Yes.

> .  I would be very surprised if it turned out
> to be anything other than anon pages allocated by the driver, but I'd like
> to see that confirmed by nvme folks.  Probably doesn't need pinning.

They are anon pages allocated by the driver using sgl_alloc().

> drivers/target/target_core_file.c:292:  iov_iter_bvec(&iter, is_write, aio_cmd->bvecs, sgl_nents, len);

Same as nvme target.

> The picture so far looks like we mostly need to take care of pinning when
> we obtain the references from iov_iter_get_pages().  What's more, it looks
> like ITER_BVEC/ITER_XARRAY/ITER_PIPE we really don't need to pin anything on
> get_pages/pin_pages - they are already protected (or, in case of ITER_PIPE,
> allocated by iov_iter itself and not reachable by anybody outside).

That's what I've been trying to say for a while..

Al Viro Sept. 22, 2022, 2:36 p.m. UTC | #21

On Thu, Sep 22, 2022 at 07:31:36AM -0700, Christoph Hellwig wrote:
> On Wed, Sep 14, 2022 at 04:51:17AM +0100, Al Viro wrote:
> > Unless I'm misreading Jan, the question is whether they should get or
> > pin.
> 
> And I think the answer is:  inside ->read_iter or ->write_iter they
> should neither get or pin.  The callers of it need to pin the pages
> if they are pagecache pages that can potentially be written to through
> shared mappings, else a get would be enough.  But the method instance
> should not have to care and just be able to rely on the caller making
> sure they do not go away.

The interesting part, AFAICS, is where do we _unpin_ them and how do
we keep track which pages (obtained from iov_iter_get_pages et.al.)
need to be unpinned.

> > I'm really tempted to slap
> > 	if (WARN_ON(i->data_source))
> > 		return 0;
> > into copy_to_iter() et.al., along with its opposite for copy_from_iter().
> 
> Ys, I think that would be useful.  And we could use something more
> descriptive than READ/WRITE to start with.

See #work.iov_iter; done, but it took a bit of fixing the places that
create iov_iter instances.

Christoph Hellwig Sept. 22, 2022, 2:38 p.m. UTC | #22

On Thu, Sep 22, 2022 at 03:22:48AM +0100, Al Viro wrote:
> What I'd like to have is the understanding of the places where we drop
> the references acquired by iov_iter_get_pages().  How do we decide
> whether to unpin?

Add a iov_iter_unpin_pages that does the right thing based on the
type.  (block will need a modified copy of it as it doesn't keep
the pages array around, but logic will be the same).

> E.g. pipe_buffer carries a reference to page and no
> way to tell whether it's a pinned one; results of iov_iter_get_pages()
> on ITER_IOVEC *can* end up there, but thankfully only from data-source
> (== WRITE, aka.  ITER_SOURCE) iov_iter.  So for those we don't care.
> Then there's nfs_request; AFAICS, we do need to pin the references in
> those if they are coming from nfs_direct_read_schedule_iovec(), but
> not if they come from readpage_async_filler().  How do we deal with
> coalescence, etc.?  It's been a long time since I really looked at
> that code...  Christoph, could you give any comments on that one?

I think the above should work, but I'll need to look at the NFS code
in more detail to be sure.

David Hildenbrand Sept. 22, 2022, 2:43 p.m. UTC | #23

On 22.09.22 16:36, Christoph Hellwig wrote:
> On Tue, Sep 20, 2022 at 06:02:11AM +0100, Al Viro wrote:
>> nvme target: nvme read requests end up with somebody allocating and filling
>> sglist, followed by reading from file into it (using ITER_BVEC).  Then the
>> pages are sent out, presumably
> 
> Yes.
> 
>> .  I would be very surprised if it turned out
>> to be anything other than anon pages allocated by the driver, but I'd like
>> to see that confirmed by nvme folks.  Probably doesn't need pinning.
> 
> They are anon pages allocated by the driver using sgl_alloc().

I assume they are not anon pages as in "PageAnon()", but simply not 
pagecache pages, correct?

Christoph Hellwig Sept. 22, 2022, 2:45 p.m. UTC | #24

On Thu, Sep 22, 2022 at 04:43:57PM +0200, David Hildenbrand wrote:
> I assume they are not anon pages as in "PageAnon()", but simply not
> pagecache pages, correct?

Yes, sorry.  From the page allocator and not added to the page cache.

Al Viro Sept. 23, 2022, 3:19 a.m. UTC | #25

On Thu, Sep 22, 2022 at 01:29:35PM +0200, Jan Kara wrote:

> > This rule would mostly work, as long as we can relax it in some cases, to
> > allow pinning of both source and dest pages, instead of just destination
> > pages, in some cases. In particular, bio_release_pages() has lost all
> > context about whether it was a read or a write request, as far as I can
> > tell. And bio_release_pages() is the primary place to unpin pages for
> > direct IO.
> 
> Well, we already do have BIO_NO_PAGE_REF bio flag that gets checked in
> bio_release_pages(). I think we can easily spare another bio flag to tell
> whether we need to unpin or not. So as long as all the pages in the created
> bio need the same treatment, the situation should be simple.

Yes.  Incidentally, the same condition is already checked by the creators
of those bio - see the assorted should_dirty logics.

While we are at it - how much of the rationale around bio_check_pages_dirty()
doing dirtying is still applicable with pinning pages before we stick them
into bio?  We do dirty them before submitting bio, then on completion
bio_check_pages_dirty() checks if something has marked them clean while
we'd been doing IO; if all of them are still dirty we just drop the pages
(well, unpin and drop), otherwise we arrange for dirty + unpin + drop
done in process context (via schedule_work()).  Can they be marked clean by
anyone while they are pinned?  After all, pinning is done to prevent
writeback getting done on them while we are modifying the suckers...

John Hubbard Sept. 23, 2022, 4:05 a.m. UTC | #26

On 9/22/22 20:19, Al Viro wrote:
> On Thu, Sep 22, 2022 at 01:29:35PM +0200, Jan Kara wrote:
> 
>>> This rule would mostly work, as long as we can relax it in some cases, to
>>> allow pinning of both source and dest pages, instead of just destination
>>> pages, in some cases. In particular, bio_release_pages() has lost all
>>> context about whether it was a read or a write request, as far as I can
>>> tell. And bio_release_pages() is the primary place to unpin pages for
>>> direct IO.
>>
>> Well, we already do have BIO_NO_PAGE_REF bio flag that gets checked in
>> bio_release_pages(). I think we can easily spare another bio flag to tell
>> whether we need to unpin or not. So as long as all the pages in the created
>> bio need the same treatment, the situation should be simple.
> 
> Yes.  Incidentally, the same condition is already checked by the creators
> of those bio - see the assorted should_dirty logics.

Beautiful!

> 
> While we are at it - how much of the rationale around bio_check_pages_dirty()
> doing dirtying is still applicable with pinning pages before we stick them
> into bio?  We do dirty them before submitting bio, then on completion
> bio_check_pages_dirty() checks if something has marked them clean while
> we'd been doing IO; if all of them are still dirty we just drop the pages
> (well, unpin and drop), otherwise we arrange for dirty + unpin + drop
> done in process context (via schedule_work()).  Can they be marked clean by
> anyone while they are pinned?  After all, pinning is done to prevent
> writeback getting done on them while we are modifying the suckers...

I certainly hope not. And in fact, we should really just say that that's
a rule: the whole time the page is pinned, it simply must remain dirty
and writable, at least with the way things are right now.

This reminds me that I'm not exactly sure what the rules for
FOLL_LONGTERM callers should be, with respect to dirtying. At the
moment, most, if not all of the code that does "set_page_dirty_lock();
unpin_user_page()" is wrong.

To fix those cases, IIUC, the answer is: you must make the page dirty
properly, with page_mkwrite(), not just with set_page_dirty_lock(). And
that has to be done probably a lot earlier, for reasons that I'm still
vague on. But perhaps right after pinning the page. (Assuming that we
hold off writeback while the page is pinned.)

Just wanted to see if that sounds right, while we're on the topic.

thanks,

Al Viro Sept. 23, 2022, 4:22 a.m. UTC | #27

On Thu, Sep 22, 2022 at 07:38:25AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 22, 2022 at 03:22:48AM +0100, Al Viro wrote:
> > What I'd like to have is the understanding of the places where we drop
> > the references acquired by iov_iter_get_pages().  How do we decide
> > whether to unpin?
> 
> Add a iov_iter_unpin_pages that does the right thing based on the
> type.  (block will need a modified copy of it as it doesn't keep
> the pages array around, but logic will be the same).

Huh?  You want to keep the type (+ direction) of iov_iter in any structure
a page reference coming from iov_iter_get_pages might end up in?  IDGI...

BTW, speaking of lifetime rules - am I right assuming that fd_execute_rw()
does IO on pages of the scatterlist passed to it?  Where are they getting
dropped and what guarantees that IO is complete by that point?

The reason I'm asking is that here you have an ITER_BVEC possibly fed to
__blkdev_direct_IO_async(), with its
        if (iov_iter_is_bvec(iter)) {
                /*
                 * Users don't rely on the iterator being in any particular
                 * state for async I/O returning -EIOCBQUEUED, hence we can
                 * avoid expensive iov_iter_advance(). Bypass
                 * bio_iov_iter_get_pages() and set the bvec directly.
                 */
                bio_iov_bvec_set(bio, iter);
which does *not* grab the page referneces.  Sure, bio_release_pages() knows
to leave those alone and doesn't drop anything.  However, what is the
mechanism preventing the pages getting freed before the IO completion
in this case?

John Hubbard Sept. 23, 2022, 4:34 a.m. UTC | #28

On 9/22/22 04:29, Jan Kara wrote:
> I don't think this will work. The pin nfs_direct_read_schedule_iovec()
> obtains needs to be released once the IO is completed. Not once the IO is
> submitted. Notice how nfs_create_request()->__nfs_create_request() gets
> another page reference which is released on completion
> (nfs_direct_read_completion->nfs_release_request->nfs_page_group_destroy->
> nfs_free_request->nfs_clear_request). And we need to stop releasing the
> obtained pin in nfs_direct_read_schedule_iovec() (and acquiring another
> reference in __nfs_create_request()) and instead propagate it to
> nfs_clear_request() where it can get released.
> 
> 								Honza

OK, now I see how that is supposed to work, thanks for explaining!


thanks,

Christoph Hellwig Sept. 23, 2022, 8:39 a.m. UTC | #29

On Thu, Sep 22, 2022 at 09:05:16PM -0700, John Hubbard wrote:
> I certainly hope not. And in fact, we should really just say that that's
> a rule: the whole time the page is pinned, it simply must remain dirty
> and writable, at least with the way things are right now.

Yes, if we can stick to that rule and make sure shared pagecache is
never dirtied through get_user_pags anywhere that will allow us to
fix a lot of mess

> To fix those cases, IIUC, the answer is: you must make the page dirty
> properly, with page_mkwrite(), not just with set_page_dirty_lock(). And
> that has to be done probably a lot earlier, for reasons that I'm still
> vague on. But perhaps right after pinning the page. (Assuming that we
> hold off writeback while the page is pinned.)

I think we need to hold off the writeback for it to work properly.
The big question is, is if there are callers that do expect data
to be written back on mappings that are longterm pinned.  RDMA or
vfio would come to mind.

Christoph Hellwig Sept. 23, 2022, 8:44 a.m. UTC | #30

On Fri, Sep 23, 2022 at 05:22:17AM +0100, Al Viro wrote:
> > Add a iov_iter_unpin_pages that does the right thing based on the
> > type.  (block will need a modified copy of it as it doesn't keep
> > the pages array around, but logic will be the same).
> 
> Huh?  You want to keep the type (+ direction) of iov_iter in any structure
> a page reference coming from iov_iter_get_pages might end up in?  IDGI...

Why would I?  We generall do have or should have the iov_iter around.
And for the common case where we don't (bios) we can carry that
information in the bio as it needs a special unmap helper anyway.

But if you don't want to use the iov_iter for some reason, we'll just
need to condense the information to a flags variable and then pass that.

> 
> BTW, speaking of lifetime rules - am I right assuming that fd_execute_rw()
> does IO on pages of the scatterlist passed to it?

Yes.

> Where are they getting
> dropped and what guarantees that IO is complete by that point?

The exact place depens on the exact taaraget frontend of which we have
a few.  But it happens from the end_io callback that is triggered
through a call to target_complete_cmd.

> The reason I'm asking is that here you have an ITER_BVEC possibly fed to
> __blkdev_direct_IO_async(), with its
>         if (iov_iter_is_bvec(iter)) {
>                 /*
>                  * Users don't rely on the iterator being in any particular
>                  * state for async I/O returning -EIOCBQUEUED, hence we can
>                  * avoid expensive iov_iter_advance(). Bypass
>                  * bio_iov_iter_get_pages() and set the bvec directly.
>                  */
>                 bio_iov_bvec_set(bio, iter);
> which does *not* grab the page referneces.  Sure, bio_release_pages() knows
> to leave those alone and doesn't drop anything.  However, what is the
> mechanism preventing the pages getting freed before the IO completion
> in this case?

The contract that callers of bvec iters need to hold their own
references as without that doing I/O do them would be unsafe.  It they
did not hold references the pages could go away before even calling
bio_iov_iter_get_pages (or this open coded bio_iov_bvec_set).

Jan Kara Sept. 23, 2022, 12:22 p.m. UTC | #31

On Thu 22-09-22 21:05:16, John Hubbard wrote:
> On 9/22/22 20:19, Al Viro wrote:
> > On Thu, Sep 22, 2022 at 01:29:35PM +0200, Jan Kara wrote:
> > 
> >>> This rule would mostly work, as long as we can relax it in some cases, to
> >>> allow pinning of both source and dest pages, instead of just destination
> >>> pages, in some cases. In particular, bio_release_pages() has lost all
> >>> context about whether it was a read or a write request, as far as I can
> >>> tell. And bio_release_pages() is the primary place to unpin pages for
> >>> direct IO.
> >>
> >> Well, we already do have BIO_NO_PAGE_REF bio flag that gets checked in
> >> bio_release_pages(). I think we can easily spare another bio flag to tell
> >> whether we need to unpin or not. So as long as all the pages in the created
> >> bio need the same treatment, the situation should be simple.
> > 
> > Yes.  Incidentally, the same condition is already checked by the creators
> > of those bio - see the assorted should_dirty logics.
> 
> Beautiful!
> 
> > 
> > While we are at it - how much of the rationale around bio_check_pages_dirty()
> > doing dirtying is still applicable with pinning pages before we stick them
> > into bio?  We do dirty them before submitting bio, then on completion
> > bio_check_pages_dirty() checks if something has marked them clean while
> > we'd been doing IO; if all of them are still dirty we just drop the pages
> > (well, unpin and drop), otherwise we arrange for dirty + unpin + drop
> > done in process context (via schedule_work()).  Can they be marked clean by
> > anyone while they are pinned?  After all, pinning is done to prevent
> > writeback getting done on them while we are modifying the suckers...
> 
> I certainly hope not. And in fact, we should really just say that that's
> a rule: the whole time the page is pinned, it simply must remain dirty
> and writable, at least with the way things are right now.

I agree the page should be staying dirty the whole time it is pinned. I
don't think it is feasible to keep it writeable in the page tables because
that would mean you would need to block e.g. munmap() until the pages gets
unpinned and that will almost certainly upset some current userspace.

But keeping page dirty should be enough so that we can get rid of all these
nasty calls to set_page_dirty() from IO completion.

> This reminds me that I'm not exactly sure what the rules for
> FOLL_LONGTERM callers should be, with respect to dirtying. At the
> moment, most, if not all of the code that does "set_page_dirty_lock();
> unpin_user_page()" is wrong.

Right.

> To fix those cases, IIUC, the answer is: you must make the page dirty
> properly, with page_mkwrite(), not just with set_page_dirty_lock(). And

Correct, and GUP (or PUP) actually does that under the hood so I don't
think we need to change anything there.

> that has to be done probably a lot earlier, for reasons that I'm still
> vague on. But perhaps right after pinning the page. (Assuming that we
> hold off writeback while the page is pinned.)

Holding off writeback is not always doable - as Christoph mentions, for
data integrity writeback we'll have to get the data to disk before the page
is unpinned (as for longterm users it can take days for the page to be
unpinned). But we can just writeback the page without clearing the dirty
bit in these cases. We may need to use bounce pages to be able to safely
writeback pinned pages but that's another part of the story...

								Honza

Al Viro Sept. 23, 2022, 4:13 p.m. UTC | #32

On Fri, Sep 23, 2022 at 01:44:33AM -0700, Christoph Hellwig wrote:

> Why would I?  We generall do have or should have the iov_iter around.

Not for async IO.

> And for the common case where we don't (bios) we can carry that
> information in the bio as it needs a special unmap helper anyway.

*Any* async read_iter is like that.

> > Where are they getting
> > dropped and what guarantees that IO is complete by that point?
> 
> The exact place depens on the exact taaraget frontend of which we have
> a few.  But it happens from the end_io callback that is triggered
> through a call to target_complete_cmd.

OK...

> > The reason I'm asking is that here you have an ITER_BVEC possibly fed to
> > __blkdev_direct_IO_async(), with its
> >         if (iov_iter_is_bvec(iter)) {
> >                 /*
> >                  * Users don't rely on the iterator being in any particular
> >                  * state for async I/O returning -EIOCBQUEUED, hence we can
> >                  * avoid expensive iov_iter_advance(). Bypass
> >                  * bio_iov_iter_get_pages() and set the bvec directly.
> >                  */
> >                 bio_iov_bvec_set(bio, iter);
> > which does *not* grab the page referneces.  Sure, bio_release_pages() knows
> > to leave those alone and doesn't drop anything.  However, what is the
> > mechanism preventing the pages getting freed before the IO completion
> > in this case?
> 
> The contract that callers of bvec iters need to hold their own
> references as without that doing I/O do them would be unsafe.  It they
> did not hold references the pages could go away before even calling
> bio_iov_iter_get_pages (or this open coded bio_iov_bvec_set).

You are mixing two issues here - holding references to pages while using
iov_iter instance is obvious; holding them until async IO is complete, even
though struct iov_iter might be long gone by that point is a different
story.

And originating iov_iter instance really can be long-gone by the time
of IO completion - requirement to keep it around would be very hard to
satisfy.  I've no objections to requiring the pages in ITER_BVEC to be
preserved at least until the IO completion by means independent of
whatever ->read_iter/->write_iter does to them, but
	* that needs to be spelled out very clearly and
	* we need to verify that it is, indeed, the case for all existing
iov_iter_bvec callers, preferably with comments next to non-obvious ones
(something that is followed only by the sync IO is obvious)

That goes not just for bio - if we make get_pages *NOT* grab references
on ITER_BVEC (and I'm all for it), we need to make sure that those
pages won't be retained after the original protection runs out.  Which
includes the reference put into struct nfs_request, for example, as well
as whatever ceph transport is doing, etc.  Another thing that needs to
be sorted out is __zerocopy_sg_from_iter() and its callers - AFAICS,
all of those are in ->sendmsg() with MSG_ZEROCOPY in flags.

It's a non-trivial amount of code audit - we have about 40 iov_iter_bvec()
callers in the tree, and while many are clearly sync-only... the ones
that are not tend to balloon into interesting analysis of call chains, etc.

Don't get me wrong - that analysis needs to be done, just don't expect
it to be trivial.  And it will require quite a bit of cooperation from the
folks familiar with e.g. drivers/target, or with ceph transport layer,
etc.

FWIW, my preference would be along the lines of

	Backing memory for any non user-backed iov_iter should be protected
	from reuse by creator of iov_iter and that protection should continue
	through not just all synchronous operations with iov_iter in question
	- it should last until all async operations involving that memory are
	finished.  That continued protection must not depend upon taking
	extra page references, etc. while we are working with iov_iter.

But getting there will take quite a bit of code audit and probably some massage
as well.

Christoph Hellwig Sept. 26, 2022, 3:53 p.m. UTC | #33

On Fri, Sep 23, 2022 at 05:13:42PM +0100, Al Viro wrote:
> You are mixing two issues here - holding references to pages while using
> iov_iter instance is obvious; holding them until async IO is complete, even
> though struct iov_iter might be long gone by that point is a different
> story.

But someone needs to hold a refernce until the I/O is completed, because
the I/O obviously needs the pages.  Yes, we could say the callers holds
them and can drop the references right after I/O submission, while
the method needs to grab another reference.  But that is more
complicated and is more costly than just holding the damn reference.

> And originating iov_iter instance really can be long-gone by the time
> of IO completion - requirement to keep it around would be very hard to
> satisfy.  I've no objections to requiring the pages in ITER_BVEC to be
> preserved at least until the IO completion by means independent of
> whatever ->read_iter/->write_iter does to them, but
> 	* that needs to be spelled out very clearly and
> 	* we need to verify that it is, indeed, the case for all existing
> iov_iter_bvec callers, preferably with comments next to non-obvious ones
> (something that is followed only by the sync IO is obvious)

Agreed.

> That goes not just for bio - if we make get_pages *NOT* grab references
> on ITER_BVEC (and I'm all for it), we need to make sure that those
> pages won't be retained after the original protection runs out.

Yes.

Al Viro Sept. 26, 2022, 7:55 p.m. UTC | #34

On Mon, Sep 26, 2022 at 08:53:43AM -0700, Christoph Hellwig wrote:
> On Fri, Sep 23, 2022 at 05:13:42PM +0100, Al Viro wrote:
> > You are mixing two issues here - holding references to pages while using
> > iov_iter instance is obvious; holding them until async IO is complete, even
> > though struct iov_iter might be long gone by that point is a different
> > story.
> 
> But someone needs to hold a refernce until the I/O is completed, because
> the I/O obviously needs the pages.  Yes, we could say the callers holds
> them and can drop the references right after I/O submission, while
> the method needs to grab another reference.  But that is more
> complicated and is more costly than just holding the damn reference.

Take a look at __nfs_create_request().  And trace the call chains leading
to nfs_clear_request() where the corresponding put_page() happens.

What I'm afraid of is something similar in the bowels of some RDMA driver.
With upper layers shoving page references into sglist using iov_iter_get_pages(),
then passing sglist to some intermediate layer, then *that* getting passed down
into a driver which grabs references for its own use and releases them from
destructor of some private structure.  Done via kref_put().  Have that
delayed by, hell - anything, up to and including debugfs shite somewhere
in the same driver, iterating through those private structures, grabbing
a reference to do some pretty-print into kmalloc'ed buffer, then drooping it.
Voila - we have page refs duplicated from ITER_BVEC and occasionally staying
around after the ->ki_complete() of async ->write_iter() that got that
ITER_BVEC.

It's really not a trivial rule change.

[v2,4/7] iov_iter: new iov_iter_pin_pages*() routines

Commit Message

Comments

Patch