Message ID | 20220831041843.973026-5-jhubbard@nvidia.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Series | convert most filesystems to pin_user_pages_fast() | expand |
On Tue, Aug 30, 2022 at 09:18:40PM -0700, John Hubbard wrote: > Provide two new wrapper routines that are intended for user space pages > only: > > iov_iter_pin_pages() > iov_iter_pin_pages_alloc() > > Internally, these routines call pin_user_pages_fast(), instead of > get_user_pages_fast(), for user_backed_iter(i) and iov_iter_bvec(i) > cases. > > As always, callers must use unpin_user_pages() or a suitable FOLL_PIN > variant, to release the pages, if they actually were acquired via > pin_user_pages_fast(). > > This is a prerequisite to converting bio/block layers over to use > pin_user_pages_fast(). What of ITER_PIPE (splice from O_DIRECT fd to a to pipe, for filesystem that uses generic_file_splice_read())?
On 8/31/22 17:42, Al Viro wrote: > On Tue, Aug 30, 2022 at 09:18:40PM -0700, John Hubbard wrote: >> Provide two new wrapper routines that are intended for user space pages >> only: >> >> iov_iter_pin_pages() >> iov_iter_pin_pages_alloc() >> >> Internally, these routines call pin_user_pages_fast(), instead of >> get_user_pages_fast(), for user_backed_iter(i) and iov_iter_bvec(i) >> cases. >> >> As always, callers must use unpin_user_pages() or a suitable FOLL_PIN >> variant, to release the pages, if they actually were acquired via >> pin_user_pages_fast(). >> >> This is a prerequisite to converting bio/block layers over to use >> pin_user_pages_fast(). > > What of ITER_PIPE (splice from O_DIRECT fd to a to pipe, for filesystem > that uses generic_file_splice_read())? Yes. And it turns out that I sent this v2 just a little too early: it does not include Jan Kara's latest idea [1] of including ITER_PIPE and ITER_XARRAY. That should fix this up. [1] https://lore.kernel.org/r/20220831094349.boln4jjajkdtykx3@quack3 thanks,
I'd it one step back. For BVECS we never need a get or pin. The block layer already does this, an the other callers should as well. For KVEC the same is true. For PIPE and xarray as you pointed out we can probably just do the pin, it is not like these are performance paths. So, I'd suggest to: - factor out the user backed and bvec cases from __iov_iter_get_pages_alloc into helper just to keep __iov_iter_get_pages_alloc readable. - for the pin case don't use the existing bvec helper at all, but copy the logic for the block layer for not pinning.
On 9/5/22 23:47, Christoph Hellwig wrote: > I'd it one step back. For BVECS we never need a get or pin. The > block layer already does this, an the other callers should as well. > For KVEC the same is true. For PIPE and xarray as you pointed out > we can probably just do the pin, it is not like these are performance > paths. > > So, I'd suggest to: > > - factor out the user backed and bvec cases from > __iov_iter_get_pages_alloc into helper just to keep > __iov_iter_get_pages_alloc readable. OK, that part is clear. > - for the pin case don't use the existing bvec helper at all, but > copy the logic for the block layer for not pinning. I'm almost, but not quite sure I get the idea above. Overall, what happens to bvec pages? Leave the get_page() pin in place for FOLL_GET (or USE_FOLL_GET), I suppose, but do...what, for FOLL_PIN callers? thanks,
On Tue, Sep 06, 2022 at 12:44:28AM -0700, John Hubbard wrote: > OK, that part is clear. > > > - for the pin case don't use the existing bvec helper at all, but > > copy the logic for the block layer for not pinning. > > I'm almost, but not quite sure I get the idea above. Overall, what > happens to bvec pages? Leave the get_page() pin in place for FOLL_GET > (or USE_FOLL_GET), I suppose, but do...what, for FOLL_PIN callers? Do not change anyhing for FOLL_GET callers, as they are on the way out anyway. For FOLL_PIN callers, never pin bvec and kvec pages: For file systems not acquiring a reference is obviously safe, and the other callers will need an audit, but I can't think of why it woul ever be unsafe.
On 9/6/22 00:48, Christoph Hellwig wrote: > On Tue, Sep 06, 2022 at 12:44:28AM -0700, John Hubbard wrote: >> OK, that part is clear. >> >>> - for the pin case don't use the existing bvec helper at all, but >>> copy the logic for the block layer for not pinning. >> >> I'm almost, but not quite sure I get the idea above. Overall, what >> happens to bvec pages? Leave the get_page() pin in place for FOLL_GET >> (or USE_FOLL_GET), I suppose, but do...what, for FOLL_PIN callers? > > Do not change anyhing for FOLL_GET callers, as they are on the way out > anyway. > OK, got it. > For FOLL_PIN callers, never pin bvec and kvec pages: For file systems > not acquiring a reference is obviously safe, and the other callers will > need an audit, but I can't think of why it woul ever be unsafe. In order to do that, one would need to be confident that such bvec and kvec pages do not get passed down to bio_release_pages() (or the new bio_unpin_pages()). Also, I'm missing a key point, because today bvec pages get a get_page() reference from __iov_iter_get_pages_alloc(). If I just skip that, then the get/put calls are unbalanced... I can just hear Al Viro repeating his points about splice() and vmsplice(), heh. :) thanks,
On Tue 06-09-22 00:48:49, Christoph Hellwig wrote: > On Tue, Sep 06, 2022 at 12:44:28AM -0700, John Hubbard wrote: > > OK, that part is clear. > > > > > - for the pin case don't use the existing bvec helper at all, but > > > copy the logic for the block layer for not pinning. > > > > I'm almost, but not quite sure I get the idea above. Overall, what > > happens to bvec pages? Leave the get_page() pin in place for FOLL_GET > > (or USE_FOLL_GET), I suppose, but do...what, for FOLL_PIN callers? > > Do not change anyhing for FOLL_GET callers, as they are on the way out > anyway. > > For FOLL_PIN callers, never pin bvec and kvec pages: For file systems > not acquiring a reference is obviously safe, and the other callers will > need an audit, but I can't think of why it woul ever be unsafe. Are you sure about "For file systems not acquiring a reference is obviously safe"? I can see places e.g. in orangefs, afs, etc. which create bvec iters from pagecache pages. And then we have iter_file_splice_write() which creates bvec from pipe pages (which can also be pagecache pages if vmsplice() is used). So perhaps there are no lifetime issues even without acquiring a reference (but looking at the code I would not say it is obvious) but I definitely don't see how it would be safe to not get a pin to signal to filesystem backing the pagecache page that there is DMA happening to/from the page. Honza
On Tue, Sep 06, 2022 at 12:21:06PM +0200, Jan Kara wrote: > > For FOLL_PIN callers, never pin bvec and kvec pages: For file systems > > not acquiring a reference is obviously safe, and the other callers will > > need an audit, but I can't think of why it woul ever be unsafe. > > Are you sure about "For file systems not acquiring a reference is obviously > safe"? I can see places e.g. in orangefs, afs, etc. which create bvec iters > from pagecache pages. And then we have iter_file_splice_write() which > creates bvec from pipe pages (which can also be pagecache pages if > vmsplice() is used). So perhaps there are no lifetime issues even without > acquiring a reference (but looking at the code I would not say it is > obvious) but I definitely don't see how it would be safe to not get a pin > to signal to filesystem backing the pagecache page that there is DMA > happening to/from the page. I mean in the context of iov_iter_get_pages callers, that is direct I/O. Direct callers of iov_iter_bvec which then pass that iov to ->read_iter / ->write_iter will need to hold references (those are the references that the callers of iov_iter_get_pages rely on!).
On Tue, Sep 06, 2022 at 12:58:59AM -0700, John Hubbard wrote: > > For FOLL_PIN callers, never pin bvec and kvec pages: For file systems > > not acquiring a reference is obviously safe, and the other callers will > > need an audit, but I can't think of why it woul ever be unsafe. > > In order to do that, one would need to be confident that such bvec and kvec > pages do not get passed down to bio_release_pages() (or the new > bio_unpin_pages()). Also, I'm missing a key point, because today bvec pages get > a get_page() reference from __iov_iter_get_pages_alloc(). If I just skip > that, then the get/put calls are unbalanced... Except that for the most relevant callers (bdev and iomap) we never end up in __iov_iter_get_pages_alloc. What I suggest is that, with the move to the pin helper, we codify that logic. For callers of the bio_iov_iter_pin_pages helper, the corresponding bio_unpin_pages helper will encapsulate that logic. For direct calls to iov_iter_pin_pages, we should add a iov_iter_unpin_pages helper that also encapsulates the logic. The beauty is that this means the caller itself does not have to care about any of thise, and the logic is in one (well two due to the block special case that reuses the bio_vec array for the pages space) place. > I can just hear Al Viro repeating his points about splice() and vmsplice(), > heh. :) splice does take these references before calling into the file system (see my reply to Jan).
On Wed, Sep 07, 2022 at 01:45:26AM -0700, Christoph Hellwig wrote: > On Tue, Sep 06, 2022 at 12:21:06PM +0200, Jan Kara wrote: > > > For FOLL_PIN callers, never pin bvec and kvec pages: For file systems > > > not acquiring a reference is obviously safe, and the other callers will > > > need an audit, but I can't think of why it woul ever be unsafe. > > > > Are you sure about "For file systems not acquiring a reference is obviously > > safe"? I can see places e.g. in orangefs, afs, etc. which create bvec iters > > from pagecache pages. And then we have iter_file_splice_write() which > > creates bvec from pipe pages (which can also be pagecache pages if > > vmsplice() is used). So perhaps there are no lifetime issues even without > > acquiring a reference (but looking at the code I would not say it is > > obvious) but I definitely don't see how it would be safe to not get a pin > > to signal to filesystem backing the pagecache page that there is DMA > > happening to/from the page. > > I mean in the context of iov_iter_get_pages callers, that is direct > I/O. Direct callers of iov_iter_bvec which then pass that iov to > ->read_iter / ->write_iter will need to hold references (those are > the references that the callers of iov_iter_get_pages rely on!). Unless I'm misreading Jan, the question is whether they should get or pin. AFAICS, anyone who passes the sucker to ->read_iter() (or ->recvmsg(), or does direct copy_to_iter()/zero_iter(), etc.) is falling under ================================================================================= CASE 5: Pinning in order to write to the data within the page ------------------------------------------------------------- Even though neither DMA nor Direct IO is involved, just a simple case of "pin, write to a page's data, unpin" can cause a problem. Case 5 may be considered a superset of Case 1, plus Case 2, plus anything that invokes that pattern. In other words, if the code is neither Case 1 nor Case 2, it may still require FOLL_PIN, for patterns like this: Correct (uses FOLL_PIN calls): pin_user_pages() write to the data within the pages unpin_user_pages() INCORRECT (uses FOLL_GET calls): get_user_pages() write to the data within the pages put_page() ================================================================================= Regarding iter_file_splice_write() case, do we need to pin pages when we are not going to modify the data in those? The same goes for afs, AFAICS; I started to type "... and everything that passes WRITE to iov_iter_bvec()", but... drivers/vhost/vringh.c:1165: iov_iter_bvec(&iter, READ, iov, ret, translated); drivers/vhost/vringh.c:1198: iov_iter_bvec(&iter, WRITE, iov, ret, translated); is backwards - READ is for data destinations, comes with copy_to_iter(); WRITE is for data sources and it comes with copy_from_iter()... I'm really tempted to slap if (WARN_ON(i->data_source)) return 0; into copy_to_iter() et.al., along with its opposite for copy_from_iter(). And see who comes screaming... Things like if (unlikely(iov_iter_is_pipe(i) || iov_iter_is_discard(i))) { WARN_ON(1); return 0; } in e.g. csum_and_copy_from_iter() would be replaced by that, and become easier to understand... These two are also getting it wrong, BTW: drivers/target/target_core_file.c:340: iov_iter_bvec(&iter, READ, bvec, sgl_nents, len); drivers/target/target_core_file.c:476: iov_iter_bvec(&iter, READ, bvec, nolb, len);
On Wed 14-09-22 04:51:17, Al Viro wrote: > On Wed, Sep 07, 2022 at 01:45:26AM -0700, Christoph Hellwig wrote: > > On Tue, Sep 06, 2022 at 12:21:06PM +0200, Jan Kara wrote: > > > > For FOLL_PIN callers, never pin bvec and kvec pages: For file systems > > > > not acquiring a reference is obviously safe, and the other callers will > > > > need an audit, but I can't think of why it woul ever be unsafe. > > > > > > Are you sure about "For file systems not acquiring a reference is obviously > > > safe"? I can see places e.g. in orangefs, afs, etc. which create bvec iters > > > from pagecache pages. And then we have iter_file_splice_write() which > > > creates bvec from pipe pages (which can also be pagecache pages if > > > vmsplice() is used). So perhaps there are no lifetime issues even without > > > acquiring a reference (but looking at the code I would not say it is > > > obvious) but I definitely don't see how it would be safe to not get a pin > > > to signal to filesystem backing the pagecache page that there is DMA > > > happening to/from the page. > > > > I mean in the context of iov_iter_get_pages callers, that is direct > > I/O. Direct callers of iov_iter_bvec which then pass that iov to > > ->read_iter / ->write_iter will need to hold references (those are > > the references that the callers of iov_iter_get_pages rely on!). > > Unless I'm misreading Jan, the question is whether they should get or > pin. AFAICS, anyone who passes the sucker to ->read_iter() (or ->recvmsg(), > or does direct copy_to_iter()/zero_iter(), etc.) is falling under > ================================================================================= > CASE 5: Pinning in order to write to the data within the page > ------------------------------------------------------------- > Even though neither DMA nor Direct IO is involved, just a simple case of "pin, > write to a page's data, unpin" can cause a problem. Case 5 may be considered a > superset of Case 1, plus Case 2, plus anything that invokes that pattern. In > other words, if the code is neither Case 1 nor Case 2, it may still require > FOLL_PIN, for patterns like this: > > Correct (uses FOLL_PIN calls): > pin_user_pages() > write to the data within the pages > unpin_user_pages() > > INCORRECT (uses FOLL_GET calls): > get_user_pages() > write to the data within the pages > put_page() > ================================================================================= Yes, that was my point. > Regarding iter_file_splice_write() case, do we need to pin pages > when we are not going to modify the data in those? Strictly speaking not. So far we are pinning pages even if they serve as data source because it is simpler not to bother about data access direction but I'm not really aware of anything that would mandate that. Honza
On Wed, Sep 14, 2022 at 04:52:33PM +0200, Jan Kara wrote: > > ================================================================================= > > CASE 5: Pinning in order to write to the data within the page > > ------------------------------------------------------------- > > Even though neither DMA nor Direct IO is involved, just a simple case of "pin, > > write to a page's data, unpin" can cause a problem. Case 5 may be considered a > > superset of Case 1, plus Case 2, plus anything that invokes that pattern. In > > other words, if the code is neither Case 1 nor Case 2, it may still require > > FOLL_PIN, for patterns like this: > > > > Correct (uses FOLL_PIN calls): > > pin_user_pages() > > write to the data within the pages > > unpin_user_pages() > > > > INCORRECT (uses FOLL_GET calls): > > get_user_pages() > > write to the data within the pages > > put_page() > > ================================================================================= > > Yes, that was my point. The thing is, at which point do we pin those pages? pin_user_pages() works by userland address; by the time we get to any of those we have struct page references and no idea whether they are still mapped anywhere. How would that work? What protects the area where you want to avoid running into pinned pages from previously acceptable page getting pinned? If "they must have been successfully unmapped" is a part of what you are planning, we really do have a problem...
On Wed 14-09-22 17:42:40, Al Viro wrote: > On Wed, Sep 14, 2022 at 04:52:33PM +0200, Jan Kara wrote: > > > ================================================================================= > > > CASE 5: Pinning in order to write to the data within the page > > > ------------------------------------------------------------- > > > Even though neither DMA nor Direct IO is involved, just a simple case of "pin, > > > write to a page's data, unpin" can cause a problem. Case 5 may be considered a > > > superset of Case 1, plus Case 2, plus anything that invokes that pattern. In > > > other words, if the code is neither Case 1 nor Case 2, it may still require > > > FOLL_PIN, for patterns like this: > > > > > > Correct (uses FOLL_PIN calls): > > > pin_user_pages() > > > write to the data within the pages > > > unpin_user_pages() > > > > > > INCORRECT (uses FOLL_GET calls): > > > get_user_pages() > > > write to the data within the pages > > > put_page() > > > ================================================================================= > > > > Yes, that was my point. > > The thing is, at which point do we pin those pages? pin_user_pages() works by > userland address; by the time we get to any of those we have struct page > references and no idea whether they are still mapped anywhere. Yes, pin_user_pages() currently works by page address but there's nothing fundamental about that. Technically, pin is currently just another type of page reference so we can as well just pin the page when given struct page. In fact John Hubbart has added such helper in this series. > How would that work? What protects the area where you want to avoid running > into pinned pages from previously acceptable page getting pinned? If "they > must have been successfully unmapped" is a part of what you are planning, we > really do have a problem... But this is a very good question. So far the idea was that we lock the page, unmap (or writeprotect) the page, and then check pincount == 0 and that is a reliable method for making sure page data is stable (until we unlock the page & release other locks blocking page faults and writes). But once suddently ordinary page references can be used to create pins this does not work anymore. Hrm. Just brainstorming ideas now: So we'd either need to obtain the pins early when we still have the virtual address (but I guess that is often not practical but should work e.g. for normal direct IO path) or we need some way to "simulate" the page fault when pinning the page, just don't map it into page tables in the end. This simulated page fault could be perhaps avoided if rmap walk shows that the page is already mapped somewhere with suitable permissions. Honza
On Thu, Sep 15, 2022 at 10:16:25AM +0200, Jan Kara wrote: > > How would that work? What protects the area where you want to avoid running > > into pinned pages from previously acceptable page getting pinned? If "they > > must have been successfully unmapped" is a part of what you are planning, we > > really do have a problem... > > But this is a very good question. So far the idea was that we lock the > page, unmap (or writeprotect) the page, and then check pincount == 0 and > that is a reliable method for making sure page data is stable (until we > unlock the page & release other locks blocking page faults and writes). But > once suddently ordinary page references can be used to create pins this > does not work anymore. Hrm. > > Just brainstorming ideas now: So we'd either need to obtain the pins early > when we still have the virtual address (but I guess that is often not > practical but should work e.g. for normal direct IO path) or we need some > way to "simulate" the page fault when pinning the page, just don't map it > into page tables in the end. This simulated page fault could be perhaps > avoided if rmap walk shows that the page is already mapped somewhere with > suitable permissions. OK... I'd done some digging; results so far * READ vs. WRITE turned out to be an awful way to specify iov_iter data direction. Local iov_iter branch so far: get rid of unlikely() on page_copy_sane() calls csum_and_copy_to_iter(): handle ITER_DISCARD [s390] copy_oldmem_kernel() - WRITE is "data source", not destination [fsi] WRITE is "data source", not destination... [infiniband] READ is "data destination", not source... [s390] zcore: WRITE is "data source", not destination... [target] fix iov_iter_bvec() "direction" argument [vhost] fix 'direction' argument of iov_iter_{init,bvec}() [xen] fix "direction" argument of iov_iter_kvec() [trace] READ means "data destination", not source... iov_iter: saner checks for attempt to copy to/from iterator use less confusing names for iov_iter direction initializers those 8 commits in the middle consist of fixes, some of them with more than one call site affected. Folks keep going "oh, we are going to copy data into that iterator, must be WRITE". Wrong - WRITE means "as for write(2)", i.e. the data _source_, not data destination. And the same kind of bugs goes in the opposite direction, of course. I think something like ITER_DEST vs. ITER_SOURCE would be less confusing. * anything that goes with ITER_SOURCE doesn't need pin. * ITER_IOVEC/ITER_UBUF need pin for get_pages and for nothing else. Need to grab reference on get_pages, obviously. * even more obviously, ITER_DISCARD is irrelevant here. * ITER_PIPE only modifies anonymous pages that had been allocated by iov_iter primitives and hadn't been observed by anything outside until we are done with said ITER_PIPE. * quite a few instances are similar to e.g. REQ_OP_READ handling in /dev/loop - we work with ITER_BVEC there and we do modify the page contents, but the damn thing would better be given to us locked and stay locked until all involved modifications (be it real IO/decoding/whatever) is complete. That ought to be safe, unless I'm missing something. That doesn't cover everything; still going through the list...
On Fri, Sep 16, 2022 at 02:55:53AM +0100, Al Viro wrote: > * READ vs. WRITE turned out to be an awful way to specify iov_iter > data direction. Local iov_iter branch so far: > get rid of unlikely() on page_copy_sane() calls > csum_and_copy_to_iter(): handle ITER_DISCARD > [s390] copy_oldmem_kernel() - WRITE is "data source", not destination > [fsi] WRITE is "data source", not destination... > [infiniband] READ is "data destination", not source... > [s390] zcore: WRITE is "data source", not destination... > [target] fix iov_iter_bvec() "direction" argument > [vhost] fix 'direction' argument of iov_iter_{init,bvec}() > [xen] fix "direction" argument of iov_iter_kvec() > [trace] READ means "data destination", not source... > iov_iter: saner checks for attempt to copy to/from iterator > use less confusing names for iov_iter direction initializers > those 8 commits in the middle consist of fixes, some of them with more than > one call site affected. Folks keep going "oh, we are going to copy data > into that iterator, must be WRITE". Wrong - WRITE means "as for write(2)", > i.e. the data _source_, not data destination. And the same kind of bugs > goes in the opposite direction, of course. > I think something like ITER_DEST vs. ITER_SOURCE would be less > confusing. > > * anything that goes with ITER_SOURCE doesn't need pin. > * ITER_IOVEC/ITER_UBUF need pin for get_pages and for nothing else. > Need to grab reference on get_pages, obviously. > * even more obviously, ITER_DISCARD is irrelevant here. > * ITER_PIPE only modifies anonymous pages that had been allocated > by iov_iter primitives and hadn't been observed by anything outside until > we are done with said ITER_PIPE. > * quite a few instances are similar to e.g. REQ_OP_READ handling in > /dev/loop - we work with ITER_BVEC there and we do modify the page contents, > but the damn thing would better be given to us locked and stay locked until > all involved modifications (be it real IO/decoding/whatever) is complete. > That ought to be safe, unless I'm missing something. > > That doesn't cover everything; still going through the list... More: nvme target: nvme read requests end up with somebody allocating and filling sglist, followed by reading from file into it (using ITER_BVEC). Then the pages are sent out, presumably. I would be very surprised if it turned out to be anything other than anon pages allocated by the driver, but I'd like to see that confirmed by nvme folks. Probably doesn't need pinning. ->read_folio() instances - page locked by caller, not unlocked until we are done. ->readahead() instances - pages are in the segment of page cache that had been populated and locked by the caller; some are ITER_BVEC (with page(s) extracted by readahead_page()), some - ITER_XARRAY. other similar places (some of ->write_begin() instances, after having grabbed a locked page, etc.) ->issue_read() instances - the call graph is scary (in particular, recursion prevention there is non-obvious), but unless netfs folks say otherwise, I'd assume that all pages involved are supposed to be locked by the caller. swap reads (ending up at __swap_read_unplug()) - pages locked by callers. in some cases (cifs) pages are privately allocated and not visible to anyone else. io_import_fixed() sets ITER_BVEC over pinned pages; see io_pin_pages() for the place where that's done. In cifs_send_async_read() we take the pages that will eventually go into ITER_BVEC iterator from iov_iter_get_pages() - that one wants pinning if the type of ctx->iter would demand so. The same goes for setup_aio_ctx_iter() - iov_iter_get_pages() is used to make an ITER_BVEC counterpart of the iov_iter passed to ->read_iter(), with the same considerations re pinning. The same goes for ceph __iter_get_bvecs(). Haven't done yet: drivers/target/target_core_file.c:292: iov_iter_bvec(&iter, is_write, aio_cmd->bvecs, sgl_nents, len); drivers/vhost/vringh.c:1198: iov_iter_bvec(&iter, ITER_DEST, iov, ret, translated); fs/afs/dir.c:308: iov_iter_xarray(&req->def_iter, ITER_DEST, &dvnode->netfs.inode.i_mapping->i_pages, net/ceph/messenger_v1.c:52: iov_iter_bvec(&msg.msg_iter, ITER_DEST, &bvec, 1, length); net/ceph/messenger_v2.c:236: iov_iter_bvec(&con->v2.in_iter, ITER_DEST, &con->v2.in_bvec, 1, bv->bv_len); net/sunrpc/svcsock.c:263: iov_iter_bvec(&msg.msg_iter, ITER_DEST, bvec, i, buflen); net/sunrpc/xprtsock.c:376: iov_iter_bvec(&msg->msg_iter, ITER_DEST, bvec, nr, count); The picture so far looks like we mostly need to take care of pinning when we obtain the references from iov_iter_get_pages(). What's more, it looks like ITER_BVEC/ITER_XARRAY/ITER_PIPE we really don't need to pin anything on get_pages/pin_pages - they are already protected (or, in case of ITER_PIPE, allocated by iov_iter itself and not reachable by anybody outside). Might or might not be true for the remaining 7 call sites... NOTE: all of the above assumes that callers with pre-locked pages are either synchronous or do not unlock until the completion callbacks. It does appear to be true; if it is true, I really wonder if we need to even grab references in iov_iter_pin_pages() for anything other than ITER_IOVEC/ITER_UBUF. The right primitive might be if user-backed pin pages else just copy the pointers; any lifetime-related issues are up to the caller. advance iterator in either case
On Thu, Sep 15, 2022 at 10:16:25AM +0200, Jan Kara wrote: > > How would that work? What protects the area where you want to avoid running > > into pinned pages from previously acceptable page getting pinned? If "they > > must have been successfully unmapped" is a part of what you are planning, we > > really do have a problem... > > But this is a very good question. So far the idea was that we lock the > page, unmap (or writeprotect) the page, and then check pincount == 0 and > that is a reliable method for making sure page data is stable (until we > unlock the page & release other locks blocking page faults and writes). But > once suddently ordinary page references can be used to create pins this > does not work anymore. Hrm. > > Just brainstorming ideas now: So we'd either need to obtain the pins early > when we still have the virtual address (but I guess that is often not > practical but should work e.g. for normal direct IO path) or we need some > way to "simulate" the page fault when pinning the page, just don't map it > into page tables in the end. This simulated page fault could be perhaps > avoided if rmap walk shows that the page is already mapped somewhere with > suitable permissions. OK. As far as I can see, the rules are along the lines of * creator of ITER_BVEC/ITER_XARRAY is responsible for pages being safe. That includes * page known to be locked by caller * page being privately allocated and not visible to anyone else * iterator being data source * page coming from pin_user_pages(), possibly as the result of iov_iter_pin_pages() on ITER_IOVEC/ITER_UBUF. * ITER_PIPE pages are always safe * pages found in ITER_BVEC/ITER_XARRAY are safe, since the iterator had been created with such. My preference would be to have iov_iter_get_pages() and friends pin if and only if we have data-destination iov_iter that is user-backed. For data-source user-backed we only need FOLL_GET, and for all other flavours (ITER_BVEC, etc.) we only do get_page(), if we need to grab any references at all. What I'd like to have is the understanding of the places where we drop the references acquired by iov_iter_get_pages(). How do we decide whether to unpin? E.g. pipe_buffer carries a reference to page and no way to tell whether it's a pinned one; results of iov_iter_get_pages() on ITER_IOVEC *can* end up there, but thankfully only from data-source (== WRITE, aka. ITER_SOURCE) iov_iter. So for those we don't care. Then there's nfs_request; AFAICS, we do need to pin the references in those if they are coming from nfs_direct_read_schedule_iovec(), but not if they come from readpage_async_filler(). How do we deal with coalescence, etc.? It's been a long time since I really looked at that code... Christoph, could you give any comments on that one? Note, BTW, that nfs_request coming from readpage_async_filler() have pages locked by caller; the ones from nfs_direct_read_schedule_iovec() do not, and that's where we want them pinned. Resulting page references end up (after quite a trip through data structures) stuffed into struct rpc_rqst ->rc_recv_buf.pages[] and when a response arrives from server, they get picked by xs_read_bvec() and fed to iov_iter_bvec(). In one case it's safe since the pages are locked; in another - since they would come from pin_user_pages(). The call chain at the time they are used has nothing to do with the originator - sunrpc is looking at the arrived response to READ that matches an rpc_rqst that had been created by sender of that request and safety is the sender's responsibility.
On 9/21/22 19:22, Al Viro wrote: > On Thu, Sep 15, 2022 at 10:16:25AM +0200, Jan Kara wrote: > >>> How would that work? What protects the area where you want to avoid running >>> into pinned pages from previously acceptable page getting pinned? If "they >>> must have been successfully unmapped" is a part of what you are planning, we >>> really do have a problem... >> >> But this is a very good question. So far the idea was that we lock the >> page, unmap (or writeprotect) the page, and then check pincount == 0 and >> that is a reliable method for making sure page data is stable (until we >> unlock the page & release other locks blocking page faults and writes). But >> once suddently ordinary page references can be used to create pins this >> does not work anymore. Hrm. >> >> Just brainstorming ideas now: So we'd either need to obtain the pins early >> when we still have the virtual address (but I guess that is often not >> practical but should work e.g. for normal direct IO path) or we need some >> way to "simulate" the page fault when pinning the page, just don't map it >> into page tables in the end. This simulated page fault could be perhaps >> avoided if rmap walk shows that the page is already mapped somewhere with >> suitable permissions. > > OK. As far as I can see, the rules are along the lines of > * creator of ITER_BVEC/ITER_XARRAY is responsible for pages being safe. > That includes > * page known to be locked by caller > * page being privately allocated and not visible to anyone else > * iterator being data source > * page coming from pin_user_pages(), possibly as the result of > iov_iter_pin_pages() on ITER_IOVEC/ITER_UBUF. > * ITER_PIPE pages are always safe > * pages found in ITER_BVEC/ITER_XARRAY are safe, since the iterator > had been created with such. > My preference would be to have iov_iter_get_pages() and friends pin if and > only if we have data-destination iov_iter that is user-backed. For > data-source user-backed we only need FOLL_GET, and for all other flavours > (ITER_BVEC, etc.) we only do get_page(), if we need to grab any references > at all. This rule would mostly work, as long as we can relax it in some cases, to allow pinning of both source and dest pages, instead of just destination pages, in some cases. In particular, bio_release_pages() has lost all context about whether it was a read or a write request, as far as I can tell. And bio_release_pages() is the primary place to unpin pages for direct IO. > > What I'd like to have is the understanding of the places where we drop > the references acquired by iov_iter_get_pages(). How do we decide > whether to unpin? E.g. pipe_buffer carries a reference to page and no > way to tell whether it's a pinned one; results of iov_iter_get_pages() > on ITER_IOVEC *can* end up there, but thankfully only from data-source > (== WRITE, aka. ITER_SOURCE) iov_iter. So for those we don't care. > Then there's nfs_request; AFAICS, we do need to pin the references in > those if they are coming from nfs_direct_read_schedule_iovec(), but > not if they come from readpage_async_filler(). How do we deal with > coalescence, etc.? It's been a long time since I really looked at > that code... Christoph, could you give any comments on that one? > > Note, BTW, that nfs_request coming from readpage_async_filler() have > pages locked by caller; the ones from nfs_direct_read_schedule_iovec() > do not, and that's where we want them pinned. Resulting page references > end up (after quite a trip through data structures) stuffed into struct > rpc_rqst ->rc_recv_buf.pages[] and when a response arrives from server, > they get picked by xs_read_bvec() and fed to iov_iter_bvec(). In one > case it's safe since the pages are locked; in another - since they would > come from pin_user_pages(). The call chain at the time they are used > has nothing to do with the originator - sunrpc is looking at the arrived > response to READ that matches an rpc_rqst that had been created by sender > of that request and safety is the sender's responsibility. For NFS Direct, is there any reason it can't be as simple as this (conceptually, that is--the implementation of iov_iter_pin_pages_alloc() is not shown here)? Here: diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c index 1707f46b1335..7dbc705bab83 100644 --- a/fs/nfs/direct.c +++ b/fs/nfs/direct.c @@ -142,13 +142,6 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter) return 0; } -static void nfs_direct_release_pages(struct page **pages, unsigned int npages) -{ - unsigned int i; - for (i = 0; i < npages; i++) - put_page(pages[i]); -} - void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo, struct nfs_direct_req *dreq) { @@ -332,7 +325,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq, size_t pgbase; unsigned npages, i; - result = iov_iter_get_pages_alloc2(iter, &pagevec, + result = iov_iter_pin_pages_alloc(iter, &pagevec, rsize, &pgbase); if (result < 0) break; @@ -362,7 +355,16 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq, pos += req_len; dreq->bytes_left -= req_len; } - nfs_direct_release_pages(pagevec, npages); + + /* + * iov_iter_pin_pages_alloc() calls pin_user_pages_fast() for + * the user_backed_iter() case (only). + */ + if (user_backed_iter(iter)) + unpin_user_pages(pagevec, npages); + else + release_pages(pagevec, npages); + kvfree(pagevec); if (result < 0) break; @@ -829,7 +831,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq, pos += req_len; dreq->bytes_left -= req_len; } - nfs_direct_release_pages(pagevec, npages); + release_pages(pagevec, npages); kvfree(pagevec); if (result < 0) break; thanks,
On Wed 21-09-22 23:09:06, John Hubbard wrote: > On 9/21/22 19:22, Al Viro wrote: > > On Thu, Sep 15, 2022 at 10:16:25AM +0200, Jan Kara wrote: > > > >>> How would that work? What protects the area where you want to avoid running > >>> into pinned pages from previously acceptable page getting pinned? If "they > >>> must have been successfully unmapped" is a part of what you are planning, we > >>> really do have a problem... > >> > >> But this is a very good question. So far the idea was that we lock the > >> page, unmap (or writeprotect) the page, and then check pincount == 0 and > >> that is a reliable method for making sure page data is stable (until we > >> unlock the page & release other locks blocking page faults and writes). But > >> once suddently ordinary page references can be used to create pins this > >> does not work anymore. Hrm. > >> > >> Just brainstorming ideas now: So we'd either need to obtain the pins early > >> when we still have the virtual address (but I guess that is often not > >> practical but should work e.g. for normal direct IO path) or we need some > >> way to "simulate" the page fault when pinning the page, just don't map it > >> into page tables in the end. This simulated page fault could be perhaps > >> avoided if rmap walk shows that the page is already mapped somewhere with > >> suitable permissions. > > > > OK. As far as I can see, the rules are along the lines of > > * creator of ITER_BVEC/ITER_XARRAY is responsible for pages being safe. > > That includes > > * page known to be locked by caller > > * page being privately allocated and not visible to anyone else > > * iterator being data source > > * page coming from pin_user_pages(), possibly as the result of > > iov_iter_pin_pages() on ITER_IOVEC/ITER_UBUF. > > * ITER_PIPE pages are always safe > > * pages found in ITER_BVEC/ITER_XARRAY are safe, since the iterator > > had been created with such. > > My preference would be to have iov_iter_get_pages() and friends pin if and > > only if we have data-destination iov_iter that is user-backed. For > > data-source user-backed we only need FOLL_GET, and for all other flavours > > (ITER_BVEC, etc.) we only do get_page(), if we need to grab any references > > at all. > > This rule would mostly work, as long as we can relax it in some cases, to > allow pinning of both source and dest pages, instead of just destination > pages, in some cases. In particular, bio_release_pages() has lost all > context about whether it was a read or a write request, as far as I can > tell. And bio_release_pages() is the primary place to unpin pages for > direct IO. Well, we already do have BIO_NO_PAGE_REF bio flag that gets checked in bio_release_pages(). I think we can easily spare another bio flag to tell whether we need to unpin or not. So as long as all the pages in the created bio need the same treatment, the situation should be simple. > > What I'd like to have is the understanding of the places where we drop > > the references acquired by iov_iter_get_pages(). How do we decide > > whether to unpin? E.g. pipe_buffer carries a reference to page and no > > way to tell whether it's a pinned one; results of iov_iter_get_pages() > > on ITER_IOVEC *can* end up there, but thankfully only from data-source > > (== WRITE, aka. ITER_SOURCE) iov_iter. So for those we don't care. > > Then there's nfs_request; AFAICS, we do need to pin the references in > > those if they are coming from nfs_direct_read_schedule_iovec(), but > > not if they come from readpage_async_filler(). How do we deal with > > coalescence, etc.? It's been a long time since I really looked at > > that code... Christoph, could you give any comments on that one? > > > > Note, BTW, that nfs_request coming from readpage_async_filler() have > > pages locked by caller; the ones from nfs_direct_read_schedule_iovec() > > do not, and that's where we want them pinned. Resulting page references > > end up (after quite a trip through data structures) stuffed into struct > > rpc_rqst ->rc_recv_buf.pages[] and when a response arrives from server, > > they get picked by xs_read_bvec() and fed to iov_iter_bvec(). In one > > case it's safe since the pages are locked; in another - since they would > > come from pin_user_pages(). The call chain at the time they are used > > has nothing to do with the originator - sunrpc is looking at the arrived > > response to READ that matches an rpc_rqst that had been created by sender > > of that request and safety is the sender's responsibility. > > For NFS Direct, is there any reason it can't be as simple as this > (conceptually, that is--the implementation of iov_iter_pin_pages_alloc() > is not shown here)? Here: > > > diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c > index 1707f46b1335..7dbc705bab83 100644 > --- a/fs/nfs/direct.c > +++ b/fs/nfs/direct.c > @@ -142,13 +142,6 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter) > return 0; > } > > -static void nfs_direct_release_pages(struct page **pages, unsigned int npages) > -{ > - unsigned int i; > - for (i = 0; i < npages; i++) > - put_page(pages[i]); > -} > - > void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo, > struct nfs_direct_req *dreq) > { > @@ -332,7 +325,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq, > size_t pgbase; > unsigned npages, i; > > - result = iov_iter_get_pages_alloc2(iter, &pagevec, > + result = iov_iter_pin_pages_alloc(iter, &pagevec, > rsize, &pgbase); > if (result < 0) > break; > @@ -362,7 +355,16 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq, > pos += req_len; > dreq->bytes_left -= req_len; > } > - nfs_direct_release_pages(pagevec, npages); > + > + /* > + * iov_iter_pin_pages_alloc() calls pin_user_pages_fast() for > + * the user_backed_iter() case (only). > + */ > + if (user_backed_iter(iter)) > + unpin_user_pages(pagevec, npages); > + else > + release_pages(pagevec, npages); > + I don't think this will work. The pin nfs_direct_read_schedule_iovec() obtains needs to be released once the IO is completed. Not once the IO is submitted. Notice how nfs_create_request()->__nfs_create_request() gets another page reference which is released on completion (nfs_direct_read_completion->nfs_release_request->nfs_page_group_destroy-> nfs_free_request->nfs_clear_request). And we need to stop releasing the obtained pin in nfs_direct_read_schedule_iovec() (and acquiring another reference in __nfs_create_request()) and instead propagate it to nfs_clear_request() where it can get released. Honza
On Wed, Sep 14, 2022 at 04:51:17AM +0100, Al Viro wrote: > Unless I'm misreading Jan, the question is whether they should get or > pin. And I think the answer is: inside ->read_iter or ->write_iter they should neither get or pin. The callers of it need to pin the pages if they are pagecache pages that can potentially be written to through shared mappings, else a get would be enough. But the method instance should not have to care and just be able to rely on the caller making sure they do not go away. > I'm really tempted to slap > if (WARN_ON(i->data_source)) > return 0; > into copy_to_iter() et.al., along with its opposite for copy_from_iter(). Ys, I think that would be useful. And we could use something more descriptive than READ/WRITE to start with.
On Tue, Sep 20, 2022 at 06:02:11AM +0100, Al Viro wrote: > nvme target: nvme read requests end up with somebody allocating and filling > sglist, followed by reading from file into it (using ITER_BVEC). Then the > pages are sent out, presumably Yes. > . I would be very surprised if it turned out > to be anything other than anon pages allocated by the driver, but I'd like > to see that confirmed by nvme folks. Probably doesn't need pinning. They are anon pages allocated by the driver using sgl_alloc(). > drivers/target/target_core_file.c:292: iov_iter_bvec(&iter, is_write, aio_cmd->bvecs, sgl_nents, len); Same as nvme target. > The picture so far looks like we mostly need to take care of pinning when > we obtain the references from iov_iter_get_pages(). What's more, it looks > like ITER_BVEC/ITER_XARRAY/ITER_PIPE we really don't need to pin anything on > get_pages/pin_pages - they are already protected (or, in case of ITER_PIPE, > allocated by iov_iter itself and not reachable by anybody outside). That's what I've been trying to say for a while..
On Thu, Sep 22, 2022 at 07:31:36AM -0700, Christoph Hellwig wrote: > On Wed, Sep 14, 2022 at 04:51:17AM +0100, Al Viro wrote: > > Unless I'm misreading Jan, the question is whether they should get or > > pin. > > And I think the answer is: inside ->read_iter or ->write_iter they > should neither get or pin. The callers of it need to pin the pages > if they are pagecache pages that can potentially be written to through > shared mappings, else a get would be enough. But the method instance > should not have to care and just be able to rely on the caller making > sure they do not go away. The interesting part, AFAICS, is where do we _unpin_ them and how do we keep track which pages (obtained from iov_iter_get_pages et.al.) need to be unpinned. > > I'm really tempted to slap > > if (WARN_ON(i->data_source)) > > return 0; > > into copy_to_iter() et.al., along with its opposite for copy_from_iter(). > > Ys, I think that would be useful. And we could use something more > descriptive than READ/WRITE to start with. See #work.iov_iter; done, but it took a bit of fixing the places that create iov_iter instances.
On Thu, Sep 22, 2022 at 03:22:48AM +0100, Al Viro wrote: > What I'd like to have is the understanding of the places where we drop > the references acquired by iov_iter_get_pages(). How do we decide > whether to unpin? Add a iov_iter_unpin_pages that does the right thing based on the type. (block will need a modified copy of it as it doesn't keep the pages array around, but logic will be the same). > E.g. pipe_buffer carries a reference to page and no > way to tell whether it's a pinned one; results of iov_iter_get_pages() > on ITER_IOVEC *can* end up there, but thankfully only from data-source > (== WRITE, aka. ITER_SOURCE) iov_iter. So for those we don't care. > Then there's nfs_request; AFAICS, we do need to pin the references in > those if they are coming from nfs_direct_read_schedule_iovec(), but > not if they come from readpage_async_filler(). How do we deal with > coalescence, etc.? It's been a long time since I really looked at > that code... Christoph, could you give any comments on that one? I think the above should work, but I'll need to look at the NFS code in more detail to be sure.
On 22.09.22 16:36, Christoph Hellwig wrote: > On Tue, Sep 20, 2022 at 06:02:11AM +0100, Al Viro wrote: >> nvme target: nvme read requests end up with somebody allocating and filling >> sglist, followed by reading from file into it (using ITER_BVEC). Then the >> pages are sent out, presumably > > Yes. > >> . I would be very surprised if it turned out >> to be anything other than anon pages allocated by the driver, but I'd like >> to see that confirmed by nvme folks. Probably doesn't need pinning. > > They are anon pages allocated by the driver using sgl_alloc(). I assume they are not anon pages as in "PageAnon()", but simply not pagecache pages, correct?
On Thu, Sep 22, 2022 at 04:43:57PM +0200, David Hildenbrand wrote: > I assume they are not anon pages as in "PageAnon()", but simply not > pagecache pages, correct? Yes, sorry. From the page allocator and not added to the page cache.
On Thu, Sep 22, 2022 at 01:29:35PM +0200, Jan Kara wrote: > > This rule would mostly work, as long as we can relax it in some cases, to > > allow pinning of both source and dest pages, instead of just destination > > pages, in some cases. In particular, bio_release_pages() has lost all > > context about whether it was a read or a write request, as far as I can > > tell. And bio_release_pages() is the primary place to unpin pages for > > direct IO. > > Well, we already do have BIO_NO_PAGE_REF bio flag that gets checked in > bio_release_pages(). I think we can easily spare another bio flag to tell > whether we need to unpin or not. So as long as all the pages in the created > bio need the same treatment, the situation should be simple. Yes. Incidentally, the same condition is already checked by the creators of those bio - see the assorted should_dirty logics. While we are at it - how much of the rationale around bio_check_pages_dirty() doing dirtying is still applicable with pinning pages before we stick them into bio? We do dirty them before submitting bio, then on completion bio_check_pages_dirty() checks if something has marked them clean while we'd been doing IO; if all of them are still dirty we just drop the pages (well, unpin and drop), otherwise we arrange for dirty + unpin + drop done in process context (via schedule_work()). Can they be marked clean by anyone while they are pinned? After all, pinning is done to prevent writeback getting done on them while we are modifying the suckers...
On 9/22/22 20:19, Al Viro wrote: > On Thu, Sep 22, 2022 at 01:29:35PM +0200, Jan Kara wrote: > >>> This rule would mostly work, as long as we can relax it in some cases, to >>> allow pinning of both source and dest pages, instead of just destination >>> pages, in some cases. In particular, bio_release_pages() has lost all >>> context about whether it was a read or a write request, as far as I can >>> tell. And bio_release_pages() is the primary place to unpin pages for >>> direct IO. >> >> Well, we already do have BIO_NO_PAGE_REF bio flag that gets checked in >> bio_release_pages(). I think we can easily spare another bio flag to tell >> whether we need to unpin or not. So as long as all the pages in the created >> bio need the same treatment, the situation should be simple. > > Yes. Incidentally, the same condition is already checked by the creators > of those bio - see the assorted should_dirty logics. Beautiful! > > While we are at it - how much of the rationale around bio_check_pages_dirty() > doing dirtying is still applicable with pinning pages before we stick them > into bio? We do dirty them before submitting bio, then on completion > bio_check_pages_dirty() checks if something has marked them clean while > we'd been doing IO; if all of them are still dirty we just drop the pages > (well, unpin and drop), otherwise we arrange for dirty + unpin + drop > done in process context (via schedule_work()). Can they be marked clean by > anyone while they are pinned? After all, pinning is done to prevent > writeback getting done on them while we are modifying the suckers... I certainly hope not. And in fact, we should really just say that that's a rule: the whole time the page is pinned, it simply must remain dirty and writable, at least with the way things are right now. This reminds me that I'm not exactly sure what the rules for FOLL_LONGTERM callers should be, with respect to dirtying. At the moment, most, if not all of the code that does "set_page_dirty_lock(); unpin_user_page()" is wrong. To fix those cases, IIUC, the answer is: you must make the page dirty properly, with page_mkwrite(), not just with set_page_dirty_lock(). And that has to be done probably a lot earlier, for reasons that I'm still vague on. But perhaps right after pinning the page. (Assuming that we hold off writeback while the page is pinned.) Just wanted to see if that sounds right, while we're on the topic. thanks,
On Thu, Sep 22, 2022 at 07:38:25AM -0700, Christoph Hellwig wrote: > On Thu, Sep 22, 2022 at 03:22:48AM +0100, Al Viro wrote: > > What I'd like to have is the understanding of the places where we drop > > the references acquired by iov_iter_get_pages(). How do we decide > > whether to unpin? > > Add a iov_iter_unpin_pages that does the right thing based on the > type. (block will need a modified copy of it as it doesn't keep > the pages array around, but logic will be the same). Huh? You want to keep the type (+ direction) of iov_iter in any structure a page reference coming from iov_iter_get_pages might end up in? IDGI... BTW, speaking of lifetime rules - am I right assuming that fd_execute_rw() does IO on pages of the scatterlist passed to it? Where are they getting dropped and what guarantees that IO is complete by that point? The reason I'm asking is that here you have an ITER_BVEC possibly fed to __blkdev_direct_IO_async(), with its if (iov_iter_is_bvec(iter)) { /* * Users don't rely on the iterator being in any particular * state for async I/O returning -EIOCBQUEUED, hence we can * avoid expensive iov_iter_advance(). Bypass * bio_iov_iter_get_pages() and set the bvec directly. */ bio_iov_bvec_set(bio, iter); which does *not* grab the page referneces. Sure, bio_release_pages() knows to leave those alone and doesn't drop anything. However, what is the mechanism preventing the pages getting freed before the IO completion in this case?
On 9/22/22 04:29, Jan Kara wrote: > I don't think this will work. The pin nfs_direct_read_schedule_iovec() > obtains needs to be released once the IO is completed. Not once the IO is > submitted. Notice how nfs_create_request()->__nfs_create_request() gets > another page reference which is released on completion > (nfs_direct_read_completion->nfs_release_request->nfs_page_group_destroy-> > nfs_free_request->nfs_clear_request). And we need to stop releasing the > obtained pin in nfs_direct_read_schedule_iovec() (and acquiring another > reference in __nfs_create_request()) and instead propagate it to > nfs_clear_request() where it can get released. > > Honza OK, now I see how that is supposed to work, thanks for explaining! thanks,
On Thu, Sep 22, 2022 at 09:05:16PM -0700, John Hubbard wrote: > I certainly hope not. And in fact, we should really just say that that's > a rule: the whole time the page is pinned, it simply must remain dirty > and writable, at least with the way things are right now. Yes, if we can stick to that rule and make sure shared pagecache is never dirtied through get_user_pags anywhere that will allow us to fix a lot of mess > To fix those cases, IIUC, the answer is: you must make the page dirty > properly, with page_mkwrite(), not just with set_page_dirty_lock(). And > that has to be done probably a lot earlier, for reasons that I'm still > vague on. But perhaps right after pinning the page. (Assuming that we > hold off writeback while the page is pinned.) I think we need to hold off the writeback for it to work properly. The big question is, is if there are callers that do expect data to be written back on mappings that are longterm pinned. RDMA or vfio would come to mind.
On Fri, Sep 23, 2022 at 05:22:17AM +0100, Al Viro wrote: > > Add a iov_iter_unpin_pages that does the right thing based on the > > type. (block will need a modified copy of it as it doesn't keep > > the pages array around, but logic will be the same). > > Huh? You want to keep the type (+ direction) of iov_iter in any structure > a page reference coming from iov_iter_get_pages might end up in? IDGI... Why would I? We generall do have or should have the iov_iter around. And for the common case where we don't (bios) we can carry that information in the bio as it needs a special unmap helper anyway. But if you don't want to use the iov_iter for some reason, we'll just need to condense the information to a flags variable and then pass that. > > BTW, speaking of lifetime rules - am I right assuming that fd_execute_rw() > does IO on pages of the scatterlist passed to it? Yes. > Where are they getting > dropped and what guarantees that IO is complete by that point? The exact place depens on the exact taaraget frontend of which we have a few. But it happens from the end_io callback that is triggered through a call to target_complete_cmd. > The reason I'm asking is that here you have an ITER_BVEC possibly fed to > __blkdev_direct_IO_async(), with its > if (iov_iter_is_bvec(iter)) { > /* > * Users don't rely on the iterator being in any particular > * state for async I/O returning -EIOCBQUEUED, hence we can > * avoid expensive iov_iter_advance(). Bypass > * bio_iov_iter_get_pages() and set the bvec directly. > */ > bio_iov_bvec_set(bio, iter); > which does *not* grab the page referneces. Sure, bio_release_pages() knows > to leave those alone and doesn't drop anything. However, what is the > mechanism preventing the pages getting freed before the IO completion > in this case? The contract that callers of bvec iters need to hold their own references as without that doing I/O do them would be unsafe. It they did not hold references the pages could go away before even calling bio_iov_iter_get_pages (or this open coded bio_iov_bvec_set).
On Thu 22-09-22 21:05:16, John Hubbard wrote: > On 9/22/22 20:19, Al Viro wrote: > > On Thu, Sep 22, 2022 at 01:29:35PM +0200, Jan Kara wrote: > > > >>> This rule would mostly work, as long as we can relax it in some cases, to > >>> allow pinning of both source and dest pages, instead of just destination > >>> pages, in some cases. In particular, bio_release_pages() has lost all > >>> context about whether it was a read or a write request, as far as I can > >>> tell. And bio_release_pages() is the primary place to unpin pages for > >>> direct IO. > >> > >> Well, we already do have BIO_NO_PAGE_REF bio flag that gets checked in > >> bio_release_pages(). I think we can easily spare another bio flag to tell > >> whether we need to unpin or not. So as long as all the pages in the created > >> bio need the same treatment, the situation should be simple. > > > > Yes. Incidentally, the same condition is already checked by the creators > > of those bio - see the assorted should_dirty logics. > > Beautiful! > > > > > While we are at it - how much of the rationale around bio_check_pages_dirty() > > doing dirtying is still applicable with pinning pages before we stick them > > into bio? We do dirty them before submitting bio, then on completion > > bio_check_pages_dirty() checks if something has marked them clean while > > we'd been doing IO; if all of them are still dirty we just drop the pages > > (well, unpin and drop), otherwise we arrange for dirty + unpin + drop > > done in process context (via schedule_work()). Can they be marked clean by > > anyone while they are pinned? After all, pinning is done to prevent > > writeback getting done on them while we are modifying the suckers... > > I certainly hope not. And in fact, we should really just say that that's > a rule: the whole time the page is pinned, it simply must remain dirty > and writable, at least with the way things are right now. I agree the page should be staying dirty the whole time it is pinned. I don't think it is feasible to keep it writeable in the page tables because that would mean you would need to block e.g. munmap() until the pages gets unpinned and that will almost certainly upset some current userspace. But keeping page dirty should be enough so that we can get rid of all these nasty calls to set_page_dirty() from IO completion. > This reminds me that I'm not exactly sure what the rules for > FOLL_LONGTERM callers should be, with respect to dirtying. At the > moment, most, if not all of the code that does "set_page_dirty_lock(); > unpin_user_page()" is wrong. Right. > To fix those cases, IIUC, the answer is: you must make the page dirty > properly, with page_mkwrite(), not just with set_page_dirty_lock(). And Correct, and GUP (or PUP) actually does that under the hood so I don't think we need to change anything there. > that has to be done probably a lot earlier, for reasons that I'm still > vague on. But perhaps right after pinning the page. (Assuming that we > hold off writeback while the page is pinned.) Holding off writeback is not always doable - as Christoph mentions, for data integrity writeback we'll have to get the data to disk before the page is unpinned (as for longterm users it can take days for the page to be unpinned). But we can just writeback the page without clearing the dirty bit in these cases. We may need to use bounce pages to be able to safely writeback pinned pages but that's another part of the story... Honza
On Fri, Sep 23, 2022 at 01:44:33AM -0700, Christoph Hellwig wrote: > Why would I? We generall do have or should have the iov_iter around. Not for async IO. > And for the common case where we don't (bios) we can carry that > information in the bio as it needs a special unmap helper anyway. *Any* async read_iter is like that. > > Where are they getting > > dropped and what guarantees that IO is complete by that point? > > The exact place depens on the exact taaraget frontend of which we have > a few. But it happens from the end_io callback that is triggered > through a call to target_complete_cmd. OK... > > The reason I'm asking is that here you have an ITER_BVEC possibly fed to > > __blkdev_direct_IO_async(), with its > > if (iov_iter_is_bvec(iter)) { > > /* > > * Users don't rely on the iterator being in any particular > > * state for async I/O returning -EIOCBQUEUED, hence we can > > * avoid expensive iov_iter_advance(). Bypass > > * bio_iov_iter_get_pages() and set the bvec directly. > > */ > > bio_iov_bvec_set(bio, iter); > > which does *not* grab the page referneces. Sure, bio_release_pages() knows > > to leave those alone and doesn't drop anything. However, what is the > > mechanism preventing the pages getting freed before the IO completion > > in this case? > > The contract that callers of bvec iters need to hold their own > references as without that doing I/O do them would be unsafe. It they > did not hold references the pages could go away before even calling > bio_iov_iter_get_pages (or this open coded bio_iov_bvec_set). You are mixing two issues here - holding references to pages while using iov_iter instance is obvious; holding them until async IO is complete, even though struct iov_iter might be long gone by that point is a different story. And originating iov_iter instance really can be long-gone by the time of IO completion - requirement to keep it around would be very hard to satisfy. I've no objections to requiring the pages in ITER_BVEC to be preserved at least until the IO completion by means independent of whatever ->read_iter/->write_iter does to them, but * that needs to be spelled out very clearly and * we need to verify that it is, indeed, the case for all existing iov_iter_bvec callers, preferably with comments next to non-obvious ones (something that is followed only by the sync IO is obvious) That goes not just for bio - if we make get_pages *NOT* grab references on ITER_BVEC (and I'm all for it), we need to make sure that those pages won't be retained after the original protection runs out. Which includes the reference put into struct nfs_request, for example, as well as whatever ceph transport is doing, etc. Another thing that needs to be sorted out is __zerocopy_sg_from_iter() and its callers - AFAICS, all of those are in ->sendmsg() with MSG_ZEROCOPY in flags. It's a non-trivial amount of code audit - we have about 40 iov_iter_bvec() callers in the tree, and while many are clearly sync-only... the ones that are not tend to balloon into interesting analysis of call chains, etc. Don't get me wrong - that analysis needs to be done, just don't expect it to be trivial. And it will require quite a bit of cooperation from the folks familiar with e.g. drivers/target, or with ceph transport layer, etc. FWIW, my preference would be along the lines of Backing memory for any non user-backed iov_iter should be protected from reuse by creator of iov_iter and that protection should continue through not just all synchronous operations with iov_iter in question - it should last until all async operations involving that memory are finished. That continued protection must not depend upon taking extra page references, etc. while we are working with iov_iter. But getting there will take quite a bit of code audit and probably some massage as well.
On Fri, Sep 23, 2022 at 05:13:42PM +0100, Al Viro wrote: > You are mixing two issues here - holding references to pages while using > iov_iter instance is obvious; holding them until async IO is complete, even > though struct iov_iter might be long gone by that point is a different > story. But someone needs to hold a refernce until the I/O is completed, because the I/O obviously needs the pages. Yes, we could say the callers holds them and can drop the references right after I/O submission, while the method needs to grab another reference. But that is more complicated and is more costly than just holding the damn reference. > And originating iov_iter instance really can be long-gone by the time > of IO completion - requirement to keep it around would be very hard to > satisfy. I've no objections to requiring the pages in ITER_BVEC to be > preserved at least until the IO completion by means independent of > whatever ->read_iter/->write_iter does to them, but > * that needs to be spelled out very clearly and > * we need to verify that it is, indeed, the case for all existing > iov_iter_bvec callers, preferably with comments next to non-obvious ones > (something that is followed only by the sync IO is obvious) Agreed. > That goes not just for bio - if we make get_pages *NOT* grab references > on ITER_BVEC (and I'm all for it), we need to make sure that those > pages won't be retained after the original protection runs out. Yes.
On Mon, Sep 26, 2022 at 08:53:43AM -0700, Christoph Hellwig wrote: > On Fri, Sep 23, 2022 at 05:13:42PM +0100, Al Viro wrote: > > You are mixing two issues here - holding references to pages while using > > iov_iter instance is obvious; holding them until async IO is complete, even > > though struct iov_iter might be long gone by that point is a different > > story. > > But someone needs to hold a refernce until the I/O is completed, because > the I/O obviously needs the pages. Yes, we could say the callers holds > them and can drop the references right after I/O submission, while > the method needs to grab another reference. But that is more > complicated and is more costly than just holding the damn reference. Take a look at __nfs_create_request(). And trace the call chains leading to nfs_clear_request() where the corresponding put_page() happens. What I'm afraid of is something similar in the bowels of some RDMA driver. With upper layers shoving page references into sglist using iov_iter_get_pages(), then passing sglist to some intermediate layer, then *that* getting passed down into a driver which grabs references for its own use and releases them from destructor of some private structure. Done via kref_put(). Have that delayed by, hell - anything, up to and including debugfs shite somewhere in the same driver, iterating through those private structures, grabbing a reference to do some pretty-print into kmalloc'ed buffer, then drooping it. Voila - we have page refs duplicated from ITER_BVEC and occasionally staying around after the ->ki_complete() of async ->write_iter() that got that ITER_BVEC. It's really not a trivial rule change.
diff --git a/include/linux/uio.h b/include/linux/uio.h index 5896af36199c..e26908e443d1 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -251,6 +251,10 @@ ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages, size_t maxsize, unsigned maxpages, size_t *start); ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages, size_t maxsize, size_t *start); +ssize_t iov_iter_pin_pages(struct iov_iter *i, struct page **pages, + size_t maxsize, unsigned int maxpages, size_t *start); +ssize_t iov_iter_pin_pages_alloc(struct iov_iter *i, struct page ***pages, + size_t maxsize, size_t *start); int iov_iter_npages(const struct iov_iter *i, int maxpages); void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state); diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 4b7fce72e3e5..c63ce0eadfcb 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1425,9 +1425,31 @@ static struct page *first_bvec_segment(const struct iov_iter *i, return page; } +enum pages_alloc_internal_flags { + USE_FOLL_GET, + MAYBE_USE_FOLL_PIN +}; + +/* + * Pins pages, either via get_page(), or via pin_user_page*(). The caller is + * responsible for tracking which pinning mechanism was used here, and releasing + * pages via the appropriate call: put_page() or unpin_user_page(). + * + * The way to figure that out is: + * + * a) If how_to_pin == FOLL_GET, then this routine will always pin via + * get_page(). + * + * b) If how_to_pin == MAYBE_USE_FOLL_PIN, then this routine will pin via + * pin_user_page*() for either user_backed_iter(i) cases, or + * iov_iter_is_bvec(i) cases. However, for the other cases (pipe, + * xarray), pages will be pinned via get_page(). + */ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages, size_t maxsize, - unsigned int maxpages, size_t *start) + unsigned int maxpages, size_t *start, + enum pages_alloc_internal_flags how_to_pin) + { unsigned int n; @@ -1454,7 +1476,12 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i, n = want_pages_array(pages, maxsize, *start, maxpages); if (!n) return -ENOMEM; - res = get_user_pages_fast(addr, n, gup_flags, *pages); + + if (how_to_pin == MAYBE_USE_FOLL_PIN) + res = pin_user_pages_fast(addr, n, gup_flags, *pages); + else + res = get_user_pages_fast(addr, n, gup_flags, *pages); + if (unlikely(res <= 0)) return res; maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - *start); @@ -1470,8 +1497,13 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i, if (!n) return -ENOMEM; p = *pages; - for (int k = 0; k < n; k++) - get_page(p[k] = page + k); + for (int k = 0; k < n; k++) { + p[k] = page + k; + if (how_to_pin == MAYBE_USE_FOLL_PIN) + pin_user_page(p[k]); + else + get_page(p[k]); + } maxsize = min_t(size_t, maxsize, n * PAGE_SIZE - *start); i->count -= maxsize; i->iov_offset += maxsize; @@ -1497,10 +1529,29 @@ ssize_t iov_iter_get_pages2(struct iov_iter *i, return 0; BUG_ON(!pages); - return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start); + return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start, + USE_FOLL_GET); } EXPORT_SYMBOL(iov_iter_get_pages2); +/* + * A FOLL_PIN variant that calls pin_user_pages_fast() instead of + * get_user_pages_fast(). + */ +ssize_t iov_iter_pin_pages(struct iov_iter *i, + struct page **pages, size_t maxsize, unsigned int maxpages, + size_t *start) +{ + if (!maxpages) + return 0; + if (WARN_ON_ONCE(!pages)) + return -EINVAL; + + return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start, + MAYBE_USE_FOLL_PIN); +} +EXPORT_SYMBOL(iov_iter_pin_pages); + ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages, size_t maxsize, size_t *start) @@ -1509,7 +1560,8 @@ ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, *pages = NULL; - len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start); + len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start, + USE_FOLL_GET); if (len <= 0) { kvfree(*pages); *pages = NULL; @@ -1518,6 +1570,28 @@ ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, } EXPORT_SYMBOL(iov_iter_get_pages_alloc2); +/* + * A FOLL_PIN variant that calls pin_user_pages_fast() instead of + * get_user_pages_fast(). + */ +ssize_t iov_iter_pin_pages_alloc(struct iov_iter *i, + struct page ***pages, size_t maxsize, + size_t *start) +{ + ssize_t len; + + *pages = NULL; + + len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start, + MAYBE_USE_FOLL_PIN); + if (len <= 0) { + kvfree(*pages); + *pages = NULL; + } + return len; +} +EXPORT_SYMBOL(iov_iter_pin_pages_alloc); + size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i) {
Provide two new wrapper routines that are intended for user space pages only: iov_iter_pin_pages() iov_iter_pin_pages_alloc() Internally, these routines call pin_user_pages_fast(), instead of get_user_pages_fast(), for user_backed_iter(i) and iov_iter_bvec(i) cases. As always, callers must use unpin_user_pages() or a suitable FOLL_PIN variant, to release the pages, if they actually were acquired via pin_user_pages_fast(). This is a prerequisite to converting bio/block layers over to use pin_user_pages_fast(). Signed-off-by: John Hubbard <jhubbard@nvidia.com> --- include/linux/uio.h | 4 +++ lib/iov_iter.c | 86 +++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 84 insertions(+), 6 deletions(-)