diff mbox

iov_iter: allow iov_iter_get_pages_alloc to allocate more pages per call

Message ID 20170124212327.14517-1-jlayton@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jeff Layton Jan. 24, 2017, 9:23 p.m. UTC
Currently, iov_iter_get_pages_alloc will only ever operate on the first
vector that iterate_all_kinds hands back. Many of the callers however
would like to have as long a set of pages as possible, to allow for
fewer, but larger I/Os.

When the previous vector ends on a page boundary and the current one
begins on one, we can continue to add more pages.

Change the function to first scan the iov_iter to see how long an
array of pages we could create from the current position. Then,
allocate an array that large (or up to the maxsize), and fill that
many pages.

This patch would allow us to rip out a chunk of code in ceph that
tries to do the same thing, but doesn't handle it right.

For NFS, this allows the client to do larger I/Os when userland passes
down an array of page-aligned iovecs in an O_DIRECT request. I imagine
this will also make splice writes into an O_DIRECT file more efficient.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 lib/iov_iter.c | 140 +++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 111 insertions(+), 29 deletions(-)

Comments

Jeff Layton Jan. 25, 2017, 1:32 p.m. UTC | #1
Small respin of the patch that I sent yesterday for the same thing.

This moves the maxsize handling into iov_iter_pvec_size, so that we don't
end up iterating past the max size we'll use anyway when trying to
determine the pagevec length.

Also, a respun patch to make ceph use iov_iter_get_pages_alloc instead of
trying to do it via its own routine.

Al, if these look ok, do you want to pick these up or shall I ask
Ilya to merge them via the ceph tree?

Jeff Layton (2):
  iov_iter: allow iov_iter_get_pages_alloc to allocate more pages per
    call
  ceph: switch DIO code to use iov_iter_get_pages_alloc

 fs/ceph/file.c |  75 ++----------------------------
 lib/iov_iter.c | 142 +++++++++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 116 insertions(+), 101 deletions(-)
Al Viro Feb. 2, 2017, 9:51 a.m. UTC | #2
On Wed, Jan 25, 2017 at 08:32:03AM -0500, Jeff Layton wrote:
> Small respin of the patch that I sent yesterday for the same thing.
> 
> This moves the maxsize handling into iov_iter_pvec_size, so that we don't
> end up iterating past the max size we'll use anyway when trying to
> determine the pagevec length.
> 
> Also, a respun patch to make ceph use iov_iter_get_pages_alloc instead of
> trying to do it via its own routine.
> 
> Al, if these look ok, do you want to pick these up or shall I ask
> Ilya to merge them via the ceph tree?

I'd rather have that kind of work go through the vfs tree; said that,
I really wonder if this is the right approach.  Most of the users of
iov_iter_get_pages()/iov_iter_get_pages_alloc() look like they want
something like
	iov_iter_for_each_page(iter, size, f, data)
with int (*f)(struct page *page, size_t from, size_t size, void *data)
passed as callback.  Not everything fits that model, but there's a whole
lot of things that do.

Examples:
	* fs/direct_io.c:do_direct_IO().  We loop through the pages
returned by dio_get_page().  For each of those we find the subrange
of page (from/to) and handle IO on that range.  Then we drop the reference
to page and move on to the next one.  dio_get_page() uses dio->pages and
sdio->{head,tail,from,to} to avoid calling iov_iter_get_pages() on each
page - iov_iter_get_pages() is called for bigger chunks (up to 64 pages,
AFAICS) and results are kept in dio->pages for subsequent calls of
dio_get_page().  Unconsumed references are dropped by dio_cleanup();
AFAICS, it could've been called unconditionally right after the call
of do_direct_IO() (or from it, for that matter) - all remaining references
to pages are never looked at after do_direct_IO().  As it is, we call
it immediately on failure return from do_direct_IO() and then unconditionally
after blk_finish_plug().
	That oddity aside (and AFAICS it's really pointless - all pages
we'd done something with in do_direct_IO() won't be affected by
dio_cleanup()), there's potentially more interesting issue.  If
iov_iter_get_pages() fails on write at the moment when we have pending
mapped blocks, we treat that as write from zero page.  Once that has
happened, we remember to stop mapping new blocks and arrange for having
the error eventually treated as if it had come from IO failure.  I'm not
sure if this sucker handles all cases correctly, BTW - can we end up with
a few pages worth of pending mapped blocks?
	But aside of that, it's really a "loop through all page subranges"
kind of code.  The inner loop in do_direct_IO() could be converted into
a callback quite easily

	* nfs_direct_read_schedule_iovec(): same kind of batching, only there
we have an outer loop calling iov_iter_get_pages_alloc() and then the
inner loop goes through the page subranges, with the same work done for
each.  In this case we grab a reference inside the would-be callback and
drop all references from iov_iter_get_pages_alloc() after the inner loop
is done.  Could've gotten rid of grabbing extra refs - that would mean dropping
only the unused ones if the 'callback' (== inner loop body) has told us
to bugger off early.  IMO that would be a better model.  Incidentally, we
keep allocating/freeing the array used to store page references for each batch.

	* nfs_direct_write_schedule_iovec(): very similar to the read side.

	* zerocopy_sg_from_iter(): similar loop, batch size is MAX_SKB_FRAGS
(i.e. 16 or 17, depending upon the PAGE_SIZE; unless somebody has done
a port with 2Kb pages it shouldn't be greater than 17).  Array of references
is on stack, skb_fill_page_desc(skb, frag++, page, from, size) should
become the callback.  References are consumed by it and it can't fail, so
there's nothing left to drop.

	* af_alg_make_sg().  Looks like it might be massaged to the same
model; the tricky part is af_alg_free_sg() users.  We keep references to
pages in sgl->pages[] *and* store them in sgl->sg[...] (via sg_set_page()).
af_alg_free_sg() drops them using ->pages[] instead of sg_page(...->sg + ...).
Might or might not be a problem - I'm not familiar with that code.

	* fuse_get_user_pages().  It pretty much fills an equivalent of
bio_vec array; the difference is, array of struct page * and arrays of
(start, len) pairs are kept separately.  The only benefit is using the
first array as destination of iov_iter_get_pages(); might as well work
into a separate batching array instead - copying struct page * is noise
compared to storing (and calculating) start/len pairs we have to do there.
Again, what we do there is a pure loop over page subranges.

	* fuse_copy_fill().  I'm not at all sure that iov_iter_get_pages()
is a good idea there - fuse_copy_do() could bloody well just use
copy_{to,from}_iter().

	* fs/splice.c:iter_to_pipe().  Loop over page subranges, consuming
page references.  Unused ones are dropped.

	* bio_iov_iter_get_pages().  Wants to populate bio_vec array; should've
been a loop calling iov_iter_get_pages(); gets tripped on each iovec boundary
instead.  IMO would've been better off with a loop and separate 'batching'
array; would've killed the "Deep magic" mess in there, while we are at it.

That's the majority of iov_iter_get_pages{,_alloc} callers.  There's one
I'm not sure about in lustre (looks like their O_DIRECT is complicated by
rudiments of lloop stuff), there's a mess in p9_get_mapped_pages() (with
special-casing the kvec-backed iterators using kmap_to_page() and
vmalloc_to_page(), no less), there's default_file_splice_read() and there's
ceph stuff.  Everything else is covered by the 'loop over page subranges'
stuff.

I'm massaging that code (along with a lot of RTFS); the interesting questions
related to VM side of things are
	* what are the relative costs of doing small vs. large batches?  Some
of get_user_pages_fast() instances have comments along the lines of "we ought
to limit the batch size, but then nobody's doing more than 64 pages at a time
anyway".
	* Not a question: any ->fault() that returns VM_FAULT_RETRY when
*not* passed FAULT_FLAG_ALLOW_RETRY in flags ought to be shot.  cxlflash one
sure as hell is.
	* drivers/gpu/drm/vgem/vgem_drv.c:vgem_gem_fault() is bloody odd -
shmem_read_mapping_page() can't return -EBUSY, AFAICS.  vm_insert_page()
used to (and back then vgem_gem_fault() used to be broken), but these days
it looks like dead code...
	* ->page_mkwrite() instances sometimes return VM_FAULT_RETRY; AFAICS,
it's only (ab)used there as 'not zero, but doesn't contain any error bits';
VM_FAULT_RETRY from that source does *not* reach handle_mm_fault() callers,
right?
	* get_user_pages_fast() only returns 0 on zero size.  AFAICS, that's
true and some callers seems to rely upon that.  Correct?
	* aligning the start address passed to get_user_pages_fast() et.al.
Happens in many callers, but not all of them.  Most of the instances
forcibly align it in get_user_pages_fast() itself, but... not the fallback
one.  I'm not sure if it can be used to screw the things up, but it feels
like aligning the sucker in get_user_pages...() would be safer - callers
outnumber them and they are scattered in bad places (including drivers/staging)

	Comments?
Christoph Hellwig Feb. 2, 2017, 10:56 a.m. UTC | #3
On Thu, Feb 02, 2017 at 09:51:25AM +0000, Al Viro wrote:
> On Wed, Jan 25, 2017 at 08:32:03AM -0500, Jeff Layton wrote:
> > Small respin of the patch that I sent yesterday for the same thing.
> > 
> > This moves the maxsize handling into iov_iter_pvec_size, so that we don't
> > end up iterating past the max size we'll use anyway when trying to
> > determine the pagevec length.
> > 
> > Also, a respun patch to make ceph use iov_iter_get_pages_alloc instead of
> > trying to do it via its own routine.
> > 
> > Al, if these look ok, do you want to pick these up or shall I ask
> > Ilya to merge them via the ceph tree?
> 
> I'd rather have that kind of work go through the vfs tree; said that,
> I really wonder if this is the right approach.  Most of the users of
> iov_iter_get_pages()/iov_iter_get_pages_alloc() look like they want
> something like
> 	iov_iter_for_each_page(iter, size, f, data)
> with int (*f)(struct page *page, size_t from, size_t size, void *data)
> passed as callback.  Not everything fits that model, but there's a whole
> lot of things that do.

I was planning to do that, mostly because of the iomap dio code that
would not only get a lot cleaner with this, but also support multi-page
bvecs that we hope to have in the block layer soon.  The issue with it
is that we need to touch all the arch get_user_pages_fast
implementations, so it's going to be a relatively invasive change that I
didn't want to fix with just introducing the new direct I/O code.
Al Viro Feb. 2, 2017, 11:16 a.m. UTC | #4
On Thu, Feb 02, 2017 at 02:56:51AM -0800, Christoph Hellwig wrote:
> > I really wonder if this is the right approach.  Most of the users of
> > iov_iter_get_pages()/iov_iter_get_pages_alloc() look like they want
> > something like
> > 	iov_iter_for_each_page(iter, size, f, data)
> > with int (*f)(struct page *page, size_t from, size_t size, void *data)
> > passed as callback.  Not everything fits that model, but there's a whole
> > lot of things that do.
> 
> I was planning to do that, mostly because of the iomap dio code that
> would not only get a lot cleaner with this, but also support multi-page
> bvecs that we hope to have in the block layer soon.  The issue with it
> is that we need to touch all the arch get_user_pages_fast
> implementations, so it's going to be a relatively invasive change that I
> didn't want to fix with just introducing the new direct I/O code.

I'm not sure we need to touch any get_user_pages_fast() at all; let it
fill a medium-sized array and use that as a buffer.  In particular,
I *really* don't like the idea of having the callbacks done in an
inconsistent locking environment - sometimes under ->mmap_sem, sometimes
not.

I played with "let it fill bio_vec array", but it doesn't really fit the
users; variant with callbacks is cleaner, IMO.
Jeff Layton Feb. 2, 2017, 1 p.m. UTC | #5
On Thu, 2017-02-02 at 11:16 +0000, Al Viro wrote:
> On Thu, Feb 02, 2017 at 02:56:51AM -0800, Christoph Hellwig wrote:
> > > I really wonder if this is the right approach.  Most of the users of
> > > iov_iter_get_pages()/iov_iter_get_pages_alloc() look like they want
> > > something like
> > > 	iov_iter_for_each_page(iter, size, f, data)
> > > with int (*f)(struct page *page, size_t from, size_t size, void *data)
> > > passed as callback.  Not everything fits that model, but there's a whole
> > > lot of things that do.
> > 

Yeah, that does seem nicer.

My only concern is that it sounds more invasive, and I'd like to get
some sort of interim fix in for Ceph in the meantime (preferably
something we can send to stable). It really is trivial to vmsplice some
data into the kernel, and then splice write that to a DIO file --
instant softlockup in my testing.

This patch also makes vectored DIO with small iovecs on NFS a lot more
efficient, which is a nice bonus.
 
> > I was planning to do that, mostly because of the iomap dio code that
> > would not only get a lot cleaner with this, but also support multi-page
> > bvecs that we hope to have in the block layer soon.  The issue with it
> > is that we need to touch all the arch get_user_pages_fast
> > implementations, so it's going to be a relatively invasive change that I
> > didn't want to fix with just introducing the new direct I/O code.
> 
> I'm not sure we need to touch any get_user_pages_fast() at all; let it
> fill a medium-sized array and use that as a buffer.  In particular,
> I *really* don't like the idea of having the callbacks done in an
> inconsistent locking environment - sometimes under ->mmap_sem, sometimes
> not.
> 

Yeah, that might work. You could kmalloc the buffer array according to
the maxsize value. For small ones we could even consider using an on-
stack buffer.

> I played with "let it fill bio_vec array", but it doesn't really fit the
> users; variant with callbacks is cleaner, IMO.

I looked at converting this over to a bio_vec array as well, but that
really doesn't work well for Ceph. I like the callback idea better too.
Jan Kara Feb. 2, 2017, 2:48 p.m. UTC | #6
On Thu 02-02-17 09:51:25, Al Viro wrote:
> I'm massaging that code (along with a lot of RTFS); the interesting questions
> related to VM side of things are
> 	* what are the relative costs of doing small vs. large batches?  Some
> of get_user_pages_fast() instances have comments along the lines of "we ought
> to limit the batch size, but then nobody's doing more than 64 pages at a time
> anyway".

Well, I believe it's only about the cost of setting up page table walk and
walking down those several levels of page tables. It is cheaper to copy
several PTE entries from the leaf level than doing the full walk every
time. But I didn't ever profile it to get actual numbers.

Large batches are only a problem because of possible soft-lockups and irq
latencies AFAIK. So the batch just should not be insanely large...

> 	* Not a question: any ->fault() that returns VM_FAULT_RETRY when
> *not* passed FAULT_FLAG_ALLOW_RETRY in flags ought to be shot.  cxlflash one
> sure as hell is.

Yep.

> 	* drivers/gpu/drm/vgem/vgem_drv.c:vgem_gem_fault() is bloody odd -
> shmem_read_mapping_page() can't return -EBUSY, AFAICS.  vm_insert_page()
> used to (and back then vgem_gem_fault() used to be broken), but these days
> it looks like dead code...

Looks like that.

> 	* ->page_mkwrite() instances sometimes return VM_FAULT_RETRY; AFAICS,
> it's only (ab)used there as 'not zero, but doesn't contain any error bits';
> VM_FAULT_RETRY from that source does *not* reach handle_mm_fault() callers,
> right?

I can see only Lustre doing it and IMHO it is abuse. VM_FAULT_RETRY is used
for mmap_sem latency reduction when paging in pages and so not everybody
handles it. If a handler wants to simply retry the fault, returning
VM_FAULT_NOPAGE is a more common way to do that...

								Honza
Al Viro Feb. 2, 2017, 6:28 p.m. UTC | #7
On Thu, Feb 02, 2017 at 03:48:17PM +0100, Jan Kara wrote:

> > 	* ->page_mkwrite() instances sometimes return VM_FAULT_RETRY; AFAICS,
> > it's only (ab)used there as 'not zero, but doesn't contain any error bits';
> > VM_FAULT_RETRY from that source does *not* reach handle_mm_fault() callers,
> > right?
> 
> I can see only Lustre doing it and IMHO it is abuse. VM_FAULT_RETRY is used
> for mmap_sem latency reduction when paging in pages and so not everybody
> handles it. If a handler wants to simply retry the fault, returning
> VM_FAULT_NOPAGE is a more common way to do that...

/* Convert errno to return value from ->page_mkwrite() call */
static inline int block_page_mkwrite_return(int err)
{
        if (err == 0)
                return VM_FAULT_LOCKED;
        if (err == -EFAULT)
                return VM_FAULT_NOPAGE;
        if (err == -ENOMEM)
                return VM_FAULT_OOM;
        if (err == -EAGAIN)
                return VM_FAULT_RETRY;
        /* -ENOSPC, -EDQUOT, -EIO ... */
        return VM_FAULT_SIGBUS;
}

and a bunch of ->page_mkwrite() instances using that.  However, the only
callers of ->page_mkwrite() are wp_page_shared()->do_page_mkwrite() and
do_shared_fault()->do_page_mkwrite().  do_page_mkwrite() treates
VM_FAULT_RETRY as "lock page and return VM_FAULT_RETRY|VM_FAULT_LOCKED".
Both callers do the same check -
                if (unlikely(!tmp || (tmp &
                                      (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
and the return value if that predicate is false.  FWIW, use of VM_FAULT_RETRY
comes from your patch back in 2011 and AFAICS the same analysis used to
apply back then, except for the open-coded method calls where we use
do_page_mkwrite() these days...
Al Viro Feb. 3, 2017, 7:29 a.m. UTC | #8
On Thu, Feb 02, 2017 at 08:00:52AM -0500, Jeff Layton wrote:
> > I'm not sure we need to touch any get_user_pages_fast() at all; let it
> > fill a medium-sized array and use that as a buffer.  In particular,
> > I *really* don't like the idea of having the callbacks done in an
> > inconsistent locking environment - sometimes under ->mmap_sem, sometimes
> > not.
> > 
> 
> Yeah, that might work. You could kmalloc the buffer array according to
> the maxsize value. For small ones we could even consider using an on-
> stack buffer.

Yes.  FWIW, I'm not proposing to kill iov_iter_get_pages{,_alloc}()
immediately - the new primitive would initially use the damn thing.
That can be backported without any problems, and conversions would be
one-by-one.  If/when we get to the point where most of the users have
been switched to the new helper, we could try and see if what remains
could be dealt with; if that works, it might be possible to fold
iov_iter_get_pages() into it.  In particular, for ITER_BVEC and
ITER_PIPE we don't need to bother with any intermediate arrays at all, etc.
But that's only after several cycles when everyone is asked to try and use
the new primitive instead of iov_iter_get_pages().

The calling conventions of iov_iter_get_pages() are ugly and I would love to
get rid of them, but doing that in a flagday conversion is a lot of PITA
for no good reason.

Speaking of get_user_pages() and conversions:

get_user_pages() relies upon find_extend_vma() to reject kernel
addresses; the fast side of get_user_pages_fast() doesn't have anything
of that sort in case e.g. HAVE_GENERIC_RCU_GUP.  Sure, usually we have
that verified and rejected earlier anyway, but it's a fairly subtle difference
that doesn't seem to be documented anywhere.

For example, /dev/st* reads and writes (st_read()/st_write()) feed the
address of buffer to get_user_pages_unlocked().  If somebody replaced
those with get_user_pages_fast(), we'd be in trouble as soon as
some code got tricked into using kernel_write() on /dev/st*.

access_ok() in HAVE_GENERIC_RCU_GUP {__,}get_user_pages_fast() obviously
doesn't help in that scenario.  What am I missing here?
Christoph Hellwig Feb. 3, 2017, 7:49 a.m. UTC | #9
On Thu, Feb 02, 2017 at 08:00:52AM -0500, Jeff Layton wrote:
> Yeah, that might work. You could kmalloc the buffer array according to
> the maxsize value. For small ones we could even consider using an on-
> stack buffer.

For the block direct I/O code we defintively want to avoid any
allocations for small I/O a that shows up in the performance numbers.
And we'd like to reuse the on-stack bio_vec so that the defintion of
a small I/O can be as big as possible without blowing up the stack.
Al Viro Feb. 3, 2017, 8:54 a.m. UTC | #10
On Thu, Feb 02, 2017 at 11:49:01PM -0800, Christoph Hellwig wrote:
> On Thu, Feb 02, 2017 at 08:00:52AM -0500, Jeff Layton wrote:
> > Yeah, that might work. You could kmalloc the buffer array according to
> > the maxsize value. For small ones we could even consider using an on-
> > stack buffer.
> 
> For the block direct I/O code we defintively want to avoid any
> allocations for small I/O a that shows up in the performance numbers.
> And we'd like to reuse the on-stack bio_vec so that the defintion of
> a small I/O can be as big as possible without blowing up the stack.

Hmm...  Reuse part is really nasty ;-/  OTOH, it might make sense to have
a "fill bio_vec array" as separate primitive - having that sucker come
from bio looks like an artificial restriction.

OK, next question, seeing that you've dealt with O_DIRECT guts more than
I have.  When we have iov_iter_get_pages() fail on do_direct_IO() write
with some blocks already allocated, we pick zero page as data source.
So far, so good, but:
	* should we bother zeroing anything unless buffer_new() is true?
	* why, in case of more than a page worth of pending allocated
blocks, do we bother with calling iov_iter_get_pages() again and again?
We *do* take care not to allocate anything else after that point, but
dio_get_page() will be calling iov_iter_get_pages() every time in that
case - there's only one page in queue.
Christoph Hellwig Feb. 3, 2017, 11:09 a.m. UTC | #11
On Fri, Feb 03, 2017 at 08:54:15AM +0000, Al Viro wrote:
> Hmm...  Reuse part is really nasty ;-/  OTOH, it might make sense to have
> a "fill bio_vec array" as separate primitive - having that sucker come
> from bio looks like an artificial restriction.

Or just the only usecase :)  But yes, it could be generalized to take a
bio_vec without too much effort.

> OK, next question, seeing that you've dealt with O_DIRECT guts more than
> I have.  When we have iov_iter_get_pages() fail on do_direct_IO() write
> with some blocks already allocated, we pick zero page as data source.
> So far, so good, but:
> 	* should we bother zeroing anything unless buffer_new() is true?

I don't think so.

> 	* why, in case of more than a page worth of pending allocated
> blocks, do we bother with calling iov_iter_get_pages() again and again?
> We *do* take care not to allocate anything else after that point, but
> dio_get_page() will be calling iov_iter_get_pages() every time in that
> case - there's only one page in queue.

There shouldn't be a need for it.

But take it with a grain of salt - fs/direct-io.c is a hairy mess,
that's one of the reasons why I replaced it with the new iomap
code instead of trying to gradually move it to iomap API.
Jan Kara Feb. 3, 2017, 2:47 p.m. UTC | #12
On Thu 02-02-17 18:28:02, Al Viro wrote:
> On Thu, Feb 02, 2017 at 03:48:17PM +0100, Jan Kara wrote:
> 
> > > 	* ->page_mkwrite() instances sometimes return VM_FAULT_RETRY; AFAICS,
> > > it's only (ab)used there as 'not zero, but doesn't contain any error bits';
> > > VM_FAULT_RETRY from that source does *not* reach handle_mm_fault() callers,
> > > right?
> > 
> > I can see only Lustre doing it and IMHO it is abuse. VM_FAULT_RETRY is used
> > for mmap_sem latency reduction when paging in pages and so not everybody
> > handles it. If a handler wants to simply retry the fault, returning
> > VM_FAULT_NOPAGE is a more common way to do that...
> 
> /* Convert errno to return value from ->page_mkwrite() call */
> static inline int block_page_mkwrite_return(int err)
> {
>         if (err == 0)
>                 return VM_FAULT_LOCKED;
>         if (err == -EFAULT)
>                 return VM_FAULT_NOPAGE;
>         if (err == -ENOMEM)
>                 return VM_FAULT_OOM;
>         if (err == -EAGAIN)
>                 return VM_FAULT_RETRY;
>         /* -ENOSPC, -EDQUOT, -EIO ... */
>         return VM_FAULT_SIGBUS;
> }
> 
> and a bunch of ->page_mkwrite() instances using that.  However, the only
> callers of ->page_mkwrite() are wp_page_shared()->do_page_mkwrite() and
> do_shared_fault()->do_page_mkwrite().  do_page_mkwrite() treates
> VM_FAULT_RETRY as "lock page and return VM_FAULT_RETRY|VM_FAULT_LOCKED".
> Both callers do the same check -
>                 if (unlikely(!tmp || (tmp &
>                                       (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
> and the return value if that predicate is false.  FWIW, use of VM_FAULT_RETRY
> comes from your patch back in 2011 and AFAICS the same analysis used to
> apply back then, except for the open-coded method calls where we use
> do_page_mkwrite() these days...

Yeah, back then I was not aware of VM_FAULT_RETRY limitations and your
analysis above just shows that its handling from do_page_mkwrite() is
simply broken (or better non-existent). Actually that VM_FAULT_RETRY
return was added by fs freeze handling patch. The freeze handling was
later changed but that change to block_page_mkwrite_return() remained.
I'll send a patch to remove it. Thanks for spotting this.

								Honza
Linus Torvalds Feb. 3, 2017, 6:29 p.m. UTC | #13
On Thu, Feb 2, 2017 at 11:29 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> get_user_pages() relies upon find_extend_vma() to reject kernel
> addresses; the fast side of get_user_pages_fast() doesn't have anything
> of that sort

It is *supposed* to have it.

See pte_allows_gup(), for example. In particular, it requires the
_PAGE_USER bit in the PTE (and the devpte case should require
_PAGE_BIT_DEVMAP).

So no, get_user_pages_fast() can not look up kernel pages. If it does,
that would be a bug.

              Linus
Al Viro Feb. 3, 2017, 7:08 p.m. UTC | #14
On Fri, Feb 03, 2017 at 10:29:23AM -0800, Linus Torvalds wrote:
> On Thu, Feb 2, 2017 at 11:29 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > get_user_pages() relies upon find_extend_vma() to reject kernel
> > addresses; the fast side of get_user_pages_fast() doesn't have anything
> > of that sort
> 
> It is *supposed* to have it.
> 
> See pte_allows_gup(), for example. In particular, it requires the
> _PAGE_USER bit in the PTE (and the devpte case should require
> _PAGE_BIT_DEVMAP).

On x86 it does.  I don't see anything equivalent in mm/gup.c one, and the
only kinda-sorta similar thing (access_ok() in __get_user_pages_fast()
there) is vulnerable to e.g. access via kernel_write().  The comment in
there
 * Before activating this code, please be aware that the following assumptions
 * are currently made:
 *
 *  *) HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table is used to free
 *      pages containing page tables.
 *
 *  *) ptes can be read atomically by the architecture.
 *
 *  *) access_ok is sufficient to validate userspace address ranges.
 *
 * The last two assumptions can be relaxed by the addition of helper functions.

doesn't look promising - access_ok() is never sufficient.  Something like
_PAGE_USER tests in x86 one solves that problem, but if anything similar
works for HAVE_GENERIC_RCU_GUP I don't see it.  Thus the question re
what am I missing here...
Linus Torvalds Feb. 3, 2017, 7:28 p.m. UTC | #15
On Fri, Feb 3, 2017 at 11:08 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On x86 it does.  I don't see anything equivalent in mm/gup.c one, and the
> only kinda-sorta similar thing (access_ok() in __get_user_pages_fast()
> there) is vulnerable to e.g. access via kernel_write().

Yeah, access_ok() is bogus. It needs to just check against TASK_SIZE
or whatever.

> doesn't look promising - access_ok() is never sufficient.  Something like
> _PAGE_USER tests in x86 one solves that problem, but if anything similar
> works for HAVE_GENERIC_RCU_GUP I don't see it.  Thus the question re
> what am I missing here...

Ok, I definitely agree that it looks like __get_user_pages_fast() just
needs to get rid of the access_ok() and replace it with a proper check
for the user address space range.

Looks like arm[64] and powerpc.are the current users. Adding in some
people involved with the original submission a few years ago.

I do note that the x86 __get_user_pages_fast() thing looks dodgy too.

In particular, we do it right in the *real* get_user_pages_fast(), see
commit 7f8189068726 ("x86: don't use 'access_ok()' as a range check in
get_user_pages_fast()"). But then the same bug was re-introduced when
the "irq safe" version was merged. As well as in the GENERIC_RCU_GUP
version.

Gaah. Apparently PeterZ copied the old buggy version before the fix
when he added __get_user_pages_fast() in commit 465a454f254e ("x86,
mm: Add __get_user_pages_fast()").

I guess it could be considered a merge error (both happened during the
2.6.31 merge window).

                   Linus
Al Viro Feb. 4, 2017, 3:08 a.m. UTC | #16
On Thu, Feb 02, 2017 at 09:51:25AM +0000, Al Viro wrote:

> 	* fuse_copy_fill().  I'm not at all sure that iov_iter_get_pages()
> is a good idea there - fuse_copy_do() could bloody well just use
> copy_{to,from}_iter().

Miklos, could you explain why does lock_request() prohibit page faults until
the matching unlock_request()?  All it does is setting FR_LOCKED on
our request and the only thing that even looks at that is fuse_abort_conn(),
which doesn't (AFAICS) wait for anything.

Where does the deadlock come from, and if it's not a deadlock - what is
it?  Or is that comment stale since "fuse: simplify request abort"?
Al Viro Feb. 4, 2017, 7:26 p.m. UTC | #17
On Sat, Feb 04, 2017 at 03:08:42AM +0000, Al Viro wrote:
> On Thu, Feb 02, 2017 at 09:51:25AM +0000, Al Viro wrote:
> 
> > 	* fuse_copy_fill().  I'm not at all sure that iov_iter_get_pages()
> > is a good idea there - fuse_copy_do() could bloody well just use
> > copy_{to,from}_iter().
> 
> Miklos, could you explain why does lock_request() prohibit page faults until
> the matching unlock_request()?  All it does is setting FR_LOCKED on
> our request and the only thing that even looks at that is fuse_abort_conn(),
> which doesn't (AFAICS) wait for anything.
> 
> Where does the deadlock come from, and if it's not a deadlock - what is
> it?  Or is that comment stale since "fuse: simplify request abort"?

While we are at it...  How can fuse_copy_page() ever get called with
*pagep == NULL?  AFAICS, for that to happen you need either
	i < req->num_pages && req->pages[i] == NULL  
or in fuse_notify_store() have
                err = fuse_copy_page(cs, &page, offset, this_num, 0);
reached with page == NULL.  The latter is flat-out impossible - we have
                if (!page)
                        goto out_iput;

                this_num = min_t(unsigned, num, PAGE_SIZE - offset);
immediately before the call in question.  As for the former...  I don't see
any place where we would increase ->num_pages without having all additional
->pages[] elements guaranteed to be non-NULL.  Stores to ->num_pages are

in cuse_send_init():
	req->num_pages = 1;
with
        req->pages[0] = page;
shortly before that and
        if (!page)
                goto err_put_req;
earlier.

In fuse_retrieve():
                if (!page)
                        break;

                this_num = min_t(unsigned, num, PAGE_SIZE - offset);
                req->pages[req->num_pages] = page;
                req->page_descs[req->num_pages].length = this_num;
                req->num_pages++;

In fuse_readdir():
        req->num_pages = 1;
        req->pages[0] = page;
with
        if (!page) {
                fuse_put_request(fc, req);
                return -ENOMEM;
        }
several lines above.

In fuse_do_readpage():
        req->num_pages = 1;
        req->pages[0] = page;
with page dereferenced earlier.

In fuse_fill_write_pages():
                req->pages[req->num_pages] = page;
                req->page_descs[req->num_pages].length = tmp;
                req->num_pages++;
with
                if (!page)
                        break;
earlier.

In fuse_get_user_pages():
                ret = iov_iter_get_pages(ii, &req->pages[req->num_pages],
                                        *nbytesp - nbytes, 
                                        req->max_pages - req->num_pages,
                                        &start);
                if (ret < 0)
                        break;

                iov_iter_advance(ii, ret);
                nbytes += ret;

                ret += start;
                npages = (ret + PAGE_SIZE - 1) / PAGE_SIZE;

                req->page_descs[req->num_pages].offset = start;
                fuse_page_descs_length_init(req, req->num_pages, npages);

                req->num_pages += npages;
which also shouldn't leave any NULLs in the area in question.

In fuse_writepage_locked():
        req->num_pages = 1;
        req->pages[0] = tmp_page;
with
        if (!tmp_page)
                goto err_free;
done earlier.

In fuse_writepage_in_flight():
	req->num_pages = 1;
with
        BUG_ON(new_req->num_pages != 0);
earlier and
        req->pages[req->num_pages] = tmp_page;
done in the caller (which passes req as new_req).  Earlier in the caller
we have
        if (!tmp_page)
                goto out_unlock;

In fuse_writepages_fill():
		req->num_pages = 0;
is obviously OK and
        req->num_pages++;
in the end of the same function is preceded by the same
        req->pages[req->num_pages] = tmp_page;
(fuse_writepages_fill() is the caller of fuse_writepage_in_flight();
reassignment in fuse_writepage_in_flight() happens only in case when
it returns 1 and in that case we don't reach the increment in
fuse_writepages_fill() at all).

In fuse_do_ioctl():
        memcpy(req->pages, pages, sizeof(req->pages[0]) * num_pages);
        req->num_pages = num_pages;
and all increments of num_pages in there are
                pages[num_pages] = alloc_page(GFP_KERNEL | __GFP_HIGHMEM);
                if (!pages[num_pages])
                        goto out;
                num_pages++;
so the array we copy into req->pages has the same property wrt num_pages.

req->pages is assigned only in fuse_request_init(); that gets called
in two places - one is at request allocation time, another (from
put_reserved_req()) passes the current req->pages value, so that leaves it
unchanged.  The contents of req->pages[] is changed only in the aforementioned
places + fuse_request_init(), which zeros ->num_pages + fuse_copy_page()
called from fuse_copy_pages() with &req->pages[i] as argument.  The last
one can modify the damn thing, but only if it hits
                                err = fuse_try_move_page(cs, pagep);
and that sucker never stores a NULL in there -
                *pagep = newpage;
with get_page(newpage) upstream from that point.

What am I missing here?  Looks like those checks in fuse_copy_page() are
dead code...  They had been there since the initial merge, but AFAICS
they were just as useless in 2.6.14.  Rudiments of some prehistorical stuff
that never had been cleaned out, perhaps?
Miklos Szeredi Feb. 4, 2017, 10:11 p.m. UTC | #18
On Sat, Feb 4, 2017 at 4:08 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Feb 02, 2017 at 09:51:25AM +0000, Al Viro wrote:
>
>>       * fuse_copy_fill().  I'm not at all sure that iov_iter_get_pages()
>> is a good idea there - fuse_copy_do() could bloody well just use
>> copy_{to,from}_iter().
>
> Miklos, could you explain why does lock_request() prohibit page faults until
> the matching unlock_request()?  All it does is setting FR_LOCKED on
> our request and the only thing that even looks at that is fuse_abort_conn(),
> which doesn't (AFAICS) wait for anything.
>
> Where does the deadlock come from, and if it's not a deadlock - what is
> it?  Or is that comment stale since "fuse: simplify request abort"?

Well, it's not historical; at least not yet.  The deadlock is there
alright: mmap fuse file to addr; read byte from mapped page -> page
locked; this triggeres read request served in same process but
separate thread; write addr-headerlen to fuse dev; trying to lock same
page -> deadlock.

The deadlock can be broken by aborting or force unmounting: return
error for original read request; page unlocked; device write can get
page lock and return.

The reason we need to prohibit pagefault while copying is that when
request is aborted and the caller returns the memory in the request
may become invalid (e.g. data from stack).

Another solution would be to copy all data and keep a ref on the copy
by the request even after being aborted.  This is the plan for the
future.

Thanks,
Miklos
Miklos Szeredi Feb. 4, 2017, 10:12 p.m. UTC | #19
On Sat, Feb 4, 2017 at 8:26 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Feb 04, 2017 at 03:08:42AM +0000, Al Viro wrote:

> What am I missing here?  Looks like those checks in fuse_copy_page() are
> dead code...  They had been there since the initial merge, but AFAICS
> they were just as useless in 2.6.14.  Rudiments of some prehistorical stuff
> that never had been cleaned out, perhaps?

Yep, that one is probably historical.  I'll double check to make sure, though.

Thanks,
Miklos
Al Viro Feb. 5, 2017, 1:51 a.m. UTC | #20
On Sat, Feb 04, 2017 at 11:11:27PM +0100, Miklos Szeredi wrote:

> Well, it's not historical; at least not yet.  The deadlock is there
> alright: mmap fuse file to addr; read byte from mapped page -> page
> locked; this triggeres read request served in same process but
> separate thread; write addr-headerlen to fuse dev; trying to lock same
> page -> deadlock.

Let me see if I got it straight - you have the same fuse file mmapped
in two processes, one of them being fuse server (either sharing the
entire address space, or the same area mapped in both).  Another process
faults the sucker in; filemap_fault() locks the page and goes
fuse_readpage() -> fuse_do_readpage() -> fuse_send_read() ->
-> fuse_request_send() -> __fuse_request_send() which puts request into
queue and goes to sleep in request_wait_answer().  Eventually, read()
on /dev/fuse (or splice(), whatever) by server picks that request and reply
is formed and fed back into /dev/fuse.  There we (in fuse_do_dev_write())
call copy_out_args(), which tries to copy into our (still locked) page
a piece of data coming from server-supplied iovec.  As it is, you
are calling get_user_pages_fast(), triggering handle_mm_fault().  Since that
malicous FPOS of a server tried to feed you the _same_ mmapped file, you
hit a deadlock.  In server's context.  Correct?

Convoluted, but possible.  But.  Why the hell do we care whether that deadlock
hits in get_user_pages_fast() or in copy_from_user()?  Put it another way,
what difference does it make whether we take that fault with or without
FR_LOCKED in req->flags?

> The deadlock can be broken by aborting or force unmounting: return
> error for original read request; page unlocked; device write can get
> page lock and return.
> 
> The reason we need to prohibit pagefault while copying is that when
> request is aborted and the caller returns the memory in the request
> may become invalid (e.g. data from stack).

???

IDGI.  Your request is marked aborted and should presumably fail, so
that when request_wait_answer() wakes up and finds it screwed, fuse_readpage()
would just return an error and filemap_fault() will return VM_FAULT_SIGBUS,
with page left not uptodate and _not_ inserted into page tables.  What's
leaking where?
Miklos Szeredi Feb. 5, 2017, 8:15 p.m. UTC | #21
On Sun, Feb 5, 2017 at 2:51 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Feb 04, 2017 at 11:11:27PM +0100, Miklos Szeredi wrote:
>
>> Well, it's not historical; at least not yet.  The deadlock is there
>> alright: mmap fuse file to addr; read byte from mapped page -> page
>> locked; this triggeres read request served in same process but
>> separate thread; write addr-headerlen to fuse dev; trying to lock same
>> page -> deadlock.
>
> Let me see if I got it straight - you have the same fuse file mmapped
> in two processes, one of them being fuse server (either sharing the
> entire address space, or the same area mapped in both).  Another process
> faults the sucker in; filemap_fault() locks the page and goes
> fuse_readpage() -> fuse_do_readpage() -> fuse_send_read() ->
> -> fuse_request_send() -> __fuse_request_send() which puts request into
> queue and goes to sleep in request_wait_answer().  Eventually, read()
> on /dev/fuse (or splice(), whatever) by server picks that request and reply
> is formed and fed back into /dev/fuse.  There we (in fuse_do_dev_write())
> call copy_out_args(), which tries to copy into our (still locked) page
> a piece of data coming from server-supplied iovec.  As it is, you
> are calling get_user_pages_fast(), triggering handle_mm_fault().  Since that
> malicous FPOS of a server tried to feed you the _same_ mmapped file, you
> hit a deadlock.  In server's context.  Correct?

Yes.

> Convoluted, but possible.  But.  Why the hell do we care whether that deadlock
> hits in get_user_pages_fast() or in copy_from_user()?  Put it another way,
> what difference does it make whether we take that fault with or without
> FR_LOCKED in req->flags?

The difference is that if the page fault happens without FR_LOCKED,
then we can abort the request then and there (done by moving the
request to to_end1 and calling request_end() on it).

If abort happens with FR_LOCKED, we can't end the request now, because
data is possibly being copied to/from the request args.  But we are
guaranteed that the request will end shortly, because no sleeping
under FR_LOCKED is allowed.

But with copy_from_user() page fault and copy aren't separated and so
we don't know whether it's safe to abort or not.

Maybe I'm missing something obvious, but I don't see a simple way out of this.

>> The deadlock can be broken by aborting or force unmounting: return
>> error for original read request; page unlocked; device write can get
>> page lock and return.
>>
>> The reason we need to prohibit pagefault while copying is that when
>> request is aborted and the caller returns the memory in the request
>> may become invalid (e.g. data from stack).
>
> ???
>
> IDGI.  Your request is marked aborted and should presumably fail, so
> that when request_wait_answer() wakes up and finds it screwed, fuse_readpage()
> would just return an error and filemap_fault() will return VM_FAULT_SIGBUS,
> with page left not uptodate and _not_ inserted into page tables.  What's
> leaking where?

That case is fine.   But nothing guarantees that fuse_abort_conn()
won't be called (in the non-deadlock case) when data is being copied
to the request args.  Ending the request at such a point could easily
lead to use after free,

Thanks,
Miklos
Al Viro Feb. 5, 2017, 8:56 p.m. UTC | #22
On Sun, Feb 05, 2017 at 01:51:49AM +0000, Al Viro wrote:

> IDGI.  Your request is marked aborted and should presumably fail, so
> that when request_wait_answer() wakes up and finds it screwed, fuse_readpage()
> would just return an error and filemap_fault() will return VM_FAULT_SIGBUS,
> with page left not uptodate and _not_ inserted into page tables.  What's
> leaking where?

Egads...  Do you mean that req->pages[] contents can be dropped by connection
abort right under fuse_copy_pages()?  Or is it that args[...].value can
end up freed under you?

<goes to look at the ->end() callbacks>

Both, apparently, plus the request initiator might have seen that
request has failed and buggered off, with args[...].value pointing to
what used to be initiator's stack frame.  Is that what you are talking about?


If so, why not mark request as "being handled by fuse_dev_do_{read,write}()"
for the duration, and leave the request_end() on such requests for
fuse_dev_do_{read,write}(), seeing that they will call request_end() for such
anyway?

Looks like your FR_LOCKED is not far from that already.  Why not stop
dropping/regaining FR_LOCKED in lock_request()/unlock_request() and simply
have your
                end_requests(fc, &to_end2);
in fuse_abort_conn() skip the actual calls of request_end()?
Al Viro Feb. 5, 2017, 9:01 p.m. UTC | #23
On Sun, Feb 05, 2017 at 09:15:24PM +0100, Miklos Szeredi wrote:

> That case is fine.   But nothing guarantees that fuse_abort_conn()
> won't be called (in the non-deadlock case) when data is being copied
> to the request args.  Ending the request at such a point could easily
> lead to use after free,

So why not leave ending it to your fuse_dev_do_write()/fuse_dev_do_read()?
See the reply I'd just sent (your mail arrived while I'd been writing that
one - saw it only after I'd sent mine).

Basically, what if we keep FR_LOCKED through *all* fuse_dev_do_{read,write}(),
rather than dropping and regaining it many times and have fuse_abort_conn()
skip request_end() on FR_LOCKED ones?
Miklos Szeredi Feb. 5, 2017, 9:19 p.m. UTC | #24
On Sun, Feb 5, 2017 at 10:01 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Feb 05, 2017 at 09:15:24PM +0100, Miklos Szeredi wrote:
>
>> That case is fine.   But nothing guarantees that fuse_abort_conn()
>> won't be called (in the non-deadlock case) when data is being copied
>> to the request args.  Ending the request at such a point could easily
>> lead to use after free,
>
> So why not leave ending it to your fuse_dev_do_write()/fuse_dev_do_read()?
> See the reply I'd just sent (your mail arrived while I'd been writing that
> one - saw it only after I'd sent mine).
>
> Basically, what if we keep FR_LOCKED through *all* fuse_dev_do_{read,write}(),
> rather than dropping and regaining it many times and have fuse_abort_conn()
> skip request_end() on FR_LOCKED ones?

Then we can't break out of that deadlock: we wait until
fuse_dev_do_write() is done until calling request_end() which
ultimately results in unlocking page.  But fuse_dev_do_write() won't
complete until the page is unlocked.

The only way out that I see is to have a refcount on all pages in
args.  Which means copying everything not already in refcountable page
(i.e. args on stack) to a page array.   It's definitely doable, but
needs time to sort out, and I'm definitely lacking that (overlayfs
currently trumps fuse).

Thanks,
Miklos
Al Viro Feb. 5, 2017, 10:04 p.m. UTC | #25
On Sun, Feb 05, 2017 at 10:19:20PM +0100, Miklos Szeredi wrote:

> Then we can't break out of that deadlock: we wait until
> fuse_dev_do_write() is done until calling request_end() which
> ultimately results in unlocking page.  But fuse_dev_do_write() won't
> complete until the page is unlocked.

Wait a sec.  What happens if

process A: fuse_lookup()
		struct fuse_entry_out outarg on stack
		...
		fuse_request_send() with req->out.args[0].value = &outarg
			sleep in request_wait_answer() on req->waitq
server: read the request, write reply
	fuse_dev_do_write()
		copy_out_args()
			fuse_copy_args()
				fuse_copy_one()
					FR_LOCKED is guaranteed to be set
					fuse_copy_do()
process C on another CPU: umount -f
	fuse_conn_abort()
		end_requests()
			request_end()
				set FR_FINISHED
				wake A up (via req->waitq)
process A: regain CPU
	bugger off from request_wait_answer(), through __fuse_request_send(),
	fuse_request_send(), fuse_simple_request(), fuse_lookup_name(),
	fuse_lookup() and out of fuse_lookup().

In the meanwhile, server in fuse_copy_do() does memcpy() to what used to
be outarg, corrupting the stack of process A.

Sure, you need to hit a fairly narrow window, especially if you are to
cause damage in A, but AFAICS it's not impossible.  Consider e.g. the
situation when you lose CPU on preempt on the way to memcpy(); in that
case server might come back when A has incremented its stack footprint
again.  Or A might end up taking a hardware interrupt and handling it
on the normal kernel stack, etc.

Looks like *any* scenario where fuse_conn_abort() manages to run during
that memcpy() has potential for that kind of trouble; any SMP box appears
to be vulnerable, along with preempt UP...

Am I missing something that prevents that kind of problem?

> The only way out that I see is to have a refcount on all pages in
> args.  Which means copying everything not already in refcountable page
> (i.e. args on stack) to a page array.   It's definitely doable, but
> needs time to sort out, and I'm definitely lacking that (overlayfs
> currently trumps fuse).

Hrm...  Then maybe I'll have to try and cook something along those lines;
AFAICS the current mainline is vulnerable...
Al Viro Feb. 6, 2017, 3:05 a.m. UTC | #26
On Sun, Feb 05, 2017 at 10:04:45PM +0000, Al Viro wrote:

> Sure, you need to hit a fairly narrow window, especially if you are to
> cause damage in A, but AFAICS it's not impossible.  Consider e.g. the
> situation when you lose CPU on preempt on the way to memcpy(); in that
> case server might come back when A has incremented its stack footprint
> again.  Or A might end up taking a hardware interrupt and handling it
> on the normal kernel stack, etc.
> 
> Looks like *any* scenario where fuse_conn_abort() manages to run during
> that memcpy() has potential for that kind of trouble; any SMP box appears
> to be vulnerable, along with preempt UP...
> 
> Am I missing something that prevents that kind of problem?

For that matter, it doesn't have to be on-stack - e.g. fuse_get_link()
has kmalloc'ed buffer for destination, kfree'd upon failure.  Have the
damn thing lose the timeslice in fuse_copy_do() and you might very well
end up spraying user-supplied data over whatever ends up picking your
kfreed buffer.  That one could be reasonably dealt with if we switched
to page_alloc() and stuffed it into the ->pages[] instead...

Some observations regarding the arguments:
	* stack footprint is atrocious.  Consider e.g. fuse_mknod() - you
get 16 bytes of fuse_mknod_in + 120 bytes of struct fuse_args + 128 bytes
of fuse_entry_out.  All on stack, and that's on top of whatever the
callchain already has eaten, which might include e.g. nfsd stuff or
ecryptfs, etc.  Or fuse_get_parent(), for that matter, with 128 bytes of
fuse_entry_out + 120 bytes of fuse_args, both on stack.  This one is
guaranteed to have a nasty call chain - fuse_get_parent() <- reconnect_one()
<- reconnect_path() <- exportfs_decode_fh() (itself with a 256-byte array of
char on stack) <- nfsd_set_fh_dentry() <- fh_verify() <- a bunch of call
chains in nfsd.
	* "out" args (i.e. reply) are probably best dealt with by having
coallocated with request itself - some already are and the sizes tend
to be fixed and not too large (->get_link() is an exception, and it's
probably better handled as mentioned above).
	* "in" args (request) are in some cases easily dealt with by
coallocating with request, but there's a large class of situations where
we are passing dentry->d_name.name and then there's fuse_symlink().
The last one is ugly - potentially up to a page worth of data, coming
straight from method caller; usually it's a part of getname() result,
but e.g. ecryptfs might have it kmalloc'ed, nfsd - picked from sunrpc
request payload, etc.

	AFAICS, your argument applies to the requests that have
some page(s) locked until the request completion (unlock_page() either
by ->end() callback or in the originator of request).  If so, I would
rather mark those as "call request_end() early"; they seem to have
the non-page parts of args hosted in req->misc, so for them it's not
a problem.

	So how about this:

* explicit FR_END_IMMEDIATELY on read/write-related requests
* no FR_LOCKED flipping in lock_request()/unlock_request()
* modifying the call of end_requests() in fuse_abort_conn() so that it
would skip request_end() for everything that isn't marked FR_END_IMMEDIATELY
* make fuse_copy_pages() grab page references around the actual
fuse_copy_page() - grab req->waitq.lock, check FR_ABORTED, grab a page
reference in case it's not, drop req->waitq.lock and bugger off if FR_ABORTED
was set.  Adjust fuse_try_move_page() accordingly.

Do you see any problems with that approach for minimal fix?  If all requests
in need of FR_END_IMMEDIATELY turn out to have non-page part of args already
embedded into req->misc, it looks like this ought to suffice.  I probably
could post something along those lines tomorrow, if you see any serious
problems with that - please yell...
Miklos Szeredi Feb. 6, 2017, 8:37 a.m. UTC | #27
On Sun, Feb 5, 2017 at 11:04 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Feb 05, 2017 at 10:19:20PM +0100, Miklos Szeredi wrote:
>
>> Then we can't break out of that deadlock: we wait until
>> fuse_dev_do_write() is done until calling request_end() which
>> ultimately results in unlocking page.  But fuse_dev_do_write() won't
>> complete until the page is unlocked.
>
> Wait a sec.  What happens if
>
> process A: fuse_lookup()
>                 struct fuse_entry_out outarg on stack
>                 ...
>                 fuse_request_send() with req->out.args[0].value = &outarg
>                         sleep in request_wait_answer() on req->waitq
> server: read the request, write reply
>         fuse_dev_do_write()
>                 copy_out_args()
>                         fuse_copy_args()
>                                 fuse_copy_one()
>                                         FR_LOCKED is guaranteed to be set
>                                         fuse_copy_do()
> process C on another CPU: umount -f
>         fuse_conn_abort()
>                 end_requests()
>                         request_end()
>                                 set FR_FINISHED
>                                 wake A up (via req->waitq)
> process A: regain CPU
>         bugger off from request_wait_answer(), through __fuse_request_send(),
>         fuse_request_send(), fuse_simple_request(), fuse_lookup_name(),
>         fuse_lookup() and out of fuse_lookup().
>
> In the meanwhile, server in fuse_copy_do() does memcpy() to what used to
> be outarg, corrupting the stack of process A.
>
> Sure, you need to hit a fairly narrow window, especially if you are to
> cause damage in A, but AFAICS it's not impossible.  Consider e.g. the
> situation when you lose CPU on preempt on the way to memcpy(); in that
> case server might come back when A has incremented its stack footprint
> again.  Or A might end up taking a hardware interrupt and handling it
> on the normal kernel stack, etc.
>
> Looks like *any* scenario where fuse_conn_abort() manages to run during
> that memcpy() has potential for that kind of trouble; any SMP box appears
> to be vulnerable, along with preempt UP...
>
> Am I missing something that prevents that kind of problem?

Yes: if FR_LOCKED is set, then we leave the request alone in
fuse_abort_conn().  Then, when the copy is finished and
request_unlock() is called, we return -ENOENT to fuse_dev_do_write(),
which in turn calls request_end() to wake up the original caller,
which gets -ECONNABORTED.

So basically FR_LOCKED is protecting the copy, which is guaranteed to
be atomic due to the get_user_pages magic that faults in all pages
beforehand.

Thanks,
Miklos
Miklos Szeredi Feb. 6, 2017, 9:08 a.m. UTC | #28
On Mon, Feb 6, 2017 at 4:05 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:

> Some observations regarding the arguments:
>         * stack footprint is atrocious.  Consider e.g. fuse_mknod() - you
> get 16 bytes of fuse_mknod_in + 120 bytes of struct fuse_args + 128 bytes
> of fuse_entry_out.  All on stack, and that's on top of whatever the
> callchain already has eaten, which might include e.g. nfsd stuff or
> ecryptfs, etc.  Or fuse_get_parent(), for that matter, with 128 bytes of
> fuse_entry_out + 120 bytes of fuse_args, both on stack.  This one is
> guaranteed to have a nasty call chain - fuse_get_parent() <- reconnect_one()
> <- reconnect_path() <- exportfs_decode_fh() (itself with a 256-byte array of
> char on stack) <- nfsd_set_fh_dentry() <- fh_verify() <- a bunch of call
> chains in nfsd.

Indeed.

>         * "out" args (i.e. reply) are probably best dealt with by having
> coallocated with request itself - some already are and the sizes tend
> to be fixed and not too large (->get_link() is an exception, and it's
> probably better handled as mentioned above).
>         * "in" args (request) are in some cases easily dealt with by
> coallocating with request, but there's a large class of situations where
> we are passing dentry->d_name.name and then there's fuse_symlink().
> The last one is ugly - potentially up to a page worth of data, coming
> straight from method caller; usually it's a part of getname() result,
> but e.g. ecryptfs might have it kmalloc'ed, nfsd - picked from sunrpc
> request payload, etc.
>
>         AFAICS, your argument applies to the requests that have
> some page(s) locked until the request completion (unlock_page() either
> by ->end() callback or in the originator of request).  If so, I would
> rather mark those as "call request_end() early"; they seem to have
> the non-page parts of args hosted in req->misc, so for them it's not
> a problem.

Yes, I think only page lock can be used to deadlock inside
fuse_dev_read/write().  So requests that don't have locked pages
should be okay  with just waiting until copy_to/from_user() finishes
and only then proceeding with the abort.

Those that have locked pages must be able to be aborted during
copy_to/from_user() because the copy itself may try to acquire the
page lock.

So yes, if we want to switch to copy_to/from_user(), then we can just
fix the page refcounting for read and write requests and handle the
two cases differently.

>         So how about this:
>
> * explicit FR_END_IMMEDIATELY on read/write-related requests
> * no FR_LOCKED flipping in lock_request()/unlock_request()
> * modifying the call of end_requests() in fuse_abort_conn() so that it
> would skip request_end() for everything that isn't marked FR_END_IMMEDIATELY
> * make fuse_copy_pages() grab page references around the actual
> fuse_copy_page() - grab req->waitq.lock, check FR_ABORTED, grab a page
> reference in case it's not, drop req->waitq.lock and bugger off if FR_ABORTED
> was set.  Adjust fuse_try_move_page() accordingly.
>
> Do you see any problems with that approach for minimal fix?  If all requests
> in need of FR_END_IMMEDIATELY turn out to have non-page part of args already
> embedded into req->misc, it looks like this ought to suffice.  I probably
> could post something along those lines tomorrow, if you see any serious
> problems with that - please yell...

See previous mail, I don't think there's an issue with the current
code.  Other than being convoluted as hell.

Thanks,
Miklos
Al Viro Feb. 6, 2017, 9:57 a.m. UTC | #29
On Mon, Feb 06, 2017 at 10:08:06AM +0100, Miklos Szeredi wrote:

> Yes, I think only page lock can be used to deadlock inside
> fuse_dev_read/write().  So requests that don't have locked pages
> should be okay  with just waiting until copy_to/from_user() finishes
> and only then proceeding with the abort.

Actually, looking at that some more, this might be not true.  Anything
that takes ->mmap_sem exclusive and *not* killable makes for another
source of deadlock.

Initial page fault takes ->mmap_sem shared.  OK, request sent to
server and server tries to read() it.  In the meanwhile, something
has closed userfaultfd for the same mm_struct.  We have userfaultfd_release()
block on attempt to take ->mmap_sem exclusive and from now on any attempt
to grab ->mmap_sem shared will deadlock.  And get_user_pages(), as well
as copy_to_user(), etc. can end up doing just that.  It doesn't have to
be an mmap of the same file, BTW - any page fault would do.

All you really need is to have server sharing address space with the
process that steps into original page fault, plus an evicted page
of any nature (anon mmap, whatever) being used as a destination of
read() in server.

down_read() inside down_read() is fine, unless there had been down_write()
in between.  And there are unkillable down_write() on ->mmap_sem -
userfaultfd_release() being one example of such.  Many of those can and
probably should become down_write_killable(), but this one can't - there
might be nothing to deliver the signal to, if the final close() happens
e.g. from exit(2).

Warning: the above might be completely bogus - I'm on way too large
uptime at the moment and most of the last day had been spent digging
through various convoluted code, so take the above with a cartload of
salt.  _If_ it's true, that kind of deadlock won't be possible to
break with killing anything or doing umount -f, though.

> 
> Those that have locked pages must be able to be aborted during
> copy_to/from_user() because the copy itself may try to acquire the
> page lock.
> 
> So yes, if we want to switch to copy_to/from_user(), then we can just
> fix the page refcounting for read and write requests and handle the
> two cases differently.
> 
> >         So how about this:
> >
> > * explicit FR_END_IMMEDIATELY on read/write-related requests
> > * no FR_LOCKED flipping in lock_request()/unlock_request()
> > * modifying the call of end_requests() in fuse_abort_conn() so that it
> > would skip request_end() for everything that isn't marked FR_END_IMMEDIATELY
> > * make fuse_copy_pages() grab page references around the actual
> > fuse_copy_page() - grab req->waitq.lock, check FR_ABORTED, grab a page
> > reference in case it's not, drop req->waitq.lock and bugger off if FR_ABORTED
> > was set.  Adjust fuse_try_move_page() accordingly.
> >
> > Do you see any problems with that approach for minimal fix?  If all requests
> > in need of FR_END_IMMEDIATELY turn out to have non-page part of args already
> > embedded into req->misc, it looks like this ought to suffice.  I probably
> > could post something along those lines tomorrow, if you see any serious
> > problems with that - please yell...
> 
> See previous mail, I don't think there's an issue with the current
> code.  Other than being convoluted as hell.

OK - I'm an idiot and I've managed to misread fuse_abort_conn() despite having
reread it many times last couple of days.  And yes, state transitions of
requests are convoluted as hell ;-/

Anyway, bedtime for me.  With any luck the scare above re ->mmap_sem *is*
bogus and I'll find "Al, you are an idiot - deadlock on ->mmap_sem can't
happen for <reasons>" from somebody in the mailbox when I get up...
Miklos Szeredi Feb. 6, 2017, 2:18 p.m. UTC | #30
On Mon, Feb 6, 2017 at 10:57 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Feb 06, 2017 at 10:08:06AM +0100, Miklos Szeredi wrote:
>
>> Yes, I think only page lock can be used to deadlock inside
>> fuse_dev_read/write().  So requests that don't have locked pages
>> should be okay  with just waiting until copy_to/from_user() finishes
>> and only then proceeding with the abort.
>
> Actually, looking at that some more, this might be not true.  Anything
> that takes ->mmap_sem exclusive and *not* killable makes for another
> source of deadlock.
>
> Initial page fault takes ->mmap_sem shared.  OK, request sent to
> server and server tries to read() it.  In the meanwhile, something
> has closed userfaultfd for the same mm_struct.  We have userfaultfd_release()
> block on attempt to take ->mmap_sem exclusive and from now on any attempt
> to grab ->mmap_sem shared will deadlock.  And get_user_pages(), as well
> as copy_to_user(), etc. can end up doing just that.  It doesn't have to
> be an mmap of the same file, BTW - any page fault would do.
>
> All you really need is to have server sharing address space with the
> process that steps into original page fault, plus an evicted page
> of any nature (anon mmap, whatever) being used as a destination of
> read() in server.
>
> down_read() inside down_read() is fine, unless there had been down_write()
> in between.  And there are unkillable down_write() on ->mmap_sem -
> userfaultfd_release() being one example of such.  Many of those can and
> probably should become down_write_killable(), but this one can't - there
> might be nothing to deliver the signal to, if the final close() happens
> e.g. from exit(2).
>
> Warning: the above might be completely bogus - I'm on way too large
> uptime at the moment and most of the last day had been spent digging
> through various convoluted code, so take the above with a cartload of
> salt.  _If_ it's true, that kind of deadlock won't be possible to
> break with killing anything or doing umount -f, though.

It's not bogus, the deadlock is there.

But I think it's breakable in the same way: if the deadlocked request
is aborted, the fault will release the page lock as well as mmap_sem,
and from there things will resolve themselves.

But you are definitely right about needing to clean up that mess in
fuse/dev.c and doing so by fixing up the arg refcounting for just the
read and write requests is going to be a lot simpler than having to do
that for all of them (which was my original plan).

So, I'll have a go at that sometime.

Thanks,
Miklos
Al Viro Feb. 7, 2017, 7:19 a.m. UTC | #31
On Mon, Feb 06, 2017 at 03:18:42PM +0100, Miklos Szeredi wrote:

> But I think it's breakable in the same way: if the deadlocked request
> is aborted, the fault will release the page lock as well as mmap_sem,
> and from there things will resolve themselves.

Right you are - original holder of ->mmap_sem is waiting for request and
abort will wake it up...

> But you are definitely right about needing to clean up that mess in
> fuse/dev.c and doing so by fixing up the arg refcounting for just the
> read and write requests is going to be a lot simpler than having to do
> that for all of them (which was my original plan).

Speaking of refcounting - what's going on with fuse_file one?  My reading
of that code is that you have 4 states of that thing:
	* new (just created, fallback request allocated, use fuse_file_free()
to kill).  Refcount is 0.
	* intermediate - in fact it's already opened, but still not
put into ->private_data.  Refcount is still 0.  Use fuse_sync_release() to kill.
	* live - normal refcounting (fuse_file_get()/fuse_file_put()).
	* shutdown - refcount has reached 0.  Can't happen until ->release()
(obviously - ->private_data holds a counting reference), some pieces of
fuse_sync_release() correspond to some stuff in fuse_release_common(), some -
to final fuse_file_put().

To make it even more convoluted, cuse is using fuse_sync_release() and
apparently relies upon no references being grabbed after fuse_do_open(),
so that thing can be called with refcount 0 *or* refcount 1.

Another thing: what guarantees that places in writepages-related paths
where we store a reference into req->ff won't hit a request with already
non-NULL ->ff?
Steve Capper Feb. 13, 2017, 9:56 a.m. UTC | #32
On Fri, Feb 03, 2017 at 11:28:48AM -0800, Linus Torvalds wrote:
> On Fri, Feb 3, 2017 at 11:08 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On x86 it does.  I don't see anything equivalent in mm/gup.c one, and the
> > only kinda-sorta similar thing (access_ok() in __get_user_pages_fast()
> > there) is vulnerable to e.g. access via kernel_write().
> 
> Yeah, access_ok() is bogus. It needs to just check against TASK_SIZE
> or whatever.
> 
> > doesn't look promising - access_ok() is never sufficient.  Something like
> > _PAGE_USER tests in x86 one solves that problem, but if anything similar
> > works for HAVE_GENERIC_RCU_GUP I don't see it.  Thus the question re
> > what am I missing here...
> 
> Ok, I definitely agree that it looks like __get_user_pages_fast() just
> needs to get rid of the access_ok() and replace it with a proper check
> for the user address space range.
> 
> Looks like arm[64] and powerpc.are the current users. Adding in some
> people involved with the original submission a few years ago.

Hi,

[ Apologies for my late reply, I was on vacation then catchup... ]

> 
> I do note that the x86 __get_user_pages_fast() thing looks dodgy too.
> 
> In particular, we do it right in the *real* get_user_pages_fast(), see
> commit 7f8189068726 ("x86: don't use 'access_ok()' as a range check in
> get_user_pages_fast()"). But then the same bug was re-introduced when
> the "irq safe" version was merged. As well as in the GENERIC_RCU_GUP
> version.
> 
> Gaah. Apparently PeterZ copied the old buggy version before the fix
> when he added __get_user_pages_fast() in commit 465a454f254e ("x86,
> mm: Add __get_user_pages_fast()").
> 
> I guess it could be considered a merge error (both happened during the
> 2.6.31 merge window).
> 

Okay so looking at what we have for access_ok(.) on arm64, my
understanding is that we perform a 65-bit add/compare (in assembler) to
see whether or not the range is below the current_thread_info->addr_limit.
So I think this is a roundabout way of checking for no-wrap around and <= TASK_SIZE.

Looking at powerpc, I see it's a little different...

So if it sounds reasonable to folk I was going to send a patch to
replace the call to access_ok(.) with a wraparound + TASK_SIZE check
written explicitly in C? (and remove some of the comments talking about
access_ok(.)).

Cheers,
Linus Torvalds Feb. 13, 2017, 9:40 p.m. UTC | #33
On Mon, Feb 13, 2017 at 1:56 AM, Steve Capper <steve.capper@linaro.org> wrote:
>
> Okay so looking at what we have for access_ok(.) on arm64, my
> understanding is that we perform a 65-bit add/compare (in assembler) to
> see whether or not the range is below the current_thread_info->addr_limit.
> So I think this is a roundabout way of checking for no-wrap around and <= TASK_SIZE.

No, that's the problem. It's *not* testing against TASK_SIZE.

Because add_limit is not always TASK_SIZE. When you do
set_fs(KERNEL_DS), you set addr_limit to infinity.

And yes, the kernel does read and write calls too. Seldom, but it
happens. And walking the page tables with kernel addresses is not
supposed to work (sometimes it happens to work by mistake). So if
somebody finds a path that gets from that kind of situation into the
get_user_pages() interface, bad things happen.

                 Linus
Jeff Layton Feb. 16, 2017, 1:10 p.m. UTC | #34
On Thu, 2017-02-02 at 09:51 +0000, Al Viro wrote:
> On Wed, Jan 25, 2017 at 08:32:03AM -0500, Jeff Layton wrote:
> > Small respin of the patch that I sent yesterday for the same thing.
> > 
> > This moves the maxsize handling into iov_iter_pvec_size, so that we don't
> > end up iterating past the max size we'll use anyway when trying to
> > determine the pagevec length.
> > 
> > Also, a respun patch to make ceph use iov_iter_get_pages_alloc instead of
> > trying to do it via its own routine.
> > 
> > Al, if these look ok, do you want to pick these up or shall I ask
> > Ilya to merge them via the ceph tree?
> 
> I'd rather have that kind of work go through the vfs tree; said that,
> I really wonder if this is the right approach.  Most of the users of
> iov_iter_get_pages()/iov_iter_get_pages_alloc() look like they want
> something like
> 	iov_iter_for_each_page(iter, size, f, data)
> with int (*f)(struct page *page, size_t from, size_t size, void *data)
> passed as callback.  Not everything fits that model, but there's a whole
> lot of things that do.
> 

While I do like the above proposal better than what I originally had,
I'm guessing it won't be ready in time for v4.11.

Would it be reasonable to take the patch I proposed for v4.11 as an
interim fix? It does fix a rather easy-to-trigger softlockup in the ceph
code that xfstests can reliably hit.
diff mbox

Patch

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index e68604ae3ced..956d17767a3e 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -883,6 +883,58 @@  unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_gap_alignment);
 
+/**
+ * iov_iter_pvec_size - find length of page aligned iovecs in iov_iter
+ * @i: iov_iter to in which to find the size
+ *
+ * Some filesystems can stitch together multiple iovecs into a single
+ * page vector when both the previous tail and current base are page
+ * aligned. This function discovers the length that can fit in a single
+ * pagevec and returns it.
+ */
+static size_t iov_iter_pvec_size(const struct iov_iter *i)
+{
+	size_t size = i->count;
+	size_t pv_size = 0;
+	bool contig = false, first = true;
+
+	if (!size)
+		return 0;
+
+	/* Pipes are naturally aligned for this */
+	if (unlikely(i->type & ITER_PIPE))
+		return size;
+
+	/*
+	 * An iov can be page vectored when the current base and previous
+	 * tail are both page aligned. Note that we don't require that the
+	 * initial base in the first iovec also be page aligned.
+	 */
+	iterate_all_kinds(i, size, v,
+		({
+		 if (first || (contig && PAGE_ALIGNED(v.iov_base))) {
+			pv_size += v.iov_len;
+			first = false;
+			contig = PAGE_ALIGNED(v.iov_base + v.iov_len);
+		 }; 0;
+		 }),
+		({
+		 if (first || (contig && v.bv_offset == 0)) {
+			pv_size += v.bv_len;
+			first = false;
+			contig = PAGE_ALIGNED(v.bv_offset + v.bv_len);
+		 }
+		 }),
+		({
+		 if (first || (contig && PAGE_ALIGNED(v.iov_base))) {
+			pv_size += v.iov_len;
+			first = false;
+			contig = PAGE_ALIGNED(v.iov_base + v.iov_len);
+		 }
+		 }))
+	return pv_size;
+}
+
 static inline size_t __pipe_get_pages(struct iov_iter *i,
 				size_t maxsize,
 				struct page **pages,
@@ -1006,47 +1058,77 @@  static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
 }
 
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
-		   struct page ***pages, size_t maxsize,
-		   size_t *start)
+		   struct page ***ppages, size_t maxsize,
+		   size_t *pstart)
 {
-	struct page **p;
-
-	if (maxsize > i->count)
-		maxsize = i->count;
+	struct page **p, **pc;
+	size_t start = 0;
+	ssize_t len = 0;
+	int npages, res = 0;
+	bool first = true;
 
 	if (unlikely(i->type & ITER_PIPE))
-		return pipe_get_pages_alloc(i, pages, maxsize, start);
+		return pipe_get_pages_alloc(i, ppages, maxsize, pstart);
+
+	maxsize = min(iov_iter_pvec_size(i), maxsize);
+	npages = DIV_ROUND_UP(maxsize, PAGE_SIZE);
+	p = get_pages_array(npages);
+	if (!p)
+		return -ENOMEM;
+
+	pc = p;
 	iterate_all_kinds(i, maxsize, v, ({
 		unsigned long addr = (unsigned long)v.iov_base;
-		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
+		size_t slen = v.iov_len;
 		int n;
-		int res;
 
-		addr &= ~(PAGE_SIZE - 1);
-		n = DIV_ROUND_UP(len, PAGE_SIZE);
-		p = get_pages_array(n);
-		if (!p)
-			return -ENOMEM;
-		res = get_user_pages_fast(addr, n, (i->type & WRITE) != WRITE, p);
-		if (unlikely(res < 0)) {
-			kvfree(p);
-			return res;
+		if (first) {
+			start = addr & (PAGE_SIZE - 1);
+			slen += start;
+			first = false;
 		}
-		*pages = p;
-		return (res == n ? len : res * PAGE_SIZE) - *start;
+
+		n = DIV_ROUND_UP(slen, PAGE_SIZE);
+		if ((pc + n) > (p + npages)) {
+			/* Did something change the iov array?!? */
+			res = -EFAULT;
+			goto out;
+		}
+		addr &= ~(PAGE_SIZE - 1);
+		res = get_user_pages_fast(addr, n, (i->type & WRITE) != WRITE, pc);
+		if (unlikely(res < 0))
+			goto out;
+		len += (res == n ? slen : res * PAGE_SIZE) - start;
+		pc += res;
 	0;}),({
-		/* can't be more than PAGE_SIZE */
-		*start = v.bv_offset;
-		*pages = p = get_pages_array(1);
-		if (!p)
-			return -ENOMEM;
-		get_page(*p = v.bv_page);
-		return v.bv_len;
+		/* bio_vecs are limited to a single page each */
+		if (first) {
+			start = v.bv_offset;
+			first = false;
+		}
+		get_page(*pc = v.bv_page);
+		len += v.bv_len;
+		++pc;
+		BUG_ON(pc > p + npages);
 	}),({
-		return -EFAULT;
+		/* FIXME: should we handle this case? */
+		res = -EFAULT;
+		goto out;
 	})
 	)
-	return 0;
+out:
+	if (unlikely(res < 0)) {
+		struct page **i;
+
+		for (i = p; i < pc; i++)
+			put_page(*i);
+		kvfree(p);
+		return res;
+	}
+
+	*ppages = p;
+	*pstart = start;
+	return len;
 }
 EXPORT_SYMBOL(iov_iter_get_pages_alloc);