diff mbox series

[1/2] nfsd: don't replace page in rq_pages if it's a continuation of last page

Message ID 20230317105608.19393-1-jlayton@kernel.org (mailing list archive)
State New, archived
Headers show
Series [1/2] nfsd: don't replace page in rq_pages if it's a continuation of last page | expand

Commit Message

Jeff Layton March 17, 2023, 10:56 a.m. UTC
The splice read calls nfsd_splice_actor to put the pages containing file
data into the svc_rqst->rq_pages array. It's possible however to get a
splice result that only has a partial page at the end, if (e.g.) the
filesystem hands back a short read that doesn't cover the whole page.

nfsd_splice_actor will plop the partial page into its rq_pages array and
return. Then later, when nfsd_splice_actor is called again, the
remainder of the page may end up being filled out. At this point,
nfsd_splice_actor will put the page into the array _again_ corrupting
the reply. If this is done enough times, rq_next_page will overrun the
array and corrupt the trailing fields -- the rq_respages and
rq_next_page pointers themselves.

If we've already added the page to the array in the last pass, don't add
it to the array a second time when dealing with a splice continuation.
This was originally handled properly in nfsd_splice_actor, but commit
91e23b1c3982 removed the check for it, and started universally replacing
pages.

Fixes: 91e23b1c3982 ("NFSD: Clean up nfsd_splice_actor()")
Reported-by: Dario Lesca <d.lesca@solinos.it>
Tested-by: David Critch <dcritch@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2150630
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/nfsd/vfs.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Jeff Layton March 17, 2023, 1:06 p.m. UTC | #1
On Fri, 2023-03-17 at 06:56 -0400, Jeff Layton wrote:
> The splice read calls nfsd_splice_actor to put the pages containing file
> data into the svc_rqst->rq_pages array. It's possible however to get a
> splice result that only has a partial page at the end, if (e.g.) the
> filesystem hands back a short read that doesn't cover the whole page.
> 
> nfsd_splice_actor will plop the partial page into its rq_pages array and
> return. Then later, when nfsd_splice_actor is called again, the
> remainder of the page may end up being filled out. At this point,
> nfsd_splice_actor will put the page into the array _again_ corrupting
> the reply. If this is done enough times, rq_next_page will overrun the
> array and corrupt the trailing fields -- the rq_respages and
> rq_next_page pointers themselves.
> 
> If we've already added the page to the array in the last pass, don't add
> it to the array a second time when dealing with a splice continuation.
> This was originally handled properly in nfsd_splice_actor, but commit
> 91e23b1c3982 removed the check for it, and started universally replacing
> pages.
> 
> Fixes: 91e23b1c3982 ("NFSD: Clean up nfsd_splice_actor()")
> Reported-by: Dario Lesca <d.lesca@solinos.it>
> Tested-by: David Critch <dcritch@redhat.com>
> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2150630
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/nfsd/vfs.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 502e1b7742db..3709ef57d96e 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -941,8 +941,11 @@ nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
>  	struct page *last_page;
>  
>  	last_page = page + (offset + sd->len - 1) / PAGE_SIZE;
> -	for (page += offset / PAGE_SIZE; page <= last_page; page++)
> -		svc_rqst_replace_page(rqstp, page);
> +	for (page += offset / PAGE_SIZE; page <= last_page; page++) {
> +		/* Only replace page if we haven't already done so */

Note that I think that this was probably the real rationale for the pp[-
1] check that 91e23b1c3982 removed. Given that, maybe we should flesh
this comment out a bit more for posterity?

		/*
		 * When we're splicing from a pipe, it's possible that
		 * we'll get an incomplete page that may be updated on
		 * a later call. Only splice it into rq_pages once.
		 */

> +		if (page != *(rqstp->rq_next_page - 1))
> +			svc_rqst_replace_page(rqstp, page);
> +	}
>  	if (rqstp->rq_res.page_len == 0)	// first call
>  		rqstp->rq_res.page_base = offset % PAGE_SIZE;
>  	rqstp->rq_res.page_len += sd->len;
Chuck Lever March 17, 2023, 2:16 p.m. UTC | #2
> On Mar 17, 2023, at 9:06 AM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> On Fri, 2023-03-17 at 06:56 -0400, Jeff Layton wrote:
>> The splice read calls nfsd_splice_actor to put the pages containing file
>> data into the svc_rqst->rq_pages array. It's possible however to get a
>> splice result that only has a partial page at the end, if (e.g.) the
>> filesystem hands back a short read that doesn't cover the whole page.
>> 
>> nfsd_splice_actor will plop the partial page into its rq_pages array and
>> return. Then later, when nfsd_splice_actor is called again, the
>> remainder of the page may end up being filled out. At this point,
>> nfsd_splice_actor will put the page into the array _again_ corrupting
>> the reply. If this is done enough times, rq_next_page will overrun the
>> array and corrupt the trailing fields -- the rq_respages and
>> rq_next_page pointers themselves.
>> 
>> If we've already added the page to the array in the last pass, don't add
>> it to the array a second time when dealing with a splice continuation.
>> This was originally handled properly in nfsd_splice_actor, but commit
>> 91e23b1c3982 removed the check for it, and started universally replacing
>> pages.
>> 
>> Fixes: 91e23b1c3982 ("NFSD: Clean up nfsd_splice_actor()")
>> Reported-by: Dario Lesca <d.lesca@solinos.it>
>> Tested-by: David Critch <dcritch@redhat.com>
>> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2150630
>> Signed-off-by: Jeff Layton <jlayton@kernel.org>
>> ---
>> fs/nfsd/vfs.c | 7 +++++--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>> 
>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> index 502e1b7742db..3709ef57d96e 100644
>> --- a/fs/nfsd/vfs.c
>> +++ b/fs/nfsd/vfs.c
>> @@ -941,8 +941,11 @@ nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
>> 	struct page *last_page;
>> 
>> 	last_page = page + (offset + sd->len - 1) / PAGE_SIZE;
>> -	for (page += offset / PAGE_SIZE; page <= last_page; page++)
>> -		svc_rqst_replace_page(rqstp, page);
>> +	for (page += offset / PAGE_SIZE; page <= last_page; page++) {
>> +		/* Only replace page if we haven't already done so */
> 
> Note that I think that this was probably the real rationale for the pp[-
> 1] check that 91e23b1c3982 removed. Given that, maybe we should flesh
> this comment out a bit more for posterity?
> 
> 		/*
> 		 * When we're splicing from a pipe, it's possible that
> 		 * we'll get an incomplete page that may be updated on
> 		 * a later call. Only splice it into rq_pages once.
> 		 */

The "real" bug here is that the API contract for pipe splicing
isn't well defined, so I agree that it's very likely the pp[-1]
check was because a splice can call the actor repeatedly for the
same page. No one could remember why that check was there.

To be clear, if the passed-in page matches the current page in
the rqst, we're "extending the current page" rather than avoiding
replacement... maybe:

	/*
	 * Skip page replacement when extending the contents
	 * of the current page.
	 */

In the patch description, would you mention that this case
arises if the READ request is not page-aligned?

If you resend this patch, please Cc: viro@ . Thanks for chasing
this down!


>> +		if (page != *(rqstp->rq_next_page - 1))
>> +			svc_rqst_replace_page(rqstp, page);
>> +	}
>> 	if (rqstp->rq_res.page_len == 0)	// first call
>> 		rqstp->rq_res.page_base = offset % PAGE_SIZE;
>> 	rqstp->rq_res.page_len += sd->len;
> 
> -- 
> Jeff Layton <jlayton@kernel.org>

--
Chuck Lever
Jeff Layton March 17, 2023, 2:59 p.m. UTC | #3
On Fri, 2023-03-17 at 14:16 +0000, Chuck Lever III wrote:
> 
> > On Mar 17, 2023, at 9:06 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Fri, 2023-03-17 at 06:56 -0400, Jeff Layton wrote:
> > > The splice read calls nfsd_splice_actor to put the pages containing file
> > > data into the svc_rqst->rq_pages array. It's possible however to get a
> > > splice result that only has a partial page at the end, if (e.g.) the
> > > filesystem hands back a short read that doesn't cover the whole page.
> > > 
> > > nfsd_splice_actor will plop the partial page into its rq_pages array and
> > > return. Then later, when nfsd_splice_actor is called again, the
> > > remainder of the page may end up being filled out. At this point,
> > > nfsd_splice_actor will put the page into the array _again_ corrupting
> > > the reply. If this is done enough times, rq_next_page will overrun the
> > > array and corrupt the trailing fields -- the rq_respages and
> > > rq_next_page pointers themselves.
> > > 
> > > If we've already added the page to the array in the last pass, don't add
> > > it to the array a second time when dealing with a splice continuation.
> > > This was originally handled properly in nfsd_splice_actor, but commit
> > > 91e23b1c3982 removed the check for it, and started universally replacing
> > > pages.
> > > 
> > > Fixes: 91e23b1c3982 ("NFSD: Clean up nfsd_splice_actor()")
> > > Reported-by: Dario Lesca <d.lesca@solinos.it>
> > > Tested-by: David Critch <dcritch@redhat.com>
> > > Link: https://bugzilla.redhat.com/show_bug.cgi?id=2150630
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > > fs/nfsd/vfs.c | 7 +++++--
> > > 1 file changed, 5 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > index 502e1b7742db..3709ef57d96e 100644
> > > --- a/fs/nfsd/vfs.c
> > > +++ b/fs/nfsd/vfs.c
> > > @@ -941,8 +941,11 @@ nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
> > > 	struct page *last_page;
> > > 
> > > 	last_page = page + (offset + sd->len - 1) / PAGE_SIZE;
> > > -	for (page += offset / PAGE_SIZE; page <= last_page; page++)
> > > -		svc_rqst_replace_page(rqstp, page);
> > > +	for (page += offset / PAGE_SIZE; page <= last_page; page++) {
> > > +		/* Only replace page if we haven't already done so */
> > 
> > Note that I think that this was probably the real rationale for the pp[-
> > 1] check that 91e23b1c3982 removed. Given that, maybe we should flesh
> > this comment out a bit more for posterity?
> > 
> > 		/*
> > 		 * When we're splicing from a pipe, it's possible that
> > 		 * we'll get an incomplete page that may be updated on
> > 		 * a later call. Only splice it into rq_pages once.
> > 		 */
> 
> The "real" bug here is that the API contract for pipe splicing
> isn't well defined, so I agree that it's very likely the pp[-1]
> check was because a splice can call the actor repeatedly for the
> same page. No one could remember why that check was there.
> 

The whole splice API is a minefield.

> To be clear, if the passed-in page matches the current page in
> the rqst, we're "extending the current page" rather than avoiding
> replacement... maybe:
> 
> 	/*
> 	 * Skip page replacement when extending the contents
> 	 * of the current page.
> 	 */
> 

Sure, sounds good.

> In the patch description, would you mention that this case
> arises if the READ request is not page-aligned?
> 

Does it though? I'm not sure that page alignment has that much to do
with it. I imagine you can hit this even with aligned I/Os.

My guess is the bigger issue is when your storage is doing sub-page-size
I/Os under the hood. We end up filling up part of a page from storage
and the kernel submits what it has to the pipe and then the next bit
comes in and the page is updated for the next actor call.

> If you resend this patch, please Cc: viro@ . Thanks for chasing
> this down!
> 

Will do.

> 
> > > +		if (page != *(rqstp->rq_next_page - 1))
> > > +			svc_rqst_replace_page(rqstp, page);
> > > +	}
> > > 	if (rqstp->rq_res.page_len == 0)	// first call
> > > 		rqstp->rq_res.page_base = offset % PAGE_SIZE;
> > > 	rqstp->rq_res.page_len += sd->len;
> > 
> > -- 
> > Jeff Layton <jlayton@kernel.org>
> 
> --
> Chuck Lever
> 
>
Chuck Lever March 17, 2023, 3:04 p.m. UTC | #4
> On Mar 17, 2023, at 10:59 AM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> On Fri, 2023-03-17 at 14:16 +0000, Chuck Lever III wrote:
> 
>> In the patch description, would you mention that this case
>> arises if the READ request is not page-aligned?
> 
> Does it though? I'm not sure that page alignment has that much to do
> with it. I imagine you can hit this even with aligned I/Os.

Maybe, but no-one has actually seen that. The vast majority of
reports of this problem are with unaligned I/O, which POSIX OS
NFS clients (like the Linux NFS client) usually avoid.

I didn't mean to exclude the possibility of hitting this issue
in other ways, but simply observing a common way it is hit.


--
Chuck Lever
Jeff Layton March 17, 2023, 5:23 p.m. UTC | #5
On Fri, 2023-03-17 at 15:04 +0000, Chuck Lever III wrote:
> 
> > On Mar 17, 2023, at 10:59 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Fri, 2023-03-17 at 14:16 +0000, Chuck Lever III wrote:
> > 
> > > In the patch description, would you mention that this case
> > > arises if the READ request is not page-aligned?
> > 
> > Does it though? I'm not sure that page alignment has that much to do
> > with it. I imagine you can hit this even with aligned I/Os.
> 
> Maybe, but no-one has actually seen that. The vast majority of
> reports of this problem are with unaligned I/O, which POSIX OS
> NFS clients (like the Linux NFS client) usually avoid.
> 
> I didn't mean to exclude the possibility of hitting this issue
> in other ways, but simply observing a common way it is hit.
> 

An unaligned read will consume an extra page, so maybe it just makes it
more likely to overrun the array in that case?
diff mbox series

Patch

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 502e1b7742db..3709ef57d96e 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -941,8 +941,11 @@  nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
 	struct page *last_page;
 
 	last_page = page + (offset + sd->len - 1) / PAGE_SIZE;
-	for (page += offset / PAGE_SIZE; page <= last_page; page++)
-		svc_rqst_replace_page(rqstp, page);
+	for (page += offset / PAGE_SIZE; page <= last_page; page++) {
+		/* Only replace page if we haven't already done so */
+		if (page != *(rqstp->rq_next_page - 1))
+			svc_rqst_replace_page(rqstp, page);
+	}
 	if (rqstp->rq_res.page_len == 0)	// first call
 		rqstp->rq_res.page_base = offset % PAGE_SIZE;
 	rqstp->rq_res.page_len += sd->len;