nfsd: Fix reading via splice

Message ID	169054754615.3783.11682801287165281930.stgit@klimt.1015granger.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@vger.kernel.org> Subject: [PATCH] nfsd: Fix reading via splice From: Chuck Lever <cel@kernel.org> Cc: Chuck Lever <chuck.lever@oracle.com>, David Howells <dhowells@redhat.com>, Jeff Layton <jlayton@kernel.org>, Hugh Dickins <hughd@google.com>, Jens Axboe <axboe@kernel.dk>, Matthew Wilcox <willy@infradead.org>, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Chuck Lever <chuck.lever@oracle.com>, hughd@google.com, axboe@kernel.dk, willy@infradead.org, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Date: Fri, 28 Jul 2023 08:32:35 -0400 Message-ID: <169054754615.3783.11682801287165281930.stgit@klimt.1015granger.net> User-Agent: StGit/1.5 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit To: unlisted-recipients:; (no To-header on input) Precedence: bulk
Series	nfsd: Fix reading via splice \| expand nfsd: Fix reading via splice

Chuck Lever July 28, 2023, 12:32 p.m. UTC

From: David Howells <dhowells@redhat.com>

nfsd_splice_actor() has a clause in its loop that chops up a compound page
into individual pages such that if the same page is seen twice in a row, it
is discarded the second time.  This is a problem with the advent of
shmem_splice_read() as that inserts zero_pages into the pipe in lieu of
pages that aren't present in the pagecache.

Fix this by assuming that the last page is being extended only if the
currently stored length + starting offset is not currently on a page
boundary.

This can be tested by NFS-exporting a tmpfs filesystem on the test machine
and truncating it to more than a page in size (eg. truncate -s 8192) and
then reading it by NFS.  The first page will be all zeros, but thereafter
garbage will be read.

Note: I wonder if we can ever get a situation now where we get a splice
that gives us contiguous parts of a page in separate actor calls.  As NFSD
can only be splicing from a file (I think), there are only three sources of
the page: copy_splice_read(), shmem_splice_read() and file_splice_read().
The first allocates pages for the data it reads, so the problem cannot
occur; the second should never see a partial page; and the third waits for
each page to become available before we're allowed to read from it.

Fixes: bd194b187115 ("shmem: Implement splice-read")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Hugh Dickins <hughd@google.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-nfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/vfs.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

NeilBrown July 28, 2023, 11:54 p.m. UTC | #1

On Fri, 28 Jul 2023, Chuck Lever wrote:
> From: David Howells <dhowells@redhat.com>
> 
> nfsd_splice_actor() has a clause in its loop that chops up a compound page
> into individual pages such that if the same page is seen twice in a row, it
> is discarded the second time.  This is a problem with the advent of
> shmem_splice_read() as that inserts zero_pages into the pipe in lieu of
> pages that aren't present in the pagecache.
> 
> Fix this by assuming that the last page is being extended only if the
> currently stored length + starting offset is not currently on a page
> boundary.
> 
> This can be tested by NFS-exporting a tmpfs filesystem on the test machine
> and truncating it to more than a page in size (eg. truncate -s 8192) and
> then reading it by NFS.  The first page will be all zeros, but thereafter
> garbage will be read.
> 
> Note: I wonder if we can ever get a situation now where we get a splice
> that gives us contiguous parts of a page in separate actor calls.  As NFSD
> can only be splicing from a file (I think), there are only three sources of
> the page: copy_splice_read(), shmem_splice_read() and file_splice_read().
> The first allocates pages for the data it reads, so the problem cannot
> occur; the second should never see a partial page; and the third waits for
> each page to become available before we're allowed to read from it.
> 
> Fixes: bd194b187115 ("shmem: Implement splice-read")
> Reported-by: Chuck Lever <chuck.lever@oracle.com>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> cc: Hugh Dickins <hughd@google.com>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: linux-nfs@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> cc: linux-mm@kvack.org
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/vfs.c |    9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 59b7d60ae33e..ee3bbaa79478 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -956,10 +956,13 @@ nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
>  	last_page = page + (offset + sd->len - 1) / PAGE_SIZE;
>  	for (page += offset / PAGE_SIZE; page <= last_page; page++) {
>  		/*
> -		 * Skip page replacement when extending the contents
> -		 * of the current page.
> +		 * Skip page replacement when extending the contents of the
> +		 * current page.  But note that we may get two zero_pages in a
> +		 * row from shmem.
>  		 */
> -		if (page == *(rqstp->rq_next_page - 1))
> +		if (page == *(rqstp->rq_next_page - 1) &&
> +		    offset_in_page(rqstp->rq_res.page_base +
> +				   rqstp->rq_res.page_len))

This seems fragile in that it makes assumptions about the pages being
sent and their alignment.
Given that it was broken by the splice-read change, that confirms it is
fragile.  Maybe we could make the code a bit more explicit about what is
expected.

Also, I don't think this test can ever be relevant after the first time
through the loop.  So I think it would be clearest to have the
interesting case outside the loop.

 page += offset / PAGE_SIZE;
 if (rqstp->rq_res.pages_len > 0) {
      /* appending to page list - check alignment */
      if (offset % PAGE_SIZE != (rqstp->rq_res.page_base +
                                 rqstp-.rq_res.page_len) % PAGE_SIZE)
	  return -EIO;
      if (offset % PAGE_SIZE != 0) {
           /* continuing previous page */
           if (page != rqstp->rq_next_page[-1])
               return -EIO;
	   page += 1;
      }
 } else
      /* Starting new page list */
      rqstp->rq_res.page_base = offset % PAGE_SIZE;

 for ( ; page <= last_page ; page++)
       if (unlikely(!svc_rqst_replace_page(rqstp, page)))
           return -EIO;

 rqstp->rq_res.page_len += sd->len;
 return sd->len;


Also, the name "svc_rqst_replace_page" doesn't give any hint that the
next_page pointer is advanced.  Maybe svc_rqst_add_page() ???  Not great
I admit.

NeilBrown

   

>  			continue;
>  		if (unlikely(!svc_rqst_replace_page(rqstp, page)))
>  			return -EIO;
> 
> 
>

Chuck Lever III July 30, 2023, 3:29 p.m. UTC | #2

On Sat, Jul 29, 2023 at 09:54:58AM +1000, NeilBrown wrote:
> On Fri, 28 Jul 2023, Chuck Lever wrote:
> > From: David Howells <dhowells@redhat.com>
> > 
> > nfsd_splice_actor() has a clause in its loop that chops up a compound page
> > into individual pages such that if the same page is seen twice in a row, it
> > is discarded the second time.  This is a problem with the advent of
> > shmem_splice_read() as that inserts zero_pages into the pipe in lieu of
> > pages that aren't present in the pagecache.
> > 
> > Fix this by assuming that the last page is being extended only if the
> > currently stored length + starting offset is not currently on a page
> > boundary.
> > 
> > This can be tested by NFS-exporting a tmpfs filesystem on the test machine
> > and truncating it to more than a page in size (eg. truncate -s 8192) and
> > then reading it by NFS.  The first page will be all zeros, but thereafter
> > garbage will be read.
> > 
> > Note: I wonder if we can ever get a situation now where we get a splice
> > that gives us contiguous parts of a page in separate actor calls.  As NFSD
> > can only be splicing from a file (I think), there are only three sources of
> > the page: copy_splice_read(), shmem_splice_read() and file_splice_read().
> > The first allocates pages for the data it reads, so the problem cannot
> > occur; the second should never see a partial page; and the third waits for
> > each page to become available before we're allowed to read from it.
> > 
> > Fixes: bd194b187115 ("shmem: Implement splice-read")
> > Reported-by: Chuck Lever <chuck.lever@oracle.com>
> > Signed-off-by: David Howells <dhowells@redhat.com>
> > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > cc: Hugh Dickins <hughd@google.com>
> > cc: Jens Axboe <axboe@kernel.dk>
> > cc: Matthew Wilcox <willy@infradead.org>
> > cc: linux-nfs@vger.kernel.org
> > cc: linux-fsdevel@vger.kernel.org
> > cc: linux-mm@kvack.org
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > ---
> >  fs/nfsd/vfs.c |    9 ++++++---
> >  1 file changed, 6 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 59b7d60ae33e..ee3bbaa79478 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -956,10 +956,13 @@ nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
> >  	last_page = page + (offset + sd->len - 1) / PAGE_SIZE;
> >  	for (page += offset / PAGE_SIZE; page <= last_page; page++) {
> >  		/*
> > -		 * Skip page replacement when extending the contents
> > -		 * of the current page.
> > +		 * Skip page replacement when extending the contents of the
> > +		 * current page.  But note that we may get two zero_pages in a
> > +		 * row from shmem.
> >  		 */
> > -		if (page == *(rqstp->rq_next_page - 1))
> > +		if (page == *(rqstp->rq_next_page - 1) &&
> > +		    offset_in_page(rqstp->rq_res.page_base +
> > +				   rqstp->rq_res.page_len))
> 
> This seems fragile in that it makes assumptions about the pages being
> sent and their alignment.
> Given that it was broken by the splice-read change, that confirms it is
> fragile.  Maybe we could make the code a bit more explicit about what is
> expected.

Indeed, this code is brittle. This is not even the only time the
actor has been broken in the past four or five kernel releases.

IMO the problem is that there is no API contract or documentation
for splice actors. And as far as I am aware, only a few other
examples are in use to learn from.


> Also, I don't think this test can ever be relevant after the first time
> through the loop.  So I think it would be clearest to have the
> interesting case outside the loop.
> 
>  page += offset / PAGE_SIZE;
>  if (rqstp->rq_res.pages_len > 0) {
>       /* appending to page list - check alignment */
>       if (offset % PAGE_SIZE != (rqstp->rq_res.page_base +
>                                  rqstp-.rq_res.page_len) % PAGE_SIZE)
> 	  return -EIO;
>       if (offset % PAGE_SIZE != 0) {
>            /* continuing previous page */
>            if (page != rqstp->rq_next_page[-1])
>                return -EIO;
> 	   page += 1;
>       }
>  } else
>       /* Starting new page list */
>       rqstp->rq_res.page_base = offset % PAGE_SIZE;
> 
>  for ( ; page <= last_page ; page++)
>        if (unlikely(!svc_rqst_replace_page(rqstp, page)))
>            return -EIO;
> 
>  rqstp->rq_res.page_len += sd->len;
>  return sd->len;
> 
> 
> Also, the name "svc_rqst_replace_page" doesn't give any hint that the
> next_page pointer is advanced.  Maybe svc_rqst_add_page() ???  Not great
> I admit.

All reasonable suggestions.

However, I'm getting ready to replace the splice read code with...
je ne ce pas.

- There are reports that splice read doesn't perform well

- It's a brittle piece of engineering, as observed

- The "zero copy" read path will need to support folios, hopefully
  sooner rather than later

- We want the server's read path to use iomap when that is more
  broadly available in local filesystems

- This fix is destined for 6.5-rc, which limits the amount of
  clean up and optimization we should be doing

I'd like to apply David's fix as-is, unless it's truly broken or
someone has a better quick solution.


> >  			continue;
> >  		if (unlikely(!svc_rqst_replace_page(rqstp, page)))
> >  			return -EIO;
> > 
> > 
> > 
>

Hugh Dickins July 30, 2023, 4:50 p.m. UTC | #3

On Sun, 30 Jul 2023, Chuck Lever wrote:
> On Sat, Jul 29, 2023 at 09:54:58AM +1000, NeilBrown wrote:
> > On Fri, 28 Jul 2023, Chuck Lever wrote:
> > > From: David Howells <dhowells@redhat.com>
...
> - This fix is destined for 6.5-rc, which limits the amount of
>   clean up and optimization we should be doing
> 
> I'd like to apply David's fix as-is, unless it's truly broken or
> someone has a better quick solution.

I certainly have no objection to you doing so; and think that you
and David will have a much better appreciation of the risks than me.

But I ought to mention that this two-ZERO_PAGEs-in-a-row behaviour
was problematic for splice() in the past - see the comments on
ZERO_PAGE(0) and its alternative block in shmem_file_read_iter().
1bdec44b1eee ("tmpfs: fix regressions from wider use of ZERO_PAGE"):
ah, that came from a report by you too, xfstests on nfsd.

In principle there's a very simple (but inferior) solution at the
shmem end: for shmem_file_splice_read() to use SGP_CACHE (used when
faulting in a hole) instead of SGP_READ in its call to shmem_get_folio().
(And delete all of shmem's splice_zeropage_into_pipe() code.)

I say "in principle" because all David's testing has been with the
SGP_READ there, and perhaps there's some gotcha I'm overlooking which
would turn up when switching over to SGP_CACHE.  And I say "inferior"
because that way entails allocating and zeroing pages for holes (which
page reclaim will then free later on if they remain clean).

My vote would be for putting David's nfsd patch in for now, but
keeping an open mind as to whether the shmem end has to change,
if there might be further problems elsewhere than nfsd.

Hugh

Chuck Lever III July 30, 2023, 4:55 p.m. UTC | #4

On Sun, Jul 30, 2023 at 09:50:44AM -0700, Hugh Dickins wrote:
> On Sun, 30 Jul 2023, Chuck Lever wrote:
> > On Sat, Jul 29, 2023 at 09:54:58AM +1000, NeilBrown wrote:
> > > On Fri, 28 Jul 2023, Chuck Lever wrote:
> > > > From: David Howells <dhowells@redhat.com>
> ...
> > - This fix is destined for 6.5-rc, which limits the amount of
> >   clean up and optimization we should be doing
> > 
> > I'd like to apply David's fix as-is, unless it's truly broken or
> > someone has a better quick solution.
> 
> I certainly have no objection to you doing so; and think that you
> and David will have a much better appreciation of the risks than me.
> 
> But I ought to mention that this two-ZERO_PAGEs-in-a-row behaviour
> was problematic for splice() in the past - see the comments on
> ZERO_PAGE(0) and its alternative block in shmem_file_read_iter().
> 1bdec44b1eee ("tmpfs: fix regressions from wider use of ZERO_PAGE"):
> ah, that came from a report by you too, xfstests on nfsd.

Yes, I thought we had visited this ZERO_PAGE approach before, but
couldn't put my finger on exactly when or where.


> In principle there's a very simple (but inferior) solution at the
> shmem end: for shmem_file_splice_read() to use SGP_CACHE (used when
> faulting in a hole) instead of SGP_READ in its call to shmem_get_folio().
> (And delete all of shmem's splice_zeropage_into_pipe() code.)
> 
> I say "in principle" because all David's testing has been with the
> SGP_READ there, and perhaps there's some gotcha I'm overlooking which
> would turn up when switching over to SGP_CACHE.  And I say "inferior"
> because that way entails allocating and zeroing pages for holes (which
> page reclaim will then free later on if they remain clean).
> 
> My vote would be for putting David's nfsd patch in for now, but
> keeping an open mind as to whether the shmem end has to change,
> if there might be further problems elsewhere than nfsd.

I'm open to that.

NeilBrown July 30, 2023, 10:02 p.m. UTC | #5

On Mon, 31 Jul 2023, Chuck Lever wrote:
> 
> I'd like to apply David's fix as-is, unless it's truly broken or
> someone has a better quick solution.
> 

Your reasoning is sound.  From a behavioural perspective (though not
from a maintenance perspective) the patch is no worse than the current
code, so
   Reviewed-by: NeilBrown <neilb@suse.de>

NeilBrown

nfsd: Fix reading via splice

Commit Message

Comments

Patch