Message ID | 169054754615.3783.11682801287165281930.stgit@klimt.1015granger.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | nfsd: Fix reading via splice | expand |
On Fri, 28 Jul 2023, Chuck Lever wrote: > From: David Howells <dhowells@redhat.com> > > nfsd_splice_actor() has a clause in its loop that chops up a compound page > into individual pages such that if the same page is seen twice in a row, it > is discarded the second time. This is a problem with the advent of > shmem_splice_read() as that inserts zero_pages into the pipe in lieu of > pages that aren't present in the pagecache. > > Fix this by assuming that the last page is being extended only if the > currently stored length + starting offset is not currently on a page > boundary. > > This can be tested by NFS-exporting a tmpfs filesystem on the test machine > and truncating it to more than a page in size (eg. truncate -s 8192) and > then reading it by NFS. The first page will be all zeros, but thereafter > garbage will be read. > > Note: I wonder if we can ever get a situation now where we get a splice > that gives us contiguous parts of a page in separate actor calls. As NFSD > can only be splicing from a file (I think), there are only three sources of > the page: copy_splice_read(), shmem_splice_read() and file_splice_read(). > The first allocates pages for the data it reads, so the problem cannot > occur; the second should never see a partial page; and the third waits for > each page to become available before we're allowed to read from it. > > Fixes: bd194b187115 ("shmem: Implement splice-read") > Reported-by: Chuck Lever <chuck.lever@oracle.com> > Signed-off-by: David Howells <dhowells@redhat.com> > Reviewed-by: Jeff Layton <jlayton@kernel.org> > cc: Hugh Dickins <hughd@google.com> > cc: Jens Axboe <axboe@kernel.dk> > cc: Matthew Wilcox <willy@infradead.org> > cc: linux-nfs@vger.kernel.org > cc: linux-fsdevel@vger.kernel.org > cc: linux-mm@kvack.org > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > --- > fs/nfsd/vfs.c | 9 ++++++--- > 1 file changed, 6 insertions(+), 3 deletions(-) > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > index 59b7d60ae33e..ee3bbaa79478 100644 > --- a/fs/nfsd/vfs.c > +++ b/fs/nfsd/vfs.c > @@ -956,10 +956,13 @@ nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf, > last_page = page + (offset + sd->len - 1) / PAGE_SIZE; > for (page += offset / PAGE_SIZE; page <= last_page; page++) { > /* > - * Skip page replacement when extending the contents > - * of the current page. > + * Skip page replacement when extending the contents of the > + * current page. But note that we may get two zero_pages in a > + * row from shmem. > */ > - if (page == *(rqstp->rq_next_page - 1)) > + if (page == *(rqstp->rq_next_page - 1) && > + offset_in_page(rqstp->rq_res.page_base + > + rqstp->rq_res.page_len)) This seems fragile in that it makes assumptions about the pages being sent and their alignment. Given that it was broken by the splice-read change, that confirms it is fragile. Maybe we could make the code a bit more explicit about what is expected. Also, I don't think this test can ever be relevant after the first time through the loop. So I think it would be clearest to have the interesting case outside the loop. page += offset / PAGE_SIZE; if (rqstp->rq_res.pages_len > 0) { /* appending to page list - check alignment */ if (offset % PAGE_SIZE != (rqstp->rq_res.page_base + rqstp-.rq_res.page_len) % PAGE_SIZE) return -EIO; if (offset % PAGE_SIZE != 0) { /* continuing previous page */ if (page != rqstp->rq_next_page[-1]) return -EIO; page += 1; } } else /* Starting new page list */ rqstp->rq_res.page_base = offset % PAGE_SIZE; for ( ; page <= last_page ; page++) if (unlikely(!svc_rqst_replace_page(rqstp, page))) return -EIO; rqstp->rq_res.page_len += sd->len; return sd->len; Also, the name "svc_rqst_replace_page" doesn't give any hint that the next_page pointer is advanced. Maybe svc_rqst_add_page() ??? Not great I admit. NeilBrown > continue; > if (unlikely(!svc_rqst_replace_page(rqstp, page))) > return -EIO; > > >
On Sat, Jul 29, 2023 at 09:54:58AM +1000, NeilBrown wrote: > On Fri, 28 Jul 2023, Chuck Lever wrote: > > From: David Howells <dhowells@redhat.com> > > > > nfsd_splice_actor() has a clause in its loop that chops up a compound page > > into individual pages such that if the same page is seen twice in a row, it > > is discarded the second time. This is a problem with the advent of > > shmem_splice_read() as that inserts zero_pages into the pipe in lieu of > > pages that aren't present in the pagecache. > > > > Fix this by assuming that the last page is being extended only if the > > currently stored length + starting offset is not currently on a page > > boundary. > > > > This can be tested by NFS-exporting a tmpfs filesystem on the test machine > > and truncating it to more than a page in size (eg. truncate -s 8192) and > > then reading it by NFS. The first page will be all zeros, but thereafter > > garbage will be read. > > > > Note: I wonder if we can ever get a situation now where we get a splice > > that gives us contiguous parts of a page in separate actor calls. As NFSD > > can only be splicing from a file (I think), there are only three sources of > > the page: copy_splice_read(), shmem_splice_read() and file_splice_read(). > > The first allocates pages for the data it reads, so the problem cannot > > occur; the second should never see a partial page; and the third waits for > > each page to become available before we're allowed to read from it. > > > > Fixes: bd194b187115 ("shmem: Implement splice-read") > > Reported-by: Chuck Lever <chuck.lever@oracle.com> > > Signed-off-by: David Howells <dhowells@redhat.com> > > Reviewed-by: Jeff Layton <jlayton@kernel.org> > > cc: Hugh Dickins <hughd@google.com> > > cc: Jens Axboe <axboe@kernel.dk> > > cc: Matthew Wilcox <willy@infradead.org> > > cc: linux-nfs@vger.kernel.org > > cc: linux-fsdevel@vger.kernel.org > > cc: linux-mm@kvack.org > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > > --- > > fs/nfsd/vfs.c | 9 ++++++--- > > 1 file changed, 6 insertions(+), 3 deletions(-) > > > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > > index 59b7d60ae33e..ee3bbaa79478 100644 > > --- a/fs/nfsd/vfs.c > > +++ b/fs/nfsd/vfs.c > > @@ -956,10 +956,13 @@ nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf, > > last_page = page + (offset + sd->len - 1) / PAGE_SIZE; > > for (page += offset / PAGE_SIZE; page <= last_page; page++) { > > /* > > - * Skip page replacement when extending the contents > > - * of the current page. > > + * Skip page replacement when extending the contents of the > > + * current page. But note that we may get two zero_pages in a > > + * row from shmem. > > */ > > - if (page == *(rqstp->rq_next_page - 1)) > > + if (page == *(rqstp->rq_next_page - 1) && > > + offset_in_page(rqstp->rq_res.page_base + > > + rqstp->rq_res.page_len)) > > This seems fragile in that it makes assumptions about the pages being > sent and their alignment. > Given that it was broken by the splice-read change, that confirms it is > fragile. Maybe we could make the code a bit more explicit about what is > expected. Indeed, this code is brittle. This is not even the only time the actor has been broken in the past four or five kernel releases. IMO the problem is that there is no API contract or documentation for splice actors. And as far as I am aware, only a few other examples are in use to learn from. > Also, I don't think this test can ever be relevant after the first time > through the loop. So I think it would be clearest to have the > interesting case outside the loop. > > page += offset / PAGE_SIZE; > if (rqstp->rq_res.pages_len > 0) { > /* appending to page list - check alignment */ > if (offset % PAGE_SIZE != (rqstp->rq_res.page_base + > rqstp-.rq_res.page_len) % PAGE_SIZE) > return -EIO; > if (offset % PAGE_SIZE != 0) { > /* continuing previous page */ > if (page != rqstp->rq_next_page[-1]) > return -EIO; > page += 1; > } > } else > /* Starting new page list */ > rqstp->rq_res.page_base = offset % PAGE_SIZE; > > for ( ; page <= last_page ; page++) > if (unlikely(!svc_rqst_replace_page(rqstp, page))) > return -EIO; > > rqstp->rq_res.page_len += sd->len; > return sd->len; > > > Also, the name "svc_rqst_replace_page" doesn't give any hint that the > next_page pointer is advanced. Maybe svc_rqst_add_page() ??? Not great > I admit. All reasonable suggestions. However, I'm getting ready to replace the splice read code with... je ne ce pas. - There are reports that splice read doesn't perform well - It's a brittle piece of engineering, as observed - The "zero copy" read path will need to support folios, hopefully sooner rather than later - We want the server's read path to use iomap when that is more broadly available in local filesystems - This fix is destined for 6.5-rc, which limits the amount of clean up and optimization we should be doing I'd like to apply David's fix as-is, unless it's truly broken or someone has a better quick solution. > > continue; > > if (unlikely(!svc_rqst_replace_page(rqstp, page))) > > return -EIO; > > > > > > >
On Sun, 30 Jul 2023, Chuck Lever wrote: > On Sat, Jul 29, 2023 at 09:54:58AM +1000, NeilBrown wrote: > > On Fri, 28 Jul 2023, Chuck Lever wrote: > > > From: David Howells <dhowells@redhat.com> ... > - This fix is destined for 6.5-rc, which limits the amount of > clean up and optimization we should be doing > > I'd like to apply David's fix as-is, unless it's truly broken or > someone has a better quick solution. I certainly have no objection to you doing so; and think that you and David will have a much better appreciation of the risks than me. But I ought to mention that this two-ZERO_PAGEs-in-a-row behaviour was problematic for splice() in the past - see the comments on ZERO_PAGE(0) and its alternative block in shmem_file_read_iter(). 1bdec44b1eee ("tmpfs: fix regressions from wider use of ZERO_PAGE"): ah, that came from a report by you too, xfstests on nfsd. In principle there's a very simple (but inferior) solution at the shmem end: for shmem_file_splice_read() to use SGP_CACHE (used when faulting in a hole) instead of SGP_READ in its call to shmem_get_folio(). (And delete all of shmem's splice_zeropage_into_pipe() code.) I say "in principle" because all David's testing has been with the SGP_READ there, and perhaps there's some gotcha I'm overlooking which would turn up when switching over to SGP_CACHE. And I say "inferior" because that way entails allocating and zeroing pages for holes (which page reclaim will then free later on if they remain clean). My vote would be for putting David's nfsd patch in for now, but keeping an open mind as to whether the shmem end has to change, if there might be further problems elsewhere than nfsd. Hugh
On Sun, Jul 30, 2023 at 09:50:44AM -0700, Hugh Dickins wrote: > On Sun, 30 Jul 2023, Chuck Lever wrote: > > On Sat, Jul 29, 2023 at 09:54:58AM +1000, NeilBrown wrote: > > > On Fri, 28 Jul 2023, Chuck Lever wrote: > > > > From: David Howells <dhowells@redhat.com> > ... > > - This fix is destined for 6.5-rc, which limits the amount of > > clean up and optimization we should be doing > > > > I'd like to apply David's fix as-is, unless it's truly broken or > > someone has a better quick solution. > > I certainly have no objection to you doing so; and think that you > and David will have a much better appreciation of the risks than me. > > But I ought to mention that this two-ZERO_PAGEs-in-a-row behaviour > was problematic for splice() in the past - see the comments on > ZERO_PAGE(0) and its alternative block in shmem_file_read_iter(). > 1bdec44b1eee ("tmpfs: fix regressions from wider use of ZERO_PAGE"): > ah, that came from a report by you too, xfstests on nfsd. Yes, I thought we had visited this ZERO_PAGE approach before, but couldn't put my finger on exactly when or where. > In principle there's a very simple (but inferior) solution at the > shmem end: for shmem_file_splice_read() to use SGP_CACHE (used when > faulting in a hole) instead of SGP_READ in its call to shmem_get_folio(). > (And delete all of shmem's splice_zeropage_into_pipe() code.) > > I say "in principle" because all David's testing has been with the > SGP_READ there, and perhaps there's some gotcha I'm overlooking which > would turn up when switching over to SGP_CACHE. And I say "inferior" > because that way entails allocating and zeroing pages for holes (which > page reclaim will then free later on if they remain clean). > > My vote would be for putting David's nfsd patch in for now, but > keeping an open mind as to whether the shmem end has to change, > if there might be further problems elsewhere than nfsd. I'm open to that.
On Mon, 31 Jul 2023, Chuck Lever wrote: > > I'd like to apply David's fix as-is, unless it's truly broken or > someone has a better quick solution. > Your reasoning is sound. From a behavioural perspective (though not from a maintenance perspective) the patch is no worse than the current code, so Reviewed-by: NeilBrown <neilb@suse.de> NeilBrown
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 59b7d60ae33e..ee3bbaa79478 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -956,10 +956,13 @@ nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf, last_page = page + (offset + sd->len - 1) / PAGE_SIZE; for (page += offset / PAGE_SIZE; page <= last_page; page++) { /* - * Skip page replacement when extending the contents - * of the current page. + * Skip page replacement when extending the contents of the + * current page. But note that we may get two zero_pages in a + * row from shmem. */ - if (page == *(rqstp->rq_next_page - 1)) + if (page == *(rqstp->rq_next_page - 1) && + offset_in_page(rqstp->rq_res.page_base + + rqstp->rq_res.page_len)) continue; if (unlikely(!svc_rqst_replace_page(rqstp, page))) return -EIO;