mbox series

[0/3] Handling NFSv3 I/O errors in knfsd

Message ID 20190826165021.81075-1-trond.myklebust@hammerspace.com (mailing list archive)
Headers show
Series Handling NFSv3 I/O errors in knfsd | expand

Message

Trond Myklebust Aug. 26, 2019, 4:50 p.m. UTC
Recently, a number of changes went into the kernel to try to ensure
that I/O errors (specifically write errors) are reported to the
application once and only once. The vehicle for ensuring the errors
are reported is the struct file, which uses the 'f_wb_err' field to
track which errors have been reported.

The problem is that errors are mainly intended to be reported through
fsync(). If the client is doing synchronous writes, then all is well,
but if it is doing unstable writes, then the errors may not be
reported until the client calls COMMIT. If the file cache has
thrown out the struct file, due to memory pressure, or just because
the client took a long while between the last WRITE and the COMMIT,
then the error report may be lost, and the client may just think
its data is safely stored.

Note that the problem is compounded by the fact that NFSv3 is stateless,
so the server never knows that the client may have rebooted, so there
can be no guarantee that a COMMIT will ever be sent.

The following patch set attempts to remedy the situation using 2
strategies:

1) If the inode is dirty, then avoid garbage collecting the file
   from the file cache.
2) If the file is closed, and we see that it would have reported
   an error to COMMIT, then we bump the boot verifier in order to
   ensure the client retransmits all its writes.

Note that if multiple clients were writing to the same file, then
we probably want to bump the boot verifier anyway, since only one
COMMIT will see the error report (because the cached file is also
shared).

So in order to implement the above strategy, we first have to do
the following: split up the file cache to act per net namespace,
since the boot verifier is per net namespace. Then add a helper
to update the boot verifier.

Trond Myklebust (3):
  nfsd: nfsd_file cache entries should be per net namespace
  nfsd: Support the server resetting the boot verifier
  nfsd: Don't garbage collect files that might contain write errors

 fs/nfsd/export.c    |  2 +-
 fs/nfsd/filecache.c | 76 +++++++++++++++++++++++++++++++++++++--------
 fs/nfsd/filecache.h |  3 +-
 fs/nfsd/netns.h     |  4 +++
 fs/nfsd/nfs3xdr.c   | 13 +++++---
 fs/nfsd/nfs4proc.c  | 14 +++------
 fs/nfsd/nfsctl.c    |  1 +
 fs/nfsd/nfssvc.c    | 32 ++++++++++++++++++-
 8 files changed, 115 insertions(+), 30 deletions(-)

Comments

J. Bruce Fields Aug. 26, 2019, 8:51 p.m. UTC | #1
On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust wrote:
> Recently, a number of changes went into the kernel to try to ensure
> that I/O errors (specifically write errors) are reported to the
> application once and only once. The vehicle for ensuring the errors
> are reported is the struct file, which uses the 'f_wb_err' field to
> track which errors have been reported.
> 
> The problem is that errors are mainly intended to be reported through
> fsync(). If the client is doing synchronous writes, then all is well,
> but if it is doing unstable writes, then the errors may not be
> reported until the client calls COMMIT. If the file cache has
> thrown out the struct file, due to memory pressure, or just because
> the client took a long while between the last WRITE and the COMMIT,
> then the error report may be lost, and the client may just think
> its data is safely stored.

These were lost before the file caching patches as well, right?  Or is
there some regression? 

> Note that the problem is compounded by the fact that NFSv3 is stateless,
> so the server never knows that the client may have rebooted, so there
> can be no guarantee that a COMMIT will ever be sent.
> 
> The following patch set attempts to remedy the situation using 2
> strategies:
> 
> 1) If the inode is dirty, then avoid garbage collecting the file
>    from the file cache.
> 2) If the file is closed, and we see that it would have reported
>    an error to COMMIT, then we bump the boot verifier in order to
>    ensure the client retransmits all its writes.

Sounds sensible to me.

> Note that if multiple clients were writing to the same file, then
> we probably want to bump the boot verifier anyway, since only one
> COMMIT will see the error report (because the cached file is also
> shared).

I'm confused by the "probably should".  So that's future work?  I guess
it'd mean some additional work to identify that case.  You can't really
even distinguish clients in the NFSv3 case, but I suppose you could use
IP address or TCP connection as an approximation.

--b.

> So in order to implement the above strategy, we first have to do
> the following: split up the file cache to act per net namespace,
> since the boot verifier is per net namespace. Then add a helper
> to update the boot verifier.
> 
> Trond Myklebust (3):
>   nfsd: nfsd_file cache entries should be per net namespace
>   nfsd: Support the server resetting the boot verifier
>   nfsd: Don't garbage collect files that might contain write errors
> 
>  fs/nfsd/export.c    |  2 +-
>  fs/nfsd/filecache.c | 76 +++++++++++++++++++++++++++++++++++++--------
>  fs/nfsd/filecache.h |  3 +-
>  fs/nfsd/netns.h     |  4 +++
>  fs/nfsd/nfs3xdr.c   | 13 +++++---
>  fs/nfsd/nfs4proc.c  | 14 +++------
>  fs/nfsd/nfsctl.c    |  1 +
>  fs/nfsd/nfssvc.c    | 32 ++++++++++++++++++-
>  8 files changed, 115 insertions(+), 30 deletions(-)
> 
> -- 
> 2.21.0
Trond Myklebust Aug. 26, 2019, 9:02 p.m. UTC | #2
On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote:
> On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust wrote:
> > Recently, a number of changes went into the kernel to try to ensure
> > that I/O errors (specifically write errors) are reported to the
> > application once and only once. The vehicle for ensuring the errors
> > are reported is the struct file, which uses the 'f_wb_err' field to
> > track which errors have been reported.
> > 
> > The problem is that errors are mainly intended to be reported
> > through
> > fsync(). If the client is doing synchronous writes, then all is
> > well,
> > but if it is doing unstable writes, then the errors may not be
> > reported until the client calls COMMIT. If the file cache has
> > thrown out the struct file, due to memory pressure, or just because
> > the client took a long while between the last WRITE and the COMMIT,
> > then the error report may be lost, and the client may just think
> > its data is safely stored.
> 
> These were lost before the file caching patches as well, right?  Or
> is
> there some regression? 

Correct. This is not a regression, but an attempt to fix a problem that
has existed for some time now.

> 
> > Note that the problem is compounded by the fact that NFSv3 is
> > stateless,
> > so the server never knows that the client may have rebooted, so
> > there
> > can be no guarantee that a COMMIT will ever be sent.
> > 
> > The following patch set attempts to remedy the situation using 2
> > strategies:
> > 
> > 1) If the inode is dirty, then avoid garbage collecting the file
> >    from the file cache.
> > 2) If the file is closed, and we see that it would have reported
> >    an error to COMMIT, then we bump the boot verifier in order to
> >    ensure the client retransmits all its writes.
> 
> Sounds sensible to me.
> 
> > Note that if multiple clients were writing to the same file, then
> > we probably want to bump the boot verifier anyway, since only one
> > COMMIT will see the error report (because the cached file is also
> > shared).
> 
> I'm confused by the "probably should".  So that's future work?  I
> guess
> it'd mean some additional work to identify that case.  You can't
> really
> even distinguish clients in the NFSv3 case, but I suppose you could
> use
> IP address or TCP connection as an approximation.

I'm suggesting we should do this too, but I haven't done so yet in
these patches. I'd like to hear other opinions (particularly from you,
Chuck and Jeff).

> --b.
> 
> > So in order to implement the above strategy, we first have to do
> > the following: split up the file cache to act per net namespace,
> > since the boot verifier is per net namespace. Then add a helper
> > to update the boot verifier.
> > 
> > Trond Myklebust (3):
> >   nfsd: nfsd_file cache entries should be per net namespace
> >   nfsd: Support the server resetting the boot verifier
> >   nfsd: Don't garbage collect files that might contain write errors
> > 
> >  fs/nfsd/export.c    |  2 +-
> >  fs/nfsd/filecache.c | 76 +++++++++++++++++++++++++++++++++++++--
> > ------
> >  fs/nfsd/filecache.h |  3 +-
> >  fs/nfsd/netns.h     |  4 +++
> >  fs/nfsd/nfs3xdr.c   | 13 +++++---
> >  fs/nfsd/nfs4proc.c  | 14 +++------
> >  fs/nfsd/nfsctl.c    |  1 +
> >  fs/nfsd/nfssvc.c    | 32 ++++++++++++++++++-
> >  8 files changed, 115 insertions(+), 30 deletions(-)
> > 
> > -- 
> > 2.21.0
J. Bruce Fields Aug. 27, 2019, 12:48 a.m. UTC | #3
On Mon, Aug 26, 2019 at 09:02:31PM +0000, Trond Myklebust wrote:
> On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote:
> > On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust wrote:
> > > Recently, a number of changes went into the kernel to try to
> > > ensure that I/O errors (specifically write errors) are reported to
> > > the application once and only once. The vehicle for ensuring the
> > > errors are reported is the struct file, which uses the 'f_wb_err'
> > > field to track which errors have been reported.
> > > 
> > > The problem is that errors are mainly intended to be reported
> > > through fsync(). If the client is doing synchronous writes, then
> > > all is well, but if it is doing unstable writes, then the errors
> > > may not be reported until the client calls COMMIT. If the file
> > > cache has thrown out the struct file, due to memory pressure, or
> > > just because the client took a long while between the last WRITE
> > > and the COMMIT, then the error report may be lost, and the client
> > > may just think its data is safely stored.
> > 
> > These were lost before the file caching patches as well, right?  Or
> > is there some regression? 
> 
> Correct. This is not a regression, but an attempt to fix a problem
> that has existed for some time now.
> 
> > 
> > > Note that the problem is compounded by the fact that NFSv3 is
> > > stateless, so the server never knows that the client may have
> > > rebooted, so there can be no guarantee that a COMMIT will ever be
> > > sent.
> > > 
> > > The following patch set attempts to remedy the situation using 2
> > > strategies:
> > > 
> > > 1) If the inode is dirty, then avoid garbage collecting the file
> > > from the file cache.  2) If the file is closed, and we see that it
> > > would have reported an error to COMMIT, then we bump the boot
> > > verifier in order to ensure the client retransmits all its writes.
> > 
> > Sounds sensible to me.
> > 
> > > Note that if multiple clients were writing to the same file, then
> > > we probably want to bump the boot verifier anyway, since only one
> > > COMMIT will see the error report (because the cached file is also
> > > shared).
> > 
> > I'm confused by the "probably should".  So that's future work?  I
> > guess it'd mean some additional work to identify that case.  You
> > can't really even distinguish clients in the NFSv3 case, but I
> > suppose you could use IP address or TCP connection as an
> > approximation.
> 
> I'm suggesting we should do this too, but I haven't done so yet in
> these patches. I'd like to hear other opinions (particularly from you,
> Chuck and Jeff).

Does this process actually converge, or do we end up with all the
clients retrying the writes and, again, only one of them getting the
error?

I wonder what the typical errors are, anyway.

--b.
Trond Myklebust Aug. 27, 2019, 12:56 a.m. UTC | #4
On Mon, 2019-08-26 at 20:48 -0400, bfields@fieldses.org wrote:
> On Mon, Aug 26, 2019 at 09:02:31PM +0000, Trond Myklebust wrote:
> > On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote:
> > > On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust wrote:
> > > > Recently, a number of changes went into the kernel to try to
> > > > ensure that I/O errors (specifically write errors) are reported
> > > > to
> > > > the application once and only once. The vehicle for ensuring
> > > > the
> > > > errors are reported is the struct file, which uses the
> > > > 'f_wb_err'
> > > > field to track which errors have been reported.
> > > > 
> > > > The problem is that errors are mainly intended to be reported
> > > > through fsync(). If the client is doing synchronous writes,
> > > > then
> > > > all is well, but if it is doing unstable writes, then the
> > > > errors
> > > > may not be reported until the client calls COMMIT. If the file
> > > > cache has thrown out the struct file, due to memory pressure,
> > > > or
> > > > just because the client took a long while between the last
> > > > WRITE
> > > > and the COMMIT, then the error report may be lost, and the
> > > > client
> > > > may just think its data is safely stored.
> > > 
> > > These were lost before the file caching patches as well,
> > > right?  Or
> > > is there some regression? 
> > 
> > Correct. This is not a regression, but an attempt to fix a problem
> > that has existed for some time now.
> > 
> > > > Note that the problem is compounded by the fact that NFSv3 is
> > > > stateless, so the server never knows that the client may have
> > > > rebooted, so there can be no guarantee that a COMMIT will ever
> > > > be
> > > > sent.
> > > > 
> > > > The following patch set attempts to remedy the situation using
> > > > 2
> > > > strategies:
> > > > 
> > > > 1) If the inode is dirty, then avoid garbage collecting the
> > > > file
> > > > from the file cache.  2) If the file is closed, and we see that
> > > > it
> > > > would have reported an error to COMMIT, then we bump the boot
> > > > verifier in order to ensure the client retransmits all its
> > > > writes.
> > > 
> > > Sounds sensible to me.
> > > 
> > > > Note that if multiple clients were writing to the same file,
> > > > then
> > > > we probably want to bump the boot verifier anyway, since only
> > > > one
> > > > COMMIT will see the error report (because the cached file is
> > > > also
> > > > shared).
> > > 
> > > I'm confused by the "probably should".  So that's future work?  I
> > > guess it'd mean some additional work to identify that case.  You
> > > can't really even distinguish clients in the NFSv3 case, but I
> > > suppose you could use IP address or TCP connection as an
> > > approximation.
> > 
> > I'm suggesting we should do this too, but I haven't done so yet in
> > these patches. I'd like to hear other opinions (particularly from
> > you,
> > Chuck and Jeff).
> 
> Does this process actually converge, or do we end up with all the
> clients retrying the writes and, again, only one of them getting the
> error?

The client that gets the error should stop retrying if the error is
fatal.

> I wonder what the typical errors are, anyway.

I would expect ENOSPC, and EIO to be the most common. The former if
delayed allocation and/or snapshots result in writes failing after
writing to the page cache. The latter if we hit a disk outage or other
such problem.
J. Bruce Fields Aug. 27, 2019, 1:13 a.m. UTC | #5
On Tue, Aug 27, 2019 at 12:56:07AM +0000, Trond Myklebust wrote:
> On Mon, 2019-08-26 at 20:48 -0400, bfields@fieldses.org wrote:
> > On Mon, Aug 26, 2019 at 09:02:31PM +0000, Trond Myklebust wrote:
> > > On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote:
> > > > On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust wrote:
> > > > > Note that if multiple clients were writing to the same file,
> > > > > then we probably want to bump the boot verifier anyway, since
> > > > > only one COMMIT will see the error report (because the cached
> > > > > file is also shared).
> > > > 
> > > > I'm confused by the "probably should".  So that's future work?
> > > > I guess it'd mean some additional work to identify that case.
> > > > You can't really even distinguish clients in the NFSv3 case, but
> > > > I suppose you could use IP address or TCP connection as an
> > > > approximation.
> > > 
> > > I'm suggesting we should do this too, but I haven't done so yet in
> > > these patches. I'd like to hear other opinions (particularly from
> > > you, Chuck and Jeff).
> > 
> > Does this process actually converge, or do we end up with all the
> > clients retrying the writes and, again, only one of them getting the
> > error?
> 
> The client that gets the error should stop retrying if the error is
> fatal.

Have clients historically been good about that?  I just wonder whether
it's a concern that boot-verifier-bumping could magnify the impact of
clients that are overly persistent about retrying IO errors.

> > I wonder what the typical errors are, anyway.
> 
> I would expect ENOSPC, and EIO to be the most common. The former if
> delayed allocation and/or snapshots result in writes failing after
> writing to the page cache. The latter if we hit a disk outage or other
> such problem.

Makes sense.

--b.
Trond Myklebust Aug. 27, 2019, 1:28 a.m. UTC | #6
On Mon, 2019-08-26 at 21:13 -0400, bfields@fieldses.org wrote:
> On Tue, Aug 27, 2019 at 12:56:07AM +0000, Trond Myklebust wrote:
> > On Mon, 2019-08-26 at 20:48 -0400, bfields@fieldses.org wrote:
> > > On Mon, Aug 26, 2019 at 09:02:31PM +0000, Trond Myklebust wrote:
> > > > On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote:
> > > > > On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust
> > > > > wrote:
> > > > > > Note that if multiple clients were writing to the same
> > > > > > file,
> > > > > > then we probably want to bump the boot verifier anyway,
> > > > > > since
> > > > > > only one COMMIT will see the error report (because the
> > > > > > cached
> > > > > > file is also shared).
> > > > > 
> > > > > I'm confused by the "probably should".  So that's future
> > > > > work?
> > > > > I guess it'd mean some additional work to identify that case.
> > > > > You can't really even distinguish clients in the NFSv3 case,
> > > > > but
> > > > > I suppose you could use IP address or TCP connection as an
> > > > > approximation.
> > > > 
> > > > I'm suggesting we should do this too, but I haven't done so yet
> > > > in
> > > > these patches. I'd like to hear other opinions (particularly
> > > > from
> > > > you, Chuck and Jeff).
> > > 
> > > Does this process actually converge, or do we end up with all the
> > > clients retrying the writes and, again, only one of them getting
> > > the
> > > error?
> > 
> > The client that gets the error should stop retrying if the error is
> > fatal.
> 
> Have clients historically been good about that?  I just wonder
> whether
> it's a concern that boot-verifier-bumping could magnify the impact of
> clients that are overly persistent about retrying IO errors.
> 

Clients have always been required to handle I/O errors, yes, and this
isn't just a Linux server thing. All other servers that support
unstable writes impose the same requirement on the client to check the
return value of COMMIT and to handle any errors.

> > > I wonder what the typical errors are, anyway.
> > 
> > I would expect ENOSPC, and EIO to be the most common. The former if
> > delayed allocation and/or snapshots result in writes failing after
> > writing to the page cache. The latter if we hit a disk outage or
> > other
> > such problem.
> 
> Makes sense.
> 
> --b.
Chuck Lever III Aug. 27, 2019, 1:59 p.m. UTC | #7
> On Aug 26, 2019, at 5:02 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote:
>> On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust wrote:
>>> Recently, a number of changes went into the kernel to try to ensure
>>> that I/O errors (specifically write errors) are reported to the
>>> application once and only once. The vehicle for ensuring the errors
>>> are reported is the struct file, which uses the 'f_wb_err' field to
>>> track which errors have been reported.
>>> 
>>> The problem is that errors are mainly intended to be reported
>>> through
>>> fsync(). If the client is doing synchronous writes, then all is
>>> well,
>>> but if it is doing unstable writes, then the errors may not be
>>> reported until the client calls COMMIT. If the file cache has
>>> thrown out the struct file, due to memory pressure, or just because
>>> the client took a long while between the last WRITE and the COMMIT,
>>> then the error report may be lost, and the client may just think
>>> its data is safely stored.
>> 
>> These were lost before the file caching patches as well, right?  Or
>> is
>> there some regression? 
> 
> Correct. This is not a regression, but an attempt to fix a problem that
> has existed for some time now.
> 
>> 
>>> Note that the problem is compounded by the fact that NFSv3 is
>>> stateless,
>>> so the server never knows that the client may have rebooted, so
>>> there
>>> can be no guarantee that a COMMIT will ever be sent.
>>> 
>>> The following patch set attempts to remedy the situation using 2
>>> strategies:
>>> 
>>> 1) If the inode is dirty, then avoid garbage collecting the file
>>>   from the file cache.
>>> 2) If the file is closed, and we see that it would have reported
>>>   an error to COMMIT, then we bump the boot verifier in order to
>>>   ensure the client retransmits all its writes.
>> 
>> Sounds sensible to me.
>> 
>>> Note that if multiple clients were writing to the same file, then
>>> we probably want to bump the boot verifier anyway, since only one
>>> COMMIT will see the error report (because the cached file is also
>>> shared).
>> 
>> I'm confused by the "probably should".  So that's future work?  I
>> guess
>> it'd mean some additional work to identify that case.  You can't
>> really
>> even distinguish clients in the NFSv3 case, but I suppose you could
>> use
>> IP address or TCP connection as an approximation.
> 
> I'm suggesting we should do this too, but I haven't done so yet in
> these patches. I'd like to hear other opinions (particularly from you,
> Chuck and Jeff).

The strategy of handling these errors more carefully seems good.
Bumping the write/commit verifier so the client writes again to
retrieve the latent error is clever!

It's not clear to me though that the NFSv3 protocol can deal with
the multi-client write scenario, since it is stateless. We are now
making it stateful in some sense by preserving error state on the
server across NFS requests, without having any sense of an open
file in the protocol itself.

Would an "approximation" without open state be good enough? I
assume you are doing this to more fully support the FlexFiles
layout type. Do you have any analysis or thought about this next
step?

I also echo Bruce's concern about whether the client implementations
are up to snuff. There could be long-standing bugs or their protocol
implementation could be missing parts. This is more curiosity than
an objection, but maybe noting which client implementations you've
tested with would be good.


>> --b.
>> 
>>> So in order to implement the above strategy, we first have to do
>>> the following: split up the file cache to act per net namespace,
>>> since the boot verifier is per net namespace. Then add a helper
>>> to update the boot verifier.
>>> 
>>> Trond Myklebust (3):
>>>  nfsd: nfsd_file cache entries should be per net namespace
>>>  nfsd: Support the server resetting the boot verifier
>>>  nfsd: Don't garbage collect files that might contain write errors
>>> 
>>> fs/nfsd/export.c    |  2 +-
>>> fs/nfsd/filecache.c | 76 +++++++++++++++++++++++++++++++++++++--
>>> ------
>>> fs/nfsd/filecache.h |  3 +-
>>> fs/nfsd/netns.h     |  4 +++
>>> fs/nfsd/nfs3xdr.c   | 13 +++++---
>>> fs/nfsd/nfs4proc.c  | 14 +++------
>>> fs/nfsd/nfsctl.c    |  1 +
>>> fs/nfsd/nfssvc.c    | 32 ++++++++++++++++++-
>>> 8 files changed, 115 insertions(+), 30 deletions(-)
>>> 
>>> -- 
>>> 2.21.0
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com

--
Chuck Lever
Trond Myklebust Aug. 27, 2019, 2:53 p.m. UTC | #8
On Tue, 2019-08-27 at 09:59 -0400, Chuck Lever wrote:
> > On Aug 26, 2019, at 5:02 PM, Trond Myklebust <
> > trondmy@hammerspace.com> wrote:
> > 
> > On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote:
> > > On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust wrote:
> > > > Recently, a number of changes went into the kernel to try to
> > > > ensure
> > > > that I/O errors (specifically write errors) are reported to the
> > > > application once and only once. The vehicle for ensuring the
> > > > errors
> > > > are reported is the struct file, which uses the 'f_wb_err'
> > > > field to
> > > > track which errors have been reported.
> > > > 
> > > > The problem is that errors are mainly intended to be reported
> > > > through
> > > > fsync(). If the client is doing synchronous writes, then all is
> > > > well,
> > > > but if it is doing unstable writes, then the errors may not be
> > > > reported until the client calls COMMIT. If the file cache has
> > > > thrown out the struct file, due to memory pressure, or just
> > > > because
> > > > the client took a long while between the last WRITE and the
> > > > COMMIT,
> > > > then the error report may be lost, and the client may just
> > > > think
> > > > its data is safely stored.
> > > 
> > > These were lost before the file caching patches as well,
> > > right?  Or
> > > is
> > > there some regression? 
> > 
> > Correct. This is not a regression, but an attempt to fix a problem
> > that
> > has existed for some time now.
> > 
> > > > Note that the problem is compounded by the fact that NFSv3 is
> > > > stateless,
> > > > so the server never knows that the client may have rebooted, so
> > > > there
> > > > can be no guarantee that a COMMIT will ever be sent.
> > > > 
> > > > The following patch set attempts to remedy the situation using
> > > > 2
> > > > strategies:
> > > > 
> > > > 1) If the inode is dirty, then avoid garbage collecting the
> > > > file
> > > >   from the file cache.
> > > > 2) If the file is closed, and we see that it would have
> > > > reported
> > > >   an error to COMMIT, then we bump the boot verifier in order
> > > > to
> > > >   ensure the client retransmits all its writes.
> > > 
> > > Sounds sensible to me.
> > > 
> > > > Note that if multiple clients were writing to the same file,
> > > > then
> > > > we probably want to bump the boot verifier anyway, since only
> > > > one
> > > > COMMIT will see the error report (because the cached file is
> > > > also
> > > > shared).
> > > 
> > > I'm confused by the "probably should".  So that's future work?  I
> > > guess
> > > it'd mean some additional work to identify that case.  You can't
> > > really
> > > even distinguish clients in the NFSv3 case, but I suppose you
> > > could
> > > use
> > > IP address or TCP connection as an approximation.
> > 
> > I'm suggesting we should do this too, but I haven't done so yet in
> > these patches. I'd like to hear other opinions (particularly from
> > you,
> > Chuck and Jeff).
> 
> The strategy of handling these errors more carefully seems good.
> Bumping the write/commit verifier so the client writes again to
> retrieve the latent error is clever!
> 
> It's not clear to me though that the NFSv3 protocol can deal with
> the multi-client write scenario, since it is stateless. We are now
> making it stateful in some sense by preserving error state on the
> server across NFS requests, without having any sense of an open
> file in the protocol itself.
> 
> Would an "approximation" without open state be good enough? I
> assume you are doing this to more fully support the FlexFiles
> layout type. Do you have any analysis or thought about this next
> step?

No, see above. This has nothing to do with the flex files layout type.
It is about ensuring the integrity of _all_ unstable writes. Any client
can hit writeback errors that only manifest themselves when the data is
flushed to disk.

> I also echo Bruce's concern about whether the client implementations
> are up to snuff. There could be long-standing bugs or their protocol
> implementation could be missing parts. This is more curiosity than
> an objection, but maybe noting which client implementations you've
> tested with would be good.

So what exactly is the concern? How do you and Bruce expect this might
make things worse?

As far as I can see, if a client doesn't handle errors in COMMIT, then
it is screwed either way, since we can already end up reporting such
errors today through the fsync() mechanism. If the client ignores the
error, it ends up silently corrupting the file. If it retries the
writes as unstable, then it loops until the error is cleared. The
current patchset does not affect either case for this single client:
the declaration of a reboot just ensures that silent corruption does
not occur, and if the client handles the error correctly once it
catches it, then great, otherwise no extra harm occurs.

If we want to fix the case of multiple writers (i.e. the case where
someone is using NLM locking to mediate between writers, or they are
using O_DIRECT and some external synchronisation mechanism) then we
need to do more (hence the proposal that we extend the current patches
to also declare a reboot on any failure of COMMIT).
In that case, the broken client, that loops forever, is going to cause
contention with all the other clients anyway, since it will never
complete its I/O (or it will silently corrupt the file if it doesn't
see the error). By declaring a reboot we don't make this case worse as
far as I can see: we still end up looping forever, but the file doesn't
get silently corrupted.
The one problem is that the looping forever client can cause other
clients to loop forever on their otherwise successful writes on other
files. That's bad, but again, that's due to client behaviour that is
toxic even today.
J. Bruce Fields Aug. 27, 2019, 2:54 p.m. UTC | #9
On Tue, Aug 27, 2019 at 09:59:25AM -0400, Chuck Lever wrote:
> The strategy of handling these errors more carefully seems good.
> Bumping the write/commit verifier so the client writes again to
> retrieve the latent error is clever!
> 
> It's not clear to me though that the NFSv3 protocol can deal with
> the multi-client write scenario, since it is stateless. We are now
> making it stateful in some sense by preserving error state on the
> server across NFS requests, without having any sense of an open
> file in the protocol itself.
> 
> Would an "approximation" without open state be good enough?

I figure there's a correct-but-slow approach which is to increment the
write verifier every time there's a write error.  Does that work?  If
write errors are rare enough, maybe it doesn't matter that that's slow?

Then there's various approximations you could use (IP address?) to guess
when only one client's accessing the file.  You save forcing all the
clients to resend writes at the expense of some complexity and
correctness--e.g., multiple clients behind NAT might not all get write
errors.

Am I thinking aobut this right?

--b.
J. Bruce Fields Aug. 27, 2019, 2:58 p.m. UTC | #10
On Tue, Aug 27, 2019 at 02:53:01PM +0000, Trond Myklebust wrote:
> The one problem is that the looping forever client can cause other
> clients to loop forever on their otherwise successful writes on other
> files.

Yeah, that's the case I was wondering about.

> That's bad, but again, that's due to client behaviour that is
> toxic even today.

So my worry was that if write errors are rare and the consequences of
the single client looping forever are relatively mild, then there might
be deployed clients that get away with that behavior.

But maybe the behavior's a lot more "toxic" than I imagined, hence
unlikely to be very common.

--b.
J. Bruce Fields Aug. 27, 2019, 2:59 p.m. UTC | #11
On Tue, Aug 27, 2019 at 10:58:19AM -0400, bfields@fieldses.org wrote:
> On Tue, Aug 27, 2019 at 02:53:01PM +0000, Trond Myklebust wrote:
> > The one problem is that the looping forever client can cause other
> > clients to loop forever on their otherwise successful writes on other
> > files.
> 
> Yeah, that's the case I was wondering about.
> 
> > That's bad, but again, that's due to client behaviour that is
> > toxic even today.
> 
> So my worry was that if write errors are rare and the consequences of
> the single client looping forever are relatively mild, then there might
> be deployed clients that get away with that behavior.
> 
> But maybe the behavior's a lot more "toxic" than I imagined, hence
> unlikely to be very common.

(And, to be clear, I like the idea, just making sure I'm not overlooking
any problems....)

--b.
Trond Myklebust Aug. 27, 2019, 2:59 p.m. UTC | #12
On Tue, 2019-08-27 at 10:54 -0400, Bruce Fields wrote:
> On Tue, Aug 27, 2019 at 09:59:25AM -0400, Chuck Lever wrote:
> > The strategy of handling these errors more carefully seems good.
> > Bumping the write/commit verifier so the client writes again to
> > retrieve the latent error is clever!
> > 
> > It's not clear to me though that the NFSv3 protocol can deal with
> > the multi-client write scenario, since it is stateless. We are now
> > making it stateful in some sense by preserving error state on the
> > server across NFS requests, without having any sense of an open
> > file in the protocol itself.
> > 
> > Would an "approximation" without open state be good enough?
> 
> I figure there's a correct-but-slow approach which is to increment
> the
> write verifier every time there's a write error.  Does that work?  If
> write errors are rare enough, maybe it doesn't matter that that's
> slow?

How is that different from changing the boot verifier? Are you
proposing to track write verifiers on a per-client basis? That seems
onerous too.

> Then there's various approximations you could use (IP address?) to
> guess
> when only one client's accessing the file.  You save forcing all the
> clients to resend writes at the expense of some complexity and
> correctness--e.g., multiple clients behind NAT might not all get
> write
> errors.
> 
> Am I thinking aobut this right?

I agree that there are multiple problems with tracking on a per-client
basis.
J. Bruce Fields Aug. 27, 2019, 3 p.m. UTC | #13
On Tue, Aug 27, 2019 at 02:59:34PM +0000, Trond Myklebust wrote:
> On Tue, 2019-08-27 at 10:54 -0400, Bruce Fields wrote:
> > On Tue, Aug 27, 2019 at 09:59:25AM -0400, Chuck Lever wrote:
> > > The strategy of handling these errors more carefully seems good.
> > > Bumping the write/commit verifier so the client writes again to
> > > retrieve the latent error is clever!
> > > 
> > > It's not clear to me though that the NFSv3 protocol can deal with
> > > the multi-client write scenario, since it is stateless. We are now
> > > making it stateful in some sense by preserving error state on the
> > > server across NFS requests, without having any sense of an open
> > > file in the protocol itself.
> > > 
> > > Would an "approximation" without open state be good enough?
> > 
> > I figure there's a correct-but-slow approach which is to increment
> > the
> > write verifier every time there's a write error.  Does that work?  If
> > write errors are rare enough, maybe it doesn't matter that that's
> > slow?
> 
> How is that different from changing the boot verifier?
> Are you
> proposing to track write verifiers on a per-client basis? That seems
> onerous too.

Sorry, I thought "write verifier" and "boot verifier" were synonyms,
and, no, wasn't suggesting tracking it per-file.

--b.

> > Then there's various approximations you could use (IP address?) to
> > guess
> > when only one client's accessing the file.  You save forcing all the
> > clients to resend writes at the expense of some complexity and
> > correctness--e.g., multiple clients behind NAT might not all get
> > write
> > errors.
> > 
> > Am I thinking aobut this right?
> 
> I agree that there are multiple problems with tracking on a per-client
> basis.
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
> 
>
Trond Myklebust Aug. 27, 2019, 3:15 p.m. UTC | #14
On Tue, 2019-08-27 at 10:59 -0400, bfields@fieldses.org wrote:
> On Tue, Aug 27, 2019 at 10:58:19AM -0400, bfields@fieldses.org wrote:
> > On Tue, Aug 27, 2019 at 02:53:01PM +0000, Trond Myklebust wrote:
> > > The one problem is that the looping forever client can cause
> > > other
> > > clients to loop forever on their otherwise successful writes on
> > > other
> > > files.
> > 
> > Yeah, that's the case I was wondering about.
> > 
> > > That's bad, but again, that's due to client behaviour that is
> > > toxic even today.
> > 
> > So my worry was that if write errors are rare and the consequences
> > of
> > the single client looping forever are relatively mild, then there
> > might
> > be deployed clients that get away with that behavior.
> > 
> > But maybe the behavior's a lot more "toxic" than I imagined, hence
> > unlikely to be very common.
> 
> (And, to be clear, I like the idea, just making sure I'm not
> overlooking
> any problems....)
> 
I'm open to other suggestions, but I'm having trouble finding one that
can scale correctly (i.e. not require per-client tracking), prevent
silent corruption (by causing clients to miss errors), while not
relying on optional features that may not be implemented by all NFSv3
clients (e.g. per-file write verifiers are not implemented by *BSD).

That said, it seems to me that to do nothing should not be an option,
as that would imply tolerating silent corruption of file data.
Jeff Layton Aug. 27, 2019, 3:17 p.m. UTC | #15
On Tue, 2019-08-27 at 09:59 -0400, Chuck Lever wrote:
> > On Aug 26, 2019, at 5:02 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> > 
> > On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote:
> > > On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust wrote:
> > > > Recently, a number of changes went into the kernel to try to ensure
> > > > that I/O errors (specifically write errors) are reported to the
> > > > application once and only once. The vehicle for ensuring the errors
> > > > are reported is the struct file, which uses the 'f_wb_err' field to
> > > > track which errors have been reported.
> > > > 
> > > > The problem is that errors are mainly intended to be reported
> > > > through
> > > > fsync(). If the client is doing synchronous writes, then all is
> > > > well,
> > > > but if it is doing unstable writes, then the errors may not be
> > > > reported until the client calls COMMIT. If the file cache has
> > > > thrown out the struct file, due to memory pressure, or just because
> > > > the client took a long while between the last WRITE and the COMMIT,
> > > > then the error report may be lost, and the client may just think
> > > > its data is safely stored.
> > > 
> > > These were lost before the file caching patches as well, right?  Or
> > > is
> > > there some regression? 
> > 
> > Correct. This is not a regression, but an attempt to fix a problem that
> > has existed for some time now.
> > 
> > > > Note that the problem is compounded by the fact that NFSv3 is
> > > > stateless,
> > > > so the server never knows that the client may have rebooted, so
> > > > there
> > > > can be no guarantee that a COMMIT will ever be sent.
> > > > 
> > > > The following patch set attempts to remedy the situation using 2
> > > > strategies:
> > > > 
> > > > 1) If the inode is dirty, then avoid garbage collecting the file
> > > >   from the file cache.
> > > > 2) If the file is closed, and we see that it would have reported
> > > >   an error to COMMIT, then we bump the boot verifier in order to
> > > >   ensure the client retransmits all its writes.
> > > 
> > > Sounds sensible to me.
> > > 
> > > > Note that if multiple clients were writing to the same file, then
> > > > we probably want to bump the boot verifier anyway, since only one
> > > > COMMIT will see the error report (because the cached file is also
> > > > shared).
> > > 
> > > I'm confused by the "probably should".  So that's future work?  I
> > > guess
> > > it'd mean some additional work to identify that case.  You can't
> > > really
> > > even distinguish clients in the NFSv3 case, but I suppose you could
> > > use
> > > IP address or TCP connection as an approximation.
> > 
> > I'm suggesting we should do this too, but I haven't done so yet in
> > these patches. I'd like to hear other opinions (particularly from you,
> > Chuck and Jeff).
> 
> The strategy of handling these errors more carefully seems good.
> Bumping the write/commit verifier so the client writes again to
> retrieve the latent error is clever!
> 

Yes, this would seem to neatly solve a whole class of related problems
in these sorts of scenarios. I also think we ought to bump the verifier
whenever nfsd sees a writeback error, as you have a great point that
we'll only report writeback errors on the first COMMIT after an error
today. Fixing that would be a nice goal.

We should note though that the verifier is a per net-namespace value, so
if you get a transient writeback error on one inode, any inode that is
currently dirty will probably end up having its writes retransmitted
once you bump the verifier.

That seems like an acceptable thing to do in these scenarios, but we may
need to consider making the verifier more granular if that turns out to
cause a lot of re-write activity that isn't necessary.

> It's not clear to me though that the NFSv3 protocol can deal with
> the multi-client write scenario, since it is stateless. We are now
> making it stateful in some sense by preserving error state on the
> server across NFS requests, without having any sense of an open
> file in the protocol itself.
> 

I think it's worthwhile to do the best we can here. WRITE/COMMIT are
inherently stateful to some degree in that they involve a verifier.
Trond's proposal is just utilizing that fact to ensure that we deliver
writeback errors more widely. I like it!

> Would an "approximation" without open state be good enough? I
> assume you are doing this to more fully support the FlexFiles
> layout type. Do you have any analysis or thought about this next
> step?
> 
> I also echo Bruce's concern about whether the client implementations
> are up to snuff. There could be long-standing bugs or their protocol
> implementation could be missing parts. This is more curiosity than
> an objection, but maybe noting which client implementations you've
> tested with would be good.
> 
> 
> > > --b.
> > > 
> > > > So in order to implement the above strategy, we first have to do
> > > > the following: split up the file cache to act per net namespace,
> > > > since the boot verifier is per net namespace. Then add a helper
> > > > to update the boot verifier.
> > > > 
> > > > Trond Myklebust (3):
> > > >  nfsd: nfsd_file cache entries should be per net namespace
> > > >  nfsd: Support the server resetting the boot verifier
> > > >  nfsd: Don't garbage collect files that might contain write errors
> > > > 
> > > > fs/nfsd/export.c    |  2 +-
> > > > fs/nfsd/filecache.c | 76 +++++++++++++++++++++++++++++++++++++--
> > > > ------
> > > > fs/nfsd/filecache.h |  3 +-
> > > > fs/nfsd/netns.h     |  4 +++
> > > > fs/nfsd/nfs3xdr.c   | 13 +++++---
> > > > fs/nfsd/nfs4proc.c  | 14 +++------
> > > > fs/nfsd/nfsctl.c    |  1 +
> > > > fs/nfsd/nfssvc.c    | 32 ++++++++++++++++++-
> > > > 8 files changed, 115 insertions(+), 30 deletions(-)
> > > > 
> > > > -- 
> > > > 2.21.0
> > -- 
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > trond.myklebust@hammerspace.com
> 
> --
> Chuck Lever
> 
> 
>
Chuck Lever III Aug. 27, 2019, 3:20 p.m. UTC | #16
> On Aug 27, 2019, at 11:15 AM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Tue, 2019-08-27 at 10:59 -0400, bfields@fieldses.org wrote:
>> On Tue, Aug 27, 2019 at 10:58:19AM -0400, bfields@fieldses.org wrote:
>>> On Tue, Aug 27, 2019 at 02:53:01PM +0000, Trond Myklebust wrote:
>>>> The one problem is that the looping forever client can cause
>>>> other
>>>> clients to loop forever on their otherwise successful writes on
>>>> other
>>>> files.
>>> 
>>> Yeah, that's the case I was wondering about.
>>> 
>>>> That's bad, but again, that's due to client behaviour that is
>>>> toxic even today.
>>> 
>>> So my worry was that if write errors are rare and the consequences
>>> of
>>> the single client looping forever are relatively mild, then there
>>> might
>>> be deployed clients that get away with that behavior.
>>> 
>>> But maybe the behavior's a lot more "toxic" than I imagined, hence
>>> unlikely to be very common.
>> 
>> (And, to be clear, I like the idea, just making sure I'm not
>> overlooking
>> any problems....)
>> 
> I'm open to other suggestions, but I'm having trouble finding one that
> can scale correctly (i.e. not require per-client tracking), prevent
> silent corruption (by causing clients to miss errors), while not
> relying on optional features that may not be implemented by all NFSv3
> clients (e.g. per-file write verifiers are not implemented by *BSD).
> 
> That said, it seems to me that to do nothing should not be an option,
> as that would imply tolerating silent corruption of file data.

Agree, we should move forward. I'm not saying "do nothing," I'm
just trying to understand what is improved and what is still left
to do (maybe nothing).


--
Chuck Lever
J. Bruce Fields Aug. 28, 2019, 1:48 p.m. UTC | #17
On Tue, Aug 27, 2019 at 03:15:35PM +0000, Trond Myklebust wrote:
> I'm open to other suggestions, but I'm having trouble finding one that
> can scale correctly (i.e. not require per-client tracking), prevent
> silent corruption (by causing clients to miss errors), while not
> relying on optional features that may not be implemented by all NFSv3
> clients (e.g. per-file write verifiers are not implemented by *BSD).
> 
> That said, it seems to me that to do nothing should not be an option,
> as that would imply tolerating silent corruption of file data.

So should we increment the boot verifier every time we discover an error
on an asynchronous write?

--b.
Jeff Layton Aug. 28, 2019, 1:51 p.m. UTC | #18
On Wed, 2019-08-28 at 09:48 -0400, bfields@fieldses.org wrote:
> On Tue, Aug 27, 2019 at 03:15:35PM +0000, Trond Myklebust wrote:
> > I'm open to other suggestions, but I'm having trouble finding one that
> > can scale correctly (i.e. not require per-client tracking), prevent
> > silent corruption (by causing clients to miss errors), while not
> > relying on optional features that may not be implemented by all NFSv3
> > clients (e.g. per-file write verifiers are not implemented by *BSD).
> > 
> > That said, it seems to me that to do nothing should not be an option,
> > as that would imply tolerating silent corruption of file data.
> 
> So should we increment the boot verifier every time we discover an error
> on an asynchronous write?
> 

I think so. Otherwise, only one client will ever see that error.
Chuck Lever III Aug. 28, 2019, 1:57 p.m. UTC | #19
> On Aug 28, 2019, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
> 
> On Wed, 2019-08-28 at 09:48 -0400, bfields@fieldses.org wrote:
>> On Tue, Aug 27, 2019 at 03:15:35PM +0000, Trond Myklebust wrote:
>>> I'm open to other suggestions, but I'm having trouble finding one that
>>> can scale correctly (i.e. not require per-client tracking), prevent
>>> silent corruption (by causing clients to miss errors), while not
>>> relying on optional features that may not be implemented by all NFSv3
>>> clients (e.g. per-file write verifiers are not implemented by *BSD).
>>> 
>>> That said, it seems to me that to do nothing should not be an option,
>>> as that would imply tolerating silent corruption of file data.
>> 
>> So should we increment the boot verifier every time we discover an error
>> on an asynchronous write?
>> 
> 
> I think so. Otherwise, only one client will ever see that error.

+1

I'm not familiar with the details of how the Linux NFS server
implements the boot verifier: Will a verifier bump be effective
for all file systems that server exports? If so, is that an
acceptable cost?


--
Chuck Lever
Bruce Fields Aug. 28, 2019, 2 p.m. UTC | #20
On Wed, Aug 28, 2019 at 09:57:25AM -0400, Chuck Lever wrote:
> 
> 
> > On Aug 28, 2019, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > On Wed, 2019-08-28 at 09:48 -0400, bfields@fieldses.org wrote:
> >> On Tue, Aug 27, 2019 at 03:15:35PM +0000, Trond Myklebust wrote:
> >>> I'm open to other suggestions, but I'm having trouble finding one that
> >>> can scale correctly (i.e. not require per-client tracking), prevent
> >>> silent corruption (by causing clients to miss errors), while not
> >>> relying on optional features that may not be implemented by all NFSv3
> >>> clients (e.g. per-file write verifiers are not implemented by *BSD).
> >>> 
> >>> That said, it seems to me that to do nothing should not be an option,
> >>> as that would imply tolerating silent corruption of file data.
> >> 
> >> So should we increment the boot verifier every time we discover an error
> >> on an asynchronous write?
> >> 
> > 
> > I think so. Otherwise, only one client will ever see that error.
> 
> +1
> 
> I'm not familiar with the details of how the Linux NFS server
> implements the boot verifier: Will a verifier bump be effective
> for all file systems that server exports?

Yes.  It will be per network namespace, but that's the only limit.

> If so, is that an acceptable cost?

It means clients will resend all their uncommitted writes.  That could
certainly make write errors more expensive.  But maybe you've already
got bigger problems if you've got a full or failing disk?

--b.
Chuck Lever III Aug. 28, 2019, 2:03 p.m. UTC | #21
> On Aug 28, 2019, at 10:00 AM, J. Bruce Fields <bfields@redhat.com> wrote:
> 
> On Wed, Aug 28, 2019 at 09:57:25AM -0400, Chuck Lever wrote:
>> 
>> 
>>> On Aug 28, 2019, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
>>> 
>>> On Wed, 2019-08-28 at 09:48 -0400, bfields@fieldses.org wrote:
>>>> On Tue, Aug 27, 2019 at 03:15:35PM +0000, Trond Myklebust wrote:
>>>>> I'm open to other suggestions, but I'm having trouble finding one that
>>>>> can scale correctly (i.e. not require per-client tracking), prevent
>>>>> silent corruption (by causing clients to miss errors), while not
>>>>> relying on optional features that may not be implemented by all NFSv3
>>>>> clients (e.g. per-file write verifiers are not implemented by *BSD).
>>>>> 
>>>>> That said, it seems to me that to do nothing should not be an option,
>>>>> as that would imply tolerating silent corruption of file data.
>>>> 
>>>> So should we increment the boot verifier every time we discover an error
>>>> on an asynchronous write?
>>>> 
>>> 
>>> I think so. Otherwise, only one client will ever see that error.
>> 
>> +1
>> 
>> I'm not familiar with the details of how the Linux NFS server
>> implements the boot verifier: Will a verifier bump be effective
>> for all file systems that server exports?
> 
> Yes.  It will be per network namespace, but that's the only limit.
> 
>> If so, is that an acceptable cost?
> 
> It means clients will resend all their uncommitted writes.  That could
> certainly make write errors more expensive.  But maybe you've already
> got bigger problems if you've got a full or failing disk?

One full filesystem will impact the behavior of all other exported
filesystems. That might be surprising behavior to a server administrator.
I don't have any suggestions other than maintaining a separate verifier
for each exported filesystem in each net namespace.


--
Chuck Lever
Jeff Layton Aug. 28, 2019, 2:16 p.m. UTC | #22
On Wed, 2019-08-28 at 10:03 -0400, Chuck Lever wrote:
> > On Aug 28, 2019, at 10:00 AM, J. Bruce Fields <bfields@redhat.com> wrote:
> > 
> > On Wed, Aug 28, 2019 at 09:57:25AM -0400, Chuck Lever wrote:
> > > 
> > > > On Aug 28, 2019, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > 
> > > > On Wed, 2019-08-28 at 09:48 -0400, bfields@fieldses.org wrote:
> > > > > On Tue, Aug 27, 2019 at 03:15:35PM +0000, Trond Myklebust wrote:
> > > > > > I'm open to other suggestions, but I'm having trouble finding one that
> > > > > > can scale correctly (i.e. not require per-client tracking), prevent
> > > > > > silent corruption (by causing clients to miss errors), while not
> > > > > > relying on optional features that may not be implemented by all NFSv3
> > > > > > clients (e.g. per-file write verifiers are not implemented by *BSD).
> > > > > > 
> > > > > > That said, it seems to me that to do nothing should not be an option,
> > > > > > as that would imply tolerating silent corruption of file data.
> > > > > 
> > > > > So should we increment the boot verifier every time we discover an error
> > > > > on an asynchronous write?
> > > > > 
> > > > 
> > > > I think so. Otherwise, only one client will ever see that error.
> > > 
> > > +1
> > > 
> > > I'm not familiar with the details of how the Linux NFS server
> > > implements the boot verifier: Will a verifier bump be effective
> > > for all file systems that server exports?
> > 
> > Yes.  It will be per network namespace, but that's the only limit.
> > 
> > > If so, is that an acceptable cost?
> > 
> > It means clients will resend all their uncommitted writes.  That could
> > certainly make write errors more expensive.  But maybe you've already
> > got bigger problems if you've got a full or failing disk?
> 
> One full filesystem will impact the behavior of all other exported
> filesystems. That might be surprising behavior to a server administrator.
> I don't have any suggestions other than maintaining a separate verifier
> for each exported filesystem in each net namespace.
> 
> 

Yeah, it's not pretty, but I think the alternative is worse. Most admins
would take rotten performance over data corruption.

For the most part, these sorts of errors tend to be rare. If it turns
out to be a problem we could consider moving the verifier into
svc_export or something?
Chuck Lever III Aug. 28, 2019, 2:21 p.m. UTC | #23
> On Aug 28, 2019, at 10:16 AM, Jeff Layton <jlayton@redhat.com> wrote:
> 
> On Wed, 2019-08-28 at 10:03 -0400, Chuck Lever wrote:
>>> On Aug 28, 2019, at 10:00 AM, J. Bruce Fields <bfields@redhat.com> wrote:
>>> 
>>> On Wed, Aug 28, 2019 at 09:57:25AM -0400, Chuck Lever wrote:
>>>> 
>>>>> On Aug 28, 2019, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote:
>>>>> 
>>>>> On Wed, 2019-08-28 at 09:48 -0400, bfields@fieldses.org wrote:
>>>>>> On Tue, Aug 27, 2019 at 03:15:35PM +0000, Trond Myklebust wrote:
>>>>>>> I'm open to other suggestions, but I'm having trouble finding one that
>>>>>>> can scale correctly (i.e. not require per-client tracking), prevent
>>>>>>> silent corruption (by causing clients to miss errors), while not
>>>>>>> relying on optional features that may not be implemented by all NFSv3
>>>>>>> clients (e.g. per-file write verifiers are not implemented by *BSD).
>>>>>>> 
>>>>>>> That said, it seems to me that to do nothing should not be an option,
>>>>>>> as that would imply tolerating silent corruption of file data.
>>>>>> 
>>>>>> So should we increment the boot verifier every time we discover an error
>>>>>> on an asynchronous write?
>>>>>> 
>>>>> 
>>>>> I think so. Otherwise, only one client will ever see that error.
>>>> 
>>>> +1
>>>> 
>>>> I'm not familiar with the details of how the Linux NFS server
>>>> implements the boot verifier: Will a verifier bump be effective
>>>> for all file systems that server exports?
>>> 
>>> Yes.  It will be per network namespace, but that's the only limit.
>>> 
>>>> If so, is that an acceptable cost?
>>> 
>>> It means clients will resend all their uncommitted writes.  That could
>>> certainly make write errors more expensive.  But maybe you've already
>>> got bigger problems if you've got a full or failing disk?
>> 
>> One full filesystem will impact the behavior of all other exported
>> filesystems. That might be surprising behavior to a server administrator.
>> I don't have any suggestions other than maintaining a separate verifier
>> for each exported filesystem in each net namespace.
>> 
>> 
> 
> Yeah, it's not pretty, but I think the alternative is worse. Most admins
> would take rotten performance over data corruption.

Again, I'm not saying we should do nothing. It doesn't seem
like a per-export verifier would be much more work than a
per-net-namespace verifier.


> For the most part, these sorts of errors tend to be rare.

EIO is certainly going to be rare, agreed. ENOSPC might not be.


> If it turns
> out to be a problem we could consider moving the verifier into
> svc_export or something?


--
Chuck Lever
Bruce Fields Aug. 28, 2019, 2:40 p.m. UTC | #24
On Wed, Aug 28, 2019 at 10:16:09AM -0400, Jeff Layton wrote:
> For the most part, these sorts of errors tend to be rare. If it turns
> out to be a problem we could consider moving the verifier into
> svc_export or something?

As Trond says, this isn't really a server implementation issue, it's a
protocol issue.  If a client detects when to resend writes by storing a
single verifier per server, then returning different verifiers from
writes to different exports will have it resending every time it writes
to one export then another.

--b.
J. Bruce Fields Aug. 28, 2019, 2:48 p.m. UTC | #25
On Wed, Aug 28, 2019 at 10:40:31AM -0400, J. Bruce Fields wrote:
> On Wed, Aug 28, 2019 at 10:16:09AM -0400, Jeff Layton wrote:
> > For the most part, these sorts of errors tend to be rare. If it turns
> > out to be a problem we could consider moving the verifier into
> > svc_export or something?
> 
> As Trond says, this isn't really a server implementation issue, it's a
> protocol issue.  If a client detects when to resend writes by storing a
> single verifier per server, then returning different verifiers from
> writes to different exports will have it resending every time it writes
> to one export then another.

The other way to limit this is by trying to detect the single-writer
case, since it's probably also rare for multiple clients to write to the
same file.

Are there any common cases where two clients share the same TCP
connection?  Maybe that would be a conservative enough test?

And you wouldn't have to track every connection that's ever sent a WRITE
to a given file.  Just remember the first one, with a flag to track
whether anyone else has ever written to it.

??

--b.
Chuck Lever III Aug. 28, 2019, 2:50 p.m. UTC | #26
> On Aug 28, 2019, at 10:48 AM, Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Wed, Aug 28, 2019 at 10:40:31AM -0400, J. Bruce Fields wrote:
>> On Wed, Aug 28, 2019 at 10:16:09AM -0400, Jeff Layton wrote:
>>> For the most part, these sorts of errors tend to be rare. If it turns
>>> out to be a problem we could consider moving the verifier into
>>> svc_export or something?
>> 
>> As Trond says, this isn't really a server implementation issue, it's a
>> protocol issue.  If a client detects when to resend writes by storing a
>> single verifier per server, then returning different verifiers from
>> writes to different exports will have it resending every time it writes
>> to one export then another.
> 
> The other way to limit this is by trying to detect the single-writer
> case, since it's probably also rare for multiple clients to write to the
> same file.
> 
> Are there any common cases where two clients share the same TCP
> connection?  Maybe that would be a conservative enough test?

What happens if a client reconnects? Does that bump the verifier?

If it reconnects from the same source port, can the server tell
that the connection is different?


> And you wouldn't have to track every connection that's ever sent a WRITE
> to a given file.  Just remember the first one, with a flag to track
> whether anyone else has ever written to it.
> 
> ??
> 
> --b.

--
Chuck Lever
Jeff Layton Aug. 28, 2019, 3:09 p.m. UTC | #27
On Wed, 2019-08-28 at 10:40 -0400, J. Bruce Fields wrote:
> On Wed, Aug 28, 2019 at 10:16:09AM -0400, Jeff Layton wrote:
> > For the most part, these sorts of errors tend to be rare. If it turns
> > out to be a problem we could consider moving the verifier into
> > svc_export or something?
> 
> As Trond says, this isn't really a server implementation issue, it's a
> protocol issue.  If a client detects when to resend writes by storing a
> single verifier per server, then returning different verifiers from
> writes to different exports will have it resending every time it writes
> to one export then another.
> 

Huh. I've always considered the verifier to be a per-inode quantity that
we just happened to fill with a global value, but the spec _is_ a bit
vague in this regard.

I think modern Linux clients track this on a per-inode basis, but I'm
not sure about really old Linux or other clients. It'd be good to know
about BSD and Solaris, in particular.
Rick Macklem Aug. 28, 2019, 3:12 p.m. UTC | #28
J. Bruce Fields wrote:
>On Wed, Aug 28, 2019 at 10:16:09AM -0400, Jeff Layton wrote:
>> For the most part, these sorts of errors tend to be rare. If it turns
>> out to be a problem we could consider moving the verifier into
>> svc_export or something?
>
>As Trond says, this isn't really a server implementation issue, it's a
>protocol issue.  If a client detects when to resend writes by storing a
>single verifier per server, then returning different verifiers from
>writes to different exports will have it resending every time it writes
>to one export then another.

Well, here's what RFC-1813 says about the write verifier:
         This cookie must be consistent during a single instance
         of the NFS version 3 protocol service and must be
         unique between instances of the NFS version 3 protocol
         server, where uncommitted data may be lost.
You could debate what "service" means in the above, but I'd say it isn't "per file".

However, since there is no way for an NFSv3 client to know what a "server" is,
the FreeBSD client (and maybe the other *BSD, although I haven't been involved
in them for a long time) stores the write verifier "per mount".
--> So, for the FreeBSD client, it is at the granularity of what the NFSv3 client sees as
     a single file system. (Typically a single file system on the server unless the knfsd
     plays the game of coalescing multiple file systems by uniquifying the I-node#, etc.)

It is conceivable that some other NFSv3 client might assume
"same server IP address --> same server" and store it "per server IP", but I have
no idea if any such client exists.

rick
Trond Myklebust Aug. 28, 2019, 3:37 p.m. UTC | #29
On Wed, 2019-08-28 at 15:12 +0000, Rick Macklem wrote:
> J. Bruce Fields wrote:
> > On Wed, Aug 28, 2019 at 10:16:09AM -0400, Jeff Layton wrote:
> > > For the most part, these sorts of errors tend to be rare. If it
> > > turns
> > > out to be a problem we could consider moving the verifier into
> > > svc_export or something?
> > 
> > As Trond says, this isn't really a server implementation issue,
> > it's a
> > protocol issue.  If a client detects when to resend writes by
> > storing a
> > single verifier per server, then returning different verifiers from
> > writes to different exports will have it resending every time it
> > writes
> > to one export then another.
> 
> Well, here's what RFC-1813 says about the write verifier:
>          This cookie must be consistent during a single instance
>          of the NFS version 3 protocol service and must be
>          unique between instances of the NFS version 3 protocol
>          server, where uncommitted data may be lost.
> You could debate what "service" means in the above, but I'd say it
> isn't "per file".
> 
> However, since there is no way for an NFSv3 client to know what a
> "server" is,
> the FreeBSD client (and maybe the other *BSD, although I haven't been
> involved
> in them for a long time) stores the write verifier "per mount".
> --> So, for the FreeBSD client, it is at the granularity of what the
> NFSv3 client sees as
>      a single file system. (Typically a single file system on the
> server unless the knfsd
>      plays the game of coalescing multiple file systems by
> uniquifying the I-node#, etc.)
> 
> It is conceivable that some other NFSv3 client might assume
> "same server IP address --> same server" and store it "per server
> IP", but I have
> no idea if any such client exists.
> 

The patchsets we've been discussing should all be compatible with
FreeBSD, as they implement a per-server boot verifier (as in different
containers representing different knfsd servers on different networks
are allowed to have different boot verifiers, but each server applies
their boot verifier globally to all exported filesystems).

We could make these boot verifiers be per filesystem to reduce the
frequency of these 'server reboots' but the expectation is that
filesystem errors on COMMIT are rare.
J. Bruce Fields Aug. 28, 2019, 3:46 p.m. UTC | #30
On Wed, Aug 28, 2019 at 03:12:26PM +0000, Rick Macklem wrote:
> J. Bruce Fields wrote:
> >On Wed, Aug 28, 2019 at 10:16:09AM -0400, Jeff Layton wrote:
> >> For the most part, these sorts of errors tend to be rare. If it turns
> >> out to be a problem we could consider moving the verifier into
> >> svc_export or something?
> >
> >As Trond says, this isn't really a server implementation issue, it's a
> >protocol issue.  If a client detects when to resend writes by storing a
> >single verifier per server, then returning different verifiers from
> >writes to different exports will have it resending every time it writes
> >to one export then another.
> 
> Well, here's what RFC-1813 says about the write verifier:
>          This cookie must be consistent during a single instance
>          of the NFS version 3 protocol service and must be
>          unique between instances of the NFS version 3 protocol
>          server, where uncommitted data may be lost.
> You could debate what "service" means in the above, but I'd say it isn't "per file".
> 
> However, since there is no way for an NFSv3 client to know what a "server" is,
> the FreeBSD client (and maybe the other *BSD, although I haven't been involved
> in them for a long time) stores the write verifier "per mount".
> --> So, for the FreeBSD client, it is at the granularity of what the NFSv3 client sees as
>      a single file system. (Typically a single file system on the server unless the knfsd
>      plays the game of coalescing multiple file systems by uniquifying the I-node#, etc.)

Oh, well, that would be nice if we could at least count on per-filesystem.

> It is conceivable that some other NFSv3 client might assume "same
> server IP address --> same server" and store it "per server IP", but I
> have no idea if any such client exists.

Hm.

--b.
J. Bruce Fields Aug. 28, 2019, 5:07 p.m. UTC | #31
On Wed, Aug 28, 2019 at 10:50:30AM -0400, Chuck Lever wrote:
> 
> 
> > On Aug 28, 2019, at 10:48 AM, Bruce Fields <bfields@fieldses.org> wrote:
> > 
> > On Wed, Aug 28, 2019 at 10:40:31AM -0400, J. Bruce Fields wrote:
> >> On Wed, Aug 28, 2019 at 10:16:09AM -0400, Jeff Layton wrote:
> >>> For the most part, these sorts of errors tend to be rare. If it turns
> >>> out to be a problem we could consider moving the verifier into
> >>> svc_export or something?
> >> 
> >> As Trond says, this isn't really a server implementation issue, it's a
> >> protocol issue.  If a client detects when to resend writes by storing a
> >> single verifier per server, then returning different verifiers from
> >> writes to different exports will have it resending every time it writes
> >> to one export then another.
> > 
> > The other way to limit this is by trying to detect the single-writer
> > case, since it's probably also rare for multiple clients to write to the
> > same file.
> > 
> > Are there any common cases where two clients share the same TCP
> > connection?  Maybe that would be a conservative enough test?
> 
> What happens if a client reconnects? Does that bump the verifier?
> 
> If it reconnects from the same source port, can the server tell
> that the connection is different?

I assume it creates a new socket structure.

So, you could key off either that or the (address, port) depending on
whether you'd rather risk unnecessarily incrementing the verifier after
a reconnection and write error, or risk failing to increment due to port
reuse.

There's some precedent to taking the latter choice in the DRC.  But the
DRC also includes some other stuff (xid, hash) to minimize the chance of
a collision.

If you took the former choice you wouldn't literally want to key off a
pointer to the socket as it'd cause complications when freeing the
socket.  But maybe you could add some kind of birth time or generation
number to struct svc_sock to help differentiate?

--b.

> 
> 
> > And you wouldn't have to track every connection that's ever sent a WRITE
> > to a given file.  Just remember the first one, with a flag to track
> > whether anyone else has ever written to it.
> > 
> > ??
> > 
> > --b.
> 
> --
> Chuck Lever
> 
>