Message ID | 20190826165021.81075-1-trond.myklebust@hammerspace.com (mailing list archive) |
---|---|
Headers | show |
Series | Handling NFSv3 I/O errors in knfsd | expand |
On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust wrote: > Recently, a number of changes went into the kernel to try to ensure > that I/O errors (specifically write errors) are reported to the > application once and only once. The vehicle for ensuring the errors > are reported is the struct file, which uses the 'f_wb_err' field to > track which errors have been reported. > > The problem is that errors are mainly intended to be reported through > fsync(). If the client is doing synchronous writes, then all is well, > but if it is doing unstable writes, then the errors may not be > reported until the client calls COMMIT. If the file cache has > thrown out the struct file, due to memory pressure, or just because > the client took a long while between the last WRITE and the COMMIT, > then the error report may be lost, and the client may just think > its data is safely stored. These were lost before the file caching patches as well, right? Or is there some regression? > Note that the problem is compounded by the fact that NFSv3 is stateless, > so the server never knows that the client may have rebooted, so there > can be no guarantee that a COMMIT will ever be sent. > > The following patch set attempts to remedy the situation using 2 > strategies: > > 1) If the inode is dirty, then avoid garbage collecting the file > from the file cache. > 2) If the file is closed, and we see that it would have reported > an error to COMMIT, then we bump the boot verifier in order to > ensure the client retransmits all its writes. Sounds sensible to me. > Note that if multiple clients were writing to the same file, then > we probably want to bump the boot verifier anyway, since only one > COMMIT will see the error report (because the cached file is also > shared). I'm confused by the "probably should". So that's future work? I guess it'd mean some additional work to identify that case. You can't really even distinguish clients in the NFSv3 case, but I suppose you could use IP address or TCP connection as an approximation. --b. > So in order to implement the above strategy, we first have to do > the following: split up the file cache to act per net namespace, > since the boot verifier is per net namespace. Then add a helper > to update the boot verifier. > > Trond Myklebust (3): > nfsd: nfsd_file cache entries should be per net namespace > nfsd: Support the server resetting the boot verifier > nfsd: Don't garbage collect files that might contain write errors > > fs/nfsd/export.c | 2 +- > fs/nfsd/filecache.c | 76 +++++++++++++++++++++++++++++++++++++-------- > fs/nfsd/filecache.h | 3 +- > fs/nfsd/netns.h | 4 +++ > fs/nfsd/nfs3xdr.c | 13 +++++--- > fs/nfsd/nfs4proc.c | 14 +++------ > fs/nfsd/nfsctl.c | 1 + > fs/nfsd/nfssvc.c | 32 ++++++++++++++++++- > 8 files changed, 115 insertions(+), 30 deletions(-) > > -- > 2.21.0
On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote: > On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust wrote: > > Recently, a number of changes went into the kernel to try to ensure > > that I/O errors (specifically write errors) are reported to the > > application once and only once. The vehicle for ensuring the errors > > are reported is the struct file, which uses the 'f_wb_err' field to > > track which errors have been reported. > > > > The problem is that errors are mainly intended to be reported > > through > > fsync(). If the client is doing synchronous writes, then all is > > well, > > but if it is doing unstable writes, then the errors may not be > > reported until the client calls COMMIT. If the file cache has > > thrown out the struct file, due to memory pressure, or just because > > the client took a long while between the last WRITE and the COMMIT, > > then the error report may be lost, and the client may just think > > its data is safely stored. > > These were lost before the file caching patches as well, right? Or > is > there some regression? Correct. This is not a regression, but an attempt to fix a problem that has existed for some time now. > > > Note that the problem is compounded by the fact that NFSv3 is > > stateless, > > so the server never knows that the client may have rebooted, so > > there > > can be no guarantee that a COMMIT will ever be sent. > > > > The following patch set attempts to remedy the situation using 2 > > strategies: > > > > 1) If the inode is dirty, then avoid garbage collecting the file > > from the file cache. > > 2) If the file is closed, and we see that it would have reported > > an error to COMMIT, then we bump the boot verifier in order to > > ensure the client retransmits all its writes. > > Sounds sensible to me. > > > Note that if multiple clients were writing to the same file, then > > we probably want to bump the boot verifier anyway, since only one > > COMMIT will see the error report (because the cached file is also > > shared). > > I'm confused by the "probably should". So that's future work? I > guess > it'd mean some additional work to identify that case. You can't > really > even distinguish clients in the NFSv3 case, but I suppose you could > use > IP address or TCP connection as an approximation. I'm suggesting we should do this too, but I haven't done so yet in these patches. I'd like to hear other opinions (particularly from you, Chuck and Jeff). > --b. > > > So in order to implement the above strategy, we first have to do > > the following: split up the file cache to act per net namespace, > > since the boot verifier is per net namespace. Then add a helper > > to update the boot verifier. > > > > Trond Myklebust (3): > > nfsd: nfsd_file cache entries should be per net namespace > > nfsd: Support the server resetting the boot verifier > > nfsd: Don't garbage collect files that might contain write errors > > > > fs/nfsd/export.c | 2 +- > > fs/nfsd/filecache.c | 76 +++++++++++++++++++++++++++++++++++++-- > > ------ > > fs/nfsd/filecache.h | 3 +- > > fs/nfsd/netns.h | 4 +++ > > fs/nfsd/nfs3xdr.c | 13 +++++--- > > fs/nfsd/nfs4proc.c | 14 +++------ > > fs/nfsd/nfsctl.c | 1 + > > fs/nfsd/nfssvc.c | 32 ++++++++++++++++++- > > 8 files changed, 115 insertions(+), 30 deletions(-) > > > > -- > > 2.21.0
On Mon, Aug 26, 2019 at 09:02:31PM +0000, Trond Myklebust wrote: > On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote: > > On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust wrote: > > > Recently, a number of changes went into the kernel to try to > > > ensure that I/O errors (specifically write errors) are reported to > > > the application once and only once. The vehicle for ensuring the > > > errors are reported is the struct file, which uses the 'f_wb_err' > > > field to track which errors have been reported. > > > > > > The problem is that errors are mainly intended to be reported > > > through fsync(). If the client is doing synchronous writes, then > > > all is well, but if it is doing unstable writes, then the errors > > > may not be reported until the client calls COMMIT. If the file > > > cache has thrown out the struct file, due to memory pressure, or > > > just because the client took a long while between the last WRITE > > > and the COMMIT, then the error report may be lost, and the client > > > may just think its data is safely stored. > > > > These were lost before the file caching patches as well, right? Or > > is there some regression? > > Correct. This is not a regression, but an attempt to fix a problem > that has existed for some time now. > > > > > > Note that the problem is compounded by the fact that NFSv3 is > > > stateless, so the server never knows that the client may have > > > rebooted, so there can be no guarantee that a COMMIT will ever be > > > sent. > > > > > > The following patch set attempts to remedy the situation using 2 > > > strategies: > > > > > > 1) If the inode is dirty, then avoid garbage collecting the file > > > from the file cache. 2) If the file is closed, and we see that it > > > would have reported an error to COMMIT, then we bump the boot > > > verifier in order to ensure the client retransmits all its writes. > > > > Sounds sensible to me. > > > > > Note that if multiple clients were writing to the same file, then > > > we probably want to bump the boot verifier anyway, since only one > > > COMMIT will see the error report (because the cached file is also > > > shared). > > > > I'm confused by the "probably should". So that's future work? I > > guess it'd mean some additional work to identify that case. You > > can't really even distinguish clients in the NFSv3 case, but I > > suppose you could use IP address or TCP connection as an > > approximation. > > I'm suggesting we should do this too, but I haven't done so yet in > these patches. I'd like to hear other opinions (particularly from you, > Chuck and Jeff). Does this process actually converge, or do we end up with all the clients retrying the writes and, again, only one of them getting the error? I wonder what the typical errors are, anyway. --b.
On Mon, 2019-08-26 at 20:48 -0400, bfields@fieldses.org wrote: > On Mon, Aug 26, 2019 at 09:02:31PM +0000, Trond Myklebust wrote: > > On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote: > > > On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust wrote: > > > > Recently, a number of changes went into the kernel to try to > > > > ensure that I/O errors (specifically write errors) are reported > > > > to > > > > the application once and only once. The vehicle for ensuring > > > > the > > > > errors are reported is the struct file, which uses the > > > > 'f_wb_err' > > > > field to track which errors have been reported. > > > > > > > > The problem is that errors are mainly intended to be reported > > > > through fsync(). If the client is doing synchronous writes, > > > > then > > > > all is well, but if it is doing unstable writes, then the > > > > errors > > > > may not be reported until the client calls COMMIT. If the file > > > > cache has thrown out the struct file, due to memory pressure, > > > > or > > > > just because the client took a long while between the last > > > > WRITE > > > > and the COMMIT, then the error report may be lost, and the > > > > client > > > > may just think its data is safely stored. > > > > > > These were lost before the file caching patches as well, > > > right? Or > > > is there some regression? > > > > Correct. This is not a regression, but an attempt to fix a problem > > that has existed for some time now. > > > > > > Note that the problem is compounded by the fact that NFSv3 is > > > > stateless, so the server never knows that the client may have > > > > rebooted, so there can be no guarantee that a COMMIT will ever > > > > be > > > > sent. > > > > > > > > The following patch set attempts to remedy the situation using > > > > 2 > > > > strategies: > > > > > > > > 1) If the inode is dirty, then avoid garbage collecting the > > > > file > > > > from the file cache. 2) If the file is closed, and we see that > > > > it > > > > would have reported an error to COMMIT, then we bump the boot > > > > verifier in order to ensure the client retransmits all its > > > > writes. > > > > > > Sounds sensible to me. > > > > > > > Note that if multiple clients were writing to the same file, > > > > then > > > > we probably want to bump the boot verifier anyway, since only > > > > one > > > > COMMIT will see the error report (because the cached file is > > > > also > > > > shared). > > > > > > I'm confused by the "probably should". So that's future work? I > > > guess it'd mean some additional work to identify that case. You > > > can't really even distinguish clients in the NFSv3 case, but I > > > suppose you could use IP address or TCP connection as an > > > approximation. > > > > I'm suggesting we should do this too, but I haven't done so yet in > > these patches. I'd like to hear other opinions (particularly from > > you, > > Chuck and Jeff). > > Does this process actually converge, or do we end up with all the > clients retrying the writes and, again, only one of them getting the > error? The client that gets the error should stop retrying if the error is fatal. > I wonder what the typical errors are, anyway. I would expect ENOSPC, and EIO to be the most common. The former if delayed allocation and/or snapshots result in writes failing after writing to the page cache. The latter if we hit a disk outage or other such problem.
On Tue, Aug 27, 2019 at 12:56:07AM +0000, Trond Myklebust wrote: > On Mon, 2019-08-26 at 20:48 -0400, bfields@fieldses.org wrote: > > On Mon, Aug 26, 2019 at 09:02:31PM +0000, Trond Myklebust wrote: > > > On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote: > > > > On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust wrote: > > > > > Note that if multiple clients were writing to the same file, > > > > > then we probably want to bump the boot verifier anyway, since > > > > > only one COMMIT will see the error report (because the cached > > > > > file is also shared). > > > > > > > > I'm confused by the "probably should". So that's future work? > > > > I guess it'd mean some additional work to identify that case. > > > > You can't really even distinguish clients in the NFSv3 case, but > > > > I suppose you could use IP address or TCP connection as an > > > > approximation. > > > > > > I'm suggesting we should do this too, but I haven't done so yet in > > > these patches. I'd like to hear other opinions (particularly from > > > you, Chuck and Jeff). > > > > Does this process actually converge, or do we end up with all the > > clients retrying the writes and, again, only one of them getting the > > error? > > The client that gets the error should stop retrying if the error is > fatal. Have clients historically been good about that? I just wonder whether it's a concern that boot-verifier-bumping could magnify the impact of clients that are overly persistent about retrying IO errors. > > I wonder what the typical errors are, anyway. > > I would expect ENOSPC, and EIO to be the most common. The former if > delayed allocation and/or snapshots result in writes failing after > writing to the page cache. The latter if we hit a disk outage or other > such problem. Makes sense. --b.
On Mon, 2019-08-26 at 21:13 -0400, bfields@fieldses.org wrote: > On Tue, Aug 27, 2019 at 12:56:07AM +0000, Trond Myklebust wrote: > > On Mon, 2019-08-26 at 20:48 -0400, bfields@fieldses.org wrote: > > > On Mon, Aug 26, 2019 at 09:02:31PM +0000, Trond Myklebust wrote: > > > > On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote: > > > > > On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust > > > > > wrote: > > > > > > Note that if multiple clients were writing to the same > > > > > > file, > > > > > > then we probably want to bump the boot verifier anyway, > > > > > > since > > > > > > only one COMMIT will see the error report (because the > > > > > > cached > > > > > > file is also shared). > > > > > > > > > > I'm confused by the "probably should". So that's future > > > > > work? > > > > > I guess it'd mean some additional work to identify that case. > > > > > You can't really even distinguish clients in the NFSv3 case, > > > > > but > > > > > I suppose you could use IP address or TCP connection as an > > > > > approximation. > > > > > > > > I'm suggesting we should do this too, but I haven't done so yet > > > > in > > > > these patches. I'd like to hear other opinions (particularly > > > > from > > > > you, Chuck and Jeff). > > > > > > Does this process actually converge, or do we end up with all the > > > clients retrying the writes and, again, only one of them getting > > > the > > > error? > > > > The client that gets the error should stop retrying if the error is > > fatal. > > Have clients historically been good about that? I just wonder > whether > it's a concern that boot-verifier-bumping could magnify the impact of > clients that are overly persistent about retrying IO errors. > Clients have always been required to handle I/O errors, yes, and this isn't just a Linux server thing. All other servers that support unstable writes impose the same requirement on the client to check the return value of COMMIT and to handle any errors. > > > I wonder what the typical errors are, anyway. > > > > I would expect ENOSPC, and EIO to be the most common. The former if > > delayed allocation and/or snapshots result in writes failing after > > writing to the page cache. The latter if we hit a disk outage or > > other > > such problem. > > Makes sense. > > --b.
> On Aug 26, 2019, at 5:02 PM, Trond Myklebust <trondmy@hammerspace.com> wrote: > > On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote: >> On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust wrote: >>> Recently, a number of changes went into the kernel to try to ensure >>> that I/O errors (specifically write errors) are reported to the >>> application once and only once. The vehicle for ensuring the errors >>> are reported is the struct file, which uses the 'f_wb_err' field to >>> track which errors have been reported. >>> >>> The problem is that errors are mainly intended to be reported >>> through >>> fsync(). If the client is doing synchronous writes, then all is >>> well, >>> but if it is doing unstable writes, then the errors may not be >>> reported until the client calls COMMIT. If the file cache has >>> thrown out the struct file, due to memory pressure, or just because >>> the client took a long while between the last WRITE and the COMMIT, >>> then the error report may be lost, and the client may just think >>> its data is safely stored. >> >> These were lost before the file caching patches as well, right? Or >> is >> there some regression? > > Correct. This is not a regression, but an attempt to fix a problem that > has existed for some time now. > >> >>> Note that the problem is compounded by the fact that NFSv3 is >>> stateless, >>> so the server never knows that the client may have rebooted, so >>> there >>> can be no guarantee that a COMMIT will ever be sent. >>> >>> The following patch set attempts to remedy the situation using 2 >>> strategies: >>> >>> 1) If the inode is dirty, then avoid garbage collecting the file >>> from the file cache. >>> 2) If the file is closed, and we see that it would have reported >>> an error to COMMIT, then we bump the boot verifier in order to >>> ensure the client retransmits all its writes. >> >> Sounds sensible to me. >> >>> Note that if multiple clients were writing to the same file, then >>> we probably want to bump the boot verifier anyway, since only one >>> COMMIT will see the error report (because the cached file is also >>> shared). >> >> I'm confused by the "probably should". So that's future work? I >> guess >> it'd mean some additional work to identify that case. You can't >> really >> even distinguish clients in the NFSv3 case, but I suppose you could >> use >> IP address or TCP connection as an approximation. > > I'm suggesting we should do this too, but I haven't done so yet in > these patches. I'd like to hear other opinions (particularly from you, > Chuck and Jeff). The strategy of handling these errors more carefully seems good. Bumping the write/commit verifier so the client writes again to retrieve the latent error is clever! It's not clear to me though that the NFSv3 protocol can deal with the multi-client write scenario, since it is stateless. We are now making it stateful in some sense by preserving error state on the server across NFS requests, without having any sense of an open file in the protocol itself. Would an "approximation" without open state be good enough? I assume you are doing this to more fully support the FlexFiles layout type. Do you have any analysis or thought about this next step? I also echo Bruce's concern about whether the client implementations are up to snuff. There could be long-standing bugs or their protocol implementation could be missing parts. This is more curiosity than an objection, but maybe noting which client implementations you've tested with would be good. >> --b. >> >>> So in order to implement the above strategy, we first have to do >>> the following: split up the file cache to act per net namespace, >>> since the boot verifier is per net namespace. Then add a helper >>> to update the boot verifier. >>> >>> Trond Myklebust (3): >>> nfsd: nfsd_file cache entries should be per net namespace >>> nfsd: Support the server resetting the boot verifier >>> nfsd: Don't garbage collect files that might contain write errors >>> >>> fs/nfsd/export.c | 2 +- >>> fs/nfsd/filecache.c | 76 +++++++++++++++++++++++++++++++++++++-- >>> ------ >>> fs/nfsd/filecache.h | 3 +- >>> fs/nfsd/netns.h | 4 +++ >>> fs/nfsd/nfs3xdr.c | 13 +++++--- >>> fs/nfsd/nfs4proc.c | 14 +++------ >>> fs/nfsd/nfsctl.c | 1 + >>> fs/nfsd/nfssvc.c | 32 ++++++++++++++++++- >>> 8 files changed, 115 insertions(+), 30 deletions(-) >>> >>> -- >>> 2.21.0 > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com -- Chuck Lever
On Tue, 2019-08-27 at 09:59 -0400, Chuck Lever wrote: > > On Aug 26, 2019, at 5:02 PM, Trond Myklebust < > > trondmy@hammerspace.com> wrote: > > > > On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote: > > > On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust wrote: > > > > Recently, a number of changes went into the kernel to try to > > > > ensure > > > > that I/O errors (specifically write errors) are reported to the > > > > application once and only once. The vehicle for ensuring the > > > > errors > > > > are reported is the struct file, which uses the 'f_wb_err' > > > > field to > > > > track which errors have been reported. > > > > > > > > The problem is that errors are mainly intended to be reported > > > > through > > > > fsync(). If the client is doing synchronous writes, then all is > > > > well, > > > > but if it is doing unstable writes, then the errors may not be > > > > reported until the client calls COMMIT. If the file cache has > > > > thrown out the struct file, due to memory pressure, or just > > > > because > > > > the client took a long while between the last WRITE and the > > > > COMMIT, > > > > then the error report may be lost, and the client may just > > > > think > > > > its data is safely stored. > > > > > > These were lost before the file caching patches as well, > > > right? Or > > > is > > > there some regression? > > > > Correct. This is not a regression, but an attempt to fix a problem > > that > > has existed for some time now. > > > > > > Note that the problem is compounded by the fact that NFSv3 is > > > > stateless, > > > > so the server never knows that the client may have rebooted, so > > > > there > > > > can be no guarantee that a COMMIT will ever be sent. > > > > > > > > The following patch set attempts to remedy the situation using > > > > 2 > > > > strategies: > > > > > > > > 1) If the inode is dirty, then avoid garbage collecting the > > > > file > > > > from the file cache. > > > > 2) If the file is closed, and we see that it would have > > > > reported > > > > an error to COMMIT, then we bump the boot verifier in order > > > > to > > > > ensure the client retransmits all its writes. > > > > > > Sounds sensible to me. > > > > > > > Note that if multiple clients were writing to the same file, > > > > then > > > > we probably want to bump the boot verifier anyway, since only > > > > one > > > > COMMIT will see the error report (because the cached file is > > > > also > > > > shared). > > > > > > I'm confused by the "probably should". So that's future work? I > > > guess > > > it'd mean some additional work to identify that case. You can't > > > really > > > even distinguish clients in the NFSv3 case, but I suppose you > > > could > > > use > > > IP address or TCP connection as an approximation. > > > > I'm suggesting we should do this too, but I haven't done so yet in > > these patches. I'd like to hear other opinions (particularly from > > you, > > Chuck and Jeff). > > The strategy of handling these errors more carefully seems good. > Bumping the write/commit verifier so the client writes again to > retrieve the latent error is clever! > > It's not clear to me though that the NFSv3 protocol can deal with > the multi-client write scenario, since it is stateless. We are now > making it stateful in some sense by preserving error state on the > server across NFS requests, without having any sense of an open > file in the protocol itself. > > Would an "approximation" without open state be good enough? I > assume you are doing this to more fully support the FlexFiles > layout type. Do you have any analysis or thought about this next > step? No, see above. This has nothing to do with the flex files layout type. It is about ensuring the integrity of _all_ unstable writes. Any client can hit writeback errors that only manifest themselves when the data is flushed to disk. > I also echo Bruce's concern about whether the client implementations > are up to snuff. There could be long-standing bugs or their protocol > implementation could be missing parts. This is more curiosity than > an objection, but maybe noting which client implementations you've > tested with would be good. So what exactly is the concern? How do you and Bruce expect this might make things worse? As far as I can see, if a client doesn't handle errors in COMMIT, then it is screwed either way, since we can already end up reporting such errors today through the fsync() mechanism. If the client ignores the error, it ends up silently corrupting the file. If it retries the writes as unstable, then it loops until the error is cleared. The current patchset does not affect either case for this single client: the declaration of a reboot just ensures that silent corruption does not occur, and if the client handles the error correctly once it catches it, then great, otherwise no extra harm occurs. If we want to fix the case of multiple writers (i.e. the case where someone is using NLM locking to mediate between writers, or they are using O_DIRECT and some external synchronisation mechanism) then we need to do more (hence the proposal that we extend the current patches to also declare a reboot on any failure of COMMIT). In that case, the broken client, that loops forever, is going to cause contention with all the other clients anyway, since it will never complete its I/O (or it will silently corrupt the file if it doesn't see the error). By declaring a reboot we don't make this case worse as far as I can see: we still end up looping forever, but the file doesn't get silently corrupted. The one problem is that the looping forever client can cause other clients to loop forever on their otherwise successful writes on other files. That's bad, but again, that's due to client behaviour that is toxic even today.
On Tue, Aug 27, 2019 at 09:59:25AM -0400, Chuck Lever wrote: > The strategy of handling these errors more carefully seems good. > Bumping the write/commit verifier so the client writes again to > retrieve the latent error is clever! > > It's not clear to me though that the NFSv3 protocol can deal with > the multi-client write scenario, since it is stateless. We are now > making it stateful in some sense by preserving error state on the > server across NFS requests, without having any sense of an open > file in the protocol itself. > > Would an "approximation" without open state be good enough? I figure there's a correct-but-slow approach which is to increment the write verifier every time there's a write error. Does that work? If write errors are rare enough, maybe it doesn't matter that that's slow? Then there's various approximations you could use (IP address?) to guess when only one client's accessing the file. You save forcing all the clients to resend writes at the expense of some complexity and correctness--e.g., multiple clients behind NAT might not all get write errors. Am I thinking aobut this right? --b.
On Tue, Aug 27, 2019 at 02:53:01PM +0000, Trond Myklebust wrote: > The one problem is that the looping forever client can cause other > clients to loop forever on their otherwise successful writes on other > files. Yeah, that's the case I was wondering about. > That's bad, but again, that's due to client behaviour that is > toxic even today. So my worry was that if write errors are rare and the consequences of the single client looping forever are relatively mild, then there might be deployed clients that get away with that behavior. But maybe the behavior's a lot more "toxic" than I imagined, hence unlikely to be very common. --b.
On Tue, Aug 27, 2019 at 10:58:19AM -0400, bfields@fieldses.org wrote: > On Tue, Aug 27, 2019 at 02:53:01PM +0000, Trond Myklebust wrote: > > The one problem is that the looping forever client can cause other > > clients to loop forever on their otherwise successful writes on other > > files. > > Yeah, that's the case I was wondering about. > > > That's bad, but again, that's due to client behaviour that is > > toxic even today. > > So my worry was that if write errors are rare and the consequences of > the single client looping forever are relatively mild, then there might > be deployed clients that get away with that behavior. > > But maybe the behavior's a lot more "toxic" than I imagined, hence > unlikely to be very common. (And, to be clear, I like the idea, just making sure I'm not overlooking any problems....) --b.
On Tue, 2019-08-27 at 10:54 -0400, Bruce Fields wrote: > On Tue, Aug 27, 2019 at 09:59:25AM -0400, Chuck Lever wrote: > > The strategy of handling these errors more carefully seems good. > > Bumping the write/commit verifier so the client writes again to > > retrieve the latent error is clever! > > > > It's not clear to me though that the NFSv3 protocol can deal with > > the multi-client write scenario, since it is stateless. We are now > > making it stateful in some sense by preserving error state on the > > server across NFS requests, without having any sense of an open > > file in the protocol itself. > > > > Would an "approximation" without open state be good enough? > > I figure there's a correct-but-slow approach which is to increment > the > write verifier every time there's a write error. Does that work? If > write errors are rare enough, maybe it doesn't matter that that's > slow? How is that different from changing the boot verifier? Are you proposing to track write verifiers on a per-client basis? That seems onerous too. > Then there's various approximations you could use (IP address?) to > guess > when only one client's accessing the file. You save forcing all the > clients to resend writes at the expense of some complexity and > correctness--e.g., multiple clients behind NAT might not all get > write > errors. > > Am I thinking aobut this right? I agree that there are multiple problems with tracking on a per-client basis.
On Tue, Aug 27, 2019 at 02:59:34PM +0000, Trond Myklebust wrote: > On Tue, 2019-08-27 at 10:54 -0400, Bruce Fields wrote: > > On Tue, Aug 27, 2019 at 09:59:25AM -0400, Chuck Lever wrote: > > > The strategy of handling these errors more carefully seems good. > > > Bumping the write/commit verifier so the client writes again to > > > retrieve the latent error is clever! > > > > > > It's not clear to me though that the NFSv3 protocol can deal with > > > the multi-client write scenario, since it is stateless. We are now > > > making it stateful in some sense by preserving error state on the > > > server across NFS requests, without having any sense of an open > > > file in the protocol itself. > > > > > > Would an "approximation" without open state be good enough? > > > > I figure there's a correct-but-slow approach which is to increment > > the > > write verifier every time there's a write error. Does that work? If > > write errors are rare enough, maybe it doesn't matter that that's > > slow? > > How is that different from changing the boot verifier? > Are you > proposing to track write verifiers on a per-client basis? That seems > onerous too. Sorry, I thought "write verifier" and "boot verifier" were synonyms, and, no, wasn't suggesting tracking it per-file. --b. > > Then there's various approximations you could use (IP address?) to > > guess > > when only one client's accessing the file. You save forcing all the > > clients to resend writes at the expense of some complexity and > > correctness--e.g., multiple clients behind NAT might not all get > > write > > errors. > > > > Am I thinking aobut this right? > > I agree that there are multiple problems with tracking on a per-client > basis. > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com > >
On Tue, 2019-08-27 at 10:59 -0400, bfields@fieldses.org wrote: > On Tue, Aug 27, 2019 at 10:58:19AM -0400, bfields@fieldses.org wrote: > > On Tue, Aug 27, 2019 at 02:53:01PM +0000, Trond Myklebust wrote: > > > The one problem is that the looping forever client can cause > > > other > > > clients to loop forever on their otherwise successful writes on > > > other > > > files. > > > > Yeah, that's the case I was wondering about. > > > > > That's bad, but again, that's due to client behaviour that is > > > toxic even today. > > > > So my worry was that if write errors are rare and the consequences > > of > > the single client looping forever are relatively mild, then there > > might > > be deployed clients that get away with that behavior. > > > > But maybe the behavior's a lot more "toxic" than I imagined, hence > > unlikely to be very common. > > (And, to be clear, I like the idea, just making sure I'm not > overlooking > any problems....) > I'm open to other suggestions, but I'm having trouble finding one that can scale correctly (i.e. not require per-client tracking), prevent silent corruption (by causing clients to miss errors), while not relying on optional features that may not be implemented by all NFSv3 clients (e.g. per-file write verifiers are not implemented by *BSD). That said, it seems to me that to do nothing should not be an option, as that would imply tolerating silent corruption of file data.
On Tue, 2019-08-27 at 09:59 -0400, Chuck Lever wrote: > > On Aug 26, 2019, at 5:02 PM, Trond Myklebust <trondmy@hammerspace.com> wrote: > > > > On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote: > > > On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust wrote: > > > > Recently, a number of changes went into the kernel to try to ensure > > > > that I/O errors (specifically write errors) are reported to the > > > > application once and only once. The vehicle for ensuring the errors > > > > are reported is the struct file, which uses the 'f_wb_err' field to > > > > track which errors have been reported. > > > > > > > > The problem is that errors are mainly intended to be reported > > > > through > > > > fsync(). If the client is doing synchronous writes, then all is > > > > well, > > > > but if it is doing unstable writes, then the errors may not be > > > > reported until the client calls COMMIT. If the file cache has > > > > thrown out the struct file, due to memory pressure, or just because > > > > the client took a long while between the last WRITE and the COMMIT, > > > > then the error report may be lost, and the client may just think > > > > its data is safely stored. > > > > > > These were lost before the file caching patches as well, right? Or > > > is > > > there some regression? > > > > Correct. This is not a regression, but an attempt to fix a problem that > > has existed for some time now. > > > > > > Note that the problem is compounded by the fact that NFSv3 is > > > > stateless, > > > > so the server never knows that the client may have rebooted, so > > > > there > > > > can be no guarantee that a COMMIT will ever be sent. > > > > > > > > The following patch set attempts to remedy the situation using 2 > > > > strategies: > > > > > > > > 1) If the inode is dirty, then avoid garbage collecting the file > > > > from the file cache. > > > > 2) If the file is closed, and we see that it would have reported > > > > an error to COMMIT, then we bump the boot verifier in order to > > > > ensure the client retransmits all its writes. > > > > > > Sounds sensible to me. > > > > > > > Note that if multiple clients were writing to the same file, then > > > > we probably want to bump the boot verifier anyway, since only one > > > > COMMIT will see the error report (because the cached file is also > > > > shared). > > > > > > I'm confused by the "probably should". So that's future work? I > > > guess > > > it'd mean some additional work to identify that case. You can't > > > really > > > even distinguish clients in the NFSv3 case, but I suppose you could > > > use > > > IP address or TCP connection as an approximation. > > > > I'm suggesting we should do this too, but I haven't done so yet in > > these patches. I'd like to hear other opinions (particularly from you, > > Chuck and Jeff). > > The strategy of handling these errors more carefully seems good. > Bumping the write/commit verifier so the client writes again to > retrieve the latent error is clever! > Yes, this would seem to neatly solve a whole class of related problems in these sorts of scenarios. I also think we ought to bump the verifier whenever nfsd sees a writeback error, as you have a great point that we'll only report writeback errors on the first COMMIT after an error today. Fixing that would be a nice goal. We should note though that the verifier is a per net-namespace value, so if you get a transient writeback error on one inode, any inode that is currently dirty will probably end up having its writes retransmitted once you bump the verifier. That seems like an acceptable thing to do in these scenarios, but we may need to consider making the verifier more granular if that turns out to cause a lot of re-write activity that isn't necessary. > It's not clear to me though that the NFSv3 protocol can deal with > the multi-client write scenario, since it is stateless. We are now > making it stateful in some sense by preserving error state on the > server across NFS requests, without having any sense of an open > file in the protocol itself. > I think it's worthwhile to do the best we can here. WRITE/COMMIT are inherently stateful to some degree in that they involve a verifier. Trond's proposal is just utilizing that fact to ensure that we deliver writeback errors more widely. I like it! > Would an "approximation" without open state be good enough? I > assume you are doing this to more fully support the FlexFiles > layout type. Do you have any analysis or thought about this next > step? > > I also echo Bruce's concern about whether the client implementations > are up to snuff. There could be long-standing bugs or their protocol > implementation could be missing parts. This is more curiosity than > an objection, but maybe noting which client implementations you've > tested with would be good. > > > > > --b. > > > > > > > So in order to implement the above strategy, we first have to do > > > > the following: split up the file cache to act per net namespace, > > > > since the boot verifier is per net namespace. Then add a helper > > > > to update the boot verifier. > > > > > > > > Trond Myklebust (3): > > > > nfsd: nfsd_file cache entries should be per net namespace > > > > nfsd: Support the server resetting the boot verifier > > > > nfsd: Don't garbage collect files that might contain write errors > > > > > > > > fs/nfsd/export.c | 2 +- > > > > fs/nfsd/filecache.c | 76 +++++++++++++++++++++++++++++++++++++-- > > > > ------ > > > > fs/nfsd/filecache.h | 3 +- > > > > fs/nfsd/netns.h | 4 +++ > > > > fs/nfsd/nfs3xdr.c | 13 +++++--- > > > > fs/nfsd/nfs4proc.c | 14 +++------ > > > > fs/nfsd/nfsctl.c | 1 + > > > > fs/nfsd/nfssvc.c | 32 ++++++++++++++++++- > > > > 8 files changed, 115 insertions(+), 30 deletions(-) > > > > > > > > -- > > > > 2.21.0 > > -- > > Trond Myklebust > > Linux NFS client maintainer, Hammerspace > > trond.myklebust@hammerspace.com > > -- > Chuck Lever > > >
> On Aug 27, 2019, at 11:15 AM, Trond Myklebust <trondmy@hammerspace.com> wrote: > > On Tue, 2019-08-27 at 10:59 -0400, bfields@fieldses.org wrote: >> On Tue, Aug 27, 2019 at 10:58:19AM -0400, bfields@fieldses.org wrote: >>> On Tue, Aug 27, 2019 at 02:53:01PM +0000, Trond Myklebust wrote: >>>> The one problem is that the looping forever client can cause >>>> other >>>> clients to loop forever on their otherwise successful writes on >>>> other >>>> files. >>> >>> Yeah, that's the case I was wondering about. >>> >>>> That's bad, but again, that's due to client behaviour that is >>>> toxic even today. >>> >>> So my worry was that if write errors are rare and the consequences >>> of >>> the single client looping forever are relatively mild, then there >>> might >>> be deployed clients that get away with that behavior. >>> >>> But maybe the behavior's a lot more "toxic" than I imagined, hence >>> unlikely to be very common. >> >> (And, to be clear, I like the idea, just making sure I'm not >> overlooking >> any problems....) >> > I'm open to other suggestions, but I'm having trouble finding one that > can scale correctly (i.e. not require per-client tracking), prevent > silent corruption (by causing clients to miss errors), while not > relying on optional features that may not be implemented by all NFSv3 > clients (e.g. per-file write verifiers are not implemented by *BSD). > > That said, it seems to me that to do nothing should not be an option, > as that would imply tolerating silent corruption of file data. Agree, we should move forward. I'm not saying "do nothing," I'm just trying to understand what is improved and what is still left to do (maybe nothing). -- Chuck Lever
On Tue, Aug 27, 2019 at 03:15:35PM +0000, Trond Myklebust wrote: > I'm open to other suggestions, but I'm having trouble finding one that > can scale correctly (i.e. not require per-client tracking), prevent > silent corruption (by causing clients to miss errors), while not > relying on optional features that may not be implemented by all NFSv3 > clients (e.g. per-file write verifiers are not implemented by *BSD). > > That said, it seems to me that to do nothing should not be an option, > as that would imply tolerating silent corruption of file data. So should we increment the boot verifier every time we discover an error on an asynchronous write? --b.
On Wed, 2019-08-28 at 09:48 -0400, bfields@fieldses.org wrote: > On Tue, Aug 27, 2019 at 03:15:35PM +0000, Trond Myklebust wrote: > > I'm open to other suggestions, but I'm having trouble finding one that > > can scale correctly (i.e. not require per-client tracking), prevent > > silent corruption (by causing clients to miss errors), while not > > relying on optional features that may not be implemented by all NFSv3 > > clients (e.g. per-file write verifiers are not implemented by *BSD). > > > > That said, it seems to me that to do nothing should not be an option, > > as that would imply tolerating silent corruption of file data. > > So should we increment the boot verifier every time we discover an error > on an asynchronous write? > I think so. Otherwise, only one client will ever see that error.
> On Aug 28, 2019, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote: > > On Wed, 2019-08-28 at 09:48 -0400, bfields@fieldses.org wrote: >> On Tue, Aug 27, 2019 at 03:15:35PM +0000, Trond Myklebust wrote: >>> I'm open to other suggestions, but I'm having trouble finding one that >>> can scale correctly (i.e. not require per-client tracking), prevent >>> silent corruption (by causing clients to miss errors), while not >>> relying on optional features that may not be implemented by all NFSv3 >>> clients (e.g. per-file write verifiers are not implemented by *BSD). >>> >>> That said, it seems to me that to do nothing should not be an option, >>> as that would imply tolerating silent corruption of file data. >> >> So should we increment the boot verifier every time we discover an error >> on an asynchronous write? >> > > I think so. Otherwise, only one client will ever see that error. +1 I'm not familiar with the details of how the Linux NFS server implements the boot verifier: Will a verifier bump be effective for all file systems that server exports? If so, is that an acceptable cost? -- Chuck Lever
On Wed, Aug 28, 2019 at 09:57:25AM -0400, Chuck Lever wrote: > > > > On Aug 28, 2019, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote: > > > > On Wed, 2019-08-28 at 09:48 -0400, bfields@fieldses.org wrote: > >> On Tue, Aug 27, 2019 at 03:15:35PM +0000, Trond Myklebust wrote: > >>> I'm open to other suggestions, but I'm having trouble finding one that > >>> can scale correctly (i.e. not require per-client tracking), prevent > >>> silent corruption (by causing clients to miss errors), while not > >>> relying on optional features that may not be implemented by all NFSv3 > >>> clients (e.g. per-file write verifiers are not implemented by *BSD). > >>> > >>> That said, it seems to me that to do nothing should not be an option, > >>> as that would imply tolerating silent corruption of file data. > >> > >> So should we increment the boot verifier every time we discover an error > >> on an asynchronous write? > >> > > > > I think so. Otherwise, only one client will ever see that error. > > +1 > > I'm not familiar with the details of how the Linux NFS server > implements the boot verifier: Will a verifier bump be effective > for all file systems that server exports? Yes. It will be per network namespace, but that's the only limit. > If so, is that an acceptable cost? It means clients will resend all their uncommitted writes. That could certainly make write errors more expensive. But maybe you've already got bigger problems if you've got a full or failing disk? --b.
> On Aug 28, 2019, at 10:00 AM, J. Bruce Fields <bfields@redhat.com> wrote: > > On Wed, Aug 28, 2019 at 09:57:25AM -0400, Chuck Lever wrote: >> >> >>> On Aug 28, 2019, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote: >>> >>> On Wed, 2019-08-28 at 09:48 -0400, bfields@fieldses.org wrote: >>>> On Tue, Aug 27, 2019 at 03:15:35PM +0000, Trond Myklebust wrote: >>>>> I'm open to other suggestions, but I'm having trouble finding one that >>>>> can scale correctly (i.e. not require per-client tracking), prevent >>>>> silent corruption (by causing clients to miss errors), while not >>>>> relying on optional features that may not be implemented by all NFSv3 >>>>> clients (e.g. per-file write verifiers are not implemented by *BSD). >>>>> >>>>> That said, it seems to me that to do nothing should not be an option, >>>>> as that would imply tolerating silent corruption of file data. >>>> >>>> So should we increment the boot verifier every time we discover an error >>>> on an asynchronous write? >>>> >>> >>> I think so. Otherwise, only one client will ever see that error. >> >> +1 >> >> I'm not familiar with the details of how the Linux NFS server >> implements the boot verifier: Will a verifier bump be effective >> for all file systems that server exports? > > Yes. It will be per network namespace, but that's the only limit. > >> If so, is that an acceptable cost? > > It means clients will resend all their uncommitted writes. That could > certainly make write errors more expensive. But maybe you've already > got bigger problems if you've got a full or failing disk? One full filesystem will impact the behavior of all other exported filesystems. That might be surprising behavior to a server administrator. I don't have any suggestions other than maintaining a separate verifier for each exported filesystem in each net namespace. -- Chuck Lever
On Wed, 2019-08-28 at 10:03 -0400, Chuck Lever wrote: > > On Aug 28, 2019, at 10:00 AM, J. Bruce Fields <bfields@redhat.com> wrote: > > > > On Wed, Aug 28, 2019 at 09:57:25AM -0400, Chuck Lever wrote: > > > > > > > On Aug 28, 2019, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote: > > > > > > > > On Wed, 2019-08-28 at 09:48 -0400, bfields@fieldses.org wrote: > > > > > On Tue, Aug 27, 2019 at 03:15:35PM +0000, Trond Myklebust wrote: > > > > > > I'm open to other suggestions, but I'm having trouble finding one that > > > > > > can scale correctly (i.e. not require per-client tracking), prevent > > > > > > silent corruption (by causing clients to miss errors), while not > > > > > > relying on optional features that may not be implemented by all NFSv3 > > > > > > clients (e.g. per-file write verifiers are not implemented by *BSD). > > > > > > > > > > > > That said, it seems to me that to do nothing should not be an option, > > > > > > as that would imply tolerating silent corruption of file data. > > > > > > > > > > So should we increment the boot verifier every time we discover an error > > > > > on an asynchronous write? > > > > > > > > > > > > > I think so. Otherwise, only one client will ever see that error. > > > > > > +1 > > > > > > I'm not familiar with the details of how the Linux NFS server > > > implements the boot verifier: Will a verifier bump be effective > > > for all file systems that server exports? > > > > Yes. It will be per network namespace, but that's the only limit. > > > > > If so, is that an acceptable cost? > > > > It means clients will resend all their uncommitted writes. That could > > certainly make write errors more expensive. But maybe you've already > > got bigger problems if you've got a full or failing disk? > > One full filesystem will impact the behavior of all other exported > filesystems. That might be surprising behavior to a server administrator. > I don't have any suggestions other than maintaining a separate verifier > for each exported filesystem in each net namespace. > > Yeah, it's not pretty, but I think the alternative is worse. Most admins would take rotten performance over data corruption. For the most part, these sorts of errors tend to be rare. If it turns out to be a problem we could consider moving the verifier into svc_export or something?
> On Aug 28, 2019, at 10:16 AM, Jeff Layton <jlayton@redhat.com> wrote: > > On Wed, 2019-08-28 at 10:03 -0400, Chuck Lever wrote: >>> On Aug 28, 2019, at 10:00 AM, J. Bruce Fields <bfields@redhat.com> wrote: >>> >>> On Wed, Aug 28, 2019 at 09:57:25AM -0400, Chuck Lever wrote: >>>> >>>>> On Aug 28, 2019, at 9:51 AM, Jeff Layton <jlayton@redhat.com> wrote: >>>>> >>>>> On Wed, 2019-08-28 at 09:48 -0400, bfields@fieldses.org wrote: >>>>>> On Tue, Aug 27, 2019 at 03:15:35PM +0000, Trond Myklebust wrote: >>>>>>> I'm open to other suggestions, but I'm having trouble finding one that >>>>>>> can scale correctly (i.e. not require per-client tracking), prevent >>>>>>> silent corruption (by causing clients to miss errors), while not >>>>>>> relying on optional features that may not be implemented by all NFSv3 >>>>>>> clients (e.g. per-file write verifiers are not implemented by *BSD). >>>>>>> >>>>>>> That said, it seems to me that to do nothing should not be an option, >>>>>>> as that would imply tolerating silent corruption of file data. >>>>>> >>>>>> So should we increment the boot verifier every time we discover an error >>>>>> on an asynchronous write? >>>>>> >>>>> >>>>> I think so. Otherwise, only one client will ever see that error. >>>> >>>> +1 >>>> >>>> I'm not familiar with the details of how the Linux NFS server >>>> implements the boot verifier: Will a verifier bump be effective >>>> for all file systems that server exports? >>> >>> Yes. It will be per network namespace, but that's the only limit. >>> >>>> If so, is that an acceptable cost? >>> >>> It means clients will resend all their uncommitted writes. That could >>> certainly make write errors more expensive. But maybe you've already >>> got bigger problems if you've got a full or failing disk? >> >> One full filesystem will impact the behavior of all other exported >> filesystems. That might be surprising behavior to a server administrator. >> I don't have any suggestions other than maintaining a separate verifier >> for each exported filesystem in each net namespace. >> >> > > Yeah, it's not pretty, but I think the alternative is worse. Most admins > would take rotten performance over data corruption. Again, I'm not saying we should do nothing. It doesn't seem like a per-export verifier would be much more work than a per-net-namespace verifier. > For the most part, these sorts of errors tend to be rare. EIO is certainly going to be rare, agreed. ENOSPC might not be. > If it turns > out to be a problem we could consider moving the verifier into > svc_export or something? -- Chuck Lever
On Wed, Aug 28, 2019 at 10:16:09AM -0400, Jeff Layton wrote: > For the most part, these sorts of errors tend to be rare. If it turns > out to be a problem we could consider moving the verifier into > svc_export or something? As Trond says, this isn't really a server implementation issue, it's a protocol issue. If a client detects when to resend writes by storing a single verifier per server, then returning different verifiers from writes to different exports will have it resending every time it writes to one export then another. --b.
On Wed, Aug 28, 2019 at 10:40:31AM -0400, J. Bruce Fields wrote: > On Wed, Aug 28, 2019 at 10:16:09AM -0400, Jeff Layton wrote: > > For the most part, these sorts of errors tend to be rare. If it turns > > out to be a problem we could consider moving the verifier into > > svc_export or something? > > As Trond says, this isn't really a server implementation issue, it's a > protocol issue. If a client detects when to resend writes by storing a > single verifier per server, then returning different verifiers from > writes to different exports will have it resending every time it writes > to one export then another. The other way to limit this is by trying to detect the single-writer case, since it's probably also rare for multiple clients to write to the same file. Are there any common cases where two clients share the same TCP connection? Maybe that would be a conservative enough test? And you wouldn't have to track every connection that's ever sent a WRITE to a given file. Just remember the first one, with a flag to track whether anyone else has ever written to it. ?? --b.
> On Aug 28, 2019, at 10:48 AM, Bruce Fields <bfields@fieldses.org> wrote: > > On Wed, Aug 28, 2019 at 10:40:31AM -0400, J. Bruce Fields wrote: >> On Wed, Aug 28, 2019 at 10:16:09AM -0400, Jeff Layton wrote: >>> For the most part, these sorts of errors tend to be rare. If it turns >>> out to be a problem we could consider moving the verifier into >>> svc_export or something? >> >> As Trond says, this isn't really a server implementation issue, it's a >> protocol issue. If a client detects when to resend writes by storing a >> single verifier per server, then returning different verifiers from >> writes to different exports will have it resending every time it writes >> to one export then another. > > The other way to limit this is by trying to detect the single-writer > case, since it's probably also rare for multiple clients to write to the > same file. > > Are there any common cases where two clients share the same TCP > connection? Maybe that would be a conservative enough test? What happens if a client reconnects? Does that bump the verifier? If it reconnects from the same source port, can the server tell that the connection is different? > And you wouldn't have to track every connection that's ever sent a WRITE > to a given file. Just remember the first one, with a flag to track > whether anyone else has ever written to it. > > ?? > > --b. -- Chuck Lever
On Wed, 2019-08-28 at 10:40 -0400, J. Bruce Fields wrote: > On Wed, Aug 28, 2019 at 10:16:09AM -0400, Jeff Layton wrote: > > For the most part, these sorts of errors tend to be rare. If it turns > > out to be a problem we could consider moving the verifier into > > svc_export or something? > > As Trond says, this isn't really a server implementation issue, it's a > protocol issue. If a client detects when to resend writes by storing a > single verifier per server, then returning different verifiers from > writes to different exports will have it resending every time it writes > to one export then another. > Huh. I've always considered the verifier to be a per-inode quantity that we just happened to fill with a global value, but the spec _is_ a bit vague in this regard. I think modern Linux clients track this on a per-inode basis, but I'm not sure about really old Linux or other clients. It'd be good to know about BSD and Solaris, in particular.
J. Bruce Fields wrote: >On Wed, Aug 28, 2019 at 10:16:09AM -0400, Jeff Layton wrote: >> For the most part, these sorts of errors tend to be rare. If it turns >> out to be a problem we could consider moving the verifier into >> svc_export or something? > >As Trond says, this isn't really a server implementation issue, it's a >protocol issue. If a client detects when to resend writes by storing a >single verifier per server, then returning different verifiers from >writes to different exports will have it resending every time it writes >to one export then another. Well, here's what RFC-1813 says about the write verifier: This cookie must be consistent during a single instance of the NFS version 3 protocol service and must be unique between instances of the NFS version 3 protocol server, where uncommitted data may be lost. You could debate what "service" means in the above, but I'd say it isn't "per file". However, since there is no way for an NFSv3 client to know what a "server" is, the FreeBSD client (and maybe the other *BSD, although I haven't been involved in them for a long time) stores the write verifier "per mount". --> So, for the FreeBSD client, it is at the granularity of what the NFSv3 client sees as a single file system. (Typically a single file system on the server unless the knfsd plays the game of coalescing multiple file systems by uniquifying the I-node#, etc.) It is conceivable that some other NFSv3 client might assume "same server IP address --> same server" and store it "per server IP", but I have no idea if any such client exists. rick
On Wed, 2019-08-28 at 15:12 +0000, Rick Macklem wrote: > J. Bruce Fields wrote: > > On Wed, Aug 28, 2019 at 10:16:09AM -0400, Jeff Layton wrote: > > > For the most part, these sorts of errors tend to be rare. If it > > > turns > > > out to be a problem we could consider moving the verifier into > > > svc_export or something? > > > > As Trond says, this isn't really a server implementation issue, > > it's a > > protocol issue. If a client detects when to resend writes by > > storing a > > single verifier per server, then returning different verifiers from > > writes to different exports will have it resending every time it > > writes > > to one export then another. > > Well, here's what RFC-1813 says about the write verifier: > This cookie must be consistent during a single instance > of the NFS version 3 protocol service and must be > unique between instances of the NFS version 3 protocol > server, where uncommitted data may be lost. > You could debate what "service" means in the above, but I'd say it > isn't "per file". > > However, since there is no way for an NFSv3 client to know what a > "server" is, > the FreeBSD client (and maybe the other *BSD, although I haven't been > involved > in them for a long time) stores the write verifier "per mount". > --> So, for the FreeBSD client, it is at the granularity of what the > NFSv3 client sees as > a single file system. (Typically a single file system on the > server unless the knfsd > plays the game of coalescing multiple file systems by > uniquifying the I-node#, etc.) > > It is conceivable that some other NFSv3 client might assume > "same server IP address --> same server" and store it "per server > IP", but I have > no idea if any such client exists. > The patchsets we've been discussing should all be compatible with FreeBSD, as they implement a per-server boot verifier (as in different containers representing different knfsd servers on different networks are allowed to have different boot verifiers, but each server applies their boot verifier globally to all exported filesystems). We could make these boot verifiers be per filesystem to reduce the frequency of these 'server reboots' but the expectation is that filesystem errors on COMMIT are rare.
On Wed, Aug 28, 2019 at 03:12:26PM +0000, Rick Macklem wrote: > J. Bruce Fields wrote: > >On Wed, Aug 28, 2019 at 10:16:09AM -0400, Jeff Layton wrote: > >> For the most part, these sorts of errors tend to be rare. If it turns > >> out to be a problem we could consider moving the verifier into > >> svc_export or something? > > > >As Trond says, this isn't really a server implementation issue, it's a > >protocol issue. If a client detects when to resend writes by storing a > >single verifier per server, then returning different verifiers from > >writes to different exports will have it resending every time it writes > >to one export then another. > > Well, here's what RFC-1813 says about the write verifier: > This cookie must be consistent during a single instance > of the NFS version 3 protocol service and must be > unique between instances of the NFS version 3 protocol > server, where uncommitted data may be lost. > You could debate what "service" means in the above, but I'd say it isn't "per file". > > However, since there is no way for an NFSv3 client to know what a "server" is, > the FreeBSD client (and maybe the other *BSD, although I haven't been involved > in them for a long time) stores the write verifier "per mount". > --> So, for the FreeBSD client, it is at the granularity of what the NFSv3 client sees as > a single file system. (Typically a single file system on the server unless the knfsd > plays the game of coalescing multiple file systems by uniquifying the I-node#, etc.) Oh, well, that would be nice if we could at least count on per-filesystem. > It is conceivable that some other NFSv3 client might assume "same > server IP address --> same server" and store it "per server IP", but I > have no idea if any such client exists. Hm. --b.
On Wed, Aug 28, 2019 at 10:50:30AM -0400, Chuck Lever wrote: > > > > On Aug 28, 2019, at 10:48 AM, Bruce Fields <bfields@fieldses.org> wrote: > > > > On Wed, Aug 28, 2019 at 10:40:31AM -0400, J. Bruce Fields wrote: > >> On Wed, Aug 28, 2019 at 10:16:09AM -0400, Jeff Layton wrote: > >>> For the most part, these sorts of errors tend to be rare. If it turns > >>> out to be a problem we could consider moving the verifier into > >>> svc_export or something? > >> > >> As Trond says, this isn't really a server implementation issue, it's a > >> protocol issue. If a client detects when to resend writes by storing a > >> single verifier per server, then returning different verifiers from > >> writes to different exports will have it resending every time it writes > >> to one export then another. > > > > The other way to limit this is by trying to detect the single-writer > > case, since it's probably also rare for multiple clients to write to the > > same file. > > > > Are there any common cases where two clients share the same TCP > > connection? Maybe that would be a conservative enough test? > > What happens if a client reconnects? Does that bump the verifier? > > If it reconnects from the same source port, can the server tell > that the connection is different? I assume it creates a new socket structure. So, you could key off either that or the (address, port) depending on whether you'd rather risk unnecessarily incrementing the verifier after a reconnection and write error, or risk failing to increment due to port reuse. There's some precedent to taking the latter choice in the DRC. But the DRC also includes some other stuff (xid, hash) to minimize the chance of a collision. If you took the former choice you wouldn't literally want to key off a pointer to the socket as it'd cause complications when freeing the socket. But maybe you could add some kind of birth time or generation number to struct svc_sock to help differentiate? --b. > > > > And you wouldn't have to track every connection that's ever sent a WRITE > > to a given file. Just remember the first one, with a flag to track > > whether anyone else has ever written to it. > > > > ?? > > > > --b. > > -- > Chuck Lever > >