NFSv4: Ensure we revalidate data after OPEN expired

Message ID	7e97897a29878a56236ef8e15bce7a295d5e8a41.1676403514.git.bcodding@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@vger.kernel.org> From: Benjamin Coddington <bcodding@redhat.com> To: Trond Myklebust <trond.myklebust@hammerspace.com>, Anna Schumaker <anna@kernel.org> Cc: linux-nfs@vger.kernel.org Subject: [PATCH] NFSv4: Ensure we revalidate data after OPEN expired Date: Tue, 14 Feb 2023 14:39:11 -0500 Message-Id: <7e97897a29878a56236ef8e15bce7a295d5e8a41.1676403514.git.bcodding@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	NFSv4: Ensure we revalidate data after OPEN expired \| expand NFSv4: Ensure we revalidate data after OPEN expired

Message ID

7e97897a29878a56236ef8e15bce7a295d5e8a41.1676403514.git.bcodding@redhat.com (mailing list archive)

State

New, archived

Headers

From: Benjamin Coddington <bcodding@redhat.com>
To: Trond Myklebust <trond.myklebust@hammerspace.com>,
        Anna Schumaker <anna@kernel.org>
Cc: linux-nfs@vger.kernel.org
Subject: [PATCH] NFSv4: Ensure we revalidate data after OPEN expired
Date: Tue, 14 Feb 2023 14:39:11 -0500
Message-Id: 
 <7e97897a29878a56236ef8e15bce7a295d5e8a41.1676403514.git.bcodding@redhat.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

NFSv4: Ensure we revalidate data after OPEN expired | expand

Commit Message

Benjamin Coddington Feb. 14, 2023, 7:39 p.m. UTC

We've observed that if the NFS client experiences a network partition and
the server revokes the client's state, the client may not revalidate cached
data for an open file during recovery.  If the file is extended by a second
client during this network partition, the first client will correctly
update the file's size and attributes during recovery, but another
extending write will discard the second client's data.

In the case where another client opened the file during the network
partition and the server revoked the first client's state, the recovery can
forego optimizations and instead attempt to avoid corruption.

It's a little tricky to solve this in a per-file way during recovery
without plumbing arguments or introducing new flags.  This patch side-steps
the per-file complexity by simply checking if the client is within a
NOGRACE recovery window, and if so, invalidates data during the open
recovery.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 fs/nfs/nfs4proc.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Trond Myklebust Feb. 14, 2023, 8:30 p.m. UTC | #1

On Tue, 2023-02-14 at 14:39 -0500, Benjamin Coddington wrote:
> We've observed that if the NFS client experiences a network partition
> and
> the server revokes the client's state, the client may not revalidate
> cached
> data for an open file during recovery.  If the file is extended by a
> second
> client during this network partition, the first client will correctly
> update the file's size and attributes during recovery, but another
> extending write will discard the second client's data.

I'm having trouble fully understanding your problem description. Is the
issue that both clients are opening the same file with something like
O_WRONLY|O_DSYNC|O_APPEND?

If so, what if the network partition happens during the write() system
call of client 1, so that the page cache is updated but the flush of
the write data ends up being delayed by the partition?
In that case, client 2 doesn't know that client 1 has writes
outstanding so it may write its data to where the server thinks the eof
offset is. However once client 1 is able to recover its open state, it
will still have dirty page cache data that is going to overwrite that
same offset.

> 
> In the case where another client opened the file during the network
> partition and the server revoked the first client's state, the
> recovery can
> forego optimizations and instead attempt to avoid corruption.
> 
> It's a little tricky to solve this in a per-file way during recovery
> without plumbing arguments or introducing new flags.  This patch
> side-steps
> the per-file complexity by simply checking if the client is within a
> NOGRACE recovery window, and if so, invalidates data during the open
> recovery.
> 

I don't see how this scenario can ever be made fully safe. If people
care, then we should probably have the open recovery of client 1 fail
altogether in this case (subject to some switch similar to the existing
'recover_lost_locks' kernel option).

Benjamin Coddington Feb. 14, 2023, 9 p.m. UTC | #2

On 14 Feb 2023, at 15:30, Trond Myklebust wrote:

> On Tue, 2023-02-14 at 14:39 -0500, Benjamin Coddington wrote:
>> We've observed that if the NFS client experiences a network partition
>> and
>> the server revokes the client's state, the client may not revalidate
>> cached
>> data for an open file during recovery.  If the file is extended by a
>> second
>> client during this network partition, the first client will correctly
>> update the file's size and attributes during recovery, but another
>> extending write will discard the second client's data.
>
> I'm having trouble fully understanding your problem description. Is the
> issue that both clients are opening the same file with something like
> O_WRONLY|O_DSYNC|O_APPEND?

Yes.

> If so, what if the network partition happens during the write() system
> call of client 1, so that the page cache is updated but the flush of
> the write data ends up being delayed by the partition?
> In that case, client 2 doesn't know that client 1 has writes
> outstanding so it may write its data to where the server thinks the eof
> offset is. However once client 1 is able to recover its open state, it
> will still have dirty page cache data that is going to overwrite that
> same offset.

Ah, yes.  :(   In this case we might be safe saying that close-to-open
consistency is preserved, though, if the second client has closed the file.
At least client 1 will have the data laid out in the file the way it
expected.

>> In the case where another client opened the file during the network
>> partition and the server revoked the first client's state, the
>> recovery can
>> forego optimizations and instead attempt to avoid corruption.
>>
>> It's a little tricky to solve this in a per-file way during recovery
>> without plumbing arguments or introducing new flags.  This patch
>> side-steps
>> the per-file complexity by simply checking if the client is within a
>> NOGRACE recovery window, and if so, invalidates data during the open
>> recovery.
>>
>
> I don't see how this scenario can ever be made fully safe. If people
> care, then we should probably have the open recovery of client 1 fail
> altogether in this case (subject to some switch similar to the existing
> 'recover_lost_locks' kernel option).

Its quite a corner case.

I actually don't have a broken setup yet.  We noticed this behavior appear
in a regression test now that knfsd will both: 1) give a read delegation for
RW open and 2) forget client state if a delegation callback times out.
Otherwise, without the delegation, the server doesn't send the client back
through recovery and the appending open/write happens at the correct eof.

Maybe that server behavior will trigger some real problem reports, and I
should look into making the server a little nicer to the client for a short
partition.

Ben

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 40d749f29ed3..b9e1e817a723 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -2604,7 +2604,9 @@  static int nfs4_run_open_task(struct nfs4_opendata *data,
 
 static int _nfs4_recover_proc_open(struct nfs4_opendata *data)
 {
+	struct inode *inode = d_inode(data->dentry);
 	struct inode *dir = d_inode(data->dir);
+	struct nfs_openargs *o_arg = &data->o_arg;
 	struct nfs_openres *o_res = &data->o_res;
 	int status;
 
@@ -2614,6 +2616,12 @@  static int _nfs4_recover_proc_open(struct nfs4_opendata *data)
 
 	nfs_fattr_map_and_free_names(NFS_SERVER(dir), &data->f_attr);
 
+	if (test_bit(NFS4CLNT_RECLAIM_NOGRACE, &o_arg->server->nfs_client->cl_state)) {
+		spin_lock(&inode->i_lock);
+		nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA);
+		spin_unlock(&inode->i_lock);
+	}
+
 	if (o_res->rflags & NFS4_OPEN_RESULT_CONFIRM)
 		status = _nfs4_proc_open_confirm(data);

NFSv4: Ensure we revalidate data after OPEN expired

Commit Message

Comments

Patch