Message ID | 1557115023-86769-1-git-send-email-zhangxiaoxu5@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | NFS4: Fix v4.0 client state corruption when mount | expand |
ping. On 5/6/2019 11:57 AM, ZhangXiaoxu wrote: > stat command with soft mount never return after server is stopped. > > When alloc a new client, the state of the client will be set to > NFS4CLNT_LEASE_EXPIRED. > > When the server is stopped, the state manager will work, and accord > the state to recover. But the state is NFS4CLNT_LEASE_EXPIRED, it > will drain the slot table and lead other task to wait queue, until > the client recovered. Then the stat command is hung. > > When discover server trunking, the client will renew the lease, > but check the client state, it lead the client state corruption. > > So, we need to call state manager to recover it when detect server > ip trunking. > > Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> > --- > fs/nfs/nfs4state.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c > index 3de3647..f502f1c 100644 > --- a/fs/nfs/nfs4state.c > +++ b/fs/nfs/nfs4state.c > @@ -159,6 +159,10 @@ int nfs40_discover_server_trunking(struct nfs_client *clp, > /* Sustain the lease, even if it's empty. If the clientid4 > * goes stale it's of no use for trunking discovery. */ > nfs4_schedule_state_renewal(*result); > + > + /* If the client state need to recover, do it. */ > + if (clp->cl_state) > + nfs4_schedule_state_manager(clp); > } > out: > return status; >
Hi ZhangXiaoxu, I'm having a bit of trouble with this fix (which went upstream in f02f3755dbd14fb935d24b14650fff9ba92243b8). Since this change, my client calls SETCLIENTID/SETCLIENTID_CONFIRM twice in quick succession on mount, and the second SETCLIENTID_CONFIRM sent by the state manager can sometimes have the same verifier sent back by the first SETCLIENTID's response. I think we're missing a memory barrier somewhere.. But, I do not understand how the client was able to corrupt the state before this patch, and I don't understand how the patch fixes state corruption. Can anyone enlighten me as to how we were corrupting state here? Ben On 5 May 2019, at 23:57, ZhangXiaoxu wrote: > stat command with soft mount never return after server is stopped. > > When alloc a new client, the state of the client will be set to > NFS4CLNT_LEASE_EXPIRED. > > When the server is stopped, the state manager will work, and accord > the state to recover. But the state is NFS4CLNT_LEASE_EXPIRED, it > will drain the slot table and lead other task to wait queue, until > the client recovered. Then the stat command is hung. > > When discover server trunking, the client will renew the lease, > but check the client state, it lead the client state corruption. > > So, we need to call state manager to recover it when detect server > ip trunking. > > Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> > --- > fs/nfs/nfs4state.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c > index 3de3647..f502f1c 100644 > --- a/fs/nfs/nfs4state.c > +++ b/fs/nfs/nfs4state.c > @@ -159,6 +159,10 @@ int nfs40_discover_server_trunking(struct > nfs_client *clp, > /* Sustain the lease, even if it's empty. If the clientid4 > * goes stale it's of no use for trunking discovery. */ > nfs4_schedule_state_renewal(*result); > + > + /* If the client state need to recover, do it. */ > + if (clp->cl_state) > + nfs4_schedule_state_manager(clp); > } > out: > return status; > -- > 2.7.4
在 2019/11/7 0:47, Benjamin Coddington 写道: > Hi ZhangXiaoxu, > > I'm having a bit of trouble with this fix (which went upstream in > f02f3755dbd14fb935d24b14650fff9ba92243b8). > > Since this change, my client calls SETCLIENTID/SETCLIENTID_CONFIRM twice in > quick succession on mount, and the second SETCLIENTID_CONFIRM sent by the state > manager can sometimes have the same verifier sent back by the first > SETCLIENTID's response. I think we're missing a memory barrier somewhere.. nfs40_discover_server_trunking nfs4_proc_setclientid # the first time after nfs4_schedule_state_manager, the state manager: nfs4_run_state_manager nfs4_state_manager # 'nfs4_alloc_client' init state to NFS4CLNT_LEASE_EXPIRED nfs4_reclaim_lease nfs4_establish_lease nfs4_init_clientid nfs4_proc_setclientid # the second time. > > But, I do not understand how the client was able to corrupt the state before > this patch, and I don't understand how the patch fixes state corruption. > > Can anyone enlighten me as to how we were corrupting state here? when 'nfs4_alloc_client', the client state initialized with 'NFS4CLNT_LEASE_EXPIRED', So, we should recover it when the client init. After the first setclientid, maybe we should clear the 'NFS4CLNT_LEASE_EXPIRED', then the state manager won't be called it again. > > Ben > > On 5 May 2019, at 23:57, ZhangXiaoxu wrote: > >> stat command with soft mount never return after server is stopped. >> >> When alloc a new client, the state of the client will be set to >> NFS4CLNT_LEASE_EXPIRED. >> >> When the server is stopped, the state manager will work, and accord >> the state to recover. But the state is NFS4CLNT_LEASE_EXPIRED, it >> will drain the slot table and lead other task to wait queue, until >> the client recovered. Then the stat command is hung. >> >> When discover server trunking, the client will renew the lease, >> but check the client state, it lead the client state corruption. >> >> So, we need to call state manager to recover it when detect server >> ip trunking. >> >> Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> >> --- >> fs/nfs/nfs4state.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c >> index 3de3647..f502f1c 100644 >> --- a/fs/nfs/nfs4state.c >> +++ b/fs/nfs/nfs4state.c >> @@ -159,6 +159,10 @@ int nfs40_discover_server_trunking(struct nfs_client *clp, >> /* Sustain the lease, even if it's empty. If the clientid4 >> * goes stale it's of no use for trunking discovery. */ >> nfs4_schedule_state_renewal(*result); >> + >> + /* If the client state need to recover, do it. */ >> + if (clp->cl_state) >> + nfs4_schedule_state_manager(clp); >> } >> out: >> return status; >> -- >> 2.7.4 > > > .
On 6 Nov 2019, at 21:34, zhangxiaoxu (A) wrote: > 在 2019/11/7 0:47, Benjamin Coddington 写道: >> Hi ZhangXiaoxu, >> >> I'm having a bit of trouble with this fix (which went upstream in >> f02f3755dbd14fb935d24b14650fff9ba92243b8). >> >> Since this change, my client calls SETCLIENTID/SETCLIENTID_CONFIRM >> twice in >> quick succession on mount, and the second SETCLIENTID_CONFIRM sent by >> the state >> manager can sometimes have the same verifier sent back by the first >> SETCLIENTID's response. I think we're missing a memory barrier >> somewhere.. > > nfs40_discover_server_trunking > nfs4_proc_setclientid # the first time > > after nfs4_schedule_state_manager, the state manager: > nfs4_run_state_manager > nfs4_state_manager > # 'nfs4_alloc_client' init state to NFS4CLNT_LEASE_EXPIRED > nfs4_reclaim_lease > nfs4_establish_lease > nfs4_init_clientid > nfs4_proc_setclientid # the second time. > >> >> But, I do not understand how the client was able to corrupt the state >> before >> this patch, and I don't understand how the patch fixes state >> corruption. >> >> Can anyone enlighten me as to how we were corrupting state here? > when 'nfs4_alloc_client', the client state initialized with > 'NFS4CLNT_LEASE_EXPIRED', > So, we should recover it when the client init. Why? The commit message asserts that if we don't, there will be corruption of the client's state. How can the client's state be corrupted? > After the first setclientid, maybe we should clear the > 'NFS4CLNT_LEASE_EXPIRED', then > the state manager won't be called it again. What about just never setting it in nfs4_alloc_client? It has been set there since 286d7d6a0cb, and that commit needed it because it changed the paths so that the only way to ever send a SETCLIENTID was through state recovery. Today, we have two paths - the trunking discovery for new clients as well as state recovery, and all new clients will go through trunking discovery. Removing the call to kick the state manager thread will fix the problem that the update to cl_confirm happens after the state manager has already started doing its own SETCLIENTID dance. Ben
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c index 3de3647..f502f1c 100644 --- a/fs/nfs/nfs4state.c +++ b/fs/nfs/nfs4state.c @@ -159,6 +159,10 @@ int nfs40_discover_server_trunking(struct nfs_client *clp, /* Sustain the lease, even if it's empty. If the clientid4 * goes stale it's of no use for trunking discovery. */ nfs4_schedule_state_renewal(*result); + + /* If the client state need to recover, do it. */ + if (clp->cl_state) + nfs4_schedule_state_manager(clp); } out: return status;
stat command with soft mount never return after server is stopped. When alloc a new client, the state of the client will be set to NFS4CLNT_LEASE_EXPIRED. When the server is stopped, the state manager will work, and accord the state to recover. But the state is NFS4CLNT_LEASE_EXPIRED, it will drain the slot table and lead other task to wait queue, until the client recovered. Then the stat command is hung. When discover server trunking, the client will renew the lease, but check the client state, it lead the client state corruption. So, we need to call state manager to recover it when detect server ip trunking. Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> --- fs/nfs/nfs4state.c | 4 ++++ 1 file changed, 4 insertions(+)