Message ID | 20210603181438.109851-1-dai.ngo@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC,1/1] nfsd: Initial implementation of NFSv4 Courteous Server | expand |
Hi Bruce, Just a reminder that this RFC still needs a review. Thanks, -Dai On 6/3/21 11:14 AM, Dai Ngo wrote: > Currently an NFSv4 client must maintain its lease by using the at least > one of the state tokens or if nothing else, by issuing a RENEW (4.0), or > a singleton SEQUENCE (4.1) at least once during each lease period. If the > client fails to renew the lease, for any reason, the Linux server expunges > the state tokens immediately upon detection of the "failure to renew the > lease" condition and begins returning NFS4ERR_EXPIRED if the client should > reconnect and attempt to use the (now) expired state. > > The default lease period for the Linux server is 90 seconds. The typical > client cuts that in half and will issue a lease renewing operation every > 45 seconds. The 90 second lease period is very short considering the > potential for moderately long term network partitions. A network partition > refers to any loss of network connectivity between the NFS client and the > NFS server, regardless of its root cause. This includes NIC failures, NIC > driver bugs, network misconfigurations & administrative errors, routers & > switches crashing and/or having software updates applied, even down to > cables being physically pulled. In most cases, these network failures are > transient, although the duration is unknown. > > A server which does not immediately expunge the state on lease expiration > is known as a Courteous Server. A Courteous Server continues to recognize > previously generated state tokens as valid until conflict arises between > the expired state and the requests from another client, or the server reboots. > > The initial implementation of the Courteous Server will do the following: > > . when the laundromat thread detects an expired client and if that client > still has established states on the Linux server then marks the client as a > COURTESY_CLIENT and skips destroying the client and all its states, > otherwise destroy the client as usual. > > . detects conflict of OPEN request with a COURTESY_CLIENT, destroys the > expired client and all its states, skips the delegation recall then allows > the conflicting request to succeed. > > . detects conflict of LOCK/LOCKT request with a COURTESY_CLIENT, destroys > the expired client and all its states then allows the conflicting request > to succeed. > > To be done: > > . fix a problem with 4.1 where the Linux server keeps returning > SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after the expired > client re-connects, causing the client to keep sending BCTS requests to server. > > . handle OPEN conflict with share reservations. > > . instead of destroy the client anf all its state on conflict, only destroy > the state that is conflicted with the current request. > > . destroy the COURTESY_CLIENT either after a fixed period of time to release > resources or as reacting to memory pressure. > > Signed-off-by: Dai Ngo <dai.ngo@oracle.com> > --- > fs/nfsd/nfs4state.c | 94 +++++++++++++++++++++++++++++++++++++++++++--- > fs/nfsd/state.h | 1 + > include/linux/sunrpc/svc.h | 1 + > 3 files changed, 91 insertions(+), 5 deletions(-) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index b517a8794400..969995872752 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -171,6 +171,7 @@ renew_client_locked(struct nfs4_client *clp) > > list_move_tail(&clp->cl_lru, &nn->client_lru); > clp->cl_time = ktime_get_boottime_seconds(); > + clear_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags); > } > > static void put_client_renew_locked(struct nfs4_client *clp) > @@ -4610,6 +4611,38 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp) > nfsd4_run_cb(&dp->dl_recall); > } > > +/* > + * Set rq_conflict_client if the conflict client have expired > + * and return true, otherwise return false. > + */ > +static bool > +nfsd_set_conflict_client(struct nfs4_delegation *dp) > +{ > + struct svc_rqst *rqst; > + struct nfs4_client *clp; > + struct nfsd_net *nn; > + bool ret; > + > + if (!i_am_nfsd()) > + return false; > + rqst = kthread_data(current); > + if (rqst->rq_prog != NFS_PROGRAM || rqst->rq_vers < 4) > + return false; > + rqst->rq_conflict_client = NULL; > + clp = dp->dl_recall.cb_clp; > + nn = net_generic(clp->net, nfsd_net_id); > + spin_lock(&nn->client_lock); > + > + if (test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) && > + !mark_client_expired_locked(clp)) { > + rqst->rq_conflict_client = clp; > + ret = true; > + } else > + ret = false; > + spin_unlock(&nn->client_lock); > + return ret; > +} > + > /* Called from break_lease() with i_lock held. */ > static bool > nfsd_break_deleg_cb(struct file_lock *fl) > @@ -4618,6 +4651,8 @@ nfsd_break_deleg_cb(struct file_lock *fl) > struct nfs4_delegation *dp = (struct nfs4_delegation *)fl->fl_owner; > struct nfs4_file *fp = dp->dl_stid.sc_file; > > + if (nfsd_set_conflict_client(dp)) > + return false; > trace_nfsd_deleg_break(&dp->dl_stid.sc_stateid); > > /* > @@ -5237,6 +5272,22 @@ static void nfsd4_deleg_xgrade_none_ext(struct nfsd4_open *open, > */ > } > > +static bool > +nfs4_destroy_courtesy_client(struct nfs4_client *clp) > +{ > + struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id); > + > + spin_lock(&nn->client_lock); > + if (!test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) || > + mark_client_expired_locked(clp)) { > + spin_unlock(&nn->client_lock); > + return false; > + } > + spin_unlock(&nn->client_lock); > + expire_client(clp); > + return true; > +} > + > __be32 > nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nfsd4_open *open) > { > @@ -5286,7 +5337,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf > goto out; > } > } else { > + rqstp->rq_conflict_client = NULL; > status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open); > + if (status == nfserr_jukebox && rqstp->rq_conflict_client) { > + if (nfs4_destroy_courtesy_client(rqstp->rq_conflict_client)) > + status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open); > + } > + > if (status) { > stp->st_stid.sc_type = NFS4_CLOSED_STID; > release_open_stateid(stp); > @@ -5472,6 +5529,8 @@ nfs4_laundromat(struct nfsd_net *nn) > }; > struct nfs4_cpntf_state *cps; > copy_stateid_t *cps_t; > + struct nfs4_stid *stid; > + int id = 0; > int i; > > if (clients_still_reclaiming(nn)) { > @@ -5495,6 +5554,15 @@ nfs4_laundromat(struct nfsd_net *nn) > clp = list_entry(pos, struct nfs4_client, cl_lru); > if (!state_expired(<, clp->cl_time)) > break; > + > + spin_lock(&clp->cl_lock); > + stid = idr_get_next(&clp->cl_stateids, &id); > + spin_unlock(&clp->cl_lock); > + if (stid) { > + /* client still has states */ > + set_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags); > + continue; > + } > if (mark_client_expired_locked(clp)) { > trace_nfsd_clid_expired(&clp->cl_clientid); > continue; > @@ -6440,7 +6508,7 @@ static const struct lock_manager_operations nfsd_posix_mng_ops = { > .lm_put_owner = nfsd4_fl_put_owner, > }; > > -static inline void > +static inline int > nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny) > { > struct nfs4_lockowner *lo; > @@ -6453,6 +6521,8 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny) > /* We just don't care that much */ > goto nevermind; > deny->ld_clientid = lo->lo_owner.so_client->cl_clientid; > + if (nfs4_destroy_courtesy_client(lo->lo_owner.so_client)) > + return -EAGAIN; > } else { > nevermind: > deny->ld_owner.len = 0; > @@ -6467,6 +6537,7 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny) > deny->ld_type = NFS4_READ_LT; > if (fl->fl_type != F_RDLCK) > deny->ld_type = NFS4_WRITE_LT; > + return 0; > } > > static struct nfs4_lockowner * > @@ -6734,6 +6805,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > unsigned int fl_flags = FL_POSIX; > struct net *net = SVC_NET(rqstp); > struct nfsd_net *nn = net_generic(net, nfsd_net_id); > + bool retried = false; > > dprintk("NFSD: nfsd4_lock: start=%Ld length=%Ld\n", > (long long) lock->lk_offset, > @@ -6860,7 +6932,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > list_add_tail(&nbl->nbl_lru, &nn->blocked_locks_lru); > spin_unlock(&nn->blocked_locks_lock); > } > - > +again: > err = vfs_lock_file(nf->nf_file, F_SETLK, file_lock, conflock); > switch (err) { > case 0: /* success! */ > @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > case -EAGAIN: /* conflock holds conflicting lock */ > status = nfserr_denied; > dprintk("NFSD: nfsd4_lock: conflicting lock found!\n"); > - nfs4_set_lock_denied(conflock, &lock->lk_denied); > + > + /* try again if conflict with courtesy client */ > + if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == -EAGAIN && !retried) { > + retried = true; > + goto again; > + } > break; > case -EDEADLK: > status = nfserr_deadlock; > @@ -6962,6 +7039,8 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > struct nfs4_lockowner *lo = NULL; > __be32 status; > struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id); > + bool retried = false; > + int ret; > > if (locks_in_grace(SVC_NET(rqstp))) > return nfserr_grace; > @@ -7010,14 +7089,19 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > file_lock->fl_end = last_byte_offset(lockt->lt_offset, lockt->lt_length); > > nfs4_transform_lock_offset(file_lock); > - > +again: > status = nfsd_test_lock(rqstp, &cstate->current_fh, file_lock); > if (status) > goto out; > > if (file_lock->fl_type != F_UNLCK) { > status = nfserr_denied; > - nfs4_set_lock_denied(file_lock, &lockt->lt_denied); > + ret = nfs4_set_lock_denied(file_lock, &lockt->lt_denied); > + if (ret == -EAGAIN && !retried) { > + retried = true; > + fh_clear_wcc(&cstate->current_fh); > + goto again; > + } > } > out: > if (lo) > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h > index e73bdbb1634a..bdc3605e3722 100644 > --- a/fs/nfsd/state.h > +++ b/fs/nfsd/state.h > @@ -345,6 +345,7 @@ struct nfs4_client { > #define NFSD4_CLIENT_UPCALL_LOCK (5) /* upcall serialization */ > #define NFSD4_CLIENT_CB_FLAG_MASK (1 << NFSD4_CLIENT_CB_UPDATE | \ > 1 << NFSD4_CLIENT_CB_KILL) > +#define NFSD4_CLIENT_COURTESY (6) /* be nice to expired client */ > unsigned long cl_flags; > const struct cred *cl_cb_cred; > struct rpc_clnt *cl_cb_client; > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h > index e91d51ea028b..2f0382f9d0ff 100644 > --- a/include/linux/sunrpc/svc.h > +++ b/include/linux/sunrpc/svc.h > @@ -304,6 +304,7 @@ struct svc_rqst { > * net namespace > */ > void ** rq_lease_breaker; /* The v4 client breaking a lease */ > + void *rq_conflict_client; > }; > > #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote: > Currently an NFSv4 client must maintain its lease by using the at least > one of the state tokens or if nothing else, by issuing a RENEW (4.0), or > a singleton SEQUENCE (4.1) at least once during each lease period. If the > client fails to renew the lease, for any reason, the Linux server expunges > the state tokens immediately upon detection of the "failure to renew the > lease" condition and begins returning NFS4ERR_EXPIRED if the client should > reconnect and attempt to use the (now) expired state. > > The default lease period for the Linux server is 90 seconds. The typical > client cuts that in half and will issue a lease renewing operation every > 45 seconds. The 90 second lease period is very short considering the > potential for moderately long term network partitions. A network partition > refers to any loss of network connectivity between the NFS client and the > NFS server, regardless of its root cause. This includes NIC failures, NIC > driver bugs, network misconfigurations & administrative errors, routers & > switches crashing and/or having software updates applied, even down to > cables being physically pulled. In most cases, these network failures are > transient, although the duration is unknown. > > A server which does not immediately expunge the state on lease expiration > is known as a Courteous Server. A Courteous Server continues to recognize > previously generated state tokens as valid until conflict arises between > the expired state and the requests from another client, or the server reboots. > > The initial implementation of the Courteous Server will do the following: > > . when the laundromat thread detects an expired client and if that client > still has established states on the Linux server then marks the client as a > COURTESY_CLIENT and skips destroying the client and all its states, > otherwise destroy the client as usual. > > . detects conflict of OPEN request with a COURTESY_CLIENT, destroys the > expired client and all its states, skips the delegation recall then allows > the conflicting request to succeed. > > . detects conflict of LOCK/LOCKT request with a COURTESY_CLIENT, destroys > the expired client and all its states then allows the conflicting request > to succeed. > > To be done: > > . fix a problem with 4.1 where the Linux server keeps returning > SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after the expired > client re-connects, causing the client to keep sending BCTS requests to server. Hm, any progress working out what's causing that? > . handle OPEN conflict with share reservations. Sounds doable. > . instead of destroy the client anf all its state on conflict, only destroy > the state that is conflicted with the current request. The other todos I think have to be done before we merge, but this one I think can wait. > . destroy the COURTESY_CLIENT either after a fixed period of time to release > resources or as reacting to memory pressure. I think we need something here, but it can be pretty simple. > Signed-off-by: Dai Ngo <dai.ngo@oracle.com> > --- > fs/nfsd/nfs4state.c | 94 +++++++++++++++++++++++++++++++++++++++++++--- > fs/nfsd/state.h | 1 + > include/linux/sunrpc/svc.h | 1 + > 3 files changed, 91 insertions(+), 5 deletions(-) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index b517a8794400..969995872752 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -171,6 +171,7 @@ renew_client_locked(struct nfs4_client *clp) > > list_move_tail(&clp->cl_lru, &nn->client_lru); > clp->cl_time = ktime_get_boottime_seconds(); > + clear_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags); > } > > static void put_client_renew_locked(struct nfs4_client *clp) > @@ -4610,6 +4611,38 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp) > nfsd4_run_cb(&dp->dl_recall); > } > > +/* > + * Set rq_conflict_client if the conflict client have expired > + * and return true, otherwise return false. > + */ > +static bool > +nfsd_set_conflict_client(struct nfs4_delegation *dp) > +{ > + struct svc_rqst *rqst; > + struct nfs4_client *clp; > + struct nfsd_net *nn; > + bool ret; > + > + if (!i_am_nfsd()) > + return false; > + rqst = kthread_data(current); > + if (rqst->rq_prog != NFS_PROGRAM || rqst->rq_vers < 4) > + return false; This says we do nothing if the lock request is not coming from nfsd and v4. That doesn't sound right. For example, if the conflicting lock is coming from a local process (not from an NFS client at all), we still need to revoke the courtesy state so that process can make progress. --b. > + rqst->rq_conflict_client = NULL; > + clp = dp->dl_recall.cb_clp; > + nn = net_generic(clp->net, nfsd_net_id); > + spin_lock(&nn->client_lock); > + > + if (test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) && > + !mark_client_expired_locked(clp)) { > + rqst->rq_conflict_client = clp; > + ret = true; > + } else > + ret = false; > + spin_unlock(&nn->client_lock); > + return ret; > +} > + > /* Called from break_lease() with i_lock held. */ > static bool > nfsd_break_deleg_cb(struct file_lock *fl) > @@ -4618,6 +4651,8 @@ nfsd_break_deleg_cb(struct file_lock *fl) > struct nfs4_delegation *dp = (struct nfs4_delegation *)fl->fl_owner; > struct nfs4_file *fp = dp->dl_stid.sc_file; > > + if (nfsd_set_conflict_client(dp)) > + return false; > trace_nfsd_deleg_break(&dp->dl_stid.sc_stateid); > > /* > @@ -5237,6 +5272,22 @@ static void nfsd4_deleg_xgrade_none_ext(struct nfsd4_open *open, > */ > } > > +static bool > +nfs4_destroy_courtesy_client(struct nfs4_client *clp) > +{ > + struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id); > + > + spin_lock(&nn->client_lock); > + if (!test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) || > + mark_client_expired_locked(clp)) { > + spin_unlock(&nn->client_lock); > + return false; > + } > + spin_unlock(&nn->client_lock); > + expire_client(clp); > + return true; > +} > + > __be32 > nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nfsd4_open *open) > { > @@ -5286,7 +5337,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf > goto out; > } > } else { > + rqstp->rq_conflict_client = NULL; > status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open); > + if (status == nfserr_jukebox && rqstp->rq_conflict_client) { > + if (nfs4_destroy_courtesy_client(rqstp->rq_conflict_client)) > + status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open); > + } > + > if (status) { > stp->st_stid.sc_type = NFS4_CLOSED_STID; > release_open_stateid(stp); > @@ -5472,6 +5529,8 @@ nfs4_laundromat(struct nfsd_net *nn) > }; > struct nfs4_cpntf_state *cps; > copy_stateid_t *cps_t; > + struct nfs4_stid *stid; > + int id = 0; > int i; > > if (clients_still_reclaiming(nn)) { > @@ -5495,6 +5554,15 @@ nfs4_laundromat(struct nfsd_net *nn) > clp = list_entry(pos, struct nfs4_client, cl_lru); > if (!state_expired(<, clp->cl_time)) > break; > + > + spin_lock(&clp->cl_lock); > + stid = idr_get_next(&clp->cl_stateids, &id); > + spin_unlock(&clp->cl_lock); > + if (stid) { > + /* client still has states */ > + set_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags); > + continue; > + } > if (mark_client_expired_locked(clp)) { > trace_nfsd_clid_expired(&clp->cl_clientid); > continue; > @@ -6440,7 +6508,7 @@ static const struct lock_manager_operations nfsd_posix_mng_ops = { > .lm_put_owner = nfsd4_fl_put_owner, > }; > > -static inline void > +static inline int > nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny) > { > struct nfs4_lockowner *lo; > @@ -6453,6 +6521,8 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny) > /* We just don't care that much */ > goto nevermind; > deny->ld_clientid = lo->lo_owner.so_client->cl_clientid; > + if (nfs4_destroy_courtesy_client(lo->lo_owner.so_client)) > + return -EAGAIN; > } else { > nevermind: > deny->ld_owner.len = 0; > @@ -6467,6 +6537,7 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny) > deny->ld_type = NFS4_READ_LT; > if (fl->fl_type != F_RDLCK) > deny->ld_type = NFS4_WRITE_LT; > + return 0; > } > > static struct nfs4_lockowner * > @@ -6734,6 +6805,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > unsigned int fl_flags = FL_POSIX; > struct net *net = SVC_NET(rqstp); > struct nfsd_net *nn = net_generic(net, nfsd_net_id); > + bool retried = false; > > dprintk("NFSD: nfsd4_lock: start=%Ld length=%Ld\n", > (long long) lock->lk_offset, > @@ -6860,7 +6932,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > list_add_tail(&nbl->nbl_lru, &nn->blocked_locks_lru); > spin_unlock(&nn->blocked_locks_lock); > } > - > +again: > err = vfs_lock_file(nf->nf_file, F_SETLK, file_lock, conflock); > switch (err) { > case 0: /* success! */ > @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > case -EAGAIN: /* conflock holds conflicting lock */ > status = nfserr_denied; > dprintk("NFSD: nfsd4_lock: conflicting lock found!\n"); > - nfs4_set_lock_denied(conflock, &lock->lk_denied); > + > + /* try again if conflict with courtesy client */ > + if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == -EAGAIN && !retried) { > + retried = true; > + goto again; > + } > break; > case -EDEADLK: > status = nfserr_deadlock; > @@ -6962,6 +7039,8 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > struct nfs4_lockowner *lo = NULL; > __be32 status; > struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id); > + bool retried = false; > + int ret; > > if (locks_in_grace(SVC_NET(rqstp))) > return nfserr_grace; > @@ -7010,14 +7089,19 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > file_lock->fl_end = last_byte_offset(lockt->lt_offset, lockt->lt_length); > > nfs4_transform_lock_offset(file_lock); > - > +again: > status = nfsd_test_lock(rqstp, &cstate->current_fh, file_lock); > if (status) > goto out; > > if (file_lock->fl_type != F_UNLCK) { > status = nfserr_denied; > - nfs4_set_lock_denied(file_lock, &lockt->lt_denied); > + ret = nfs4_set_lock_denied(file_lock, &lockt->lt_denied); > + if (ret == -EAGAIN && !retried) { > + retried = true; > + fh_clear_wcc(&cstate->current_fh); > + goto again; > + } > } > out: > if (lo) > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h > index e73bdbb1634a..bdc3605e3722 100644 > --- a/fs/nfsd/state.h > +++ b/fs/nfsd/state.h > @@ -345,6 +345,7 @@ struct nfs4_client { > #define NFSD4_CLIENT_UPCALL_LOCK (5) /* upcall serialization */ > #define NFSD4_CLIENT_CB_FLAG_MASK (1 << NFSD4_CLIENT_CB_UPDATE | \ > 1 << NFSD4_CLIENT_CB_KILL) > +#define NFSD4_CLIENT_COURTESY (6) /* be nice to expired client */ > unsigned long cl_flags; > const struct cred *cl_cb_cred; > struct rpc_clnt *cl_cb_client; > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h > index e91d51ea028b..2f0382f9d0ff 100644 > --- a/include/linux/sunrpc/svc.h > +++ b/include/linux/sunrpc/svc.h > @@ -304,6 +304,7 @@ struct svc_rqst { > * net namespace > */ > void ** rq_lease_breaker; /* The v4 client breaking a lease */ > + void *rq_conflict_client; > }; > > #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net) > -- > 2.9.5
> On Jun 16, 2021, at 12:02 PM, J. Bruce Fields <bfields@fieldses.org> wrote: > > On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote: >> . instead of destroy the client anf all its state on conflict, only destroy >> the state that is conflicted with the current request. > > The other todos I think have to be done before we merge, but this one I > think can wait. I agree on both points: this one can wait, but the others should be done before merge. >> . destroy the COURTESY_CLIENT either after a fixed period of time to release >> resources or as reacting to memory pressure. > > I think we need something here, but it can be pretty simple. We should work out a policy now. A lower bound is good to have. Keep courtesy clients at least this long. Average network partition length times two as a shot in the dark. Or it could be N times the lease expiry time. An upper bound is harder to guess at. Obviously these things will go away when the server reboots. The laundromat could handle this sooner. However using a shrinker might be nicer and more Linux-y, keeping the clients as long as practical, without the need for adding another administrative setting. -- Chuck Lever
On 6/16/21 9:02 AM, J. Bruce Fields wrote: > On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote: >> Currently an NFSv4 client must maintain its lease by using the at least >> one of the state tokens or if nothing else, by issuing a RENEW (4.0), or >> a singleton SEQUENCE (4.1) at least once during each lease period. If the >> client fails to renew the lease, for any reason, the Linux server expunges >> the state tokens immediately upon detection of the "failure to renew the >> lease" condition and begins returning NFS4ERR_EXPIRED if the client should >> reconnect and attempt to use the (now) expired state. >> >> The default lease period for the Linux server is 90 seconds. The typical >> client cuts that in half and will issue a lease renewing operation every >> 45 seconds. The 90 second lease period is very short considering the >> potential for moderately long term network partitions. A network partition >> refers to any loss of network connectivity between the NFS client and the >> NFS server, regardless of its root cause. This includes NIC failures, NIC >> driver bugs, network misconfigurations & administrative errors, routers & >> switches crashing and/or having software updates applied, even down to >> cables being physically pulled. In most cases, these network failures are >> transient, although the duration is unknown. >> >> A server which does not immediately expunge the state on lease expiration >> is known as a Courteous Server. A Courteous Server continues to recognize >> previously generated state tokens as valid until conflict arises between >> the expired state and the requests from another client, or the server reboots. >> >> The initial implementation of the Courteous Server will do the following: >> >> . when the laundromat thread detects an expired client and if that client >> still has established states on the Linux server then marks the client as a >> COURTESY_CLIENT and skips destroying the client and all its states, >> otherwise destroy the client as usual. >> >> . detects conflict of OPEN request with a COURTESY_CLIENT, destroys the >> expired client and all its states, skips the delegation recall then allows >> the conflicting request to succeed. >> >> . detects conflict of LOCK/LOCKT request with a COURTESY_CLIENT, destroys >> the expired client and all its states then allows the conflicting request >> to succeed. >> >> To be done: >> >> . fix a problem with 4.1 where the Linux server keeps returning >> SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after the expired >> client re-connects, causing the client to keep sending BCTS requests to server. > Hm, any progress working out what's causing that? I will fix this in v2 patch. > >> . handle OPEN conflict with share reservations. > Sounds doable. Yes. But I don't know a way to force the Linux client to specify the deny mode on OPEN. I may have to test this with SMB client: expired NFS client should not prevent SMB client from open the file with deny mode. > >> . instead of destroy the client anf all its state on conflict, only destroy >> the state that is conflicted with the current request. > The other todos I think have to be done before we merge, but this one I > think can wait. > >> . destroy the COURTESY_CLIENT either after a fixed period of time to release >> resources or as reacting to memory pressure. > I think we need something here, but it can be pretty simple. ok. > >> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> >> --- >> fs/nfsd/nfs4state.c | 94 +++++++++++++++++++++++++++++++++++++++++++--- >> fs/nfsd/state.h | 1 + >> include/linux/sunrpc/svc.h | 1 + >> 3 files changed, 91 insertions(+), 5 deletions(-) >> >> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >> index b517a8794400..969995872752 100644 >> --- a/fs/nfsd/nfs4state.c >> +++ b/fs/nfsd/nfs4state.c >> @@ -171,6 +171,7 @@ renew_client_locked(struct nfs4_client *clp) >> >> list_move_tail(&clp->cl_lru, &nn->client_lru); >> clp->cl_time = ktime_get_boottime_seconds(); >> + clear_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags); >> } >> >> static void put_client_renew_locked(struct nfs4_client *clp) >> @@ -4610,6 +4611,38 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp) >> nfsd4_run_cb(&dp->dl_recall); >> } >> >> +/* >> + * Set rq_conflict_client if the conflict client have expired >> + * and return true, otherwise return false. >> + */ >> +static bool >> +nfsd_set_conflict_client(struct nfs4_delegation *dp) >> +{ >> + struct svc_rqst *rqst; >> + struct nfs4_client *clp; >> + struct nfsd_net *nn; >> + bool ret; >> + >> + if (!i_am_nfsd()) >> + return false; >> + rqst = kthread_data(current); >> + if (rqst->rq_prog != NFS_PROGRAM || rqst->rq_vers < 4) >> + return false; > This says we do nothing if the lock request is not coming from nfsd and > v4. That doesn't sound right. > > For example, if the conflicting lock is coming from a local process (not > from an NFS client at all), we still need to revoke the courtesy state > so that process can make progress. The reason it checks for NFS request is because it uses rq_conflict_client to pass the expired client back to the upper layer which then does the expire_client. If this is not from a NFS request then kthread_data() might not be the svc_rqst. In that case the code will try to recall the delegation from the expired client and the recall will eventually timed out. I will rework this code. Thanks, -Dai >> + rqst->rq_conflict_client = NULL; >> + clp = dp->dl_recall.cb_clp; >> + nn = net_generic(clp->net, nfsd_net_id); >> + spin_lock(&nn->client_lock); >> + >> + if (test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) && >> + !mark_client_expired_locked(clp)) { >> + rqst->rq_conflict_client = clp; >> + ret = true; >> + } else >> + ret = false; >> + spin_unlock(&nn->client_lock); >> + return ret; >> +} >> + >> /* Called from break_lease() with i_lock held. */ >> static bool >> nfsd_break_deleg_cb(struct file_lock *fl) >> @@ -4618,6 +4651,8 @@ nfsd_break_deleg_cb(struct file_lock *fl) >> struct nfs4_delegation *dp = (struct nfs4_delegation *)fl->fl_owner; >> struct nfs4_file *fp = dp->dl_stid.sc_file; >> >> + if (nfsd_set_conflict_client(dp)) >> + return false; >> trace_nfsd_deleg_break(&dp->dl_stid.sc_stateid); >> >> /* >> @@ -5237,6 +5272,22 @@ static void nfsd4_deleg_xgrade_none_ext(struct nfsd4_open *open, >> */ >> } >> >> +static bool >> +nfs4_destroy_courtesy_client(struct nfs4_client *clp) >> +{ >> + struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id); >> + >> + spin_lock(&nn->client_lock); >> + if (!test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) || >> + mark_client_expired_locked(clp)) { >> + spin_unlock(&nn->client_lock); >> + return false; >> + } >> + spin_unlock(&nn->client_lock); >> + expire_client(clp); >> + return true; >> +} >> + >> __be32 >> nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nfsd4_open *open) >> { >> @@ -5286,7 +5337,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf >> goto out; >> } >> } else { >> + rqstp->rq_conflict_client = NULL; >> status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open); >> + if (status == nfserr_jukebox && rqstp->rq_conflict_client) { >> + if (nfs4_destroy_courtesy_client(rqstp->rq_conflict_client)) >> + status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open); >> + } >> + >> if (status) { >> stp->st_stid.sc_type = NFS4_CLOSED_STID; >> release_open_stateid(stp); >> @@ -5472,6 +5529,8 @@ nfs4_laundromat(struct nfsd_net *nn) >> }; >> struct nfs4_cpntf_state *cps; >> copy_stateid_t *cps_t; >> + struct nfs4_stid *stid; >> + int id = 0; >> int i; >> >> if (clients_still_reclaiming(nn)) { >> @@ -5495,6 +5554,15 @@ nfs4_laundromat(struct nfsd_net *nn) >> clp = list_entry(pos, struct nfs4_client, cl_lru); >> if (!state_expired(<, clp->cl_time)) >> break; >> + >> + spin_lock(&clp->cl_lock); >> + stid = idr_get_next(&clp->cl_stateids, &id); >> + spin_unlock(&clp->cl_lock); >> + if (stid) { >> + /* client still has states */ >> + set_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags); >> + continue; >> + } >> if (mark_client_expired_locked(clp)) { >> trace_nfsd_clid_expired(&clp->cl_clientid); >> continue; >> @@ -6440,7 +6508,7 @@ static const struct lock_manager_operations nfsd_posix_mng_ops = { >> .lm_put_owner = nfsd4_fl_put_owner, >> }; >> >> -static inline void >> +static inline int >> nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny) >> { >> struct nfs4_lockowner *lo; >> @@ -6453,6 +6521,8 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny) >> /* We just don't care that much */ >> goto nevermind; >> deny->ld_clientid = lo->lo_owner.so_client->cl_clientid; >> + if (nfs4_destroy_courtesy_client(lo->lo_owner.so_client)) >> + return -EAGAIN; >> } else { >> nevermind: >> deny->ld_owner.len = 0; >> @@ -6467,6 +6537,7 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny) >> deny->ld_type = NFS4_READ_LT; >> if (fl->fl_type != F_RDLCK) >> deny->ld_type = NFS4_WRITE_LT; >> + return 0; >> } >> >> static struct nfs4_lockowner * >> @@ -6734,6 +6805,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, >> unsigned int fl_flags = FL_POSIX; >> struct net *net = SVC_NET(rqstp); >> struct nfsd_net *nn = net_generic(net, nfsd_net_id); >> + bool retried = false; >> >> dprintk("NFSD: nfsd4_lock: start=%Ld length=%Ld\n", >> (long long) lock->lk_offset, >> @@ -6860,7 +6932,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, >> list_add_tail(&nbl->nbl_lru, &nn->blocked_locks_lru); >> spin_unlock(&nn->blocked_locks_lock); >> } >> - >> +again: >> err = vfs_lock_file(nf->nf_file, F_SETLK, file_lock, conflock); >> switch (err) { >> case 0: /* success! */ >> @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, >> case -EAGAIN: /* conflock holds conflicting lock */ >> status = nfserr_denied; >> dprintk("NFSD: nfsd4_lock: conflicting lock found!\n"); >> - nfs4_set_lock_denied(conflock, &lock->lk_denied); >> + >> + /* try again if conflict with courtesy client */ >> + if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == -EAGAIN && !retried) { >> + retried = true; >> + goto again; >> + } >> break; >> case -EDEADLK: >> status = nfserr_deadlock; >> @@ -6962,6 +7039,8 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, >> struct nfs4_lockowner *lo = NULL; >> __be32 status; >> struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id); >> + bool retried = false; >> + int ret; >> >> if (locks_in_grace(SVC_NET(rqstp))) >> return nfserr_grace; >> @@ -7010,14 +7089,19 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, >> file_lock->fl_end = last_byte_offset(lockt->lt_offset, lockt->lt_length); >> >> nfs4_transform_lock_offset(file_lock); >> - >> +again: >> status = nfsd_test_lock(rqstp, &cstate->current_fh, file_lock); >> if (status) >> goto out; >> >> if (file_lock->fl_type != F_UNLCK) { >> status = nfserr_denied; >> - nfs4_set_lock_denied(file_lock, &lockt->lt_denied); >> + ret = nfs4_set_lock_denied(file_lock, &lockt->lt_denied); >> + if (ret == -EAGAIN && !retried) { >> + retried = true; >> + fh_clear_wcc(&cstate->current_fh); >> + goto again; >> + } >> } >> out: >> if (lo) >> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h >> index e73bdbb1634a..bdc3605e3722 100644 >> --- a/fs/nfsd/state.h >> +++ b/fs/nfsd/state.h >> @@ -345,6 +345,7 @@ struct nfs4_client { >> #define NFSD4_CLIENT_UPCALL_LOCK (5) /* upcall serialization */ >> #define NFSD4_CLIENT_CB_FLAG_MASK (1 << NFSD4_CLIENT_CB_UPDATE | \ >> 1 << NFSD4_CLIENT_CB_KILL) >> +#define NFSD4_CLIENT_COURTESY (6) /* be nice to expired client */ >> unsigned long cl_flags; >> const struct cred *cl_cb_cred; >> struct rpc_clnt *cl_cb_client; >> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h >> index e91d51ea028b..2f0382f9d0ff 100644 >> --- a/include/linux/sunrpc/svc.h >> +++ b/include/linux/sunrpc/svc.h >> @@ -304,6 +304,7 @@ struct svc_rqst { >> * net namespace >> */ >> void ** rq_lease_breaker; /* The v4 client breaking a lease */ >> + void *rq_conflict_client; >> }; >> >> #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net) >> -- >> 2.9.5
On 16/06/2021 8:17 pm, dai.ngo@oracle.com wrote: > > On 6/16/21 9:02 AM, J. Bruce Fields wrote: >> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote: >>> Currently an NFSv4 client must maintain its lease by using the at least >>> one of the state tokens or if nothing else, by issuing a RENEW (4.0), or >>> a singleton SEQUENCE (4.1) at least once during each lease period. If >>> the >>> client fails to renew the lease, for any reason, the Linux server >>> expunges >>> the state tokens immediately upon detection of the "failure to renew the >>> lease" condition and begins returning NFS4ERR_EXPIRED if the client >>> should >>> reconnect and attempt to use the (now) expired state. >>> >>> The default lease period for the Linux server is 90 seconds. The >>> typical >>> client cuts that in half and will issue a lease renewing operation every >>> 45 seconds. The 90 second lease period is very short considering the >>> potential for moderately long term network partitions. A network >>> partition >>> refers to any loss of network connectivity between the NFS client and >>> the >>> NFS server, regardless of its root cause. This includes NIC >>> failures, NIC >>> driver bugs, network misconfigurations & administrative errors, >>> routers & >>> switches crashing and/or having software updates applied, even down to >>> cables being physically pulled. In most cases, these network >>> failures are >>> transient, although the duration is unknown. >>> >>> A server which does not immediately expunge the state on lease >>> expiration >>> is known as a Courteous Server. A Courteous Server continues to >>> recognize >>> previously generated state tokens as valid until conflict arises between >>> the expired state and the requests from another client, or the server >>> reboots. >>> >>> The initial implementation of the Courteous Server will do the >>> following: >>> >>> . when the laundromat thread detects an expired client and if that >>> client >>> still has established states on the Linux server then marks the >>> client as a >>> COURTESY_CLIENT and skips destroying the client and all its states, >>> otherwise destroy the client as usual. >>> >>> . detects conflict of OPEN request with a COURTESY_CLIENT, destroys the >>> expired client and all its states, skips the delegation recall then >>> allows >>> the conflicting request to succeed. >>> >>> . detects conflict of LOCK/LOCKT request with a COURTESY_CLIENT, >>> destroys >>> the expired client and all its states then allows the conflicting >>> request >>> to succeed. >>> >>> To be done: >>> >>> . fix a problem with 4.1 where the Linux server keeps returning >>> SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after the >>> expired >>> client re-connects, causing the client to keep sending BCTS requests >>> to server. >> Hm, any progress working out what's causing that? > > I will fix this in v2 patch. > >> >>> . handle OPEN conflict with share reservations. >> Sounds doable. > > Yes. But I don't know a way to force the Linux client to specify the > deny mode on OPEN. We could do this with pynfs, perhaps, for testing? cheers, c. I may have to test this with SMB client: expired > NFS client should not prevent SMB client from open the file with > deny mode. > >> >>> . instead of destroy the client anf all its state on conflict, only >>> destroy >>> the state that is conflicted with the current request. >> The other todos I think have to be done before we merge, but this one I >> think can wait. >> >>> . destroy the COURTESY_CLIENT either after a fixed period of time to >>> release >>> resources or as reacting to memory pressure. >> I think we need something here, but it can be pretty simple. > > ok. > >> >>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> >>> --- >>> fs/nfsd/nfs4state.c | 94 >>> +++++++++++++++++++++++++++++++++++++++++++--- >>> fs/nfsd/state.h | 1 + >>> include/linux/sunrpc/svc.h | 1 + >>> 3 files changed, 91 insertions(+), 5 deletions(-) >>> >>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >>> index b517a8794400..969995872752 100644 >>> --- a/fs/nfsd/nfs4state.c >>> +++ b/fs/nfsd/nfs4state.c >>> @@ -171,6 +171,7 @@ renew_client_locked(struct nfs4_client *clp) >>> list_move_tail(&clp->cl_lru, &nn->client_lru); >>> clp->cl_time = ktime_get_boottime_seconds(); >>> + clear_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags); >>> } >>> static void put_client_renew_locked(struct nfs4_client *clp) >>> @@ -4610,6 +4611,38 @@ static void nfsd_break_one_deleg(struct >>> nfs4_delegation *dp) >>> nfsd4_run_cb(&dp->dl_recall); >>> } >>> +/* >>> + * Set rq_conflict_client if the conflict client have expired >>> + * and return true, otherwise return false. >>> + */ >>> +static bool >>> +nfsd_set_conflict_client(struct nfs4_delegation *dp) >>> +{ >>> + struct svc_rqst *rqst; >>> + struct nfs4_client *clp; >>> + struct nfsd_net *nn; >>> + bool ret; >>> + >>> + if (!i_am_nfsd()) >>> + return false; >>> + rqst = kthread_data(current); >>> + if (rqst->rq_prog != NFS_PROGRAM || rqst->rq_vers < 4) >>> + return false; >> This says we do nothing if the lock request is not coming from nfsd and >> v4. That doesn't sound right. >> >> For example, if the conflicting lock is coming from a local process (not >> from an NFS client at all), we still need to revoke the courtesy state >> so that process can make progress. > > The reason it checks for NFS request is because it uses rq_conflict_client > to pass the expired client back to the upper layer which then does the > expire_client. If this is not from a NFS request then kthread_data() might > not be the svc_rqst. In that case the code will try to recall the > delegation > from the expired client and the recall will eventually timed out. > > I will rework this code. > > Thanks, > -Dai > >>> + rqst->rq_conflict_client = NULL; >>> + clp = dp->dl_recall.cb_clp; >>> + nn = net_generic(clp->net, nfsd_net_id); >>> + spin_lock(&nn->client_lock); >>> + >>> + if (test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) && >>> + !mark_client_expired_locked(clp)) { >>> + rqst->rq_conflict_client = clp; >>> + ret = true; >>> + } else >>> + ret = false; >>> + spin_unlock(&nn->client_lock); >>> + return ret; >>> +} >>> + >>> /* Called from break_lease() with i_lock held. */ >>> static bool >>> nfsd_break_deleg_cb(struct file_lock *fl) >>> @@ -4618,6 +4651,8 @@ nfsd_break_deleg_cb(struct file_lock *fl) >>> struct nfs4_delegation *dp = (struct nfs4_delegation >>> *)fl->fl_owner; >>> struct nfs4_file *fp = dp->dl_stid.sc_file; >>> + if (nfsd_set_conflict_client(dp)) >>> + return false; >>> trace_nfsd_deleg_break(&dp->dl_stid.sc_stateid); >>> /* >>> @@ -5237,6 +5272,22 @@ static void nfsd4_deleg_xgrade_none_ext(struct >>> nfsd4_open *open, >>> */ >>> } >>> +static bool >>> +nfs4_destroy_courtesy_client(struct nfs4_client *clp) >>> +{ >>> + struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id); >>> + >>> + spin_lock(&nn->client_lock); >>> + if (!test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) || >>> + mark_client_expired_locked(clp)) { >>> + spin_unlock(&nn->client_lock); >>> + return false; >>> + } >>> + spin_unlock(&nn->client_lock); >>> + expire_client(clp); >>> + return true; >>> +} >>> + >>> __be32 >>> nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh >>> *current_fh, struct nfsd4_open *open) >>> { >>> @@ -5286,7 +5337,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp, >>> struct svc_fh *current_fh, struct nf >>> goto out; >>> } >>> } else { >>> + rqstp->rq_conflict_client = NULL; >>> status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open); >>> + if (status == nfserr_jukebox && rqstp->rq_conflict_client) { >>> + if >>> (nfs4_destroy_courtesy_client(rqstp->rq_conflict_client)) >>> + status = nfs4_get_vfs_file(rqstp, fp, current_fh, >>> stp, open); >>> + } >>> + >>> if (status) { >>> stp->st_stid.sc_type = NFS4_CLOSED_STID; >>> release_open_stateid(stp); >>> @@ -5472,6 +5529,8 @@ nfs4_laundromat(struct nfsd_net *nn) >>> }; >>> struct nfs4_cpntf_state *cps; >>> copy_stateid_t *cps_t; >>> + struct nfs4_stid *stid; >>> + int id = 0; >>> int i; >>> if (clients_still_reclaiming(nn)) { >>> @@ -5495,6 +5554,15 @@ nfs4_laundromat(struct nfsd_net *nn) >>> clp = list_entry(pos, struct nfs4_client, cl_lru); >>> if (!state_expired(<, clp->cl_time)) >>> break; >>> + >>> + spin_lock(&clp->cl_lock); >>> + stid = idr_get_next(&clp->cl_stateids, &id); >>> + spin_unlock(&clp->cl_lock); >>> + if (stid) { >>> + /* client still has states */ >>> + set_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags); >>> + continue; >>> + } >>> if (mark_client_expired_locked(clp)) { >>> trace_nfsd_clid_expired(&clp->cl_clientid); >>> continue; >>> @@ -6440,7 +6508,7 @@ static const struct lock_manager_operations >>> nfsd_posix_mng_ops = { >>> .lm_put_owner = nfsd4_fl_put_owner, >>> }; >>> -static inline void >>> +static inline int >>> nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied >>> *deny) >>> { >>> struct nfs4_lockowner *lo; >>> @@ -6453,6 +6521,8 @@ nfs4_set_lock_denied(struct file_lock *fl, >>> struct nfsd4_lock_denied *deny) >>> /* We just don't care that much */ >>> goto nevermind; >>> deny->ld_clientid = lo->lo_owner.so_client->cl_clientid; >>> + if (nfs4_destroy_courtesy_client(lo->lo_owner.so_client)) >>> + return -EAGAIN; >>> } else { >>> nevermind: >>> deny->ld_owner.len = 0; >>> @@ -6467,6 +6537,7 @@ nfs4_set_lock_denied(struct file_lock *fl, >>> struct nfsd4_lock_denied *deny) >>> deny->ld_type = NFS4_READ_LT; >>> if (fl->fl_type != F_RDLCK) >>> deny->ld_type = NFS4_WRITE_LT; >>> + return 0; >>> } >>> static struct nfs4_lockowner * >>> @@ -6734,6 +6805,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct >>> nfsd4_compound_state *cstate, >>> unsigned int fl_flags = FL_POSIX; >>> struct net *net = SVC_NET(rqstp); >>> struct nfsd_net *nn = net_generic(net, nfsd_net_id); >>> + bool retried = false; >>> dprintk("NFSD: nfsd4_lock: start=%Ld length=%Ld\n", >>> (long long) lock->lk_offset, >>> @@ -6860,7 +6932,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct >>> nfsd4_compound_state *cstate, >>> list_add_tail(&nbl->nbl_lru, &nn->blocked_locks_lru); >>> spin_unlock(&nn->blocked_locks_lock); >>> } >>> - >>> +again: >>> err = vfs_lock_file(nf->nf_file, F_SETLK, file_lock, conflock); >>> switch (err) { >>> case 0: /* success! */ >>> @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct >>> nfsd4_compound_state *cstate, >>> case -EAGAIN: /* conflock holds conflicting lock */ >>> status = nfserr_denied; >>> dprintk("NFSD: nfsd4_lock: conflicting lock found!\n"); >>> - nfs4_set_lock_denied(conflock, &lock->lk_denied); >>> + >>> + /* try again if conflict with courtesy client */ >>> + if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == >>> -EAGAIN && !retried) { >>> + retried = true; >>> + goto again; >>> + } >>> break; >>> case -EDEADLK: >>> status = nfserr_deadlock; >>> @@ -6962,6 +7039,8 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct >>> nfsd4_compound_state *cstate, >>> struct nfs4_lockowner *lo = NULL; >>> __be32 status; >>> struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id); >>> + bool retried = false; >>> + int ret; >>> if (locks_in_grace(SVC_NET(rqstp))) >>> return nfserr_grace; >>> @@ -7010,14 +7089,19 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct >>> nfsd4_compound_state *cstate, >>> file_lock->fl_end = last_byte_offset(lockt->lt_offset, >>> lockt->lt_length); >>> nfs4_transform_lock_offset(file_lock); >>> - >>> +again: >>> status = nfsd_test_lock(rqstp, &cstate->current_fh, file_lock); >>> if (status) >>> goto out; >>> if (file_lock->fl_type != F_UNLCK) { >>> status = nfserr_denied; >>> - nfs4_set_lock_denied(file_lock, &lockt->lt_denied); >>> + ret = nfs4_set_lock_denied(file_lock, &lockt->lt_denied); >>> + if (ret == -EAGAIN && !retried) { >>> + retried = true; >>> + fh_clear_wcc(&cstate->current_fh); >>> + goto again; >>> + } >>> } >>> out: >>> if (lo) >>> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h >>> index e73bdbb1634a..bdc3605e3722 100644 >>> --- a/fs/nfsd/state.h >>> +++ b/fs/nfsd/state.h >>> @@ -345,6 +345,7 @@ struct nfs4_client { >>> #define NFSD4_CLIENT_UPCALL_LOCK (5) /* upcall serialization */ >>> #define NFSD4_CLIENT_CB_FLAG_MASK (1 << NFSD4_CLIENT_CB_UPDATE | \ >>> 1 << NFSD4_CLIENT_CB_KILL) >>> +#define NFSD4_CLIENT_COURTESY (6) /* be nice to expired >>> client */ >>> unsigned long cl_flags; >>> const struct cred *cl_cb_cred; >>> struct rpc_clnt *cl_cb_client; >>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h >>> index e91d51ea028b..2f0382f9d0ff 100644 >>> --- a/include/linux/sunrpc/svc.h >>> +++ b/include/linux/sunrpc/svc.h >>> @@ -304,6 +304,7 @@ struct svc_rqst { >>> * net namespace >>> */ >>> void ** rq_lease_breaker; /* The v4 client breaking >>> a lease */ >>> + void *rq_conflict_client; >>> }; >>> #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : >>> rqst->rq_bc_net) >>> -- >>> 2.9.5
On 6/16/21 9:32 AM, Chuck Lever III wrote: >> On Jun 16, 2021, at 12:02 PM, J. Bruce Fields <bfields@fieldses.org> wrote: >> >> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote: >>> . instead of destroy the client anf all its state on conflict, only destroy >>> the state that is conflicted with the current request. >> The other todos I think have to be done before we merge, but this one I >> think can wait. > I agree on both points: this one can wait, but the others > should be done before merge. yes, will do. > > >>> . destroy the COURTESY_CLIENT either after a fixed period of time to release >>> resources or as reacting to memory pressure. >> I think we need something here, but it can be pretty simple. > We should work out a policy now. > > A lower bound is good to have. Keep courtesy clients at least > this long. Average network partition length times two as a shot > in the dark. Or it could be N times the lease expiry time. > > An upper bound is harder to guess at. Obviously these things > will go away when the server reboots. The laundromat could > handle this sooner. However using a shrinker might be nicer and > more Linux-y, keeping the clients as long as practical, without > the need for adding another administrative setting. Can we start out with a simple 12 or 24 hours to accommodate long network outages for this phase? Thanks, -Dai > > > -- > Chuck Lever > > >
On 6/16/21 12:19 PM, Calum Mackay wrote: > On 16/06/2021 8:17 pm, dai.ngo@oracle.com wrote: >> >> On 6/16/21 9:02 AM, J. Bruce Fields wrote: >>> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote: >>>> Currently an NFSv4 client must maintain its lease by using the at >>>> least >>>> one of the state tokens or if nothing else, by issuing a RENEW >>>> (4.0), or >>>> a singleton SEQUENCE (4.1) at least once during each lease period. >>>> If the >>>> client fails to renew the lease, for any reason, the Linux server >>>> expunges >>>> the state tokens immediately upon detection of the "failure to >>>> renew the >>>> lease" condition and begins returning NFS4ERR_EXPIRED if the client >>>> should >>>> reconnect and attempt to use the (now) expired state. >>>> >>>> The default lease period for the Linux server is 90 seconds. The >>>> typical >>>> client cuts that in half and will issue a lease renewing operation >>>> every >>>> 45 seconds. The 90 second lease period is very short considering the >>>> potential for moderately long term network partitions. A network >>>> partition >>>> refers to any loss of network connectivity between the NFS client >>>> and the >>>> NFS server, regardless of its root cause. This includes NIC >>>> failures, NIC >>>> driver bugs, network misconfigurations & administrative errors, >>>> routers & >>>> switches crashing and/or having software updates applied, even down to >>>> cables being physically pulled. In most cases, these network >>>> failures are >>>> transient, although the duration is unknown. >>>> >>>> A server which does not immediately expunge the state on lease >>>> expiration >>>> is known as a Courteous Server. A Courteous Server continues to >>>> recognize >>>> previously generated state tokens as valid until conflict arises >>>> between >>>> the expired state and the requests from another client, or the >>>> server reboots. >>>> >>>> The initial implementation of the Courteous Server will do the >>>> following: >>>> >>>> . when the laundromat thread detects an expired client and if that >>>> client >>>> still has established states on the Linux server then marks the >>>> client as a >>>> COURTESY_CLIENT and skips destroying the client and all its states, >>>> otherwise destroy the client as usual. >>>> >>>> . detects conflict of OPEN request with a COURTESY_CLIENT, destroys >>>> the >>>> expired client and all its states, skips the delegation recall then >>>> allows >>>> the conflicting request to succeed. >>>> >>>> . detects conflict of LOCK/LOCKT request with a COURTESY_CLIENT, >>>> destroys >>>> the expired client and all its states then allows the conflicting >>>> request >>>> to succeed. >>>> >>>> To be done: >>>> >>>> . fix a problem with 4.1 where the Linux server keeps returning >>>> SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after >>>> the expired >>>> client re-connects, causing the client to keep sending BCTS >>>> requests to server. >>> Hm, any progress working out what's causing that? >> >> I will fix this in v2 patch. >> >>> >>>> . handle OPEN conflict with share reservations. >>> Sounds doable. >> >> Yes. But I don't know a way to force the Linux client to specify the >> deny mode on OPEN. > > We could do this with pynfs, perhaps, for testing? Ah you're right. I think we can, might need your help with this. Thanks, -Dai > > cheers, > c. > > > > I may have to test this with SMB client: expired >> NFS client should not prevent SMB client from open the file with >> deny mode. >> >>> >>>> . instead of destroy the client anf all its state on conflict, only >>>> destroy >>>> the state that is conflicted with the current request. >>> The other todos I think have to be done before we merge, but this one I >>> think can wait. >>> >>>> . destroy the COURTESY_CLIENT either after a fixed period of time >>>> to release >>>> resources or as reacting to memory pressure. >>> I think we need something here, but it can be pretty simple. >> >> ok. >> >>> >>>> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> >>>> --- >>>> fs/nfsd/nfs4state.c | 94 >>>> +++++++++++++++++++++++++++++++++++++++++++--- >>>> fs/nfsd/state.h | 1 + >>>> include/linux/sunrpc/svc.h | 1 + >>>> 3 files changed, 91 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >>>> index b517a8794400..969995872752 100644 >>>> --- a/fs/nfsd/nfs4state.c >>>> +++ b/fs/nfsd/nfs4state.c >>>> @@ -171,6 +171,7 @@ renew_client_locked(struct nfs4_client *clp) >>>> list_move_tail(&clp->cl_lru, &nn->client_lru); >>>> clp->cl_time = ktime_get_boottime_seconds(); >>>> + clear_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags); >>>> } >>>> static void put_client_renew_locked(struct nfs4_client *clp) >>>> @@ -4610,6 +4611,38 @@ static void nfsd_break_one_deleg(struct >>>> nfs4_delegation *dp) >>>> nfsd4_run_cb(&dp->dl_recall); >>>> } >>>> +/* >>>> + * Set rq_conflict_client if the conflict client have expired >>>> + * and return true, otherwise return false. >>>> + */ >>>> +static bool >>>> +nfsd_set_conflict_client(struct nfs4_delegation *dp) >>>> +{ >>>> + struct svc_rqst *rqst; >>>> + struct nfs4_client *clp; >>>> + struct nfsd_net *nn; >>>> + bool ret; >>>> + >>>> + if (!i_am_nfsd()) >>>> + return false; >>>> + rqst = kthread_data(current); >>>> + if (rqst->rq_prog != NFS_PROGRAM || rqst->rq_vers < 4) >>>> + return false; >>> This says we do nothing if the lock request is not coming from nfsd and >>> v4. That doesn't sound right. >>> >>> For example, if the conflicting lock is coming from a local process >>> (not >>> from an NFS client at all), we still need to revoke the courtesy state >>> so that process can make progress. >> >> The reason it checks for NFS request is because it uses >> rq_conflict_client >> to pass the expired client back to the upper layer which then does the >> expire_client. If this is not from a NFS request then kthread_data() >> might >> not be the svc_rqst. In that case the code will try to recall the >> delegation >> from the expired client and the recall will eventually timed out. >> >> I will rework this code. >> >> Thanks, >> -Dai >> >>>> + rqst->rq_conflict_client = NULL; >>>> + clp = dp->dl_recall.cb_clp; >>>> + nn = net_generic(clp->net, nfsd_net_id); >>>> + spin_lock(&nn->client_lock); >>>> + >>>> + if (test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) && >>>> + !mark_client_expired_locked(clp)) { >>>> + rqst->rq_conflict_client = clp; >>>> + ret = true; >>>> + } else >>>> + ret = false; >>>> + spin_unlock(&nn->client_lock); >>>> + return ret; >>>> +} >>>> + >>>> /* Called from break_lease() with i_lock held. */ >>>> static bool >>>> nfsd_break_deleg_cb(struct file_lock *fl) >>>> @@ -4618,6 +4651,8 @@ nfsd_break_deleg_cb(struct file_lock *fl) >>>> struct nfs4_delegation *dp = (struct nfs4_delegation >>>> *)fl->fl_owner; >>>> struct nfs4_file *fp = dp->dl_stid.sc_file; >>>> + if (nfsd_set_conflict_client(dp)) >>>> + return false; >>>> trace_nfsd_deleg_break(&dp->dl_stid.sc_stateid); >>>> /* >>>> @@ -5237,6 +5272,22 @@ static void >>>> nfsd4_deleg_xgrade_none_ext(struct nfsd4_open *open, >>>> */ >>>> } >>>> +static bool >>>> +nfs4_destroy_courtesy_client(struct nfs4_client *clp) >>>> +{ >>>> + struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id); >>>> + >>>> + spin_lock(&nn->client_lock); >>>> + if (!test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) || >>>> + mark_client_expired_locked(clp)) { >>>> + spin_unlock(&nn->client_lock); >>>> + return false; >>>> + } >>>> + spin_unlock(&nn->client_lock); >>>> + expire_client(clp); >>>> + return true; >>>> +} >>>> + >>>> __be32 >>>> nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh >>>> *current_fh, struct nfsd4_open *open) >>>> { >>>> @@ -5286,7 +5337,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp, >>>> struct svc_fh *current_fh, struct nf >>>> goto out; >>>> } >>>> } else { >>>> + rqstp->rq_conflict_client = NULL; >>>> status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, >>>> open); >>>> + if (status == nfserr_jukebox && rqstp->rq_conflict_client) { >>>> + if >>>> (nfs4_destroy_courtesy_client(rqstp->rq_conflict_client)) >>>> + status = nfs4_get_vfs_file(rqstp, fp, current_fh, >>>> stp, open); >>>> + } >>>> + >>>> if (status) { >>>> stp->st_stid.sc_type = NFS4_CLOSED_STID; >>>> release_open_stateid(stp); >>>> @@ -5472,6 +5529,8 @@ nfs4_laundromat(struct nfsd_net *nn) >>>> }; >>>> struct nfs4_cpntf_state *cps; >>>> copy_stateid_t *cps_t; >>>> + struct nfs4_stid *stid; >>>> + int id = 0; >>>> int i; >>>> if (clients_still_reclaiming(nn)) { >>>> @@ -5495,6 +5554,15 @@ nfs4_laundromat(struct nfsd_net *nn) >>>> clp = list_entry(pos, struct nfs4_client, cl_lru); >>>> if (!state_expired(<, clp->cl_time)) >>>> break; >>>> + >>>> + spin_lock(&clp->cl_lock); >>>> + stid = idr_get_next(&clp->cl_stateids, &id); >>>> + spin_unlock(&clp->cl_lock); >>>> + if (stid) { >>>> + /* client still has states */ >>>> + set_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags); >>>> + continue; >>>> + } >>>> if (mark_client_expired_locked(clp)) { >>>> trace_nfsd_clid_expired(&clp->cl_clientid); >>>> continue; >>>> @@ -6440,7 +6508,7 @@ static const struct lock_manager_operations >>>> nfsd_posix_mng_ops = { >>>> .lm_put_owner = nfsd4_fl_put_owner, >>>> }; >>>> -static inline void >>>> +static inline int >>>> nfs4_set_lock_denied(struct file_lock *fl, struct >>>> nfsd4_lock_denied *deny) >>>> { >>>> struct nfs4_lockowner *lo; >>>> @@ -6453,6 +6521,8 @@ nfs4_set_lock_denied(struct file_lock *fl, >>>> struct nfsd4_lock_denied *deny) >>>> /* We just don't care that much */ >>>> goto nevermind; >>>> deny->ld_clientid = lo->lo_owner.so_client->cl_clientid; >>>> + if (nfs4_destroy_courtesy_client(lo->lo_owner.so_client)) >>>> + return -EAGAIN; >>>> } else { >>>> nevermind: >>>> deny->ld_owner.len = 0; >>>> @@ -6467,6 +6537,7 @@ nfs4_set_lock_denied(struct file_lock *fl, >>>> struct nfsd4_lock_denied *deny) >>>> deny->ld_type = NFS4_READ_LT; >>>> if (fl->fl_type != F_RDLCK) >>>> deny->ld_type = NFS4_WRITE_LT; >>>> + return 0; >>>> } >>>> static struct nfs4_lockowner * >>>> @@ -6734,6 +6805,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct >>>> nfsd4_compound_state *cstate, >>>> unsigned int fl_flags = FL_POSIX; >>>> struct net *net = SVC_NET(rqstp); >>>> struct nfsd_net *nn = net_generic(net, nfsd_net_id); >>>> + bool retried = false; >>>> dprintk("NFSD: nfsd4_lock: start=%Ld length=%Ld\n", >>>> (long long) lock->lk_offset, >>>> @@ -6860,7 +6932,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct >>>> nfsd4_compound_state *cstate, >>>> list_add_tail(&nbl->nbl_lru, &nn->blocked_locks_lru); >>>> spin_unlock(&nn->blocked_locks_lock); >>>> } >>>> - >>>> +again: >>>> err = vfs_lock_file(nf->nf_file, F_SETLK, file_lock, conflock); >>>> switch (err) { >>>> case 0: /* success! */ >>>> @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct >>>> nfsd4_compound_state *cstate, >>>> case -EAGAIN: /* conflock holds conflicting lock */ >>>> status = nfserr_denied; >>>> dprintk("NFSD: nfsd4_lock: conflicting lock found!\n"); >>>> - nfs4_set_lock_denied(conflock, &lock->lk_denied); >>>> + >>>> + /* try again if conflict with courtesy client */ >>>> + if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == >>>> -EAGAIN && !retried) { >>>> + retried = true; >>>> + goto again; >>>> + } >>>> break; >>>> case -EDEADLK: >>>> status = nfserr_deadlock; >>>> @@ -6962,6 +7039,8 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct >>>> nfsd4_compound_state *cstate, >>>> struct nfs4_lockowner *lo = NULL; >>>> __be32 status; >>>> struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id); >>>> + bool retried = false; >>>> + int ret; >>>> if (locks_in_grace(SVC_NET(rqstp))) >>>> return nfserr_grace; >>>> @@ -7010,14 +7089,19 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct >>>> nfsd4_compound_state *cstate, >>>> file_lock->fl_end = last_byte_offset(lockt->lt_offset, >>>> lockt->lt_length); >>>> nfs4_transform_lock_offset(file_lock); >>>> - >>>> +again: >>>> status = nfsd_test_lock(rqstp, &cstate->current_fh, file_lock); >>>> if (status) >>>> goto out; >>>> if (file_lock->fl_type != F_UNLCK) { >>>> status = nfserr_denied; >>>> - nfs4_set_lock_denied(file_lock, &lockt->lt_denied); >>>> + ret = nfs4_set_lock_denied(file_lock, &lockt->lt_denied); >>>> + if (ret == -EAGAIN && !retried) { >>>> + retried = true; >>>> + fh_clear_wcc(&cstate->current_fh); >>>> + goto again; >>>> + } >>>> } >>>> out: >>>> if (lo) >>>> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h >>>> index e73bdbb1634a..bdc3605e3722 100644 >>>> --- a/fs/nfsd/state.h >>>> +++ b/fs/nfsd/state.h >>>> @@ -345,6 +345,7 @@ struct nfs4_client { >>>> #define NFSD4_CLIENT_UPCALL_LOCK (5) /* upcall >>>> serialization */ >>>> #define NFSD4_CLIENT_CB_FLAG_MASK (1 << NFSD4_CLIENT_CB_UPDATE >>>> | \ >>>> 1 << NFSD4_CLIENT_CB_KILL) >>>> +#define NFSD4_CLIENT_COURTESY (6) /* be nice to expired >>>> client */ >>>> unsigned long cl_flags; >>>> const struct cred *cl_cb_cred; >>>> struct rpc_clnt *cl_cb_client; >>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h >>>> index e91d51ea028b..2f0382f9d0ff 100644 >>>> --- a/include/linux/sunrpc/svc.h >>>> +++ b/include/linux/sunrpc/svc.h >>>> @@ -304,6 +304,7 @@ struct svc_rqst { >>>> * net namespace >>>> */ >>>> void ** rq_lease_breaker; /* The v4 client >>>> breaking a lease */ >>>> + void *rq_conflict_client; >>>> }; >>>> #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : >>>> rqst->rq_bc_net) >>>> -- >>>> 2.9.5 >
> On Jun 16, 2021, at 3:25 PM, Dai Ngo <dai.ngo@oracle.com> wrote: > > On 6/16/21 9:32 AM, Chuck Lever III wrote: >>> On Jun 16, 2021, at 12:02 PM, J. Bruce Fields <bfields@fieldses.org> wrote: >>> >>> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote: >>>> . instead of destroy the client anf all its state on conflict, only destroy >>>> the state that is conflicted with the current request. >>> The other todos I think have to be done before we merge, but this one I >>> think can wait. >> I agree on both points: this one can wait, but the others >> should be done before merge. > > yes, will do. > >> >> >>>> . destroy the COURTESY_CLIENT either after a fixed period of time to release >>>> resources or as reacting to memory pressure. >>> I think we need something here, but it can be pretty simple. >> We should work out a policy now. >> >> A lower bound is good to have. Keep courtesy clients at least >> this long. Average network partition length times two as a shot >> in the dark. Or it could be N times the lease expiry time. >> >> An upper bound is harder to guess at. Obviously these things >> will go away when the server reboots. The laundromat could >> handle this sooner. However using a shrinker might be nicer and >> more Linux-y, keeping the clients as long as practical, without >> the need for adding another administrative setting. > > Can we start out with a simple 12 or 24 hours to accommodate long > network outages for this phase? Sure. Let's go with 24 hours. Bill suggested adding a "clear_locks" like mechanism that could be used to throw out all courteous clients at once. Maybe another phase 2 project! -- Chuck Lever
On Wed, Jun 16, 2021 at 07:29:37PM +0000, Chuck Lever III wrote: > > > > On Jun 16, 2021, at 3:25 PM, Dai Ngo <dai.ngo@oracle.com> wrote: > > > > On 6/16/21 9:32 AM, Chuck Lever III wrote: > >>> On Jun 16, 2021, at 12:02 PM, J. Bruce Fields <bfields@fieldses.org> wrote: > >>> > >>> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote: > >>>> . instead of destroy the client anf all its state on conflict, only destroy > >>>> the state that is conflicted with the current request. > >>> The other todos I think have to be done before we merge, but this one I > >>> think can wait. > >> I agree on both points: this one can wait, but the others > >> should be done before merge. > > > > yes, will do. > > > >> > >> > >>>> . destroy the COURTESY_CLIENT either after a fixed period of time to release > >>>> resources or as reacting to memory pressure. > >>> I think we need something here, but it can be pretty simple. > >> We should work out a policy now. > >> > >> A lower bound is good to have. Keep courtesy clients at least > >> this long. Average network partition length times two as a shot > >> in the dark. Or it could be N times the lease expiry time. > >> > >> An upper bound is harder to guess at. Obviously these things > >> will go away when the server reboots. The laundromat could > >> handle this sooner. However using a shrinker might be nicer and > >> more Linux-y, keeping the clients as long as practical, without > >> the need for adding another administrative setting. > > > > Can we start out with a simple 12 or 24 hours to accommodate long > > network outages for this phase? > > Sure. Let's go with 24 hours. > > Bill suggested adding a "clear_locks" like mechanism that could be > used to throw out all courteous clients at once. Maybe another > phase 2 project! For what it's worth, you can forcibly expire a client by writing "expire" to /proc/fs/nfsd/client/xxx/ctl. So it shouldn't be hard to script this, if we add some kind of "courtesy" flag to client_info_show() and/or a number of seconds since the most recent renew. Maybe adding a command like "expire_if_courtesy" would also simplify that and avoid a race where the renew comes in simultaneously with the expire command. Or we could just add a single call to clear all courtesy clients. But the per-client approach would allow more flexibility if you wanted (e.g. to throw out only clients over a certain age). --b.
By the way, have we thought about the -ENOSPC case? An expired client could prevent a lot of disk space from being freed if it happens to be holding the last reference to a large file. I'm not sure how to deal with this. I don't think there's an efficient way to determine which expired client is responsible for an ENOSPC. So, maybe you'd check for ENOSPC and when you see it, remove all expired clients and retry if there were any? --b. On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote: > Currently an NFSv4 client must maintain its lease by using the at least > one of the state tokens or if nothing else, by issuing a RENEW (4.0), or > a singleton SEQUENCE (4.1) at least once during each lease period. If the > client fails to renew the lease, for any reason, the Linux server expunges > the state tokens immediately upon detection of the "failure to renew the > lease" condition and begins returning NFS4ERR_EXPIRED if the client should > reconnect and attempt to use the (now) expired state. > > The default lease period for the Linux server is 90 seconds. The typical > client cuts that in half and will issue a lease renewing operation every > 45 seconds. The 90 second lease period is very short considering the > potential for moderately long term network partitions. A network partition > refers to any loss of network connectivity between the NFS client and the > NFS server, regardless of its root cause. This includes NIC failures, NIC > driver bugs, network misconfigurations & administrative errors, routers & > switches crashing and/or having software updates applied, even down to > cables being physically pulled. In most cases, these network failures are > transient, although the duration is unknown. > > A server which does not immediately expunge the state on lease expiration > is known as a Courteous Server. A Courteous Server continues to recognize > previously generated state tokens as valid until conflict arises between > the expired state and the requests from another client, or the server reboots. > > The initial implementation of the Courteous Server will do the following: > > . when the laundromat thread detects an expired client and if that client > still has established states on the Linux server then marks the client as a > COURTESY_CLIENT and skips destroying the client and all its states, > otherwise destroy the client as usual. > > . detects conflict of OPEN request with a COURTESY_CLIENT, destroys the > expired client and all its states, skips the delegation recall then allows > the conflicting request to succeed. > > . detects conflict of LOCK/LOCKT request with a COURTESY_CLIENT, destroys > the expired client and all its states then allows the conflicting request > to succeed. > > To be done: > > . fix a problem with 4.1 where the Linux server keeps returning > SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after the expired > client re-connects, causing the client to keep sending BCTS requests to server. > > . handle OPEN conflict with share reservations. > > . instead of destroy the client anf all its state on conflict, only destroy > the state that is conflicted with the current request. > > . destroy the COURTESY_CLIENT either after a fixed period of time to release > resources or as reacting to memory pressure. > > Signed-off-by: Dai Ngo <dai.ngo@oracle.com> > --- > fs/nfsd/nfs4state.c | 94 +++++++++++++++++++++++++++++++++++++++++++--- > fs/nfsd/state.h | 1 + > include/linux/sunrpc/svc.h | 1 + > 3 files changed, 91 insertions(+), 5 deletions(-) > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > index b517a8794400..969995872752 100644 > --- a/fs/nfsd/nfs4state.c > +++ b/fs/nfsd/nfs4state.c > @@ -171,6 +171,7 @@ renew_client_locked(struct nfs4_client *clp) > > list_move_tail(&clp->cl_lru, &nn->client_lru); > clp->cl_time = ktime_get_boottime_seconds(); > + clear_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags); > } > > static void put_client_renew_locked(struct nfs4_client *clp) > @@ -4610,6 +4611,38 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp) > nfsd4_run_cb(&dp->dl_recall); > } > > +/* > + * Set rq_conflict_client if the conflict client have expired > + * and return true, otherwise return false. > + */ > +static bool > +nfsd_set_conflict_client(struct nfs4_delegation *dp) > +{ > + struct svc_rqst *rqst; > + struct nfs4_client *clp; > + struct nfsd_net *nn; > + bool ret; > + > + if (!i_am_nfsd()) > + return false; > + rqst = kthread_data(current); > + if (rqst->rq_prog != NFS_PROGRAM || rqst->rq_vers < 4) > + return false; > + rqst->rq_conflict_client = NULL; > + clp = dp->dl_recall.cb_clp; > + nn = net_generic(clp->net, nfsd_net_id); > + spin_lock(&nn->client_lock); > + > + if (test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) && > + !mark_client_expired_locked(clp)) { > + rqst->rq_conflict_client = clp; > + ret = true; > + } else > + ret = false; > + spin_unlock(&nn->client_lock); > + return ret; > +} > + > /* Called from break_lease() with i_lock held. */ > static bool > nfsd_break_deleg_cb(struct file_lock *fl) > @@ -4618,6 +4651,8 @@ nfsd_break_deleg_cb(struct file_lock *fl) > struct nfs4_delegation *dp = (struct nfs4_delegation *)fl->fl_owner; > struct nfs4_file *fp = dp->dl_stid.sc_file; > > + if (nfsd_set_conflict_client(dp)) > + return false; > trace_nfsd_deleg_break(&dp->dl_stid.sc_stateid); > > /* > @@ -5237,6 +5272,22 @@ static void nfsd4_deleg_xgrade_none_ext(struct nfsd4_open *open, > */ > } > > +static bool > +nfs4_destroy_courtesy_client(struct nfs4_client *clp) > +{ > + struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id); > + > + spin_lock(&nn->client_lock); > + if (!test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) || > + mark_client_expired_locked(clp)) { > + spin_unlock(&nn->client_lock); > + return false; > + } > + spin_unlock(&nn->client_lock); > + expire_client(clp); > + return true; > +} > + > __be32 > nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nfsd4_open *open) > { > @@ -5286,7 +5337,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf > goto out; > } > } else { > + rqstp->rq_conflict_client = NULL; > status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open); > + if (status == nfserr_jukebox && rqstp->rq_conflict_client) { > + if (nfs4_destroy_courtesy_client(rqstp->rq_conflict_client)) > + status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open); > + } > + > if (status) { > stp->st_stid.sc_type = NFS4_CLOSED_STID; > release_open_stateid(stp); > @@ -5472,6 +5529,8 @@ nfs4_laundromat(struct nfsd_net *nn) > }; > struct nfs4_cpntf_state *cps; > copy_stateid_t *cps_t; > + struct nfs4_stid *stid; > + int id = 0; > int i; > > if (clients_still_reclaiming(nn)) { > @@ -5495,6 +5554,15 @@ nfs4_laundromat(struct nfsd_net *nn) > clp = list_entry(pos, struct nfs4_client, cl_lru); > if (!state_expired(<, clp->cl_time)) > break; > + > + spin_lock(&clp->cl_lock); > + stid = idr_get_next(&clp->cl_stateids, &id); > + spin_unlock(&clp->cl_lock); > + if (stid) { > + /* client still has states */ > + set_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags); > + continue; > + } > if (mark_client_expired_locked(clp)) { > trace_nfsd_clid_expired(&clp->cl_clientid); > continue; > @@ -6440,7 +6508,7 @@ static const struct lock_manager_operations nfsd_posix_mng_ops = { > .lm_put_owner = nfsd4_fl_put_owner, > }; > > -static inline void > +static inline int > nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny) > { > struct nfs4_lockowner *lo; > @@ -6453,6 +6521,8 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny) > /* We just don't care that much */ > goto nevermind; > deny->ld_clientid = lo->lo_owner.so_client->cl_clientid; > + if (nfs4_destroy_courtesy_client(lo->lo_owner.so_client)) > + return -EAGAIN; > } else { > nevermind: > deny->ld_owner.len = 0; > @@ -6467,6 +6537,7 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny) > deny->ld_type = NFS4_READ_LT; > if (fl->fl_type != F_RDLCK) > deny->ld_type = NFS4_WRITE_LT; > + return 0; > } > > static struct nfs4_lockowner * > @@ -6734,6 +6805,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > unsigned int fl_flags = FL_POSIX; > struct net *net = SVC_NET(rqstp); > struct nfsd_net *nn = net_generic(net, nfsd_net_id); > + bool retried = false; > > dprintk("NFSD: nfsd4_lock: start=%Ld length=%Ld\n", > (long long) lock->lk_offset, > @@ -6860,7 +6932,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > list_add_tail(&nbl->nbl_lru, &nn->blocked_locks_lru); > spin_unlock(&nn->blocked_locks_lock); > } > - > +again: > err = vfs_lock_file(nf->nf_file, F_SETLK, file_lock, conflock); > switch (err) { > case 0: /* success! */ > @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > case -EAGAIN: /* conflock holds conflicting lock */ > status = nfserr_denied; > dprintk("NFSD: nfsd4_lock: conflicting lock found!\n"); > - nfs4_set_lock_denied(conflock, &lock->lk_denied); > + > + /* try again if conflict with courtesy client */ > + if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == -EAGAIN && !retried) { > + retried = true; > + goto again; > + } > break; > case -EDEADLK: > status = nfserr_deadlock; > @@ -6962,6 +7039,8 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > struct nfs4_lockowner *lo = NULL; > __be32 status; > struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id); > + bool retried = false; > + int ret; > > if (locks_in_grace(SVC_NET(rqstp))) > return nfserr_grace; > @@ -7010,14 +7089,19 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > file_lock->fl_end = last_byte_offset(lockt->lt_offset, lockt->lt_length); > > nfs4_transform_lock_offset(file_lock); > - > +again: > status = nfsd_test_lock(rqstp, &cstate->current_fh, file_lock); > if (status) > goto out; > > if (file_lock->fl_type != F_UNLCK) { > status = nfserr_denied; > - nfs4_set_lock_denied(file_lock, &lockt->lt_denied); > + ret = nfs4_set_lock_denied(file_lock, &lockt->lt_denied); > + if (ret == -EAGAIN && !retried) { > + retried = true; > + fh_clear_wcc(&cstate->current_fh); > + goto again; > + } > } > out: > if (lo) > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h > index e73bdbb1634a..bdc3605e3722 100644 > --- a/fs/nfsd/state.h > +++ b/fs/nfsd/state.h > @@ -345,6 +345,7 @@ struct nfs4_client { > #define NFSD4_CLIENT_UPCALL_LOCK (5) /* upcall serialization */ > #define NFSD4_CLIENT_CB_FLAG_MASK (1 << NFSD4_CLIENT_CB_UPDATE | \ > 1 << NFSD4_CLIENT_CB_KILL) > +#define NFSD4_CLIENT_COURTESY (6) /* be nice to expired client */ > unsigned long cl_flags; > const struct cred *cl_cb_cred; > struct rpc_clnt *cl_cb_client; > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h > index e91d51ea028b..2f0382f9d0ff 100644 > --- a/include/linux/sunrpc/svc.h > +++ b/include/linux/sunrpc/svc.h > @@ -304,6 +304,7 @@ struct svc_rqst { > * net namespace > */ > void ** rq_lease_breaker; /* The v4 client breaking a lease */ > + void *rq_conflict_client; > }; > > #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net) > -- > 2.9.5
On 6/24/21 7:02 AM, J. Bruce Fields wrote: > By the way, have we thought about the -ENOSPC case? > > An expired client could prevent a lot of disk space from being freed if > it happens to be holding the last reference to a large file. I think memory shortage is probably more likely to happen than running out disk space but we have to handle this too. > > I'm not sure how to deal with this. I don't think there's an efficient > way to determine which expired client is responsible for an ENOSPC. So, > maybe you'd check for ENOSPC and when you see it, remove all expired > clients and retry if there were any? I did a quick check on fs/nfsd and did not see anywhere the server returns ENOSPC/nfserr_nospc so I'm not sure where to check? Currently we plan to destroy the courtesy client after 24hr if it has not reconnect but that's a long time. There are other immediate issues that needs to resolve: 1. before an expired client can be set as courtesy client, we need to make sure there is no other NFSv4 clients interest in any of the locks owned by the expired client; another v4 client tries to acquire the lock and gets denied (the lock request is not blocked). We need a mechanism to detect this condition and do not allow the expired client to be a courtesy client. 2. handle conflict with a v3 lock request. The v3 lock code can detect lock conflicted caused by an v4 client but it needs a way to call into v4 code to destroy the courtesy client. 3. handle conflict with local lock. Currently the lock lock code does not even need to get conflict lock info in case of a conflict. It's just blocked and wait for the lock to be released. We need to detect this condition before allowing an v4 expired client to be a courtesy client and also handle the case after the v4 client was set to courtesy client then a local lock comes in. I'm not sure how to handle this without modifying the fs/locks.c. -Dai > > --b. > > On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote: >> Currently an NFSv4 client must maintain its lease by using the at least >> one of the state tokens or if nothing else, by issuing a RENEW (4.0), or >> a singleton SEQUENCE (4.1) at least once during each lease period. If the >> client fails to renew the lease, for any reason, the Linux server expunges >> the state tokens immediately upon detection of the "failure to renew the >> lease" condition and begins returning NFS4ERR_EXPIRED if the client should >> reconnect and attempt to use the (now) expired state. >> >> The default lease period for the Linux server is 90 seconds. The typical >> client cuts that in half and will issue a lease renewing operation every >> 45 seconds. The 90 second lease period is very short considering the >> potential for moderately long term network partitions. A network partition >> refers to any loss of network connectivity between the NFS client and the >> NFS server, regardless of its root cause. This includes NIC failures, NIC >> driver bugs, network misconfigurations & administrative errors, routers & >> switches crashing and/or having software updates applied, even down to >> cables being physically pulled. In most cases, these network failures are >> transient, although the duration is unknown. >> >> A server which does not immediately expunge the state on lease expiration >> is known as a Courteous Server. A Courteous Server continues to recognize >> previously generated state tokens as valid until conflict arises between >> the expired state and the requests from another client, or the server reboots. >> >> The initial implementation of the Courteous Server will do the following: >> >> . when the laundromat thread detects an expired client and if that client >> still has established states on the Linux server then marks the client as a >> COURTESY_CLIENT and skips destroying the client and all its states, >> otherwise destroy the client as usual. >> >> . detects conflict of OPEN request with a COURTESY_CLIENT, destroys the >> expired client and all its states, skips the delegation recall then allows >> the conflicting request to succeed. >> >> . detects conflict of LOCK/LOCKT request with a COURTESY_CLIENT, destroys >> the expired client and all its states then allows the conflicting request >> to succeed. >> >> To be done: >> >> . fix a problem with 4.1 where the Linux server keeps returning >> SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after the expired >> client re-connects, causing the client to keep sending BCTS requests to server. >> >> . handle OPEN conflict with share reservations. >> >> . instead of destroy the client anf all its state on conflict, only destroy >> the state that is conflicted with the current request. >> >> . destroy the COURTESY_CLIENT either after a fixed period of time to release >> resources or as reacting to memory pressure. >> >> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> >> --- >> fs/nfsd/nfs4state.c | 94 +++++++++++++++++++++++++++++++++++++++++++--- >> fs/nfsd/state.h | 1 + >> include/linux/sunrpc/svc.h | 1 + >> 3 files changed, 91 insertions(+), 5 deletions(-) >> >> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c >> index b517a8794400..969995872752 100644 >> --- a/fs/nfsd/nfs4state.c >> +++ b/fs/nfsd/nfs4state.c >> @@ -171,6 +171,7 @@ renew_client_locked(struct nfs4_client *clp) >> >> list_move_tail(&clp->cl_lru, &nn->client_lru); >> clp->cl_time = ktime_get_boottime_seconds(); >> + clear_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags); >> } >> >> static void put_client_renew_locked(struct nfs4_client *clp) >> @@ -4610,6 +4611,38 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp) >> nfsd4_run_cb(&dp->dl_recall); >> } >> >> +/* >> + * Set rq_conflict_client if the conflict client have expired >> + * and return true, otherwise return false. >> + */ >> +static bool >> +nfsd_set_conflict_client(struct nfs4_delegation *dp) >> +{ >> + struct svc_rqst *rqst; >> + struct nfs4_client *clp; >> + struct nfsd_net *nn; >> + bool ret; >> + >> + if (!i_am_nfsd()) >> + return false; >> + rqst = kthread_data(current); >> + if (rqst->rq_prog != NFS_PROGRAM || rqst->rq_vers < 4) >> + return false; >> + rqst->rq_conflict_client = NULL; >> + clp = dp->dl_recall.cb_clp; >> + nn = net_generic(clp->net, nfsd_net_id); >> + spin_lock(&nn->client_lock); >> + >> + if (test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) && >> + !mark_client_expired_locked(clp)) { >> + rqst->rq_conflict_client = clp; >> + ret = true; >> + } else >> + ret = false; >> + spin_unlock(&nn->client_lock); >> + return ret; >> +} >> + >> /* Called from break_lease() with i_lock held. */ >> static bool >> nfsd_break_deleg_cb(struct file_lock *fl) >> @@ -4618,6 +4651,8 @@ nfsd_break_deleg_cb(struct file_lock *fl) >> struct nfs4_delegation *dp = (struct nfs4_delegation *)fl->fl_owner; >> struct nfs4_file *fp = dp->dl_stid.sc_file; >> >> + if (nfsd_set_conflict_client(dp)) >> + return false; >> trace_nfsd_deleg_break(&dp->dl_stid.sc_stateid); >> >> /* >> @@ -5237,6 +5272,22 @@ static void nfsd4_deleg_xgrade_none_ext(struct nfsd4_open *open, >> */ >> } >> >> +static bool >> +nfs4_destroy_courtesy_client(struct nfs4_client *clp) >> +{ >> + struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id); >> + >> + spin_lock(&nn->client_lock); >> + if (!test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) || >> + mark_client_expired_locked(clp)) { >> + spin_unlock(&nn->client_lock); >> + return false; >> + } >> + spin_unlock(&nn->client_lock); >> + expire_client(clp); >> + return true; >> +} >> + >> __be32 >> nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nfsd4_open *open) >> { >> @@ -5286,7 +5337,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf >> goto out; >> } >> } else { >> + rqstp->rq_conflict_client = NULL; >> status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open); >> + if (status == nfserr_jukebox && rqstp->rq_conflict_client) { >> + if (nfs4_destroy_courtesy_client(rqstp->rq_conflict_client)) >> + status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open); >> + } >> + >> if (status) { >> stp->st_stid.sc_type = NFS4_CLOSED_STID; >> release_open_stateid(stp); >> @@ -5472,6 +5529,8 @@ nfs4_laundromat(struct nfsd_net *nn) >> }; >> struct nfs4_cpntf_state *cps; >> copy_stateid_t *cps_t; >> + struct nfs4_stid *stid; >> + int id = 0; >> int i; >> >> if (clients_still_reclaiming(nn)) { >> @@ -5495,6 +5554,15 @@ nfs4_laundromat(struct nfsd_net *nn) >> clp = list_entry(pos, struct nfs4_client, cl_lru); >> if (!state_expired(<, clp->cl_time)) >> break; >> + >> + spin_lock(&clp->cl_lock); >> + stid = idr_get_next(&clp->cl_stateids, &id); >> + spin_unlock(&clp->cl_lock); >> + if (stid) { >> + /* client still has states */ >> + set_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags); >> + continue; >> + } >> if (mark_client_expired_locked(clp)) { >> trace_nfsd_clid_expired(&clp->cl_clientid); >> continue; >> @@ -6440,7 +6508,7 @@ static const struct lock_manager_operations nfsd_posix_mng_ops = { >> .lm_put_owner = nfsd4_fl_put_owner, >> }; >> >> -static inline void >> +static inline int >> nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny) >> { >> struct nfs4_lockowner *lo; >> @@ -6453,6 +6521,8 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny) >> /* We just don't care that much */ >> goto nevermind; >> deny->ld_clientid = lo->lo_owner.so_client->cl_clientid; >> + if (nfs4_destroy_courtesy_client(lo->lo_owner.so_client)) >> + return -EAGAIN; >> } else { >> nevermind: >> deny->ld_owner.len = 0; >> @@ -6467,6 +6537,7 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny) >> deny->ld_type = NFS4_READ_LT; >> if (fl->fl_type != F_RDLCK) >> deny->ld_type = NFS4_WRITE_LT; >> + return 0; >> } >> >> static struct nfs4_lockowner * >> @@ -6734,6 +6805,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, >> unsigned int fl_flags = FL_POSIX; >> struct net *net = SVC_NET(rqstp); >> struct nfsd_net *nn = net_generic(net, nfsd_net_id); >> + bool retried = false; >> >> dprintk("NFSD: nfsd4_lock: start=%Ld length=%Ld\n", >> (long long) lock->lk_offset, >> @@ -6860,7 +6932,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, >> list_add_tail(&nbl->nbl_lru, &nn->blocked_locks_lru); >> spin_unlock(&nn->blocked_locks_lock); >> } >> - >> +again: >> err = vfs_lock_file(nf->nf_file, F_SETLK, file_lock, conflock); >> switch (err) { >> case 0: /* success! */ >> @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, >> case -EAGAIN: /* conflock holds conflicting lock */ >> status = nfserr_denied; >> dprintk("NFSD: nfsd4_lock: conflicting lock found!\n"); >> - nfs4_set_lock_denied(conflock, &lock->lk_denied); >> + >> + /* try again if conflict with courtesy client */ >> + if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == -EAGAIN && !retried) { >> + retried = true; >> + goto again; >> + } >> break; >> case -EDEADLK: >> status = nfserr_deadlock; >> @@ -6962,6 +7039,8 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, >> struct nfs4_lockowner *lo = NULL; >> __be32 status; >> struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id); >> + bool retried = false; >> + int ret; >> >> if (locks_in_grace(SVC_NET(rqstp))) >> return nfserr_grace; >> @@ -7010,14 +7089,19 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, >> file_lock->fl_end = last_byte_offset(lockt->lt_offset, lockt->lt_length); >> >> nfs4_transform_lock_offset(file_lock); >> - >> +again: >> status = nfsd_test_lock(rqstp, &cstate->current_fh, file_lock); >> if (status) >> goto out; >> >> if (file_lock->fl_type != F_UNLCK) { >> status = nfserr_denied; >> - nfs4_set_lock_denied(file_lock, &lockt->lt_denied); >> + ret = nfs4_set_lock_denied(file_lock, &lockt->lt_denied); >> + if (ret == -EAGAIN && !retried) { >> + retried = true; >> + fh_clear_wcc(&cstate->current_fh); >> + goto again; >> + } >> } >> out: >> if (lo) >> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h >> index e73bdbb1634a..bdc3605e3722 100644 >> --- a/fs/nfsd/state.h >> +++ b/fs/nfsd/state.h >> @@ -345,6 +345,7 @@ struct nfs4_client { >> #define NFSD4_CLIENT_UPCALL_LOCK (5) /* upcall serialization */ >> #define NFSD4_CLIENT_CB_FLAG_MASK (1 << NFSD4_CLIENT_CB_UPDATE | \ >> 1 << NFSD4_CLIENT_CB_KILL) >> +#define NFSD4_CLIENT_COURTESY (6) /* be nice to expired client */ >> unsigned long cl_flags; >> const struct cred *cl_cb_cred; >> struct rpc_clnt *cl_cb_client; >> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h >> index e91d51ea028b..2f0382f9d0ff 100644 >> --- a/include/linux/sunrpc/svc.h >> +++ b/include/linux/sunrpc/svc.h >> @@ -304,6 +304,7 @@ struct svc_rqst { >> * net namespace >> */ >> void ** rq_lease_breaker; /* The v4 client breaking a lease */ >> + void *rq_conflict_client; >> }; >> >> #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net) >> -- >> 2.9.5
On Thu, Jun 24, 2021 at 12:50:25PM -0700, dai.ngo@oracle.com wrote: > On 6/24/21 7:02 AM, J. Bruce Fields wrote: > >I'm not sure how to deal with this. I don't think there's an efficient > >way to determine which expired client is responsible for an ENOSPC. So, > >maybe you'd check for ENOSPC and when you see it, remove all expired > >clients and retry if there were any? > > I did a quick check on fs/nfsd and did not see anywhere the server returns > ENOSPC/nfserr_nospc so I'm not sure where to check? Currently we plan to > destroy the courtesy client after 24hr if it has not reconnect but that's > a long time. The error normally originates from the filesystem. Anything that might need to allocate space could hit it, and it could happen to a non-nfsd process. I'm not seeing an easy way to do this; we might need ideas from other filesystem developers. --b.
On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote: > @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > case -EAGAIN: /* conflock holds conflicting lock */ > status = nfserr_denied; > dprintk("NFSD: nfsd4_lock: conflicting lock found!\n"); > - nfs4_set_lock_denied(conflock, &lock->lk_denied); > + > + /* try again if conflict with courtesy client */ > + if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == -EAGAIN && !retried) { > + retried = true; > + goto again; > + } Ugh, apologies, this was my idea, but I just noticed it only handles conflicts from other NFSv4 clients. The conflicting lock could just as well come from NLM or a local process. So we need cooperation from the common locks.c code. I'm not sure what to suggest.... Maybe something like: @@ -1159,6 +1159,7 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request, } percpu_down_read(&file_rwsem); +retry: spin_lock(&ctx->flc_lock); /* * New lock request. Walk all POSIX locks and look for conflicts. If @@ -1169,6 +1170,11 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request, list_for_each_entry(fl, &ctx->flc_posix, fl_list) { if (!posix_locks_conflict(request, fl)) continue; + if (fl->fl_lops->fl_expire_lock(fl, 1)) { + spin_unlock(&ctx->flc_lock); + fl->fl_lops->fl_expire_locks(fl, 0); + goto retry; + } if (conflock) locks_copy_conflock(conflock, fl); error = -EAGAIN; where ->fl_expire_lock is a new lock callback with second argument "check" where: check = 1 means: just check whether this lock could be freed check = 0 means: go ahead and free this lock if you can --b.
On 6/28/21 1:23 PM, J. Bruce Fields wrote: > On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote: >> @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, >> case -EAGAIN: /* conflock holds conflicting lock */ >> status = nfserr_denied; >> dprintk("NFSD: nfsd4_lock: conflicting lock found!\n"); >> - nfs4_set_lock_denied(conflock, &lock->lk_denied); >> + >> + /* try again if conflict with courtesy client */ >> + if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == -EAGAIN && !retried) { >> + retried = true; >> + goto again; >> + } > Ugh, apologies, this was my idea, but I just noticed it only handles conflicts > from other NFSv4 clients. The conflicting lock could just as well come from > NLM or a local process. So we need cooperation from the common locks.c code. > > I'm not sure what to suggest.... > > Maybe something like: > > @@ -1159,6 +1159,7 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request, > } > > percpu_down_read(&file_rwsem); > +retry: > spin_lock(&ctx->flc_lock); > /* > * New lock request. Walk all POSIX locks and look for conflicts. If > @@ -1169,6 +1170,11 @@ static int posix_lock_inode(struct inode *inode, struct file_lock *request, > list_for_each_entry(fl, &ctx->flc_posix, fl_list) { > if (!posix_locks_conflict(request, fl)) > continue; > + if (fl->fl_lops->fl_expire_lock(fl, 1)) { > + spin_unlock(&ctx->flc_lock); > + fl->fl_lops->fl_expire_locks(fl, 0); > + goto retry; > + } > if (conflock) > locks_copy_conflock(conflock, fl); > error = -EAGAIN; > > > where ->fl_expire_lock is a new lock callback with second argument "check" > where: > > check = 1 means: just check whether this lock could be freed > check = 0 means: go ahead and free this lock if you can Thanks Bruce, I will look into this approach. -Dai > > --b.
On 6/28/21 4:39 PM, dai.ngo@oracle.com wrote: > > On 6/28/21 1:23 PM, J. Bruce Fields wrote: >> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote: >>> @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct >>> nfsd4_compound_state *cstate, >>> case -EAGAIN: /* conflock holds conflicting lock */ >>> status = nfserr_denied; >>> dprintk("NFSD: nfsd4_lock: conflicting lock found!\n"); >>> - nfs4_set_lock_denied(conflock, &lock->lk_denied); >>> + >>> + /* try again if conflict with courtesy client */ >>> + if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == >>> -EAGAIN && !retried) { >>> + retried = true; >>> + goto again; >>> + } >> Ugh, apologies, this was my idea, but I just noticed it only handles >> conflicts >> from other NFSv4 clients. The conflicting lock could just as well >> come from >> NLM or a local process. So we need cooperation from the common >> locks.c code. >> >> I'm not sure what to suggest.... One option is to use locks_copy_conflock/nfsd4_fl_get_owner to detect the lock being copied belongs to a courtesy client and schedule the laundromat to run to destroy the courtesy client. This option requires callers of vfs_lock_file to provide the 'conflock' argument. This option, let's call it option 1, is similar to what you suggested below (option 2). Both of these options can handle lock conflict between NFSv4, NFSv3/NLM and NFSv4 client. Option 1: For lock request that uses F_LOCK, the client retries lock request on lock denied. The lock request that gets denied also causes the courtesy client to be destroyed. Client succeeds on next retry. For lock request that uses F_TLOCK, the request failed with lock denied, the client does not retry and returns the error to the application. If the application tries again it will succeed since the courtesy was destroyed when the lock conflict was detected by nfsd4_fl_get_owner. I think this behavior is ok since the client can not rely on a specific lease expiration time maintained by the server. Option 2: For lock request that uses F_LOCK, since the new call, fl_expire_locks, is called out of a spinlock context, we can call expire_client to destroy the courtesy client in this new call and modify posix_lock_inode to retry the loop. With this option, the conflict lock request is allowed to succeed without the need for the client to receive denied then retry. For lock request that uses F_TLOCK, the conflict lock request is allowed to succeed same as in F_LOCK case. The application does not need to retry. Option 1 does not need any modification in fs/lock.c to add new call and do the retry. It does need to modify fs/lockd/svclock.c to add the 'conflock' arg for vfs_lock_file. Which option you think we should take, I'm open to either one. Regarding local lock conflick, do_lock_file_wait calls vfs_lock_file and just block waiting for the lock to be released. Both of the options above do not handle the case where the local lock happens before the v4 client expires and becomes courtesy client. In this case we can not let the v4 client becomes courtesy client. We need to have a way to detect that someone is blocked on a lock owned by the v4 client and do not allow that client to become courtesy client. One way to handle this to mark the v4 lock as 'has_waiter', and then before allowing the expired v4 client to become courtesy client we need to search all the locks of this v4 client for any lock with 'has_waiter' flag and disallow it. The part that I don't like about this approach is having to search all locks of each lockowner of the v4 client for lock with 'has_waiter'. I need some suggestions here. -Dai >> >> Maybe something like: >> >> @@ -1159,6 +1159,7 @@ static int posix_lock_inode(struct inode >> *inode, struct file_lock *request, >> } >> percpu_down_read(&file_rwsem); >> +retry: >> spin_lock(&ctx->flc_lock); >> /* >> * New lock request. Walk all POSIX locks and look for >> conflicts. If >> @@ -1169,6 +1170,11 @@ static int posix_lock_inode(struct inode >> *inode, struct file_lock *request, >> list_for_each_entry(fl, &ctx->flc_posix, fl_list) { >> if (!posix_locks_conflict(request, fl)) >> continue; >> + if (fl->fl_lops->fl_expire_lock(fl, 1)) { >> + spin_unlock(&ctx->flc_lock); >> + fl->fl_lops->fl_expire_locks(fl, 0); >> + goto retry; >> + } >> if (conflock) >> locks_copy_conflock(conflock, fl); >> error = -EAGAIN; >> >> >> where ->fl_expire_lock is a new lock callback with second argument >> "check" >> where: >> >> check = 1 means: just check whether this lock could be freed >> check = 0 means: go ahead and free this lock if you can > > Thanks Bruce, I will look into this approach. > > -Dai > >> >> --b.
On Mon, Jun 28, 2021 at 09:40:56PM -0700, dai.ngo@oracle.com wrote: > > On 6/28/21 4:39 PM, dai.ngo@oracle.com wrote: > > > >On 6/28/21 1:23 PM, J. Bruce Fields wrote: > >>On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote: > >>>@@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, > >>>struct nfsd4_compound_state *cstate, > >>> case -EAGAIN: /* conflock holds conflicting lock */ > >>> status = nfserr_denied; > >>> dprintk("NFSD: nfsd4_lock: conflicting lock found!\n"); > >>>- nfs4_set_lock_denied(conflock, &lock->lk_denied); > >>>+ > >>>+ /* try again if conflict with courtesy client */ > >>>+ if (nfs4_set_lock_denied(conflock, &lock->lk_denied) > >>>== -EAGAIN && !retried) { > >>>+ retried = true; > >>>+ goto again; > >>>+ } > >>Ugh, apologies, this was my idea, but I just noticed it only > >>handles conflicts > >>from other NFSv4 clients. The conflicting lock could just as > >>well come from > >>NLM or a local process. So we need cooperation from the common > >>locks.c code. > >> > >>I'm not sure what to suggest.... > > One option is to use locks_copy_conflock/nfsd4_fl_get_owner to detect > the lock being copied belongs to a courtesy client and schedule the > laundromat to run to destroy the courtesy client. This option requires > callers of vfs_lock_file to provide the 'conflock' argument. I'm not sure I follow. What's the advantage of doing it this way? > Regarding local lock conflick, do_lock_file_wait calls vfs_lock_file and > just block waiting for the lock to be released. Both of the options > above do not handle the case where the local lock happens before the > v4 client expires and becomes courtesy client. In this case we can not > let the v4 client becomes courtesy client. Oh, good point, yes, we don't want that waiter stuck waiting forever on this expired client.... > We need to have a way to > detect that someone is blocked on a lock owned by the v4 client and > do not allow that client to become courtesy client. One way to handle > this to mark the v4 lock as 'has_waiter', and then before allowing > the expired v4 client to become courtesy client we need to search > all the locks of this v4 client for any lock with 'has_waiter' flag > and disallow it. The part that I don't like about this approach is > having to search all locks of each lockowner of the v4 client for > lock with 'has_waiter'. I need some suggestions here. I'm not seeing a way to do it without iterating over all the client's locks. I don't think you should need a new flag, though, shouldn't !list_empty(&lock->fl_blocked_requests) be enough? --b. > > -Dai > > >> > >>Maybe something like: > >> > >>@@ -1159,6 +1159,7 @@ static int posix_lock_inode(struct inode > >>*inode, struct file_lock *request, > >> } > >> percpu_down_read(&file_rwsem); > >>+retry: > >> spin_lock(&ctx->flc_lock); > >> /* > >> * New lock request. Walk all POSIX locks and look for > >>conflicts. If > >>@@ -1169,6 +1170,11 @@ static int posix_lock_inode(struct inode > >>*inode, struct file_lock *request, > >> list_for_each_entry(fl, &ctx->flc_posix, fl_list) { > >> if (!posix_locks_conflict(request, fl)) > >> continue; > >>+ if (fl->fl_lops->fl_expire_lock(fl, 1)) { > >>+ spin_unlock(&ctx->flc_lock); > >>+ fl->fl_lops->fl_expire_locks(fl, 0); > >>+ goto retry; > >>+ } > >> if (conflock) > >> locks_copy_conflock(conflock, fl); > >> error = -EAGAIN; > >> > >> > >>where ->fl_expire_lock is a new lock callback with second > >>argument "check" > >>where: > >> > >> check = 1 means: just check whether this lock could be freed > >> check = 0 means: go ahead and free this lock if you can > > > >Thanks Bruce, I will look into this approach. > > > >-Dai > > > >> > >>--b.
On 6/29/21 6:35 PM, J. Bruce Fields wrote: > On Mon, Jun 28, 2021 at 09:40:56PM -0700, dai.ngo@oracle.com wrote: >> On 6/28/21 4:39 PM, dai.ngo@oracle.com wrote: >>> On 6/28/21 1:23 PM, J. Bruce Fields wrote: >>>> On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote: >>>>> @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, >>>>> struct nfsd4_compound_state *cstate, >>>>> case -EAGAIN: /* conflock holds conflicting lock */ >>>>> status = nfserr_denied; >>>>> dprintk("NFSD: nfsd4_lock: conflicting lock found!\n"); >>>>> - nfs4_set_lock_denied(conflock, &lock->lk_denied); >>>>> + >>>>> + /* try again if conflict with courtesy client */ >>>>> + if (nfs4_set_lock_denied(conflock, &lock->lk_denied) >>>>> == -EAGAIN && !retried) { >>>>> + retried = true; >>>>> + goto again; >>>>> + } >>>> Ugh, apologies, this was my idea, but I just noticed it only >>>> handles conflicts >>> >from other NFSv4 clients. The conflicting lock could just as >>>> well come from >>>> NLM or a local process. So we need cooperation from the common >>>> locks.c code. >>>> >>>> I'm not sure what to suggest.... >> One option is to use locks_copy_conflock/nfsd4_fl_get_owner to detect >> the lock being copied belongs to a courtesy client and schedule the >> laundromat to run to destroy the courtesy client. This option requires >> callers of vfs_lock_file to provide the 'conflock' argument. > I'm not sure I follow. What's the advantage of doing it this way? I'm not sure it's an advantage but I was trying to minimize changes to the fs code. The only change we need is to add the conflock argument to do_lock_file_wait to handle local lock conflicts. If you don't think we're going to get objection with the new callback, fl_expire_lock, then I will take that approach. We still need to add the conflock argument to do_lock_file_wait in this case. > >> Regarding local lock conflick, do_lock_file_wait calls vfs_lock_file and >> just block waiting for the lock to be released. Both of the options >> above do not handle the case where the local lock happens before the >> v4 client expires and becomes courtesy client. In this case we can not >> let the v4 client becomes courtesy client. > Oh, good point, yes, we don't want that waiter stuck waiting forever on > this expired client.... > >> We need to have a way to >> detect that someone is blocked on a lock owned by the v4 client and >> do not allow that client to become courtesy client. One way to handle >> this to mark the v4 lock as 'has_waiter', and then before allowing >> the expired v4 client to become courtesy client we need to search >> all the locks of this v4 client for any lock with 'has_waiter' flag >> and disallow it. The part that I don't like about this approach is >> having to search all locks of each lockowner of the v4 client for >> lock with 'has_waiter'. I need some suggestions here. > I'm not seeing a way to do it without iterating over all the client's > locks. ok, i feel a bit better :-) > > I don't think you should need a new flag, though, shouldn't > !list_empty(&lock->fl_blocked_requests) be enough? Thanks Bruce, this is what I was looking for. -Dai > > --b. > >> -Dai >> >>>> Maybe something like: >>>> >>>> @@ -1159,6 +1159,7 @@ static int posix_lock_inode(struct inode >>>> *inode, struct file_lock *request, >>>> } >>>> percpu_down_read(&file_rwsem); >>>> +retry: >>>> spin_lock(&ctx->flc_lock); >>>> /* >>>> * New lock request. Walk all POSIX locks and look for >>>> conflicts. If >>>> @@ -1169,6 +1170,11 @@ static int posix_lock_inode(struct inode >>>> *inode, struct file_lock *request, >>>> list_for_each_entry(fl, &ctx->flc_posix, fl_list) { >>>> if (!posix_locks_conflict(request, fl)) >>>> continue; >>>> + if (fl->fl_lops->fl_expire_lock(fl, 1)) { >>>> + spin_unlock(&ctx->flc_lock); >>>> + fl->fl_lops->fl_expire_locks(fl, 0); >>>> + goto retry; >>>> + } >>>> if (conflock) >>>> locks_copy_conflock(conflock, fl); >>>> error = -EAGAIN; >>>> >>>> >>>> where ->fl_expire_lock is a new lock callback with second >>>> argument "check" >>>> where: >>>> >>>> check = 1 means: just check whether this lock could be freed >>>> check = 0 means: go ahead and free this lock if you can >>> Thanks Bruce, I will look into this approach. >>> >>> -Dai >>> >>>> --b.
On Wed, Jun 30, 2021 at 01:41:32AM -0700, dai.ngo@oracle.com wrote: > > On 6/29/21 6:35 PM, J. Bruce Fields wrote: > >On Mon, Jun 28, 2021 at 09:40:56PM -0700, dai.ngo@oracle.com wrote: > >>On 6/28/21 4:39 PM, dai.ngo@oracle.com wrote: > >>>On 6/28/21 1:23 PM, J. Bruce Fields wrote: > >>>>On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote: > >>>>>@@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, > >>>>>struct nfsd4_compound_state *cstate, > >>>>> case -EAGAIN: /* conflock holds conflicting lock */ > >>>>> status = nfserr_denied; > >>>>> dprintk("NFSD: nfsd4_lock: conflicting lock found!\n"); > >>>>>- nfs4_set_lock_denied(conflock, &lock->lk_denied); > >>>>>+ > >>>>>+ /* try again if conflict with courtesy client */ > >>>>>+ if (nfs4_set_lock_denied(conflock, &lock->lk_denied) > >>>>>== -EAGAIN && !retried) { > >>>>>+ retried = true; > >>>>>+ goto again; > >>>>>+ } > >>>>Ugh, apologies, this was my idea, but I just noticed it only > >>>>handles conflicts > >>>>from other NFSv4 clients. The conflicting lock could just as > >>>>well come from > >>>>NLM or a local process. So we need cooperation from the common > >>>>locks.c code. > >>>> > >>>>I'm not sure what to suggest.... > >>One option is to use locks_copy_conflock/nfsd4_fl_get_owner to detect > >>the lock being copied belongs to a courtesy client and schedule the > >>laundromat to run to destroy the courtesy client. This option requires > >>callers of vfs_lock_file to provide the 'conflock' argument. > >I'm not sure I follow. What's the advantage of doing it this way? > > I'm not sure it's an advantage but I was trying to minimize changes to > the fs code. The only change we need is to add the conflock argument > to do_lock_file_wait to handle local lock conflicts. Got it. That's a clever but kind of unexpected use of lm_get_owner; I think it could be confusing to a future reader. And I'd rather not require the extra retry. A new lock callback is a complication, but at least it's pretty obvious what it does. > If you don't think we're going to get objection with the new callback, > fl_expire_lock, then I will take that approach. We still need to add > the conflock argument to do_lock_file_wait in this case. Why is that? --b.
On Mon, Jun 28, 2021 at 04:23:31PM -0400, J. Bruce Fields wrote: > On Thu, Jun 03, 2021 at 02:14:38PM -0400, Dai Ngo wrote: > > @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > > case -EAGAIN: /* conflock holds conflicting lock */ > > status = nfserr_denied; > > dprintk("NFSD: nfsd4_lock: conflicting lock found!\n"); > > - nfs4_set_lock_denied(conflock, &lock->lk_denied); > > + > > + /* try again if conflict with courtesy client */ > > + if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == -EAGAIN && !retried) { > > + retried = true; > > + goto again; > > + } > > Ugh, apologies, this was my idea, but I just noticed it only handles > conflicts from other NFSv4 clients. (Oh, and I only just *now* noticed that you'd already pointed out that problem in an email I apparently stopped reading after the first paragraph: https://lore.kernel.org/linux-nfs/c983218b-d270-bbae-71c5-b591bfbce473@oracle.com/ Sorry!) --b.
> On 6/28/21 1:23 PM, J. Bruce Fields wrote: >> >> where ->fl_expire_lock is a new lock callback with second argument >> "check" >> where: >> >> check = 1 means: just check whether this lock could be freed Why do we need this, is there a use case for it? can we just always try to expire the lock and return success/fail? -Dai >> check = 0 means: go ahead and free this lock if you can
On Wed, Jun 30, 2021 at 10:51:27AM -0700, dai.ngo@oracle.com wrote: > >On 6/28/21 1:23 PM, J. Bruce Fields wrote: > >> > >>where ->fl_expire_lock is a new lock callback with second > >>argument "check" > >>where: > >> > >> check = 1 means: just check whether this lock could be freed > > Why do we need this, is there a use case for it? can we just always try > to expire the lock and return success/fail? We can't expire the client while holding the flc_lock. And once we drop that lock we need to restart the loop. Clearly we can't do that every time. (So, my code was wrong, it should have been: if (fl->fl_lops->fl_expire_lock(fl, 1)) { spin_unlock(&ct->flc_lock); fl->fl_lops->fl_expire_locks(fl, 0); goto retry; } ) But the 1 and 0 cases are starting to look pretty different; maybe they should be two different callbacks. --b.
On 6/30/21 11:05 AM, J. Bruce Fields wrote: > On Wed, Jun 30, 2021 at 10:51:27AM -0700, dai.ngo@oracle.com wrote: >>> On 6/28/21 1:23 PM, J. Bruce Fields wrote: >>>> where ->fl_expire_lock is a new lock callback with second >>>> argument "check" >>>> where: >>>> >>>> check = 1 means: just check whether this lock could be freed >> Why do we need this, is there a use case for it? can we just always try >> to expire the lock and return success/fail? > We can't expire the client while holding the flc_lock. And once we drop > that lock we need to restart the loop. Clearly we can't do that every > time. > > (So, my code was wrong, it should have been: > > > if (fl->fl_lops->fl_expire_lock(fl, 1)) { > spin_unlock(&ct->flc_lock); > fl->fl_lops->fl_expire_locks(fl, 0); > goto retry; > } > > ) This is what I currently have: retry: list_for_each_entry(fl, &ctx->flc_posix, fl_list) { if (!posix_locks_conflict(request, fl)) continue; if (fl->fl_lmops && fl->fl_lmops->lm_expire_lock) { spin_unlock(&ctx->flc_lock); ret = fl->fl_lmops->lm_expire_lock(fl, 0); spin_lock(&ctx->flc_lock); if (ret) goto retry; } if (conflock) locks_copy_conflock(conflock, fl); > > But the 1 and 0 cases are starting to look pretty different; maybe they > should be two different callbacks. why the case of 1 (test only) is needed, who would use this call? -Dai > > --b.
On Wed, Jun 30, 2021 at 11:49:18AM -0700, dai.ngo@oracle.com wrote: > > On 6/30/21 11:05 AM, J. Bruce Fields wrote: > >On Wed, Jun 30, 2021 at 10:51:27AM -0700, dai.ngo@oracle.com wrote: > >>>On 6/28/21 1:23 PM, J. Bruce Fields wrote: > >>>>where ->fl_expire_lock is a new lock callback with second > >>>>argument "check" > >>>>where: > >>>> > >>>> check = 1 means: just check whether this lock could be freed > >>Why do we need this, is there a use case for it? can we just always try > >>to expire the lock and return success/fail? > >We can't expire the client while holding the flc_lock. And once we drop > >that lock we need to restart the loop. Clearly we can't do that every > >time. > > > >(So, my code was wrong, it should have been: > > > > > > if (fl->fl_lops->fl_expire_lock(fl, 1)) { > > spin_unlock(&ct->flc_lock); > > fl->fl_lops->fl_expire_locks(fl, 0); > > goto retry; > > } > > > >) > > This is what I currently have: > > retry: > list_for_each_entry(fl, &ctx->flc_posix, fl_list) { > if (!posix_locks_conflict(request, fl)) > continue; > > if (fl->fl_lmops && fl->fl_lmops->lm_expire_lock) { > spin_unlock(&ctx->flc_lock); > ret = fl->fl_lmops->lm_expire_lock(fl, 0); > spin_lock(&ctx->flc_lock); > if (ret) > goto retry; We have to retry regardless of the return value. Once we've dropped flc_lock, it's not safe to continue trying to iterate through the list. > } > > if (conflock) > locks_copy_conflock(conflock, fl); > > > > >But the 1 and 0 cases are starting to look pretty different; maybe they > >should be two different callbacks. > > why the case of 1 (test only) is needed, who would use this call? We need to avoid dropping the spinlock in the case there are no clients to expire, otherwise we'll make no forward progress. --b.
On 6/30/21 11:55 AM, J. Bruce Fields wrote: > On Wed, Jun 30, 2021 at 11:49:18AM -0700, dai.ngo@oracle.com wrote: >> On 6/30/21 11:05 AM, J. Bruce Fields wrote: >>> On Wed, Jun 30, 2021 at 10:51:27AM -0700, dai.ngo@oracle.com wrote: >>>>> On 6/28/21 1:23 PM, J. Bruce Fields wrote: >>>>>> where ->fl_expire_lock is a new lock callback with second >>>>>> argument "check" >>>>>> where: >>>>>> >>>>>> check = 1 means: just check whether this lock could be freed >>>> Why do we need this, is there a use case for it? can we just always try >>>> to expire the lock and return success/fail? >>> We can't expire the client while holding the flc_lock. And once we drop >>> that lock we need to restart the loop. Clearly we can't do that every >>> time. >>> >>> (So, my code was wrong, it should have been: >>> >>> >>> if (fl->fl_lops->fl_expire_lock(fl, 1)) { >>> spin_unlock(&ct->flc_lock); >>> fl->fl_lops->fl_expire_locks(fl, 0); >>> goto retry; >>> } >>> >>> ) >> This is what I currently have: >> >> retry: >> list_for_each_entry(fl, &ctx->flc_posix, fl_list) { >> if (!posix_locks_conflict(request, fl)) >> continue; >> >> if (fl->fl_lmops && fl->fl_lmops->lm_expire_lock) { >> spin_unlock(&ctx->flc_lock); >> ret = fl->fl_lmops->lm_expire_lock(fl, 0); >> spin_lock(&ctx->flc_lock); >> if (ret) >> goto retry; > We have to retry regardless of the return value. Once we've dropped > flc_lock, it's not safe to continue trying to iterate through the list. Yes, thanks! > >> } >> >> if (conflock) >> locks_copy_conflock(conflock, fl); >> >>> But the 1 and 0 cases are starting to look pretty different; maybe they >>> should be two different callbacks. >> why the case of 1 (test only) is needed, who would use this call? > We need to avoid dropping the spinlock in the case there are no clients > to expire, otherwise we'll make no forward progress. I think we can remember the last checked file_lock and skip it: retry: list_for_each_entry(fl, &ctx->flc_posix, fl_list) { if (!posix_locks_conflict(request, fl)) continue; if (checked_fl != fl && fl->fl_lmops && fl->fl_lmops->lm_expire_lock) { checked_fl = fl; spin_unlock(&ctx->flc_lock); fl->fl_lmops->lm_expire_lock(fl); spin_lock(&ctx->flc_lock); goto retry; } if (conflock) locks_copy_conflock(conflock, fl); -Dai > > --b.
On Wed, Jun 30, 2021 at 12:13:35PM -0700, dai.ngo@oracle.com wrote: > > On 6/30/21 11:55 AM, J. Bruce Fields wrote: > >On Wed, Jun 30, 2021 at 11:49:18AM -0700, dai.ngo@oracle.com wrote: > >>On 6/30/21 11:05 AM, J. Bruce Fields wrote: > >>>On Wed, Jun 30, 2021 at 10:51:27AM -0700, dai.ngo@oracle.com wrote: > >>>>>On 6/28/21 1:23 PM, J. Bruce Fields wrote: > >>>>>>where ->fl_expire_lock is a new lock callback with second > >>>>>>argument "check" > >>>>>>where: > >>>>>> > >>>>>> check = 1 means: just check whether this lock could be freed > >>>>Why do we need this, is there a use case for it? can we just always try > >>>>to expire the lock and return success/fail? > >>>We can't expire the client while holding the flc_lock. And once we drop > >>>that lock we need to restart the loop. Clearly we can't do that every > >>>time. > >>> > >>>(So, my code was wrong, it should have been: > >>> > >>> > >>> if (fl->fl_lops->fl_expire_lock(fl, 1)) { > >>> spin_unlock(&ct->flc_lock); > >>> fl->fl_lops->fl_expire_locks(fl, 0); > >>> goto retry; > >>> } > >>> > >>>) > >>This is what I currently have: > >> > >>retry: > >> list_for_each_entry(fl, &ctx->flc_posix, fl_list) { > >> if (!posix_locks_conflict(request, fl)) > >> continue; > >> > >> if (fl->fl_lmops && fl->fl_lmops->lm_expire_lock) { > >> spin_unlock(&ctx->flc_lock); > >> ret = fl->fl_lmops->lm_expire_lock(fl, 0); > >> spin_lock(&ctx->flc_lock); > >> if (ret) > >> goto retry; > >We have to retry regardless of the return value. Once we've dropped > >flc_lock, it's not safe to continue trying to iterate through the list. > > Yes, thanks! > > > > >> } > >> > >> if (conflock) > >> locks_copy_conflock(conflock, fl); > >> > >>>But the 1 and 0 cases are starting to look pretty different; maybe they > >>>should be two different callbacks. > >>why the case of 1 (test only) is needed, who would use this call? > >We need to avoid dropping the spinlock in the case there are no clients > >to expire, otherwise we'll make no forward progress. > > I think we can remember the last checked file_lock and skip it: I doubt that works in the case there are multiple locks with lm_expire_lock set. If you really don't want another callback here, maybe you could set some kind of flag on the lock. At the time a client expires, you're going to have to walk all of its locks to see if anyone's waiting for them. At the same time maybe you could set an FL_EXPIRABLE flag on all those locks, and test for that here. If the network partition heals and the client comes back, you'd have to remember to clear that flag again. --b. > retry: > list_for_each_entry(fl, &ctx->flc_posix, fl_list) { > if (!posix_locks_conflict(request, fl)) > continue; > > if (checked_fl != fl && fl->fl_lmops && > fl->fl_lmops->lm_expire_lock) { > checked_fl = fl; > spin_unlock(&ctx->flc_lock); > fl->fl_lmops->lm_expire_lock(fl); > spin_lock(&ctx->flc_lock); > goto retry; > } > > if (conflock) > locks_copy_conflock(conflock, fl); > > -Dai > > > > >--b.
On 6/30/21 12:24 PM, J. Bruce Fields wrote: > On Wed, Jun 30, 2021 at 12:13:35PM -0700, dai.ngo@oracle.com wrote: >> On 6/30/21 11:55 AM, J. Bruce Fields wrote: >>> On Wed, Jun 30, 2021 at 11:49:18AM -0700, dai.ngo@oracle.com wrote: >>>> On 6/30/21 11:05 AM, J. Bruce Fields wrote: >>>>> On Wed, Jun 30, 2021 at 10:51:27AM -0700, dai.ngo@oracle.com wrote: >>>>>>> On 6/28/21 1:23 PM, J. Bruce Fields wrote: >>>>>>>> where ->fl_expire_lock is a new lock callback with second >>>>>>>> argument "check" >>>>>>>> where: >>>>>>>> >>>>>>>> check = 1 means: just check whether this lock could be freed >>>>>> Why do we need this, is there a use case for it? can we just always try >>>>>> to expire the lock and return success/fail? >>>>> We can't expire the client while holding the flc_lock. And once we drop >>>>> that lock we need to restart the loop. Clearly we can't do that every >>>>> time. >>>>> >>>>> (So, my code was wrong, it should have been: >>>>> >>>>> >>>>> if (fl->fl_lops->fl_expire_lock(fl, 1)) { >>>>> spin_unlock(&ct->flc_lock); >>>>> fl->fl_lops->fl_expire_locks(fl, 0); >>>>> goto retry; >>>>> } >>>>> >>>>> ) >>>> This is what I currently have: >>>> >>>> retry: >>>> list_for_each_entry(fl, &ctx->flc_posix, fl_list) { >>>> if (!posix_locks_conflict(request, fl)) >>>> continue; >>>> >>>> if (fl->fl_lmops && fl->fl_lmops->lm_expire_lock) { >>>> spin_unlock(&ctx->flc_lock); >>>> ret = fl->fl_lmops->lm_expire_lock(fl, 0); >>>> spin_lock(&ctx->flc_lock); >>>> if (ret) >>>> goto retry; >>> We have to retry regardless of the return value. Once we've dropped >>> flc_lock, it's not safe to continue trying to iterate through the list. >> Yes, thanks! >> >>>> } >>>> >>>> if (conflock) >>>> locks_copy_conflock(conflock, fl); >>>> >>>>> But the 1 and 0 cases are starting to look pretty different; maybe they >>>>> should be two different callbacks. >>>> why the case of 1 (test only) is needed, who would use this call? >>> We need to avoid dropping the spinlock in the case there are no clients >>> to expire, otherwise we'll make no forward progress. >> I think we can remember the last checked file_lock and skip it: > I doubt that works in the case there are multiple locks with > lm_expire_lock set. > > If you really don't want another callback here, maybe you could set some > kind of flag on the lock. > > At the time a client expires, you're going to have to walk all of its > locks to see if anyone's waiting for them. At the same time maybe you > could set an FL_EXPIRABLE flag on all those locks, and test for that > here. > > If the network partition heals and the client comes back, you'd have to > remember to clear that flag again. It's too much unnecessary work. Would this be suffice: retry: list_for_each_entry(fl, &ctx->flc_posix, fl_list) { if (!posix_locks_conflict(request, fl)) continue; if (fl->fl_lmops && fl->fl_lmops->lm_expire_lock && fl->fl_lmops->lm_expire_lock(fl, 1)) { spin_unlock(&ctx->flc_lock); fl->fl_lmops->lm_expire_lock(fl, 0); spin_lock(&ctx->flc_lock); goto retry; } if (conflock) locks_copy_conflock(conflock, fl); -Dai > > --b. > >> retry: >> list_for_each_entry(fl, &ctx->flc_posix, fl_list) { >> if (!posix_locks_conflict(request, fl)) >> continue; >> >> if (checked_fl != fl && fl->fl_lmops && >> fl->fl_lmops->lm_expire_lock) { >> checked_fl = fl; >> spin_unlock(&ctx->flc_lock); >> fl->fl_lmops->lm_expire_lock(fl); >> spin_lock(&ctx->flc_lock); >> goto retry; >> } >> >> if (conflock) >> locks_copy_conflock(conflock, fl); >> >> -Dai >> >>> --b.
On Wed, Jun 30, 2021 at 04:48:57PM -0700, dai.ngo@oracle.com wrote: > > On 6/30/21 12:24 PM, J. Bruce Fields wrote: > >On Wed, Jun 30, 2021 at 12:13:35PM -0700, dai.ngo@oracle.com wrote: > >>On 6/30/21 11:55 AM, J. Bruce Fields wrote: > >>>On Wed, Jun 30, 2021 at 11:49:18AM -0700, dai.ngo@oracle.com wrote: > >>>>On 6/30/21 11:05 AM, J. Bruce Fields wrote: > >>>>>On Wed, Jun 30, 2021 at 10:51:27AM -0700, dai.ngo@oracle.com wrote: > >>>>>>>On 6/28/21 1:23 PM, J. Bruce Fields wrote: > >>>>>>>>where ->fl_expire_lock is a new lock callback with second > >>>>>>>>argument "check" > >>>>>>>>where: > >>>>>>>> > >>>>>>>> check = 1 means: just check whether this lock could be freed > >>>>>>Why do we need this, is there a use case for it? can we just always try > >>>>>>to expire the lock and return success/fail? > >>>>>We can't expire the client while holding the flc_lock. And once we drop > >>>>>that lock we need to restart the loop. Clearly we can't do that every > >>>>>time. > >>>>> > >>>>>(So, my code was wrong, it should have been: > >>>>> > >>>>> > >>>>> if (fl->fl_lops->fl_expire_lock(fl, 1)) { > >>>>> spin_unlock(&ct->flc_lock); > >>>>> fl->fl_lops->fl_expire_locks(fl, 0); > >>>>> goto retry; > >>>>> } > >>>>> > >>>>>) > >>>>This is what I currently have: > >>>> > >>>>retry: > >>>> list_for_each_entry(fl, &ctx->flc_posix, fl_list) { > >>>> if (!posix_locks_conflict(request, fl)) > >>>> continue; > >>>> > >>>> if (fl->fl_lmops && fl->fl_lmops->lm_expire_lock) { > >>>> spin_unlock(&ctx->flc_lock); > >>>> ret = fl->fl_lmops->lm_expire_lock(fl, 0); > >>>> spin_lock(&ctx->flc_lock); > >>>> if (ret) > >>>> goto retry; > >>>We have to retry regardless of the return value. Once we've dropped > >>>flc_lock, it's not safe to continue trying to iterate through the list. > >>Yes, thanks! > >> > >>>> } > >>>> > >>>> if (conflock) > >>>> locks_copy_conflock(conflock, fl); > >>>> > >>>>>But the 1 and 0 cases are starting to look pretty different; maybe they > >>>>>should be two different callbacks. > >>>>why the case of 1 (test only) is needed, who would use this call? > >>>We need to avoid dropping the spinlock in the case there are no clients > >>>to expire, otherwise we'll make no forward progress. > >>I think we can remember the last checked file_lock and skip it: > >I doubt that works in the case there are multiple locks with > >lm_expire_lock set. > > > >If you really don't want another callback here, maybe you could set some > >kind of flag on the lock. > > > >At the time a client expires, you're going to have to walk all of its > >locks to see if anyone's waiting for them. At the same time maybe you > >could set an FL_EXPIRABLE flag on all those locks, and test for that > >here. > > > >If the network partition heals and the client comes back, you'd have to > >remember to clear that flag again. > > It's too much unnecessary work. > > Would this be suffice: > > retry: > list_for_each_entry(fl, &ctx->flc_posix, fl_list) { > if (!posix_locks_conflict(request, fl)) > continue; > if (fl->fl_lmops && fl->fl_lmops->lm_expire_lock && > fl->fl_lmops->lm_expire_lock(fl, 1)) { > spin_unlock(&ctx->flc_lock); > fl->fl_lmops->lm_expire_lock(fl, 0); > spin_lock(&ctx->flc_lock); > goto retry; > } > if (conflock) > locks_copy_conflock(conflock, fl); Looks OK to me.--b. > > -Dai > > > > >--b. > > > >>retry: > >> list_for_each_entry(fl, &ctx->flc_posix, fl_list) { > >> if (!posix_locks_conflict(request, fl)) > >> continue; > >> > >> if (checked_fl != fl && fl->fl_lmops && > >> fl->fl_lmops->lm_expire_lock) { > >> checked_fl = fl; > >> spin_unlock(&ctx->flc_lock); > >> fl->fl_lmops->lm_expire_lock(fl); > >> spin_lock(&ctx->flc_lock); > >> goto retry; > >> } > >> > >> if (conflock) > >> locks_copy_conflock(conflock, fl); > >> > >>-Dai > >> > >>>--b.
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index b517a8794400..969995872752 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -171,6 +171,7 @@ renew_client_locked(struct nfs4_client *clp) list_move_tail(&clp->cl_lru, &nn->client_lru); clp->cl_time = ktime_get_boottime_seconds(); + clear_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags); } static void put_client_renew_locked(struct nfs4_client *clp) @@ -4610,6 +4611,38 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp) nfsd4_run_cb(&dp->dl_recall); } +/* + * Set rq_conflict_client if the conflict client have expired + * and return true, otherwise return false. + */ +static bool +nfsd_set_conflict_client(struct nfs4_delegation *dp) +{ + struct svc_rqst *rqst; + struct nfs4_client *clp; + struct nfsd_net *nn; + bool ret; + + if (!i_am_nfsd()) + return false; + rqst = kthread_data(current); + if (rqst->rq_prog != NFS_PROGRAM || rqst->rq_vers < 4) + return false; + rqst->rq_conflict_client = NULL; + clp = dp->dl_recall.cb_clp; + nn = net_generic(clp->net, nfsd_net_id); + spin_lock(&nn->client_lock); + + if (test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) && + !mark_client_expired_locked(clp)) { + rqst->rq_conflict_client = clp; + ret = true; + } else + ret = false; + spin_unlock(&nn->client_lock); + return ret; +} + /* Called from break_lease() with i_lock held. */ static bool nfsd_break_deleg_cb(struct file_lock *fl) @@ -4618,6 +4651,8 @@ nfsd_break_deleg_cb(struct file_lock *fl) struct nfs4_delegation *dp = (struct nfs4_delegation *)fl->fl_owner; struct nfs4_file *fp = dp->dl_stid.sc_file; + if (nfsd_set_conflict_client(dp)) + return false; trace_nfsd_deleg_break(&dp->dl_stid.sc_stateid); /* @@ -5237,6 +5272,22 @@ static void nfsd4_deleg_xgrade_none_ext(struct nfsd4_open *open, */ } +static bool +nfs4_destroy_courtesy_client(struct nfs4_client *clp) +{ + struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id); + + spin_lock(&nn->client_lock); + if (!test_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags) || + mark_client_expired_locked(clp)) { + spin_unlock(&nn->client_lock); + return false; + } + spin_unlock(&nn->client_lock); + expire_client(clp); + return true; +} + __be32 nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nfsd4_open *open) { @@ -5286,7 +5337,13 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf goto out; } } else { + rqstp->rq_conflict_client = NULL; status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open); + if (status == nfserr_jukebox && rqstp->rq_conflict_client) { + if (nfs4_destroy_courtesy_client(rqstp->rq_conflict_client)) + status = nfs4_get_vfs_file(rqstp, fp, current_fh, stp, open); + } + if (status) { stp->st_stid.sc_type = NFS4_CLOSED_STID; release_open_stateid(stp); @@ -5472,6 +5529,8 @@ nfs4_laundromat(struct nfsd_net *nn) }; struct nfs4_cpntf_state *cps; copy_stateid_t *cps_t; + struct nfs4_stid *stid; + int id = 0; int i; if (clients_still_reclaiming(nn)) { @@ -5495,6 +5554,15 @@ nfs4_laundromat(struct nfsd_net *nn) clp = list_entry(pos, struct nfs4_client, cl_lru); if (!state_expired(<, clp->cl_time)) break; + + spin_lock(&clp->cl_lock); + stid = idr_get_next(&clp->cl_stateids, &id); + spin_unlock(&clp->cl_lock); + if (stid) { + /* client still has states */ + set_bit(NFSD4_CLIENT_COURTESY, &clp->cl_flags); + continue; + } if (mark_client_expired_locked(clp)) { trace_nfsd_clid_expired(&clp->cl_clientid); continue; @@ -6440,7 +6508,7 @@ static const struct lock_manager_operations nfsd_posix_mng_ops = { .lm_put_owner = nfsd4_fl_put_owner, }; -static inline void +static inline int nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny) { struct nfs4_lockowner *lo; @@ -6453,6 +6521,8 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny) /* We just don't care that much */ goto nevermind; deny->ld_clientid = lo->lo_owner.so_client->cl_clientid; + if (nfs4_destroy_courtesy_client(lo->lo_owner.so_client)) + return -EAGAIN; } else { nevermind: deny->ld_owner.len = 0; @@ -6467,6 +6537,7 @@ nfs4_set_lock_denied(struct file_lock *fl, struct nfsd4_lock_denied *deny) deny->ld_type = NFS4_READ_LT; if (fl->fl_type != F_RDLCK) deny->ld_type = NFS4_WRITE_LT; + return 0; } static struct nfs4_lockowner * @@ -6734,6 +6805,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, unsigned int fl_flags = FL_POSIX; struct net *net = SVC_NET(rqstp); struct nfsd_net *nn = net_generic(net, nfsd_net_id); + bool retried = false; dprintk("NFSD: nfsd4_lock: start=%Ld length=%Ld\n", (long long) lock->lk_offset, @@ -6860,7 +6932,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, list_add_tail(&nbl->nbl_lru, &nn->blocked_locks_lru); spin_unlock(&nn->blocked_locks_lock); } - +again: err = vfs_lock_file(nf->nf_file, F_SETLK, file_lock, conflock); switch (err) { case 0: /* success! */ @@ -6875,7 +6947,12 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, case -EAGAIN: /* conflock holds conflicting lock */ status = nfserr_denied; dprintk("NFSD: nfsd4_lock: conflicting lock found!\n"); - nfs4_set_lock_denied(conflock, &lock->lk_denied); + + /* try again if conflict with courtesy client */ + if (nfs4_set_lock_denied(conflock, &lock->lk_denied) == -EAGAIN && !retried) { + retried = true; + goto again; + } break; case -EDEADLK: status = nfserr_deadlock; @@ -6962,6 +7039,8 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, struct nfs4_lockowner *lo = NULL; __be32 status; struct nfsd_net *nn = net_generic(SVC_NET(rqstp), nfsd_net_id); + bool retried = false; + int ret; if (locks_in_grace(SVC_NET(rqstp))) return nfserr_grace; @@ -7010,14 +7089,19 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, file_lock->fl_end = last_byte_offset(lockt->lt_offset, lockt->lt_length); nfs4_transform_lock_offset(file_lock); - +again: status = nfsd_test_lock(rqstp, &cstate->current_fh, file_lock); if (status) goto out; if (file_lock->fl_type != F_UNLCK) { status = nfserr_denied; - nfs4_set_lock_denied(file_lock, &lockt->lt_denied); + ret = nfs4_set_lock_denied(file_lock, &lockt->lt_denied); + if (ret == -EAGAIN && !retried) { + retried = true; + fh_clear_wcc(&cstate->current_fh); + goto again; + } } out: if (lo) diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h index e73bdbb1634a..bdc3605e3722 100644 --- a/fs/nfsd/state.h +++ b/fs/nfsd/state.h @@ -345,6 +345,7 @@ struct nfs4_client { #define NFSD4_CLIENT_UPCALL_LOCK (5) /* upcall serialization */ #define NFSD4_CLIENT_CB_FLAG_MASK (1 << NFSD4_CLIENT_CB_UPDATE | \ 1 << NFSD4_CLIENT_CB_KILL) +#define NFSD4_CLIENT_COURTESY (6) /* be nice to expired client */ unsigned long cl_flags; const struct cred *cl_cb_cred; struct rpc_clnt *cl_cb_client; diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h index e91d51ea028b..2f0382f9d0ff 100644 --- a/include/linux/sunrpc/svc.h +++ b/include/linux/sunrpc/svc.h @@ -304,6 +304,7 @@ struct svc_rqst { * net namespace */ void ** rq_lease_breaker; /* The v4 client breaking a lease */ + void *rq_conflict_client; }; #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
Currently an NFSv4 client must maintain its lease by using the at least one of the state tokens or if nothing else, by issuing a RENEW (4.0), or a singleton SEQUENCE (4.1) at least once during each lease period. If the client fails to renew the lease, for any reason, the Linux server expunges the state tokens immediately upon detection of the "failure to renew the lease" condition and begins returning NFS4ERR_EXPIRED if the client should reconnect and attempt to use the (now) expired state. The default lease period for the Linux server is 90 seconds. The typical client cuts that in half and will issue a lease renewing operation every 45 seconds. The 90 second lease period is very short considering the potential for moderately long term network partitions. A network partition refers to any loss of network connectivity between the NFS client and the NFS server, regardless of its root cause. This includes NIC failures, NIC driver bugs, network misconfigurations & administrative errors, routers & switches crashing and/or having software updates applied, even down to cables being physically pulled. In most cases, these network failures are transient, although the duration is unknown. A server which does not immediately expunge the state on lease expiration is known as a Courteous Server. A Courteous Server continues to recognize previously generated state tokens as valid until conflict arises between the expired state and the requests from another client, or the server reboots. The initial implementation of the Courteous Server will do the following: . when the laundromat thread detects an expired client and if that client still has established states on the Linux server then marks the client as a COURTESY_CLIENT and skips destroying the client and all its states, otherwise destroy the client as usual. . detects conflict of OPEN request with a COURTESY_CLIENT, destroys the expired client and all its states, skips the delegation recall then allows the conflicting request to succeed. . detects conflict of LOCK/LOCKT request with a COURTESY_CLIENT, destroys the expired client and all its states then allows the conflicting request to succeed. To be done: . fix a problem with 4.1 where the Linux server keeps returning SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after the expired client re-connects, causing the client to keep sending BCTS requests to server. . handle OPEN conflict with share reservations. . instead of destroy the client anf all its state on conflict, only destroy the state that is conflicted with the current request. . destroy the COURTESY_CLIENT either after a fixed period of time to release resources or as reacting to memory pressure. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> --- fs/nfsd/nfs4state.c | 94 +++++++++++++++++++++++++++++++++++++++++++--- fs/nfsd/state.h | 1 + include/linux/sunrpc/svc.h | 1 + 3 files changed, 91 insertions(+), 5 deletions(-)