Message ID | 87h8nxkyjk.fsf@notabene.neil.brown.name (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
----- On Apr 27, 2018, at 1:29 AM, NeilBrown neilb@suse.com wrote: > If an NFS client has 10,000 delegations which are between 90 and 180 seconds > old, > and 10,000 which are between 180 and 270 seconds old, with none of them still > in use, it is likely that the old ones are at the end of the list. > The first 10,000 will not be marked to be returned, the last 10,000 will. > > To return these, the code starts at the front of the list and finds the > first delegation needing to be returned (testing 10,000 on the way). > Then it drops rcu_readlock(), returns the delegation and starts again, > and does this 10,000 times. > > As delegation return is async, it may never block, so these 10,000 > delegation will be returned without stopping for a breath. The soft-lock > detector will notice and complain. > > This patch makes 3 changes. > > 1/ cond_resched() is added so that the soft-lockup detector doesn't notice. > 2/ A place-holder (an inode) is kept when locks are dropped, so that > the place can usually be found again after taking rcu_readlock(). > This means we don't need to skip over 10,000 entries 10,000 times, > 100 million pointless operations - which could eaisly be a larger > number. > 3/ If nfs_sb_active() fails, break out of the loop - there is no point > in continuing. > > The patch also add list_for_each_entry_from_rcu() to rculist.h in > order to achieve 2/. > > Signed-off-by: NeilBrown <neilb@suse.com> > --- > > Hi, > I'm hoping one of the RCU reviewers will provide an Acked-by for the > rculist.h change. > > If you'ld like 3 patches instead of just one, please let me know. But > they all see to fit well together. I think the RCU list API changes should sit in their own patch. > > thanks, > NeilBrown > > > fs/nfs/delegation.c | 57 ++++++++++++++++++++++++++++++++++++++++++++----- > include/linux/rculist.h | 10 +++++++++ > 2 files changed, 62 insertions(+), 5 deletions(-) > > diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c > index 1819d0d0ba4b..c3d9e21ab440 100644 > --- a/fs/nfs/delegation.c > +++ b/fs/nfs/delegation.c > @@ -483,19 +483,56 @@ static bool nfs_delegation_need_return(struct > nfs_delegation *delegation) > int nfs_client_return_marked_delegations(struct nfs_client *clp) > { > struct nfs_delegation *delegation; > + struct nfs_delegation *prev; > struct nfs_server *server; > struct inode *inode; > + struct inode *place_holder = NULL; > int err = 0; > > restart: > + /* > + * To avoid quadratic looping we hold an reference an reference -> a reference > + * to an inode place_holder. Each time we restart, we > + * list nfs_servers from the server of that inode, and > + * delegation in the server from the delegations of that > + * inode. > + * prev is an RCU-protected pointer to a delegation which > + * wasn't marked for return and might be a good choice for > + * the next place_holder. > + */ > rcu_read_lock(); > - list_for_each_entry_rcu(server, &clp->cl_superblocks, client_link) { > - list_for_each_entry_rcu(delegation, &server->delegations, > - super_list) { > - if (!nfs_delegation_need_return(delegation)) > + prev = NULL; > + if (place_holder) > + server = NFS_SERVER(place_holder); > + else > + server = list_entry_rcu(clp->cl_superblocks.next, > + struct nfs_server, client_link); > + list_for_each_entry_from_rcu(server, &clp->cl_superblocks, client_link) { Unless I'm missing something, you should define the delegation variable within this scope. > + delegation = NULL; > + if (place_holder && server == NFS_SERVER(place_holder)) > + delegation = rcu_dereference(NFS_I(place_holder)->delegation); > + if (!delegation) > + delegation = list_entry_rcu(server->delegations.next, > + struct nfs_delegation, super_list); AFAIU, what makes this work is that you expect grabbing the inode reference will ensure the entry is not removed from the RCU list until the iput(). > + list_for_each_entry_from_rcu(delegation, &server->delegations, super_list) { > + struct inode *to_put = NULL; > + > + if (!nfs_delegation_need_return(delegation)) { > + prev = delegation; > continue; > + } > if (!nfs_sb_active(server->super)) > - continue; > + break; > + > + if (prev) { > + struct inode *tmp; missing whiteline, or the assignment can be done on the variable definition. > + tmp = nfs_delegation_grab_inode(prev); > + if (tmp) { > + to_put = place_holder; > + place_holder = tmp; > + } > + } > + > inode = nfs_delegation_grab_inode(delegation); > if (inode == NULL) { > rcu_read_unlock(); > @@ -505,16 +542,26 @@ int nfs_client_return_marked_delegations(struct nfs_client > *clp) > delegation = nfs_start_delegation_return_locked(NFS_I(inode)); > rcu_read_unlock(); > > + if (to_put) { > + iput(to_put); > + to_put = NULL; considering the scope of to_put, I think assigning it to NULL is redundant with the variable definition. > + } > + > err = nfs_end_delegation_return(inode, delegation, 0); > iput(inode); > nfs_sb_deactive(server->super); > + cond_resched(); > if (!err) > goto restart; > set_bit(NFS4CLNT_DELEGRETURN, &clp->cl_state); > + if (place_holder) > + iput(place_holder); > return err; > } > } > rcu_read_unlock(); > + if (place_holder) > + iput(place_holder); > return 0; > } > > diff --git a/include/linux/rculist.h b/include/linux/rculist.h > index 127f534fec94..2d86f9869842 100644 > --- a/include/linux/rculist.h > +++ b/include/linux/rculist.h > @@ -403,6 +403,16 @@ static inline void list_splice_tail_init_rcu(struct > list_head *list, > &pos->member != (head); \ > pos = list_entry_rcu(pos->member.next, typeof(*pos), member)) > > +/** > + * list_for_each_entry_from_rcu - iterate over a list continuing from current > point There is some documentation missing about the requirements for continuing a RCU list iteration. The simple case is that both original iteration and continued iteration need to be in the same RCU read-side critical section. A more complex case like yours is along those lines: rcu_read_lock() list_iteration break; grab inode ref (prohibit removal of node from list) rcu_read_unlock() rcu_read_lock() put inode ref continue list iteration rcu_read_unlock() I'm thinking these allowed patterns should be documented somehow, else we'll see list iteration being continued across different RCU read-side critical sections without any reference counting (or locking) to ensure protection against removal. Thanks, Mathieu > + * @pos: the type * to use as a loop cursor. > + * @head: the head for your list. > + * @member: the name of the list_node within the struct. > + */ > +#define list_for_each_entry_from_rcu(pos, head, member) \ > + for (; &(pos)->member != (head); \ > + pos = list_entry_rcu(pos->member.next, typeof(*(pos)), member)) > + > /** > * hlist_del_rcu - deletes entry from hash list without re-initialization > * @n: the element to delete from the hash list. > -- > 2.14.0.rc0.dirty
On Fri, Apr 27 2018, Mathieu Desnoyers wrote: > ----- On Apr 27, 2018, at 1:29 AM, NeilBrown neilb@suse.com wrote: > >> If an NFS client has 10,000 delegations which are between 90 and 180 seconds >> old, >> and 10,000 which are between 180 and 270 seconds old, with none of them still >> in use, it is likely that the old ones are at the end of the list. >> The first 10,000 will not be marked to be returned, the last 10,000 will. >> >> To return these, the code starts at the front of the list and finds the >> first delegation needing to be returned (testing 10,000 on the way). >> Then it drops rcu_readlock(), returns the delegation and starts again, >> and does this 10,000 times. >> >> As delegation return is async, it may never block, so these 10,000 >> delegation will be returned without stopping for a breath. The soft-lock >> detector will notice and complain. >> >> This patch makes 3 changes. >> >> 1/ cond_resched() is added so that the soft-lockup detector doesn't notice. >> 2/ A place-holder (an inode) is kept when locks are dropped, so that >> the place can usually be found again after taking rcu_readlock(). >> This means we don't need to skip over 10,000 entries 10,000 times, >> 100 million pointless operations - which could eaisly be a larger >> number. >> 3/ If nfs_sb_active() fails, break out of the loop - there is no point >> in continuing. >> >> The patch also add list_for_each_entry_from_rcu() to rculist.h in >> order to achieve 2/. >> >> Signed-off-by: NeilBrown <neilb@suse.com> >> --- >> >> Hi, >> I'm hoping one of the RCU reviewers will provide an Acked-by for the >> rculist.h change. >> >> If you'ld like 3 patches instead of just one, please let me know. But >> they all see to fit well together. Thanks for your review! > > I think the RCU list API changes should sit in their own patch. I guess I can split it into a separate patch if you like. Hopefully it can still go in through the NFS tree so we need to sync separate trees. > >> >> thanks, >> NeilBrown >> >> >> fs/nfs/delegation.c | 57 ++++++++++++++++++++++++++++++++++++++++++++----- >> include/linux/rculist.h | 10 +++++++++ >> 2 files changed, 62 insertions(+), 5 deletions(-) >> >> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c >> index 1819d0d0ba4b..c3d9e21ab440 100644 >> --- a/fs/nfs/delegation.c >> +++ b/fs/nfs/delegation.c >> @@ -483,19 +483,56 @@ static bool nfs_delegation_need_return(struct >> nfs_delegation *delegation) >> int nfs_client_return_marked_delegations(struct nfs_client *clp) >> { >> struct nfs_delegation *delegation; >> + struct nfs_delegation *prev; >> struct nfs_server *server; >> struct inode *inode; >> + struct inode *place_holder = NULL; >> int err = 0; >> >> restart: >> + /* >> + * To avoid quadratic looping we hold an reference > > an reference -> a reference Fixed, thanks. > >> + * to an inode place_holder. Each time we restart, we >> + * list nfs_servers from the server of that inode, and >> + * delegation in the server from the delegations of that >> + * inode. >> + * prev is an RCU-protected pointer to a delegation which >> + * wasn't marked for return and might be a good choice for >> + * the next place_holder. >> + */ >> rcu_read_lock(); >> - list_for_each_entry_rcu(server, &clp->cl_superblocks, client_link) { >> - list_for_each_entry_rcu(delegation, &server->delegations, >> - super_list) { >> - if (!nfs_delegation_need_return(delegation)) >> + prev = NULL; >> + if (place_holder) >> + server = NFS_SERVER(place_holder); >> + else >> + server = list_entry_rcu(clp->cl_superblocks.next, >> + struct nfs_server, client_link); >> + list_for_each_entry_from_rcu(server, &clp->cl_superblocks, client_link) { > > Unless I'm missing something, you should define the delegation variable within > this scope. I could, yes. But that was true before the patch as well. I don't really want to change things that I don't need to change. This is really a style issue. If Trond or Anna prefer the variable in the most local scope, I'll move it. > >> + delegation = NULL; >> + if (place_holder && server == NFS_SERVER(place_holder)) >> + delegation = rcu_dereference(NFS_I(place_holder)->delegation); >> + if (!delegation) >> + delegation = list_entry_rcu(server->delegations.next, >> + struct nfs_delegation, super_list); > > AFAIU, what makes this work is that you expect grabbing the inode > reference will ensure the entry is not removed from the RCU list until > the iput(). > There are two lists, a per "struct nfs_client" list of "struct nfs_server", and a per "struct nfs_server" list of "struct nfs_delegation". I expect that grabbing the inode will stop the filesystem from being unmounted, so that nfs_server will remain in the list from the nfs_client. Also I expect that grabbing the inode will mean that it remains safe to dereference NFS_I(place_holder)->delegation. If that is not NULL, then it is either the same delegation that we had before or, just possibly, a new one. If it is a new one it will have been added to the end of the list, so we could end up missing a few delegations. This is not a critical problem as this is a periodic cleanup and if it missing some occasionally, it isn't a big problem. However it would be easy enough to keep track of what the previous delegation was, and only use the new one if it is the same. I'll do that. > >> + list_for_each_entry_from_rcu(delegation, &server->delegations, super_list) { >> + struct inode *to_put = NULL; >> + >> + if (!nfs_delegation_need_return(delegation)) { >> + prev = delegation; >> continue; >> + } >> if (!nfs_sb_active(server->super)) >> - continue; >> + break; >> + >> + if (prev) { >> + struct inode *tmp; > > missing whiteline, or the assignment can be done on the variable definition. > >> + tmp = nfs_delegation_grab_inode(prev); >> + if (tmp) { >> + to_put = place_holder; >> + place_holder = tmp; >> + } >> + } >> + >> inode = nfs_delegation_grab_inode(delegation); >> if (inode == NULL) { >> rcu_read_unlock(); >> @@ -505,16 +542,26 @@ int nfs_client_return_marked_delegations(struct nfs_client >> *clp) >> delegation = nfs_start_delegation_return_locked(NFS_I(inode)); >> rcu_read_unlock(); >> >> + if (to_put) { >> + iput(to_put); >> + to_put = NULL; > > considering the scope of to_put, I think assigning it to NULL is redundant > with the variable definition. True ... and that actually highlights a bug for me. Earlier where I have "goto restart" I need to iput(to_put). Thanks. > >> + } >> + >> err = nfs_end_delegation_return(inode, delegation, 0); >> iput(inode); >> nfs_sb_deactive(server->super); >> + cond_resched(); >> if (!err) >> goto restart; >> set_bit(NFS4CLNT_DELEGRETURN, &clp->cl_state); >> + if (place_holder) >> + iput(place_holder); >> return err; >> } >> } >> rcu_read_unlock(); >> + if (place_holder) >> + iput(place_holder); >> return 0; >> } >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h >> index 127f534fec94..2d86f9869842 100644 >> --- a/include/linux/rculist.h >> +++ b/include/linux/rculist.h >> @@ -403,6 +403,16 @@ static inline void list_splice_tail_init_rcu(struct >> list_head *list, >> &pos->member != (head); \ >> pos = list_entry_rcu(pos->member.next, typeof(*pos), member)) >> >> +/** >> + * list_for_each_entry_from_rcu - iterate over a list continuing from current >> point > > There is some documentation missing about the requirements for continuing > a RCU list iteration. If that is true, then the already existing list_for_each_entry_continue_rcu() also is missing some documentation. I've removed the word "continue" so it now reads. * list_for_each_entry_from_rcu - iterate over a list from current point Does it really need to be said that you actually have to have a "current point" that is known to be in the list? I've added * Iterate over the tail of a list starting from a given position, * which must have been in the list when the RCU read lock was taken. in case it does help. > > The simple case is that both original iteration and continued iteration > need to be in the same RCU read-side critical section. > > A more complex case like yours is along those lines: > > rcu_read_lock() > list_iteration > break; > grab inode ref (prohibit removal of node from list) > rcu_read_unlock() > > rcu_read_lock() > put inode ref > continue list iteration > rcu_read_unlock() > > I'm thinking these allowed patterns should be documented somehow, > else we'll see list iteration being continued across different > RCU read-side critical sections without any reference counting > (or locking) to ensure protection against removal. If that was a real risk, surely we would already see it with list_for_each_entry_continue_rcu(). Thanks a lot, NeilBrown > > Thanks, > > Mathieu > >> + * @pos: the type * to use as a loop cursor. >> + * @head: the head for your list. >> + * @member: the name of the list_node within the struct. >> + */ >> +#define list_for_each_entry_from_rcu(pos, head, member) \ >> + for (; &(pos)->member != (head); \ >> + pos = list_entry_rcu(pos->member.next, typeof(*(pos)), member)) >> + >> /** >> * hlist_del_rcu - deletes entry from hash list without re-initialization >> * @n: the element to delete from the hash list. >> -- >> 2.14.0.rc0.dirty > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c index 1819d0d0ba4b..c3d9e21ab440 100644 --- a/fs/nfs/delegation.c +++ b/fs/nfs/delegation.c @@ -483,19 +483,56 @@ static bool nfs_delegation_need_return(struct nfs_delegation *delegation) int nfs_client_return_marked_delegations(struct nfs_client *clp) { struct nfs_delegation *delegation; + struct nfs_delegation *prev; struct nfs_server *server; struct inode *inode; + struct inode *place_holder = NULL; int err = 0; restart: + /* + * To avoid quadratic looping we hold an reference + * to an inode place_holder. Each time we restart, we + * list nfs_servers from the server of that inode, and + * delegation in the server from the delegations of that + * inode. + * prev is an RCU-protected pointer to a delegation which + * wasn't marked for return and might be a good choice for + * the next place_holder. + */ rcu_read_lock(); - list_for_each_entry_rcu(server, &clp->cl_superblocks, client_link) { - list_for_each_entry_rcu(delegation, &server->delegations, - super_list) { - if (!nfs_delegation_need_return(delegation)) + prev = NULL; + if (place_holder) + server = NFS_SERVER(place_holder); + else + server = list_entry_rcu(clp->cl_superblocks.next, + struct nfs_server, client_link); + list_for_each_entry_from_rcu(server, &clp->cl_superblocks, client_link) { + delegation = NULL; + if (place_holder && server == NFS_SERVER(place_holder)) + delegation = rcu_dereference(NFS_I(place_holder)->delegation); + if (!delegation) + delegation = list_entry_rcu(server->delegations.next, + struct nfs_delegation, super_list); + list_for_each_entry_from_rcu(delegation, &server->delegations, super_list) { + struct inode *to_put = NULL; + + if (!nfs_delegation_need_return(delegation)) { + prev = delegation; continue; + } if (!nfs_sb_active(server->super)) - continue; + break; + + if (prev) { + struct inode *tmp; + tmp = nfs_delegation_grab_inode(prev); + if (tmp) { + to_put = place_holder; + place_holder = tmp; + } + } + inode = nfs_delegation_grab_inode(delegation); if (inode == NULL) { rcu_read_unlock(); @@ -505,16 +542,26 @@ int nfs_client_return_marked_delegations(struct nfs_client *clp) delegation = nfs_start_delegation_return_locked(NFS_I(inode)); rcu_read_unlock(); + if (to_put) { + iput(to_put); + to_put = NULL; + } + err = nfs_end_delegation_return(inode, delegation, 0); iput(inode); nfs_sb_deactive(server->super); + cond_resched(); if (!err) goto restart; set_bit(NFS4CLNT_DELEGRETURN, &clp->cl_state); + if (place_holder) + iput(place_holder); return err; } } rcu_read_unlock(); + if (place_holder) + iput(place_holder); return 0; } diff --git a/include/linux/rculist.h b/include/linux/rculist.h index 127f534fec94..2d86f9869842 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -403,6 +403,16 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, &pos->member != (head); \ pos = list_entry_rcu(pos->member.next, typeof(*pos), member)) +/** + * list_for_each_entry_from_rcu - iterate over a list continuing from current point + * @pos: the type * to use as a loop cursor. + * @head: the head for your list. + * @member: the name of the list_node within the struct. + */ +#define list_for_each_entry_from_rcu(pos, head, member) \ + for (; &(pos)->member != (head); \ + pos = list_entry_rcu(pos->member.next, typeof(*(pos)), member)) + /** * hlist_del_rcu - deletes entry from hash list without re-initialization * @n: the element to delete from the hash list.
If an NFS client has 10,000 delegations which are between 90 and 180 seconds old, and 10,000 which are between 180 and 270 seconds old, with none of them still in use, it is likely that the old ones are at the end of the list. The first 10,000 will not be marked to be returned, the last 10,000 will. To return these, the code starts at the front of the list and finds the first delegation needing to be returned (testing 10,000 on the way). Then it drops rcu_readlock(), returns the delegation and starts again, and does this 10,000 times. As delegation return is async, it may never block, so these 10,000 delegation will be returned without stopping for a breath. The soft-lock detector will notice and complain. This patch makes 3 changes. 1/ cond_resched() is added so that the soft-lockup detector doesn't notice. 2/ A place-holder (an inode) is kept when locks are dropped, so that the place can usually be found again after taking rcu_readlock(). This means we don't need to skip over 10,000 entries 10,000 times, 100 million pointless operations - which could eaisly be a larger number. 3/ If nfs_sb_active() fails, break out of the loop - there is no point in continuing. The patch also add list_for_each_entry_from_rcu() to rculist.h in order to achieve 2/. Signed-off-by: NeilBrown <neilb@suse.com> --- Hi, I'm hoping one of the RCU reviewers will provide an Acked-by for the rculist.h change. If you'ld like 3 patches instead of just one, please let me know. But they all see to fit well together. thanks, NeilBrown fs/nfs/delegation.c | 57 ++++++++++++++++++++++++++++++++++++++++++++----- include/linux/rculist.h | 10 +++++++++ 2 files changed, 62 insertions(+), 5 deletions(-)