Message ID | 52D5589A.7090507@panasas.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, 2014-01-14 at 17:32 +0200, Boaz Harrosh wrote: > An NFS4ERR_RECALLCONFLICT is returned by server from a GET_LAYOUT > only when a Server Sent a RECALL do to that GET_LAYOUT, or > the RECALL and GET_LAYOUT crossed on the wire. > In any way this means we want to wait at most until in-flight IO > is finished and the RECALL can be satisfied. > > So a proper wait here is more like 1/10 of a second, not 15 seconds > like we have now. (We use NFS4_POLL_RETRY_MIN here) > > Current code totally craps out performance of very large files on > most pnfs-objects layouts, because of how the map changes when the > file has grown beyond a raid group. > > CC: Stable Tree <stable@vger.kernel.org> > Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> > --- > fs/nfs/nfs4proc.c | 22 +++++++++++++++++++--- > 1 file changed, 19 insertions(+), 3 deletions(-) > > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c > index d53d678..3264fca 100644 > --- a/fs/nfs/nfs4proc.c > +++ b/fs/nfs/nfs4proc.c > @@ -7058,7 +7058,7 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata) > struct nfs4_state *state = NULL; > unsigned long timeo, giveup; > > - dprintk("--> %s\n", __func__); > + dprintk("--> %s tk_status => %d\n", __func__, task->tk_status); > > if (!nfs41_sequence_done(task, &lgp->res.seq_res)) > goto out; > @@ -7067,11 +7067,27 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata) > case 0: > goto out; > case -NFS4ERR_LAYOUTTRYLATER: > + /* NFS4ERR_RECALLCONFLICT is always a minimal delay (conflict with > + * self) > + * TODO: NFS4ERR_LAYOUTTRYLATER is a conflict with another client > + * (or clients). What we should do is randomize a short delay like on a > + * network broadcast burst, and raise the random max every failure. > + * For now leave it stateless and do this polling. > + */ > case -NFS4ERR_RECALLCONFLICT: > timeo = rpc_get_timeout(task->tk_client); > giveup = lgp->args.timestamp + timeo; > - if (time_after(giveup, jiffies)) > - task->tk_status = -NFS4ERR_DELAY; > + if (time_after(giveup, jiffies)) { > + /* Do a minimum delay, We are actually waiting for our > + * own IO to finish (In most cases) > + */ > + dprintk("%s: NFS4ERR_RECALLCONFLICT waiting\n", > + __func__); > + rpc_delay(task, NFS4_POLL_RETRY_MIN); > + task->tk_status = 0; > + rpc_restart_call_prepare(task); > + goto out; /* Do not call nfs4_async_handle_error() */ > + } > For the default mount option of 'timeo=600', and the default #define NFS4_POLL_RETRY_MIN==HZ/10, this means we can end up pounding the server with 600 LAYOUTGET requests within the space of 1 minute, before giving up. Is that reasonable?
On 01/14/2014 09:05 PM, Trond Myklebust wrote: > On Tue, 2014-01-14 at 17:32 +0200, Boaz Harrosh wrote: >> > > For the default mount option of 'timeo=600', and the default #define > NFS4_POLL_RETRY_MIN==HZ/10, this means we can end up pounding the server > with 600 LAYOUTGET requests within the space of 1 minute, before giving > up. Is that reasonable? > It will never get there it will always be 1 or two sends. Usually it is just so the sequence of layout_get_done is out of the way and the LAYOUT_RECALL sequence+1 can get through and the layout released. Then the next time it will all be good and the LAYOUT_GET will succeed. Worst case is when the client is very busy with queue full of IO on the same busy layout that needs to be released by the recall. Personally I found that this never exceeds 40 IOPs in flight. Note that this is not the amount of total dirty memory but only the amount of already submitted IO. I guess that on a very slow connection these can take time but in regular line speeds I never observed more the 2 retries with this patch. It is all up to the client. NFS4ERR_RECALLCONFLICT means "the layouts you have need to be released" (I say released because the forgetful model does not actually returns them). Can you see a critical time when layouts are held for longer than a second ? Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jan 14, 2014, at 17:21, Boaz Harrosh <bharrosh@panasas.com> wrote: > On 01/14/2014 09:05 PM, Trond Myklebust wrote: >> On Tue, 2014-01-14 at 17:32 +0200, Boaz Harrosh wrote: >>> >> >> For the default mount option of 'timeo=600', and the default #define >> NFS4_POLL_RETRY_MIN==HZ/10, this means we can end up pounding the server >> with 600 LAYOUTGET requests within the space of 1 minute, before giving >> up. Is that reasonable? >> > > It will never get there it will always be 1 or two sends. Usually it is > just so the sequence of layout_get_done is out of the way and the > LAYOUT_RECALL sequence+1 can get through and the layout released. Then > the next time it will all be good and the LAYOUT_GET will succeed. > > Worst case is when the client is very busy with queue full of IO > on the same busy layout that needs to be released by the recall. Personally > I found that this never exceeds 40 IOPs in flight. Note that this is not > the amount of total dirty memory but only the amount of already submitted > IO. I guess that on a very slow connection these can take time but in > regular line speeds I never observed more the 2 retries with this patch. > > It is all up to the client. NFS4ERR_RECALLCONFLICT means "the layouts you > have need to be released" (I say released because the forgetful model does > not actually returns them). Can you see a critical time when layouts are > held for longer than a second ? That will probably depend on the workload and possibly on the layout type. My point was, however, about the potential for mischief due to the mismatch between the number of retries that the resulting code allows, and the fixed period between those retries of 1/10 seconds. Why not rather use something along the lines of "rpc_delay(rpc_task, min(giveup -jiffies , max(jiffies - lgp->args.timestamp, NFS4_POLL_RETRY_MIN)));”? That gives you an initially exponential back off with a minimum period of NFS4_POLL_RETRY_MIN, and with an expiry date of ‘timeo’ jiffies after the first attempt. -- Trond Myklebust Linux NFS client maintainer -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jan 14, 2014, at 17:43, Trond Myklebust <trond.myklebust@primarydata.com> wrote: > > On Jan 14, 2014, at 17:21, Boaz Harrosh <bharrosh@panasas.com> wrote: > >> On 01/14/2014 09:05 PM, Trond Myklebust wrote: >>> On Tue, 2014-01-14 at 17:32 +0200, Boaz Harrosh wrote: >>>> >>> >>> For the default mount option of 'timeo=600', and the default #define >>> NFS4_POLL_RETRY_MIN==HZ/10, this means we can end up pounding the server >>> with 600 LAYOUTGET requests within the space of 1 minute, before giving >>> up. Is that reasonable? >>> >> >> It will never get there it will always be 1 or two sends. Usually it is >> just so the sequence of layout_get_done is out of the way and the >> LAYOUT_RECALL sequence+1 can get through and the layout released. Then >> the next time it will all be good and the LAYOUT_GET will succeed. >> >> Worst case is when the client is very busy with queue full of IO >> on the same busy layout that needs to be released by the recall. Personally >> I found that this never exceeds 40 IOPs in flight. Note that this is not >> the amount of total dirty memory but only the amount of already submitted >> IO. I guess that on a very slow connection these can take time but in >> regular line speeds I never observed more the 2 retries with this patch. >> >> It is all up to the client. NFS4ERR_RECALLCONFLICT means "the layouts you >> have need to be released" (I say released because the forgetful model does >> not actually returns them). Can you see a critical time when layouts are >> held for longer than a second ? > > That will probably depend on the workload and possibly on the layout type. > > My point was, however, about the potential for mischief due to the mismatch between the number of retries that the resulting code allows, and the fixed period between those retries of 1/10 seconds. Why not rather use something along the lines of "rpc_delay(rpc_task, min(giveup -jiffies , max(jiffies - lgp->args.timestamp, NFS4_POLL_RETRY_MIN)));”? That gives you an initially exponential back off with a minimum period of NFS4_POLL_RETRY_MIN, and with an expiry date of ‘timeo’ jiffies after the first attempt. Whoops. That should probably be max(NFS4_POLL_RETRY_MIN, min(giveup - jiffies , jiffies - lgp->args.timestamp)) so that the time interval is not < NFS4_POLL_RETRY_MIN. -- Trond Myklebust Linux NFS client maintainer -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/15/2014 12:47 AM, Trond Myklebust wrote: > > On Jan 14, 2014, at 17:43, Trond Myklebust <trond.myklebust@primarydata.com> wrote: > >> >> On Jan 14, 2014, at 17:21, Boaz Harrosh <bharrosh@panasas.com> wrote: >> >>> On 01/14/2014 09:05 PM, Trond Myklebust wrote: >>>> On Tue, 2014-01-14 at 17:32 +0200, Boaz Harrosh wrote: >>>>> >>>> >>>> For the default mount option of 'timeo=600', and the default #define >>>> NFS4_POLL_RETRY_MIN==HZ/10, this means we can end up pounding the server >>>> with 600 LAYOUTGET requests within the space of 1 minute, before giving >>>> up. Is that reasonable? >>>> >>> >>> It will never get there it will always be 1 or two sends. Usually it is >>> just so the sequence of layout_get_done is out of the way and the >>> LAYOUT_RECALL sequence+1 can get through and the layout released. Then >>> the next time it will all be good and the LAYOUT_GET will succeed. >>> >>> Worst case is when the client is very busy with queue full of IO >>> on the same busy layout that needs to be released by the recall. Personally >>> I found that this never exceeds 40 IOPs in flight. Note that this is not >>> the amount of total dirty memory but only the amount of already submitted >>> IO. I guess that on a very slow connection these can take time but in >>> regular line speeds I never observed more the 2 retries with this patch. >>> >>> It is all up to the client. NFS4ERR_RECALLCONFLICT means "the layouts you >>> have need to be released" (I say released because the forgetful model does >>> not actually returns them). Can you see a critical time when layouts are >>> held for longer than a second ? >> >> That will probably depend on the workload and possibly on the layout type. >> >> My point was, however, about the potential for mischief due to the mismatch between the number of retries that the resulting code allows, and the fixed period between those retries of 1/10 seconds. Why not rather use something along the lines of "rpc_delay(rpc_task, min(giveup -jiffies , max(jiffies - lgp->args.timestamp, NFS4_POLL_RETRY_MIN)));”? That gives you an initially exponential back off with a minimum period of NFS4_POLL_RETRY_MIN, and with an expiry date of ‘timeo’ jiffies after the first attempt. > > Whoops. That should probably be > > max(NFS4_POLL_RETRY_MIN, min(giveup - jiffies , jiffies - lgp->args.timestamp)) > > so that the time interval is not < NFS4_POLL_RETRY_MIN. OK I'll try that. Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c index d53d678..3264fca 100644 --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -7058,7 +7058,7 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata) struct nfs4_state *state = NULL; unsigned long timeo, giveup; - dprintk("--> %s\n", __func__); + dprintk("--> %s tk_status => %d\n", __func__, task->tk_status); if (!nfs41_sequence_done(task, &lgp->res.seq_res)) goto out; @@ -7067,11 +7067,27 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata) case 0: goto out; case -NFS4ERR_LAYOUTTRYLATER: + /* NFS4ERR_RECALLCONFLICT is always a minimal delay (conflict with + * self) + * TODO: NFS4ERR_LAYOUTTRYLATER is a conflict with another client + * (or clients). What we should do is randomize a short delay like on a + * network broadcast burst, and raise the random max every failure. + * For now leave it stateless and do this polling. + */ case -NFS4ERR_RECALLCONFLICT: timeo = rpc_get_timeout(task->tk_client); giveup = lgp->args.timestamp + timeo; - if (time_after(giveup, jiffies)) - task->tk_status = -NFS4ERR_DELAY; + if (time_after(giveup, jiffies)) { + /* Do a minimum delay, We are actually waiting for our + * own IO to finish (In most cases) + */ + dprintk("%s: NFS4ERR_RECALLCONFLICT waiting\n", + __func__); + rpc_delay(task, NFS4_POLL_RETRY_MIN); + task->tk_status = 0; + rpc_restart_call_prepare(task); + goto out; /* Do not call nfs4_async_handle_error() */ + } break; case -NFS4ERR_EXPIRED: case -NFS4ERR_BAD_STATEID:
An NFS4ERR_RECALLCONFLICT is returned by server from a GET_LAYOUT only when a Server Sent a RECALL do to that GET_LAYOUT, or the RECALL and GET_LAYOUT crossed on the wire. In any way this means we want to wait at most until in-flight IO is finished and the RECALL can be satisfied. So a proper wait here is more like 1/10 of a second, not 15 seconds like we have now. (We use NFS4_POLL_RETRY_MIN here) Current code totally craps out performance of very large files on most pnfs-objects layouts, because of how the map changes when the file has grown beyond a raid group. CC: Stable Tree <stable@vger.kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> --- fs/nfs/nfs4proc.c | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-)