Message ID | 1539543498-29105-5-git-send-email-jsimmons@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | lustre: more assorted fixes for lustre 2.10 | expand |
On Sun, Oct 14 2018, James Simmons wrote: > From: Doug Oucharek <dougso@me.com> > > There is a case in the routine ptlrpc_register_bulk() where we were > asserting if bd_nob_transferred != 0 when not resending. There is > evidence that network errors can create a situation where > this does happen. So we should not be asserting! > > This patch changes that assert to an error return code of -EIO. > > Signed-off-by: Doug Oucharek <dougso@me.com> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9828 > Reviewed-on: https://review.whamcloud.com/28491 > Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com> > Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com> > Reviewed-by: Oleg Drokin <green@whamcloud.com> > Signed-off-by: James Simmons <jsimmons@infradead.org> > --- > drivers/staging/lustre/lustre/ptlrpc/niobuf.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c > index 27eb1c0..7e7db24 100644 > --- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c > +++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c > @@ -139,8 +139,12 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req) > /* cleanup the state of the bulk for it will be reused */ > if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY) > desc->bd_nob_transferred = 0; > - else > - LASSERT(desc->bd_nob_transferred == 0); > + else if (desc->bd_nob_transferred != 0) > + /* If the network failed after an RPC was sent, this condition > + * could happen. Rather than assert (was here before), return > + * an EIO error. > + */ > + return -EIO; This looks weird, and the justification is rather lame. I wonder if this is an attempt to fix the same problem that the smp_mb() in the previous patch was attempting to fix (and I'm not yet convinced that either is the correct fix). NeilBrown > > desc->bd_failure = 0; > > -- > 1.8.3.1
> On Sun, Oct 14 2018, James Simmons wrote: > > > From: Doug Oucharek <dougso@me.com> > > > > There is a case in the routine ptlrpc_register_bulk() where we were > > asserting if bd_nob_transferred != 0 when not resending. There is > > evidence that network errors can create a situation where > > this does happen. So we should not be asserting! > > > > This patch changes that assert to an error return code of -EIO. > > > > Signed-off-by: Doug Oucharek <dougso@me.com> > > WC-bug-id: https://jira.whamcloud.com/browse/LU-9828 > > Reviewed-on: https://review.whamcloud.com/28491 > > Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com> > > Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com> > > Reviewed-by: Oleg Drokin <green@whamcloud.com> > > Signed-off-by: James Simmons <jsimmons@infradead.org> > > --- > > drivers/staging/lustre/lustre/ptlrpc/niobuf.c | 8 ++++++-- > > 1 file changed, 6 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c > > index 27eb1c0..7e7db24 100644 > > --- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c > > +++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c > > @@ -139,8 +139,12 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req) > > /* cleanup the state of the bulk for it will be reused */ > > if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY) > > desc->bd_nob_transferred = 0; > > - else > > - LASSERT(desc->bd_nob_transferred == 0); > > + else if (desc->bd_nob_transferred != 0) > > + /* If the network failed after an RPC was sent, this condition > > + * could happen. Rather than assert (was here before), return > > + * an EIO error. > > + */ > > + return -EIO; > > This looks weird, and the justification is rather lame. > I wonder if this is an attempt to fix the same problem that the smp_mb() > in the previous patch was attempting to fix (and I'm not yet convinced > that either is the correct fix). When the above condition happens the LASSERT ends up taking out the node with a panic which in turn kills the application running on the cluster. When replaced with reporting an EIO error the node survives as well as the job. The job might fail at its IO but it wouldn't fail performing its work flow which is way more important. > > desc->bd_failure = 0; > > > > -- > > 1.8.3.1 >
On Sun, Oct 21 2018, James Simmons wrote: >> On Sun, Oct 14 2018, James Simmons wrote: >> >> > From: Doug Oucharek <dougso@me.com> >> > >> > There is a case in the routine ptlrpc_register_bulk() where we were >> > asserting if bd_nob_transferred != 0 when not resending. There is >> > evidence that network errors can create a situation where >> > this does happen. So we should not be asserting! >> > >> > This patch changes that assert to an error return code of -EIO. >> > >> > Signed-off-by: Doug Oucharek <dougso@me.com> >> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9828 >> > Reviewed-on: https://review.whamcloud.com/28491 >> > Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com> >> > Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com> >> > Reviewed-by: Oleg Drokin <green@whamcloud.com> >> > Signed-off-by: James Simmons <jsimmons@infradead.org> >> > --- >> > drivers/staging/lustre/lustre/ptlrpc/niobuf.c | 8 ++++++-- >> > 1 file changed, 6 insertions(+), 2 deletions(-) >> > >> > diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c >> > index 27eb1c0..7e7db24 100644 >> > --- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c >> > +++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c >> > @@ -139,8 +139,12 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req) >> > /* cleanup the state of the bulk for it will be reused */ >> > if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY) >> > desc->bd_nob_transferred = 0; >> > - else >> > - LASSERT(desc->bd_nob_transferred == 0); >> > + else if (desc->bd_nob_transferred != 0) >> > + /* If the network failed after an RPC was sent, this condition >> > + * could happen. Rather than assert (was here before), return >> > + * an EIO error. >> > + */ >> > + return -EIO; >> >> This looks weird, and the justification is rather lame. >> I wonder if this is an attempt to fix the same problem that the smp_mb() >> in the previous patch was attempting to fix (and I'm not yet convinced >> that either is the correct fix). > > When the above condition happens the LASSERT ends up taking out the > node with a panic which in turn kills the application running on the cluster. > When replaced with reporting an EIO error the node survives as well as the > job. The job might fail at its IO but it wouldn't fail performing its work > flow which is way more important. Yes, a meaningless error is better than a crash, but a proper fix is better still. As I said, my guess is that the memory barrier in the previous patch might have fixed the bug, so the LASSERT can remain. Doug: is there any chance that this might be the case? Thanks, NeilBrown
> >> On Sun, Oct 14 2018, James Simmons wrote: > >> > >> > From: Doug Oucharek <dougso@me.com> > >> > > >> > There is a case in the routine ptlrpc_register_bulk() where we were > >> > asserting if bd_nob_transferred != 0 when not resending. There is > >> > evidence that network errors can create a situation where > >> > this does happen. So we should not be asserting! > >> > > >> > This patch changes that assert to an error return code of -EIO. > >> > > >> > Signed-off-by: Doug Oucharek <dougso@me.com> > >> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9828 > >> > Reviewed-on: https://review.whamcloud.com/28491 > >> > Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com> > >> > Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com> > >> > Reviewed-by: Oleg Drokin <green@whamcloud.com> > >> > Signed-off-by: James Simmons <jsimmons@infradead.org> > >> > --- > >> > drivers/staging/lustre/lustre/ptlrpc/niobuf.c | 8 ++++++-- > >> > 1 file changed, 6 insertions(+), 2 deletions(-) > >> > > >> > diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c > >> > index 27eb1c0..7e7db24 100644 > >> > --- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c > >> > +++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c > >> > @@ -139,8 +139,12 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req) > >> > /* cleanup the state of the bulk for it will be reused */ > >> > if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY) > >> > desc->bd_nob_transferred = 0; > >> > - else > >> > - LASSERT(desc->bd_nob_transferred == 0); > >> > + else if (desc->bd_nob_transferred != 0) > >> > + /* If the network failed after an RPC was sent, this condition > >> > + * could happen. Rather than assert (was here before), return > >> > + * an EIO error. > >> > + */ > >> > + return -EIO; > >> > >> This looks weird, and the justification is rather lame. > >> I wonder if this is an attempt to fix the same problem that the smp_mb() > >> in the previous patch was attempting to fix (and I'm not yet convinced > >> that either is the correct fix). > > > > When the above condition happens the LASSERT ends up taking out the > > node with a panic which in turn kills the application running on the cluster. > > When replaced with reporting an EIO error the node survives as well as the > > job. The job might fail at its IO but it wouldn't fail performing its work > > flow which is way more important. > > Yes, a meaningless error is better than a crash, but a proper fix is > better still. As I said, my guess is that the memory barrier in the > previous patch might have fixed the bug, so the LASSERT can remain. > > Doug: is there any chance that this might be the case? I got a hold of Doug and discussed this issue. So the answer is that the original logs to track down the original problem no longer exist. So finding the original source of the problem can't be done at this point. Would you be okay with a version of this patch with dump_stack() and treat it as a debug patch. We really need to collect logs to figure out the real problem. I will push a debug patch to OpenSFS branch since it is more widely used.
On Sun, Nov 04 2018, James Simmons wrote: >> >> On Sun, Oct 14 2018, James Simmons wrote: >> >> >> >> > From: Doug Oucharek <dougso@me.com> >> >> > >> >> > There is a case in the routine ptlrpc_register_bulk() where we were >> >> > asserting if bd_nob_transferred != 0 when not resending. There is >> >> > evidence that network errors can create a situation where >> >> > this does happen. So we should not be asserting! >> >> > >> >> > This patch changes that assert to an error return code of -EIO. >> >> > >> >> > Signed-off-by: Doug Oucharek <dougso@me.com> >> >> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9828 >> >> > Reviewed-on: https://review.whamcloud.com/28491 >> >> > Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com> >> >> > Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com> >> >> > Reviewed-by: Oleg Drokin <green@whamcloud.com> >> >> > Signed-off-by: James Simmons <jsimmons@infradead.org> >> >> > --- >> >> > drivers/staging/lustre/lustre/ptlrpc/niobuf.c | 8 ++++++-- >> >> > 1 file changed, 6 insertions(+), 2 deletions(-) >> >> > >> >> > diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c >> >> > index 27eb1c0..7e7db24 100644 >> >> > --- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c >> >> > +++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c >> >> > @@ -139,8 +139,12 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req) >> >> > /* cleanup the state of the bulk for it will be reused */ >> >> > if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY) >> >> > desc->bd_nob_transferred = 0; >> >> > - else >> >> > - LASSERT(desc->bd_nob_transferred == 0); >> >> > + else if (desc->bd_nob_transferred != 0) >> >> > + /* If the network failed after an RPC was sent, this condition >> >> > + * could happen. Rather than assert (was here before), return >> >> > + * an EIO error. >> >> > + */ >> >> > + return -EIO; >> >> >> >> This looks weird, and the justification is rather lame. >> >> I wonder if this is an attempt to fix the same problem that the smp_mb() >> >> in the previous patch was attempting to fix (and I'm not yet convinced >> >> that either is the correct fix). >> > >> > When the above condition happens the LASSERT ends up taking out the >> > node with a panic which in turn kills the application running on the cluster. >> > When replaced with reporting an EIO error the node survives as well as the >> > job. The job might fail at its IO but it wouldn't fail performing its work >> > flow which is way more important. >> >> Yes, a meaningless error is better than a crash, but a proper fix is >> better still. As I said, my guess is that the memory barrier in the >> previous patch might have fixed the bug, so the LASSERT can remain. >> >> Doug: is there any chance that this might be the case? > > I got a hold of Doug and discussed this issue. So the answer is that the > original logs to track down the original problem no longer exist. So > finding the original source of the problem can't be done at this point. > Would you be okay with a version of this patch with dump_stack() and > treat it as a debug patch. We really need to collect logs to figure out > the real problem. I will push a debug patch to OpenSFS branch since it > is more widely used. Yes. else if (WARN_ON(desc->nb_nob_transferred != 0)) /* comment explaining what we know 8/ return -EIO would be perfectly appropriate. Thanks, NeilBrown
diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c index 27eb1c0..7e7db24 100644 --- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c +++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c @@ -139,8 +139,12 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req) /* cleanup the state of the bulk for it will be reused */ if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY) desc->bd_nob_transferred = 0; - else - LASSERT(desc->bd_nob_transferred == 0); + else if (desc->bd_nob_transferred != 0) + /* If the network failed after an RPC was sent, this condition + * could happen. Rather than assert (was here before), return + * an EIO error. + */ + return -EIO; desc->bd_failure = 0;