Message ID | 20190614221020.19173-4-hmadhani@marvell.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | qla2xxx: Fix crashes with FC-NVMe devices | expand |
On 6/14/19 3:10 PM, Himanshu Madhani wrote: > From: Quinn Tran <qutran@marvell.com> > > This patch uses kref to protect access between fcp_abort > path and nvme command and LS command completion path. > Stack trace below shows the abort path is accessing stale > memory (nvme_private->sp). > > When command kref reaches 0, nvme_private & srb resource will > be disconnected from each other. Any subsequence nvme abort > request will not be able to reference the original srb. > > [ 5631.003998] BUG: unable to handle kernel paging request at 00000010000005d8 > [ 5631.004016] IP: [<ffffffffc087df92>] qla_nvme_abort_work+0x22/0x100 [qla2xxx] > [ 5631.004086] Workqueue: events qla_nvme_abort_work [qla2xxx] > [ 5631.004097] RIP: 0010:[<ffffffffc087df92>] [<ffffffffc087df92>] qla_nvme_abort_work+0x22/0x100 [qla2xxx] > [ 5631.004109] Call Trace: > [ 5631.004115] [<ffffffffaa4b8174>] ? pwq_dec_nr_in_flight+0x64/0xb0 > [ 5631.004117] [<ffffffffaa4b9d4f>] process_one_work+0x17f/0x440 > [ 5631.004120] [<ffffffffaa4bade6>] worker_thread+0x126/0x3c0 > > Signed-off-by: Quinn Tran <qutran@marvell.com> > Signed-off-by: Himanshu Madhani <hmadhani@marvell.com> > --- > drivers/scsi/qla2xxx/qla_def.h | 2 + > drivers/scsi/qla2xxx/qla_nvme.c | 164 ++++++++++++++++++++++++++++------------ > drivers/scsi/qla2xxx/qla_nvme.h | 1 + > 3 files changed, 117 insertions(+), 50 deletions(-) > > diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h > index 602ed24bb806..85a27ee5d647 100644 > --- a/drivers/scsi/qla2xxx/qla_def.h > +++ b/drivers/scsi/qla2xxx/qla_def.h > @@ -532,6 +532,8 @@ typedef struct srb { > uint8_t cmd_type; > uint8_t pad[3]; > atomic_t ref_count; > + struct kref cmd_kref; /* need to migrate ref_count over to this */ > + void *priv; > wait_queue_head_t nvme_ls_waitq; > struct fc_port *fcport; > struct scsi_qla_host *vha; My feedback about this patch is as follows: - I think that having two atomic counters in struct srb with a very similar purpose is overkill and also that a single atomic counter should be sufficient in this structure. - The problem fixed by this patch not only can happen with NVMe-FC but also with FC. I would prefer to see all code paths fixed for which a race between completion and abort can occur. Thanks, Bart.
Hi Bart, On 6/14/19, 3:24 PM, "Bart Van Assche" <bvanassche@acm.org> wrote: External Email ---------------------------------------------------------------------- On 6/14/19 3:10 PM, Himanshu Madhani wrote: > From: Quinn Tran <qutran@marvell.com> > > This patch uses kref to protect access between fcp_abort > path and nvme command and LS command completion path. > Stack trace below shows the abort path is accessing stale > memory (nvme_private->sp). > > When command kref reaches 0, nvme_private & srb resource will > be disconnected from each other. Any subsequence nvme abort > request will not be able to reference the original srb. > > [ 5631.003998] BUG: unable to handle kernel paging request at 00000010000005d8 > [ 5631.004016] IP: [<ffffffffc087df92>] qla_nvme_abort_work+0x22/0x100 [qla2xxx] > [ 5631.004086] Workqueue: events qla_nvme_abort_work [qla2xxx] > [ 5631.004097] RIP: 0010:[<ffffffffc087df92>] [<ffffffffc087df92>] qla_nvme_abort_work+0x22/0x100 [qla2xxx] > [ 5631.004109] Call Trace: > [ 5631.004115] [<ffffffffaa4b8174>] ? pwq_dec_nr_in_flight+0x64/0xb0 > [ 5631.004117] [<ffffffffaa4b9d4f>] process_one_work+0x17f/0x440 > [ 5631.004120] [<ffffffffaa4bade6>] worker_thread+0x126/0x3c0 > > Signed-off-by: Quinn Tran <qutran@marvell.com> > Signed-off-by: Himanshu Madhani <hmadhani@marvell.com> > --- > drivers/scsi/qla2xxx/qla_def.h | 2 + > drivers/scsi/qla2xxx/qla_nvme.c | 164 ++++++++++++++++++++++++++++------------ > drivers/scsi/qla2xxx/qla_nvme.h | 1 + > 3 files changed, 117 insertions(+), 50 deletions(-) > > diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h > index 602ed24bb806..85a27ee5d647 100644 > --- a/drivers/scsi/qla2xxx/qla_def.h > +++ b/drivers/scsi/qla2xxx/qla_def.h > @@ -532,6 +532,8 @@ typedef struct srb { > uint8_t cmd_type; > uint8_t pad[3]; > atomic_t ref_count; > + struct kref cmd_kref; /* need to migrate ref_count over to this */ > + void *priv; > wait_queue_head_t nvme_ls_waitq; > struct fc_port *fcport; > struct scsi_qla_host *vha; My feedback about this patch is as follows: - I think that having two atomic counters in struct srb with a very similar purpose is overkill and also that a single atomic counter should be sufficient in this structure. - The problem fixed by this patch not only can happen with NVMe-FC but also with FC. I would prefer to see all code paths fixed for which a race between completion and abort can occur. Yes. We are in process of doing the larger cleanup. However, this patch was part of fixes we verified for a crash and want to get this in a distro before the wider cleanup is submitted for inclusion. Would you consider allowing us to add this patch and we'll provide larger patch fixing all code path in next series. Thanks, Bart.
On 6/14/19 3:58 PM, Himanshu Madhani wrote: > Would you consider allowing us to add this patch and we'll provide > larger patch fixing all code path in next series. Hi Himanshu, I think this is something Martin should decide. Thanks, Bart.
On 6/14/19 3:58 PM, Himanshu Madhani wrote: > Yes. We are in process of doing the larger cleanup. However, this > patch was part of fixes we verified for a crash and want to get this > in a distro before the wider cleanup is submitted for inclusion. Hi Himanshu, It's not that hard to fix this properly. It would be appreciated if you could have a look at the series with six patches that I just posted. Thanks, Bart.
Bart, Attached is the clean-up patch that we held back from the series. We felt it wasn't ready for wider audience because it needed additional soak time with our test group. We want to ahead and share it with you to let you know that we intent to cleanup the duplicate atomic [ref_count|kref]. Once it has some soak time in our test group, we'll submit it in the next RC window. Thanks. Regards, Quinn Tran On 6/17/19, 1:42 PM, "linux-scsi-owner@vger.kernel.org on behalf of Bart Van Assche" <linux-scsi-owner@vger.kernel.org on behalf of bvanassche@acm.org> wrote: On 6/14/19 3:58 PM, Himanshu Madhani wrote: > Yes. We are in process of doing the larger cleanup. However, this > patch was part of fixes we verified for a crash and want to get this > in a distro before the wider cleanup is submitted for inclusion. Hi Himanshu, It's not that hard to fix this properly. It would be appreciated if you could have a look at the series with six patches that I just posted. Thanks, Bart.
On 6/17/19 5:01 PM, Quinn Tran wrote: > Attached is the clean-up patch that we held back from the series. > We felt it wasn't ready for wider audience because it needed additional > soak time with our test group. > > We want to ahead and share it with you to let you know that we intent > to cleanup the duplicate atomic [ref_count|kref]. Once it has some > soak time in our test group, we'll submit it in the next RC window. Hi Quinn, Thank you for having shared that patch early. My comments about that patch are as follows: - The patch description is not correct. Today freeing of an SRB does not happen when ref_count reaches zero but it happens when the firmware reports a completion. That is why today the abort code can trigger a use-after-free. ref_count is only useful today for the abort code to detect atomically whether or not the firmware already reported that a request completed. - Only calling cmd->scsi-done(cmd) when the reference count reaches zero involves a behavior change. If a command completion and a request to abort a command race, this patch will report the command as aborted to the SCSI mid-layer instead of as completed. This change has not been mentioned in the patch description. Is this change perhaps unintentional? - I think that this patch does not address the memory leak that can be triggered by aborting a command. If a command is aborted it will be freed by qla_release_fcp_cmd_kref() calling qla2xxx_rel_qpair_sp() and by qla2xxx_rel_qpair_sp() calling sp->free(sp). However, the implementation of the free function for multiple SRB types is incomplete. - The approach for avoiding that qla2xxx_eh_abort() triggers a use-after-free (the new srb_private structure) is interesting. However, that approach does not look correct to me. The patch attached to the previous e-mail inserts the following code in qla2xxx_eh_abort(): 'sp = CMD_SP(cmd); if (!sp || !sp->qpair || ...) return SUCCESS' As one can see the sp pointer is dereferenced although the memory it points at could already have been freed and could have been reallocated by another driver or another process. So I don't think the new code for avoiding a use-after-free is correct. Thanks, Bart.
On 6/18/19, 8:57 AM, "Bart Van Assche" <bvanassche@acm.org> wrote: On 6/17/19 5:01 PM, Quinn Tran wrote: > Attached is the clean-up patch that we held back from the series. > We felt it wasn't ready for wider audience because it needed additional > soak time with our test group. > > We want to ahead and share it with you to let you know that we intent > to cleanup the duplicate atomic [ref_count|kref]. Once it has some > soak time in our test group, we'll submit it in the next RC window. Hi Quinn, Thank you for having shared that patch early. My comments about that patch are as follows: - The patch description is not correct. Today freeing of an SRB does not happen when ref_count reaches zero but it happens when the firmware reports a completion. That is why today the abort code can trigger a use-after-free. ref_count is only useful today for the abort code to detect atomically whether or not the firmware already reported that a request completed. QT: Bart, thanks for the additional eyes. This patch made additional safe guard to prevent use after free. When the original IO arrive into QLA driver, QLA will bind scsi_cmnd & srb together. When the kref reaches 0, we will unbind the 2 structures with NULL pointer. Any attempt on 'use after free' will be block by null pointer. We reserve additional scratch space at the end of scsi_cmnd (srb_private) to facilitate the bind and unbind. 879 qla_release_fcp_cmd_kref(struct kref *kref) 880 { 881 struct srb *sp = container_of(kref, struct srb, cmd_kref); 882 struct scsi_cmnd *cmd = GET_CMD_SP(sp); 884 struct srb_private *priv = (struct srb_private*)(cmd + 1); .. 888 spin_lock_irqsave(&priv->cmd_lock, flags); 889 CMD_SP(cmd) = NULL; <<<< unbind scsi_cmnd from srb. 890 sp->u.scmd.cmd = NULL; 891 spin_unlock_irqrestore(&priv->cmd_lock, flags); } 1359 static int 1360 qla2xxx_eh_abort(struct scsi_cmnd *cmd) 1361 { .. 1370 struct srb_private *priv = (struct srb_private*)(cmd + 1); .. 1382 1383 spin_lock_irqsave(&priv->cmd_lock, flags); 1384 sp = (srb_t *) CMD_SP(cmd); 1385 if (!sp || !sp->qpair || 1386 (priv->cmd_id != sp->cmd_id) || 1387 (sp->fcport && sp->fcport->deleted)) { 1388 spin_unlock_irqrestore(&priv->cmd_lock, flags); 1389 return SUCCESS; 1390 } - Only calling cmd->scsi-done(cmd) when the reference count reaches zero involves a behavior change. If a command completion and a request to abort a command race, this patch will report the command as aborted to the SCSI mid-layer instead of as completed. This change has not been mentioned in the patch description. Is this change perhaps unintentional? QT: I believe the original code has the same behavior. We're trying to preserve precedent. Will revisit to differentiate the 2 status code the described race. - I think that this patch does not address the memory leak that can be triggered by aborting a command. If a command is aborted it will be freed by qla_release_fcp_cmd_kref() calling qla2xxx_rel_qpair_sp() and by qla2xxx_rel_qpair_sp() calling sp->free(sp). However, the implementation of the free function for multiple SRB types is incomplete. QT: As for the other type SRB, I follow what you're pointing at. We do have another patch(is) to address this other corner. Will queue it up to go out also with the next window. It's currently being soak in our test group. - The approach for avoiding that qla2xxx_eh_abort() triggers a use-after-free (the new srb_private structure) is interesting. However, that approach does not look correct to me. The patch attached to the previous e-mail inserts the following code in qla2xxx_eh_abort(): 'sp = CMD_SP(cmd); if (!sp || !sp->qpair || ...) return SUCCESS' As one can see the sp pointer is dereferenced although the memory it points at could already have been freed and could have been reallocated by another driver or another process. So I don't think the new code for avoiding a use-after-free is correct. QT: OK. Will remove old defensive check that may still cause use after free. ---- Thanks, Bart.
diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h index 602ed24bb806..85a27ee5d647 100644 --- a/drivers/scsi/qla2xxx/qla_def.h +++ b/drivers/scsi/qla2xxx/qla_def.h @@ -532,6 +532,8 @@ typedef struct srb { uint8_t cmd_type; uint8_t pad[3]; atomic_t ref_count; + struct kref cmd_kref; /* need to migrate ref_count over to this */ + void *priv; wait_queue_head_t nvme_ls_waitq; struct fc_port *fcport; struct scsi_qla_host *vha; diff --git a/drivers/scsi/qla2xxx/qla_nvme.c b/drivers/scsi/qla2xxx/qla_nvme.c index ead10e1a81fc..b56dcab9d265 100644 --- a/drivers/scsi/qla2xxx/qla_nvme.c +++ b/drivers/scsi/qla2xxx/qla_nvme.c @@ -135,53 +135,91 @@ static int qla_nvme_alloc_queue(struct nvme_fc_local_port *lport, return 0; } +static void qla_nvme_release_fcp_cmd_kref(struct kref *kref) +{ + struct srb *sp = container_of(kref, struct srb, cmd_kref); + struct nvme_private *priv = (struct nvme_private *)sp->priv; + struct nvmefc_fcp_req *fd; + struct srb_iocb *nvme; + unsigned long flags; + + if (!priv) + goto out; + + nvme = &sp->u.iocb_cmd; + fd = nvme->u.nvme.desc; + + spin_lock_irqsave(&priv->cmd_lock, flags); + priv->sp = NULL; + sp->priv = NULL; + if (priv->comp_status == QLA_SUCCESS) { + fd->rcv_rsplen = nvme->u.nvme.rsp_pyld_len; + } else { + fd->rcv_rsplen = 0; + fd->transferred_length = 0; + } + fd->status = 0; + spin_unlock_irqrestore(&priv->cmd_lock, flags); + + fd->done(fd); +out: + qla2xxx_rel_qpair_sp(sp->qpair, sp); +} + +static void qla_nvme_release_ls_cmd_kref(struct kref *kref) +{ + struct srb *sp = container_of(kref, struct srb, cmd_kref); + struct nvme_private *priv = (struct nvme_private *)sp->priv; + struct nvmefc_ls_req *fd; + unsigned long flags; + + if (!priv) + goto out; + + spin_lock_irqsave(&priv->cmd_lock, flags); + priv->sp = NULL; + sp->priv = NULL; + spin_unlock_irqrestore(&priv->cmd_lock, flags); + + fd = priv->fd; + fd->done(fd, priv->comp_status); +out: + qla2x00_rel_sp(sp); +} + +static void qla_nvme_ls_complete(struct work_struct *work) +{ + struct nvme_private *priv = + container_of(work, struct nvme_private, ls_work); + + kref_put(&priv->sp->cmd_kref, qla_nvme_release_ls_cmd_kref); +} + static void qla_nvme_sp_ls_done(void *ptr, int res) { srb_t *sp = ptr; - struct srb_iocb *nvme; - struct nvmefc_ls_req *fd; struct nvme_private *priv; - if (WARN_ON_ONCE(atomic_read(&sp->ref_count) == 0)) + if (WARN_ON_ONCE(kref_read(&sp->cmd_kref) == 0)) return; - atomic_dec(&sp->ref_count); - if (res) res = -EINVAL; - nvme = &sp->u.iocb_cmd; - fd = nvme->u.nvme.desc; - priv = fd->private; + priv = (struct nvme_private *)sp->priv; priv->comp_status = res; + INIT_WORK(&priv->ls_work, qla_nvme_ls_complete); schedule_work(&priv->ls_work); - /* work schedule doesn't need the sp */ - qla2x00_rel_sp(sp); } +/* it assumed that QPair lock is held. */ static void qla_nvme_sp_done(void *ptr, int res) { srb_t *sp = ptr; - struct srb_iocb *nvme; - struct nvmefc_fcp_req *fd; + struct nvme_private *priv = (struct nvme_private *)sp->priv; - nvme = &sp->u.iocb_cmd; - fd = nvme->u.nvme.desc; - - if (WARN_ON_ONCE(atomic_read(&sp->ref_count) == 0)) - return; - - atomic_dec(&sp->ref_count); - - if (res == QLA_SUCCESS) { - fd->rcv_rsplen = nvme->u.nvme.rsp_pyld_len; - } else { - fd->rcv_rsplen = 0; - fd->transferred_length = 0; - } - fd->status = 0; - fd->done(fd); - qla2xxx_rel_qpair_sp(sp->qpair, sp); + priv->comp_status = res; + kref_put(&sp->cmd_kref, qla_nvme_release_fcp_cmd_kref); return; } @@ -200,44 +238,53 @@ static void qla_nvme_abort_work(struct work_struct *work) __func__, sp, sp->handle, fcport, fcport->deleted); if (!ha->flags.fw_started && (fcport && fcport->deleted)) - return; + goto out; if (ha->flags.host_shutting_down) { ql_log(ql_log_info, sp->fcport->vha, 0xffff, "%s Calling done on sp: %p, type: 0x%x, sp->ref_count: 0x%x\n", __func__, sp, sp->type, atomic_read(&sp->ref_count)); sp->done(sp, 0); - return; + goto out; } - if (WARN_ON_ONCE(atomic_read(&sp->ref_count) == 0)) - return; - rval = ha->isp_ops->abort_command(sp); ql_dbg(ql_dbg_io, fcport->vha, 0x212b, "%s: %s command for sp=%p, handle=%x on fcport=%p rval=%x\n", __func__, (rval != QLA_SUCCESS) ? "Failed to abort" : "Aborted", sp, sp->handle, fcport, rval); + +out: + /* kref_get was done before work was schedule. */ + if (sp->type == SRB_NVME_CMD) + kref_put(&sp->cmd_kref, qla_nvme_release_fcp_cmd_kref); + else if (sp->type == SRB_NVME_LS) + kref_put(&sp->cmd_kref, qla_nvme_release_ls_cmd_kref); } static void qla_nvme_ls_abort(struct nvme_fc_local_port *lport, struct nvme_fc_remote_port *rport, struct nvmefc_ls_req *fd) { struct nvme_private *priv = fd->private; + unsigned long flags; + + spin_lock_irqsave(&priv->cmd_lock, flags); + if (!priv->sp) { + spin_unlock_irqrestore(&priv->cmd_lock, flags); + return; + } + + if (!kref_get_unless_zero(&priv->sp->cmd_kref)) { + spin_unlock_irqrestore(&priv->cmd_lock, flags); + return; + } + spin_unlock_irqrestore(&priv->cmd_lock, flags); INIT_WORK(&priv->abort_work, qla_nvme_abort_work); schedule_work(&priv->abort_work); } -static void qla_nvme_ls_complete(struct work_struct *work) -{ - struct nvme_private *priv = - container_of(work, struct nvme_private, ls_work); - struct nvmefc_ls_req *fd = priv->fd; - - fd->done(fd, priv->comp_status); -} static int qla_nvme_ls_req(struct nvme_fc_local_port *lport, struct nvme_fc_remote_port *rport, struct nvmefc_ls_req *fd) @@ -265,11 +312,12 @@ static int qla_nvme_ls_req(struct nvme_fc_local_port *lport, sp->type = SRB_NVME_LS; sp->name = "nvme_ls"; sp->done = qla_nvme_sp_ls_done; - atomic_set(&sp->ref_count, 1); - nvme = &sp->u.iocb_cmd; + sp->priv = (void *)priv; priv->sp = sp; + kref_init(&sp->cmd_kref); + spin_lock_init(&priv->cmd_lock); + nvme = &sp->u.iocb_cmd; priv->fd = fd; - INIT_WORK(&priv->ls_work, qla_nvme_ls_complete); nvme->u.nvme.desc = fd; nvme->u.nvme.dir = 0; nvme->u.nvme.dl = 0; @@ -286,9 +334,10 @@ static int qla_nvme_ls_req(struct nvme_fc_local_port *lport, if (rval != QLA_SUCCESS) { ql_log(ql_log_warn, vha, 0x700e, "qla2x00_start_sp failed = %d\n", rval); - atomic_dec(&sp->ref_count); wake_up(&sp->nvme_ls_waitq); - sp->free(sp); + sp->priv = NULL; + priv->sp = NULL; + qla2x00_rel_sp(sp); return rval; } @@ -300,6 +349,18 @@ static void qla_nvme_fcp_abort(struct nvme_fc_local_port *lport, struct nvmefc_fcp_req *fd) { struct nvme_private *priv = fd->private; + unsigned long flags; + + spin_lock_irqsave(&priv->cmd_lock, flags); + if (!priv->sp) { + spin_unlock_irqrestore(&priv->cmd_lock, flags); + return; + } + if (!kref_get_unless_zero(&priv->sp->cmd_kref)) { + spin_unlock_irqrestore(&priv->cmd_lock, flags); + return; + } + spin_unlock_irqrestore(&priv->cmd_lock, flags); INIT_WORK(&priv->abort_work, qla_nvme_abort_work); schedule_work(&priv->abort_work); @@ -523,8 +584,10 @@ static int qla_nvme_post_cmd(struct nvme_fc_local_port *lport, if (!sp) return -EBUSY; - atomic_set(&sp->ref_count, 1); init_waitqueue_head(&sp->nvme_ls_waitq); + kref_init(&sp->cmd_kref); + spin_lock_init(&priv->cmd_lock); + sp->priv = (void *)priv; priv->sp = sp; sp->type = SRB_NVME_CMD; sp->name = "nvme_cmd"; @@ -538,9 +601,10 @@ static int qla_nvme_post_cmd(struct nvme_fc_local_port *lport, if (rval != QLA_SUCCESS) { ql_log(ql_log_warn, vha, 0x212d, "qla2x00_start_nvme_mq failed = %d\n", rval); - atomic_dec(&sp->ref_count); wake_up(&sp->nvme_ls_waitq); - sp->free(sp); + sp->priv = NULL; + priv->sp = NULL; + qla2xxx_rel_qpair_sp(sp->qpair, sp); } return rval; diff --git a/drivers/scsi/qla2xxx/qla_nvme.h b/drivers/scsi/qla2xxx/qla_nvme.h index 2d088add7011..67bb4a2a3742 100644 --- a/drivers/scsi/qla2xxx/qla_nvme.h +++ b/drivers/scsi/qla2xxx/qla_nvme.h @@ -34,6 +34,7 @@ struct nvme_private { struct work_struct ls_work; struct work_struct abort_work; int comp_status; + spinlock_t cmd_lock; }; struct qla_nvme_rport {