Message ID | 1546997746-50323-1-git-send-email-tanxiaofei@huawei.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [resend,v3,rdma-rc,1/1] RDMA/hns: Add the process of AEQ overflow for hip08 | expand |
On Wed, Jan 09, 2019 at 09:35:46AM +0800, Xiaofei Tan wrote: > AEQ overflow will be reported by hardware when too many > asynchronous events occurred but not be handled in time. > Normally, AEQ overflow error is not easy to occur. Once > happened, we have to do physical function reset to recover. > PF reset is implemented in two steps. Firstly, set reset > level with ae_dev->ops->set_default_reset_request. > Secondly, run reset with ae_dev->ops->reset_event. > > Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> > Signed-off-by: Yixian Liu <liuyixian@huawei.com> > --- > drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) Why should this be a -rc patch? It doesn't look like it meets the requires for a stable kernel, or fixing something introduced in the merge window. This looks like a new feature to me. Jason
Hi Jason, On 2019/1/12 5:25, Jason Gunthorpe wrote: > On Wed, Jan 09, 2019 at 09:35:46AM +0800, Xiaofei Tan wrote: >> AEQ overflow will be reported by hardware when too many >> asynchronous events occurred but not be handled in time. >> Normally, AEQ overflow error is not easy to occur. Once >> happened, we have to do physical function reset to recover. >> PF reset is implemented in two steps. Firstly, set reset >> level with ae_dev->ops->set_default_reset_request. >> Secondly, run reset with ae_dev->ops->reset_event. >> >> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> >> Signed-off-by: Yixian Liu <liuyixian@huawei.com> >> --- >> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 11 +++++++++++ >> 1 file changed, 11 insertions(+) > > Why should this be a -rc patch? It doesn't look like it meets the > requires for a stable kernel, or fixing something introduced in the > merge window. > > This looks like a new feature to me. > I think we could take this as a bug. Because the device can't continue working, if we don't handle the AEQ overflow error. And the code is simple and doesn't bring any unstable factor. Then i take it as -rc patch.
On Sat, Jan 12, 2019 at 08:21:38PM +0800, tanxiaofei wrote: > Hi Jason, > > On 2019/1/12 5:25, Jason Gunthorpe wrote: > > On Wed, Jan 09, 2019 at 09:35:46AM +0800, Xiaofei Tan wrote: > >> AEQ overflow will be reported by hardware when too many > >> asynchronous events occurred but not be handled in time. > >> Normally, AEQ overflow error is not easy to occur. Once > >> happened, we have to do physical function reset to recover. > >> PF reset is implemented in two steps. Firstly, set reset > >> level with ae_dev->ops->set_default_reset_request. > >> Secondly, run reset with ae_dev->ops->reset_event. > >> > >> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> > >> Signed-off-by: Yixian Liu <liuyixian@huawei.com> > >> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 11 +++++++++++ > >> 1 file changed, 11 insertions(+) > > > > Why should this be a -rc patch? It doesn't look like it meets the > > requires for a stable kernel, or fixing something introduced in the > > merge window. > > > > This looks like a new feature to me. > > I think we could take this as a bug. Because the device can't > continue working, if we don't handle the AEQ overflow error. And the > code is simple and doesn't bring any unstable factor. Then i take > it as -rc patch. recovery from hardware failures is a new feature. If this is fixing a bug in the existing recovery, or is a major problem that many users are hitting, then explain in the commit comment. Jason
On 2019/1/16 6:03, Jason Gunthorpe wrote: > On Sat, Jan 12, 2019 at 08:21:38PM +0800, tanxiaofei wrote: >> Hi Jason, >> >> On 2019/1/12 5:25, Jason Gunthorpe wrote: >>> On Wed, Jan 09, 2019 at 09:35:46AM +0800, Xiaofei Tan wrote: >>>> AEQ overflow will be reported by hardware when too many >>>> asynchronous events occurred but not be handled in time. >>>> Normally, AEQ overflow error is not easy to occur. Once >>>> happened, we have to do physical function reset to recover. >>>> PF reset is implemented in two steps. Firstly, set reset >>>> level with ae_dev->ops->set_default_reset_request. >>>> Secondly, run reset with ae_dev->ops->reset_event. >>>> >>>> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> >>>> Signed-off-by: Yixian Liu <liuyixian@huawei.com> >>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 11 +++++++++++ >>>> 1 file changed, 11 insertions(+) >>> >>> Why should this be a -rc patch? It doesn't look like it meets the >>> requires for a stable kernel, or fixing something introduced in the >>> merge window. >>> >>> This looks like a new feature to me. >> >> I think we could take this as a bug. Because the device can't >> continue working, if we don't handle the AEQ overflow error. And the >> code is simple and doesn't bring any unstable factor. Then i take >> it as -rc patch. > > recovery from hardware failures is a new feature. > > If this is fixing a bug in the existing recovery, or is a major > problem that many users are hitting, then explain in the commit > comment. > Hi Jason, Thank you very much. How about change the commit log as following? RDMA/hns: fix the issue of handling AEQ overflow for hip08 AEQ overflow will be reported by hardware when too many asynchronous events occurred but not be handled in time. Currently, we just do the print when AEQ overflow happened. This is not right. We should also do physical function reset, otherwise, the device can't continue working. PF reset is implemented in two steps. Firstly, set reset level with ae_dev->ops->set_default_reset_request. Secondly, run reset with ae_dev->ops->reset_event. Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Yixian Liu <liuyixian@huawei.com> > Jason > >
On Wed, Jan 16, 2019 at 08:32:30PM +0800, tanxiaofei wrote: > > On 2019/1/16 6:03, Jason Gunthorpe wrote: > > On Sat, Jan 12, 2019 at 08:21:38PM +0800, tanxiaofei wrote: > >> Hi Jason, > >> > >> On 2019/1/12 5:25, Jason Gunthorpe wrote: > >>> On Wed, Jan 09, 2019 at 09:35:46AM +0800, Xiaofei Tan wrote: > >>>> AEQ overflow will be reported by hardware when too many > >>>> asynchronous events occurred but not be handled in time. > >>>> Normally, AEQ overflow error is not easy to occur. Once > >>>> happened, we have to do physical function reset to recover. > >>>> PF reset is implemented in two steps. Firstly, set reset > >>>> level with ae_dev->ops->set_default_reset_request. > >>>> Secondly, run reset with ae_dev->ops->reset_event. > >>>> > >>>> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> > >>>> Signed-off-by: Yixian Liu <liuyixian@huawei.com> > >>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 11 +++++++++++ > >>>> 1 file changed, 11 insertions(+) > >>> > >>> Why should this be a -rc patch? It doesn't look like it meets the > >>> requires for a stable kernel, or fixing something introduced in the > >>> merge window. > >>> > >>> This looks like a new feature to me. > >> > >> I think we could take this as a bug. Because the device can't > >> continue working, if we don't handle the AEQ overflow error. And the > >> code is simple and doesn't bring any unstable factor. Then i take > >> it as -rc patch. > > > > recovery from hardware failures is a new feature. > > > > If this is fixing a bug in the existing recovery, or is a major > > problem that many users are hitting, then explain in the commit > > comment. > > > Hi Jason, > > Thank you very much. > How about change the commit log as following? > > RDMA/hns: fix the issue of handling AEQ overflow for hip08 > > AEQ overflow will be reported by hardware when too many > asynchronous events occurred but not be handled in time. > > Currently, we just do the print when AEQ overflow happened. > This is not right. We should also do physical function reset, > otherwise, the device can't continue working. > > PF reset is implemented in two steps. Firstly, set reset > level with ae_dev->ops->set_default_reset_request. > Secondly, run reset with ae_dev->ops->reset_event. This still doesn't explain why this is for -rc. New features are not for the -rc tree. You'd need to explain how/why that this is a widespread a critical problem in your user base Jason
Hi Jason, On 2019/1/19 3:58, Jason Gunthorpe wrote: > On Wed, Jan 16, 2019 at 08:32:30PM +0800, tanxiaofei wrote: >> >> On 2019/1/16 6:03, Jason Gunthorpe wrote: >>> On Sat, Jan 12, 2019 at 08:21:38PM +0800, tanxiaofei wrote: >>>> Hi Jason, >>>> >>>> On 2019/1/12 5:25, Jason Gunthorpe wrote: >>>>> On Wed, Jan 09, 2019 at 09:35:46AM +0800, Xiaofei Tan wrote: >>>>>> AEQ overflow will be reported by hardware when too many >>>>>> asynchronous events occurred but not be handled in time. >>>>>> Normally, AEQ overflow error is not easy to occur. Once >>>>>> happened, we have to do physical function reset to recover. >>>>>> PF reset is implemented in two steps. Firstly, set reset >>>>>> level with ae_dev->ops->set_default_reset_request. >>>>>> Secondly, run reset with ae_dev->ops->reset_event. >>>>>> >>>>>> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com> >>>>>> Signed-off-by: Yixian Liu <liuyixian@huawei.com> >>>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 11 +++++++++++ >>>>>> 1 file changed, 11 insertions(+) >>>>> >>>>> Why should this be a -rc patch? It doesn't look like it meets the >>>>> requires for a stable kernel, or fixing something introduced in the >>>>> merge window. >>>>> >>>>> This looks like a new feature to me. >>>> >>>> I think we could take this as a bug. Because the device can't >>>> continue working, if we don't handle the AEQ overflow error. And the >>>> code is simple and doesn't bring any unstable factor. Then i take >>>> it as -rc patch. >>> >>> recovery from hardware failures is a new feature. >>> >>> If this is fixing a bug in the existing recovery, or is a major >>> problem that many users are hitting, then explain in the commit >>> comment. >>> >> Hi Jason, >> >> Thank you very much. >> How about change the commit log as following? >> >> RDMA/hns: fix the issue of handling AEQ overflow for hip08 >> >> AEQ overflow will be reported by hardware when too many >> asynchronous events occurred but not be handled in time. >> >> Currently, we just do the print when AEQ overflow happened. >> This is not right. We should also do physical function reset, >> otherwise, the device can't continue working. >> >> PF reset is implemented in two steps. Firstly, set reset >> level with ae_dev->ops->set_default_reset_request. >> Secondly, run reset with ae_dev->ops->reset_event. > > This still doesn't explain why this is for -rc. > > New features are not for the -rc tree. > > You'd need to explain how/why that this is a widespread a critical > problem in your user base > Maybe this patch is not suitable for the -rc branch. Hmm, I will resend this patch based on -next. Many thanks. > Jason > > . >
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c index 3a66945..317539a 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c @@ -4692,11 +4692,22 @@ static irqreturn_t hns_roce_v2_msix_interrupt_abn(int irq, void *dev_id) int_en = roce_read(hr_dev, ROCEE_VF_ABN_INT_EN_REG); if (roce_get_bit(int_st, HNS_ROCE_V2_VF_INT_ST_AEQ_OVERFLOW_S)) { + struct pci_dev *pdev = hr_dev->pci_dev; + struct hnae3_ae_dev *ae_dev = pci_get_drvdata(pdev); + const struct hnae3_ae_ops *ops = ae_dev->ops; + dev_err(dev, "AEQ overflow!\n"); roce_set_bit(int_st, HNS_ROCE_V2_VF_INT_ST_AEQ_OVERFLOW_S, 1); roce_write(hr_dev, ROCEE_VF_ABN_INT_ST_REG, int_st); + /* Set reset level for reset_event() */ + if (ops->set_default_reset_request) + ops->set_default_reset_request(ae_dev, + HNAE3_FUNC_RESET); + if (ops->reset_event) + ops->reset_event(pdev, NULL); + roce_set_bit(int_en, HNS_ROCE_V2_VF_ABN_INT_EN_S, 1); roce_write(hr_dev, ROCEE_VF_ABN_INT_EN_REG, int_en);