diff mbox series

[resend,v3,rdma-rc,1/1] RDMA/hns: Add the process of AEQ overflow for hip08

Message ID 1546997746-50323-1-git-send-email-tanxiaofei@huawei.com (mailing list archive)
State Superseded
Headers show
Series [resend,v3,rdma-rc,1/1] RDMA/hns: Add the process of AEQ overflow for hip08 | expand

Commit Message

Xiaofei Tan Jan. 9, 2019, 1:35 a.m. UTC
AEQ overflow will be reported by hardware when too many
asynchronous events occurred but not be handled in time.
Normally, AEQ overflow error is not easy to occur. Once
happened, we have to do physical function reset to recover.
PF reset is implemented in two steps. Firstly, set reset
level with ae_dev->ops->set_default_reset_request.
Secondly, run reset with ae_dev->ops->reset_event.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
---
 drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Jason Gunthorpe Jan. 11, 2019, 9:25 p.m. UTC | #1
On Wed, Jan 09, 2019 at 09:35:46AM +0800, Xiaofei Tan wrote:
> AEQ overflow will be reported by hardware when too many
> asynchronous events occurred but not be handled in time.
> Normally, AEQ overflow error is not easy to occur. Once
> happened, we have to do physical function reset to recover.
> PF reset is implemented in two steps. Firstly, set reset
> level with ae_dev->ops->set_default_reset_request.
> Secondly, run reset with ae_dev->ops->reset_event.
> 
> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
> Signed-off-by: Yixian Liu <liuyixian@huawei.com>
> ---
>  drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)

Why should this be a -rc patch? It doesn't look like it meets the
requires for a stable kernel, or fixing something introduced in the
merge window.

This looks like a new feature to me.

Jason
Xiaofei Tan Jan. 12, 2019, 12:21 p.m. UTC | #2
Hi Jason,

On 2019/1/12 5:25, Jason Gunthorpe wrote:
> On Wed, Jan 09, 2019 at 09:35:46AM +0800, Xiaofei Tan wrote:
>> AEQ overflow will be reported by hardware when too many
>> asynchronous events occurred but not be handled in time.
>> Normally, AEQ overflow error is not easy to occur. Once
>> happened, we have to do physical function reset to recover.
>> PF reset is implemented in two steps. Firstly, set reset
>> level with ae_dev->ops->set_default_reset_request.
>> Secondly, run reset with ae_dev->ops->reset_event.
>>
>> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
>> Signed-off-by: Yixian Liu <liuyixian@huawei.com>
>> ---
>>  drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 11 +++++++++++
>>  1 file changed, 11 insertions(+)
> 
> Why should this be a -rc patch? It doesn't look like it meets the
> requires for a stable kernel, or fixing something introduced in the
> merge window.
> 
> This looks like a new feature to me.
> 

I think we could take this as a bug. Because the device can't continue working, if we don't handle
the AEQ overflow error. And the code is simple and doesn't bring any unstable factor.
Then i take it as -rc patch.
Jason Gunthorpe Jan. 15, 2019, 10:03 p.m. UTC | #3
On Sat, Jan 12, 2019 at 08:21:38PM +0800, tanxiaofei wrote:
> Hi Jason,
> 
> On 2019/1/12 5:25, Jason Gunthorpe wrote:
> > On Wed, Jan 09, 2019 at 09:35:46AM +0800, Xiaofei Tan wrote:
> >> AEQ overflow will be reported by hardware when too many
> >> asynchronous events occurred but not be handled in time.
> >> Normally, AEQ overflow error is not easy to occur. Once
> >> happened, we have to do physical function reset to recover.
> >> PF reset is implemented in two steps. Firstly, set reset
> >> level with ae_dev->ops->set_default_reset_request.
> >> Secondly, run reset with ae_dev->ops->reset_event.
> >>
> >> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
> >> Signed-off-by: Yixian Liu <liuyixian@huawei.com>
> >>  drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 11 +++++++++++
> >>  1 file changed, 11 insertions(+)
> > 
> > Why should this be a -rc patch? It doesn't look like it meets the
> > requires for a stable kernel, or fixing something introduced in the
> > merge window.
> > 
> > This looks like a new feature to me.
>
> I think we could take this as a bug. Because the device can't
> continue working, if we don't handle the AEQ overflow error. And the
> code is simple and doesn't bring any unstable factor.  Then i take
> it as -rc patch.

recovery from hardware failures is a new feature.

If this is fixing a bug in the existing recovery, or is a major
problem that many users are hitting, then explain in the commit
comment.

Jason
Xiaofei Tan Jan. 16, 2019, 12:32 p.m. UTC | #4
On 2019/1/16 6:03, Jason Gunthorpe wrote:
> On Sat, Jan 12, 2019 at 08:21:38PM +0800, tanxiaofei wrote:
>> Hi Jason,
>>
>> On 2019/1/12 5:25, Jason Gunthorpe wrote:
>>> On Wed, Jan 09, 2019 at 09:35:46AM +0800, Xiaofei Tan wrote:
>>>> AEQ overflow will be reported by hardware when too many
>>>> asynchronous events occurred but not be handled in time.
>>>> Normally, AEQ overflow error is not easy to occur. Once
>>>> happened, we have to do physical function reset to recover.
>>>> PF reset is implemented in two steps. Firstly, set reset
>>>> level with ae_dev->ops->set_default_reset_request.
>>>> Secondly, run reset with ae_dev->ops->reset_event.
>>>>
>>>> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
>>>> Signed-off-by: Yixian Liu <liuyixian@huawei.com>
>>>>  drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 11 +++++++++++
>>>>  1 file changed, 11 insertions(+)
>>>
>>> Why should this be a -rc patch? It doesn't look like it meets the
>>> requires for a stable kernel, or fixing something introduced in the
>>> merge window.
>>>
>>> This looks like a new feature to me.
>>
>> I think we could take this as a bug. Because the device can't
>> continue working, if we don't handle the AEQ overflow error. And the
>> code is simple and doesn't bring any unstable factor.  Then i take
>> it as -rc patch.
> 
> recovery from hardware failures is a new feature.
> 
> If this is fixing a bug in the existing recovery, or is a major
> problem that many users are hitting, then explain in the commit
> comment.
> 
Hi Jason,

Thank you very much.
How about change the commit log as following?

RDMA/hns: fix the issue of handling AEQ overflow for hip08

AEQ overflow will be reported by hardware when too many
asynchronous events occurred but not be handled in time.

Currently, we just do the print when AEQ overflow happened.
This is not right. We should also do physical function reset,
otherwise, the device can't continue working.

PF reset is implemented in two steps. Firstly, set reset
level with ae_dev->ops->set_default_reset_request.
Secondly, run reset with ae_dev->ops->reset_event.

Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Yixian Liu <liuyixian@huawei.com>

> Jason
> 
>
Jason Gunthorpe Jan. 18, 2019, 7:58 p.m. UTC | #5
On Wed, Jan 16, 2019 at 08:32:30PM +0800, tanxiaofei wrote:
> 
> On 2019/1/16 6:03, Jason Gunthorpe wrote:
> > On Sat, Jan 12, 2019 at 08:21:38PM +0800, tanxiaofei wrote:
> >> Hi Jason,
> >>
> >> On 2019/1/12 5:25, Jason Gunthorpe wrote:
> >>> On Wed, Jan 09, 2019 at 09:35:46AM +0800, Xiaofei Tan wrote:
> >>>> AEQ overflow will be reported by hardware when too many
> >>>> asynchronous events occurred but not be handled in time.
> >>>> Normally, AEQ overflow error is not easy to occur. Once
> >>>> happened, we have to do physical function reset to recover.
> >>>> PF reset is implemented in two steps. Firstly, set reset
> >>>> level with ae_dev->ops->set_default_reset_request.
> >>>> Secondly, run reset with ae_dev->ops->reset_event.
> >>>>
> >>>> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
> >>>> Signed-off-by: Yixian Liu <liuyixian@huawei.com>
> >>>>  drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 11 +++++++++++
> >>>>  1 file changed, 11 insertions(+)
> >>>
> >>> Why should this be a -rc patch? It doesn't look like it meets the
> >>> requires for a stable kernel, or fixing something introduced in the
> >>> merge window.
> >>>
> >>> This looks like a new feature to me.
> >>
> >> I think we could take this as a bug. Because the device can't
> >> continue working, if we don't handle the AEQ overflow error. And the
> >> code is simple and doesn't bring any unstable factor.  Then i take
> >> it as -rc patch.
> > 
> > recovery from hardware failures is a new feature.
> > 
> > If this is fixing a bug in the existing recovery, or is a major
> > problem that many users are hitting, then explain in the commit
> > comment.
> > 
> Hi Jason,
> 
> Thank you very much.
> How about change the commit log as following?
> 
> RDMA/hns: fix the issue of handling AEQ overflow for hip08
> 
> AEQ overflow will be reported by hardware when too many
> asynchronous events occurred but not be handled in time.
> 
> Currently, we just do the print when AEQ overflow happened.
> This is not right. We should also do physical function reset,
> otherwise, the device can't continue working.
> 
> PF reset is implemented in two steps. Firstly, set reset
> level with ae_dev->ops->set_default_reset_request.
> Secondly, run reset with ae_dev->ops->reset_event.

This still doesn't explain why this is for -rc.

New features are not for the -rc tree.

You'd need to explain how/why that this is a widespread a critical
problem in your user base

Jason
Xiaofei Tan Jan. 19, 2019, 6:16 a.m. UTC | #6
Hi Jason,

On 2019/1/19 3:58, Jason Gunthorpe wrote:
> On Wed, Jan 16, 2019 at 08:32:30PM +0800, tanxiaofei wrote:
>>
>> On 2019/1/16 6:03, Jason Gunthorpe wrote:
>>> On Sat, Jan 12, 2019 at 08:21:38PM +0800, tanxiaofei wrote:
>>>> Hi Jason,
>>>>
>>>> On 2019/1/12 5:25, Jason Gunthorpe wrote:
>>>>> On Wed, Jan 09, 2019 at 09:35:46AM +0800, Xiaofei Tan wrote:
>>>>>> AEQ overflow will be reported by hardware when too many
>>>>>> asynchronous events occurred but not be handled in time.
>>>>>> Normally, AEQ overflow error is not easy to occur. Once
>>>>>> happened, we have to do physical function reset to recover.
>>>>>> PF reset is implemented in two steps. Firstly, set reset
>>>>>> level with ae_dev->ops->set_default_reset_request.
>>>>>> Secondly, run reset with ae_dev->ops->reset_event.
>>>>>>
>>>>>> Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
>>>>>> Signed-off-by: Yixian Liu <liuyixian@huawei.com>
>>>>>>  drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 11 +++++++++++
>>>>>>  1 file changed, 11 insertions(+)
>>>>>
>>>>> Why should this be a -rc patch? It doesn't look like it meets the
>>>>> requires for a stable kernel, or fixing something introduced in the
>>>>> merge window.
>>>>>
>>>>> This looks like a new feature to me.
>>>>
>>>> I think we could take this as a bug. Because the device can't
>>>> continue working, if we don't handle the AEQ overflow error. And the
>>>> code is simple and doesn't bring any unstable factor.  Then i take
>>>> it as -rc patch.
>>>
>>> recovery from hardware failures is a new feature.
>>>
>>> If this is fixing a bug in the existing recovery, or is a major
>>> problem that many users are hitting, then explain in the commit
>>> comment.
>>>
>> Hi Jason,
>>
>> Thank you very much.
>> How about change the commit log as following?
>>
>> RDMA/hns: fix the issue of handling AEQ overflow for hip08
>>
>> AEQ overflow will be reported by hardware when too many
>> asynchronous events occurred but not be handled in time.
>>
>> Currently, we just do the print when AEQ overflow happened.
>> This is not right. We should also do physical function reset,
>> otherwise, the device can't continue working.
>>
>> PF reset is implemented in two steps. Firstly, set reset
>> level with ae_dev->ops->set_default_reset_request.
>> Secondly, run reset with ae_dev->ops->reset_event.
> 
> This still doesn't explain why this is for -rc.
> 
> New features are not for the -rc tree.
> 
> You'd need to explain how/why that this is a widespread a critical
> problem in your user base
> 

Maybe this patch is not suitable for the -rc branch.
Hmm, I will resend this patch based on -next. Many thanks.

> Jason
> 
> .
>
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
index 3a66945..317539a 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
@@ -4692,11 +4692,22 @@  static irqreturn_t hns_roce_v2_msix_interrupt_abn(int irq, void *dev_id)
 	int_en = roce_read(hr_dev, ROCEE_VF_ABN_INT_EN_REG);
 
 	if (roce_get_bit(int_st, HNS_ROCE_V2_VF_INT_ST_AEQ_OVERFLOW_S)) {
+		struct pci_dev *pdev = hr_dev->pci_dev;
+		struct hnae3_ae_dev *ae_dev = pci_get_drvdata(pdev);
+		const struct hnae3_ae_ops *ops = ae_dev->ops;
+
 		dev_err(dev, "AEQ overflow!\n");
 
 		roce_set_bit(int_st, HNS_ROCE_V2_VF_INT_ST_AEQ_OVERFLOW_S, 1);
 		roce_write(hr_dev, ROCEE_VF_ABN_INT_ST_REG, int_st);
 
+		/* Set reset level for reset_event() */
+		if (ops->set_default_reset_request)
+			ops->set_default_reset_request(ae_dev,
+						       HNAE3_FUNC_RESET);
+		if (ops->reset_event)
+			ops->reset_event(pdev, NULL);
+
 		roce_set_bit(int_en, HNS_ROCE_V2_VF_ABN_INT_EN_S, 1);
 		roce_write(hr_dev, ROCEE_VF_ABN_INT_EN_REG, int_en);