diff mbox series

[net-next,2/2] net: hns3: add vf fault process in hns3 ras

Message ID 20230113020829.48451-3-lanhao@huawei.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series add some vf fault detect patch for hns | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1 this patch: 1
netdev/cc_maintainers warning 3 maintainers not CCed: yuehaibing@huawei.com huangguangbin2@huawei.com lipeng321@huawei.com
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5 this patch: 5
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 210 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Hao Lan Jan. 13, 2023, 2:08 a.m. UTC
From: Jie Wang <wangjie125@huawei.com>

Currently hns3 driver supports vf fault detect feature. Several ras caused
by VF resources don't need to do PF function reset for recovery. The driver
only needs to reset the specified VF.

So this patch adds process in ras module. New process will get detailed
information about ras and do the most correct measures based on these
accurate information.

Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Hao Lan <lanhao@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
 .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
 .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
 .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
 .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
 .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
 6 files changed, 115 insertions(+), 6 deletions(-)

Comments

Leon Romanovsky Jan. 13, 2023, 6:51 a.m. UTC | #1
On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote:
> From: Jie Wang <wangjie125@huawei.com>
> 
> Currently hns3 driver supports vf fault detect feature. Several ras caused
> by VF resources don't need to do PF function reset for recovery. The driver
> only needs to reset the specified VF.
> 
> So this patch adds process in ras module. New process will get detailed
> information about ras and do the most correct measures based on these
> accurate information.
> 
> Signed-off-by: Jie Wang <wangjie125@huawei.com>
> Signed-off-by: Hao Lan <lanhao@huawei.com>
> ---
>  drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
>  .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
>  .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
>  .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
>  .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
>  .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
>  6 files changed, 115 insertions(+), 6 deletions(-)

Why is it good idea to reset VF from PF?
What will happen with driver bound to this VF?
Shouldn't PCI recovery handle it?

Thanks
wangjie (L) Jan. 17, 2023, 7:04 a.m. UTC | #2
On 2023/1/13 14:51, Leon Romanovsky wrote:
> On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote:
>> From: Jie Wang <wangjie125@huawei.com>
>>
>> Currently hns3 driver supports vf fault detect feature. Several ras caused
>> by VF resources don't need to do PF function reset for recovery. The driver
>> only needs to reset the specified VF.
>>
>> So this patch adds process in ras module. New process will get detailed
>> information about ras and do the most correct measures based on these
>> accurate information.
>>
>> Signed-off-by: Jie Wang <wangjie125@huawei.com>
>> Signed-off-by: Hao Lan <lanhao@huawei.com>
>> ---
>>  drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
>>  .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
>>  .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
>>  .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
>>  .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
>>  .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
>>  6 files changed, 115 insertions(+), 6 deletions(-)
>
> Why is it good idea to reset VF from PF?
> What will happen with driver bound to this VF?
> Shouldn't PCI recovery handle it?
>
> Thanks
> .
PF doesn't reset VF directly. These VF faults are detected by hardware,
and only reported to PF. PF get the VF id from firmware, then notify the 
VF that it needs reset. VF will do reset after receive the request.

These hardware faults are not standard PCI ras event, so we prefer to 
use MSIx path.
>
Paolo Abeni Jan. 17, 2023, 8:04 a.m. UTC | #3
Hello,

On Fri, 2023-01-13 at 10:08 +0800, Hao Lan wrote:
[...]

> +static void hclge_get_vf_fault_bitmap(struct hclge_desc *desc,
> +				      unsigned long *bitmap)
> +{
> +#define HCLGE_FIR_FAULT_BYTES	24
> +#define HCLGE_SEC_FAULT_BYTES	8
> +
> +	u8 *buff;
> +
> +	memcpy(bitmap, desc[0].data, HCLGE_FIR_FAULT_BYTES);
> +	buff = (u8 *)bitmap + HCLGE_FIR_FAULT_BYTES;
> +	memcpy(buff, desc[1].data, HCLGE_SEC_FAULT_BYTES);
> +}

The above works under the assumption that:

	HCLGE_FIR_FAULT_BYTES + HCLGE_SEC_FAULT_BYTES == BITS_TO_BYTES(HCLGE_VPORT_NUM)

I think it's better to enforce such condition at build time with a
BUILD_BUG_ON(), to avoid future issues.

Also I think Leon still deserve a reply to one of his questions,
specifically: What will happen (at recovery time) with driver bound to
this VF?

Thanks!

Paolo
Leon Romanovsky Jan. 17, 2023, 11:21 a.m. UTC | #4
On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote:
> 
> 
> On 2023/1/13 14:51, Leon Romanovsky wrote:
> > On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote:
> > > From: Jie Wang <wangjie125@huawei.com>
> > > 
> > > Currently hns3 driver supports vf fault detect feature. Several ras caused
> > > by VF resources don't need to do PF function reset for recovery. The driver
> > > only needs to reset the specified VF.
> > > 
> > > So this patch adds process in ras module. New process will get detailed
> > > information about ras and do the most correct measures based on these
> > > accurate information.
> > > 
> > > Signed-off-by: Jie Wang <wangjie125@huawei.com>
> > > Signed-off-by: Hao Lan <lanhao@huawei.com>
> > > ---
> > >  drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
> > >  .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
> > >  .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
> > >  .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
> > >  .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
> > >  .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
> > >  6 files changed, 115 insertions(+), 6 deletions(-)
> > 
> > Why is it good idea to reset VF from PF?
> > What will happen with driver bound to this VF?
> > Shouldn't PCI recovery handle it?
> > 
> > Thanks
> > .
> PF doesn't reset VF directly. These VF faults are detected by hardware,
> and only reported to PF. PF get the VF id from firmware, then notify the VF
> that it needs reset. VF will do reset after receive the request.

This description isn't aligned with the code. You are issuing
hclge_func_reset_cmd() command which will reset VF, while notification
are handled by hclge_func_reset_notify_vf().

It also doesn't make any sense to send notification event to VF through
FW while the goal is to recover from stuck FW in that VF.

> 
> These hardware faults are not standard PCI ras event, so we prefer to use
> MSIx path.

What is different here?

> >
wangjie (L) Jan. 18, 2023, 12:34 p.m. UTC | #5
On 2023/1/17 19:21, Leon Romanovsky wrote:
> On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote:
>>
>>
>> On 2023/1/13 14:51, Leon Romanovsky wrote:
>>> On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote:
>>>> From: Jie Wang <wangjie125@huawei.com>
>>>>
>>>> Currently hns3 driver supports vf fault detect feature. Several ras caused
>>>> by VF resources don't need to do PF function reset for recovery. The driver
>>>> only needs to reset the specified VF.
>>>>
>>>> So this patch adds process in ras module. New process will get detailed
>>>> information about ras and do the most correct measures based on these
>>>> accurate information.
>>>>
>>>> Signed-off-by: Jie Wang <wangjie125@huawei.com>
>>>> Signed-off-by: Hao Lan <lanhao@huawei.com>
>>>> ---
>>>>  drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
>>>>  .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
>>>>  .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
>>>>  .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
>>>>  .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
>>>>  .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
>>>>  6 files changed, 115 insertions(+), 6 deletions(-)
>>>
>>> Why is it good idea to reset VF from PF?
>>> What will happen with driver bound to this VF?
>>> Shouldn't PCI recovery handle it?
>>>
>>> Thanks
>>> .
>> PF doesn't reset VF directly. These VF faults are detected by hardware,
>> and only reported to PF. PF get the VF id from firmware, then notify the VF
>> that it needs reset. VF will do reset after receive the request.
>
> This description isn't aligned with the code. You are issuing
> hclge_func_reset_cmd() command which will reset VF, while notification
> are handled by hclge_func_reset_notify_vf().
>
> It also doesn't make any sense to send notification event to VF through
> FW while the goal is to recover from stuck FW in that VF.
>
Yes, I misunderstand the hclge_func_reset_notify_vf and
hclge_func_reset_cmd. It should use hclge_func_reset_notify_vf to inform
the VF for recovery. I will fix and retest it in V2.

This patch is used to recover specific vf hardware errors, for example the
tx queue configuration exceptions. It make sense in these cases for the
firmware is still working properly and can do the recovery rightly.
>>
>> These hardware faults are not standard PCI ras event, so we prefer to use
>> MSIx path.
>
> What is different here?
>
These hardware faults are reported by MSIx interrupts instead of PCI ras
path.

Thanks!
>>>
> .
>
wangjie (L) Jan. 18, 2023, 12:36 p.m. UTC | #6
On 2023/1/17 16:04, Paolo Abeni wrote:
> Hello,
>
> On Fri, 2023-01-13 at 10:08 +0800, Hao Lan wrote:
> [...]
>
>> +static void hclge_get_vf_fault_bitmap(struct hclge_desc *desc,
>> +				      unsigned long *bitmap)
>> +{
>> +#define HCLGE_FIR_FAULT_BYTES	24
>> +#define HCLGE_SEC_FAULT_BYTES	8
>> +
>> +	u8 *buff;
>> +
>> +	memcpy(bitmap, desc[0].data, HCLGE_FIR_FAULT_BYTES);
>> +	buff = (u8 *)bitmap + HCLGE_FIR_FAULT_BYTES;
>> +	memcpy(buff, desc[1].data, HCLGE_SEC_FAULT_BYTES);
>> +}
>
> The above works under the assumption that:
>
> 	HCLGE_FIR_FAULT_BYTES + HCLGE_SEC_FAULT_BYTES == BITS_TO_BYTES(HCLGE_VPORT_NUM)
>
> I think it's better to enforce such condition at build time with a
> BUILD_BUG_ON(), to avoid future issues.
>
> Also I think Leon still deserve a reply to one of his questions,
> specifically: What will happen (at recovery time) with driver bound to
> this VF?
>
> Thanks!
>
> Paolo
>
As discussed in reply to Leon, it missed to notify VF driver to prepare for
the reset in V1, So I will add notify to VF driver in v2 to make sure the
VF driver will init to normal at recovery time.

Thanks!
> .
>
wangjie (L) Jan. 18, 2023, 12:36 p.m. UTC | #7
On 2023/1/17 16:04, Paolo Abeni wrote:
> Hello,
>
> On Fri, 2023-01-13 at 10:08 +0800, Hao Lan wrote:
> [...]
>
>> +static void hclge_get_vf_fault_bitmap(struct hclge_desc *desc,
>> +				      unsigned long *bitmap)
>> +{
>> +#define HCLGE_FIR_FAULT_BYTES	24
>> +#define HCLGE_SEC_FAULT_BYTES	8
>> +
>> +	u8 *buff;
>> +
>> +	memcpy(bitmap, desc[0].data, HCLGE_FIR_FAULT_BYTES);
>> +	buff = (u8 *)bitmap + HCLGE_FIR_FAULT_BYTES;
>> +	memcpy(buff, desc[1].data, HCLGE_SEC_FAULT_BYTES);
>> +}
>
> The above works under the assumption that:
>
> 	HCLGE_FIR_FAULT_BYTES + HCLGE_SEC_FAULT_BYTES == BITS_TO_BYTES(HCLGE_VPORT_NUM)
>
> I think it's better to enforce such condition at build time with a
> BUILD_BUG_ON(), to avoid future issues.
>

OK, I will add it in v2

> Also I think Leon still deserve a reply to one of his questions,
> specifically: What will happen (at recovery time) with driver bound to
> this VF?
>
> Thanks!
>
> Paolo
>
> .
>
Leon Romanovsky Jan. 20, 2023, 5:12 p.m. UTC | #8
On Wed, Jan 18, 2023 at 08:34:03PM +0800, wangjie (L) wrote:
> 
> 
> On 2023/1/17 19:21, Leon Romanovsky wrote:
> > On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote:
> > > 
> > > 
> > > On 2023/1/13 14:51, Leon Romanovsky wrote:
> > > > On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote:
> > > > > From: Jie Wang <wangjie125@huawei.com>
> > > > > 
> > > > > Currently hns3 driver supports vf fault detect feature. Several ras caused
> > > > > by VF resources don't need to do PF function reset for recovery. The driver
> > > > > only needs to reset the specified VF.
> > > > > 
> > > > > So this patch adds process in ras module. New process will get detailed
> > > > > information about ras and do the most correct measures based on these
> > > > > accurate information.
> > > > > 
> > > > > Signed-off-by: Jie Wang <wangjie125@huawei.com>
> > > > > Signed-off-by: Hao Lan <lanhao@huawei.com>
> > > > > ---
> > > > >  drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
> > > > >  .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
> > > > >  .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
> > > > >  .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
> > > > >  .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
> > > > >  .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
> > > > >  6 files changed, 115 insertions(+), 6 deletions(-)
> > > > 
> > > > Why is it good idea to reset VF from PF?
> > > > What will happen with driver bound to this VF?
> > > > Shouldn't PCI recovery handle it?
> > > > 
> > > > Thanks
> > > > .
> > > PF doesn't reset VF directly. These VF faults are detected by hardware,
> > > and only reported to PF. PF get the VF id from firmware, then notify the VF
> > > that it needs reset. VF will do reset after receive the request.
> > 
> > This description isn't aligned with the code. You are issuing
> > hclge_func_reset_cmd() command which will reset VF, while notification
> > are handled by hclge_func_reset_notify_vf().
> > 
> > It also doesn't make any sense to send notification event to VF through
> > FW while the goal is to recover from stuck FW in that VF.
> > 
> Yes, I misunderstand the hclge_func_reset_notify_vf and
> hclge_func_reset_cmd. It should use hclge_func_reset_notify_vf to inform
> the VF for recovery. I will fix and retest it in V2.
> 
> This patch is used to recover specific vf hardware errors, for example the
> tx queue configuration exceptions. It make sense in these cases for the
> firmware is still working properly and can do the recovery rightly.

If FW is operational and knows about failure, why can't FW do recovery
internally to that VF without PF involvement?

Thanks
wangjie (L) Jan. 31, 2023, 12:04 p.m. UTC | #9
On 2023/1/21 1:12, Leon Romanovsky wrote:
> On Wed, Jan 18, 2023 at 08:34:03PM +0800, wangjie (L) wrote:
>>
>>
>> On 2023/1/17 19:21, Leon Romanovsky wrote:
>>> On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote:
>>>>
>>>>
>>>> On 2023/1/13 14:51, Leon Romanovsky wrote:
>>>>> On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote:
>>>>>> From: Jie Wang <wangjie125@huawei.com>
>>>>>>
>>>>>> Currently hns3 driver supports vf fault detect feature. Several ras caused
>>>>>> by VF resources don't need to do PF function reset for recovery. The driver
>>>>>> only needs to reset the specified VF.
>>>>>>
>>>>>> So this patch adds process in ras module. New process will get detailed
>>>>>> information about ras and do the most correct measures based on these
>>>>>> accurate information.
>>>>>>
>>>>>> Signed-off-by: Jie Wang <wangjie125@huawei.com>
>>>>>> Signed-off-by: Hao Lan <lanhao@huawei.com>
>>>>>> ---
>>>>>>  drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
>>>>>>  .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
>>>>>>  .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
>>>>>>  .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
>>>>>>  .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
>>>>>>  .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
>>>>>>  6 files changed, 115 insertions(+), 6 deletions(-)
>>>>>
>>>>> Why is it good idea to reset VF from PF?
>>>>> What will happen with driver bound to this VF?
>>>>> Shouldn't PCI recovery handle it?
>>>>>
>>>>> Thanks
>>>>> .
>>>> PF doesn't reset VF directly. These VF faults are detected by hardware,
>>>> and only reported to PF. PF get the VF id from firmware, then notify the VF
>>>> that it needs reset. VF will do reset after receive the request.
>>>
>>> This description isn't aligned with the code. You are issuing
>>> hclge_func_reset_cmd() command which will reset VF, while notification
>>> are handled by hclge_func_reset_notify_vf().
>>>
>>> It also doesn't make any sense to send notification event to VF through
>>> FW while the goal is to recover from stuck FW in that VF.
>>>
>> Yes, I misunderstand the hclge_func_reset_notify_vf and
>> hclge_func_reset_cmd. It should use hclge_func_reset_notify_vf to inform
>> the VF for recovery. I will fix and retest it in V2.
>>
>> This patch is used to recover specific vf hardware errors, for example the
>> tx queue configuration exceptions. It make sense in these cases for the
>> firmware is still working properly and can do the recovery rightly.
>
> If FW is operational and knows about failure, why can't FW do recovery
> internally to that VF without PF involvement?
I'm sorry to reply so late because I took a vacation. If firmware reset VF
hardware directly without notify the running VF driver, it will cause VF
driver works abnormal.

Thanks
>
> Thanks
> .
>
Leon Romanovsky Jan. 31, 2023, 1:24 p.m. UTC | #10
On Tue, Jan 31, 2023 at 08:04:14PM +0800, wangjie (L) wrote:
> 
> 
> On 2023/1/21 1:12, Leon Romanovsky wrote:
> > On Wed, Jan 18, 2023 at 08:34:03PM +0800, wangjie (L) wrote:
> > > 
> > > 
> > > On 2023/1/17 19:21, Leon Romanovsky wrote:
> > > > On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote:
> > > > > 
> > > > > 
> > > > > On 2023/1/13 14:51, Leon Romanovsky wrote:
> > > > > > On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote:
> > > > > > > From: Jie Wang <wangjie125@huawei.com>
> > > > > > > 
> > > > > > > Currently hns3 driver supports vf fault detect feature. Several ras caused
> > > > > > > by VF resources don't need to do PF function reset for recovery. The driver
> > > > > > > only needs to reset the specified VF.
> > > > > > > 
> > > > > > > So this patch adds process in ras module. New process will get detailed
> > > > > > > information about ras and do the most correct measures based on these
> > > > > > > accurate information.
> > > > > > > 
> > > > > > > Signed-off-by: Jie Wang <wangjie125@huawei.com>
> > > > > > > Signed-off-by: Hao Lan <lanhao@huawei.com>
> > > > > > > ---
> > > > > > >  drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
> > > > > > >  .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
> > > > > > >  .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
> > > > > > >  .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
> > > > > > >  .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
> > > > > > >  .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
> > > > > > >  6 files changed, 115 insertions(+), 6 deletions(-)
> > > > > > 
> > > > > > Why is it good idea to reset VF from PF?
> > > > > > What will happen with driver bound to this VF?
> > > > > > Shouldn't PCI recovery handle it?
> > > > > > 
> > > > > > Thanks
> > > > > > .
> > > > > PF doesn't reset VF directly. These VF faults are detected by hardware,
> > > > > and only reported to PF. PF get the VF id from firmware, then notify the VF
> > > > > that it needs reset. VF will do reset after receive the request.
> > > > 
> > > > This description isn't aligned with the code. You are issuing
> > > > hclge_func_reset_cmd() command which will reset VF, while notification
> > > > are handled by hclge_func_reset_notify_vf().
> > > > 
> > > > It also doesn't make any sense to send notification event to VF through
> > > > FW while the goal is to recover from stuck FW in that VF.
> > > > 
> > > Yes, I misunderstand the hclge_func_reset_notify_vf and
> > > hclge_func_reset_cmd. It should use hclge_func_reset_notify_vf to inform
> > > the VF for recovery. I will fix and retest it in V2.
> > > 
> > > This patch is used to recover specific vf hardware errors, for example the
> > > tx queue configuration exceptions. It make sense in these cases for the
> > > firmware is still working properly and can do the recovery rightly.
> > 
> > If FW is operational and knows about failure, why can't FW do recovery
> > internally to that VF without PF involvement?
> I'm sorry to reply so late because I took a vacation. If firmware reset VF
> hardware directly without notify the running VF driver, it will cause VF
> driver works abnormal.

mlx5 health recovery code proves that it is possible to do.
Even in your case, FW can notify VF without PF in the middle.

Thanks

> 
> Thanks
> > 
> > Thanks
> > .
> >
wangjie (L) Feb. 2, 2023, 1:08 p.m. UTC | #11
On 2023/1/31 21:24, Leon Romanovsky wrote:
> On Tue, Jan 31, 2023 at 08:04:14PM +0800, wangjie (L) wrote:
>>
>>
>> On 2023/1/21 1:12, Leon Romanovsky wrote:
>>> On Wed, Jan 18, 2023 at 08:34:03PM +0800, wangjie (L) wrote:
>>>>
>>>>
>>>> On 2023/1/17 19:21, Leon Romanovsky wrote:
>>>>> On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote:
>>>>>>
>>>>>>
>>>>>> On 2023/1/13 14:51, Leon Romanovsky wrote:
>>>>>>> On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote:
>>>>>>>> From: Jie Wang <wangjie125@huawei.com>
>>>>>>>>
>>>>>>>> Currently hns3 driver supports vf fault detect feature. Several ras caused
>>>>>>>> by VF resources don't need to do PF function reset for recovery. The driver
>>>>>>>> only needs to reset the specified VF.
>>>>>>>>
>>>>>>>> So this patch adds process in ras module. New process will get detailed
>>>>>>>> information about ras and do the most correct measures based on these
>>>>>>>> accurate information.
>>>>>>>>
>>>>>>>> Signed-off-by: Jie Wang <wangjie125@huawei.com>
>>>>>>>> Signed-off-by: Hao Lan <lanhao@huawei.com>
>>>>>>>> ---
>>>>>>>>  drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
>>>>>>>>  .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
>>>>>>>>  .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
>>>>>>>>  .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
>>>>>>>>  .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
>>>>>>>>  .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
>>>>>>>>  6 files changed, 115 insertions(+), 6 deletions(-)
>>>>>>>
>>>>>>> Why is it good idea to reset VF from PF?
>>>>>>> What will happen with driver bound to this VF?
>>>>>>> Shouldn't PCI recovery handle it?
>>>>>>>
>>>>>>> Thanks
>>>>>>> .
>>>>>> PF doesn't reset VF directly. These VF faults are detected by hardware,
>>>>>> and only reported to PF. PF get the VF id from firmware, then notify the VF
>>>>>> that it needs reset. VF will do reset after receive the request.
>>>>>
>>>>> This description isn't aligned with the code. You are issuing
>>>>> hclge_func_reset_cmd() command which will reset VF, while notification
>>>>> are handled by hclge_func_reset_notify_vf().
>>>>>
>>>>> It also doesn't make any sense to send notification event to VF through
>>>>> FW while the goal is to recover from stuck FW in that VF.
>>>>>
>>>> Yes, I misunderstand the hclge_func_reset_notify_vf and
>>>> hclge_func_reset_cmd. It should use hclge_func_reset_notify_vf to inform
>>>> the VF for recovery. I will fix and retest it in V2.
>>>>
>>>> This patch is used to recover specific vf hardware errors, for example the
>>>> tx queue configuration exceptions. It make sense in these cases for the
>>>> firmware is still working properly and can do the recovery rightly.
>>>
>>> If FW is operational and knows about failure, why can't FW do recovery
>>> internally to that VF without PF involvement?
>> I'm sorry to reply so late because I took a vacation. If firmware reset VF
>> hardware directly without notify the running VF driver, it will cause VF
>> driver works abnormal.
>
> mlx5 health recovery code proves that it is possible to do.
> Even in your case, FW can notify VF without PF in the middle.
>
> Thanks
>
These faults only report to PF in hns3 devices, even if devlink health is
used in hns3 driver, these faults also need to report to PF.

Thanks
>>
>> Thanks
>>>
>>> Thanks
>>> .
>>>
> .
>
Leon Romanovsky Feb. 6, 2023, 1:36 p.m. UTC | #12
On Thu, Feb 02, 2023 at 09:08:37PM +0800, wangjie (L) wrote:
> 
> 
> On 2023/1/31 21:24, Leon Romanovsky wrote:
> > On Tue, Jan 31, 2023 at 08:04:14PM +0800, wangjie (L) wrote:
> > > 
> > > 
> > > On 2023/1/21 1:12, Leon Romanovsky wrote:
> > > > On Wed, Jan 18, 2023 at 08:34:03PM +0800, wangjie (L) wrote:
> > > > > 
> > > > > 
> > > > > On 2023/1/17 19:21, Leon Romanovsky wrote:
> > > > > > On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On 2023/1/13 14:51, Leon Romanovsky wrote:
> > > > > > > > On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote:
> > > > > > > > > From: Jie Wang <wangjie125@huawei.com>
> > > > > > > > > 
> > > > > > > > > Currently hns3 driver supports vf fault detect feature. Several ras caused
> > > > > > > > > by VF resources don't need to do PF function reset for recovery. The driver
> > > > > > > > > only needs to reset the specified VF.
> > > > > > > > > 
> > > > > > > > > So this patch adds process in ras module. New process will get detailed
> > > > > > > > > information about ras and do the most correct measures based on these
> > > > > > > > > accurate information.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Jie Wang <wangjie125@huawei.com>
> > > > > > > > > Signed-off-by: Hao Lan <lanhao@huawei.com>
> > > > > > > > > ---
> > > > > > > > >  drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   1 +
> > > > > > > > >  .../hns3/hns3_common/hclge_comm_cmd.h         |   1 +
> > > > > > > > >  .../hisilicon/hns3/hns3pf/hclge_err.c         | 113 +++++++++++++++++-
> > > > > > > > >  .../hisilicon/hns3/hns3pf/hclge_err.h         |   2 +
> > > > > > > > >  .../hisilicon/hns3/hns3pf/hclge_main.c        |   3 +-
> > > > > > > > >  .../hisilicon/hns3/hns3pf/hclge_main.h        |   1 +
> > > > > > > > >  6 files changed, 115 insertions(+), 6 deletions(-)
> > > > > > > > 
> > > > > > > > Why is it good idea to reset VF from PF?
> > > > > > > > What will happen with driver bound to this VF?
> > > > > > > > Shouldn't PCI recovery handle it?
> > > > > > > > 
> > > > > > > > Thanks
> > > > > > > > .
> > > > > > > PF doesn't reset VF directly. These VF faults are detected by hardware,
> > > > > > > and only reported to PF. PF get the VF id from firmware, then notify the VF
> > > > > > > that it needs reset. VF will do reset after receive the request.
> > > > > > 
> > > > > > This description isn't aligned with the code. You are issuing
> > > > > > hclge_func_reset_cmd() command which will reset VF, while notification
> > > > > > are handled by hclge_func_reset_notify_vf().
> > > > > > 
> > > > > > It also doesn't make any sense to send notification event to VF through
> > > > > > FW while the goal is to recover from stuck FW in that VF.
> > > > > > 
> > > > > Yes, I misunderstand the hclge_func_reset_notify_vf and
> > > > > hclge_func_reset_cmd. It should use hclge_func_reset_notify_vf to inform
> > > > > the VF for recovery. I will fix and retest it in V2.
> > > > > 
> > > > > This patch is used to recover specific vf hardware errors, for example the
> > > > > tx queue configuration exceptions. It make sense in these cases for the
> > > > > firmware is still working properly and can do the recovery rightly.
> > > > 
> > > > If FW is operational and knows about failure, why can't FW do recovery
> > > > internally to that VF without PF involvement?
> > > I'm sorry to reply so late because I took a vacation. If firmware reset VF
> > > hardware directly without notify the running VF driver, it will cause VF
> > > driver works abnormal.
> > 
> > mlx5 health recovery code proves that it is possible to do.
> > Even in your case, FW can notify VF without PF in the middle.
> > 
> > Thanks
> > 
> These faults only report to PF in hns3 devices, even if devlink health is
> used in hns3 driver, these faults also need to report to PF.

I don't think that it is correct, but let's see your v2 without VF reset
and with PF notifications.

Thanks

> 
> Thanks
> > > 
> > > Thanks
> > > > 
> > > > Thanks
> > > > .
> > > > 
> > .
> >
diff mbox series

Patch

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index cd85c360335d..76a6230ccfee 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -265,6 +265,7 @@  enum hnae3_reset_type {
 	HNAE3_GLOBAL_RESET,
 	HNAE3_IMP_RESET,
 	HNAE3_NONE_RESET,
+	HNAE3_VF_EXP_RESET,
 	HNAE3_MAX_RESET,
 };
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h
index ca3692dc2848..8f823cdc0543 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h
@@ -92,6 +92,7 @@  enum hclge_opcode_type {
 	HCLGE_OPC_DFX_SSU_REG_2		= 0x004F,
 
 	HCLGE_OPC_QUERY_DEV_SPECS	= 0x0050,
+	HCLGE_OPC_GET_QUEUE_ERR_VF      = 0x0067,
 
 	/* MAC command */
 	HCLGE_OPC_CONFIG_MAC_MODE	= 0x0301,
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
index 6efd768cc07c..cf8f3882304a 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
@@ -1299,10 +1299,12 @@  static const struct hclge_hw_type_id hclge_hw_type_id_st[] = {
 		.msg = "tqp_int_ecc_error"
 	}, {
 		.type_id = PF_ABNORMAL_INT_ERROR,
-		.msg = "pf_abnormal_int_error"
+		.msg = "pf_abnormal_int_error",
+		.cause_by_vf = true
 	}, {
 		.type_id = MPF_ABNORMAL_INT_ERROR,
-		.msg = "mpf_abnormal_int_error"
+		.msg = "mpf_abnormal_int_error",
+		.cause_by_vf = true
 	}, {
 		.type_id = COMMON_ERROR,
 		.msg = "common_error"
@@ -2757,7 +2759,7 @@  void hclge_handle_occurred_error(struct hclge_dev *hdev)
 		hclge_handle_error_info_log(ae_dev);
 }
 
-static void
+static bool
 hclge_handle_error_type_reg_log(struct device *dev,
 				struct hclge_mod_err_info *mod_info,
 				struct hclge_type_reg_err_info *type_reg_info)
@@ -2768,6 +2770,7 @@  hclge_handle_error_type_reg_log(struct device *dev,
 	u8 mod_id, total_module, type_id, total_type, i, is_ras;
 	u8 index_module = MODULE_NONE;
 	u8 index_type = NONE_ERROR;
+	bool cause_by_vf = false;
 
 	mod_id = mod_info->mod_id;
 	type_id = type_reg_info->type_id & HCLGE_ERR_TYPE_MASK;
@@ -2786,6 +2789,7 @@  hclge_handle_error_type_reg_log(struct device *dev,
 	for (i = 0; i < total_type; i++) {
 		if (type_id == hclge_hw_type_id_st[i].type_id) {
 			index_type = i;
+			cause_by_vf = hclge_hw_type_id_st[i].cause_by_vf;
 			break;
 		}
 	}
@@ -2803,6 +2807,8 @@  hclge_handle_error_type_reg_log(struct device *dev,
 	dev_err(dev, "reg_value:\n");
 	for (i = 0; i < type_reg_info->reg_num; i++)
 		dev_err(dev, "0x%08x\n", type_reg_info->hclge_reg[i]);
+
+	return cause_by_vf;
 }
 
 static void hclge_handle_error_module_log(struct hnae3_ae_dev *ae_dev,
@@ -2813,6 +2819,7 @@  static void hclge_handle_error_module_log(struct hnae3_ae_dev *ae_dev,
 	struct device *dev = &hdev->pdev->dev;
 	struct hclge_mod_err_info *mod_info;
 	struct hclge_sum_err_info *sum_info;
+	bool cause_by_vf = false;
 	u8 mod_num, err_num, i;
 	u32 offset = 0;
 
@@ -2841,12 +2848,16 @@  static void hclge_handle_error_module_log(struct hnae3_ae_dev *ae_dev,
 
 			type_reg_info = (struct hclge_type_reg_err_info *)
 					    &buf[offset++];
-			hclge_handle_error_type_reg_log(dev, mod_info,
-							type_reg_info);
+			if (hclge_handle_error_type_reg_log(dev, mod_info,
+							    type_reg_info))
+				cause_by_vf = true;
 
 			offset += type_reg_info->reg_num;
 		}
 	}
+
+	if (hnae3_ae_dev_vf_fault_supported(hdev->ae_dev) && cause_by_vf)
+		set_bit(HNAE3_VF_EXP_RESET, &ae_dev->hw_err_reset_req);
 }
 
 static int hclge_query_all_err_bd_num(struct hclge_dev *hdev, u32 *bd_num)
@@ -2938,3 +2949,95 @@  int hclge_handle_error_info_log(struct hnae3_ae_dev *ae_dev)
 out:
 	return ret;
 }
+
+static bool hclge_reset_vf_in_bitmap(struct hclge_dev *hdev,
+				     unsigned long *bitmap)
+{
+	struct hclge_vport *vport;
+	bool exist_set = false;
+	int func_id;
+	int ret;
+
+	func_id = find_first_bit(bitmap, HCLGE_VPORT_NUM);
+	if (func_id == PF_VPORT_ID)
+		return false;
+
+	while (func_id != HCLGE_VPORT_NUM) {
+		vport = hclge_get_vf_vport(hdev,
+					   func_id - HCLGE_VF_VPORT_START_NUM);
+		if (!vport) {
+			dev_err(&hdev->pdev->dev, "invalid func id(%d)\n",
+				func_id);
+			return false;
+		}
+
+		dev_info(&hdev->pdev->dev, "do function %d recovery.", func_id);
+
+		ret = hclge_reset_tqp(&vport->nic);
+		if (ret) {
+			dev_err(&hdev->pdev->dev,
+				"failed to reset tqp, ret = %d.", ret);
+			return false;
+		}
+
+		ret = hclge_func_reset_cmd(hdev, func_id);
+		if (ret) {
+			dev_err(&hdev->pdev->dev,
+				"failed to reset func %d, ret = %d.",
+				func_id, ret);
+			return false;
+		}
+
+		exist_set = true;
+		clear_bit(func_id, bitmap);
+		func_id = find_first_bit(bitmap, HCLGE_VPORT_NUM);
+	}
+
+	return exist_set;
+}
+
+static void hclge_get_vf_fault_bitmap(struct hclge_desc *desc,
+				      unsigned long *bitmap)
+{
+#define HCLGE_FIR_FAULT_BYTES	24
+#define HCLGE_SEC_FAULT_BYTES	8
+
+	u8 *buff;
+
+	memcpy(bitmap, desc[0].data, HCLGE_FIR_FAULT_BYTES);
+	buff = (u8 *)bitmap + HCLGE_FIR_FAULT_BYTES;
+	memcpy(buff, desc[1].data, HCLGE_SEC_FAULT_BYTES);
+}
+
+int hclge_handle_vf_queue_err_ras(struct hclge_dev *hdev)
+{
+	unsigned long vf_fault_bitmap[BITS_TO_LONGS(HCLGE_VPORT_NUM)];
+	struct hclge_desc desc[2];
+	bool cause_by_vf = false;
+	int ret;
+
+	if (!hnae3_ae_dev_vf_fault_supported(hdev->ae_dev) ||
+	    !test_and_clear_bit(HNAE3_VF_EXP_RESET,
+				&hdev->ae_dev->hw_err_reset_req))
+		return 0;
+
+	hclge_comm_cmd_setup_basic_desc(&desc[0], HCLGE_OPC_GET_QUEUE_ERR_VF,
+					true);
+	desc[0].flag |= cpu_to_le16(HCLGE_COMM_CMD_FLAG_NEXT);
+	hclge_comm_cmd_setup_basic_desc(&desc[1], HCLGE_OPC_GET_QUEUE_ERR_VF,
+					true);
+
+	ret = hclge_comm_cmd_send(&hdev->hw.hw, desc, 2);
+	if (ret) {
+		dev_err(&hdev->pdev->dev,
+			"failed to get vf bitmap, ret = %d.\n", ret);
+		return ret;
+	}
+	hclge_get_vf_fault_bitmap(desc, vf_fault_bitmap);
+
+	cause_by_vf = hclge_reset_vf_in_bitmap(hdev, vf_fault_bitmap);
+	if (cause_by_vf)
+		hdev->ae_dev->hw_err_reset_req = 0;
+
+	return 0;
+}
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h
index 86be6fb32990..68b738affa66 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h
@@ -196,6 +196,7 @@  struct hclge_hw_module_id {
 struct hclge_hw_type_id {
 	enum hclge_err_type_list type_id;
 	const char *msg;
+	bool cause_by_vf; /* indicate the error may from vf exception */
 };
 
 struct hclge_sum_err_info {
@@ -228,4 +229,5 @@  int hclge_handle_hw_msix_error(struct hclge_dev *hdev,
 			       unsigned long *reset_requests);
 int hclge_handle_error_info_log(struct hnae3_ae_dev *ae_dev);
 int hclge_handle_mac_tnl(struct hclge_dev *hdev);
+int hclge_handle_vf_queue_err_ras(struct hclge_dev *hdev);
 #endif
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 07ad5f35219e..a1f1f72db8a6 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -3520,7 +3520,7 @@  static int hclge_get_status(struct hnae3_handle *handle)
 	return hdev->hw.mac.link;
 }
 
-static struct hclge_vport *hclge_get_vf_vport(struct hclge_dev *hdev, int vf)
+struct hclge_vport *hclge_get_vf_vport(struct hclge_dev *hdev, int vf)
 {
 	if (!pci_num_vf(hdev->pdev)) {
 		dev_err(&hdev->pdev->dev,
@@ -4559,6 +4559,7 @@  static void hclge_handle_err_recovery(struct hclge_dev *hdev)
 	if (hclge_find_error_source(hdev)) {
 		hclge_handle_error_info_log(ae_dev);
 		hclge_handle_mac_tnl(hdev);
+		hclge_handle_vf_queue_err_ras(hdev);
 	}
 
 	hclge_handle_err_reset_request(hdev);
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
index 13f23d606e77..73cfc26f5389 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
@@ -1148,4 +1148,5 @@  int hclge_dbg_dump_rst_info(struct hclge_dev *hdev, char *buf, int len);
 int hclge_push_vf_link_status(struct hclge_vport *vport);
 int hclge_enable_vport_vlan_filter(struct hclge_vport *vport, bool request_en);
 int hclge_mac_update_stats(struct hclge_dev *hdev);
+struct hclge_vport *hclge_get_vf_vport(struct hclge_dev *hdev, int vf);
 #endif