Message ID | 20230113020829.48451-3-lanhao@huawei.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | add some vf fault detect patch for hns | expand |
On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote: > From: Jie Wang <wangjie125@huawei.com> > > Currently hns3 driver supports vf fault detect feature. Several ras caused > by VF resources don't need to do PF function reset for recovery. The driver > only needs to reset the specified VF. > > So this patch adds process in ras module. New process will get detailed > information about ras and do the most correct measures based on these > accurate information. > > Signed-off-by: Jie Wang <wangjie125@huawei.com> > Signed-off-by: Hao Lan <lanhao@huawei.com> > --- > drivers/net/ethernet/hisilicon/hns3/hnae3.h | 1 + > .../hns3/hns3_common/hclge_comm_cmd.h | 1 + > .../hisilicon/hns3/hns3pf/hclge_err.c | 113 +++++++++++++++++- > .../hisilicon/hns3/hns3pf/hclge_err.h | 2 + > .../hisilicon/hns3/hns3pf/hclge_main.c | 3 +- > .../hisilicon/hns3/hns3pf/hclge_main.h | 1 + > 6 files changed, 115 insertions(+), 6 deletions(-) Why is it good idea to reset VF from PF? What will happen with driver bound to this VF? Shouldn't PCI recovery handle it? Thanks
On 2023/1/13 14:51, Leon Romanovsky wrote: > On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote: >> From: Jie Wang <wangjie125@huawei.com> >> >> Currently hns3 driver supports vf fault detect feature. Several ras caused >> by VF resources don't need to do PF function reset for recovery. The driver >> only needs to reset the specified VF. >> >> So this patch adds process in ras module. New process will get detailed >> information about ras and do the most correct measures based on these >> accurate information. >> >> Signed-off-by: Jie Wang <wangjie125@huawei.com> >> Signed-off-by: Hao Lan <lanhao@huawei.com> >> --- >> drivers/net/ethernet/hisilicon/hns3/hnae3.h | 1 + >> .../hns3/hns3_common/hclge_comm_cmd.h | 1 + >> .../hisilicon/hns3/hns3pf/hclge_err.c | 113 +++++++++++++++++- >> .../hisilicon/hns3/hns3pf/hclge_err.h | 2 + >> .../hisilicon/hns3/hns3pf/hclge_main.c | 3 +- >> .../hisilicon/hns3/hns3pf/hclge_main.h | 1 + >> 6 files changed, 115 insertions(+), 6 deletions(-) > > Why is it good idea to reset VF from PF? > What will happen with driver bound to this VF? > Shouldn't PCI recovery handle it? > > Thanks > . PF doesn't reset VF directly. These VF faults are detected by hardware, and only reported to PF. PF get the VF id from firmware, then notify the VF that it needs reset. VF will do reset after receive the request. These hardware faults are not standard PCI ras event, so we prefer to use MSIx path. >
Hello, On Fri, 2023-01-13 at 10:08 +0800, Hao Lan wrote: [...] > +static void hclge_get_vf_fault_bitmap(struct hclge_desc *desc, > + unsigned long *bitmap) > +{ > +#define HCLGE_FIR_FAULT_BYTES 24 > +#define HCLGE_SEC_FAULT_BYTES 8 > + > + u8 *buff; > + > + memcpy(bitmap, desc[0].data, HCLGE_FIR_FAULT_BYTES); > + buff = (u8 *)bitmap + HCLGE_FIR_FAULT_BYTES; > + memcpy(buff, desc[1].data, HCLGE_SEC_FAULT_BYTES); > +} The above works under the assumption that: HCLGE_FIR_FAULT_BYTES + HCLGE_SEC_FAULT_BYTES == BITS_TO_BYTES(HCLGE_VPORT_NUM) I think it's better to enforce such condition at build time with a BUILD_BUG_ON(), to avoid future issues. Also I think Leon still deserve a reply to one of his questions, specifically: What will happen (at recovery time) with driver bound to this VF? Thanks! Paolo
On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote: > > > On 2023/1/13 14:51, Leon Romanovsky wrote: > > On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote: > > > From: Jie Wang <wangjie125@huawei.com> > > > > > > Currently hns3 driver supports vf fault detect feature. Several ras caused > > > by VF resources don't need to do PF function reset for recovery. The driver > > > only needs to reset the specified VF. > > > > > > So this patch adds process in ras module. New process will get detailed > > > information about ras and do the most correct measures based on these > > > accurate information. > > > > > > Signed-off-by: Jie Wang <wangjie125@huawei.com> > > > Signed-off-by: Hao Lan <lanhao@huawei.com> > > > --- > > > drivers/net/ethernet/hisilicon/hns3/hnae3.h | 1 + > > > .../hns3/hns3_common/hclge_comm_cmd.h | 1 + > > > .../hisilicon/hns3/hns3pf/hclge_err.c | 113 +++++++++++++++++- > > > .../hisilicon/hns3/hns3pf/hclge_err.h | 2 + > > > .../hisilicon/hns3/hns3pf/hclge_main.c | 3 +- > > > .../hisilicon/hns3/hns3pf/hclge_main.h | 1 + > > > 6 files changed, 115 insertions(+), 6 deletions(-) > > > > Why is it good idea to reset VF from PF? > > What will happen with driver bound to this VF? > > Shouldn't PCI recovery handle it? > > > > Thanks > > . > PF doesn't reset VF directly. These VF faults are detected by hardware, > and only reported to PF. PF get the VF id from firmware, then notify the VF > that it needs reset. VF will do reset after receive the request. This description isn't aligned with the code. You are issuing hclge_func_reset_cmd() command which will reset VF, while notification are handled by hclge_func_reset_notify_vf(). It also doesn't make any sense to send notification event to VF through FW while the goal is to recover from stuck FW in that VF. > > These hardware faults are not standard PCI ras event, so we prefer to use > MSIx path. What is different here? > >
On 2023/1/17 19:21, Leon Romanovsky wrote: > On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote: >> >> >> On 2023/1/13 14:51, Leon Romanovsky wrote: >>> On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote: >>>> From: Jie Wang <wangjie125@huawei.com> >>>> >>>> Currently hns3 driver supports vf fault detect feature. Several ras caused >>>> by VF resources don't need to do PF function reset for recovery. The driver >>>> only needs to reset the specified VF. >>>> >>>> So this patch adds process in ras module. New process will get detailed >>>> information about ras and do the most correct measures based on these >>>> accurate information. >>>> >>>> Signed-off-by: Jie Wang <wangjie125@huawei.com> >>>> Signed-off-by: Hao Lan <lanhao@huawei.com> >>>> --- >>>> drivers/net/ethernet/hisilicon/hns3/hnae3.h | 1 + >>>> .../hns3/hns3_common/hclge_comm_cmd.h | 1 + >>>> .../hisilicon/hns3/hns3pf/hclge_err.c | 113 +++++++++++++++++- >>>> .../hisilicon/hns3/hns3pf/hclge_err.h | 2 + >>>> .../hisilicon/hns3/hns3pf/hclge_main.c | 3 +- >>>> .../hisilicon/hns3/hns3pf/hclge_main.h | 1 + >>>> 6 files changed, 115 insertions(+), 6 deletions(-) >>> >>> Why is it good idea to reset VF from PF? >>> What will happen with driver bound to this VF? >>> Shouldn't PCI recovery handle it? >>> >>> Thanks >>> . >> PF doesn't reset VF directly. These VF faults are detected by hardware, >> and only reported to PF. PF get the VF id from firmware, then notify the VF >> that it needs reset. VF will do reset after receive the request. > > This description isn't aligned with the code. You are issuing > hclge_func_reset_cmd() command which will reset VF, while notification > are handled by hclge_func_reset_notify_vf(). > > It also doesn't make any sense to send notification event to VF through > FW while the goal is to recover from stuck FW in that VF. > Yes, I misunderstand the hclge_func_reset_notify_vf and hclge_func_reset_cmd. It should use hclge_func_reset_notify_vf to inform the VF for recovery. I will fix and retest it in V2. This patch is used to recover specific vf hardware errors, for example the tx queue configuration exceptions. It make sense in these cases for the firmware is still working properly and can do the recovery rightly. >> >> These hardware faults are not standard PCI ras event, so we prefer to use >> MSIx path. > > What is different here? > These hardware faults are reported by MSIx interrupts instead of PCI ras path. Thanks! >>> > . >
On 2023/1/17 16:04, Paolo Abeni wrote: > Hello, > > On Fri, 2023-01-13 at 10:08 +0800, Hao Lan wrote: > [...] > >> +static void hclge_get_vf_fault_bitmap(struct hclge_desc *desc, >> + unsigned long *bitmap) >> +{ >> +#define HCLGE_FIR_FAULT_BYTES 24 >> +#define HCLGE_SEC_FAULT_BYTES 8 >> + >> + u8 *buff; >> + >> + memcpy(bitmap, desc[0].data, HCLGE_FIR_FAULT_BYTES); >> + buff = (u8 *)bitmap + HCLGE_FIR_FAULT_BYTES; >> + memcpy(buff, desc[1].data, HCLGE_SEC_FAULT_BYTES); >> +} > > The above works under the assumption that: > > HCLGE_FIR_FAULT_BYTES + HCLGE_SEC_FAULT_BYTES == BITS_TO_BYTES(HCLGE_VPORT_NUM) > > I think it's better to enforce such condition at build time with a > BUILD_BUG_ON(), to avoid future issues. > > Also I think Leon still deserve a reply to one of his questions, > specifically: What will happen (at recovery time) with driver bound to > this VF? > > Thanks! > > Paolo > As discussed in reply to Leon, it missed to notify VF driver to prepare for the reset in V1, So I will add notify to VF driver in v2 to make sure the VF driver will init to normal at recovery time. Thanks! > . >
On 2023/1/17 16:04, Paolo Abeni wrote: > Hello, > > On Fri, 2023-01-13 at 10:08 +0800, Hao Lan wrote: > [...] > >> +static void hclge_get_vf_fault_bitmap(struct hclge_desc *desc, >> + unsigned long *bitmap) >> +{ >> +#define HCLGE_FIR_FAULT_BYTES 24 >> +#define HCLGE_SEC_FAULT_BYTES 8 >> + >> + u8 *buff; >> + >> + memcpy(bitmap, desc[0].data, HCLGE_FIR_FAULT_BYTES); >> + buff = (u8 *)bitmap + HCLGE_FIR_FAULT_BYTES; >> + memcpy(buff, desc[1].data, HCLGE_SEC_FAULT_BYTES); >> +} > > The above works under the assumption that: > > HCLGE_FIR_FAULT_BYTES + HCLGE_SEC_FAULT_BYTES == BITS_TO_BYTES(HCLGE_VPORT_NUM) > > I think it's better to enforce such condition at build time with a > BUILD_BUG_ON(), to avoid future issues. > OK, I will add it in v2 > Also I think Leon still deserve a reply to one of his questions, > specifically: What will happen (at recovery time) with driver bound to > this VF? > > Thanks! > > Paolo > > . >
On Wed, Jan 18, 2023 at 08:34:03PM +0800, wangjie (L) wrote: > > > On 2023/1/17 19:21, Leon Romanovsky wrote: > > On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote: > > > > > > > > > On 2023/1/13 14:51, Leon Romanovsky wrote: > > > > On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote: > > > > > From: Jie Wang <wangjie125@huawei.com> > > > > > > > > > > Currently hns3 driver supports vf fault detect feature. Several ras caused > > > > > by VF resources don't need to do PF function reset for recovery. The driver > > > > > only needs to reset the specified VF. > > > > > > > > > > So this patch adds process in ras module. New process will get detailed > > > > > information about ras and do the most correct measures based on these > > > > > accurate information. > > > > > > > > > > Signed-off-by: Jie Wang <wangjie125@huawei.com> > > > > > Signed-off-by: Hao Lan <lanhao@huawei.com> > > > > > --- > > > > > drivers/net/ethernet/hisilicon/hns3/hnae3.h | 1 + > > > > > .../hns3/hns3_common/hclge_comm_cmd.h | 1 + > > > > > .../hisilicon/hns3/hns3pf/hclge_err.c | 113 +++++++++++++++++- > > > > > .../hisilicon/hns3/hns3pf/hclge_err.h | 2 + > > > > > .../hisilicon/hns3/hns3pf/hclge_main.c | 3 +- > > > > > .../hisilicon/hns3/hns3pf/hclge_main.h | 1 + > > > > > 6 files changed, 115 insertions(+), 6 deletions(-) > > > > > > > > Why is it good idea to reset VF from PF? > > > > What will happen with driver bound to this VF? > > > > Shouldn't PCI recovery handle it? > > > > > > > > Thanks > > > > . > > > PF doesn't reset VF directly. These VF faults are detected by hardware, > > > and only reported to PF. PF get the VF id from firmware, then notify the VF > > > that it needs reset. VF will do reset after receive the request. > > > > This description isn't aligned with the code. You are issuing > > hclge_func_reset_cmd() command which will reset VF, while notification > > are handled by hclge_func_reset_notify_vf(). > > > > It also doesn't make any sense to send notification event to VF through > > FW while the goal is to recover from stuck FW in that VF. > > > Yes, I misunderstand the hclge_func_reset_notify_vf and > hclge_func_reset_cmd. It should use hclge_func_reset_notify_vf to inform > the VF for recovery. I will fix and retest it in V2. > > This patch is used to recover specific vf hardware errors, for example the > tx queue configuration exceptions. It make sense in these cases for the > firmware is still working properly and can do the recovery rightly. If FW is operational and knows about failure, why can't FW do recovery internally to that VF without PF involvement? Thanks
On 2023/1/21 1:12, Leon Romanovsky wrote: > On Wed, Jan 18, 2023 at 08:34:03PM +0800, wangjie (L) wrote: >> >> >> On 2023/1/17 19:21, Leon Romanovsky wrote: >>> On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote: >>>> >>>> >>>> On 2023/1/13 14:51, Leon Romanovsky wrote: >>>>> On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote: >>>>>> From: Jie Wang <wangjie125@huawei.com> >>>>>> >>>>>> Currently hns3 driver supports vf fault detect feature. Several ras caused >>>>>> by VF resources don't need to do PF function reset for recovery. The driver >>>>>> only needs to reset the specified VF. >>>>>> >>>>>> So this patch adds process in ras module. New process will get detailed >>>>>> information about ras and do the most correct measures based on these >>>>>> accurate information. >>>>>> >>>>>> Signed-off-by: Jie Wang <wangjie125@huawei.com> >>>>>> Signed-off-by: Hao Lan <lanhao@huawei.com> >>>>>> --- >>>>>> drivers/net/ethernet/hisilicon/hns3/hnae3.h | 1 + >>>>>> .../hns3/hns3_common/hclge_comm_cmd.h | 1 + >>>>>> .../hisilicon/hns3/hns3pf/hclge_err.c | 113 +++++++++++++++++- >>>>>> .../hisilicon/hns3/hns3pf/hclge_err.h | 2 + >>>>>> .../hisilicon/hns3/hns3pf/hclge_main.c | 3 +- >>>>>> .../hisilicon/hns3/hns3pf/hclge_main.h | 1 + >>>>>> 6 files changed, 115 insertions(+), 6 deletions(-) >>>>> >>>>> Why is it good idea to reset VF from PF? >>>>> What will happen with driver bound to this VF? >>>>> Shouldn't PCI recovery handle it? >>>>> >>>>> Thanks >>>>> . >>>> PF doesn't reset VF directly. These VF faults are detected by hardware, >>>> and only reported to PF. PF get the VF id from firmware, then notify the VF >>>> that it needs reset. VF will do reset after receive the request. >>> >>> This description isn't aligned with the code. You are issuing >>> hclge_func_reset_cmd() command which will reset VF, while notification >>> are handled by hclge_func_reset_notify_vf(). >>> >>> It also doesn't make any sense to send notification event to VF through >>> FW while the goal is to recover from stuck FW in that VF. >>> >> Yes, I misunderstand the hclge_func_reset_notify_vf and >> hclge_func_reset_cmd. It should use hclge_func_reset_notify_vf to inform >> the VF for recovery. I will fix and retest it in V2. >> >> This patch is used to recover specific vf hardware errors, for example the >> tx queue configuration exceptions. It make sense in these cases for the >> firmware is still working properly and can do the recovery rightly. > > If FW is operational and knows about failure, why can't FW do recovery > internally to that VF without PF involvement? I'm sorry to reply so late because I took a vacation. If firmware reset VF hardware directly without notify the running VF driver, it will cause VF driver works abnormal. Thanks > > Thanks > . >
On Tue, Jan 31, 2023 at 08:04:14PM +0800, wangjie (L) wrote: > > > On 2023/1/21 1:12, Leon Romanovsky wrote: > > On Wed, Jan 18, 2023 at 08:34:03PM +0800, wangjie (L) wrote: > > > > > > > > > On 2023/1/17 19:21, Leon Romanovsky wrote: > > > > On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote: > > > > > > > > > > > > > > > On 2023/1/13 14:51, Leon Romanovsky wrote: > > > > > > On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote: > > > > > > > From: Jie Wang <wangjie125@huawei.com> > > > > > > > > > > > > > > Currently hns3 driver supports vf fault detect feature. Several ras caused > > > > > > > by VF resources don't need to do PF function reset for recovery. The driver > > > > > > > only needs to reset the specified VF. > > > > > > > > > > > > > > So this patch adds process in ras module. New process will get detailed > > > > > > > information about ras and do the most correct measures based on these > > > > > > > accurate information. > > > > > > > > > > > > > > Signed-off-by: Jie Wang <wangjie125@huawei.com> > > > > > > > Signed-off-by: Hao Lan <lanhao@huawei.com> > > > > > > > --- > > > > > > > drivers/net/ethernet/hisilicon/hns3/hnae3.h | 1 + > > > > > > > .../hns3/hns3_common/hclge_comm_cmd.h | 1 + > > > > > > > .../hisilicon/hns3/hns3pf/hclge_err.c | 113 +++++++++++++++++- > > > > > > > .../hisilicon/hns3/hns3pf/hclge_err.h | 2 + > > > > > > > .../hisilicon/hns3/hns3pf/hclge_main.c | 3 +- > > > > > > > .../hisilicon/hns3/hns3pf/hclge_main.h | 1 + > > > > > > > 6 files changed, 115 insertions(+), 6 deletions(-) > > > > > > > > > > > > Why is it good idea to reset VF from PF? > > > > > > What will happen with driver bound to this VF? > > > > > > Shouldn't PCI recovery handle it? > > > > > > > > > > > > Thanks > > > > > > . > > > > > PF doesn't reset VF directly. These VF faults are detected by hardware, > > > > > and only reported to PF. PF get the VF id from firmware, then notify the VF > > > > > that it needs reset. VF will do reset after receive the request. > > > > > > > > This description isn't aligned with the code. You are issuing > > > > hclge_func_reset_cmd() command which will reset VF, while notification > > > > are handled by hclge_func_reset_notify_vf(). > > > > > > > > It also doesn't make any sense to send notification event to VF through > > > > FW while the goal is to recover from stuck FW in that VF. > > > > > > > Yes, I misunderstand the hclge_func_reset_notify_vf and > > > hclge_func_reset_cmd. It should use hclge_func_reset_notify_vf to inform > > > the VF for recovery. I will fix and retest it in V2. > > > > > > This patch is used to recover specific vf hardware errors, for example the > > > tx queue configuration exceptions. It make sense in these cases for the > > > firmware is still working properly and can do the recovery rightly. > > > > If FW is operational and knows about failure, why can't FW do recovery > > internally to that VF without PF involvement? > I'm sorry to reply so late because I took a vacation. If firmware reset VF > hardware directly without notify the running VF driver, it will cause VF > driver works abnormal. mlx5 health recovery code proves that it is possible to do. Even in your case, FW can notify VF without PF in the middle. Thanks > > Thanks > > > > Thanks > > . > >
On 2023/1/31 21:24, Leon Romanovsky wrote: > On Tue, Jan 31, 2023 at 08:04:14PM +0800, wangjie (L) wrote: >> >> >> On 2023/1/21 1:12, Leon Romanovsky wrote: >>> On Wed, Jan 18, 2023 at 08:34:03PM +0800, wangjie (L) wrote: >>>> >>>> >>>> On 2023/1/17 19:21, Leon Romanovsky wrote: >>>>> On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote: >>>>>> >>>>>> >>>>>> On 2023/1/13 14:51, Leon Romanovsky wrote: >>>>>>> On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote: >>>>>>>> From: Jie Wang <wangjie125@huawei.com> >>>>>>>> >>>>>>>> Currently hns3 driver supports vf fault detect feature. Several ras caused >>>>>>>> by VF resources don't need to do PF function reset for recovery. The driver >>>>>>>> only needs to reset the specified VF. >>>>>>>> >>>>>>>> So this patch adds process in ras module. New process will get detailed >>>>>>>> information about ras and do the most correct measures based on these >>>>>>>> accurate information. >>>>>>>> >>>>>>>> Signed-off-by: Jie Wang <wangjie125@huawei.com> >>>>>>>> Signed-off-by: Hao Lan <lanhao@huawei.com> >>>>>>>> --- >>>>>>>> drivers/net/ethernet/hisilicon/hns3/hnae3.h | 1 + >>>>>>>> .../hns3/hns3_common/hclge_comm_cmd.h | 1 + >>>>>>>> .../hisilicon/hns3/hns3pf/hclge_err.c | 113 +++++++++++++++++- >>>>>>>> .../hisilicon/hns3/hns3pf/hclge_err.h | 2 + >>>>>>>> .../hisilicon/hns3/hns3pf/hclge_main.c | 3 +- >>>>>>>> .../hisilicon/hns3/hns3pf/hclge_main.h | 1 + >>>>>>>> 6 files changed, 115 insertions(+), 6 deletions(-) >>>>>>> >>>>>>> Why is it good idea to reset VF from PF? >>>>>>> What will happen with driver bound to this VF? >>>>>>> Shouldn't PCI recovery handle it? >>>>>>> >>>>>>> Thanks >>>>>>> . >>>>>> PF doesn't reset VF directly. These VF faults are detected by hardware, >>>>>> and only reported to PF. PF get the VF id from firmware, then notify the VF >>>>>> that it needs reset. VF will do reset after receive the request. >>>>> >>>>> This description isn't aligned with the code. You are issuing >>>>> hclge_func_reset_cmd() command which will reset VF, while notification >>>>> are handled by hclge_func_reset_notify_vf(). >>>>> >>>>> It also doesn't make any sense to send notification event to VF through >>>>> FW while the goal is to recover from stuck FW in that VF. >>>>> >>>> Yes, I misunderstand the hclge_func_reset_notify_vf and >>>> hclge_func_reset_cmd. It should use hclge_func_reset_notify_vf to inform >>>> the VF for recovery. I will fix and retest it in V2. >>>> >>>> This patch is used to recover specific vf hardware errors, for example the >>>> tx queue configuration exceptions. It make sense in these cases for the >>>> firmware is still working properly and can do the recovery rightly. >>> >>> If FW is operational and knows about failure, why can't FW do recovery >>> internally to that VF without PF involvement? >> I'm sorry to reply so late because I took a vacation. If firmware reset VF >> hardware directly without notify the running VF driver, it will cause VF >> driver works abnormal. > > mlx5 health recovery code proves that it is possible to do. > Even in your case, FW can notify VF without PF in the middle. > > Thanks > These faults only report to PF in hns3 devices, even if devlink health is used in hns3 driver, these faults also need to report to PF. Thanks >> >> Thanks >>> >>> Thanks >>> . >>> > . >
On Thu, Feb 02, 2023 at 09:08:37PM +0800, wangjie (L) wrote: > > > On 2023/1/31 21:24, Leon Romanovsky wrote: > > On Tue, Jan 31, 2023 at 08:04:14PM +0800, wangjie (L) wrote: > > > > > > > > > On 2023/1/21 1:12, Leon Romanovsky wrote: > > > > On Wed, Jan 18, 2023 at 08:34:03PM +0800, wangjie (L) wrote: > > > > > > > > > > > > > > > On 2023/1/17 19:21, Leon Romanovsky wrote: > > > > > > On Tue, Jan 17, 2023 at 03:04:15PM +0800, wangjie (L) wrote: > > > > > > > > > > > > > > > > > > > > > On 2023/1/13 14:51, Leon Romanovsky wrote: > > > > > > > > On Fri, Jan 13, 2023 at 10:08:29AM +0800, Hao Lan wrote: > > > > > > > > > From: Jie Wang <wangjie125@huawei.com> > > > > > > > > > > > > > > > > > > Currently hns3 driver supports vf fault detect feature. Several ras caused > > > > > > > > > by VF resources don't need to do PF function reset for recovery. The driver > > > > > > > > > only needs to reset the specified VF. > > > > > > > > > > > > > > > > > > So this patch adds process in ras module. New process will get detailed > > > > > > > > > information about ras and do the most correct measures based on these > > > > > > > > > accurate information. > > > > > > > > > > > > > > > > > > Signed-off-by: Jie Wang <wangjie125@huawei.com> > > > > > > > > > Signed-off-by: Hao Lan <lanhao@huawei.com> > > > > > > > > > --- > > > > > > > > > drivers/net/ethernet/hisilicon/hns3/hnae3.h | 1 + > > > > > > > > > .../hns3/hns3_common/hclge_comm_cmd.h | 1 + > > > > > > > > > .../hisilicon/hns3/hns3pf/hclge_err.c | 113 +++++++++++++++++- > > > > > > > > > .../hisilicon/hns3/hns3pf/hclge_err.h | 2 + > > > > > > > > > .../hisilicon/hns3/hns3pf/hclge_main.c | 3 +- > > > > > > > > > .../hisilicon/hns3/hns3pf/hclge_main.h | 1 + > > > > > > > > > 6 files changed, 115 insertions(+), 6 deletions(-) > > > > > > > > > > > > > > > > Why is it good idea to reset VF from PF? > > > > > > > > What will happen with driver bound to this VF? > > > > > > > > Shouldn't PCI recovery handle it? > > > > > > > > > > > > > > > > Thanks > > > > > > > > . > > > > > > > PF doesn't reset VF directly. These VF faults are detected by hardware, > > > > > > > and only reported to PF. PF get the VF id from firmware, then notify the VF > > > > > > > that it needs reset. VF will do reset after receive the request. > > > > > > > > > > > > This description isn't aligned with the code. You are issuing > > > > > > hclge_func_reset_cmd() command which will reset VF, while notification > > > > > > are handled by hclge_func_reset_notify_vf(). > > > > > > > > > > > > It also doesn't make any sense to send notification event to VF through > > > > > > FW while the goal is to recover from stuck FW in that VF. > > > > > > > > > > > Yes, I misunderstand the hclge_func_reset_notify_vf and > > > > > hclge_func_reset_cmd. It should use hclge_func_reset_notify_vf to inform > > > > > the VF for recovery. I will fix and retest it in V2. > > > > > > > > > > This patch is used to recover specific vf hardware errors, for example the > > > > > tx queue configuration exceptions. It make sense in these cases for the > > > > > firmware is still working properly and can do the recovery rightly. > > > > > > > > If FW is operational and knows about failure, why can't FW do recovery > > > > internally to that VF without PF involvement? > > > I'm sorry to reply so late because I took a vacation. If firmware reset VF > > > hardware directly without notify the running VF driver, it will cause VF > > > driver works abnormal. > > > > mlx5 health recovery code proves that it is possible to do. > > Even in your case, FW can notify VF without PF in the middle. > > > > Thanks > > > These faults only report to PF in hns3 devices, even if devlink health is > used in hns3 driver, these faults also need to report to PF. I don't think that it is correct, but let's see your v2 without VF reset and with PF notifications. Thanks > > Thanks > > > > > > Thanks > > > > > > > > Thanks > > > > . > > > > > > . > >
diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index cd85c360335d..76a6230ccfee 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -265,6 +265,7 @@ enum hnae3_reset_type { HNAE3_GLOBAL_RESET, HNAE3_IMP_RESET, HNAE3_NONE_RESET, + HNAE3_VF_EXP_RESET, HNAE3_MAX_RESET, }; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h b/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h index ca3692dc2848..8f823cdc0543 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_common/hclge_comm_cmd.h @@ -92,6 +92,7 @@ enum hclge_opcode_type { HCLGE_OPC_DFX_SSU_REG_2 = 0x004F, HCLGE_OPC_QUERY_DEV_SPECS = 0x0050, + HCLGE_OPC_GET_QUEUE_ERR_VF = 0x0067, /* MAC command */ HCLGE_OPC_CONFIG_MAC_MODE = 0x0301, diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c index 6efd768cc07c..cf8f3882304a 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c @@ -1299,10 +1299,12 @@ static const struct hclge_hw_type_id hclge_hw_type_id_st[] = { .msg = "tqp_int_ecc_error" }, { .type_id = PF_ABNORMAL_INT_ERROR, - .msg = "pf_abnormal_int_error" + .msg = "pf_abnormal_int_error", + .cause_by_vf = true }, { .type_id = MPF_ABNORMAL_INT_ERROR, - .msg = "mpf_abnormal_int_error" + .msg = "mpf_abnormal_int_error", + .cause_by_vf = true }, { .type_id = COMMON_ERROR, .msg = "common_error" @@ -2757,7 +2759,7 @@ void hclge_handle_occurred_error(struct hclge_dev *hdev) hclge_handle_error_info_log(ae_dev); } -static void +static bool hclge_handle_error_type_reg_log(struct device *dev, struct hclge_mod_err_info *mod_info, struct hclge_type_reg_err_info *type_reg_info) @@ -2768,6 +2770,7 @@ hclge_handle_error_type_reg_log(struct device *dev, u8 mod_id, total_module, type_id, total_type, i, is_ras; u8 index_module = MODULE_NONE; u8 index_type = NONE_ERROR; + bool cause_by_vf = false; mod_id = mod_info->mod_id; type_id = type_reg_info->type_id & HCLGE_ERR_TYPE_MASK; @@ -2786,6 +2789,7 @@ hclge_handle_error_type_reg_log(struct device *dev, for (i = 0; i < total_type; i++) { if (type_id == hclge_hw_type_id_st[i].type_id) { index_type = i; + cause_by_vf = hclge_hw_type_id_st[i].cause_by_vf; break; } } @@ -2803,6 +2807,8 @@ hclge_handle_error_type_reg_log(struct device *dev, dev_err(dev, "reg_value:\n"); for (i = 0; i < type_reg_info->reg_num; i++) dev_err(dev, "0x%08x\n", type_reg_info->hclge_reg[i]); + + return cause_by_vf; } static void hclge_handle_error_module_log(struct hnae3_ae_dev *ae_dev, @@ -2813,6 +2819,7 @@ static void hclge_handle_error_module_log(struct hnae3_ae_dev *ae_dev, struct device *dev = &hdev->pdev->dev; struct hclge_mod_err_info *mod_info; struct hclge_sum_err_info *sum_info; + bool cause_by_vf = false; u8 mod_num, err_num, i; u32 offset = 0; @@ -2841,12 +2848,16 @@ static void hclge_handle_error_module_log(struct hnae3_ae_dev *ae_dev, type_reg_info = (struct hclge_type_reg_err_info *) &buf[offset++]; - hclge_handle_error_type_reg_log(dev, mod_info, - type_reg_info); + if (hclge_handle_error_type_reg_log(dev, mod_info, + type_reg_info)) + cause_by_vf = true; offset += type_reg_info->reg_num; } } + + if (hnae3_ae_dev_vf_fault_supported(hdev->ae_dev) && cause_by_vf) + set_bit(HNAE3_VF_EXP_RESET, &ae_dev->hw_err_reset_req); } static int hclge_query_all_err_bd_num(struct hclge_dev *hdev, u32 *bd_num) @@ -2938,3 +2949,95 @@ int hclge_handle_error_info_log(struct hnae3_ae_dev *ae_dev) out: return ret; } + +static bool hclge_reset_vf_in_bitmap(struct hclge_dev *hdev, + unsigned long *bitmap) +{ + struct hclge_vport *vport; + bool exist_set = false; + int func_id; + int ret; + + func_id = find_first_bit(bitmap, HCLGE_VPORT_NUM); + if (func_id == PF_VPORT_ID) + return false; + + while (func_id != HCLGE_VPORT_NUM) { + vport = hclge_get_vf_vport(hdev, + func_id - HCLGE_VF_VPORT_START_NUM); + if (!vport) { + dev_err(&hdev->pdev->dev, "invalid func id(%d)\n", + func_id); + return false; + } + + dev_info(&hdev->pdev->dev, "do function %d recovery.", func_id); + + ret = hclge_reset_tqp(&vport->nic); + if (ret) { + dev_err(&hdev->pdev->dev, + "failed to reset tqp, ret = %d.", ret); + return false; + } + + ret = hclge_func_reset_cmd(hdev, func_id); + if (ret) { + dev_err(&hdev->pdev->dev, + "failed to reset func %d, ret = %d.", + func_id, ret); + return false; + } + + exist_set = true; + clear_bit(func_id, bitmap); + func_id = find_first_bit(bitmap, HCLGE_VPORT_NUM); + } + + return exist_set; +} + +static void hclge_get_vf_fault_bitmap(struct hclge_desc *desc, + unsigned long *bitmap) +{ +#define HCLGE_FIR_FAULT_BYTES 24 +#define HCLGE_SEC_FAULT_BYTES 8 + + u8 *buff; + + memcpy(bitmap, desc[0].data, HCLGE_FIR_FAULT_BYTES); + buff = (u8 *)bitmap + HCLGE_FIR_FAULT_BYTES; + memcpy(buff, desc[1].data, HCLGE_SEC_FAULT_BYTES); +} + +int hclge_handle_vf_queue_err_ras(struct hclge_dev *hdev) +{ + unsigned long vf_fault_bitmap[BITS_TO_LONGS(HCLGE_VPORT_NUM)]; + struct hclge_desc desc[2]; + bool cause_by_vf = false; + int ret; + + if (!hnae3_ae_dev_vf_fault_supported(hdev->ae_dev) || + !test_and_clear_bit(HNAE3_VF_EXP_RESET, + &hdev->ae_dev->hw_err_reset_req)) + return 0; + + hclge_comm_cmd_setup_basic_desc(&desc[0], HCLGE_OPC_GET_QUEUE_ERR_VF, + true); + desc[0].flag |= cpu_to_le16(HCLGE_COMM_CMD_FLAG_NEXT); + hclge_comm_cmd_setup_basic_desc(&desc[1], HCLGE_OPC_GET_QUEUE_ERR_VF, + true); + + ret = hclge_comm_cmd_send(&hdev->hw.hw, desc, 2); + if (ret) { + dev_err(&hdev->pdev->dev, + "failed to get vf bitmap, ret = %d.\n", ret); + return ret; + } + hclge_get_vf_fault_bitmap(desc, vf_fault_bitmap); + + cause_by_vf = hclge_reset_vf_in_bitmap(hdev, vf_fault_bitmap); + if (cause_by_vf) + hdev->ae_dev->hw_err_reset_req = 0; + + return 0; +} diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h index 86be6fb32990..68b738affa66 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h @@ -196,6 +196,7 @@ struct hclge_hw_module_id { struct hclge_hw_type_id { enum hclge_err_type_list type_id; const char *msg; + bool cause_by_vf; /* indicate the error may from vf exception */ }; struct hclge_sum_err_info { @@ -228,4 +229,5 @@ int hclge_handle_hw_msix_error(struct hclge_dev *hdev, unsigned long *reset_requests); int hclge_handle_error_info_log(struct hnae3_ae_dev *ae_dev); int hclge_handle_mac_tnl(struct hclge_dev *hdev); +int hclge_handle_vf_queue_err_ras(struct hclge_dev *hdev); #endif diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 07ad5f35219e..a1f1f72db8a6 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -3520,7 +3520,7 @@ static int hclge_get_status(struct hnae3_handle *handle) return hdev->hw.mac.link; } -static struct hclge_vport *hclge_get_vf_vport(struct hclge_dev *hdev, int vf) +struct hclge_vport *hclge_get_vf_vport(struct hclge_dev *hdev, int vf) { if (!pci_num_vf(hdev->pdev)) { dev_err(&hdev->pdev->dev, @@ -4559,6 +4559,7 @@ static void hclge_handle_err_recovery(struct hclge_dev *hdev) if (hclge_find_error_source(hdev)) { hclge_handle_error_info_log(ae_dev); hclge_handle_mac_tnl(hdev); + hclge_handle_vf_queue_err_ras(hdev); } hclge_handle_err_reset_request(hdev); diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h index 13f23d606e77..73cfc26f5389 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h @@ -1148,4 +1148,5 @@ int hclge_dbg_dump_rst_info(struct hclge_dev *hdev, char *buf, int len); int hclge_push_vf_link_status(struct hclge_vport *vport); int hclge_enable_vport_vlan_filter(struct hclge_vport *vport, bool request_en); int hclge_mac_update_stats(struct hclge_dev *hdev); +struct hclge_vport *hclge_get_vf_vport(struct hclge_dev *hdev, int vf); #endif