Message ID | 20211123142402.26936-1-liangwenpeng@huawei.com (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Jason Gunthorpe |
Headers | show |
Series | [for-rc] RDMA/hns: Fix the error of destroying resources in hw reseting phase | expand |
On Tue, Nov 23, 2021 at 10:24:02PM +0800, Wenpeng Liang wrote: > From: Yangyang Li <liyangyang20@huawei.com> > > When hns_roce_v2_destroy_qp() is called, the brief calling process of the > driver is as follows: > > ...... > hns_roce_v2_destroy_qp > hns_roce_v2_qp_modify > hns_roce_cmd_mbox > hns_roce_qp_destroy > > If hns_roce_cmd_mbox() detects that the hardware is being reset during > the execution of the hns_roce_cmd_mbox(), the driver will not be able > to get the return value from the hardware (the firmware cannot respond > to the driver's mailbox during the hardware reset phase). The driver > needs to wait for the hardware reset to complete before continuing to > execute hns_roce_qp_destroy(), otherwise it may happen that the driver > releases the resources but the hardware is still accessing. In order to > fix this problem, HNS RoCE needs to add a piece of code to wait for the > hardware reset to complete. > > The original interface get_hw_reset_stat() is the instantaneous state > of the hardware reset, which cannot accurately reflect whether the > hardware reset is completed, so it needs to be replaced with the > ae_dev_reset_cnt interface. > > The sign that the hardware reset is complete is that the return value > of the ae_dev_reset_cnt interface is greater than the original value > reset_cnt recorded by the driver. > > Fixes: 6a04aed6afae ("RDMA/hns: Fix the chip hanging caused by sending mailbox&CMQ during reset") > Signed-off-by: Yangyang Li <liyangyang20@huawei.com> > Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com> > --- > drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) And what about the other fix? Should we take both of them or only one? https://lore.kernel.org/all/20211123084809.37318-1-liangwenpeng@huawei.com Thanks > > diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > index ae14329c619c..bbfa1332dedc 100644 > --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c > @@ -33,6 +33,7 @@ > #include <linux/acpi.h> > #include <linux/etherdevice.h> > #include <linux/interrupt.h> > +#include <linux/iopoll.h> > #include <linux/kernel.h> > #include <linux/types.h> > #include <net/addrconf.h> > @@ -1050,9 +1051,14 @@ static u32 hns_roce_v2_cmd_hw_resetting(struct hns_roce_dev *hr_dev, > unsigned long instance_stage, > unsigned long reset_stage) > { > +#define HW_RESET_TIMEOUT_US 1000000 > +#define HW_RESET_SLEEP_US 1000 > + > struct hns_roce_v2_priv *priv = hr_dev->priv; > struct hnae3_handle *handle = priv->handle; > const struct hnae3_ae_ops *ops = handle->ae_algo->ops; > + unsigned long val; > + int ret; > > /* When hardware reset is detected, we should stop sending mailbox&cmq& > * doorbell to hardware. If now in .init_instance() function, we should > @@ -1064,7 +1070,11 @@ static u32 hns_roce_v2_cmd_hw_resetting(struct hns_roce_dev *hr_dev, > * again. > */ > hr_dev->dis_db = true; > - if (!ops->get_hw_reset_stat(handle)) > + > + ret = read_poll_timeout(ops->ae_dev_reset_cnt, val, > + val > hr_dev->reset_cnt, HW_RESET_SLEEP_US, > + HW_RESET_TIMEOUT_US, false, handle); > + if (!ret) > hr_dev->is_reset = true; > > if (!hr_dev->is_reset || reset_stage == HNS_ROCE_STATE_RST_INIT || > -- > 2.33.0 >
On 2021/11/25 21:17, Leon Romanovsky wrote: > On Tue, Nov 23, 2021 at 10:24:02PM +0800, Wenpeng Liang wrote: >> From: Yangyang Li <liyangyang20@huawei.com> >> >> When hns_roce_v2_destroy_qp() is called, the brief calling process of the >> driver is as follows: >> >> ...... >> hns_roce_v2_destroy_qp >> hns_roce_v2_qp_modify >> hns_roce_cmd_mbox >> hns_roce_qp_destroy >> >> If hns_roce_cmd_mbox() detects that the hardware is being reset during >> the execution of the hns_roce_cmd_mbox(), the driver will not be able >> to get the return value from the hardware (the firmware cannot respond >> to the driver's mailbox during the hardware reset phase). The driver >> needs to wait for the hardware reset to complete before continuing to >> execute hns_roce_qp_destroy(), otherwise it may happen that the driver >> releases the resources but the hardware is still accessing. In order to >> fix this problem, HNS RoCE needs to add a piece of code to wait for the >> hardware reset to complete. >> >> The original interface get_hw_reset_stat() is the instantaneous state >> of the hardware reset, which cannot accurately reflect whether the >> hardware reset is completed, so it needs to be replaced with the >> ae_dev_reset_cnt interface. >> >> The sign that the hardware reset is complete is that the return value >> of the ae_dev_reset_cnt interface is greater than the original value >> reset_cnt recorded by the driver. >> >> Fixes: 6a04aed6afae ("RDMA/hns: Fix the chip hanging caused by sending mailbox&CMQ during reset") >> Signed-off-by: Yangyang Li <liyangyang20@huawei.com> >> Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com> >> --- >> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 12 +++++++++++- >> 1 file changed, 11 insertions(+), 1 deletion(-) > > And what about the other fix? > Should we take both of them or only one? > https://lore.kernel.org/all/20211123084809.37318-1-liangwenpeng@huawei.com > > Thanks > These two fixes are independent of each other. Thanks Wenpeng >> >> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c >> index ae14329c619c..bbfa1332dedc 100644 >> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c >> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c >> @@ -33,6 +33,7 @@ >> #include <linux/acpi.h> >> #include <linux/etherdevice.h> >> #include <linux/interrupt.h> >> +#include <linux/iopoll.h> >> #include <linux/kernel.h> >> #include <linux/types.h> >> #include <net/addrconf.h> >> @@ -1050,9 +1051,14 @@ static u32 hns_roce_v2_cmd_hw_resetting(struct hns_roce_dev *hr_dev, >> unsigned long instance_stage, >> unsigned long reset_stage) >> { >> +#define HW_RESET_TIMEOUT_US 1000000 >> +#define HW_RESET_SLEEP_US 1000 >> + >> struct hns_roce_v2_priv *priv = hr_dev->priv; >> struct hnae3_handle *handle = priv->handle; >> const struct hnae3_ae_ops *ops = handle->ae_algo->ops; >> + unsigned long val; >> + int ret; >> >> /* When hardware reset is detected, we should stop sending mailbox&cmq& >> * doorbell to hardware. If now in .init_instance() function, we should >> @@ -1064,7 +1070,11 @@ static u32 hns_roce_v2_cmd_hw_resetting(struct hns_roce_dev *hr_dev, >> * again. >> */ >> hr_dev->dis_db = true; >> - if (!ops->get_hw_reset_stat(handle)) >> + >> + ret = read_poll_timeout(ops->ae_dev_reset_cnt, val, >> + val > hr_dev->reset_cnt, HW_RESET_SLEEP_US, >> + HW_RESET_TIMEOUT_US, false, handle); >> + if (!ret) >> hr_dev->is_reset = true; >> >> if (!hr_dev->is_reset || reset_stage == HNS_ROCE_STATE_RST_INIT || >> -- >> 2.33.0 >> > . >
On Tue, Nov 23, 2021 at 10:24:02PM +0800, Wenpeng Liang wrote: > From: Yangyang Li <liyangyang20@huawei.com> > > When hns_roce_v2_destroy_qp() is called, the brief calling process of the > driver is as follows: > > ...... > hns_roce_v2_destroy_qp > hns_roce_v2_qp_modify > hns_roce_cmd_mbox > hns_roce_qp_destroy > > If hns_roce_cmd_mbox() detects that the hardware is being reset during > the execution of the hns_roce_cmd_mbox(), the driver will not be able > to get the return value from the hardware (the firmware cannot respond > to the driver's mailbox during the hardware reset phase). The driver > needs to wait for the hardware reset to complete before continuing to > execute hns_roce_qp_destroy(), otherwise it may happen that the driver > releases the resources but the hardware is still accessing. In order to > fix this problem, HNS RoCE needs to add a piece of code to wait for the > hardware reset to complete. > > The original interface get_hw_reset_stat() is the instantaneous state > of the hardware reset, which cannot accurately reflect whether the > hardware reset is completed, so it needs to be replaced with the > ae_dev_reset_cnt interface. > > The sign that the hardware reset is complete is that the return value > of the ae_dev_reset_cnt interface is greater than the original value > reset_cnt recorded by the driver. > > Fixes: 6a04aed6afae ("RDMA/hns: Fix the chip hanging caused by sending mailbox&CMQ during reset") > Signed-off-by: Yangyang Li <liyangyang20@huawei.com> > Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com> > --- > drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) Applied to for-rc, thanks Jason
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c index ae14329c619c..bbfa1332dedc 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c @@ -33,6 +33,7 @@ #include <linux/acpi.h> #include <linux/etherdevice.h> #include <linux/interrupt.h> +#include <linux/iopoll.h> #include <linux/kernel.h> #include <linux/types.h> #include <net/addrconf.h> @@ -1050,9 +1051,14 @@ static u32 hns_roce_v2_cmd_hw_resetting(struct hns_roce_dev *hr_dev, unsigned long instance_stage, unsigned long reset_stage) { +#define HW_RESET_TIMEOUT_US 1000000 +#define HW_RESET_SLEEP_US 1000 + struct hns_roce_v2_priv *priv = hr_dev->priv; struct hnae3_handle *handle = priv->handle; const struct hnae3_ae_ops *ops = handle->ae_algo->ops; + unsigned long val; + int ret; /* When hardware reset is detected, we should stop sending mailbox&cmq& * doorbell to hardware. If now in .init_instance() function, we should @@ -1064,7 +1070,11 @@ static u32 hns_roce_v2_cmd_hw_resetting(struct hns_roce_dev *hr_dev, * again. */ hr_dev->dis_db = true; - if (!ops->get_hw_reset_stat(handle)) + + ret = read_poll_timeout(ops->ae_dev_reset_cnt, val, + val > hr_dev->reset_cnt, HW_RESET_SLEEP_US, + HW_RESET_TIMEOUT_US, false, handle); + if (!ret) hr_dev->is_reset = true; if (!hr_dev->is_reset || reset_stage == HNS_ROCE_STATE_RST_INIT ||