Message ID | 20230809134339.698074-1-manishc@marvell.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v2,net] qede: fix firmware halt over suspend and resume | expand |
On Wed, Aug 09, 2023 at 07:13:39PM +0530, Manish Chopra wrote: > While performing certain power-off sequences, PCI drivers are > called to suspend and resume their underlying devices through > PCI PM (power management) interface. However this NIC hardware > does not support PCI PM suspend/resume operations so system wide > suspend/resume leads to bad MFW (management firmware) state which > causes various follow-up errors in driver when communicating with > the device/firmware afterwards. > > To fix this driver implements PCI PM suspend handler to indicate > unsupported operation to the PCI subsystem explicitly, thus avoiding > system to go into suspended/standby mode. > > Fixes: 2950219d87b0 ("qede: Add basic network device support") > Cc: David Miller <davem@davemloft.net> > Signed-off-by: Manish Chopra <manishc@marvell.com> > Signed-off-by: Alok Prasad <palok@marvell.com> > --- > V1->V2: > * Replace SIMPLE_DEV_PM_OPS with DEFINE_SIMPLE_DEV_PM_OPS Thanks! Reviewed-by: Simon Horman <horms@kernel.org>
On Wed, 9 Aug 2023 19:13:39 +0530 Manish Chopra wrote: > While performing certain power-off sequences, PCI drivers are > called to suspend and resume their underlying devices through > PCI PM (power management) interface. However this NIC hardware > does not support PCI PM suspend/resume operations so system wide > suspend/resume leads to bad MFW (management firmware) state which > causes various follow-up errors in driver when communicating with > the device/firmware afterwards. Does the FW end up recovering? That could still be preferable to rejecting suspend altogether. Reject is a big hammer, I'm a bit worried it will cause a regression in stable. > To fix this driver implements PCI PM suspend handler to indicate > unsupported operation to the PCI subsystem explicitly, thus avoiding > system to go into suspended/standby mode. > > Fixes: 2950219d87b0 ("qede: Add basic network device support") > Cc: David Miller <davem@davemloft.net> > Signed-off-by: Manish Chopra <manishc@marvell.com> > Signed-off-by: Alok Prasad <palok@marvell.com> > --- > V1->V2: > * Replace SIMPLE_DEV_PM_OPS with DEFINE_SIMPLE_DEV_PM_OPS > --- > drivers/net/ethernet/qlogic/qede/qede_main.c | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c > index d57e52a97f85..18ae7af1764c 100644 > --- a/drivers/net/ethernet/qlogic/qede/qede_main.c > +++ b/drivers/net/ethernet/qlogic/qede/qede_main.c > @@ -177,6 +177,18 @@ static int qede_sriov_configure(struct pci_dev *pdev, int num_vfs_param) > } > #endif > > +static int __maybe_unused qede_suspend(struct device *dev) > +{ > + if (!dev) > + return -ENODEV; Can dev really be NULL here? That wouldn't make sense, what's the driver supposed to do in such case?
> -----Original Message----- > From: Jakub Kicinski <kuba@kernel.org> > Sent: Friday, August 11, 2023 6:17 AM > To: Manish Chopra <manishc@marvell.com> > Cc: netdev@vger.kernel.org; Ariel Elior <aelior@marvell.com>; Alok Prasad > <palok@marvell.com>; Nilesh Javali <njavali@marvell.com>; Saurav Kashyap > <skashyap@marvell.com>; jmeneghi@redhat.com; yuval.mintz@qlogic.com; > Sudarsana Reddy Kalluru <skalluru@marvell.com>; pabeni@redhat.com; > edumazet@google.com; horms@kernel.org; David Miller > <davem@davemloft.net> > Subject: [EXT] Re: [PATCH v2 net] qede: fix firmware halt over suspend and > resume > > External Email > > ---------------------------------------------------------------------- > On Wed, 9 Aug 2023 19:13:39 +0530 Manish Chopra wrote: > > While performing certain power-off sequences, PCI drivers are called > > to suspend and resume their underlying devices through PCI PM (power > > management) interface. However this NIC hardware does not support PCI > > PM suspend/resume operations so system wide suspend/resume leads to > > bad MFW (management firmware) state which causes various follow-up > > errors in driver when communicating with the device/firmware > > afterwards. > > Does the FW end up recovering? That could still be preferable to rejecting > suspend altogether. Reject is a big hammer, I'm a bit worried it will cause a > regression in stable. Yes, By adding the driver's suspend handler with explicit error returned to PCI subsystem prevents the system wide suspend and does not impact the device/FW at all. It keeps them operational as they were before. > > > To fix this driver implements PCI PM suspend handler to indicate > > unsupported operation to the PCI subsystem explicitly, thus avoiding > > system to go into suspended/standby mode. > > > > Fixes: 2950219d87b0 ("qede: Add basic network device support") > > Cc: David Miller <davem@davemloft.net> > > Signed-off-by: Manish Chopra <manishc@marvell.com> > > Signed-off-by: Alok Prasad <palok@marvell.com> > > --- > > V1->V2: > > * Replace SIMPLE_DEV_PM_OPS with DEFINE_SIMPLE_DEV_PM_OPS > > --- > > drivers/net/ethernet/qlogic/qede/qede_main.c | 13 +++++++++++++ > > 1 file changed, 13 insertions(+) > > > > diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c > > b/drivers/net/ethernet/qlogic/qede/qede_main.c > > index d57e52a97f85..18ae7af1764c 100644 > > --- a/drivers/net/ethernet/qlogic/qede/qede_main.c > > +++ b/drivers/net/ethernet/qlogic/qede/qede_main.c > > @@ -177,6 +177,18 @@ static int qede_sriov_configure(struct pci_dev > > *pdev, int num_vfs_param) } #endif > > > > +static int __maybe_unused qede_suspend(struct device *dev) { > > + if (!dev) > > + return -ENODEV; > > Can dev really be NULL here? That wouldn't make sense, what's the driver > supposed to do in such case? It's not supposed to be NULL here assuming caller must be validating it way before. I just put it for sanity. I will remove it. > -- > pw-bot: cr
On Fri, 11 Aug 2023 09:31:15 +0000 Manish Chopra wrote: > > Does the FW end up recovering? That could still be preferable to rejecting > > suspend altogether. Reject is a big hammer, I'm a bit worried it will cause a > > regression in stable. > > Yes, By adding the driver's suspend handler with explicit error returned > to PCI subsystem prevents the system wide suspend and does not impact the > device/FW at all. It keeps them operational as they were before. I'm asking about recovery without this patch, not with it. That should be evident from the text I'm replying under.
> -----Original Message----- > From: Jakub Kicinski <kuba@kernel.org> > Sent: Saturday, August 12, 2023 3:15 AM > To: Manish Chopra <manishc@marvell.com> > Cc: netdev@vger.kernel.org; Ariel Elior <aelior@marvell.com>; Alok Prasad > <palok@marvell.com>; Nilesh Javali <njavali@marvell.com>; Saurav Kashyap > <skashyap@marvell.com>; jmeneghi@redhat.com; yuval.mintz@qlogic.com; > Sudarsana Reddy Kalluru <skalluru@marvell.com>; pabeni@redhat.com; > edumazet@google.com; horms@kernel.org; David Miller > <davem@davemloft.net> > Subject: Re: [EXT] Re: [PATCH v2 net] qede: fix firmware halt over suspend and > resume > > On Fri, 11 Aug 2023 09:31:15 +0000 Manish Chopra wrote: > > > Does the FW end up recovering? That could still be preferable to > > > rejecting suspend altogether. Reject is a big hammer, I'm a bit > > > worried it will cause a regression in stable. > > > > Yes, By adding the driver's suspend handler with explicit error > > returned to PCI subsystem prevents the system wide suspend and does > > not impact the device/FW at all. It keeps them operational as they were > before. > > I'm asking about recovery without this patch, not with it. > That should be evident from the text I'm replying under. Nope, It does not recover. We have to power cycle the system to recover.
On Mon, 14 Aug 2023 10:24:52 +0000 Manish Chopra wrote: > > I'm asking about recovery without this patch, not with it. > > That should be evident from the text I'm replying under. > > Nope, It does not recover. We have to power cycle the system to recover. Alright, please state that in the commit message and drop the unnecessary NULL check for v2.
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c index d57e52a97f85..18ae7af1764c 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_main.c +++ b/drivers/net/ethernet/qlogic/qede/qede_main.c @@ -177,6 +177,18 @@ static int qede_sriov_configure(struct pci_dev *pdev, int num_vfs_param) } #endif +static int __maybe_unused qede_suspend(struct device *dev) +{ + if (!dev) + return -ENODEV; + + dev_info(dev, "Device does not support suspend operation\n"); + + return -EOPNOTSUPP; +} + +static DEFINE_SIMPLE_DEV_PM_OPS(qede_pm_ops, qede_suspend, NULL); + static const struct pci_error_handlers qede_err_handler = { .error_detected = qede_io_error_detected, }; @@ -191,6 +203,7 @@ static struct pci_driver qede_pci_driver = { .sriov_configure = qede_sriov_configure, #endif .err_handler = &qede_err_handler, + .driver.pm = &qede_pm_ops, }; static struct qed_eth_cb_ops qede_ll_ops = {