Message ID | 20200930070537.30982-5-haifeng.zhao@intel.com (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | Fix DPC hotplug race and enhance error handling | expand |
On 9/30/2020 3:05 AM, Ethan Zhao wrote: > When uncorrectable error happens, AER driver and DPC driver interrupt > handlers likely call > > pcie_do_recovery() > ->pci_walk_bus() > ->report_frozen_detected() > > with pci_channel_io_frozen the same time. We need some more data on this. If DPC is supported by HW, errors should be triggered by DPC not AER. If I remember right, there is a register that tells which AER errors should be handled by DPC.
[+cc Sinan] On Wed, Sep 30, 2020 at 03:05:36AM -0400, Ethan Zhao wrote: > When uncorrectable error happens, AER driver and DPC driver interrupt > handlers likely call > > pcie_do_recovery() > ->pci_walk_bus() > ->report_frozen_detected() > > with pci_channel_io_frozen the same time. > If pci_dev_set_io_state() return true even if the original state is > pci_channel_io_frozen, that will cause AER or DPC handler re-enter > the error detecting and recovery procedure one after another. > The result is the recovery flow mixed between AER and DPC. > So simplify the pci_dev_set_io_state() function to only return true > when dev->error_state is changed. > > Signed-off-by: Ethan Zhao <haifeng.zhao@intel.com> > Tested-by: Wen Jin <wen.jin@intel.com> > Tested-by: Shanshan Zhang <ShanshanX.Zhang@intel.com> > Reviewed-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> > Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> > --- > Changnes: > v2: revise description and code according to suggestion from Andy. > v3: change code to simpler. > v4: no change. > v5: no change. > v6: no change. > > drivers/pci/pci.h | 37 +++++-------------------------------- > 1 file changed, 5 insertions(+), 32 deletions(-) > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > index 455b32187abd..f2beeaeda321 100644 > --- a/drivers/pci/pci.h > +++ b/drivers/pci/pci.h > @@ -359,39 +359,12 @@ struct pci_sriov { > static inline bool pci_dev_set_io_state(struct pci_dev *dev, > pci_channel_state_t new) > { > - bool changed = false; > - > device_lock_assert(&dev->dev); > - switch (new) { > - case pci_channel_io_perm_failure: > - switch (dev->error_state) { > - case pci_channel_io_frozen: > - case pci_channel_io_normal: > - case pci_channel_io_perm_failure: > - changed = true; > - break; > - } > - break; > - case pci_channel_io_frozen: > - switch (dev->error_state) { > - case pci_channel_io_frozen: > - case pci_channel_io_normal: > - changed = true; > - break; > - } > - break; > - case pci_channel_io_normal: > - switch (dev->error_state) { > - case pci_channel_io_frozen: > - case pci_channel_io_normal: > - changed = true; > - break; > - } > - break; > - } > - if (changed) > - dev->error_state = new; > - return changed; > + if (dev->error_state == new) > + return false; > + > + dev->error_state = new; > + return true; > } IIUC this changes the behavior of the function, but it's difficult to analyze because it does a lot of simplification at the same time. Please consider the following, which is intended to simplify the function while preserving the behavior (but please verify; it's been a long time since I looked at this). Then maybe see how your patch could be done on top of this? Alternatively, come up with your own simplification patch + the functionality change. commit 983d9b1f8177 ("PCI/ERR: Simplify pci_dev_set_io_state()") Author: Bjorn Helgaas <bhelgaas@google.com> Date: Tue May 19 12:28:57 2020 -0500 PCI/ERR: Simplify pci_dev_set_io_state() Truth table: requested new state current ------------------------------------------ state normal frozen perm_failure ------------ + ------------- ------------- ------------ normal | normal frozen perm_failure frozen | normal frozen perm_failure perm_failure | perm_failure* perm_failure* perm_failure * "not changed", returns false No functional change intended. diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 6d3f75867106..81408552f7c9 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -358,39 +358,21 @@ struct pci_sriov { static inline bool pci_dev_set_io_state(struct pci_dev *dev, pci_channel_state_t new) { - bool changed = false; - device_lock_assert(&dev->dev); - switch (new) { - case pci_channel_io_perm_failure: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - case pci_channel_io_perm_failure: - changed = true; - break; - } - break; - case pci_channel_io_frozen: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - case pci_channel_io_normal: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; + + /* Can always put a device in perm_failure state */ + if (new == pci_channel_io_perm_failure) { + dev->error_state == pci_channel_io_perm_failure; + return true; } - if (changed) - dev->error_state = new; - return changed; + + /* If already in perm_failure, can't set to normal or frozen */ + if (dev->error_state == pci_channel_io_perm_failure) + return false; + + /* Can always change normal to frozen or vice versa */ + dev->error_state = new; + return true; } static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
Sinan, On Sat, Oct 3, 2020 at 12:08 AM Sinan Kaya <okaya@kernel.org> wrote: > > On 9/30/2020 3:05 AM, Ethan Zhao wrote: > > When uncorrectable error happens, AER driver and DPC driver interrupt > > handlers likely call > > > > pcie_do_recovery() > > ->pci_walk_bus() > > ->report_frozen_detected() > > > > with pci_channel_io_frozen the same time. > > We need some more data on this. If DPC is supported by HW, errors > should be triggered by DPC not AER. > > If I remember right, there is a register that tells which AER errors > should be handled by DPC. When uncorrectable errors happen, non-fatal severity level, AER and DPC could be triggered at the same time. Thanks, Ethan >
Bjorn, On Sat, Oct 3, 2020 at 1:29 AM Bjorn Helgaas <helgaas@kernel.org> wrote: > > [+cc Sinan] > > On Wed, Sep 30, 2020 at 03:05:36AM -0400, Ethan Zhao wrote: > > When uncorrectable error happens, AER driver and DPC driver interrupt > > handlers likely call > > > > pcie_do_recovery() > > ->pci_walk_bus() > > ->report_frozen_detected() > > > > with pci_channel_io_frozen the same time. > > If pci_dev_set_io_state() return true even if the original state is > > pci_channel_io_frozen, that will cause AER or DPC handler re-enter > > the error detecting and recovery procedure one after another. > > The result is the recovery flow mixed between AER and DPC. > > So simplify the pci_dev_set_io_state() function to only return true > > when dev->error_state is changed. > > > > Signed-off-by: Ethan Zhao <haifeng.zhao@intel.com> > > Tested-by: Wen Jin <wen.jin@intel.com> > > Tested-by: Shanshan Zhang <ShanshanX.Zhang@intel.com> > > Reviewed-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> > > Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> > > --- > > Changnes: > > v2: revise description and code according to suggestion from Andy. > > v3: change code to simpler. > > v4: no change. > > v5: no change. > > v6: no change. > > > > drivers/pci/pci.h | 37 +++++-------------------------------- > > 1 file changed, 5 insertions(+), 32 deletions(-) > > > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > > index 455b32187abd..f2beeaeda321 100644 > > --- a/drivers/pci/pci.h > > +++ b/drivers/pci/pci.h > > @@ -359,39 +359,12 @@ struct pci_sriov { > > static inline bool pci_dev_set_io_state(struct pci_dev *dev, > > pci_channel_state_t new) > > { > > - bool changed = false; > > - > > device_lock_assert(&dev->dev); > > - switch (new) { > > - case pci_channel_io_perm_failure: > > - switch (dev->error_state) { > > - case pci_channel_io_frozen: > > - case pci_channel_io_normal: > > - case pci_channel_io_perm_failure: > > - changed = true; > > - break; > > - } > > - break; > > - case pci_channel_io_frozen: > > - switch (dev->error_state) { > > - case pci_channel_io_frozen: > > - case pci_channel_io_normal: > > - changed = true; > > - break; > > - } > > - break; > > - case pci_channel_io_normal: > > - switch (dev->error_state) { > > - case pci_channel_io_frozen: > > - case pci_channel_io_normal: > > - changed = true; > > - break; > > - } > > - break; > > - } > > - if (changed) > > - dev->error_state = new; > > - return changed; > > + if (dev->error_state == new) > > + return false; > > + > > + dev->error_state = new; > > + return true; > > } > > IIUC this changes the behavior of the function, but it's difficult to > analyze because it does a lot of simplification at the same time. > > Please consider the following, which is intended to simplify the > function while preserving the behavior (but please verify; it's been a > long time since I looked at this). Then maybe see how your patch > could be done on top of this? > > Alternatively, come up with your own simplification patch + the > functionality change. > > > commit 983d9b1f8177 ("PCI/ERR: Simplify pci_dev_set_io_state()") > Author: Bjorn Helgaas <bhelgaas@google.com> > Date: Tue May 19 12:28:57 2020 -0500 > > PCI/ERR: Simplify pci_dev_set_io_state() > > Truth table: > > requested new state > current ------------------------------------------ > state normal frozen perm_failure > ------------ + ------------- ------------- ------------ > normal | normal frozen perm_failure > frozen | normal frozen perm_failure > perm_failure | perm_failure* perm_failure* perm_failure > > * "not changed", returns false > > No functional change intended. > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > index 6d3f75867106..81408552f7c9 100644 > --- a/drivers/pci/pci.h > +++ b/drivers/pci/pci.h > @@ -358,39 +358,21 @@ struct pci_sriov { > static inline bool pci_dev_set_io_state(struct pci_dev *dev, > pci_channel_state_t new) > { > - bool changed = false; > - > device_lock_assert(&dev->dev); > - switch (new) { > - case pci_channel_io_perm_failure: > - switch (dev->error_state) { > - case pci_channel_io_frozen: > - case pci_channel_io_normal: > - case pci_channel_io_perm_failure: > - changed = true; > - break; > - } > - break; > - case pci_channel_io_frozen: > - switch (dev->error_state) { > - case pci_channel_io_frozen: > - case pci_channel_io_normal: > - changed = true; > - break; > - } > - break; > - case pci_channel_io_normal: > - switch (dev->error_state) { > - case pci_channel_io_frozen: > - case pci_channel_io_normal: > - changed = true; > - break; > - } > - break; > + > + /* Can always put a device in perm_failure state */ > + if (new == pci_channel_io_perm_failure) { > + dev->error_state == pci_channel_io_perm_failure; > + return true; > } > - if (changed) > - dev->error_state = new; > - return changed; > + > + /* If already in perm_failure, can't set to normal or frozen */ > + if (dev->error_state == pci_channel_io_perm_failure) > + return false; This line goes to the first. > + > + /* Can always change normal to frozen or vice versa */ > + dev->error_state = new; > + return true; > } Per code grepping, there are two kinds of cases pci_dev_set_io_state() are called. 1. doesn't care about the return value, just set it to pci_chnnel_io_perm_failure. and later handling depends on dev->error_state, whatever the value changed or not. pciehp_unconfigure_device() -> pci_walk_bus() -> pci_dev_set_disconnected() ->pci_dev_set_io_state() with pci_chnnel_io_perm_failure 2. cares about the return value or the dev->error_state is changed or not. and the later handling also depends on if the value of dev->error_state is changed or not. pcie_do_recovery() ->pci_walk_bus() ->report_frozen_detected() ->report_error_detected() ->pci_dev_set_io_state() with pci_channel_io_frozen. This case, if the original value was pci_channel_io_frozen, no need to set it again and say it is *changed*. ----- return true only it is really changed. pcie_do_recovery() ->pci_walk_bus() ->report_resume() ->pcI_dev_set_ios_state() with pci_channel_io_normal. The same. pcie_do_recovery() ->pci_walk_bus() ->report_normal_detected() with pci_channel_io_normal The same. I will combat the 'Truth Table' to keep the original functions and let it return true only when it really changes to avoid re-enter the later handling process. Will send v7 for your review. Thanks, Ethan > > static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 455b32187abd..f2beeaeda321 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -359,39 +359,12 @@ struct pci_sriov { static inline bool pci_dev_set_io_state(struct pci_dev *dev, pci_channel_state_t new) { - bool changed = false; - device_lock_assert(&dev->dev); - switch (new) { - case pci_channel_io_perm_failure: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - case pci_channel_io_perm_failure: - changed = true; - break; - } - break; - case pci_channel_io_frozen: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - case pci_channel_io_normal: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - } - if (changed) - dev->error_state = new; - return changed; + if (dev->error_state == new) + return false; + + dev->error_state = new; + return true; } static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)