Message ID | 1424976648.17078.1.camel@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Imre Deak <imre.deak@intel.com> writes: > Attached is the proposed fix for this issue. > > --Imre > > From 5c23657bc168db12a1ba100ab65fabd305c89c8a Mon Sep 17 00:00:00 2001 > From: Imre Deak <imre.deak@intel.com> > Date: Thu, 26 Feb 2015 18:38:53 +0200 > Subject: [PATCH] drm/i915: gm45: work around hang during hibernation I can confirm that this patch solves the problem for me. Thanks. Bjørn
On Thu, Feb 26, 2015 at 08:50:48PM +0200, Imre Deak wrote: > On to, 2015-02-26 at 10:34 +0100, Bjørn Mork wrote: > > Imre Deak <imre.deak@intel.com> writes: > > > > >> That patch fixes the problem, with only pci_set_power_state commented > > >> out. Do you still want me to try with pci_disable_device() commented > > >> out as well? > > > > > > No, but it would help if you could still try the two attached patch > > > separately, without any of the previous workarounds. Based on the > > > result, we'll follow up with a fix that adds for your specific platform > > > either of the attached workarounds or simply avoids putting the device > > > into D3 (corresponding to the patch you already tried). > > > > None of those patches made any difference. The laptop still hangs at > > power-off. > > > > Not really surprising, is it? Previous testing shows that the hang > > occurs at the pci_set_power_state(drm_dev->pdev, PCI_D3hot) call in the > > poweroff_late hook. It is hard to see how replacing it by an attempt to > > set D3cold, or adding any call after this point, could possibly change > > anything. The system is stil hanging at the pci_set_power_state() call. > > Judging from the blinking LED, I assume that it's not > pci_set_power_state() that hangs the machine, but the hang happens in > BIOS code. > > > The generic pci-driver code will put the i915 device into PCI_D3hot for > > you, won't it? Why do you need to duplicate that in the driver, > > *knowing* that doing so breaks (at least some) systems? > > Letting the pci core put the device into D3 wouldn't get rid of the problem. > It's putting the device into D3 in the first place what causes it. > > > I honestly don't think this "let's try some random code" is the proper > > way to fix this bug (or any other bug for that matter). You need to > > start understanding the code you write, and the first step is by > > actually explaining the changes you make. > > We have a good understanding about the issue: the BIOS on your platform > does something unexpected behind the back of the driver/kernel. In that > sense the patches were not random, for instance the first one is based on > an existing workaround for an issue that resembles quite a lot yours, see > pci_pm_poweroff_noirq(). > > > I also believe that you completely miss the fact that this bug has > > survived a full release cycle before you became aware of it, and the > > implications this has wrt other affected systems/users. I assume you > > understand that my system isn't one-of-a-kind, This means that there are > > other affected users with identical/similar systems. Now, if none of > > those users reported the bug to you (we all know why: Linux kernel > > development is currently limited by the available testing resources, NOT > > by the available developer resources), then how do you know that there > > aren't a number of other systems affected as well? > > > > Let me answer that for you: You don't. > > > > Which is why you must explain the mechanism triggering the bug, proving > > that it is a chipset/system specific issue. Because that's the only way > > you will *know* that you have solved the problem not only for me, but for > > all affected users. > > > > IMHO, the only safe and sane solution at the moment is the revert patch > > I posted. It's a simple fix, reverting back to the *known* working > > state before this regression was introduced. > > > > Then you can start over from there, trying to implement this properly. > > The current way is the proper one that we want for the generic case. The issue > on your platform is the exception, so working around that is a sensible choice. > > Attached is the proposed fix for this issue. > > --Imre > > From 5c23657bc168db12a1ba100ab65fabd305c89c8a Mon Sep 17 00:00:00 2001 > From: Imre Deak <imre.deak@intel.com> > Date: Thu, 26 Feb 2015 18:38:53 +0200 > Subject: [PATCH] drm/i915: gm45: work around hang during hibernation > MIME-Version: 1.0 > Content-Type: text/plain; charset=UTF-8 > Content-Transfer-Encoding: 8bit > > Bjørn reported that his machine hang during hibernation and eventually > bisected the problem to the following commit: > > commit da2bc1b9db3351addd293e5b82757efe1f77ed1d > Author: Imre Deak <imre.deak@intel.com> > Date: Thu Oct 23 19:23:26 2014 +0300 > > drm/i915: add poweroff_late handler > > The problem seems to be that after the kernel puts the device into D3 > the BIOS still tries to access it, or otherwise assumes that it's in D0. > This is clearly bogus, since ACPI mandates that devices are put into D3 > by the OSPM if they are not wake-up sources. In the future we want to > unify more of the driver's runtime and system suspend paths, for example > by skipping all the system suspend/hibernation hooks if the device is > runtime suspended already. Accordingly for all other platforms the goal > is still to properly power down the device during hibernation. > > Reference: http://lists.freedesktop.org/archives/intel-gfx/2015-February/060633.html > Reported-and-bisected-by: Bjørn Mork <bjorn@mork.no> > Signed-off-by: Imre Deak <imre.deak@intel.com> > --- > drivers/gpu/drm/i915/i915_drv.c | 27 ++++++++++++++++++++++----- > 1 file changed, 22 insertions(+), 5 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c > index 4badb23..67d212b 100644 > --- a/drivers/gpu/drm/i915/i915_drv.c > +++ b/drivers/gpu/drm/i915/i915_drv.c > @@ -637,7 +637,7 @@ static int i915_drm_suspend(struct drm_device *dev) > return 0; > } > > -static int i915_drm_suspend_late(struct drm_device *drm_dev) > +static int i915_drm_suspend_late(struct drm_device *drm_dev, bool hibernation) > { > struct drm_i915_private *dev_priv = drm_dev->dev_private; > int ret; > @@ -651,7 +651,14 @@ static int i915_drm_suspend_late(struct drm_device *drm_dev) > } > > pci_disable_device(drm_dev->pdev); > - pci_set_power_state(drm_dev->pdev, PCI_D3hot); > + /* > + * During hibernation on some GM45 platforms the BIOS may try to access > + * the device even though it's already in D3 and hang the machine. So > + * leave the device in D0 on those platforms and hope the BIOS will > + * power down the device properly. > + */ > + if (!(hibernation && IS_GM45(dev_priv))) > + pci_set_power_state(drm_dev->pdev, PCI_D3hot); If we see more of these troublesome machines we might need to apply an even bigger hammer like gen < 5 or so. But as long as there's just 1 report I think this is the right approach. Bjorn, as soon as we have your tested-by that this work we can apply this and shovel it through the backporting machinery. Thanks, Daniel > > return 0; > } > @@ -677,7 +684,7 @@ int i915_suspend_legacy(struct drm_device *dev, pm_message_t state) > if (error) > return error; > > - return i915_drm_suspend_late(dev); > + return i915_drm_suspend_late(dev, false); > } > > static int i915_drm_resume(struct drm_device *dev) > @@ -965,7 +972,17 @@ static int i915_pm_suspend_late(struct device *dev) > if (drm_dev->switch_power_state == DRM_SWITCH_POWER_OFF) > return 0; > > - return i915_drm_suspend_late(drm_dev); > + return i915_drm_suspend_late(drm_dev, false); > +} > + > +static int i915_pm_poweroff_late(struct device *dev) > +{ > + struct drm_device *drm_dev = dev_to_i915(dev)->dev; > + > + if (drm_dev->switch_power_state == DRM_SWITCH_POWER_OFF) > + return 0; > + > + return i915_drm_suspend_late(drm_dev, true); > } > > static int i915_pm_resume_early(struct device *dev) > @@ -1535,7 +1552,7 @@ static const struct dev_pm_ops i915_pm_ops = { > .thaw_early = i915_pm_resume_early, > .thaw = i915_pm_resume, > .poweroff = i915_pm_suspend, > - .poweroff_late = i915_pm_suspend_late, > + .poweroff_late = i915_pm_poweroff_late, > .restore_early = i915_pm_resume_early, > .restore = i915_pm_resume, > > -- > 2.1.0 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
On Thu, Feb 26, 2015 at 08:50:48PM +0200, Imre Deak wrote: > On to, 2015-02-26 at 10:34 +0100, Bjørn Mork wrote: > > Imre Deak <imre.deak@intel.com> writes: > > > > >> That patch fixes the problem, with only pci_set_power_state commented > > >> out. Do you still want me to try with pci_disable_device() commented > > >> out as well? > > > > > > No, but it would help if you could still try the two attached patch > > > separately, without any of the previous workarounds. Based on the > > > result, we'll follow up with a fix that adds for your specific platform > > > either of the attached workarounds or simply avoids putting the device > > > into D3 (corresponding to the patch you already tried). > > > > None of those patches made any difference. The laptop still hangs at > > power-off. > > > > Not really surprising, is it? Previous testing shows that the hang > > occurs at the pci_set_power_state(drm_dev->pdev, PCI_D3hot) call in the > > poweroff_late hook. It is hard to see how replacing it by an attempt to > > set D3cold, or adding any call after this point, could possibly change > > anything. The system is stil hanging at the pci_set_power_state() call. > > Judging from the blinking LED, I assume that it's not > pci_set_power_state() that hangs the machine, but the hang happens in > BIOS code. > > > The generic pci-driver code will put the i915 device into PCI_D3hot for > > you, won't it? Why do you need to duplicate that in the driver, > > *knowing* that doing so breaks (at least some) systems? > > Letting the pci core put the device into D3 wouldn't get rid of the problem. > It's putting the device into D3 in the first place what causes it. > > > I honestly don't think this "let's try some random code" is the proper > > way to fix this bug (or any other bug for that matter). You need to > > start understanding the code you write, and the first step is by > > actually explaining the changes you make. > > We have a good understanding about the issue: the BIOS on your platform > does something unexpected behind the back of the driver/kernel. In that > sense the patches were not random, for instance the first one is based on > an existing workaround for an issue that resembles quite a lot yours, see > pci_pm_poweroff_noirq(). > > > I also believe that you completely miss the fact that this bug has > > survived a full release cycle before you became aware of it, and the > > implications this has wrt other affected systems/users. I assume you > > understand that my system isn't one-of-a-kind, This means that there are > > other affected users with identical/similar systems. Now, if none of > > those users reported the bug to you (we all know why: Linux kernel > > development is currently limited by the available testing resources, NOT > > by the available developer resources), then how do you know that there > > aren't a number of other systems affected as well? > > > > Let me answer that for you: You don't. > > > > Which is why you must explain the mechanism triggering the bug, proving > > that it is a chipset/system specific issue. Because that's the only way > > you will *know* that you have solved the problem not only for me, but for > > all affected users. > > > > IMHO, the only safe and sane solution at the moment is the revert patch > > I posted. It's a simple fix, reverting back to the *known* working > > state before this regression was introduced. > > > > Then you can start over from there, trying to implement this properly. > > The current way is the proper one that we want for the generic case. The issue > on your platform is the exception, so working around that is a sensible choice. > > Attached is the proposed fix for this issue. > > --Imre > > From 5c23657bc168db12a1ba100ab65fabd305c89c8a Mon Sep 17 00:00:00 2001 > From: Imre Deak <imre.deak@intel.com> > Date: Thu, 26 Feb 2015 18:38:53 +0200 > Subject: [PATCH] drm/i915: gm45: work around hang during hibernation > MIME-Version: 1.0 > Content-Type: text/plain; charset=UTF-8 > Content-Transfer-Encoding: 8bit > > Bjørn reported that his machine hang during hibernation and eventually > bisected the problem to the following commit: > > commit da2bc1b9db3351addd293e5b82757efe1f77ed1d > Author: Imre Deak <imre.deak@intel.com> > Date: Thu Oct 23 19:23:26 2014 +0300 > > drm/i915: add poweroff_late handler > > The problem seems to be that after the kernel puts the device into D3 > the BIOS still tries to access it, or otherwise assumes that it's in D0. > This is clearly bogus, since ACPI mandates that devices are put into D3 > by the OSPM if they are not wake-up sources. In the future we want to > unify more of the driver's runtime and system suspend paths, for example > by skipping all the system suspend/hibernation hooks if the device is > runtime suspended already. Accordingly for all other platforms the goal > is still to properly power down the device during hibernation. > > Reference: http://lists.freedesktop.org/archives/intel-gfx/2015-February/060633.html > Reported-and-bisected-by: Bjørn Mork <bjorn@mork.no> > Signed-off-by: Imre Deak <imre.deak@intel.com> > --- > drivers/gpu/drm/i915/i915_drv.c | 27 ++++++++++++++++++++++----- > 1 file changed, 22 insertions(+), 5 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c > index 4badb23..67d212b 100644 > --- a/drivers/gpu/drm/i915/i915_drv.c > +++ b/drivers/gpu/drm/i915/i915_drv.c > @@ -637,7 +637,7 @@ static int i915_drm_suspend(struct drm_device *dev) > return 0; > } > > -static int i915_drm_suspend_late(struct drm_device *drm_dev) > +static int i915_drm_suspend_late(struct drm_device *drm_dev, bool hibernation) > { > struct drm_i915_private *dev_priv = drm_dev->dev_private; > int ret; > @@ -651,7 +651,14 @@ static int i915_drm_suspend_late(struct drm_device *drm_dev) > } > > pci_disable_device(drm_dev->pdev); > - pci_set_power_state(drm_dev->pdev, PCI_D3hot); > + /* > + * During hibernation on some GM45 platforms the BIOS may try to access > + * the device even though it's already in D3 and hang the machine. So > + * leave the device in D0 on those platforms and hope the BIOS will > + * power down the device properly. Please include the model of the known bad machine in this comment, to help future archaeologists. > + */ > + if (!(hibernation && IS_GM45(dev_priv))) > + pci_set_power_state(drm_dev->pdev, PCI_D3hot); > > return 0; > } > @@ -677,7 +684,7 @@ int i915_suspend_legacy(struct drm_device *dev, pm_message_t state) > if (error) > return error; > > - return i915_drm_suspend_late(dev); > + return i915_drm_suspend_late(dev, false); > } > > static int i915_drm_resume(struct drm_device *dev) > @@ -965,7 +972,17 @@ static int i915_pm_suspend_late(struct device *dev) > if (drm_dev->switch_power_state == DRM_SWITCH_POWER_OFF) > return 0; > > - return i915_drm_suspend_late(drm_dev); > + return i915_drm_suspend_late(drm_dev, false); > +} > + > +static int i915_pm_poweroff_late(struct device *dev) > +{ > + struct drm_device *drm_dev = dev_to_i915(dev)->dev; > + > + if (drm_dev->switch_power_state == DRM_SWITCH_POWER_OFF) > + return 0; > + > + return i915_drm_suspend_late(drm_dev, true); > } > > static int i915_pm_resume_early(struct device *dev) > @@ -1535,7 +1552,7 @@ static const struct dev_pm_ops i915_pm_ops = { > .thaw_early = i915_pm_resume_early, > .thaw = i915_pm_resume, > .poweroff = i915_pm_suspend, > - .poweroff_late = i915_pm_suspend_late, > + .poweroff_late = i915_pm_poweroff_late, > .restore_early = i915_pm_resume_early, > .restore = i915_pm_resume, > > -- > 2.1.0 >
Ville Syrjälä <ville.syrjala@linux.intel.com> writes: >> @@ -651,7 +651,14 @@ static int i915_drm_suspend_late(struct drm_device *drm_dev) >> } >> >> pci_disable_device(drm_dev->pdev); >> - pci_set_power_state(drm_dev->pdev, PCI_D3hot); >> + /* >> + * During hibernation on some GM45 platforms the BIOS may try to access >> + * the device even though it's already in D3 and hang the machine. So >> + * leave the device in D0 on those platforms and hope the BIOS will >> + * power down the device properly. > > Please include the model of the known bad machine in this comment, to > help future archaeologists. Here are some details: bjorn@nemi:~$ grep . /sys/class/dmi/id/{bios,product}* 2>/dev/null /sys/class/dmi/id/bios_date:12/19/2011 /sys/class/dmi/id/bios_vendor:LENOVO /sys/class/dmi/id/bios_version:6EET55WW (3.15 ) /sys/class/dmi/id/product_name:2776LEG /sys/class/dmi/id/product_version:ThinkPad X301 Please let me know if you need some other data. Bjørn
On Thu, Feb 26, 2015 at 09:01:27PM +0100, Daniel Vetter wrote: > On Thu, Feb 26, 2015 at 08:50:48PM +0200, Imre Deak wrote: [snip] > > > > The problem seems to be that after the kernel puts the device into D3 > > the BIOS still tries to access it, or otherwise assumes that it's in D0. > > This is clearly bogus, since ACPI mandates that devices are put into D3 > > by the OSPM if they are not wake-up sources. In the future we want to > > unify more of the driver's runtime and system suspend paths, for example > > by skipping all the system suspend/hibernation hooks if the device is > > runtime suspended already. Accordingly for all other platforms the goal > > is still to properly power down the device during hibernation. [snip] > > If we see more of these troublesome machines we might need to apply an > even bigger hammer like gen < 5 or so. But as long as there's just 1 > report I think this is the right approach. > > Bjorn, as soon as we have your tested-by that this work we can apply this > and shovel it through the backporting machinery. Isn't there a risk that this will negatively impact machines using gen4 that *don't* have a buggy BIOS? Wouldn't a quirk tied to the laptop or BIOS version be better suited -- or possibly a parameter that can be passed to the driver, which would make it easier to test if others suffering from similar symptoms on other systems suffer from the same issue or not? Just my 2¢. Kind regards, David Weinehall
On Fri, 2015-02-27 at 14:15 +0200, David Weinehall wrote: > On Thu, Feb 26, 2015 at 09:01:27PM +0100, Daniel Vetter wrote: > > On Thu, Feb 26, 2015 at 08:50:48PM +0200, Imre Deak wrote: > > [snip] > > > > > > The problem seems to be that after the kernel puts the device into D3 > > > the BIOS still tries to access it, or otherwise assumes that it's in D0. > > > This is clearly bogus, since ACPI mandates that devices are put into D3 > > > by the OSPM if they are not wake-up sources. In the future we want to > > > unify more of the driver's runtime and system suspend paths, for example > > > by skipping all the system suspend/hibernation hooks if the device is > > > runtime suspended already. Accordingly for all other platforms the goal > > > is still to properly power down the device during hibernation. > > [snip] > > > > If we see more of these troublesome machines we might need to apply an > > even bigger hammer like gen < 5 or so. But as long as there's just 1 > > report I think this is the right approach. > > > > Bjorn, as soon as we have your tested-by that this work we can apply this > > and shovel it through the backporting machinery. > > Isn't there a risk that this will negatively impact machines using gen4 > that *don't* have a buggy BIOS? Wouldn't a quirk tied to the laptop > or BIOS version be better suited -- or possibly a parameter that can be > passed to the driver, which would make it easier to test if others > suffering from similar symptoms on other systems suffer from the same > issue or not? > > Just my 2¢. We've checked today with Ville a few machines we've found from GEN2 to GEN5+. There was one Thinkpad x61s (GEN4) where I could reproduce the exact same problem and get rid of it using the same workaround. All the others were non-Lenovo (including GEN4) mobile and desktop platforms and none of these had the problem. I also tried it on a Lenovo IVB ultrabook, which also works fine. So the current theory is that it's related to GEN4 Thinkpad BIOSes, and so we need to change the condition to match that at least. --Imre
On Fri, Feb 27, 2015 at 08:23:41PM +0200, Imre Deak wrote: > We've checked today with Ville a few machines we've found from GEN2 to > GEN5+. There was one Thinkpad x61s (GEN4) where I could reproduce the > exact same problem and get rid of it using the same workaround. All the > others were non-Lenovo (including GEN4) mobile and desktop platforms and > none of these had the problem. I also tried it on a Lenovo IVB > ultrabook, which also works fine. So the current theory is that it's > related to GEN4 Thinkpad BIOSes, and so we need to change the condition > to match that at least. If you need more ThinkPads to test with I have a few at home :) Kind regards, David
From 5c23657bc168db12a1ba100ab65fabd305c89c8a Mon Sep 17 00:00:00 2001 From: Imre Deak <imre.deak@intel.com> Date: Thu, 26 Feb 2015 18:38:53 +0200 Subject: [PATCH] drm/i915: gm45: work around hang during hibernation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bjørn reported that his machine hang during hibernation and eventually bisected the problem to the following commit: commit da2bc1b9db3351addd293e5b82757efe1f77ed1d Author: Imre Deak <imre.deak@intel.com> Date: Thu Oct 23 19:23:26 2014 +0300 drm/i915: add poweroff_late handler The problem seems to be that after the kernel puts the device into D3 the BIOS still tries to access it, or otherwise assumes that it's in D0. This is clearly bogus, since ACPI mandates that devices are put into D3 by the OSPM if they are not wake-up sources. In the future we want to unify more of the driver's runtime and system suspend paths, for example by skipping all the system suspend/hibernation hooks if the device is runtime suspended already. Accordingly for all other platforms the goal is still to properly power down the device during hibernation. Reference: http://lists.freedesktop.org/archives/intel-gfx/2015-February/060633.html Reported-and-bisected-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: Imre Deak <imre.deak@intel.com> --- drivers/gpu/drm/i915/i915_drv.c | 27 ++++++++++++++++++++++----- 1 file changed, 22 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index 4badb23..67d212b 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -637,7 +637,7 @@ static int i915_drm_suspend(struct drm_device *dev) return 0; } -static int i915_drm_suspend_late(struct drm_device *drm_dev) +static int i915_drm_suspend_late(struct drm_device *drm_dev, bool hibernation) { struct drm_i915_private *dev_priv = drm_dev->dev_private; int ret; @@ -651,7 +651,14 @@ static int i915_drm_suspend_late(struct drm_device *drm_dev) } pci_disable_device(drm_dev->pdev); - pci_set_power_state(drm_dev->pdev, PCI_D3hot); + /* + * During hibernation on some GM45 platforms the BIOS may try to access + * the device even though it's already in D3 and hang the machine. So + * leave the device in D0 on those platforms and hope the BIOS will + * power down the device properly. + */ + if (!(hibernation && IS_GM45(dev_priv))) + pci_set_power_state(drm_dev->pdev, PCI_D3hot); return 0; } @@ -677,7 +684,7 @@ int i915_suspend_legacy(struct drm_device *dev, pm_message_t state) if (error) return error; - return i915_drm_suspend_late(dev); + return i915_drm_suspend_late(dev, false); } static int i915_drm_resume(struct drm_device *dev) @@ -965,7 +972,17 @@ static int i915_pm_suspend_late(struct device *dev) if (drm_dev->switch_power_state == DRM_SWITCH_POWER_OFF) return 0; - return i915_drm_suspend_late(drm_dev); + return i915_drm_suspend_late(drm_dev, false); +} + +static int i915_pm_poweroff_late(struct device *dev) +{ + struct drm_device *drm_dev = dev_to_i915(dev)->dev; + + if (drm_dev->switch_power_state == DRM_SWITCH_POWER_OFF) + return 0; + + return i915_drm_suspend_late(drm_dev, true); } static int i915_pm_resume_early(struct device *dev) @@ -1535,7 +1552,7 @@ static const struct dev_pm_ops i915_pm_ops = { .thaw_early = i915_pm_resume_early, .thaw = i915_pm_resume, .poweroff = i915_pm_suspend, - .poweroff_late = i915_pm_suspend_late, + .poweroff_late = i915_pm_poweroff_late, .restore_early = i915_pm_resume_early, .restore = i915_pm_resume, -- 2.1.0