Message ID | dd120a30cb769c93af8973cae41f61831d17e04b.1464103767.git.lukas@wunner.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, May 24, 2016 at 06:03:27PM +0200, Lukas Wunner wrote: > nouveau_drm_load() calls pm_runtime_put() if nouveau_runtime_pm != 0, > but nouveau_drm_unload() calls pm_runtime_get_sync() unconditionally. > We therefore leak a runtime pm ref whenever nouveau is loaded with > runpm=0 and then unloaded. The GPU will subsequently never runtime > suspend even if nouveau is loaded again with runpm=1. > > Fix by taking the runtime pm ref under the same condition that it was > released on driver load. > > Fixes: 5addcf0a5f0f ("nouveau: add runtime PM support (v0.9)") > Cc: Dave Airlie <airlied@redhat.com> > Reported-by: Karol Herbst <nouveau@karolherbst.de> > Tested-by: Karol Herbst <nouveau@karolherbst.de> > Signed-off-by: Lukas Wunner <lukas@wunner.de> Looks good, I tested this scenario: ru(){ cat /sys/bus/pci/devices/0000\:01:00.0/power/runtime_usage;} ru # reports 1 modprobe nouveau runpm=0 ru # reports 2 rmmod nouveau ru # reports 1 Without runpm=0 the count drops to 0 in the second step and stays 0 in the third step. After applying patch 2/9, this correctly reports 1 as expected (this is the same as manually setting power/control to on). Peter > --- > drivers/gpu/drm/nouveau/nouveau_drm.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c > index 11f8dd9..faf7438 100644 > --- a/drivers/gpu/drm/nouveau/nouveau_drm.c > +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c > @@ -498,7 +498,10 @@ nouveau_drm_unload(struct drm_device *dev) > { > struct nouveau_drm *drm = nouveau_drm(dev); > > - pm_runtime_get_sync(dev->dev); > + if (nouveau_runtime_pm != 0) { > + pm_runtime_get_sync(dev->dev); > + } > + > nouveau_fbcon_fini(dev); > nouveau_accel_fini(drm); > nouveau_hwmon_fini(dev); > -- > 2.8.1 > > _______________________________________________ > Nouveau mailing list > Nouveau@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/nouveau
Hi Peter, On Fri, May 27, 2016 at 03:07:33AM +0200, Peter Wu wrote: > On Tue, May 24, 2016 at 06:03:27PM +0200, Lukas Wunner wrote: > > nouveau_drm_load() calls pm_runtime_put() if nouveau_runtime_pm != 0, > > but nouveau_drm_unload() calls pm_runtime_get_sync() unconditionally. > > We therefore leak a runtime pm ref whenever nouveau is loaded with > > runpm=0 and then unloaded. The GPU will subsequently never runtime > > suspend even if nouveau is loaded again with runpm=1. > > > > Fix by taking the runtime pm ref under the same condition that it was > > released on driver load. > > > > Fixes: 5addcf0a5f0f ("nouveau: add runtime PM support (v0.9)") > > Cc: Dave Airlie <airlied@redhat.com> > > Reported-by: Karol Herbst <nouveau@karolherbst.de> > > Tested-by: Karol Herbst <nouveau@karolherbst.de> > > Signed-off-by: Lukas Wunner <lukas@wunner.de> > > Looks good, I tested this scenario: > > ru(){ cat /sys/bus/pci/devices/0000\:01:00.0/power/runtime_usage;} > ru # reports 1 > modprobe nouveau runpm=0 > ru # reports 2 > rmmod nouveau > ru # reports 1 > > Without runpm=0 the count drops to 0 in the second step and stays 0 in > the third step. After applying patch 2/9, this correctly reports 1 as > expected (this is the same as manually setting power/control to on). How exactly did you reach the situation where the root port didn't wake up when you tried to load nouveau again? (IRC conversation this week.) What's happening is, the PCI core will keep unbound devices (i.e., without driver) in D0 but the runtime status is allowed to change to "suspended". So it'll appear to the kernel as if it was suspended but in reality it stays in D0. Once runtime pm for PCIe ports gets merged, the root port above the GPU will indeed go to D3 in such a situation because the check pm_children_suspended() (called from rpm_check_suspend_allowed()) returns true. I'm not sure if this is desirable or not. If we keep unbound devices in D0, should we allow ports above them to go to D3? In any case, when nouveau is loaded again, local_pci_probe() will call pm_runtime_get_sync(), which will implicitly set the runtime status to "active" and which should also wake parents. So how did you ever reach a point where you loaded nouveau and the root port stayed asleep? Clearly we have a bug there, question is where. This shouldn't work only if pm_runtime_forbid() was called on driver unload. Thanks for the extensive testing! Lukas > > Peter > > > --- > > drivers/gpu/drm/nouveau/nouveau_drm.c | 5 ++++- > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c > > index 11f8dd9..faf7438 100644 > > --- a/drivers/gpu/drm/nouveau/nouveau_drm.c > > +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c > > @@ -498,7 +498,10 @@ nouveau_drm_unload(struct drm_device *dev) > > { > > struct nouveau_drm *drm = nouveau_drm(dev); > > > > - pm_runtime_get_sync(dev->dev); > > + if (nouveau_runtime_pm != 0) { > > + pm_runtime_get_sync(dev->dev); > > + } > > + > > nouveau_fbcon_fini(dev); > > nouveau_accel_fini(drm); > > nouveau_hwmon_fini(dev); > > -- > > 2.8.1 > > > > _______________________________________________ > > Nouveau mailing list > > Nouveau@lists.freedesktop.org > > https://lists.freedesktop.org/mailman/listinfo/nouveau > > -- > Kind regards, > Peter Wu > https://lekensteyn.nl
On Sun, May 29, 2016 at 05:50:06PM +0200, Lukas Wunner wrote: > Hi Peter, > > On Fri, May 27, 2016 at 03:07:33AM +0200, Peter Wu wrote: > > On Tue, May 24, 2016 at 06:03:27PM +0200, Lukas Wunner wrote: > > > nouveau_drm_load() calls pm_runtime_put() if nouveau_runtime_pm != 0, > > > but nouveau_drm_unload() calls pm_runtime_get_sync() unconditionally. > > > We therefore leak a runtime pm ref whenever nouveau is loaded with > > > runpm=0 and then unloaded. The GPU will subsequently never runtime > > > suspend even if nouveau is loaded again with runpm=1. > > > > > > Fix by taking the runtime pm ref under the same condition that it was > > > released on driver load. > > > > > > Fixes: 5addcf0a5f0f ("nouveau: add runtime PM support (v0.9)") > > > Cc: Dave Airlie <airlied@redhat.com> > > > Reported-by: Karol Herbst <nouveau@karolherbst.de> > > > Tested-by: Karol Herbst <nouveau@karolherbst.de> > > > Signed-off-by: Lukas Wunner <lukas@wunner.de> > > > > Looks good, I tested this scenario: > > > > ru(){ cat /sys/bus/pci/devices/0000\:01:00.0/power/runtime_usage;} > > ru # reports 1 > > modprobe nouveau runpm=0 > > ru # reports 2 > > rmmod nouveau > > ru # reports 1 > > > > Without runpm=0 the count drops to 0 in the second step and stays 0 in > > the third step. After applying patch 2/9, this correctly reports 1 as > > expected (this is the same as manually setting power/control to on). > > How exactly did you reach the situation where the root port didn't wake > up when you tried to load nouveau again? (IRC conversation this week.) Ensure that the pci/pm patches are applied, then: 0. Unload nouveau (I have blacklisted it for testing). 1. Enable rpm for the root port and children (control = auto). 2. Verify in the kernel logs that the devices are sleeping: pcieport 0000:00:01.0: power state changed by ACPI to D3cold 3. (Optional, to rule out issues with delays:) Disable rpm for the Nvidia device (control = on). 4. modprobe nouveau. The above test with v4.6 + 4 pci/pm patches (8b71f565) gives: 50.245795 MXM: GUID detected in BIOS 50.245948 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.GFX0._DSM] at AML address ffffc90000013b11 length 492 50.246016 ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) 50.246044 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.GFX0._DSM] at AML address ffffc90000013b11 length 492 50.246110 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F 50.246256 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) 50.246289 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F 50.246443 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) 50.246457 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F 50.246932 pci 0000:01:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported 50.247005 VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle 50.247084 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PG00._ON] at AML address ffffc9000001086e length 11D 50.390140 pcieport 0000:00:01.0: power state changed by ACPI to D0 50.491893 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0._DSW] at AML address ffffc90000010a2d length 1D 50.492285 pcieport 0000:00:01.0: PME# disabled 50.492583 nouveau 0000:01:00.0: unknown chipset (ffffffff) 50.492687 nouveau: probe of 0000:01:00.0 failed with error -12 50.501990 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0._S0W] at AML address ffffc90000010a8e length 2 50.502403 pcieport 0000:00:01.0: PME# enabled 50.502601 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0._DSW] at AML address ffffc90000010a2d length 1D 50.513005 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PG00._OFF] at AML address ffffc90000010994 length 6D 50.533258 pcieport 0000:00:01.0: power state changed by ACPI to D3cold (Note that this patch is not included.) When nouveau is operating normally, I see that _PS0 is also called (which does not happen above). If you think that mixing power resources with DSM causes this issue, I also tried to apply my power resources work for nouveau but it gives the same problem: 20.183306 MXM: GUID detected in BIOS 20.183606 ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) 20.184158 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) 20.184547 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) 20.185152 pci 0000:01:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported 20.185351 VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle 20.185384 nouveau: detected PR support, will not use DSM 20.185552 nouveau 0000:01:00.0: enabling device (0000 -> 0003) 20.185873 nouveau 0000:01:00.0: unknown chipset (ffffffff) 20.185946 nouveau: probe of 0000:01:00.0 failed with error -12 > What's happening is, the PCI core will keep unbound devices (i.e., > without driver) in D0 but the runtime status is allowed to change > to "suspended". So it'll appear to the kernel as if it was suspended > but in reality it stays in D0. > > Once runtime pm for PCIe ports gets merged, the root port above the > GPU will indeed go to D3 in such a situation because the check > pm_children_suspended() (called from rpm_check_suspend_allowed()) > returns true. > > I'm not sure if this is desirable or not. If we keep unbound devices > in D0, should we allow ports above them to go to D3? Maybe Rafael (linux-pm / linux-pci) can answer this question better? The comments in local_pci_probe, pci_pm_runtime_suspend and pci_pm_runtime_resume suggest that unbound devices are assumed in D0 which is apparently not the case when runtime PM is enabled. > In any case, when nouveau is loaded again, local_pci_probe() will > call pm_runtime_get_sync(), which will implicitly set the runtime > status to "active" and which should also wake parents. So how did > you ever reach a point where you loaded nouveau and the root port > stayed asleep? Clearly we have a bug there, question is where. > This shouldn't work only if pm_runtime_forbid() was called on > driver unload. > > Thanks for the extensive testing! > Lukas Both devices (root port and Nvidia) were resumed, but somehow the Nvidia card was not fully initialized/ready (as you can see in the above logs). Peter > > > > Peter > > > > > --- > > > drivers/gpu/drm/nouveau/nouveau_drm.c | 5 ++++- > > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c > > > index 11f8dd9..faf7438 100644 > > > --- a/drivers/gpu/drm/nouveau/nouveau_drm.c > > > +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c > > > @@ -498,7 +498,10 @@ nouveau_drm_unload(struct drm_device *dev) > > > { > > > struct nouveau_drm *drm = nouveau_drm(dev); > > > > > > - pm_runtime_get_sync(dev->dev); > > > + if (nouveau_runtime_pm != 0) { > > > + pm_runtime_get_sync(dev->dev); > > > + } > > > + > > > nouveau_fbcon_fini(dev); > > > nouveau_accel_fini(drm); > > > nouveau_hwmon_fini(dev); > > > -- > > > 2.8.1 > > > > > > _______________________________________________ > > > Nouveau mailing list > > > Nouveau@lists.freedesktop.org > > > https://lists.freedesktop.org/mailman/listinfo/nouveau
On Mon, May 30, 2016 at 07:03:46PM +0200, Peter Wu wrote: > On Sun, May 29, 2016 at 05:50:06PM +0200, Lukas Wunner wrote: > > How exactly did you reach the situation where the root port didn't wake > > up when you tried to load nouveau again? (IRC conversation this week.) > > Ensure that the pci/pm patches are applied, then: > > 0. Unload nouveau (I have blacklisted it for testing). > 1. Enable rpm for the root port and children (control = auto). > 2. Verify in the kernel logs that the devices are sleeping: > pcieport 0000:00:01.0: power state changed by ACPI to D3cold > 3. (Optional, to rule out issues with delays:) Disable rpm for the > Nvidia device (control = on). > 4. modprobe nouveau. > > The above test with v4.6 + 4 pci/pm patches (8b71f565) gives: > > 50.245795 MXM: GUID detected in BIOS > 50.245948 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.GFX0._DSM] at AML address ffffc90000013b11 length 492 > 50.246016 ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) > 50.246044 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.GFX0._DSM] at AML address ffffc90000013b11 length 492 > 50.246110 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F > 50.246256 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) > 50.246289 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F > 50.246443 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) > 50.246457 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F > 50.246932 pci 0000:01:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported > 50.247005 VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle > 50.247084 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PG00._ON] at AML address ffffc9000001086e length 11D > 50.390140 pcieport 0000:00:01.0: power state changed by ACPI to D0 > 50.491893 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0._DSW] at AML address ffffc90000010a2d length 1D > 50.492285 pcieport 0000:00:01.0: PME# disabled > 50.492583 nouveau 0000:01:00.0: unknown chipset (ffffffff) > 50.492687 nouveau: probe of 0000:01:00.0 failed with error -12 I've tested this on a MacBook Pro, which does not have ACPI _PR3 methods for the root port to which the discrete GPU is attached. The port can thus only suspend to D3hot, not D3cold. Even without patch [2/9], when unloading nouveau and letting the root port go to D3hot, the port is subsequently correctly resumed to D0 when reloading nouveau. So the issue that you're seeing without patch [2/9] seems to be specific to Optimus/_PR3 machines. If possible you should try to get it working without patch [2/9] because that patch is really optional (as I've written in the commit message). I'm totally unfamiliar with Optimus but maybe lspci could help to debug this? Lukas
On Tue, May 31, 2016 at 01:34:43PM +0200, Lukas Wunner wrote: > On Mon, May 30, 2016 at 07:03:46PM +0200, Peter Wu wrote: > > On Sun, May 29, 2016 at 05:50:06PM +0200, Lukas Wunner wrote: > > > How exactly did you reach the situation where the root port didn't wake > > > up when you tried to load nouveau again? (IRC conversation this week.) > > > > Ensure that the pci/pm patches are applied, then: > > > > 0. Unload nouveau (I have blacklisted it for testing). > > 1. Enable rpm for the root port and children (control = auto). > > 2. Verify in the kernel logs that the devices are sleeping: > > pcieport 0000:00:01.0: power state changed by ACPI to D3cold > > 3. (Optional, to rule out issues with delays:) Disable rpm for the > > Nvidia device (control = on). > > 4. modprobe nouveau. > > > > The above test with v4.6 + 4 pci/pm patches (8b71f565) gives: > > > > 50.245795 MXM: GUID detected in BIOS > > 50.245948 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.GFX0._DSM] at AML address ffffc90000013b11 length 492 > > 50.246016 ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) > > 50.246044 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.GFX0._DSM] at AML address ffffc90000013b11 length 492 > > 50.246110 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F > > 50.246256 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) > > 50.246289 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F > > 50.246443 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) > > 50.246457 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F > > 50.246932 pci 0000:01:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported > > 50.247005 VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle > > 50.247084 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PG00._ON] at AML address ffffc9000001086e length 11D > > 50.390140 pcieport 0000:00:01.0: power state changed by ACPI to D0 > > 50.491893 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0._DSW] at AML address ffffc90000010a2d length 1D > > 50.492285 pcieport 0000:00:01.0: PME# disabled > > 50.492583 nouveau 0000:01:00.0: unknown chipset (ffffffff) > > 50.492687 nouveau: probe of 0000:01:00.0 failed with error -12 > > I've tested this on a MacBook Pro, which does not have ACPI _PR3 > methods for the root port to which the discrete GPU is attached. > The port can thus only suspend to D3hot, not D3cold. > > Even without patch [2/9], when unloading nouveau and letting the > root port go to D3hot, the port is subsequently correctly resumed > to D0 when reloading nouveau. > > So the issue that you're seeing without patch [2/9] seems to be > specific to Optimus/_PR3 machines. If possible you should try to > get it working without patch [2/9] because that patch is really > optional (as I've written in the commit message). I'm totally > unfamiliar with Optimus but maybe lspci could help to debug this? Without 2/9 I can prevent the issue by writing "on" to /sys/bus/pci/devices/0000:00:01.0/power/control (the PCIe port), but that effectively gives the same result as applying 2/9. The problem occurs when the power is lost (by putting the PCIe port in D3cold). Maybe it is a bug in the PCI core that does not re-initialize devices under the port, but since a workaround is available (2/9), I will focus on other issues first. Maybe it is worth to mention this issue in the commit message for 2/9 though.
diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c index 11f8dd9..faf7438 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drm.c +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c @@ -498,7 +498,10 @@ nouveau_drm_unload(struct drm_device *dev) { struct nouveau_drm *drm = nouveau_drm(dev); - pm_runtime_get_sync(dev->dev); + if (nouveau_runtime_pm != 0) { + pm_runtime_get_sync(dev->dev); + } + nouveau_fbcon_fini(dev); nouveau_accel_fini(drm); nouveau_hwmon_fini(dev);