Message ID | 20211110210307.1172004-1-kai.vehmanen@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ALSA: hda: fix general protection fault in azx_runtime_idle | expand |
On Wed, 10 Nov 2021 22:03:07 +0100, Kai Vehmanen wrote: > > Fix a corner case between PCI device driver remove callback and > runtime PM idle callback. > > Following sequence of events can happen: > - at azx_create, context is allocated with devm_kzalloc() and > stored as pci_set_drvdata() > - user-space requests to unbind audio driver > - dd.c:__device_release_driver() calls PCI remove > - pci-driver.c:pci_device_remove() calls the audio > driver azx_remove() callback and this is completed > - pci-driver.c:pm_runtime_put_sync() leads to a call > to rpm_idle() which again calls azx_runtime_idle() > - the azx context object, as returned by dev_get_drvdata(), > is no longer valid > -> access fault in azx_runtime_idle when executing > struct snd_card *card = dev_get_drvdata(dev); > chip = card->private_data; > if (chip->disabled || hda->init_failed) > > This was discovered by i915_module_load test with 5.15.0 based > linux-next tree. > > Example log caught by i915_module_load test with linux-next > https://intel-gfx-ci.01.org/tree/linux-next/ > > <4> [264.038232] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b73f0: 0000 [#1] PREEMPT SMP NOPTI > <4> [264.038248] CPU: 0 PID: 5374 Comm: i915_module_loa Not tainted 5.15.0-next-20211109-gc8109c2ba35e-next-20211109 #1 > [...] > <4> [264.038267] RIP: 0010:azx_runtime_idle+0x12/0x60 [snd_hda_intel] > [...] > <4> [264.038355] Call Trace: > <4> [264.038359] <TASK> > <4> [264.038362] __rpm_callback+0x3d/0x110 > <4> [264.038371] rpm_idle+0x27f/0x380 > <4> [264.038376] __pm_runtime_idle+0x3b/0x100 > <4> [264.038382] pci_device_remove+0x6d/0xa0 > <4> [264.038388] device_release_driver_internal+0xef/0x1e0 > <4> [264.038395] unbind_store+0xeb/0x120 > <4> [264.038400] kernfs_fop_write_iter+0x11a/0x1c0 > > Fix the issue by setting drvdata to NULL at end of azx_remove(). > > Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> > --- > sound/pci/hda/hda_intel.c | 1 + > 1 file changed, 1 insertion(+) > > Some non-persistent direct links showing the bug trigger on > different platforms with linux-next 20211109: > - https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/fi-tgl-1115g4/igt@i915_module_load@reload.html > - https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/fi-jsl-1/igt@i915_module_load@reload.html > > Notably with 20211110 linux-next, the bug does not trigger: > - https://intel-gfx-ci.01.org/tree/linux-next/next-20211110/fi-tgl-1115g4/igt@i915_module_load@reload.html Is this the case with CONFIG_DEBUG_KOBJECT_RELEASE? This would be the only logical explanation I can think of for now. In anyway, the code change itself looks good, so I took the fix now. thanks, Takashi
Hey, On Wed, 10 Nov 2021, Takashi Iwai wrote: > On Wed, 10 Nov 2021 22:03:07 +0100, Kai Vehmanen wrote: > > Fix a corner case between PCI device driver remove callback and > > runtime PM idle callback. [...] > > Some non-persistent direct links showing the bug trigger on > > different platforms with linux-next 20211109: > > - https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/fi-tgl-1115g4/igt@i915_module_load@reload.html > > - https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/fi-jsl-1/igt@i915_module_load@reload.html > > > > Notably with 20211110 linux-next, the bug does not trigger: > > - https://intel-gfx-ci.01.org/tree/linux-next/next-20211110/fi-tgl-1115g4/igt@i915_module_load@reload.html > > Is this the case with CONFIG_DEBUG_KOBJECT_RELEASE? > This would be the only logical explanation I can think of for now. hmm, that doesn't seem to be used. Here's a link to kconfig used in the failing CI run: https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/kconfig.txt It's still a bit odd, especially given Scott just reported the other HDA related regression in 5.15 today. The two issues don't seem to be related though, although both are fixed by clearing drvdata (but in different places of hda_intel.c). I'll try to run some more tests tomorrow. The fix should be good in any case, but it would be interesting to understand better what change made this more (?) likely to hit than before. This is not a new test and the problem happens on fairly old platforms, so something has changed. Br, Kai
On Wed, 10 Nov 2021 23:15:40 +0100, Kai Vehmanen wrote: > > Hey, > > On Wed, 10 Nov 2021, Takashi Iwai wrote: > > > On Wed, 10 Nov 2021 22:03:07 +0100, Kai Vehmanen wrote: > > > Fix a corner case between PCI device driver remove callback and > > > runtime PM idle callback. > [...] > > > Some non-persistent direct links showing the bug trigger on > > > different platforms with linux-next 20211109: > > > - https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/fi-tgl-1115g4/igt@i915_module_load@reload.html > > > - https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/fi-jsl-1/igt@i915_module_load@reload.html > > > > > > Notably with 20211110 linux-next, the bug does not trigger: > > > - https://intel-gfx-ci.01.org/tree/linux-next/next-20211110/fi-tgl-1115g4/igt@i915_module_load@reload.html > > > > Is this the case with CONFIG_DEBUG_KOBJECT_RELEASE? > > This would be the only logical explanation I can think of for now. > > hmm, that doesn't seem to be used. Here's a link to kconfig used in the > failing CI run: > https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/kconfig.txt OK, then it's not due to the delayed release, but the cause should be the same, I suppose. > It's still a bit odd, especially given Scott just reported the other HDA > related regression in 5.15 today. The two issues don't seem to be related > though, although both are fixed by clearing drvdata (but in different > places of hda_intel.c). I don't think it's the same issue, rather a coincidence of the timing. There have been many changes in 5.15, after all :) > I'll try to run some more tests tomorrow. The fix should be good in any > case, but it would be interesting to understand better what change made > this more (?) likely to hit than before. This is not a new test and the > problem happens on fairly old platforms, so something has changed. A potential problem with the current code is that it doesn't disable the runtime PM at the release procedure. Could you try the patch below? You can put WARN_ON(!chip) at azx_runtime_idle(), too, for catching the invalid runtime call. thanks, Takashi --- a/sound/pci/hda/hda_intel.c +++ b/sound/pci/hda/hda_intel.c @@ -1347,8 +1347,13 @@ static void azx_free(struct azx *chip) if (hda->freed) return; - if (azx_has_pm_runtime(chip) && chip->running) + if (azx_has_pm_runtime(chip) && chip->running) { pm_runtime_get_noresume(&pci->dev); + pm_runtime_forbid(&pci->dev); + pm_runtime_dont_use_autosuspend(&pci->dev); + pm_runtime_disable(&pci->dev); + } + chip->running = 0; azx_del_card_list(chip); @@ -2320,6 +2325,7 @@ static int azx_probe_continue(struct azx *chip) set_default_power_save(chip); if (azx_has_pm_runtime(chip)) { + pm_runtime_enable(&pci->dev); pm_runtime_use_autosuspend(&pci->dev); pm_runtime_allow(&pci->dev); pm_runtime_put_autosuspend(&pci->dev);
Hi, On Thu, 11 Nov 2021, Takashi Iwai wrote: > A potential problem with the current code is that it doesn't disable > the runtime PM at the release procedure. Could you try the patch > below? You can put WARN_ON(!chip) at azx_runtime_idle(), too, for > catching the invalid runtime call. [...] > --- a/sound/pci/hda/hda_intel.c > +++ b/sound/pci/hda/hda_intel.c > @@ -1347,8 +1347,13 @@ static void azx_free(struct azx *chip) > if (hda->freed) > return; > > - if (azx_has_pm_runtime(chip) && chip->running) > + if (azx_has_pm_runtime(chip) && chip->running) { > pm_runtime_get_noresume(&pci->dev); > + pm_runtime_forbid(&pci->dev); > + pm_runtime_dont_use_autosuspend(&pci->dev); > + pm_runtime_disable(&pci->dev); > + } > + > chip->running = 0; Tested with next-20211019 (first next tag where I've seen test failures) and your patch, and this seems to do the trick. I didn't have my drvdata patch included when I ran the test. No rpm_idle() calls anymore after azx_remove(), so the bug is not hit. > azx_del_card_list(chip); > @@ -2320,6 +2325,7 @@ static int azx_probe_continue(struct azx *chip) > set_default_power_save(chip); > > if (azx_has_pm_runtime(chip)) { > + pm_runtime_enable(&pci->dev); > pm_runtime_use_autosuspend(&pci->dev); This does generate warnings [ 13.495059] snd_hda_intel 0000:00:1f.3: Unbalanced pm_runtime_enable! And later [ 54.770701] Enabling runtime PM for inactive device (0000:00:1f.3) with active children [ 54.770718] WARNING: CPU: 0 PID: 10 at drivers/base/power/runtime.c:1439 pm_runtime_enable+0x98/0xb0 Adding a "pm_runtime_set_active(&pci->dev)" to both azx_free() and azx_probe_continue() seems to help and fix still works. Br, Kai
On Thu, 11 Nov 2021 18:39:36 +0100, Kai Vehmanen wrote: > > Hi, > > On Thu, 11 Nov 2021, Takashi Iwai wrote: > > > A potential problem with the current code is that it doesn't disable > > the runtime PM at the release procedure. Could you try the patch > > below? You can put WARN_ON(!chip) at azx_runtime_idle(), too, for > > catching the invalid runtime call. > [...] > > --- a/sound/pci/hda/hda_intel.c > > +++ b/sound/pci/hda/hda_intel.c > > @@ -1347,8 +1347,13 @@ static void azx_free(struct azx *chip) > > if (hda->freed) > > return; > > > > - if (azx_has_pm_runtime(chip) && chip->running) > > + if (azx_has_pm_runtime(chip) && chip->running) { > > pm_runtime_get_noresume(&pci->dev); > > + pm_runtime_forbid(&pci->dev); > > + pm_runtime_dont_use_autosuspend(&pci->dev); > > + pm_runtime_disable(&pci->dev); > > + } > > + > > chip->running = 0; > > Tested with next-20211019 (first next tag where I've seen test failures) > and your patch, and this seems to do the trick. I didn't have my drvdata > patch included when I ran the test. No rpm_idle() calls > anymore after azx_remove(), so the bug is not hit. So far, so good... > > azx_del_card_list(chip); > > @@ -2320,6 +2325,7 @@ static int azx_probe_continue(struct azx *chip) > > set_default_power_save(chip); > > > > if (azx_has_pm_runtime(chip)) { > > + pm_runtime_enable(&pci->dev); > > pm_runtime_use_autosuspend(&pci->dev); > > This does generate warnings > [ 13.495059] snd_hda_intel 0000:00:1f.3: Unbalanced pm_runtime_enable! > > And later > [ 54.770701] Enabling runtime PM for inactive device (0000:00:1f.3) with active children > [ 54.770718] WARNING: CPU: 0 PID: 10 at drivers/base/power/runtime.c:1439 pm_runtime_enable+0x98/0xb0 > > Adding a "pm_runtime_set_active(&pci->dev)" to both azx_free() and > azx_probe_continue() seems to help and fix still works. Ah yes, I was confused as if it were already called in hdac_device.c, but this was about the HD-audio bus controller, not the codec. Below is the revised one. Takashi -- 8< -- From: Takashi Iwai <tiwai@suse.de> Subject: [PATCH] ALSA: hda: intel: More comprehensive PM runtime setup for controller driver Currently we haven't explicitly enable and allow/forbid the runtime PM at the probe and the remove phases of HD-audio controller driver, and this was the reason of a GPF mentioned in the commit e81478bbe7a1 ("ALSA: hda: fix general protection fault in azx_runtime_idle"); namely, even after the resources are released, the runtime PM might be still invoked by the bound graphics driver during the remove of the controller driver. Although we've fixed it by clearing the drvdata reference, it'd be also better to cover the runtime PM issue more properly. This patch adds a few more pm_runtime_*() calls at the probe and the remove time for setting and cleaning up the runtime PM. Particularly, now more explicitly pm_runtime_enable() and _disable() get called as well as pm_runtime_forbid() call at the remove callback, so that a use-after-free should be avoided. Reported-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> --- sound/pci/hda/hda_intel.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c index fe51163f2d82..45e85180048c 100644 --- a/sound/pci/hda/hda_intel.c +++ b/sound/pci/hda/hda_intel.c @@ -1347,8 +1347,14 @@ static void azx_free(struct azx *chip) if (hda->freed) return; - if (azx_has_pm_runtime(chip) && chip->running) + if (azx_has_pm_runtime(chip) && chip->running) { pm_runtime_get_noresume(&pci->dev); + pm_runtime_disable(&pci->dev); + pm_runtime_set_suspended(&pci->dev); + pm_runtime_forbid(&pci->dev); + pm_runtime_dont_use_autosuspend(&pci->dev); + } + chip->running = 0; azx_del_card_list(chip); @@ -2322,6 +2328,8 @@ static int azx_probe_continue(struct azx *chip) if (azx_has_pm_runtime(chip)) { pm_runtime_use_autosuspend(&pci->dev); pm_runtime_allow(&pci->dev); + pm_runtime_set_active(&pci->dev); + pm_runtime_enable(&pci->dev); pm_runtime_put_autosuspend(&pci->dev); }
Hi, On Fri, 12 Nov 2021, Takashi Iwai wrote: > On Thu, 11 Nov 2021 18:39:36 +0100, Kai Vehmanen wrote: > > And later > > [ 54.770701] Enabling runtime PM for inactive device (0000:00:1f.3) with active children > > [ 54.770718] WARNING: CPU: 0 PID: 10 at drivers/base/power/runtime.c:1439 pm_runtime_enable+0x98/0xb0 > > > > Adding a "pm_runtime_set_active(&pci->dev)" to both azx_free() and > > azx_probe_continue() seems to help and fix still works. > > Ah yes, I was confused as if it were already called in hdac_device.c, > but this was about the HD-audio bus controller, not the codec. > > Below is the revised one. [...] > Currently we haven't explicitly enable and allow/forbid the runtime PM > at the probe and the remove phases of HD-audio controller driver, and > this was the reason of a GPF mentioned in the commit e81478bbe7a1 > ("ALSA: hda: fix general protection fault in azx_runtime_idle"); > namely, even after the resources are released, the runtime PM might be > still invoked by the bound graphics driver during the remove of the > controller driver. Although we've fixed it by clearing the drvdata > reference, it'd be also better to cover the runtime PM issue more > properly. > > This patch adds a few more pm_runtime_*() calls at the probe and the > remove time for setting and cleaning up the runtime PM. Particularly, > now more explicitly pm_runtime_enable() and _disable() get called as > well as pm_runtime_forbid() call at the remove callback, so that a > use-after-free should be avoided. > > Reported-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> > Signed-off-by: Takashi Iwai <tiwai@suse.de> ack, tested this and no warnings anymore. Reviewed-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Tested-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Br, Kai
On Fri, 12 Nov 2021 13:27:34 +0100, Kai Vehmanen wrote: > > Hi, > > On Fri, 12 Nov 2021, Takashi Iwai wrote: > > > On Thu, 11 Nov 2021 18:39:36 +0100, Kai Vehmanen wrote: > > > And later > > > [ 54.770701] Enabling runtime PM for inactive device (0000:00:1f.3) with active children > > > [ 54.770718] WARNING: CPU: 0 PID: 10 at drivers/base/power/runtime.c:1439 pm_runtime_enable+0x98/0xb0 > > > > > > Adding a "pm_runtime_set_active(&pci->dev)" to both azx_free() and > > > azx_probe_continue() seems to help and fix still works. > > > > Ah yes, I was confused as if it were already called in hdac_device.c, > > but this was about the HD-audio bus controller, not the codec. > > > > Below is the revised one. > [...] > > Currently we haven't explicitly enable and allow/forbid the runtime PM > > at the probe and the remove phases of HD-audio controller driver, and > > this was the reason of a GPF mentioned in the commit e81478bbe7a1 > > ("ALSA: hda: fix general protection fault in azx_runtime_idle"); > > namely, even after the resources are released, the runtime PM might be > > still invoked by the bound graphics driver during the remove of the > > controller driver. Although we've fixed it by clearing the drvdata > > reference, it'd be also better to cover the runtime PM issue more > > properly. > > > > This patch adds a few more pm_runtime_*() calls at the probe and the > > remove time for setting and cleaning up the runtime PM. Particularly, > > now more explicitly pm_runtime_enable() and _disable() get called as > > well as pm_runtime_forbid() call at the remove callback, so that a > > use-after-free should be avoided. > > > > Reported-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> > > Signed-off-by: Takashi Iwai <tiwai@suse.de> > > ack, tested this and no warnings anymore. > > Reviewed-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> > Tested-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> OK, I'll submit and merge the patch. As the original problem was fixed by NULL checks, I'd push this for 5.16, though. thanks, Takashi
diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c index 7762718cf429..b90c817e3f6f 100644 --- a/sound/pci/hda/hda_intel.c +++ b/sound/pci/hda/hda_intel.c @@ -2364,6 +2364,7 @@ static void azx_remove(struct pci_dev *pci) cancel_delayed_work_sync(&hda->probe_work); device_lock(&pci->dev); + pci_set_drvdata(pci, NULL); snd_card_free(card); } }
Fix a corner case between PCI device driver remove callback and runtime PM idle callback. Following sequence of events can happen: - at azx_create, context is allocated with devm_kzalloc() and stored as pci_set_drvdata() - user-space requests to unbind audio driver - dd.c:__device_release_driver() calls PCI remove - pci-driver.c:pci_device_remove() calls the audio driver azx_remove() callback and this is completed - pci-driver.c:pm_runtime_put_sync() leads to a call to rpm_idle() which again calls azx_runtime_idle() - the azx context object, as returned by dev_get_drvdata(), is no longer valid -> access fault in azx_runtime_idle when executing struct snd_card *card = dev_get_drvdata(dev); chip = card->private_data; if (chip->disabled || hda->init_failed) This was discovered by i915_module_load test with 5.15.0 based linux-next tree. Example log caught by i915_module_load test with linux-next https://intel-gfx-ci.01.org/tree/linux-next/ <4> [264.038232] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b73f0: 0000 [#1] PREEMPT SMP NOPTI <4> [264.038248] CPU: 0 PID: 5374 Comm: i915_module_loa Not tainted 5.15.0-next-20211109-gc8109c2ba35e-next-20211109 #1 [...] <4> [264.038267] RIP: 0010:azx_runtime_idle+0x12/0x60 [snd_hda_intel] [...] <4> [264.038355] Call Trace: <4> [264.038359] <TASK> <4> [264.038362] __rpm_callback+0x3d/0x110 <4> [264.038371] rpm_idle+0x27f/0x380 <4> [264.038376] __pm_runtime_idle+0x3b/0x100 <4> [264.038382] pci_device_remove+0x6d/0xa0 <4> [264.038388] device_release_driver_internal+0xef/0x1e0 <4> [264.038395] unbind_store+0xeb/0x120 <4> [264.038400] kernfs_fop_write_iter+0x11a/0x1c0 Fix the issue by setting drvdata to NULL at end of azx_remove(). Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> --- sound/pci/hda/hda_intel.c | 1 + 1 file changed, 1 insertion(+) Some non-persistent direct links showing the bug trigger on different platforms with linux-next 20211109: - https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/fi-tgl-1115g4/igt@i915_module_load@reload.html - https://intel-gfx-ci.01.org/tree/linux-next/next-20211109/fi-jsl-1/igt@i915_module_load@reload.html Notably with 20211110 linux-next, the bug does not trigger: - https://intel-gfx-ci.01.org/tree/linux-next/next-20211110/fi-tgl-1115g4/igt@i915_module_load@reload.html base-commit: 6322ec8d0de924cf9672b23c1b5052afafc2f03b