Message ID | 20240725120729.59788-2-pstanner@redhat.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 8dea7ce8b09161497f5f54cf8034edd03518c6c2 |
Headers | show |
Series | PCI: Fix devres regression in pci_intx() | expand |
Can we please fix this to never silently redirect to a manager version and add a proper pcim_intx instead and use that where people actually want to use the crazy devres voodoo instead? Doing this silently underneath will always create problems.
On Thu, 2024-07-25 at 07:26 -0700, Christoph Hellwig wrote: > Can we please fix this to never silently redirect to a manager > version It is not the fix or the recent changes (which the fix is for) to PCI devres that is doing that. pci_intx() has been "silently redirect[ing] to a managed version" since 15 years. The changes merged into v6.11 attempted to keep this behavior exactly identical as a preparation for later cleanups. The fix here only corrects the position of the redirection to where the "crazy devres voodoo" had always been: void pci_intx(struct pci_dev *pdev, int enable) { u16 pci_command, new; pci_read_config_word(pdev, PCI_COMMAND, &pci_command); if (enable) new = pci_command & ~PCI_COMMAND_INTX_DISABLE; else new = pci_command | PCI_COMMAND_INTX_DISABLE; if (new != pci_command) { struct pci_devres *dr; pci_write_config_word(pdev, PCI_COMMAND, new); /* voodoo_begin */ dr = find_pci_dr(pdev); if (dr && !dr->restore_intx) { dr->restore_intx = 1; dr->orig_intx = !enable; } /* voodoo_end */ } } EXPORT_SYMBOL_GPL(pci_intx); > and add a proper pcim_intx instead That has already been done. pcim_intx() sits in drivers/pci/devres.c > and use that where people actually > want to use the crazy devres voodoo instead? Doing this silently > underneath will always create problems. That's precisely what all my work is all about. The hybrid nature of pci_intx(), pci_set_mwi() and all pci*request*() functions needs to be removed. However, that will take us some while, because the APIs are partly ossificated and every user that relies on implicit crazy devres voodoo has to be identified and then ported to *explicit* half-crazy devres voodoo. More details here: https://lore.kernel.org/all/20240613115032.29098-1-pstanner@redhat.com/ P. > >
> That's precisely what all my work is all about. The hybrid nature of > pci_intx(), pci_set_mwi() and all pci*request*() functions needs to be > removed. Ok. Thanks for doing the work..
[+cc Christoph] On Thu, Jul 25, 2024 at 02:07:30PM +0200, Philipp Stanner wrote: > pci_intx() is a function that becomes managed if pcim_enable_device() > has been called in advance. Commit 25216afc9db5 ("PCI: Add managed > pcim_intx()") changed this behavior so that pci_intx() always leads to > creation of a separate device resource for itself, whereas earlier, a > shared resource was used for all PCI devres operations. > > Unfortunately, pci_intx() seems to be used in some drivers' remove() > paths; in the managed case this causes a device resource to be created > on driver detach. > > Fix the regression by only redirecting pci_intx() to its managed twin > pcim_intx() if the pci_command changes. > > Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()") > Reported-by: Damien Le Moal <dlemoal@kernel.org> > Closes: https://lore.kernel.org/all/b8f4ba97-84fc-4b7e-ba1a-99de2d9f0118@kernel.org/ > Signed-off-by: Philipp Stanner <pstanner@redhat.com> Applied to for-linus for v6.11, thanks! > --- > Alright, I reproduced this with QEMU as Damien described and this here > fixes the issue on my side. Feedback welcome. Thank you very much, > Damien. > > It seems that this might yet again be the issue of drivers not being > aware that pci_intx() might become managed, so they use it in their > unwind path (rightfully so; there probably was no alternative back > then). > > That will make the long term cleanup difficult. But I think this for now > is the most elegant possible workaround. > --- > drivers/pci/pci.c | 15 ++++++++------- > 1 file changed, 8 insertions(+), 7 deletions(-) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index e3a49f66982d..ffaaca0978cb 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -4477,12 +4477,6 @@ void pci_intx(struct pci_dev *pdev, int enable) > { > u16 pci_command, new; > > - /* Preserve the "hybrid" behavior for backwards compatibility */ > - if (pci_is_managed(pdev)) { > - WARN_ON_ONCE(pcim_intx(pdev, enable) != 0); > - return; > - } > - > pci_read_config_word(pdev, PCI_COMMAND, &pci_command); > > if (enable) > @@ -4490,8 +4484,15 @@ void pci_intx(struct pci_dev *pdev, int enable) > else > new = pci_command | PCI_COMMAND_INTX_DISABLE; > > - if (new != pci_command) > + if (new != pci_command) { > + /* Preserve the "hybrid" behavior for backwards compatibility */ > + if (pci_is_managed(pdev)) { > + WARN_ON_ONCE(pcim_intx(pdev, enable) != 0); > + return; > + } > + > pci_write_config_word(pdev, PCI_COMMAND, new); > + } > } > EXPORT_SYMBOL_GPL(pci_intx); > > -- > 2.45.2 >
On 7/25/24 21:07, Philipp Stanner wrote: > pci_intx() is a function that becomes managed if pcim_enable_device() > has been called in advance. Commit 25216afc9db5 ("PCI: Add managed > pcim_intx()") changed this behavior so that pci_intx() always leads to > creation of a separate device resource for itself, whereas earlier, a > shared resource was used for all PCI devres operations. > > Unfortunately, pci_intx() seems to be used in some drivers' remove() > paths; in the managed case this causes a device resource to be created > on driver detach. > > Fix the regression by only redirecting pci_intx() to its managed twin > pcim_intx() if the pci_command changes. > > Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()") > Reported-by: Damien Le Moal <dlemoal@kernel.org> > Closes: https://lore.kernel.org/all/b8f4ba97-84fc-4b7e-ba1a-99de2d9f0118@kernel.org/ > Signed-off-by: Philipp Stanner <pstanner@redhat.com> > --- > Alright, I reproduced this with QEMU as Damien described and this here > fixes the issue on my side. Feedback welcome. Thank you very much, > Damien. This works ans is cleaner that what I had :) Tested-by: Damien Le Moal <dlemoal@kernel.org> > It seems that this might yet again be the issue of drivers not being > aware that pci_intx() might become managed, so they use it in their > unwind path (rightfully so; there probably was no alternative back > then). At least for the ahci driver with wich I found the issue, what is odd is that there is only a single call to pci_intx() to *enable* intx, and that call is in the probe path. With QEMU, this call is not even done as the qemu AHCI support MSI. Adding a WARN_ON(!enable) at the beginning of pci_inx(), we can see that what happens is that during device probe, we get this backtrace: [ 34.658988] WARNING: CPU: 0 PID: 999 at drivers/pci/pci.c:4480 pci_intx+0x7f/0xc0 [ 34.660799] Modules linked in: ahci(+) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace netfs] [ 34.673784] CPU: 0 UID: 0 PID: 999 Comm: modprobe Tainted: G W 6.10.0+ #1948 [ 34.675961] Tainted: [W]=WARN [ 34.676891] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 [ 34.679197] RIP: 0010:pci_intx+0x7f/0xc0 [ 34.680348] Code: b7 d2 be 04 00 00 00 48 89 df e8 0c 84 ff ff 48 8b 44 24 08 65 48 2b 04 25 28 00 00 00 756 [ 34.685015] RSP: 0018:ffffb60f40e2f7f0 EFLAGS: 00010246 [ 34.686436] RAX: 0000000000000000 RBX: ffff9dbb81237000 RCX: ffffb60f40e2f64c [ 34.688294] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9dbb81237000 [ 34.690120] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 [ 34.691986] R10: ffff9dbb88883538 R11: 0000000000000001 R12: 0000000000000001 [ 34.693687] R13: ffff9dbb812370c8 R14: ffff9dbb86eeaa00 R15: 0000000000000000 [ 34.695140] FS: 00007f9d81465740(0000) GS:ffff9dbcf7c00000(0000) knlGS:0000000000000000 [ 34.696884] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 34.698107] CR2: 00007ffc786ed8b8 CR3: 00000001088da000 CR4: 0000000000350ef0 [ 34.699562] Call Trace: [ 34.700215] <TASK> [ 34.700802] ? pci_intx+0x7f/0xc0 [ 34.701607] ? __warn.cold+0xa5/0x13c [ 34.702448] ? pci_intx+0x7f/0xc0 [ 34.703257] ? report_bug+0xff/0x140 [ 34.704105] ? handle_bug+0x3a/0x70 [ 34.704938] ? exc_invalid_op+0x17/0x70 [ 34.705826] ? asm_exc_invalid_op+0x1a/0x20 [ 34.706593] ? pci_intx+0x7f/0xc0 [ 34.707086] msi_capability_init+0x35a/0x370 [ 34.707723] __pci_enable_msi_range+0x187/0x240 [ 34.708356] pci_alloc_irq_vectors_affinity+0xc4/0x110 [ 34.709058] ahci_init_one+0x6ec/0xcc0 [ahci] [ 34.709692] ? __pm_runtime_resume+0x58/0x90 [ 34.710311] local_pci_probe+0x45/0x90 [ 34.710865] pci_device_probe+0xbb/0x230 [ 34.711433] really_probe+0xcc/0x350 [ 34.711976] ? pm_runtime_barrier+0x54/0x90 [ 34.712569] ? __pfx___driver_attach+0x10/0x10 [ 34.713206] __driver_probe_device+0x78/0x110 [ 34.713837] driver_probe_device+0x1f/0xa0 [ 34.714427] __driver_attach+0xbe/0x1d0 [ 34.715001] bus_for_each_dev+0x92/0xe0 [ 34.715563] bus_add_driver+0x115/0x200 [ 34.716136] driver_register+0x72/0xd0 [ 34.716704] ? __pfx_ahci_pci_driver_init+0x10/0x10 [ahci] [ 34.717457] do_one_initcall+0x76/0x3a0 [ 34.718036] do_init_module+0x60/0x210 [ 34.718616] init_module_from_file+0x86/0xc0 [ 34.719243] idempotent_init_module+0x127/0x2c0 [ 34.719913] __x64_sys_finit_module+0x5e/0xb0 [ 34.720546] do_syscall_64+0x7d/0x160 [ 34.721100] ? srso_return_thunk+0x5/0x5f [ 34.721695] ? do_syscall_64+0x89/0x160 [ 34.722258] ? srso_return_thunk+0x5/0x5f [ 34.722846] ? do_sys_openat2+0x9c/0xe0 [ 34.723421] ? srso_return_thunk+0x5/0x5f [ 34.724012] ? syscall_exit_to_user_mode+0x64/0x1f0 [ 34.724703] ? srso_return_thunk+0x5/0x5f [ 34.725293] ? do_syscall_64+0x89/0x160 [ 34.725883] ? srso_return_thunk+0x5/0x5f [ 34.726467] ? syscall_exit_to_user_mode+0x64/0x1f0 [ 34.727159] ? srso_return_thunk+0x5/0x5f [ 34.727764] ? do_syscall_64+0x89/0x160 [ 34.728341] ? srso_return_thunk+0x5/0x5f [ 34.728937] ? exc_page_fault+0x6c/0x200 [ 34.729511] ? srso_return_thunk+0x5/0x5f [ 34.730109] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 34.730837] RIP: 0033:0x7f9d80d281dd [ 34.731375] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d8 [ 34.733796] RSP: 002b:00007ffc786f0898 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 34.734894] RAX: ffffffffffffffda RBX: 00005617347f09a0 RCX: 00007f9d80d281dd [ 34.735850] RDX: 0000000000000000 RSI: 0000561715fe5e79 RDI: 0000000000000003 [ 34.736812] RBP: 00007ffc786f0950 R08: 00007f9d80df6b20 R09: 00007ffc786f08e0 [ 34.737769] R10: 00005617347f13e0 R11: 0000000000000246 R12: 0000561715fe5e79 [ 34.738725] R13: 0000000000040000 R14: 00005617347f8990 R15: 00005617347f8b20 [ 34.739695] </TASK> [ 34.740075] ---[ end trace 0000000000000000 ]--- So it is msi_capability_init() that is the problem: that function calls pci_intx_for_msi(dev, 0) which then calls pci_intx(dev, 0), thus creating the intx devres for the device despite the driver code not touching intx at all. The driver is fine ! It is MSI touching INTX that is messing things up. That said, I do not see that as an issue in itself. What I fail to understand though is why that intx devres is not deleted on device teardown. I think this may have something to do with the fact that pcim_intx() always does "res->orig_intx = !enable;", that is, it assumes that if there is a call to pcim_intx(dev, 0), then it is because intx where enabled already, which I do not think is true for most drivers... So we endup with INTX being wrongly enabled on device teardown by pcim_intx_restore(), and because of that, the intx resource is not deleted ? Re-enabling intx on teardown is wrong I think, but that still does not explain why that resource is not deleted. I fail to see why.
On Fri, 2024-07-26 at 09:19 +0900, Damien Le Moal wrote: > On 7/25/24 21:07, Philipp Stanner wrote: > > pci_intx() is a function that becomes managed if > > pcim_enable_device() > > has been called in advance. Commit 25216afc9db5 ("PCI: Add managed > > pcim_intx()") changed this behavior so that pci_intx() always leads > > to > > creation of a separate device resource for itself, whereas earlier, > > a > > shared resource was used for all PCI devres operations. > > > > Unfortunately, pci_intx() seems to be used in some drivers' > > remove() > > paths; in the managed case this causes a device resource to be > > created > > on driver detach. > > > > Fix the regression by only redirecting pci_intx() to its managed > > twin > > pcim_intx() if the pci_command changes. > > > > Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()") > > Reported-by: Damien Le Moal <dlemoal@kernel.org> > > Closes: > > https://lore.kernel.org/all/b8f4ba97-84fc-4b7e-ba1a-99de2d9f0118@kernel.org/ > > Signed-off-by: Philipp Stanner <pstanner@redhat.com> > > --- > > Alright, I reproduced this with QEMU as Damien described and this > > here > > fixes the issue on my side. Feedback welcome. Thank you very much, > > Damien. > > This works ans is cleaner that what I had :) The fundamental idea is mostly identical to yours – you likely just didn't see it because your attention was directed towards the code in devres.c ;) > > Tested-by: Damien Le Moal <dlemoal@kernel.org> You might wanna ping Bjorn about that in case he didn't see. > > > It seems that this might yet again be the issue of drivers not > > being > > aware that pci_intx() might become managed, so they use it in their > > unwind path (rightfully so; there probably was no alternative back > > then). > > At least for the ahci driver with wich I found the issue, what is odd > is that > there is only a single call to pci_intx() There is only a single _directly visible_ call :] > to *enable* intx, and that call is in > the probe path. With QEMU, this call is not even done as the qemu > AHCI support MSI. Hmm, MSI... > > Adding a WARN_ON(!enable) at the beginning of pci_inx(), we can see > that what > happens is that during device probe, we get this backtrace: > > [ 34.658988] WARNING: CPU: 0 PID: 999 at drivers/pci/pci.c:4480 > pci_intx+0x7f/0xc0 > [ 34.660799] Modules linked in: ahci(+) rpcsec_gss_krb5 auth_rpcgss > nfsv4 > dns_resolver nfs lockd grace netfs] > [ 34.673784] CPU: 0 UID: 0 PID: 999 Comm: modprobe Tainted: > G W > 6.10.0+ #1948 > [ 34.675961] Tainted: [W]=WARN > [ 34.676891] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), > BIOS > 1.16.3-2.fc40 04/01/2014 > [ 34.679197] RIP: 0010:pci_intx+0x7f/0xc0 > [ 34.680348] Code: b7 d2 be 04 00 00 00 48 89 df e8 0c 84 ff ff 48 > 8b 44 24 08 > 65 48 2b 04 25 28 00 00 00 756 > [ 34.685015] RSP: 0018:ffffb60f40e2f7f0 EFLAGS: 00010246 > [ 34.686436] RAX: 0000000000000000 RBX: ffff9dbb81237000 RCX: > ffffb60f40e2f64c > [ 34.688294] RDX: 0000000000000000 RSI: 0000000000000000 RDI: > ffff9dbb81237000 > [ 34.690120] RBP: 0000000000000000 R08: 0000000000000000 R09: > 0000000000000001 > [ 34.691986] R10: ffff9dbb88883538 R11: 0000000000000001 R12: > 0000000000000001 > [ 34.693687] R13: ffff9dbb812370c8 R14: ffff9dbb86eeaa00 R15: > 0000000000000000 > [ 34.695140] FS: 00007f9d81465740(0000) GS:ffff9dbcf7c00000(0000) > knlGS:0000000000000000 > [ 34.696884] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 34.698107] CR2: 00007ffc786ed8b8 CR3: 00000001088da000 CR4: > 0000000000350ef0 > [ 34.699562] Call Trace: > [ 34.700215] <TASK> > [ 34.700802] ? pci_intx+0x7f/0xc0 > [ 34.701607] ? __warn.cold+0xa5/0x13c > [ 34.702448] ? pci_intx+0x7f/0xc0 > [ 34.703257] ? report_bug+0xff/0x140 > [ 34.704105] ? handle_bug+0x3a/0x70 > [ 34.704938] ? exc_invalid_op+0x17/0x70 > [ 34.705826] ? asm_exc_invalid_op+0x1a/0x20 > [ 34.706593] ? pci_intx+0x7f/0xc0 > [ 34.707086] msi_capability_init+0x35a/0x370 > [ 34.707723] __pci_enable_msi_range+0x187/0x240 > [ 34.708356] pci_alloc_irq_vectors_affinity+0xc4/0x110 > [ 34.709058] ahci_init_one+0x6ec/0xcc0 [ahci] > [ 34.709692] ? __pm_runtime_resume+0x58/0x90 > [ 34.710311] local_pci_probe+0x45/0x90 > [ 34.710865] pci_device_probe+0xbb/0x230 > [ 34.711433] really_probe+0xcc/0x350 > [ 34.711976] ? pm_runtime_barrier+0x54/0x90 > [ 34.712569] ? __pfx___driver_attach+0x10/0x10 > [ 34.713206] __driver_probe_device+0x78/0x110 > [ 34.713837] driver_probe_device+0x1f/0xa0 > [ 34.714427] __driver_attach+0xbe/0x1d0 > [ 34.715001] bus_for_each_dev+0x92/0xe0 > [ 34.715563] bus_add_driver+0x115/0x200 > [ 34.716136] driver_register+0x72/0xd0 > [ 34.716704] ? __pfx_ahci_pci_driver_init+0x10/0x10 [ahci] > [ 34.717457] do_one_initcall+0x76/0x3a0 > [ 34.718036] do_init_module+0x60/0x210 > [ 34.718616] init_module_from_file+0x86/0xc0 > [ 34.719243] idempotent_init_module+0x127/0x2c0 > [ 34.719913] __x64_sys_finit_module+0x5e/0xb0 > [ 34.720546] do_syscall_64+0x7d/0x160 > [ 34.721100] ? srso_return_thunk+0x5/0x5f > [ 34.721695] ? do_syscall_64+0x89/0x160 > [ 34.722258] ? srso_return_thunk+0x5/0x5f > [ 34.722846] ? do_sys_openat2+0x9c/0xe0 > [ 34.723421] ? srso_return_thunk+0x5/0x5f > [ 34.724012] ? syscall_exit_to_user_mode+0x64/0x1f0 > [ 34.724703] ? srso_return_thunk+0x5/0x5f > [ 34.725293] ? do_syscall_64+0x89/0x160 > [ 34.725883] ? srso_return_thunk+0x5/0x5f > [ 34.726467] ? syscall_exit_to_user_mode+0x64/0x1f0 > [ 34.727159] ? srso_return_thunk+0x5/0x5f > [ 34.727764] ? do_syscall_64+0x89/0x160 > [ 34.728341] ? srso_return_thunk+0x5/0x5f > [ 34.728937] ? exc_page_fault+0x6c/0x200 > [ 34.729511] ? srso_return_thunk+0x5/0x5f > [ 34.730109] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 34.730837] RIP: 0033:0x7f9d80d281dd > [ 34.731375] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e > fa 48 89 f8 > 48 89 f7 48 89 d6 48 89 ca 4d8 > [ 34.733796] RSP: 002b:00007ffc786f0898 EFLAGS: 00000246 ORIG_RAX: > 0000000000000139 > [ 34.734894] RAX: ffffffffffffffda RBX: 00005617347f09a0 RCX: > 00007f9d80d281dd > [ 34.735850] RDX: 0000000000000000 RSI: 0000561715fe5e79 RDI: > 0000000000000003 > [ 34.736812] RBP: 00007ffc786f0950 R08: 00007f9d80df6b20 R09: > 00007ffc786f08e0 > [ 34.737769] R10: 00005617347f13e0 R11: 0000000000000246 R12: > 0000561715fe5e79 > [ 34.738725] R13: 0000000000040000 R14: 00005617347f8990 R15: > 00005617347f8b20 > [ 34.739695] </TASK> > [ 34.740075] ---[ end trace 0000000000000000 ]--- > > So it is msi_capability_init() that is the problem: that function > calls > pci_intx_for_msi(dev, 0) which then calls pci_intx(dev, 0), thus > creating the > intx devres for the device despite the driver code not touching intx > at all. Nope, that is not the problem – as you correctly point out below, that device resource should be destroyed again. > The driver is fine ! It is MSI touching INTX that is messing things > up. Yes, many drivers are more or less innocent in regards with all of that. As Christoph rightfully pointed out, an API should never behave like that and do _implicit_ magic behind your back. That's the entire philosophy of the C Language. > > That said, I do not see that as an issue in itself. What I fail to > understand > though is why that intx devres is not deleted on device teardown. I > think this > may have something to do with the fact that pcim_intx() always does > "res->orig_intx = !enable;", that is, it assumes that if there is a > call to > pcim_intx(dev, 0), then it is because intx where enabled already, > which I do not > think is true for most drivers... So we endup with INTX being wrongly > enabled on > device teardown by pcim_intx_restore(), and because of that, the intx > resource > is not deleted ? Spoiler: The device resources that have initially been created do get deleted. Devres works as intended. The issue is that the forces of evil invoke pci_intx() through another path, hidden behind an API, through another devres callback. So the device resource never gets deleated because it is *created* on driver detach, when devres already ran. > > Re-enabling intx on teardown is wrong I think, but that still does > not explain > why that resource is not deleted. I fail to see why. You came very close to the truth ;) With some help from my favorite coworker I did some tracing today and found this when doing `rmmod ahci`: => pci_intx => pci_msi_shutdown => pci_disable_msi => devres_release_all => device_unbind_cleanup => device_release_driver_internal => driver_detach => bus_remove_driver => pci_unregister_driver => __do_sys_delete_module => do_syscall_64 => entry_SYSCALL_64_after_hwframe The SYSCALL is my `rmmod`. As you can see, pci_intx() is invoked indirectly through pci_disable_msi() – which gets invoked by devres, which is precisely one reason why you could not find the suspicious pci_intx() call in the ahci code base. Now the question is: Who set up that devres callback which indirectly calls pci_intx()? It is indeed MSI, here in msi/msi.c: static void pcim_msi_release(void *pcidev) { struct pci_dev *dev = pcidev; dev->is_msi_managed = false; pci_free_irq_vectors(dev); // <-- calls pci_disable_msi(), which calls pci_intx(), which re-registers yet another devres callback } /* * Needs to be separate from pcim_release to prevent an ordering problem ==> Oh, here they even had a warning about that interacting with devres somehow... * vs. msi_device_data_release() in the MSI core code. */ static int pcim_setup_msi_release(struct pci_dev *dev) { int ret; if (!pci_is_managed(dev) || dev->is_msi_managed) return 0; ret = devm_add_action(&dev->dev, pcim_msi_release, dev); if (ret) return ret; dev->is_msi_managed = true; return 0; } I don't know enough about AHCI to see where exactly it jumps into these, but a candidate would be: * pci_enable_msi(), called among others in acard-ahci.c Another path is: 1. ahci_init_one() calls 2. ahci_init_msi() calls 3. pci_alloc_irq_vectors() calls 4. pci_alloc_irq_vectors_affinity() calls 5. __pci_enable_msi_range() OR __pci_enable_msix_range() call 6. pci_setup_msi_context() calls 7. pcim_setup_msi_release() which registers the callback to pci_intx() Ha! I think I earned myself a Friday evening beer now 8-) Now the interesting question will be how the heck we're supposed to clean that up. Another interesting question is: Did that only work by coincidence during the last 15 years, or is it by design that the check in pci_intx(): if (new != pci_command) only evaluates to true if we are not in a detach path. If it were coincidence, it would not have caused faults as it did now with my recent work, because the old version did not allocate in pci_intx(). But it could certainly have been racy and might run into a UAF since the old pci_intx() would have worked on memory that is also managed by devres, but has been registered at a different place. I guess that is what that comment in the MSI code quoted above is hinting at. Christoph indeed rightfully called it voodoo ^^ Cheers, P.
On Fri, Jul 26, 2024 at 08:43:12PM +0200, pstanner@redhat.com wrote: > On Fri, 2024-07-26 at 09:19 +0900, Damien Le Moal wrote: > > Tested-by: Damien Le Moal <dlemoal@kernel.org> > > You might wanna ping Bjorn about that in case he didn't see. I amended the commit to add this, thanks Damien!
On 7/27/24 03:43, pstanner@redhat.com wrote: >> That said, I do not see that as an issue in itself. What I fail to >> understand >> though is why that intx devres is not deleted on device teardown. I >> think this >> may have something to do with the fact that pcim_intx() always does >> "res->orig_intx = !enable;", that is, it assumes that if there is a >> call to >> pcim_intx(dev, 0), then it is because intx where enabled already, >> which I do not >> think is true for most drivers... So we endup with INTX being wrongly >> enabled on >> device teardown by pcim_intx_restore(), and because of that, the intx >> resource >> is not deleted ? > > Spoiler: The device resources that have initially been created do get > deleted. Devres works as intended. The issue is that the forces of evil > invoke pci_intx() through another path, hidden behind an API, through > another devres callback. > > So the device resource never gets deleated because it is *created* on > driver detach, when devres already ran. That explains the issue :) >> Re-enabling intx on teardown is wrong I think, but that still does >> not explain >> why that resource is not deleted. I fail to see why. > > You came very close to the truth ;) > > With some help from my favorite coworker I did some tracing today and > found this when doing `rmmod ahci`: > > => pci_intx > => pci_msi_shutdown > => pci_disable_msi > => devres_release_all > => device_unbind_cleanup > => device_release_driver_internal > => driver_detach > => bus_remove_driver > => pci_unregister_driver > => __do_sys_delete_module > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > The SYSCALL is my `rmmod`. > > As you can see, pci_intx() is invoked indirectly through > pci_disable_msi() – which gets invoked by devres, which is precisely > one reason why you could not find the suspicious pci_intx() call in the > ahci code base. > > Now the question is: Who set up that devres callback which indirectly > calls pci_intx()? > > It is indeed MSI, here in msi/msi.c: > > static void pcim_msi_release(void *pcidev) > { > struct pci_dev *dev = pcidev; > > dev->is_msi_managed = false; > pci_free_irq_vectors(dev); // <-- calls pci_disable_msi(), which calls pci_intx(), which re-registers yet another devres callback > } > > /* > * Needs to be separate from pcim_release to prevent an ordering problem > > ==> Oh, here they even had a warning about that interacting with devres somehow... > > * vs. msi_device_data_release() in the MSI core code. > */ > static int pcim_setup_msi_release(struct pci_dev *dev) > { > int ret; > > if (!pci_is_managed(dev) || dev->is_msi_managed) > return 0; > > ret = devm_add_action(&dev->dev, pcim_msi_release, dev); > if (ret) > return ret; > > dev->is_msi_managed = true; > return 0; > } > > I don't know enough about AHCI to see where exactly it jumps into > these, but a candidate would be: > * pci_enable_msi(), called among others in acard-ahci.c > > Another path is: > 1. ahci_init_one() calls > 2. ahci_init_msi() calls > 3. pci_alloc_irq_vectors() calls > 4. pci_alloc_irq_vectors_affinity() calls > 5. __pci_enable_msi_range() OR __pci_enable_msix_range() call > 6. pci_setup_msi_context() calls > 7. pcim_setup_msi_release() which registers the callback to > pci_intx() ahci_init_one() is the function used by the default AHCI driver (ahci.ko), so this path is correct. > Ha! > > I think I earned myself a Friday evening beer now 8-) I was way ahead of you :) > Now the interesting question will be how the heck we're supposed to > clean that up. If pcim_intx() always gets called on device release, AND MSI/MSIX are also managed interrupts with a devres action on release, I would say that pcim_msi_release() should NOT lead to a path calling pci_intx(dev, 0), as that would create the intx devres again. But given the comment you point out above, it seems that there are some ordering constraints between disabling msi and intx ? If that is the case, then may be this indeed will be tricky. > Another interesting question is: Did that only work by coincidence > during the last 15 years, or is it by design that the check in > pci_intx(): > > if (new != pci_command) > > only evaluates to true if we are not in a detach path. I would say that it evaluates to true for any device using intx, which tend to be rare these days. In such case, then enabling on device attach and disabling on detach would lead to new != pci_command, including in the hidden intx disable path triggered by disabling msi. But the devres being recreated on detach definitely seem to depend on the disabling order though. So we may have been lucky indeed. > > If it were coincidence, it would not have caused faults as it did now > with my recent work, because the old version did not allocate in > pci_intx(). > > But it could certainly have been racy and might run into a UAF since > the old pci_intx() would have worked on memory that is also managed by > devres, but has been registered at a different place. I guess that is > what that comment in the MSI code quoted above is hinting at. > > > Christoph indeed rightfully called it voodoo ^^ > > > Cheers, > P. > >
On Mon, 2024-07-29 at 20:29 +0900, Damien Le Moal wrote: > On 7/27/24 03:43, pstanner@redhat.com wrote: > > > That said, I do not see that as an issue in itself. What I fail > > > to > > > understand > > > though is why that intx devres is not deleted on device teardown. > > > I > > > think this > > > may have something to do with the fact that pcim_intx() always > > > does > > > "res->orig_intx = !enable;", that is, it assumes that if there is > > > a > > > call to > > > pcim_intx(dev, 0), then it is because intx where enabled already, > > > which I do not > > > think is true for most drivers... So we endup with INTX being > > > wrongly > > > enabled on > > > device teardown by pcim_intx_restore(), and because of that, the > > > intx > > > resource > > > is not deleted ? > > > > Spoiler: The device resources that have initially been created do > > get > > deleted. Devres works as intended. The issue is that the forces of > > evil > > invoke pci_intx() through another path, hidden behind an API, > > through > > another devres callback. > > > > So the device resource never gets deleated because it is *created* > > on > > driver detach, when devres already ran. > > That explains the issue :) > > > > Re-enabling intx on teardown is wrong I think, but that still > > > does > > > not explain > > > why that resource is not deleted. I fail to see why. > > > > You came very close to the truth ;) > > > > With some help from my favorite coworker I did some tracing today > > and > > found this when doing `rmmod ahci`: > > > > => pci_intx > > => pci_msi_shutdown > > => pci_disable_msi > > => devres_release_all > > => device_unbind_cleanup > > => device_release_driver_internal > > => driver_detach > > => bus_remove_driver > > => pci_unregister_driver > > => __do_sys_delete_module > > => do_syscall_64 > > => entry_SYSCALL_64_after_hwframe > > > > The SYSCALL is my `rmmod`. > > > > As you can see, pci_intx() is invoked indirectly through > > pci_disable_msi() – which gets invoked by devres, which is > > precisely > > one reason why you could not find the suspicious pci_intx() call in > > the > > ahci code base. > > > > Now the question is: Who set up that devres callback which > > indirectly > > calls pci_intx()? > > > > It is indeed MSI, here in msi/msi.c: > > > > static void pcim_msi_release(void *pcidev) > > { > > struct pci_dev *dev = pcidev; > > > > dev->is_msi_managed = false; > > pci_free_irq_vectors(dev); // <-- calls pci_disable_msi(), which > > calls pci_intx(), which re-registers yet another devres callback > > } > > > > /* > > * Needs to be separate from pcim_release to prevent an ordering > > problem > > > > ==> Oh, here they even had a warning about that interacting with > > devres somehow... > > > > * vs. msi_device_data_release() in the MSI core code. > > */ > > static int pcim_setup_msi_release(struct pci_dev *dev) > > { > > int ret; > > > > if (!pci_is_managed(dev) || dev->is_msi_managed) > > return 0; > > > > ret = devm_add_action(&dev->dev, pcim_msi_release, dev); > > if (ret) > > return ret; > > > > dev->is_msi_managed = true; > > return 0; > > } > > > > I don't know enough about AHCI to see where exactly it jumps into > > these, but a candidate would be: > > * pci_enable_msi(), called among others in acard-ahci.c > > > > Another path is: > > 1. ahci_init_one() calls > > 2. ahci_init_msi() calls > > 3. pci_alloc_irq_vectors() calls > > 4. pci_alloc_irq_vectors_affinity() calls > > 5. __pci_enable_msi_range() OR __pci_enable_msix_range() call > > 6. pci_setup_msi_context() calls > > 7. pcim_setup_msi_release() which registers the callback to > > pci_intx() > > ahci_init_one() is the function used by the default AHCI driver > (ahci.ko), so > this path is correct. > > > Ha! > > > > I think I earned myself a Friday evening beer now 8-) > > I was way ahead of you :) > > > Now the interesting question will be how the heck we're supposed to > > clean that up. > > If pcim_intx() always gets called on device release, AND MSI/MSIX are > also > managed interrupts with a devres action on release, I would say that > pcim_msi_release() should NOT lead to a path calling pci_intx(dev, > 0), as that > would create the intx devres again. Yes, that is obviously wrong – the thing is that we cannot remove the devres aspect of pci_intx() as long as the other callers (especially drivers) rely on it. And, most notably, we (= I) don't know if *MSI* relies on the devres aspect being present in pci_intx() in MSI's cleanup path. Looking at what pci_intx_for_msi() is good for and where it's used might give a clue. @Krzysztof: Aren't you the MSI expert? Any idea what this might be all about? There are definitely some hacks one could do to remove the devres aspect from pci_intx() _just for MSI_, for example: * set struct pci_dev.is_managed = false in the disable path in MSI * Use yet another function, pci_intx_unmanaged(), in MSI It's obviously work and all that and I don't think one should even try it without an understanding of what MSI is supposed to do. > But given the comment you point out above, > it seems that there are some ordering constraints between disabling > msi and intx > ? If that is the case, then may be this indeed will be tricky. No, I don't think that's related. I think the ordering issue should have disappeared now in v6.11 (although I wouldn't bet my hand on it) The issue was that until including v6.10, struct pci_devres bundled everything needed by several managed PCI functions into just that one struct. It is handled by 1 creator function and 1 cleanup function (Should have never been implemented that way, since it violates the basic concepts of devres, but here we are, should, would, could...). I presume the author was then about to register that separate cleanup devres callback in MSI and discovered that it would be racy if that MSI code uses the same struct pci_devres, which is administrated by a different function. It could have been freed already when the MSI callback awakes -> UAF. > > > Another interesting question is: Did that only work by coincidence > > during the last 15 years, or is it by design that the check in > > pci_intx(): > > > > if (new != pci_command) > > > > only evaluates to true if we are not in a detach path. > > I would say that it evaluates to true for any device using intx, > which tend to > be rare these days. In such case, then enabling on device attach and > disabling > on detach would lead to new != pci_command, including in the hidden > intx disable > path triggered by disabling msi. But the devres being recreated on > detach > definitely seem to depend on the disabling order though. So we may > have been > lucky indeed. Hm. I hope you are right. If not, then other drivers would experience the same regression you found in AHCI. Let's keep our eyes open... P. > > > > > If it were coincidence, it would not have caused faults as it did > > now > > with my recent work, because the old version did not allocate in > > pci_intx(). > > > > But it could certainly have been racy and might run into a UAF > > since > > the old pci_intx() would have worked on memory that is also managed > > by > > devres, but has been registered at a different place. I guess that > > is > > what that comment in the MSI code quoted above is hinting at. > > > > > > Christoph indeed rightfully called it voodoo ^^ > > > > > > Cheers, > > P. > > > > >
On Thu, 25 Jul 2024 14:07:30 +0200 Philipp Stanner <pstanner@redhat.com> wrote: > pci_intx() is a function that becomes managed if pcim_enable_device() > has been called in advance. Commit 25216afc9db5 ("PCI: Add managed > pcim_intx()") changed this behavior so that pci_intx() always leads to > creation of a separate device resource for itself, whereas earlier, a > shared resource was used for all PCI devres operations. > > Unfortunately, pci_intx() seems to be used in some drivers' remove() > paths; in the managed case this causes a device resource to be created > on driver detach. > > Fix the regression by only redirecting pci_intx() to its managed twin > pcim_intx() if the pci_command changes. > > Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()") I'm seeing another issue from this, which is maybe a more general problem with managed mode. In my case I'm using vfio-pci to assign an ahci controller to a VM. ahci_init_one() calls pcim_enable_device() which sets is_managed = true. I notice that nothing ever sets is_managed to false. Therefore now when I call pci_intx() from vfio-pci under spinlock, I get a lockdep warning as I no go through pcim_intx() code after 25216afc9db5 since the previous driver was managed. It seems like we should be setting is_managed to false is the driver release path, right? Thanks, Alex
On Tue, 2024-09-03 at 09:44 -0600, Alex Williamson wrote: > On Thu, 25 Jul 2024 14:07:30 +0200 > Philipp Stanner <pstanner@redhat.com> wrote: > > > pci_intx() is a function that becomes managed if > > pcim_enable_device() > > has been called in advance. Commit 25216afc9db5 ("PCI: Add managed > > pcim_intx()") changed this behavior so that pci_intx() always leads > > to > > creation of a separate device resource for itself, whereas earlier, > > a > > shared resource was used for all PCI devres operations. > > > > Unfortunately, pci_intx() seems to be used in some drivers' > > remove() > > paths; in the managed case this causes a device resource to be > > created > > on driver detach. > > > > Fix the regression by only redirecting pci_intx() to its managed > > twin > > pcim_intx() if the pci_command changes. > > > > Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()") > > I'm seeing another issue from this, which is maybe a more general > problem with managed mode. In my case I'm using vfio-pci to assign > an > ahci controller to a VM. "In my case" doesn't mean OOT, does it? I can't fully follow. > ahci_init_one() calls pcim_enable_device() > which sets is_managed = true. I notice that nothing ever sets > is_managed to false. Therefore now when I call pci_intx() from vfio- > pci > under spinlock, I get a lockdep warning I suppose you see the lockdep warning because the new pcim_intx() can now allocate, whereas before 25216afc9db5 it was pcim_enable_device() which allocated *everything* related to PCI devres. > as I no go through pcim_intx() > code after 25216afc9db5 You alwas went through pcim_intx()'s logic. The issue seems to be that the allocation step was moved. > since the previous driver was managed. what do you mean by "previous driver"? > It seems > like we should be setting is_managed to false is the driver release > path, right? So the issue seems to be that the same struct pci_dev can be used by different drivers, is that correct? If so, I think that can be addressed trough having pcim_disable_device() set is_managed to false as you suggest. Another solution can could at least consider would be to use a GFP_ATOMIC for allocation in get_or_create_intx_devres(). I suppose your solution is the better one, though. P. > Thanks, > > Alex >
On 2024/09/04 16:06, Philipp Stanner wrote: > On Tue, 2024-09-03 at 09:44 -0600, Alex Williamson wrote: >> On Thu, 25 Jul 2024 14:07:30 +0200 >> Philipp Stanner <pstanner@redhat.com> wrote: >> >>> pci_intx() is a function that becomes managed if >>> pcim_enable_device() >>> has been called in advance. Commit 25216afc9db5 ("PCI: Add managed >>> pcim_intx()") changed this behavior so that pci_intx() always leads >>> to >>> creation of a separate device resource for itself, whereas earlier, >>> a >>> shared resource was used for all PCI devres operations. >>> >>> Unfortunately, pci_intx() seems to be used in some drivers' >>> remove() >>> paths; in the managed case this causes a device resource to be >>> created >>> on driver detach. >>> >>> Fix the regression by only redirecting pci_intx() to its managed >>> twin >>> pcim_intx() if the pci_command changes. >>> >>> Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()") >> >> I'm seeing another issue from this, which is maybe a more general >> problem with managed mode. In my case I'm using vfio-pci to assign >> an >> ahci controller to a VM. > > "In my case" doesn't mean OOT, does it? I can't fully follow. > >> ahci_init_one() calls pcim_enable_device() >> which sets is_managed = true. I notice that nothing ever sets >> is_managed to false. Therefore now when I call pci_intx() from vfio- >> pci >> under spinlock, I get a lockdep warning > > I suppose you see the lockdep warning because the new pcim_intx() can > now allocate, whereas before 25216afc9db5 it was pcim_enable_device() > which allocated *everything* related to PCI devres. > >> as I no go through pcim_intx() >> code after 25216afc9db5 > > You alwas went through pcim_intx()'s logic. The issue seems to be that > the allocation step was moved. > >> since the previous driver was managed. > > what do you mean by "previous driver"? The AHCI driver... When attaching a PCI dev to vfio to e.g. passthrough to a VM, the device driver must first be unbound and the device bound to vfio-pci. So we switch from ahci/libata driver to vfio. When vfio tries to enable intx with is_managed still true from the use of the device by ahci, problem happen. > >> It seems >> like we should be setting is_managed to false is the driver release >> path, right? > > So the issue seems to be that the same struct pci_dev can be used by > different drivers, is that correct? > > If so, I think that can be addressed trough having > pcim_disable_device() set is_managed to false as you suggest. > > Another solution can could at least consider would be to use a > GFP_ATOMIC for allocation in get_or_create_intx_devres(). If it is allowed to call pci_intx() under a spin_lock, then we need GFP_ATOMIC. If not, then vfio-pci needs to move the call out of the spinlock. Either solution must be implemented regardless of the fix to set is_managed to false. So what context is allowed to call pci_intx() ? The current kdoc comment does not say...
On Wed, 04 Sep 2024 09:06:37 +0200 Philipp Stanner <pstanner@redhat.com> wrote: > On Tue, 2024-09-03 at 09:44 -0600, Alex Williamson wrote: > > On Thu, 25 Jul 2024 14:07:30 +0200 > > Philipp Stanner <pstanner@redhat.com> wrote: > > > > > pci_intx() is a function that becomes managed if > > > pcim_enable_device() > > > has been called in advance. Commit 25216afc9db5 ("PCI: Add managed > > > pcim_intx()") changed this behavior so that pci_intx() always leads > > > to > > > creation of a separate device resource for itself, whereas earlier, > > > a > > > shared resource was used for all PCI devres operations. > > > > > > Unfortunately, pci_intx() seems to be used in some drivers' > > > remove() > > > paths; in the managed case this causes a device resource to be > > > created > > > on driver detach. > > > > > > Fix the regression by only redirecting pci_intx() to its managed > > > twin > > > pcim_intx() if the pci_command changes. > > > > > > Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()") > > > > I'm seeing another issue from this, which is maybe a more general > > problem with managed mode. In my case I'm using vfio-pci to assign > > an > > ahci controller to a VM. > > "In my case" doesn't mean OOT, does it? I can't fully follow. "OOT" Out Of Tree? No, "In my case" is simply introducing the scenario in which I see the issue. vfio-pci is an in-tree driver used to attach devices to userspace drivers, such as QEMU. The ahci driver is loaded during system boot, setting the is_managed flag. The ahci driver is then unbound from the device and the vfio-pci driver is bound. The vfio-pci driver provides a uAPI for userspace drivers to operate a device in an an IOMMU protected context. > > ahci_init_one() calls pcim_enable_device() > > which sets is_managed = true. I notice that nothing ever sets > > is_managed to false. Therefore now when I call pci_intx() from vfio- > > pci > > under spinlock, I get a lockdep warning > > I suppose you see the lockdep warning because the new pcim_intx() can > now allocate, whereas before 25216afc9db5 it was pcim_enable_device() > which allocated *everything* related to PCI devres. > > > as I no go through pcim_intx() > > code after 25216afc9db5 > > You alwas went through pcim_intx()'s logic. The issue seems to be that > the allocation step was moved. Unintentionally, yes, I believe so. vfio-pci is not a managed, devres driver and therefore had no expectation of using the managed code path. > > since the previous driver was managed. > > what do you mean by "previous driver"? As noted, the ahci driver is first bound to the device at boot, unbound, and the vfio-pci driver bound to the device. The ahci driver is the previous driver. > > It seems > > like we should be setting is_managed to false is the driver release > > path, right? > > So the issue seems to be that the same struct pci_dev can be used by > different drivers, is that correct? Yes, and more generically, the driver release should undo everything that has been configured by the driver probe. > If so, I think that can be addressed trough having > pcim_disable_device() set is_managed to false as you suggest. If that's sufficient and drivers only call pcim_disable_device() in their release function. I also note that f748a07a0b64 ("PCI: Remove legacy pcim_release()") claims that: Thanks to preceding cleanup steps, pcim_release() is now not needed anymore and can be replaced by pcim_disable_device(), which is the exact counterpart to pcim_enable_device(). However, that's not accurate as pcim_enable_device() adds a devm action, unconditionally calls pci_enable_device() and sets is_managed to true. If we assume pcim_pin_device() is a valid concept, don't we still need to remove the devm action as well? > Another solution can could at least consider would be to use a > GFP_ATOMIC for allocation in get_or_create_intx_devres(). If we look at what pci_intx() does without devres, it's simply reading and setting or clearing a bit in config space. I can attest that a driver author would have no expectation that such a function allocates memory and there are scenarios where we want to call this with interrupts disabled, such as within an interrupt context. So, TBH, it might make sense to consider whether an allocation in this path is appropriate at all, but I'm obviously no expert in devres. > I suppose your solution is the better one, though. I see you've posted a patch, I'll test it as soon as I'm able. Thanks, Alex
On Wed, 2024-09-04 at 06:57 -0600, Alex Williamson wrote: > On Wed, 04 Sep 2024 09:06:37 +0200 > Philipp Stanner <pstanner@redhat.com> wrote: > > > On Tue, 2024-09-03 at 09:44 -0600, Alex Williamson wrote: > > > On Thu, 25 Jul 2024 14:07:30 +0200 > > > Philipp Stanner <pstanner@redhat.com> wrote: > > > > > > > pci_intx() is a function that becomes managed if > > > > pcim_enable_device() > > > > has been called in advance. Commit 25216afc9db5 ("PCI: Add > > > > managed > > > > pcim_intx()") changed this behavior so that pci_intx() always > > > > leads > > > > to > > > > creation of a separate device resource for itself, whereas > > > > earlier, > > > > a > > > > shared resource was used for all PCI devres operations. > > > > > > > > Unfortunately, pci_intx() seems to be used in some drivers' > > > > remove() > > > > paths; in the managed case this causes a device resource to be > > > > created > > > > on driver detach. > > > > > > > > Fix the regression by only redirecting pci_intx() to its > > > > managed > > > > twin > > > > pcim_intx() if the pci_command changes. > > > > > > > > Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()") > > > > > > I'm seeing another issue from this, which is maybe a more general > > > problem with managed mode. In my case I'm using vfio-pci to > > > assign > > > an > > > ahci controller to a VM. > > > > "In my case" doesn't mean OOT, does it? I can't fully follow. > > "OOT" Out Of Tree? No, "In my case" is simply introducing the > scenario > in which I see the issue. vfio-pci is an in-tree driver used to > attach > devices to userspace drivers, such as QEMU. The ahci driver is > loaded > during system boot, setting the is_managed flag. The ahci driver is > then unbound from the device and the vfio-pci driver is bound. The > vfio-pci driver provides a uAPI for userspace drivers to operate a > device in an an IOMMU protected context. Alright, thx for the clarification. > > > > ahci_init_one() calls pcim_enable_device() > > > which sets is_managed = true. I notice that nothing ever sets > > > is_managed to false. Therefore now when I call pci_intx() from > > > vfio- > > > pci > > > under spinlock, I get a lockdep warning > > > > I suppose you see the lockdep warning because the new pcim_intx() > > can > > now allocate, whereas before 25216afc9db5 it was > > pcim_enable_device() > > which allocated *everything* related to PCI devres. > > > > > as I no go through pcim_intx() > > > code after 25216afc9db5 > > > > You alwas went through pcim_intx()'s logic. The issue seems to be > > that > > the allocation step was moved. > > Unintentionally, yes, I believe so. vfio-pci is not a managed, > devres > driver and therefore had no expectation of using the managed code > path. Yes, I agree this needs to be fixed through the solution you proposed. > > > > since the previous driver was managed. > > > > what do you mean by "previous driver"? > > As noted, the ahci driver is first bound to the device at boot, > unbound, and the vfio-pci driver bound to the device. The ahci > driver > is the previous driver. > > > > It seems > > > like we should be setting is_managed to false is the driver > > > release > > > path, right? > > > > So the issue seems to be that the same struct pci_dev can be used > > by > > different drivers, is that correct? > > Yes, and more generically, the driver release should undo everything > that has been configured by the driver probe. > > > If so, I think that can be addressed trough having > > pcim_disable_device() set is_managed to false as you suggest. > > If that's sufficient and drivers only call pcim_disable_device() in > their release function. pcim_disable_device() is not intended to be used directly by drivers. It's basically the devres callback for pcim_enable_device() and is called in everyone's release path automatically by devres. (I agree that the naming is not superbe) > I also note that f748a07a0b64 ("PCI: Remove > legacy pcim_release()") claims that: > > Thanks to preceding cleanup steps, pcim_release() is now not needed > anymore and can be replaced by pcim_disable_device(), which is the > exact counterpart to pcim_enable_device(). > > However, that's not accurate as pcim_enable_device() adds a devm > action, unconditionally calls pci_enable_device() and sets is_managed > to true. It's not accurate in regards with is_managed. The rest is fine IMO. The devres callback shall be added, and the unconditional call to pci_enable_device() is also desired. > If we assume pcim_pin_device() is a valid concept, don't we > still need to remove the devm action as well? No. As pcim_disable_device() is the very devres callback, it does not need to remove itself. Devres calls it once and then it's gone. However, pcim_pin_device() IMO is not a valid concept. Who wants such behavior IMO shall use pci_enable_device() and pci_disable_device() manually. I proposed removing it here: https://lore.kernel.org/all/20240822073815.12365-2-pstanner@redhat.com/ (Note that previously it could not be removed because pcim_enable_device() also allocated all the stuff needed by pci_request_region() etc.) > > > Another solution can could at least consider would be to use a > > GFP_ATOMIC for allocation in get_or_create_intx_devres(). > > If we look at what pci_intx() does without devres, it's simply > reading > and setting or clearing a bit in config space. I can attest that a > driver author would have no expectation that such a function > allocates > memory A driver author would not have any expectation of a function implicitly doing anything with devres. That's the reason why I did all this work in the first place, to, ultimately, remove this hybrid behavior from all pci_ functions. So that devres functions are clearly marked with pcim_ That is, however, not that easy because everyone who uses pcim_enable_device() in combination with pci_intx() first has to be ported to a pci_intx()-version that has nothing to do with devres. > and there are scenarios where we want to call this with > interrupts disabled, such as within an interrupt context. So, TBH, > it > might make sense to consider whether an allocation in this path is > appropriate at all, but I'm obviously no expert in devres. The entire hybrid nature from pci_intx() should be removed. I'm working on that. The hybrid nature was always there, but the allocation was not. It would be removed later together with the hybrid devres usage. > > > I suppose your solution is the better one, though. > > I see you've posted a patch, I'll test it as soon as I'm able. If it works from what I understand that should solve those issues for now until we can phase out pci_intx() usage that relies on the device resource. --- btw. I just looked into the old code and found that this one also already had a similar half-bug with is_managed. It never sets it to false again, but pci_intx() runs into: static struct pci_devres *find_pci_dr(struct pci_dev *pdev) { if (pci_is_managed(pdev)) return devres_find(&pdev->dev, pcim_release, NULL, NULL); return NULL; } So in your case pci_is_managed() would have also been true and the only reason no problem occurred is that devres_find() doesn't find the device resource of the previous driver anymore, so pci_intx() won't use that memory. --- Thanks for debugging, P. > Thanks, > > Alex >
On Wed, 2024-09-04 at 17:25 +0900, Damien Le Moal wrote: > On 2024/09/04 16:06, Philipp Stanner wrote: > > On Tue, 2024-09-03 at 09:44 -0600, Alex Williamson wrote: > > > On Thu, 25 Jul 2024 14:07:30 +0200 > > > Philipp Stanner <pstanner@redhat.com> wrote: > > > > > > > pci_intx() is a function that becomes managed if > > > > pcim_enable_device() > > > > has been called in advance. Commit 25216afc9db5 ("PCI: Add > > > > managed > > > > pcim_intx()") changed this behavior so that pci_intx() always > > > > leads > > > > to > > > > creation of a separate device resource for itself, whereas > > > > earlier, > > > > a > > > > shared resource was used for all PCI devres operations. > > > > > > > > Unfortunately, pci_intx() seems to be used in some drivers' > > > > remove() > > > > paths; in the managed case this causes a device resource to be > > > > created > > > > on driver detach. > > > > > > > > Fix the regression by only redirecting pci_intx() to its > > > > managed > > > > twin > > > > pcim_intx() if the pci_command changes. > > > > > > > > Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()") > > > > > > I'm seeing another issue from this, which is maybe a more general > > > problem with managed mode. In my case I'm using vfio-pci to > > > assign > > > an > > > ahci controller to a VM. > > > > "In my case" doesn't mean OOT, does it? I can't fully follow. > > > > > ahci_init_one() calls pcim_enable_device() > > > which sets is_managed = true. I notice that nothing ever sets > > > is_managed to false. Therefore now when I call pci_intx() from > > > vfio- > > > pci > > > under spinlock, I get a lockdep warning > > > > I suppose you see the lockdep warning because the new pcim_intx() > > can > > now allocate, whereas before 25216afc9db5 it was > > pcim_enable_device() > > which allocated *everything* related to PCI devres. > > > > > as I no go through pcim_intx() > > > code after 25216afc9db5 > > > > You alwas went through pcim_intx()'s logic. The issue seems to be > > that > > the allocation step was moved. > > > > > since the previous driver was managed. > > > > what do you mean by "previous driver"? > > The AHCI driver... When attaching a PCI dev to vfio to e.g. > passthrough to a VM, > the device driver must first be unbound and the device bound to vfio- > pci. So we > switch from ahci/libata driver to vfio. When vfio tries to enable > intx with > is_managed still true from the use of the device by ahci, problem > happen. > > > > > > It seems > > > like we should be setting is_managed to false is the driver > > > release > > > path, right? > > > > So the issue seems to be that the same struct pci_dev can be used > > by > > different drivers, is that correct? > > > > If so, I think that can be addressed trough having > > pcim_disable_device() set is_managed to false as you suggest. > > > > Another solution can could at least consider would be to use a > > GFP_ATOMIC for allocation in get_or_create_intx_devres(). > > If it is allowed to call pci_intx() under a spin_lock, then we need > GFP_ATOMIC. > If not, then vfio-pci needs to move the call out of the spinlock. If vfio-pci can get rid of pci_intx() alltogether, that might be a good thing. As far as I understood Andy Shevchenko, pci_intx() is outdated. There's only a hand full of users anyways. Best solution would be to avoid GFP_ATOMIC and see first if setting is_managed = false solves the reported problem for now. Other problematic users should hopefully be found through lockdep, too. Though I think they are unlikely to occur > > Either solution must be implemented regardless of the fix to set > is_managed to > false. Yes > > So what context is allowed to call pci_intx() ? The current kdoc > comment does > not say... the old pci_intx() did not allocate. It only calls pci_read_config_word() and pci_write_config_word(). If those cannot block etc. it should be save from any context. Though I'd like to hear from one of the maintainers about it. The new version allocates if pcim_enable_device() was called when it runs for the first time. That first run would then be illegal in must- not-sleep contexts. P.
On Wed, 04 Sep 2024 15:37:25 +0200 Philipp Stanner <pstanner@redhat.com> wrote: > On Wed, 2024-09-04 at 17:25 +0900, Damien Le Moal wrote: > > On 2024/09/04 16:06, Philipp Stanner wrote: > > > On Tue, 2024-09-03 at 09:44 -0600, Alex Williamson wrote: > > > > On Thu, 25 Jul 2024 14:07:30 +0200 > > > > Philipp Stanner <pstanner@redhat.com> wrote: > > > > > > > > > pci_intx() is a function that becomes managed if > > > > > pcim_enable_device() > > > > > has been called in advance. Commit 25216afc9db5 ("PCI: Add > > > > > managed > > > > > pcim_intx()") changed this behavior so that pci_intx() always > > > > > leads > > > > > to > > > > > creation of a separate device resource for itself, whereas > > > > > earlier, > > > > > a > > > > > shared resource was used for all PCI devres operations. > > > > > > > > > > Unfortunately, pci_intx() seems to be used in some drivers' > > > > > remove() > > > > > paths; in the managed case this causes a device resource to be > > > > > created > > > > > on driver detach. > > > > > > > > > > Fix the regression by only redirecting pci_intx() to its > > > > > managed > > > > > twin > > > > > pcim_intx() if the pci_command changes. > > > > > > > > > > Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()") > > > > > > > > I'm seeing another issue from this, which is maybe a more general > > > > problem with managed mode. In my case I'm using vfio-pci to > > > > assign > > > > an > > > > ahci controller to a VM. > > > > > > "In my case" doesn't mean OOT, does it? I can't fully follow. > > > > > > > ahci_init_one() calls pcim_enable_device() > > > > which sets is_managed = true. I notice that nothing ever sets > > > > is_managed to false. Therefore now when I call pci_intx() from > > > > vfio- > > > > pci > > > > under spinlock, I get a lockdep warning > > > > > > I suppose you see the lockdep warning because the new pcim_intx() > > > can > > > now allocate, whereas before 25216afc9db5 it was > > > pcim_enable_device() > > > which allocated *everything* related to PCI devres. > > > > > > > as I no go through pcim_intx() > > > > code after 25216afc9db5 > > > > > > You alwas went through pcim_intx()'s logic. The issue seems to be > > > that > > > the allocation step was moved. > > > > > > > since the previous driver was managed. > > > > > > what do you mean by "previous driver"? > > > > The AHCI driver... When attaching a PCI dev to vfio to e.g. > > passthrough to a VM, > > the device driver must first be unbound and the device bound to vfio- > > pci. So we > > switch from ahci/libata driver to vfio. When vfio tries to enable > > intx with > > is_managed still true from the use of the device by ahci, problem > > happen. > > > > > > > > > It seems > > > > like we should be setting is_managed to false is the driver > > > > release > > > > path, right? > > > > > > So the issue seems to be that the same struct pci_dev can be used > > > by > > > different drivers, is that correct? > > > > > > If so, I think that can be addressed trough having > > > pcim_disable_device() set is_managed to false as you suggest. > > > > > > Another solution can could at least consider would be to use a > > > GFP_ATOMIC for allocation in get_or_create_intx_devres(). > > > > If it is allowed to call pci_intx() under a spin_lock, then we need > > GFP_ATOMIC. > > If not, then vfio-pci needs to move the call out of the spinlock. > > If vfio-pci can get rid of pci_intx() alltogether, that might be a good > thing. As far as I understood Andy Shevchenko, pci_intx() is outdated. > There's only a hand full of users anyways. What's the alternative? vfio-pci has a potentially unique requirement here, we don't know how to handle the device interrupt, we only forward it to the userspace driver. As a level triggered interrupt, INTx will continue to assert until that userspace driver handles the device. That's obviously unacceptable from a host perspective, so INTx is masked at the device via pci_intx() where available, or at the interrupt controller otherwise. The API with the userspace driver requires that driver to unmask the interrupt, again resulting in a call to pci_intx() or unmasking the interrupt controller, in order to receive further interrupts from the device. Thanks, Alex
Wed, Sep 04, 2024 at 12:07:21PM -0600, Alex Williamson kirjoitti: > On Wed, 04 Sep 2024 15:37:25 +0200 > Philipp Stanner <pstanner@redhat.com> wrote: > > On Wed, 2024-09-04 at 17:25 +0900, Damien Le Moal wrote: ... > > If vfio-pci can get rid of pci_intx() alltogether, that might be a good > > thing. As far as I understood Andy Shevchenko, pci_intx() is outdated. > > There's only a hand full of users anyways. > > What's the alternative? From API perspective the pci_alloc_irq_vectors() & Co should be used. > vfio-pci has a potentially unique requirement > here, we don't know how to handle the device interrupt, we only forward > it to the userspace driver. As a level triggered interrupt, INTx will > continue to assert until that userspace driver handles the device. > That's obviously unacceptable from a host perspective, so INTx is > masked at the device via pci_intx() where available, or at the > interrupt controller otherwise. The API with the userspace driver > requires that driver to unmask the interrupt, again resulting in a call > to pci_intx() or unmasking the interrupt controller, in order to receive > further interrupts from the device. Thanks, I briefly read the discussion and if I understand it correctly the problem here is in the flow: when the above mentioned API is being called. Hence it's design (or architectural) level of issue and changing call from foo() to bar() won't magically make problem go away. But I might be mistaken.
On Wed, 4 Sep 2024 23:24:53 +0300 Andy Shevchenko <andy.shevchenko@gmail.com> wrote: > Wed, Sep 04, 2024 at 12:07:21PM -0600, Alex Williamson kirjoitti: > > On Wed, 04 Sep 2024 15:37:25 +0200 > > Philipp Stanner <pstanner@redhat.com> wrote: > > > On Wed, 2024-09-04 at 17:25 +0900, Damien Le Moal wrote: > > ... > > > > If vfio-pci can get rid of pci_intx() alltogether, that might be a good > > > thing. As far as I understood Andy Shevchenko, pci_intx() is outdated. > > > There's only a hand full of users anyways. > > > > What's the alternative? > > From API perspective the pci_alloc_irq_vectors() & Co should be used. We can't replace a device level INTx control with a vector allocation function. > > vfio-pci has a potentially unique requirement > > here, we don't know how to handle the device interrupt, we only forward > > it to the userspace driver. As a level triggered interrupt, INTx will > > continue to assert until that userspace driver handles the device. > > That's obviously unacceptable from a host perspective, so INTx is > > masked at the device via pci_intx() where available, or at the > > interrupt controller otherwise. The API with the userspace driver > > requires that driver to unmask the interrupt, again resulting in a call > > to pci_intx() or unmasking the interrupt controller, in order to receive > > further interrupts from the device. Thanks, > > I briefly read the discussion and if I understand it correctly the problem here > is in the flow: when the above mentioned API is being called. Hence it's design > (or architectural) level of issue and changing call from foo() to bar() won't > magically make problem go away. But I might be mistaken. Certainly from a vector allocation standpoint we can change to whatever is preferred, but the direct INTx manipulation functions are a different thing entirely and afaik there's nothing else that can replace them at a low level, nor can we just get rid of our calls to pci_intx(). Thanks, Alex
On 2024/09/05 6:10, Alex Williamson wrote: > On Wed, 4 Sep 2024 23:24:53 +0300 > Andy Shevchenko <andy.shevchenko@gmail.com> wrote: > >> Wed, Sep 04, 2024 at 12:07:21PM -0600, Alex Williamson kirjoitti: >>> On Wed, 04 Sep 2024 15:37:25 +0200 >>> Philipp Stanner <pstanner@redhat.com> wrote: >>>> On Wed, 2024-09-04 at 17:25 +0900, Damien Le Moal wrote: >> >> ... >> >>>> If vfio-pci can get rid of pci_intx() alltogether, that might be a good >>>> thing. As far as I understood Andy Shevchenko, pci_intx() is outdated. >>>> There's only a hand full of users anyways. >>> >>> What's the alternative? >> >> From API perspective the pci_alloc_irq_vectors() & Co should be used. > > We can't replace a device level INTx control with a vector allocation > function. > >>> vfio-pci has a potentially unique requirement >>> here, we don't know how to handle the device interrupt, we only forward >>> it to the userspace driver. As a level triggered interrupt, INTx will >>> continue to assert until that userspace driver handles the device. >>> That's obviously unacceptable from a host perspective, so INTx is >>> masked at the device via pci_intx() where available, or at the >>> interrupt controller otherwise. The API with the userspace driver >>> requires that driver to unmask the interrupt, again resulting in a call >>> to pci_intx() or unmasking the interrupt controller, in order to receive >>> further interrupts from the device. Thanks, >> >> I briefly read the discussion and if I understand it correctly the problem here >> is in the flow: when the above mentioned API is being called. Hence it's design >> (or architectural) level of issue and changing call from foo() to bar() won't >> magically make problem go away. But I might be mistaken. > > Certainly from a vector allocation standpoint we can change to whatever > is preferred, but the direct INTx manipulation functions are a > different thing entirely and afaik there's nothing else that can > replace them at a low level, nor can we just get rid of our calls to > pci_intx(). Thanks, But can these calls be moved out of the spinlock context ? If not, then we need to clarify that pci_intx() can be called from any context, which will require changing to a GFP_ATOMIC for the resource allocation, even if the use case cannot trigger the allocation. This is needed to ensure the correctness of the pci_intx() function use. Frankly, I am surprised that the might sleep splat you got was not already reported before (fuzzying, static analyzers might eventually catch that though). The other solution would be a version of pci_intx() that has a gfp flags argument to allow callers to use the right gfp flags for the call context. > > Alex >
On Thu, 5 Sep 2024 09:33:35 +0900 Damien Le Moal <dlemoal@kernel.org> wrote: > On 2024/09/05 6:10, Alex Williamson wrote: > > On Wed, 4 Sep 2024 23:24:53 +0300 > > Andy Shevchenko <andy.shevchenko@gmail.com> wrote: > > > >> Wed, Sep 04, 2024 at 12:07:21PM -0600, Alex Williamson kirjoitti: > >>> On Wed, 04 Sep 2024 15:37:25 +0200 > >>> Philipp Stanner <pstanner@redhat.com> wrote: > >>>> On Wed, 2024-09-04 at 17:25 +0900, Damien Le Moal wrote: > >> > >> ... > >> > >>>> If vfio-pci can get rid of pci_intx() alltogether, that might be a good > >>>> thing. As far as I understood Andy Shevchenko, pci_intx() is outdated. > >>>> There's only a hand full of users anyways. > >>> > >>> What's the alternative? > >> > >> From API perspective the pci_alloc_irq_vectors() & Co should be used. > > > > We can't replace a device level INTx control with a vector allocation > > function. > > > >>> vfio-pci has a potentially unique requirement > >>> here, we don't know how to handle the device interrupt, we only forward > >>> it to the userspace driver. As a level triggered interrupt, INTx will > >>> continue to assert until that userspace driver handles the device. > >>> That's obviously unacceptable from a host perspective, so INTx is > >>> masked at the device via pci_intx() where available, or at the > >>> interrupt controller otherwise. The API with the userspace driver > >>> requires that driver to unmask the interrupt, again resulting in a call > >>> to pci_intx() or unmasking the interrupt controller, in order to receive > >>> further interrupts from the device. Thanks, > >> > >> I briefly read the discussion and if I understand it correctly the problem here > >> is in the flow: when the above mentioned API is being called. Hence it's design > >> (or architectural) level of issue and changing call from foo() to bar() won't > >> magically make problem go away. But I might be mistaken. > > > > Certainly from a vector allocation standpoint we can change to whatever > > is preferred, but the direct INTx manipulation functions are a > > different thing entirely and afaik there's nothing else that can > > replace them at a low level, nor can we just get rid of our calls to > > pci_intx(). Thanks, > > But can these calls be moved out of the spinlock context ? If not, then we need > to clarify that pci_intx() can be called from any context, which will require > changing to a GFP_ATOMIC for the resource allocation, even if the use case > cannot trigger the allocation. This is needed to ensure the correctness of the > pci_intx() function use. Frankly, I am surprised that the might sleep splat you > got was not already reported before (fuzzying, static analyzers might eventually > catch that though). > > The other solution would be a version of pci_intx() that has a gfp flags > argument to allow callers to use the right gfp flags for the call context. In vfio-pci we're trying to achieve mutual exclusion of the device interrupt masking between IRQ context and userspace context, so the problem really does not lend itself to pulling the pci_intx() call out of an atomic context. I'll also note again that from a non-devres perspective, pci_intx() is only setting or clearing a bit in the command register, so it's a hefty imposition to restrict the function in general for an allocation in the devres path. Thanks, Alex
On Thu, 2024-09-05 at 09:33 +0900, Damien Le Moal wrote: > On 2024/09/05 6:10, Alex Williamson wrote: > > On Wed, 4 Sep 2024 23:24:53 +0300 > > Andy Shevchenko <andy.shevchenko@gmail.com> wrote: > > > > > Wed, Sep 04, 2024 at 12:07:21PM -0600, Alex Williamson kirjoitti: > > > > On Wed, 04 Sep 2024 15:37:25 +0200 > > > > Philipp Stanner <pstanner@redhat.com> wrote: > > > > > On Wed, 2024-09-04 at 17:25 +0900, Damien Le Moal wrote: > > > > > > ... > > > > > > > > If vfio-pci can get rid of pci_intx() alltogether, that might > > > > > be a good > > > > > thing. As far as I understood Andy Shevchenko, pci_intx() is > > > > > outdated. > > > > > There's only a hand full of users anyways. > > > > > > > > What's the alternative? > > > > > > From API perspective the pci_alloc_irq_vectors() & Co should be > > > used. > > > > We can't replace a device level INTx control with a vector > > allocation > > function. > > > > > > vfio-pci has a potentially unique requirement > > > > here, we don't know how to handle the device interrupt, we only > > > > forward > > > > it to the userspace driver. As a level triggered interrupt, > > > > INTx will > > > > continue to assert until that userspace driver handles the > > > > device. > > > > That's obviously unacceptable from a host perspective, so INTx > > > > is > > > > masked at the device via pci_intx() where available, or at the > > > > interrupt controller otherwise. The API with the userspace > > > > driver > > > > requires that driver to unmask the interrupt, again resulting > > > > in a call > > > > to pci_intx() or unmasking the interrupt controller, in order > > > > to receive > > > > further interrupts from the device. Thanks, > > > > > > I briefly read the discussion and if I understand it correctly > > > the problem here > > > is in the flow: when the above mentioned API is being called. > > > Hence it's design > > > (or architectural) level of issue and changing call from foo() to > > > bar() won't > > > magically make problem go away. But I might be mistaken. > > > > Certainly from a vector allocation standpoint we can change to > > whatever > > is preferred, but the direct INTx manipulation functions are a > > different thing entirely and afaik there's nothing else that can > > replace them at a low level, nor can we just get rid of our calls > > to > > pci_intx(). Thanks, > > But can these calls be moved out of the spinlock context ? If not, > then we need > to clarify that pci_intx() can be called from any context, which will > require > changing to a GFP_ATOMIC for the resource allocation, even if the use > case > cannot trigger the allocation. This is needed to ensure the > correctness of the > pci_intx() function use. We could do that I guess. As I keep saying, it's not intended to have pci_intx() allocate _permanently_. This is a temporary situation. pci_intx() should have neither devres nor allocation. > Frankly, I am surprised that the might sleep splat you > got was not already reported before (fuzzying, static analyzers might > eventually > catch that though). It's a super rare situation: * pci_intx() has very few callers * It only allocates if pcim_enable_device() instead of pci_enable_device() ran. * It only allocates when it's called for the FIRST TIME * All of the above is only a problem while you hold a lock > > The other solution would be a version of pci_intx() that has a gfp > flags > argument to allow callers to use the right gfp flags for the call > context. I don't think that's a good idea. As I said, I want to clean up all that in the mid term. As a matter of fact, there is already __pcim_intx() in pci/devres.c as a pure unmanaged pci_intx() as a means to split and then cleanup the APIs. One path towards getting the hybrid behavior out of pci_intx() could be to rename __pcim_intx() to pci_intx_unmanaged() and port everyone who uses pci_enable_device() + pci_intx() to that version. That would be better than to have a third version with a gfp_t argument. P. > > > > > > Alex > > >
On 9/5/24 16:13, Philipp Stanner wrote: > On Thu, 2024-09-05 at 09:33 +0900, Damien Le Moal wrote: >> On 2024/09/05 6:10, Alex Williamson wrote: >>> On Wed, 4 Sep 2024 23:24:53 +0300 >>> Andy Shevchenko <andy.shevchenko@gmail.com> wrote: >>> >>>> Wed, Sep 04, 2024 at 12:07:21PM -0600, Alex Williamson kirjoitti: >>>>> On Wed, 04 Sep 2024 15:37:25 +0200 >>>>> Philipp Stanner <pstanner@redhat.com> wrote: >>>>>> On Wed, 2024-09-04 at 17:25 +0900, Damien Le Moal wrote: >>>> >>>> ... >>>> >>>>>> If vfio-pci can get rid of pci_intx() alltogether, that might >>>>>> be a good >>>>>> thing. As far as I understood Andy Shevchenko, pci_intx() is >>>>>> outdated. >>>>>> There's only a hand full of users anyways. >>>>> >>>>> What's the alternative? >>>> >>>> From API perspective the pci_alloc_irq_vectors() & Co should be >>>> used. >>> >>> We can't replace a device level INTx control with a vector >>> allocation >>> function. >>> >>>>> vfio-pci has a potentially unique requirement >>>>> here, we don't know how to handle the device interrupt, we only >>>>> forward >>>>> it to the userspace driver. As a level triggered interrupt, >>>>> INTx will >>>>> continue to assert until that userspace driver handles the >>>>> device. >>>>> That's obviously unacceptable from a host perspective, so INTx >>>>> is >>>>> masked at the device via pci_intx() where available, or at the >>>>> interrupt controller otherwise. The API with the userspace >>>>> driver >>>>> requires that driver to unmask the interrupt, again resulting >>>>> in a call >>>>> to pci_intx() or unmasking the interrupt controller, in order >>>>> to receive >>>>> further interrupts from the device. Thanks, >>>> >>>> I briefly read the discussion and if I understand it correctly >>>> the problem here >>>> is in the flow: when the above mentioned API is being called. >>>> Hence it's design >>>> (or architectural) level of issue and changing call from foo() to >>>> bar() won't >>>> magically make problem go away. But I might be mistaken. >>> >>> Certainly from a vector allocation standpoint we can change to >>> whatever >>> is preferred, but the direct INTx manipulation functions are a >>> different thing entirely and afaik there's nothing else that can >>> replace them at a low level, nor can we just get rid of our calls >>> to >>> pci_intx(). Thanks, >> >> But can these calls be moved out of the spinlock context ? If not, >> then we need >> to clarify that pci_intx() can be called from any context, which will >> require >> changing to a GFP_ATOMIC for the resource allocation, even if the use >> case >> cannot trigger the allocation. This is needed to ensure the >> correctness of the >> pci_intx() function use. > > We could do that I guess. As I keep saying, it's not intended to have > pci_intx() allocate _permanently_. This is a temporary situation. > pci_intx() should have neither devres nor allocation. > >> Frankly, I am surprised that the might sleep splat you >> got was not already reported before (fuzzying, static analyzers might >> eventually >> catch that though). > > It's a super rare situation: > * pci_intx() has very few callers > * It only allocates if pcim_enable_device() instead of > pci_enable_device() ran. > * It only allocates when it's called for the FIRST TIME > * All of the above is only a problem while you hold a lock > >> >> The other solution would be a version of pci_intx() that has a gfp >> flags >> argument to allow callers to use the right gfp flags for the call >> context. > > I don't think that's a good idea. As I said, I want to clean up all > that in the mid term. > > As a matter of fact, there is already __pcim_intx() in pci/devres.c as > a pure unmanaged pci_intx() as a means to split and then cleanup the > APIs. Yeah. That naming is in fact confusing. __pcim_intx() should really be named pci_intx()... > One path towards getting the hybrid behavior out of pci_intx() could be > to rename __pcim_intx() to pci_intx_unmanaged() and port everyone who > uses pci_enable_device() + pci_intx() to that version. That would be > better than to have a third version with a gfp_t argument. Sounds good. But ideally, all users that rely on the managed variant should be converted to use pcim_intx() and pci_intx() changed to not call in devres. But that may be too much code churn (I have not checked).
On Fri, 2024-09-06 at 09:37 +0900, Damien Le Moal wrote: > On 9/5/24 16:13, Philipp Stanner wrote: > > On Thu, 2024-09-05 at 09:33 +0900, Damien Le Moal wrote: > > > On 2024/09/05 6:10, Alex Williamson wrote: > > > > On Wed, 4 Sep 2024 23:24:53 +0300 > > > > Andy Shevchenko <andy.shevchenko@gmail.com> wrote: > > > > > > > > > Wed, Sep 04, 2024 at 12:07:21PM -0600, Alex Williamson > > > > > kirjoitti: > > > > > > On Wed, 04 Sep 2024 15:37:25 +0200 > > > > > > Philipp Stanner <pstanner@redhat.com> wrote: > > > > > > > On Wed, 2024-09-04 at 17:25 +0900, Damien Le Moal wrote: > > > > > > > > > > ... > > > > > > > > > > > > If vfio-pci can get rid of pci_intx() alltogether, that > > > > > > > might > > > > > > > be a good > > > > > > > thing. As far as I understood Andy Shevchenko, pci_intx() > > > > > > > is > > > > > > > outdated. > > > > > > > There's only a hand full of users anyways. > > > > > > > > > > > > What's the alternative? > > > > > > > > > > From API perspective the pci_alloc_irq_vectors() & Co should > > > > > be > > > > > used. > > > > > > > > We can't replace a device level INTx control with a vector > > > > allocation > > > > function. > > > > > > > > > > vfio-pci has a potentially unique requirement > > > > > > here, we don't know how to handle the device interrupt, we > > > > > > only > > > > > > forward > > > > > > it to the userspace driver. As a level triggered > > > > > > interrupt, > > > > > > INTx will > > > > > > continue to assert until that userspace driver handles the > > > > > > device. > > > > > > That's obviously unacceptable from a host perspective, so > > > > > > INTx > > > > > > is > > > > > > masked at the device via pci_intx() where available, or at > > > > > > the > > > > > > interrupt controller otherwise. The API with the userspace > > > > > > driver > > > > > > requires that driver to unmask the interrupt, again > > > > > > resulting > > > > > > in a call > > > > > > to pci_intx() or unmasking the interrupt controller, in > > > > > > order > > > > > > to receive > > > > > > further interrupts from the device. Thanks, > > > > > > > > > > I briefly read the discussion and if I understand it > > > > > correctly > > > > > the problem here > > > > > is in the flow: when the above mentioned API is being called. > > > > > Hence it's design > > > > > (or architectural) level of issue and changing call from > > > > > foo() to > > > > > bar() won't > > > > > magically make problem go away. But I might be mistaken. > > > > > > > > Certainly from a vector allocation standpoint we can change to > > > > whatever > > > > is preferred, but the direct INTx manipulation functions are a > > > > different thing entirely and afaik there's nothing else that > > > > can > > > > replace them at a low level, nor can we just get rid of our > > > > calls > > > > to > > > > pci_intx(). Thanks, > > > > > > But can these calls be moved out of the spinlock context ? If > > > not, > > > then we need > > > to clarify that pci_intx() can be called from any context, which > > > will > > > require > > > changing to a GFP_ATOMIC for the resource allocation, even if the > > > use > > > case > > > cannot trigger the allocation. This is needed to ensure the > > > correctness of the > > > pci_intx() function use. > > > > We could do that I guess. As I keep saying, it's not intended to > > have > > pci_intx() allocate _permanently_. This is a temporary situation. > > pci_intx() should have neither devres nor allocation. > > > > > Frankly, I am surprised that the might sleep splat you > > > got was not already reported before (fuzzying, static analyzers > > > might > > > eventually > > > catch that though). > > > > It's a super rare situation: > > * pci_intx() has very few callers > > * It only allocates if pcim_enable_device() instead of > > pci_enable_device() ran. > > * It only allocates when it's called for the FIRST TIME > > * All of the above is only a problem while you hold a lock > > > > > > > > The other solution would be a version of pci_intx() that has a > > > gfp > > > flags > > > argument to allow callers to use the right gfp flags for the call > > > context. > > > > I don't think that's a good idea. As I said, I want to clean up all > > that in the mid term. > > > > As a matter of fact, there is already __pcim_intx() in pci/devres.c > > as > > a pure unmanaged pci_intx() as a means to split and then cleanup > > the > > APIs. > > Yeah. That naming is in fact confusing. __pcim_intx() should really > be named > pci_intx()... > > > One path towards getting the hybrid behavior out of pci_intx() > > could be > > to rename __pcim_intx() to pci_intx_unmanaged() and port everyone > > who > > uses pci_enable_device() + pci_intx() to that version. That would > > be > > better than to have a third version with a gfp_t argument. > > Sounds good. But ideally, all users that rely on the managed variant > should be > converted to use pcim_intx() and pci_intx() changed to not call in > devres. But > that may be too much code churn (I have not checked). Exactly that is the plan. I looked into that yesterday and my idea is to publish __pcim_intx() under the name pci_intx_unmanaged(), then port everyone who doesn't rely on managed behavior to that function and then port everyone who does rely on it to pcim_intx() (as you suggest). Afterwards you can remove pci_intx_unmanaged() again and also remove the hybrid behavior from pci_intx(). It's doable on not that much code. Getting it merged might be politically difficult, though. The only thing I'm really a bit afraid of is the pci_intx() user in pci/msi/ – MSI calls that through yet another implicit devres call, pcim_msi_release(). I *suspect* that pci_intx() in MSI can just be made pci_intx_unmanaged(), but that should be checked carefully. The doubly implicit devres magic caused some trouble in the past already, as we had discussed here [1]. P. [1] https://lore.kernel.org/all/ee44ea7ac760e73edad3f20b30b4d2fff66c1a85.camel@redhat.com/
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index e3a49f66982d..ffaaca0978cb 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4477,12 +4477,6 @@ void pci_intx(struct pci_dev *pdev, int enable) { u16 pci_command, new; - /* Preserve the "hybrid" behavior for backwards compatibility */ - if (pci_is_managed(pdev)) { - WARN_ON_ONCE(pcim_intx(pdev, enable) != 0); - return; - } - pci_read_config_word(pdev, PCI_COMMAND, &pci_command); if (enable) @@ -4490,8 +4484,15 @@ void pci_intx(struct pci_dev *pdev, int enable) else new = pci_command | PCI_COMMAND_INTX_DISABLE; - if (new != pci_command) + if (new != pci_command) { + /* Preserve the "hybrid" behavior for backwards compatibility */ + if (pci_is_managed(pdev)) { + WARN_ON_ONCE(pcim_intx(pdev, enable) != 0); + return; + } + pci_write_config_word(pdev, PCI_COMMAND, new); + } } EXPORT_SYMBOL_GPL(pci_intx);
pci_intx() is a function that becomes managed if pcim_enable_device() has been called in advance. Commit 25216afc9db5 ("PCI: Add managed pcim_intx()") changed this behavior so that pci_intx() always leads to creation of a separate device resource for itself, whereas earlier, a shared resource was used for all PCI devres operations. Unfortunately, pci_intx() seems to be used in some drivers' remove() paths; in the managed case this causes a device resource to be created on driver detach. Fix the regression by only redirecting pci_intx() to its managed twin pcim_intx() if the pci_command changes. Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()") Reported-by: Damien Le Moal <dlemoal@kernel.org> Closes: https://lore.kernel.org/all/b8f4ba97-84fc-4b7e-ba1a-99de2d9f0118@kernel.org/ Signed-off-by: Philipp Stanner <pstanner@redhat.com> --- Alright, I reproduced this with QEMU as Damien described and this here fixes the issue on my side. Feedback welcome. Thank you very much, Damien. It seems that this might yet again be the issue of drivers not being aware that pci_intx() might become managed, so they use it in their unwind path (rightfully so; there probably was no alternative back then). That will make the long term cleanup difficult. But I think this for now is the most elegant possible workaround. --- drivers/pci/pci.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-)