Message ID | 20230726-doe-fix-v1-1-af07e614d4dd@intel.com (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | PCI/DOE: Fix destroy_work_on_stack() race | expand |
On Wed, Jul 26, 2023 at 11:29:42AM -0700, Ira Weiny wrote: > The following debug object splat was observed in testing. [...] > This occurs because destroy_work_on_stack() was called after signaling > the completion in the calling thread. This creates a race between > destroy_work_on_stack() and the task->work struct going of scope in the > pci_doe(). > > Signal the work complete after destroying the work struct. This is safe > because signal_task_complete() is the final thing the work item does and > the workqueue code is careful not to access the work struct after. > > Fixes: abf04be0e707 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y") > Cc: Lukas Wunner <lukas@wunner.de> > Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Lukas Wunner <lukas@wunner.de> Thanks for catching this. The offending commit abf04be0e707 was applied by Dan. Not sure if that means he's going to apply this fix as well? Would require an ack from Bjorn in that case. Or Bjorn applies it. Thanks, Lukas
On Wed, Jul 26, 2023 at 11:29:42AM -0700, Ira Weiny wrote: > The following debug object splat was observed in testing. > > [ 14.061937] ------------[ cut here ]------------ > [ 14.063899] ODEBUG: free active (active state 0) object: 0000000097d23782 object type: work_struct hint: doe_statemachine_work+0x0/0x510 > [ 14.067480] WARNING: CPU: 1 PID: 71 at lib/debugobjects.c:514 debug_print_object+0x7d/0xb0 > ... > [ 14.080951] Workqueue: pci 0000:36:00.0 DOE [1 doe_statemachine_work > [ 14.083485] RIP: 0010:debug_print_object+0x7d/0xb0 > ... > [ 14.116231] Call Trace: > [ 14.117652] <TASK> > [ 14.118958] ? debug_print_object+0x7d/0xb0 > [ 14.120782] ? __warn+0x7d/0x130 > [ 14.122399] ? debug_print_object+0x7d/0xb0 > [ 14.123746] ? report_bug+0x18d/0x1c0 > [ 14.125025] ? handle_bug+0x3c/0x80 > [ 14.126506] ? exc_invalid_op+0x13/0x60 > [ 14.127796] ? asm_exc_invalid_op+0x16/0x20 > [ 14.129380] ? debug_print_object+0x7d/0xb0 > [ 14.130688] ? debug_print_object+0x7d/0xb0 > [ 14.131997] ? __pfx_doe_statemachine_work+0x10/0x10 > [ 14.133597] debug_object_free.part.0+0x11b/0x150 > [ 14.134940] doe_statemachine_work+0x45e/0x510 > [ 14.136348] process_one_work+0x1d4/0x3c0 > ... > [ 14.161484] </TASK> > [ 14.162434] ---[ end trace 0000000000000000 ]--- > > This occurs because destroy_work_on_stack() was called after signaling > the completion in the calling thread. This creates a race between > destroy_work_on_stack() and the task->work struct going of scope in the > pci_doe(). > > Signal the work complete after destroying the work struct. This is safe > because signal_task_complete() is the final thing the work item does and > the workqueue code is careful not to access the work struct after. > > Fixes: abf04be0e707 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y") > Cc: Lukas Wunner <lukas@wunner.de> > Signed-off-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Dan, let me know if you'd rather have me take this. > --- > drivers/pci/doe.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c > index 1b97a5ab71a9..e3aab5edaf70 100644 > --- a/drivers/pci/doe.c > +++ b/drivers/pci/doe.c > @@ -293,8 +293,8 @@ static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *tas > static void signal_task_complete(struct pci_doe_task *task, int rv) > { > task->rv = rv; > - task->complete(task); > destroy_work_on_stack(&task->work); > + task->complete(task); > } > > static void signal_task_abort(struct pci_doe_task *task, int rv) > > --- > base-commit: 20ea1e7d13c1b544fe67c4a8dc3943bb1ab33e6f > change-id: 20230726-doe-fix-f57943f9ea82 > > Best regards, > -- > Ira Weiny <ira.weiny@intel.com> >
Bjorn Helgaas wrote: > On Wed, Jul 26, 2023 at 11:29:42AM -0700, Ira Weiny wrote: > > The following debug object splat was observed in testing. > > > > [ 14.061937] ------------[ cut here ]------------ > > [ 14.063899] ODEBUG: free active (active state 0) object: 0000000097d23782 object type: work_struct hint: doe_statemachine_work+0x0/0x510 > > [ 14.067480] WARNING: CPU: 1 PID: 71 at lib/debugobjects.c:514 debug_print_object+0x7d/0xb0 > > ... > > [ 14.080951] Workqueue: pci 0000:36:00.0 DOE [1 doe_statemachine_work > > [ 14.083485] RIP: 0010:debug_print_object+0x7d/0xb0 > > ... > > [ 14.116231] Call Trace: > > [ 14.117652] <TASK> > > [ 14.118958] ? debug_print_object+0x7d/0xb0 > > [ 14.120782] ? __warn+0x7d/0x130 > > [ 14.122399] ? debug_print_object+0x7d/0xb0 > > [ 14.123746] ? report_bug+0x18d/0x1c0 > > [ 14.125025] ? handle_bug+0x3c/0x80 > > [ 14.126506] ? exc_invalid_op+0x13/0x60 > > [ 14.127796] ? asm_exc_invalid_op+0x16/0x20 > > [ 14.129380] ? debug_print_object+0x7d/0xb0 > > [ 14.130688] ? debug_print_object+0x7d/0xb0 > > [ 14.131997] ? __pfx_doe_statemachine_work+0x10/0x10 > > [ 14.133597] debug_object_free.part.0+0x11b/0x150 > > [ 14.134940] doe_statemachine_work+0x45e/0x510 > > [ 14.136348] process_one_work+0x1d4/0x3c0 > > ... > > [ 14.161484] </TASK> > > [ 14.162434] ---[ end trace 0000000000000000 ]--- > > > > This occurs because destroy_work_on_stack() was called after signaling > > the completion in the calling thread. This creates a race between > > destroy_work_on_stack() and the task->work struct going of scope in the > > pci_doe(). > > > > Signal the work complete after destroying the work struct. This is safe > > because signal_task_complete() is the final thing the work item does and > > the workqueue code is careful not to access the work struct after. > > > > Fixes: abf04be0e707 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y") > > Cc: Lukas Wunner <lukas@wunner.de> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > Acked-by: Bjorn Helgaas <bhelgaas@google.com> > Dan, let me know if you'd rather have me take this. I have nothing else in the DOE area at this time, so I would be happy if you took it. It is self contained to drivers/pci/. Much appreciated. Acked-by: Dan Williams <dan.j.williams@intel.com>
On Wed, Jul 26, 2023 at 11:29:42AM -0700, Ira Weiny wrote: > The following debug object splat was observed in testing. > > [ 14.061937] ------------[ cut here ]------------ > [ 14.063899] ODEBUG: free active (active state 0) object: 0000000097d23782 object type: work_struct hint: doe_statemachine_work+0x0/0x510 > [ 14.067480] WARNING: CPU: 1 PID: 71 at lib/debugobjects.c:514 debug_print_object+0x7d/0xb0 > ... > [ 14.080951] Workqueue: pci 0000:36:00.0 DOE [1 doe_statemachine_work > [ 14.083485] RIP: 0010:debug_print_object+0x7d/0xb0 > ... > [ 14.116231] Call Trace: > [ 14.117652] <TASK> > [ 14.118958] ? debug_print_object+0x7d/0xb0 > [ 14.120782] ? __warn+0x7d/0x130 > [ 14.122399] ? debug_print_object+0x7d/0xb0 > [ 14.123746] ? report_bug+0x18d/0x1c0 > [ 14.125025] ? handle_bug+0x3c/0x80 > [ 14.126506] ? exc_invalid_op+0x13/0x60 > [ 14.127796] ? asm_exc_invalid_op+0x16/0x20 > [ 14.129380] ? debug_print_object+0x7d/0xb0 > [ 14.130688] ? debug_print_object+0x7d/0xb0 > [ 14.131997] ? __pfx_doe_statemachine_work+0x10/0x10 > [ 14.133597] debug_object_free.part.0+0x11b/0x150 > [ 14.134940] doe_statemachine_work+0x45e/0x510 > [ 14.136348] process_one_work+0x1d4/0x3c0 > ... > [ 14.161484] </TASK> > [ 14.162434] ---[ end trace 0000000000000000 ]--- > > This occurs because destroy_work_on_stack() was called after signaling > the completion in the calling thread. This creates a race between > destroy_work_on_stack() and the task->work struct going of scope in the > pci_doe(). > > Signal the work complete after destroying the work struct. This is safe > because signal_task_complete() is the final thing the work item does and > the workqueue code is careful not to access the work struct after. > > Fixes: abf04be0e707 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y") > Cc: Lukas Wunner <lukas@wunner.de> > Signed-off-by: Ira Weiny <ira.weiny@intel.com> Thanks, applied to pci/misc with Lukas' reviewed-by and Dan's ack for v6.6. I edited out the timestamps and some of the call trace from the splat because they didn't seem relevant. > --- > drivers/pci/doe.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c > index 1b97a5ab71a9..e3aab5edaf70 100644 > --- a/drivers/pci/doe.c > +++ b/drivers/pci/doe.c > @@ -293,8 +293,8 @@ static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *tas > static void signal_task_complete(struct pci_doe_task *task, int rv) > { > task->rv = rv; > - task->complete(task); > destroy_work_on_stack(&task->work); > + task->complete(task); > } > > static void signal_task_abort(struct pci_doe_task *task, int rv) > > --- > base-commit: 20ea1e7d13c1b544fe67c4a8dc3943bb1ab33e6f > change-id: 20230726-doe-fix-f57943f9ea82 > > Best regards, > -- > Ira Weiny <ira.weiny@intel.com> >
Bjorn Helgaas wrote: > On Wed, Jul 26, 2023 at 11:29:42AM -0700, Ira Weiny wrote: > > The following debug object splat was observed in testing. > > > > [ 14.061937] ------------[ cut here ]------------ > > [ 14.063899] ODEBUG: free active (active state 0) object: 0000000097d23782 object type: work_struct hint: doe_statemachine_work+0x0/0x510 > > [ 14.067480] WARNING: CPU: 1 PID: 71 at lib/debugobjects.c:514 debug_print_object+0x7d/0xb0 > > ... > > [ 14.080951] Workqueue: pci 0000:36:00.0 DOE [1 doe_statemachine_work > > [ 14.083485] RIP: 0010:debug_print_object+0x7d/0xb0 > > ... > > [ 14.116231] Call Trace: > > [ 14.117652] <TASK> > > [ 14.118958] ? debug_print_object+0x7d/0xb0 > > [ 14.120782] ? __warn+0x7d/0x130 > > [ 14.122399] ? debug_print_object+0x7d/0xb0 > > [ 14.123746] ? report_bug+0x18d/0x1c0 > > [ 14.125025] ? handle_bug+0x3c/0x80 > > [ 14.126506] ? exc_invalid_op+0x13/0x60 > > [ 14.127796] ? asm_exc_invalid_op+0x16/0x20 > > [ 14.129380] ? debug_print_object+0x7d/0xb0 > > [ 14.130688] ? debug_print_object+0x7d/0xb0 > > [ 14.131997] ? __pfx_doe_statemachine_work+0x10/0x10 > > [ 14.133597] debug_object_free.part.0+0x11b/0x150 > > [ 14.134940] doe_statemachine_work+0x45e/0x510 > > [ 14.136348] process_one_work+0x1d4/0x3c0 > > ... > > [ 14.161484] </TASK> > > [ 14.162434] ---[ end trace 0000000000000000 ]--- > > > > This occurs because destroy_work_on_stack() was called after signaling > > the completion in the calling thread. This creates a race between > > destroy_work_on_stack() and the task->work struct going of scope in the > > pci_doe(). > > > > Signal the work complete after destroying the work struct. This is safe > > because signal_task_complete() is the final thing the work item does and > > the workqueue code is careful not to access the work struct after. > > > > Fixes: abf04be0e707 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y") > > Cc: Lukas Wunner <lukas@wunner.de> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > Thanks, applied to pci/misc with Lukas' reviewed-by and Dan's ack for > v6.6. I edited out the timestamps and some of the call trace from the > splat because they didn't seem relevant. > Thanks. I'll make sure to remove the timestamps in future. Ira
diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c index 1b97a5ab71a9..e3aab5edaf70 100644 --- a/drivers/pci/doe.c +++ b/drivers/pci/doe.c @@ -293,8 +293,8 @@ static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *tas static void signal_task_complete(struct pci_doe_task *task, int rv) { task->rv = rv; - task->complete(task); destroy_work_on_stack(&task->work); + task->complete(task); } static void signal_task_abort(struct pci_doe_task *task, int rv)
The following debug object splat was observed in testing. [ 14.061937] ------------[ cut here ]------------ [ 14.063899] ODEBUG: free active (active state 0) object: 0000000097d23782 object type: work_struct hint: doe_statemachine_work+0x0/0x510 [ 14.067480] WARNING: CPU: 1 PID: 71 at lib/debugobjects.c:514 debug_print_object+0x7d/0xb0 ... [ 14.080951] Workqueue: pci 0000:36:00.0 DOE [1 doe_statemachine_work [ 14.083485] RIP: 0010:debug_print_object+0x7d/0xb0 ... [ 14.116231] Call Trace: [ 14.117652] <TASK> [ 14.118958] ? debug_print_object+0x7d/0xb0 [ 14.120782] ? __warn+0x7d/0x130 [ 14.122399] ? debug_print_object+0x7d/0xb0 [ 14.123746] ? report_bug+0x18d/0x1c0 [ 14.125025] ? handle_bug+0x3c/0x80 [ 14.126506] ? exc_invalid_op+0x13/0x60 [ 14.127796] ? asm_exc_invalid_op+0x16/0x20 [ 14.129380] ? debug_print_object+0x7d/0xb0 [ 14.130688] ? debug_print_object+0x7d/0xb0 [ 14.131997] ? __pfx_doe_statemachine_work+0x10/0x10 [ 14.133597] debug_object_free.part.0+0x11b/0x150 [ 14.134940] doe_statemachine_work+0x45e/0x510 [ 14.136348] process_one_work+0x1d4/0x3c0 ... [ 14.161484] </TASK> [ 14.162434] ---[ end trace 0000000000000000 ]--- This occurs because destroy_work_on_stack() was called after signaling the completion in the calling thread. This creates a race between destroy_work_on_stack() and the task->work struct going of scope in the pci_doe(). Signal the work complete after destroying the work struct. This is safe because signal_task_complete() is the final thing the work item does and the workqueue code is careful not to access the work struct after. Fixes: abf04be0e707 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y") Cc: Lukas Wunner <lukas@wunner.de> Signed-off-by: Ira Weiny <ira.weiny@intel.com> --- drivers/pci/doe.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- base-commit: 20ea1e7d13c1b544fe67c4a8dc3943bb1ab33e6f change-id: 20230726-doe-fix-f57943f9ea82 Best regards,