diff mbox series

[net] net: dsa: microchip: Added the condition for scheduling ksz_mib_read_work

Message ID 20211008084348.7306-1-arun.ramadoss@microchip.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net] net: dsa: microchip: Added the condition for scheduling ksz_mib_read_work | expand

Checks

Context Check Description
netdev/cover_letter success Single patches do not need cover letters
netdev/fixes_present success Fixes tag present in non-next series
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/cc_maintainers success CCed 10 of 10 maintainers
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Fixes tag looks correct
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 20 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/header_inline success No static functions without inline keyword in header files

Commit Message

Arun Ramadoss Oct. 8, 2021, 8:43 a.m. UTC
When the ksz module is installed and removed using rmmod, kernel crashes
with null pointer dereferrence error. During rmmod, ksz_switch_remove
function tries to cancel the mib_read_workqueue using
cancel_delayed_work_sync routine.

At the end of  mib_read_workqueue execution, it again reschedule the
workqueue unconditionally. Due to which queue rescheduled after
mib_interval, during this execution it tries to access dp->slave. But
the slave is unregistered in the ksz_switch_remove function. Hence
kernel crashes.

To avoid this crash, before canceling the workqueue, resetted the
mib_interval to 0. In the work queue execution, it schedules the
workqueue next time only if the mib_interval is non zero.

Fixes: 469b390e1ba3 ("net: dsa: microchip: use delayed_work instead of timer + work")
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
---
 drivers/net/dsa/microchip/ksz_common.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Andrew Lunn Oct. 8, 2021, 1:58 p.m. UTC | #1
On Fri, Oct 08, 2021 at 02:13:48PM +0530, Arun Ramadoss wrote:
> When the ksz module is installed and removed using rmmod, kernel crashes
> with null pointer dereferrence error. During rmmod, ksz_switch_remove
> function tries to cancel the mib_read_workqueue using
> cancel_delayed_work_sync routine.
> 
> At the end of  mib_read_workqueue execution, it again reschedule the
> workqueue unconditionally. Due to which queue rescheduled after
> mib_interval, during this execution it tries to access dp->slave. But
> the slave is unregistered in the ksz_switch_remove function. Hence
> kernel crashes.

Something not correct here.

https://www.kernel.org/doc/html/latest/core-api/workqueue.html?highlight=delayed_work#c.cancel_delayed_work_sync

This is cancel_work_sync() for delayed works.

and

https://www.kernel.org/doc/html/latest/core-api/workqueue.html?highlight=delayed_work#c.cancel_work_sync

This function can be used even if the work re-queues itself or
migrates to another workqueue.

Maybe the real problem is a missing call to destroy_worker()?

      Andrew
Jakub Kicinski Oct. 8, 2021, 6:34 p.m. UTC | #2
On Fri, 8 Oct 2021 15:58:16 +0200 Andrew Lunn wrote:
> On Fri, Oct 08, 2021 at 02:13:48PM +0530, Arun Ramadoss wrote:
> > When the ksz module is installed and removed using rmmod, kernel crashes
> > with null pointer dereferrence error. During rmmod, ksz_switch_remove
> > function tries to cancel the mib_read_workqueue using
> > cancel_delayed_work_sync routine.
> > 
> > At the end of  mib_read_workqueue execution, it again reschedule the
> > workqueue unconditionally. Due to which queue rescheduled after
> > mib_interval, during this execution it tries to access dp->slave. But
> > the slave is unregistered in the ksz_switch_remove function. Hence
> > kernel crashes.  
> 
> Something not correct here.
> 
> https://www.kernel.org/doc/html/latest/core-api/workqueue.html?highlight=delayed_work#c.cancel_delayed_work_sync
> 
> This is cancel_work_sync() for delayed works.
> 
> and
> 
> https://www.kernel.org/doc/html/latest/core-api/workqueue.html?highlight=delayed_work#c.cancel_work_sync
> 
> This function can be used even if the work re-queues itself or
> migrates to another workqueue.
> 
> Maybe the real problem is a missing call to destroy_worker()?

Also the cancel_delayed_work_sync() is suspiciously early in the remove
flow. There is a schedule_work call in ksz_mac_link_down() which may 
schedule the work back in. That'd also explain why the patch helps since
ksz_mac_link_down() only schedules if (dev->mib_read_interval).
Arun Ramadoss Oct. 11, 2021, 9:41 a.m. UTC | #3
On Fri, 2021-10-08 at 11:34 -0700, Jakub Kicinski wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you
> know the content is safe
> 
> On Fri, 8 Oct 2021 15:58:16 +0200 Andrew Lunn wrote:
> > On Fri, Oct 08, 2021 at 02:13:48PM +0530, Arun Ramadoss wrote:
> > > When the ksz module is installed and removed using rmmod, kernel
> > > crashes
> > > with null pointer dereferrence error. During rmmod,
> > > ksz_switch_remove
> > > function tries to cancel the mib_read_workqueue using
> > > cancel_delayed_work_sync routine.
> > > 
> > > At the end of  mib_read_workqueue execution, it again reschedule
> > > the
> > > workqueue unconditionally. Due to which queue rescheduled after
> > > mib_interval, during this execution it tries to access dp->slave. 
> > > But
> > > the slave is unregistered in the ksz_switch_remove function.
> > > Hence
> > > kernel crashes.
> > 
> > Something not correct here.
> > 
> > 
https://www.kernel.org/doc/html/latest/core-api/workqueue.html?highlight=delayed_work#c.cancel_delayed_work_sync
> > 
> > This is cancel_work_sync() for delayed works.
> > 
> > and
> > 
> > 
https://www.kernel.org/doc/html/latest/core-api/workqueue.html?highlight=delayed_work#c.cancel_work_sync
> > 
> > This function can be used even if the work re-queues itself or
> > migrates to another workqueue.
> > 
> > Maybe the real problem is a missing call to destroy_worker()?
> 
> Also the cancel_delayed_work_sync() is suspiciously early in the
> remove
> flow. There is a schedule_work call in ksz_mac_link_down() which may
> schedule the work back in. That'd also explain why the patch helps
> since
> ksz_mac_link_down() only schedules if (dev->mib_read_interval).
In this patch, I did two things. Added the if condition for
rescheduling the queue and other is resetted the mib_read_interval to
zero.
As per the cancel_delayed_queue_sync() functionality, Now I tried rmod
after removing the if condition for resheduling the queue,kernel didn't
crash. So, concluded that it is not rearm in ksz_mib_read_work  is
causing the problem but it is due to scheduling in the
ksz_mac_link_down function. This function is called due to the
dsa_unregister_switch. Due to resetting of the mib_read_interval to
zero in switch_remove, the queue is not scheduled in mac_link_down, so
kernel didn't crash.

And also, as per suggestion on cancel_delayed_work_sync() is
suspiciously placed in switch_remove. I undo this patch, and tried just
by moving the canceling of delayed_work after the dsa_unregister_switch
function. As expected dsa_unregister_switch calls the
ksz_mac_link_down, which schedules the mib_read_work. Now, when
cancel_delayed_work_sync is called, it cancels all the workqueue. As a
result, module is removed successfully without kernel crash.

Can I send the updated patch as v1 or new patch with updated commit
message and description.
Jakub Kicinski Oct. 11, 2021, 1:45 p.m. UTC | #4
On Mon, 11 Oct 2021 09:41:55 +0000 Arun.Ramadoss@microchip.com wrote:
> On Fri, 2021-10-08 at 11:34 -0700, Jakub Kicinski wrote:
> > Also the cancel_delayed_work_sync() is suspiciously early in the
> > remove
> > flow. There is a schedule_work call in ksz_mac_link_down() which may
> > schedule the work back in. That'd also explain why the patch helps
> > since
> > ksz_mac_link_down() only schedules if (dev->mib_read_interval).  
> In this patch, I did two things. Added the if condition for
> rescheduling the queue and other is resetted the mib_read_interval to
> zero.
> As per the cancel_delayed_queue_sync() functionality, Now I tried rmod
> after removing the if condition for resheduling the queue,kernel didn't
> crash. So, concluded that it is not rearm in ksz_mib_read_work  is
> causing the problem but it is due to scheduling in the
> ksz_mac_link_down function. This function is called due to the
> dsa_unregister_switch. Due to resetting of the mib_read_interval to
> zero in switch_remove, the queue is not scheduled in mac_link_down, so
> kernel didn't crash.
> 
> And also, as per suggestion on cancel_delayed_work_sync() is
> suspiciously placed in switch_remove. I undo this patch, and tried just
> by moving the canceling of delayed_work after the dsa_unregister_switch
> function. As expected dsa_unregister_switch calls the
> ksz_mac_link_down, which schedules the mib_read_work. Now, when
> cancel_delayed_work_sync is called, it cancels all the workqueue. As a
> result, module is removed successfully without kernel crash.
> 
> Can I send the updated patch as v1 or new patch with updated commit
> message and description.

Please send a patch with just the second chunk (zeroing
mib_read_interval), you can mark it as v2.
diff mbox series

Patch

diff --git a/drivers/net/dsa/microchip/ksz_common.c b/drivers/net/dsa/microchip/ksz_common.c
index 1542bfb8b5e5..ffc8e6fb300a 100644
--- a/drivers/net/dsa/microchip/ksz_common.c
+++ b/drivers/net/dsa/microchip/ksz_common.c
@@ -94,7 +94,8 @@  static void ksz_mib_read_work(struct work_struct *work)
 		mutex_unlock(&mib->cnt_mutex);
 	}
 
-	schedule_delayed_work(&dev->mib_read, dev->mib_read_interval);
+	if (dev->mib_read_interval)
+		schedule_delayed_work(&dev->mib_read, dev->mib_read_interval);
 }
 
 void ksz_init_mib_timer(struct ksz_device *dev)
@@ -449,8 +450,10 @@  EXPORT_SYMBOL(ksz_switch_register);
 void ksz_switch_remove(struct ksz_device *dev)
 {
 	/* timer started */
-	if (dev->mib_read_interval)
+	if (dev->mib_read_interval) {
+		dev->mib_read_interval = 0;
 		cancel_delayed_work_sync(&dev->mib_read);
+	}
 
 	dev->dev_ops->exit(dev);
 	dsa_unregister_switch(dev->ds);