Message ID | 20241022172153.217890-1-jdamato@fastly.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC,iwl-net] e1000: Hold RTNL when e1000_down can be called | expand |
On Tue, Oct 22, 2024 at 05:21:53PM +0000, Joe Damato wrote: > e1000_down calls netif_queue_set_napi, which assumes that RTNL is held. > > There are a few paths for e1000_down to be called in e1000 where RTNL is > not currently being held: > - e1000_shutdown (pci shutdown) > - e1000_suspend (power management) > - e1000_reinit_locked (via e1000_reset_task delayed work) > > Hold RTNL in two places to fix this issue: > - e1000_reset_task > - __e1000_shutdown (which is called from both e1000_shutdown and > e1000_suspend). It looks like there's one other spot I missed: e1000_io_error_detected (pci error handler) which should also hold rtnl_lock: + if (netif_running(netdev)) { + rtnl_lock(); e1000_down(adapter); + rtnl_unlock(); + } I can send that update in the v2, but I'll wait to see if Intel has suggestions on the below. > The other paths which call e1000_down seemingly hold RTNL and are OK: > - e1000_close (ndo_stop) > - e1000_change_mtu (ndo_change_mtu) > > I'm submitting this is as an RFC because: > - the e1000_reinit_locked issue appears very similar to commit > 21f857f0321d ("e1000e: add rtnl_lock() to e1000_reset_task"), which > fixes a similar issue in e1000e > > however > > - adding rtnl to e1000_reinit_locked seemingly conflicts with an > earlier e1000 commit b2f963bfaeba ("e1000: fix lockdep warning in > e1000_reset_task"). > > Hopefully Intel can weigh in and shed some light on the correct way to > go. > > Fixes: 8f7ff18a5ec7 ("e1000: Link NAPI instances to queues and IRQs") > Reported-by: Dmitry Antipov <dmantipov@yandex.ru> > Closes: https://lore.kernel.org/netdev/8cf62307-1965-46a0-a411-ff0080090ff9@yandex.ru/ > Signed-off-by: Joe Damato <jdamato@fastly.com>
On Tue, Oct 22, 2024 at 01:00:47PM -0700, Joe Damato wrote: > On Tue, Oct 22, 2024 at 05:21:53PM +0000, Joe Damato wrote: > > e1000_down calls netif_queue_set_napi, which assumes that RTNL is held. > > > > There are a few paths for e1000_down to be called in e1000 where RTNL is > > not currently being held: > > - e1000_shutdown (pci shutdown) > > - e1000_suspend (power management) > > - e1000_reinit_locked (via e1000_reset_task delayed work) > > > > Hold RTNL in two places to fix this issue: > > - e1000_reset_task > > - __e1000_shutdown (which is called from both e1000_shutdown and > > e1000_suspend). > > It looks like there's one other spot I missed: > > e1000_io_error_detected (pci error handler) which should also hold > rtnl_lock: > > + if (netif_running(netdev)) { > + rtnl_lock(); > e1000_down(adapter); > + rtnl_unlock(); > + } > > I can send that update in the v2, but I'll wait to see if Intel has suggestions > on the below. > > > The other paths which call e1000_down seemingly hold RTNL and are OK: > > - e1000_close (ndo_stop) > > - e1000_change_mtu (ndo_change_mtu) > > > > I'm submitting this is as an RFC because: > > - the e1000_reinit_locked issue appears very similar to commit > > 21f857f0321d ("e1000e: add rtnl_lock() to e1000_reset_task"), which > > fixes a similar issue in e1000e > > > > however > > > > - adding rtnl to e1000_reinit_locked seemingly conflicts with an > > earlier e1000 commit b2f963bfaeba ("e1000: fix lockdep warning in > > e1000_reset_task"). > > > > Hopefully Intel can weigh in and shed some light on the correct way to > > go. Regarding the above locations where rtnl_lock may need to be held, comparing to other intel drivers: - e1000_reset_task: it appears that igc, igb, and e100e all hold rtnl_lock in their reset_task functions, so I think adding an rtnl_lock / rtnl_unlock to e1000_reset_task should be OK, despite the existence of commit b2f963bfaeba ("e1000: fix lockdep warning in e1000_reset_task"). - e1000_io_error_detected: - e1000e temporarily obtains and drops rtnl in e1000e_pm_freeze - ixgbe holds rtnl in the same path (toward the bottom of ixgbe_io_error_detected) - igb does NOT hold rtnl in this path (as far as I can tell) - it was suggested in another thread to hold rtnl in this path for igc [1]. Given that it will be added to igc and is held in this same path in e1000e and ixgbe, I think it is safe to add it for e1000, as well. - e1000_shutdown: - igb holds rtnl in the same path, - e1000e temporarily holds it in this path (via e1000e_pm_freeze) - ixgbe holds rtnl in the same path So based on the recommendation for igc [1], and the precedent set in the other Intel drivers in most cases (except igb and the io_error path), I think adding rtnl to all 3 locations described above is correct. Please let me know if you all agree. Thanks for reviewing this. [1]: https://lore.kernel.org/netdev/40242f59-139a-4b45-8949-1210039f881b@intel.com/
On 10/22/2024 1:00 PM, Joe Damato wrote: > On Tue, Oct 22, 2024 at 05:21:53PM +0000, Joe Damato wrote: >> e1000_down calls netif_queue_set_napi, which assumes that RTNL is held. >> >> There are a few paths for e1000_down to be called in e1000 where RTNL is >> not currently being held: >> - e1000_shutdown (pci shutdown) >> - e1000_suspend (power management) >> - e1000_reinit_locked (via e1000_reset_task delayed work) >> >> Hold RTNL in two places to fix this issue: >> - e1000_reset_task >> - __e1000_shutdown (which is called from both e1000_shutdown and >> e1000_suspend). > > It looks like there's one other spot I missed: > > e1000_io_error_detected (pci error handler) which should also hold > rtnl_lock: > > + if (netif_running(netdev)) { > + rtnl_lock(); > e1000_down(adapter); > + rtnl_unlock(); > + } > > I can send that update in the v2, but I'll wait to see if Intel has suggestions > on the below. > >> The other paths which call e1000_down seemingly hold RTNL and are OK: >> - e1000_close (ndo_stop) >> - e1000_change_mtu (ndo_change_mtu) >> >> I'm submitting this is as an RFC because: >> - the e1000_reinit_locked issue appears very similar to commit >> 21f857f0321d ("e1000e: add rtnl_lock() to e1000_reset_task"), which >> fixes a similar issue in e1000e >> >> however >> >> - adding rtnl to e1000_reinit_locked seemingly conflicts with an >> earlier e1000 commit b2f963bfaeba ("e1000: fix lockdep warning in >> e1000_reset_task"). >> >> Hopefully Intel can weigh in and shed some light on the correct way to >> go. >> From my review, I think we need the RTNL lock around this function. The deadlocks mentions in the fix lockdep patch appear to be due to having an *extra* lock which could then cause issues. >> Fixes: 8f7ff18a5ec7 ("e1000: Link NAPI instances to queues and IRQs") >> Reported-by: Dmitry Antipov <dmantipov@yandex.ru> >> Closes: https://lore.kernel.org/netdev/8cf62307-1965-46a0-a411-ff0080090ff9@yandex.ru/ >> Signed-off-by: Joe Damato <jdamato@fastly.com>
On 10/22/2024 2:12 PM, Joe Damato wrote: > On Tue, Oct 22, 2024 at 01:00:47PM -0700, Joe Damato wrote: >> On Tue, Oct 22, 2024 at 05:21:53PM +0000, Joe Damato wrote: >>> e1000_down calls netif_queue_set_napi, which assumes that RTNL is held. >>> >>> There are a few paths for e1000_down to be called in e1000 where RTNL is >>> not currently being held: >>> - e1000_shutdown (pci shutdown) >>> - e1000_suspend (power management) >>> - e1000_reinit_locked (via e1000_reset_task delayed work) >>> >>> Hold RTNL in two places to fix this issue: >>> - e1000_reset_task >>> - __e1000_shutdown (which is called from both e1000_shutdown and >>> e1000_suspend). >> >> It looks like there's one other spot I missed: >> >> e1000_io_error_detected (pci error handler) which should also hold >> rtnl_lock: >> >> + if (netif_running(netdev)) { >> + rtnl_lock(); >> e1000_down(adapter); >> + rtnl_unlock(); >> + } >> >> I can send that update in the v2, but I'll wait to see if Intel has suggestions >> on the below. >> >>> The other paths which call e1000_down seemingly hold RTNL and are OK: >>> - e1000_close (ndo_stop) >>> - e1000_change_mtu (ndo_change_mtu) >>> >>> I'm submitting this is as an RFC because: >>> - the e1000_reinit_locked issue appears very similar to commit >>> 21f857f0321d ("e1000e: add rtnl_lock() to e1000_reset_task"), which >>> fixes a similar issue in e1000e >>> >>> however >>> >>> - adding rtnl to e1000_reinit_locked seemingly conflicts with an >>> earlier e1000 commit b2f963bfaeba ("e1000: fix lockdep warning in >>> e1000_reset_task"). >>> >>> Hopefully Intel can weigh in and shed some light on the correct way to >>> go. > > Regarding the above locations where rtnl_lock may need to be held, > comparing to other intel drivers: > > - e1000_reset_task: it appears that igc, igb, and e100e all hold > rtnl_lock in their reset_task functions, so I think adding an > rtnl_lock / rtnl_unlock to e1000_reset_task should be OK, > despite the existence of commit b2f963bfaeba ("e1000: fix > lockdep warning in e1000_reset_task"). > > - e1000_io_error_detected: > - e1000e temporarily obtains and drops rtnl in > e1000e_pm_freeze > - ixgbe holds rtnl in the same path (toward the bottom of > ixgbe_io_error_detected) > - igb does NOT hold rtnl in this path (as far as I can tell) > - it was suggested in another thread to hold rtnl in this path > for igc [1]. > > Given that it will be added to igc and is held in this same > path in e1000e and ixgbe, I think it is safe to add it for > e1000, as well. > > - e1000_shutdown: > - igb holds rtnl in the same path, > - e1000e temporarily holds it in this path (via > e1000e_pm_freeze) > - ixgbe holds rtnl in the same path > > So based on the recommendation for igc [1], and the precedent set in > the other Intel drivers in most cases (except igb and the io_error > path), I think adding rtnl to all 3 locations described above is > correct. > > Please let me know if you all agree. Thanks for reviewing this. > > [1]: https://lore.kernel.org/netdev/40242f59-139a-4b45-8949-1210039f881b@intel.com/ I agree with this assessment.
On Tue, Oct 22, 2024 at 02:15:27PM -0700, Jacob Keller wrote: > > > On 10/22/2024 2:12 PM, Joe Damato wrote: > > On Tue, Oct 22, 2024 at 01:00:47PM -0700, Joe Damato wrote: > >> On Tue, Oct 22, 2024 at 05:21:53PM +0000, Joe Damato wrote: > >>> e1000_down calls netif_queue_set_napi, which assumes that RTNL is held. > >>> > >>> There are a few paths for e1000_down to be called in e1000 where RTNL is > >>> not currently being held: > >>> - e1000_shutdown (pci shutdown) > >>> - e1000_suspend (power management) > >>> - e1000_reinit_locked (via e1000_reset_task delayed work) > >>> > >>> Hold RTNL in two places to fix this issue: > >>> - e1000_reset_task > >>> - __e1000_shutdown (which is called from both e1000_shutdown and > >>> e1000_suspend). > >> > >> It looks like there's one other spot I missed: > >> > >> e1000_io_error_detected (pci error handler) which should also hold > >> rtnl_lock: > >> > >> + if (netif_running(netdev)) { > >> + rtnl_lock(); > >> e1000_down(adapter); > >> + rtnl_unlock(); > >> + } > >> > >> I can send that update in the v2, but I'll wait to see if Intel has suggestions > >> on the below. > >> > >>> The other paths which call e1000_down seemingly hold RTNL and are OK: > >>> - e1000_close (ndo_stop) > >>> - e1000_change_mtu (ndo_change_mtu) > >>> > >>> I'm submitting this is as an RFC because: > >>> - the e1000_reinit_locked issue appears very similar to commit > >>> 21f857f0321d ("e1000e: add rtnl_lock() to e1000_reset_task"), which > >>> fixes a similar issue in e1000e > >>> > >>> however > >>> > >>> - adding rtnl to e1000_reinit_locked seemingly conflicts with an > >>> earlier e1000 commit b2f963bfaeba ("e1000: fix lockdep warning in > >>> e1000_reset_task"). > >>> > >>> Hopefully Intel can weigh in and shed some light on the correct way to > >>> go. > > > > Regarding the above locations where rtnl_lock may need to be held, > > comparing to other intel drivers: > > > > - e1000_reset_task: it appears that igc, igb, and e100e all hold > > rtnl_lock in their reset_task functions, so I think adding an > > rtnl_lock / rtnl_unlock to e1000_reset_task should be OK, > > despite the existence of commit b2f963bfaeba ("e1000: fix > > lockdep warning in e1000_reset_task"). > > > > - e1000_io_error_detected: > > - e1000e temporarily obtains and drops rtnl in > > e1000e_pm_freeze > > - ixgbe holds rtnl in the same path (toward the bottom of > > ixgbe_io_error_detected) > > - igb does NOT hold rtnl in this path (as far as I can tell) > > - it was suggested in another thread to hold rtnl in this path > > for igc [1]. > > > > Given that it will be added to igc and is held in this same > > path in e1000e and ixgbe, I think it is safe to add it for > > e1000, as well. > > > > - e1000_shutdown: > > - igb holds rtnl in the same path, > > - e1000e temporarily holds it in this path (via > > e1000e_pm_freeze) > > - ixgbe holds rtnl in the same path > > > > So based on the recommendation for igc [1], and the precedent set in > > the other Intel drivers in most cases (except igb and the io_error > > path), I think adding rtnl to all 3 locations described above is > > correct. > > > > Please let me know if you all agree. Thanks for reviewing this. > > > > > [1]: > https://lore.kernel.org/netdev/40242f59-139a-4b45-8949-1210039f881b@intel.com/ > > I agree with this assessment. Thanks for taking a look. I will send an official iwl-net PATCH with these changes once the 24 hour timer has expired.
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c index 4de9b156b2be..9ed99c75d59e 100644 --- a/drivers/net/ethernet/intel/e1000/e1000_main.c +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c @@ -3509,7 +3509,9 @@ static void e1000_reset_task(struct work_struct *work) container_of(work, struct e1000_adapter, reset_task); e_err(drv, "Reset adapter\n"); + rtnl_lock(); e1000_reinit_locked(adapter); + rtnl_unlock(); } /** @@ -5074,7 +5076,9 @@ static int __e1000_shutdown(struct pci_dev *pdev, bool *enable_wake) usleep_range(10000, 20000); WARN_ON(test_bit(__E1000_RESETTING, &adapter->flags)); + rtnl_lock(); e1000_down(adapter); + rtnl_unlock(); } status = er32(STATUS);
e1000_down calls netif_queue_set_napi, which assumes that RTNL is held. There are a few paths for e1000_down to be called in e1000 where RTNL is not currently being held: - e1000_shutdown (pci shutdown) - e1000_suspend (power management) - e1000_reinit_locked (via e1000_reset_task delayed work) Hold RTNL in two places to fix this issue: - e1000_reset_task - __e1000_shutdown (which is called from both e1000_shutdown and e1000_suspend). The other paths which call e1000_down seemingly hold RTNL and are OK: - e1000_close (ndo_stop) - e1000_change_mtu (ndo_change_mtu) I'm submitting this is as an RFC because: - the e1000_reinit_locked issue appears very similar to commit 21f857f0321d ("e1000e: add rtnl_lock() to e1000_reset_task"), which fixes a similar issue in e1000e however - adding rtnl to e1000_reinit_locked seemingly conflicts with an earlier e1000 commit b2f963bfaeba ("e1000: fix lockdep warning in e1000_reset_task"). Hopefully Intel can weigh in and shed some light on the correct way to go. Fixes: 8f7ff18a5ec7 ("e1000: Link NAPI instances to queues and IRQs") Reported-by: Dmitry Antipov <dmantipov@yandex.ru> Closes: https://lore.kernel.org/netdev/8cf62307-1965-46a0-a411-ff0080090ff9@yandex.ru/ Signed-off-by: Joe Damato <jdamato@fastly.com> --- drivers/net/ethernet/intel/e1000/e1000_main.c | 4 ++++ 1 file changed, 4 insertions(+) base-commit: d811ac148f0afd2f3f7e1cd7f54de8da973ec5e3