Message ID | 20221122121048.776643-1-yangyingliang@huawei.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] net: devlink: fix UAF in devlink_compat_running_version() | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Clearly marked for net |
netdev/fixes_present | success | Fixes tag present in non-next series |
netdev/subject_prefix | success | Link |
netdev/cover_letter | success | Single patches do not need cover letters |
netdev/patch_count | success | Link |
netdev/header_inline | success | No static functions without inline keyword in header files |
netdev/build_32bit | success | Errors and warnings before: 2 this patch: 2 |
netdev/cc_maintainers | success | CCed 6 of 6 maintainers |
netdev/build_clang | success | Errors and warnings before: 5 this patch: 5 |
netdev/module_param | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Signed-off-by tag matches author and committer |
netdev/check_selftest | success | No net selftest shell script |
netdev/verify_fixes | success | Fixes tag looks correct |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 2 this patch: 2 |
netdev/checkpatch | warning | WARNING: Statements should start on a tabstop |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/source_inline | success | Was 0 now: 0 |
On Tue, Nov 22, 2022 at 08:10:48PM +0800, Yang Yingliang wrote: > I got a UAF report as following when doing fault injection test: > > BUG: KASAN: use-after-free in devlink_compat_running_version+0x5b9/0x6a0 > Read of size 8 at addr ffff88810ac591f0 by task systemd-udevd/458 > > CPU: 2 PID: 458 Comm: systemd-udevd Not tainted 6.1.0-rc5-00155-ga9d2b54dd4e7-dirty #1359 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 > Call Trace: > <TASK> > kasan_report+0x90/0x190 > devlink_compat_running_version+0x5b9/0x6a0 > dev_ethtool+0x2ca/0x340 > dev_ioctl+0x16c/0xff0 > sock_do_ioctl+0x1ae/0x220 > > Allocated by task 456: > kasan_save_stack+0x1c/0x40 > kasan_set_track+0x21/0x30 > __kasan_kmalloc+0x7e/0x90 > __kmalloc+0x59/0x1b0 > devlink_alloc_ns+0xf7/0xa10 > nsim_drv_probe+0xa8/0x150 [netdevsim] > really_probe+0x1ff/0x660 > > Freed by task 456: > kasan_save_stack+0x1c/0x40 > kasan_set_track+0x21/0x30 > kasan_save_free_info+0x2a/0x50 > __kasan_slab_free+0x102/0x190 > __kmem_cache_free+0xca/0x400 > nsim_drv_probe.cold.31+0x2af/0xf62 [netdevsim] > really_probe+0x1ff/0x660 > > It happened like this: > > processor A processor B > nsim_drv_probe() > devlink_alloc_ns() > nsim_dev_port_add_all() > __nsim_dev_port_add() // add eth1 successful > dev_ethtool() > ethtool_get_drvinfo(eth1) > netdev_to_devlink_get(eth1) > devlink_try_get() // get devlink here > __nsim_dev_port_add() // add eth2 failed, goto error > devlink_free() // it's called in the error path > devlink_compat_running_version() <- causes UAF > devlink_register() // it's in normal path, not called yet > > There is two ports to add in nsim_dev_port_add_all(), if first > port is added successful, the devlink is visable and can be get > by devlink_try_get() on another cpu, but it is not registered > yet. And then the second port is failed to added, in the error > path, devlink_free() is called, at last it causes UAF. > > Add check in devlink_try_get(), if the 'devlink' is not registered > it returns NULL to avoid UAF like this case. > > Fixes: a62fdbbe9403 ("netdevsim: implement ndo_get_devlink_port") > Reported-by: Zhengchao Shao <shaozhengchao@huawei.com> > Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> > --- > net/core/devlink.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/net/core/devlink.c b/net/core/devlink.c > index 89baa7c0938b..6453ac0558fb 100644 > --- a/net/core/devlink.c > +++ b/net/core/devlink.c > @@ -250,6 +250,9 @@ void devlink_put(struct devlink *devlink) > > struct devlink *__must_check devlink_try_get(struct devlink *devlink) > { > + if (!xa_get_mark(&devlinks, devlink->index, DEVLINK_REGISTERED)) > + return NULL; > + Please fix nsim instead of devlink. Thanks > if (refcount_inc_not_zero(&devlink->refcount)) > return devlink; > return NULL; > -- > 2.25.1 >
Hi, On 2022/11/22 22:32, Leon Romanovsky wrote: > On Tue, Nov 22, 2022 at 08:10:48PM +0800, Yang Yingliang wrote: >> I got a UAF report as following when doing fault injection test: >> >> BUG: KASAN: use-after-free in devlink_compat_running_version+0x5b9/0x6a0 >> Read of size 8 at addr ffff88810ac591f0 by task systemd-udevd/458 >> >> CPU: 2 PID: 458 Comm: systemd-udevd Not tainted 6.1.0-rc5-00155-ga9d2b54dd4e7-dirty #1359 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 >> Call Trace: >> <TASK> >> kasan_report+0x90/0x190 >> devlink_compat_running_version+0x5b9/0x6a0 >> dev_ethtool+0x2ca/0x340 >> dev_ioctl+0x16c/0xff0 >> sock_do_ioctl+0x1ae/0x220 >> >> Allocated by task 456: >> kasan_save_stack+0x1c/0x40 >> kasan_set_track+0x21/0x30 >> __kasan_kmalloc+0x7e/0x90 >> __kmalloc+0x59/0x1b0 >> devlink_alloc_ns+0xf7/0xa10 >> nsim_drv_probe+0xa8/0x150 [netdevsim] >> really_probe+0x1ff/0x660 >> >> Freed by task 456: >> kasan_save_stack+0x1c/0x40 >> kasan_set_track+0x21/0x30 >> kasan_save_free_info+0x2a/0x50 >> __kasan_slab_free+0x102/0x190 >> __kmem_cache_free+0xca/0x400 >> nsim_drv_probe.cold.31+0x2af/0xf62 [netdevsim] >> really_probe+0x1ff/0x660 >> >> It happened like this: >> >> processor A processor B >> nsim_drv_probe() >> devlink_alloc_ns() >> nsim_dev_port_add_all() >> __nsim_dev_port_add() // add eth1 successful >> dev_ethtool() >> ethtool_get_drvinfo(eth1) >> netdev_to_devlink_get(eth1) >> devlink_try_get() // get devlink here >> __nsim_dev_port_add() // add eth2 failed, goto error >> devlink_free() // it's called in the error path >> devlink_compat_running_version() <- causes UAF >> devlink_register() // it's in normal path, not called yet >> >> There is two ports to add in nsim_dev_port_add_all(), if first >> port is added successful, the devlink is visable and can be get >> by devlink_try_get() on another cpu, but it is not registered >> yet. And then the second port is failed to added, in the error >> path, devlink_free() is called, at last it causes UAF. >> >> Add check in devlink_try_get(), if the 'devlink' is not registered >> it returns NULL to avoid UAF like this case. >> >> Fixes: a62fdbbe9403 ("netdevsim: implement ndo_get_devlink_port") >> Reported-by: Zhengchao Shao <shaozhengchao@huawei.com> >> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> >> --- >> net/core/devlink.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/net/core/devlink.c b/net/core/devlink.c >> index 89baa7c0938b..6453ac0558fb 100644 >> --- a/net/core/devlink.c >> +++ b/net/core/devlink.c >> @@ -250,6 +250,9 @@ void devlink_put(struct devlink *devlink) >> >> struct devlink *__must_check devlink_try_get(struct devlink *devlink) >> { >> + if (!xa_get_mark(&devlinks, devlink->index, DEVLINK_REGISTERED)) >> + return NULL; >> + > Please fix nsim instead of devlink. I think if devlink is not registered, it can not be get and used, but there is no API for driver to check this, can I introduce a new helper name devlink_is_registered() for driver using. Thanks, Yang > > Thanks > >> if (refcount_inc_not_zero(&devlink->refcount)) >> return devlink; >> return NULL; >> -- >> 2.25.1 >> > .
On Tue, Nov 22, 2022 at 11:25:31PM +0800, Yang Yingliang wrote: > Hi, > > On 2022/11/22 22:32, Leon Romanovsky wrote: > > On Tue, Nov 22, 2022 at 08:10:48PM +0800, Yang Yingliang wrote: > > > I got a UAF report as following when doing fault injection test: > > > > > > BUG: KASAN: use-after-free in devlink_compat_running_version+0x5b9/0x6a0 > > > Read of size 8 at addr ffff88810ac591f0 by task systemd-udevd/458 > > > > > > CPU: 2 PID: 458 Comm: systemd-udevd Not tainted 6.1.0-rc5-00155-ga9d2b54dd4e7-dirty #1359 > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 > > > Call Trace: > > > <TASK> > > > kasan_report+0x90/0x190 > > > devlink_compat_running_version+0x5b9/0x6a0 > > > dev_ethtool+0x2ca/0x340 > > > dev_ioctl+0x16c/0xff0 > > > sock_do_ioctl+0x1ae/0x220 > > > > > > Allocated by task 456: > > > kasan_save_stack+0x1c/0x40 > > > kasan_set_track+0x21/0x30 > > > __kasan_kmalloc+0x7e/0x90 > > > __kmalloc+0x59/0x1b0 > > > devlink_alloc_ns+0xf7/0xa10 > > > nsim_drv_probe+0xa8/0x150 [netdevsim] > > > really_probe+0x1ff/0x660 > > > > > > Freed by task 456: > > > kasan_save_stack+0x1c/0x40 > > > kasan_set_track+0x21/0x30 > > > kasan_save_free_info+0x2a/0x50 > > > __kasan_slab_free+0x102/0x190 > > > __kmem_cache_free+0xca/0x400 > > > nsim_drv_probe.cold.31+0x2af/0xf62 [netdevsim] > > > really_probe+0x1ff/0x660 > > > > > > It happened like this: > > > > > > processor A processor B > > > nsim_drv_probe() > > > devlink_alloc_ns() > > > nsim_dev_port_add_all() > > > __nsim_dev_port_add() // add eth1 successful > > > dev_ethtool() > > > ethtool_get_drvinfo(eth1) > > > netdev_to_devlink_get(eth1) > > > devlink_try_get() // get devlink here > > > __nsim_dev_port_add() // add eth2 failed, goto error > > > devlink_free() // it's called in the error path > > > devlink_compat_running_version() <- causes UAF > > > devlink_register() // it's in normal path, not called yet > > > > > > There is two ports to add in nsim_dev_port_add_all(), if first > > > port is added successful, the devlink is visable and can be get > > > by devlink_try_get() on another cpu, but it is not registered > > > yet. And then the second port is failed to added, in the error > > > path, devlink_free() is called, at last it causes UAF. > > > > > > Add check in devlink_try_get(), if the 'devlink' is not registered > > > it returns NULL to avoid UAF like this case. > > > > > > Fixes: a62fdbbe9403 ("netdevsim: implement ndo_get_devlink_port") > > > Reported-by: Zhengchao Shao <shaozhengchao@huawei.com> > > > Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> > > > --- > > > net/core/devlink.c | 3 +++ > > > 1 file changed, 3 insertions(+) > > > > > > diff --git a/net/core/devlink.c b/net/core/devlink.c > > > index 89baa7c0938b..6453ac0558fb 100644 > > > --- a/net/core/devlink.c > > > +++ b/net/core/devlink.c > > > @@ -250,6 +250,9 @@ void devlink_put(struct devlink *devlink) > > > struct devlink *__must_check devlink_try_get(struct devlink *devlink) > > > { > > > + if (!xa_get_mark(&devlinks, devlink->index, DEVLINK_REGISTERED)) > > > + return NULL; > > > + > > Please fix nsim instead of devlink. > I think if devlink is not registered, it can not be get and used, but there > is no API for driver to check this, can I introduce a new helper name > devlink_is_registered() for driver using. There is no need in such API as driver calls to devlink_register() and as such it knows when devlink is registered. This UAF is nsim specific issue. Real devices have single .probe() routine with serialized registration flow. None of them will use devlink_is_registered() call. Thanks > > Thanks, > Yang > > > > Thanks > > > > > if (refcount_inc_not_zero(&devlink->refcount)) > > > return devlink; > > > return NULL; > > > -- > > > 2.25.1 > > > > > .
On Tue, 22 Nov 2022 21:04:29 +0200 Leon Romanovsky wrote: > > > Please fix nsim instead of devlink. > > I think if devlink is not registered, it can not be get and used, but there > > is no API for driver to check this, can I introduce a new helper name > > devlink_is_registered() for driver using. > > There is no need in such API as driver calls to devlink_register() and > as such it knows when devlink is registered. > > This UAF is nsim specific issue. Real devices have single .probe() > routine with serialized registration flow. None of them will use > devlink_is_registered() call. Agreed, the fix is to move the register call back. Something along the lines of the untested patch below? Yang Yingliang would you be able to turn that into a real patch? diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c index e14686594a71..26602d5fe0a2 100644 --- a/drivers/net/netdevsim/dev.c +++ b/drivers/net/netdevsim/dev.c @@ -1566,12 +1566,15 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) err = devlink_params_register(devlink, nsim_devlink_params, ARRAY_SIZE(nsim_devlink_params)); if (err) - goto err_dl_unregister; + goto err_resource_unregister; nsim_devlink_set_params_init_values(nsim_dev, devlink); + /* here, because params API still expect devlink to be unregistered */ + devl_register(devlink); + err = nsim_dev_dummy_region_init(nsim_dev, devlink); if (err) - goto err_params_unregister; + goto err_dl_unregister; err = nsim_dev_traps_init(devlink); if (err) @@ -1610,7 +1613,6 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) nsim_dev->esw_mode = DEVLINK_ESWITCH_MODE_LEGACY; devlink_set_features(devlink, DEVLINK_F_RELOAD); devl_unlock(devlink); - devlink_register(devlink); return 0; err_hwstats_exit: @@ -1629,10 +1631,11 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) nsim_dev_traps_exit(devlink); err_dummy_region_exit: nsim_dev_dummy_region_exit(nsim_dev); -err_params_unregister: +err_dl_unregister: + devl_unregister(devlink); devlink_params_unregister(devlink, nsim_devlink_params, ARRAY_SIZE(nsim_devlink_params)); -err_dl_unregister: +err_resource_unregister: devl_resources_unregister(devlink); err_vfc_free: kfree(nsim_dev->vfconfigs); diff --git a/include/net/devlink.h b/include/net/devlink.h index 074a79b8933f..e0f13100fc6b 100644 --- a/include/net/devlink.h +++ b/include/net/devlink.h @@ -1593,6 +1593,14 @@ void devlink_set_features(struct devlink *devlink, u64 features); void devlink_register(struct devlink *devlink); void devlink_unregister(struct devlink *devlink); void devlink_free(struct devlink *devlink); + +void devl_register(struct devlink *devlink) +{ + devl_assert_locked(devlink); + /* doesn't actually take the instance lock */ + devlink_register(devlink); +} + void devlink_port_init(struct devlink *devlink, struct devlink_port *devlink_port); void devlink_port_fini(struct devlink_port *devlink_port);
On 2022/11/23 4:27, Jakub Kicinski wrote: > On Tue, 22 Nov 2022 21:04:29 +0200 Leon Romanovsky wrote: >>>> Please fix nsim instead of devlink. >>> I think if devlink is not registered, it can not be get and used, but there >>> is no API for driver to check this, can I introduce a new helper name >>> devlink_is_registered() for driver using. >> There is no need in such API as driver calls to devlink_register() and >> as such it knows when devlink is registered. >> >> This UAF is nsim specific issue. Real devices have single .probe() >> routine with serialized registration flow. None of them will use >> devlink_is_registered() call. > Agreed, the fix is to move the register call back. > Something along the lines of the untested patch below? > Yang Yingliang would you be able to turn that into a real patch? OK. Thanks, Yang > > diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c > index e14686594a71..26602d5fe0a2 100644 > --- a/drivers/net/netdevsim/dev.c > +++ b/drivers/net/netdevsim/dev.c > @@ -1566,12 +1566,15 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) > err = devlink_params_register(devlink, nsim_devlink_params, > ARRAY_SIZE(nsim_devlink_params)); > if (err) > - goto err_dl_unregister; > + goto err_resource_unregister; > nsim_devlink_set_params_init_values(nsim_dev, devlink); > > + /* here, because params API still expect devlink to be unregistered */ > + devl_register(devlink); > + > err = nsim_dev_dummy_region_init(nsim_dev, devlink); > if (err) > - goto err_params_unregister; > + goto err_dl_unregister; > > err = nsim_dev_traps_init(devlink); > if (err) > @@ -1610,7 +1613,6 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) > nsim_dev->esw_mode = DEVLINK_ESWITCH_MODE_LEGACY; > devlink_set_features(devlink, DEVLINK_F_RELOAD); > devl_unlock(devlink); > - devlink_register(devlink); > return 0; > > err_hwstats_exit: > @@ -1629,10 +1631,11 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) > nsim_dev_traps_exit(devlink); > err_dummy_region_exit: > nsim_dev_dummy_region_exit(nsim_dev); > -err_params_unregister: > +err_dl_unregister: > + devl_unregister(devlink); > devlink_params_unregister(devlink, nsim_devlink_params, > ARRAY_SIZE(nsim_devlink_params)); > -err_dl_unregister: > +err_resource_unregister: > devl_resources_unregister(devlink); > err_vfc_free: > kfree(nsim_dev->vfconfigs); > diff --git a/include/net/devlink.h b/include/net/devlink.h > index 074a79b8933f..e0f13100fc6b 100644 > --- a/include/net/devlink.h > +++ b/include/net/devlink.h > @@ -1593,6 +1593,14 @@ void devlink_set_features(struct devlink *devlink, u64 features); > void devlink_register(struct devlink *devlink); > void devlink_unregister(struct devlink *devlink); > void devlink_free(struct devlink *devlink); > + > +void devl_register(struct devlink *devlink) > +{ > + devl_assert_locked(devlink); > + /* doesn't actually take the instance lock */ > + devlink_register(devlink); > +} > + > void devlink_port_init(struct devlink *devlink, > struct devlink_port *devlink_port); > void devlink_port_fini(struct devlink_port *devlink_port); > .
On 2022/11/23 4:27, Jakub Kicinski wrote: > On Tue, 22 Nov 2022 21:04:29 +0200 Leon Romanovsky wrote: >>>> Please fix nsim instead of devlink. >>> I think if devlink is not registered, it can not be get and used, but there >>> is no API for driver to check this, can I introduce a new helper name >>> devlink_is_registered() for driver using. >> There is no need in such API as driver calls to devlink_register() and >> as such it knows when devlink is registered. >> >> This UAF is nsim specific issue. Real devices have single .probe() >> routine with serialized registration flow. None of them will use >> devlink_is_registered() call. > Agreed, the fix is to move the register call back. > Something along the lines of the untested patch below? > Yang Yingliang would you be able to turn that into a real patch? > > diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c > index e14686594a71..26602d5fe0a2 100644 > --- a/drivers/net/netdevsim/dev.c > +++ b/drivers/net/netdevsim/dev.c > @@ -1566,12 +1566,15 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) > err = devlink_params_register(devlink, nsim_devlink_params, > ARRAY_SIZE(nsim_devlink_params)); > if (err) > - goto err_dl_unregister; > + goto err_resource_unregister; > nsim_devlink_set_params_init_values(nsim_dev, devlink); > > + /* here, because params API still expect devlink to be unregistered */ > + devl_register(devlink); > + devlink_set_features() called at last in probe() also needs devlink is not registered. > err = nsim_dev_dummy_region_init(nsim_dev, devlink); > if (err) > - goto err_params_unregister; > + goto err_dl_unregister; > > err = nsim_dev_traps_init(devlink); > if (err) > @@ -1610,7 +1613,6 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) > nsim_dev->esw_mode = DEVLINK_ESWITCH_MODE_LEGACY; > devlink_set_features(devlink, DEVLINK_F_RELOAD); > devl_unlock(devlink); > - devlink_register(devlink); > return 0; > > err_hwstats_exit: > @@ -1629,10 +1631,11 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) > nsim_dev_traps_exit(devlink); > err_dummy_region_exit: > nsim_dev_dummy_region_exit(nsim_dev); > -err_params_unregister: > +err_dl_unregister: > + devl_unregister(devlink); It races with dev_ethtool(): dev_ethtool devlink_try_get() nsim_drv_probe devl_lock() devl_lock() devlink_unregister() devlink_put() wait_for_completion() <- the refcount is got in dev_ethtool, it causes ABBA deadlock > devlink_params_unregister(devlink, nsim_devlink_params, > ARRAY_SIZE(nsim_devlink_params)); > -err_dl_unregister: > +err_resource_unregister: > devl_resources_unregister(devlink); > err_vfc_free: > kfree(nsim_dev->vfconfigs); > diff --git a/include/net/devlink.h b/include/net/devlink.h > index 074a79b8933f..e0f13100fc6b 100644 > --- a/include/net/devlink.h > +++ b/include/net/devlink.h > @@ -1593,6 +1593,14 @@ void devlink_set_features(struct devlink *devlink, u64 features); > void devlink_register(struct devlink *devlink); > void devlink_unregister(struct devlink *devlink); > void devlink_free(struct devlink *devlink); > + > +void devl_register(struct devlink *devlink) > +{ > + devl_assert_locked(devlink); > + /* doesn't actually take the instance lock */ > + devlink_register(devlink); > +} > + > void devlink_port_init(struct devlink *devlink, > struct devlink_port *devlink_port); > void devlink_port_fini(struct devlink_port *devlink_port); How about the following changes, it's tested ok in my case: diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c index a7880c7ce94c..0ec4590c26a0 100644 --- a/drivers/net/netdevsim/dev.c +++ b/drivers/net/netdevsim/dev.c @@ -1640,6 +1640,8 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) kfree(nsim_dev->vfconfigs); err_devlink_unlock: devl_unlock(devlink); + devl_put(devlink); + devl_wait_for_comp(devlink); devlink_free(devlink); dev_set_drvdata(&nsim_bus_dev->dev, NULL); return err; diff --git a/include/net/devlink.h b/include/net/devlink.h index ba6b8b094943..dc8ddf942c4d 100644 --- a/include/net/devlink.h +++ b/include/net/devlink.h @@ -1563,6 +1563,7 @@ static inline struct devlink *devlink_alloc(const struct devlink_ops *ops, void devlink_set_features(struct devlink *devlink, u64 features); void devlink_register(struct devlink *devlink); void devlink_unregister(struct devlink *devlink); +void devl_wait_for_comp(struct devlink *devlink); void devlink_free(struct devlink *devlink); void devlink_port_init(struct devlink *devlink, struct devlink_port *devlink_port); @@ -1903,4 +1904,6 @@ devlink_compat_switch_id_get(struct net_device *dev, #endif +void devl_put(struct devlink *devlink); + #endif /* _NET_DEVLINK_H_ */ diff --git a/net/core/devlink.c b/net/core/devlink.c index 89baa7c0938b..67c78cf37bac 100644 --- a/net/core/devlink.c +++ b/net/core/devlink.c @@ -248,6 +248,12 @@ void devlink_put(struct devlink *devlink) call_rcu(&devlink->rcu, __devlink_put_rcu); } +void devl_put(struct devlink *devlink) +{ + devlink_put(devlink); +} +EXPORT_SYMBOL_GPL(devl_put); + struct devlink *__must_check devlink_try_get(struct devlink *devlink) { if (refcount_inc_not_zero(&devlink->refcount)) @@ -9788,6 +9794,12 @@ void devlink_unregister(struct devlink *devlink) } EXPORT_SYMBOL_GPL(devlink_unregister); +void devl_wait_for_comp(struct devlink *devlink) +{ + wait_for_completion(&devlink->comp); +} +EXPORT_SYMBOL_GPL(devl_wait_for_comp); + /** * devlink_free - Free devlink instance resources * > .
On Wed, Nov 23, 2022 at 02:40:24PM +0800, Yang Yingliang wrote: > > On 2022/11/23 4:27, Jakub Kicinski wrote: > > On Tue, 22 Nov 2022 21:04:29 +0200 Leon Romanovsky wrote: > > > > > Please fix nsim instead of devlink. > > > > I think if devlink is not registered, it can not be get and used, but there > > > > is no API for driver to check this, can I introduce a new helper name > > > > devlink_is_registered() for driver using. > > > There is no need in such API as driver calls to devlink_register() and > > > as such it knows when devlink is registered. > > > > > > This UAF is nsim specific issue. Real devices have single .probe() > > > routine with serialized registration flow. None of them will use > > > devlink_is_registered() call. > > Agreed, the fix is to move the register call back. > > Something along the lines of the untested patch below? > > Yang Yingliang would you be able to turn that into a real patch? > > > > diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c > > index e14686594a71..26602d5fe0a2 100644 > > --- a/drivers/net/netdevsim/dev.c > > +++ b/drivers/net/netdevsim/dev.c > > @@ -1566,12 +1566,15 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) > > err = devlink_params_register(devlink, nsim_devlink_params, > > ARRAY_SIZE(nsim_devlink_params)); > > if (err) > > - goto err_dl_unregister; > > + goto err_resource_unregister; > > nsim_devlink_set_params_init_values(nsim_dev, devlink); > > + /* here, because params API still expect devlink to be unregistered */ > > + devl_register(devlink); > > + > devlink_set_features() called at last in probe() also needs devlink is not > registered. > > err = nsim_dev_dummy_region_init(nsim_dev, devlink); > > if (err) > > - goto err_params_unregister; > > + goto err_dl_unregister; > > err = nsim_dev_traps_init(devlink); > > if (err) > > @@ -1610,7 +1613,6 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) > > nsim_dev->esw_mode = DEVLINK_ESWITCH_MODE_LEGACY; > > devlink_set_features(devlink, DEVLINK_F_RELOAD); > > devl_unlock(devlink); > > - devlink_register(devlink); > > return 0; > > err_hwstats_exit: > > @@ -1629,10 +1631,11 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) > > nsim_dev_traps_exit(devlink); > > err_dummy_region_exit: > > nsim_dev_dummy_region_exit(nsim_dev); > > -err_params_unregister: > > +err_dl_unregister: > > + devl_unregister(devlink); > It races with dev_ethtool(): > dev_ethtool > devlink_try_get() > nsim_drv_probe > devl_lock() > devl_lock() > devlink_unregister() > devlink_put() > wait_for_completion() <- the refcount is > got in dev_ethtool, it causes ABBA deadlock But all these races are nsim specific ones. Can you please explain why devlink.[c|h] MUST be changed and nsim can't be fixed? Thanks
On 2022/11/23 15:41, Leon Romanovsky wrote: > On Wed, Nov 23, 2022 at 02:40:24PM +0800, Yang Yingliang wrote: >> On 2022/11/23 4:27, Jakub Kicinski wrote: >>> On Tue, 22 Nov 2022 21:04:29 +0200 Leon Romanovsky wrote: >>>>>> Please fix nsim instead of devlink. >>>>> I think if devlink is not registered, it can not be get and used, but there >>>>> is no API for driver to check this, can I introduce a new helper name >>>>> devlink_is_registered() for driver using. >>>> There is no need in such API as driver calls to devlink_register() and >>>> as such it knows when devlink is registered. >>>> >>>> This UAF is nsim specific issue. Real devices have single .probe() >>>> routine with serialized registration flow. None of them will use >>>> devlink_is_registered() call. >>> Agreed, the fix is to move the register call back. >>> Something along the lines of the untested patch below? >>> Yang Yingliang would you be able to turn that into a real patch? >>> >>> diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c >>> index e14686594a71..26602d5fe0a2 100644 >>> --- a/drivers/net/netdevsim/dev.c >>> +++ b/drivers/net/netdevsim/dev.c >>> @@ -1566,12 +1566,15 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) >>> err = devlink_params_register(devlink, nsim_devlink_params, >>> ARRAY_SIZE(nsim_devlink_params)); >>> if (err) >>> - goto err_dl_unregister; >>> + goto err_resource_unregister; >>> nsim_devlink_set_params_init_values(nsim_dev, devlink); >>> + /* here, because params API still expect devlink to be unregistered */ >>> + devl_register(devlink); >>> + >> devlink_set_features() called at last in probe() also needs devlink is not >> registered. >>> err = nsim_dev_dummy_region_init(nsim_dev, devlink); >>> if (err) >>> - goto err_params_unregister; >>> + goto err_dl_unregister; >>> err = nsim_dev_traps_init(devlink); >>> if (err) >>> @@ -1610,7 +1613,6 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) >>> nsim_dev->esw_mode = DEVLINK_ESWITCH_MODE_LEGACY; >>> devlink_set_features(devlink, DEVLINK_F_RELOAD); >>> devl_unlock(devlink); >>> - devlink_register(devlink); >>> return 0; >>> err_hwstats_exit: >>> @@ -1629,10 +1631,11 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) >>> nsim_dev_traps_exit(devlink); >>> err_dummy_region_exit: >>> nsim_dev_dummy_region_exit(nsim_dev); >>> -err_params_unregister: >>> +err_dl_unregister: >>> + devl_unregister(devlink); >> It races with dev_ethtool(): >> dev_ethtool >> devlink_try_get() >> nsim_drv_probe >> devl_lock() >> devl_lock() >> devlink_unregister() >> devlink_put() >> wait_for_completion() <- the refcount is >> got in dev_ethtool, it causes ABBA deadlock > But all these races are nsim specific ones. > Can you please explain why devlink.[c|h] MUST be changed and nsim can't > be fixed? I used the fix code proposed by Jakub, but it didn't work correctly, so I tried to correct and improve it, and need some devlink helper. Anyway, it is a nsim problem, if we want fix this without touch devlink, I think we can add a 'registered' field in struct nsim_dev, and it can be checked in nsim_get_devlink_port() like this: --- a/drivers/net/netdevsim/netdev.c +++ b/drivers/net/netdevsim/netdev.c @@ -242,6 +242,9 @@ static struct devlink_port *nsim_get_devlink_port(struct net_device *dev) { struct netdevsim *ns = netdev_priv(dev); + if (!ns->nsim_dev->devlink_registered) + return NULL; + return &ns->nsim_dev_port->devlink_port; } Thanks, Yang > > Thanks > .
On Wed, Nov 23, 2022 at 04:34:49PM +0800, Yang Yingliang wrote: > > On 2022/11/23 15:41, Leon Romanovsky wrote: > > On Wed, Nov 23, 2022 at 02:40:24PM +0800, Yang Yingliang wrote: > > > On 2022/11/23 4:27, Jakub Kicinski wrote: > > > > On Tue, 22 Nov 2022 21:04:29 +0200 Leon Romanovsky wrote: > > > > > > > Please fix nsim instead of devlink. > > > > > > I think if devlink is not registered, it can not be get and used, but there > > > > > > is no API for driver to check this, can I introduce a new helper name > > > > > > devlink_is_registered() for driver using. > > > > > There is no need in such API as driver calls to devlink_register() and > > > > > as such it knows when devlink is registered. > > > > > > > > > > This UAF is nsim specific issue. Real devices have single .probe() > > > > > routine with serialized registration flow. None of them will use > > > > > devlink_is_registered() call. > > > > Agreed, the fix is to move the register call back. > > > > Something along the lines of the untested patch below? > > > > Yang Yingliang would you be able to turn that into a real patch? > > > > > > > > diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c > > > > index e14686594a71..26602d5fe0a2 100644 > > > > --- a/drivers/net/netdevsim/dev.c > > > > +++ b/drivers/net/netdevsim/dev.c > > > > @@ -1566,12 +1566,15 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) > > > > err = devlink_params_register(devlink, nsim_devlink_params, > > > > ARRAY_SIZE(nsim_devlink_params)); > > > > if (err) > > > > - goto err_dl_unregister; > > > > + goto err_resource_unregister; > > > > nsim_devlink_set_params_init_values(nsim_dev, devlink); > > > > + /* here, because params API still expect devlink to be unregistered */ > > > > + devl_register(devlink); > > > > + > > > devlink_set_features() called at last in probe() also needs devlink is not > > > registered. > > > > err = nsim_dev_dummy_region_init(nsim_dev, devlink); > > > > if (err) > > > > - goto err_params_unregister; > > > > + goto err_dl_unregister; > > > > err = nsim_dev_traps_init(devlink); > > > > if (err) > > > > @@ -1610,7 +1613,6 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) > > > > nsim_dev->esw_mode = DEVLINK_ESWITCH_MODE_LEGACY; > > > > devlink_set_features(devlink, DEVLINK_F_RELOAD); > > > > devl_unlock(devlink); > > > > - devlink_register(devlink); > > > > return 0; > > > > err_hwstats_exit: > > > > @@ -1629,10 +1631,11 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) > > > > nsim_dev_traps_exit(devlink); > > > > err_dummy_region_exit: > > > > nsim_dev_dummy_region_exit(nsim_dev); > > > > -err_params_unregister: > > > > +err_dl_unregister: > > > > + devl_unregister(devlink); > > > It races with dev_ethtool(): > > > dev_ethtool > > > devlink_try_get() > > > nsim_drv_probe > > > devl_lock() > > > devl_lock() > > > devlink_unregister() > > > devlink_put() > > > wait_for_completion() <- the refcount is > > > got in dev_ethtool, it causes ABBA deadlock > > But all these races are nsim specific ones. > > Can you please explain why devlink.[c|h] MUST be changed and nsim can't > > be fixed? > I used the fix code proposed by Jakub, but it didn't work correctly, so > I tried to correct and improve it, and need some devlink helper. > > Anyway, it is a nsim problem, if we want fix this without touch devlink, > I think we can add a 'registered' field in struct nsim_dev, and it can be > checked in nsim_get_devlink_port() like this: > --- a/drivers/net/netdevsim/netdev.c > +++ b/drivers/net/netdevsim/netdev.c > @@ -242,6 +242,9 @@ static struct devlink_port *nsim_get_devlink_port(struct > net_device *dev) > { > struct netdevsim *ns = netdev_priv(dev); > > + if (!ns->nsim_dev->devlink_registered) > + return NULL; Something like that, but you need to make sure that this check is protected by some sort of the locking. > + > return &ns->nsim_dev_port->devlink_port; > } > > Thanks, > Yang > > > > Thanks > > .
On Wed, Nov 23, 2022 at 04:34:49PM +0800, Yang Yingliang wrote: > > On 2022/11/23 15:41, Leon Romanovsky wrote: > > On Wed, Nov 23, 2022 at 02:40:24PM +0800, Yang Yingliang wrote: > > > On 2022/11/23 4:27, Jakub Kicinski wrote: > > > > On Tue, 22 Nov 2022 21:04:29 +0200 Leon Romanovsky wrote: > > > > > > > Please fix nsim instead of devlink. > > > > > > I think if devlink is not registered, it can not be get and used, but there > > > > > > is no API for driver to check this, can I introduce a new helper name > > > > > > devlink_is_registered() for driver using. > > > > > There is no need in such API as driver calls to devlink_register() and > > > > > as such it knows when devlink is registered. > > > > > > > > > > This UAF is nsim specific issue. Real devices have single .probe() > > > > > routine with serialized registration flow. None of them will use > > > > > devlink_is_registered() call. > > > > Agreed, the fix is to move the register call back. > > > > Something along the lines of the untested patch below? > > > > Yang Yingliang would you be able to turn that into a real patch? > > > > > > > > diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c > > > > index e14686594a71..26602d5fe0a2 100644 > > > > --- a/drivers/net/netdevsim/dev.c > > > > +++ b/drivers/net/netdevsim/dev.c > > > > @@ -1566,12 +1566,15 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) > > > > err = devlink_params_register(devlink, nsim_devlink_params, > > > > ARRAY_SIZE(nsim_devlink_params)); > > > > if (err) > > > > - goto err_dl_unregister; > > > > + goto err_resource_unregister; > > > > nsim_devlink_set_params_init_values(nsim_dev, devlink); > > > > + /* here, because params API still expect devlink to be unregistered */ > > > > + devl_register(devlink); > > > > + > > > devlink_set_features() called at last in probe() also needs devlink is not > > > registered. > > > > err = nsim_dev_dummy_region_init(nsim_dev, devlink); > > > > if (err) > > > > - goto err_params_unregister; > > > > + goto err_dl_unregister; > > > > err = nsim_dev_traps_init(devlink); > > > > if (err) > > > > @@ -1610,7 +1613,6 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) > > > > nsim_dev->esw_mode = DEVLINK_ESWITCH_MODE_LEGACY; > > > > devlink_set_features(devlink, DEVLINK_F_RELOAD); > > > > devl_unlock(devlink); > > > > - devlink_register(devlink); > > > > return 0; > > > > err_hwstats_exit: > > > > @@ -1629,10 +1631,11 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) > > > > nsim_dev_traps_exit(devlink); > > > > err_dummy_region_exit: > > > > nsim_dev_dummy_region_exit(nsim_dev); > > > > -err_params_unregister: > > > > +err_dl_unregister: > > > > + devl_unregister(devlink); > > > It races with dev_ethtool(): > > > dev_ethtool > > > devlink_try_get() > > > nsim_drv_probe > > > devl_lock() > > > devl_lock() > > > devlink_unregister() > > > devlink_put() > > > wait_for_completion() <- the refcount is > > > got in dev_ethtool, it causes ABBA deadlock > > But all these races are nsim specific ones. > > Can you please explain why devlink.[c|h] MUST be changed and nsim can't > > be fixed? > I used the fix code proposed by Jakub, but it didn't work correctly, so > I tried to correct and improve it, and need some devlink helper. > > Anyway, it is a nsim problem, if we want fix this without touch devlink, > I think we can add a 'registered' field in struct nsim_dev, and it can be > checked in nsim_get_devlink_port() like this: I read the discussion and it's not clear to me why this is a netdevsim specific problem. The fundamental problem seems to be that it is possible to hold a reference on a devlink instance before it's registered and that devlink_free() will free the instance regardless of its current reference count because it expects devlink_unregister() to block. In this case, the instance was never registered, so devlink_unregister() is not called. ethtool was able to get a reference on the devlink instance before it was registered because netdevsim registers its netdevs before registering its devlink instance. However, netdevsim is not the only one doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially others do the same thing. When you think about it, it's strange that it's even possible for ethtool to reach the driver when the netdev used in the request is long gone, but it's not holding a reference on the netdev (it's holding a reference on the devlink instance instead) and devlink_compat_running_version() is called without RTNL.
On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote: > > I used the fix code proposed by Jakub, but it didn't work correctly, so > > I tried to correct and improve it, and need some devlink helper. > > > > Anyway, it is a nsim problem, if we want fix this without touch devlink, > > I think we can add a 'registered' field in struct nsim_dev, and it can be > > checked in nsim_get_devlink_port() like this: > > I read the discussion and it's not clear to me why this is a netdevsim > specific problem. The fundamental problem seems to be that it is > possible to hold a reference on a devlink instance before it's > registered and that devlink_free() will free the instance regardless of > its current reference count because it expects devlink_unregister() to > block. In this case, the instance was never registered, so > devlink_unregister() is not called. > > ethtool was able to get a reference on the devlink instance before it > was registered because netdevsim registers its netdevs before > registering its devlink instance. However, netdevsim is not the only one > doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially > others do the same thing. > > When you think about it, it's strange that it's even possible for > ethtool to reach the driver when the netdev used in the request is long > gone, but it's not holding a reference on the netdev (it's holding a > reference on the devlink instance instead) and > devlink_compat_running_version() is called without RTNL. Indeed. We did a bit of a flip-flop with the devlink locking rules and the fact that the instance is reachable before it is registered is a leftover from a previous restructuring :( Hence my preference to get rid of the ordering at the driver level than to try to patch it up in the code. Dunno if that's convincing.
On Wed, 23 Nov 2022 14:40:24 +0800 Yang Yingliang wrote: > > if (err) > > - goto err_dl_unregister; > > + goto err_resource_unregister; > > nsim_devlink_set_params_init_values(nsim_dev, devlink); > > > > + /* here, because params API still expect devlink to be unregistered */ > > + devl_register(devlink); > > + > devlink_set_features() called at last in probe() also needs devlink is > not registered. You can move the devlink_set_features() up. It's also a leftover, it was preventing reload from happening before probe has finished. Now the instance is locked until probe is done so there is no race.
On Wed, 23 Nov 2022 14:40:24 +0800 Yang Yingliang wrote: > > +err_dl_unregister: > > + devl_unregister(devlink); > It races with dev_ethtool(): > dev_ethtool > devlink_try_get() > nsim_drv_probe > devl_lock() > devl_lock() > devlink_unregister() > devlink_put() > wait_for_completion() <- the refcount > is got in dev_ethtool, it causes ABBA deadlock Yeah.. so my original design for the locking had a "devlink_is_alive()" check for this exact reason: https://lore.kernel.org/netdev/20211030231254.2477599-3-kuba@kernel.org/ and the devlink structure was properly refcounted (devlink_put() calls devlink_free() when the last reference is released). Pure references then need to check if the instance is still alive after locking it. Which is fine, it should only happen in core code. I think we should go back to that idea. The waiting for references is a nightmare in the netdev code.
On Thu, Nov 24, 2022, at 04:18, Jakub Kicinski wrote: > On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote: >> > I used the fix code proposed by Jakub, but it didn't work correctly, so >> > I tried to correct and improve it, and need some devlink helper. >> > >> > Anyway, it is a nsim problem, if we want fix this without touch devlink, >> > I think we can add a 'registered' field in struct nsim_dev, and it can be >> > checked in nsim_get_devlink_port() like this: >> >> I read the discussion and it's not clear to me why this is a netdevsim >> specific problem. The fundamental problem seems to be that it is >> possible to hold a reference on a devlink instance before it's >> registered and that devlink_free() will free the instance regardless of >> its current reference count because it expects devlink_unregister() to >> block. In this case, the instance was never registered, so >> devlink_unregister() is not called. >> >> ethtool was able to get a reference on the devlink instance before it >> was registered because netdevsim registers its netdevs before >> registering its devlink instance. However, netdevsim is not the only one >> doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially >> others do the same thing. >> >> When you think about it, it's strange that it's even possible for >> ethtool to reach the driver when the netdev used in the request is long >> gone, but it's not holding a reference on the netdev (it's holding a >> reference on the devlink instance instead) and >> devlink_compat_running_version() is called without RTNL. > > Indeed. We did a bit of a flip-flop with the devlink locking rules > and the fact that the instance is reachable before it is registered > is a leftover from a previous restructuring :( > > Hence my preference to get rid of the ordering at the driver level > than to try to patch it up in the code. Dunno if that's convincing. Before you are doing that, let's remind why we do ordering: 1. It was in happy times when some commands were locked and some not. It caused to very exciting things like controlled uaf, e.t.c. 2. We wanted to provide an access from user space after everything devlink related is initialized. 3. Allow sane debuggability as some drivers called to devlink_register and unregister in random places. The locking (item 1) was fixed. Sent from mobile.
On 2022/11/24 10:47, Jakub Kicinski wrote: > On Wed, 23 Nov 2022 14:40:24 +0800 Yang Yingliang wrote: >>> +err_dl_unregister: >>> + devl_unregister(devlink); >> It races with dev_ethtool(): >> dev_ethtool >> devlink_try_get() >> nsim_drv_probe >> devl_lock() >> devl_lock() >> devlink_unregister() >> devlink_put() >> wait_for_completion() <- the refcount >> is got in dev_ethtool, it causes ABBA deadlock > Yeah.. so my original design for the locking had a "devlink_is_alive()" > check for this exact reason: > > https://lore.kernel.org/netdev/20211030231254.2477599-3-kuba@kernel.org/ > > and the devlink structure was properly refcounted (devlink_put() calls > devlink_free() when the last reference is released). > > Pure references then need to check if the instance is still alive > after locking it. Which is fine, it should only happen in core code. > > I think we should go back to that idea. But Leon disagree to change devlink code. I think this problem occurs in the drivers that have multiple ports(netdev): In some drivers (e.g. mlx5) , one net device uses one devlink instance(see mlx5e_probe()), the instance can not be get until the device is register, in this case, it won't cause UAF. But in some other drivers(e.g. netdevsim, funeth) multiple ports(net devices) use one devlink instance. If first one is register successful, the instance is visible and can be get through netdev, meanwhile, the second port register failed and goto free the devlink that used by first port(netdevice). So can we fix this in every single driver. Thanks, Yang > > The waiting for references is a nightmare in the netdev code. > > .
On Wed, Nov 23, 2022 at 06:18:00PM -0800, Jakub Kicinski wrote: > On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote: > > > I used the fix code proposed by Jakub, but it didn't work correctly, so > > > I tried to correct and improve it, and need some devlink helper. > > > > > > Anyway, it is a nsim problem, if we want fix this without touch devlink, > > > I think we can add a 'registered' field in struct nsim_dev, and it can be > > > checked in nsim_get_devlink_port() like this: > > > > I read the discussion and it's not clear to me why this is a netdevsim > > specific problem. The fundamental problem seems to be that it is > > possible to hold a reference on a devlink instance before it's > > registered and that devlink_free() will free the instance regardless of > > its current reference count because it expects devlink_unregister() to > > block. In this case, the instance was never registered, so > > devlink_unregister() is not called. > > > > ethtool was able to get a reference on the devlink instance before it > > was registered because netdevsim registers its netdevs before > > registering its devlink instance. However, netdevsim is not the only one > > doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially > > others do the same thing. > > > > When you think about it, it's strange that it's even possible for > > ethtool to reach the driver when the netdev used in the request is long > > gone, but it's not holding a reference on the netdev (it's holding a > > reference on the devlink instance instead) and > > devlink_compat_running_version() is called without RTNL. > > Indeed. We did a bit of a flip-flop with the devlink locking rules > and the fact that the instance is reachable before it is registered > is a leftover from a previous restructuring :( > > Hence my preference to get rid of the ordering at the driver level > than to try to patch it up in the code. Dunno if that's convincing. I don't have a good solution, but changing all the drivers to register their netdevs after the devlink instance is going to be quite painful and too big for 'net'. I feel like the main motivation for this is the ethtool compat stuff, which is not very convincing IMO. I'm quite happy with the current flow where drivers call devlink_register() at the end of their probe. Regarding a solution for the current crash, assuming we agree it's not a netdevsim specific problem, I think the current fix [1] is OK. Note that while it fixes the crash, it potentially creates other (less severe) problems. After user space receives RTM_NEWLINK notification it will need to wait for a certain period of time before issuing 'ETHTOOL_GDRVINFO' as otherwise it will not get the firmware version. I guess it's not a big deal for drivers that only register one netdev since they will very quickly follow with devlink_register(), but the race window is larger for drivers that need to register many netdevs, for either physical switch or eswitch ports. Long term, we either need to find a way to make the ethtool compat stuff work correctly or just get rid of it and have affected drivers implement the relevant ethtool operations instead of relying on devlink. [1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/
Mon, Nov 28, 2022 at 10:20:53AM CET, idosch@idosch.org wrote: >On Wed, Nov 23, 2022 at 06:18:00PM -0800, Jakub Kicinski wrote: >> On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote: >> > > I used the fix code proposed by Jakub, but it didn't work correctly, so >> > > I tried to correct and improve it, and need some devlink helper. >> > > >> > > Anyway, it is a nsim problem, if we want fix this without touch devlink, >> > > I think we can add a 'registered' field in struct nsim_dev, and it can be >> > > checked in nsim_get_devlink_port() like this: >> > >> > I read the discussion and it's not clear to me why this is a netdevsim >> > specific problem. The fundamental problem seems to be that it is >> > possible to hold a reference on a devlink instance before it's >> > registered and that devlink_free() will free the instance regardless of >> > its current reference count because it expects devlink_unregister() to >> > block. In this case, the instance was never registered, so >> > devlink_unregister() is not called. >> > >> > ethtool was able to get a reference on the devlink instance before it >> > was registered because netdevsim registers its netdevs before >> > registering its devlink instance. However, netdevsim is not the only one >> > doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially >> > others do the same thing. >> > >> > When you think about it, it's strange that it's even possible for >> > ethtool to reach the driver when the netdev used in the request is long >> > gone, but it's not holding a reference on the netdev (it's holding a >> > reference on the devlink instance instead) and >> > devlink_compat_running_version() is called without RTNL. >> >> Indeed. We did a bit of a flip-flop with the devlink locking rules >> and the fact that the instance is reachable before it is registered >> is a leftover from a previous restructuring :( >> >> Hence my preference to get rid of the ordering at the driver level >> than to try to patch it up in the code. Dunno if that's convincing. > >I don't have a good solution, but changing all the drivers to register >their netdevs after the devlink instance is going to be quite painful >and too big for 'net'. I feel like the main motivation for this is the >ethtool compat stuff, which is not very convincing IMO. I'm quite happy >with the current flow where drivers call devlink_register() at the end >of their probe. > >Regarding a solution for the current crash, assuming we agree it's not a >netdevsim specific problem, I think the current fix [1] is OK. Note that >while it fixes the crash, it potentially creates other (less severe) >problems. After user space receives RTM_NEWLINK notification it will >need to wait for a certain period of time before issuing >'ETHTOOL_GDRVINFO' as otherwise it will not get the firmware version. I >guess it's not a big deal for drivers that only register one netdev >since they will very quickly follow with devlink_register(), but the >race window is larger for drivers that need to register many netdevs, >for either physical switch or eswitch ports. > >Long term, we either need to find a way to make the ethtool compat stuff >work correctly or just get rid of it and have affected drivers implement >the relevant ethtool operations instead of relying on devlink. > >[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/ I just had a call with Ido. We both think that this might be a good solution for -net to avoid the use after free. For net-next, we eventually should change driver init flows to register devlink instance first and only after that register devlink_port and related netdevice. The ordering is important for the userspace app. For example the init flow: <- RTnetlink new netdev event app sees devlink_port handle in IFLA_DEVLINK_PORT -> query devlink instance using this handle <- ENODEV The instance is not registered yet. So we need to make sure all devlink_port_register() calls are happening after devlink_register(). This is aligned with the original flow before devlink_register() was moved by Leon. Also it is aligned with devlink reload and devlink port split flows.
Tue, Nov 22, 2022 at 01:10:48PM CET, yangyingliang@huawei.com wrote: [...] > >Fixes: a62fdbbe9403 ("netdevsim: implement ndo_get_devlink_port") >Reported-by: Zhengchao Shao <shaozhengchao@huawei.com> >Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> >--- > net/core/devlink.c | 3 +++ > 1 file changed, 3 insertions(+) > >diff --git a/net/core/devlink.c b/net/core/devlink.c >index 89baa7c0938b..6453ac0558fb 100644 >--- a/net/core/devlink.c >+++ b/net/core/devlink.c >@@ -250,6 +250,9 @@ void devlink_put(struct devlink *devlink) > > struct devlink *__must_check devlink_try_get(struct devlink *devlink) > { >+ if (!xa_get_mark(&devlinks, devlink->index, DEVLINK_REGISTERED)) ^ You have an extra space here. Checkpatch would warn you. >+ return NULL; >+ > if (refcount_inc_not_zero(&devlink->refcount)) > return devlink; > return NULL; >-- >2.25.1 >
On Mon, Nov 28, 2022 at 10:58:58AM +0100, Jiri Pirko wrote: > Mon, Nov 28, 2022 at 10:20:53AM CET, idosch@idosch.org wrote: > >On Wed, Nov 23, 2022 at 06:18:00PM -0800, Jakub Kicinski wrote: > >> On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote: > >> > > I used the fix code proposed by Jakub, but it didn't work correctly, so > >> > > I tried to correct and improve it, and need some devlink helper. > >> > > > >> > > Anyway, it is a nsim problem, if we want fix this without touch devlink, > >> > > I think we can add a 'registered' field in struct nsim_dev, and it can be > >> > > checked in nsim_get_devlink_port() like this: > >> > > >> > I read the discussion and it's not clear to me why this is a netdevsim > >> > specific problem. The fundamental problem seems to be that it is > >> > possible to hold a reference on a devlink instance before it's > >> > registered and that devlink_free() will free the instance regardless of > >> > its current reference count because it expects devlink_unregister() to > >> > block. In this case, the instance was never registered, so > >> > devlink_unregister() is not called. > >> > > >> > ethtool was able to get a reference on the devlink instance before it > >> > was registered because netdevsim registers its netdevs before > >> > registering its devlink instance. However, netdevsim is not the only one > >> > doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially > >> > others do the same thing. > >> > > >> > When you think about it, it's strange that it's even possible for > >> > ethtool to reach the driver when the netdev used in the request is long > >> > gone, but it's not holding a reference on the netdev (it's holding a > >> > reference on the devlink instance instead) and > >> > devlink_compat_running_version() is called without RTNL. > >> > >> Indeed. We did a bit of a flip-flop with the devlink locking rules > >> and the fact that the instance is reachable before it is registered > >> is a leftover from a previous restructuring :( > >> > >> Hence my preference to get rid of the ordering at the driver level > >> than to try to patch it up in the code. Dunno if that's convincing. > > > >I don't have a good solution, but changing all the drivers to register > >their netdevs after the devlink instance is going to be quite painful > >and too big for 'net'. I feel like the main motivation for this is the > >ethtool compat stuff, which is not very convincing IMO. I'm quite happy > >with the current flow where drivers call devlink_register() at the end > >of their probe. > > > >Regarding a solution for the current crash, assuming we agree it's not a > >netdevsim specific problem, I think the current fix [1] is OK. Note that > >while it fixes the crash, it potentially creates other (less severe) > >problems. After user space receives RTM_NEWLINK notification it will > >need to wait for a certain period of time before issuing > >'ETHTOOL_GDRVINFO' as otherwise it will not get the firmware version. I > >guess it's not a big deal for drivers that only register one netdev > >since they will very quickly follow with devlink_register(), but the > >race window is larger for drivers that need to register many netdevs, > >for either physical switch or eswitch ports. > > > >Long term, we either need to find a way to make the ethtool compat stuff > >work correctly or just get rid of it and have affected drivers implement > >the relevant ethtool operations instead of relying on devlink. > > > >[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/ > > I just had a call with Ido. We both think that this might be a good > solution for -net to avoid the use after free. > > For net-next, we eventually should change driver init flows to register > devlink instance first and only after that register devlink_port and > related netdevice. The ordering is important for the userspace app. For > example the init flow: > <- RTnetlink new netdev event > app sees devlink_port handle in IFLA_DEVLINK_PORT > -> query devlink instance using this handle > <- ENODEV > > The instance is not registered yet. This is supposed to be handled by devlink_notify_register() which sends "delayed" notifications after devlink_register() is called. Unless something is broken, the scenario above shouldn't happen. > > So we need to make sure all devlink_port_register() calls are happening > after devlink_register(). This is aligned with the original flow before > devlink_register() was moved by Leon. Also it is aligned with devlink > reload and devlink port split flows. > I don't know what it means. Thanks
Mon, Nov 28, 2022 at 12:50:15PM CET, leon@kernel.org wrote: >On Mon, Nov 28, 2022 at 10:58:58AM +0100, Jiri Pirko wrote: >> Mon, Nov 28, 2022 at 10:20:53AM CET, idosch@idosch.org wrote: >> >On Wed, Nov 23, 2022 at 06:18:00PM -0800, Jakub Kicinski wrote: >> >> On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote: >> >> > > I used the fix code proposed by Jakub, but it didn't work correctly, so >> >> > > I tried to correct and improve it, and need some devlink helper. >> >> > > >> >> > > Anyway, it is a nsim problem, if we want fix this without touch devlink, >> >> > > I think we can add a 'registered' field in struct nsim_dev, and it can be >> >> > > checked in nsim_get_devlink_port() like this: >> >> > >> >> > I read the discussion and it's not clear to me why this is a netdevsim >> >> > specific problem. The fundamental problem seems to be that it is >> >> > possible to hold a reference on a devlink instance before it's >> >> > registered and that devlink_free() will free the instance regardless of >> >> > its current reference count because it expects devlink_unregister() to >> >> > block. In this case, the instance was never registered, so >> >> > devlink_unregister() is not called. >> >> > >> >> > ethtool was able to get a reference on the devlink instance before it >> >> > was registered because netdevsim registers its netdevs before >> >> > registering its devlink instance. However, netdevsim is not the only one >> >> > doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially >> >> > others do the same thing. >> >> > >> >> > When you think about it, it's strange that it's even possible for >> >> > ethtool to reach the driver when the netdev used in the request is long >> >> > gone, but it's not holding a reference on the netdev (it's holding a >> >> > reference on the devlink instance instead) and >> >> > devlink_compat_running_version() is called without RTNL. >> >> >> >> Indeed. We did a bit of a flip-flop with the devlink locking rules >> >> and the fact that the instance is reachable before it is registered >> >> is a leftover from a previous restructuring :( >> >> >> >> Hence my preference to get rid of the ordering at the driver level >> >> than to try to patch it up in the code. Dunno if that's convincing. >> > >> >I don't have a good solution, but changing all the drivers to register >> >their netdevs after the devlink instance is going to be quite painful >> >and too big for 'net'. I feel like the main motivation for this is the >> >ethtool compat stuff, which is not very convincing IMO. I'm quite happy >> >with the current flow where drivers call devlink_register() at the end >> >of their probe. >> > >> >Regarding a solution for the current crash, assuming we agree it's not a >> >netdevsim specific problem, I think the current fix [1] is OK. Note that >> >while it fixes the crash, it potentially creates other (less severe) >> >problems. After user space receives RTM_NEWLINK notification it will >> >need to wait for a certain period of time before issuing >> >'ETHTOOL_GDRVINFO' as otherwise it will not get the firmware version. I >> >guess it's not a big deal for drivers that only register one netdev >> >since they will very quickly follow with devlink_register(), but the >> >race window is larger for drivers that need to register many netdevs, >> >for either physical switch or eswitch ports. >> > >> >Long term, we either need to find a way to make the ethtool compat stuff >> >work correctly or just get rid of it and have affected drivers implement >> >the relevant ethtool operations instead of relying on devlink. >> > >> >[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/ >> >> I just had a call with Ido. We both think that this might be a good >> solution for -net to avoid the use after free. >> >> For net-next, we eventually should change driver init flows to register >> devlink instance first and only after that register devlink_port and >> related netdevice. The ordering is important for the userspace app. For >> example the init flow: >> <- RTnetlink new netdev event >> app sees devlink_port handle in IFLA_DEVLINK_PORT >> -> query devlink instance using this handle >> <- ENODEV >> >> The instance is not registered yet. > >This is supposed to be handled by devlink_notify_register() which sends >"delayed" notifications after devlink_register() is called. > >Unless something is broken, the scenario above shouldn't happen. Nope, RTnetlink message for new netdev is not handled by that. It is sent right away. > >> >> So we need to make sure all devlink_port_register() calls are happening >> after devlink_register(). This is aligned with the original flow before >> devlink_register() was moved by Leon. Also it is aligned with devlink >> reload and devlink port split flows. >> > >I don't know what it means. What I mean is that during port split, devlink instance is registered. During port creation and removal during reload, devlink instance is registered. We should maintain the same ordering during init/fini I believe. > >Thanks
On Mon, 28 Nov 2022 10:58:58 +0100 Jiri Pirko wrote: > >Long term, we either need to find a way to make the ethtool compat stuff > >work correctly or just get rid of it and have affected drivers implement > >the relevant ethtool operations instead of relying on devlink. > > > >[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/ > > I just had a call with Ido. We both think that this might be a good > solution for -net to avoid the use after free. > > For net-next, we eventually should change driver init flows to register > devlink instance first and only after that register devlink_port and > related netdevice. The ordering is important for the userspace app. For > example the init flow: > <- RTnetlink new netdev event > app sees devlink_port handle in IFLA_DEVLINK_PORT > -> query devlink instance using this handle > <- ENODEV > > The instance is not registered yet. > > So we need to make sure all devlink_port_register() calls are happening > after devlink_register(). This is aligned with the original flow before > devlink_register() was moved by Leon. Also it is aligned with devlink > reload and devlink port split flows. Cool. Do you also agree with doing proper refcounting for the devlink instance struct and the liveness check after locking the instance?
Mon, Nov 28, 2022 at 07:20:43PM CET, kuba@kernel.org wrote: >On Mon, 28 Nov 2022 10:58:58 +0100 Jiri Pirko wrote: >> >Long term, we either need to find a way to make the ethtool compat stuff >> >work correctly or just get rid of it and have affected drivers implement >> >the relevant ethtool operations instead of relying on devlink. >> > >> >[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/ >> >> I just had a call with Ido. We both think that this might be a good >> solution for -net to avoid the use after free. >> >> For net-next, we eventually should change driver init flows to register >> devlink instance first and only after that register devlink_port and >> related netdevice. The ordering is important for the userspace app. For >> example the init flow: >> <- RTnetlink new netdev event >> app sees devlink_port handle in IFLA_DEVLINK_PORT >> -> query devlink instance using this handle >> <- ENODEV >> >> The instance is not registered yet. >> >> So we need to make sure all devlink_port_register() calls are happening >> after devlink_register(). This is aligned with the original flow before >> devlink_register() was moved by Leon. Also it is aligned with devlink >> reload and devlink port split flows. > >Cool. Do you also agree with doing proper refcounting for the devlink >instance struct and the liveness check after locking the instance? Could you elaborate a bit more? I missed that in the thread and can't find it. Why do we need it?
On Mon, Nov 28, 2022 at 02:52:00PM +0100, Jiri Pirko wrote: > Mon, Nov 28, 2022 at 12:50:15PM CET, leon@kernel.org wrote: > >On Mon, Nov 28, 2022 at 10:58:58AM +0100, Jiri Pirko wrote: > >> Mon, Nov 28, 2022 at 10:20:53AM CET, idosch@idosch.org wrote: > >> >On Wed, Nov 23, 2022 at 06:18:00PM -0800, Jakub Kicinski wrote: > >> >> On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote: > >> >> > > I used the fix code proposed by Jakub, but it didn't work correctly, so > >> >> > > I tried to correct and improve it, and need some devlink helper. > >> >> > > > >> >> > > Anyway, it is a nsim problem, if we want fix this without touch devlink, > >> >> > > I think we can add a 'registered' field in struct nsim_dev, and it can be > >> >> > > checked in nsim_get_devlink_port() like this: > >> >> > > >> >> > I read the discussion and it's not clear to me why this is a netdevsim > >> >> > specific problem. The fundamental problem seems to be that it is > >> >> > possible to hold a reference on a devlink instance before it's > >> >> > registered and that devlink_free() will free the instance regardless of > >> >> > its current reference count because it expects devlink_unregister() to > >> >> > block. In this case, the instance was never registered, so > >> >> > devlink_unregister() is not called. > >> >> > > >> >> > ethtool was able to get a reference on the devlink instance before it > >> >> > was registered because netdevsim registers its netdevs before > >> >> > registering its devlink instance. However, netdevsim is not the only one > >> >> > doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially > >> >> > others do the same thing. > >> >> > > >> >> > When you think about it, it's strange that it's even possible for > >> >> > ethtool to reach the driver when the netdev used in the request is long > >> >> > gone, but it's not holding a reference on the netdev (it's holding a > >> >> > reference on the devlink instance instead) and > >> >> > devlink_compat_running_version() is called without RTNL. > >> >> > >> >> Indeed. We did a bit of a flip-flop with the devlink locking rules > >> >> and the fact that the instance is reachable before it is registered > >> >> is a leftover from a previous restructuring :( > >> >> > >> >> Hence my preference to get rid of the ordering at the driver level > >> >> than to try to patch it up in the code. Dunno if that's convincing. > >> > > >> >I don't have a good solution, but changing all the drivers to register > >> >their netdevs after the devlink instance is going to be quite painful > >> >and too big for 'net'. I feel like the main motivation for this is the > >> >ethtool compat stuff, which is not very convincing IMO. I'm quite happy > >> >with the current flow where drivers call devlink_register() at the end > >> >of their probe. > >> > > >> >Regarding a solution for the current crash, assuming we agree it's not a > >> >netdevsim specific problem, I think the current fix [1] is OK. Note that > >> >while it fixes the crash, it potentially creates other (less severe) > >> >problems. After user space receives RTM_NEWLINK notification it will > >> >need to wait for a certain period of time before issuing > >> >'ETHTOOL_GDRVINFO' as otherwise it will not get the firmware version. I > >> >guess it's not a big deal for drivers that only register one netdev > >> >since they will very quickly follow with devlink_register(), but the > >> >race window is larger for drivers that need to register many netdevs, > >> >for either physical switch or eswitch ports. > >> > > >> >Long term, we either need to find a way to make the ethtool compat stuff > >> >work correctly or just get rid of it and have affected drivers implement > >> >the relevant ethtool operations instead of relying on devlink. > >> > > >> >[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/ > >> > >> I just had a call with Ido. We both think that this might be a good > >> solution for -net to avoid the use after free. > >> > >> For net-next, we eventually should change driver init flows to register > >> devlink instance first and only after that register devlink_port and > >> related netdevice. The ordering is important for the userspace app. For > >> example the init flow: > >> <- RTnetlink new netdev event > >> app sees devlink_port handle in IFLA_DEVLINK_PORT > >> -> query devlink instance using this handle > >> <- ENODEV > >> > >> The instance is not registered yet. > > > >This is supposed to be handled by devlink_notify_register() which sends > >"delayed" notifications after devlink_register() is called. > > > >Unless something is broken, the scenario above shouldn't happen. > > Nope, RTnetlink message for new netdev is not handled by that. It is > sent right away. And why don't you fix your new commit dca56c3038c3 ("net: expose devlink port over rtnetlink") to do not return devlink instance unless it is registered? Why is it correct to expose devlink port with not ready to use devlink instance? Thanks
Tue, Nov 29, 2022 at 09:44:48AM CET, leon@kernel.org wrote: >On Mon, Nov 28, 2022 at 02:52:00PM +0100, Jiri Pirko wrote: >> Mon, Nov 28, 2022 at 12:50:15PM CET, leon@kernel.org wrote: >> >On Mon, Nov 28, 2022 at 10:58:58AM +0100, Jiri Pirko wrote: >> >> Mon, Nov 28, 2022 at 10:20:53AM CET, idosch@idosch.org wrote: >> >> >On Wed, Nov 23, 2022 at 06:18:00PM -0800, Jakub Kicinski wrote: >> >> >> On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote: >> >> >> > > I used the fix code proposed by Jakub, but it didn't work correctly, so >> >> >> > > I tried to correct and improve it, and need some devlink helper. >> >> >> > > >> >> >> > > Anyway, it is a nsim problem, if we want fix this without touch devlink, >> >> >> > > I think we can add a 'registered' field in struct nsim_dev, and it can be >> >> >> > > checked in nsim_get_devlink_port() like this: >> >> >> > >> >> >> > I read the discussion and it's not clear to me why this is a netdevsim >> >> >> > specific problem. The fundamental problem seems to be that it is >> >> >> > possible to hold a reference on a devlink instance before it's >> >> >> > registered and that devlink_free() will free the instance regardless of >> >> >> > its current reference count because it expects devlink_unregister() to >> >> >> > block. In this case, the instance was never registered, so >> >> >> > devlink_unregister() is not called. >> >> >> > >> >> >> > ethtool was able to get a reference on the devlink instance before it >> >> >> > was registered because netdevsim registers its netdevs before >> >> >> > registering its devlink instance. However, netdevsim is not the only one >> >> >> > doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially >> >> >> > others do the same thing. >> >> >> > >> >> >> > When you think about it, it's strange that it's even possible for >> >> >> > ethtool to reach the driver when the netdev used in the request is long >> >> >> > gone, but it's not holding a reference on the netdev (it's holding a >> >> >> > reference on the devlink instance instead) and >> >> >> > devlink_compat_running_version() is called without RTNL. >> >> >> >> >> >> Indeed. We did a bit of a flip-flop with the devlink locking rules >> >> >> and the fact that the instance is reachable before it is registered >> >> >> is a leftover from a previous restructuring :( >> >> >> >> >> >> Hence my preference to get rid of the ordering at the driver level >> >> >> than to try to patch it up in the code. Dunno if that's convincing. >> >> > >> >> >I don't have a good solution, but changing all the drivers to register >> >> >their netdevs after the devlink instance is going to be quite painful >> >> >and too big for 'net'. I feel like the main motivation for this is the >> >> >ethtool compat stuff, which is not very convincing IMO. I'm quite happy >> >> >with the current flow where drivers call devlink_register() at the end >> >> >of their probe. >> >> > >> >> >Regarding a solution for the current crash, assuming we agree it's not a >> >> >netdevsim specific problem, I think the current fix [1] is OK. Note that >> >> >while it fixes the crash, it potentially creates other (less severe) >> >> >problems. After user space receives RTM_NEWLINK notification it will >> >> >need to wait for a certain period of time before issuing >> >> >'ETHTOOL_GDRVINFO' as otherwise it will not get the firmware version. I >> >> >guess it's not a big deal for drivers that only register one netdev >> >> >since they will very quickly follow with devlink_register(), but the >> >> >race window is larger for drivers that need to register many netdevs, >> >> >for either physical switch or eswitch ports. >> >> > >> >> >Long term, we either need to find a way to make the ethtool compat stuff >> >> >work correctly or just get rid of it and have affected drivers implement >> >> >the relevant ethtool operations instead of relying on devlink. >> >> > >> >> >[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/ >> >> >> >> I just had a call with Ido. We both think that this might be a good >> >> solution for -net to avoid the use after free. >> >> >> >> For net-next, we eventually should change driver init flows to register >> >> devlink instance first and only after that register devlink_port and >> >> related netdevice. The ordering is important for the userspace app. For >> >> example the init flow: >> >> <- RTnetlink new netdev event >> >> app sees devlink_port handle in IFLA_DEVLINK_PORT >> >> -> query devlink instance using this handle >> >> <- ENODEV >> >> >> >> The instance is not registered yet. >> > >> >This is supposed to be handled by devlink_notify_register() which sends >> >"delayed" notifications after devlink_register() is called. >> > >> >Unless something is broken, the scenario above shouldn't happen. >> >> Nope, RTnetlink message for new netdev is not handled by that. It is >> sent right away. > >And why don't you fix your new commit dca56c3038c3 ("net: expose devlink port over rtnetlink") >to do not return devlink instance unless it is registered? > >Why is it correct to expose devlink port with not ready to use devlink >instance? It is not, but: Devlink port which is "parent" of the netdev is registered. The netdev is created with devlink_port registered and that it guaranteed to not change during netdev lifetime. Therefore, it would be weird to have 2 RTnetlink events: 1. event of netdev being created without devlink port 2. event of netdev with devlink port If that is what you suggest. I'm working on a patchset that is making sure that the flow is always 1) devlink_register & netlink event 2) devlink_port_register & netlink event 3) netdev_register & netlink event Always the same. That means during init, during reload, during port split.
On Tue, Nov 29, 2022 at 10:05:10AM +0100, Jiri Pirko wrote: > Tue, Nov 29, 2022 at 09:44:48AM CET, leon@kernel.org wrote: > >On Mon, Nov 28, 2022 at 02:52:00PM +0100, Jiri Pirko wrote: > >> Mon, Nov 28, 2022 at 12:50:15PM CET, leon@kernel.org wrote: > >> >On Mon, Nov 28, 2022 at 10:58:58AM +0100, Jiri Pirko wrote: > >> >> Mon, Nov 28, 2022 at 10:20:53AM CET, idosch@idosch.org wrote: > >> >> >On Wed, Nov 23, 2022 at 06:18:00PM -0800, Jakub Kicinski wrote: > >> >> >> On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote: > >> >> >> > > I used the fix code proposed by Jakub, but it didn't work correctly, so > >> >> >> > > I tried to correct and improve it, and need some devlink helper. > >> >> >> > > > >> >> >> > > Anyway, it is a nsim problem, if we want fix this without touch devlink, > >> >> >> > > I think we can add a 'registered' field in struct nsim_dev, and it can be > >> >> >> > > checked in nsim_get_devlink_port() like this: > >> >> >> > > >> >> >> > I read the discussion and it's not clear to me why this is a netdevsim > >> >> >> > specific problem. The fundamental problem seems to be that it is > >> >> >> > possible to hold a reference on a devlink instance before it's > >> >> >> > registered and that devlink_free() will free the instance regardless of > >> >> >> > its current reference count because it expects devlink_unregister() to > >> >> >> > block. In this case, the instance was never registered, so > >> >> >> > devlink_unregister() is not called. > >> >> >> > > >> >> >> > ethtool was able to get a reference on the devlink instance before it > >> >> >> > was registered because netdevsim registers its netdevs before > >> >> >> > registering its devlink instance. However, netdevsim is not the only one > >> >> >> > doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially > >> >> >> > others do the same thing. > >> >> >> > > >> >> >> > When you think about it, it's strange that it's even possible for > >> >> >> > ethtool to reach the driver when the netdev used in the request is long > >> >> >> > gone, but it's not holding a reference on the netdev (it's holding a > >> >> >> > reference on the devlink instance instead) and > >> >> >> > devlink_compat_running_version() is called without RTNL. > >> >> >> > >> >> >> Indeed. We did a bit of a flip-flop with the devlink locking rules > >> >> >> and the fact that the instance is reachable before it is registered > >> >> >> is a leftover from a previous restructuring :( > >> >> >> > >> >> >> Hence my preference to get rid of the ordering at the driver level > >> >> >> than to try to patch it up in the code. Dunno if that's convincing. > >> >> > > >> >> >I don't have a good solution, but changing all the drivers to register > >> >> >their netdevs after the devlink instance is going to be quite painful > >> >> >and too big for 'net'. I feel like the main motivation for this is the > >> >> >ethtool compat stuff, which is not very convincing IMO. I'm quite happy > >> >> >with the current flow where drivers call devlink_register() at the end > >> >> >of their probe. > >> >> > > >> >> >Regarding a solution for the current crash, assuming we agree it's not a > >> >> >netdevsim specific problem, I think the current fix [1] is OK. Note that > >> >> >while it fixes the crash, it potentially creates other (less severe) > >> >> >problems. After user space receives RTM_NEWLINK notification it will > >> >> >need to wait for a certain period of time before issuing > >> >> >'ETHTOOL_GDRVINFO' as otherwise it will not get the firmware version. I > >> >> >guess it's not a big deal for drivers that only register one netdev > >> >> >since they will very quickly follow with devlink_register(), but the > >> >> >race window is larger for drivers that need to register many netdevs, > >> >> >for either physical switch or eswitch ports. > >> >> > > >> >> >Long term, we either need to find a way to make the ethtool compat stuff > >> >> >work correctly or just get rid of it and have affected drivers implement > >> >> >the relevant ethtool operations instead of relying on devlink. > >> >> > > >> >> >[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/ > >> >> > >> >> I just had a call with Ido. We both think that this might be a good > >> >> solution for -net to avoid the use after free. > >> >> > >> >> For net-next, we eventually should change driver init flows to register > >> >> devlink instance first and only after that register devlink_port and > >> >> related netdevice. The ordering is important for the userspace app. For > >> >> example the init flow: > >> >> <- RTnetlink new netdev event > >> >> app sees devlink_port handle in IFLA_DEVLINK_PORT > >> >> -> query devlink instance using this handle > >> >> <- ENODEV > >> >> > >> >> The instance is not registered yet. > >> > > >> >This is supposed to be handled by devlink_notify_register() which sends > >> >"delayed" notifications after devlink_register() is called. > >> > > >> >Unless something is broken, the scenario above shouldn't happen. > >> > >> Nope, RTnetlink message for new netdev is not handled by that. It is > >> sent right away. > > > >And why don't you fix your new commit dca56c3038c3 ("net: expose devlink port over rtnetlink") > >to do not return devlink instance unless it is registered? > > > >Why is it correct to expose devlink port with not ready to use devlink > >instance? > > It is not, but: > Devlink port which is "parent" of the netdev is registered. The netdev > is created with devlink_port registered and that it guaranteed to not > change during netdev lifetime. Therefore, it would be weird to have 2 > RTnetlink events: > 1. event of netdev being created without devlink port > 2. event of netdev with devlink port > If that is what you suggest. Yes, something like that in similar way to IFLA_EVENT_NOTIFY_PEERS. > > I'm working on a patchset that is making sure that the flow is always > 1) devlink_register & netlink event > 2) devlink_port_register & netlink event > 3) netdev_register & netlink event > > Always the same. That means during init, during reload, during port > split. The thing that worries me is unclear lifetime of these devlink objects. devlink_port is aligned with netdev lifetime, devlink is with device one. It is very strange to me that we have netdev ready without device underneath ready. Can you please document the lifetime models and how all these objects interact? Thanks
Tue, Nov 29, 2022 at 12:20:01PM CET, leon@kernel.org wrote: >On Tue, Nov 29, 2022 at 10:05:10AM +0100, Jiri Pirko wrote: >> Tue, Nov 29, 2022 at 09:44:48AM CET, leon@kernel.org wrote: >> >On Mon, Nov 28, 2022 at 02:52:00PM +0100, Jiri Pirko wrote: >> >> Mon, Nov 28, 2022 at 12:50:15PM CET, leon@kernel.org wrote: >> >> >On Mon, Nov 28, 2022 at 10:58:58AM +0100, Jiri Pirko wrote: >> >> >> Mon, Nov 28, 2022 at 10:20:53AM CET, idosch@idosch.org wrote: >> >> >> >On Wed, Nov 23, 2022 at 06:18:00PM -0800, Jakub Kicinski wrote: >> >> >> >> On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote: >> >> >> >> > > I used the fix code proposed by Jakub, but it didn't work correctly, so >> >> >> >> > > I tried to correct and improve it, and need some devlink helper. >> >> >> >> > > >> >> >> >> > > Anyway, it is a nsim problem, if we want fix this without touch devlink, >> >> >> >> > > I think we can add a 'registered' field in struct nsim_dev, and it can be >> >> >> >> > > checked in nsim_get_devlink_port() like this: >> >> >> >> > >> >> >> >> > I read the discussion and it's not clear to me why this is a netdevsim >> >> >> >> > specific problem. The fundamental problem seems to be that it is >> >> >> >> > possible to hold a reference on a devlink instance before it's >> >> >> >> > registered and that devlink_free() will free the instance regardless of >> >> >> >> > its current reference count because it expects devlink_unregister() to >> >> >> >> > block. In this case, the instance was never registered, so >> >> >> >> > devlink_unregister() is not called. >> >> >> >> > >> >> >> >> > ethtool was able to get a reference on the devlink instance before it >> >> >> >> > was registered because netdevsim registers its netdevs before >> >> >> >> > registering its devlink instance. However, netdevsim is not the only one >> >> >> >> > doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially >> >> >> >> > others do the same thing. >> >> >> >> > >> >> >> >> > When you think about it, it's strange that it's even possible for >> >> >> >> > ethtool to reach the driver when the netdev used in the request is long >> >> >> >> > gone, but it's not holding a reference on the netdev (it's holding a >> >> >> >> > reference on the devlink instance instead) and >> >> >> >> > devlink_compat_running_version() is called without RTNL. >> >> >> >> >> >> >> >> Indeed. We did a bit of a flip-flop with the devlink locking rules >> >> >> >> and the fact that the instance is reachable before it is registered >> >> >> >> is a leftover from a previous restructuring :( >> >> >> >> >> >> >> >> Hence my preference to get rid of the ordering at the driver level >> >> >> >> than to try to patch it up in the code. Dunno if that's convincing. >> >> >> > >> >> >> >I don't have a good solution, but changing all the drivers to register >> >> >> >their netdevs after the devlink instance is going to be quite painful >> >> >> >and too big for 'net'. I feel like the main motivation for this is the >> >> >> >ethtool compat stuff, which is not very convincing IMO. I'm quite happy >> >> >> >with the current flow where drivers call devlink_register() at the end >> >> >> >of their probe. >> >> >> > >> >> >> >Regarding a solution for the current crash, assuming we agree it's not a >> >> >> >netdevsim specific problem, I think the current fix [1] is OK. Note that >> >> >> >while it fixes the crash, it potentially creates other (less severe) >> >> >> >problems. After user space receives RTM_NEWLINK notification it will >> >> >> >need to wait for a certain period of time before issuing >> >> >> >'ETHTOOL_GDRVINFO' as otherwise it will not get the firmware version. I >> >> >> >guess it's not a big deal for drivers that only register one netdev >> >> >> >since they will very quickly follow with devlink_register(), but the >> >> >> >race window is larger for drivers that need to register many netdevs, >> >> >> >for either physical switch or eswitch ports. >> >> >> > >> >> >> >Long term, we either need to find a way to make the ethtool compat stuff >> >> >> >work correctly or just get rid of it and have affected drivers implement >> >> >> >the relevant ethtool operations instead of relying on devlink. >> >> >> > >> >> >> >[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/ >> >> >> >> >> >> I just had a call with Ido. We both think that this might be a good >> >> >> solution for -net to avoid the use after free. >> >> >> >> >> >> For net-next, we eventually should change driver init flows to register >> >> >> devlink instance first and only after that register devlink_port and >> >> >> related netdevice. The ordering is important for the userspace app. For >> >> >> example the init flow: >> >> >> <- RTnetlink new netdev event >> >> >> app sees devlink_port handle in IFLA_DEVLINK_PORT >> >> >> -> query devlink instance using this handle >> >> >> <- ENODEV >> >> >> >> >> >> The instance is not registered yet. >> >> > >> >> >This is supposed to be handled by devlink_notify_register() which sends >> >> >"delayed" notifications after devlink_register() is called. >> >> > >> >> >Unless something is broken, the scenario above shouldn't happen. >> >> >> >> Nope, RTnetlink message for new netdev is not handled by that. It is >> >> sent right away. >> > >> >And why don't you fix your new commit dca56c3038c3 ("net: expose devlink port over rtnetlink") >> >to do not return devlink instance unless it is registered? >> > >> >Why is it correct to expose devlink port with not ready to use devlink >> >instance? >> >> It is not, but: >> Devlink port which is "parent" of the netdev is registered. The netdev >> is created with devlink_port registered and that it guaranteed to not >> change during netdev lifetime. Therefore, it would be weird to have 2 >> RTnetlink events: >> 1. event of netdev being created without devlink port >> 2. event of netdev with devlink port >> If that is what you suggest. > >Yes, something like that in similar way to IFLA_EVENT_NOTIFY_PEERS. > >> >> I'm working on a patchset that is making sure that the flow is always >> 1) devlink_register & netlink event >> 2) devlink_port_register & netlink event >> 3) netdev_register & netlink event >> >> Always the same. That means during init, during reload, during port >> split. > >The thing that worries me is unclear lifetime of these devlink objects. >devlink_port is aligned with netdev lifetime, devlink is with device one. Correct. Therefore it makes sense to create device object and propagate the existence to the user, then create port object with netdevice and propagate those. > >It is very strange to me that we have netdev ready without device >underneath ready. > >Can you please document the lifetime models and how all these objects >interact? Yep. On top of that, I will put an assert into devlink_port_register to check devlink instance is registered. > >Thanks
On Tue, 29 Nov 2022 09:31:40 +0100 Jiri Pirko wrote: > >Cool. Do you also agree with doing proper refcounting for the devlink > >instance struct and the liveness check after locking the instance? > > Could you elaborate a bit more? I missed that in the thread and can't > find it. Why do we need it? Look at the __devlink_free() and changes to devlink_compat_flash_update() here: https://lore.kernel.org/netdev/20211030231254.2477599-3-kuba@kernel.org/ The model I had in mind (a year ago when it all started) was that the driver takes the devlink instance lock around its entire init path, including the registration of the instance. This way the devlink instance is never visible "half initialized". I mean - it's "visible" as in you can see a notification over netlink before init is done but you can't access it until the init in the driver is completed and it releases the instance lock. For that to work and to avoid ordering issues with netdev we need to allow unregistering a devlink instance before all references are gone. So we atomically look up and take a reference on a devlink instance. Then take its lock. Then under the instance lock we check if it's still registered. For devlink core that's not a problem it's all hidden in the helpers.
On Tue, Nov 29, 2022 at 06:18:26PM -0800, Jakub Kicinski wrote: > On Tue, 29 Nov 2022 09:31:40 +0100 Jiri Pirko wrote: > > >Cool. Do you also agree with doing proper refcounting for the devlink > > >instance struct and the liveness check after locking the instance? > > > > Could you elaborate a bit more? I missed that in the thread and can't > > find it. Why do we need it? > > Look at the __devlink_free() and changes > to devlink_compat_flash_update() here: > > https://lore.kernel.org/netdev/20211030231254.2477599-3-kuba@kernel.org/ > > The model I had in mind (a year ago when it all started) was that > the driver takes the devlink instance lock around its entire init path, > including the registration of the instance. This way the devlink > instance is never visible "half initialized". I mean - it's "visible" > as in you can see a notification over netlink before init is done but > you can't access it until the init in the driver is completed and it > releases the instance lock. In parallel thread, Jiri wanted to avoid this situation of netlink notifications for not-visible yet object. He gave as an example devlink_port which is advertised without devlink being ready. Thanks
Wed, Nov 30, 2022 at 09:54:06AM CET, leon@kernel.org wrote: >On Tue, Nov 29, 2022 at 06:18:26PM -0800, Jakub Kicinski wrote: >> On Tue, 29 Nov 2022 09:31:40 +0100 Jiri Pirko wrote: >> > >Cool. Do you also agree with doing proper refcounting for the devlink >> > >instance struct and the liveness check after locking the instance? >> > >> > Could you elaborate a bit more? I missed that in the thread and can't >> > find it. Why do we need it? >> >> Look at the __devlink_free() and changes >> to devlink_compat_flash_update() here: >> >> https://lore.kernel.org/netdev/20211030231254.2477599-3-kuba@kernel.org/ >> >> The model I had in mind (a year ago when it all started) was that >> the driver takes the devlink instance lock around its entire init path, >> including the registration of the instance. This way the devlink >> instance is never visible "half initialized". I mean - it's "visible" >> as in you can see a notification over netlink before init is done but >> you can't access it until the init in the driver is completed and it >> releases the instance lock. > >In parallel thread, Jiri wanted to avoid this situation of netlink >notifications for not-visible yet object. He gave as an example >devlink_port which is advertised without devlink being ready. To be honest, I'm lost tracking what you point at. I never suggested to send notification of devlink_port before devlink notification is sent previously. Perhaps you misread that. > >Thanks
Wed, Nov 30, 2022 at 03:18:26AM CET, kuba@kernel.org wrote: >On Tue, 29 Nov 2022 09:31:40 +0100 Jiri Pirko wrote: >> >Cool. Do you also agree with doing proper refcounting for the devlink >> >instance struct and the liveness check after locking the instance? >> >> Could you elaborate a bit more? I missed that in the thread and can't >> find it. Why do we need it? > >Look at the __devlink_free() and changes >to devlink_compat_flash_update() here: > >https://lore.kernel.org/netdev/20211030231254.2477599-3-kuba@kernel.org/ **) I see. With the change I suggest, meaning doing devlink_port_register/unregister() and netdev_register/unregister only for registered devlink instance, you don't need this at all. When you hit this compat callback, the netdevice is there and therefore devlink instance is registered for sure. > >The model I had in mind (a year ago when it all started) was that >the driver takes the devlink instance lock around its entire init path, >including the registration of the instance. This way the devlink >instance is never visible "half initialized". I mean - it's "visible" >as in you can see a notification over netlink before init is done but >you can't access it until the init in the driver is completed and it >releases the instance lock. What is "half-initialized"? Take devlink reload flow for instance. There are multiple things removed/readded, like devlink_port and related netdevice. No problem there. I think that we really need to go a step back and put the devlink_register at the point of init flow where all things that are "static" during register lifetime are initialized, the rest would be initialized later on. This would make things aligned with devlink_reload() and would make possible to share init/fini/reload code in drivers. > >For that to work and to avoid ordering issues with netdev we need to >allow unregistering a devlink instance before all references are gone. References of what? devlink instance, put by devlink_put()? > >So we atomically look up and take a reference on a devlink instance. >Then take its lock. Then under the instance lock we check if it's still >registered. As mentioned above (**), I don't think this is needed. > >For devlink core that's not a problem it's all hidden in the helpers.
On Wed, 30 Nov 2022 10:54:06 +0200 Leon Romanovsky wrote: > In parallel thread, Jiri wanted to avoid this situation of netlink > notifications for not-visible yet object. He gave as an example > devlink_port which is advertised without devlink being ready. To be clear the exact ordering matters much less in my scheme, because the user space caller will wait for the instance lock before checking if the instance is registered. So an early user request will just sleep on the instance lock until init is done.
On Wed, 30 Nov 2022 12:42:39 +0100 Jiri Pirko wrote: > >Look at the __devlink_free() and changes > >to devlink_compat_flash_update() here: > > > >https://lore.kernel.org/netdev/20211030231254.2477599-3-kuba@kernel.org/ > > **) > I see. With the change I suggest, meaning doing > devlink_port_register/unregister() and netdev_register/unregister only > for registered devlink instance, you don't need this at all. When you > hit this compat callback, the netdevice is there and therefore devlink > instance is registered for sure. If you move devlink registration up it has to be under the instance lock, otherwise we're back to reload problems. That implies unregister should be under the lock too. But then we can't wait for refs in unregister. Perhaps I don't understand the suggestion. > >The model I had in mind (a year ago when it all started) was that > >the driver takes the devlink instance lock around its entire init path, > >including the registration of the instance. This way the devlink > >instance is never visible "half initialized". I mean - it's "visible" > >as in you can see a notification over netlink before init is done but > >you can't access it until the init in the driver is completed and it > >releases the instance lock. > > What is "half-initialized"? Take devlink reload flow for instance. There > are multiple things removed/readded, like devlink_port and related > netdevice. No problem there. Yes, but reload is under the instance lock, so nothing can mess with a device in a transitional state. > I think that we really need to go a step back and put the > devlink_register at the point of init flow where all things that are > "static" during register lifetime are initialized, the rest would be > initialized later on. This would make things aligned with > devlink_reload() and would make possible to share init/fini/reload > code in drivers. Yes, I agree that the move should be done but I don't think its sufficient. > >For that to work and to avoid ordering issues with netdev we need to > >allow unregistering a devlink instance before all references are gone. > > References of what? devlink instance, put by devlink_put()? Yes. > >So we atomically look up and take a reference on a devlink instance. > >Then take its lock. Then under the instance lock we check if it's still > >registered. > > As mentioned above (**), I don't think this is needed. But it is, please just let me do it and make the bugs stop
Wed, Nov 30, 2022 at 05:46:59PM CET, kuba@kernel.org wrote: >On Wed, 30 Nov 2022 12:42:39 +0100 Jiri Pirko wrote: >> >Look at the __devlink_free() and changes >> >to devlink_compat_flash_update() here: >> > >> >https://lore.kernel.org/netdev/20211030231254.2477599-3-kuba@kernel.org/ >> >> **) >> I see. With the change I suggest, meaning doing >> devlink_port_register/unregister() and netdev_register/unregister only >> for registered devlink instance, you don't need this at all. When you >> hit this compat callback, the netdevice is there and therefore devlink >> instance is registered for sure. > >If you move devlink registration up it has to be under the instance >lock, otherwise we're back to reload problems. That implies unregister >should be under the lock too. But then we can't wait for refs in >unregister. Perhaps I don't understand the suggestion. I unlock for register and for the rest of the init I lock again. > >> >The model I had in mind (a year ago when it all started) was that >> >the driver takes the devlink instance lock around its entire init path, >> >including the registration of the instance. This way the devlink >> >instance is never visible "half initialized". I mean - it's "visible" >> >as in you can see a notification over netlink before init is done but >> >you can't access it until the init in the driver is completed and it >> >releases the instance lock. >> >> What is "half-initialized"? Take devlink reload flow for instance. There >> are multiple things removed/readded, like devlink_port and related >> netdevice. No problem there. > >Yes, but reload is under the instance lock, so nothing can mess with >a device in a transitional state. Sure, that is what I want to do too. To be under instance lock. > >> I think that we really need to go a step back and put the >> devlink_register at the point of init flow where all things that are >> "static" during register lifetime are initialized, the rest would be >> initialized later on. This would make things aligned with >> devlink_reload() and would make possible to share init/fini/reload >> code in drivers. > >Yes, I agree that the move should be done but I don't think its >sufficient. > >> >For that to work and to avoid ordering issues with netdev we need to >> >allow unregistering a devlink instance before all references are gone. >> >> References of what? devlink instance, put by devlink_put()? > >Yes. > >> >So we atomically look up and take a reference on a devlink instance. >> >Then take its lock. Then under the instance lock we check if it's still >> >registered. >> >> As mentioned above (**), I don't think this is needed. > >But it is, please just let me do it and make the bugs stop
On Wed, 30 Nov 2022 18:00:05 +0100 Jiri Pirko wrote: > Wed, Nov 30, 2022 at 05:46:59PM CET, kuba@kernel.org wrote: > >On Wed, 30 Nov 2022 12:42:39 +0100 Jiri Pirko wrote: > >> **) > >> I see. With the change I suggest, meaning doing > >> devlink_port_register/unregister() and netdev_register/unregister only > >> for registered devlink instance, you don't need this at all. When you > >> hit this compat callback, the netdevice is there and therefore devlink > >> instance is registered for sure. > > > >If you move devlink registration up it has to be under the instance > >lock, otherwise we're back to reload problems. That implies unregister > >should be under the lock too. But then we can't wait for refs in > >unregister. Perhaps I don't understand the suggestion. > > I unlock for register and for the rest of the init I lock again. The moment you register that instance callbacks can start coming. Leon move the register call last for a good reason - all drivers we looked at had bugs in handling init. We can come up with fixes in the drivers, flags, devlink_set_features() and all that sort of garbage until the day we die but let's not. The driver facing API should be simple - hold the lock around entire init. > >> What is "half-initialized"? Take devlink reload flow for instance. There > >> are multiple things removed/readded, like devlink_port and related > >> netdevice. No problem there. > > > >Yes, but reload is under the instance lock, so nothing can mess with > >a device in a transitional state. > > Sure, that is what I want to do too. To be under instance lock. I'm confused, you just said "I unlock for register". > >> As mentioned above (**), I don't think this is needed. > > > >But it is, please just let me do it and make the bugs stop
On Wed, Nov 30, 2022 at 09:20:42AM -0800, Jakub Kicinski wrote: > On Wed, 30 Nov 2022 18:00:05 +0100 Jiri Pirko wrote: > > Wed, Nov 30, 2022 at 05:46:59PM CET, kuba@kernel.org wrote: > > >On Wed, 30 Nov 2022 12:42:39 +0100 Jiri Pirko wrote: > > >> **) > > >> I see. With the change I suggest, meaning doing > > >> devlink_port_register/unregister() and netdev_register/unregister only > > >> for registered devlink instance, you don't need this at all. When you > > >> hit this compat callback, the netdevice is there and therefore devlink > > >> instance is registered for sure. > > > > > >If you move devlink registration up it has to be under the instance > > >lock, otherwise we're back to reload problems. That implies unregister > > >should be under the lock too. But then we can't wait for refs in > > >unregister. Perhaps I don't understand the suggestion. > > > > I unlock for register and for the rest of the init I lock again. > > The moment you register that instance callbacks can start coming. > Leon move the register call last for a good reason - all drivers > we looked at had bugs in handling init. Plus we had very cozy lock->unlock->lock sequences during devlink command execution, which caused to races between devlink calls and driver initialization. So I'm also interested to see what Jiri meant by saying "I unlock for register and for the rest of the init I lock again". Thanks
On 11/23/2022 11:18 AM, Ido Schimmel wrote: > On Wed, Nov 23, 2022 at 04:34:49PM +0800, Yang Yingliang wrote: >> >> On 2022/11/23 15:41, Leon Romanovsky wrote: >>> On Wed, Nov 23, 2022 at 02:40:24PM +0800, Yang Yingliang wrote: >>>> On 2022/11/23 4:27, Jakub Kicinski wrote: >>>>> On Tue, 22 Nov 2022 21:04:29 +0200 Leon Romanovsky wrote: >>>>>>>> Please fix nsim instead of devlink. >>>>>>> I think if devlink is not registered, it can not be get and used, but there >>>>>>> is no API for driver to check this, can I introduce a new helper name >>>>>>> devlink_is_registered() for driver using. >>>>>> There is no need in such API as driver calls to devlink_register() and >>>>>> as such it knows when devlink is registered. >>>>>> >>>>>> This UAF is nsim specific issue. Real devices have single .probe() >>>>>> routine with serialized registration flow. None of them will use >>>>>> devlink_is_registered() call. >>>>> Agreed, the fix is to move the register call back. >>>>> Something along the lines of the untested patch below? >>>>> Yang Yingliang would you be able to turn that into a real patch? >>>>> >>>>> diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c >>>>> index e14686594a71..26602d5fe0a2 100644 >>>>> --- a/drivers/net/netdevsim/dev.c >>>>> +++ b/drivers/net/netdevsim/dev.c >>>>> @@ -1566,12 +1566,15 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) >>>>> err = devlink_params_register(devlink, nsim_devlink_params, >>>>> ARRAY_SIZE(nsim_devlink_params)); >>>>> if (err) >>>>> - goto err_dl_unregister; >>>>> + goto err_resource_unregister; >>>>> nsim_devlink_set_params_init_values(nsim_dev, devlink); >>>>> + /* here, because params API still expect devlink to be unregistered */ >>>>> + devl_register(devlink); >>>>> + >>>> devlink_set_features() called at last in probe() also needs devlink is not >>>> registered. >>>>> err = nsim_dev_dummy_region_init(nsim_dev, devlink); >>>>> if (err) >>>>> - goto err_params_unregister; >>>>> + goto err_dl_unregister; >>>>> err = nsim_dev_traps_init(devlink); >>>>> if (err) >>>>> @@ -1610,7 +1613,6 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) >>>>> nsim_dev->esw_mode = DEVLINK_ESWITCH_MODE_LEGACY; >>>>> devlink_set_features(devlink, DEVLINK_F_RELOAD); >>>>> devl_unlock(devlink); >>>>> - devlink_register(devlink); >>>>> return 0; >>>>> err_hwstats_exit: >>>>> @@ -1629,10 +1631,11 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev) >>>>> nsim_dev_traps_exit(devlink); >>>>> err_dummy_region_exit: >>>>> nsim_dev_dummy_region_exit(nsim_dev); >>>>> -err_params_unregister: >>>>> +err_dl_unregister: >>>>> + devl_unregister(devlink); >>>> It races with dev_ethtool(): >>>> dev_ethtool >>>> devlink_try_get() >>>> nsim_drv_probe >>>> devl_lock() >>>> devl_lock() >>>> devlink_unregister() >>>> devlink_put() >>>> wait_for_completion() <- the refcount is >>>> got in dev_ethtool, it causes ABBA deadlock >>> But all these races are nsim specific ones. >>> Can you please explain why devlink.[c|h] MUST be changed and nsim can't >>> be fixed? >> I used the fix code proposed by Jakub, but it didn't work correctly, so >> I tried to correct and improve it, and need some devlink helper. >> >> Anyway, it is a nsim problem, if we want fix this without touch devlink, >> I think we can add a 'registered' field in struct nsim_dev, and it can be >> checked in nsim_get_devlink_port() like this: > > I read the discussion and it's not clear to me why this is a netdevsim > specific problem. The fundamental problem seems to be that it is > possible to hold a reference on a devlink instance before it's > registered and that devlink_free() will free the instance regardless of > its current reference count because it expects devlink_unregister() to > block. In this case, the instance was never registered, so > devlink_unregister() is not called. > Shouldn't devlink_free never free until after all references drop? That's how typical reference count works. I guess its just assuming that unregister will prevent this but.... > ethtool was able to get a reference on the devlink instance before it > was registered because netdevsim registers its netdevs before > registering its devlink instance. However, netdevsim is not the only one > doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially > others do the same thing. > We used to register devlink before netdev, but this was changed at some point because we wanted to register it late after things had been setup. > When you think about it, it's strange that it's even possible for > ethtool to reach the driver when the netdev used in the request is long > gone, but it's not holding a reference on the netdev (it's holding a > reference on the devlink instance instead) and > devlink_compat_running_version() is called without RTNL. Yea that seems weird. I would expect ethtool to get to devlink via netdev? (especially now with the way we changed netdev to hold the devlink?) Hmm.
Wed, Nov 30, 2022 at 06:20:42PM CET, kuba@kernel.org wrote: >On Wed, 30 Nov 2022 18:00:05 +0100 Jiri Pirko wrote: >> Wed, Nov 30, 2022 at 05:46:59PM CET, kuba@kernel.org wrote: >> >On Wed, 30 Nov 2022 12:42:39 +0100 Jiri Pirko wrote: >> >> **) >> >> I see. With the change I suggest, meaning doing >> >> devlink_port_register/unregister() and netdev_register/unregister only >> >> for registered devlink instance, you don't need this at all. When you >> >> hit this compat callback, the netdevice is there and therefore devlink >> >> instance is registered for sure. >> > >> >If you move devlink registration up it has to be under the instance >> >lock, otherwise we're back to reload problems. That implies unregister >> >should be under the lock too. But then we can't wait for refs in >> >unregister. Perhaps I don't understand the suggestion. >> >> I unlock for register and for the rest of the init I lock again. > >The moment you register that instance callbacks can start coming. >Leon move the register call last for a good reason - all drivers >we looked at had bugs in handling init. > >We can come up with fixes in the drivers, flags, devlink_set_features() >and all that sort of garbage until the day we die but let's not. >The driver facing API should be simple - hold the lock around entire >init. > >> >> What is "half-initialized"? Take devlink reload flow for instance. There >> >> are multiple things removed/readded, like devlink_port and related >> >> netdevice. No problem there. >> > >> >Yes, but reload is under the instance lock, so nothing can mess with >> >a device in a transitional state. >> >> Sure, that is what I want to do too. To be under instance lock. > >I'm confused, you just said "I unlock for register". Ah, right. I got your point now, you don't want the user to see half-init devlink objects. In reload, the secondhalf-uninit&seconfhalf-init happens atomically under instance lock, so the user sees the whole picture still. But is it a problem? For ports, I don't think so. For the other objects being removed-readded during reload, why do you think it is a problem? > >> >> As mentioned above (**), I don't think this is needed. >> > >> >But it is, please just let me do it and make the bugs stop
Wed, Nov 30, 2022 at 08:20:34PM CET, leon@kernel.org wrote: >On Wed, Nov 30, 2022 at 09:20:42AM -0800, Jakub Kicinski wrote: >> On Wed, 30 Nov 2022 18:00:05 +0100 Jiri Pirko wrote: >> > Wed, Nov 30, 2022 at 05:46:59PM CET, kuba@kernel.org wrote: >> > >On Wed, 30 Nov 2022 12:42:39 +0100 Jiri Pirko wrote: >> > >> **) >> > >> I see. With the change I suggest, meaning doing >> > >> devlink_port_register/unregister() and netdev_register/unregister only >> > >> for registered devlink instance, you don't need this at all. When you >> > >> hit this compat callback, the netdevice is there and therefore devlink >> > >> instance is registered for sure. >> > > >> > >If you move devlink registration up it has to be under the instance >> > >lock, otherwise we're back to reload problems. That implies unregister >> > >should be under the lock too. But then we can't wait for refs in >> > >unregister. Perhaps I don't understand the suggestion. >> > >> > I unlock for register and for the rest of the init I lock again. >> >> The moment you register that instance callbacks can start coming. >> Leon move the register call last for a good reason - all drivers >> we looked at had bugs in handling init. > >Plus we had very cozy lock->unlock->lock sequences during devlink >command execution, which caused to races between devlink calls >and driver initialization. So? Why do you think it is a problem? > >So I'm also interested to see what Jiri meant by saying "I unlock for >register and for the rest of the init I lock again". > >Thanks
On Thu, Dec 01, 2022 at 09:40:00AM +0100, Jiri Pirko wrote: > Wed, Nov 30, 2022 at 08:20:34PM CET, leon@kernel.org wrote: > >On Wed, Nov 30, 2022 at 09:20:42AM -0800, Jakub Kicinski wrote: > >> On Wed, 30 Nov 2022 18:00:05 +0100 Jiri Pirko wrote: > >> > Wed, Nov 30, 2022 at 05:46:59PM CET, kuba@kernel.org wrote: > >> > >On Wed, 30 Nov 2022 12:42:39 +0100 Jiri Pirko wrote: > >> > >> **) > >> > >> I see. With the change I suggest, meaning doing > >> > >> devlink_port_register/unregister() and netdev_register/unregister only > >> > >> for registered devlink instance, you don't need this at all. When you > >> > >> hit this compat callback, the netdevice is there and therefore devlink > >> > >> instance is registered for sure. > >> > > > >> > >If you move devlink registration up it has to be under the instance > >> > >lock, otherwise we're back to reload problems. That implies unregister > >> > >should be under the lock too. But then we can't wait for refs in > >> > >unregister. Perhaps I don't understand the suggestion. > >> > > >> > I unlock for register and for the rest of the init I lock again. > >> > >> The moment you register that instance callbacks can start coming. > >> Leon move the register call last for a good reason - all drivers > >> we looked at had bugs in handling init. > > > >Plus we had very cozy lock->unlock->lock sequences during devlink > >command execution, which caused to races between devlink calls > >and driver initialization. > > So? Why do you think it is a problem? We need to see the actual implementation. In general, once you unlock you can get other threads to change the state of your device. > > > > >So I'm also interested to see what Jiri meant by saying "I unlock for > >register and for the rest of the init I lock again". > > > >Thanks
Thu, Dec 01, 2022 at 11:05:38AM CET, leon@kernel.org wrote: >On Thu, Dec 01, 2022 at 09:40:00AM +0100, Jiri Pirko wrote: >> Wed, Nov 30, 2022 at 08:20:34PM CET, leon@kernel.org wrote: >> >On Wed, Nov 30, 2022 at 09:20:42AM -0800, Jakub Kicinski wrote: >> >> On Wed, 30 Nov 2022 18:00:05 +0100 Jiri Pirko wrote: >> >> > Wed, Nov 30, 2022 at 05:46:59PM CET, kuba@kernel.org wrote: >> >> > >On Wed, 30 Nov 2022 12:42:39 +0100 Jiri Pirko wrote: >> >> > >> **) >> >> > >> I see. With the change I suggest, meaning doing >> >> > >> devlink_port_register/unregister() and netdev_register/unregister only >> >> > >> for registered devlink instance, you don't need this at all. When you >> >> > >> hit this compat callback, the netdevice is there and therefore devlink >> >> > >> instance is registered for sure. >> >> > > >> >> > >If you move devlink registration up it has to be under the instance >> >> > >lock, otherwise we're back to reload problems. That implies unregister >> >> > >should be under the lock too. But then we can't wait for refs in >> >> > >unregister. Perhaps I don't understand the suggestion. >> >> > >> >> > I unlock for register and for the rest of the init I lock again. >> >> >> >> The moment you register that instance callbacks can start coming. >> >> Leon move the register call last for a good reason - all drivers >> >> we looked at had bugs in handling init. >> > >> >Plus we had very cozy lock->unlock->lock sequences during devlink >> >command execution, which caused to races between devlink calls >> >and driver initialization. >> >> So? Why do you think it is a problem? > >We need to see the actual implementation. In general, once you unlock >you can get other threads to change the state of your device. Sure, I still don't understand, why it would be a problem. > >> >> > >> >So I'm also interested to see what Jiri meant by saying "I unlock for >> >register and for the rest of the init I lock again". >> > >> >Thanks
diff --git a/net/core/devlink.c b/net/core/devlink.c index 89baa7c0938b..6453ac0558fb 100644 --- a/net/core/devlink.c +++ b/net/core/devlink.c @@ -250,6 +250,9 @@ void devlink_put(struct devlink *devlink) struct devlink *__must_check devlink_try_get(struct devlink *devlink) { + if (!xa_get_mark(&devlinks, devlink->index, DEVLINK_REGISTERED)) + return NULL; + if (refcount_inc_not_zero(&devlink->refcount)) return devlink; return NULL;
I got a UAF report as following when doing fault injection test: BUG: KASAN: use-after-free in devlink_compat_running_version+0x5b9/0x6a0 Read of size 8 at addr ffff88810ac591f0 by task systemd-udevd/458 CPU: 2 PID: 458 Comm: systemd-udevd Not tainted 6.1.0-rc5-00155-ga9d2b54dd4e7-dirty #1359 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> kasan_report+0x90/0x190 devlink_compat_running_version+0x5b9/0x6a0 dev_ethtool+0x2ca/0x340 dev_ioctl+0x16c/0xff0 sock_do_ioctl+0x1ae/0x220 Allocated by task 456: kasan_save_stack+0x1c/0x40 kasan_set_track+0x21/0x30 __kasan_kmalloc+0x7e/0x90 __kmalloc+0x59/0x1b0 devlink_alloc_ns+0xf7/0xa10 nsim_drv_probe+0xa8/0x150 [netdevsim] really_probe+0x1ff/0x660 Freed by task 456: kasan_save_stack+0x1c/0x40 kasan_set_track+0x21/0x30 kasan_save_free_info+0x2a/0x50 __kasan_slab_free+0x102/0x190 __kmem_cache_free+0xca/0x400 nsim_drv_probe.cold.31+0x2af/0xf62 [netdevsim] really_probe+0x1ff/0x660 It happened like this: processor A processor B nsim_drv_probe() devlink_alloc_ns() nsim_dev_port_add_all() __nsim_dev_port_add() // add eth1 successful dev_ethtool() ethtool_get_drvinfo(eth1) netdev_to_devlink_get(eth1) devlink_try_get() // get devlink here __nsim_dev_port_add() // add eth2 failed, goto error devlink_free() // it's called in the error path devlink_compat_running_version() <- causes UAF devlink_register() // it's in normal path, not called yet There is two ports to add in nsim_dev_port_add_all(), if first port is added successful, the devlink is visable and can be get by devlink_try_get() on another cpu, but it is not registered yet. And then the second port is failed to added, in the error path, devlink_free() is called, at last it causes UAF. Add check in devlink_try_get(), if the 'devlink' is not registered it returns NULL to avoid UAF like this case. Fixes: a62fdbbe9403 ("netdevsim: implement ndo_get_devlink_port") Reported-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> --- net/core/devlink.c | 3 +++ 1 file changed, 3 insertions(+)