diff mbox series

[net] net: devlink: fix UAF in devlink_compat_running_version()

Message ID 20221122121048.776643-1-yangyingliang@huawei.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net] net: devlink: fix UAF in devlink_compat_running_version() | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2 this patch: 2
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 5 this patch: 5
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 2 this patch: 2
netdev/checkpatch warning WARNING: Statements should start on a tabstop
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Yang Yingliang Nov. 22, 2022, 12:10 p.m. UTC
I got a UAF report as following when doing fault injection test:

BUG: KASAN: use-after-free in devlink_compat_running_version+0x5b9/0x6a0
Read of size 8 at addr ffff88810ac591f0 by task systemd-udevd/458

CPU: 2 PID: 458 Comm: systemd-udevd Not tainted 6.1.0-rc5-00155-ga9d2b54dd4e7-dirty #1359
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
 <TASK>
 kasan_report+0x90/0x190
 devlink_compat_running_version+0x5b9/0x6a0
 dev_ethtool+0x2ca/0x340
 dev_ioctl+0x16c/0xff0
 sock_do_ioctl+0x1ae/0x220

Allocated by task 456:
 kasan_save_stack+0x1c/0x40
 kasan_set_track+0x21/0x30
 __kasan_kmalloc+0x7e/0x90
 __kmalloc+0x59/0x1b0
 devlink_alloc_ns+0xf7/0xa10
 nsim_drv_probe+0xa8/0x150 [netdevsim]
 really_probe+0x1ff/0x660

Freed by task 456:
 kasan_save_stack+0x1c/0x40
 kasan_set_track+0x21/0x30
 kasan_save_free_info+0x2a/0x50
 __kasan_slab_free+0x102/0x190
 __kmem_cache_free+0xca/0x400
 nsim_drv_probe.cold.31+0x2af/0xf62 [netdevsim]
 really_probe+0x1ff/0x660

It happened like this:

processor A							processor B
nsim_drv_probe()
  devlink_alloc_ns()
  nsim_dev_port_add_all()
    __nsim_dev_port_add() // add eth1 successful
								dev_ethtool()
								  ethtool_get_drvinfo(eth1)
								    netdev_to_devlink_get(eth1)
								      devlink_try_get() // get devlink here
    __nsim_dev_port_add() // add eth2 failed, goto error
      devlink_free() // it's called in the error path
								  devlink_compat_running_version() <- causes UAF
  devlink_register() // it's in normal path, not called yet

There is two ports to add in nsim_dev_port_add_all(), if first
port is added successful, the devlink is visable and can be get
by devlink_try_get() on another cpu, but it is not registered
yet. And then the second port is failed to added, in the error
path, devlink_free() is called, at last it causes UAF.

Add check in devlink_try_get(), if the 'devlink' is not registered
it returns NULL to avoid UAF like this case.

Fixes: a62fdbbe9403 ("netdevsim: implement ndo_get_devlink_port")
Reported-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
---
 net/core/devlink.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Leon Romanovsky Nov. 22, 2022, 2:32 p.m. UTC | #1
On Tue, Nov 22, 2022 at 08:10:48PM +0800, Yang Yingliang wrote:
> I got a UAF report as following when doing fault injection test:
> 
> BUG: KASAN: use-after-free in devlink_compat_running_version+0x5b9/0x6a0
> Read of size 8 at addr ffff88810ac591f0 by task systemd-udevd/458
> 
> CPU: 2 PID: 458 Comm: systemd-udevd Not tainted 6.1.0-rc5-00155-ga9d2b54dd4e7-dirty #1359
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
> Call Trace:
>  <TASK>
>  kasan_report+0x90/0x190
>  devlink_compat_running_version+0x5b9/0x6a0
>  dev_ethtool+0x2ca/0x340
>  dev_ioctl+0x16c/0xff0
>  sock_do_ioctl+0x1ae/0x220
> 
> Allocated by task 456:
>  kasan_save_stack+0x1c/0x40
>  kasan_set_track+0x21/0x30
>  __kasan_kmalloc+0x7e/0x90
>  __kmalloc+0x59/0x1b0
>  devlink_alloc_ns+0xf7/0xa10
>  nsim_drv_probe+0xa8/0x150 [netdevsim]
>  really_probe+0x1ff/0x660
> 
> Freed by task 456:
>  kasan_save_stack+0x1c/0x40
>  kasan_set_track+0x21/0x30
>  kasan_save_free_info+0x2a/0x50
>  __kasan_slab_free+0x102/0x190
>  __kmem_cache_free+0xca/0x400
>  nsim_drv_probe.cold.31+0x2af/0xf62 [netdevsim]
>  really_probe+0x1ff/0x660
> 
> It happened like this:
> 
> processor A							processor B
> nsim_drv_probe()
>   devlink_alloc_ns()
>   nsim_dev_port_add_all()
>     __nsim_dev_port_add() // add eth1 successful
> 								dev_ethtool()
> 								  ethtool_get_drvinfo(eth1)
> 								    netdev_to_devlink_get(eth1)
> 								      devlink_try_get() // get devlink here
>     __nsim_dev_port_add() // add eth2 failed, goto error
>       devlink_free() // it's called in the error path
> 								  devlink_compat_running_version() <- causes UAF
>   devlink_register() // it's in normal path, not called yet
> 
> There is two ports to add in nsim_dev_port_add_all(), if first
> port is added successful, the devlink is visable and can be get
> by devlink_try_get() on another cpu, but it is not registered
> yet. And then the second port is failed to added, in the error
> path, devlink_free() is called, at last it causes UAF.
> 
> Add check in devlink_try_get(), if the 'devlink' is not registered
> it returns NULL to avoid UAF like this case.
> 
> Fixes: a62fdbbe9403 ("netdevsim: implement ndo_get_devlink_port")
> Reported-by: Zhengchao Shao <shaozhengchao@huawei.com>
> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
> ---
>  net/core/devlink.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/core/devlink.c b/net/core/devlink.c
> index 89baa7c0938b..6453ac0558fb 100644
> --- a/net/core/devlink.c
> +++ b/net/core/devlink.c
> @@ -250,6 +250,9 @@ void devlink_put(struct devlink *devlink)
>  
>  struct devlink *__must_check devlink_try_get(struct devlink *devlink)
>  {
> +	 if (!xa_get_mark(&devlinks, devlink->index, DEVLINK_REGISTERED))
> +		return NULL;
> +

Please fix nsim instead of devlink.

Thanks

>  	if (refcount_inc_not_zero(&devlink->refcount))
>  		return devlink;
>  	return NULL;
> -- 
> 2.25.1
>
Yang Yingliang Nov. 22, 2022, 3:25 p.m. UTC | #2
Hi,

On 2022/11/22 22:32, Leon Romanovsky wrote:
> On Tue, Nov 22, 2022 at 08:10:48PM +0800, Yang Yingliang wrote:
>> I got a UAF report as following when doing fault injection test:
>>
>> BUG: KASAN: use-after-free in devlink_compat_running_version+0x5b9/0x6a0
>> Read of size 8 at addr ffff88810ac591f0 by task systemd-udevd/458
>>
>> CPU: 2 PID: 458 Comm: systemd-udevd Not tainted 6.1.0-rc5-00155-ga9d2b54dd4e7-dirty #1359
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
>> Call Trace:
>>   <TASK>
>>   kasan_report+0x90/0x190
>>   devlink_compat_running_version+0x5b9/0x6a0
>>   dev_ethtool+0x2ca/0x340
>>   dev_ioctl+0x16c/0xff0
>>   sock_do_ioctl+0x1ae/0x220
>>
>> Allocated by task 456:
>>   kasan_save_stack+0x1c/0x40
>>   kasan_set_track+0x21/0x30
>>   __kasan_kmalloc+0x7e/0x90
>>   __kmalloc+0x59/0x1b0
>>   devlink_alloc_ns+0xf7/0xa10
>>   nsim_drv_probe+0xa8/0x150 [netdevsim]
>>   really_probe+0x1ff/0x660
>>
>> Freed by task 456:
>>   kasan_save_stack+0x1c/0x40
>>   kasan_set_track+0x21/0x30
>>   kasan_save_free_info+0x2a/0x50
>>   __kasan_slab_free+0x102/0x190
>>   __kmem_cache_free+0xca/0x400
>>   nsim_drv_probe.cold.31+0x2af/0xf62 [netdevsim]
>>   really_probe+0x1ff/0x660
>>
>> It happened like this:
>>
>> processor A							processor B
>> nsim_drv_probe()
>>    devlink_alloc_ns()
>>    nsim_dev_port_add_all()
>>      __nsim_dev_port_add() // add eth1 successful
>> 								dev_ethtool()
>> 								  ethtool_get_drvinfo(eth1)
>> 								    netdev_to_devlink_get(eth1)
>> 								      devlink_try_get() // get devlink here
>>      __nsim_dev_port_add() // add eth2 failed, goto error
>>        devlink_free() // it's called in the error path
>> 								  devlink_compat_running_version() <- causes UAF
>>    devlink_register() // it's in normal path, not called yet
>>
>> There is two ports to add in nsim_dev_port_add_all(), if first
>> port is added successful, the devlink is visable and can be get
>> by devlink_try_get() on another cpu, but it is not registered
>> yet. And then the second port is failed to added, in the error
>> path, devlink_free() is called, at last it causes UAF.
>>
>> Add check in devlink_try_get(), if the 'devlink' is not registered
>> it returns NULL to avoid UAF like this case.
>>
>> Fixes: a62fdbbe9403 ("netdevsim: implement ndo_get_devlink_port")
>> Reported-by: Zhengchao Shao <shaozhengchao@huawei.com>
>> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
>> ---
>>   net/core/devlink.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/net/core/devlink.c b/net/core/devlink.c
>> index 89baa7c0938b..6453ac0558fb 100644
>> --- a/net/core/devlink.c
>> +++ b/net/core/devlink.c
>> @@ -250,6 +250,9 @@ void devlink_put(struct devlink *devlink)
>>   
>>   struct devlink *__must_check devlink_try_get(struct devlink *devlink)
>>   {
>> +	 if (!xa_get_mark(&devlinks, devlink->index, DEVLINK_REGISTERED))
>> +		return NULL;
>> +
> Please fix nsim instead of devlink.
I think if devlink is not registered, it can not be get and used, but there
is no API for driver to check this, can I introduce a new helper name
devlink_is_registered() for driver using.

Thanks,
Yang
>
> Thanks
>
>>   	if (refcount_inc_not_zero(&devlink->refcount))
>>   		return devlink;
>>   	return NULL;
>> -- 
>> 2.25.1
>>
> .
Leon Romanovsky Nov. 22, 2022, 7:04 p.m. UTC | #3
On Tue, Nov 22, 2022 at 11:25:31PM +0800, Yang Yingliang wrote:
> Hi,
> 
> On 2022/11/22 22:32, Leon Romanovsky wrote:
> > On Tue, Nov 22, 2022 at 08:10:48PM +0800, Yang Yingliang wrote:
> > > I got a UAF report as following when doing fault injection test:
> > > 
> > > BUG: KASAN: use-after-free in devlink_compat_running_version+0x5b9/0x6a0
> > > Read of size 8 at addr ffff88810ac591f0 by task systemd-udevd/458
> > > 
> > > CPU: 2 PID: 458 Comm: systemd-udevd Not tainted 6.1.0-rc5-00155-ga9d2b54dd4e7-dirty #1359
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
> > > Call Trace:
> > >   <TASK>
> > >   kasan_report+0x90/0x190
> > >   devlink_compat_running_version+0x5b9/0x6a0
> > >   dev_ethtool+0x2ca/0x340
> > >   dev_ioctl+0x16c/0xff0
> > >   sock_do_ioctl+0x1ae/0x220
> > > 
> > > Allocated by task 456:
> > >   kasan_save_stack+0x1c/0x40
> > >   kasan_set_track+0x21/0x30
> > >   __kasan_kmalloc+0x7e/0x90
> > >   __kmalloc+0x59/0x1b0
> > >   devlink_alloc_ns+0xf7/0xa10
> > >   nsim_drv_probe+0xa8/0x150 [netdevsim]
> > >   really_probe+0x1ff/0x660
> > > 
> > > Freed by task 456:
> > >   kasan_save_stack+0x1c/0x40
> > >   kasan_set_track+0x21/0x30
> > >   kasan_save_free_info+0x2a/0x50
> > >   __kasan_slab_free+0x102/0x190
> > >   __kmem_cache_free+0xca/0x400
> > >   nsim_drv_probe.cold.31+0x2af/0xf62 [netdevsim]
> > >   really_probe+0x1ff/0x660
> > > 
> > > It happened like this:
> > > 
> > > processor A							processor B
> > > nsim_drv_probe()
> > >    devlink_alloc_ns()
> > >    nsim_dev_port_add_all()
> > >      __nsim_dev_port_add() // add eth1 successful
> > > 								dev_ethtool()
> > > 								  ethtool_get_drvinfo(eth1)
> > > 								    netdev_to_devlink_get(eth1)
> > > 								      devlink_try_get() // get devlink here
> > >      __nsim_dev_port_add() // add eth2 failed, goto error
> > >        devlink_free() // it's called in the error path
> > > 								  devlink_compat_running_version() <- causes UAF
> > >    devlink_register() // it's in normal path, not called yet
> > > 
> > > There is two ports to add in nsim_dev_port_add_all(), if first
> > > port is added successful, the devlink is visable and can be get
> > > by devlink_try_get() on another cpu, but it is not registered
> > > yet. And then the second port is failed to added, in the error
> > > path, devlink_free() is called, at last it causes UAF.
> > > 
> > > Add check in devlink_try_get(), if the 'devlink' is not registered
> > > it returns NULL to avoid UAF like this case.
> > > 
> > > Fixes: a62fdbbe9403 ("netdevsim: implement ndo_get_devlink_port")
> > > Reported-by: Zhengchao Shao <shaozhengchao@huawei.com>
> > > Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
> > > ---
> > >   net/core/devlink.c | 3 +++
> > >   1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/net/core/devlink.c b/net/core/devlink.c
> > > index 89baa7c0938b..6453ac0558fb 100644
> > > --- a/net/core/devlink.c
> > > +++ b/net/core/devlink.c
> > > @@ -250,6 +250,9 @@ void devlink_put(struct devlink *devlink)
> > >   struct devlink *__must_check devlink_try_get(struct devlink *devlink)
> > >   {
> > > +	 if (!xa_get_mark(&devlinks, devlink->index, DEVLINK_REGISTERED))
> > > +		return NULL;
> > > +
> > Please fix nsim instead of devlink.
> I think if devlink is not registered, it can not be get and used, but there
> is no API for driver to check this, can I introduce a new helper name
> devlink_is_registered() for driver using.

There is no need in such API as driver calls to devlink_register() and
as such it knows when devlink is registered.

This UAF is nsim specific issue. Real devices have single .probe()
routine with serialized registration flow. None of them will use
devlink_is_registered() call.

Thanks

> 
> Thanks,
> Yang
> > 
> > Thanks
> > 
> > >   	if (refcount_inc_not_zero(&devlink->refcount))
> > >   		return devlink;
> > >   	return NULL;
> > > -- 
> > > 2.25.1
> > > 
> > .
Jakub Kicinski Nov. 22, 2022, 8:27 p.m. UTC | #4
On Tue, 22 Nov 2022 21:04:29 +0200 Leon Romanovsky wrote:
> > > Please fix nsim instead of devlink.  
> > I think if devlink is not registered, it can not be get and used, but there
> > is no API for driver to check this, can I introduce a new helper name
> > devlink_is_registered() for driver using.  
> 
> There is no need in such API as driver calls to devlink_register() and
> as such it knows when devlink is registered.
> 
> This UAF is nsim specific issue. Real devices have single .probe()
> routine with serialized registration flow. None of them will use
> devlink_is_registered() call.

Agreed, the fix is to move the register call back.
Something along the lines of the untested patch below?
Yang Yingliang would you be able to turn that into a real patch?

diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
index e14686594a71..26602d5fe0a2 100644
--- a/drivers/net/netdevsim/dev.c
+++ b/drivers/net/netdevsim/dev.c
@@ -1566,12 +1566,15 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
 	err = devlink_params_register(devlink, nsim_devlink_params,
 				      ARRAY_SIZE(nsim_devlink_params));
 	if (err)
-		goto err_dl_unregister;
+		goto err_resource_unregister;
 	nsim_devlink_set_params_init_values(nsim_dev, devlink);
 
+	/* here, because params API still expect devlink to be unregistered */
+	devl_register(devlink);
+
 	err = nsim_dev_dummy_region_init(nsim_dev, devlink);
 	if (err)
-		goto err_params_unregister;
+		goto err_dl_unregister;
 
 	err = nsim_dev_traps_init(devlink);
 	if (err)
@@ -1610,7 +1613,6 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
 	nsim_dev->esw_mode = DEVLINK_ESWITCH_MODE_LEGACY;
 	devlink_set_features(devlink, DEVLINK_F_RELOAD);
 	devl_unlock(devlink);
-	devlink_register(devlink);
 	return 0;
 
 err_hwstats_exit:
@@ -1629,10 +1631,11 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
 	nsim_dev_traps_exit(devlink);
 err_dummy_region_exit:
 	nsim_dev_dummy_region_exit(nsim_dev);
-err_params_unregister:
+err_dl_unregister:
+	devl_unregister(devlink);
 	devlink_params_unregister(devlink, nsim_devlink_params,
 				  ARRAY_SIZE(nsim_devlink_params));
-err_dl_unregister:
+err_resource_unregister:
 	devl_resources_unregister(devlink);
 err_vfc_free:
 	kfree(nsim_dev->vfconfigs);
diff --git a/include/net/devlink.h b/include/net/devlink.h
index 074a79b8933f..e0f13100fc6b 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -1593,6 +1593,14 @@ void devlink_set_features(struct devlink *devlink, u64 features);
 void devlink_register(struct devlink *devlink);
 void devlink_unregister(struct devlink *devlink);
 void devlink_free(struct devlink *devlink);
+
+void devl_register(struct devlink *devlink)
+{
+	devl_assert_locked(devlink);
+	/* doesn't actually take the instance lock */
+	devlink_register(devlink);
+}
+
 void devlink_port_init(struct devlink *devlink,
 		       struct devlink_port *devlink_port);
 void devlink_port_fini(struct devlink_port *devlink_port);
Yang Yingliang Nov. 23, 2022, 1:50 a.m. UTC | #5
On 2022/11/23 4:27, Jakub Kicinski wrote:
> On Tue, 22 Nov 2022 21:04:29 +0200 Leon Romanovsky wrote:
>>>> Please fix nsim instead of devlink.
>>> I think if devlink is not registered, it can not be get and used, but there
>>> is no API for driver to check this, can I introduce a new helper name
>>> devlink_is_registered() for driver using.
>> There is no need in such API as driver calls to devlink_register() and
>> as such it knows when devlink is registered.
>>
>> This UAF is nsim specific issue. Real devices have single .probe()
>> routine with serialized registration flow. None of them will use
>> devlink_is_registered() call.
> Agreed, the fix is to move the register call back.
> Something along the lines of the untested patch below?
> Yang Yingliang would you be able to turn that into a real patch?
OK.

Thanks,
Yang
>
> diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
> index e14686594a71..26602d5fe0a2 100644
> --- a/drivers/net/netdevsim/dev.c
> +++ b/drivers/net/netdevsim/dev.c
> @@ -1566,12 +1566,15 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
>   	err = devlink_params_register(devlink, nsim_devlink_params,
>   				      ARRAY_SIZE(nsim_devlink_params));
>   	if (err)
> -		goto err_dl_unregister;
> +		goto err_resource_unregister;
>   	nsim_devlink_set_params_init_values(nsim_dev, devlink);
>   
> +	/* here, because params API still expect devlink to be unregistered */
> +	devl_register(devlink);
> +
>   	err = nsim_dev_dummy_region_init(nsim_dev, devlink);
>   	if (err)
> -		goto err_params_unregister;
> +		goto err_dl_unregister;
>   
>   	err = nsim_dev_traps_init(devlink);
>   	if (err)
> @@ -1610,7 +1613,6 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
>   	nsim_dev->esw_mode = DEVLINK_ESWITCH_MODE_LEGACY;
>   	devlink_set_features(devlink, DEVLINK_F_RELOAD);
>   	devl_unlock(devlink);
> -	devlink_register(devlink);
>   	return 0;
>   
>   err_hwstats_exit:
> @@ -1629,10 +1631,11 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
>   	nsim_dev_traps_exit(devlink);
>   err_dummy_region_exit:
>   	nsim_dev_dummy_region_exit(nsim_dev);
> -err_params_unregister:
> +err_dl_unregister:
> +	devl_unregister(devlink);
>   	devlink_params_unregister(devlink, nsim_devlink_params,
>   				  ARRAY_SIZE(nsim_devlink_params));
> -err_dl_unregister:
> +err_resource_unregister:
>   	devl_resources_unregister(devlink);
>   err_vfc_free:
>   	kfree(nsim_dev->vfconfigs);
> diff --git a/include/net/devlink.h b/include/net/devlink.h
> index 074a79b8933f..e0f13100fc6b 100644
> --- a/include/net/devlink.h
> +++ b/include/net/devlink.h
> @@ -1593,6 +1593,14 @@ void devlink_set_features(struct devlink *devlink, u64 features);
>   void devlink_register(struct devlink *devlink);
>   void devlink_unregister(struct devlink *devlink);
>   void devlink_free(struct devlink *devlink);
> +
> +void devl_register(struct devlink *devlink)
> +{
> +	devl_assert_locked(devlink);
> +	/* doesn't actually take the instance lock */
> +	devlink_register(devlink);
> +}
> +
>   void devlink_port_init(struct devlink *devlink,
>   		       struct devlink_port *devlink_port);
>   void devlink_port_fini(struct devlink_port *devlink_port);
> .
Yang Yingliang Nov. 23, 2022, 6:40 a.m. UTC | #6
On 2022/11/23 4:27, Jakub Kicinski wrote:
> On Tue, 22 Nov 2022 21:04:29 +0200 Leon Romanovsky wrote:
>>>> Please fix nsim instead of devlink.
>>> I think if devlink is not registered, it can not be get and used, but there
>>> is no API for driver to check this, can I introduce a new helper name
>>> devlink_is_registered() for driver using.
>> There is no need in such API as driver calls to devlink_register() and
>> as such it knows when devlink is registered.
>>
>> This UAF is nsim specific issue. Real devices have single .probe()
>> routine with serialized registration flow. None of them will use
>> devlink_is_registered() call.
> Agreed, the fix is to move the register call back.
> Something along the lines of the untested patch below?
> Yang Yingliang would you be able to turn that into a real patch?
>
> diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
> index e14686594a71..26602d5fe0a2 100644
> --- a/drivers/net/netdevsim/dev.c
> +++ b/drivers/net/netdevsim/dev.c
> @@ -1566,12 +1566,15 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
>   	err = devlink_params_register(devlink, nsim_devlink_params,
>   				      ARRAY_SIZE(nsim_devlink_params));
>   	if (err)
> -		goto err_dl_unregister;
> +		goto err_resource_unregister;
>   	nsim_devlink_set_params_init_values(nsim_dev, devlink);
>   
> +	/* here, because params API still expect devlink to be unregistered */
> +	devl_register(devlink);
> +
devlink_set_features() called at last in probe() also needs devlink is 
not registered.
>   	err = nsim_dev_dummy_region_init(nsim_dev, devlink);
>   	if (err)
> -		goto err_params_unregister;
> +		goto err_dl_unregister;
>   
>   	err = nsim_dev_traps_init(devlink);
>   	if (err)
> @@ -1610,7 +1613,6 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
>   	nsim_dev->esw_mode = DEVLINK_ESWITCH_MODE_LEGACY;
>   	devlink_set_features(devlink, DEVLINK_F_RELOAD);
>   	devl_unlock(devlink);
> -	devlink_register(devlink);
>   	return 0;
>   
>   err_hwstats_exit:
> @@ -1629,10 +1631,11 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
>   	nsim_dev_traps_exit(devlink);
>   err_dummy_region_exit:
>   	nsim_dev_dummy_region_exit(nsim_dev);
> -err_params_unregister:
> +err_dl_unregister:
> +	devl_unregister(devlink);
It races with dev_ethtool():
dev_ethtool
   devlink_try_get()
                                 nsim_drv_probe
                                 devl_lock()
     devl_lock()
                                 devlink_unregister()
                                   devlink_put()
                                   wait_for_completion() <- the refcount 
is got in dev_ethtool, it causes ABBA deadlock
>   	devlink_params_unregister(devlink, nsim_devlink_params,
>   				  ARRAY_SIZE(nsim_devlink_params));
> -err_dl_unregister:
> +err_resource_unregister:
>   	devl_resources_unregister(devlink);
>   err_vfc_free:
>   	kfree(nsim_dev->vfconfigs);
> diff --git a/include/net/devlink.h b/include/net/devlink.h
> index 074a79b8933f..e0f13100fc6b 100644
> --- a/include/net/devlink.h
> +++ b/include/net/devlink.h
> @@ -1593,6 +1593,14 @@ void devlink_set_features(struct devlink *devlink, u64 features);
>   void devlink_register(struct devlink *devlink);
>   void devlink_unregister(struct devlink *devlink);
>   void devlink_free(struct devlink *devlink);
> +
> +void devl_register(struct devlink *devlink)
> +{
> +	devl_assert_locked(devlink);
> +	/* doesn't actually take the instance lock */
> +	devlink_register(devlink);
> +}
> +
>   void devlink_port_init(struct devlink *devlink,
>   		       struct devlink_port *devlink_port);
>   void devlink_port_fini(struct devlink_port *devlink_port);

How about the following changes, it's tested ok in my case:

diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
index a7880c7ce94c..0ec4590c26a0 100644
--- a/drivers/net/netdevsim/dev.c
+++ b/drivers/net/netdevsim/dev.c
@@ -1640,6 +1640,8 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
      kfree(nsim_dev->vfconfigs);
  err_devlink_unlock:
      devl_unlock(devlink);
+    devl_put(devlink);
+    devl_wait_for_comp(devlink);
      devlink_free(devlink);
      dev_set_drvdata(&nsim_bus_dev->dev, NULL);
      return err;
diff --git a/include/net/devlink.h b/include/net/devlink.h
index ba6b8b094943..dc8ddf942c4d 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -1563,6 +1563,7 @@ static inline struct devlink *devlink_alloc(const 
struct devlink_ops *ops,
  void devlink_set_features(struct devlink *devlink, u64 features);
  void devlink_register(struct devlink *devlink);
  void devlink_unregister(struct devlink *devlink);
+void devl_wait_for_comp(struct devlink *devlink);
  void devlink_free(struct devlink *devlink);
  void devlink_port_init(struct devlink *devlink,
                 struct devlink_port *devlink_port);
@@ -1903,4 +1904,6 @@ devlink_compat_switch_id_get(struct net_device *dev,

  #endif

+void devl_put(struct devlink *devlink);
+
  #endif /* _NET_DEVLINK_H_ */
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 89baa7c0938b..67c78cf37bac 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -248,6 +248,12 @@ void devlink_put(struct devlink *devlink)
          call_rcu(&devlink->rcu, __devlink_put_rcu);
  }

+void devl_put(struct devlink *devlink)
+{
+    devlink_put(devlink);
+}
+EXPORT_SYMBOL_GPL(devl_put);
+
  struct devlink *__must_check devlink_try_get(struct devlink *devlink)
  {
      if (refcount_inc_not_zero(&devlink->refcount))
@@ -9788,6 +9794,12 @@ void devlink_unregister(struct devlink *devlink)
  }
  EXPORT_SYMBOL_GPL(devlink_unregister);

+void devl_wait_for_comp(struct devlink *devlink)
+{
+    wait_for_completion(&devlink->comp);
+}
+EXPORT_SYMBOL_GPL(devl_wait_for_comp);
+
  /**
   *    devlink_free - Free devlink instance resources
   *
> .
Leon Romanovsky Nov. 23, 2022, 7:41 a.m. UTC | #7
On Wed, Nov 23, 2022 at 02:40:24PM +0800, Yang Yingliang wrote:
> 
> On 2022/11/23 4:27, Jakub Kicinski wrote:
> > On Tue, 22 Nov 2022 21:04:29 +0200 Leon Romanovsky wrote:
> > > > > Please fix nsim instead of devlink.
> > > > I think if devlink is not registered, it can not be get and used, but there
> > > > is no API for driver to check this, can I introduce a new helper name
> > > > devlink_is_registered() for driver using.
> > > There is no need in such API as driver calls to devlink_register() and
> > > as such it knows when devlink is registered.
> > > 
> > > This UAF is nsim specific issue. Real devices have single .probe()
> > > routine with serialized registration flow. None of them will use
> > > devlink_is_registered() call.
> > Agreed, the fix is to move the register call back.
> > Something along the lines of the untested patch below?
> > Yang Yingliang would you be able to turn that into a real patch?
> > 
> > diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
> > index e14686594a71..26602d5fe0a2 100644
> > --- a/drivers/net/netdevsim/dev.c
> > +++ b/drivers/net/netdevsim/dev.c
> > @@ -1566,12 +1566,15 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
> >   	err = devlink_params_register(devlink, nsim_devlink_params,
> >   				      ARRAY_SIZE(nsim_devlink_params));
> >   	if (err)
> > -		goto err_dl_unregister;
> > +		goto err_resource_unregister;
> >   	nsim_devlink_set_params_init_values(nsim_dev, devlink);
> > +	/* here, because params API still expect devlink to be unregistered */
> > +	devl_register(devlink);
> > +
> devlink_set_features() called at last in probe() also needs devlink is not
> registered.
> >   	err = nsim_dev_dummy_region_init(nsim_dev, devlink);
> >   	if (err)
> > -		goto err_params_unregister;
> > +		goto err_dl_unregister;
> >   	err = nsim_dev_traps_init(devlink);
> >   	if (err)
> > @@ -1610,7 +1613,6 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
> >   	nsim_dev->esw_mode = DEVLINK_ESWITCH_MODE_LEGACY;
> >   	devlink_set_features(devlink, DEVLINK_F_RELOAD);
> >   	devl_unlock(devlink);
> > -	devlink_register(devlink);
> >   	return 0;
> >   err_hwstats_exit:
> > @@ -1629,10 +1631,11 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
> >   	nsim_dev_traps_exit(devlink);
> >   err_dummy_region_exit:
> >   	nsim_dev_dummy_region_exit(nsim_dev);
> > -err_params_unregister:
> > +err_dl_unregister:
> > +	devl_unregister(devlink);
> It races with dev_ethtool():
> dev_ethtool
>   devlink_try_get()
>                                 nsim_drv_probe
>                                 devl_lock()
>     devl_lock()
>                                 devlink_unregister()
>                                   devlink_put()
>                                   wait_for_completion() <- the refcount is
> got in dev_ethtool, it causes ABBA deadlock

But all these races are nsim specific ones.
Can you please explain why devlink.[c|h] MUST be changed and nsim can't
be fixed?

Thanks
Yang Yingliang Nov. 23, 2022, 8:34 a.m. UTC | #8
On 2022/11/23 15:41, Leon Romanovsky wrote:
> On Wed, Nov 23, 2022 at 02:40:24PM +0800, Yang Yingliang wrote:
>> On 2022/11/23 4:27, Jakub Kicinski wrote:
>>> On Tue, 22 Nov 2022 21:04:29 +0200 Leon Romanovsky wrote:
>>>>>> Please fix nsim instead of devlink.
>>>>> I think if devlink is not registered, it can not be get and used, but there
>>>>> is no API for driver to check this, can I introduce a new helper name
>>>>> devlink_is_registered() for driver using.
>>>> There is no need in such API as driver calls to devlink_register() and
>>>> as such it knows when devlink is registered.
>>>>
>>>> This UAF is nsim specific issue. Real devices have single .probe()
>>>> routine with serialized registration flow. None of them will use
>>>> devlink_is_registered() call.
>>> Agreed, the fix is to move the register call back.
>>> Something along the lines of the untested patch below?
>>> Yang Yingliang would you be able to turn that into a real patch?
>>>
>>> diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
>>> index e14686594a71..26602d5fe0a2 100644
>>> --- a/drivers/net/netdevsim/dev.c
>>> +++ b/drivers/net/netdevsim/dev.c
>>> @@ -1566,12 +1566,15 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
>>>    	err = devlink_params_register(devlink, nsim_devlink_params,
>>>    				      ARRAY_SIZE(nsim_devlink_params));
>>>    	if (err)
>>> -		goto err_dl_unregister;
>>> +		goto err_resource_unregister;
>>>    	nsim_devlink_set_params_init_values(nsim_dev, devlink);
>>> +	/* here, because params API still expect devlink to be unregistered */
>>> +	devl_register(devlink);
>>> +
>> devlink_set_features() called at last in probe() also needs devlink is not
>> registered.
>>>    	err = nsim_dev_dummy_region_init(nsim_dev, devlink);
>>>    	if (err)
>>> -		goto err_params_unregister;
>>> +		goto err_dl_unregister;
>>>    	err = nsim_dev_traps_init(devlink);
>>>    	if (err)
>>> @@ -1610,7 +1613,6 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
>>>    	nsim_dev->esw_mode = DEVLINK_ESWITCH_MODE_LEGACY;
>>>    	devlink_set_features(devlink, DEVLINK_F_RELOAD);
>>>    	devl_unlock(devlink);
>>> -	devlink_register(devlink);
>>>    	return 0;
>>>    err_hwstats_exit:
>>> @@ -1629,10 +1631,11 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
>>>    	nsim_dev_traps_exit(devlink);
>>>    err_dummy_region_exit:
>>>    	nsim_dev_dummy_region_exit(nsim_dev);
>>> -err_params_unregister:
>>> +err_dl_unregister:
>>> +	devl_unregister(devlink);
>> It races with dev_ethtool():
>> dev_ethtool
>>    devlink_try_get()
>>                                  nsim_drv_probe
>>                                  devl_lock()
>>      devl_lock()
>>                                  devlink_unregister()
>>                                    devlink_put()
>>                                    wait_for_completion() <- the refcount is
>> got in dev_ethtool, it causes ABBA deadlock
> But all these races are nsim specific ones.
> Can you please explain why devlink.[c|h] MUST be changed and nsim can't
> be fixed?
I used the fix code proposed by Jakub, but it didn't work correctly, so
I tried to correct and improve it, and need some devlink helper.

Anyway, it is a nsim problem, if we want fix this without touch devlink,
I think we can add a 'registered' field in struct nsim_dev, and it can be
checked in nsim_get_devlink_port() like this:
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -242,6 +242,9 @@ static struct devlink_port 
*nsim_get_devlink_port(struct net_device *dev)
  {
         struct netdevsim *ns = netdev_priv(dev);

+       if (!ns->nsim_dev->devlink_registered)
+               return NULL;
+
         return &ns->nsim_dev_port->devlink_port;
  }

Thanks,
Yang
>
> Thanks
> .
Leon Romanovsky Nov. 23, 2022, 9:33 a.m. UTC | #9
On Wed, Nov 23, 2022 at 04:34:49PM +0800, Yang Yingliang wrote:
> 
> On 2022/11/23 15:41, Leon Romanovsky wrote:
> > On Wed, Nov 23, 2022 at 02:40:24PM +0800, Yang Yingliang wrote:
> > > On 2022/11/23 4:27, Jakub Kicinski wrote:
> > > > On Tue, 22 Nov 2022 21:04:29 +0200 Leon Romanovsky wrote:
> > > > > > > Please fix nsim instead of devlink.
> > > > > > I think if devlink is not registered, it can not be get and used, but there
> > > > > > is no API for driver to check this, can I introduce a new helper name
> > > > > > devlink_is_registered() for driver using.
> > > > > There is no need in such API as driver calls to devlink_register() and
> > > > > as such it knows when devlink is registered.
> > > > > 
> > > > > This UAF is nsim specific issue. Real devices have single .probe()
> > > > > routine with serialized registration flow. None of them will use
> > > > > devlink_is_registered() call.
> > > > Agreed, the fix is to move the register call back.
> > > > Something along the lines of the untested patch below?
> > > > Yang Yingliang would you be able to turn that into a real patch?
> > > > 
> > > > diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
> > > > index e14686594a71..26602d5fe0a2 100644
> > > > --- a/drivers/net/netdevsim/dev.c
> > > > +++ b/drivers/net/netdevsim/dev.c
> > > > @@ -1566,12 +1566,15 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
> > > >    	err = devlink_params_register(devlink, nsim_devlink_params,
> > > >    				      ARRAY_SIZE(nsim_devlink_params));
> > > >    	if (err)
> > > > -		goto err_dl_unregister;
> > > > +		goto err_resource_unregister;
> > > >    	nsim_devlink_set_params_init_values(nsim_dev, devlink);
> > > > +	/* here, because params API still expect devlink to be unregistered */
> > > > +	devl_register(devlink);
> > > > +
> > > devlink_set_features() called at last in probe() also needs devlink is not
> > > registered.
> > > >    	err = nsim_dev_dummy_region_init(nsim_dev, devlink);
> > > >    	if (err)
> > > > -		goto err_params_unregister;
> > > > +		goto err_dl_unregister;
> > > >    	err = nsim_dev_traps_init(devlink);
> > > >    	if (err)
> > > > @@ -1610,7 +1613,6 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
> > > >    	nsim_dev->esw_mode = DEVLINK_ESWITCH_MODE_LEGACY;
> > > >    	devlink_set_features(devlink, DEVLINK_F_RELOAD);
> > > >    	devl_unlock(devlink);
> > > > -	devlink_register(devlink);
> > > >    	return 0;
> > > >    err_hwstats_exit:
> > > > @@ -1629,10 +1631,11 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
> > > >    	nsim_dev_traps_exit(devlink);
> > > >    err_dummy_region_exit:
> > > >    	nsim_dev_dummy_region_exit(nsim_dev);
> > > > -err_params_unregister:
> > > > +err_dl_unregister:
> > > > +	devl_unregister(devlink);
> > > It races with dev_ethtool():
> > > dev_ethtool
> > >    devlink_try_get()
> > >                                  nsim_drv_probe
> > >                                  devl_lock()
> > >      devl_lock()
> > >                                  devlink_unregister()
> > >                                    devlink_put()
> > >                                    wait_for_completion() <- the refcount is
> > > got in dev_ethtool, it causes ABBA deadlock
> > But all these races are nsim specific ones.
> > Can you please explain why devlink.[c|h] MUST be changed and nsim can't
> > be fixed?
> I used the fix code proposed by Jakub, but it didn't work correctly, so
> I tried to correct and improve it, and need some devlink helper.
> 
> Anyway, it is a nsim problem, if we want fix this without touch devlink,
> I think we can add a 'registered' field in struct nsim_dev, and it can be
> checked in nsim_get_devlink_port() like this:
> --- a/drivers/net/netdevsim/netdev.c
> +++ b/drivers/net/netdevsim/netdev.c
> @@ -242,6 +242,9 @@ static struct devlink_port *nsim_get_devlink_port(struct
> net_device *dev)
>  {
>         struct netdevsim *ns = netdev_priv(dev);
> 
> +       if (!ns->nsim_dev->devlink_registered)
> +               return NULL;

Something like that, but you need to make sure that this check is
protected by some sort of the locking.

> +
>         return &ns->nsim_dev_port->devlink_port;
>  }
> 
> Thanks,
> Yang
> > 
> > Thanks
> > .
Ido Schimmel Nov. 23, 2022, 7:18 p.m. UTC | #10
On Wed, Nov 23, 2022 at 04:34:49PM +0800, Yang Yingliang wrote:
> 
> On 2022/11/23 15:41, Leon Romanovsky wrote:
> > On Wed, Nov 23, 2022 at 02:40:24PM +0800, Yang Yingliang wrote:
> > > On 2022/11/23 4:27, Jakub Kicinski wrote:
> > > > On Tue, 22 Nov 2022 21:04:29 +0200 Leon Romanovsky wrote:
> > > > > > > Please fix nsim instead of devlink.
> > > > > > I think if devlink is not registered, it can not be get and used, but there
> > > > > > is no API for driver to check this, can I introduce a new helper name
> > > > > > devlink_is_registered() for driver using.
> > > > > There is no need in such API as driver calls to devlink_register() and
> > > > > as such it knows when devlink is registered.
> > > > > 
> > > > > This UAF is nsim specific issue. Real devices have single .probe()
> > > > > routine with serialized registration flow. None of them will use
> > > > > devlink_is_registered() call.
> > > > Agreed, the fix is to move the register call back.
> > > > Something along the lines of the untested patch below?
> > > > Yang Yingliang would you be able to turn that into a real patch?
> > > > 
> > > > diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
> > > > index e14686594a71..26602d5fe0a2 100644
> > > > --- a/drivers/net/netdevsim/dev.c
> > > > +++ b/drivers/net/netdevsim/dev.c
> > > > @@ -1566,12 +1566,15 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
> > > >    	err = devlink_params_register(devlink, nsim_devlink_params,
> > > >    				      ARRAY_SIZE(nsim_devlink_params));
> > > >    	if (err)
> > > > -		goto err_dl_unregister;
> > > > +		goto err_resource_unregister;
> > > >    	nsim_devlink_set_params_init_values(nsim_dev, devlink);
> > > > +	/* here, because params API still expect devlink to be unregistered */
> > > > +	devl_register(devlink);
> > > > +
> > > devlink_set_features() called at last in probe() also needs devlink is not
> > > registered.
> > > >    	err = nsim_dev_dummy_region_init(nsim_dev, devlink);
> > > >    	if (err)
> > > > -		goto err_params_unregister;
> > > > +		goto err_dl_unregister;
> > > >    	err = nsim_dev_traps_init(devlink);
> > > >    	if (err)
> > > > @@ -1610,7 +1613,6 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
> > > >    	nsim_dev->esw_mode = DEVLINK_ESWITCH_MODE_LEGACY;
> > > >    	devlink_set_features(devlink, DEVLINK_F_RELOAD);
> > > >    	devl_unlock(devlink);
> > > > -	devlink_register(devlink);
> > > >    	return 0;
> > > >    err_hwstats_exit:
> > > > @@ -1629,10 +1631,11 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
> > > >    	nsim_dev_traps_exit(devlink);
> > > >    err_dummy_region_exit:
> > > >    	nsim_dev_dummy_region_exit(nsim_dev);
> > > > -err_params_unregister:
> > > > +err_dl_unregister:
> > > > +	devl_unregister(devlink);
> > > It races with dev_ethtool():
> > > dev_ethtool
> > >    devlink_try_get()
> > >                                  nsim_drv_probe
> > >                                  devl_lock()
> > >      devl_lock()
> > >                                  devlink_unregister()
> > >                                    devlink_put()
> > >                                    wait_for_completion() <- the refcount is
> > > got in dev_ethtool, it causes ABBA deadlock
> > But all these races are nsim specific ones.
> > Can you please explain why devlink.[c|h] MUST be changed and nsim can't
> > be fixed?
> I used the fix code proposed by Jakub, but it didn't work correctly, so
> I tried to correct and improve it, and need some devlink helper.
> 
> Anyway, it is a nsim problem, if we want fix this without touch devlink,
> I think we can add a 'registered' field in struct nsim_dev, and it can be
> checked in nsim_get_devlink_port() like this:

I read the discussion and it's not clear to me why this is a netdevsim
specific problem. The fundamental problem seems to be that it is
possible to hold a reference on a devlink instance before it's
registered and that devlink_free() will free the instance regardless of
its current reference count because it expects devlink_unregister() to
block. In this case, the instance was never registered, so
devlink_unregister() is not called.

ethtool was able to get a reference on the devlink instance before it
was registered because netdevsim registers its netdevs before
registering its devlink instance. However, netdevsim is not the only one
doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially
others do the same thing.

When you think about it, it's strange that it's even possible for
ethtool to reach the driver when the netdev used in the request is long
gone, but it's not holding a reference on the netdev (it's holding a
reference on the devlink instance instead) and
devlink_compat_running_version() is called without RTNL.
Jakub Kicinski Nov. 24, 2022, 2:18 a.m. UTC | #11
On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote:
> > I used the fix code proposed by Jakub, but it didn't work correctly, so
> > I tried to correct and improve it, and need some devlink helper.
> > 
> > Anyway, it is a nsim problem, if we want fix this without touch devlink,
> > I think we can add a 'registered' field in struct nsim_dev, and it can be
> > checked in nsim_get_devlink_port() like this:  
> 
> I read the discussion and it's not clear to me why this is a netdevsim
> specific problem. The fundamental problem seems to be that it is
> possible to hold a reference on a devlink instance before it's
> registered and that devlink_free() will free the instance regardless of
> its current reference count because it expects devlink_unregister() to
> block. In this case, the instance was never registered, so
> devlink_unregister() is not called.
> 
> ethtool was able to get a reference on the devlink instance before it
> was registered because netdevsim registers its netdevs before
> registering its devlink instance. However, netdevsim is not the only one
> doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially
> others do the same thing.
> 
> When you think about it, it's strange that it's even possible for
> ethtool to reach the driver when the netdev used in the request is long
> gone, but it's not holding a reference on the netdev (it's holding a
> reference on the devlink instance instead) and
> devlink_compat_running_version() is called without RTNL.

Indeed. We did a bit of a flip-flop with the devlink locking rules
and the fact that the instance is reachable before it is registered 
is a leftover from a previous restructuring :(

Hence my preference to get rid of the ordering at the driver level 
than to try to patch it up in the code. Dunno if that's convincing.
Jakub Kicinski Nov. 24, 2022, 2:20 a.m. UTC | #12
On Wed, 23 Nov 2022 14:40:24 +0800 Yang Yingliang wrote:
> >   	if (err)
> > -		goto err_dl_unregister;
> > +		goto err_resource_unregister;
> >   	nsim_devlink_set_params_init_values(nsim_dev, devlink);
> >   
> > +	/* here, because params API still expect devlink to be unregistered */
> > +	devl_register(devlink);
> > +  
> devlink_set_features() called at last in probe() also needs devlink is 
> not registered.

You can move the devlink_set_features() up. It's also a leftover,
it was preventing reload from happening before probe has finished.
Now the instance is locked until probe is done so there is no race.
Jakub Kicinski Nov. 24, 2022, 2:47 a.m. UTC | #13
On Wed, 23 Nov 2022 14:40:24 +0800 Yang Yingliang wrote:
> > +err_dl_unregister:
> > +	devl_unregister(devlink);  
> It races with dev_ethtool():
> dev_ethtool
>    devlink_try_get()
>                                  nsim_drv_probe
>                                  devl_lock()
>      devl_lock()
>                                  devlink_unregister()
>                                    devlink_put()
>                                    wait_for_completion() <- the refcount 
> is got in dev_ethtool, it causes ABBA deadlock

Yeah.. so my original design for the locking had a "devlink_is_alive()"
check for this exact reason:

https://lore.kernel.org/netdev/20211030231254.2477599-3-kuba@kernel.org/

and the devlink structure was properly refcounted (devlink_put() calls
devlink_free() when the last reference is released).

Pure references then need to check if the instance is still alive
after locking it. Which is fine, it should only happen in core code.

I think we should go back to that idea.

The waiting for references is a nightmare in the netdev code.
Leon Romanovsky Nov. 24, 2022, 5:56 a.m. UTC | #14
On Thu, Nov 24, 2022, at 04:18, Jakub Kicinski wrote:
> On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote:
>> > I used the fix code proposed by Jakub, but it didn't work correctly, so
>> > I tried to correct and improve it, and need some devlink helper.
>> > 
>> > Anyway, it is a nsim problem, if we want fix this without touch devlink,
>> > I think we can add a 'registered' field in struct nsim_dev, and it can be
>> > checked in nsim_get_devlink_port() like this:  
>> 
>> I read the discussion and it's not clear to me why this is a netdevsim
>> specific problem. The fundamental problem seems to be that it is
>> possible to hold a reference on a devlink instance before it's
>> registered and that devlink_free() will free the instance regardless of
>> its current reference count because it expects devlink_unregister() to
>> block. In this case, the instance was never registered, so
>> devlink_unregister() is not called.
>> 
>> ethtool was able to get a reference on the devlink instance before it
>> was registered because netdevsim registers its netdevs before
>> registering its devlink instance. However, netdevsim is not the only one
>> doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially
>> others do the same thing.
>> 
>> When you think about it, it's strange that it's even possible for
>> ethtool to reach the driver when the netdev used in the request is long
>> gone, but it's not holding a reference on the netdev (it's holding a
>> reference on the devlink instance instead) and
>> devlink_compat_running_version() is called without RTNL.
>
> Indeed. We did a bit of a flip-flop with the devlink locking rules
> and the fact that the instance is reachable before it is registered 
> is a leftover from a previous restructuring :(
>
> Hence my preference to get rid of the ordering at the driver level 
> than to try to patch it up in the code. Dunno if that's convincing.

Before you are doing that, let's remind why we do ordering:
1. It was in happy times when some commands were locked and some not. It caused to very exciting things like controlled uaf, e.t.c.
2. We wanted to provide an access from user space after everything devlink related is initialized. 
3. Allow sane debuggability as some drivers called to devlink_register and unregister in random places.

The locking (item 1) was fixed.

Sent from mobile.
Yang Yingliang Nov. 24, 2022, 7:28 a.m. UTC | #15
On 2022/11/24 10:47, Jakub Kicinski wrote:
> On Wed, 23 Nov 2022 14:40:24 +0800 Yang Yingliang wrote:
>>> +err_dl_unregister:
>>> +	devl_unregister(devlink);
>> It races with dev_ethtool():
>> dev_ethtool
>>     devlink_try_get()
>>                                   nsim_drv_probe
>>                                   devl_lock()
>>       devl_lock()
>>                                   devlink_unregister()
>>                                     devlink_put()
>>                                     wait_for_completion() <- the refcount
>> is got in dev_ethtool, it causes ABBA deadlock
> Yeah.. so my original design for the locking had a "devlink_is_alive()"
> check for this exact reason:
>
> https://lore.kernel.org/netdev/20211030231254.2477599-3-kuba@kernel.org/
>
> and the devlink structure was properly refcounted (devlink_put() calls
> devlink_free() when the last reference is released).
>
> Pure references then need to check if the instance is still alive
> after locking it. Which is fine, it should only happen in core code.
>
> I think we should go back to that idea.
But Leon disagree to change devlink code.

I think this problem occurs in the drivers that have multiple ports(netdev):

In some drivers (e.g. mlx5) , one net device uses one devlink 
instance(see mlx5e_probe()),
the instance can not be get until the device is register, in this case, 
it won't cause UAF.

But in some other drivers(e.g. netdevsim, funeth) multiple ports(net 
devices) use one
devlink instance. If first one is register successful, the instance is 
visible and can be get
through netdev, meanwhile, the second port register failed and goto free 
the devlink
that used by first port(netdevice). So can we fix this in every single 
driver.

Thanks,
Yang
>
> The waiting for references is a nightmare in the netdev code.
>
> .
Ido Schimmel Nov. 28, 2022, 9:20 a.m. UTC | #16
On Wed, Nov 23, 2022 at 06:18:00PM -0800, Jakub Kicinski wrote:
> On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote:
> > > I used the fix code proposed by Jakub, but it didn't work correctly, so
> > > I tried to correct and improve it, and need some devlink helper.
> > > 
> > > Anyway, it is a nsim problem, if we want fix this without touch devlink,
> > > I think we can add a 'registered' field in struct nsim_dev, and it can be
> > > checked in nsim_get_devlink_port() like this:  
> > 
> > I read the discussion and it's not clear to me why this is a netdevsim
> > specific problem. The fundamental problem seems to be that it is
> > possible to hold a reference on a devlink instance before it's
> > registered and that devlink_free() will free the instance regardless of
> > its current reference count because it expects devlink_unregister() to
> > block. In this case, the instance was never registered, so
> > devlink_unregister() is not called.
> > 
> > ethtool was able to get a reference on the devlink instance before it
> > was registered because netdevsim registers its netdevs before
> > registering its devlink instance. However, netdevsim is not the only one
> > doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially
> > others do the same thing.
> > 
> > When you think about it, it's strange that it's even possible for
> > ethtool to reach the driver when the netdev used in the request is long
> > gone, but it's not holding a reference on the netdev (it's holding a
> > reference on the devlink instance instead) and
> > devlink_compat_running_version() is called without RTNL.
> 
> Indeed. We did a bit of a flip-flop with the devlink locking rules
> and the fact that the instance is reachable before it is registered 
> is a leftover from a previous restructuring :(
> 
> Hence my preference to get rid of the ordering at the driver level 
> than to try to patch it up in the code. Dunno if that's convincing.

I don't have a good solution, but changing all the drivers to register
their netdevs after the devlink instance is going to be quite painful
and too big for 'net'. I feel like the main motivation for this is the
ethtool compat stuff, which is not very convincing IMO. I'm quite happy
with the current flow where drivers call devlink_register() at the end
of their probe.

Regarding a solution for the current crash, assuming we agree it's not a
netdevsim specific problem, I think the current fix [1] is OK. Note that
while it fixes the crash, it potentially creates other (less severe)
problems. After user space receives RTM_NEWLINK notification it will
need to wait for a certain period of time before issuing
'ETHTOOL_GDRVINFO' as otherwise it will not get the firmware version. I
guess it's not a big deal for drivers that only register one netdev
since they will very quickly follow with devlink_register(), but the
race window is larger for drivers that need to register many netdevs,
for either physical switch or eswitch ports.

Long term, we either need to find a way to make the ethtool compat stuff
work correctly or just get rid of it and have affected drivers implement
the relevant ethtool operations instead of relying on devlink.

[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/
Jiri Pirko Nov. 28, 2022, 9:58 a.m. UTC | #17
Mon, Nov 28, 2022 at 10:20:53AM CET, idosch@idosch.org wrote:
>On Wed, Nov 23, 2022 at 06:18:00PM -0800, Jakub Kicinski wrote:
>> On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote:
>> > > I used the fix code proposed by Jakub, but it didn't work correctly, so
>> > > I tried to correct and improve it, and need some devlink helper.
>> > > 
>> > > Anyway, it is a nsim problem, if we want fix this without touch devlink,
>> > > I think we can add a 'registered' field in struct nsim_dev, and it can be
>> > > checked in nsim_get_devlink_port() like this:  
>> > 
>> > I read the discussion and it's not clear to me why this is a netdevsim
>> > specific problem. The fundamental problem seems to be that it is
>> > possible to hold a reference on a devlink instance before it's
>> > registered and that devlink_free() will free the instance regardless of
>> > its current reference count because it expects devlink_unregister() to
>> > block. In this case, the instance was never registered, so
>> > devlink_unregister() is not called.
>> > 
>> > ethtool was able to get a reference on the devlink instance before it
>> > was registered because netdevsim registers its netdevs before
>> > registering its devlink instance. However, netdevsim is not the only one
>> > doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially
>> > others do the same thing.
>> > 
>> > When you think about it, it's strange that it's even possible for
>> > ethtool to reach the driver when the netdev used in the request is long
>> > gone, but it's not holding a reference on the netdev (it's holding a
>> > reference on the devlink instance instead) and
>> > devlink_compat_running_version() is called without RTNL.
>> 
>> Indeed. We did a bit of a flip-flop with the devlink locking rules
>> and the fact that the instance is reachable before it is registered 
>> is a leftover from a previous restructuring :(
>> 
>> Hence my preference to get rid of the ordering at the driver level 
>> than to try to patch it up in the code. Dunno if that's convincing.
>
>I don't have a good solution, but changing all the drivers to register
>their netdevs after the devlink instance is going to be quite painful
>and too big for 'net'. I feel like the main motivation for this is the
>ethtool compat stuff, which is not very convincing IMO. I'm quite happy
>with the current flow where drivers call devlink_register() at the end
>of their probe.
>
>Regarding a solution for the current crash, assuming we agree it's not a
>netdevsim specific problem, I think the current fix [1] is OK. Note that
>while it fixes the crash, it potentially creates other (less severe)
>problems. After user space receives RTM_NEWLINK notification it will
>need to wait for a certain period of time before issuing
>'ETHTOOL_GDRVINFO' as otherwise it will not get the firmware version. I
>guess it's not a big deal for drivers that only register one netdev
>since they will very quickly follow with devlink_register(), but the
>race window is larger for drivers that need to register many netdevs,
>for either physical switch or eswitch ports.
>
>Long term, we either need to find a way to make the ethtool compat stuff
>work correctly or just get rid of it and have affected drivers implement
>the relevant ethtool operations instead of relying on devlink.
>
>[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/

I just had a call with Ido. We both think that this might be a good
solution for -net to avoid the use after free.

For net-next, we eventually should change driver init flows to register
devlink instance first and only after that register devlink_port and
related netdevice. The ordering is important for the userspace app. For
example the init flow:
<- RTnetlink new netdev event
app sees devlink_port handle in IFLA_DEVLINK_PORT
-> query devlink instance using this handle
<- ENODEV

The instance is not registered yet.

So we need to make sure all devlink_port_register() calls are happening
after devlink_register(). This is aligned with the original flow before
devlink_register() was moved by Leon. Also it is aligned with devlink
reload and devlink port split flows.
Jiri Pirko Nov. 28, 2022, 10:01 a.m. UTC | #18
Tue, Nov 22, 2022 at 01:10:48PM CET, yangyingliang@huawei.com wrote:

[...]


>
>Fixes: a62fdbbe9403 ("netdevsim: implement ndo_get_devlink_port")
>Reported-by: Zhengchao Shao <shaozhengchao@huawei.com>
>Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
>---
> net/core/devlink.c | 3 +++
> 1 file changed, 3 insertions(+)
>
>diff --git a/net/core/devlink.c b/net/core/devlink.c
>index 89baa7c0938b..6453ac0558fb 100644
>--- a/net/core/devlink.c
>+++ b/net/core/devlink.c
>@@ -250,6 +250,9 @@ void devlink_put(struct devlink *devlink)
> 
> struct devlink *__must_check devlink_try_get(struct devlink *devlink)
> {
>+	 if (!xa_get_mark(&devlinks, devlink->index, DEVLINK_REGISTERED))

        ^
You have an extra space here. Checkpatch would warn you.


>+		return NULL;
>+
> 	if (refcount_inc_not_zero(&devlink->refcount))
> 		return devlink;
> 	return NULL;
>-- 
>2.25.1
>
Leon Romanovsky Nov. 28, 2022, 11:50 a.m. UTC | #19
On Mon, Nov 28, 2022 at 10:58:58AM +0100, Jiri Pirko wrote:
> Mon, Nov 28, 2022 at 10:20:53AM CET, idosch@idosch.org wrote:
> >On Wed, Nov 23, 2022 at 06:18:00PM -0800, Jakub Kicinski wrote:
> >> On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote:
> >> > > I used the fix code proposed by Jakub, but it didn't work correctly, so
> >> > > I tried to correct and improve it, and need some devlink helper.
> >> > > 
> >> > > Anyway, it is a nsim problem, if we want fix this without touch devlink,
> >> > > I think we can add a 'registered' field in struct nsim_dev, and it can be
> >> > > checked in nsim_get_devlink_port() like this:  
> >> > 
> >> > I read the discussion and it's not clear to me why this is a netdevsim
> >> > specific problem. The fundamental problem seems to be that it is
> >> > possible to hold a reference on a devlink instance before it's
> >> > registered and that devlink_free() will free the instance regardless of
> >> > its current reference count because it expects devlink_unregister() to
> >> > block. In this case, the instance was never registered, so
> >> > devlink_unregister() is not called.
> >> > 
> >> > ethtool was able to get a reference on the devlink instance before it
> >> > was registered because netdevsim registers its netdevs before
> >> > registering its devlink instance. However, netdevsim is not the only one
> >> > doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially
> >> > others do the same thing.
> >> > 
> >> > When you think about it, it's strange that it's even possible for
> >> > ethtool to reach the driver when the netdev used in the request is long
> >> > gone, but it's not holding a reference on the netdev (it's holding a
> >> > reference on the devlink instance instead) and
> >> > devlink_compat_running_version() is called without RTNL.
> >> 
> >> Indeed. We did a bit of a flip-flop with the devlink locking rules
> >> and the fact that the instance is reachable before it is registered 
> >> is a leftover from a previous restructuring :(
> >> 
> >> Hence my preference to get rid of the ordering at the driver level 
> >> than to try to patch it up in the code. Dunno if that's convincing.
> >
> >I don't have a good solution, but changing all the drivers to register
> >their netdevs after the devlink instance is going to be quite painful
> >and too big for 'net'. I feel like the main motivation for this is the
> >ethtool compat stuff, which is not very convincing IMO. I'm quite happy
> >with the current flow where drivers call devlink_register() at the end
> >of their probe.
> >
> >Regarding a solution for the current crash, assuming we agree it's not a
> >netdevsim specific problem, I think the current fix [1] is OK. Note that
> >while it fixes the crash, it potentially creates other (less severe)
> >problems. After user space receives RTM_NEWLINK notification it will
> >need to wait for a certain period of time before issuing
> >'ETHTOOL_GDRVINFO' as otherwise it will not get the firmware version. I
> >guess it's not a big deal for drivers that only register one netdev
> >since they will very quickly follow with devlink_register(), but the
> >race window is larger for drivers that need to register many netdevs,
> >for either physical switch or eswitch ports.
> >
> >Long term, we either need to find a way to make the ethtool compat stuff
> >work correctly or just get rid of it and have affected drivers implement
> >the relevant ethtool operations instead of relying on devlink.
> >
> >[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/
> 
> I just had a call with Ido. We both think that this might be a good
> solution for -net to avoid the use after free.
> 
> For net-next, we eventually should change driver init flows to register
> devlink instance first and only after that register devlink_port and
> related netdevice. The ordering is important for the userspace app. For
> example the init flow:
> <- RTnetlink new netdev event
> app sees devlink_port handle in IFLA_DEVLINK_PORT
> -> query devlink instance using this handle
> <- ENODEV
> 
> The instance is not registered yet.

This is supposed to be handled by devlink_notify_register() which sends
"delayed" notifications after devlink_register() is called.

Unless something is broken, the scenario above shouldn't happen.

> 
> So we need to make sure all devlink_port_register() calls are happening
> after devlink_register(). This is aligned with the original flow before
> devlink_register() was moved by Leon. Also it is aligned with devlink
> reload and devlink port split flows.
> 

I don't know what it means.

Thanks
Jiri Pirko Nov. 28, 2022, 1:52 p.m. UTC | #20
Mon, Nov 28, 2022 at 12:50:15PM CET, leon@kernel.org wrote:
>On Mon, Nov 28, 2022 at 10:58:58AM +0100, Jiri Pirko wrote:
>> Mon, Nov 28, 2022 at 10:20:53AM CET, idosch@idosch.org wrote:
>> >On Wed, Nov 23, 2022 at 06:18:00PM -0800, Jakub Kicinski wrote:
>> >> On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote:
>> >> > > I used the fix code proposed by Jakub, but it didn't work correctly, so
>> >> > > I tried to correct and improve it, and need some devlink helper.
>> >> > > 
>> >> > > Anyway, it is a nsim problem, if we want fix this without touch devlink,
>> >> > > I think we can add a 'registered' field in struct nsim_dev, and it can be
>> >> > > checked in nsim_get_devlink_port() like this:  
>> >> > 
>> >> > I read the discussion and it's not clear to me why this is a netdevsim
>> >> > specific problem. The fundamental problem seems to be that it is
>> >> > possible to hold a reference on a devlink instance before it's
>> >> > registered and that devlink_free() will free the instance regardless of
>> >> > its current reference count because it expects devlink_unregister() to
>> >> > block. In this case, the instance was never registered, so
>> >> > devlink_unregister() is not called.
>> >> > 
>> >> > ethtool was able to get a reference on the devlink instance before it
>> >> > was registered because netdevsim registers its netdevs before
>> >> > registering its devlink instance. However, netdevsim is not the only one
>> >> > doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially
>> >> > others do the same thing.
>> >> > 
>> >> > When you think about it, it's strange that it's even possible for
>> >> > ethtool to reach the driver when the netdev used in the request is long
>> >> > gone, but it's not holding a reference on the netdev (it's holding a
>> >> > reference on the devlink instance instead) and
>> >> > devlink_compat_running_version() is called without RTNL.
>> >> 
>> >> Indeed. We did a bit of a flip-flop with the devlink locking rules
>> >> and the fact that the instance is reachable before it is registered 
>> >> is a leftover from a previous restructuring :(
>> >> 
>> >> Hence my preference to get rid of the ordering at the driver level 
>> >> than to try to patch it up in the code. Dunno if that's convincing.
>> >
>> >I don't have a good solution, but changing all the drivers to register
>> >their netdevs after the devlink instance is going to be quite painful
>> >and too big for 'net'. I feel like the main motivation for this is the
>> >ethtool compat stuff, which is not very convincing IMO. I'm quite happy
>> >with the current flow where drivers call devlink_register() at the end
>> >of their probe.
>> >
>> >Regarding a solution for the current crash, assuming we agree it's not a
>> >netdevsim specific problem, I think the current fix [1] is OK. Note that
>> >while it fixes the crash, it potentially creates other (less severe)
>> >problems. After user space receives RTM_NEWLINK notification it will
>> >need to wait for a certain period of time before issuing
>> >'ETHTOOL_GDRVINFO' as otherwise it will not get the firmware version. I
>> >guess it's not a big deal for drivers that only register one netdev
>> >since they will very quickly follow with devlink_register(), but the
>> >race window is larger for drivers that need to register many netdevs,
>> >for either physical switch or eswitch ports.
>> >
>> >Long term, we either need to find a way to make the ethtool compat stuff
>> >work correctly or just get rid of it and have affected drivers implement
>> >the relevant ethtool operations instead of relying on devlink.
>> >
>> >[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/
>> 
>> I just had a call with Ido. We both think that this might be a good
>> solution for -net to avoid the use after free.
>> 
>> For net-next, we eventually should change driver init flows to register
>> devlink instance first and only after that register devlink_port and
>> related netdevice. The ordering is important for the userspace app. For
>> example the init flow:
>> <- RTnetlink new netdev event
>> app sees devlink_port handle in IFLA_DEVLINK_PORT
>> -> query devlink instance using this handle
>> <- ENODEV
>> 
>> The instance is not registered yet.
>
>This is supposed to be handled by devlink_notify_register() which sends
>"delayed" notifications after devlink_register() is called.
>
>Unless something is broken, the scenario above shouldn't happen.

Nope, RTnetlink message for new netdev is not handled by that. It is
sent right away.


>
>> 
>> So we need to make sure all devlink_port_register() calls are happening
>> after devlink_register(). This is aligned with the original flow before
>> devlink_register() was moved by Leon. Also it is aligned with devlink
>> reload and devlink port split flows.
>> 
>
>I don't know what it means.

What I mean is that during port split, devlink instance is registered.
During port creation and removal during reload, devlink instance is
registered. We should maintain the same ordering during init/fini I
believe.


>
>Thanks
Jakub Kicinski Nov. 28, 2022, 6:20 p.m. UTC | #21
On Mon, 28 Nov 2022 10:58:58 +0100 Jiri Pirko wrote:
> >Long term, we either need to find a way to make the ethtool compat stuff
> >work correctly or just get rid of it and have affected drivers implement
> >the relevant ethtool operations instead of relying on devlink.
> >
> >[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/  
> 
> I just had a call with Ido. We both think that this might be a good
> solution for -net to avoid the use after free.
> 
> For net-next, we eventually should change driver init flows to register
> devlink instance first and only after that register devlink_port and
> related netdevice. The ordering is important for the userspace app. For
> example the init flow:
> <- RTnetlink new netdev event
> app sees devlink_port handle in IFLA_DEVLINK_PORT
> -> query devlink instance using this handle  
> <- ENODEV
> 
> The instance is not registered yet.
> 
> So we need to make sure all devlink_port_register() calls are happening
> after devlink_register(). This is aligned with the original flow before
> devlink_register() was moved by Leon. Also it is aligned with devlink
> reload and devlink port split flows.

Cool. Do you also agree with doing proper refcounting for the devlink
instance struct and the liveness check after locking the instance?
Jiri Pirko Nov. 29, 2022, 8:31 a.m. UTC | #22
Mon, Nov 28, 2022 at 07:20:43PM CET, kuba@kernel.org wrote:
>On Mon, 28 Nov 2022 10:58:58 +0100 Jiri Pirko wrote:
>> >Long term, we either need to find a way to make the ethtool compat stuff
>> >work correctly or just get rid of it and have affected drivers implement
>> >the relevant ethtool operations instead of relying on devlink.
>> >
>> >[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/  
>> 
>> I just had a call with Ido. We both think that this might be a good
>> solution for -net to avoid the use after free.
>> 
>> For net-next, we eventually should change driver init flows to register
>> devlink instance first and only after that register devlink_port and
>> related netdevice. The ordering is important for the userspace app. For
>> example the init flow:
>> <- RTnetlink new netdev event
>> app sees devlink_port handle in IFLA_DEVLINK_PORT
>> -> query devlink instance using this handle  
>> <- ENODEV
>> 
>> The instance is not registered yet.
>> 
>> So we need to make sure all devlink_port_register() calls are happening
>> after devlink_register(). This is aligned with the original flow before
>> devlink_register() was moved by Leon. Also it is aligned with devlink
>> reload and devlink port split flows.
>
>Cool. Do you also agree with doing proper refcounting for the devlink
>instance struct and the liveness check after locking the instance?

Could you elaborate a bit more? I missed that in the thread and can't
find it. Why do we need it?
Leon Romanovsky Nov. 29, 2022, 8:44 a.m. UTC | #23
On Mon, Nov 28, 2022 at 02:52:00PM +0100, Jiri Pirko wrote:
> Mon, Nov 28, 2022 at 12:50:15PM CET, leon@kernel.org wrote:
> >On Mon, Nov 28, 2022 at 10:58:58AM +0100, Jiri Pirko wrote:
> >> Mon, Nov 28, 2022 at 10:20:53AM CET, idosch@idosch.org wrote:
> >> >On Wed, Nov 23, 2022 at 06:18:00PM -0800, Jakub Kicinski wrote:
> >> >> On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote:
> >> >> > > I used the fix code proposed by Jakub, but it didn't work correctly, so
> >> >> > > I tried to correct and improve it, and need some devlink helper.
> >> >> > > 
> >> >> > > Anyway, it is a nsim problem, if we want fix this without touch devlink,
> >> >> > > I think we can add a 'registered' field in struct nsim_dev, and it can be
> >> >> > > checked in nsim_get_devlink_port() like this:  
> >> >> > 
> >> >> > I read the discussion and it's not clear to me why this is a netdevsim
> >> >> > specific problem. The fundamental problem seems to be that it is
> >> >> > possible to hold a reference on a devlink instance before it's
> >> >> > registered and that devlink_free() will free the instance regardless of
> >> >> > its current reference count because it expects devlink_unregister() to
> >> >> > block. In this case, the instance was never registered, so
> >> >> > devlink_unregister() is not called.
> >> >> > 
> >> >> > ethtool was able to get a reference on the devlink instance before it
> >> >> > was registered because netdevsim registers its netdevs before
> >> >> > registering its devlink instance. However, netdevsim is not the only one
> >> >> > doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially
> >> >> > others do the same thing.
> >> >> > 
> >> >> > When you think about it, it's strange that it's even possible for
> >> >> > ethtool to reach the driver when the netdev used in the request is long
> >> >> > gone, but it's not holding a reference on the netdev (it's holding a
> >> >> > reference on the devlink instance instead) and
> >> >> > devlink_compat_running_version() is called without RTNL.
> >> >> 
> >> >> Indeed. We did a bit of a flip-flop with the devlink locking rules
> >> >> and the fact that the instance is reachable before it is registered 
> >> >> is a leftover from a previous restructuring :(
> >> >> 
> >> >> Hence my preference to get rid of the ordering at the driver level 
> >> >> than to try to patch it up in the code. Dunno if that's convincing.
> >> >
> >> >I don't have a good solution, but changing all the drivers to register
> >> >their netdevs after the devlink instance is going to be quite painful
> >> >and too big for 'net'. I feel like the main motivation for this is the
> >> >ethtool compat stuff, which is not very convincing IMO. I'm quite happy
> >> >with the current flow where drivers call devlink_register() at the end
> >> >of their probe.
> >> >
> >> >Regarding a solution for the current crash, assuming we agree it's not a
> >> >netdevsim specific problem, I think the current fix [1] is OK. Note that
> >> >while it fixes the crash, it potentially creates other (less severe)
> >> >problems. After user space receives RTM_NEWLINK notification it will
> >> >need to wait for a certain period of time before issuing
> >> >'ETHTOOL_GDRVINFO' as otherwise it will not get the firmware version. I
> >> >guess it's not a big deal for drivers that only register one netdev
> >> >since they will very quickly follow with devlink_register(), but the
> >> >race window is larger for drivers that need to register many netdevs,
> >> >for either physical switch or eswitch ports.
> >> >
> >> >Long term, we either need to find a way to make the ethtool compat stuff
> >> >work correctly or just get rid of it and have affected drivers implement
> >> >the relevant ethtool operations instead of relying on devlink.
> >> >
> >> >[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/
> >> 
> >> I just had a call with Ido. We both think that this might be a good
> >> solution for -net to avoid the use after free.
> >> 
> >> For net-next, we eventually should change driver init flows to register
> >> devlink instance first and only after that register devlink_port and
> >> related netdevice. The ordering is important for the userspace app. For
> >> example the init flow:
> >> <- RTnetlink new netdev event
> >> app sees devlink_port handle in IFLA_DEVLINK_PORT
> >> -> query devlink instance using this handle
> >> <- ENODEV
> >> 
> >> The instance is not registered yet.
> >
> >This is supposed to be handled by devlink_notify_register() which sends
> >"delayed" notifications after devlink_register() is called.
> >
> >Unless something is broken, the scenario above shouldn't happen.
> 
> Nope, RTnetlink message for new netdev is not handled by that. It is
> sent right away.

And why don't you fix your new commit dca56c3038c3 ("net: expose devlink port over rtnetlink")
to do not return devlink instance unless it is registered?

Why is it correct to expose devlink port with not ready to use devlink
instance?

Thanks
Jiri Pirko Nov. 29, 2022, 9:05 a.m. UTC | #24
Tue, Nov 29, 2022 at 09:44:48AM CET, leon@kernel.org wrote:
>On Mon, Nov 28, 2022 at 02:52:00PM +0100, Jiri Pirko wrote:
>> Mon, Nov 28, 2022 at 12:50:15PM CET, leon@kernel.org wrote:
>> >On Mon, Nov 28, 2022 at 10:58:58AM +0100, Jiri Pirko wrote:
>> >> Mon, Nov 28, 2022 at 10:20:53AM CET, idosch@idosch.org wrote:
>> >> >On Wed, Nov 23, 2022 at 06:18:00PM -0800, Jakub Kicinski wrote:
>> >> >> On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote:
>> >> >> > > I used the fix code proposed by Jakub, but it didn't work correctly, so
>> >> >> > > I tried to correct and improve it, and need some devlink helper.
>> >> >> > > 
>> >> >> > > Anyway, it is a nsim problem, if we want fix this without touch devlink,
>> >> >> > > I think we can add a 'registered' field in struct nsim_dev, and it can be
>> >> >> > > checked in nsim_get_devlink_port() like this:  
>> >> >> > 
>> >> >> > I read the discussion and it's not clear to me why this is a netdevsim
>> >> >> > specific problem. The fundamental problem seems to be that it is
>> >> >> > possible to hold a reference on a devlink instance before it's
>> >> >> > registered and that devlink_free() will free the instance regardless of
>> >> >> > its current reference count because it expects devlink_unregister() to
>> >> >> > block. In this case, the instance was never registered, so
>> >> >> > devlink_unregister() is not called.
>> >> >> > 
>> >> >> > ethtool was able to get a reference on the devlink instance before it
>> >> >> > was registered because netdevsim registers its netdevs before
>> >> >> > registering its devlink instance. However, netdevsim is not the only one
>> >> >> > doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially
>> >> >> > others do the same thing.
>> >> >> > 
>> >> >> > When you think about it, it's strange that it's even possible for
>> >> >> > ethtool to reach the driver when the netdev used in the request is long
>> >> >> > gone, but it's not holding a reference on the netdev (it's holding a
>> >> >> > reference on the devlink instance instead) and
>> >> >> > devlink_compat_running_version() is called without RTNL.
>> >> >> 
>> >> >> Indeed. We did a bit of a flip-flop with the devlink locking rules
>> >> >> and the fact that the instance is reachable before it is registered 
>> >> >> is a leftover from a previous restructuring :(
>> >> >> 
>> >> >> Hence my preference to get rid of the ordering at the driver level 
>> >> >> than to try to patch it up in the code. Dunno if that's convincing.
>> >> >
>> >> >I don't have a good solution, but changing all the drivers to register
>> >> >their netdevs after the devlink instance is going to be quite painful
>> >> >and too big for 'net'. I feel like the main motivation for this is the
>> >> >ethtool compat stuff, which is not very convincing IMO. I'm quite happy
>> >> >with the current flow where drivers call devlink_register() at the end
>> >> >of their probe.
>> >> >
>> >> >Regarding a solution for the current crash, assuming we agree it's not a
>> >> >netdevsim specific problem, I think the current fix [1] is OK. Note that
>> >> >while it fixes the crash, it potentially creates other (less severe)
>> >> >problems. After user space receives RTM_NEWLINK notification it will
>> >> >need to wait for a certain period of time before issuing
>> >> >'ETHTOOL_GDRVINFO' as otherwise it will not get the firmware version. I
>> >> >guess it's not a big deal for drivers that only register one netdev
>> >> >since they will very quickly follow with devlink_register(), but the
>> >> >race window is larger for drivers that need to register many netdevs,
>> >> >for either physical switch or eswitch ports.
>> >> >
>> >> >Long term, we either need to find a way to make the ethtool compat stuff
>> >> >work correctly or just get rid of it and have affected drivers implement
>> >> >the relevant ethtool operations instead of relying on devlink.
>> >> >
>> >> >[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/
>> >> 
>> >> I just had a call with Ido. We both think that this might be a good
>> >> solution for -net to avoid the use after free.
>> >> 
>> >> For net-next, we eventually should change driver init flows to register
>> >> devlink instance first and only after that register devlink_port and
>> >> related netdevice. The ordering is important for the userspace app. For
>> >> example the init flow:
>> >> <- RTnetlink new netdev event
>> >> app sees devlink_port handle in IFLA_DEVLINK_PORT
>> >> -> query devlink instance using this handle
>> >> <- ENODEV
>> >> 
>> >> The instance is not registered yet.
>> >
>> >This is supposed to be handled by devlink_notify_register() which sends
>> >"delayed" notifications after devlink_register() is called.
>> >
>> >Unless something is broken, the scenario above shouldn't happen.
>> 
>> Nope, RTnetlink message for new netdev is not handled by that. It is
>> sent right away.
>
>And why don't you fix your new commit dca56c3038c3 ("net: expose devlink port over rtnetlink")
>to do not return devlink instance unless it is registered?
>
>Why is it correct to expose devlink port with not ready to use devlink
>instance?

It is not, but:
Devlink port which is "parent" of the netdev is registered. The netdev
is created with devlink_port registered and that it guaranteed to not
change during netdev lifetime. Therefore, it would be weird to have 2
RTnetlink events:
1. event of netdev being created without devlink port
2. event of netdev with devlink port
If that is what you suggest.

I'm working on a patchset that is making sure that the flow is always
1) devlink_register & netlink event
2) devlink_port_register & netlink event
3) netdev_register & netlink event

Always the same. That means during init, during reload, during port
split.
Leon Romanovsky Nov. 29, 2022, 11:20 a.m. UTC | #25
On Tue, Nov 29, 2022 at 10:05:10AM +0100, Jiri Pirko wrote:
> Tue, Nov 29, 2022 at 09:44:48AM CET, leon@kernel.org wrote:
> >On Mon, Nov 28, 2022 at 02:52:00PM +0100, Jiri Pirko wrote:
> >> Mon, Nov 28, 2022 at 12:50:15PM CET, leon@kernel.org wrote:
> >> >On Mon, Nov 28, 2022 at 10:58:58AM +0100, Jiri Pirko wrote:
> >> >> Mon, Nov 28, 2022 at 10:20:53AM CET, idosch@idosch.org wrote:
> >> >> >On Wed, Nov 23, 2022 at 06:18:00PM -0800, Jakub Kicinski wrote:
> >> >> >> On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote:
> >> >> >> > > I used the fix code proposed by Jakub, but it didn't work correctly, so
> >> >> >> > > I tried to correct and improve it, and need some devlink helper.
> >> >> >> > > 
> >> >> >> > > Anyway, it is a nsim problem, if we want fix this without touch devlink,
> >> >> >> > > I think we can add a 'registered' field in struct nsim_dev, and it can be
> >> >> >> > > checked in nsim_get_devlink_port() like this:  
> >> >> >> > 
> >> >> >> > I read the discussion and it's not clear to me why this is a netdevsim
> >> >> >> > specific problem. The fundamental problem seems to be that it is
> >> >> >> > possible to hold a reference on a devlink instance before it's
> >> >> >> > registered and that devlink_free() will free the instance regardless of
> >> >> >> > its current reference count because it expects devlink_unregister() to
> >> >> >> > block. In this case, the instance was never registered, so
> >> >> >> > devlink_unregister() is not called.
> >> >> >> > 
> >> >> >> > ethtool was able to get a reference on the devlink instance before it
> >> >> >> > was registered because netdevsim registers its netdevs before
> >> >> >> > registering its devlink instance. However, netdevsim is not the only one
> >> >> >> > doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially
> >> >> >> > others do the same thing.
> >> >> >> > 
> >> >> >> > When you think about it, it's strange that it's even possible for
> >> >> >> > ethtool to reach the driver when the netdev used in the request is long
> >> >> >> > gone, but it's not holding a reference on the netdev (it's holding a
> >> >> >> > reference on the devlink instance instead) and
> >> >> >> > devlink_compat_running_version() is called without RTNL.
> >> >> >> 
> >> >> >> Indeed. We did a bit of a flip-flop with the devlink locking rules
> >> >> >> and the fact that the instance is reachable before it is registered 
> >> >> >> is a leftover from a previous restructuring :(
> >> >> >> 
> >> >> >> Hence my preference to get rid of the ordering at the driver level 
> >> >> >> than to try to patch it up in the code. Dunno if that's convincing.
> >> >> >
> >> >> >I don't have a good solution, but changing all the drivers to register
> >> >> >their netdevs after the devlink instance is going to be quite painful
> >> >> >and too big for 'net'. I feel like the main motivation for this is the
> >> >> >ethtool compat stuff, which is not very convincing IMO. I'm quite happy
> >> >> >with the current flow where drivers call devlink_register() at the end
> >> >> >of their probe.
> >> >> >
> >> >> >Regarding a solution for the current crash, assuming we agree it's not a
> >> >> >netdevsim specific problem, I think the current fix [1] is OK. Note that
> >> >> >while it fixes the crash, it potentially creates other (less severe)
> >> >> >problems. After user space receives RTM_NEWLINK notification it will
> >> >> >need to wait for a certain period of time before issuing
> >> >> >'ETHTOOL_GDRVINFO' as otherwise it will not get the firmware version. I
> >> >> >guess it's not a big deal for drivers that only register one netdev
> >> >> >since they will very quickly follow with devlink_register(), but the
> >> >> >race window is larger for drivers that need to register many netdevs,
> >> >> >for either physical switch or eswitch ports.
> >> >> >
> >> >> >Long term, we either need to find a way to make the ethtool compat stuff
> >> >> >work correctly or just get rid of it and have affected drivers implement
> >> >> >the relevant ethtool operations instead of relying on devlink.
> >> >> >
> >> >> >[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/
> >> >> 
> >> >> I just had a call with Ido. We both think that this might be a good
> >> >> solution for -net to avoid the use after free.
> >> >> 
> >> >> For net-next, we eventually should change driver init flows to register
> >> >> devlink instance first and only after that register devlink_port and
> >> >> related netdevice. The ordering is important for the userspace app. For
> >> >> example the init flow:
> >> >> <- RTnetlink new netdev event
> >> >> app sees devlink_port handle in IFLA_DEVLINK_PORT
> >> >> -> query devlink instance using this handle
> >> >> <- ENODEV
> >> >> 
> >> >> The instance is not registered yet.
> >> >
> >> >This is supposed to be handled by devlink_notify_register() which sends
> >> >"delayed" notifications after devlink_register() is called.
> >> >
> >> >Unless something is broken, the scenario above shouldn't happen.
> >> 
> >> Nope, RTnetlink message for new netdev is not handled by that. It is
> >> sent right away.
> >
> >And why don't you fix your new commit dca56c3038c3 ("net: expose devlink port over rtnetlink")
> >to do not return devlink instance unless it is registered?
> >
> >Why is it correct to expose devlink port with not ready to use devlink
> >instance?
> 
> It is not, but:
> Devlink port which is "parent" of the netdev is registered. The netdev
> is created with devlink_port registered and that it guaranteed to not
> change during netdev lifetime. Therefore, it would be weird to have 2
> RTnetlink events:
> 1. event of netdev being created without devlink port
> 2. event of netdev with devlink port
> If that is what you suggest.

Yes, something like that in similar way to IFLA_EVENT_NOTIFY_PEERS.

> 
> I'm working on a patchset that is making sure that the flow is always
> 1) devlink_register & netlink event
> 2) devlink_port_register & netlink event
> 3) netdev_register & netlink event
> 
> Always the same. That means during init, during reload, during port
> split.

The thing that worries me is unclear lifetime of these devlink objects.
devlink_port is aligned with netdev lifetime, devlink is with device one.

It is very strange to me that we have netdev ready without device
underneath ready.

Can you please document the lifetime models and how all these objects
interact?

Thanks
Jiri Pirko Nov. 29, 2022, 11:44 a.m. UTC | #26
Tue, Nov 29, 2022 at 12:20:01PM CET, leon@kernel.org wrote:
>On Tue, Nov 29, 2022 at 10:05:10AM +0100, Jiri Pirko wrote:
>> Tue, Nov 29, 2022 at 09:44:48AM CET, leon@kernel.org wrote:
>> >On Mon, Nov 28, 2022 at 02:52:00PM +0100, Jiri Pirko wrote:
>> >> Mon, Nov 28, 2022 at 12:50:15PM CET, leon@kernel.org wrote:
>> >> >On Mon, Nov 28, 2022 at 10:58:58AM +0100, Jiri Pirko wrote:
>> >> >> Mon, Nov 28, 2022 at 10:20:53AM CET, idosch@idosch.org wrote:
>> >> >> >On Wed, Nov 23, 2022 at 06:18:00PM -0800, Jakub Kicinski wrote:
>> >> >> >> On Wed, 23 Nov 2022 21:18:14 +0200 Ido Schimmel wrote:
>> >> >> >> > > I used the fix code proposed by Jakub, but it didn't work correctly, so
>> >> >> >> > > I tried to correct and improve it, and need some devlink helper.
>> >> >> >> > > 
>> >> >> >> > > Anyway, it is a nsim problem, if we want fix this without touch devlink,
>> >> >> >> > > I think we can add a 'registered' field in struct nsim_dev, and it can be
>> >> >> >> > > checked in nsim_get_devlink_port() like this:  
>> >> >> >> > 
>> >> >> >> > I read the discussion and it's not clear to me why this is a netdevsim
>> >> >> >> > specific problem. The fundamental problem seems to be that it is
>> >> >> >> > possible to hold a reference on a devlink instance before it's
>> >> >> >> > registered and that devlink_free() will free the instance regardless of
>> >> >> >> > its current reference count because it expects devlink_unregister() to
>> >> >> >> > block. In this case, the instance was never registered, so
>> >> >> >> > devlink_unregister() is not called.
>> >> >> >> > 
>> >> >> >> > ethtool was able to get a reference on the devlink instance before it
>> >> >> >> > was registered because netdevsim registers its netdevs before
>> >> >> >> > registering its devlink instance. However, netdevsim is not the only one
>> >> >> >> > doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially
>> >> >> >> > others do the same thing.
>> >> >> >> > 
>> >> >> >> > When you think about it, it's strange that it's even possible for
>> >> >> >> > ethtool to reach the driver when the netdev used in the request is long
>> >> >> >> > gone, but it's not holding a reference on the netdev (it's holding a
>> >> >> >> > reference on the devlink instance instead) and
>> >> >> >> > devlink_compat_running_version() is called without RTNL.
>> >> >> >> 
>> >> >> >> Indeed. We did a bit of a flip-flop with the devlink locking rules
>> >> >> >> and the fact that the instance is reachable before it is registered 
>> >> >> >> is a leftover from a previous restructuring :(
>> >> >> >> 
>> >> >> >> Hence my preference to get rid of the ordering at the driver level 
>> >> >> >> than to try to patch it up in the code. Dunno if that's convincing.
>> >> >> >
>> >> >> >I don't have a good solution, but changing all the drivers to register
>> >> >> >their netdevs after the devlink instance is going to be quite painful
>> >> >> >and too big for 'net'. I feel like the main motivation for this is the
>> >> >> >ethtool compat stuff, which is not very convincing IMO. I'm quite happy
>> >> >> >with the current flow where drivers call devlink_register() at the end
>> >> >> >of their probe.
>> >> >> >
>> >> >> >Regarding a solution for the current crash, assuming we agree it's not a
>> >> >> >netdevsim specific problem, I think the current fix [1] is OK. Note that
>> >> >> >while it fixes the crash, it potentially creates other (less severe)
>> >> >> >problems. After user space receives RTM_NEWLINK notification it will
>> >> >> >need to wait for a certain period of time before issuing
>> >> >> >'ETHTOOL_GDRVINFO' as otherwise it will not get the firmware version. I
>> >> >> >guess it's not a big deal for drivers that only register one netdev
>> >> >> >since they will very quickly follow with devlink_register(), but the
>> >> >> >race window is larger for drivers that need to register many netdevs,
>> >> >> >for either physical switch or eswitch ports.
>> >> >> >
>> >> >> >Long term, we either need to find a way to make the ethtool compat stuff
>> >> >> >work correctly or just get rid of it and have affected drivers implement
>> >> >> >the relevant ethtool operations instead of relying on devlink.
>> >> >> >
>> >> >> >[1] https://lore.kernel.org/netdev/20221122121048.776643-1-yangyingliang@huawei.com/
>> >> >> 
>> >> >> I just had a call with Ido. We both think that this might be a good
>> >> >> solution for -net to avoid the use after free.
>> >> >> 
>> >> >> For net-next, we eventually should change driver init flows to register
>> >> >> devlink instance first and only after that register devlink_port and
>> >> >> related netdevice. The ordering is important for the userspace app. For
>> >> >> example the init flow:
>> >> >> <- RTnetlink new netdev event
>> >> >> app sees devlink_port handle in IFLA_DEVLINK_PORT
>> >> >> -> query devlink instance using this handle
>> >> >> <- ENODEV
>> >> >> 
>> >> >> The instance is not registered yet.
>> >> >
>> >> >This is supposed to be handled by devlink_notify_register() which sends
>> >> >"delayed" notifications after devlink_register() is called.
>> >> >
>> >> >Unless something is broken, the scenario above shouldn't happen.
>> >> 
>> >> Nope, RTnetlink message for new netdev is not handled by that. It is
>> >> sent right away.
>> >
>> >And why don't you fix your new commit dca56c3038c3 ("net: expose devlink port over rtnetlink")
>> >to do not return devlink instance unless it is registered?
>> >
>> >Why is it correct to expose devlink port with not ready to use devlink
>> >instance?
>> 
>> It is not, but:
>> Devlink port which is "parent" of the netdev is registered. The netdev
>> is created with devlink_port registered and that it guaranteed to not
>> change during netdev lifetime. Therefore, it would be weird to have 2
>> RTnetlink events:
>> 1. event of netdev being created without devlink port
>> 2. event of netdev with devlink port
>> If that is what you suggest.
>
>Yes, something like that in similar way to IFLA_EVENT_NOTIFY_PEERS.
>
>> 
>> I'm working on a patchset that is making sure that the flow is always
>> 1) devlink_register & netlink event
>> 2) devlink_port_register & netlink event
>> 3) netdev_register & netlink event
>> 
>> Always the same. That means during init, during reload, during port
>> split.
>
>The thing that worries me is unclear lifetime of these devlink objects.
>devlink_port is aligned with netdev lifetime, devlink is with device one.

Correct. Therefore it makes sense to create device object and propagate
the existence to the user, then create port object with netdevice and
propagate those.


>
>It is very strange to me that we have netdev ready without device
>underneath ready.
>
>Can you please document the lifetime models and how all these objects
>interact?

Yep. On top of that, I will put an assert into devlink_port_register to
check devlink instance is registered.


>
>Thanks
Jakub Kicinski Nov. 30, 2022, 2:18 a.m. UTC | #27
On Tue, 29 Nov 2022 09:31:40 +0100 Jiri Pirko wrote:
> >Cool. Do you also agree with doing proper refcounting for the devlink
> >instance struct and the liveness check after locking the instance?  
> 
> Could you elaborate a bit more? I missed that in the thread and can't
> find it. Why do we need it?

Look at the __devlink_free() and changes 
to devlink_compat_flash_update() here:

https://lore.kernel.org/netdev/20211030231254.2477599-3-kuba@kernel.org/

The model I had in mind (a year ago when it all started) was that 
the driver takes the devlink instance lock around its entire init path,
including the registration of the instance. This way the devlink
instance is never visible "half initialized". I mean - it's "visible"
as in you can see a notification over netlink before init is done but
you can't access it until the init in the driver is completed and it
releases the instance lock.

For that to work and to avoid ordering issues with netdev we need to
allow unregistering a devlink instance before all references are gone.

So we atomically look up and take a reference on a devlink instance.
Then take its lock. Then under the instance lock we check if it's still
registered.

For devlink core that's not a problem it's all hidden in the helpers.
Leon Romanovsky Nov. 30, 2022, 8:54 a.m. UTC | #28
On Tue, Nov 29, 2022 at 06:18:26PM -0800, Jakub Kicinski wrote:
> On Tue, 29 Nov 2022 09:31:40 +0100 Jiri Pirko wrote:
> > >Cool. Do you also agree with doing proper refcounting for the devlink
> > >instance struct and the liveness check after locking the instance?  
> > 
> > Could you elaborate a bit more? I missed that in the thread and can't
> > find it. Why do we need it?
> 
> Look at the __devlink_free() and changes 
> to devlink_compat_flash_update() here:
> 
> https://lore.kernel.org/netdev/20211030231254.2477599-3-kuba@kernel.org/
> 
> The model I had in mind (a year ago when it all started) was that 
> the driver takes the devlink instance lock around its entire init path,
> including the registration of the instance. This way the devlink
> instance is never visible "half initialized". I mean - it's "visible"
> as in you can see a notification over netlink before init is done but
> you can't access it until the init in the driver is completed and it
> releases the instance lock.

In parallel thread, Jiri wanted to avoid this situation of netlink
notifications for not-visible yet object. He gave as an example
devlink_port which is advertised without devlink being ready.

Thanks
Jiri Pirko Nov. 30, 2022, 11:32 a.m. UTC | #29
Wed, Nov 30, 2022 at 09:54:06AM CET, leon@kernel.org wrote:
>On Tue, Nov 29, 2022 at 06:18:26PM -0800, Jakub Kicinski wrote:
>> On Tue, 29 Nov 2022 09:31:40 +0100 Jiri Pirko wrote:
>> > >Cool. Do you also agree with doing proper refcounting for the devlink
>> > >instance struct and the liveness check after locking the instance?  
>> > 
>> > Could you elaborate a bit more? I missed that in the thread and can't
>> > find it. Why do we need it?
>> 
>> Look at the __devlink_free() and changes 
>> to devlink_compat_flash_update() here:
>> 
>> https://lore.kernel.org/netdev/20211030231254.2477599-3-kuba@kernel.org/
>> 
>> The model I had in mind (a year ago when it all started) was that 
>> the driver takes the devlink instance lock around its entire init path,
>> including the registration of the instance. This way the devlink
>> instance is never visible "half initialized". I mean - it's "visible"
>> as in you can see a notification over netlink before init is done but
>> you can't access it until the init in the driver is completed and it
>> releases the instance lock.
>
>In parallel thread, Jiri wanted to avoid this situation of netlink
>notifications for not-visible yet object. He gave as an example
>devlink_port which is advertised without devlink being ready.

To be honest, I'm lost tracking what you point at. I never suggested to
send notification of devlink_port before devlink notification is sent
previously. Perhaps you misread that.

>
>Thanks
Jiri Pirko Nov. 30, 2022, 11:42 a.m. UTC | #30
Wed, Nov 30, 2022 at 03:18:26AM CET, kuba@kernel.org wrote:
>On Tue, 29 Nov 2022 09:31:40 +0100 Jiri Pirko wrote:
>> >Cool. Do you also agree with doing proper refcounting for the devlink
>> >instance struct and the liveness check after locking the instance?  
>> 
>> Could you elaborate a bit more? I missed that in the thread and can't
>> find it. Why do we need it?
>
>Look at the __devlink_free() and changes 
>to devlink_compat_flash_update() here:
>
>https://lore.kernel.org/netdev/20211030231254.2477599-3-kuba@kernel.org/

**)
I see. With the change I suggest, meaning doing
devlink_port_register/unregister() and netdev_register/unregister only
for registered devlink instance, you don't need this at all. When you
hit this compat callback, the netdevice is there and therefore devlink
instance is registered for sure.


>
>The model I had in mind (a year ago when it all started) was that 
>the driver takes the devlink instance lock around its entire init path,
>including the registration of the instance. This way the devlink
>instance is never visible "half initialized". I mean - it's "visible"
>as in you can see a notification over netlink before init is done but
>you can't access it until the init in the driver is completed and it
>releases the instance lock.

What is "half-initialized"? Take devlink reload flow for instance. There
are multiple things removed/readded, like devlink_port and related
netdevice. No problem there.

I think that we really need to go a step back and put the
devlink_register at the point of init flow where all things that are
"static" during register lifetime are initialized, the rest would be
initialized later on. This would make things aligned with
devlink_reload() and would make possible to share init/fini/reload
code in drivers.


>
>For that to work and to avoid ordering issues with netdev we need to
>allow unregistering a devlink instance before all references are gone.

References of what? devlink instance, put by devlink_put()?


>
>So we atomically look up and take a reference on a devlink instance.
>Then take its lock. Then under the instance lock we check if it's still
>registered.

As mentioned above (**), I don't think this is needed.


>
>For devlink core that's not a problem it's all hidden in the helpers.
Jakub Kicinski Nov. 30, 2022, 4:36 p.m. UTC | #31
On Wed, 30 Nov 2022 10:54:06 +0200 Leon Romanovsky wrote:
> In parallel thread, Jiri wanted to avoid this situation of netlink
> notifications for not-visible yet object. He gave as an example
> devlink_port which is advertised without devlink being ready.

To be clear the exact ordering matters much less in my scheme,
because the user space caller will wait for the instance lock
before checking if the instance is registered. So an early user 
request will just sleep on the instance lock until init is done.
Jakub Kicinski Nov. 30, 2022, 4:46 p.m. UTC | #32
On Wed, 30 Nov 2022 12:42:39 +0100 Jiri Pirko wrote:
> >Look at the __devlink_free() and changes 
> >to devlink_compat_flash_update() here:
> >
> >https://lore.kernel.org/netdev/20211030231254.2477599-3-kuba@kernel.org/  
> 
> **)
> I see. With the change I suggest, meaning doing
> devlink_port_register/unregister() and netdev_register/unregister only
> for registered devlink instance, you don't need this at all. When you
> hit this compat callback, the netdevice is there and therefore devlink
> instance is registered for sure.

If you move devlink registration up it has to be under the instance
lock, otherwise we're back to reload problems. That implies unregister
should be under the lock too. But then we can't wait for refs in
unregister. Perhaps I don't understand the suggestion.

> >The model I had in mind (a year ago when it all started) was that 
> >the driver takes the devlink instance lock around its entire init path,
> >including the registration of the instance. This way the devlink
> >instance is never visible "half initialized". I mean - it's "visible"
> >as in you can see a notification over netlink before init is done but
> >you can't access it until the init in the driver is completed and it
> >releases the instance lock.  
> 
> What is "half-initialized"? Take devlink reload flow for instance. There
> are multiple things removed/readded, like devlink_port and related
> netdevice. No problem there.

Yes, but reload is under the instance lock, so nothing can mess with 
a device in a transitional state.

> I think that we really need to go a step back and put the
> devlink_register at the point of init flow where all things that are
> "static" during register lifetime are initialized, the rest would be
> initialized later on. This would make things aligned with
> devlink_reload() and would make possible to share init/fini/reload
> code in drivers.

Yes, I agree that the move should be done but I don't think its
sufficient.

> >For that to work and to avoid ordering issues with netdev we need to
> >allow unregistering a devlink instance before all references are gone.  
> 
> References of what? devlink instance, put by devlink_put()?

Yes.

> >So we atomically look up and take a reference on a devlink instance.
> >Then take its lock. Then under the instance lock we check if it's still
> >registered.  
> 
> As mentioned above (**), I don't think this is needed.

But it is, please just let me do it and make the bugs stop 
Jiri Pirko Nov. 30, 2022, 5 p.m. UTC | #33
Wed, Nov 30, 2022 at 05:46:59PM CET, kuba@kernel.org wrote:
>On Wed, 30 Nov 2022 12:42:39 +0100 Jiri Pirko wrote:
>> >Look at the __devlink_free() and changes 
>> >to devlink_compat_flash_update() here:
>> >
>> >https://lore.kernel.org/netdev/20211030231254.2477599-3-kuba@kernel.org/  
>> 
>> **)
>> I see. With the change I suggest, meaning doing
>> devlink_port_register/unregister() and netdev_register/unregister only
>> for registered devlink instance, you don't need this at all. When you
>> hit this compat callback, the netdevice is there and therefore devlink
>> instance is registered for sure.
>
>If you move devlink registration up it has to be under the instance
>lock, otherwise we're back to reload problems. That implies unregister
>should be under the lock too. But then we can't wait for refs in
>unregister. Perhaps I don't understand the suggestion.

I unlock for register and for the rest of the init I lock again.


>
>> >The model I had in mind (a year ago when it all started) was that 
>> >the driver takes the devlink instance lock around its entire init path,
>> >including the registration of the instance. This way the devlink
>> >instance is never visible "half initialized". I mean - it's "visible"
>> >as in you can see a notification over netlink before init is done but
>> >you can't access it until the init in the driver is completed and it
>> >releases the instance lock.  
>> 
>> What is "half-initialized"? Take devlink reload flow for instance. There
>> are multiple things removed/readded, like devlink_port and related
>> netdevice. No problem there.
>
>Yes, but reload is under the instance lock, so nothing can mess with 
>a device in a transitional state.

Sure, that is what I want to do too. To be under instance lock.

>
>> I think that we really need to go a step back and put the
>> devlink_register at the point of init flow where all things that are
>> "static" during register lifetime are initialized, the rest would be
>> initialized later on. This would make things aligned with
>> devlink_reload() and would make possible to share init/fini/reload
>> code in drivers.
>
>Yes, I agree that the move should be done but I don't think its
>sufficient.
>
>> >For that to work and to avoid ordering issues with netdev we need to
>> >allow unregistering a devlink instance before all references are gone.  
>> 
>> References of what? devlink instance, put by devlink_put()?
>
>Yes.
>
>> >So we atomically look up and take a reference on a devlink instance.
>> >Then take its lock. Then under the instance lock we check if it's still
>> >registered.  
>> 
>> As mentioned above (**), I don't think this is needed.
>
>But it is, please just let me do it and make the bugs stop 
Jakub Kicinski Nov. 30, 2022, 5:20 p.m. UTC | #34
On Wed, 30 Nov 2022 18:00:05 +0100 Jiri Pirko wrote:
> Wed, Nov 30, 2022 at 05:46:59PM CET, kuba@kernel.org wrote:
> >On Wed, 30 Nov 2022 12:42:39 +0100 Jiri Pirko wrote:  
> >> **)
> >> I see. With the change I suggest, meaning doing
> >> devlink_port_register/unregister() and netdev_register/unregister only
> >> for registered devlink instance, you don't need this at all. When you
> >> hit this compat callback, the netdevice is there and therefore devlink
> >> instance is registered for sure.  
> >
> >If you move devlink registration up it has to be under the instance
> >lock, otherwise we're back to reload problems. That implies unregister
> >should be under the lock too. But then we can't wait for refs in
> >unregister. Perhaps I don't understand the suggestion.  
> 
> I unlock for register and for the rest of the init I lock again.

The moment you register that instance callbacks can start coming.
Leon move the register call last for a good reason - all drivers
we looked at had bugs in handling init.

We can come up with fixes in the drivers, flags, devlink_set_features()
and all that sort of garbage until the day we die but let's not.
The driver facing API should be simple - hold the lock around entire
init.

> >> What is "half-initialized"? Take devlink reload flow for instance. There
> >> are multiple things removed/readded, like devlink_port and related
> >> netdevice. No problem there.  
> >
> >Yes, but reload is under the instance lock, so nothing can mess with 
> >a device in a transitional state.  
> 
> Sure, that is what I want to do too. To be under instance lock.

I'm confused, you just said "I unlock for register".

> >> As mentioned above (**), I don't think this is needed.  
> >
> >But it is, please just let me do it and make the bugs stop 
Leon Romanovsky Nov. 30, 2022, 7:20 p.m. UTC | #35
On Wed, Nov 30, 2022 at 09:20:42AM -0800, Jakub Kicinski wrote:
> On Wed, 30 Nov 2022 18:00:05 +0100 Jiri Pirko wrote:
> > Wed, Nov 30, 2022 at 05:46:59PM CET, kuba@kernel.org wrote:
> > >On Wed, 30 Nov 2022 12:42:39 +0100 Jiri Pirko wrote:  
> > >> **)
> > >> I see. With the change I suggest, meaning doing
> > >> devlink_port_register/unregister() and netdev_register/unregister only
> > >> for registered devlink instance, you don't need this at all. When you
> > >> hit this compat callback, the netdevice is there and therefore devlink
> > >> instance is registered for sure.  
> > >
> > >If you move devlink registration up it has to be under the instance
> > >lock, otherwise we're back to reload problems. That implies unregister
> > >should be under the lock too. But then we can't wait for refs in
> > >unregister. Perhaps I don't understand the suggestion.  
> > 
> > I unlock for register and for the rest of the init I lock again.
> 
> The moment you register that instance callbacks can start coming.
> Leon move the register call last for a good reason - all drivers
> we looked at had bugs in handling init.

Plus we had very cozy lock->unlock->lock sequences during devlink
command execution, which caused to races between devlink calls
and driver initialization.

So I'm also interested to see what Jiri meant by saying "I unlock for
register and for the rest of the init I lock again".

Thanks
Jacob Keller Nov. 30, 2022, 10:25 p.m. UTC | #36
On 11/23/2022 11:18 AM, Ido Schimmel wrote:
> On Wed, Nov 23, 2022 at 04:34:49PM +0800, Yang Yingliang wrote:
>>
>> On 2022/11/23 15:41, Leon Romanovsky wrote:
>>> On Wed, Nov 23, 2022 at 02:40:24PM +0800, Yang Yingliang wrote:
>>>> On 2022/11/23 4:27, Jakub Kicinski wrote:
>>>>> On Tue, 22 Nov 2022 21:04:29 +0200 Leon Romanovsky wrote:
>>>>>>>> Please fix nsim instead of devlink.
>>>>>>> I think if devlink is not registered, it can not be get and used, but there
>>>>>>> is no API for driver to check this, can I introduce a new helper name
>>>>>>> devlink_is_registered() for driver using.
>>>>>> There is no need in such API as driver calls to devlink_register() and
>>>>>> as such it knows when devlink is registered.
>>>>>>
>>>>>> This UAF is nsim specific issue. Real devices have single .probe()
>>>>>> routine with serialized registration flow. None of them will use
>>>>>> devlink_is_registered() call.
>>>>> Agreed, the fix is to move the register call back.
>>>>> Something along the lines of the untested patch below?
>>>>> Yang Yingliang would you be able to turn that into a real patch?
>>>>>
>>>>> diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
>>>>> index e14686594a71..26602d5fe0a2 100644
>>>>> --- a/drivers/net/netdevsim/dev.c
>>>>> +++ b/drivers/net/netdevsim/dev.c
>>>>> @@ -1566,12 +1566,15 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
>>>>>     	err = devlink_params_register(devlink, nsim_devlink_params,
>>>>>     				      ARRAY_SIZE(nsim_devlink_params));
>>>>>     	if (err)
>>>>> -		goto err_dl_unregister;
>>>>> +		goto err_resource_unregister;
>>>>>     	nsim_devlink_set_params_init_values(nsim_dev, devlink);
>>>>> +	/* here, because params API still expect devlink to be unregistered */
>>>>> +	devl_register(devlink);
>>>>> +
>>>> devlink_set_features() called at last in probe() also needs devlink is not
>>>> registered.
>>>>>     	err = nsim_dev_dummy_region_init(nsim_dev, devlink);
>>>>>     	if (err)
>>>>> -		goto err_params_unregister;
>>>>> +		goto err_dl_unregister;
>>>>>     	err = nsim_dev_traps_init(devlink);
>>>>>     	if (err)
>>>>> @@ -1610,7 +1613,6 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
>>>>>     	nsim_dev->esw_mode = DEVLINK_ESWITCH_MODE_LEGACY;
>>>>>     	devlink_set_features(devlink, DEVLINK_F_RELOAD);
>>>>>     	devl_unlock(devlink);
>>>>> -	devlink_register(devlink);
>>>>>     	return 0;
>>>>>     err_hwstats_exit:
>>>>> @@ -1629,10 +1631,11 @@ int nsim_drv_probe(struct nsim_bus_dev *nsim_bus_dev)
>>>>>     	nsim_dev_traps_exit(devlink);
>>>>>     err_dummy_region_exit:
>>>>>     	nsim_dev_dummy_region_exit(nsim_dev);
>>>>> -err_params_unregister:
>>>>> +err_dl_unregister:
>>>>> +	devl_unregister(devlink);
>>>> It races with dev_ethtool():
>>>> dev_ethtool
>>>>     devlink_try_get()
>>>>                                   nsim_drv_probe
>>>>                                   devl_lock()
>>>>       devl_lock()
>>>>                                   devlink_unregister()
>>>>                                     devlink_put()
>>>>                                     wait_for_completion() <- the refcount is
>>>> got in dev_ethtool, it causes ABBA deadlock
>>> But all these races are nsim specific ones.
>>> Can you please explain why devlink.[c|h] MUST be changed and nsim can't
>>> be fixed?
>> I used the fix code proposed by Jakub, but it didn't work correctly, so
>> I tried to correct and improve it, and need some devlink helper.
>>
>> Anyway, it is a nsim problem, if we want fix this without touch devlink,
>> I think we can add a 'registered' field in struct nsim_dev, and it can be
>> checked in nsim_get_devlink_port() like this:
> 
> I read the discussion and it's not clear to me why this is a netdevsim
> specific problem. The fundamental problem seems to be that it is
> possible to hold a reference on a devlink instance before it's
> registered and that devlink_free() will free the instance regardless of
> its current reference count because it expects devlink_unregister() to
> block. In this case, the instance was never registered, so
> devlink_unregister() is not called.
> 

Shouldn't devlink_free never free until after all references drop? 
That's how typical reference count works. I guess its just assuming that 
unregister will prevent this but....

> ethtool was able to get a reference on the devlink instance before it
> was registered because netdevsim registers its netdevs before
> registering its devlink instance. However, netdevsim is not the only one
> doing this: funeth, ice, prestera, mlx4, mlxsw, nfp and potentially
> others do the same thing.
> 

We used to register devlink before netdev, but this was changed at some 
point because we wanted to register it late after things had been setup.

> When you think about it, it's strange that it's even possible for
> ethtool to reach the driver when the netdev used in the request is long
> gone, but it's not holding a reference on the netdev (it's holding a
> reference on the devlink instance instead) and
> devlink_compat_running_version() is called without RTNL.

Yea that seems weird. I would expect ethtool to get to devlink via 
netdev? (especially now with the way we changed netdev to hold the devlink?)

Hmm.
Jiri Pirko Dec. 1, 2022, 8:39 a.m. UTC | #37
Wed, Nov 30, 2022 at 06:20:42PM CET, kuba@kernel.org wrote:
>On Wed, 30 Nov 2022 18:00:05 +0100 Jiri Pirko wrote:
>> Wed, Nov 30, 2022 at 05:46:59PM CET, kuba@kernel.org wrote:
>> >On Wed, 30 Nov 2022 12:42:39 +0100 Jiri Pirko wrote:  
>> >> **)
>> >> I see. With the change I suggest, meaning doing
>> >> devlink_port_register/unregister() and netdev_register/unregister only
>> >> for registered devlink instance, you don't need this at all. When you
>> >> hit this compat callback, the netdevice is there and therefore devlink
>> >> instance is registered for sure.  
>> >
>> >If you move devlink registration up it has to be under the instance
>> >lock, otherwise we're back to reload problems. That implies unregister
>> >should be under the lock too. But then we can't wait for refs in
>> >unregister. Perhaps I don't understand the suggestion.  
>> 
>> I unlock for register and for the rest of the init I lock again.
>
>The moment you register that instance callbacks can start coming.
>Leon move the register call last for a good reason - all drivers
>we looked at had bugs in handling init.
>
>We can come up with fixes in the drivers, flags, devlink_set_features()
>and all that sort of garbage until the day we die but let's not.
>The driver facing API should be simple - hold the lock around entire
>init.
>
>> >> What is "half-initialized"? Take devlink reload flow for instance. There
>> >> are multiple things removed/readded, like devlink_port and related
>> >> netdevice. No problem there.  
>> >
>> >Yes, but reload is under the instance lock, so nothing can mess with 
>> >a device in a transitional state.  
>> 
>> Sure, that is what I want to do too. To be under instance lock.
>
>I'm confused, you just said "I unlock for register".

Ah, right. I got your point now, you don't want the user to see
half-init devlink objects. In reload, the secondhalf-uninit&seconfhalf-init
happens atomically under instance lock, so the user sees the whole picture
still.

But is it a problem? For ports, I don't think so. For the other objects
being removed-readded during reload, why do you think it is a problem?


>
>> >> As mentioned above (**), I don't think this is needed.  
>> >
>> >But it is, please just let me do it and make the bugs stop 
Jiri Pirko Dec. 1, 2022, 8:40 a.m. UTC | #38
Wed, Nov 30, 2022 at 08:20:34PM CET, leon@kernel.org wrote:
>On Wed, Nov 30, 2022 at 09:20:42AM -0800, Jakub Kicinski wrote:
>> On Wed, 30 Nov 2022 18:00:05 +0100 Jiri Pirko wrote:
>> > Wed, Nov 30, 2022 at 05:46:59PM CET, kuba@kernel.org wrote:
>> > >On Wed, 30 Nov 2022 12:42:39 +0100 Jiri Pirko wrote:  
>> > >> **)
>> > >> I see. With the change I suggest, meaning doing
>> > >> devlink_port_register/unregister() and netdev_register/unregister only
>> > >> for registered devlink instance, you don't need this at all. When you
>> > >> hit this compat callback, the netdevice is there and therefore devlink
>> > >> instance is registered for sure.  
>> > >
>> > >If you move devlink registration up it has to be under the instance
>> > >lock, otherwise we're back to reload problems. That implies unregister
>> > >should be under the lock too. But then we can't wait for refs in
>> > >unregister. Perhaps I don't understand the suggestion.  
>> > 
>> > I unlock for register and for the rest of the init I lock again.
>> 
>> The moment you register that instance callbacks can start coming.
>> Leon move the register call last for a good reason - all drivers
>> we looked at had bugs in handling init.
>
>Plus we had very cozy lock->unlock->lock sequences during devlink
>command execution, which caused to races between devlink calls
>and driver initialization.

So? Why do you think it is a problem?

>
>So I'm also interested to see what Jiri meant by saying "I unlock for
>register and for the rest of the init I lock again".
>
>Thanks
Leon Romanovsky Dec. 1, 2022, 10:05 a.m. UTC | #39
On Thu, Dec 01, 2022 at 09:40:00AM +0100, Jiri Pirko wrote:
> Wed, Nov 30, 2022 at 08:20:34PM CET, leon@kernel.org wrote:
> >On Wed, Nov 30, 2022 at 09:20:42AM -0800, Jakub Kicinski wrote:
> >> On Wed, 30 Nov 2022 18:00:05 +0100 Jiri Pirko wrote:
> >> > Wed, Nov 30, 2022 at 05:46:59PM CET, kuba@kernel.org wrote:
> >> > >On Wed, 30 Nov 2022 12:42:39 +0100 Jiri Pirko wrote:  
> >> > >> **)
> >> > >> I see. With the change I suggest, meaning doing
> >> > >> devlink_port_register/unregister() and netdev_register/unregister only
> >> > >> for registered devlink instance, you don't need this at all. When you
> >> > >> hit this compat callback, the netdevice is there and therefore devlink
> >> > >> instance is registered for sure.  
> >> > >
> >> > >If you move devlink registration up it has to be under the instance
> >> > >lock, otherwise we're back to reload problems. That implies unregister
> >> > >should be under the lock too. But then we can't wait for refs in
> >> > >unregister. Perhaps I don't understand the suggestion.  
> >> > 
> >> > I unlock for register and for the rest of the init I lock again.
> >> 
> >> The moment you register that instance callbacks can start coming.
> >> Leon move the register call last for a good reason - all drivers
> >> we looked at had bugs in handling init.
> >
> >Plus we had very cozy lock->unlock->lock sequences during devlink
> >command execution, which caused to races between devlink calls
> >and driver initialization.
> 
> So? Why do you think it is a problem?

We need to see the actual implementation. In general, once you unlock
you can get other threads to change the state of your device.

> 
> >
> >So I'm also interested to see what Jiri meant by saying "I unlock for
> >register and for the rest of the init I lock again".
> >
> >Thanks
Jiri Pirko Dec. 1, 2022, 12:20 p.m. UTC | #40
Thu, Dec 01, 2022 at 11:05:38AM CET, leon@kernel.org wrote:
>On Thu, Dec 01, 2022 at 09:40:00AM +0100, Jiri Pirko wrote:
>> Wed, Nov 30, 2022 at 08:20:34PM CET, leon@kernel.org wrote:
>> >On Wed, Nov 30, 2022 at 09:20:42AM -0800, Jakub Kicinski wrote:
>> >> On Wed, 30 Nov 2022 18:00:05 +0100 Jiri Pirko wrote:
>> >> > Wed, Nov 30, 2022 at 05:46:59PM CET, kuba@kernel.org wrote:
>> >> > >On Wed, 30 Nov 2022 12:42:39 +0100 Jiri Pirko wrote:  
>> >> > >> **)
>> >> > >> I see. With the change I suggest, meaning doing
>> >> > >> devlink_port_register/unregister() and netdev_register/unregister only
>> >> > >> for registered devlink instance, you don't need this at all. When you
>> >> > >> hit this compat callback, the netdevice is there and therefore devlink
>> >> > >> instance is registered for sure.  
>> >> > >
>> >> > >If you move devlink registration up it has to be under the instance
>> >> > >lock, otherwise we're back to reload problems. That implies unregister
>> >> > >should be under the lock too. But then we can't wait for refs in
>> >> > >unregister. Perhaps I don't understand the suggestion.  
>> >> > 
>> >> > I unlock for register and for the rest of the init I lock again.
>> >> 
>> >> The moment you register that instance callbacks can start coming.
>> >> Leon move the register call last for a good reason - all drivers
>> >> we looked at had bugs in handling init.
>> >
>> >Plus we had very cozy lock->unlock->lock sequences during devlink
>> >command execution, which caused to races between devlink calls
>> >and driver initialization.
>> 
>> So? Why do you think it is a problem?
>
>We need to see the actual implementation. In general, once you unlock
>you can get other threads to change the state of your device.

Sure, I still don't understand, why it would be a problem.

>
>> 
>> >
>> >So I'm also interested to see what Jiri meant by saying "I unlock for
>> >register and for the rest of the init I lock again".
>> >
>> >Thanks
diff mbox series

Patch

diff --git a/net/core/devlink.c b/net/core/devlink.c
index 89baa7c0938b..6453ac0558fb 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -250,6 +250,9 @@  void devlink_put(struct devlink *devlink)
 
 struct devlink *__must_check devlink_try_get(struct devlink *devlink)
 {
+	 if (!xa_get_mark(&devlinks, devlink->index, DEVLINK_REGISTERED))
+		return NULL;
+
 	if (refcount_inc_not_zero(&devlink->refcount))
 		return devlink;
 	return NULL;