diff mbox series

[net-next,1/2] net/sched: act_api: rely on rcu in tcf_idr_check_alloc

Message ID 20231205153012.484687-2-pctammela@mojatatu.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series net/sched: optimizations around action binding and init | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1117 this patch: 1117
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 1143 this patch: 1143
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1144 this patch: 1144
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 77 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Pedro Tammela Dec. 5, 2023, 3:30 p.m. UTC
Instead of relying only on the idrinfo->lock mutex for
bind/alloc logic, rely on a combination of rcu + mutex + atomics
to better scale the case where multiple rtnl-less filters are
binding to the same action object.

Action binding happens when an action index is specified explicitly and
an action exists which such index exists. Example:
  tc actions add action drop index 1
  tc filter add ... matchall action drop index 1
  tc filter add ... matchall action drop index 1
  tc filter add ... matchall action drop index 1
  tc filter ls ...
     filter protocol all pref 49150 matchall chain 0 filter protocol all pref 49150 matchall chain 0 handle 0x1
     not_in_hw
           action order 1: gact action drop
            random type none pass val 0
            index 1 ref 4 bind 3

   filter protocol all pref 49151 matchall chain 0 filter protocol all pref 49151 matchall chain 0 handle 0x1
     not_in_hw
           action order 1: gact action drop
            random type none pass val 0
            index 1 ref 4 bind 3

   filter protocol all pref 49152 matchall chain 0 filter protocol all pref 49152 matchall chain 0 handle 0x1
     not_in_hw
           action order 1: gact action drop
            random type none pass val 0
            index 1 ref 4 bind 3

When no index is specified, as before, grab the mutex and allocate
in the idr the next available id. In this version, as opposed to before,
it's simplified to store the -EBUSY pointer instead of the previous
alloc + replace combination.

When an index is specified, rely on rcu to find if there's an object in
such index. If there's none, fallback to the above, serializing on the
mutex and reserving the specified id. If there's one, it can be an -EBUSY
pointer, in which case we just try again until it's an action, or an action.
Given the rcu guarantees, the action found could be dead and therefore
we need to bump the refcount if it's not 0, handling the case it's
in fact 0.

As bind and the action refcount are already atomics, these increments can
happen without the mutex protection while many tcf_idr_check_alloc race
to bind to the same action instance.

Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
---
 net/sched/act_api.c | 56 +++++++++++++++++++++++++++------------------
 1 file changed, 34 insertions(+), 22 deletions(-)

Comments

Vlad Buslov Dec. 5, 2023, 6:34 p.m. UTC | #1
On Tue 05 Dec 2023 at 12:30, Pedro Tammela <pctammela@mojatatu.com> wrote:
> Instead of relying only on the idrinfo->lock mutex for
> bind/alloc logic, rely on a combination of rcu + mutex + atomics
> to better scale the case where multiple rtnl-less filters are
> binding to the same action object.
>
> Action binding happens when an action index is specified explicitly and
> an action exists which such index exists. Example:

Nit: the first sentence looks mangled, extra 'exists' word and probably
'which' should be 'with'.

>   tc actions add action drop index 1
>   tc filter add ... matchall action drop index 1
>   tc filter add ... matchall action drop index 1
>   tc filter add ... matchall action drop index 1
>   tc filter ls ...
>      filter protocol all pref 49150 matchall chain 0 filter protocol all pref 49150 matchall chain 0 handle 0x1
>      not_in_hw
>            action order 1: gact action drop
>             random type none pass val 0
>             index 1 ref 4 bind 3
>
>    filter protocol all pref 49151 matchall chain 0 filter protocol all pref 49151 matchall chain 0 handle 0x1
>      not_in_hw
>            action order 1: gact action drop
>             random type none pass val 0
>             index 1 ref 4 bind 3
>
>    filter protocol all pref 49152 matchall chain 0 filter protocol all pref 49152 matchall chain 0 handle 0x1
>      not_in_hw
>            action order 1: gact action drop
>             random type none pass val 0
>             index 1 ref 4 bind 3
>
> When no index is specified, as before, grab the mutex and allocate
> in the idr the next available id. In this version, as opposed to before,
> it's simplified to store the -EBUSY pointer instead of the previous
> alloc + replace combination.
>
> When an index is specified, rely on rcu to find if there's an object in
> such index. If there's none, fallback to the above, serializing on the
> mutex and reserving the specified id. If there's one, it can be an -EBUSY
> pointer, in which case we just try again until it's an action, or an action.
> Given the rcu guarantees, the action found could be dead and therefore
> we need to bump the refcount if it's not 0, handling the case it's
> in fact 0.
>
> As bind and the action refcount are already atomics, these increments can
> happen without the mutex protection while many tcf_idr_check_alloc race
> to bind to the same action instance.
>
> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
> ---
>  net/sched/act_api.c | 56 +++++++++++++++++++++++++++------------------
>  1 file changed, 34 insertions(+), 22 deletions(-)
>
> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> index abec5c45b5a4..79a044d2ae02 100644
> --- a/net/sched/act_api.c
> +++ b/net/sched/act_api.c
> @@ -824,43 +824,55 @@ int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
>  	struct tcf_idrinfo *idrinfo = tn->idrinfo;
>  	struct tc_action *p;
>  	int ret;
> +	u32 max;
>  
> -again:
> -	mutex_lock(&idrinfo->lock);
>  	if (*index) {
> +again:
> +		rcu_read_lock();
>  		p = idr_find(&idrinfo->action_idr, *index);
> +
>  		if (IS_ERR(p)) {
>  			/* This means that another process allocated
>  			 * index but did not assign the pointer yet.
>  			 */
> -			mutex_unlock(&idrinfo->lock);
> +			rcu_read_unlock();
>  			goto again;
>  		}
>  
> -		if (p) {
> -			refcount_inc(&p->tcfa_refcnt);
> -			if (bind)
> -				atomic_inc(&p->tcfa_bindcnt);
> -			*a = p;
> -			ret = 1;
> -		} else {
> -			*a = NULL;
> -			ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
> -					    *index, GFP_KERNEL);
> -			if (!ret)
> -				idr_replace(&idrinfo->action_idr,
> -					    ERR_PTR(-EBUSY), *index);
> +		if (!p) {
> +			/* Empty slot, try to allocate it */
> +			max = *index;
> +			rcu_read_unlock();
> +			goto new;
> +		}
> +
> +		if (!refcount_inc_not_zero(&p->tcfa_refcnt)) {
> +			/* Action was deleted in parallel */
> +			rcu_read_unlock();
> +			return -ENOENT;

Current version doesn't return ENOENT since it is synchronous. You are
now introducing basically a change to UAPI since users of this function
(individual actions) are not prepared to retry on ENOENT and will
propagate the error up the call chain. I guess you need to try to create
a new action with specified index instead.

>  		}
> +
> +		if (bind)
> +			atomic_inc(&p->tcfa_bindcnt);
> +		*a = p;
> +
> +		rcu_read_unlock();
> +
> +		return 1;
>  	} else {
> +		/* Find a slot */
>  		*index = 1;
> -		*a = NULL;
> -		ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
> -				    UINT_MAX, GFP_KERNEL);
> -		if (!ret)
> -			idr_replace(&idrinfo->action_idr, ERR_PTR(-EBUSY),
> -				    *index);
> +		max = UINT_MAX;
>  	}
> +
> +new:
> +	*a = NULL;
> +
> +	mutex_lock(&idrinfo->lock);
> +	ret = idr_alloc_u32(&idrinfo->action_idr, ERR_PTR(-EBUSY), index, max,
> +			    GFP_KERNEL);

What if multiple concurrent tasks didn't find the action by index with
rcu and get here, synchronizing on the idrinfo->lock? It looks like
after the one who got the lock first successfully allocates the index
everyone else will fail (also propagating ENOSPACE to the user). I guess
you need some mechanism to account for such case and retry.

>  	mutex_unlock(&idrinfo->lock);
> +
>  	return ret;
>  }
>  EXPORT_SYMBOL(tcf_idr_check_alloc);
Pedro Tammela Dec. 5, 2023, 8:19 p.m. UTC | #2
On 05/12/2023 15:34, Vlad Buslov wrote:
> On Tue 05 Dec 2023 at 12:30, Pedro Tammela <pctammela@mojatatu.com> wrote:
>> Instead of relying only on the idrinfo->lock mutex for
>> bind/alloc logic, rely on a combination of rcu + mutex + atomics
>> to better scale the case where multiple rtnl-less filters are
>> binding to the same action object.
>>
>> Action binding happens when an action index is specified explicitly and
>> an action exists which such index exists. Example:
> 
> Nit: the first sentence looks mangled, extra 'exists' word and probably
> 'which' should be 'with'.
> 
>>    tc actions add action drop index 1
>>    tc filter add ... matchall action drop index 1
>>    tc filter add ... matchall action drop index 1
>>    tc filter add ... matchall action drop index 1
>>    tc filter ls ...
>>       filter protocol all pref 49150 matchall chain 0 filter protocol all pref 49150 matchall chain 0 handle 0x1
>>       not_in_hw
>>             action order 1: gact action drop
>>              random type none pass val 0
>>              index 1 ref 4 bind 3
>>
>>     filter protocol all pref 49151 matchall chain 0 filter protocol all pref 49151 matchall chain 0 handle 0x1
>>       not_in_hw
>>             action order 1: gact action drop
>>              random type none pass val 0
>>              index 1 ref 4 bind 3
>>
>>     filter protocol all pref 49152 matchall chain 0 filter protocol all pref 49152 matchall chain 0 handle 0x1
>>       not_in_hw
>>             action order 1: gact action drop
>>              random type none pass val 0
>>              index 1 ref 4 bind 3
>>
>> When no index is specified, as before, grab the mutex and allocate
>> in the idr the next available id. In this version, as opposed to before,
>> it's simplified to store the -EBUSY pointer instead of the previous
>> alloc + replace combination.
>>
>> When an index is specified, rely on rcu to find if there's an object in
>> such index. If there's none, fallback to the above, serializing on the
>> mutex and reserving the specified id. If there's one, it can be an -EBUSY
>> pointer, in which case we just try again until it's an action, or an action.
>> Given the rcu guarantees, the action found could be dead and therefore
>> we need to bump the refcount if it's not 0, handling the case it's
>> in fact 0.
>>
>> As bind and the action refcount are already atomics, these increments can
>> happen without the mutex protection while many tcf_idr_check_alloc race
>> to bind to the same action instance.
>>
>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
>> ---
>>   net/sched/act_api.c | 56 +++++++++++++++++++++++++++------------------
>>   1 file changed, 34 insertions(+), 22 deletions(-)
>>
>> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
>> index abec5c45b5a4..79a044d2ae02 100644
>> --- a/net/sched/act_api.c
>> +++ b/net/sched/act_api.c
>> @@ -824,43 +824,55 @@ int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
>>   	struct tcf_idrinfo *idrinfo = tn->idrinfo;
>>   	struct tc_action *p;
>>   	int ret;
>> +	u32 max;
>>   
>> -again:
>> -	mutex_lock(&idrinfo->lock);
>>   	if (*index) {
>> +again:
>> +		rcu_read_lock();
>>   		p = idr_find(&idrinfo->action_idr, *index);
>> +
>>   		if (IS_ERR(p)) {
>>   			/* This means that another process allocated
>>   			 * index but did not assign the pointer yet.
>>   			 */
>> -			mutex_unlock(&idrinfo->lock);
>> +			rcu_read_unlock();
>>   			goto again;
>>   		}
>>   
>> -		if (p) {
>> -			refcount_inc(&p->tcfa_refcnt);
>> -			if (bind)
>> -				atomic_inc(&p->tcfa_bindcnt);
>> -			*a = p;
>> -			ret = 1;
>> -		} else {
>> -			*a = NULL;
>> -			ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
>> -					    *index, GFP_KERNEL);
>> -			if (!ret)
>> -				idr_replace(&idrinfo->action_idr,
>> -					    ERR_PTR(-EBUSY), *index);
>> +		if (!p) {
>> +			/* Empty slot, try to allocate it */
>> +			max = *index;
>> +			rcu_read_unlock();
>> +			goto new;
>> +		}
>> +
>> +		if (!refcount_inc_not_zero(&p->tcfa_refcnt)) {
>> +			/* Action was deleted in parallel */
>> +			rcu_read_unlock();
>> +			return -ENOENT;
> 
> Current version doesn't return ENOENT since it is synchronous. You are
> now introducing basically a change to UAPI since users of this function
> (individual actions) are not prepared to retry on ENOENT and will
> propagate the error up the call chain. I guess you need to try to create
> a new action with specified index instead.

I see.
So you are saying that in the case where action foo is deleted and a 
binding in parallel observes the deleted action, it should fallback into 
trying to allocate the index.

We could goto again and hope that idr_find will observe the idr index 
being freed, in which case it would fall back into action allocation if 
it does or simply go via the same path as before (jumping to 'again').

I don't see much problems here, it seems to converge in this scenario
as it eventually transforms into race for action allocation (more below) 
if you have an unfortunate delete with many bindings in flight.

> 
>>   		}
>> +
>> +		if (bind)
>> +			atomic_inc(&p->tcfa_bindcnt);
>> +		*a = p;
>> +
>> +		rcu_read_unlock();
>> +
>> +		return 1;
>>   	} else {
>> +		/* Find a slot */
>>   		*index = 1;
>> -		*a = NULL;
>> -		ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
>> -				    UINT_MAX, GFP_KERNEL);
>> -		if (!ret)
>> -			idr_replace(&idrinfo->action_idr, ERR_PTR(-EBUSY),
>> -				    *index);
>> +		max = UINT_MAX;
>>   	}
>> +
>> +new:
>> +	*a = NULL;
>> +
>> +	mutex_lock(&idrinfo->lock);
>> +	ret = idr_alloc_u32(&idrinfo->action_idr, ERR_PTR(-EBUSY), index, max,
>> +			    GFP_KERNEL);
> 
> What if multiple concurrent tasks didn't find the action by index with
> rcu and get here, synchronizing on the idrinfo->lock? It looks like
> after the one who got the lock first successfully allocates the index
> everyone else will fail (also propagating ENOSPACE to the user). 

Correct

> I guess you need some mechanism to account for such case and retry.

Ok, so if I'm binding and it's observed a free index, which means "try 
to allocate" and I get a ENOSPC after jumping to new, try again but this 
time binding into the allocated action.

In this scenario when we come back to 'again' we will wait until -EBUSY 
is replaced with the real pointer. Seems like a big enough window that 
any race for allocating from binding would most probably end up in this 
contention loop.

However I think when we have these two retry mechanisms there's a 
extremely small window for an infinite loop if an action delete is timed 
just right, in between the action pointer is found and when we grab the 
tcfa_refcnt.

	idr_find (pointer)
	tcfa_refcnt (0)  <-------|
	again:                   |
	idr_find (free index!)   |
	new:                     |
	idr_alloc_u32 (ENOSPC)   |
	again:                   |
	idr_find (EBUSY)         |
	again:                   |
	idr_find (pointer)       |
	<evil delete happens>    |
	------->>>>--------------|

Another potential problem, is that this will race with non binding 
actions. So if the ENOSPC was actually from another unrelated action. A 
practical example would be a race between a binding to an 'action drop 
index 1' and an 'action ok' allocation. Actually it's a general problem 
and not particular to this case here but it seems like we could be 
amplifying it.

I'm conflicted here. If I were to choose one of the two, I would pick 
the action respawing as to me it seems to converge much quicker and 
removes the uapi change (my bad! :).

As for usability, if all of a sudden there's a huge influx of ENOSPC 
errors because users are abusing 'tc filter add ... action index 1 ...' 
in parallel _before_ actually creating the action object the fix is to just:
tc actions add action index 1 ...
tc filter add ...

As tc has always supported
Vlad Buslov Dec. 6, 2023, 9:52 a.m. UTC | #3
On Tue 05 Dec 2023 at 17:19, Pedro Tammela <pctammela@mojatatu.com> wrote:
> On 05/12/2023 15:34, Vlad Buslov wrote:
>> On Tue 05 Dec 2023 at 12:30, Pedro Tammela <pctammela@mojatatu.com> wrote:
>>> Instead of relying only on the idrinfo->lock mutex for
>>> bind/alloc logic, rely on a combination of rcu + mutex + atomics
>>> to better scale the case where multiple rtnl-less filters are
>>> binding to the same action object.
>>>
>>> Action binding happens when an action index is specified explicitly and
>>> an action exists which such index exists. Example:
>> Nit: the first sentence looks mangled, extra 'exists' word and probably
>> 'which' should be 'with'.
>> 
>>>    tc actions add action drop index 1
>>>    tc filter add ... matchall action drop index 1
>>>    tc filter add ... matchall action drop index 1
>>>    tc filter add ... matchall action drop index 1
>>>    tc filter ls ...
>>>       filter protocol all pref 49150 matchall chain 0 filter protocol all pref 49150 matchall chain 0 handle 0x1
>>>       not_in_hw
>>>             action order 1: gact action drop
>>>              random type none pass val 0
>>>              index 1 ref 4 bind 3
>>>
>>>     filter protocol all pref 49151 matchall chain 0 filter protocol all pref 49151 matchall chain 0 handle 0x1
>>>       not_in_hw
>>>             action order 1: gact action drop
>>>              random type none pass val 0
>>>              index 1 ref 4 bind 3
>>>
>>>     filter protocol all pref 49152 matchall chain 0 filter protocol all pref 49152 matchall chain 0 handle 0x1
>>>       not_in_hw
>>>             action order 1: gact action drop
>>>              random type none pass val 0
>>>              index 1 ref 4 bind 3
>>>
>>> When no index is specified, as before, grab the mutex and allocate
>>> in the idr the next available id. In this version, as opposed to before,
>>> it's simplified to store the -EBUSY pointer instead of the previous
>>> alloc + replace combination.
>>>
>>> When an index is specified, rely on rcu to find if there's an object in
>>> such index. If there's none, fallback to the above, serializing on the
>>> mutex and reserving the specified id. If there's one, it can be an -EBUSY
>>> pointer, in which case we just try again until it's an action, or an action.
>>> Given the rcu guarantees, the action found could be dead and therefore
>>> we need to bump the refcount if it's not 0, handling the case it's
>>> in fact 0.
>>>
>>> As bind and the action refcount are already atomics, these increments can
>>> happen without the mutex protection while many tcf_idr_check_alloc race
>>> to bind to the same action instance.
>>>
>>> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
>>> ---
>>>   net/sched/act_api.c | 56 +++++++++++++++++++++++++++------------------
>>>   1 file changed, 34 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
>>> index abec5c45b5a4..79a044d2ae02 100644
>>> --- a/net/sched/act_api.c
>>> +++ b/net/sched/act_api.c
>>> @@ -824,43 +824,55 @@ int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
>>>   	struct tcf_idrinfo *idrinfo = tn->idrinfo;
>>>   	struct tc_action *p;
>>>   	int ret;
>>> +	u32 max;
>>>   -again:
>>> -	mutex_lock(&idrinfo->lock);
>>>   	if (*index) {
>>> +again:
>>> +		rcu_read_lock();
>>>   		p = idr_find(&idrinfo->action_idr, *index);
>>> +
>>>   		if (IS_ERR(p)) {
>>>   			/* This means that another process allocated
>>>   			 * index but did not assign the pointer yet.
>>>   			 */
>>> -			mutex_unlock(&idrinfo->lock);
>>> +			rcu_read_unlock();
>>>   			goto again;
>>>   		}
>>>   -		if (p) {
>>> -			refcount_inc(&p->tcfa_refcnt);
>>> -			if (bind)
>>> -				atomic_inc(&p->tcfa_bindcnt);
>>> -			*a = p;
>>> -			ret = 1;
>>> -		} else {
>>> -			*a = NULL;
>>> -			ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
>>> -					    *index, GFP_KERNEL);
>>> -			if (!ret)
>>> -				idr_replace(&idrinfo->action_idr,
>>> -					    ERR_PTR(-EBUSY), *index);
>>> +		if (!p) {
>>> +			/* Empty slot, try to allocate it */
>>> +			max = *index;
>>> +			rcu_read_unlock();
>>> +			goto new;
>>> +		}
>>> +
>>> +		if (!refcount_inc_not_zero(&p->tcfa_refcnt)) {
>>> +			/* Action was deleted in parallel */
>>> +			rcu_read_unlock();
>>> +			return -ENOENT;
>> Current version doesn't return ENOENT since it is synchronous. You are
>> now introducing basically a change to UAPI since users of this function
>> (individual actions) are not prepared to retry on ENOENT and will
>> propagate the error up the call chain. I guess you need to try to create
>> a new action with specified index instead.
>
> I see.
> So you are saying that in the case where action foo is deleted and a binding in
> parallel observes the deleted action, it should fallback into trying to allocate
> the index.

Correct.

>
> We could goto again and hope that idr_find will observe the idr index being
> freed, in which case it would fall back into action allocation if it does or
> simply go via the same path as before (jumping to 'again').
>
> I don't see much problems here, it seems to converge in this scenario
> as it eventually transforms into race for action allocation (more below) if you
> have an unfortunate delete with many bindings in flight.
>
>> 
>>>   		}
>>> +
>>> +		if (bind)
>>> +			atomic_inc(&p->tcfa_bindcnt);
>>> +		*a = p;
>>> +
>>> +		rcu_read_unlock();
>>> +
>>> +		return 1;
>>>   	} else {
>>> +		/* Find a slot */
>>>   		*index = 1;
>>> -		*a = NULL;
>>> -		ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
>>> -				    UINT_MAX, GFP_KERNEL);
>>> -		if (!ret)
>>> -			idr_replace(&idrinfo->action_idr, ERR_PTR(-EBUSY),
>>> -				    *index);
>>> +		max = UINT_MAX;
>>>   	}
>>> +
>>> +new:
>>> +	*a = NULL;
>>> +
>>> +	mutex_lock(&idrinfo->lock);
>>> +	ret = idr_alloc_u32(&idrinfo->action_idr, ERR_PTR(-EBUSY), index, max,
>>> +			    GFP_KERNEL);
>> What if multiple concurrent tasks didn't find the action by index with
>> rcu and get here, synchronizing on the idrinfo->lock? It looks like
>> after the one who got the lock first successfully allocates the index
>> everyone else will fail (also propagating ENOSPACE to the user). 
>
> Correct
>
>> I guess you need some mechanism to account for such case and retry.
>
> Ok, so if I'm binding and it's observed a free index, which means "try to
> allocate" and I get a ENOSPC after jumping to new, try again but this time
> binding into the allocated action.
>
> In this scenario when we come back to 'again' we will wait until -EBUSY is
> replaced with the real pointer. Seems like a big enough window that any race for
> allocating from binding would most probably end up in this contention loop.
>
> However I think when we have these two retry mechanisms there's a extremely
> small window for an infinite loop if an action delete is timed just right, in
> between the action pointer is found and when we grab the tcfa_refcnt.
>
> 	idr_find (pointer)
> 	tcfa_refcnt (0)  <-------|
> 	again:                   |
> 	idr_find (free index!)   |
> 	new:                     |
> 	idr_alloc_u32 (ENOSPC)   |
> 	again:                   |
> 	idr_find (EBUSY)         |
> 	again:                   |
> 	idr_find (pointer)       |
> 	<evil delete happens>    |
> 	------->>>>--------------|

I'm not sure I'm following. Why would this sequence cause infinite loop?

>
> Another potential problem, is that this will race with non binding actions. So
> if the ENOSPC was actually from another unrelated action. A practical example
> would be a race between a binding to an 'action drop index 1' and an 'action ok'
> allocation. Actually it's a general problem and not particular to this case here
> but it seems like we could be amplifying it.
>
> I'm conflicted here. If I were to choose one of the two, I would pick the action
> respawing as to me it seems to converge much quicker and removes the uapi change
> (my bad! :).
>
> As for usability, if all of a sudden there's a huge influx of ENOSPC errors
> because users are abusing 'tc filter add ... action index 1 ...' in parallel
> _before_ actually creating the action object the fix is to just:
> tc actions add action index 1 ...
> tc filter add ...
>
> As tc has always supported
Pedro Tammela Dec. 8, 2023, 9:07 p.m. UTC | #4
On 06/12/2023 06:52, Vlad Buslov wrote:
>> Ok, so if I'm binding and it's observed a free index, which means "try to
>> allocate" and I get a ENOSPC after jumping to new, try again but this time
>> binding into the allocated action.
>>
>> In this scenario when we come back to 'again' we will wait until -EBUSY is
>> replaced with the real pointer. Seems like a big enough window that any race for
>> allocating from binding would most probably end up in this contention loop.
>>
>> However I think when we have these two retry mechanisms there's a extremely
>> small window for an infinite loop if an action delete is timed just right, in
>> between the action pointer is found and when we grab the tcfa_refcnt.
>>
>> 	idr_find (pointer)
>> 	tcfa_refcnt (0)  <-------|
>> 	again:                   |
>> 	idr_find (free index!)   |
>> 	new:                     |
>> 	idr_alloc_u32 (ENOSPC)   |
>> 	again:                   |
>> 	idr_find (EBUSY)         |
>> 	again:                   |
>> 	idr_find (pointer)       |
>> 	<evil delete happens>    |
>> 	------->>>>--------------|
> 
> I'm not sure I'm following. Why would this sequence cause infinite loop?
> 

Perhaps I was being overly paranoid. Taking a look again it seems that 
not only an evil delete but also EBUSY must be in the action idr for a 
long time. I see it now, it looks like it converges.

I was wondering if instead of looping in 'again:' in either scenarios 
you presented, what if we return -EAGAIN and let the filter 
infrastructure retry it under rtnl_lock()? At least will give enough 
breathing room for a call to schedule() to kick in if needed.
Vlad Buslov Dec. 11, 2023, 4:10 p.m. UTC | #5
On Fri 08 Dec 2023 at 18:07, Pedro Tammela <pctammela@mojatatu.com> wrote:
> On 06/12/2023 06:52, Vlad Buslov wrote:
>>> Ok, so if I'm binding and it's observed a free index, which means "try to
>>> allocate" and I get a ENOSPC after jumping to new, try again but this time
>>> binding into the allocated action.
>>>
>>> In this scenario when we come back to 'again' we will wait until -EBUSY is
>>> replaced with the real pointer. Seems like a big enough window that any race for
>>> allocating from binding would most probably end up in this contention loop.
>>>
>>> However I think when we have these two retry mechanisms there's a extremely
>>> small window for an infinite loop if an action delete is timed just right, in
>>> between the action pointer is found and when we grab the tcfa_refcnt.
>>>
>>> 	idr_find (pointer)
>>> 	tcfa_refcnt (0)  <-------|
>>> 	again:                   |
>>> 	idr_find (free index!)   |
>>> 	new:                     |
>>> 	idr_alloc_u32 (ENOSPC)   |
>>> 	again:                   |
>>> 	idr_find (EBUSY)         |
>>> 	again:                   |
>>> 	idr_find (pointer)       |
>>> 	<evil delete happens>    |
>>> 	------->>>>--------------|
>> I'm not sure I'm following. Why would this sequence cause infinite loop?
>> 
>
> Perhaps I was being overly paranoid. Taking a look again it seems that not only
> an evil delete but also EBUSY must be in the action idr for a long time. I see
> it now, it looks like it converges.
>
> I was wondering if instead of looping in 'again:' in either scenarios you
> presented, what if we return -EAGAIN and let the filter infrastructure retry it
> under rtnl_lock()? At least will give enough breathing room for a call to
> schedule() to kick in if needed.

Sounds good, but you will need to ensure that both act and cls api
implementations properly retry on EAGAIN (looks like they do, but I only
gave it a cursory glance).
diff mbox series

Patch

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index abec5c45b5a4..79a044d2ae02 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -824,43 +824,55 @@  int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
 	struct tcf_idrinfo *idrinfo = tn->idrinfo;
 	struct tc_action *p;
 	int ret;
+	u32 max;
 
-again:
-	mutex_lock(&idrinfo->lock);
 	if (*index) {
+again:
+		rcu_read_lock();
 		p = idr_find(&idrinfo->action_idr, *index);
+
 		if (IS_ERR(p)) {
 			/* This means that another process allocated
 			 * index but did not assign the pointer yet.
 			 */
-			mutex_unlock(&idrinfo->lock);
+			rcu_read_unlock();
 			goto again;
 		}
 
-		if (p) {
-			refcount_inc(&p->tcfa_refcnt);
-			if (bind)
-				atomic_inc(&p->tcfa_bindcnt);
-			*a = p;
-			ret = 1;
-		} else {
-			*a = NULL;
-			ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
-					    *index, GFP_KERNEL);
-			if (!ret)
-				idr_replace(&idrinfo->action_idr,
-					    ERR_PTR(-EBUSY), *index);
+		if (!p) {
+			/* Empty slot, try to allocate it */
+			max = *index;
+			rcu_read_unlock();
+			goto new;
+		}
+
+		if (!refcount_inc_not_zero(&p->tcfa_refcnt)) {
+			/* Action was deleted in parallel */
+			rcu_read_unlock();
+			return -ENOENT;
 		}
+
+		if (bind)
+			atomic_inc(&p->tcfa_bindcnt);
+		*a = p;
+
+		rcu_read_unlock();
+
+		return 1;
 	} else {
+		/* Find a slot */
 		*index = 1;
-		*a = NULL;
-		ret = idr_alloc_u32(&idrinfo->action_idr, NULL, index,
-				    UINT_MAX, GFP_KERNEL);
-		if (!ret)
-			idr_replace(&idrinfo->action_idr, ERR_PTR(-EBUSY),
-				    *index);
+		max = UINT_MAX;
 	}
+
+new:
+	*a = NULL;
+
+	mutex_lock(&idrinfo->lock);
+	ret = idr_alloc_u32(&idrinfo->action_idr, ERR_PTR(-EBUSY), index, max,
+			    GFP_KERNEL);
 	mutex_unlock(&idrinfo->lock);
+
 	return ret;
 }
 EXPORT_SYMBOL(tcf_idr_check_alloc);