diff mbox series

[bpf-next,2/3] bpf: Overwrite the element in hash map atomically

Message ID 20250204082848.13471-3-hotforest@gmail.com (mailing list archive)
State New, archived
Delegated to: BPF
Headers show
Series bpf: Overwrite the htab element atomically | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 13 of 13 maintainers
netdev/build_clang success Errors and warnings before: 2 this patch: 2
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 6 this patch: 6
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 20 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Unittests
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-4 success Logs for aarch64-gcc / GCC BPF
bpf/vmtest-bpf-next-VM_Test-5 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-11 success Logs for aarch64-gcc / veristat-kernel
bpf/vmtest-bpf-next-VM_Test-12 success Logs for aarch64-gcc / veristat-meta
bpf/vmtest-bpf-next-VM_Test-10 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for s390x-gcc / GCC BPF
bpf/vmtest-bpf-next-VM_Test-7 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-18 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-19 success Logs for s390x-gcc / veristat-kernel
bpf/vmtest-bpf-next-VM_Test-20 success Logs for s390x-gcc / veristat-meta
bpf/vmtest-bpf-next-VM_Test-21 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-23 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-34 success Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-35 success Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17-O2
bpf/vmtest-bpf-next-VM_Test-41 success Logs for x86_64-llvm-17 / veristat-meta
bpf/vmtest-bpf-next-VM_Test-40 success Logs for x86_64-llvm-17 / veristat-kernel
bpf/vmtest-bpf-next-VM_Test-44 success Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18-O2
bpf/vmtest-bpf-next-VM_Test-43 success Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-50 success Logs for x86_64-llvm-18 / veristat-kernel
bpf/vmtest-bpf-next-VM_Test-51 success Logs for x86_64-llvm-18 / veristat-meta
bpf/vmtest-bpf-next-VM_Test-8 success Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-29 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-30 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-31 success Logs for x86_64-gcc / veristat-kernel / x86_64-gcc veristat_kernel
bpf/vmtest-bpf-next-VM_Test-32 success Logs for x86_64-gcc / veristat-meta / x86_64-gcc veristat_meta
bpf/vmtest-bpf-next-VM_Test-33 success Logs for x86_64-llvm-17 / GCC BPF / GCC BPF
bpf/vmtest-bpf-next-VM_Test-36 success Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-38 success Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-37 success Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-39 success Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-42 success Logs for x86_64-llvm-18 / GCC BPF / GCC BPF
bpf/vmtest-bpf-next-VM_Test-45 success Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-46 success Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-47 success Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-48 success Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-49 success Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-27 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for x86_64-gcc / GCC BPF / GCC BPF

Commit Message

Hou Tao Feb. 4, 2025, 8:28 a.m. UTC
Currently, the update of existing element in hash map involves two
steps:
1) insert the new element at the head of the hash list
2) remove the old element

It is possible that the concurrent lookup operation may fail to find
either the old element or the new element if the lookup operation starts
before the addition and continues after the removal.

Therefore, replacing the two-step update with an atomic update. After
the change, the update will be atomic in the perspective of the lookup
operation: it will either find the old element or the new element.

Signed-off-by: Hou Tao <hotforest@gmail.com>
---
 kernel/bpf/hashtab.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

Comments

Hou Tao Feb. 5, 2025, 1:38 a.m. UTC | #1
+cc Cody Haas

Sorry for the resend. I sent the reply in the HTML format.

On 2/4/2025 4:28 PM, Hou Tao wrote:
> Currently, the update of existing element in hash map involves two
> steps:
> 1) insert the new element at the head of the hash list
> 2) remove the old element
> 
> It is possible that the concurrent lookup operation may fail to find
> either the old element or the new element if the lookup operation starts
> before the addition and continues after the removal.
> 
> Therefore, replacing the two-step update with an atomic update. After
> the change, the update will be atomic in the perspective of the lookup
> operation: it will either find the old element or the new element.
> 
> Signed-off-by: Hou Tao <hotforest@gmail.com>
> ---
>  kernel/bpf/hashtab.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 4a9eeb7aef85..a28b11ce74c6 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>  		goto err;
>  	}
>  
> -	/* add new element to the head of the list, so that
> -	 * concurrent search will find it before old elem
> -	 */
> -	hlist_nulls_add_head_rcu(&l_new->hash_node, head);
> -	if (l_old) {
> -		hlist_nulls_del_rcu(&l_old->hash_node);
> +	if (!l_old) {
> +		hlist_nulls_add_head_rcu(&l_new->hash_node, head);
> +	} else {
> +		/* Replace the old element atomically, so that
> +		 * concurrent search will find either the new element or
> +		 * the old element.
> +		 */
> +		hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node);
>  
>  		/* l_old has already been stashed in htab->extra_elems, free
>  		 * its special fields before it is available for reuse. Also
> 

After thinking about it the second time, the atomic list replacement on
the update side is enough to make lookup operation always find the
existing element. However, due to the immediate reuse, the lookup may
find an unexpected value. Maybe we should disable the immediate reuse
for specific map (e.g., htab of maps).
Toke Høiland-Jørgensen Feb. 6, 2025, 3:05 p.m. UTC | #2
Hou Tao <houtao@huaweicloud.com> writes:

> +cc Cody Haas
>
> Sorry for the resend. I sent the reply in the HTML format.
>
> On 2/4/2025 4:28 PM, Hou Tao wrote:
>> Currently, the update of existing element in hash map involves two
>> steps:
>> 1) insert the new element at the head of the hash list
>> 2) remove the old element
>> 
>> It is possible that the concurrent lookup operation may fail to find
>> either the old element or the new element if the lookup operation starts
>> before the addition and continues after the removal.
>> 
>> Therefore, replacing the two-step update with an atomic update. After
>> the change, the update will be atomic in the perspective of the lookup
>> operation: it will either find the old element or the new element.
>> 
>> Signed-off-by: Hou Tao <hotforest@gmail.com>
>> ---
>>  kernel/bpf/hashtab.c | 14 ++++++++------
>>  1 file changed, 8 insertions(+), 6 deletions(-)
>> 
>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>> index 4a9eeb7aef85..a28b11ce74c6 100644
>> --- a/kernel/bpf/hashtab.c
>> +++ b/kernel/bpf/hashtab.c
>> @@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>>  		goto err;
>>  	}
>>  
>> -	/* add new element to the head of the list, so that
>> -	 * concurrent search will find it before old elem
>> -	 */
>> -	hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>> -	if (l_old) {
>> -		hlist_nulls_del_rcu(&l_old->hash_node);
>> +	if (!l_old) {
>> +		hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>> +	} else {
>> +		/* Replace the old element atomically, so that
>> +		 * concurrent search will find either the new element or
>> +		 * the old element.
>> +		 */
>> +		hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node);
>>  
>>  		/* l_old has already been stashed in htab->extra_elems, free
>>  		 * its special fields before it is available for reuse. Also
>> 
>
> After thinking about it the second time, the atomic list replacement on
> the update side is enough to make lookup operation always find the
> existing element. However, due to the immediate reuse, the lookup may
> find an unexpected value. Maybe we should disable the immediate reuse
> for specific map (e.g., htab of maps).

Hmm, in an RCU-protected data structure, reusing the memory before an
RCU grace period has elapsed is just as wrong as freeing it, isn't it?
I.e., the reuse logic should have some kind of call_rcu redirection to
be completely correct?

-Toke
Hou Tao Feb. 8, 2025, 10:16 a.m. UTC | #3
Hi Toke,

On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
> Hou Tao <houtao@huaweicloud.com> writes:
>
>> +cc Cody Haas
>>
>> Sorry for the resend. I sent the reply in the HTML format.
>>
>> On 2/4/2025 4:28 PM, Hou Tao wrote:
>>> Currently, the update of existing element in hash map involves two
>>> steps:
>>> 1) insert the new element at the head of the hash list
>>> 2) remove the old element
>>>
>>> It is possible that the concurrent lookup operation may fail to find
>>> either the old element or the new element if the lookup operation starts
>>> before the addition and continues after the removal.
>>>
>>> Therefore, replacing the two-step update with an atomic update. After
>>> the change, the update will be atomic in the perspective of the lookup
>>> operation: it will either find the old element or the new element.
>>>
>>> Signed-off-by: Hou Tao <hotforest@gmail.com>
>>> ---
>>>  kernel/bpf/hashtab.c | 14 ++++++++------
>>>  1 file changed, 8 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>> index 4a9eeb7aef85..a28b11ce74c6 100644
>>> --- a/kernel/bpf/hashtab.c
>>> +++ b/kernel/bpf/hashtab.c
>>> @@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>>>  		goto err;
>>>  	}
>>>  
>>> -	/* add new element to the head of the list, so that
>>> -	 * concurrent search will find it before old elem
>>> -	 */
>>> -	hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>>> -	if (l_old) {
>>> -		hlist_nulls_del_rcu(&l_old->hash_node);
>>> +	if (!l_old) {
>>> +		hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>>> +	} else {
>>> +		/* Replace the old element atomically, so that
>>> +		 * concurrent search will find either the new element or
>>> +		 * the old element.
>>> +		 */
>>> +		hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node);
>>>  
>>>  		/* l_old has already been stashed in htab->extra_elems, free
>>>  		 * its special fields before it is available for reuse. Also
>>>
>> After thinking about it the second time, the atomic list replacement on
>> the update side is enough to make lookup operation always find the
>> existing element. However, due to the immediate reuse, the lookup may
>> find an unexpected value. Maybe we should disable the immediate reuse
>> for specific map (e.g., htab of maps).
> Hmm, in an RCU-protected data structure, reusing the memory before an
> RCU grace period has elapsed is just as wrong as freeing it, isn't it?
> I.e., the reuse logic should have some kind of call_rcu redirection to
> be completely correct?

Not for all cases. There is SLAB_TYPESAFE_BY_RCU-typed slab. For hash
map, the reuse is also tricky (e.g., the goto again case in
lookup_nulls_elem_raw), however it can not prevent the lookup procedure
from returning unexpected value. I had post a patch set [1] to "fix"
that, but Alexei said it is "a known quirk". Here I am not sure about
whether it is reasonable to disable the reuse for htab of maps only. I
will post a v2 for the patch set.

[1]:
https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
> -Toke
Alexei Starovoitov Feb. 26, 2025, 3:24 a.m. UTC | #4
On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi Toke,
>
> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
> > Hou Tao <houtao@huaweicloud.com> writes:
> >
> >> +cc Cody Haas
> >>
> >> Sorry for the resend. I sent the reply in the HTML format.
> >>
> >> On 2/4/2025 4:28 PM, Hou Tao wrote:
> >>> Currently, the update of existing element in hash map involves two
> >>> steps:
> >>> 1) insert the new element at the head of the hash list
> >>> 2) remove the old element
> >>>
> >>> It is possible that the concurrent lookup operation may fail to find
> >>> either the old element or the new element if the lookup operation starts
> >>> before the addition and continues after the removal.
> >>>
> >>> Therefore, replacing the two-step update with an atomic update. After
> >>> the change, the update will be atomic in the perspective of the lookup
> >>> operation: it will either find the old element or the new element.

I'm missing the point.
This "atomic" replacement doesn't really solve anything.
lookup will see one element.
That element could be deleted by another thread.
bucket lock and either two step update or single step
don't change anything from the pov of bpf prog doing lookup.

> >>>
> >>> Signed-off-by: Hou Tao <hotforest@gmail.com>
> >>> ---
> >>>  kernel/bpf/hashtab.c | 14 ++++++++------
> >>>  1 file changed, 8 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> >>> index 4a9eeb7aef85..a28b11ce74c6 100644
> >>> --- a/kernel/bpf/hashtab.c
> >>> +++ b/kernel/bpf/hashtab.c
> >>> @@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
> >>>             goto err;
> >>>     }
> >>>
> >>> -   /* add new element to the head of the list, so that
> >>> -    * concurrent search will find it before old elem
> >>> -    */
> >>> -   hlist_nulls_add_head_rcu(&l_new->hash_node, head);
> >>> -   if (l_old) {
> >>> -           hlist_nulls_del_rcu(&l_old->hash_node);
> >>> +   if (!l_old) {
> >>> +           hlist_nulls_add_head_rcu(&l_new->hash_node, head);
> >>> +   } else {
> >>> +           /* Replace the old element atomically, so that
> >>> +            * concurrent search will find either the new element or
> >>> +            * the old element.
> >>> +            */
> >>> +           hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node);
> >>>
> >>>             /* l_old has already been stashed in htab->extra_elems, free
> >>>              * its special fields before it is available for reuse. Also
> >>>
> >> After thinking about it the second time, the atomic list replacement on
> >> the update side is enough to make lookup operation always find the
> >> existing element. However, due to the immediate reuse, the lookup may
> >> find an unexpected value. Maybe we should disable the immediate reuse
> >> for specific map (e.g., htab of maps).
> > Hmm, in an RCU-protected data structure, reusing the memory before an
> > RCU grace period has elapsed is just as wrong as freeing it, isn't it?
> > I.e., the reuse logic should have some kind of call_rcu redirection to
> > be completely correct?
>
> Not for all cases. There is SLAB_TYPESAFE_BY_RCU-typed slab. For hash
> map, the reuse is also tricky (e.g., the goto again case in
> lookup_nulls_elem_raw), however it can not prevent the lookup procedure
> from returning unexpected value. I had post a patch set [1] to "fix"
> that, but Alexei said it is "a known quirk". Here I am not sure about
> whether it is reasonable to disable the reuse for htab of maps only. I
> will post a v2 for the patch set.
>
> [1]:
> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/

yes. we still have to keep prealloc as default for now :(
Eventually bpf_mem_alloc is replaced with fully re-entrant
and safe kmalloc, then we can do fully re-entrant and safe
kfree_rcu. Then we can talk about closing this quirk.
Until then the prog has to deal with immediate reuse.
That was the case for a decade already.
Hou Tao Feb. 26, 2025, 4:05 a.m. UTC | #5
Hi,

On 2/26/2025 11:24 AM, Alexei Starovoitov wrote:
> On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi Toke,
>>
>> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>
>>>> +cc Cody Haas
>>>>
>>>> Sorry for the resend. I sent the reply in the HTML format.
>>>>
>>>> On 2/4/2025 4:28 PM, Hou Tao wrote:
>>>>> Currently, the update of existing element in hash map involves two
>>>>> steps:
>>>>> 1) insert the new element at the head of the hash list
>>>>> 2) remove the old element
>>>>>
>>>>> It is possible that the concurrent lookup operation may fail to find
>>>>> either the old element or the new element if the lookup operation starts
>>>>> before the addition and continues after the removal.
>>>>>
>>>>> Therefore, replacing the two-step update with an atomic update. After
>>>>> the change, the update will be atomic in the perspective of the lookup
>>>>> operation: it will either find the old element or the new element.
> I'm missing the point.
> This "atomic" replacement doesn't really solve anything.
> lookup will see one element.
> That element could be deleted by another thread.
> bucket lock and either two step update or single step
> don't change anything from the pov of bpf prog doing lookup.

The point is that overwriting an existed element may lead to concurrent
lookups return ENOENT as demonstrated by the added selftest and the
patch tried to "fix" that. However, it seems using
hlist_nulls_replace_rcu() for the overwriting update is still not
enough. Because when the lookup procedure found the old element, the old
element may be reusing, the comparison of the map key may fail, and the
lookup procedure may still return ENOENT.
>
>>>>> Signed-off-by: Hou Tao <hotforest@gmail.com>
>>>>> ---
>>>>>  kernel/bpf/hashtab.c | 14 ++++++++------
>>>>>  1 file changed, 8 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>>>> index 4a9eeb7aef85..a28b11ce74c6 100644
>>>>> --- a/kernel/bpf/hashtab.c
>>>>> +++ b/kernel/bpf/hashtab.c
>>>>> @@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>>>>>             goto err;
>>>>>     }
>>>>>
>>>>> -   /* add new element to the head of the list, so that
>>>>> -    * concurrent search will find it before old elem
>>>>> -    */
>>>>> -   hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>>>>> -   if (l_old) {
>>>>> -           hlist_nulls_del_rcu(&l_old->hash_node);
>>>>> +   if (!l_old) {
>>>>> +           hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>>>>> +   } else {
>>>>> +           /* Replace the old element atomically, so that
>>>>> +            * concurrent search will find either the new element or
>>>>> +            * the old element.
>>>>> +            */
>>>>> +           hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node);
>>>>>
>>>>>             /* l_old has already been stashed in htab->extra_elems, free
>>>>>              * its special fields before it is available for reuse. Also
>>>>>
>>>> After thinking about it the second time, the atomic list replacement on
>>>> the update side is enough to make lookup operation always find the
>>>> existing element. However, due to the immediate reuse, the lookup may
>>>> find an unexpected value. Maybe we should disable the immediate reuse
>>>> for specific map (e.g., htab of maps).
>>> Hmm, in an RCU-protected data structure, reusing the memory before an
>>> RCU grace period has elapsed is just as wrong as freeing it, isn't it?
>>> I.e., the reuse logic should have some kind of call_rcu redirection to
>>> be completely correct?
>> Not for all cases. There is SLAB_TYPESAFE_BY_RCU-typed slab. For hash
>> map, the reuse is also tricky (e.g., the goto again case in
>> lookup_nulls_elem_raw), however it can not prevent the lookup procedure
>> from returning unexpected value. I had post a patch set [1] to "fix"
>> that, but Alexei said it is "a known quirk". Here I am not sure about
>> whether it is reasonable to disable the reuse for htab of maps only. I
>> will post a v2 for the patch set.
>>
>> [1]:
>> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
> yes. we still have to keep prealloc as default for now :(
> Eventually bpf_mem_alloc is replaced with fully re-entrant
> and safe kmalloc, then we can do fully re-entrant and safe
> kfree_rcu. Then we can talk about closing this quirk.
> Until then the prog has to deal with immediate reuse.
> That was the case for a decade already.

I see. In v2 I will fallback to the original idea: adding a standalone
update procedure for htab of maps in which it will atomically overwrite
the map_ptr just like array of maps does.
Alexei Starovoitov Feb. 26, 2025, 5:42 a.m. UTC | #6
On Tue, Feb 25, 2025 at 8:05 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 2/26/2025 11:24 AM, Alexei Starovoitov wrote:
> > On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hi Toke,
> >>
> >> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
> >>> Hou Tao <houtao@huaweicloud.com> writes:
> >>>
> >>>> +cc Cody Haas
> >>>>
> >>>> Sorry for the resend. I sent the reply in the HTML format.
> >>>>
> >>>> On 2/4/2025 4:28 PM, Hou Tao wrote:
> >>>>> Currently, the update of existing element in hash map involves two
> >>>>> steps:
> >>>>> 1) insert the new element at the head of the hash list
> >>>>> 2) remove the old element
> >>>>>
> >>>>> It is possible that the concurrent lookup operation may fail to find
> >>>>> either the old element or the new element if the lookup operation starts
> >>>>> before the addition and continues after the removal.
> >>>>>
> >>>>> Therefore, replacing the two-step update with an atomic update. After
> >>>>> the change, the update will be atomic in the perspective of the lookup
> >>>>> operation: it will either find the old element or the new element.
> > I'm missing the point.
> > This "atomic" replacement doesn't really solve anything.
> > lookup will see one element.
> > That element could be deleted by another thread.
> > bucket lock and either two step update or single step
> > don't change anything from the pov of bpf prog doing lookup.
>
> The point is that overwriting an existed element may lead to concurrent
> lookups return ENOENT as demonstrated by the added selftest and the
> patch tried to "fix" that. However, it seems using
> hlist_nulls_replace_rcu() for the overwriting update is still not
> enough. Because when the lookup procedure found the old element, the old
> element may be reusing, the comparison of the map key may fail, and the
> lookup procedure may still return ENOENT.

you mean l_old == l_new ? I don't think it's possible
within htab_map_update_elem(),
but htab_map_delete_elem() doing hlist_nulls_del_rcu()
then free_htab_elem, htab_map_update_elem, alloc, hlist_nulls_add_head_rcu
and just deleted elem is reused to be inserted
into another bucket.

I'm not sure whether this new hlist_nulls_replace_rcu()
primitive works with nulls logic.

So back to the problem statement..
Are you saying that adding new to a head while lookup is in the middle
is causing it to miss an element that
is supposed to be updated assuming atomicity of the update?
While now update_elem() is more like a sequence of delete + insert?

Hmm.

> I see. In v2 I will fallback to the original idea: adding a standalone
> update procedure for htab of maps in which it will atomically overwrite
> the map_ptr just like array of maps does.

hold on. is this only for hash-of-maps ?
How that non-atomic update manifested in real production?
Zvi Effron Feb. 26, 2025, 11:17 p.m. UTC | #7
On Tue, Feb 25, 2025 at 9:42 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Feb 25, 2025 at 8:05 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >
> > Hi,
> >
> > On 2/26/2025 11:24 AM, Alexei Starovoitov wrote:
> > > On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote:
> > >> Hi Toke,
> > >>
> > >> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
> > >>> Hou Tao <houtao@huaweicloud.com> writes:
> > >>>
> > >>>> +cc Cody Haas
> > >>>>
> > >>>> Sorry for the resend. I sent the reply in the HTML format.
> > >>>>
> > >>>> On 2/4/2025 4:28 PM, Hou Tao wrote:
> > >>>>> Currently, the update of existing element in hash map involves two
> > >>>>> steps:
> > >>>>> 1) insert the new element at the head of the hash list
> > >>>>> 2) remove the old element
> > >>>>>
> > >>>>> It is possible that the concurrent lookup operation may fail to find
> > >>>>> either the old element or the new element if the lookup operation starts
> > >>>>> before the addition and continues after the removal.
> > >>>>>
> > >>>>> Therefore, replacing the two-step update with an atomic update. After
> > >>>>> the change, the update will be atomic in the perspective of the lookup
> > >>>>> operation: it will either find the old element or the new element.
> > > I'm missing the point.
> > > This "atomic" replacement doesn't really solve anything.
> > > lookup will see one element.
> > > That element could be deleted by another thread.
> > > bucket lock and either two step update or single step
> > > don't change anything from the pov of bpf prog doing lookup.
> >
> > The point is that overwriting an existed element may lead to concurrent
> > lookups return ENOENT as demonstrated by the added selftest and the
> > patch tried to "fix" that. However, it seems using
> > hlist_nulls_replace_rcu() for the overwriting update is still not
> > enough. Because when the lookup procedure found the old element, the old
> > element may be reusing, the comparison of the map key may fail, and the
> > lookup procedure may still return ENOENT.
>
> you mean l_old == l_new ? I don't think it's possible
> within htab_map_update_elem(),
> but htab_map_delete_elem() doing hlist_nulls_del_rcu()
> then free_htab_elem, htab_map_update_elem, alloc, hlist_nulls_add_head_rcu
> and just deleted elem is reused to be inserted
> into another bucket.
>
> I'm not sure whether this new hlist_nulls_replace_rcu()
> primitive works with nulls logic.
>
> So back to the problem statement..
> Are you saying that adding new to a head while lookup is in the middle
> is causing it to miss an element that
> is supposed to be updated assuming atomicity of the update?
> While now update_elem() is more like a sequence of delete + insert?
>
> Hmm.

Yes, exactly that. Because update_elem is actually a delete + insert (actually
an insert + delete, I think?), it is possible for a concurrent lookup to see no
element instead of either the old or new value.

>
> > I see. In v2 I will fallback to the original idea: adding a standalone
> > update procedure for htab of maps in which it will atomically overwrite
> > the map_ptr just like array of maps does.
>
> hold on. is this only for hash-of-maps ?

I believe this was also replicated for hash as well as hash-of-maps. Cody can
confirm, or use the reproducer he has to test that case.

> How that non-atomic update manifested in real production?
>

See [1] (in the cover letter for this series, also replicated below).

[1] : https://lore.kernel.org/xdp-newbies/07a365d8-2e66-2899-4298-b8b158a928fa@huaweicloud.com/T/#m06fcd687c6cfdbd0f9b643b227e69b479fc8c2f6
Hou Tao Feb. 27, 2025, 1:48 a.m. UTC | #8
Hi,

On 2/27/2025 7:17 AM, Zvi Effron wrote:
> On Tue, Feb 25, 2025 at 9:42 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Tue, Feb 25, 2025 at 8:05 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>> Hi,
>>>
>>> On 2/26/2025 11:24 AM, Alexei Starovoitov wrote:
>>>> On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>> Hi Toke,
>>>>>
>>>>> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
>>>>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>>>>
>>>>>>> +cc Cody Haas
>>>>>>>
>>>>>>> Sorry for the resend. I sent the reply in the HTML format.
>>>>>>>
>>>>>>> On 2/4/2025 4:28 PM, Hou Tao wrote:
>>>>>>>> Currently, the update of existing element in hash map involves two
>>>>>>>> steps:
>>>>>>>> 1) insert the new element at the head of the hash list
>>>>>>>> 2) remove the old element
>>>>>>>>
>>>>>>>> It is possible that the concurrent lookup operation may fail to find
>>>>>>>> either the old element or the new element if the lookup operation starts
>>>>>>>> before the addition and continues after the removal.
>>>>>>>>
>>>>>>>> Therefore, replacing the two-step update with an atomic update. After
>>>>>>>> the change, the update will be atomic in the perspective of the lookup
>>>>>>>> operation: it will either find the old element or the new element.
>>>> I'm missing the point.
>>>> This "atomic" replacement doesn't really solve anything.
>>>> lookup will see one element.
>>>> That element could be deleted by another thread.
>>>> bucket lock and either two step update or single step
>>>> don't change anything from the pov of bpf prog doing lookup.
>>> The point is that overwriting an existed element may lead to concurrent
>>> lookups return ENOENT as demonstrated by the added selftest and the
>>> patch tried to "fix" that. However, it seems using
>>> hlist_nulls_replace_rcu() for the overwriting update is still not
>>> enough. Because when the lookup procedure found the old element, the old
>>> element may be reusing, the comparison of the map key may fail, and the
>>> lookup procedure may still return ENOENT.
>> you mean l_old == l_new ? I don't think it's possible
>> within htab_map_update_elem(),
>> but htab_map_delete_elem() doing hlist_nulls_del_rcu()
>> then free_htab_elem, htab_map_update_elem, alloc, hlist_nulls_add_head_rcu
>> and just deleted elem is reused to be inserted
>> into another bucket.

No. I mean the following procedure in which the lookup procedure finds
the old element and tries to match its key, one update procedure has
already deleted the old element, and another update procedure is reusing
the old element:

lookup procedure A
A: find the old element (instead of the new old)
                                        
              update procedure B
              B: delete the old element
              update procedure C on the same CPU:
              C: reuse the old element (overwrite its key and insert in
the same bucket)

A: the key is mismatched and return -ENOENT.

It can be reproduced by increasing ctx.loop in the selftest.
>>
>> I'm not sure whether this new hlist_nulls_replace_rcu()
>> primitive works with nulls logic.
>>
>> So back to the problem statement..
>> Are you saying that adding new to a head while lookup is in the middle
>> is causing it to miss an element that
>> is supposed to be updated assuming atomicity of the update?
>> While now update_elem() is more like a sequence of delete + insert?
>>
>> Hmm.
> Yes, exactly that. Because update_elem is actually a delete + insert (actually
> an insert + delete, I think?), it is possible for a concurrent lookup to see no
> element instead of either the old or new value.

Yep.
>
>>> I see. In v2 I will fallback to the original idea: adding a standalone
>>> update procedure for htab of maps in which it will atomically overwrite
>>> the map_ptr just like array of maps does.
>> hold on. is this only for hash-of-maps ?
> I believe this was also replicated for hash as well as hash-of-maps. Cody can
> confirm, or use the reproducer he has to test that case.

The fix for hash-of-maps will be much simpler and the returned map
pointer will be correct. For other types of hash map, beside switching
to hlist_nulls_replace_rcu(),  I think we also need to detect the reuse
of element during the loop: if the element has been reused, the lookup
should restart from the head again. However, even withing the above two
fixes, the value returned by lookup procedure for other types of hash
map may still be incorrect (as shown long time ago [1]), so I think
maybe for other types of map, the atomic update doesn't matter too much.

[1]:
https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
>
>> How that non-atomic update manifested in real production?
>>
> See [1] (in the cover letter for this series, also replicated below).
>
> [1] : https://lore.kernel.org/xdp-newbies/07a365d8-2e66-2899-4298-b8b158a928fa@huaweicloud.com/T/#m06fcd687c6cfdbd0f9b643b227e69b479fc8c2f6
Alexei Starovoitov Feb. 27, 2025, 1:59 a.m. UTC | #9
On Wed, Feb 26, 2025 at 5:48 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 2/27/2025 7:17 AM, Zvi Effron wrote:
> > On Tue, Feb 25, 2025 at 9:42 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >> On Tue, Feb 25, 2025 at 8:05 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>> Hi,
> >>>
> >>> On 2/26/2025 11:24 AM, Alexei Starovoitov wrote:
> >>>> On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>>> Hi Toke,
> >>>>>
> >>>>> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
> >>>>>> Hou Tao <houtao@huaweicloud.com> writes:
> >>>>>>
> >>>>>>> +cc Cody Haas
> >>>>>>>
> >>>>>>> Sorry for the resend. I sent the reply in the HTML format.
> >>>>>>>
> >>>>>>> On 2/4/2025 4:28 PM, Hou Tao wrote:
> >>>>>>>> Currently, the update of existing element in hash map involves two
> >>>>>>>> steps:
> >>>>>>>> 1) insert the new element at the head of the hash list
> >>>>>>>> 2) remove the old element
> >>>>>>>>
> >>>>>>>> It is possible that the concurrent lookup operation may fail to find
> >>>>>>>> either the old element or the new element if the lookup operation starts
> >>>>>>>> before the addition and continues after the removal.
> >>>>>>>>
> >>>>>>>> Therefore, replacing the two-step update with an atomic update. After
> >>>>>>>> the change, the update will be atomic in the perspective of the lookup
> >>>>>>>> operation: it will either find the old element or the new element.
> >>>> I'm missing the point.
> >>>> This "atomic" replacement doesn't really solve anything.
> >>>> lookup will see one element.
> >>>> That element could be deleted by another thread.
> >>>> bucket lock and either two step update or single step
> >>>> don't change anything from the pov of bpf prog doing lookup.
> >>> The point is that overwriting an existed element may lead to concurrent
> >>> lookups return ENOENT as demonstrated by the added selftest and the
> >>> patch tried to "fix" that. However, it seems using
> >>> hlist_nulls_replace_rcu() for the overwriting update is still not
> >>> enough. Because when the lookup procedure found the old element, the old
> >>> element may be reusing, the comparison of the map key may fail, and the
> >>> lookup procedure may still return ENOENT.
> >> you mean l_old == l_new ? I don't think it's possible
> >> within htab_map_update_elem(),
> >> but htab_map_delete_elem() doing hlist_nulls_del_rcu()
> >> then free_htab_elem, htab_map_update_elem, alloc, hlist_nulls_add_head_rcu
> >> and just deleted elem is reused to be inserted
> >> into another bucket.
>
> No. I mean the following procedure in which the lookup procedure finds
> the old element and tries to match its key, one update procedure has
> already deleted the old element, and another update procedure is reusing
> the old element:
>
> lookup procedure A
> A: find the old element (instead of the new old)
>
>               update procedure B
>               B: delete the old element
>               update procedure C on the same CPU:
>               C: reuse the old element (overwrite its key and insert in
> the same bucket)
>
> A: the key is mismatched and return -ENOENT.

This is fine. It's just normal reuse.
Orthogonal to 'update as insert+delete' issue.

> It can be reproduced by increasing ctx.loop in the selftest.
> >>
> >> I'm not sure whether this new hlist_nulls_replace_rcu()
> >> primitive works with nulls logic.
> >>
> >> So back to the problem statement..
> >> Are you saying that adding new to a head while lookup is in the middle
> >> is causing it to miss an element that
> >> is supposed to be updated assuming atomicity of the update?
> >> While now update_elem() is more like a sequence of delete + insert?
> >>
> >> Hmm.
> > Yes, exactly that. Because update_elem is actually a delete + insert (actually
> > an insert + delete, I think?), it is possible for a concurrent lookup to see no
> > element instead of either the old or new value.
>
> Yep.
> >
> >>> I see. In v2 I will fallback to the original idea: adding a standalone
> >>> update procedure for htab of maps in which it will atomically overwrite
> >>> the map_ptr just like array of maps does.
> >> hold on. is this only for hash-of-maps ?
> > I believe this was also replicated for hash as well as hash-of-maps. Cody can
> > confirm, or use the reproducer he has to test that case.
>
> The fix for hash-of-maps will be much simpler and the returned map
> pointer will be correct. For other types of hash map, beside switching
> to hlist_nulls_replace_rcu(),

It's been a long time since I looked into rcu_nulls details.
Pls help me understand that this new replace_rcu_nulls()
is correct from nulls pov,
If it is then this patch set may be the right answer to non-atomic update.

And for the future, please please focus on "why" part in
the cover letter and commit logs instead of "what".

Since the only thing I got from the log was:
"Currently, the update is not atomic
because the overwrite of existing element happens in a two-steps way,
but the support of atomic update is feasible".

"is feasible" doesn't explain "why".

Link to xdp-newbie question is nice for additional context,
but reviewers should not need to go and read some thread somewhere
to understand "why" part.
All of it should be in the commit log.

> map may still be incorrect (as shown long time ago [1]), so I think
> maybe for other types of map, the atomic update doesn't matter too much.
>
> [1]:
> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/

A thread from 3 years ago ?! Sorry, it's not helpful to ask
people to page-in such an old context with lots of follow ups
that may or may not be relevant today.
Hou Tao Feb. 27, 2025, 2:43 a.m. UTC | #10
Hi,

On 2/27/2025 9:59 AM, Alexei Starovoitov wrote:
> On Wed, Feb 26, 2025 at 5:48 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 2/27/2025 7:17 AM, Zvi Effron wrote:
>>> On Tue, Feb 25, 2025 at 9:42 PM Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>>> On Tue, Feb 25, 2025 at 8:05 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>> Hi,
>>>>>
>>>>> On 2/26/2025 11:24 AM, Alexei Starovoitov wrote:
>>>>>> On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>>>> Hi Toke,
>>>>>>>
>>>>>>> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
>>>>>>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>>>>>>
>>>>>>>>> +cc Cody Haas
>>>>>>>>>
>>>>>>>>> Sorry for the resend. I sent the reply in the HTML format.
>>>>>>>>>
>>>>>>>>> On 2/4/2025 4:28 PM, Hou Tao wrote:
>>>>>>>>>> Currently, the update of existing element in hash map involves two
>>>>>>>>>> steps:
>>>>>>>>>> 1) insert the new element at the head of the hash list
>>>>>>>>>> 2) remove the old element
>>>>>>>>>>
>>>>>>>>>> It is possible that the concurrent lookup operation may fail to find
>>>>>>>>>> either the old element or the new element if the lookup operation starts
>>>>>>>>>> before the addition and continues after the removal.
>>>>>>>>>>
>>>>>>>>>> Therefore, replacing the two-step update with an atomic update. After
>>>>>>>>>> the change, the update will be atomic in the perspective of the lookup
>>>>>>>>>> operation: it will either find the old element or the new element.
>>>>>> I'm missing the point.
>>>>>> This "atomic" replacement doesn't really solve anything.
>>>>>> lookup will see one element.
>>>>>> That element could be deleted by another thread.
>>>>>> bucket lock and either two step update or single step
>>>>>> don't change anything from the pov of bpf prog doing lookup.
>>>>> The point is that overwriting an existed element may lead to concurrent
>>>>> lookups return ENOENT as demonstrated by the added selftest and the
>>>>> patch tried to "fix" that. However, it seems using
>>>>> hlist_nulls_replace_rcu() for the overwriting update is still not
>>>>> enough. Because when the lookup procedure found the old element, the old
>>>>> element may be reusing, the comparison of the map key may fail, and the
>>>>> lookup procedure may still return ENOENT.
>>>> you mean l_old == l_new ? I don't think it's possible
>>>> within htab_map_update_elem(),
>>>> but htab_map_delete_elem() doing hlist_nulls_del_rcu()
>>>> then free_htab_elem, htab_map_update_elem, alloc, hlist_nulls_add_head_rcu
>>>> and just deleted elem is reused to be inserted
>>>> into another bucket.
>> No. I mean the following procedure in which the lookup procedure finds
>> the old element and tries to match its key, one update procedure has
>> already deleted the old element, and another update procedure is reusing
>> the old element:
>>
>> lookup procedure A
>> A: find the old element (instead of the new old)
>>
>>               update procedure B
>>               B: delete the old element
>>               update procedure C on the same CPU:
>>               C: reuse the old element (overwrite its key and insert in
>> the same bucket)
>>
>> A: the key is mismatched and return -ENOENT.
> This is fine. It's just normal reuse.
> Orthogonal to 'update as insert+delete' issue.

OK. However, it will break the lookup procedure because it expects it
will return an valid result instead of -ENOENT.
>
>> It can be reproduced by increasing ctx.loop in the selftest.
>>>> I'm not sure whether this new hlist_nulls_replace_rcu()
>>>> primitive works with nulls logic.
>>>>
>>>> So back to the problem statement..
>>>> Are you saying that adding new to a head while lookup is in the middle
>>>> is causing it to miss an element that
>>>> is supposed to be updated assuming atomicity of the update?
>>>> While now update_elem() is more like a sequence of delete + insert?
>>>>
>>>> Hmm.
>>> Yes, exactly that. Because update_elem is actually a delete + insert (actually
>>> an insert + delete, I think?), it is possible for a concurrent lookup to see no
>>> element instead of either the old or new value.
>> Yep.
>>>>> I see. In v2 I will fallback to the original idea: adding a standalone
>>>>> update procedure for htab of maps in which it will atomically overwrite
>>>>> the map_ptr just like array of maps does.
>>>> hold on. is this only for hash-of-maps ?
>>> I believe this was also replicated for hash as well as hash-of-maps. Cody can
>>> confirm, or use the reproducer he has to test that case.
>> The fix for hash-of-maps will be much simpler and the returned map
>> pointer will be correct. For other types of hash map, beside switching
>> to hlist_nulls_replace_rcu(),
> It's been a long time since I looked into rcu_nulls details.
> Pls help me understand that this new replace_rcu_nulls()
> is correct from nulls pov,
> If it is then this patch set may be the right answer to non-atomic update.

If I understand correctly, only the manipulations of ->first pointer and
->next pointer need to take care of nulls pointer.
>
> And for the future, please please focus on "why" part in
> the cover letter and commit logs instead of "what".
>
> Since the only thing I got from the log was:
> "Currently, the update is not atomic
> because the overwrite of existing element happens in a two-steps way,
> but the support of atomic update is feasible".
>
> "is feasible" doesn't explain "why".
>
> Link to xdp-newbie question is nice for additional context,
> but reviewers should not need to go and read some thread somewhere
> to understand "why" part.
> All of it should be in the commit log.

OK. My original thought is that is a reported problem, so an extra link
will be enough. Will try to add more context next time.
>
>> map may still be incorrect (as shown long time ago [1]), so I think
>> maybe for other types of map, the atomic update doesn't matter too much.
>>
>> [1]:
>> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
> A thread from 3 years ago ?! Sorry, it's not helpful to ask
> people to page-in such an old context with lots of follow ups
> that may or may not be relevant today.
Let me reuse part of the diagram above to explain how does the lookup
procedure may return a incorrect value:

lookup procedure A
A: find the old element (instead of the new element)


              update procedure B
              B: delete the old element
              update procedure C on the same CPU:


A: the key is matched and return the value in the element

              C: reuse the old element (overwrite its key and value)

A: read the value (it is incorrect, because it has been reused and
overwritten)
Alexei Starovoitov Feb. 27, 2025, 3:17 a.m. UTC | #11
On Wed, Feb 26, 2025 at 6:43 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> >>
> >> lookup procedure A
> >> A: find the old element (instead of the new old)
> >>
> >>               update procedure B
> >>               B: delete the old element
> >>               update procedure C on the same CPU:
> >>               C: reuse the old element (overwrite its key and insert in
> >> the same bucket)
> >>
> >> A: the key is mismatched and return -ENOENT.
> > This is fine. It's just normal reuse.
> > Orthogonal to 'update as insert+delete' issue.
>
> OK. However, it will break the lookup procedure because it expects it
> will return an valid result instead of -ENOENT.

What do you mean 'breaks the lookup' ?
lookup_elem_raw() matches hash, and then it memcmp(key),
if the element is reused anything can happen.
Either it succeeds in memcmp() and returns an elem,
or miscompares in memcmp().
Both are expected, because elems are reused in place.

And this behavior is expected and not-broken,
because bpf prog that does lookup on one cpu and deletes
the same element on the other cpu is asking for trouble.
bpf infra guarantees the safety of the kernel.
It doesn't guarantee that bpf progs are sane.

> > It's been a long time since I looked into rcu_nulls details.
> > Pls help me understand that this new replace_rcu_nulls()
> > is correct from nulls pov,
> > If it is then this patch set may be the right answer to non-atomic update.
>
> If I understand correctly, only the manipulations of ->first pointer and
> ->next pointer need to take care of nulls pointer.

hmm. I feel we're still talking past each other.
See if (get_nulls_value() == ...) in lookup_nulls_elem_raw().
It's there because of reuse. The lookup can start in one bucket
and finish in another.

> >
> > And for the future, please please focus on "why" part in
> > the cover letter and commit logs instead of "what".
> >
> > Since the only thing I got from the log was:
> > "Currently, the update is not atomic
> > because the overwrite of existing element happens in a two-steps way,
> > but the support of atomic update is feasible".
> >
> > "is feasible" doesn't explain "why".
> >
> > Link to xdp-newbie question is nice for additional context,
> > but reviewers should not need to go and read some thread somewhere
> > to understand "why" part.
> > All of it should be in the commit log.
>
> OK. My original thought is that is a reported problem, so an extra link
> will be enough. Will try to add more context next time.
> >
> >> map may still be incorrect (as shown long time ago [1]), so I think
> >> maybe for other types of map, the atomic update doesn't matter too much.
> >>
> >> [1]:
> >> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
> > A thread from 3 years ago ?! Sorry, it's not helpful to ask
> > people to page-in such an old context with lots of follow ups
> > that may or may not be relevant today.
> Let me reuse part of the diagram above to explain how does the lookup
> procedure may return a incorrect value:
>
> lookup procedure A
> A: find the old element (instead of the new element)
>
>
>               update procedure B
>               B: delete the old element
>               update procedure C on the same CPU:
>
>
> A: the key is matched and return the value in the element
>
>               C: reuse the old element (overwrite its key and value)
>
> A: read the value (it is incorrect, because it has been reused and
> overwritten)

... and it's fine. It's by design. It's an element reuse behavior.

Long ago hashmap had two modes: prealloc (default) and
NO_PREALLOC (call_rcu + kfree)

The call_rcu part was there to make things safe.
The memory cannot be kfree-ed to the kernel until RCU GP.
With bpf_mem_alloc hashmap elements are freed back to bpf_ma
right away. Hashmap is doing bpf_mem_cache_free()
(instead of bpf_mem_cache_free_rcu()) because users need speed.
So since 2022 both prealloc and no_prealloc reuse elements.
We can consider a new flag for the hash map like F_REUSE_AFTER_RCU_GP
that will use _rcu() flavor of freeing into bpf_ma,
but it has to have a strong reason.
And soon as we add it the default with prealloc would need
to use call_rcu() too, right?
and that becomes nightmare, since bpf prog can easily DoS the system.
Even if we use bpf_mem_cache_free_rcu() only, the DoS is a concern.
Unlike new things like bpf_obj_new/obj_drop the hashmap
is unpriv, so concerns are drastically different.
Hou Tao Feb. 27, 2025, 4:08 a.m. UTC | #12
Hi,

On 2/27/2025 11:17 AM, Alexei Starovoitov wrote:
> On Wed, Feb 26, 2025 at 6:43 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>> lookup procedure A
>>>> A: find the old element (instead of the new old)
>>>>
>>>>               update procedure B
>>>>               B: delete the old element
>>>>               update procedure C on the same CPU:
>>>>               C: reuse the old element (overwrite its key and insert in
>>>> the same bucket)
>>>>
>>>> A: the key is mismatched and return -ENOENT.
>>> This is fine. It's just normal reuse.
>>> Orthogonal to 'update as insert+delete' issue.
>> OK. However, it will break the lookup procedure because it expects it
>> will return an valid result instead of -ENOENT.
> What do you mean 'breaks the lookup' ?
> lookup_elem_raw() matches hash, and then it memcmp(key),
> if the element is reused anything can happen.
> Either it succeeds in memcmp() and returns an elem,
> or miscompares in memcmp().
> Both are expected, because elems are reused in place.
>
> And this behavior is expected and not-broken,
> because bpf prog that does lookup on one cpu and deletes
> the same element on the other cpu is asking for trouble.

It seems I mislead you in the above example. It is also possible for
doing lookup in one CPU and updating the same element in other CPU. So
does such concurrence also look insane ?

lookup procedure A
A: find the old element (instead of the new old)

              update procedure B
              B: overwrite the same element and free the old element
              update procedure C on the same CPU:
              C: reuse the old element (overwrite its key and insert in
the same bucket)

A: the key is mismatched and return -ENOENT.

> bpf infra guarantees the safety of the kernel.
> It doesn't guarantee that bpf progs are sane.
>
>>> It's been a long time since I looked into rcu_nulls details.
>>> Pls help me understand that this new replace_rcu_nulls()
>>> is correct from nulls pov,
>>> If it is then this patch set may be the right answer to non-atomic update.
>> If I understand correctly, only the manipulations of ->first pointer and
>> ->next pointer need to take care of nulls pointer.
> hmm. I feel we're still talking past each other.
> See if (get_nulls_value() == ...) in lookup_nulls_elem_raw().
> It's there because of reuse. The lookup can start in one bucket
> and finish in another.

Yes. I noticed that.  However, it happens when the deleted element is
reused and inserted to a different bucket, right ? For
replace_rcu_nulls(), the reuse of old element is possible only after
replace_rcu_nulls() completes, so I think it is irrelevant.

>If it is then this patch set may be the right answer to non-atomic update.

I also didn't follow that. Due to the immediate reuse, the lookup
procedure may still return -ENOENT when there is concurrent overwriting
of the same element as show above, so I think it only reduce the
possibility. I still prefer to implement a separated update procedure of
htab of maps to fix the atomic update problem completely for htab of
maps first.


>
>>> And for the future, please please focus on "why" part in
>>> the cover letter and commit logs instead of "what".
>>>
>>> Since the only thing I got from the log was:
>>> "Currently, the update is not atomic
>>> because the overwrite of existing element happens in a two-steps way,
>>> but the support of atomic update is feasible".
>>>
>>> "is feasible" doesn't explain "why".
>>>
>>> Link to xdp-newbie question is nice for additional context,
>>> but reviewers should not need to go and read some thread somewhere
>>> to understand "why" part.
>>> All of it should be in the commit log.
>> OK. My original thought is that is a reported problem, so an extra link
>> will be enough. Will try to add more context next time.
>>>> map may still be incorrect (as shown long time ago [1]), so I think
>>>> maybe for other types of map, the atomic update doesn't matter too much.
>>>>
>>>> [1]:
>>>> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
>>> A thread from 3 years ago ?! Sorry, it's not helpful to ask
>>> people to page-in such an old context with lots of follow ups
>>> that may or may not be relevant today.
>> Let me reuse part of the diagram above to explain how does the lookup
>> procedure may return a incorrect value:
>>
>> lookup procedure A
>> A: find the old element (instead of the new element)
>>
>>
>>               update procedure B
>>               B: delete the old element
>>               update procedure C on the same CPU:
>>
>>
>> A: the key is matched and return the value in the element
>>
>>               C: reuse the old element (overwrite its key and value)
>>
>> A: read the value (it is incorrect, because it has been reused and
>> overwritten)
> ... and it's fine. It's by design. It's an element reuse behavior.
>
> Long ago hashmap had two modes: prealloc (default) and
> NO_PREALLOC (call_rcu + kfree)
>
> The call_rcu part was there to make things safe.
> The memory cannot be kfree-ed to the kernel until RCU GP.
> With bpf_mem_alloc hashmap elements are freed back to bpf_ma
> right away. Hashmap is doing bpf_mem_cache_free()
> (instead of bpf_mem_cache_free_rcu()) because users need speed.
> So since 2022 both prealloc and no_prealloc reuse elements.
> We can consider a new flag for the hash map like F_REUSE_AFTER_RCU_GP
> that will use _rcu() flavor of freeing into bpf_ma,
> but it has to have a strong reason.
> And soon as we add it the default with prealloc would need
> to use call_rcu() too, right?
> and that becomes nightmare, since bpf prog can easily DoS the system.
> Even if we use bpf_mem_cache_free_rcu() only, the DoS is a concern.
> Unlike new things like bpf_obj_new/obj_drop the hashmap
> is unpriv, so concerns are drastically different.
>
> .

I see. Thanks for the detailed explanation. And that is part of the
reason why I proposed adding a seq-counter in the htab-element and
checking the seq-counter during lookup, so at least the lookup will not
return -ENOENT for the concurrent overwriting procedure.
Nick Zavaritsky March 6, 2025, 10:22 a.m. UTC | #13
> On 27. Feb 2025, at 04:17, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> On Wed, Feb 26, 2025 at 6:43 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> 
>>>> 
>>>> lookup procedure A
>>>> A: find the old element (instead of the new old)
>>>> 
>>>>              update procedure B
>>>>              B: delete the old element
>>>>              update procedure C on the same CPU:
>>>>              C: reuse the old element (overwrite its key and insert in
>>>> the same bucket)
>>>> 
>>>> A: the key is mismatched and return -ENOENT.
>>> This is fine. It's just normal reuse.
>>> Orthogonal to 'update as insert+delete' issue.
>> 
>> OK. However, it will break the lookup procedure because it expects it
>> will return an valid result instead of -ENOENT.
> 
> What do you mean 'breaks the lookup' ?
> lookup_elem_raw() matches hash, and then it memcmp(key),
> if the element is reused anything can happen.
> Either it succeeds in memcmp() and returns an elem,
> or miscompares in memcmp().
> Both are expected, because elems are reused in place.
> 
> And this behavior is expected and not-broken,
> because bpf prog that does lookup on one cpu and deletes
> the same element on the other cpu is asking for trouble.
> bpf infra guarantees the safety of the kernel.
> It doesn't guarantee that bpf progs are sane.
> 
>>> It's been a long time since I looked into rcu_nulls details.
>>> Pls help me understand that this new replace_rcu_nulls()
>>> is correct from nulls pov,
>>> If it is then this patch set may be the right answer to non-atomic update.
>> 
>> If I understand correctly, only the manipulations of ->first pointer and
>> ->next pointer need to take care of nulls pointer.
> 
> hmm. I feel we're still talking past each other.
> See if (get_nulls_value() == ...) in lookup_nulls_elem_raw().
> It's there because of reuse. The lookup can start in one bucket
> and finish in another.
> 
>>> 
>>> And for the future, please please focus on "why" part in
>>> the cover letter and commit logs instead of "what".
>>> 
>>> Since the only thing I got from the log was:
>>> "Currently, the update is not atomic
>>> because the overwrite of existing element happens in a two-steps way,
>>> but the support of atomic update is feasible".
>>> 
>>> "is feasible" doesn't explain "why".
>>> 
>>> Link to xdp-newbie question is nice for additional context,
>>> but reviewers should not need to go and read some thread somewhere
>>> to understand "why" part.
>>> All of it should be in the commit log.
>> 
>> OK. My original thought is that is a reported problem, so an extra link
>> will be enough. Will try to add more context next time.
>>> 
>>>> map may still be incorrect (as shown long time ago [1]), so I think
>>>> maybe for other types of map, the atomic update doesn't matter too much.
>>>> 
>>>> [1]:
>>>> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
>>> A thread from 3 years ago ?! Sorry, it's not helpful to ask
>>> people to page-in such an old context with lots of follow ups
>>> that may or may not be relevant today.
>> Let me reuse part of the diagram above to explain how does the lookup
>> procedure may return a incorrect value:
>> 
>> lookup procedure A
>> A: find the old element (instead of the new element)
>> 
>> 
>>              update procedure B
>>              B: delete the old element
>>              update procedure C on the same CPU:
>> 
>> 
>> A: the key is matched and return the value in the element
>> 
>>              C: reuse the old element (overwrite its key and value)
>> 
>> A: read the value (it is incorrect, because it has been reused and
>> overwritten)
> 
> ... and it's fine. It's by design. It's an element reuse behavior.
> 
> Long ago hashmap had two modes: prealloc (default) and
> NO_PREALLOC (call_rcu + kfree)
> 
> The call_rcu part was there to make things safe.
> The memory cannot be kfree-ed to the kernel until RCU GP.
> With bpf_mem_alloc hashmap elements are freed back to bpf_ma
> right away. Hashmap is doing bpf_mem_cache_free()

We (emnify.com) missed this change and kept writing code with an
assumption that NO_PREALLOC implies rcu.

Is there something we can do as of today to reduce the likelihood of an
item getting reused immediately? We are concerned with lookups yielding
bogus results when racing with updates. Worse, a program could corrupt
an unrelated item when writing via a pointer obtained from lookup.

You wrote that "lookup on one cpu and deletes the same element on the
other cpu is asking for trouble.” It puzzles me since user space
updating a map while (say) TC program is consulting the map to make a
routing decision look like a supported and widespread use case.

For us, implications vary in severity, e.g.:
 - 1-in-1e? packet mis-delivered (bogus lookup: LOW)
 - a tenant getting billed for a packet of another tenant delivered via
   satellite and costing USD 0.10 (writing into unrelated item: LOW)
 - a network flow state corrupted (writing into unrelated item: MEDIUM)

We need to audit our code to ensure that e.g. a flow state getting
corrupted self-corrects and the damage doesn’t spread.

It would be nice if we as eBPF users could decide whether we wish to
live dangerously or prefer to trade speed for safety, on a case-by-case
basis.

> (instead of bpf_mem_cache_free_rcu()) because users need speed.
> So since 2022 both prealloc and no_prealloc reuse elements.
> We can consider a new flag for the hash map like F_REUSE_AFTER_RCU_GP
> that will use _rcu() flavor of freeing into bpf_ma,
> but it has to have a strong reason.
> And soon as we add it the default with prealloc would need
> to use call_rcu() too, right?
> and that becomes nightmare, since bpf prog can easily DoS the system.
> Even if we use bpf_mem_cache_free_rcu() only, the DoS is a concern.
> Unlike new things like bpf_obj_new/obj_drop the hashmap
> is unpriv, so concerns are drastically different.

Would it be an option to gate F_REUSE_AFTER_RCU under CAP_BPF?

It looks like sleepable programs would need to bpf_rcu_read_lock
explicitly to reap the benefits, but why not.
diff mbox series

Patch

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 4a9eeb7aef85..a28b11ce74c6 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1179,12 +1179,14 @@  static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 		goto err;
 	}
 
-	/* add new element to the head of the list, so that
-	 * concurrent search will find it before old elem
-	 */
-	hlist_nulls_add_head_rcu(&l_new->hash_node, head);
-	if (l_old) {
-		hlist_nulls_del_rcu(&l_old->hash_node);
+	if (!l_old) {
+		hlist_nulls_add_head_rcu(&l_new->hash_node, head);
+	} else {
+		/* Replace the old element atomically, so that
+		 * concurrent search will find either the new element or
+		 * the old element.
+		 */
+		hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node);
 
 		/* l_old has already been stashed in htab->extra_elems, free
 		 * its special fields before it is available for reuse. Also