Message ID | 20250204082848.13471-3-hotforest@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | BPF |
Headers | show |
Series | bpf: Overwrite the htab element atomically | expand |
+cc Cody Haas Sorry for the resend. I sent the reply in the HTML format. On 2/4/2025 4:28 PM, Hou Tao wrote: > Currently, the update of existing element in hash map involves two > steps: > 1) insert the new element at the head of the hash list > 2) remove the old element > > It is possible that the concurrent lookup operation may fail to find > either the old element or the new element if the lookup operation starts > before the addition and continues after the removal. > > Therefore, replacing the two-step update with an atomic update. After > the change, the update will be atomic in the perspective of the lookup > operation: it will either find the old element or the new element. > > Signed-off-by: Hou Tao <hotforest@gmail.com> > --- > kernel/bpf/hashtab.c | 14 ++++++++------ > 1 file changed, 8 insertions(+), 6 deletions(-) > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > index 4a9eeb7aef85..a28b11ce74c6 100644 > --- a/kernel/bpf/hashtab.c > +++ b/kernel/bpf/hashtab.c > @@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value, > goto err; > } > > - /* add new element to the head of the list, so that > - * concurrent search will find it before old elem > - */ > - hlist_nulls_add_head_rcu(&l_new->hash_node, head); > - if (l_old) { > - hlist_nulls_del_rcu(&l_old->hash_node); > + if (!l_old) { > + hlist_nulls_add_head_rcu(&l_new->hash_node, head); > + } else { > + /* Replace the old element atomically, so that > + * concurrent search will find either the new element or > + * the old element. > + */ > + hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node); > > /* l_old has already been stashed in htab->extra_elems, free > * its special fields before it is available for reuse. Also > After thinking about it the second time, the atomic list replacement on the update side is enough to make lookup operation always find the existing element. However, due to the immediate reuse, the lookup may find an unexpected value. Maybe we should disable the immediate reuse for specific map (e.g., htab of maps).
Hou Tao <houtao@huaweicloud.com> writes: > +cc Cody Haas > > Sorry for the resend. I sent the reply in the HTML format. > > On 2/4/2025 4:28 PM, Hou Tao wrote: >> Currently, the update of existing element in hash map involves two >> steps: >> 1) insert the new element at the head of the hash list >> 2) remove the old element >> >> It is possible that the concurrent lookup operation may fail to find >> either the old element or the new element if the lookup operation starts >> before the addition and continues after the removal. >> >> Therefore, replacing the two-step update with an atomic update. After >> the change, the update will be atomic in the perspective of the lookup >> operation: it will either find the old element or the new element. >> >> Signed-off-by: Hou Tao <hotforest@gmail.com> >> --- >> kernel/bpf/hashtab.c | 14 ++++++++------ >> 1 file changed, 8 insertions(+), 6 deletions(-) >> >> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c >> index 4a9eeb7aef85..a28b11ce74c6 100644 >> --- a/kernel/bpf/hashtab.c >> +++ b/kernel/bpf/hashtab.c >> @@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value, >> goto err; >> } >> >> - /* add new element to the head of the list, so that >> - * concurrent search will find it before old elem >> - */ >> - hlist_nulls_add_head_rcu(&l_new->hash_node, head); >> - if (l_old) { >> - hlist_nulls_del_rcu(&l_old->hash_node); >> + if (!l_old) { >> + hlist_nulls_add_head_rcu(&l_new->hash_node, head); >> + } else { >> + /* Replace the old element atomically, so that >> + * concurrent search will find either the new element or >> + * the old element. >> + */ >> + hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node); >> >> /* l_old has already been stashed in htab->extra_elems, free >> * its special fields before it is available for reuse. Also >> > > After thinking about it the second time, the atomic list replacement on > the update side is enough to make lookup operation always find the > existing element. However, due to the immediate reuse, the lookup may > find an unexpected value. Maybe we should disable the immediate reuse > for specific map (e.g., htab of maps). Hmm, in an RCU-protected data structure, reusing the memory before an RCU grace period has elapsed is just as wrong as freeing it, isn't it? I.e., the reuse logic should have some kind of call_rcu redirection to be completely correct? -Toke
Hi Toke, On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote: > Hou Tao <houtao@huaweicloud.com> writes: > >> +cc Cody Haas >> >> Sorry for the resend. I sent the reply in the HTML format. >> >> On 2/4/2025 4:28 PM, Hou Tao wrote: >>> Currently, the update of existing element in hash map involves two >>> steps: >>> 1) insert the new element at the head of the hash list >>> 2) remove the old element >>> >>> It is possible that the concurrent lookup operation may fail to find >>> either the old element or the new element if the lookup operation starts >>> before the addition and continues after the removal. >>> >>> Therefore, replacing the two-step update with an atomic update. After >>> the change, the update will be atomic in the perspective of the lookup >>> operation: it will either find the old element or the new element. >>> >>> Signed-off-by: Hou Tao <hotforest@gmail.com> >>> --- >>> kernel/bpf/hashtab.c | 14 ++++++++------ >>> 1 file changed, 8 insertions(+), 6 deletions(-) >>> >>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c >>> index 4a9eeb7aef85..a28b11ce74c6 100644 >>> --- a/kernel/bpf/hashtab.c >>> +++ b/kernel/bpf/hashtab.c >>> @@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value, >>> goto err; >>> } >>> >>> - /* add new element to the head of the list, so that >>> - * concurrent search will find it before old elem >>> - */ >>> - hlist_nulls_add_head_rcu(&l_new->hash_node, head); >>> - if (l_old) { >>> - hlist_nulls_del_rcu(&l_old->hash_node); >>> + if (!l_old) { >>> + hlist_nulls_add_head_rcu(&l_new->hash_node, head); >>> + } else { >>> + /* Replace the old element atomically, so that >>> + * concurrent search will find either the new element or >>> + * the old element. >>> + */ >>> + hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node); >>> >>> /* l_old has already been stashed in htab->extra_elems, free >>> * its special fields before it is available for reuse. Also >>> >> After thinking about it the second time, the atomic list replacement on >> the update side is enough to make lookup operation always find the >> existing element. However, due to the immediate reuse, the lookup may >> find an unexpected value. Maybe we should disable the immediate reuse >> for specific map (e.g., htab of maps). > Hmm, in an RCU-protected data structure, reusing the memory before an > RCU grace period has elapsed is just as wrong as freeing it, isn't it? > I.e., the reuse logic should have some kind of call_rcu redirection to > be completely correct? Not for all cases. There is SLAB_TYPESAFE_BY_RCU-typed slab. For hash map, the reuse is also tricky (e.g., the goto again case in lookup_nulls_elem_raw), however it can not prevent the lookup procedure from returning unexpected value. I had post a patch set [1] to "fix" that, but Alexei said it is "a known quirk". Here I am not sure about whether it is reasonable to disable the reuse for htab of maps only. I will post a v2 for the patch set. [1]: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/ > -Toke
On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi Toke, > > On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote: > > Hou Tao <houtao@huaweicloud.com> writes: > > > >> +cc Cody Haas > >> > >> Sorry for the resend. I sent the reply in the HTML format. > >> > >> On 2/4/2025 4:28 PM, Hou Tao wrote: > >>> Currently, the update of existing element in hash map involves two > >>> steps: > >>> 1) insert the new element at the head of the hash list > >>> 2) remove the old element > >>> > >>> It is possible that the concurrent lookup operation may fail to find > >>> either the old element or the new element if the lookup operation starts > >>> before the addition and continues after the removal. > >>> > >>> Therefore, replacing the two-step update with an atomic update. After > >>> the change, the update will be atomic in the perspective of the lookup > >>> operation: it will either find the old element or the new element. I'm missing the point. This "atomic" replacement doesn't really solve anything. lookup will see one element. That element could be deleted by another thread. bucket lock and either two step update or single step don't change anything from the pov of bpf prog doing lookup. > >>> > >>> Signed-off-by: Hou Tao <hotforest@gmail.com> > >>> --- > >>> kernel/bpf/hashtab.c | 14 ++++++++------ > >>> 1 file changed, 8 insertions(+), 6 deletions(-) > >>> > >>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > >>> index 4a9eeb7aef85..a28b11ce74c6 100644 > >>> --- a/kernel/bpf/hashtab.c > >>> +++ b/kernel/bpf/hashtab.c > >>> @@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value, > >>> goto err; > >>> } > >>> > >>> - /* add new element to the head of the list, so that > >>> - * concurrent search will find it before old elem > >>> - */ > >>> - hlist_nulls_add_head_rcu(&l_new->hash_node, head); > >>> - if (l_old) { > >>> - hlist_nulls_del_rcu(&l_old->hash_node); > >>> + if (!l_old) { > >>> + hlist_nulls_add_head_rcu(&l_new->hash_node, head); > >>> + } else { > >>> + /* Replace the old element atomically, so that > >>> + * concurrent search will find either the new element or > >>> + * the old element. > >>> + */ > >>> + hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node); > >>> > >>> /* l_old has already been stashed in htab->extra_elems, free > >>> * its special fields before it is available for reuse. Also > >>> > >> After thinking about it the second time, the atomic list replacement on > >> the update side is enough to make lookup operation always find the > >> existing element. However, due to the immediate reuse, the lookup may > >> find an unexpected value. Maybe we should disable the immediate reuse > >> for specific map (e.g., htab of maps). > > Hmm, in an RCU-protected data structure, reusing the memory before an > > RCU grace period has elapsed is just as wrong as freeing it, isn't it? > > I.e., the reuse logic should have some kind of call_rcu redirection to > > be completely correct? > > Not for all cases. There is SLAB_TYPESAFE_BY_RCU-typed slab. For hash > map, the reuse is also tricky (e.g., the goto again case in > lookup_nulls_elem_raw), however it can not prevent the lookup procedure > from returning unexpected value. I had post a patch set [1] to "fix" > that, but Alexei said it is "a known quirk". Here I am not sure about > whether it is reasonable to disable the reuse for htab of maps only. I > will post a v2 for the patch set. > > [1]: > https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/ yes. we still have to keep prealloc as default for now :( Eventually bpf_mem_alloc is replaced with fully re-entrant and safe kmalloc, then we can do fully re-entrant and safe kfree_rcu. Then we can talk about closing this quirk. Until then the prog has to deal with immediate reuse. That was the case for a decade already.
Hi, On 2/26/2025 11:24 AM, Alexei Starovoitov wrote: > On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote: >> Hi Toke, >> >> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote: >>> Hou Tao <houtao@huaweicloud.com> writes: >>> >>>> +cc Cody Haas >>>> >>>> Sorry for the resend. I sent the reply in the HTML format. >>>> >>>> On 2/4/2025 4:28 PM, Hou Tao wrote: >>>>> Currently, the update of existing element in hash map involves two >>>>> steps: >>>>> 1) insert the new element at the head of the hash list >>>>> 2) remove the old element >>>>> >>>>> It is possible that the concurrent lookup operation may fail to find >>>>> either the old element or the new element if the lookup operation starts >>>>> before the addition and continues after the removal. >>>>> >>>>> Therefore, replacing the two-step update with an atomic update. After >>>>> the change, the update will be atomic in the perspective of the lookup >>>>> operation: it will either find the old element or the new element. > I'm missing the point. > This "atomic" replacement doesn't really solve anything. > lookup will see one element. > That element could be deleted by another thread. > bucket lock and either two step update or single step > don't change anything from the pov of bpf prog doing lookup. The point is that overwriting an existed element may lead to concurrent lookups return ENOENT as demonstrated by the added selftest and the patch tried to "fix" that. However, it seems using hlist_nulls_replace_rcu() for the overwriting update is still not enough. Because when the lookup procedure found the old element, the old element may be reusing, the comparison of the map key may fail, and the lookup procedure may still return ENOENT. > >>>>> Signed-off-by: Hou Tao <hotforest@gmail.com> >>>>> --- >>>>> kernel/bpf/hashtab.c | 14 ++++++++------ >>>>> 1 file changed, 8 insertions(+), 6 deletions(-) >>>>> >>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c >>>>> index 4a9eeb7aef85..a28b11ce74c6 100644 >>>>> --- a/kernel/bpf/hashtab.c >>>>> +++ b/kernel/bpf/hashtab.c >>>>> @@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value, >>>>> goto err; >>>>> } >>>>> >>>>> - /* add new element to the head of the list, so that >>>>> - * concurrent search will find it before old elem >>>>> - */ >>>>> - hlist_nulls_add_head_rcu(&l_new->hash_node, head); >>>>> - if (l_old) { >>>>> - hlist_nulls_del_rcu(&l_old->hash_node); >>>>> + if (!l_old) { >>>>> + hlist_nulls_add_head_rcu(&l_new->hash_node, head); >>>>> + } else { >>>>> + /* Replace the old element atomically, so that >>>>> + * concurrent search will find either the new element or >>>>> + * the old element. >>>>> + */ >>>>> + hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node); >>>>> >>>>> /* l_old has already been stashed in htab->extra_elems, free >>>>> * its special fields before it is available for reuse. Also >>>>> >>>> After thinking about it the second time, the atomic list replacement on >>>> the update side is enough to make lookup operation always find the >>>> existing element. However, due to the immediate reuse, the lookup may >>>> find an unexpected value. Maybe we should disable the immediate reuse >>>> for specific map (e.g., htab of maps). >>> Hmm, in an RCU-protected data structure, reusing the memory before an >>> RCU grace period has elapsed is just as wrong as freeing it, isn't it? >>> I.e., the reuse logic should have some kind of call_rcu redirection to >>> be completely correct? >> Not for all cases. There is SLAB_TYPESAFE_BY_RCU-typed slab. For hash >> map, the reuse is also tricky (e.g., the goto again case in >> lookup_nulls_elem_raw), however it can not prevent the lookup procedure >> from returning unexpected value. I had post a patch set [1] to "fix" >> that, but Alexei said it is "a known quirk". Here I am not sure about >> whether it is reasonable to disable the reuse for htab of maps only. I >> will post a v2 for the patch set. >> >> [1]: >> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/ > yes. we still have to keep prealloc as default for now :( > Eventually bpf_mem_alloc is replaced with fully re-entrant > and safe kmalloc, then we can do fully re-entrant and safe > kfree_rcu. Then we can talk about closing this quirk. > Until then the prog has to deal with immediate reuse. > That was the case for a decade already. I see. In v2 I will fallback to the original idea: adding a standalone update procedure for htab of maps in which it will atomically overwrite the map_ptr just like array of maps does.
On Tue, Feb 25, 2025 at 8:05 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 2/26/2025 11:24 AM, Alexei Starovoitov wrote: > > On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote: > >> Hi Toke, > >> > >> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote: > >>> Hou Tao <houtao@huaweicloud.com> writes: > >>> > >>>> +cc Cody Haas > >>>> > >>>> Sorry for the resend. I sent the reply in the HTML format. > >>>> > >>>> On 2/4/2025 4:28 PM, Hou Tao wrote: > >>>>> Currently, the update of existing element in hash map involves two > >>>>> steps: > >>>>> 1) insert the new element at the head of the hash list > >>>>> 2) remove the old element > >>>>> > >>>>> It is possible that the concurrent lookup operation may fail to find > >>>>> either the old element or the new element if the lookup operation starts > >>>>> before the addition and continues after the removal. > >>>>> > >>>>> Therefore, replacing the two-step update with an atomic update. After > >>>>> the change, the update will be atomic in the perspective of the lookup > >>>>> operation: it will either find the old element or the new element. > > I'm missing the point. > > This "atomic" replacement doesn't really solve anything. > > lookup will see one element. > > That element could be deleted by another thread. > > bucket lock and either two step update or single step > > don't change anything from the pov of bpf prog doing lookup. > > The point is that overwriting an existed element may lead to concurrent > lookups return ENOENT as demonstrated by the added selftest and the > patch tried to "fix" that. However, it seems using > hlist_nulls_replace_rcu() for the overwriting update is still not > enough. Because when the lookup procedure found the old element, the old > element may be reusing, the comparison of the map key may fail, and the > lookup procedure may still return ENOENT. you mean l_old == l_new ? I don't think it's possible within htab_map_update_elem(), but htab_map_delete_elem() doing hlist_nulls_del_rcu() then free_htab_elem, htab_map_update_elem, alloc, hlist_nulls_add_head_rcu and just deleted elem is reused to be inserted into another bucket. I'm not sure whether this new hlist_nulls_replace_rcu() primitive works with nulls logic. So back to the problem statement.. Are you saying that adding new to a head while lookup is in the middle is causing it to miss an element that is supposed to be updated assuming atomicity of the update? While now update_elem() is more like a sequence of delete + insert? Hmm. > I see. In v2 I will fallback to the original idea: adding a standalone > update procedure for htab of maps in which it will atomically overwrite > the map_ptr just like array of maps does. hold on. is this only for hash-of-maps ? How that non-atomic update manifested in real production?
On Tue, Feb 25, 2025 at 9:42 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Tue, Feb 25, 2025 at 8:05 PM Hou Tao <houtao@huaweicloud.com> wrote: > > > > Hi, > > > > On 2/26/2025 11:24 AM, Alexei Starovoitov wrote: > > > On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote: > > >> Hi Toke, > > >> > > >> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote: > > >>> Hou Tao <houtao@huaweicloud.com> writes: > > >>> > > >>>> +cc Cody Haas > > >>>> > > >>>> Sorry for the resend. I sent the reply in the HTML format. > > >>>> > > >>>> On 2/4/2025 4:28 PM, Hou Tao wrote: > > >>>>> Currently, the update of existing element in hash map involves two > > >>>>> steps: > > >>>>> 1) insert the new element at the head of the hash list > > >>>>> 2) remove the old element > > >>>>> > > >>>>> It is possible that the concurrent lookup operation may fail to find > > >>>>> either the old element or the new element if the lookup operation starts > > >>>>> before the addition and continues after the removal. > > >>>>> > > >>>>> Therefore, replacing the two-step update with an atomic update. After > > >>>>> the change, the update will be atomic in the perspective of the lookup > > >>>>> operation: it will either find the old element or the new element. > > > I'm missing the point. > > > This "atomic" replacement doesn't really solve anything. > > > lookup will see one element. > > > That element could be deleted by another thread. > > > bucket lock and either two step update or single step > > > don't change anything from the pov of bpf prog doing lookup. > > > > The point is that overwriting an existed element may lead to concurrent > > lookups return ENOENT as demonstrated by the added selftest and the > > patch tried to "fix" that. However, it seems using > > hlist_nulls_replace_rcu() for the overwriting update is still not > > enough. Because when the lookup procedure found the old element, the old > > element may be reusing, the comparison of the map key may fail, and the > > lookup procedure may still return ENOENT. > > you mean l_old == l_new ? I don't think it's possible > within htab_map_update_elem(), > but htab_map_delete_elem() doing hlist_nulls_del_rcu() > then free_htab_elem, htab_map_update_elem, alloc, hlist_nulls_add_head_rcu > and just deleted elem is reused to be inserted > into another bucket. > > I'm not sure whether this new hlist_nulls_replace_rcu() > primitive works with nulls logic. > > So back to the problem statement.. > Are you saying that adding new to a head while lookup is in the middle > is causing it to miss an element that > is supposed to be updated assuming atomicity of the update? > While now update_elem() is more like a sequence of delete + insert? > > Hmm. Yes, exactly that. Because update_elem is actually a delete + insert (actually an insert + delete, I think?), it is possible for a concurrent lookup to see no element instead of either the old or new value. > > > I see. In v2 I will fallback to the original idea: adding a standalone > > update procedure for htab of maps in which it will atomically overwrite > > the map_ptr just like array of maps does. > > hold on. is this only for hash-of-maps ? I believe this was also replicated for hash as well as hash-of-maps. Cody can confirm, or use the reproducer he has to test that case. > How that non-atomic update manifested in real production? > See [1] (in the cover letter for this series, also replicated below). [1] : https://lore.kernel.org/xdp-newbies/07a365d8-2e66-2899-4298-b8b158a928fa@huaweicloud.com/T/#m06fcd687c6cfdbd0f9b643b227e69b479fc8c2f6
Hi, On 2/27/2025 7:17 AM, Zvi Effron wrote: > On Tue, Feb 25, 2025 at 9:42 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: >> On Tue, Feb 25, 2025 at 8:05 PM Hou Tao <houtao@huaweicloud.com> wrote: >>> Hi, >>> >>> On 2/26/2025 11:24 AM, Alexei Starovoitov wrote: >>>> On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote: >>>>> Hi Toke, >>>>> >>>>> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote: >>>>>> Hou Tao <houtao@huaweicloud.com> writes: >>>>>> >>>>>>> +cc Cody Haas >>>>>>> >>>>>>> Sorry for the resend. I sent the reply in the HTML format. >>>>>>> >>>>>>> On 2/4/2025 4:28 PM, Hou Tao wrote: >>>>>>>> Currently, the update of existing element in hash map involves two >>>>>>>> steps: >>>>>>>> 1) insert the new element at the head of the hash list >>>>>>>> 2) remove the old element >>>>>>>> >>>>>>>> It is possible that the concurrent lookup operation may fail to find >>>>>>>> either the old element or the new element if the lookup operation starts >>>>>>>> before the addition and continues after the removal. >>>>>>>> >>>>>>>> Therefore, replacing the two-step update with an atomic update. After >>>>>>>> the change, the update will be atomic in the perspective of the lookup >>>>>>>> operation: it will either find the old element or the new element. >>>> I'm missing the point. >>>> This "atomic" replacement doesn't really solve anything. >>>> lookup will see one element. >>>> That element could be deleted by another thread. >>>> bucket lock and either two step update or single step >>>> don't change anything from the pov of bpf prog doing lookup. >>> The point is that overwriting an existed element may lead to concurrent >>> lookups return ENOENT as demonstrated by the added selftest and the >>> patch tried to "fix" that. However, it seems using >>> hlist_nulls_replace_rcu() for the overwriting update is still not >>> enough. Because when the lookup procedure found the old element, the old >>> element may be reusing, the comparison of the map key may fail, and the >>> lookup procedure may still return ENOENT. >> you mean l_old == l_new ? I don't think it's possible >> within htab_map_update_elem(), >> but htab_map_delete_elem() doing hlist_nulls_del_rcu() >> then free_htab_elem, htab_map_update_elem, alloc, hlist_nulls_add_head_rcu >> and just deleted elem is reused to be inserted >> into another bucket. No. I mean the following procedure in which the lookup procedure finds the old element and tries to match its key, one update procedure has already deleted the old element, and another update procedure is reusing the old element: lookup procedure A A: find the old element (instead of the new old) update procedure B B: delete the old element update procedure C on the same CPU: C: reuse the old element (overwrite its key and insert in the same bucket) A: the key is mismatched and return -ENOENT. It can be reproduced by increasing ctx.loop in the selftest. >> >> I'm not sure whether this new hlist_nulls_replace_rcu() >> primitive works with nulls logic. >> >> So back to the problem statement.. >> Are you saying that adding new to a head while lookup is in the middle >> is causing it to miss an element that >> is supposed to be updated assuming atomicity of the update? >> While now update_elem() is more like a sequence of delete + insert? >> >> Hmm. > Yes, exactly that. Because update_elem is actually a delete + insert (actually > an insert + delete, I think?), it is possible for a concurrent lookup to see no > element instead of either the old or new value. Yep. > >>> I see. In v2 I will fallback to the original idea: adding a standalone >>> update procedure for htab of maps in which it will atomically overwrite >>> the map_ptr just like array of maps does. >> hold on. is this only for hash-of-maps ? > I believe this was also replicated for hash as well as hash-of-maps. Cody can > confirm, or use the reproducer he has to test that case. The fix for hash-of-maps will be much simpler and the returned map pointer will be correct. For other types of hash map, beside switching to hlist_nulls_replace_rcu(), I think we also need to detect the reuse of element during the loop: if the element has been reused, the lookup should restart from the head again. However, even withing the above two fixes, the value returned by lookup procedure for other types of hash map may still be incorrect (as shown long time ago [1]), so I think maybe for other types of map, the atomic update doesn't matter too much. [1]: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/ > >> How that non-atomic update manifested in real production? >> > See [1] (in the cover letter for this series, also replicated below). > > [1] : https://lore.kernel.org/xdp-newbies/07a365d8-2e66-2899-4298-b8b158a928fa@huaweicloud.com/T/#m06fcd687c6cfdbd0f9b643b227e69b479fc8c2f6
On Wed, Feb 26, 2025 at 5:48 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 2/27/2025 7:17 AM, Zvi Effron wrote: > > On Tue, Feb 25, 2025 at 9:42 PM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > >> On Tue, Feb 25, 2025 at 8:05 PM Hou Tao <houtao@huaweicloud.com> wrote: > >>> Hi, > >>> > >>> On 2/26/2025 11:24 AM, Alexei Starovoitov wrote: > >>>> On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote: > >>>>> Hi Toke, > >>>>> > >>>>> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote: > >>>>>> Hou Tao <houtao@huaweicloud.com> writes: > >>>>>> > >>>>>>> +cc Cody Haas > >>>>>>> > >>>>>>> Sorry for the resend. I sent the reply in the HTML format. > >>>>>>> > >>>>>>> On 2/4/2025 4:28 PM, Hou Tao wrote: > >>>>>>>> Currently, the update of existing element in hash map involves two > >>>>>>>> steps: > >>>>>>>> 1) insert the new element at the head of the hash list > >>>>>>>> 2) remove the old element > >>>>>>>> > >>>>>>>> It is possible that the concurrent lookup operation may fail to find > >>>>>>>> either the old element or the new element if the lookup operation starts > >>>>>>>> before the addition and continues after the removal. > >>>>>>>> > >>>>>>>> Therefore, replacing the two-step update with an atomic update. After > >>>>>>>> the change, the update will be atomic in the perspective of the lookup > >>>>>>>> operation: it will either find the old element or the new element. > >>>> I'm missing the point. > >>>> This "atomic" replacement doesn't really solve anything. > >>>> lookup will see one element. > >>>> That element could be deleted by another thread. > >>>> bucket lock and either two step update or single step > >>>> don't change anything from the pov of bpf prog doing lookup. > >>> The point is that overwriting an existed element may lead to concurrent > >>> lookups return ENOENT as demonstrated by the added selftest and the > >>> patch tried to "fix" that. However, it seems using > >>> hlist_nulls_replace_rcu() for the overwriting update is still not > >>> enough. Because when the lookup procedure found the old element, the old > >>> element may be reusing, the comparison of the map key may fail, and the > >>> lookup procedure may still return ENOENT. > >> you mean l_old == l_new ? I don't think it's possible > >> within htab_map_update_elem(), > >> but htab_map_delete_elem() doing hlist_nulls_del_rcu() > >> then free_htab_elem, htab_map_update_elem, alloc, hlist_nulls_add_head_rcu > >> and just deleted elem is reused to be inserted > >> into another bucket. > > No. I mean the following procedure in which the lookup procedure finds > the old element and tries to match its key, one update procedure has > already deleted the old element, and another update procedure is reusing > the old element: > > lookup procedure A > A: find the old element (instead of the new old) > > update procedure B > B: delete the old element > update procedure C on the same CPU: > C: reuse the old element (overwrite its key and insert in > the same bucket) > > A: the key is mismatched and return -ENOENT. This is fine. It's just normal reuse. Orthogonal to 'update as insert+delete' issue. > It can be reproduced by increasing ctx.loop in the selftest. > >> > >> I'm not sure whether this new hlist_nulls_replace_rcu() > >> primitive works with nulls logic. > >> > >> So back to the problem statement.. > >> Are you saying that adding new to a head while lookup is in the middle > >> is causing it to miss an element that > >> is supposed to be updated assuming atomicity of the update? > >> While now update_elem() is more like a sequence of delete + insert? > >> > >> Hmm. > > Yes, exactly that. Because update_elem is actually a delete + insert (actually > > an insert + delete, I think?), it is possible for a concurrent lookup to see no > > element instead of either the old or new value. > > Yep. > > > >>> I see. In v2 I will fallback to the original idea: adding a standalone > >>> update procedure for htab of maps in which it will atomically overwrite > >>> the map_ptr just like array of maps does. > >> hold on. is this only for hash-of-maps ? > > I believe this was also replicated for hash as well as hash-of-maps. Cody can > > confirm, or use the reproducer he has to test that case. > > The fix for hash-of-maps will be much simpler and the returned map > pointer will be correct. For other types of hash map, beside switching > to hlist_nulls_replace_rcu(), It's been a long time since I looked into rcu_nulls details. Pls help me understand that this new replace_rcu_nulls() is correct from nulls pov, If it is then this patch set may be the right answer to non-atomic update. And for the future, please please focus on "why" part in the cover letter and commit logs instead of "what". Since the only thing I got from the log was: "Currently, the update is not atomic because the overwrite of existing element happens in a two-steps way, but the support of atomic update is feasible". "is feasible" doesn't explain "why". Link to xdp-newbie question is nice for additional context, but reviewers should not need to go and read some thread somewhere to understand "why" part. All of it should be in the commit log. > map may still be incorrect (as shown long time ago [1]), so I think > maybe for other types of map, the atomic update doesn't matter too much. > > [1]: > https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/ A thread from 3 years ago ?! Sorry, it's not helpful to ask people to page-in such an old context with lots of follow ups that may or may not be relevant today.
Hi, On 2/27/2025 9:59 AM, Alexei Starovoitov wrote: > On Wed, Feb 26, 2025 at 5:48 PM Hou Tao <houtao@huaweicloud.com> wrote: >> Hi, >> >> On 2/27/2025 7:17 AM, Zvi Effron wrote: >>> On Tue, Feb 25, 2025 at 9:42 PM Alexei Starovoitov >>> <alexei.starovoitov@gmail.com> wrote: >>>> On Tue, Feb 25, 2025 at 8:05 PM Hou Tao <houtao@huaweicloud.com> wrote: >>>>> Hi, >>>>> >>>>> On 2/26/2025 11:24 AM, Alexei Starovoitov wrote: >>>>>> On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote: >>>>>>> Hi Toke, >>>>>>> >>>>>>> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote: >>>>>>>> Hou Tao <houtao@huaweicloud.com> writes: >>>>>>>> >>>>>>>>> +cc Cody Haas >>>>>>>>> >>>>>>>>> Sorry for the resend. I sent the reply in the HTML format. >>>>>>>>> >>>>>>>>> On 2/4/2025 4:28 PM, Hou Tao wrote: >>>>>>>>>> Currently, the update of existing element in hash map involves two >>>>>>>>>> steps: >>>>>>>>>> 1) insert the new element at the head of the hash list >>>>>>>>>> 2) remove the old element >>>>>>>>>> >>>>>>>>>> It is possible that the concurrent lookup operation may fail to find >>>>>>>>>> either the old element or the new element if the lookup operation starts >>>>>>>>>> before the addition and continues after the removal. >>>>>>>>>> >>>>>>>>>> Therefore, replacing the two-step update with an atomic update. After >>>>>>>>>> the change, the update will be atomic in the perspective of the lookup >>>>>>>>>> operation: it will either find the old element or the new element. >>>>>> I'm missing the point. >>>>>> This "atomic" replacement doesn't really solve anything. >>>>>> lookup will see one element. >>>>>> That element could be deleted by another thread. >>>>>> bucket lock and either two step update or single step >>>>>> don't change anything from the pov of bpf prog doing lookup. >>>>> The point is that overwriting an existed element may lead to concurrent >>>>> lookups return ENOENT as demonstrated by the added selftest and the >>>>> patch tried to "fix" that. However, it seems using >>>>> hlist_nulls_replace_rcu() for the overwriting update is still not >>>>> enough. Because when the lookup procedure found the old element, the old >>>>> element may be reusing, the comparison of the map key may fail, and the >>>>> lookup procedure may still return ENOENT. >>>> you mean l_old == l_new ? I don't think it's possible >>>> within htab_map_update_elem(), >>>> but htab_map_delete_elem() doing hlist_nulls_del_rcu() >>>> then free_htab_elem, htab_map_update_elem, alloc, hlist_nulls_add_head_rcu >>>> and just deleted elem is reused to be inserted >>>> into another bucket. >> No. I mean the following procedure in which the lookup procedure finds >> the old element and tries to match its key, one update procedure has >> already deleted the old element, and another update procedure is reusing >> the old element: >> >> lookup procedure A >> A: find the old element (instead of the new old) >> >> update procedure B >> B: delete the old element >> update procedure C on the same CPU: >> C: reuse the old element (overwrite its key and insert in >> the same bucket) >> >> A: the key is mismatched and return -ENOENT. > This is fine. It's just normal reuse. > Orthogonal to 'update as insert+delete' issue. OK. However, it will break the lookup procedure because it expects it will return an valid result instead of -ENOENT. > >> It can be reproduced by increasing ctx.loop in the selftest. >>>> I'm not sure whether this new hlist_nulls_replace_rcu() >>>> primitive works with nulls logic. >>>> >>>> So back to the problem statement.. >>>> Are you saying that adding new to a head while lookup is in the middle >>>> is causing it to miss an element that >>>> is supposed to be updated assuming atomicity of the update? >>>> While now update_elem() is more like a sequence of delete + insert? >>>> >>>> Hmm. >>> Yes, exactly that. Because update_elem is actually a delete + insert (actually >>> an insert + delete, I think?), it is possible for a concurrent lookup to see no >>> element instead of either the old or new value. >> Yep. >>>>> I see. In v2 I will fallback to the original idea: adding a standalone >>>>> update procedure for htab of maps in which it will atomically overwrite >>>>> the map_ptr just like array of maps does. >>>> hold on. is this only for hash-of-maps ? >>> I believe this was also replicated for hash as well as hash-of-maps. Cody can >>> confirm, or use the reproducer he has to test that case. >> The fix for hash-of-maps will be much simpler and the returned map >> pointer will be correct. For other types of hash map, beside switching >> to hlist_nulls_replace_rcu(), > It's been a long time since I looked into rcu_nulls details. > Pls help me understand that this new replace_rcu_nulls() > is correct from nulls pov, > If it is then this patch set may be the right answer to non-atomic update. If I understand correctly, only the manipulations of ->first pointer and ->next pointer need to take care of nulls pointer. > > And for the future, please please focus on "why" part in > the cover letter and commit logs instead of "what". > > Since the only thing I got from the log was: > "Currently, the update is not atomic > because the overwrite of existing element happens in a two-steps way, > but the support of atomic update is feasible". > > "is feasible" doesn't explain "why". > > Link to xdp-newbie question is nice for additional context, > but reviewers should not need to go and read some thread somewhere > to understand "why" part. > All of it should be in the commit log. OK. My original thought is that is a reported problem, so an extra link will be enough. Will try to add more context next time. > >> map may still be incorrect (as shown long time ago [1]), so I think >> maybe for other types of map, the atomic update doesn't matter too much. >> >> [1]: >> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/ > A thread from 3 years ago ?! Sorry, it's not helpful to ask > people to page-in such an old context with lots of follow ups > that may or may not be relevant today. Let me reuse part of the diagram above to explain how does the lookup procedure may return a incorrect value: lookup procedure A A: find the old element (instead of the new element) update procedure B B: delete the old element update procedure C on the same CPU: A: the key is matched and return the value in the element C: reuse the old element (overwrite its key and value) A: read the value (it is incorrect, because it has been reused and overwritten)
On Wed, Feb 26, 2025 at 6:43 PM Hou Tao <houtao@huaweicloud.com> wrote: > > >> > >> lookup procedure A > >> A: find the old element (instead of the new old) > >> > >> update procedure B > >> B: delete the old element > >> update procedure C on the same CPU: > >> C: reuse the old element (overwrite its key and insert in > >> the same bucket) > >> > >> A: the key is mismatched and return -ENOENT. > > This is fine. It's just normal reuse. > > Orthogonal to 'update as insert+delete' issue. > > OK. However, it will break the lookup procedure because it expects it > will return an valid result instead of -ENOENT. What do you mean 'breaks the lookup' ? lookup_elem_raw() matches hash, and then it memcmp(key), if the element is reused anything can happen. Either it succeeds in memcmp() and returns an elem, or miscompares in memcmp(). Both are expected, because elems are reused in place. And this behavior is expected and not-broken, because bpf prog that does lookup on one cpu and deletes the same element on the other cpu is asking for trouble. bpf infra guarantees the safety of the kernel. It doesn't guarantee that bpf progs are sane. > > It's been a long time since I looked into rcu_nulls details. > > Pls help me understand that this new replace_rcu_nulls() > > is correct from nulls pov, > > If it is then this patch set may be the right answer to non-atomic update. > > If I understand correctly, only the manipulations of ->first pointer and > ->next pointer need to take care of nulls pointer. hmm. I feel we're still talking past each other. See if (get_nulls_value() == ...) in lookup_nulls_elem_raw(). It's there because of reuse. The lookup can start in one bucket and finish in another. > > > > And for the future, please please focus on "why" part in > > the cover letter and commit logs instead of "what". > > > > Since the only thing I got from the log was: > > "Currently, the update is not atomic > > because the overwrite of existing element happens in a two-steps way, > > but the support of atomic update is feasible". > > > > "is feasible" doesn't explain "why". > > > > Link to xdp-newbie question is nice for additional context, > > but reviewers should not need to go and read some thread somewhere > > to understand "why" part. > > All of it should be in the commit log. > > OK. My original thought is that is a reported problem, so an extra link > will be enough. Will try to add more context next time. > > > >> map may still be incorrect (as shown long time ago [1]), so I think > >> maybe for other types of map, the atomic update doesn't matter too much. > >> > >> [1]: > >> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/ > > A thread from 3 years ago ?! Sorry, it's not helpful to ask > > people to page-in such an old context with lots of follow ups > > that may or may not be relevant today. > Let me reuse part of the diagram above to explain how does the lookup > procedure may return a incorrect value: > > lookup procedure A > A: find the old element (instead of the new element) > > > update procedure B > B: delete the old element > update procedure C on the same CPU: > > > A: the key is matched and return the value in the element > > C: reuse the old element (overwrite its key and value) > > A: read the value (it is incorrect, because it has been reused and > overwritten) ... and it's fine. It's by design. It's an element reuse behavior. Long ago hashmap had two modes: prealloc (default) and NO_PREALLOC (call_rcu + kfree) The call_rcu part was there to make things safe. The memory cannot be kfree-ed to the kernel until RCU GP. With bpf_mem_alloc hashmap elements are freed back to bpf_ma right away. Hashmap is doing bpf_mem_cache_free() (instead of bpf_mem_cache_free_rcu()) because users need speed. So since 2022 both prealloc and no_prealloc reuse elements. We can consider a new flag for the hash map like F_REUSE_AFTER_RCU_GP that will use _rcu() flavor of freeing into bpf_ma, but it has to have a strong reason. And soon as we add it the default with prealloc would need to use call_rcu() too, right? and that becomes nightmare, since bpf prog can easily DoS the system. Even if we use bpf_mem_cache_free_rcu() only, the DoS is a concern. Unlike new things like bpf_obj_new/obj_drop the hashmap is unpriv, so concerns are drastically different.
Hi, On 2/27/2025 11:17 AM, Alexei Starovoitov wrote: > On Wed, Feb 26, 2025 at 6:43 PM Hou Tao <houtao@huaweicloud.com> wrote: >>>> lookup procedure A >>>> A: find the old element (instead of the new old) >>>> >>>> update procedure B >>>> B: delete the old element >>>> update procedure C on the same CPU: >>>> C: reuse the old element (overwrite its key and insert in >>>> the same bucket) >>>> >>>> A: the key is mismatched and return -ENOENT. >>> This is fine. It's just normal reuse. >>> Orthogonal to 'update as insert+delete' issue. >> OK. However, it will break the lookup procedure because it expects it >> will return an valid result instead of -ENOENT. > What do you mean 'breaks the lookup' ? > lookup_elem_raw() matches hash, and then it memcmp(key), > if the element is reused anything can happen. > Either it succeeds in memcmp() and returns an elem, > or miscompares in memcmp(). > Both are expected, because elems are reused in place. > > And this behavior is expected and not-broken, > because bpf prog that does lookup on one cpu and deletes > the same element on the other cpu is asking for trouble. It seems I mislead you in the above example. It is also possible for doing lookup in one CPU and updating the same element in other CPU. So does such concurrence also look insane ? lookup procedure A A: find the old element (instead of the new old) update procedure B B: overwrite the same element and free the old element update procedure C on the same CPU: C: reuse the old element (overwrite its key and insert in the same bucket) A: the key is mismatched and return -ENOENT. > bpf infra guarantees the safety of the kernel. > It doesn't guarantee that bpf progs are sane. > >>> It's been a long time since I looked into rcu_nulls details. >>> Pls help me understand that this new replace_rcu_nulls() >>> is correct from nulls pov, >>> If it is then this patch set may be the right answer to non-atomic update. >> If I understand correctly, only the manipulations of ->first pointer and >> ->next pointer need to take care of nulls pointer. > hmm. I feel we're still talking past each other. > See if (get_nulls_value() == ...) in lookup_nulls_elem_raw(). > It's there because of reuse. The lookup can start in one bucket > and finish in another. Yes. I noticed that. However, it happens when the deleted element is reused and inserted to a different bucket, right ? For replace_rcu_nulls(), the reuse of old element is possible only after replace_rcu_nulls() completes, so I think it is irrelevant. >If it is then this patch set may be the right answer to non-atomic update. I also didn't follow that. Due to the immediate reuse, the lookup procedure may still return -ENOENT when there is concurrent overwriting of the same element as show above, so I think it only reduce the possibility. I still prefer to implement a separated update procedure of htab of maps to fix the atomic update problem completely for htab of maps first. > >>> And for the future, please please focus on "why" part in >>> the cover letter and commit logs instead of "what". >>> >>> Since the only thing I got from the log was: >>> "Currently, the update is not atomic >>> because the overwrite of existing element happens in a two-steps way, >>> but the support of atomic update is feasible". >>> >>> "is feasible" doesn't explain "why". >>> >>> Link to xdp-newbie question is nice for additional context, >>> but reviewers should not need to go and read some thread somewhere >>> to understand "why" part. >>> All of it should be in the commit log. >> OK. My original thought is that is a reported problem, so an extra link >> will be enough. Will try to add more context next time. >>>> map may still be incorrect (as shown long time ago [1]), so I think >>>> maybe for other types of map, the atomic update doesn't matter too much. >>>> >>>> [1]: >>>> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/ >>> A thread from 3 years ago ?! Sorry, it's not helpful to ask >>> people to page-in such an old context with lots of follow ups >>> that may or may not be relevant today. >> Let me reuse part of the diagram above to explain how does the lookup >> procedure may return a incorrect value: >> >> lookup procedure A >> A: find the old element (instead of the new element) >> >> >> update procedure B >> B: delete the old element >> update procedure C on the same CPU: >> >> >> A: the key is matched and return the value in the element >> >> C: reuse the old element (overwrite its key and value) >> >> A: read the value (it is incorrect, because it has been reused and >> overwritten) > ... and it's fine. It's by design. It's an element reuse behavior. > > Long ago hashmap had two modes: prealloc (default) and > NO_PREALLOC (call_rcu + kfree) > > The call_rcu part was there to make things safe. > The memory cannot be kfree-ed to the kernel until RCU GP. > With bpf_mem_alloc hashmap elements are freed back to bpf_ma > right away. Hashmap is doing bpf_mem_cache_free() > (instead of bpf_mem_cache_free_rcu()) because users need speed. > So since 2022 both prealloc and no_prealloc reuse elements. > We can consider a new flag for the hash map like F_REUSE_AFTER_RCU_GP > that will use _rcu() flavor of freeing into bpf_ma, > but it has to have a strong reason. > And soon as we add it the default with prealloc would need > to use call_rcu() too, right? > and that becomes nightmare, since bpf prog can easily DoS the system. > Even if we use bpf_mem_cache_free_rcu() only, the DoS is a concern. > Unlike new things like bpf_obj_new/obj_drop the hashmap > is unpriv, so concerns are drastically different. > > . I see. Thanks for the detailed explanation. And that is part of the reason why I proposed adding a seq-counter in the htab-element and checking the seq-counter during lookup, so at least the lookup will not return -ENOENT for the concurrent overwriting procedure.
> On 27. Feb 2025, at 04:17, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Wed, Feb 26, 2025 at 6:43 PM Hou Tao <houtao@huaweicloud.com> wrote: >> >>>> >>>> lookup procedure A >>>> A: find the old element (instead of the new old) >>>> >>>> update procedure B >>>> B: delete the old element >>>> update procedure C on the same CPU: >>>> C: reuse the old element (overwrite its key and insert in >>>> the same bucket) >>>> >>>> A: the key is mismatched and return -ENOENT. >>> This is fine. It's just normal reuse. >>> Orthogonal to 'update as insert+delete' issue. >> >> OK. However, it will break the lookup procedure because it expects it >> will return an valid result instead of -ENOENT. > > What do you mean 'breaks the lookup' ? > lookup_elem_raw() matches hash, and then it memcmp(key), > if the element is reused anything can happen. > Either it succeeds in memcmp() and returns an elem, > or miscompares in memcmp(). > Both are expected, because elems are reused in place. > > And this behavior is expected and not-broken, > because bpf prog that does lookup on one cpu and deletes > the same element on the other cpu is asking for trouble. > bpf infra guarantees the safety of the kernel. > It doesn't guarantee that bpf progs are sane. > >>> It's been a long time since I looked into rcu_nulls details. >>> Pls help me understand that this new replace_rcu_nulls() >>> is correct from nulls pov, >>> If it is then this patch set may be the right answer to non-atomic update. >> >> If I understand correctly, only the manipulations of ->first pointer and >> ->next pointer need to take care of nulls pointer. > > hmm. I feel we're still talking past each other. > See if (get_nulls_value() == ...) in lookup_nulls_elem_raw(). > It's there because of reuse. The lookup can start in one bucket > and finish in another. > >>> >>> And for the future, please please focus on "why" part in >>> the cover letter and commit logs instead of "what". >>> >>> Since the only thing I got from the log was: >>> "Currently, the update is not atomic >>> because the overwrite of existing element happens in a two-steps way, >>> but the support of atomic update is feasible". >>> >>> "is feasible" doesn't explain "why". >>> >>> Link to xdp-newbie question is nice for additional context, >>> but reviewers should not need to go and read some thread somewhere >>> to understand "why" part. >>> All of it should be in the commit log. >> >> OK. My original thought is that is a reported problem, so an extra link >> will be enough. Will try to add more context next time. >>> >>>> map may still be incorrect (as shown long time ago [1]), so I think >>>> maybe for other types of map, the atomic update doesn't matter too much. >>>> >>>> [1]: >>>> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/ >>> A thread from 3 years ago ?! Sorry, it's not helpful to ask >>> people to page-in such an old context with lots of follow ups >>> that may or may not be relevant today. >> Let me reuse part of the diagram above to explain how does the lookup >> procedure may return a incorrect value: >> >> lookup procedure A >> A: find the old element (instead of the new element) >> >> >> update procedure B >> B: delete the old element >> update procedure C on the same CPU: >> >> >> A: the key is matched and return the value in the element >> >> C: reuse the old element (overwrite its key and value) >> >> A: read the value (it is incorrect, because it has been reused and >> overwritten) > > ... and it's fine. It's by design. It's an element reuse behavior. > > Long ago hashmap had two modes: prealloc (default) and > NO_PREALLOC (call_rcu + kfree) > > The call_rcu part was there to make things safe. > The memory cannot be kfree-ed to the kernel until RCU GP. > With bpf_mem_alloc hashmap elements are freed back to bpf_ma > right away. Hashmap is doing bpf_mem_cache_free() We (emnify.com) missed this change and kept writing code with an assumption that NO_PREALLOC implies rcu. Is there something we can do as of today to reduce the likelihood of an item getting reused immediately? We are concerned with lookups yielding bogus results when racing with updates. Worse, a program could corrupt an unrelated item when writing via a pointer obtained from lookup. You wrote that "lookup on one cpu and deletes the same element on the other cpu is asking for trouble.” It puzzles me since user space updating a map while (say) TC program is consulting the map to make a routing decision look like a supported and widespread use case. For us, implications vary in severity, e.g.: - 1-in-1e? packet mis-delivered (bogus lookup: LOW) - a tenant getting billed for a packet of another tenant delivered via satellite and costing USD 0.10 (writing into unrelated item: LOW) - a network flow state corrupted (writing into unrelated item: MEDIUM) We need to audit our code to ensure that e.g. a flow state getting corrupted self-corrects and the damage doesn’t spread. It would be nice if we as eBPF users could decide whether we wish to live dangerously or prefer to trade speed for safety, on a case-by-case basis. > (instead of bpf_mem_cache_free_rcu()) because users need speed. > So since 2022 both prealloc and no_prealloc reuse elements. > We can consider a new flag for the hash map like F_REUSE_AFTER_RCU_GP > that will use _rcu() flavor of freeing into bpf_ma, > but it has to have a strong reason. > And soon as we add it the default with prealloc would need > to use call_rcu() too, right? > and that becomes nightmare, since bpf prog can easily DoS the system. > Even if we use bpf_mem_cache_free_rcu() only, the DoS is a concern. > Unlike new things like bpf_obj_new/obj_drop the hashmap > is unpriv, so concerns are drastically different. Would it be an option to gate F_REUSE_AFTER_RCU under CAP_BPF? It looks like sleepable programs would need to bpf_rcu_read_lock explicitly to reap the benefits, but why not.
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 4a9eeb7aef85..a28b11ce74c6 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value, goto err; } - /* add new element to the head of the list, so that - * concurrent search will find it before old elem - */ - hlist_nulls_add_head_rcu(&l_new->hash_node, head); - if (l_old) { - hlist_nulls_del_rcu(&l_old->hash_node); + if (!l_old) { + hlist_nulls_add_head_rcu(&l_new->hash_node, head); + } else { + /* Replace the old element atomically, so that + * concurrent search will find either the new element or + * the old element. + */ + hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node); /* l_old has already been stashed in htab->extra_elems, free * its special fields before it is available for reuse. Also
Currently, the update of existing element in hash map involves two steps: 1) insert the new element at the head of the hash list 2) remove the old element It is possible that the concurrent lookup operation may fail to find either the old element or the new element if the lookup operation starts before the addition and continues after the removal. Therefore, replacing the two-step update with an atomic update. After the change, the update will be atomic in the perspective of the lookup operation: it will either find the old element or the new element. Signed-off-by: Hou Tao <hotforest@gmail.com> --- kernel/bpf/hashtab.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-)