Message ID | 20230711150906.466434-1-mmpgouride@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [v2] docs/RCU: Bring back smp_wmb() | expand |
On Tue, Jul 11, 2023 at 03:09:06PM +0000, Alan Huang wrote: > The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is > n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(), > which modifies obj->obj_node.next. There may be readers holding the > reference of obj in lockless_lookup, and when updater modifies ->next, > readers can see the change immediately because of SLAB_TYPESAFE_BY_RCU. > > There are two memory ordering required in the insertion algorithm, > we need to make sure obj->key is updated before obj->obj_node.next > and obj->refcnt, atomic_set_release is not enough to provide the > required memory barrier. > > Signed-off-by: Alan Huang <mmpgouride@gmail.com> This is an interesting one!!! Now I am having a hard time believing that the smp_rmb() suffices. > --- > Changelog: > v1 -> v2: Use _ONCE to protect obj->key. > > Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++-------- > 1 file changed, 13 insertions(+), 8 deletions(-) > > diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst > index 21e40fcc08de..2a9f5a63d334 100644 > --- a/Documentation/RCU/rculist_nulls.rst > +++ b/Documentation/RCU/rculist_nulls.rst > @@ -47,7 +47,7 @@ objects, which is having below type. > * reuse these object before the RCU grace period, we > * must check key after getting the reference on object > */ > - if (obj->key != key) { // not the object we expected > + if (READ_ONCE(obj->key) != key) { // not the object we expected > put_ref(obj); > rcu_read_unlock(); > goto begin; > @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb()) > { > struct hlist_node *node, *next; > for (pos = rcu_dereference((head)->first); > - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && > + pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) && Suppose that lockless_lookup() is delayed just before fetching pos->next, and that there were 17 more node to search in the list. Then consider the following sequence of events: o The updater deletes this same node and kmem_cache_free()s it. o Another updater kmem_cache_alloc()s that same memory and inserts it into an empty hash chain with a different key. o Then lockless_lookup() fetches pos->next and sees a NULL pointer, thus failing to search the remaining 17 nodes in the list, one of which had the desired key value. o The lookup algorithm resumes and sees the NULL return from lockless_lookup(), and ends up with a NULL obj. And this happens even with the strongest possible ordering everywhere. OK, yes, it is late on Friday. So what am I missing here? Independent of that, does hlist_add_head_rcu() need to replace its "n->next = first" with "WRITE_ONCE(n->next, first)"? Thanx, Paul > ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; }); > pos = rcu_dereference(next)) > - if (obj->key == key) > + if (READ_ONCE(obj->key) == key) > return obj; > return NULL; > } > @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain. > */ > obj = kmem_cache_alloc(...); > lock_chain(); // typically a spin_lock() > - obj->key = key; > - atomic_set_release(&obj->refcnt, 1); // key before refcnt > + WRITE_ONCE(obj->key, key); > + /* > + * We need to make sure obj->key is updated before obj->obj_node.next > + * and obj->refcnt. > + */ > + smp_wmb(); > + atomic_set(&obj->refcnt, 1); > hlist_add_head_rcu(&obj->obj_node, list); > unlock_chain(); // typically a spin_unlock() > > @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of > begin: > rcu_read_lock(); > hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) { > - if (obj->key == key) { > + if (READ_ONCE(obj->key) == key) { > if (!try_get_ref(obj)) { // might fail for free objects > rcu_read_unlock(); > goto begin; > } > - if (obj->key != key) { // not the object we expected > + if (READ_ONCE(obj->key) != key) { // not the object we expected > put_ref(obj); > rcu_read_unlock(); > goto begin; > @@ -206,7 +211,7 @@ hlist_add_head_rcu(). > */ > obj = kmem_cache_alloc(cachep); > lock_chain(); // typically a spin_lock() > - obj->key = key; > + WRITE_ONCE(obj->key, key); > atomic_set_release(&obj->refcnt, 1); // key before refcnt > /* > * insert obj in RCU way (readers might be traversing chain) > -- > 2.34.1 >
> 2023年7月15日 07:23,Paul E. McKenney <paulmck@kernel.org> 写道: > > On Tue, Jul 11, 2023 at 03:09:06PM +0000, Alan Huang wrote: >> The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is >> n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(), >> which modifies obj->obj_node.next. There may be readers holding the >> reference of obj in lockless_lookup, and when updater modifies ->next, >> readers can see the change immediately because ofSLAB_TYPESAFE_BY_RCU. >> >> There are two memory ordering required in the insertion algorithm, >> we need to make sure obj->key is updated before obj->obj_node.next >> and obj->refcnt, atomic_set_release is not enough to provide the >> required memory barrier. >> >> Signed-off-by: Alan Huang <mmpgouride@gmail.com> > > This is an interesting one!!! > > Now I am having a hard time believing that the smp_rmb() suffices. > >> --- >> Changelog: >> v1 -> v2: Use _ONCE to protect obj->key. >> >> Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++-------- >> 1 file changed, 13 insertions(+), 8 deletions(-) >> >> diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst >> index 21e40fcc08de..2a9f5a63d334 100644 >> --- a/Documentation/RCU/rculist_nulls.rst >> +++ b/Documentation/RCU/rculist_nulls.rst >> @@ -47,7 +47,7 @@ objects, which is having below type. >> * reuse these object before the RCU grace period, we >> * must check key after getting the reference on object >> */ >> - if (obj->key != key) { // not the object we expected >> + if (READ_ONCE(obj->key) != key) { // not the object we expected >> put_ref(obj); >> rcu_read_unlock(); >> goto begin; >> @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb()) >> { >> struct hlist_node *node, *next; >> for (pos = rcu_dereference((head)->first); >> - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && >> + pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) && > > Suppose that lockless_lookup() is delayed just before fetching pos->next, > and that there were 17 more node to search in the list. > > Then consider the following sequence of events: > > o The updater deletes this same node and kmem_cache_free()s it. > > o Another updater kmem_cache_alloc()s that same memory and > inserts it into an empty hash chain with a different key. > > o Then lockless_lookup() fetches pos->next and sees a NULL pointer, > thus failing to search the remaining 17 nodes in the list, > one of which had the desired key value. > > o The lookup algorithm resumes and sees the NULL return from > lockless_lookup(), and ends up with a NULL obj. > > And this happens even with the strongest possible ordering > everywhere. > > OK, yes, it is late on Friday. So what am I missing here? You missed nothing! The lockless_lockup should not be a function, but a macro like hlist_for_each_entry_rcu. > > Independent of that, does hlist_add_head_rcu() need to replace its > "n->next = first" with "WRITE_ONCE(n->next, first)"? I think users who want to use hlist with SLAB_TYPESAFE_BY_RCU should use rculist_nulls? I didn’t find a case using hlist with SLAB_TYPESAFE_BY_RCU, but I did find a case using list with SLAB_TYPESAFE_BY_RCU in drivers/gpu/drm/i915, the driver also doesn’t use _ONCE on the fields of the objects allocated with SLAB_TYPESAFE_BY_RCU. > > Thanx, Paul > >> ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; }); >> pos = rcu_dereference(next)) >> - if (obj->key == key) >> + if (READ_ONCE(obj->key) == key) >> return obj; >> return NULL; >> } >> @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain. >> */ >> obj = kmem_cache_alloc(...); >> lock_chain(); // typically a spin_lock() >> - obj->key = key; >> - atomic_set_release(&obj->refcnt, 1); // key before refcnt >> + WRITE_ONCE(obj->key, key); >> + /* >> + * We need to make sure obj->key is updated before obj->obj_node.next >> + * and obj->refcnt. >> + */ >> + smp_wmb(); >> + atomic_set(&obj->refcnt, 1); >> hlist_add_head_rcu(&obj->obj_node, list); >> unlock_chain(); // typically a spin_unlock() >> >> @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of >> begin: >> rcu_read_lock(); >> hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) { >> - if (obj->key == key) { >> + if (READ_ONCE(obj->key) == key) { >> if (!try_get_ref(obj)) { // might fail for free objects >> rcu_read_unlock(); >> goto begin; >> } >> - if (obj->key != key) { // not the object we expected >> + if (READ_ONCE(obj->key) != key) { // not the object we expected >> put_ref(obj); >> rcu_read_unlock(); >> goto begin; >> @@ -206,7 +211,7 @@ hlist_add_head_rcu(). >> */ >> obj = kmem_cache_alloc(cachep); >> lock_chain(); // typically a spin_lock() >> - obj->key = key; >> + WRITE_ONCE(obj->key, key); >> atomic_set_release(&obj->refcnt, 1); // key before refcnt >> /* >> * insert obj in RCU way (readers might be traversing chain) >> -- >> 2.34.1
On Sat, Jul 15, 2023 at 08:50:23AM +0800, Alan Huang wrote: > > > 2023年7月15日 07:23,Paul E. McKenney <paulmck@kernel.org> 写道: > > > > On Tue, Jul 11, 2023 at 03:09:06PM +0000, Alan Huang wrote: > >> The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is > >> n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(), > >> which modifies obj->obj_node.next. There may be readers holding the > >> reference of obj in lockless_lookup, and when updater modifies ->next, > >> readers can see the change immediately because ofSLAB_TYPESAFE_BY_RCU. > >> > >> There are two memory ordering required in the insertion algorithm, > >> we need to make sure obj->key is updated before obj->obj_node.next > >> and obj->refcnt, atomic_set_release is not enough to provide the > >> required memory barrier. > >> > >> Signed-off-by: Alan Huang <mmpgouride@gmail.com> > > > > This is an interesting one!!! > > > > Now I am having a hard time believing that the smp_rmb() suffices. > > > >> --- > >> Changelog: > >> v1 -> v2: Use _ONCE to protect obj->key. > >> > >> Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++-------- > >> 1 file changed, 13 insertions(+), 8 deletions(-) > >> > >> diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst > >> index 21e40fcc08de..2a9f5a63d334 100644 > >> --- a/Documentation/RCU/rculist_nulls.rst > >> +++ b/Documentation/RCU/rculist_nulls.rst > >> @@ -47,7 +47,7 @@ objects, which is having below type. > >> * reuse these object before the RCU grace period, we > >> * must check key after getting the reference on object > >> */ > >> - if (obj->key != key) { // not the object we expected > >> + if (READ_ONCE(obj->key) != key) { // not the object we expected > >> put_ref(obj); > >> rcu_read_unlock(); > >> goto begin; > >> @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb()) > >> { > >> struct hlist_node *node, *next; > >> for (pos = rcu_dereference((head)->first); > >> - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && > >> + pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) && > > > > Suppose that lockless_lookup() is delayed just before fetching pos->next, > > and that there were 17 more node to search in the list. > > > > Then consider the following sequence of events: > > > > o The updater deletes this same node and kmem_cache_free()s it. > > > > o Another updater kmem_cache_alloc()s that same memory and > > inserts it into an empty hash chain with a different key. > > > > o Then lockless_lookup() fetches pos->next and sees a NULL pointer, > > thus failing to search the remaining 17 nodes in the list, > > one of which had the desired key value. > > > > o The lookup algorithm resumes and sees the NULL return from > > lockless_lookup(), and ends up with a NULL obj. > > > > And this happens even with the strongest possible ordering > > everywhere. > > > > OK, yes, it is late on Friday. So what am I missing here? > > You missed nothing! > > The lockless_lockup should not be a function, but a macro like hlist_for_each_entry_rcu. How would you fix this using a macro? > > Independent of that, does hlist_add_head_rcu() need to replace its > > "n->next = first" with "WRITE_ONCE(n->next, first)"? > > I think users who want to use hlist with SLAB_TYPESAFE_BY_RCU should use rculist_nulls? I believe that you are correct. Would you like to propose a patch, or would you rather I put something together? My current thought is to keep the examples, but to show why the one with smp_rmb() is broken. > I didn’t find a case using hlist with SLAB_TYPESAFE_BY_RCU, but I did find a case using list > with SLAB_TYPESAFE_BY_RCU in drivers/gpu/drm/i915, the driver also doesn’t use _ONCE > on the fields of the objects allocated with SLAB_TYPESAFE_BY_RCU. Feel free to send them a patch, though I cannot speak for their reception of it. Thanx, Paul > > Thanx, Paul > > > >> ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; }); > >> pos = rcu_dereference(next)) > >> - if (obj->key == key) > >> + if (READ_ONCE(obj->key) == key) > >> return obj; > >> return NULL; > >> } > >> @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain. > >> */ > >> obj = kmem_cache_alloc(...); > >> lock_chain(); // typically a spin_lock() > >> - obj->key = key; > >> - atomic_set_release(&obj->refcnt, 1); // key before refcnt > >> + WRITE_ONCE(obj->key, key); > >> + /* > >> + * We need to make sure obj->key is updated before obj->obj_node.next > >> + * and obj->refcnt. > >> + */ > >> + smp_wmb(); > >> + atomic_set(&obj->refcnt, 1); > >> hlist_add_head_rcu(&obj->obj_node, list); > >> unlock_chain(); // typically a spin_unlock() > >> > >> @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of > >> begin: > >> rcu_read_lock(); > >> hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) { > >> - if (obj->key == key) { > >> + if (READ_ONCE(obj->key) == key) { > >> if (!try_get_ref(obj)) { // might fail for free objects > >> rcu_read_unlock(); > >> goto begin; > >> } > >> - if (obj->key != key) { // not the object we expected > >> + if (READ_ONCE(obj->key) != key) { // not the object we expected > >> put_ref(obj); > >> rcu_read_unlock(); > >> goto begin; > >> @@ -206,7 +211,7 @@ hlist_add_head_rcu(). > >> */ > >> obj = kmem_cache_alloc(cachep); > >> lock_chain(); // typically a spin_lock() > >> - obj->key = key; > >> + WRITE_ONCE(obj->key, key); > >> atomic_set_release(&obj->refcnt, 1); // key before refcnt > >> /* > >> * insert obj in RCU way (readers might be traversing chain) > >> -- > >> 2.34.1 > >
> 2023年7月16日 01:19,Paul E. McKenney <paulmck@kernel.org> 写道: > > On Sat, Jul 15, 2023 at 08:50:23AM +0800, Alan Huang wrote: >> >>> 2023年7月15日 07:23,Paul E. McKenney <paulmck@kernel.org> 写道: >>> >>> On Tue, Jul 11, 2023 at 03:09:06PM +0000, Alan Huang wrote: >>>> The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is >>>> n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(), >>>> which modifies obj->obj_node.next. There may be readers holding the >>>> reference of obj in lockless_lookup, and when updater modifies ->next, >>>> readers can see the change immediately because ofSLAB_TYPESAFE_BY_RCU. >>>> >>>> There are two memory ordering required in the insertion algorithm, >>>> we need to make sure obj->key is updated before obj->obj_node.next >>>> and obj->refcnt, atomic_set_release is not enough to provide the >>>> required memory barrier. >>>> >>>> Signed-off-by: Alan Huang <mmpgouride@gmail.com> >>> >>> This is an interesting one!!! >>> >>> Now I am having a hard time believing that the smp_rmb() suffices. >>> >>>> --- >>>> Changelog: >>>> v1 -> v2: Use _ONCE to protect obj->key. >>>> >>>> Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++-------- >>>> 1 file changed, 13 insertions(+), 8 deletions(-) >>>> >>>> diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst >>>> index 21e40fcc08de..2a9f5a63d334 100644 >>>> --- a/Documentation/RCU/rculist_nulls.rst >>>> +++ b/Documentation/RCU/rculist_nulls.rst >>>> @@ -47,7 +47,7 @@ objects, which is having below type. >>>> * reuse these object before the RCU grace period, we >>>> * must check key after getting the reference on object >>>> */ >>>> - if (obj->key != key) { // not the object we expected >>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected >>>> put_ref(obj); >>>> rcu_read_unlock(); >>>> goto begin; >>>> @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb()) >>>> { >>>> struct hlist_node *node, *next; >>>> for (pos = rcu_dereference((head)->first); >>>> - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && >>>> + pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) && >>> >>> Suppose that lockless_lookup() is delayed just before fetching pos->next, >>> and that there were 17 more node to search in the list. >>> >>> Then consider the following sequence of events: >>> >>> o The updater deletes this same node and kmem_cache_free()s it. >>> >>> o Another updater kmem_cache_alloc()s that same memory and >>> inserts it into an empty hash chain with a different key. >>> >>> o Then lockless_lookup() fetches pos->next and sees a NULL pointer, >>> thus failing to search the remaining 17 nodes in the list, >>> one of which had the desired key value. >>> >>> o The lookup algorithm resumes and sees the NULL return from >>> lockless_lookup(), and ends up with a NULL obj. >>> >>> And this happens even with the strongest possible ordering >>> everywhere. >>> >>> OK, yes, it is late on Friday. So what am I missing here? >> >> You missed nothing! >> >> The lockless_lockup should not be a function, but a macro like hlist_for_each_entry_rcu. > > How would you fix this using a macro? With additional detection code. A moved object (in another chain) will have a different slot. (I have sent patch v3. ) > >>> Independent of that, does hlist_add_head_rcu() need to replace its >>> "n->next = first" with "WRITE_ONCE(n->next, first)"? >> >> I think users who want to use hlist with SLAB_TYPESAFE_BY_RCU should use rculist_nulls? > > I believe that you are correct. Would you like to propose a patch, or > would you rather I put something together? My current thought is to Feel free to add. One thing I think would be useful is to tell readers where the ‘next' is. The document mentions ’next’ many times, but it’s hard for me, as a reader, to realize that the ‘next' is within hlist_add_head_rcu(). (I have no idea where to put the hint.) > keep the examples, but to show why the one with smp_rmb() is broken. I think the example needs to be fixed. :) > >> I didn’t find a case using hlist with SLAB_TYPESAFE_BY_RCU, but I did find a case using list >> with SLAB_TYPESAFE_BY_RCU in drivers/gpu/drm/i915, the driver also doesn’t use _ONCE >> on the fields of the objects allocated with SLAB_TYPESAFE_BY_RCU. > > Feel free to send them a patch, though I cannot speak for their > reception of it. > > Thanx, Paul > >>> Thanx, Paul >>> >>>> ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; }); >>>> pos = rcu_dereference(next)) >>>> - if (obj->key == key) >>>> + if (READ_ONCE(obj->key) == key) >>>> return obj; >>>> return NULL; >>>> } >>>> @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain. >>>> */ >>>> obj = kmem_cache_alloc(...); >>>> lock_chain(); // typically a spin_lock() >>>> - obj->key = key; >>>> - atomic_set_release(&obj->refcnt, 1); // key before refcnt >>>> + WRITE_ONCE(obj->key, key); >>>> + /* >>>> + * We need to make sure obj->key is updated before obj->obj_node.next >>>> + * and obj->refcnt. >>>> + */ >>>> + smp_wmb(); >>>> + atomic_set(&obj->refcnt, 1); >>>> hlist_add_head_rcu(&obj->obj_node, list); >>>> unlock_chain(); // typically a spin_unlock() >>>> >>>> @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of >>>> begin: >>>> rcu_read_lock(); >>>> hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) { >>>> - if (obj->key == key) { >>>> + if (READ_ONCE(obj->key) == key) { >>>> if (!try_get_ref(obj)) { // might fail for free objects >>>> rcu_read_unlock(); >>>> goto begin; >>>> } >>>> - if (obj->key != key) { // not the object we expected >>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected >>>> put_ref(obj); >>>> rcu_read_unlock(); >>>> goto begin; >>>> @@ -206,7 +211,7 @@ hlist_add_head_rcu(). >>>> */ >>>> obj = kmem_cache_alloc(cachep); >>>> lock_chain(); // typically a spin_lock() >>>> - obj->key = key; >>>> + WRITE_ONCE(obj->key, key); >>>> atomic_set_release(&obj->refcnt, 1); // key before refcnt >>>> /* >>>> * insert obj in RCU way (readers might be traversing chain) >>>> -- >>>> 2.34.1
On Sun, Jul 16, 2023 at 07:21:28PM +0800, Alan Huang wrote: > > 2023年7月16日 01:19,Paul E. McKenney <paulmck@kernel.org> 写道: > > On Sat, Jul 15, 2023 at 08:50:23AM +0800, Alan Huang wrote: > >>> 2023年7月15日 07:23,Paul E. McKenney <paulmck@kernel.org> 写道: > >>> On Tue, Jul 11, 2023 at 03:09:06PM +0000, Alan Huang wrote: > >>>> The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is > >>>> n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(), > >>>> which modifies obj->obj_node.next. There may be readers holding the > >>>> reference of obj in lockless_lookup, and when updater modifies ->next, > >>>> readers can see the change immediately because ofSLAB_TYPESAFE_BY_RCU. > >>>> > >>>> There are two memory ordering required in the insertion algorithm, > >>>> we need to make sure obj->key is updated before obj->obj_node.next > >>>> and obj->refcnt, atomic_set_release is not enough to provide the > >>>> required memory barrier. > >>>> > >>>> Signed-off-by: Alan Huang <mmpgouride@gmail.com> > >>> > >>> This is an interesting one!!! > >>> > >>> Now I am having a hard time believing that the smp_rmb() suffices. > >>> > >>>> --- > >>>> Changelog: > >>>> v1 -> v2: Use _ONCE to protect obj->key. > >>>> > >>>> Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++-------- > >>>> 1 file changed, 13 insertions(+), 8 deletions(-) > >>>> > >>>> diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst > >>>> index 21e40fcc08de..2a9f5a63d334 100644 > >>>> --- a/Documentation/RCU/rculist_nulls.rst > >>>> +++ b/Documentation/RCU/rculist_nulls.rst > >>>> @@ -47,7 +47,7 @@ objects, which is having below type. > >>>> * reuse these object before the RCU grace period, we > >>>> * must check key after getting the reference on object > >>>> */ > >>>> - if (obj->key != key) { // not the object we expected > >>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected > >>>> put_ref(obj); > >>>> rcu_read_unlock(); > >>>> goto begin; > >>>> @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb()) > >>>> { > >>>> struct hlist_node *node, *next; > >>>> for (pos = rcu_dereference((head)->first); > >>>> - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && > >>>> + pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) && > >>> > >>> Suppose that lockless_lookup() is delayed just before fetching pos->next, > >>> and that there were 17 more node to search in the list. > >>> > >>> Then consider the following sequence of events: > >>> > >>> o The updater deletes this same node and kmem_cache_free()s it. > >>> > >>> o Another updater kmem_cache_alloc()s that same memory and > >>> inserts it into an empty hash chain with a different key. > >>> > >>> o Then lockless_lookup() fetches pos->next and sees a NULL pointer, > >>> thus failing to search the remaining 17 nodes in the list, > >>> one of which had the desired key value. > >>> > >>> o The lookup algorithm resumes and sees the NULL return from > >>> lockless_lookup(), and ends up with a NULL obj. > >>> > >>> And this happens even with the strongest possible ordering > >>> everywhere. > >>> > >>> OK, yes, it is late on Friday. So what am I missing here? > >> > >> You missed nothing! > >> > >> The lockless_lockup should not be a function, but a macro like hlist_for_each_entry_rcu. > > > > How would you fix this using a macro? > > With additional detection code. A moved object (in another chain) will have a different slot. > (I have sent patch v3. ) > > > > >>> Independent of that, does hlist_add_head_rcu() need to replace its > >>> "n->next = first" with "WRITE_ONCE(n->next, first)"? > >> > >> I think users who want to use hlist with SLAB_TYPESAFE_BY_RCU should use rculist_nulls? > > > > I believe that you are correct. Would you like to propose a patch, or > > would you rather I put something together? My current thought is to > > Feel free to add. > > One thing I think would be useful is to tell readers where the ‘next' is. > The document mentions ’next’ many times, but it’s hard for me, as a reader, to realize that > the ‘next' is within hlist_add_head_rcu(). (I have no idea where to put the hint.) > > > > keep the examples, but to show why the one with smp_rmb() is broken. > > I think the example needs to be fixed. :) Even better! I will take a look, but in the meantime, would you be interested in updating the wording to explain how the back-pointer works? (Looks similar to the is_a_nulls() pointer, but in each element instead of just at the end. One advantage is the ability to detect a move mid-list, though that is not a big deal in well-tuned hash tables, which tend to have short hash chains. The need to move elements to the front of the destination list remains, though in both cases only if it has been less than a grace period since the last move.) Thanx, Paul > >> I didn’t find a case using hlist with SLAB_TYPESAFE_BY_RCU, but I did find a case using list > >> with SLAB_TYPESAFE_BY_RCU in drivers/gpu/drm/i915, the driver also doesn’t use _ONCE > >> on the fields of the objects allocated with SLAB_TYPESAFE_BY_RCU. > > > > Feel free to send them a patch, though I cannot speak for their > > reception of it. > > > > Thanx, Paul > > > >>> Thanx, Paul > >>> > >>>> ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; }); > >>>> pos = rcu_dereference(next)) > >>>> - if (obj->key == key) > >>>> + if (READ_ONCE(obj->key) == key) > >>>> return obj; > >>>> return NULL; > >>>> } > >>>> @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain. > >>>> */ > >>>> obj = kmem_cache_alloc(...); > >>>> lock_chain(); // typically a spin_lock() > >>>> - obj->key = key; > >>>> - atomic_set_release(&obj->refcnt, 1); // key before refcnt > >>>> + WRITE_ONCE(obj->key, key); > >>>> + /* > >>>> + * We need to make sure obj->key is updated before obj->obj_node.next > >>>> + * and obj->refcnt. > >>>> + */ > >>>> + smp_wmb(); > >>>> + atomic_set(&obj->refcnt, 1); > >>>> hlist_add_head_rcu(&obj->obj_node, list); > >>>> unlock_chain(); // typically a spin_unlock() > >>>> > >>>> @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of > >>>> begin: > >>>> rcu_read_lock(); > >>>> hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) { > >>>> - if (obj->key == key) { > >>>> + if (READ_ONCE(obj->key) == key) { > >>>> if (!try_get_ref(obj)) { // might fail for free objects > >>>> rcu_read_unlock(); > >>>> goto begin; > >>>> } > >>>> - if (obj->key != key) { // not the object we expected > >>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected > >>>> put_ref(obj); > >>>> rcu_read_unlock(); > >>>> goto begin; > >>>> @@ -206,7 +211,7 @@ hlist_add_head_rcu(). > >>>> */ > >>>> obj = kmem_cache_alloc(cachep); > >>>> lock_chain(); // typically a spin_lock() > >>>> - obj->key = key; > >>>> + WRITE_ONCE(obj->key, key); > >>>> atomic_set_release(&obj->refcnt, 1); // key before refcnt > >>>> /* > >>>> * insert obj in RCU way (readers might be traversing chain) > >>>> -- > >>>> 2.34.1 > >
> 2023年7月18日 00:02,Paul E. McKenney <paulmck@kernel.org> 写道: > > On Sun, Jul 16, 2023 at 07:21:28PM +0800, Alan Huang wrote: >>> 2023年7月16日 01:19,Paul E. McKenney <paulmck@kernel.org> 写道: >>> On Sat, Jul 15, 2023 at 08:50:23AM +0800, Alan Huang wrote: >>>>> 2023年7月15日 07:23,Paul E. McKenney <paulmck@kernel.org> 写道: >>>>> On Tue, Jul 11, 2023 at 03:09:06PM +0000, Alan Huang wrote: >>>>>> The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is >>>>>> n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(), >>>>>> which modifies obj->obj_node.next. There may be readers holding the >>>>>> reference of obj in lockless_lookup, and when updater modifies ->next, >>>>>> readers can see the change immediately because ofSLAB_TYPESAFE_BY_RCU. >>>>>> >>>>>> There are two memory ordering required in the insertion algorithm, >>>>>> we need to make sure obj->key is updated before obj->obj_node.next >>>>>> and obj->refcnt, atomic_set_release is not enough to provide the >>>>>> required memory barrier. >>>>>> >>>>>> Signed-off-by: Alan Huang <mmpgouride@gmail.com> >>>>> >>>>> This is an interesting one!!! >>>>> >>>>> Now I am having a hard time believing that the smp_rmb() suffices. >>>>> >>>>>> --- >>>>>> Changelog: >>>>>> v1 -> v2: Use _ONCE to protect obj->key. >>>>>> >>>>>> Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++-------- >>>>>> 1 file changed, 13 insertions(+), 8 deletions(-) >>>>>> >>>>>> diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst >>>>>> index 21e40fcc08de..2a9f5a63d334 100644 >>>>>> --- a/Documentation/RCU/rculist_nulls.rst >>>>>> +++ b/Documentation/RCU/rculist_nulls.rst >>>>>> @@ -47,7 +47,7 @@ objects, which is having below type. >>>>>> * reuse these object before the RCU grace period, we >>>>>> * must check key after getting the reference on object >>>>>> */ >>>>>> - if (obj->key != key) { // not the object we expected >>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected >>>>>> put_ref(obj); >>>>>> rcu_read_unlock(); >>>>>> goto begin; >>>>>> @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb()) >>>>>> { >>>>>> struct hlist_node *node, *next; >>>>>> for (pos = rcu_dereference((head)->first); >>>>>> - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && >>>>>> + pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) && >>>>> >>>>> Suppose that lockless_lookup() is delayed just before fetching pos->next, >>>>> and that there were 17 more node to search in the list. >>>>> >>>>> Then consider the following sequence of events: >>>>> >>>>> o The updater deletes this same node and kmem_cache_free()s it. >>>>> >>>>> o Another updater kmem_cache_alloc()s that same memory and >>>>> inserts it into an empty hash chain with a different key. >>>>> >>>>> o Then lockless_lookup() fetches pos->next and sees a NULL pointer, >>>>> thus failing to search the remaining 17 nodes in the list, >>>>> one of which had the desired key value. >>>>> >>>>> o The lookup algorithm resumes and sees the NULL return from >>>>> lockless_lookup(), and ends up with a NULL obj. >>>>> >>>>> And this happens even with the strongest possible ordering >>>>> everywhere. >>>>> >>>>> OK, yes, it is late on Friday. So what am I missing here? >>>> >>>> You missed nothing! >>>> >>>> The lockless_lockup should not be a function, but a macro like hlist_for_each_entry_rcu. >>> >>> How would you fix this using a macro? >> >> With additional detection code. A moved object (in another chain) will have a different slot. >> (I have sent patch v3. ) >> >>> >>>>> Independent of that, does hlist_add_head_rcu() need to replace its >>>>> "n->next = first" with "WRITE_ONCE(n->next, first)"? >>>> >>>> I think users who want to use hlist with SLAB_TYPESAFE_BY_RCU should use rculist_nulls? >>> >>> I believe that you are correct. Would you like to propose a patch, or >>> would you rather I put something together? My current thought is to >> >> Feel free to add. >> >> One thing I think would be useful is to tell readers where the ‘next' is. >> The document mentions ’next’ many times, but it’s hard for me, as a reader, to realize that >> the ‘next' is within hlist_add_head_rcu(). (I have no idea where to put the hint.) >> >> >>> keep the examples, but to show why the one with smp_rmb() is broken. >> >> I think the example needs to be fixed. :) > > Even better! I will take a look, but in the meantime, would you be > interested in updating the wording to explain how the back-pointer works? Which document needs to be updated? And is there anything that I can refer to? It’s the first time I have ever heard about it. > > (Looks similar to the is_a_nulls() pointer, but in each element instead of > just at the end. One advantage is the ability to detect a move mid-list, > though that is not a big deal in well-tuned hash tables, which tend to > have short hash chains. The need to move elements to the front of the > destination list remains, though in both cases only if it has been less > than a grace period since the last move.) Looks like that I need to learn it first. :) > > Thanx, Paul > >>>> I didn’t find a case using hlist with SLAB_TYPESAFE_BY_RCU, but I did find a case using list >>>> with SLAB_TYPESAFE_BY_RCU in drivers/gpu/drm/i915, the driver also doesn’t use _ONCE >>>> on the fields of the objects allocated with SLAB_TYPESAFE_BY_RCU. >>> >>> Feel free to send them a patch, though I cannot speak for their >>> reception of it. >>> >>> Thanx, Paul >>> >>>>> Thanx, Paul >>>>> >>>>>> ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; }); >>>>>> pos = rcu_dereference(next)) >>>>>> - if (obj->key == key) >>>>>> + if (READ_ONCE(obj->key) == key) >>>>>> return obj; >>>>>> return NULL; >>>>>> } >>>>>> @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain. >>>>>> */ >>>>>> obj = kmem_cache_alloc(...); >>>>>> lock_chain(); // typically a spin_lock() >>>>>> - obj->key = key; >>>>>> - atomic_set_release(&obj->refcnt, 1); // key before refcnt >>>>>> + WRITE_ONCE(obj->key, key); >>>>>> + /* >>>>>> + * We need to make sure obj->key is updated before obj->obj_node.next >>>>>> + * and obj->refcnt. >>>>>> + */ >>>>>> + smp_wmb(); >>>>>> + atomic_set(&obj->refcnt, 1); >>>>>> hlist_add_head_rcu(&obj->obj_node, list); >>>>>> unlock_chain(); // typically a spin_unlock() >>>>>> >>>>>> @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of >>>>>> begin: >>>>>> rcu_read_lock(); >>>>>> hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) { >>>>>> - if (obj->key == key) { >>>>>> + if (READ_ONCE(obj->key) == key) { >>>>>> if (!try_get_ref(obj)) { // might fail for free objects >>>>>> rcu_read_unlock(); >>>>>> goto begin; >>>>>> } >>>>>> - if (obj->key != key) { // not the object we expected >>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected >>>>>> put_ref(obj); >>>>>> rcu_read_unlock(); >>>>>> goto begin; >>>>>> @@ -206,7 +211,7 @@ hlist_add_head_rcu(). >>>>>> */ >>>>>> obj = kmem_cache_alloc(cachep); >>>>>> lock_chain(); // typically a spin_lock() >>>>>> - obj->key = key; >>>>>> + WRITE_ONCE(obj->key, key); >>>>>> atomic_set_release(&obj->refcnt, 1); // key before refcnt >>>>>> /* >>>>>> * insert obj in RCU way (readers might be traversing chain) >>>>>> -- >>>>>> 2.34.1
On Tue, Jul 18, 2023 at 01:53:10AM +0800, Alan Huang wrote: > > 2023年7月18日 00:02,Paul E. McKenney <paulmck@kernel.org> 写道: > > On Sun, Jul 16, 2023 at 07:21:28PM +0800, Alan Huang wrote: > >>> 2023年7月16日 01:19,Paul E. McKenney <paulmck@kernel.org> 写道: > >>> On Sat, Jul 15, 2023 at 08:50:23AM +0800, Alan Huang wrote: > >>>>> 2023年7月15日 07:23,Paul E. McKenney <paulmck@kernel.org> 写道: > >>>>> On Tue, Jul 11, 2023 at 03:09:06PM +0000, Alan Huang wrote: > >>>>>> The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is > >>>>>> n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(), > >>>>>> which modifies obj->obj_node.next. There may be readers holding the > >>>>>> reference of obj in lockless_lookup, and when updater modifies ->next, > >>>>>> readers can see the change immediately because ofSLAB_TYPESAFE_BY_RCU. > >>>>>> > >>>>>> There are two memory ordering required in the insertion algorithm, > >>>>>> we need to make sure obj->key is updated before obj->obj_node.next > >>>>>> and obj->refcnt, atomic_set_release is not enough to provide the > >>>>>> required memory barrier. > >>>>>> > >>>>>> Signed-off-by: Alan Huang <mmpgouride@gmail.com> > >>>>> > >>>>> This is an interesting one!!! > >>>>> > >>>>> Now I am having a hard time believing that the smp_rmb() suffices. > >>>>> > >>>>>> --- > >>>>>> Changelog: > >>>>>> v1 -> v2: Use _ONCE to protect obj->key. > >>>>>> > >>>>>> Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++-------- > >>>>>> 1 file changed, 13 insertions(+), 8 deletions(-) > >>>>>> > >>>>>> diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst > >>>>>> index 21e40fcc08de..2a9f5a63d334 100644 > >>>>>> --- a/Documentation/RCU/rculist_nulls.rst > >>>>>> +++ b/Documentation/RCU/rculist_nulls.rst > >>>>>> @@ -47,7 +47,7 @@ objects, which is having below type. > >>>>>> * reuse these object before the RCU grace period, we > >>>>>> * must check key after getting the reference on object > >>>>>> */ > >>>>>> - if (obj->key != key) { // not the object we expected > >>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected > >>>>>> put_ref(obj); > >>>>>> rcu_read_unlock(); > >>>>>> goto begin; > >>>>>> @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb()) > >>>>>> { > >>>>>> struct hlist_node *node, *next; > >>>>>> for (pos = rcu_dereference((head)->first); > >>>>>> - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && > >>>>>> + pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) && > >>>>> > >>>>> Suppose that lockless_lookup() is delayed just before fetching pos->next, > >>>>> and that there were 17 more node to search in the list. > >>>>> > >>>>> Then consider the following sequence of events: > >>>>> > >>>>> o The updater deletes this same node and kmem_cache_free()s it. > >>>>> > >>>>> o Another updater kmem_cache_alloc()s that same memory and > >>>>> inserts it into an empty hash chain with a different key. > >>>>> > >>>>> o Then lockless_lookup() fetches pos->next and sees a NULL pointer, > >>>>> thus failing to search the remaining 17 nodes in the list, > >>>>> one of which had the desired key value. > >>>>> > >>>>> o The lookup algorithm resumes and sees the NULL return from > >>>>> lockless_lookup(), and ends up with a NULL obj. > >>>>> > >>>>> And this happens even with the strongest possible ordering > >>>>> everywhere. > >>>>> > >>>>> OK, yes, it is late on Friday. So what am I missing here? > >>>> > >>>> You missed nothing! > >>>> > >>>> The lockless_lockup should not be a function, but a macro like hlist_for_each_entry_rcu. > >>> > >>> How would you fix this using a macro? > >> > >> With additional detection code. A moved object (in another chain) will have a different slot. > >> (I have sent patch v3. ) > >> > >>> > >>>>> Independent of that, does hlist_add_head_rcu() need to replace its > >>>>> "n->next = first" with "WRITE_ONCE(n->next, first)"? > >>>> > >>>> I think users who want to use hlist with SLAB_TYPESAFE_BY_RCU should use rculist_nulls? > >>> > >>> I believe that you are correct. Would you like to propose a patch, or > >>> would you rather I put something together? My current thought is to > >> > >> Feel free to add. > >> > >> One thing I think would be useful is to tell readers where the ‘next' is. > >> The document mentions ’next’ many times, but it’s hard for me, as a reader, to realize that > >> the ‘next' is within hlist_add_head_rcu(). (I have no idea where to put the hint.) > >> > >> > >>> keep the examples, but to show why the one with smp_rmb() is broken. > >> > >> I think the example needs to be fixed. :) > > > > Even better! I will take a look, but in the meantime, would you be > > interested in updating the wording to explain how the back-pointer works? > > Which document needs to be updated? > And is there anything that I can refer to? It’s the first time I have ever heard about it. Documentation/RCU/rculist_nulls.rst, the one that you are updating. There admittedly isn't a whole lot of commentary. > > (Looks similar to the is_a_nulls() pointer, but in each element instead of > > just at the end. One advantage is the ability to detect a move mid-list, > > though that is not a big deal in well-tuned hash tables, which tend to > > have short hash chains. The need to move elements to the front of the > > destination list remains, though in both cases only if it has been less > > than a grace period since the last move.) > > Looks like that I need to learn it first. :) Well, you wrote the code, so... ;-) Thanx, Paul > > Thanx, Paul > > > >>>> I didn’t find a case using hlist with SLAB_TYPESAFE_BY_RCU, but I did find a case using list > >>>> with SLAB_TYPESAFE_BY_RCU in drivers/gpu/drm/i915, the driver also doesn’t use _ONCE > >>>> on the fields of the objects allocated with SLAB_TYPESAFE_BY_RCU. > >>> > >>> Feel free to send them a patch, though I cannot speak for their > >>> reception of it. > >>> > >>> Thanx, Paul > >>> > >>>>> Thanx, Paul > >>>>> > >>>>>> ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; }); > >>>>>> pos = rcu_dereference(next)) > >>>>>> - if (obj->key == key) > >>>>>> + if (READ_ONCE(obj->key) == key) > >>>>>> return obj; > >>>>>> return NULL; > >>>>>> } > >>>>>> @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain. > >>>>>> */ > >>>>>> obj = kmem_cache_alloc(...); > >>>>>> lock_chain(); // typically a spin_lock() > >>>>>> - obj->key = key; > >>>>>> - atomic_set_release(&obj->refcnt, 1); // key before refcnt > >>>>>> + WRITE_ONCE(obj->key, key); > >>>>>> + /* > >>>>>> + * We need to make sure obj->key is updated before obj->obj_node.next > >>>>>> + * and obj->refcnt. > >>>>>> + */ > >>>>>> + smp_wmb(); > >>>>>> + atomic_set(&obj->refcnt, 1); > >>>>>> hlist_add_head_rcu(&obj->obj_node, list); > >>>>>> unlock_chain(); // typically a spin_unlock() > >>>>>> > >>>>>> @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of > >>>>>> begin: > >>>>>> rcu_read_lock(); > >>>>>> hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) { > >>>>>> - if (obj->key == key) { > >>>>>> + if (READ_ONCE(obj->key) == key) { > >>>>>> if (!try_get_ref(obj)) { // might fail for free objects > >>>>>> rcu_read_unlock(); > >>>>>> goto begin; > >>>>>> } > >>>>>> - if (obj->key != key) { // not the object we expected > >>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected > >>>>>> put_ref(obj); > >>>>>> rcu_read_unlock(); > >>>>>> goto begin; > >>>>>> @@ -206,7 +211,7 @@ hlist_add_head_rcu(). > >>>>>> */ > >>>>>> obj = kmem_cache_alloc(cachep); > >>>>>> lock_chain(); // typically a spin_lock() > >>>>>> - obj->key = key; > >>>>>> + WRITE_ONCE(obj->key, key); > >>>>>> atomic_set_release(&obj->refcnt, 1); // key before refcnt > >>>>>> /* > >>>>>> * insert obj in RCU way (readers might be traversing chain) > >>>>>> -- > >>>>>> 2.34.1 > >
> 2023年7月18日 03:06,Paul E. McKenney <paulmck@kernel.org> 写道: > > On Tue, Jul 18, 2023 at 01:53:10AM +0800, Alan Huang wrote: >>> 2023年7月18日 00:02,Paul E. McKenney <paulmck@kernel.org> 写道: >>> On Sun, Jul 16, 2023 at 07:21:28PM +0800, Alan Huang wrote: >>>>> 2023年7月16日 01:19,Paul E. McKenney <paulmck@kernel.org> 写道: >>>>> On Sat, Jul 15, 2023 at 08:50:23AM +0800, Alan Huang wrote: >>>>>>> 2023年7月15日 07:23,Paul E. McKenney <paulmck@kernel.org> 写道: >>>>>>> On Tue, Jul 11, 2023 at 03:09:06PM +0000, Alan Huang wrote: >>>>>>>> The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is >>>>>>>> n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(), >>>>>>>> which modifies obj->obj_node.next. There may be readers holding the >>>>>>>> reference of obj in lockless_lookup, and when updater modifies ->next, >>>>>>>> readers can see the change immediately because ofSLAB_TYPESAFE_BY_RCU. >>>>>>>> >>>>>>>> There are two memory ordering required in the insertion algorithm, >>>>>>>> we need to make sure obj->key is updated before obj->obj_node.next >>>>>>>> and obj->refcnt, atomic_set_release is not enough to provide the >>>>>>>> required memory barrier. >>>>>>>> >>>>>>>> Signed-off-by: Alan Huang <mmpgouride@gmail.com> >>>>>>> >>>>>>> This is an interesting one!!! >>>>>>> >>>>>>> Now I am having a hard time believing that the smp_rmb() suffices. >>>>>>> >>>>>>>> --- >>>>>>>> Changelog: >>>>>>>> v1 -> v2: Use _ONCE to protect obj->key. >>>>>>>> >>>>>>>> Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++-------- >>>>>>>> 1 file changed, 13 insertions(+), 8 deletions(-) >>>>>>>> >>>>>>>> diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst >>>>>>>> index 21e40fcc08de..2a9f5a63d334 100644 >>>>>>>> --- a/Documentation/RCU/rculist_nulls.rst >>>>>>>> +++ b/Documentation/RCU/rculist_nulls.rst >>>>>>>> @@ -47,7 +47,7 @@ objects, which is having below type. >>>>>>>> * reuse these object before the RCU grace period, we >>>>>>>> * must check key after getting the reference on object >>>>>>>> */ >>>>>>>> - if (obj->key != key) { // not the object we expected >>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected >>>>>>>> put_ref(obj); >>>>>>>> rcu_read_unlock(); >>>>>>>> goto begin; >>>>>>>> @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb()) >>>>>>>> { >>>>>>>> struct hlist_node *node, *next; >>>>>>>> for (pos = rcu_dereference((head)->first); >>>>>>>> - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && >>>>>>>> + pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) && >>>>>>> >>>>>>> Suppose that lockless_lookup() is delayed just before fetching pos->next, >>>>>>> and that there were 17 more node to search in the list. >>>>>>> >>>>>>> Then consider the following sequence of events: >>>>>>> >>>>>>> o The updater deletes this same node and kmem_cache_free()s it. >>>>>>> >>>>>>> o Another updater kmem_cache_alloc()s that same memory and >>>>>>> inserts it into an empty hash chain with a different key. >>>>>>> >>>>>>> o Then lockless_lookup() fetches pos->next and sees a NULL pointer, >>>>>>> thus failing to search the remaining 17 nodes in the list, >>>>>>> one of which had the desired key value. >>>>>>> >>>>>>> o The lookup algorithm resumes and sees the NULL return from >>>>>>> lockless_lookup(), and ends up with a NULL obj. >>>>>>> >>>>>>> And this happens even with the strongest possible ordering >>>>>>> everywhere. >>>>>>> >>>>>>> OK, yes, it is late on Friday. So what am I missing here? >>>>>> >>>>>> You missed nothing! >>>>>> >>>>>> The lockless_lockup should not be a function, but a macro like hlist_for_each_entry_rcu. >>>>> >>>>> How would you fix this using a macro? >>>> >>>> With additional detection code. A moved object (in another chain) will have a different slot. >>>> (I have sent patch v3. ) >>>> >>>>> >>>>>>> Independent of that, does hlist_add_head_rcu() need to replace its >>>>>>> "n->next = first" with "WRITE_ONCE(n->next, first)"? >>>>>> >>>>>> I think users who want to use hlist with SLAB_TYPESAFE_BY_RCU should use rculist_nulls? >>>>> >>>>> I believe that you are correct. Would you like to propose a patch, or >>>>> would you rather I put something together? My current thought is to >>>> >>>> Feel free to add. >>>> >>>> One thing I think would be useful is to tell readers where the ‘next' is. >>>> The document mentions ’next’ many times, but it’s hard for me, as a reader, to realize that >>>> the ‘next' is within hlist_add_head_rcu(). (I have no idea where to put the hint.) >>>> >>>> >>>>> keep the examples, but to show why the one with smp_rmb() is broken. >>>> >>>> I think the example needs to be fixed. :) >>> >>> Even better! I will take a look, but in the meantime, would you be >>> interested in updating the wording to explain how the back-pointer works? >> >> Which document needs to be updated? >> And is there anything that I can refer to? It’s the first time I have ever heard about it. > > Documentation/RCU/rculist_nulls.rst, the one that you are updating. > > There admittedly isn't a whole lot of commentary. > >>> (Looks similar to the is_a_nulls() pointer, but in each element instead of >>> just at the end. One advantage is the ability to detect a move mid-list, >>> though that is not a big deal in well-tuned hash tables, which tend to >>> have short hash chains. The need to move elements to the front of the >>> destination list remains, though in both cases only if it has been less >>> than a grace period since the last move.) >> >> Looks like that I need to learn it first. :) > > Well, you wrote the code, so... ;-) If I understand correctly, it works only for 64-bit machines? And the number of slots of the hash table will be limited? > > Thanx, Paul > >>> Thanx, Paul >>> >>>>>> I didn’t find a case using hlist with SLAB_TYPESAFE_BY_RCU, but I did find a case using list >>>>>> with SLAB_TYPESAFE_BY_RCU in drivers/gpu/drm/i915, the driver also doesn’t use _ONCE >>>>>> on the fields of the objects allocated with SLAB_TYPESAFE_BY_RCU. >>>>> >>>>> Feel free to send them a patch, though I cannot speak for their >>>>> reception of it. >>>>> >>>>> Thanx, Paul >>>>> >>>>>>> Thanx, Paul >>>>>>> >>>>>>>> ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; }); >>>>>>>> pos = rcu_dereference(next)) >>>>>>>> - if (obj->key == key) >>>>>>>> + if (READ_ONCE(obj->key) == key) >>>>>>>> return obj; >>>>>>>> return NULL; >>>>>>>> } >>>>>>>> @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain. >>>>>>>> */ >>>>>>>> obj = kmem_cache_alloc(...); >>>>>>>> lock_chain(); // typically a spin_lock() >>>>>>>> - obj->key = key; >>>>>>>> - atomic_set_release(&obj->refcnt, 1); // key before refcnt >>>>>>>> + WRITE_ONCE(obj->key, key); >>>>>>>> + /* >>>>>>>> + * We need to make sure obj->key is updated before obj->obj_node.next >>>>>>>> + * and obj->refcnt. >>>>>>>> + */ >>>>>>>> + smp_wmb(); >>>>>>>> + atomic_set(&obj->refcnt, 1); >>>>>>>> hlist_add_head_rcu(&obj->obj_node, list); >>>>>>>> unlock_chain(); // typically a spin_unlock() >>>>>>>> >>>>>>>> @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of >>>>>>>> begin: >>>>>>>> rcu_read_lock(); >>>>>>>> hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) { >>>>>>>> - if (obj->key == key) { >>>>>>>> + if (READ_ONCE(obj->key) == key) { >>>>>>>> if (!try_get_ref(obj)) { // might fail for free objects >>>>>>>> rcu_read_unlock(); >>>>>>>> goto begin; >>>>>>>> } >>>>>>>> - if (obj->key != key) { // not the object we expected >>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected >>>>>>>> put_ref(obj); >>>>>>>> rcu_read_unlock(); >>>>>>>> goto begin; >>>>>>>> @@ -206,7 +211,7 @@ hlist_add_head_rcu(). >>>>>>>> */ >>>>>>>> obj = kmem_cache_alloc(cachep); >>>>>>>> lock_chain(); // typically a spin_lock() >>>>>>>> - obj->key = key; >>>>>>>> + WRITE_ONCE(obj->key, key); >>>>>>>> atomic_set_release(&obj->refcnt, 1); // key before refcnt >>>>>>>> /* >>>>>>>> * insert obj in RCU way (readers might be traversing chain) >>>>>>>> -- >>>>>>>> 2.34.1
On Wed, Jul 19, 2023 at 12:08:33AM +0800, Alan Huang wrote: > > > 2023年7月18日 03:06,Paul E. McKenney <paulmck@kernel.org> 写道: > > > > On Tue, Jul 18, 2023 at 01:53:10AM +0800, Alan Huang wrote: > >>> 2023年7月18日 00:02,Paul E. McKenney <paulmck@kernel.org> 写道: > >>> On Sun, Jul 16, 2023 at 07:21:28PM +0800, Alan Huang wrote: > >>>>> 2023年7月16日 01:19,Paul E. McKenney <paulmck@kernel.org> 写道: > >>>>> On Sat, Jul 15, 2023 at 08:50:23AM +0800, Alan Huang wrote: > >>>>>>> 2023年7月15日 07:23,Paul E. McKenney <paulmck@kernel.org> 写道: > >>>>>>> On Tue, Jul 11, 2023 at 03:09:06PM +0000, Alan Huang wrote: > >>>>>>>> The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is > >>>>>>>> n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(), > >>>>>>>> which modifies obj->obj_node.next. There may be readers holding the > >>>>>>>> reference of obj in lockless_lookup, and when updater modifies ->next, > >>>>>>>> readers can see the change immediately because ofSLAB_TYPESAFE_BY_RCU. > >>>>>>>> > >>>>>>>> There are two memory ordering required in the insertion algorithm, > >>>>>>>> we need to make sure obj->key is updated before obj->obj_node.next > >>>>>>>> and obj->refcnt, atomic_set_release is not enough to provide the > >>>>>>>> required memory barrier. > >>>>>>>> > >>>>>>>> Signed-off-by: Alan Huang <mmpgouride@gmail.com> > >>>>>>> > >>>>>>> This is an interesting one!!! > >>>>>>> > >>>>>>> Now I am having a hard time believing that the smp_rmb() suffices. > >>>>>>> > >>>>>>>> --- > >>>>>>>> Changelog: > >>>>>>>> v1 -> v2: Use _ONCE to protect obj->key. > >>>>>>>> > >>>>>>>> Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++-------- > >>>>>>>> 1 file changed, 13 insertions(+), 8 deletions(-) > >>>>>>>> > >>>>>>>> diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst > >>>>>>>> index 21e40fcc08de..2a9f5a63d334 100644 > >>>>>>>> --- a/Documentation/RCU/rculist_nulls.rst > >>>>>>>> +++ b/Documentation/RCU/rculist_nulls.rst > >>>>>>>> @@ -47,7 +47,7 @@ objects, which is having below type. > >>>>>>>> * reuse these object before the RCU grace period, we > >>>>>>>> * must check key after getting the reference on object > >>>>>>>> */ > >>>>>>>> - if (obj->key != key) { // not the object we expected > >>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected > >>>>>>>> put_ref(obj); > >>>>>>>> rcu_read_unlock(); > >>>>>>>> goto begin; > >>>>>>>> @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb()) > >>>>>>>> { > >>>>>>>> struct hlist_node *node, *next; > >>>>>>>> for (pos = rcu_dereference((head)->first); > >>>>>>>> - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && > >>>>>>>> + pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) && > >>>>>>> > >>>>>>> Suppose that lockless_lookup() is delayed just before fetching pos->next, > >>>>>>> and that there were 17 more node to search in the list. > >>>>>>> > >>>>>>> Then consider the following sequence of events: > >>>>>>> > >>>>>>> o The updater deletes this same node and kmem_cache_free()s it. > >>>>>>> > >>>>>>> o Another updater kmem_cache_alloc()s that same memory and > >>>>>>> inserts it into an empty hash chain with a different key. > >>>>>>> > >>>>>>> o Then lockless_lookup() fetches pos->next and sees a NULL pointer, > >>>>>>> thus failing to search the remaining 17 nodes in the list, > >>>>>>> one of which had the desired key value. > >>>>>>> > >>>>>>> o The lookup algorithm resumes and sees the NULL return from > >>>>>>> lockless_lookup(), and ends up with a NULL obj. > >>>>>>> > >>>>>>> And this happens even with the strongest possible ordering > >>>>>>> everywhere. > >>>>>>> > >>>>>>> OK, yes, it is late on Friday. So what am I missing here? > >>>>>> > >>>>>> You missed nothing! > >>>>>> > >>>>>> The lockless_lockup should not be a function, but a macro like hlist_for_each_entry_rcu. > >>>>> > >>>>> How would you fix this using a macro? > >>>> > >>>> With additional detection code. A moved object (in another chain) will have a different slot. > >>>> (I have sent patch v3. ) > >>>> > >>>>> > >>>>>>> Independent of that, does hlist_add_head_rcu() need to replace its > >>>>>>> "n->next = first" with "WRITE_ONCE(n->next, first)"? > >>>>>> > >>>>>> I think users who want to use hlist with SLAB_TYPESAFE_BY_RCU should use rculist_nulls? > >>>>> > >>>>> I believe that you are correct. Would you like to propose a patch, or > >>>>> would you rather I put something together? My current thought is to > >>>> > >>>> Feel free to add. > >>>> > >>>> One thing I think would be useful is to tell readers where the ‘next' is. > >>>> The document mentions ’next’ many times, but it’s hard for me, as a reader, to realize that > >>>> the ‘next' is within hlist_add_head_rcu(). (I have no idea where to put the hint.) > >>>> > >>>> > >>>>> keep the examples, but to show why the one with smp_rmb() is broken. > >>>> > >>>> I think the example needs to be fixed. :) > >>> > >>> Even better! I will take a look, but in the meantime, would you be > >>> interested in updating the wording to explain how the back-pointer works? > >> > >> Which document needs to be updated? > >> And is there anything that I can refer to? It’s the first time I have ever heard about it. > > > > Documentation/RCU/rculist_nulls.rst, the one that you are updating. > > > > There admittedly isn't a whole lot of commentary. > > > >>> (Looks similar to the is_a_nulls() pointer, but in each element instead of > >>> just at the end. One advantage is the ability to detect a move mid-list, > >>> though that is not a big deal in well-tuned hash tables, which tend to > >>> have short hash chains. The need to move elements to the front of the > >>> destination list remains, though in both cases only if it has been less > >>> than a grace period since the last move.) > >> > >> Looks like that I need to learn it first. :) > > > > Well, you wrote the code, so... ;-) > > If I understand correctly, it works only for 64-bit machines? > > And the number of slots of the hash table will be limited? You are asking about the is_a_nulls() value? If so, it works on both 32-bit and 64-bit machines. They each have enough bits for the nulls value to cover all possible two-byte objects in the full address space. If that wasn't what you were asking, please help me with your question. Thanx, Paul > > Thanx, Paul > > > >>> Thanx, Paul > >>> > >>>>>> I didn’t find a case using hlist with SLAB_TYPESAFE_BY_RCU, but I did find a case using list > >>>>>> with SLAB_TYPESAFE_BY_RCU in drivers/gpu/drm/i915, the driver also doesn’t use _ONCE > >>>>>> on the fields of the objects allocated with SLAB_TYPESAFE_BY_RCU. > >>>>> > >>>>> Feel free to send them a patch, though I cannot speak for their > >>>>> reception of it. > >>>>> > >>>>> Thanx, Paul > >>>>> > >>>>>>> Thanx, Paul > >>>>>>> > >>>>>>>> ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; }); > >>>>>>>> pos = rcu_dereference(next)) > >>>>>>>> - if (obj->key == key) > >>>>>>>> + if (READ_ONCE(obj->key) == key) > >>>>>>>> return obj; > >>>>>>>> return NULL; > >>>>>>>> } > >>>>>>>> @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain. > >>>>>>>> */ > >>>>>>>> obj = kmem_cache_alloc(...); > >>>>>>>> lock_chain(); // typically a spin_lock() > >>>>>>>> - obj->key = key; > >>>>>>>> - atomic_set_release(&obj->refcnt, 1); // key before refcnt > >>>>>>>> + WRITE_ONCE(obj->key, key); > >>>>>>>> + /* > >>>>>>>> + * We need to make sure obj->key is updated before obj->obj_node.next > >>>>>>>> + * and obj->refcnt. > >>>>>>>> + */ > >>>>>>>> + smp_wmb(); > >>>>>>>> + atomic_set(&obj->refcnt, 1); > >>>>>>>> hlist_add_head_rcu(&obj->obj_node, list); > >>>>>>>> unlock_chain(); // typically a spin_unlock() > >>>>>>>> > >>>>>>>> @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of > >>>>>>>> begin: > >>>>>>>> rcu_read_lock(); > >>>>>>>> hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) { > >>>>>>>> - if (obj->key == key) { > >>>>>>>> + if (READ_ONCE(obj->key) == key) { > >>>>>>>> if (!try_get_ref(obj)) { // might fail for free objects > >>>>>>>> rcu_read_unlock(); > >>>>>>>> goto begin; > >>>>>>>> } > >>>>>>>> - if (obj->key != key) { // not the object we expected > >>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected > >>>>>>>> put_ref(obj); > >>>>>>>> rcu_read_unlock(); > >>>>>>>> goto begin; > >>>>>>>> @@ -206,7 +211,7 @@ hlist_add_head_rcu(). > >>>>>>>> */ > >>>>>>>> obj = kmem_cache_alloc(cachep); > >>>>>>>> lock_chain(); // typically a spin_lock() > >>>>>>>> - obj->key = key; > >>>>>>>> + WRITE_ONCE(obj->key, key); > >>>>>>>> atomic_set_release(&obj->refcnt, 1); // key before refcnt > >>>>>>>> /* > >>>>>>>> * insert obj in RCU way (readers might be traversing chain) > >>>>>>>> -- > >>>>>>>> 2.34.1 > >
> 2023年7月22日 上午6:49,Paul E. McKenney <paulmck@kernel.org> 写道: > > On Wed, Jul 19, 2023 at 12:08:33AM +0800, Alan Huang wrote: >> >>> 2023年7月18日 03:06,Paul E. McKenney <paulmck@kernel.org> 写道: >>> >>> On Tue, Jul 18, 2023 at 01:53:10AM +0800, Alan Huang wrote: >>>>> 2023年7月18日 00:02,Paul E. McKenney <paulmck@kernel.org> 写道: >>>>> On Sun, Jul 16, 2023 at 07:21:28PM +0800, Alan Huang wrote: >>>>>>> 2023年7月16日 01:19,Paul E. McKenney <paulmck@kernel.org> 写道: >>>>>>> On Sat, Jul 15, 2023 at 08:50:23AM +0800, Alan Huang wrote: >>>>>>>>> 2023年7月15日 07:23,Paul E. McKenney <paulmck@kernel.org> 写道: >>>>>>>>> On Tue, Jul 11, 2023 at 03:09:06PM +0000, Alan Huang wrote: >>>>>>>>>> The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is >>>>>>>>>> n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(), >>>>>>>>>> which modifies obj->obj_node.next. There may be readers holding the >>>>>>>>>> reference of obj in lockless_lookup, and when updater modifies ->next, >>>>>>>>>> readers can see the change immediately because ofSLAB_TYPESAFE_BY_RCU. >>>>>>>>>> >>>>>>>>>> There are two memory ordering required in the insertion algorithm, >>>>>>>>>> we need to make sure obj->key is updated before obj->obj_node.next >>>>>>>>>> and obj->refcnt, atomic_set_release is not enough to provide the >>>>>>>>>> required memory barrier. >>>>>>>>>> >>>>>>>>>> Signed-off-by: Alan Huang <mmpgouride@gmail.com> >>>>>>>>> >>>>>>>>> This is an interesting one!!! >>>>>>>>> >>>>>>>>> Now I am having a hard time believing that the smp_rmb() suffices. >>>>>>>>> >>>>>>>>>> --- >>>>>>>>>> Changelog: >>>>>>>>>> v1 -> v2: Use _ONCE to protect obj->key. >>>>>>>>>> >>>>>>>>>> Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++-------- >>>>>>>>>> 1 file changed, 13 insertions(+), 8 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst >>>>>>>>>> index 21e40fcc08de..2a9f5a63d334 100644 >>>>>>>>>> --- a/Documentation/RCU/rculist_nulls.rst >>>>>>>>>> +++ b/Documentation/RCU/rculist_nulls.rst >>>>>>>>>> @@ -47,7 +47,7 @@ objects, which is having below type. >>>>>>>>>> * reuse these object before the RCU grace period, we >>>>>>>>>> * must check key after getting the reference on object >>>>>>>>>> */ >>>>>>>>>> - if (obj->key != key) { // not the object we expected >>>>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected >>>>>>>>>> put_ref(obj); >>>>>>>>>> rcu_read_unlock(); >>>>>>>>>> goto begin; >>>>>>>>>> @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb()) >>>>>>>>>> { >>>>>>>>>> struct hlist_node *node, *next; >>>>>>>>>> for (pos = rcu_dereference((head)->first); >>>>>>>>>> - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && >>>>>>>>>> + pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) && >>>>>>>>> >>>>>>>>> Suppose that lockless_lookup() is delayed just before fetching pos->next, >>>>>>>>> and that there were 17 more node to search in the list. >>>>>>>>> >>>>>>>>> Then consider the following sequence of events: >>>>>>>>> >>>>>>>>> o The updater deletes this same node and kmem_cache_free()s it. >>>>>>>>> >>>>>>>>> o Another updater kmem_cache_alloc()s that same memory and >>>>>>>>> inserts it into an empty hash chain with a different key. >>>>>>>>> >>>>>>>>> o Then lockless_lookup() fetches pos->next and sees a NULL pointer, >>>>>>>>> thus failing to search the remaining 17 nodes in the list, >>>>>>>>> one of which had the desired key value. >>>>>>>>> >>>>>>>>> o The lookup algorithm resumes and sees the NULL return from >>>>>>>>> lockless_lookup(), and ends up with a NULL obj. >>>>>>>>> >>>>>>>>> And this happens even with the strongest possible ordering >>>>>>>>> everywhere. >>>>>>>>> >>>>>>>>> OK, yes, it is late on Friday. So what am I missing here? >>>>>>>> >>>>>>>> You missed nothing! >>>>>>>> >>>>>>>> The lockless_lockup should not be a function, but a macro like hlist_for_each_entry_rcu. >>>>>>> >>>>>>> How would you fix this using a macro? >>>>>> >>>>>> With additional detection code. A moved object (in another chain) will have a different slot. >>>>>> (I have sent patch v3. ) >>>>>> >>>>>>> >>>>>>>>> Independent of that, does hlist_add_head_rcu() need to replace its >>>>>>>>> "n->next = first" with "WRITE_ONCE(n->next, first)"? >>>>>>>> >>>>>>>> I think users who want to use hlist with SLAB_TYPESAFE_BY_RCU should use rculist_nulls? >>>>>>> >>>>>>> I believe that you are correct. Would you like to propose a patch, or >>>>>>> would you rather I put something together? My current thought is to >>>>>> >>>>>> Feel free to add. >>>>>> >>>>>> One thing I think would be useful is to tell readers where the ‘next' is. >>>>>> The document mentions ’next’ many times, but it’s hard for me, as a reader, to realize that >>>>>> the ‘next' is within hlist_add_head_rcu(). (I have no idea where to put the hint.) >>>>>> >>>>>> >>>>>>> keep the examples, but to show why the one with smp_rmb() is broken. >>>>>> >>>>>> I think the example needs to be fixed. :) >>>>> >>>>> Even better! I will take a look, but in the meantime, would you be >>>>> interested in updating the wording to explain how the back-pointer works? >>>> >>>> Which document needs to be updated? >>>> And is there anything that I can refer to? It’s the first time I have ever heard about it. >>> >>> Documentation/RCU/rculist_nulls.rst, the one that you are updating. >>> >>> There admittedly isn't a whole lot of commentary. >>> >>>>> (Looks similar to the is_a_nulls() pointer, but in each element instead of >>>>> just at the end. One advantage is the ability to detect a move mid-list, >>>>> though that is not a big deal in well-tuned hash tables, which tend to >>>>> have short hash chains. The need to move elements to the front of the >>>>> destination list remains, though in both cases only if it has been less >>>>> than a grace period since the last move.) >>>> >>>> Looks like that I need to learn it first. :) >>> >>> Well, you wrote the code, so... ;-) >> >> If I understand correctly, it works only for 64-bit machines? >> >> And the number of slots of the hash table will be limited? > > You are asking about the is_a_nulls() value? If so, it works on both > 32-bit and 64-bit machines. They each have enough bits for the nulls > value to cover all possible two-byte objects in the full address space. > > If that wasn't what you were asking, please help me with your question. I’m asking the ‘back-pointer’, which "Looks similar to the is_a_nulls() pointer, but in each element instead of just at the end” If I understand correctly, we need to store the nulls in the upper unused bits. And we only have several bits unused. One advantage I can think of is that it will improve the performance once we have something like the Intel Upper Address Ignore (UAI), which also works only for 64-bit machine. > > Thanx, Paul > >>> Thanx, Paul >>> >>>>> Thanx, Paul >>>>> >>>>>>>> I didn’t find a case using hlist with SLAB_TYPESAFE_BY_RCU, but I did find a case using list >>>>>>>> with SLAB_TYPESAFE_BY_RCU in drivers/gpu/drm/i915, the driver also doesn’t use _ONCE >>>>>>>> on the fields of the objects allocated with SLAB_TYPESAFE_BY_RCU. >>>>>>> >>>>>>> Feel free to send them a patch, though I cannot speak for their >>>>>>> reception of it. >>>>>>> >>>>>>> Thanx, Paul >>>>>>> >>>>>>>>> Thanx, Paul >>>>>>>>> >>>>>>>>>> ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; }); >>>>>>>>>> pos = rcu_dereference(next)) >>>>>>>>>> - if (obj->key == key) >>>>>>>>>> + if (READ_ONCE(obj->key) == key) >>>>>>>>>> return obj; >>>>>>>>>> return NULL; >>>>>>>>>> } >>>>>>>>>> @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain. >>>>>>>>>> */ >>>>>>>>>> obj = kmem_cache_alloc(...); >>>>>>>>>> lock_chain(); // typically a spin_lock() >>>>>>>>>> - obj->key = key; >>>>>>>>>> - atomic_set_release(&obj->refcnt, 1); // key before refcnt >>>>>>>>>> + WRITE_ONCE(obj->key, key); >>>>>>>>>> + /* >>>>>>>>>> + * We need to make sure obj->key is updated before obj->obj_node.next >>>>>>>>>> + * and obj->refcnt. >>>>>>>>>> + */ >>>>>>>>>> + smp_wmb(); >>>>>>>>>> + atomic_set(&obj->refcnt, 1); >>>>>>>>>> hlist_add_head_rcu(&obj->obj_node, list); >>>>>>>>>> unlock_chain(); // typically a spin_unlock() >>>>>>>>>> >>>>>>>>>> @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of >>>>>>>>>> begin: >>>>>>>>>> rcu_read_lock(); >>>>>>>>>> hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) { >>>>>>>>>> - if (obj->key == key) { >>>>>>>>>> + if (READ_ONCE(obj->key) == key) { >>>>>>>>>> if (!try_get_ref(obj)) { // might fail for free objects >>>>>>>>>> rcu_read_unlock(); >>>>>>>>>> goto begin; >>>>>>>>>> } >>>>>>>>>> - if (obj->key != key) { // not the object we expected >>>>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected >>>>>>>>>> put_ref(obj); >>>>>>>>>> rcu_read_unlock(); >>>>>>>>>> goto begin; >>>>>>>>>> @@ -206,7 +211,7 @@ hlist_add_head_rcu(). >>>>>>>>>> */ >>>>>>>>>> obj = kmem_cache_alloc(cachep); >>>>>>>>>> lock_chain(); // typically a spin_lock() >>>>>>>>>> - obj->key = key; >>>>>>>>>> + WRITE_ONCE(obj->key, key); >>>>>>>>>> atomic_set_release(&obj->refcnt, 1); // key before refcnt >>>>>>>>>> /* >>>>>>>>>> * insert obj in RCU way (readers might be traversing chain) >>>>>>>>>> -- >>>>>>>>>> 2.34.1
On Sat, Jul 22, 2023 at 09:02:46AM +0800, Alan Huang wrote: > > > 2023年7月22日 上午6:49,Paul E. McKenney <paulmck@kernel.org> 写道: > > > > On Wed, Jul 19, 2023 at 12:08:33AM +0800, Alan Huang wrote: > >> > >>> 2023年7月18日 03:06,Paul E. McKenney <paulmck@kernel.org> 写道: > >>> > >>> On Tue, Jul 18, 2023 at 01:53:10AM +0800, Alan Huang wrote: > >>>>> 2023年7月18日 00:02,Paul E. McKenney <paulmck@kernel.org> 写道: > >>>>> On Sun, Jul 16, 2023 at 07:21:28PM +0800, Alan Huang wrote: > >>>>>>> 2023年7月16日 01:19,Paul E. McKenney <paulmck@kernel.org> 写道: > >>>>>>> On Sat, Jul 15, 2023 at 08:50:23AM +0800, Alan Huang wrote: > >>>>>>>>> 2023年7月15日 07:23,Paul E. McKenney <paulmck@kernel.org> 写道: > >>>>>>>>> On Tue, Jul 11, 2023 at 03:09:06PM +0000, Alan Huang wrote: > >>>>>>>>>> The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is > >>>>>>>>>> n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(), > >>>>>>>>>> which modifies obj->obj_node.next. There may be readers holding the > >>>>>>>>>> reference of obj in lockless_lookup, and when updater modifies ->next, > >>>>>>>>>> readers can see the change immediately because ofSLAB_TYPESAFE_BY_RCU. > >>>>>>>>>> > >>>>>>>>>> There are two memory ordering required in the insertion algorithm, > >>>>>>>>>> we need to make sure obj->key is updated before obj->obj_node.next > >>>>>>>>>> and obj->refcnt, atomic_set_release is not enough to provide the > >>>>>>>>>> required memory barrier. > >>>>>>>>>> > >>>>>>>>>> Signed-off-by: Alan Huang <mmpgouride@gmail.com> > >>>>>>>>> > >>>>>>>>> This is an interesting one!!! > >>>>>>>>> > >>>>>>>>> Now I am having a hard time believing that the smp_rmb() suffices. > >>>>>>>>> > >>>>>>>>>> --- > >>>>>>>>>> Changelog: > >>>>>>>>>> v1 -> v2: Use _ONCE to protect obj->key. > >>>>>>>>>> > >>>>>>>>>> Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++-------- > >>>>>>>>>> 1 file changed, 13 insertions(+), 8 deletions(-) > >>>>>>>>>> > >>>>>>>>>> diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst > >>>>>>>>>> index 21e40fcc08de..2a9f5a63d334 100644 > >>>>>>>>>> --- a/Documentation/RCU/rculist_nulls.rst > >>>>>>>>>> +++ b/Documentation/RCU/rculist_nulls.rst > >>>>>>>>>> @@ -47,7 +47,7 @@ objects, which is having below type. > >>>>>>>>>> * reuse these object before the RCU grace period, we > >>>>>>>>>> * must check key after getting the reference on object > >>>>>>>>>> */ > >>>>>>>>>> - if (obj->key != key) { // not the object we expected > >>>>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected > >>>>>>>>>> put_ref(obj); > >>>>>>>>>> rcu_read_unlock(); > >>>>>>>>>> goto begin; > >>>>>>>>>> @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb()) > >>>>>>>>>> { > >>>>>>>>>> struct hlist_node *node, *next; > >>>>>>>>>> for (pos = rcu_dereference((head)->first); > >>>>>>>>>> - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && > >>>>>>>>>> + pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) && > >>>>>>>>> > >>>>>>>>> Suppose that lockless_lookup() is delayed just before fetching pos->next, > >>>>>>>>> and that there were 17 more node to search in the list. > >>>>>>>>> > >>>>>>>>> Then consider the following sequence of events: > >>>>>>>>> > >>>>>>>>> o The updater deletes this same node and kmem_cache_free()s it. > >>>>>>>>> > >>>>>>>>> o Another updater kmem_cache_alloc()s that same memory and > >>>>>>>>> inserts it into an empty hash chain with a different key. > >>>>>>>>> > >>>>>>>>> o Then lockless_lookup() fetches pos->next and sees a NULL pointer, > >>>>>>>>> thus failing to search the remaining 17 nodes in the list, > >>>>>>>>> one of which had the desired key value. > >>>>>>>>> > >>>>>>>>> o The lookup algorithm resumes and sees the NULL return from > >>>>>>>>> lockless_lookup(), and ends up with a NULL obj. > >>>>>>>>> > >>>>>>>>> And this happens even with the strongest possible ordering > >>>>>>>>> everywhere. > >>>>>>>>> > >>>>>>>>> OK, yes, it is late on Friday. So what am I missing here? > >>>>>>>> > >>>>>>>> You missed nothing! > >>>>>>>> > >>>>>>>> The lockless_lockup should not be a function, but a macro like hlist_for_each_entry_rcu. > >>>>>>> > >>>>>>> How would you fix this using a macro? > >>>>>> > >>>>>> With additional detection code. A moved object (in another chain) will have a different slot. > >>>>>> (I have sent patch v3. ) > >>>>>> > >>>>>>> > >>>>>>>>> Independent of that, does hlist_add_head_rcu() need to replace its > >>>>>>>>> "n->next = first" with "WRITE_ONCE(n->next, first)"? > >>>>>>>> > >>>>>>>> I think users who want to use hlist with SLAB_TYPESAFE_BY_RCU should use rculist_nulls? > >>>>>>> > >>>>>>> I believe that you are correct. Would you like to propose a patch, or > >>>>>>> would you rather I put something together? My current thought is to > >>>>>> > >>>>>> Feel free to add. > >>>>>> > >>>>>> One thing I think would be useful is to tell readers where the ‘next' is. > >>>>>> The document mentions ’next’ many times, but it’s hard for me, as a reader, to realize that > >>>>>> the ‘next' is within hlist_add_head_rcu(). (I have no idea where to put the hint.) > >>>>>> > >>>>>> > >>>>>>> keep the examples, but to show why the one with smp_rmb() is broken. > >>>>>> > >>>>>> I think the example needs to be fixed. :) > >>>>> > >>>>> Even better! I will take a look, but in the meantime, would you be > >>>>> interested in updating the wording to explain how the back-pointer works? > >>>> > >>>> Which document needs to be updated? > >>>> And is there anything that I can refer to? It’s the first time I have ever heard about it. > >>> > >>> Documentation/RCU/rculist_nulls.rst, the one that you are updating. > >>> > >>> There admittedly isn't a whole lot of commentary. > >>> > >>>>> (Looks similar to the is_a_nulls() pointer, but in each element instead of > >>>>> just at the end. One advantage is the ability to detect a move mid-list, > >>>>> though that is not a big deal in well-tuned hash tables, which tend to > >>>>> have short hash chains. The need to move elements to the front of the > >>>>> destination list remains, though in both cases only if it has been less > >>>>> than a grace period since the last move.) > >>>> > >>>> Looks like that I need to learn it first. :) > >>> > >>> Well, you wrote the code, so... ;-) > >> > >> If I understand correctly, it works only for 64-bit machines? > >> > >> And the number of slots of the hash table will be limited? > > > > You are asking about the is_a_nulls() value? If so, it works on both > > 32-bit and 64-bit machines. They each have enough bits for the nulls > > value to cover all possible two-byte objects in the full address space. > > > > If that wasn't what you were asking, please help me with your question. > > I’m asking the ‘back-pointer’, which > > "Looks similar to the is_a_nulls() pointer, but in each element instead of > just at the end” > > If I understand correctly, we need to store the nulls in the upper unused bits. > > And we only have several bits unused. One advantage I can think of is that it > will improve the performance once we have something like the Intel Upper Address Ignore (UAI), > which also works only for 64-bit machine. Ah, I was using the word "similar" very loosely. Not "implemented in a manner similar to is_a_nulls(), but rather "serves roughly the same function as is_a_nulls()". You should be able to just have the pointer to the bucket in each element, and just compare for each element. No need for extra bits, though UAI, like its pre-existing ARM counterpart, might also help catch use-after-free bugs when resizing the hash table. Thanx, Paul > >>>>>>>> I didn’t find a case using hlist with SLAB_TYPESAFE_BY_RCU, but I did find a case using list > >>>>>>>> with SLAB_TYPESAFE_BY_RCU in drivers/gpu/drm/i915, the driver also doesn’t use _ONCE > >>>>>>>> on the fields of the objects allocated with SLAB_TYPESAFE_BY_RCU. > >>>>>>> > >>>>>>> Feel free to send them a patch, though I cannot speak for their > >>>>>>> reception of it. > >>>>>>> > >>>>>>> Thanx, Paul > >>>>>>> > >>>>>>>>> Thanx, Paul > >>>>>>>>> > >>>>>>>>>> ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; }); > >>>>>>>>>> pos = rcu_dereference(next)) > >>>>>>>>>> - if (obj->key == key) > >>>>>>>>>> + if (READ_ONCE(obj->key) == key) > >>>>>>>>>> return obj; > >>>>>>>>>> return NULL; > >>>>>>>>>> } > >>>>>>>>>> @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain. > >>>>>>>>>> */ > >>>>>>>>>> obj = kmem_cache_alloc(...); > >>>>>>>>>> lock_chain(); // typically a spin_lock() > >>>>>>>>>> - obj->key = key; > >>>>>>>>>> - atomic_set_release(&obj->refcnt, 1); // key before refcnt > >>>>>>>>>> + WRITE_ONCE(obj->key, key); > >>>>>>>>>> + /* > >>>>>>>>>> + * We need to make sure obj->key is updated before obj->obj_node.next > >>>>>>>>>> + * and obj->refcnt. > >>>>>>>>>> + */ > >>>>>>>>>> + smp_wmb(); > >>>>>>>>>> + atomic_set(&obj->refcnt, 1); > >>>>>>>>>> hlist_add_head_rcu(&obj->obj_node, list); > >>>>>>>>>> unlock_chain(); // typically a spin_unlock() > >>>>>>>>>> > >>>>>>>>>> @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of > >>>>>>>>>> begin: > >>>>>>>>>> rcu_read_lock(); > >>>>>>>>>> hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) { > >>>>>>>>>> - if (obj->key == key) { > >>>>>>>>>> + if (READ_ONCE(obj->key) == key) { > >>>>>>>>>> if (!try_get_ref(obj)) { // might fail for free objects > >>>>>>>>>> rcu_read_unlock(); > >>>>>>>>>> goto begin; > >>>>>>>>>> } > >>>>>>>>>> - if (obj->key != key) { // not the object we expected > >>>>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected > >>>>>>>>>> put_ref(obj); > >>>>>>>>>> rcu_read_unlock(); > >>>>>>>>>> goto begin; > >>>>>>>>>> @@ -206,7 +211,7 @@ hlist_add_head_rcu(). > >>>>>>>>>> */ > >>>>>>>>>> obj = kmem_cache_alloc(cachep); > >>>>>>>>>> lock_chain(); // typically a spin_lock() > >>>>>>>>>> - obj->key = key; > >>>>>>>>>> + WRITE_ONCE(obj->key, key); > >>>>>>>>>> atomic_set_release(&obj->refcnt, 1); // key before refcnt > >>>>>>>>>> /* > >>>>>>>>>> * insert obj in RCU way (readers might be traversing chain) > >>>>>>>>>> -- > >>>>>>>>>> 2.34.1 >
> 2023年7月25日 00:09,Paul E. McKenney <paulmck@kernel.org> 写道: > > On Sat, Jul 22, 2023 at 09:02:46AM +0800, Alan Huang wrote: >> >>> 2023年7月22日 上午6:49,Paul E. McKenney <paulmck@kernel.org> 写道: >>> >>> On Wed, Jul 19, 2023 at 12:08:33AM +0800, Alan Huang wrote: >>>> >>>>> 2023年7月18日 03:06,Paul E. McKenney <paulmck@kernel.org> 写道: >>>>> >>>>> On Tue, Jul 18, 2023 at 01:53:10AM +0800, Alan Huang wrote: >>>>>>> 2023年7月18日 00:02,Paul E. McKenney <paulmck@kernel.org> 写道: >>>>>>> On Sun, Jul 16, 2023 at 07:21:28PM +0800, Alan Huang wrote: >>>>>>>>> 2023年7月16日 01:19,Paul E. McKenney <paulmck@kernel.org> 写道: >>>>>>>>> On Sat, Jul 15, 2023 at 08:50:23AM +0800, Alan Huang wrote: >>>>>>>>>>> 2023年7月15日 07:23,Paul E. McKenney <paulmck@kernel.org> 写道: >>>>>>>>>>> On Tue, Jul 11, 2023 at 03:09:06PM +0000, Alan Huang wrote: >>>>>>>>>>>> The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is >>>>>>>>>>>> n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(), >>>>>>>>>>>> which modifies obj->obj_node.next. There may be readers holding the >>>>>>>>>>>> reference of obj in lockless_lookup, and when updater modifies ->next, >>>>>>>>>>>> readers can see the change immediately because ofSLAB_TYPESAFE_BY_RCU. >>>>>>>>>>>> >>>>>>>>>>>> There are two memory ordering required in the insertion algorithm, >>>>>>>>>>>> we need to make sure obj->key is updated before obj->obj_node.next >>>>>>>>>>>> and obj->refcnt, atomic_set_release is not enough to provide the >>>>>>>>>>>> required memory barrier. >>>>>>>>>>>> >>>>>>>>>>>> Signed-off-by: Alan Huang <mmpgouride@gmail.com> >>>>>>>>>>> >>>>>>>>>>> This is an interesting one!!! >>>>>>>>>>> >>>>>>>>>>> Now I am having a hard time believing that the smp_rmb() suffices. >>>>>>>>>>> >>>>>>>>>>>> --- >>>>>>>>>>>> Changelog: >>>>>>>>>>>> v1 -> v2: Use _ONCE to protect obj->key. >>>>>>>>>>>> >>>>>>>>>>>> Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++-------- >>>>>>>>>>>> 1 file changed, 13 insertions(+), 8 deletions(-) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst >>>>>>>>>>>> index 21e40fcc08de..2a9f5a63d334 100644 >>>>>>>>>>>> --- a/Documentation/RCU/rculist_nulls.rst >>>>>>>>>>>> +++ b/Documentation/RCU/rculist_nulls.rst >>>>>>>>>>>> @@ -47,7 +47,7 @@ objects, which is having below type. >>>>>>>>>>>> * reuse these object before the RCU grace period, we >>>>>>>>>>>> * must check key after getting the reference on object >>>>>>>>>>>> */ >>>>>>>>>>>> - if (obj->key != key) { // not the object we expected >>>>>>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected >>>>>>>>>>>> put_ref(obj); >>>>>>>>>>>> rcu_read_unlock(); >>>>>>>>>>>> goto begin; >>>>>>>>>>>> @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb()) >>>>>>>>>>>> { >>>>>>>>>>>> struct hlist_node *node, *next; >>>>>>>>>>>> for (pos = rcu_dereference((head)->first); >>>>>>>>>>>> - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && >>>>>>>>>>>> + pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) && >>>>>>>>>>> >>>>>>>>>>> Suppose that lockless_lookup() is delayed just before fetching pos->next, >>>>>>>>>>> and that there were 17 more node to search in the list. >>>>>>>>>>> >>>>>>>>>>> Then consider the following sequence of events: >>>>>>>>>>> >>>>>>>>>>> o The updater deletes this same node and kmem_cache_free()s it. >>>>>>>>>>> >>>>>>>>>>> o Another updater kmem_cache_alloc()s that same memory and >>>>>>>>>>> inserts it into an empty hash chain with a different key. >>>>>>>>>>> >>>>>>>>>>> o Then lockless_lookup() fetches pos->next and sees a NULL pointer, >>>>>>>>>>> thus failing to search the remaining 17 nodes in the list, >>>>>>>>>>> one of which had the desired key value. >>>>>>>>>>> >>>>>>>>>>> o The lookup algorithm resumes and sees the NULL return from >>>>>>>>>>> lockless_lookup(), and ends up with a NULL obj. >>>>>>>>>>> >>>>>>>>>>> And this happens even with the strongest possible ordering >>>>>>>>>>> everywhere. >>>>>>>>>>> >>>>>>>>>>> OK, yes, it is late on Friday. So what am I missing here? >>>>>>>>>> >>>>>>>>>> You missed nothing! >>>>>>>>>> >>>>>>>>>> The lockless_lockup should not be a function, but a macro like hlist_for_each_entry_rcu. >>>>>>>>> >>>>>>>>> How would you fix this using a macro? >>>>>>>> >>>>>>>> With additional detection code. A moved object (in another chain) will have a different slot. >>>>>>>> (I have sent patch v3. ) >>>>>>>> >>>>>>>>> >>>>>>>>>>> Independent of that, does hlist_add_head_rcu() need to replace its >>>>>>>>>>> "n->next = first" with "WRITE_ONCE(n->next, first)"? >>>>>>>>>> >>>>>>>>>> I think users who want to use hlist with SLAB_TYPESAFE_BY_RCU should use rculist_nulls? >>>>>>>>> >>>>>>>>> I believe that you are correct. Would you like to propose a patch, or >>>>>>>>> would you rather I put something together? My current thought is to >>>>>>>> >>>>>>>> Feel free to add. >>>>>>>> >>>>>>>> One thing I think would be useful is to tell readers where the ‘next' is. >>>>>>>> The document mentions ’next’ many times, but it’s hard for me, as a reader, to realize that >>>>>>>> the ‘next' is within hlist_add_head_rcu(). (I have no idea where to put the hint.) >>>>>>>> >>>>>>>> >>>>>>>>> keep the examples, but to show why the one with smp_rmb() is broken. >>>>>>>> >>>>>>>> I think the example needs to be fixed. :) >>>>>>> >>>>>>> Even better! I will take a look, but in the meantime, would you be >>>>>>> interested in updating the wording to explain how the back-pointer works? >>>>>> >>>>>> Which document needs to be updated? >>>>>> And is there anything that I can refer to? It’s the first time I have ever heard about it. >>>>> >>>>> Documentation/RCU/rculist_nulls.rst, the one that you are updating. >>>>> >>>>> There admittedly isn't a whole lot of commentary. >>>>> >>>>>>> (Looks similar to the is_a_nulls() pointer, but in each element instead of >>>>>>> just at the end. One advantage is the ability to detect a move mid-list, >>>>>>> though that is not a big deal in well-tuned hash tables, which tend to >>>>>>> have short hash chains. The need to move elements to the front of the >>>>>>> destination list remains, though in both cases only if it has been less >>>>>>> than a grace period since the last move.) >>>>>> >>>>>> Looks like that I need to learn it first. :) >>>>> >>>>> Well, you wrote the code, so... ;-) >>>> >>>> If I understand correctly, it works only for 64-bit machines? >>>> >>>> And the number of slots of the hash table will be limited? >>> >>> You are asking about the is_a_nulls() value? If so, it works on both >>> 32-bit and 64-bit machines. They each have enough bits for the nulls >>> value to cover all possible two-byte objects in the full address space. >>> >>> If that wasn't what you were asking, please help me with your question. >> >> I’m asking the ‘back-pointer’, which >> >> "Looks similar to the is_a_nulls() pointer, but in each element instead of >> just at the end” >> >> If I understand correctly, we need to store the nulls in the upper unused bits. >> >> And we only have several bits unused. One advantage I can think of is that it >> will improve the performance once we have something like the Intel Upper Address Ignore (UAI), >> which also works only for 64-bit machine. > > Ah, I was using the word "similar" very loosely. Not "implemented in > a manner similar to is_a_nulls(), but rather "serves roughly the same > function as is_a_nulls()". I got it. > > You should be able to just have the pointer to the bucket in each > element, and just compare for each element. No need for extra bits, It might have the same functionality if we put the nulls value in each element. > though UAI, like its pre-existing ARM counterpart, might also help catch > use-after-free bugs when resizing the hash table. > > Thanx, Paul > >>>>>>>>>> I didn’t find a case using hlist with SLAB_TYPESAFE_BY_RCU, but I did find a case using list >>>>>>>>>> with SLAB_TYPESAFE_BY_RCU in drivers/gpu/drm/i915, the driver also doesn’t use _ONCE >>>>>>>>>> on the fields of the objects allocated with SLAB_TYPESAFE_BY_RCU. >>>>>>>>> >>>>>>>>> Feel free to send them a patch, though I cannot speak for their >>>>>>>>> reception of it. >>>>>>>>> >>>>>>>>> Thanx, Paul >>>>>>>>> >>>>>>>>>>> Thanx, Paul >>>>>>>>>>> >>>>>>>>>>>> ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; }); >>>>>>>>>>>> pos = rcu_dereference(next)) >>>>>>>>>>>> - if (obj->key == key) >>>>>>>>>>>> + if (READ_ONCE(obj->key) == key) >>>>>>>>>>>> return obj; >>>>>>>>>>>> return NULL; >>>>>>>>>>>> } >>>>>>>>>>>> @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain. >>>>>>>>>>>> */ >>>>>>>>>>>> obj = kmem_cache_alloc(...); >>>>>>>>>>>> lock_chain(); // typically a spin_lock() >>>>>>>>>>>> - obj->key = key; >>>>>>>>>>>> - atomic_set_release(&obj->refcnt, 1); // key before refcnt >>>>>>>>>>>> + WRITE_ONCE(obj->key, key); >>>>>>>>>>>> + /* >>>>>>>>>>>> + * We need to make sure obj->key is updated before obj->obj_node.next >>>>>>>>>>>> + * and obj->refcnt. >>>>>>>>>>>> + */ >>>>>>>>>>>> + smp_wmb(); >>>>>>>>>>>> + atomic_set(&obj->refcnt, 1); >>>>>>>>>>>> hlist_add_head_rcu(&obj->obj_node, list); >>>>>>>>>>>> unlock_chain(); // typically a spin_unlock() >>>>>>>>>>>> >>>>>>>>>>>> @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of >>>>>>>>>>>> begin: >>>>>>>>>>>> rcu_read_lock(); >>>>>>>>>>>> hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) { >>>>>>>>>>>> - if (obj->key == key) { >>>>>>>>>>>> + if (READ_ONCE(obj->key) == key) { >>>>>>>>>>>> if (!try_get_ref(obj)) { // might fail for free objects >>>>>>>>>>>> rcu_read_unlock(); >>>>>>>>>>>> goto begin; >>>>>>>>>>>> } >>>>>>>>>>>> - if (obj->key != key) { // not the object we expected >>>>>>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected >>>>>>>>>>>> put_ref(obj); >>>>>>>>>>>> rcu_read_unlock(); >>>>>>>>>>>> goto begin; >>>>>>>>>>>> @@ -206,7 +211,7 @@ hlist_add_head_rcu(). >>>>>>>>>>>> */ >>>>>>>>>>>> obj = kmem_cache_alloc(cachep); >>>>>>>>>>>> lock_chain(); // typically a spin_lock() >>>>>>>>>>>> - obj->key = key; >>>>>>>>>>>> + WRITE_ONCE(obj->key, key); >>>>>>>>>>>> atomic_set_release(&obj->refcnt, 1); // key before refcnt >>>>>>>>>>>> /* >>>>>>>>>>>> * insert obj in RCU way (readers might be traversing chain) >>>>>>>>>>>> -- >>>>>>>>>>>> 2.34.1
On Sat, Jul 29, 2023 at 11:17:31PM +0800, Alan Huang wrote: > > > 2023年7月25日 00:09,Paul E. McKenney <paulmck@kernel.org> 写道: > > > > On Sat, Jul 22, 2023 at 09:02:46AM +0800, Alan Huang wrote: > >> > >>> 2023年7月22日 上午6:49,Paul E. McKenney <paulmck@kernel.org> 写道: > >>> > >>> On Wed, Jul 19, 2023 at 12:08:33AM +0800, Alan Huang wrote: > >>>> > >>>>> 2023年7月18日 03:06,Paul E. McKenney <paulmck@kernel.org> 写道: > >>>>> > >>>>> On Tue, Jul 18, 2023 at 01:53:10AM +0800, Alan Huang wrote: > >>>>>>> 2023年7月18日 00:02,Paul E. McKenney <paulmck@kernel.org> 写道: > >>>>>>> On Sun, Jul 16, 2023 at 07:21:28PM +0800, Alan Huang wrote: > >>>>>>>>> 2023年7月16日 01:19,Paul E. McKenney <paulmck@kernel.org> 写道: > >>>>>>>>> On Sat, Jul 15, 2023 at 08:50:23AM +0800, Alan Huang wrote: > >>>>>>>>>>> 2023年7月15日 07:23,Paul E. McKenney <paulmck@kernel.org> 写道: > >>>>>>>>>>> On Tue, Jul 11, 2023 at 03:09:06PM +0000, Alan Huang wrote: > >>>>>>>>>>>> The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is > >>>>>>>>>>>> n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(), > >>>>>>>>>>>> which modifies obj->obj_node.next. There may be readers holding the > >>>>>>>>>>>> reference of obj in lockless_lookup, and when updater modifies ->next, > >>>>>>>>>>>> readers can see the change immediately because ofSLAB_TYPESAFE_BY_RCU. > >>>>>>>>>>>> > >>>>>>>>>>>> There are two memory ordering required in the insertion algorithm, > >>>>>>>>>>>> we need to make sure obj->key is updated before obj->obj_node.next > >>>>>>>>>>>> and obj->refcnt, atomic_set_release is not enough to provide the > >>>>>>>>>>>> required memory barrier. > >>>>>>>>>>>> > >>>>>>>>>>>> Signed-off-by: Alan Huang <mmpgouride@gmail.com> > >>>>>>>>>>> > >>>>>>>>>>> This is an interesting one!!! > >>>>>>>>>>> > >>>>>>>>>>> Now I am having a hard time believing that the smp_rmb() suffices. > >>>>>>>>>>> > >>>>>>>>>>>> --- > >>>>>>>>>>>> Changelog: > >>>>>>>>>>>> v1 -> v2: Use _ONCE to protect obj->key. > >>>>>>>>>>>> > >>>>>>>>>>>> Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++-------- > >>>>>>>>>>>> 1 file changed, 13 insertions(+), 8 deletions(-) > >>>>>>>>>>>> > >>>>>>>>>>>> diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst > >>>>>>>>>>>> index 21e40fcc08de..2a9f5a63d334 100644 > >>>>>>>>>>>> --- a/Documentation/RCU/rculist_nulls.rst > >>>>>>>>>>>> +++ b/Documentation/RCU/rculist_nulls.rst > >>>>>>>>>>>> @@ -47,7 +47,7 @@ objects, which is having below type. > >>>>>>>>>>>> * reuse these object before the RCU grace period, we > >>>>>>>>>>>> * must check key after getting the reference on object > >>>>>>>>>>>> */ > >>>>>>>>>>>> - if (obj->key != key) { // not the object we expected > >>>>>>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected > >>>>>>>>>>>> put_ref(obj); > >>>>>>>>>>>> rcu_read_unlock(); > >>>>>>>>>>>> goto begin; > >>>>>>>>>>>> @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb()) > >>>>>>>>>>>> { > >>>>>>>>>>>> struct hlist_node *node, *next; > >>>>>>>>>>>> for (pos = rcu_dereference((head)->first); > >>>>>>>>>>>> - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && > >>>>>>>>>>>> + pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) && > >>>>>>>>>>> > >>>>>>>>>>> Suppose that lockless_lookup() is delayed just before fetching pos->next, > >>>>>>>>>>> and that there were 17 more node to search in the list. > >>>>>>>>>>> > >>>>>>>>>>> Then consider the following sequence of events: > >>>>>>>>>>> > >>>>>>>>>>> o The updater deletes this same node and kmem_cache_free()s it. > >>>>>>>>>>> > >>>>>>>>>>> o Another updater kmem_cache_alloc()s that same memory and > >>>>>>>>>>> inserts it into an empty hash chain with a different key. > >>>>>>>>>>> > >>>>>>>>>>> o Then lockless_lookup() fetches pos->next and sees a NULL pointer, > >>>>>>>>>>> thus failing to search the remaining 17 nodes in the list, > >>>>>>>>>>> one of which had the desired key value. > >>>>>>>>>>> > >>>>>>>>>>> o The lookup algorithm resumes and sees the NULL return from > >>>>>>>>>>> lockless_lookup(), and ends up with a NULL obj. > >>>>>>>>>>> > >>>>>>>>>>> And this happens even with the strongest possible ordering > >>>>>>>>>>> everywhere. > >>>>>>>>>>> > >>>>>>>>>>> OK, yes, it is late on Friday. So what am I missing here? > >>>>>>>>>> > >>>>>>>>>> You missed nothing! > >>>>>>>>>> > >>>>>>>>>> The lockless_lockup should not be a function, but a macro like hlist_for_each_entry_rcu. > >>>>>>>>> > >>>>>>>>> How would you fix this using a macro? > >>>>>>>> > >>>>>>>> With additional detection code. A moved object (in another chain) will have a different slot. > >>>>>>>> (I have sent patch v3. ) > >>>>>>>> > >>>>>>>>> > >>>>>>>>>>> Independent of that, does hlist_add_head_rcu() need to replace its > >>>>>>>>>>> "n->next = first" with "WRITE_ONCE(n->next, first)"? > >>>>>>>>>> > >>>>>>>>>> I think users who want to use hlist with SLAB_TYPESAFE_BY_RCU should use rculist_nulls? > >>>>>>>>> > >>>>>>>>> I believe that you are correct. Would you like to propose a patch, or > >>>>>>>>> would you rather I put something together? My current thought is to > >>>>>>>> > >>>>>>>> Feel free to add. > >>>>>>>> > >>>>>>>> One thing I think would be useful is to tell readers where the ‘next' is. > >>>>>>>> The document mentions ’next’ many times, but it’s hard for me, as a reader, to realize that > >>>>>>>> the ‘next' is within hlist_add_head_rcu(). (I have no idea where to put the hint.) > >>>>>>>> > >>>>>>>> > >>>>>>>>> keep the examples, but to show why the one with smp_rmb() is broken. > >>>>>>>> > >>>>>>>> I think the example needs to be fixed. :) > >>>>>>> > >>>>>>> Even better! I will take a look, but in the meantime, would you be > >>>>>>> interested in updating the wording to explain how the back-pointer works? > >>>>>> > >>>>>> Which document needs to be updated? > >>>>>> And is there anything that I can refer to? It’s the first time I have ever heard about it. > >>>>> > >>>>> Documentation/RCU/rculist_nulls.rst, the one that you are updating. > >>>>> > >>>>> There admittedly isn't a whole lot of commentary. > >>>>> > >>>>>>> (Looks similar to the is_a_nulls() pointer, but in each element instead of > >>>>>>> just at the end. One advantage is the ability to detect a move mid-list, > >>>>>>> though that is not a big deal in well-tuned hash tables, which tend to > >>>>>>> have short hash chains. The need to move elements to the front of the > >>>>>>> destination list remains, though in both cases only if it has been less > >>>>>>> than a grace period since the last move.) > >>>>>> > >>>>>> Looks like that I need to learn it first. :) > >>>>> > >>>>> Well, you wrote the code, so... ;-) > >>>> > >>>> If I understand correctly, it works only for 64-bit machines? > >>>> > >>>> And the number of slots of the hash table will be limited? > >>> > >>> You are asking about the is_a_nulls() value? If so, it works on both > >>> 32-bit and 64-bit machines. They each have enough bits for the nulls > >>> value to cover all possible two-byte objects in the full address space. > >>> > >>> If that wasn't what you were asking, please help me with your question. > >> > >> I’m asking the ‘back-pointer’, which > >> > >> "Looks similar to the is_a_nulls() pointer, but in each element instead of > >> just at the end” > >> > >> If I understand correctly, we need to store the nulls in the upper unused bits. > >> > >> And we only have several bits unused. One advantage I can think of is that it > >> will improve the performance once we have something like the Intel Upper Address Ignore (UAI), > >> which also works only for 64-bit machine. > > > > Ah, I was using the word "similar" very loosely. Not "implemented in > > a manner similar to is_a_nulls(), but rather "serves roughly the same > > function as is_a_nulls()". > > I got it. > > > > > You should be able to just have the pointer to the bucket in each > > element, and just compare for each element. No need for extra bits, > > It might have the same functionality if we put the nulls value in each element. There are some strange things that are easier to do with a nulls value than with an explicit pointer, but yes. (For example, if for some odd reason you wanted to group the hash buckets into classes within which moving a reader to another bucket was OK. Why would you want to do this? I have no idea!) Thanx, Paul > > though UAI, like its pre-existing ARM counterpart, might also help catch > > use-after-free bugs when resizing the hash table. > > > > Thanx, Paul > > > >>>>>>>>>> I didn’t find a case using hlist with SLAB_TYPESAFE_BY_RCU, but I did find a case using list > >>>>>>>>>> with SLAB_TYPESAFE_BY_RCU in drivers/gpu/drm/i915, the driver also doesn’t use _ONCE > >>>>>>>>>> on the fields of the objects allocated with SLAB_TYPESAFE_BY_RCU. > >>>>>>>>> > >>>>>>>>> Feel free to send them a patch, though I cannot speak for their > >>>>>>>>> reception of it. > >>>>>>>>> > >>>>>>>>> Thanx, Paul > >>>>>>>>> > >>>>>>>>>>> Thanx, Paul > >>>>>>>>>>> > >>>>>>>>>>>> ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; }); > >>>>>>>>>>>> pos = rcu_dereference(next)) > >>>>>>>>>>>> - if (obj->key == key) > >>>>>>>>>>>> + if (READ_ONCE(obj->key) == key) > >>>>>>>>>>>> return obj; > >>>>>>>>>>>> return NULL; > >>>>>>>>>>>> } > >>>>>>>>>>>> @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain. > >>>>>>>>>>>> */ > >>>>>>>>>>>> obj = kmem_cache_alloc(...); > >>>>>>>>>>>> lock_chain(); // typically a spin_lock() > >>>>>>>>>>>> - obj->key = key; > >>>>>>>>>>>> - atomic_set_release(&obj->refcnt, 1); // key before refcnt > >>>>>>>>>>>> + WRITE_ONCE(obj->key, key); > >>>>>>>>>>>> + /* > >>>>>>>>>>>> + * We need to make sure obj->key is updated before obj->obj_node.next > >>>>>>>>>>>> + * and obj->refcnt. > >>>>>>>>>>>> + */ > >>>>>>>>>>>> + smp_wmb(); > >>>>>>>>>>>> + atomic_set(&obj->refcnt, 1); > >>>>>>>>>>>> hlist_add_head_rcu(&obj->obj_node, list); > >>>>>>>>>>>> unlock_chain(); // typically a spin_unlock() > >>>>>>>>>>>> > >>>>>>>>>>>> @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of > >>>>>>>>>>>> begin: > >>>>>>>>>>>> rcu_read_lock(); > >>>>>>>>>>>> hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) { > >>>>>>>>>>>> - if (obj->key == key) { > >>>>>>>>>>>> + if (READ_ONCE(obj->key) == key) { > >>>>>>>>>>>> if (!try_get_ref(obj)) { // might fail for free objects > >>>>>>>>>>>> rcu_read_unlock(); > >>>>>>>>>>>> goto begin; > >>>>>>>>>>>> } > >>>>>>>>>>>> - if (obj->key != key) { // not the object we expected > >>>>>>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected > >>>>>>>>>>>> put_ref(obj); > >>>>>>>>>>>> rcu_read_unlock(); > >>>>>>>>>>>> goto begin; > >>>>>>>>>>>> @@ -206,7 +211,7 @@ hlist_add_head_rcu(). > >>>>>>>>>>>> */ > >>>>>>>>>>>> obj = kmem_cache_alloc(cachep); > >>>>>>>>>>>> lock_chain(); // typically a spin_lock() > >>>>>>>>>>>> - obj->key = key; > >>>>>>>>>>>> + WRITE_ONCE(obj->key, key); > >>>>>>>>>>>> atomic_set_release(&obj->refcnt, 1); // key before refcnt > >>>>>>>>>>>> /* > >>>>>>>>>>>> * insert obj in RCU way (readers might be traversing chain) > >>>>>>>>>>>> -- > >>>>>>>>>>>> 2.34.1 > >
> 2023年8月1日 上午4:06,Paul E. McKenney <paulmck@kernel.org> 写道: > > On Sat, Jul 29, 2023 at 11:17:31PM +0800, Alan Huang wrote: >> >>> 2023年7月25日 00:09,Paul E. McKenney <paulmck@kernel.org> 写道: >>> >>> On Sat, Jul 22, 2023 at 09:02:46AM +0800, Alan Huang wrote: >>>> >>>>> 2023年7月22日 上午6:49,Paul E. McKenney <paulmck@kernel.org> 写道: >>>>> >>>>> On Wed, Jul 19, 2023 at 12:08:33AM +0800, Alan Huang wrote: >>>>>> >>>>>>> 2023年7月18日 03:06,Paul E. McKenney <paulmck@kernel.org> 写道: >>>>>>> >>>>>>> On Tue, Jul 18, 2023 at 01:53:10AM +0800, Alan Huang wrote: >>>>>>>>> 2023年7月18日 00:02,Paul E. McKenney <paulmck@kernel.org> 写道: >>>>>>>>> On Sun, Jul 16, 2023 at 07:21:28PM +0800, Alan Huang wrote: >>>>>>>>>>> 2023年7月16日 01:19,Paul E. McKenney <paulmck@kernel.org> 写道: >>>>>>>>>>> On Sat, Jul 15, 2023 at 08:50:23AM +0800, Alan Huang wrote: >>>>>>>>>>>>> 2023年7月15日 07:23,Paul E. McKenney <paulmck@kernel.org> 写道: >>>>>>>>>>>>> On Tue, Jul 11, 2023 at 03:09:06PM +0000, Alan Huang wrote: >>>>>>>>>>>>>> The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is >>>>>>>>>>>>>> n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(), >>>>>>>>>>>>>> which modifies obj->obj_node.next. There may be readers holding the >>>>>>>>>>>>>> reference of obj in lockless_lookup, and when updater modifies ->next, >>>>>>>>>>>>>> readers can see the change immediately because ofSLAB_TYPESAFE_BY_RCU. >>>>>>>>>>>>>> >>>>>>>>>>>>>> There are two memory ordering required in the insertion algorithm, >>>>>>>>>>>>>> we need to make sure obj->key is updated before obj->obj_node.next >>>>>>>>>>>>>> and obj->refcnt, atomic_set_release is not enough to provide the >>>>>>>>>>>>>> required memory barrier. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Signed-off-by: Alan Huang <mmpgouride@gmail.com> >>>>>>>>>>>>> >>>>>>>>>>>>> This is an interesting one!!! >>>>>>>>>>>>> >>>>>>>>>>>>> Now I am having a hard time believing that the smp_rmb() suffices. >>>>>>>>>>>>> >>>>>>>>>>>>>> --- >>>>>>>>>>>>>> Changelog: >>>>>>>>>>>>>> v1 -> v2: Use _ONCE to protect obj->key. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++-------- >>>>>>>>>>>>>> 1 file changed, 13 insertions(+), 8 deletions(-) >>>>>>>>>>>>>> >>>>>>>>>>>>>> diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst >>>>>>>>>>>>>> index 21e40fcc08de..2a9f5a63d334 100644 >>>>>>>>>>>>>> --- a/Documentation/RCU/rculist_nulls.rst >>>>>>>>>>>>>> +++ b/Documentation/RCU/rculist_nulls.rst >>>>>>>>>>>>>> @@ -47,7 +47,7 @@ objects, which is having below type. >>>>>>>>>>>>>> * reuse these object before the RCU grace period, we >>>>>>>>>>>>>> * must check key after getting the reference on object >>>>>>>>>>>>>> */ >>>>>>>>>>>>>> - if (obj->key != key) { // not the object we expected >>>>>>>>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected >>>>>>>>>>>>>> put_ref(obj); >>>>>>>>>>>>>> rcu_read_unlock(); >>>>>>>>>>>>>> goto begin; >>>>>>>>>>>>>> @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb()) >>>>>>>>>>>>>> { >>>>>>>>>>>>>> struct hlist_node *node, *next; >>>>>>>>>>>>>> for (pos = rcu_dereference((head)->first); >>>>>>>>>>>>>> - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && >>>>>>>>>>>>>> + pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) && >>>>>>>>>>>>> >>>>>>>>>>>>> Suppose that lockless_lookup() is delayed just before fetching pos->next, >>>>>>>>>>>>> and that there were 17 more node to search in the list. >>>>>>>>>>>>> >>>>>>>>>>>>> Then consider the following sequence of events: >>>>>>>>>>>>> >>>>>>>>>>>>> o The updater deletes this same node and kmem_cache_free()s it. >>>>>>>>>>>>> >>>>>>>>>>>>> o Another updater kmem_cache_alloc()s that same memory and >>>>>>>>>>>>> inserts it into an empty hash chain with a different key. >>>>>>>>>>>>> >>>>>>>>>>>>> o Then lockless_lookup() fetches pos->next and sees a NULL pointer, >>>>>>>>>>>>> thus failing to search the remaining 17 nodes in the list, >>>>>>>>>>>>> one of which had the desired key value. >>>>>>>>>>>>> >>>>>>>>>>>>> o The lookup algorithm resumes and sees the NULL return from >>>>>>>>>>>>> lockless_lookup(), and ends up with a NULL obj. >>>>>>>>>>>>> >>>>>>>>>>>>> And this happens even with the strongest possible ordering >>>>>>>>>>>>> everywhere. >>>>>>>>>>>>> >>>>>>>>>>>>> OK, yes, it is late on Friday. So what am I missing here? >>>>>>>>>>>> >>>>>>>>>>>> You missed nothing! >>>>>>>>>>>> >>>>>>>>>>>> The lockless_lockup should not be a function, but a macro like hlist_for_each_entry_rcu. >>>>>>>>>>> >>>>>>>>>>> How would you fix this using a macro? >>>>>>>>>> >>>>>>>>>> With additional detection code. A moved object (in another chain) will have a different slot. >>>>>>>>>> (I have sent patch v3. ) >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>> Independent of that, does hlist_add_head_rcu() need to replace its >>>>>>>>>>>>> "n->next = first" with "WRITE_ONCE(n->next, first)"? >>>>>>>>>>>> >>>>>>>>>>>> I think users who want to use hlist with SLAB_TYPESAFE_BY_RCU should use rculist_nulls? >>>>>>>>>>> >>>>>>>>>>> I believe that you are correct. Would you like to propose a patch, or >>>>>>>>>>> would you rather I put something together? My current thought is to >>>>>>>>>> >>>>>>>>>> Feel free to add. >>>>>>>>>> >>>>>>>>>> One thing I think would be useful is to tell readers where the ‘next' is. >>>>>>>>>> The document mentions ’next’ many times, but it’s hard for me, as a reader, to realize that >>>>>>>>>> the ‘next' is within hlist_add_head_rcu(). (I have no idea where to put the hint.) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> keep the examples, but to show why the one with smp_rmb() is broken. >>>>>>>>>> >>>>>>>>>> I think the example needs to be fixed. :) >>>>>>>>> >>>>>>>>> Even better! I will take a look, but in the meantime, would you be >>>>>>>>> interested in updating the wording to explain how the back-pointer works? >>>>>>>> >>>>>>>> Which document needs to be updated? >>>>>>>> And is there anything that I can refer to? It’s the first time I have ever heard about it. >>>>>>> >>>>>>> Documentation/RCU/rculist_nulls.rst, the one that you are updating. >>>>>>> >>>>>>> There admittedly isn't a whole lot of commentary. >>>>>>> >>>>>>>>> (Looks similar to the is_a_nulls() pointer, but in each element instead of >>>>>>>>> just at the end. One advantage is the ability to detect a move mid-list, >>>>>>>>> though that is not a big deal in well-tuned hash tables, which tend to >>>>>>>>> have short hash chains. The need to move elements to the front of the >>>>>>>>> destination list remains, though in both cases only if it has been less >>>>>>>>> than a grace period since the last move.) >>>>>>>> >>>>>>>> Looks like that I need to learn it first. :) >>>>>>> >>>>>>> Well, you wrote the code, so... ;-) >>>>>> >>>>>> If I understand correctly, it works only for 64-bit machines? >>>>>> >>>>>> And the number of slots of the hash table will be limited? >>>>> >>>>> You are asking about the is_a_nulls() value? If so, it works on both >>>>> 32-bit and 64-bit machines. They each have enough bits for the nulls >>>>> value to cover all possible two-byte objects in the full address space. >>>>> >>>>> If that wasn't what you were asking, please help me with your question. >>>> >>>> I’m asking the ‘back-pointer’, which >>>> >>>> "Looks similar to the is_a_nulls() pointer, but in each element instead of >>>> just at the end” >>>> >>>> If I understand correctly, we need to store the nulls in the upper unused bits. >>>> >>>> And we only have several bits unused. One advantage I can think of is that it >>>> will improve the performance once we have something like the Intel Upper Address Ignore (UAI), >>>> which also works only for 64-bit machine. >>> >>> Ah, I was using the word "similar" very loosely. Not "implemented in >>> a manner similar to is_a_nulls(), but rather "serves roughly the same >>> function as is_a_nulls()". >> >> I got it. >> >>> >>> You should be able to just have the pointer to the bucket in each >>> element, and just compare for each element. No need for extra bits, >> >> It might have the same functionality if we put the nulls value in each element. > > There are some strange things that are easier to do with a nulls value > than with an explicit pointer, but yes. > > (For example, if for some odd reason you wanted to group the hash buckets > into classes within which moving a reader to another bucket was OK. > Why would you want to do this? I have no idea!) Interesting example. I’ll try to update the wording to explain how the back-pointer works. By the way, would you like a separate patch or a patch V4 containing the content of V3 of "docs/RCU: Bring back smp_wmb()” and the back-pointer? > > Thanx, Paul > >>> though UAI, like its pre-existing ARM counterpart, might also help catch >>> use-after-free bugs when resizing the hash table. >>> >>> Thanx, Paul >>> >>>>>>>>>>>> I didn’t find a case using hlist with SLAB_TYPESAFE_BY_RCU, but I did find a case using list >>>>>>>>>>>> with SLAB_TYPESAFE_BY_RCU in drivers/gpu/drm/i915, the driver also doesn’t use _ONCE >>>>>>>>>>>> on the fields of the objects allocated with SLAB_TYPESAFE_BY_RCU. >>>>>>>>>>> >>>>>>>>>>> Feel free to send them a patch, though I cannot speak for their >>>>>>>>>>> reception of it. >>>>>>>>>>> >>>>>>>>>>> Thanx, Paul >>>>>>>>>>> >>>>>>>>>>>>> Thanx, Paul >>>>>>>>>>>>> >>>>>>>>>>>>>> ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; }); >>>>>>>>>>>>>> pos = rcu_dereference(next)) >>>>>>>>>>>>>> - if (obj->key == key) >>>>>>>>>>>>>> + if (READ_ONCE(obj->key) == key) >>>>>>>>>>>>>> return obj; >>>>>>>>>>>>>> return NULL; >>>>>>>>>>>>>> } >>>>>>>>>>>>>> @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain. >>>>>>>>>>>>>> */ >>>>>>>>>>>>>> obj = kmem_cache_alloc(...); >>>>>>>>>>>>>> lock_chain(); // typically a spin_lock() >>>>>>>>>>>>>> - obj->key = key; >>>>>>>>>>>>>> - atomic_set_release(&obj->refcnt, 1); // key before refcnt >>>>>>>>>>>>>> + WRITE_ONCE(obj->key, key); >>>>>>>>>>>>>> + /* >>>>>>>>>>>>>> + * We need to make sure obj->key is updated before obj->obj_node.next >>>>>>>>>>>>>> + * and obj->refcnt. >>>>>>>>>>>>>> + */ >>>>>>>>>>>>>> + smp_wmb(); >>>>>>>>>>>>>> + atomic_set(&obj->refcnt, 1); >>>>>>>>>>>>>> hlist_add_head_rcu(&obj->obj_node, list); >>>>>>>>>>>>>> unlock_chain(); // typically a spin_unlock() >>>>>>>>>>>>>> >>>>>>>>>>>>>> @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of >>>>>>>>>>>>>> begin: >>>>>>>>>>>>>> rcu_read_lock(); >>>>>>>>>>>>>> hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) { >>>>>>>>>>>>>> - if (obj->key == key) { >>>>>>>>>>>>>> + if (READ_ONCE(obj->key) == key) { >>>>>>>>>>>>>> if (!try_get_ref(obj)) { // might fail for free objects >>>>>>>>>>>>>> rcu_read_unlock(); >>>>>>>>>>>>>> goto begin; >>>>>>>>>>>>>> } >>>>>>>>>>>>>> - if (obj->key != key) { // not the object we expected >>>>>>>>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected >>>>>>>>>>>>>> put_ref(obj); >>>>>>>>>>>>>> rcu_read_unlock(); >>>>>>>>>>>>>> goto begin; >>>>>>>>>>>>>> @@ -206,7 +211,7 @@ hlist_add_head_rcu(). >>>>>>>>>>>>>> */ >>>>>>>>>>>>>> obj = kmem_cache_alloc(cachep); >>>>>>>>>>>>>> lock_chain(); // typically a spin_lock() >>>>>>>>>>>>>> - obj->key = key; >>>>>>>>>>>>>> + WRITE_ONCE(obj->key, key); >>>>>>>>>>>>>> atomic_set_release(&obj->refcnt, 1); // key before refcnt >>>>>>>>>>>>>> /* >>>>>>>>>>>>>> * insert obj in RCU way (readers might be traversing chain) >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> 2.34.1
On Thu, Aug 03, 2023 at 09:50:35PM +0800, Alan Huang wrote: > > > 2023年8月1日 上午4:06,Paul E. McKenney <paulmck@kernel.org> 写道: > > > > On Sat, Jul 29, 2023 at 11:17:31PM +0800, Alan Huang wrote: > >> > >>> 2023年7月25日 00:09,Paul E. McKenney <paulmck@kernel.org> 写道: > >>> > >>> On Sat, Jul 22, 2023 at 09:02:46AM +0800, Alan Huang wrote: > >>>> > >>>>> 2023年7月22日 上午6:49,Paul E. McKenney <paulmck@kernel.org> 写道: > >>>>> > >>>>> On Wed, Jul 19, 2023 at 12:08:33AM +0800, Alan Huang wrote: > >>>>>> > >>>>>>> 2023年7月18日 03:06,Paul E. McKenney <paulmck@kernel.org> 写道: > >>>>>>> > >>>>>>> On Tue, Jul 18, 2023 at 01:53:10AM +0800, Alan Huang wrote: > >>>>>>>>> 2023年7月18日 00:02,Paul E. McKenney <paulmck@kernel.org> 写道: > >>>>>>>>> On Sun, Jul 16, 2023 at 07:21:28PM +0800, Alan Huang wrote: > >>>>>>>>>>> 2023年7月16日 01:19,Paul E. McKenney <paulmck@kernel.org> 写道: > >>>>>>>>>>> On Sat, Jul 15, 2023 at 08:50:23AM +0800, Alan Huang wrote: > >>>>>>>>>>>>> 2023年7月15日 07:23,Paul E. McKenney <paulmck@kernel.org> 写道: > >>>>>>>>>>>>> On Tue, Jul 11, 2023 at 03:09:06PM +0000, Alan Huang wrote: > >>>>>>>>>>>>>> The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is > >>>>>>>>>>>>>> n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(), > >>>>>>>>>>>>>> which modifies obj->obj_node.next. There may be readers holding the > >>>>>>>>>>>>>> reference of obj in lockless_lookup, and when updater modifies ->next, > >>>>>>>>>>>>>> readers can see the change immediately because ofSLAB_TYPESAFE_BY_RCU. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> There are two memory ordering required in the insertion algorithm, > >>>>>>>>>>>>>> we need to make sure obj->key is updated before obj->obj_node.next > >>>>>>>>>>>>>> and obj->refcnt, atomic_set_release is not enough to provide the > >>>>>>>>>>>>>> required memory barrier. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Signed-off-by: Alan Huang <mmpgouride@gmail.com> > >>>>>>>>>>>>> > >>>>>>>>>>>>> This is an interesting one!!! > >>>>>>>>>>>>> > >>>>>>>>>>>>> Now I am having a hard time believing that the smp_rmb() suffices. > >>>>>>>>>>>>> > >>>>>>>>>>>>>> --- > >>>>>>>>>>>>>> Changelog: > >>>>>>>>>>>>>> v1 -> v2: Use _ONCE to protect obj->key. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++-------- > >>>>>>>>>>>>>> 1 file changed, 13 insertions(+), 8 deletions(-) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst > >>>>>>>>>>>>>> index 21e40fcc08de..2a9f5a63d334 100644 > >>>>>>>>>>>>>> --- a/Documentation/RCU/rculist_nulls.rst > >>>>>>>>>>>>>> +++ b/Documentation/RCU/rculist_nulls.rst > >>>>>>>>>>>>>> @@ -47,7 +47,7 @@ objects, which is having below type. > >>>>>>>>>>>>>> * reuse these object before the RCU grace period, we > >>>>>>>>>>>>>> * must check key after getting the reference on object > >>>>>>>>>>>>>> */ > >>>>>>>>>>>>>> - if (obj->key != key) { // not the object we expected > >>>>>>>>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected > >>>>>>>>>>>>>> put_ref(obj); > >>>>>>>>>>>>>> rcu_read_unlock(); > >>>>>>>>>>>>>> goto begin; > >>>>>>>>>>>>>> @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb()) > >>>>>>>>>>>>>> { > >>>>>>>>>>>>>> struct hlist_node *node, *next; > >>>>>>>>>>>>>> for (pos = rcu_dereference((head)->first); > >>>>>>>>>>>>>> - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && > >>>>>>>>>>>>>> + pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) && > >>>>>>>>>>>>> > >>>>>>>>>>>>> Suppose that lockless_lookup() is delayed just before fetching pos->next, > >>>>>>>>>>>>> and that there were 17 more node to search in the list. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Then consider the following sequence of events: > >>>>>>>>>>>>> > >>>>>>>>>>>>> o The updater deletes this same node and kmem_cache_free()s it. > >>>>>>>>>>>>> > >>>>>>>>>>>>> o Another updater kmem_cache_alloc()s that same memory and > >>>>>>>>>>>>> inserts it into an empty hash chain with a different key. > >>>>>>>>>>>>> > >>>>>>>>>>>>> o Then lockless_lookup() fetches pos->next and sees a NULL pointer, > >>>>>>>>>>>>> thus failing to search the remaining 17 nodes in the list, > >>>>>>>>>>>>> one of which had the desired key value. > >>>>>>>>>>>>> > >>>>>>>>>>>>> o The lookup algorithm resumes and sees the NULL return from > >>>>>>>>>>>>> lockless_lookup(), and ends up with a NULL obj. > >>>>>>>>>>>>> > >>>>>>>>>>>>> And this happens even with the strongest possible ordering > >>>>>>>>>>>>> everywhere. > >>>>>>>>>>>>> > >>>>>>>>>>>>> OK, yes, it is late on Friday. So what am I missing here? > >>>>>>>>>>>> > >>>>>>>>>>>> You missed nothing! > >>>>>>>>>>>> > >>>>>>>>>>>> The lockless_lockup should not be a function, but a macro like hlist_for_each_entry_rcu. > >>>>>>>>>>> > >>>>>>>>>>> How would you fix this using a macro? > >>>>>>>>>> > >>>>>>>>>> With additional detection code. A moved object (in another chain) will have a different slot. > >>>>>>>>>> (I have sent patch v3. ) > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>>> Independent of that, does hlist_add_head_rcu() need to replace its > >>>>>>>>>>>>> "n->next = first" with "WRITE_ONCE(n->next, first)"? > >>>>>>>>>>>> > >>>>>>>>>>>> I think users who want to use hlist with SLAB_TYPESAFE_BY_RCU should use rculist_nulls? > >>>>>>>>>>> > >>>>>>>>>>> I believe that you are correct. Would you like to propose a patch, or > >>>>>>>>>>> would you rather I put something together? My current thought is to > >>>>>>>>>> > >>>>>>>>>> Feel free to add. > >>>>>>>>>> > >>>>>>>>>> One thing I think would be useful is to tell readers where the ‘next' is. > >>>>>>>>>> The document mentions ’next’ many times, but it’s hard for me, as a reader, to realize that > >>>>>>>>>> the ‘next' is within hlist_add_head_rcu(). (I have no idea where to put the hint.) > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> keep the examples, but to show why the one with smp_rmb() is broken. > >>>>>>>>>> > >>>>>>>>>> I think the example needs to be fixed. :) > >>>>>>>>> > >>>>>>>>> Even better! I will take a look, but in the meantime, would you be > >>>>>>>>> interested in updating the wording to explain how the back-pointer works? > >>>>>>>> > >>>>>>>> Which document needs to be updated? > >>>>>>>> And is there anything that I can refer to? It’s the first time I have ever heard about it. > >>>>>>> > >>>>>>> Documentation/RCU/rculist_nulls.rst, the one that you are updating. > >>>>>>> > >>>>>>> There admittedly isn't a whole lot of commentary. > >>>>>>> > >>>>>>>>> (Looks similar to the is_a_nulls() pointer, but in each element instead of > >>>>>>>>> just at the end. One advantage is the ability to detect a move mid-list, > >>>>>>>>> though that is not a big deal in well-tuned hash tables, which tend to > >>>>>>>>> have short hash chains. The need to move elements to the front of the > >>>>>>>>> destination list remains, though in both cases only if it has been less > >>>>>>>>> than a grace period since the last move.) > >>>>>>>> > >>>>>>>> Looks like that I need to learn it first. :) > >>>>>>> > >>>>>>> Well, you wrote the code, so... ;-) > >>>>>> > >>>>>> If I understand correctly, it works only for 64-bit machines? > >>>>>> > >>>>>> And the number of slots of the hash table will be limited? > >>>>> > >>>>> You are asking about the is_a_nulls() value? If so, it works on both > >>>>> 32-bit and 64-bit machines. They each have enough bits for the nulls > >>>>> value to cover all possible two-byte objects in the full address space. > >>>>> > >>>>> If that wasn't what you were asking, please help me with your question. > >>>> > >>>> I’m asking the ‘back-pointer’, which > >>>> > >>>> "Looks similar to the is_a_nulls() pointer, but in each element instead of > >>>> just at the end” > >>>> > >>>> If I understand correctly, we need to store the nulls in the upper unused bits. > >>>> > >>>> And we only have several bits unused. One advantage I can think of is that it > >>>> will improve the performance once we have something like the Intel Upper Address Ignore (UAI), > >>>> which also works only for 64-bit machine. > >>> > >>> Ah, I was using the word "similar" very loosely. Not "implemented in > >>> a manner similar to is_a_nulls(), but rather "serves roughly the same > >>> function as is_a_nulls()". > >> > >> I got it. > >> > >>> > >>> You should be able to just have the pointer to the bucket in each > >>> element, and just compare for each element. No need for extra bits, > >> > >> It might have the same functionality if we put the nulls value in each element. > > > > There are some strange things that are easier to do with a nulls value > > than with an explicit pointer, but yes. > > > > (For example, if for some odd reason you wanted to group the hash buckets > > into classes within which moving a reader to another bucket was OK. > > Why would you want to do this? I have no idea!) > > Interesting example. > > I’ll try to update the wording to explain how the back-pointer works. > > By the way, would you like a separate patch or a patch V4 containing the content of > V3 of "docs/RCU: Bring back smp_wmb()” and the back-pointer? A V4 containing the V3, please. Thanx, Paul > >>> though UAI, like its pre-existing ARM counterpart, might also help catch > >>> use-after-free bugs when resizing the hash table. > >>> > >>> Thanx, Paul > >>> > >>>>>>>>>>>> I didn’t find a case using hlist with SLAB_TYPESAFE_BY_RCU, but I did find a case using list > >>>>>>>>>>>> with SLAB_TYPESAFE_BY_RCU in drivers/gpu/drm/i915, the driver also doesn’t use _ONCE > >>>>>>>>>>>> on the fields of the objects allocated with SLAB_TYPESAFE_BY_RCU. > >>>>>>>>>>> > >>>>>>>>>>> Feel free to send them a patch, though I cannot speak for their > >>>>>>>>>>> reception of it. > >>>>>>>>>>> > >>>>>>>>>>> Thanx, Paul > >>>>>>>>>>> > >>>>>>>>>>>>> Thanx, Paul > >>>>>>>>>>>>> > >>>>>>>>>>>>>> ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; }); > >>>>>>>>>>>>>> pos = rcu_dereference(next)) > >>>>>>>>>>>>>> - if (obj->key == key) > >>>>>>>>>>>>>> + if (READ_ONCE(obj->key) == key) > >>>>>>>>>>>>>> return obj; > >>>>>>>>>>>>>> return NULL; > >>>>>>>>>>>>>> } > >>>>>>>>>>>>>> @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain. > >>>>>>>>>>>>>> */ > >>>>>>>>>>>>>> obj = kmem_cache_alloc(...); > >>>>>>>>>>>>>> lock_chain(); // typically a spin_lock() > >>>>>>>>>>>>>> - obj->key = key; > >>>>>>>>>>>>>> - atomic_set_release(&obj->refcnt, 1); // key before refcnt > >>>>>>>>>>>>>> + WRITE_ONCE(obj->key, key); > >>>>>>>>>>>>>> + /* > >>>>>>>>>>>>>> + * We need to make sure obj->key is updated before obj->obj_node.next > >>>>>>>>>>>>>> + * and obj->refcnt. > >>>>>>>>>>>>>> + */ > >>>>>>>>>>>>>> + smp_wmb(); > >>>>>>>>>>>>>> + atomic_set(&obj->refcnt, 1); > >>>>>>>>>>>>>> hlist_add_head_rcu(&obj->obj_node, list); > >>>>>>>>>>>>>> unlock_chain(); // typically a spin_unlock() > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of > >>>>>>>>>>>>>> begin: > >>>>>>>>>>>>>> rcu_read_lock(); > >>>>>>>>>>>>>> hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) { > >>>>>>>>>>>>>> - if (obj->key == key) { > >>>>>>>>>>>>>> + if (READ_ONCE(obj->key) == key) { > >>>>>>>>>>>>>> if (!try_get_ref(obj)) { // might fail for free objects > >>>>>>>>>>>>>> rcu_read_unlock(); > >>>>>>>>>>>>>> goto begin; > >>>>>>>>>>>>>> } > >>>>>>>>>>>>>> - if (obj->key != key) { // not the object we expected > >>>>>>>>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected > >>>>>>>>>>>>>> put_ref(obj); > >>>>>>>>>>>>>> rcu_read_unlock(); > >>>>>>>>>>>>>> goto begin; > >>>>>>>>>>>>>> @@ -206,7 +211,7 @@ hlist_add_head_rcu(). > >>>>>>>>>>>>>> */ > >>>>>>>>>>>>>> obj = kmem_cache_alloc(cachep); > >>>>>>>>>>>>>> lock_chain(); // typically a spin_lock() > >>>>>>>>>>>>>> - obj->key = key; > >>>>>>>>>>>>>> + WRITE_ONCE(obj->key, key); > >>>>>>>>>>>>>> atomic_set_release(&obj->refcnt, 1); // key before refcnt > >>>>>>>>>>>>>> /* > >>>>>>>>>>>>>> * insert obj in RCU way (readers might be traversing chain) > >>>>>>>>>>>>>> -- > >>>>>>>>>>>>>> 2.34.1 >
diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst index 21e40fcc08de..2a9f5a63d334 100644 --- a/Documentation/RCU/rculist_nulls.rst +++ b/Documentation/RCU/rculist_nulls.rst @@ -47,7 +47,7 @@ objects, which is having below type. * reuse these object before the RCU grace period, we * must check key after getting the reference on object */ - if (obj->key != key) { // not the object we expected + if (READ_ONCE(obj->key) != key) { // not the object we expected put_ref(obj); rcu_read_unlock(); goto begin; @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb()) { struct hlist_node *node, *next; for (pos = rcu_dereference((head)->first); - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && + pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) && ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; }); pos = rcu_dereference(next)) - if (obj->key == key) + if (READ_ONCE(obj->key) == key) return obj; return NULL; } @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain. */ obj = kmem_cache_alloc(...); lock_chain(); // typically a spin_lock() - obj->key = key; - atomic_set_release(&obj->refcnt, 1); // key before refcnt + WRITE_ONCE(obj->key, key); + /* + * We need to make sure obj->key is updated before obj->obj_node.next + * and obj->refcnt. + */ + smp_wmb(); + atomic_set(&obj->refcnt, 1); hlist_add_head_rcu(&obj->obj_node, list); unlock_chain(); // typically a spin_unlock() @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of begin: rcu_read_lock(); hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) { - if (obj->key == key) { + if (READ_ONCE(obj->key) == key) { if (!try_get_ref(obj)) { // might fail for free objects rcu_read_unlock(); goto begin; } - if (obj->key != key) { // not the object we expected + if (READ_ONCE(obj->key) != key) { // not the object we expected put_ref(obj); rcu_read_unlock(); goto begin; @@ -206,7 +211,7 @@ hlist_add_head_rcu(). */ obj = kmem_cache_alloc(cachep); lock_chain(); // typically a spin_lock() - obj->key = key; + WRITE_ONCE(obj->key, key); atomic_set_release(&obj->refcnt, 1); // key before refcnt /* * insert obj in RCU way (readers might be traversing chain)
The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(), which modifies obj->obj_node.next. There may be readers holding the reference of obj in lockless_lookup, and when updater modifies ->next, readers can see the change immediately because of SLAB_TYPESAFE_BY_RCU. There are two memory ordering required in the insertion algorithm, we need to make sure obj->key is updated before obj->obj_node.next and obj->refcnt, atomic_set_release is not enough to provide the required memory barrier. Signed-off-by: Alan Huang <mmpgouride@gmail.com> --- Changelog: v1 -> v2: Use _ONCE to protect obj->key. Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-)