Message ID | 20180518130413.16997-2-roman.penyaev@profitbricks.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, May 18, 2018 at 6:07 AM Roman Pen <roman.penyaev@profitbricks.com> wrote: > Function is going to be used in transport over RDMA module > in subsequent patches. Does this really merit its own helper macro in a generic header? It honestly smells more like "just have an inline helper function that is specific to rdma" to me. Particularly since it's probably just one specific list where you want this oddly specific behavior. Also, if we really want a round-robin list traversal macro, this isn't the way it should be implemented, I suspect, and it probably shouldn't be RCU-specific to begin with. Side note: I notice that I should already have been more critical of even the much simpler "list_next_or_null_rcu()" macro. The "documentation" comment above the macro is pure and utter cut-and-paste garbage. Paul, mind giving this a look? Linus
On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote: > Function is going to be used in transport over RDMA module > in subsequent patches. > > Function returns next element in round-robin fashion, > i.e. head will be skipped. NULL will be returned if list > is observed as empty. > > Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > Cc: linux-kernel@vger.kernel.org > --- > include/linux/rculist.h | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) > > diff --git a/include/linux/rculist.h b/include/linux/rculist.h > index 127f534fec94..b0840d5ab25a 100644 > --- a/include/linux/rculist.h > +++ b/include/linux/rculist.h > @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, > }) > > /** > + * list_next_or_null_rr_rcu - get next list element in round-robin fashion. > + * @head: the head for the list. > + * @ptr: the list head to take the next element from. > + * @type: the type of the struct this is embedded in. > + * @memb: the name of the list_head within the struct. > + * > + * Next element returned in round-robin fashion, i.e. head will be skipped, > + * but if list is observed as empty, NULL will be returned. > + * > + * This primitive may safely run concurrently with the _rcu list-mutation > + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock(). Of course, all the set of list_next_or_null_rr_rcu() invocations that are round-robining a given list must all be under the same RCU read-side critical section. For example, the following will break badly: struct foo *take_rr_step(struct list_head *head, struct foo *ptr) { struct foo *ret; rcu_read_lock(); ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist); rcu_read_unlock(); /* BUG */ return ret; } You need a big fat comment stating this, at the very least. The resulting bug can be very hard to trigger and even harder to debug. And yes, I know that the same restriction applies to list_next_rcu() and friends. The difference is that if you try to invoke those in an infinite loop, you will be rapped on the knuckles as soon as you hit the list header. Without that knuckle-rapping, RCU CPU stall warnings might tempt people to do something broken like take_rr_step() above. Is it possible to instead do some sort of list_for_each_entry_rcu()-like macro that makes it more obvious that the whole thing need to be under a single RCU read-side critical section? Such a macro would of course be an infinite loop if the list never went empty, so presumably there would be a break or return statement in there somewhere. > + */ > +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \ > +({ \ > + list_next_or_null_rcu(head, ptr, type, memb) ?: \ > + list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \ Are there any uses for this outside of RDMA? If not, I am with Linus. Define this within RDMA, where a smaller number of people can more easily be kept aware of the restrictions on use. If it turns out to be more generally useful, we can take a look at exactly what makes sense more globally. Even within RDMA, I strongly recommend the big fat comment called out above. And the list_for_each_entry_rcu()-like formulation, if that can be made to work within RDMA's code structure. Seem reasonable, or am I missing something here? Thanx, Paul > +}) > + > +/** > * list_for_each_entry_rcu - iterate over rcu list of given type > * @pos: the type * to use as a loop cursor. > * @head: the head for your list. > -- > 2.13.1 >
On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote: >> Function is going to be used in transport over RDMA module >> in subsequent patches. >> >> Function returns next element in round-robin fashion, >> i.e. head will be skipped. NULL will be returned if list >> is observed as empty. >> >> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> >> Cc: linux-kernel@vger.kernel.org >> --- >> include/linux/rculist.h | 19 +++++++++++++++++++ >> 1 file changed, 19 insertions(+) >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h >> index 127f534fec94..b0840d5ab25a 100644 >> --- a/include/linux/rculist.h >> +++ b/include/linux/rculist.h >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, >> }) >> >> /** >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion. >> + * @head: the head for the list. >> + * @ptr: the list head to take the next element from. >> + * @type: the type of the struct this is embedded in. >> + * @memb: the name of the list_head within the struct. >> + * >> + * Next element returned in round-robin fashion, i.e. head will be skipped, >> + * but if list is observed as empty, NULL will be returned. >> + * >> + * This primitive may safely run concurrently with the _rcu list-mutation >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock(). > > Of course, all the set of list_next_or_null_rr_rcu() invocations that > are round-robining a given list must all be under the same RCU read-side > critical section. For example, the following will break badly: > > struct foo *take_rr_step(struct list_head *head, struct foo *ptr) > { > struct foo *ret; > > rcu_read_lock(); > ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist); > rcu_read_unlock(); /* BUG */ > return ret; > } > > You need a big fat comment stating this, at the very least. The resulting > bug can be very hard to trigger and even harder to debug. > > And yes, I know that the same restriction applies to list_next_rcu() > and friends. The difference is that if you try to invoke those in an > infinite loop, you will be rapped on the knuckles as soon as you hit > the list header. Without that knuckle-rapping, RCU CPU stall warnings > might tempt people to do something broken like take_rr_step() above. Hi Paul, I need -rr behaviour for doing IO load-balancing when I choose next RDMA connection from the list in order to send a request, i.e. my code is something like the following: static struct conn *get_and_set_next_conn(void) { struct conn *conn; conn = rcu_dereferece(rcu_conn); if (unlikely(!conn)) return conn; conn = list_next_or_null_rr_rcu(&conn_list, &conn->entry, typeof(*conn), entry); rcu_assign_pointer(rcu_conn, conn); return conn; } rcu_read_lock(); conn = get_and_set_next_conn(); if (unlikely(!conn)) { /* ... */ } err = rdma_io(conn, request); rcu_read_unlock(); i.e. usage of the @next pointer is under an RCU critical section. > Is it possible to instead do some sort of list_for_each_entry_rcu()-like > macro that makes it more obvious that the whole thing need to be under > a single RCU read-side critical section? Such a macro would of course be > an infinite loop if the list never went empty, so presumably there would > be a break or return statement in there somewhere. The difference is that I do not need a loop, I take the @next conn pointer, save it for the following IO request and do IO for current IO request. It seems list_for_each_entry_rcu()-like with immediate "break" in the body of the loop does not look nice, I personally do not like it, i.e.: static struct conn *get_and_set_next_conn(void) { struct conn *conn; conn = rcu_dereferece(rcu_conn); if (unlikely(!conn)) return conn; list_for_each_entry_rr_rcu(conn, &conn_list, entry) { break; } rcu_assign_pointer(rcu_conn, conn); return conn; } or maybe I did not fully get your idea? >> + */ >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \ >> +({ \ >> + list_next_or_null_rcu(head, ptr, type, memb) ?: \ >> + list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \ > > Are there any uses for this outside of RDMA? If not, I am with Linus. > Define this within RDMA, where a smaller number of people can more > easily be kept aware of the restrictions on use. If it turns out to be > more generally useful, we can take a look at exactly what makes sense > more globally. The only one list_for_each_entry_rcu()-like macro I am aware of is used in block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr(): https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370 Does it make sense to implement generic list_next_or_null_rr_rcu() reusing my list_next_or_null_rr_rcu() variant? > Even within RDMA, I strongly recommend the big fat comment called out above. > And the list_for_each_entry_rcu()-like formulation, if that can be made to > work within RDMA's code structure. > > Seem reasonable, or am I missing something here? Thanks for clear explanation. -- Roman
On Fri, May 18, 2018 at 6:56 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, May 18, 2018 at 6:07 AM Roman Pen <roman.penyaev@profitbricks.com> > wrote: > >> Function is going to be used in transport over RDMA module >> in subsequent patches. > > Does this really merit its own helper macro in a generic header? > > It honestly smells more like "just have an inline helper function that is > specific to rdma" to me. Particularly since it's probably just one specific > list where you want this oddly specific behavior. > > Also, if we really want a round-robin list traversal macro, this isn't the > way it should be implemented, I suspect, and it probably shouldn't be > RCU-specific to begin with. Hi Linus, Another one list_for_each_entry_rcu()-like macro I am aware of is used in block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr(): https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370 Can we do something generic with -rr semantics to cover both cases? -- Roman > > Side note: I notice that I should already have been more critical of even > the much simpler "list_next_or_null_rcu()" macro. The "documentation" > comment above the macro is pure and utter cut-and-paste garbage. > > Paul, mind giving this a look? > > Linus
On Sat, May 19, 2018 at 1:21 PM Roman Penyaev < roman.penyaev@profitbricks.com> wrote: > I need -rr behaviour for doing IO load-balancing when I choose next RDMA > connection from the list in order to send a request, i.e. my code is > something like the following: [ incomplete pseudoicode ] > i.e. usage of the @next pointer is under an RCU critical section. That's not enough. The whole chain to look up the pointer you are taking 'next' of needs to be under RCU, and that's not clear from your example. It's *probably* the case, but basically you have to prove that the starting point is still on the same RCU list. That wasn't clear from your example. The above is (as Paul said) true of list_next_rcu() too, so it's not like this is anything specific to the 'rr' version. Linus
On Sat, May 19, 2018 at 1:25 PM Roman Penyaev < roman.penyaev@profitbricks.com> wrote: > Another one list_for_each_entry_rcu()-like macro I am aware of is used in > block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr(): https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370 > Can we do something generic with -rr semantics to cover both cases? That loop actually looks more like what Paul was asking for, and makes it (perhaps) a bit more obvious that the whole loop has to be done under the same RCU read sequence that looked up that first 'skip' entry. (Again, stronger locking than RCU is obviously also acceptable for the "look up skip entry"). But another reason I really dislike that list_next_or_null_rr_rcu() macro in the patch under discussion is that it's *really* not the right way to skip one entry. It may work, but it's really ugly. Again, the list_for_each_entry_rcu_rr() in blk-mq-sched.c looks better in that regard, in that the skipping seems at least a _bit_ more explicit about what it's doing. And again, if you make this specific to one particular list (and it really likely is just one particular list that wants this), you can use a nice legible helper inline function instead of the macro with the member name. Don't get me wrong - I absolutely adore our generic list handling macros, but I think they work because they are simple. Once we get to "take the next entry, but skip it if it's the head entry, and then return NULL if you get back to the entry you started with" kind of semantics, an inline function that takes a particular list and has a big comment about *why* you want those semantics for that particular case sounds _much_ better to me than adding some complex "generic" macro for a very very unusual special case. Linus
On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote: > On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote: > >> Function is going to be used in transport over RDMA module > >> in subsequent patches. > >> > >> Function returns next element in round-robin fashion, > >> i.e. head will be skipped. NULL will be returned if list > >> is observed as empty. > >> > >> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> > >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > >> Cc: linux-kernel@vger.kernel.org > >> --- > >> include/linux/rculist.h | 19 +++++++++++++++++++ > >> 1 file changed, 19 insertions(+) > >> > >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h > >> index 127f534fec94..b0840d5ab25a 100644 > >> --- a/include/linux/rculist.h > >> +++ b/include/linux/rculist.h > >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, > >> }) > >> > >> /** > >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion. > >> + * @head: the head for the list. > >> + * @ptr: the list head to take the next element from. > >> + * @type: the type of the struct this is embedded in. > >> + * @memb: the name of the list_head within the struct. > >> + * > >> + * Next element returned in round-robin fashion, i.e. head will be skipped, > >> + * but if list is observed as empty, NULL will be returned. > >> + * > >> + * This primitive may safely run concurrently with the _rcu list-mutation > >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock(). > > > > Of course, all the set of list_next_or_null_rr_rcu() invocations that > > are round-robining a given list must all be under the same RCU read-side > > critical section. For example, the following will break badly: > > > > struct foo *take_rr_step(struct list_head *head, struct foo *ptr) > > { > > struct foo *ret; > > > > rcu_read_lock(); > > ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist); > > rcu_read_unlock(); /* BUG */ > > return ret; > > } > > > > You need a big fat comment stating this, at the very least. The resulting > > bug can be very hard to trigger and even harder to debug. > > > > And yes, I know that the same restriction applies to list_next_rcu() > > and friends. The difference is that if you try to invoke those in an > > infinite loop, you will be rapped on the knuckles as soon as you hit > > the list header. Without that knuckle-rapping, RCU CPU stall warnings > > might tempt people to do something broken like take_rr_step() above. > > Hi Paul, > > I need -rr behaviour for doing IO load-balancing when I choose next RDMA > connection from the list in order to send a request, i.e. my code is > something like the following: > > static struct conn *get_and_set_next_conn(void) > { > struct conn *conn; > > conn = rcu_dereferece(rcu_conn); > if (unlikely(!conn)) > return conn; Wait. Don't you need to restart from the beginning of the list in this case? Or does the list never have anything added to it and is rcu_conn initially the first element in the list? > conn = list_next_or_null_rr_rcu(&conn_list, > &conn->entry, > typeof(*conn), > entry); > rcu_assign_pointer(rcu_conn, conn); Linus is correct to doubt this code. You assign a pointer to the current element to rcu_conn, which is presumably a per-CPU or global variable. So far, so good ... > return conn; > } > > rcu_read_lock(); > conn = get_and_set_next_conn(); > if (unlikely(!conn)) { > /* ... */ > } > err = rdma_io(conn, request); > rcu_read_unlock(); ... except that some other CPU might well remove the entry referenced by rcu_conn at this point. It would have to wait for a grace period (e.g., synchronize_rcu()), but the current CPU has exited its RCU read-side critical section, and therefore is not blocking the grace period. Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it might well be referencing the freelist, or, even worse, some other type of structure. What is your code doing to prevent this from happening? (There are ways, but I want to know what you were doing in this case.) > i.e. usage of the @next pointer is under an RCU critical section. > > > Is it possible to instead do some sort of list_for_each_entry_rcu()-like > > macro that makes it more obvious that the whole thing need to be under > > a single RCU read-side critical section? Such a macro would of course be > > an infinite loop if the list never went empty, so presumably there would > > be a break or return statement in there somewhere. > > The difference is that I do not need a loop, I take the @next conn pointer, > save it for the following IO request and do IO for current IO request. > > It seems list_for_each_entry_rcu()-like with immediate "break" in the body > of the loop does not look nice, I personally do not like it, i.e.: > > > static struct conn *get_and_set_next_conn(void) > { > struct conn *conn; > > conn = rcu_dereferece(rcu_conn); > if (unlikely(!conn)) > return conn; > list_for_each_entry_rr_rcu(conn, &conn_list, > entry) { > break; > } > rcu_assign_pointer(rcu_conn, conn); > return conn; > } > > > or maybe I did not fully get your idea? That would not help at all because you are still leaking the pointer out of the RCU read-side critical section. That is completely and utterly broken unless you are somehow cleaning up rcu_conn when you remove the element. And getting that cleanup right is -extremely- tricky. Unless you have some sort of proof of correctness, you will get a NACK from me. More like this: list_for_each_entry_rr_rcu(conn, &conn_list, entry) { do_something_with(conn); if (done_for_now()) break; } > >> + */ > >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \ > >> +({ \ > >> + list_next_or_null_rcu(head, ptr, type, memb) ?: \ > >> + list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \ > > > > Are there any uses for this outside of RDMA? If not, I am with Linus. > > Define this within RDMA, where a smaller number of people can more > > easily be kept aware of the restrictions on use. If it turns out to be > > more generally useful, we can take a look at exactly what makes sense > > more globally. > > The only one list_for_each_entry_rcu()-like macro I am aware of is used in > block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr(): > > https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370 > > Does it make sense to implement generic list_next_or_null_rr_rcu() reusing > my list_next_or_null_rr_rcu() variant? Let's start with the basics: It absolutely does not make sense to leak pointers across rcu_read_unlock() unless you have arranged something else to protect the pointed-to data in the meantime. There are a number of ways of implementing this protection. Again, what protection are you using? Your code at the above URL looks plausible to me at first glance: You do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then rcu_read_unlock(). But at second glance, it looks like htcx->queue might have the same vulnerability as rcu_conn in your earlier code. Thanx, Paul > > Even within RDMA, I strongly recommend the big fat comment called out above. > > And the list_for_each_entry_rcu()-like formulation, if that can be made to > > work within RDMA's code structure. > > > > Seem reasonable, or am I missing something here? > > Thanks for clear explanation. > > -- > Roman >
On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote: >> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney >> <paulmck@linux.vnet.ibm.com> wrote: >> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote: >> >> Function is going to be used in transport over RDMA module >> >> in subsequent patches. >> >> >> >> Function returns next element in round-robin fashion, >> >> i.e. head will be skipped. NULL will be returned if list >> >> is observed as empty. >> >> >> >> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> >> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> >> >> Cc: linux-kernel@vger.kernel.org >> >> --- >> >> include/linux/rculist.h | 19 +++++++++++++++++++ >> >> 1 file changed, 19 insertions(+) >> >> >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h >> >> index 127f534fec94..b0840d5ab25a 100644 >> >> --- a/include/linux/rculist.h >> >> +++ b/include/linux/rculist.h >> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, >> >> }) >> >> >> >> /** >> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion. >> >> + * @head: the head for the list. >> >> + * @ptr: the list head to take the next element from. >> >> + * @type: the type of the struct this is embedded in. >> >> + * @memb: the name of the list_head within the struct. >> >> + * >> >> + * Next element returned in round-robin fashion, i.e. head will be skipped, >> >> + * but if list is observed as empty, NULL will be returned. >> >> + * >> >> + * This primitive may safely run concurrently with the _rcu list-mutation >> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock(). >> > >> > Of course, all the set of list_next_or_null_rr_rcu() invocations that >> > are round-robining a given list must all be under the same RCU read-side >> > critical section. For example, the following will break badly: >> > >> > struct foo *take_rr_step(struct list_head *head, struct foo *ptr) >> > { >> > struct foo *ret; >> > >> > rcu_read_lock(); >> > ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist); >> > rcu_read_unlock(); /* BUG */ >> > return ret; >> > } >> > >> > You need a big fat comment stating this, at the very least. The resulting >> > bug can be very hard to trigger and even harder to debug. >> > >> > And yes, I know that the same restriction applies to list_next_rcu() >> > and friends. The difference is that if you try to invoke those in an >> > infinite loop, you will be rapped on the knuckles as soon as you hit >> > the list header. Without that knuckle-rapping, RCU CPU stall warnings >> > might tempt people to do something broken like take_rr_step() above. >> >> Hi Paul, >> >> I need -rr behaviour for doing IO load-balancing when I choose next RDMA >> connection from the list in order to send a request, i.e. my code is >> something like the following: >> >> static struct conn *get_and_set_next_conn(void) >> { >> struct conn *conn; >> >> conn = rcu_dereferece(rcu_conn); >> if (unlikely(!conn)) >> return conn; > > Wait. Don't you need to restart from the beginning of the list in > this case? Or does the list never have anything added to it and is > rcu_conn initially the first element in the list? Hi Paul, No, I continue from the pointer, which I assigned on the previous IO in order to send IO fairly and keep load balanced. Initially @rcu_conn points to the first element, but elements can be deleted from the list and list can become empty. The deletion code is below. > >> conn = list_next_or_null_rr_rcu(&conn_list, >> &conn->entry, >> typeof(*conn), >> entry); >> rcu_assign_pointer(rcu_conn, conn); > > Linus is correct to doubt this code. You assign a pointer to the current > element to rcu_conn, which is presumably a per-CPU or global variable. > So far, so good ... I use per-CPU, in the first example I did not show that not to overcomplicate the code. > >> return conn; >> } >> >> rcu_read_lock(); >> conn = get_and_set_next_conn(); >> if (unlikely(!conn)) { >> /* ... */ >> } >> err = rdma_io(conn, request); >> rcu_read_unlock(); > > ... except that some other CPU might well remove the entry referenced by > rcu_conn at this point. It would have to wait for a grace period (e.g., > synchronize_rcu()), but the current CPU has exited its RCU read-side > critical section, and therefore is not blocking the grace period. > Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it > might well be referencing the freelist, or, even worse, some other type > of structure. > > What is your code doing to prevent this from happening? (There are ways, > but I want to know what you were doing in this case.) Probably I should have shown the way of removal at the very beginning, my fault. So deletion looks as the following (a bit changed and simplified for the sake of clearness): static void remove_connection(conn) { bool need_to_wait = false; int cpu; /* Do not let RCU list add/delete happen in parallel */ mutex_lock(&conn_lock); list_del_rcu(&conn->entry); /* Make sure everybody observes element removal */ synchronize_rcu(); /* * At this point nobody sees @conn in the list, but still we have * dangling pointer @rcu_conn which _can_ point to @conn. Since * nobody can observe @conn in the list, we guarantee that IO path * will not assign @conn to @rcu_conn, i.e. @rcu_conn can be equal * to @conn, but can never again become @conn. */ /* * Get @next connection from current @conn which is going to be * removed. */ next = list_next_or_null_rr_rcu(&conn_list, &conn->entry, typeof(*next), entry); /* * Here @rcu_conn can be changed by reader side, so use @cmpxchg * in order to keep fairness in load-balancing and do not touch * the pointer which can be already changed by the IO path. * * Current path can be faster than IO path and the following race * exists: * * CPU0 CPU1 * ---- ---- * conn = rcu_dereferece(rcu_conn); * next = list_next_or_null_rr_rcu(conn) * * conn == cmpxchg(rcu_conn, conn, next); * synchronize_rcu(); * * rcu_assign_pointer(rcu_conn, next); * ^^^^^^^^^^^^^^^^^^ * * Here @rcu_conn is already equal to @next (done by @cmpxchg), * so assignment to the same pointer is harmless. * */ for_each_possible_cpu(cpu) { struct conn **rcu_conn; rcu_conn = per_cpu_ptr(pcpu_rcu_conn, cpu); if (*rcu_conn != conn) /* * This @cpu will never again pick up @conn, * so it is safe just to choose next CPU. */ continue; if (conn == cmpxchg(rcu_conn, conn, next)) /* * @rcu_conn was successfully replaced with @next, * that means that someone can also hold a @conn * and dereferencing it, so wait for a grace period * is required. */ need_to_wait = true; } if (need_to_wait) synchronize_rcu(); mutex_unlock(&conn_lock); kfree(conn); } > >> i.e. usage of the @next pointer is under an RCU critical section. >> >> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like >> > macro that makes it more obvious that the whole thing need to be under >> > a single RCU read-side critical section? Such a macro would of course be >> > an infinite loop if the list never went empty, so presumably there would >> > be a break or return statement in there somewhere. >> >> The difference is that I do not need a loop, I take the @next conn pointer, >> save it for the following IO request and do IO for current IO request. >> >> It seems list_for_each_entry_rcu()-like with immediate "break" in the body >> of the loop does not look nice, I personally do not like it, i.e.: >> >> >> static struct conn *get_and_set_next_conn(void) >> { >> struct conn *conn; >> >> conn = rcu_dereferece(rcu_conn); >> if (unlikely(!conn)) >> return conn; >> list_for_each_entry_rr_rcu(conn, &conn_list, >> entry) { >> break; >> } >> rcu_assign_pointer(rcu_conn, conn); >> return conn; >> } >> >> >> or maybe I did not fully get your idea? > > That would not help at all because you are still leaking the pointer out > of the RCU read-side critical section. That is completely and utterly > broken unless you are somehow cleaning up rcu_conn when you remove > the element. And getting that cleanup right is -extremely- tricky. > Unless you have some sort of proof of correctness, you will get a NACK > from me. I understand all the consequences of the leaking pointer, and of course wrapped loop with RCU lock/unlock is simpler, but in order to keep load-balancing and IO fairness avoiding any locks on IO path I've come up with these RCU tricks and list_next_or_null_rr_rcu() macro. > More like this: > > list_for_each_entry_rr_rcu(conn, &conn_list, entry) { > do_something_with(conn); > if (done_for_now()) > break; > } > >> >> + */ >> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \ >> >> +({ \ >> >> + list_next_or_null_rcu(head, ptr, type, memb) ?: \ >> >> + list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \ >> > >> > Are there any uses for this outside of RDMA? If not, I am with Linus. >> > Define this within RDMA, where a smaller number of people can more >> > easily be kept aware of the restrictions on use. If it turns out to be >> > more generally useful, we can take a look at exactly what makes sense >> > more globally. >> >> The only one list_for_each_entry_rcu()-like macro I am aware of is used in >> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr(): >> >> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370 >> >> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing >> my list_next_or_null_rr_rcu() variant? > > Let's start with the basics: It absolutely does not make sense to leak > pointers across rcu_read_unlock() unless you have arranged something else > to protect the pointed-to data in the meantime. There are a number of ways > of implementing this protection. Again, what protection are you using? > > Your code at the above URL looks plausible to me at first glance: You > do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then > rcu_read_unlock(). But at second glance, it looks like htcx->queue > might have the same vulnerability as rcu_conn in your earlier code. I am not the author of the code at the URL I specified. I provided the link answering the question to show other possible users of round-robin semantics for RCU list traversal. In my 'list_next_or_null_rr_rcu()' case I can't use a loop, I leak the pointer and indeed have to be very careful. But perhaps we can come up with some generic solution to cover both cases: -rr loop and -rr next. -- Roman
On Mon, May 21, 2018 at 6:51 AM Roman Penyaev < roman.penyaev@profitbricks.com> wrote: > No, I continue from the pointer, which I assigned on the previous IO > in order to send IO fairly and keep load balanced. Right. And that's exactly what has both me and Paul nervous. You're no longer in the RCU domain. You're using a pointer where the lifetime has nothing to do with RCU any more. Can it be done? Sure. But you need *other* locking for it (that you haven't explained), and it's fragile as hell. It's probably best to not use RCU for it at all, but depend on that "other locking" that you have to have anyway, to keep the pointer valid over the non-RCU region. Linus
On Mon, May 21, 2018 at 03:50:10PM +0200, Roman Penyaev wrote: > On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote: > >> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney > >> <paulmck@linux.vnet.ibm.com> wrote: > >> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote: > >> >> Function is going to be used in transport over RDMA module > >> >> in subsequent patches. > >> >> > >> >> Function returns next element in round-robin fashion, > >> >> i.e. head will be skipped. NULL will be returned if list > >> >> is observed as empty. > >> >> > >> >> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> > >> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > >> >> Cc: linux-kernel@vger.kernel.org > >> >> --- > >> >> include/linux/rculist.h | 19 +++++++++++++++++++ > >> >> 1 file changed, 19 insertions(+) > >> >> > >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h > >> >> index 127f534fec94..b0840d5ab25a 100644 > >> >> --- a/include/linux/rculist.h > >> >> +++ b/include/linux/rculist.h > >> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, > >> >> }) > >> >> > >> >> /** > >> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion. > >> >> + * @head: the head for the list. > >> >> + * @ptr: the list head to take the next element from. > >> >> + * @type: the type of the struct this is embedded in. > >> >> + * @memb: the name of the list_head within the struct. > >> >> + * > >> >> + * Next element returned in round-robin fashion, i.e. head will be skipped, > >> >> + * but if list is observed as empty, NULL will be returned. > >> >> + * > >> >> + * This primitive may safely run concurrently with the _rcu list-mutation > >> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock(). > >> > > >> > Of course, all the set of list_next_or_null_rr_rcu() invocations that > >> > are round-robining a given list must all be under the same RCU read-side > >> > critical section. For example, the following will break badly: > >> > > >> > struct foo *take_rr_step(struct list_head *head, struct foo *ptr) > >> > { > >> > struct foo *ret; > >> > > >> > rcu_read_lock(); > >> > ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist); > >> > rcu_read_unlock(); /* BUG */ > >> > return ret; > >> > } > >> > > >> > You need a big fat comment stating this, at the very least. The resulting > >> > bug can be very hard to trigger and even harder to debug. > >> > > >> > And yes, I know that the same restriction applies to list_next_rcu() > >> > and friends. The difference is that if you try to invoke those in an > >> > infinite loop, you will be rapped on the knuckles as soon as you hit > >> > the list header. Without that knuckle-rapping, RCU CPU stall warnings > >> > might tempt people to do something broken like take_rr_step() above. > >> > >> Hi Paul, > >> > >> I need -rr behaviour for doing IO load-balancing when I choose next RDMA > >> connection from the list in order to send a request, i.e. my code is > >> something like the following: > >> > >> static struct conn *get_and_set_next_conn(void) > >> { > >> struct conn *conn; > >> > >> conn = rcu_dereferece(rcu_conn); > >> if (unlikely(!conn)) > >> return conn; > > > > Wait. Don't you need to restart from the beginning of the list in > > this case? Or does the list never have anything added to it and is > > rcu_conn initially the first element in the list? > > Hi Paul, > > No, I continue from the pointer, which I assigned on the previous IO > in order to send IO fairly and keep load balanced. > > Initially @rcu_conn points to the first element, but elements can > be deleted from the list and list can become empty. > > The deletion code is below. > > > > >> conn = list_next_or_null_rr_rcu(&conn_list, > >> &conn->entry, > >> typeof(*conn), > >> entry); > >> rcu_assign_pointer(rcu_conn, conn); > > > > Linus is correct to doubt this code. You assign a pointer to the current > > element to rcu_conn, which is presumably a per-CPU or global variable. > > So far, so good ... > > I use per-CPU, in the first example I did not show that not to overcomplicate > the code. > > > > >> return conn; > >> } > >> > >> rcu_read_lock(); > >> conn = get_and_set_next_conn(); > >> if (unlikely(!conn)) { > >> /* ... */ > >> } > >> err = rdma_io(conn, request); > >> rcu_read_unlock(); > > > > ... except that some other CPU might well remove the entry referenced by > > rcu_conn at this point. It would have to wait for a grace period (e.g., > > synchronize_rcu()), but the current CPU has exited its RCU read-side > > critical section, and therefore is not blocking the grace period. > > Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it > > might well be referencing the freelist, or, even worse, some other type > > of structure. > > > > What is your code doing to prevent this from happening? (There are ways, > > but I want to know what you were doing in this case.) > > Probably I should have shown the way of removal at the very beginning, > my fault. So deletion looks as the following (a bit changed and > simplified for the sake of clearness): Thank you! Let's see... > static void remove_connection(conn) > { > bool need_to_wait = false; > int cpu; > > /* Do not let RCU list add/delete happen in parallel */ > mutex_lock(&conn_lock); > > list_del_rcu(&conn->entry); > > /* Make sure everybody observes element removal */ > synchronize_rcu(); At this point, any reader who saw the element in the list is done, as you comment in fact says. But there might be a pointer to that element in the per-CPU variables, however, from this point forward it cannot be the case that one of the per-CPU variables gets set to the newly deleted element. Which is your next block of code... > /* > * At this point nobody sees @conn in the list, but > still we have > * dangling pointer @rcu_conn which _can_ point to @conn. Since > * nobody can observe @conn in the list, we guarantee > that IO path > * will not assign @conn to @rcu_conn, i.e. @rcu_conn > can be equal > * to @conn, but can never again become @conn. > */ > > /* > * Get @next connection from current @conn which is going to be > * removed. > */ > next = list_next_or_null_rr_rcu(&conn_list, &conn->entry, > typeof(*next), entry); > > /* > * Here @rcu_conn can be changed by reader side, so use @cmpxchg > * in order to keep fairness in load-balancing and do not touch > * the pointer which can be already changed by the IO path. > * > * Current path can be faster than IO path and the > following race > * exists: > * > * CPU0 CPU1 > * ---- ---- > * conn = rcu_dereferece(rcu_conn); > * next = list_next_or_null_rr_rcu(conn) > * > * conn == > cmpxchg(rcu_conn, conn, next); > * synchronize_rcu(); > * > * rcu_assign_pointer(rcu_conn, next); > * ^^^^^^^^^^^^^^^^^^ > * > * Here @rcu_conn is already equal to @next (done by > @cmpxchg), > * so assignment to the same pointer is harmless. > * > */ > for_each_possible_cpu(cpu) { > struct conn **rcu_conn; > > rcu_conn = per_cpu_ptr(pcpu_rcu_conn, cpu); > if (*rcu_conn != conn) > /* > * This @cpu will never again pick up @conn, > * so it is safe just to choose next CPU. > */ > continue; ... Someone else might have picked up rcu_conn at this point... > if (conn == cmpxchg(rcu_conn, conn, next)) > /* > * @rcu_conn was successfully replaced > with @next, > * that means that someone can also hold a @conn > * and dereferencing it, so wait for a > grace period > * is required. > */ > need_to_wait = true; ... But if there was any possibility of that, need_to_wait is true, and it still cannot be the case that a reader finds the newly deleted element in the list, so they cannot find that element, so the pcpu_rcu_conn variables cannot be set to it. > } > if (need_to_wait) > synchronize_rcu(); And at this point, the reader that might have picked up rcu_conn just before the cmpxchg must have completed. (Good show, by the way! Many people miss the fact that they need this second synchronize_rcu().) Hmmm... What happens if this was the last element in the list, and the relevant pcpu_rcu_conn variable references that newly removed element? Taking a look at list_next_or_null_rcu() and thus at list_next_or_null_rcu(), and it does appear that you get NULL in that case, as is right and good. > mutex_unlock(&conn_lock); > > kfree(conn); > } > > > > > >> i.e. usage of the @next pointer is under an RCU critical section. > >> > >> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like > >> > macro that makes it more obvious that the whole thing need to be under > >> > a single RCU read-side critical section? Such a macro would of course be > >> > an infinite loop if the list never went empty, so presumably there would > >> > be a break or return statement in there somewhere. > >> > >> The difference is that I do not need a loop, I take the @next conn pointer, > >> save it for the following IO request and do IO for current IO request. > >> > >> It seems list_for_each_entry_rcu()-like with immediate "break" in the body > >> of the loop does not look nice, I personally do not like it, i.e.: > >> > >> > >> static struct conn *get_and_set_next_conn(void) > >> { > >> struct conn *conn; > >> > >> conn = rcu_dereferece(rcu_conn); > >> if (unlikely(!conn)) > >> return conn; > >> list_for_each_entry_rr_rcu(conn, &conn_list, > >> entry) { > >> break; > >> } > >> rcu_assign_pointer(rcu_conn, conn); > >> return conn; > >> } > >> > >> > >> or maybe I did not fully get your idea? > > > > That would not help at all because you are still leaking the pointer out > > of the RCU read-side critical section. That is completely and utterly > > broken unless you are somehow cleaning up rcu_conn when you remove > > the element. And getting that cleanup right is -extremely- tricky. > > Unless you have some sort of proof of correctness, you will get a NACK > > from me. > > I understand all the consequences of the leaking pointer, and of course > wrapped loop with RCU lock/unlock is simpler, but in order to keep > load-balancing and IO fairness avoiding any locks on IO path I've come > up with these RCU tricks and list_next_or_null_rr_rcu() macro. At first glance, it appears that you have handled this correctly. But I can make mistakes just as easily as the next guy, so what have you done to validate your algorithm? > > More like this: > > > > list_for_each_entry_rr_rcu(conn, &conn_list, entry) { > > do_something_with(conn); > > if (done_for_now()) > > break; > > } > > > >> >> + */ > >> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \ > >> >> +({ \ > >> >> + list_next_or_null_rcu(head, ptr, type, memb) ?: \ > >> >> + list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \ > >> > > >> > Are there any uses for this outside of RDMA? If not, I am with Linus. > >> > Define this within RDMA, where a smaller number of people can more > >> > easily be kept aware of the restrictions on use. If it turns out to be > >> > more generally useful, we can take a look at exactly what makes sense > >> > more globally. > >> > >> The only one list_for_each_entry_rcu()-like macro I am aware of is used in > >> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr(): > >> > >> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370 > >> > >> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing > >> my list_next_or_null_rr_rcu() variant? > > > > Let's start with the basics: It absolutely does not make sense to leak > > pointers across rcu_read_unlock() unless you have arranged something else > > to protect the pointed-to data in the meantime. There are a number of ways > > of implementing this protection. Again, what protection are you using? > > > > Your code at the above URL looks plausible to me at first glance: You > > do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then > > rcu_read_unlock(). But at second glance, it looks like htcx->queue > > might have the same vulnerability as rcu_conn in your earlier code. > > I am not the author of the code at the URL I specified. I provided the > link answering the question to show other possible users of round-robin > semantics for RCU list traversal. In my 'list_next_or_null_rr_rcu()' > case I can't use a loop, I leak the pointer and indeed have to be very > careful. But perhaps we can come up with some generic solution to cover > both cases: -rr loop and -rr next. Ah. Could you please check their update-side code to make sure that it looks correct to you? Thanx, Paul
On Mon, May 21, 2018 at 08:16:59AM -0700, Linus Torvalds wrote: > On Mon, May 21, 2018 at 6:51 AM Roman Penyaev < > roman.penyaev@profitbricks.com> wrote: > > > No, I continue from the pointer, which I assigned on the previous IO > > in order to send IO fairly and keep load balanced. > > Right. And that's exactly what has both me and Paul nervous. You're no > longer in the RCU domain. You're using a pointer where the lifetime has > nothing to do with RCU any more. > > Can it be done? Sure. But you need *other* locking for it (that you haven't > explained), and it's fragile as hell. He looks to actually have it right, but I would want to see a big comment on the read side noting the leak of the pointer and documenting why it is OK. Thanx, Paul > It's probably best to not use RCU for it at all, but depend on that "other > locking" that you have to have anyway, to keep the pointer valid over the > non-RCU region. > > Linus >
On Mon, May 21, 2018 at 5:33 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > On Mon, May 21, 2018 at 08:16:59AM -0700, Linus Torvalds wrote: >> On Mon, May 21, 2018 at 6:51 AM Roman Penyaev < >> roman.penyaev@profitbricks.com> wrote: >> >> > No, I continue from the pointer, which I assigned on the previous IO >> > in order to send IO fairly and keep load balanced. >> >> Right. And that's exactly what has both me and Paul nervous. You're no >> longer in the RCU domain. You're using a pointer where the lifetime has >> nothing to do with RCU any more. >> >> Can it be done? Sure. But you need *other* locking for it (that you haven't >> explained), and it's fragile as hell. > > He looks to actually have it right, but I would want to see a big comment > on the read side noting the leak of the pointer and documenting why it > is OK. Hi Paul and Linus, Should I resend current patch with more clear comments about how careful caller should be with a leaking pointer? Also I will update read side with a fat comment about "rcu_assign_pointer()" which leaks the pointer out of RCU domain and what is done to prevent nasty consequences. Does that sound acceptable? -- Roman
On Mon, May 21, 2018 at 5:31 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > On Mon, May 21, 2018 at 03:50:10PM +0200, Roman Penyaev wrote: >> On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney >> <paulmck@linux.vnet.ibm.com> wrote: >> > On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote: >> >> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney >> >> <paulmck@linux.vnet.ibm.com> wrote: >> >> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote: >> >> >> Function is going to be used in transport over RDMA module >> >> >> in subsequent patches. >> >> >> >> >> >> Function returns next element in round-robin fashion, >> >> >> i.e. head will be skipped. NULL will be returned if list >> >> >> is observed as empty. >> >> >> >> >> >> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> >> >> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> >> >> >> Cc: linux-kernel@vger.kernel.org >> >> >> --- >> >> >> include/linux/rculist.h | 19 +++++++++++++++++++ >> >> >> 1 file changed, 19 insertions(+) >> >> >> >> >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h >> >> >> index 127f534fec94..b0840d5ab25a 100644 >> >> >> --- a/include/linux/rculist.h >> >> >> +++ b/include/linux/rculist.h >> >> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, >> >> >> }) >> >> >> >> >> >> /** >> >> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion. >> >> >> + * @head: the head for the list. >> >> >> + * @ptr: the list head to take the next element from. >> >> >> + * @type: the type of the struct this is embedded in. >> >> >> + * @memb: the name of the list_head within the struct. >> >> >> + * >> >> >> + * Next element returned in round-robin fashion, i.e. head will be skipped, >> >> >> + * but if list is observed as empty, NULL will be returned. >> >> >> + * >> >> >> + * This primitive may safely run concurrently with the _rcu list-mutation >> >> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock(). >> >> > >> >> > Of course, all the set of list_next_or_null_rr_rcu() invocations that >> >> > are round-robining a given list must all be under the same RCU read-side >> >> > critical section. For example, the following will break badly: >> >> > >> >> > struct foo *take_rr_step(struct list_head *head, struct foo *ptr) >> >> > { >> >> > struct foo *ret; >> >> > >> >> > rcu_read_lock(); >> >> > ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist); >> >> > rcu_read_unlock(); /* BUG */ >> >> > return ret; >> >> > } >> >> > >> >> > You need a big fat comment stating this, at the very least. The resulting >> >> > bug can be very hard to trigger and even harder to debug. >> >> > >> >> > And yes, I know that the same restriction applies to list_next_rcu() >> >> > and friends. The difference is that if you try to invoke those in an >> >> > infinite loop, you will be rapped on the knuckles as soon as you hit >> >> > the list header. Without that knuckle-rapping, RCU CPU stall warnings >> >> > might tempt people to do something broken like take_rr_step() above. >> >> >> >> Hi Paul, >> >> >> >> I need -rr behaviour for doing IO load-balancing when I choose next RDMA >> >> connection from the list in order to send a request, i.e. my code is >> >> something like the following: >> >> >> >> static struct conn *get_and_set_next_conn(void) >> >> { >> >> struct conn *conn; >> >> >> >> conn = rcu_dereferece(rcu_conn); >> >> if (unlikely(!conn)) >> >> return conn; >> > >> > Wait. Don't you need to restart from the beginning of the list in >> > this case? Or does the list never have anything added to it and is >> > rcu_conn initially the first element in the list? >> >> Hi Paul, >> >> No, I continue from the pointer, which I assigned on the previous IO >> in order to send IO fairly and keep load balanced. >> >> Initially @rcu_conn points to the first element, but elements can >> be deleted from the list and list can become empty. >> >> The deletion code is below. >> >> > >> >> conn = list_next_or_null_rr_rcu(&conn_list, >> >> &conn->entry, >> >> typeof(*conn), >> >> entry); >> >> rcu_assign_pointer(rcu_conn, conn); >> > >> > Linus is correct to doubt this code. You assign a pointer to the current >> > element to rcu_conn, which is presumably a per-CPU or global variable. >> > So far, so good ... >> >> I use per-CPU, in the first example I did not show that not to overcomplicate >> the code. >> >> > >> >> return conn; >> >> } >> >> >> >> rcu_read_lock(); >> >> conn = get_and_set_next_conn(); >> >> if (unlikely(!conn)) { >> >> /* ... */ >> >> } >> >> err = rdma_io(conn, request); >> >> rcu_read_unlock(); >> > >> > ... except that some other CPU might well remove the entry referenced by >> > rcu_conn at this point. It would have to wait for a grace period (e.g., >> > synchronize_rcu()), but the current CPU has exited its RCU read-side >> > critical section, and therefore is not blocking the grace period. >> > Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it >> > might well be referencing the freelist, or, even worse, some other type >> > of structure. >> > >> > What is your code doing to prevent this from happening? (There are ways, >> > but I want to know what you were doing in this case.) >> >> Probably I should have shown the way of removal at the very beginning, >> my fault. So deletion looks as the following (a bit changed and >> simplified for the sake of clearness): > > Thank you! Let's see... > >> static void remove_connection(conn) >> { >> bool need_to_wait = false; >> int cpu; >> >> /* Do not let RCU list add/delete happen in parallel */ >> mutex_lock(&conn_lock); >> >> list_del_rcu(&conn->entry); >> >> /* Make sure everybody observes element removal */ >> synchronize_rcu(); > > At this point, any reader who saw the element in the list is done, as you > comment in fact says. But there might be a pointer to that element in the > per-CPU variables, however, from this point forward it cannot be the case > that one of the per-CPU variables gets set to the newly deleted element. > Which is your next block of code... > >> /* >> * At this point nobody sees @conn in the list, but >> still we have >> * dangling pointer @rcu_conn which _can_ point to @conn. Since >> * nobody can observe @conn in the list, we guarantee >> that IO path >> * will not assign @conn to @rcu_conn, i.e. @rcu_conn >> can be equal >> * to @conn, but can never again become @conn. >> */ >> >> /* >> * Get @next connection from current @conn which is going to be >> * removed. >> */ >> next = list_next_or_null_rr_rcu(&conn_list, &conn->entry, >> typeof(*next), entry); >> >> /* >> * Here @rcu_conn can be changed by reader side, so use @cmpxchg >> * in order to keep fairness in load-balancing and do not touch >> * the pointer which can be already changed by the IO path. >> * >> * Current path can be faster than IO path and the >> following race >> * exists: >> * >> * CPU0 CPU1 >> * ---- ---- >> * conn = rcu_dereferece(rcu_conn); >> * next = list_next_or_null_rr_rcu(conn) >> * >> * conn == >> cmpxchg(rcu_conn, conn, next); >> * synchronize_rcu(); >> * >> * rcu_assign_pointer(rcu_conn, next); >> * ^^^^^^^^^^^^^^^^^^ >> * >> * Here @rcu_conn is already equal to @next (done by >> @cmpxchg), >> * so assignment to the same pointer is harmless. >> * >> */ >> for_each_possible_cpu(cpu) { >> struct conn **rcu_conn; >> >> rcu_conn = per_cpu_ptr(pcpu_rcu_conn, cpu); >> if (*rcu_conn != conn) >> /* >> * This @cpu will never again pick up @conn, >> * so it is safe just to choose next CPU. >> */ >> continue; > > ... Someone else might have picked up rcu_conn at this point... > >> if (conn == cmpxchg(rcu_conn, conn, next)) >> /* >> * @rcu_conn was successfully replaced >> with @next, >> * that means that someone can also hold a @conn >> * and dereferencing it, so wait for a >> grace period >> * is required. >> */ >> need_to_wait = true; > > ... But if there was any possibility of that, need_to_wait is true, and it > still cannot be the case that a reader finds the newly deleted element > in the list, so they cannot find that element, so the pcpu_rcu_conn > variables cannot be set to it. > >> } >> if (need_to_wait) >> synchronize_rcu(); > > And at this point, the reader that might have picked up rcu_conn > just before the cmpxchg must have completed. (Good show, by the way! > Many people miss the fact that they need this second synchronize_rcu().) > > Hmmm... What happens if this was the last element in the list, and > the relevant pcpu_rcu_conn variable references that newly removed > element? Taking a look at list_next_or_null_rcu() and thus at > list_next_or_null_rcu(), and it does appear that you get NULL in that > case, as is right and good. Thanks for explicit comments. What I always lack is a good description. Indeed it is worth to mention that @next can become NULL if that was the last element, will add comments then. > >> mutex_unlock(&conn_lock); >> >> kfree(conn); >> } >> >> >> > >> >> i.e. usage of the @next pointer is under an RCU critical section. >> >> >> >> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like >> >> > macro that makes it more obvious that the whole thing need to be under >> >> > a single RCU read-side critical section? Such a macro would of course be >> >> > an infinite loop if the list never went empty, so presumably there would >> >> > be a break or return statement in there somewhere. >> >> >> >> The difference is that I do not need a loop, I take the @next conn pointer, >> >> save it for the following IO request and do IO for current IO request. >> >> >> >> It seems list_for_each_entry_rcu()-like with immediate "break" in the body >> >> of the loop does not look nice, I personally do not like it, i.e.: >> >> >> >> >> >> static struct conn *get_and_set_next_conn(void) >> >> { >> >> struct conn *conn; >> >> >> >> conn = rcu_dereferece(rcu_conn); >> >> if (unlikely(!conn)) >> >> return conn; >> >> list_for_each_entry_rr_rcu(conn, &conn_list, >> >> entry) { >> >> break; >> >> } >> >> rcu_assign_pointer(rcu_conn, conn); >> >> return conn; >> >> } >> >> >> >> >> >> or maybe I did not fully get your idea? >> > >> > That would not help at all because you are still leaking the pointer out >> > of the RCU read-side critical section. That is completely and utterly >> > broken unless you are somehow cleaning up rcu_conn when you remove >> > the element. And getting that cleanup right is -extremely- tricky. >> > Unless you have some sort of proof of correctness, you will get a NACK >> > from me. >> >> I understand all the consequences of the leaking pointer, and of course >> wrapped loop with RCU lock/unlock is simpler, but in order to keep >> load-balancing and IO fairness avoiding any locks on IO path I've come >> up with these RCU tricks and list_next_or_null_rr_rcu() macro. > > At first glance, it appears that you have handled this correctly. But I > can make mistakes just as easily as the next guy, so what have you done > to validate your algorithm? What we only have is a set of unit-tests which run every night. For this particular case there is a special stress test which adds/removes RDMA connections in a loop while IO is performing. Unfortunately I did not write any synthetic test-case just for testing this isolated algorithm. (e.g. module with only these RCU functions can be created and list modification can be easily simulated. Should not be very much difficult) Do you think it is worth to do? Unfortunately it also does not prove correctness, like Linus said it is fragile as hell, but for sure I can burn CPUs testing it for couple of days. >> > More like this: >> > >> > list_for_each_entry_rr_rcu(conn, &conn_list, entry) { >> > do_something_with(conn); >> > if (done_for_now()) >> > break; >> > } >> > >> >> >> + */ >> >> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \ >> >> >> +({ \ >> >> >> + list_next_or_null_rcu(head, ptr, type, memb) ?: \ >> >> >> + list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \ >> >> > >> >> > Are there any uses for this outside of RDMA? If not, I am with Linus. >> >> > Define this within RDMA, where a smaller number of people can more >> >> > easily be kept aware of the restrictions on use. If it turns out to be >> >> > more generally useful, we can take a look at exactly what makes sense >> >> > more globally. >> >> >> >> The only one list_for_each_entry_rcu()-like macro I am aware of is used in >> >> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr(): >> >> >> >> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370 >> >> >> >> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing >> >> my list_next_or_null_rr_rcu() variant? >> > >> > Let's start with the basics: It absolutely does not make sense to leak >> > pointers across rcu_read_unlock() unless you have arranged something else >> > to protect the pointed-to data in the meantime. There are a number of ways >> > of implementing this protection. Again, what protection are you using? >> > >> > Your code at the above URL looks plausible to me at first glance: You >> > do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then >> > rcu_read_unlock(). But at second glance, it looks like htcx->queue >> > might have the same vulnerability as rcu_conn in your earlier code. >> >> I am not the author of the code at the URL I specified. I provided the >> link answering the question to show other possible users of round-robin >> semantics for RCU list traversal. In my 'list_next_or_null_rr_rcu()' >> case I can't use a loop, I leak the pointer and indeed have to be very >> careful. But perhaps we can come up with some generic solution to cover >> both cases: -rr loop and -rr next. > > Ah. Could you please check their update-side code to make sure that it > looks correct to you? BTW authors of this particular -rr loop are in CC, but they keep silence so far :) According to my shallow understanding @queue can't disappear from the list on this calling path, i.e. existence of a @queue should be guaranteed by the fact that the queue has no IO in-flights and only then it can be removed. -- Roman
On Tue, May 22, 2018 at 11:09:08AM +0200, Roman Penyaev wrote: > On Mon, May 21, 2018 at 5:33 PM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > On Mon, May 21, 2018 at 08:16:59AM -0700, Linus Torvalds wrote: > >> On Mon, May 21, 2018 at 6:51 AM Roman Penyaev < > >> roman.penyaev@profitbricks.com> wrote: > >> > >> > No, I continue from the pointer, which I assigned on the previous IO > >> > in order to send IO fairly and keep load balanced. > >> > >> Right. And that's exactly what has both me and Paul nervous. You're no > >> longer in the RCU domain. You're using a pointer where the lifetime has > >> nothing to do with RCU any more. > >> > >> Can it be done? Sure. But you need *other* locking for it (that you haven't > >> explained), and it's fragile as hell. > > > > He looks to actually have it right, but I would want to see a big comment > > on the read side noting the leak of the pointer and documenting why it > > is OK. > > Hi Paul and Linus, > > Should I resend current patch with more clear comments about how careful > caller should be with a leaking pointer? Also I will update read side > with a fat comment about "rcu_assign_pointer()" which leaks the pointer > out of RCU domain and what is done to prevent nasty consequences. > Does that sound acceptable? That sounds good to me. Thanx, Paul
On Tue, May 22, 2018 at 2:09 AM Roman Penyaev < roman.penyaev@profitbricks.com> wrote: > Should I resend current patch with more clear comments about how careful > caller should be with a leaking pointer? No. Even if we go your way, there is *one* single user, and that one is special and needs to take a lot more care. Just roll your own version, and make it an inline function like I've asked now now many times, and add a shit-ton of explanations of why it's safe to use in that *one* situation. I don't want any crazy and unsafe stuff in the generic header file that absolutely *nobody* should ever use. Linus
On Tue, May 22, 2018 at 11:09:16AM +0200, Roman Penyaev wrote: > On Mon, May 21, 2018 at 5:31 PM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > On Mon, May 21, 2018 at 03:50:10PM +0200, Roman Penyaev wrote: > >> On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney > >> <paulmck@linux.vnet.ibm.com> wrote: > >> > On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote: > >> >> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney > >> >> <paulmck@linux.vnet.ibm.com> wrote: > >> >> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote: > >> >> >> Function is going to be used in transport over RDMA module > >> >> >> in subsequent patches. > >> >> >> > >> >> >> Function returns next element in round-robin fashion, > >> >> >> i.e. head will be skipped. NULL will be returned if list > >> >> >> is observed as empty. > >> >> >> > >> >> >> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> > >> >> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > >> >> >> Cc: linux-kernel@vger.kernel.org > >> >> >> --- > >> >> >> include/linux/rculist.h | 19 +++++++++++++++++++ > >> >> >> 1 file changed, 19 insertions(+) > >> >> >> > >> >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h > >> >> >> index 127f534fec94..b0840d5ab25a 100644 > >> >> >> --- a/include/linux/rculist.h > >> >> >> +++ b/include/linux/rculist.h > >> >> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, > >> >> >> }) > >> >> >> > >> >> >> /** > >> >> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion. > >> >> >> + * @head: the head for the list. > >> >> >> + * @ptr: the list head to take the next element from. > >> >> >> + * @type: the type of the struct this is embedded in. > >> >> >> + * @memb: the name of the list_head within the struct. > >> >> >> + * > >> >> >> + * Next element returned in round-robin fashion, i.e. head will be skipped, > >> >> >> + * but if list is observed as empty, NULL will be returned. > >> >> >> + * > >> >> >> + * This primitive may safely run concurrently with the _rcu list-mutation > >> >> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock(). > >> >> > > >> >> > Of course, all the set of list_next_or_null_rr_rcu() invocations that > >> >> > are round-robining a given list must all be under the same RCU read-side > >> >> > critical section. For example, the following will break badly: > >> >> > > >> >> > struct foo *take_rr_step(struct list_head *head, struct foo *ptr) > >> >> > { > >> >> > struct foo *ret; > >> >> > > >> >> > rcu_read_lock(); > >> >> > ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist); > >> >> > rcu_read_unlock(); /* BUG */ > >> >> > return ret; > >> >> > } > >> >> > > >> >> > You need a big fat comment stating this, at the very least. The resulting > >> >> > bug can be very hard to trigger and even harder to debug. > >> >> > > >> >> > And yes, I know that the same restriction applies to list_next_rcu() > >> >> > and friends. The difference is that if you try to invoke those in an > >> >> > infinite loop, you will be rapped on the knuckles as soon as you hit > >> >> > the list header. Without that knuckle-rapping, RCU CPU stall warnings > >> >> > might tempt people to do something broken like take_rr_step() above. > >> >> > >> >> Hi Paul, > >> >> > >> >> I need -rr behaviour for doing IO load-balancing when I choose next RDMA > >> >> connection from the list in order to send a request, i.e. my code is > >> >> something like the following: > >> >> > >> >> static struct conn *get_and_set_next_conn(void) > >> >> { > >> >> struct conn *conn; > >> >> > >> >> conn = rcu_dereferece(rcu_conn); > >> >> if (unlikely(!conn)) > >> >> return conn; > >> > > >> > Wait. Don't you need to restart from the beginning of the list in > >> > this case? Or does the list never have anything added to it and is > >> > rcu_conn initially the first element in the list? > >> > >> Hi Paul, > >> > >> No, I continue from the pointer, which I assigned on the previous IO > >> in order to send IO fairly and keep load balanced. > >> > >> Initially @rcu_conn points to the first element, but elements can > >> be deleted from the list and list can become empty. > >> > >> The deletion code is below. > >> > >> > > >> >> conn = list_next_or_null_rr_rcu(&conn_list, > >> >> &conn->entry, > >> >> typeof(*conn), > >> >> entry); > >> >> rcu_assign_pointer(rcu_conn, conn); > >> > > >> > Linus is correct to doubt this code. You assign a pointer to the current > >> > element to rcu_conn, which is presumably a per-CPU or global variable. > >> > So far, so good ... > >> > >> I use per-CPU, in the first example I did not show that not to overcomplicate > >> the code. > >> > >> > > >> >> return conn; > >> >> } > >> >> > >> >> rcu_read_lock(); > >> >> conn = get_and_set_next_conn(); > >> >> if (unlikely(!conn)) { > >> >> /* ... */ > >> >> } > >> >> err = rdma_io(conn, request); > >> >> rcu_read_unlock(); > >> > > >> > ... except that some other CPU might well remove the entry referenced by > >> > rcu_conn at this point. It would have to wait for a grace period (e.g., > >> > synchronize_rcu()), but the current CPU has exited its RCU read-side > >> > critical section, and therefore is not blocking the grace period. > >> > Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it > >> > might well be referencing the freelist, or, even worse, some other type > >> > of structure. > >> > > >> > What is your code doing to prevent this from happening? (There are ways, > >> > but I want to know what you were doing in this case.) > >> > >> Probably I should have shown the way of removal at the very beginning, > >> my fault. So deletion looks as the following (a bit changed and > >> simplified for the sake of clearness): > > > > Thank you! Let's see... > > > >> static void remove_connection(conn) > >> { > >> bool need_to_wait = false; > >> int cpu; > >> > >> /* Do not let RCU list add/delete happen in parallel */ > >> mutex_lock(&conn_lock); > >> > >> list_del_rcu(&conn->entry); > >> > >> /* Make sure everybody observes element removal */ > >> synchronize_rcu(); > > > > At this point, any reader who saw the element in the list is done, as you > > comment in fact says. But there might be a pointer to that element in the > > per-CPU variables, however, from this point forward it cannot be the case > > that one of the per-CPU variables gets set to the newly deleted element. > > Which is your next block of code... > > > >> /* > >> * At this point nobody sees @conn in the list, but > >> still we have > >> * dangling pointer @rcu_conn which _can_ point to @conn. Since > >> * nobody can observe @conn in the list, we guarantee > >> that IO path > >> * will not assign @conn to @rcu_conn, i.e. @rcu_conn > >> can be equal > >> * to @conn, but can never again become @conn. > >> */ > >> > >> /* > >> * Get @next connection from current @conn which is going to be > >> * removed. > >> */ > >> next = list_next_or_null_rr_rcu(&conn_list, &conn->entry, > >> typeof(*next), entry); > >> > >> /* > >> * Here @rcu_conn can be changed by reader side, so use @cmpxchg > >> * in order to keep fairness in load-balancing and do not touch > >> * the pointer which can be already changed by the IO path. > >> * > >> * Current path can be faster than IO path and the > >> following race > >> * exists: > >> * > >> * CPU0 CPU1 > >> * ---- ---- > >> * conn = rcu_dereferece(rcu_conn); > >> * next = list_next_or_null_rr_rcu(conn) > >> * > >> * conn == > >> cmpxchg(rcu_conn, conn, next); > >> * synchronize_rcu(); > >> * > >> * rcu_assign_pointer(rcu_conn, next); > >> * ^^^^^^^^^^^^^^^^^^ > >> * > >> * Here @rcu_conn is already equal to @next (done by > >> @cmpxchg), > >> * so assignment to the same pointer is harmless. > >> * > >> */ > >> for_each_possible_cpu(cpu) { > >> struct conn **rcu_conn; > >> > >> rcu_conn = per_cpu_ptr(pcpu_rcu_conn, cpu); > >> if (*rcu_conn != conn) > >> /* > >> * This @cpu will never again pick up @conn, > >> * so it is safe just to choose next CPU. > >> */ > >> continue; > > > > ... Someone else might have picked up rcu_conn at this point... > > > >> if (conn == cmpxchg(rcu_conn, conn, next)) > >> /* > >> * @rcu_conn was successfully replaced > >> with @next, > >> * that means that someone can also hold a @conn > >> * and dereferencing it, so wait for a > >> grace period > >> * is required. > >> */ > >> need_to_wait = true; > > > > ... But if there was any possibility of that, need_to_wait is true, and it > > still cannot be the case that a reader finds the newly deleted element > > in the list, so they cannot find that element, so the pcpu_rcu_conn > > variables cannot be set to it. > > > >> } > >> if (need_to_wait) > >> synchronize_rcu(); > > > > And at this point, the reader that might have picked up rcu_conn > > just before the cmpxchg must have completed. (Good show, by the way! > > Many people miss the fact that they need this second synchronize_rcu().) > > > > Hmmm... What happens if this was the last element in the list, and > > the relevant pcpu_rcu_conn variable references that newly removed > > element? Taking a look at list_next_or_null_rcu() and thus at > > list_next_or_null_rcu(), and it does appear that you get NULL in that > > case, as is right and good. > > Thanks for explicit comments. What I always lack is a good description. > Indeed it is worth to mention that @next can become NULL if that was the > last element, will add comments then. Very good. And for the record, I agree with Linus that this needs to be private. I am very glad that you got it right (or appear to have), and there might come a time when it needs to be generally available, but that time is certainly not now and might well never come. > >> mutex_unlock(&conn_lock); > >> > >> kfree(conn); > >> } > >> > >> > >> > > >> >> i.e. usage of the @next pointer is under an RCU critical section. > >> >> > >> >> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like > >> >> > macro that makes it more obvious that the whole thing need to be under > >> >> > a single RCU read-side critical section? Such a macro would of course be > >> >> > an infinite loop if the list never went empty, so presumably there would > >> >> > be a break or return statement in there somewhere. > >> >> > >> >> The difference is that I do not need a loop, I take the @next conn pointer, > >> >> save it for the following IO request and do IO for current IO request. > >> >> > >> >> It seems list_for_each_entry_rcu()-like with immediate "break" in the body > >> >> of the loop does not look nice, I personally do not like it, i.e.: > >> >> > >> >> > >> >> static struct conn *get_and_set_next_conn(void) > >> >> { > >> >> struct conn *conn; > >> >> > >> >> conn = rcu_dereferece(rcu_conn); > >> >> if (unlikely(!conn)) > >> >> return conn; > >> >> list_for_each_entry_rr_rcu(conn, &conn_list, > >> >> entry) { > >> >> break; > >> >> } > >> >> rcu_assign_pointer(rcu_conn, conn); > >> >> return conn; > >> >> } > >> >> > >> >> > >> >> or maybe I did not fully get your idea? > >> > > >> > That would not help at all because you are still leaking the pointer out > >> > of the RCU read-side critical section. That is completely and utterly > >> > broken unless you are somehow cleaning up rcu_conn when you remove > >> > the element. And getting that cleanup right is -extremely- tricky. > >> > Unless you have some sort of proof of correctness, you will get a NACK > >> > from me. > >> > >> I understand all the consequences of the leaking pointer, and of course > >> wrapped loop with RCU lock/unlock is simpler, but in order to keep > >> load-balancing and IO fairness avoiding any locks on IO path I've come > >> up with these RCU tricks and list_next_or_null_rr_rcu() macro. > > > > At first glance, it appears that you have handled this correctly. But I > > can make mistakes just as easily as the next guy, so what have you done > > to validate your algorithm? > > What we only have is a set of unit-tests which run every night. For this > particular case there is a special stress test which adds/removes RDMA > connections in a loop while IO is performing. Unfortunately I did not > write any synthetic test-case just for testing this isolated algorithm. > (e.g. module with only these RCU functions can be created and list > modification can be easily simulated. Should not be very much difficult) > Do you think it is worth to do? Unfortunately it also does not prove > correctness, like Linus said it is fragile as hell, but for sure I can > burn CPUs testing it for couple of days. Yes, this definitely deserves some -serious- focused stress testing. And formal verification, if that can be arranged. The Linux-kernel memory model does handle RCU, and also minimal linked lists, so might be worth a try. Unfortunately, I don't know of any other tooling that handles RCU. :-( > >> > More like this: > >> > > >> > list_for_each_entry_rr_rcu(conn, &conn_list, entry) { > >> > do_something_with(conn); > >> > if (done_for_now()) > >> > break; > >> > } > >> > > >> >> >> + */ > >> >> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \ > >> >> >> +({ \ > >> >> >> + list_next_or_null_rcu(head, ptr, type, memb) ?: \ > >> >> >> + list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \ > >> >> > > >> >> > Are there any uses for this outside of RDMA? If not, I am with Linus. > >> >> > Define this within RDMA, where a smaller number of people can more > >> >> > easily be kept aware of the restrictions on use. If it turns out to be > >> >> > more generally useful, we can take a look at exactly what makes sense > >> >> > more globally. > >> >> > >> >> The only one list_for_each_entry_rcu()-like macro I am aware of is used in > >> >> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr(): > >> >> > >> >> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370 > >> >> > >> >> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing > >> >> my list_next_or_null_rr_rcu() variant? > >> > > >> > Let's start with the basics: It absolutely does not make sense to leak > >> > pointers across rcu_read_unlock() unless you have arranged something else > >> > to protect the pointed-to data in the meantime. There are a number of ways > >> > of implementing this protection. Again, what protection are you using? > >> > > >> > Your code at the above URL looks plausible to me at first glance: You > >> > do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then > >> > rcu_read_unlock(). But at second glance, it looks like htcx->queue > >> > might have the same vulnerability as rcu_conn in your earlier code. > >> > >> I am not the author of the code at the URL I specified. I provided the > >> link answering the question to show other possible users of round-robin > >> semantics for RCU list traversal. In my 'list_next_or_null_rr_rcu()' > >> case I can't use a loop, I leak the pointer and indeed have to be very > >> careful. But perhaps we can come up with some generic solution to cover > >> both cases: -rr loop and -rr next. > > > > Ah. Could you please check their update-side code to make sure that it > > looks correct to you? > > BTW authors of this particular -rr loop are in CC, but they keep silence > so far :) According to my shallow understanding @queue can't disappear > from the list on this calling path, i.e. existence of a @queue should be > guaranteed by the fact that the queue has no IO in-flights and only then > it can be removed. But something has to enforce that, right? For example, how does the code know that there are no longer any I/Os in flight, and how does it know that more I/Os won't be issued in the near future, possibly based on old state? But yes, the authors are free to respond. Thanx, Paul
On Tue, May 22, 2018 at 09:38:13AM -0700, Linus Torvalds wrote: > On Tue, May 22, 2018 at 2:09 AM Roman Penyaev < > roman.penyaev@profitbricks.com> wrote: > > > Should I resend current patch with more clear comments about how careful > > caller should be with a leaking pointer? > > No. Even if we go your way, there is *one* single user, and that one is > special and needs to take a lot more care. > > Just roll your own version, and make it an inline function like I've asked > now now many times, and add a shit-ton of explanations of why it's safe to > use in that *one* situation. > > I don't want any crazy and unsafe stuff in the generic header file that > absolutely *nobody* should ever use. Completely agreed! I was perhaps foolishly assuming that they would be making that adjustment based on earlier emails, but yes, I should have explicitly stated this requirement in my earlier reply. Thanx, Paul
diff --git a/include/linux/rculist.h b/include/linux/rculist.h index 127f534fec94..b0840d5ab25a 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, }) /** + * list_next_or_null_rr_rcu - get next list element in round-robin fashion. + * @head: the head for the list. + * @ptr: the list head to take the next element from. + * @type: the type of the struct this is embedded in. + * @memb: the name of the list_head within the struct. + * + * Next element returned in round-robin fashion, i.e. head will be skipped, + * but if list is observed as empty, NULL will be returned. + * + * This primitive may safely run concurrently with the _rcu list-mutation + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock(). + */ +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \ +({ \ + list_next_or_null_rcu(head, ptr, type, memb) ?: \ + list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \ +}) + +/** * list_for_each_entry_rcu - iterate over rcu list of given type * @pos: the type * to use as a loop cursor. * @head: the head for your list.
Function is going to be used in transport over RDMA module in subsequent patches. Function returns next element in round-robin fashion, i.e. head will be skipped. NULL will be returned if list is observed as empty. Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: linux-kernel@vger.kernel.org --- include/linux/rculist.h | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)