[v2,01/26] rculist: introduce list_next_or_null_rr_rcu()

Message ID	20180518130413.16997-2-roman.penyaev@profitbricks.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Roman Pen <roman.penyaev@profitbricks.com> To: linux-block@vger.kernel.org, linux-rdma@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org>, Sagi Grimberg <sagi@grimberg.me>, Bart Van Assche <bart.vanassche@sandisk.com>, Or Gerlitz <ogerlitz@mellanox.com>, Doug Ledford <dledford@redhat.com>, Swapnil Ingle <swapnil.ingle@profitbricks.com>, Danil Kipnis <danil.kipnis@profitbricks.com>, Jack Wang <jinpu.wang@profitbricks.com>, Roman Pen <roman.penyaev@profitbricks.com>, "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>, linux-kernel@vger.kernel.org Subject: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu() Date: Fri, 18 May 2018 15:03:48 +0200 Message-Id: <20180518130413.16997-2-roman.penyaev@profitbricks.com> In-Reply-To: <20180518130413.16997-1-roman.penyaev@profitbricks.com> References: <20180518130413.16997-1-roman.penyaev@profitbricks.com> Sender: linux-block-owner@vger.kernel.org Precedence: bulk

Roman Pen May 18, 2018, 1:03 p.m. UTC

Function is going to be used in transport over RDMA module
in subsequent patches.

Function returns next element in round-robin fashion,
i.e. head will be skipped.  NULL will be returned if list
is observed as empty.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: linux-kernel@vger.kernel.org
---
 include/linux/rculist.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Linus Torvalds May 18, 2018, 4:56 p.m. UTC | #1

On Fri, May 18, 2018 at 6:07 AM Roman Pen <roman.penyaev@profitbricks.com>
wrote:

> Function is going to be used in transport over RDMA module
> in subsequent patches.

Does this really merit its own helper macro in a generic header?

It honestly smells more like "just have an inline helper function that is
specific to rdma" to me. Particularly since it's probably just one specific
list where you want this oddly specific behavior.

Also, if we really want a round-robin list traversal macro, this isn't the
way it should be implemented, I suspect, and it probably shouldn't be
RCU-specific to begin with.

Side note: I notice that I should already  have been more critical of even
the much simpler "list_next_or_null_rcu()" macro. The "documentation"
comment above the macro is pure and utter cut-and-paste garbage.

Paul, mind giving this a look?

                 Linus

Paul E. McKenney May 19, 2018, 4:37 p.m. UTC | #2

On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
> Function is going to be used in transport over RDMA module
> in subsequent patches.
> 
> Function returns next element in round-robin fashion,
> i.e. head will be skipped.  NULL will be returned if list
> is observed as empty.
> 
> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: linux-kernel@vger.kernel.org
> ---
>  include/linux/rculist.h | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index 127f534fec94..b0840d5ab25a 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
>  })
> 
>  /**
> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
> + * @head:	the head for the list.
> + * @ptr:        the list head to take the next element from.
> + * @type:       the type of the struct this is embedded in.
> + * @memb:       the name of the list_head within the struct.
> + *
> + * Next element returned in round-robin fashion, i.e. head will be skipped,
> + * but if list is observed as empty, NULL will be returned.
> + *
> + * This primitive may safely run concurrently with the _rcu list-mutation
> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().

Of course, all the set of list_next_or_null_rr_rcu() invocations that
are round-robining a given list must all be under the same RCU read-side
critical section.  For example, the following will break badly:

	struct foo *take_rr_step(struct list_head *head, struct foo *ptr)
	{
		struct foo *ret;

		rcu_read_lock();
		ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist);
		rcu_read_unlock();  /* BUG */
		return ret;
	}

You need a big fat comment stating this, at the very least.  The resulting
bug can be very hard to trigger and even harder to debug.

And yes, I know that the same restriction applies to list_next_rcu()
and friends.  The difference is that if you try to invoke those in an
infinite loop, you will be rapped on the knuckles as soon as you hit
the list header.  Without that knuckle-rapping, RCU CPU stall warnings
might tempt people to do something broken like take_rr_step() above.

Is it possible to instead do some sort of list_for_each_entry_rcu()-like
macro that makes it more obvious that the whole thing need to be under
a single RCU read-side critical section?  Such a macro would of course be
an infinite loop if the list never went empty, so presumably there would
be a break or return statement in there somewhere.

> + */
> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
> +({ \
> +	list_next_or_null_rcu(head, ptr, type, memb) ?: \
> +		list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \

Are there any uses for this outside of RDMA?  If not, I am with Linus.
Define this within RDMA, where a smaller number of people can more
easily be kept aware of the restrictions on use.  If it turns out to be
more generally useful, we can take a look at exactly what makes sense
more globally.

Even within RDMA, I strongly recommend the big fat comment called out above.
And the list_for_each_entry_rcu()-like formulation, if that can be made to
work within RDMA's code structure.

Seem reasonable, or am I missing something here?

								Thanx, Paul

> +})
> +
> +/**
>   * list_for_each_entry_rcu	-	iterate over rcu list of given type
>   * @pos:	the type * to use as a loop cursor.
>   * @head:	the head for your list.
> -- 
> 2.13.1
>

Roman Pen May 19, 2018, 8:20 p.m. UTC | #3

On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
>> Function is going to be used in transport over RDMA module
>> in subsequent patches.
>>
>> Function returns next element in round-robin fashion,
>> i.e. head will be skipped.  NULL will be returned if list
>> is observed as empty.
>>
>> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
>> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> Cc: linux-kernel@vger.kernel.org
>> ---
>>  include/linux/rculist.h | 19 +++++++++++++++++++
>>  1 file changed, 19 insertions(+)
>>
>> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
>> index 127f534fec94..b0840d5ab25a 100644
>> --- a/include/linux/rculist.h
>> +++ b/include/linux/rculist.h
>> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
>>  })
>>
>>  /**
>> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
>> + * @head:    the head for the list.
>> + * @ptr:        the list head to take the next element from.
>> + * @type:       the type of the struct this is embedded in.
>> + * @memb:       the name of the list_head within the struct.
>> + *
>> + * Next element returned in round-robin fashion, i.e. head will be skipped,
>> + * but if list is observed as empty, NULL will be returned.
>> + *
>> + * This primitive may safely run concurrently with the _rcu list-mutation
>> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
>
> Of course, all the set of list_next_or_null_rr_rcu() invocations that
> are round-robining a given list must all be under the same RCU read-side
> critical section.  For example, the following will break badly:
>
>         struct foo *take_rr_step(struct list_head *head, struct foo *ptr)
>         {
>                 struct foo *ret;
>
>                 rcu_read_lock();
>                 ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist);
>                 rcu_read_unlock();  /* BUG */
>                 return ret;
>         }
>
> You need a big fat comment stating this, at the very least.  The resulting
> bug can be very hard to trigger and even harder to debug.
>
> And yes, I know that the same restriction applies to list_next_rcu()
> and friends.  The difference is that if you try to invoke those in an
> infinite loop, you will be rapped on the knuckles as soon as you hit
> the list header.  Without that knuckle-rapping, RCU CPU stall warnings
> might tempt people to do something broken like take_rr_step() above.

Hi Paul,

I need -rr behaviour for doing IO load-balancing when I choose next RDMA
connection from the list in order to send a request, i.e. my code is
something like the following:

        static struct conn *get_and_set_next_conn(void)
        {
                struct conn *conn;

                conn = rcu_dereferece(rcu_conn);
                if (unlikely(!conn))
                    return conn;
                conn = list_next_or_null_rr_rcu(&conn_list,
                                                &conn->entry,
                                                typeof(*conn),
                                                entry);
                rcu_assign_pointer(rcu_conn, conn);
                return conn;
        }

        rcu_read_lock();
        conn = get_and_set_next_conn();
        if (unlikely(!conn)) {
                /* ... */
        }
        err = rdma_io(conn, request);
        rcu_read_unlock();

i.e. usage of the @next pointer is under an RCU critical section.

> Is it possible to instead do some sort of list_for_each_entry_rcu()-like
> macro that makes it more obvious that the whole thing need to be under
> a single RCU read-side critical section?  Such a macro would of course be
> an infinite loop if the list never went empty, so presumably there would
> be a break or return statement in there somewhere.

The difference is that I do not need a loop, I take the @next conn pointer,
save it for the following IO request and do IO for current IO request.

It seems list_for_each_entry_rcu()-like with immediate "break" in the body
of the loop does not look nice, I personally do not like it, i.e.:


        static struct conn *get_and_set_next_conn(void)
        {
                struct conn *conn;

                conn = rcu_dereferece(rcu_conn);
                if (unlikely(!conn))
                    return conn;
                list_for_each_entry_rr_rcu(conn, &conn_list,
                                           entry) {
                        break;
                }
                rcu_assign_pointer(rcu_conn, conn);
                return conn;
        }


or maybe I did not fully get your idea?

>> + */
>> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
>> +({ \
>> +     list_next_or_null_rcu(head, ptr, type, memb) ?: \
>> +             list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \
>
> Are there any uses for this outside of RDMA?  If not, I am with Linus.
> Define this within RDMA, where a smaller number of people can more
> easily be kept aware of the restrictions on use.  If it turns out to be
> more generally useful, we can take a look at exactly what makes sense
> more globally.

The only one list_for_each_entry_rcu()-like macro I am aware of is used in
block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():

https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370

Does it make sense to implement generic list_next_or_null_rr_rcu() reusing
my list_next_or_null_rr_rcu() variant?

> Even within RDMA, I strongly recommend the big fat comment called out above.
> And the list_for_each_entry_rcu()-like formulation, if that can be made to
> work within RDMA's code structure.
>
> Seem reasonable, or am I missing something here?

Thanks for clear explanation.

--
Roman

Roman Pen May 19, 2018, 8:25 p.m. UTC | #4

On Fri, May 18, 2018 at 6:56 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, May 18, 2018 at 6:07 AM Roman Pen <roman.penyaev@profitbricks.com>
> wrote:
>
>> Function is going to be used in transport over RDMA module
>> in subsequent patches.
>
> Does this really merit its own helper macro in a generic header?
>
> It honestly smells more like "just have an inline helper function that is
> specific to rdma" to me. Particularly since it's probably just one specific
> list where you want this oddly specific behavior.
>
> Also, if we really want a round-robin list traversal macro, this isn't the
> way it should be implemented, I suspect, and it probably shouldn't be
> RCU-specific to begin with.

Hi Linus,

Another one list_for_each_entry_rcu()-like macro I am aware of is used in
block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():

https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370

Can we do something generic with -rr semantics to cover both cases?

--
Roman

>
> Side note: I notice that I should already  have been more critical of even
> the much simpler "list_next_or_null_rcu()" macro. The "documentation"
> comment above the macro is pure and utter cut-and-paste garbage.
>
> Paul, mind giving this a look?
>
>                  Linus

Linus Torvalds May 19, 2018, 8:56 p.m. UTC | #5

On Sat, May 19, 2018 at 1:21 PM Roman Penyaev <
roman.penyaev@profitbricks.com> wrote:

> I need -rr behaviour for doing IO load-balancing when I choose next RDMA
> connection from the list in order to send a request, i.e. my code is
> something like the following:
[ incomplete pseudoicode ]
> i.e. usage of the @next pointer is under an RCU critical section.

That's not enough. The whole chain to look up the pointer you are taking
'next' of needs to be under RCU, and that's not clear from your example.

It's *probably* the case, but basically you have to prove that the starting
point is still on the same RCU list. That wasn't clear from your example.

The above is (as Paul said) true of list_next_rcu() too, so it's not like
this is anything specific to the 'rr' version.

              Linus

Linus Torvalds May 19, 2018, 9:04 p.m. UTC | #6

On Sat, May 19, 2018 at 1:25 PM Roman Penyaev <
roman.penyaev@profitbricks.com> wrote:

> Another one list_for_each_entry_rcu()-like macro I am aware of is used in
> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():

https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370

> Can we do something generic with -rr semantics to cover both cases?

That loop actually looks more like what Paul was asking for, and makes it
(perhaps) a bit more obvious that the whole loop has to be done under the
same RCU read sequence that looked up that first 'skip' entry.

(Again, stronger locking than RCU is obviously also acceptable for the
"look up skip entry").

But another reason I really dislike that list_next_or_null_rr_rcu() macro
in the patch under discussion is that it's *really* not the right way to
skip one entry. It may work, but it's really ugly. Again, the
list_for_each_entry_rcu_rr() in blk-mq-sched.c looks better in that regard,
in that the skipping seems at least a _bit_ more explicit about what it's
doing.

And again, if you make this specific to one particular list (and it really
likely is just one particular list that wants this), you can use a nice
legible helper inline function instead of the macro with the member name.

Don't get me wrong - I absolutely adore our generic list handling macros,
but I think they work because they are simple. Once we get to "take the
next entry, but skip it if it's the head entry, and then return NULL if you
get back to the entry you started with" kind of semantics, an inline
function that takes a particular list and has a big comment about *why* you
want those semantics for that particular case sounds _much_ better to me
than adding some complex "generic" macro for a very very unusual special
case.

                  Linus

Paul E. McKenney May 20, 2018, 12:43 a.m. UTC | #7

On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote:
> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
> >> Function is going to be used in transport over RDMA module
> >> in subsequent patches.
> >>
> >> Function returns next element in round-robin fashion,
> >> i.e. head will be skipped.  NULL will be returned if list
> >> is observed as empty.
> >>
> >> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >> Cc: linux-kernel@vger.kernel.org
> >> ---
> >>  include/linux/rculist.h | 19 +++++++++++++++++++
> >>  1 file changed, 19 insertions(+)
> >>
> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> >> index 127f534fec94..b0840d5ab25a 100644
> >> --- a/include/linux/rculist.h
> >> +++ b/include/linux/rculist.h
> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
> >>  })
> >>
> >>  /**
> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
> >> + * @head:    the head for the list.
> >> + * @ptr:        the list head to take the next element from.
> >> + * @type:       the type of the struct this is embedded in.
> >> + * @memb:       the name of the list_head within the struct.
> >> + *
> >> + * Next element returned in round-robin fashion, i.e. head will be skipped,
> >> + * but if list is observed as empty, NULL will be returned.
> >> + *
> >> + * This primitive may safely run concurrently with the _rcu list-mutation
> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
> >
> > Of course, all the set of list_next_or_null_rr_rcu() invocations that
> > are round-robining a given list must all be under the same RCU read-side
> > critical section.  For example, the following will break badly:
> >
> >         struct foo *take_rr_step(struct list_head *head, struct foo *ptr)
> >         {
> >                 struct foo *ret;
> >
> >                 rcu_read_lock();
> >                 ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist);
> >                 rcu_read_unlock();  /* BUG */
> >                 return ret;
> >         }
> >
> > You need a big fat comment stating this, at the very least.  The resulting
> > bug can be very hard to trigger and even harder to debug.
> >
> > And yes, I know that the same restriction applies to list_next_rcu()
> > and friends.  The difference is that if you try to invoke those in an
> > infinite loop, you will be rapped on the knuckles as soon as you hit
> > the list header.  Without that knuckle-rapping, RCU CPU stall warnings
> > might tempt people to do something broken like take_rr_step() above.
> 
> Hi Paul,
> 
> I need -rr behaviour for doing IO load-balancing when I choose next RDMA
> connection from the list in order to send a request, i.e. my code is
> something like the following:
> 
>         static struct conn *get_and_set_next_conn(void)
>         {
>                 struct conn *conn;
> 
>                 conn = rcu_dereferece(rcu_conn);
>                 if (unlikely(!conn))
>                     return conn;

Wait.  Don't you need to restart from the beginning of the list in
this case?  Or does the list never have anything added to it and is
rcu_conn initially the first element in the list?

>                 conn = list_next_or_null_rr_rcu(&conn_list,
>                                                 &conn->entry,
>                                                 typeof(*conn),
>                                                 entry);
>                 rcu_assign_pointer(rcu_conn, conn);

Linus is correct to doubt this code.  You assign a pointer to the current
element to rcu_conn, which is presumably a per-CPU or global variable.
So far, so good ...

>                 return conn;
>         }
> 
>         rcu_read_lock();
>         conn = get_and_set_next_conn();
>         if (unlikely(!conn)) {
>                 /* ... */
>         }
>         err = rdma_io(conn, request);
>         rcu_read_unlock();

... except that some other CPU might well remove the entry referenced by
rcu_conn at this point.  It would have to wait for a grace period (e.g.,
synchronize_rcu()), but the current CPU has exited its RCU read-side
critical section, and therefore is not blocking the grace period.
Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it
might well be referencing the freelist, or, even worse, some other type
of structure.

What is your code doing to prevent this from happening?  (There are ways,
but I want to know what you were doing in this case.)

> i.e. usage of the @next pointer is under an RCU critical section.
> 
> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like
> > macro that makes it more obvious that the whole thing need to be under
> > a single RCU read-side critical section?  Such a macro would of course be
> > an infinite loop if the list never went empty, so presumably there would
> > be a break or return statement in there somewhere.
> 
> The difference is that I do not need a loop, I take the @next conn pointer,
> save it for the following IO request and do IO for current IO request.
> 
> It seems list_for_each_entry_rcu()-like with immediate "break" in the body
> of the loop does not look nice, I personally do not like it, i.e.:
> 
> 
>         static struct conn *get_and_set_next_conn(void)
>         {
>                 struct conn *conn;
> 
>                 conn = rcu_dereferece(rcu_conn);
>                 if (unlikely(!conn))
>                     return conn;
>                 list_for_each_entry_rr_rcu(conn, &conn_list,
>                                            entry) {
>                         break;
>                 }
>                 rcu_assign_pointer(rcu_conn, conn);
>                 return conn;
>         }
> 
> 
> or maybe I did not fully get your idea?

That would not help at all because you are still leaking the pointer out
of the RCU read-side critical section.  That is completely and utterly
broken unless you are somehow cleaning up rcu_conn when you remove
the element.  And getting that cleanup right is -extremely- tricky.
Unless you have some sort of proof of correctness, you will get a NACK
from me.

More like this:

	list_for_each_entry_rr_rcu(conn, &conn_list, entry) {
		do_something_with(conn);
		if (done_for_now())
			break;
	}

> >> + */
> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
> >> +({ \
> >> +     list_next_or_null_rcu(head, ptr, type, memb) ?: \
> >> +             list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \
> >
> > Are there any uses for this outside of RDMA?  If not, I am with Linus.
> > Define this within RDMA, where a smaller number of people can more
> > easily be kept aware of the restrictions on use.  If it turns out to be
> > more generally useful, we can take a look at exactly what makes sense
> > more globally.
> 
> The only one list_for_each_entry_rcu()-like macro I am aware of is used in
> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():
> 
> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370
> 
> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing
> my list_next_or_null_rr_rcu() variant?

Let's start with the basics:  It absolutely does not make sense to leak
pointers across rcu_read_unlock() unless you have arranged something else
to protect the pointed-to data in the meantime.  There are a number of ways
of implementing this protection.  Again, what protection are you using?

Your code at the above URL looks plausible to me at first glance: You
do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then
rcu_read_unlock().  But at second glance, it looks like htcx->queue
might have the same vulnerability as rcu_conn in your earlier code.

							Thanx, Paul

> > Even within RDMA, I strongly recommend the big fat comment called out above.
> > And the list_for_each_entry_rcu()-like formulation, if that can be made to
> > work within RDMA's code structure.
> >
> > Seem reasonable, or am I missing something here?
> 
> Thanks for clear explanation.
> 
> --
> Roman
>

Roman Pen May 21, 2018, 1:50 p.m. UTC | #8

On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote:
>> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
>> >> Function is going to be used in transport over RDMA module
>> >> in subsequent patches.
>> >>
>> >> Function returns next element in round-robin fashion,
>> >> i.e. head will be skipped.  NULL will be returned if list
>> >> is observed as empty.
>> >>
>> >> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
>> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> >> Cc: linux-kernel@vger.kernel.org
>> >> ---
>> >>  include/linux/rculist.h | 19 +++++++++++++++++++
>> >>  1 file changed, 19 insertions(+)
>> >>
>> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
>> >> index 127f534fec94..b0840d5ab25a 100644
>> >> --- a/include/linux/rculist.h
>> >> +++ b/include/linux/rculist.h
>> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
>> >>  })
>> >>
>> >>  /**
>> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
>> >> + * @head:    the head for the list.
>> >> + * @ptr:        the list head to take the next element from.
>> >> + * @type:       the type of the struct this is embedded in.
>> >> + * @memb:       the name of the list_head within the struct.
>> >> + *
>> >> + * Next element returned in round-robin fashion, i.e. head will be skipped,
>> >> + * but if list is observed as empty, NULL will be returned.
>> >> + *
>> >> + * This primitive may safely run concurrently with the _rcu list-mutation
>> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
>> >
>> > Of course, all the set of list_next_or_null_rr_rcu() invocations that
>> > are round-robining a given list must all be under the same RCU read-side
>> > critical section.  For example, the following will break badly:
>> >
>> >         struct foo *take_rr_step(struct list_head *head, struct foo *ptr)
>> >         {
>> >                 struct foo *ret;
>> >
>> >                 rcu_read_lock();
>> >                 ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist);
>> >                 rcu_read_unlock();  /* BUG */
>> >                 return ret;
>> >         }
>> >
>> > You need a big fat comment stating this, at the very least.  The resulting
>> > bug can be very hard to trigger and even harder to debug.
>> >
>> > And yes, I know that the same restriction applies to list_next_rcu()
>> > and friends.  The difference is that if you try to invoke those in an
>> > infinite loop, you will be rapped on the knuckles as soon as you hit
>> > the list header.  Without that knuckle-rapping, RCU CPU stall warnings
>> > might tempt people to do something broken like take_rr_step() above.
>>
>> Hi Paul,
>>
>> I need -rr behaviour for doing IO load-balancing when I choose next RDMA
>> connection from the list in order to send a request, i.e. my code is
>> something like the following:
>>
>>         static struct conn *get_and_set_next_conn(void)
>>         {
>>                 struct conn *conn;
>>
>>                 conn = rcu_dereferece(rcu_conn);
>>                 if (unlikely(!conn))
>>                     return conn;
>
> Wait.  Don't you need to restart from the beginning of the list in
> this case?  Or does the list never have anything added to it and is
> rcu_conn initially the first element in the list?

Hi Paul,

No, I continue from the pointer, which I assigned on the previous IO
in order to send IO fairly and keep load balanced.

Initially @rcu_conn points to the first element, but elements can
be deleted from the list and list can become empty.

The deletion code is below.

>
>>                 conn = list_next_or_null_rr_rcu(&conn_list,
>>                                                 &conn->entry,
>>                                                 typeof(*conn),
>>                                                 entry);
>>                 rcu_assign_pointer(rcu_conn, conn);
>
> Linus is correct to doubt this code.  You assign a pointer to the current
> element to rcu_conn, which is presumably a per-CPU or global variable.
> So far, so good ...

I use per-CPU, in the first example I did not show that not to overcomplicate
the code.

>
>>                 return conn;
>>         }
>>
>>         rcu_read_lock();
>>         conn = get_and_set_next_conn();
>>         if (unlikely(!conn)) {
>>                 /* ... */
>>         }
>>         err = rdma_io(conn, request);
>>         rcu_read_unlock();
>
> ... except that some other CPU might well remove the entry referenced by
> rcu_conn at this point.  It would have to wait for a grace period (e.g.,
> synchronize_rcu()), but the current CPU has exited its RCU read-side
> critical section, and therefore is not blocking the grace period.
> Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it
> might well be referencing the freelist, or, even worse, some other type
> of structure.
>
> What is your code doing to prevent this from happening?  (There are ways,
> but I want to know what you were doing in this case.)

Probably I should have shown the way of removal at the very beginning,
my fault.  So deletion looks as the following (a bit changed and
simplified for the sake of clearness):

        static void remove_connection(conn)
        {
                bool need_to_wait = false;
                int cpu;

                /* Do not let RCU list add/delete happen in parallel */
                mutex_lock(&conn_lock);

                list_del_rcu(&conn->entry);

                /* Make sure everybody observes element removal */
                synchronize_rcu();

                /*
                 * At this point nobody sees @conn in the list, but
still we have
                 * dangling pointer @rcu_conn which _can_ point to @conn.  Since
                 * nobody can observe @conn in the list, we guarantee
that IO path
                 * will not assign @conn to @rcu_conn, i.e. @rcu_conn
can be equal
                 * to @conn, but can never again become @conn.
                 */

                /*
                 * Get @next connection from current @conn which is going to be
                 * removed.
                 */
                next = list_next_or_null_rr_rcu(&conn_list, &conn->entry,
                                                typeof(*next), entry);

                /*
                 * Here @rcu_conn can be changed by reader side, so use @cmpxchg
                 * in order to keep fairness in load-balancing and do not touch
                 * the pointer which can be already changed by the IO path.
                 *
                 * Current path can be faster than IO path and the
following race
                 * exists:
                 *
                 *   CPU0                         CPU1
                 *   ----                         ----
                 *   conn = rcu_dereferece(rcu_conn);
                 *   next = list_next_or_null_rr_rcu(conn)
                 *
                 *                                conn ==
cmpxchg(rcu_conn, conn, next);
                 *                                synchronize_rcu();
                 *
                 *   rcu_assign_pointer(rcu_conn, next);
                 *   ^^^^^^^^^^^^^^^^^^
                 *
                 *   Here @rcu_conn is already equal to @next (done by
@cmpxchg),
                 *   so assignment to the same pointer is harmless.
                 *
                 */
                for_each_possible_cpu(cpu) {
                        struct conn **rcu_conn;

                        rcu_conn = per_cpu_ptr(pcpu_rcu_conn, cpu);
                        if (*rcu_conn != conn)
                                /*
                                 * This @cpu will never again pick up @conn,
                                 * so it is safe just to choose next CPU.
                                 */
                                continue;

                        if (conn == cmpxchg(rcu_conn, conn, next))
                                /*
                                 * @rcu_conn was successfully replaced
with @next,
                                 * that means that someone can also hold a @conn
                                 * and dereferencing it, so wait for a
grace period
                                 * is required.
                                 */
                                need_to_wait = true;
                }
                if (need_to_wait)
                        synchronize_rcu();

                mutex_unlock(&conn_lock);

                kfree(conn);
        }


>
>> i.e. usage of the @next pointer is under an RCU critical section.
>>
>> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like
>> > macro that makes it more obvious that the whole thing need to be under
>> > a single RCU read-side critical section?  Such a macro would of course be
>> > an infinite loop if the list never went empty, so presumably there would
>> > be a break or return statement in there somewhere.
>>
>> The difference is that I do not need a loop, I take the @next conn pointer,
>> save it for the following IO request and do IO for current IO request.
>>
>> It seems list_for_each_entry_rcu()-like with immediate "break" in the body
>> of the loop does not look nice, I personally do not like it, i.e.:
>>
>>
>>         static struct conn *get_and_set_next_conn(void)
>>         {
>>                 struct conn *conn;
>>
>>                 conn = rcu_dereferece(rcu_conn);
>>                 if (unlikely(!conn))
>>                     return conn;
>>                 list_for_each_entry_rr_rcu(conn, &conn_list,
>>                                            entry) {
>>                         break;
>>                 }
>>                 rcu_assign_pointer(rcu_conn, conn);
>>                 return conn;
>>         }
>>
>>
>> or maybe I did not fully get your idea?
>
> That would not help at all because you are still leaking the pointer out
> of the RCU read-side critical section.  That is completely and utterly
> broken unless you are somehow cleaning up rcu_conn when you remove
> the element.  And getting that cleanup right is -extremely- tricky.
> Unless you have some sort of proof of correctness, you will get a NACK
> from me.

I understand all the consequences of the leaking pointer, and of course
wrapped loop with RCU lock/unlock is simpler, but in order to keep
load-balancing and IO fairness avoiding any locks on IO path I've come
up with these RCU tricks and list_next_or_null_rr_rcu() macro.

> More like this:
>
>         list_for_each_entry_rr_rcu(conn, &conn_list, entry) {
>                 do_something_with(conn);
>                 if (done_for_now())
>                         break;
>         }
>
>> >> + */
>> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
>> >> +({ \
>> >> +     list_next_or_null_rcu(head, ptr, type, memb) ?: \
>> >> +             list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \
>> >
>> > Are there any uses for this outside of RDMA?  If not, I am with Linus.
>> > Define this within RDMA, where a smaller number of people can more
>> > easily be kept aware of the restrictions on use.  If it turns out to be
>> > more generally useful, we can take a look at exactly what makes sense
>> > more globally.
>>
>> The only one list_for_each_entry_rcu()-like macro I am aware of is used in
>> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():
>>
>> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370
>>
>> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing
>> my list_next_or_null_rr_rcu() variant?
>
> Let's start with the basics:  It absolutely does not make sense to leak
> pointers across rcu_read_unlock() unless you have arranged something else
> to protect the pointed-to data in the meantime.  There are a number of ways
> of implementing this protection.  Again, what protection are you using?
>
> Your code at the above URL looks plausible to me at first glance: You
> do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then
> rcu_read_unlock().  But at second glance, it looks like htcx->queue
> might have the same vulnerability as rcu_conn in your earlier code.

I am not the author of the code at the URL I specified. I provided the
link answering the question to show other possible users of round-robin
semantics for RCU list traversal.  In my 'list_next_or_null_rr_rcu()'
case I can't use a loop, I leak the pointer and indeed have to be very
careful.  But perhaps we can come up with some generic solution to cover
both cases: -rr loop and -rr next.

--
Roman

Linus Torvalds May 21, 2018, 3:16 p.m. UTC | #9

On Mon, May 21, 2018 at 6:51 AM Roman Penyaev <
roman.penyaev@profitbricks.com> wrote:

> No, I continue from the pointer, which I assigned on the previous IO
> in order to send IO fairly and keep load balanced.

Right. And that's exactly what has both me and Paul nervous. You're no
longer in the RCU domain. You're using a pointer where the lifetime has
nothing to do with RCU any more.

Can it be done? Sure. But you need *other* locking for it (that you haven't
explained), and it's fragile as hell.

It's probably best to not use RCU for it at all, but depend on that "other
locking" that you have to have anyway, to keep the pointer valid over the
non-RCU region.

                Linus

Paul E. McKenney May 21, 2018, 3:31 p.m. UTC | #10

On Mon, May 21, 2018 at 03:50:10PM +0200, Roman Penyaev wrote:
> On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote:
> >> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney
> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
> >> >> Function is going to be used in transport over RDMA module
> >> >> in subsequent patches.
> >> >>
> >> >> Function returns next element in round-robin fashion,
> >> >> i.e. head will be skipped.  NULL will be returned if list
> >> >> is observed as empty.
> >> >>
> >> >> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
> >> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >> >> Cc: linux-kernel@vger.kernel.org
> >> >> ---
> >> >>  include/linux/rculist.h | 19 +++++++++++++++++++
> >> >>  1 file changed, 19 insertions(+)
> >> >>
> >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> >> >> index 127f534fec94..b0840d5ab25a 100644
> >> >> --- a/include/linux/rculist.h
> >> >> +++ b/include/linux/rculist.h
> >> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
> >> >>  })
> >> >>
> >> >>  /**
> >> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
> >> >> + * @head:    the head for the list.
> >> >> + * @ptr:        the list head to take the next element from.
> >> >> + * @type:       the type of the struct this is embedded in.
> >> >> + * @memb:       the name of the list_head within the struct.
> >> >> + *
> >> >> + * Next element returned in round-robin fashion, i.e. head will be skipped,
> >> >> + * but if list is observed as empty, NULL will be returned.
> >> >> + *
> >> >> + * This primitive may safely run concurrently with the _rcu list-mutation
> >> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
> >> >
> >> > Of course, all the set of list_next_or_null_rr_rcu() invocations that
> >> > are round-robining a given list must all be under the same RCU read-side
> >> > critical section.  For example, the following will break badly:
> >> >
> >> >         struct foo *take_rr_step(struct list_head *head, struct foo *ptr)
> >> >         {
> >> >                 struct foo *ret;
> >> >
> >> >                 rcu_read_lock();
> >> >                 ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist);
> >> >                 rcu_read_unlock();  /* BUG */
> >> >                 return ret;
> >> >         }
> >> >
> >> > You need a big fat comment stating this, at the very least.  The resulting
> >> > bug can be very hard to trigger and even harder to debug.
> >> >
> >> > And yes, I know that the same restriction applies to list_next_rcu()
> >> > and friends.  The difference is that if you try to invoke those in an
> >> > infinite loop, you will be rapped on the knuckles as soon as you hit
> >> > the list header.  Without that knuckle-rapping, RCU CPU stall warnings
> >> > might tempt people to do something broken like take_rr_step() above.
> >>
> >> Hi Paul,
> >>
> >> I need -rr behaviour for doing IO load-balancing when I choose next RDMA
> >> connection from the list in order to send a request, i.e. my code is
> >> something like the following:
> >>
> >>         static struct conn *get_and_set_next_conn(void)
> >>         {
> >>                 struct conn *conn;
> >>
> >>                 conn = rcu_dereferece(rcu_conn);
> >>                 if (unlikely(!conn))
> >>                     return conn;
> >
> > Wait.  Don't you need to restart from the beginning of the list in
> > this case?  Or does the list never have anything added to it and is
> > rcu_conn initially the first element in the list?
> 
> Hi Paul,
> 
> No, I continue from the pointer, which I assigned on the previous IO
> in order to send IO fairly and keep load balanced.
> 
> Initially @rcu_conn points to the first element, but elements can
> be deleted from the list and list can become empty.
> 
> The deletion code is below.
> 
> >
> >>                 conn = list_next_or_null_rr_rcu(&conn_list,
> >>                                                 &conn->entry,
> >>                                                 typeof(*conn),
> >>                                                 entry);
> >>                 rcu_assign_pointer(rcu_conn, conn);
> >
> > Linus is correct to doubt this code.  You assign a pointer to the current
> > element to rcu_conn, which is presumably a per-CPU or global variable.
> > So far, so good ...
> 
> I use per-CPU, in the first example I did not show that not to overcomplicate
> the code.
> 
> >
> >>                 return conn;
> >>         }
> >>
> >>         rcu_read_lock();
> >>         conn = get_and_set_next_conn();
> >>         if (unlikely(!conn)) {
> >>                 /* ... */
> >>         }
> >>         err = rdma_io(conn, request);
> >>         rcu_read_unlock();
> >
> > ... except that some other CPU might well remove the entry referenced by
> > rcu_conn at this point.  It would have to wait for a grace period (e.g.,
> > synchronize_rcu()), but the current CPU has exited its RCU read-side
> > critical section, and therefore is not blocking the grace period.
> > Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it
> > might well be referencing the freelist, or, even worse, some other type
> > of structure.
> >
> > What is your code doing to prevent this from happening?  (There are ways,
> > but I want to know what you were doing in this case.)
> 
> Probably I should have shown the way of removal at the very beginning,
> my fault.  So deletion looks as the following (a bit changed and
> simplified for the sake of clearness):

Thank you!  Let's see...

>         static void remove_connection(conn)
>         {
>                 bool need_to_wait = false;
>                 int cpu;
> 
>                 /* Do not let RCU list add/delete happen in parallel */
>                 mutex_lock(&conn_lock);
> 
>                 list_del_rcu(&conn->entry);
> 
>                 /* Make sure everybody observes element removal */
>                 synchronize_rcu();

At this point, any reader who saw the element in the list is done, as you
comment in fact says.  But there might be a pointer to that element in the
per-CPU variables, however, from this point forward it cannot be the case
that one of the per-CPU variables gets set to the newly deleted element.
Which is your next block of code...

>                 /*
>                  * At this point nobody sees @conn in the list, but
> still we have
>                  * dangling pointer @rcu_conn which _can_ point to @conn.  Since
>                  * nobody can observe @conn in the list, we guarantee
> that IO path
>                  * will not assign @conn to @rcu_conn, i.e. @rcu_conn
> can be equal
>                  * to @conn, but can never again become @conn.
>                  */
> 
>                 /*
>                  * Get @next connection from current @conn which is going to be
>                  * removed.
>                  */
>                 next = list_next_or_null_rr_rcu(&conn_list, &conn->entry,
>                                                 typeof(*next), entry);
> 
>                 /*
>                  * Here @rcu_conn can be changed by reader side, so use @cmpxchg
>                  * in order to keep fairness in load-balancing and do not touch
>                  * the pointer which can be already changed by the IO path.
>                  *
>                  * Current path can be faster than IO path and the
> following race
>                  * exists:
>                  *
>                  *   CPU0                         CPU1
>                  *   ----                         ----
>                  *   conn = rcu_dereferece(rcu_conn);
>                  *   next = list_next_or_null_rr_rcu(conn)
>                  *
>                  *                                conn ==
> cmpxchg(rcu_conn, conn, next);
>                  *                                synchronize_rcu();
>                  *
>                  *   rcu_assign_pointer(rcu_conn, next);
>                  *   ^^^^^^^^^^^^^^^^^^
>                  *
>                  *   Here @rcu_conn is already equal to @next (done by
> @cmpxchg),
>                  *   so assignment to the same pointer is harmless.
>                  *
>                  */
>                 for_each_possible_cpu(cpu) {
>                         struct conn **rcu_conn;
> 
>                         rcu_conn = per_cpu_ptr(pcpu_rcu_conn, cpu);
>                         if (*rcu_conn != conn)
>                                 /*
>                                  * This @cpu will never again pick up @conn,
>                                  * so it is safe just to choose next CPU.
>                                  */
>                                 continue;

... Someone else might have picked up rcu_conn at this point...

>                         if (conn == cmpxchg(rcu_conn, conn, next))
>                                 /*
>                                  * @rcu_conn was successfully replaced
> with @next,
>                                  * that means that someone can also hold a @conn
>                                  * and dereferencing it, so wait for a
> grace period
>                                  * is required.
>                                  */
>                                 need_to_wait = true;

... But if there was any possibility of that, need_to_wait is true, and it
still cannot be the case that a reader finds the newly deleted element
in the list, so they cannot find that element, so the pcpu_rcu_conn
variables cannot be set to it.

>                 }
>                 if (need_to_wait)
>                         synchronize_rcu();

And at this point, the reader that might have picked up rcu_conn
just before the cmpxchg must have completed.  (Good show, by the way!
Many people miss the fact that they need this second synchronize_rcu().)

Hmmm...  What happens if this was the last element in the list, and
the relevant pcpu_rcu_conn variable references that newly removed
element?  Taking a look at list_next_or_null_rcu() and thus at
list_next_or_null_rcu(), and it does appear that you get NULL in that
case, as is right and good.

>                 mutex_unlock(&conn_lock);
> 
>                 kfree(conn);
>         }
> 
> 
> >
> >> i.e. usage of the @next pointer is under an RCU critical section.
> >>
> >> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like
> >> > macro that makes it more obvious that the whole thing need to be under
> >> > a single RCU read-side critical section?  Such a macro would of course be
> >> > an infinite loop if the list never went empty, so presumably there would
> >> > be a break or return statement in there somewhere.
> >>
> >> The difference is that I do not need a loop, I take the @next conn pointer,
> >> save it for the following IO request and do IO for current IO request.
> >>
> >> It seems list_for_each_entry_rcu()-like with immediate "break" in the body
> >> of the loop does not look nice, I personally do not like it, i.e.:
> >>
> >>
> >>         static struct conn *get_and_set_next_conn(void)
> >>         {
> >>                 struct conn *conn;
> >>
> >>                 conn = rcu_dereferece(rcu_conn);
> >>                 if (unlikely(!conn))
> >>                     return conn;
> >>                 list_for_each_entry_rr_rcu(conn, &conn_list,
> >>                                            entry) {
> >>                         break;
> >>                 }
> >>                 rcu_assign_pointer(rcu_conn, conn);
> >>                 return conn;
> >>         }
> >>
> >>
> >> or maybe I did not fully get your idea?
> >
> > That would not help at all because you are still leaking the pointer out
> > of the RCU read-side critical section.  That is completely and utterly
> > broken unless you are somehow cleaning up rcu_conn when you remove
> > the element.  And getting that cleanup right is -extremely- tricky.
> > Unless you have some sort of proof of correctness, you will get a NACK
> > from me.
> 
> I understand all the consequences of the leaking pointer, and of course
> wrapped loop with RCU lock/unlock is simpler, but in order to keep
> load-balancing and IO fairness avoiding any locks on IO path I've come
> up with these RCU tricks and list_next_or_null_rr_rcu() macro.

At first glance, it appears that you have handled this correctly.  But I
can make mistakes just as easily as the next guy, so what have you done
to validate your algorithm?

> > More like this:
> >
> >         list_for_each_entry_rr_rcu(conn, &conn_list, entry) {
> >                 do_something_with(conn);
> >                 if (done_for_now())
> >                         break;
> >         }
> >
> >> >> + */
> >> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
> >> >> +({ \
> >> >> +     list_next_or_null_rcu(head, ptr, type, memb) ?: \
> >> >> +             list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \
> >> >
> >> > Are there any uses for this outside of RDMA?  If not, I am with Linus.
> >> > Define this within RDMA, where a smaller number of people can more
> >> > easily be kept aware of the restrictions on use.  If it turns out to be
> >> > more generally useful, we can take a look at exactly what makes sense
> >> > more globally.
> >>
> >> The only one list_for_each_entry_rcu()-like macro I am aware of is used in
> >> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():
> >>
> >> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370
> >>
> >> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing
> >> my list_next_or_null_rr_rcu() variant?
> >
> > Let's start with the basics:  It absolutely does not make sense to leak
> > pointers across rcu_read_unlock() unless you have arranged something else
> > to protect the pointed-to data in the meantime.  There are a number of ways
> > of implementing this protection.  Again, what protection are you using?
> >
> > Your code at the above URL looks plausible to me at first glance: You
> > do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then
> > rcu_read_unlock().  But at second glance, it looks like htcx->queue
> > might have the same vulnerability as rcu_conn in your earlier code.
> 
> I am not the author of the code at the URL I specified. I provided the
> link answering the question to show other possible users of round-robin
> semantics for RCU list traversal.  In my 'list_next_or_null_rr_rcu()'
> case I can't use a loop, I leak the pointer and indeed have to be very
> careful.  But perhaps we can come up with some generic solution to cover
> both cases: -rr loop and -rr next.

Ah.  Could you please check their update-side code to make sure that it
looks correct to you?

							Thanx, Paul

Paul E. McKenney May 21, 2018, 3:33 p.m. UTC | #11

On Mon, May 21, 2018 at 08:16:59AM -0700, Linus Torvalds wrote:
> On Mon, May 21, 2018 at 6:51 AM Roman Penyaev <
> roman.penyaev@profitbricks.com> wrote:
> 
> > No, I continue from the pointer, which I assigned on the previous IO
> > in order to send IO fairly and keep load balanced.
> 
> Right. And that's exactly what has both me and Paul nervous. You're no
> longer in the RCU domain. You're using a pointer where the lifetime has
> nothing to do with RCU any more.
> 
> Can it be done? Sure. But you need *other* locking for it (that you haven't
> explained), and it's fragile as hell.

He looks to actually have it right, but I would want to see a big comment
on the read side noting the leak of the pointer and documenting why it
is OK.

							Thanx, Paul

> It's probably best to not use RCU for it at all, but depend on that "other
> locking" that you have to have anyway, to keep the pointer valid over the
> non-RCU region.
> 
>                 Linus
>

Roman Pen May 22, 2018, 9:09 a.m. UTC | #12

On Mon, May 21, 2018 at 5:33 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Mon, May 21, 2018 at 08:16:59AM -0700, Linus Torvalds wrote:
>> On Mon, May 21, 2018 at 6:51 AM Roman Penyaev <
>> roman.penyaev@profitbricks.com> wrote:
>>
>> > No, I continue from the pointer, which I assigned on the previous IO
>> > in order to send IO fairly and keep load balanced.
>>
>> Right. And that's exactly what has both me and Paul nervous. You're no
>> longer in the RCU domain. You're using a pointer where the lifetime has
>> nothing to do with RCU any more.
>>
>> Can it be done? Sure. But you need *other* locking for it (that you haven't
>> explained), and it's fragile as hell.
>
> He looks to actually have it right, but I would want to see a big comment
> on the read side noting the leak of the pointer and documenting why it
> is OK.

Hi Paul and Linus,

Should I resend current patch with more clear comments about how careful
caller should be with a leaking pointer?  Also I will update read side
with a fat comment about "rcu_assign_pointer()" which leaks the pointer
out of RCU domain and what is done to prevent nasty consequences.
Does that sound acceptable?

--
Roman

Roman Pen May 22, 2018, 9:09 a.m. UTC | #13

On Mon, May 21, 2018 at 5:31 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Mon, May 21, 2018 at 03:50:10PM +0200, Roman Penyaev wrote:
>> On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>> > On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote:
>> >> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney
>> >> <paulmck@linux.vnet.ibm.com> wrote:
>> >> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
>> >> >> Function is going to be used in transport over RDMA module
>> >> >> in subsequent patches.
>> >> >>
>> >> >> Function returns next element in round-robin fashion,
>> >> >> i.e. head will be skipped.  NULL will be returned if list
>> >> >> is observed as empty.
>> >> >>
>> >> >> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
>> >> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> >> >> Cc: linux-kernel@vger.kernel.org
>> >> >> ---
>> >> >>  include/linux/rculist.h | 19 +++++++++++++++++++
>> >> >>  1 file changed, 19 insertions(+)
>> >> >>
>> >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
>> >> >> index 127f534fec94..b0840d5ab25a 100644
>> >> >> --- a/include/linux/rculist.h
>> >> >> +++ b/include/linux/rculist.h
>> >> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
>> >> >>  })
>> >> >>
>> >> >>  /**
>> >> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
>> >> >> + * @head:    the head for the list.
>> >> >> + * @ptr:        the list head to take the next element from.
>> >> >> + * @type:       the type of the struct this is embedded in.
>> >> >> + * @memb:       the name of the list_head within the struct.
>> >> >> + *
>> >> >> + * Next element returned in round-robin fashion, i.e. head will be skipped,
>> >> >> + * but if list is observed as empty, NULL will be returned.
>> >> >> + *
>> >> >> + * This primitive may safely run concurrently with the _rcu list-mutation
>> >> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
>> >> >
>> >> > Of course, all the set of list_next_or_null_rr_rcu() invocations that
>> >> > are round-robining a given list must all be under the same RCU read-side
>> >> > critical section.  For example, the following will break badly:
>> >> >
>> >> >         struct foo *take_rr_step(struct list_head *head, struct foo *ptr)
>> >> >         {
>> >> >                 struct foo *ret;
>> >> >
>> >> >                 rcu_read_lock();
>> >> >                 ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist);
>> >> >                 rcu_read_unlock();  /* BUG */
>> >> >                 return ret;
>> >> >         }
>> >> >
>> >> > You need a big fat comment stating this, at the very least.  The resulting
>> >> > bug can be very hard to trigger and even harder to debug.
>> >> >
>> >> > And yes, I know that the same restriction applies to list_next_rcu()
>> >> > and friends.  The difference is that if you try to invoke those in an
>> >> > infinite loop, you will be rapped on the knuckles as soon as you hit
>> >> > the list header.  Without that knuckle-rapping, RCU CPU stall warnings
>> >> > might tempt people to do something broken like take_rr_step() above.
>> >>
>> >> Hi Paul,
>> >>
>> >> I need -rr behaviour for doing IO load-balancing when I choose next RDMA
>> >> connection from the list in order to send a request, i.e. my code is
>> >> something like the following:
>> >>
>> >>         static struct conn *get_and_set_next_conn(void)
>> >>         {
>> >>                 struct conn *conn;
>> >>
>> >>                 conn = rcu_dereferece(rcu_conn);
>> >>                 if (unlikely(!conn))
>> >>                     return conn;
>> >
>> > Wait.  Don't you need to restart from the beginning of the list in
>> > this case?  Or does the list never have anything added to it and is
>> > rcu_conn initially the first element in the list?
>>
>> Hi Paul,
>>
>> No, I continue from the pointer, which I assigned on the previous IO
>> in order to send IO fairly and keep load balanced.
>>
>> Initially @rcu_conn points to the first element, but elements can
>> be deleted from the list and list can become empty.
>>
>> The deletion code is below.
>>
>> >
>> >>                 conn = list_next_or_null_rr_rcu(&conn_list,
>> >>                                                 &conn->entry,
>> >>                                                 typeof(*conn),
>> >>                                                 entry);
>> >>                 rcu_assign_pointer(rcu_conn, conn);
>> >
>> > Linus is correct to doubt this code.  You assign a pointer to the current
>> > element to rcu_conn, which is presumably a per-CPU or global variable.
>> > So far, so good ...
>>
>> I use per-CPU, in the first example I did not show that not to overcomplicate
>> the code.
>>
>> >
>> >>                 return conn;
>> >>         }
>> >>
>> >>         rcu_read_lock();
>> >>         conn = get_and_set_next_conn();
>> >>         if (unlikely(!conn)) {
>> >>                 /* ... */
>> >>         }
>> >>         err = rdma_io(conn, request);
>> >>         rcu_read_unlock();
>> >
>> > ... except that some other CPU might well remove the entry referenced by
>> > rcu_conn at this point.  It would have to wait for a grace period (e.g.,
>> > synchronize_rcu()), but the current CPU has exited its RCU read-side
>> > critical section, and therefore is not blocking the grace period.
>> > Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it
>> > might well be referencing the freelist, or, even worse, some other type
>> > of structure.
>> >
>> > What is your code doing to prevent this from happening?  (There are ways,
>> > but I want to know what you were doing in this case.)
>>
>> Probably I should have shown the way of removal at the very beginning,
>> my fault.  So deletion looks as the following (a bit changed and
>> simplified for the sake of clearness):
>
> Thank you!  Let's see...
>
>>         static void remove_connection(conn)
>>         {
>>                 bool need_to_wait = false;
>>                 int cpu;
>>
>>                 /* Do not let RCU list add/delete happen in parallel */
>>                 mutex_lock(&conn_lock);
>>
>>                 list_del_rcu(&conn->entry);
>>
>>                 /* Make sure everybody observes element removal */
>>                 synchronize_rcu();
>
> At this point, any reader who saw the element in the list is done, as you
> comment in fact says.  But there might be a pointer to that element in the
> per-CPU variables, however, from this point forward it cannot be the case
> that one of the per-CPU variables gets set to the newly deleted element.
> Which is your next block of code...
>
>>                 /*
>>                  * At this point nobody sees @conn in the list, but
>> still we have
>>                  * dangling pointer @rcu_conn which _can_ point to @conn.  Since
>>                  * nobody can observe @conn in the list, we guarantee
>> that IO path
>>                  * will not assign @conn to @rcu_conn, i.e. @rcu_conn
>> can be equal
>>                  * to @conn, but can never again become @conn.
>>                  */
>>
>>                 /*
>>                  * Get @next connection from current @conn which is going to be
>>                  * removed.
>>                  */
>>                 next = list_next_or_null_rr_rcu(&conn_list, &conn->entry,
>>                                                 typeof(*next), entry);
>>
>>                 /*
>>                  * Here @rcu_conn can be changed by reader side, so use @cmpxchg
>>                  * in order to keep fairness in load-balancing and do not touch
>>                  * the pointer which can be already changed by the IO path.
>>                  *
>>                  * Current path can be faster than IO path and the
>> following race
>>                  * exists:
>>                  *
>>                  *   CPU0                         CPU1
>>                  *   ----                         ----
>>                  *   conn = rcu_dereferece(rcu_conn);
>>                  *   next = list_next_or_null_rr_rcu(conn)
>>                  *
>>                  *                                conn ==
>> cmpxchg(rcu_conn, conn, next);
>>                  *                                synchronize_rcu();
>>                  *
>>                  *   rcu_assign_pointer(rcu_conn, next);
>>                  *   ^^^^^^^^^^^^^^^^^^
>>                  *
>>                  *   Here @rcu_conn is already equal to @next (done by
>> @cmpxchg),
>>                  *   so assignment to the same pointer is harmless.
>>                  *
>>                  */
>>                 for_each_possible_cpu(cpu) {
>>                         struct conn **rcu_conn;
>>
>>                         rcu_conn = per_cpu_ptr(pcpu_rcu_conn, cpu);
>>                         if (*rcu_conn != conn)
>>                                 /*
>>                                  * This @cpu will never again pick up @conn,
>>                                  * so it is safe just to choose next CPU.
>>                                  */
>>                                 continue;
>
> ... Someone else might have picked up rcu_conn at this point...
>
>>                         if (conn == cmpxchg(rcu_conn, conn, next))
>>                                 /*
>>                                  * @rcu_conn was successfully replaced
>> with @next,
>>                                  * that means that someone can also hold a @conn
>>                                  * and dereferencing it, so wait for a
>> grace period
>>                                  * is required.
>>                                  */
>>                                 need_to_wait = true;
>
> ... But if there was any possibility of that, need_to_wait is true, and it
> still cannot be the case that a reader finds the newly deleted element
> in the list, so they cannot find that element, so the pcpu_rcu_conn
> variables cannot be set to it.
>
>>                 }
>>                 if (need_to_wait)
>>                         synchronize_rcu();
>
> And at this point, the reader that might have picked up rcu_conn
> just before the cmpxchg must have completed.  (Good show, by the way!
> Many people miss the fact that they need this second synchronize_rcu().)
>
> Hmmm...  What happens if this was the last element in the list, and
> the relevant pcpu_rcu_conn variable references that newly removed
> element?  Taking a look at list_next_or_null_rcu() and thus at
> list_next_or_null_rcu(), and it does appear that you get NULL in that
> case, as is right and good.

Thanks for explicit comments.  What I always lack is a good description.
Indeed it is worth to mention that @next can become NULL if that was the
last element, will add comments then.

>
>>                 mutex_unlock(&conn_lock);
>>
>>                 kfree(conn);
>>         }
>>
>>
>> >
>> >> i.e. usage of the @next pointer is under an RCU critical section.
>> >>
>> >> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like
>> >> > macro that makes it more obvious that the whole thing need to be under
>> >> > a single RCU read-side critical section?  Such a macro would of course be
>> >> > an infinite loop if the list never went empty, so presumably there would
>> >> > be a break or return statement in there somewhere.
>> >>
>> >> The difference is that I do not need a loop, I take the @next conn pointer,
>> >> save it for the following IO request and do IO for current IO request.
>> >>
>> >> It seems list_for_each_entry_rcu()-like with immediate "break" in the body
>> >> of the loop does not look nice, I personally do not like it, i.e.:
>> >>
>> >>
>> >>         static struct conn *get_and_set_next_conn(void)
>> >>         {
>> >>                 struct conn *conn;
>> >>
>> >>                 conn = rcu_dereferece(rcu_conn);
>> >>                 if (unlikely(!conn))
>> >>                     return conn;
>> >>                 list_for_each_entry_rr_rcu(conn, &conn_list,
>> >>                                            entry) {
>> >>                         break;
>> >>                 }
>> >>                 rcu_assign_pointer(rcu_conn, conn);
>> >>                 return conn;
>> >>         }
>> >>
>> >>
>> >> or maybe I did not fully get your idea?
>> >
>> > That would not help at all because you are still leaking the pointer out
>> > of the RCU read-side critical section.  That is completely and utterly
>> > broken unless you are somehow cleaning up rcu_conn when you remove
>> > the element.  And getting that cleanup right is -extremely- tricky.
>> > Unless you have some sort of proof of correctness, you will get a NACK
>> > from me.
>>
>> I understand all the consequences of the leaking pointer, and of course
>> wrapped loop with RCU lock/unlock is simpler, but in order to keep
>> load-balancing and IO fairness avoiding any locks on IO path I've come
>> up with these RCU tricks and list_next_or_null_rr_rcu() macro.
>
> At first glance, it appears that you have handled this correctly.  But I
> can make mistakes just as easily as the next guy, so what have you done
> to validate your algorithm?

What we only have is a set of unit-tests which run every night.  For this
particular case there is a special stress test which adds/removes RDMA
connections in a loop while IO is performing.  Unfortunately I did not
write any synthetic test-case just for testing this isolated algorithm.
(e.g. module with only these RCU functions can be created and list
modification can be easily simulated. Should not be very much difficult)
Do you think it is worth to do?  Unfortunately it also does not prove
correctness, like Linus said it is fragile as hell, but for sure I can
burn CPUs testing it for couple of days.

>> > More like this:
>> >
>> >         list_for_each_entry_rr_rcu(conn, &conn_list, entry) {
>> >                 do_something_with(conn);
>> >                 if (done_for_now())
>> >                         break;
>> >         }
>> >
>> >> >> + */
>> >> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
>> >> >> +({ \
>> >> >> +     list_next_or_null_rcu(head, ptr, type, memb) ?: \
>> >> >> +             list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \
>> >> >
>> >> > Are there any uses for this outside of RDMA?  If not, I am with Linus.
>> >> > Define this within RDMA, where a smaller number of people can more
>> >> > easily be kept aware of the restrictions on use.  If it turns out to be
>> >> > more generally useful, we can take a look at exactly what makes sense
>> >> > more globally.
>> >>
>> >> The only one list_for_each_entry_rcu()-like macro I am aware of is used in
>> >> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():
>> >>
>> >> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370
>> >>
>> >> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing
>> >> my list_next_or_null_rr_rcu() variant?
>> >
>> > Let's start with the basics:  It absolutely does not make sense to leak
>> > pointers across rcu_read_unlock() unless you have arranged something else
>> > to protect the pointed-to data in the meantime.  There are a number of ways
>> > of implementing this protection.  Again, what protection are you using?
>> >
>> > Your code at the above URL looks plausible to me at first glance: You
>> > do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then
>> > rcu_read_unlock().  But at second glance, it looks like htcx->queue
>> > might have the same vulnerability as rcu_conn in your earlier code.
>>
>> I am not the author of the code at the URL I specified. I provided the
>> link answering the question to show other possible users of round-robin
>> semantics for RCU list traversal.  In my 'list_next_or_null_rr_rcu()'
>> case I can't use a loop, I leak the pointer and indeed have to be very
>> careful.  But perhaps we can come up with some generic solution to cover
>> both cases: -rr loop and -rr next.
>
> Ah.  Could you please check their update-side code to make sure that it
> looks correct to you?

BTW authors of this particular -rr loop are in CC, but they keep silence
so far :)  According to my shallow understanding @queue can't disappear
from the list on this calling path, i.e. existence of a @queue should be
guaranteed by the fact that the queue has no IO in-flights and only then
it can be removed.

--
Roman

Paul E. McKenney May 22, 2018, 4:36 p.m. UTC | #14

On Tue, May 22, 2018 at 11:09:08AM +0200, Roman Penyaev wrote:
> On Mon, May 21, 2018 at 5:33 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Mon, May 21, 2018 at 08:16:59AM -0700, Linus Torvalds wrote:
> >> On Mon, May 21, 2018 at 6:51 AM Roman Penyaev <
> >> roman.penyaev@profitbricks.com> wrote:
> >>
> >> > No, I continue from the pointer, which I assigned on the previous IO
> >> > in order to send IO fairly and keep load balanced.
> >>
> >> Right. And that's exactly what has both me and Paul nervous. You're no
> >> longer in the RCU domain. You're using a pointer where the lifetime has
> >> nothing to do with RCU any more.
> >>
> >> Can it be done? Sure. But you need *other* locking for it (that you haven't
> >> explained), and it's fragile as hell.
> >
> > He looks to actually have it right, but I would want to see a big comment
> > on the read side noting the leak of the pointer and documenting why it
> > is OK.
> 
> Hi Paul and Linus,
> 
> Should I resend current patch with more clear comments about how careful
> caller should be with a leaking pointer?  Also I will update read side
> with a fat comment about "rcu_assign_pointer()" which leaks the pointer
> out of RCU domain and what is done to prevent nasty consequences.
> Does that sound acceptable?

That sounds good to me.

							Thanx, Paul

Linus Torvalds May 22, 2018, 4:38 p.m. UTC | #15

On Tue, May 22, 2018 at 2:09 AM Roman Penyaev <
roman.penyaev@profitbricks.com> wrote:

> Should I resend current patch with more clear comments about how careful
> caller should be with a leaking pointer?

No. Even if we go your way, there is *one* single user, and that one is
special and needs to take a lot more care.

Just roll your own version, and make it an inline function like I've asked
now now many times, and add a shit-ton of explanations of why it's safe to
use in that *one* situation.

I don't want any crazy and unsafe stuff in the generic header file that
absolutely *nobody* should ever use.

                  Linus

Paul E. McKenney May 22, 2018, 5:03 p.m. UTC | #16

On Tue, May 22, 2018 at 11:09:16AM +0200, Roman Penyaev wrote:
> On Mon, May 21, 2018 at 5:31 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Mon, May 21, 2018 at 03:50:10PM +0200, Roman Penyaev wrote:
> >> On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney
> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> > On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote:
> >> >> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney
> >> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> >> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
> >> >> >> Function is going to be used in transport over RDMA module
> >> >> >> in subsequent patches.
> >> >> >>
> >> >> >> Function returns next element in round-robin fashion,
> >> >> >> i.e. head will be skipped.  NULL will be returned if list
> >> >> >> is observed as empty.
> >> >> >>
> >> >> >> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
> >> >> >> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >> >> >> Cc: linux-kernel@vger.kernel.org
> >> >> >> ---
> >> >> >>  include/linux/rculist.h | 19 +++++++++++++++++++
> >> >> >>  1 file changed, 19 insertions(+)
> >> >> >>
> >> >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> >> >> >> index 127f534fec94..b0840d5ab25a 100644
> >> >> >> --- a/include/linux/rculist.h
> >> >> >> +++ b/include/linux/rculist.h
> >> >> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
> >> >> >>  })
> >> >> >>
> >> >> >>  /**
> >> >> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
> >> >> >> + * @head:    the head for the list.
> >> >> >> + * @ptr:        the list head to take the next element from.
> >> >> >> + * @type:       the type of the struct this is embedded in.
> >> >> >> + * @memb:       the name of the list_head within the struct.
> >> >> >> + *
> >> >> >> + * Next element returned in round-robin fashion, i.e. head will be skipped,
> >> >> >> + * but if list is observed as empty, NULL will be returned.
> >> >> >> + *
> >> >> >> + * This primitive may safely run concurrently with the _rcu list-mutation
> >> >> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
> >> >> >
> >> >> > Of course, all the set of list_next_or_null_rr_rcu() invocations that
> >> >> > are round-robining a given list must all be under the same RCU read-side
> >> >> > critical section.  For example, the following will break badly:
> >> >> >
> >> >> >         struct foo *take_rr_step(struct list_head *head, struct foo *ptr)
> >> >> >         {
> >> >> >                 struct foo *ret;
> >> >> >
> >> >> >                 rcu_read_lock();
> >> >> >                 ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist);
> >> >> >                 rcu_read_unlock();  /* BUG */
> >> >> >                 return ret;
> >> >> >         }
> >> >> >
> >> >> > You need a big fat comment stating this, at the very least.  The resulting
> >> >> > bug can be very hard to trigger and even harder to debug.
> >> >> >
> >> >> > And yes, I know that the same restriction applies to list_next_rcu()
> >> >> > and friends.  The difference is that if you try to invoke those in an
> >> >> > infinite loop, you will be rapped on the knuckles as soon as you hit
> >> >> > the list header.  Without that knuckle-rapping, RCU CPU stall warnings
> >> >> > might tempt people to do something broken like take_rr_step() above.
> >> >>
> >> >> Hi Paul,
> >> >>
> >> >> I need -rr behaviour for doing IO load-balancing when I choose next RDMA
> >> >> connection from the list in order to send a request, i.e. my code is
> >> >> something like the following:
> >> >>
> >> >>         static struct conn *get_and_set_next_conn(void)
> >> >>         {
> >> >>                 struct conn *conn;
> >> >>
> >> >>                 conn = rcu_dereferece(rcu_conn);
> >> >>                 if (unlikely(!conn))
> >> >>                     return conn;
> >> >
> >> > Wait.  Don't you need to restart from the beginning of the list in
> >> > this case?  Or does the list never have anything added to it and is
> >> > rcu_conn initially the first element in the list?
> >>
> >> Hi Paul,
> >>
> >> No, I continue from the pointer, which I assigned on the previous IO
> >> in order to send IO fairly and keep load balanced.
> >>
> >> Initially @rcu_conn points to the first element, but elements can
> >> be deleted from the list and list can become empty.
> >>
> >> The deletion code is below.
> >>
> >> >
> >> >>                 conn = list_next_or_null_rr_rcu(&conn_list,
> >> >>                                                 &conn->entry,
> >> >>                                                 typeof(*conn),
> >> >>                                                 entry);
> >> >>                 rcu_assign_pointer(rcu_conn, conn);
> >> >
> >> > Linus is correct to doubt this code.  You assign a pointer to the current
> >> > element to rcu_conn, which is presumably a per-CPU or global variable.
> >> > So far, so good ...
> >>
> >> I use per-CPU, in the first example I did not show that not to overcomplicate
> >> the code.
> >>
> >> >
> >> >>                 return conn;
> >> >>         }
> >> >>
> >> >>         rcu_read_lock();
> >> >>         conn = get_and_set_next_conn();
> >> >>         if (unlikely(!conn)) {
> >> >>                 /* ... */
> >> >>         }
> >> >>         err = rdma_io(conn, request);
> >> >>         rcu_read_unlock();
> >> >
> >> > ... except that some other CPU might well remove the entry referenced by
> >> > rcu_conn at this point.  It would have to wait for a grace period (e.g.,
> >> > synchronize_rcu()), but the current CPU has exited its RCU read-side
> >> > critical section, and therefore is not blocking the grace period.
> >> > Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it
> >> > might well be referencing the freelist, or, even worse, some other type
> >> > of structure.
> >> >
> >> > What is your code doing to prevent this from happening?  (There are ways,
> >> > but I want to know what you were doing in this case.)
> >>
> >> Probably I should have shown the way of removal at the very beginning,
> >> my fault.  So deletion looks as the following (a bit changed and
> >> simplified for the sake of clearness):
> >
> > Thank you!  Let's see...
> >
> >>         static void remove_connection(conn)
> >>         {
> >>                 bool need_to_wait = false;
> >>                 int cpu;
> >>
> >>                 /* Do not let RCU list add/delete happen in parallel */
> >>                 mutex_lock(&conn_lock);
> >>
> >>                 list_del_rcu(&conn->entry);
> >>
> >>                 /* Make sure everybody observes element removal */
> >>                 synchronize_rcu();
> >
> > At this point, any reader who saw the element in the list is done, as you
> > comment in fact says.  But there might be a pointer to that element in the
> > per-CPU variables, however, from this point forward it cannot be the case
> > that one of the per-CPU variables gets set to the newly deleted element.
> > Which is your next block of code...
> >
> >>                 /*
> >>                  * At this point nobody sees @conn in the list, but
> >> still we have
> >>                  * dangling pointer @rcu_conn which _can_ point to @conn.  Since
> >>                  * nobody can observe @conn in the list, we guarantee
> >> that IO path
> >>                  * will not assign @conn to @rcu_conn, i.e. @rcu_conn
> >> can be equal
> >>                  * to @conn, but can never again become @conn.
> >>                  */
> >>
> >>                 /*
> >>                  * Get @next connection from current @conn which is going to be
> >>                  * removed.
> >>                  */
> >>                 next = list_next_or_null_rr_rcu(&conn_list, &conn->entry,
> >>                                                 typeof(*next), entry);
> >>
> >>                 /*
> >>                  * Here @rcu_conn can be changed by reader side, so use @cmpxchg
> >>                  * in order to keep fairness in load-balancing and do not touch
> >>                  * the pointer which can be already changed by the IO path.
> >>                  *
> >>                  * Current path can be faster than IO path and the
> >> following race
> >>                  * exists:
> >>                  *
> >>                  *   CPU0                         CPU1
> >>                  *   ----                         ----
> >>                  *   conn = rcu_dereferece(rcu_conn);
> >>                  *   next = list_next_or_null_rr_rcu(conn)
> >>                  *
> >>                  *                                conn ==
> >> cmpxchg(rcu_conn, conn, next);
> >>                  *                                synchronize_rcu();
> >>                  *
> >>                  *   rcu_assign_pointer(rcu_conn, next);
> >>                  *   ^^^^^^^^^^^^^^^^^^
> >>                  *
> >>                  *   Here @rcu_conn is already equal to @next (done by
> >> @cmpxchg),
> >>                  *   so assignment to the same pointer is harmless.
> >>                  *
> >>                  */
> >>                 for_each_possible_cpu(cpu) {
> >>                         struct conn **rcu_conn;
> >>
> >>                         rcu_conn = per_cpu_ptr(pcpu_rcu_conn, cpu);
> >>                         if (*rcu_conn != conn)
> >>                                 /*
> >>                                  * This @cpu will never again pick up @conn,
> >>                                  * so it is safe just to choose next CPU.
> >>                                  */
> >>                                 continue;
> >
> > ... Someone else might have picked up rcu_conn at this point...
> >
> >>                         if (conn == cmpxchg(rcu_conn, conn, next))
> >>                                 /*
> >>                                  * @rcu_conn was successfully replaced
> >> with @next,
> >>                                  * that means that someone can also hold a @conn
> >>                                  * and dereferencing it, so wait for a
> >> grace period
> >>                                  * is required.
> >>                                  */
> >>                                 need_to_wait = true;
> >
> > ... But if there was any possibility of that, need_to_wait is true, and it
> > still cannot be the case that a reader finds the newly deleted element
> > in the list, so they cannot find that element, so the pcpu_rcu_conn
> > variables cannot be set to it.
> >
> >>                 }
> >>                 if (need_to_wait)
> >>                         synchronize_rcu();
> >
> > And at this point, the reader that might have picked up rcu_conn
> > just before the cmpxchg must have completed.  (Good show, by the way!
> > Many people miss the fact that they need this second synchronize_rcu().)
> >
> > Hmmm...  What happens if this was the last element in the list, and
> > the relevant pcpu_rcu_conn variable references that newly removed
> > element?  Taking a look at list_next_or_null_rcu() and thus at
> > list_next_or_null_rcu(), and it does appear that you get NULL in that
> > case, as is right and good.
> 
> Thanks for explicit comments.  What I always lack is a good description.
> Indeed it is worth to mention that @next can become NULL if that was the
> last element, will add comments then.

Very good.

And for the record, I agree with Linus that this needs to be private.
I am very glad that you got it right (or appear to have), and there might
come a time when it needs to be generally available, but that time is
certainly not now and might well never come.

> >>                 mutex_unlock(&conn_lock);
> >>
> >>                 kfree(conn);
> >>         }
> >>
> >>
> >> >
> >> >> i.e. usage of the @next pointer is under an RCU critical section.
> >> >>
> >> >> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like
> >> >> > macro that makes it more obvious that the whole thing need to be under
> >> >> > a single RCU read-side critical section?  Such a macro would of course be
> >> >> > an infinite loop if the list never went empty, so presumably there would
> >> >> > be a break or return statement in there somewhere.
> >> >>
> >> >> The difference is that I do not need a loop, I take the @next conn pointer,
> >> >> save it for the following IO request and do IO for current IO request.
> >> >>
> >> >> It seems list_for_each_entry_rcu()-like with immediate "break" in the body
> >> >> of the loop does not look nice, I personally do not like it, i.e.:
> >> >>
> >> >>
> >> >>         static struct conn *get_and_set_next_conn(void)
> >> >>         {
> >> >>                 struct conn *conn;
> >> >>
> >> >>                 conn = rcu_dereferece(rcu_conn);
> >> >>                 if (unlikely(!conn))
> >> >>                     return conn;
> >> >>                 list_for_each_entry_rr_rcu(conn, &conn_list,
> >> >>                                            entry) {
> >> >>                         break;
> >> >>                 }
> >> >>                 rcu_assign_pointer(rcu_conn, conn);
> >> >>                 return conn;
> >> >>         }
> >> >>
> >> >>
> >> >> or maybe I did not fully get your idea?
> >> >
> >> > That would not help at all because you are still leaking the pointer out
> >> > of the RCU read-side critical section.  That is completely and utterly
> >> > broken unless you are somehow cleaning up rcu_conn when you remove
> >> > the element.  And getting that cleanup right is -extremely- tricky.
> >> > Unless you have some sort of proof of correctness, you will get a NACK
> >> > from me.
> >>
> >> I understand all the consequences of the leaking pointer, and of course
> >> wrapped loop with RCU lock/unlock is simpler, but in order to keep
> >> load-balancing and IO fairness avoiding any locks on IO path I've come
> >> up with these RCU tricks and list_next_or_null_rr_rcu() macro.
> >
> > At first glance, it appears that you have handled this correctly.  But I
> > can make mistakes just as easily as the next guy, so what have you done
> > to validate your algorithm?
> 
> What we only have is a set of unit-tests which run every night.  For this
> particular case there is a special stress test which adds/removes RDMA
> connections in a loop while IO is performing.  Unfortunately I did not
> write any synthetic test-case just for testing this isolated algorithm.
> (e.g. module with only these RCU functions can be created and list
> modification can be easily simulated. Should not be very much difficult)
> Do you think it is worth to do?  Unfortunately it also does not prove
> correctness, like Linus said it is fragile as hell, but for sure I can
> burn CPUs testing it for couple of days.

Yes, this definitely deserves some -serious- focused stress testing.

And formal verification, if that can be arranged.  The Linux-kernel
memory model does handle RCU, and also minimal linked lists, so might
be worth a try.  Unfortunately, I don't know of any other tooling
that handles RCU.  :-(

> >> > More like this:
> >> >
> >> >         list_for_each_entry_rr_rcu(conn, &conn_list, entry) {
> >> >                 do_something_with(conn);
> >> >                 if (done_for_now())
> >> >                         break;
> >> >         }
> >> >
> >> >> >> + */
> >> >> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
> >> >> >> +({ \
> >> >> >> +     list_next_or_null_rcu(head, ptr, type, memb) ?: \
> >> >> >> +             list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \
> >> >> >
> >> >> > Are there any uses for this outside of RDMA?  If not, I am with Linus.
> >> >> > Define this within RDMA, where a smaller number of people can more
> >> >> > easily be kept aware of the restrictions on use.  If it turns out to be
> >> >> > more generally useful, we can take a look at exactly what makes sense
> >> >> > more globally.
> >> >>
> >> >> The only one list_for_each_entry_rcu()-like macro I am aware of is used in
> >> >> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():
> >> >>
> >> >> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370
> >> >>
> >> >> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing
> >> >> my list_next_or_null_rr_rcu() variant?
> >> >
> >> > Let's start with the basics:  It absolutely does not make sense to leak
> >> > pointers across rcu_read_unlock() unless you have arranged something else
> >> > to protect the pointed-to data in the meantime.  There are a number of ways
> >> > of implementing this protection.  Again, what protection are you using?
> >> >
> >> > Your code at the above URL looks plausible to me at first glance: You
> >> > do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then
> >> > rcu_read_unlock().  But at second glance, it looks like htcx->queue
> >> > might have the same vulnerability as rcu_conn in your earlier code.
> >>
> >> I am not the author of the code at the URL I specified. I provided the
> >> link answering the question to show other possible users of round-robin
> >> semantics for RCU list traversal.  In my 'list_next_or_null_rr_rcu()'
> >> case I can't use a loop, I leak the pointer and indeed have to be very
> >> careful.  But perhaps we can come up with some generic solution to cover
> >> both cases: -rr loop and -rr next.
> >
> > Ah.  Could you please check their update-side code to make sure that it
> > looks correct to you?
> 
> BTW authors of this particular -rr loop are in CC, but they keep silence
> so far :)  According to my shallow understanding @queue can't disappear
> from the list on this calling path, i.e. existence of a @queue should be
> guaranteed by the fact that the queue has no IO in-flights and only then
> it can be removed.

But something has to enforce that, right?  For example, how does the
code know that there are no longer any I/Os in flight, and how does it
know that more I/Os won't be issued in the near future, possibly based
on old state?

But yes, the authors are free to respond.

							Thanx, Paul

Paul E. McKenney May 22, 2018, 5:04 p.m. UTC | #17

On Tue, May 22, 2018 at 09:38:13AM -0700, Linus Torvalds wrote:
> On Tue, May 22, 2018 at 2:09 AM Roman Penyaev <
> roman.penyaev@profitbricks.com> wrote:
> 
> > Should I resend current patch with more clear comments about how careful
> > caller should be with a leaking pointer?
> 
> No. Even if we go your way, there is *one* single user, and that one is
> special and needs to take a lot more care.
> 
> Just roll your own version, and make it an inline function like I've asked
> now now many times, and add a shit-ton of explanations of why it's safe to
> use in that *one* situation.
> 
> I don't want any crazy and unsafe stuff in the generic header file that
> absolutely *nobody* should ever use.

Completely agreed!

I was perhaps foolishly assuming that they would be making that adjustment
based on earlier emails, but yes, I should have explicitly stated this
requirement in my earlier reply.

							Thanx, Paul

[v2,01/26] rculist: introduce list_next_or_null_rr_rcu()

Commit Message

Comments

Patch