[1/2] rcu: Do not release a wait-head from a GP kthread

Message ID	20240305195720.42687-1-urezki@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-lf1-f48.google.com (mail-lf1-f48.google.com [209.85.167.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2719E86655; Tue, 5 Mar 2024 19:57:26 +0000 (UTC) From: "Uladzislau Rezki (Sony)" <urezki@gmail.com> To: "Paul E . McKenney" <paulmck@kernel.org> Cc: RCU <rcu@vger.kernel.org>, Neeraj upadhyay <Neeraj.Upadhyay@amd.com>, Boqun Feng <boqun.feng@gmail.com>, Hillf Danton <hdanton@sina.com>, Joel Fernandes <joel@joelfernandes.org>, LKML <linux-kernel@vger.kernel.org>, Uladzislau Rezki <urezki@gmail.com>, Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>, Frederic Weisbecker <frederic@kernel.org> Subject: [PATCH 1/2] rcu: Do not release a wait-head from a GP kthread Date: Tue, 5 Mar 2024 20:57:19 +0100 Message-Id: <20240305195720.42687-1-urezki@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[1/2] rcu: Do not release a wait-head from a GP kthread \| expand [1/2] rcu: Do not release a wait-head from a GP kthread [2/2] rcu: Allocate WQ with WQ_MEM_RECLAIM bit set

Uladzislau Rezki March 5, 2024, 7:57 p.m. UTC

Fix a below race by not releasing a wait-head from the
GP-kthread as it can lead for reusing it whereas a worker
can still access it thus execute newly added callbacks too
early.

CPU 0                              CPU 1
-----                              -----

// wait_tail == HEAD1
rcu_sr_normal_gp_cleanup() {
    // has passed SR_MAX_USERS_WAKE_FROM_GP
    wait_tail->next = next;
    // done_tail = HEAD1
    smp_store_release(&rcu_state.srs_done_tail, wait_tail);
    queue_work() {
        test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
        __queue_work()
    }
}

                               set_work_pool_and_clear_pending()
                               rcu_sr_normal_gp_cleanup_work() {
// new GP, wait_tail == HEAD2
rcu_sr_normal_gp_cleanup() {
    // executes all completion, but stop at HEAD1
    wait_tail->next = HEAD1;
    // done_tail = HEAD2
    smp_store_release(&rcu_state.srs_done_tail, wait_tail);
    queue_work() {
        test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
        __queue_work()
    }
}
                                 // done = HEAD2
                                 done = smp_load_acquire(&rcu_state.srs_done_tail);
                                 // head = HEAD1
                                 head = done->next;
                                 done->next = NULL;
                                 llist_for_each_safe() {
                                 // completes all callbacks, release HEAD1
                                 }
                               }
                               // Process second queue
                               set_work_pool_and_clear_pending()
                               rcu_sr_normal_gp_cleanup_work() {
                               // done = HEAD2
                               done = smp_load_acquire(&rcu_state.srs_done_tail);

// new GP, wait_tail == HEAD3
rcu_sr_normal_gp_cleanup() {
    // Finds HEAD2 with ->next == NULL at the end
    rcu_sr_put_wait_head(HEAD2)
    ...

// A few more GPs later
rcu_sr_normal_gp_init() {
     HEAD2 = rcu_sr_get_wait_head();
     llist_add(HEAD2, &rcu_state.srs_next);
                               // head == rcu_state.srs_next
                               head = done->next;
                               done->next = NULL;
                               llist_for_each_safe() {
                                // EXECUTE CALLBACKS TOO EARLY!!!
                                }
                               }

Reported-by: Frederic Weisbecker <frederic@kernel.org>
Fixes: 05a10b921000 ("rcu: Support direct wake-up of synchronize_rcu() users")
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 kernel/rcu/tree.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

Joel Fernandes March 6, 2024, 10:31 p.m. UTC | #1

On 3/5/2024 2:57 PM, Uladzislau Rezki (Sony) wrote:
> Fix a below race by not releasing a wait-head from the
> GP-kthread as it can lead for reusing it whereas a worker
> can still access it thus execute newly added callbacks too
> early.
> 
> CPU 0                              CPU 1
> -----                              -----
> 
> // wait_tail == HEAD1
> rcu_sr_normal_gp_cleanup() {
>     // has passed SR_MAX_USERS_WAKE_FROM_GP
>     wait_tail->next = next;
>     // done_tail = HEAD1
>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>     queue_work() {
>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
>         __queue_work()
>     }
> }
> 
>                                set_work_pool_and_clear_pending()
>                                rcu_sr_normal_gp_cleanup_work() {
> // new GP, wait_tail == HEAD2
> rcu_sr_normal_gp_cleanup() {
>     // executes all completion, but stop at HEAD1
>     wait_tail->next = HEAD1;
>     // done_tail = HEAD2
>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>     queue_work() {
>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
>         __queue_work()
>     }
> }
>                                  // done = HEAD2
>                                  done = smp_load_acquire(&rcu_state.srs_done_tail);
>                                  // head = HEAD1
>                                  head = done->next;
>                                  done->next = NULL;
>                                  llist_for_each_safe() {
>                                  // completes all callbacks, release HEAD1
>                                  }
>                                }
>                                // Process second queue
>                                set_work_pool_and_clear_pending()
>                                rcu_sr_normal_gp_cleanup_work() {
>                                // done = HEAD2
>                                done = smp_load_acquire(&rcu_state.srs_done_tail);
> 
> // new GP, wait_tail == HEAD3
> rcu_sr_normal_gp_cleanup() {
>     // Finds HEAD2 with ->next == NULL at the end
>     rcu_sr_put_wait_head(HEAD2)
>     ...
> 
> // A few more GPs later
> rcu_sr_normal_gp_init() {
>      HEAD2 = rcu_sr_get_wait_head();
>      llist_add(HEAD2, &rcu_state.srs_next);
>                                // head == rcu_state.srs_next
>                                head = done->next;
>                                done->next = NULL;
>                                llist_for_each_safe() {
>                                 // EXECUTE CALLBACKS TOO EARLY!!!
>                                 }
>                                }
> 
> Reported-by: Frederic Weisbecker <frederic@kernel.org>
> Fixes: 05a10b921000 ("rcu: Support direct wake-up of synchronize_rcu() users")
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  kernel/rcu/tree.c | 22 ++++++++--------------
>  1 file changed, 8 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 31f3a61f9c38..475647620b12 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1656,21 +1656,11 @@ static void rcu_sr_normal_gp_cleanup(void)
>  	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
>  
>  	/*
> -	 * Process (a) and (d) cases. See an illustration. Apart of
> -	 * that it handles the scenario when all clients are done,
> -	 * wait-head is released if last. The worker is not kicked.
> +	 * Process (a) and (d) cases. See an illustration.
>  	 */
>  	llist_for_each_safe(rcu, next, wait_tail->next) {
> -		if (rcu_sr_is_wait_head(rcu)) {
> -			if (!rcu->next) {
> -				rcu_sr_put_wait_head(rcu);
> -				wait_tail->next = NULL;
> -			} else {
> -				wait_tail->next = rcu;
> -			}
> -
> +		if (rcu_sr_is_wait_head(rcu))
>  			break;
> -		}
>  
>  		rcu_sr_normal_complete(rcu);
>  		// It can be last, update a next on this step.
> @@ -1684,8 +1674,12 @@ static void rcu_sr_normal_gp_cleanup(void)
>  	smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>  	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
>  
> -	if (wait_tail->next)
> -		queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
> +	/*
> +	 * We schedule a work in order to perform a final processing
> +	 * of outstanding users(if still left) and releasing wait-heads
> +	 * added by rcu_sr_normal_gp_init() call.
> +	 */
> +	queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
>  }

Ah, nice. So instead of allocating/freeing in GP thread and freeing in worker,
you allocate heads only in GP thread and free them only in worker, thus
essentially fixing the UAF that Frederick found.

AFAICS, this fixes the issue.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

There might a way to prevent queuing new work as fast-path optimization, incase
the CBs per GP will always be < SR_MAX_USERS_WAKE_FROM_GP but I could not find a
workqueue API that helps there, and work_busy() has comments saying not to use that.

thanks,

 - Joel

Joel Fernandes March 6, 2024, 10:44 p.m. UTC | #2

On Wed, Mar 6, 2024 at 5:31 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
>
>
> On 3/5/2024 2:57 PM, Uladzislau Rezki (Sony) wrote:
> > Fix a below race by not releasing a wait-head from the
> > GP-kthread as it can lead for reusing it whereas a worker
> > can still access it thus execute newly added callbacks too
> > early.
> >
[...]
> There might a way to prevent queuing new work as fast-path optimization, incase
> the CBs per GP will always be < SR_MAX_USERS_WAKE_FROM_GP but I could not find a
> workqueue API that helps there, and work_busy() has comments saying not to use that.

One way to do this would be to maintain a count of how many CBs are in
flight via the worker route, and then fast-path-free the thing if the
count is 0. Should I send a patch around something like that? It saves
1 more wakeup per synchronize_rcu() I think.

 - Joel

Frederic Weisbecker March 7, 2024, 12:04 a.m. UTC | #3

Le Tue, Mar 05, 2024 at 08:57:19PM +0100, Uladzislau Rezki (Sony) a écrit :
> Fix a below race by not releasing a wait-head from the
> GP-kthread as it can lead for reusing it whereas a worker
> can still access it thus execute newly added callbacks too
> early.
> 
> CPU 0                              CPU 1
> -----                              -----
> 
> // wait_tail == HEAD1
> rcu_sr_normal_gp_cleanup() {
>     // has passed SR_MAX_USERS_WAKE_FROM_GP
>     wait_tail->next = next;
>     // done_tail = HEAD1
>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>     queue_work() {
>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
>         __queue_work()
>     }
> }
> 
>                                set_work_pool_and_clear_pending()
>                                rcu_sr_normal_gp_cleanup_work() {
> // new GP, wait_tail == HEAD2
> rcu_sr_normal_gp_cleanup() {
>     // executes all completion, but stop at HEAD1
>     wait_tail->next = HEAD1;
>     // done_tail = HEAD2
>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>     queue_work() {
>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
>         __queue_work()
>     }
> }
>                                  // done = HEAD2
>                                  done = smp_load_acquire(&rcu_state.srs_done_tail);
>                                  // head = HEAD1
>                                  head = done->next;
>                                  done->next = NULL;
>                                  llist_for_each_safe() {
>                                  // completes all callbacks, release HEAD1
>                                  }
>                                }
>                                // Process second queue
>                                set_work_pool_and_clear_pending()
>                                rcu_sr_normal_gp_cleanup_work() {
>                                // done = HEAD2
>                                done = smp_load_acquire(&rcu_state.srs_done_tail);
> 
> // new GP, wait_tail == HEAD3
> rcu_sr_normal_gp_cleanup() {
>     // Finds HEAD2 with ->next == NULL at the end
>     rcu_sr_put_wait_head(HEAD2)
>     ...
> 
> // A few more GPs later
> rcu_sr_normal_gp_init() {
>      HEAD2 = rcu_sr_get_wait_head();
>      llist_add(HEAD2, &rcu_state.srs_next);
>                                // head == rcu_state.srs_next
>                                head = done->next;
>                                done->next = NULL;
>                                llist_for_each_safe() {
>                                 // EXECUTE CALLBACKS TOO EARLY!!!
>                                 }
>                                }
> 
> Reported-by: Frederic Weisbecker <frederic@kernel.org>
> Fixes: 05a10b921000 ("rcu: Support direct wake-up of synchronize_rcu() users")
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

Joel Fernandes March 7, 2024, 6:21 a.m. UTC | #4

On 3/6/2024 5:31 PM, Joel Fernandes wrote:
> 
> 
> On 3/5/2024 2:57 PM, Uladzislau Rezki (Sony) wrote:
>> Fix a below race by not releasing a wait-head from the
>> GP-kthread as it can lead for reusing it whereas a worker
>> can still access it thus execute newly added callbacks too
>> early.
>>
>> CPU 0                              CPU 1
>> -----                              -----
>>
>> // wait_tail == HEAD1
>> rcu_sr_normal_gp_cleanup() {
>>     // has passed SR_MAX_USERS_WAKE_FROM_GP
>>     wait_tail->next = next;
>>     // done_tail = HEAD1
>>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>>     queue_work() {
>>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
>>         __queue_work()
>>     }
>> }
>>
>>                                set_work_pool_and_clear_pending()
>>                                rcu_sr_normal_gp_cleanup_work() {
>> // new GP, wait_tail == HEAD2
>> rcu_sr_normal_gp_cleanup() {
>>     // executes all completion, but stop at HEAD1
>>     wait_tail->next = HEAD1;
>>     // done_tail = HEAD2
>>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>>     queue_work() {
>>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
>>         __queue_work()
>>     }
>> }
>>                                  // done = HEAD2
>>                                  done = smp_load_acquire(&rcu_state.srs_done_tail);
>>                                  // head = HEAD1
>>                                  head = done->next;
>>                                  done->next = NULL;
>>                                  llist_for_each_safe() {
>>                                  // completes all callbacks, release HEAD1
>>                                  }
>>                                }
>>                                // Process second queue
>>                                set_work_pool_and_clear_pending()
>>                                rcu_sr_normal_gp_cleanup_work() {
>>                                // done = HEAD2
>>                                done = smp_load_acquire(&rcu_state.srs_done_tail);
>>
>> // new GP, wait_tail == HEAD3
>> rcu_sr_normal_gp_cleanup() {
>>     // Finds HEAD2 with ->next == NULL at the end
>>     rcu_sr_put_wait_head(HEAD2)
>>     ...
>>
>> // A few more GPs later
>> rcu_sr_normal_gp_init() {
>>      HEAD2 = rcu_sr_get_wait_head();
>>      llist_add(HEAD2, &rcu_state.srs_next);
>>                                // head == rcu_state.srs_next
>>                                head = done->next;
>>                                done->next = NULL;
>>                                llist_for_each_safe() {
>>                                 // EXECUTE CALLBACKS TOO EARLY!!!
>>                                 }
>>                                }
>>
>> Reported-by: Frederic Weisbecker <frederic@kernel.org>
>> Fixes: 05a10b921000 ("rcu: Support direct wake-up of synchronize_rcu() users")
>> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
>> ---
>>  kernel/rcu/tree.c | 22 ++++++++--------------
>>  1 file changed, 8 insertions(+), 14 deletions(-)
>>
>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> index 31f3a61f9c38..475647620b12 100644
>> --- a/kernel/rcu/tree.c
>> +++ b/kernel/rcu/tree.c
>> @@ -1656,21 +1656,11 @@ static void rcu_sr_normal_gp_cleanup(void)
>>  	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
>>  
>>  	/*
>> -	 * Process (a) and (d) cases. See an illustration. Apart of
>> -	 * that it handles the scenario when all clients are done,
>> -	 * wait-head is released if last. The worker is not kicked.
>> +	 * Process (a) and (d) cases. See an illustration.
>>  	 */
>>  	llist_for_each_safe(rcu, next, wait_tail->next) {
>> -		if (rcu_sr_is_wait_head(rcu)) {
>> -			if (!rcu->next) {
>> -				rcu_sr_put_wait_head(rcu);
>> -				wait_tail->next = NULL;
>> -			} else {
>> -				wait_tail->next = rcu;
>> -			}
>> -
>> +		if (rcu_sr_is_wait_head(rcu))
>>  			break;
>> -		}
>>  
>>  		rcu_sr_normal_complete(rcu);
>>  		// It can be last, update a next on this step.
>> @@ -1684,8 +1674,12 @@ static void rcu_sr_normal_gp_cleanup(void)
>>  	smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>>  	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
>>  
>> -	if (wait_tail->next)
>> -		queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
>> +	/*
>> +	 * We schedule a work in order to perform a final processing
>> +	 * of outstanding users(if still left) and releasing wait-heads
>> +	 * added by rcu_sr_normal_gp_init() call.
>> +	 */
>> +	queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
>>  }

One question, why do you need to queue_work() if wait_tail->next == NULL?

AFAICS, at this stage if wait_tail->next == NULL, you are in CASE f. so the last
remaining HEAD stays?  (And llist_for_each_safe() in
rcu_sr_normal_gp_cleanup_work becomes a NOOP).

Could be something like this, but maybe I missed something:

@@ -1672,7 +1674,7 @@ static void rcu_sr_normal_gp_cleanup_work(struct
work_struct *work)
  */
 static void rcu_sr_normal_gp_cleanup(void)
 {
-       struct llist_node *wait_tail, *next, *rcu;
+       struct llist_node *wait_tail, *next = NULL, *rcu = NULL;
        int done = 0;

        wait_tail = rcu_state.srs_wait_tail;
@@ -1707,7 +1709,8 @@ static void rcu_sr_normal_gp_cleanup(void)
         * of outstanding users(if still left) and releasing wait-heads
         * added by rcu_sr_normal_gp_init() call.
         */
-       queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
+       if (rcu)
+               queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
 }

 /*

Joel Fernandes March 7, 2024, 7:09 a.m. UTC | #5

On 3/7/2024 1:21 AM, Joel Fernandes wrote:
> 
> 
> On 3/6/2024 5:31 PM, Joel Fernandes wrote:
>>
>>
>> On 3/5/2024 2:57 PM, Uladzislau Rezki (Sony) wrote:
>>> Fix a below race by not releasing a wait-head from the
>>> GP-kthread as it can lead for reusing it whereas a worker
>>> can still access it thus execute newly added callbacks too
>>> early.
>>>
>>> CPU 0                              CPU 1
>>> -----                              -----
>>>
>>> // wait_tail == HEAD1
>>> rcu_sr_normal_gp_cleanup() {
>>>     // has passed SR_MAX_USERS_WAKE_FROM_GP
>>>     wait_tail->next = next;
>>>     // done_tail = HEAD1
>>>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>>>     queue_work() {
>>>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
>>>         __queue_work()
>>>     }
>>> }
>>>
>>>                                set_work_pool_and_clear_pending()
>>>                                rcu_sr_normal_gp_cleanup_work() {
>>> // new GP, wait_tail == HEAD2
>>> rcu_sr_normal_gp_cleanup() {
>>>     // executes all completion, but stop at HEAD1
>>>     wait_tail->next = HEAD1;
>>>     // done_tail = HEAD2
>>>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>>>     queue_work() {
>>>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
>>>         __queue_work()
>>>     }
>>> }
>>>                                  // done = HEAD2
>>>                                  done = smp_load_acquire(&rcu_state.srs_done_tail);
>>>                                  // head = HEAD1
>>>                                  head = done->next;
>>>                                  done->next = NULL;
>>>                                  llist_for_each_safe() {
>>>                                  // completes all callbacks, release HEAD1
>>>                                  }
>>>                                }
>>>                                // Process second queue
>>>                                set_work_pool_and_clear_pending()
>>>                                rcu_sr_normal_gp_cleanup_work() {
>>>                                // done = HEAD2
>>>                                done = smp_load_acquire(&rcu_state.srs_done_tail);
>>>
>>> // new GP, wait_tail == HEAD3
>>> rcu_sr_normal_gp_cleanup() {
>>>     // Finds HEAD2 with ->next == NULL at the end
>>>     rcu_sr_put_wait_head(HEAD2)
>>>     ...
>>>
>>> // A few more GPs later
>>> rcu_sr_normal_gp_init() {
>>>      HEAD2 = rcu_sr_get_wait_head();
>>>      llist_add(HEAD2, &rcu_state.srs_next);
>>>                                // head == rcu_state.srs_next
>>>                                head = done->next;
>>>                                done->next = NULL;
>>>                                llist_for_each_safe() {
>>>                                 // EXECUTE CALLBACKS TOO EARLY!!!
>>>                                 }
>>>                                }
>>>
>>> Reported-by: Frederic Weisbecker <frederic@kernel.org>
>>> Fixes: 05a10b921000 ("rcu: Support direct wake-up of synchronize_rcu() users")
>>> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
>>> ---
>>>  kernel/rcu/tree.c | 22 ++++++++--------------
>>>  1 file changed, 8 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>> index 31f3a61f9c38..475647620b12 100644
>>> --- a/kernel/rcu/tree.c
>>> +++ b/kernel/rcu/tree.c
>>> @@ -1656,21 +1656,11 @@ static void rcu_sr_normal_gp_cleanup(void)
>>>  	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
>>>  
>>>  	/*
>>> -	 * Process (a) and (d) cases. See an illustration. Apart of
>>> -	 * that it handles the scenario when all clients are done,
>>> -	 * wait-head is released if last. The worker is not kicked.
>>> +	 * Process (a) and (d) cases. See an illustration.
>>>  	 */
>>>  	llist_for_each_safe(rcu, next, wait_tail->next) {
>>> -		if (rcu_sr_is_wait_head(rcu)) {
>>> -			if (!rcu->next) {
>>> -				rcu_sr_put_wait_head(rcu);
>>> -				wait_tail->next = NULL;
>>> -			} else {
>>> -				wait_tail->next = rcu;
>>> -			}
>>> -
>>> +		if (rcu_sr_is_wait_head(rcu))
>>>  			break;
>>> -		}
>>>  
>>>  		rcu_sr_normal_complete(rcu);
>>>  		// It can be last, update a next on this step.
>>> @@ -1684,8 +1674,12 @@ static void rcu_sr_normal_gp_cleanup(void)
>>>  	smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>>>  	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
>>>  
>>> -	if (wait_tail->next)
>>> -		queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
>>> +	/*
>>> +	 * We schedule a work in order to perform a final processing
>>> +	 * of outstanding users(if still left) and releasing wait-heads
>>> +	 * added by rcu_sr_normal_gp_init() call.
>>> +	 */
>>> +	queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
>>>  }
> 
> One question, why do you need to queue_work() if wait_tail->next == NULL?
> 
> AFAICS, at this stage if wait_tail->next == NULL, you are in CASE f. so the last
> remaining HEAD stays?  (And llist_for_each_safe() in
> rcu_sr_normal_gp_cleanup_work becomes a NOOP).
> 

Never mind, sorry for spewing nonsense. You can never end up with CASE f here so
you still need the queue_work(). I think it is worth looking into how to free
the last HEAD, without queuing a work though (while not causing wreckage).

thanks,

 - Joel

Uladzislau Rezki March 7, 2024, 12:17 p.m. UTC | #6

On Thu, Mar 07, 2024 at 01:04:25AM +0100, Frederic Weisbecker wrote:
> Le Tue, Mar 05, 2024 at 08:57:19PM +0100, Uladzislau Rezki (Sony) a écrit :
> > Fix a below race by not releasing a wait-head from the
> > GP-kthread as it can lead for reusing it whereas a worker
> > can still access it thus execute newly added callbacks too
> > early.
> > 
> > CPU 0                              CPU 1
> > -----                              -----
> > 
> > // wait_tail == HEAD1
> > rcu_sr_normal_gp_cleanup() {
> >     // has passed SR_MAX_USERS_WAKE_FROM_GP
> >     wait_tail->next = next;
> >     // done_tail = HEAD1
> >     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >     queue_work() {
> >         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> >         __queue_work()
> >     }
> > }
> > 
> >                                set_work_pool_and_clear_pending()
> >                                rcu_sr_normal_gp_cleanup_work() {
> > // new GP, wait_tail == HEAD2
> > rcu_sr_normal_gp_cleanup() {
> >     // executes all completion, but stop at HEAD1
> >     wait_tail->next = HEAD1;
> >     // done_tail = HEAD2
> >     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >     queue_work() {
> >         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> >         __queue_work()
> >     }
> > }
> >                                  // done = HEAD2
> >                                  done = smp_load_acquire(&rcu_state.srs_done_tail);
> >                                  // head = HEAD1
> >                                  head = done->next;
> >                                  done->next = NULL;
> >                                  llist_for_each_safe() {
> >                                  // completes all callbacks, release HEAD1
> >                                  }
> >                                }
> >                                // Process second queue
> >                                set_work_pool_and_clear_pending()
> >                                rcu_sr_normal_gp_cleanup_work() {
> >                                // done = HEAD2
> >                                done = smp_load_acquire(&rcu_state.srs_done_tail);
> > 
> > // new GP, wait_tail == HEAD3
> > rcu_sr_normal_gp_cleanup() {
> >     // Finds HEAD2 with ->next == NULL at the end
> >     rcu_sr_put_wait_head(HEAD2)
> >     ...
> > 
> > // A few more GPs later
> > rcu_sr_normal_gp_init() {
> >      HEAD2 = rcu_sr_get_wait_head();
> >      llist_add(HEAD2, &rcu_state.srs_next);
> >                                // head == rcu_state.srs_next
> >                                head = done->next;
> >                                done->next = NULL;
> >                                llist_for_each_safe() {
> >                                 // EXECUTE CALLBACKS TOO EARLY!!!
> >                                 }
> >                                }
> > 
> > Reported-by: Frederic Weisbecker <frederic@kernel.org>
> > Fixes: 05a10b921000 ("rcu: Support direct wake-up of synchronize_rcu() users")
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
>
Thank you, I will update with your review-by tag!

--
Uladzislau Rezki

Uladzislau Rezki March 7, 2024, 12:25 p.m. UTC | #7

On Wed, Mar 06, 2024 at 05:44:04PM -0500, Joel Fernandes wrote:
> On Wed, Mar 6, 2024 at 5:31 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> >
> >
> > On 3/5/2024 2:57 PM, Uladzislau Rezki (Sony) wrote:
> > > Fix a below race by not releasing a wait-head from the
> > > GP-kthread as it can lead for reusing it whereas a worker
> > > can still access it thus execute newly added callbacks too
> > > early.
> > >
> [...]
> > There might a way to prevent queuing new work as fast-path optimization, incase
> > the CBs per GP will always be < SR_MAX_USERS_WAKE_FROM_GP but I could not find a
> > workqueue API that helps there, and work_busy() has comments saying not to use that.
> 
> One way to do this would be to maintain a count of how many CBs are in
> flight via the worker route, and then fast-path-free the thing if the
> count is 0. Should I send a patch around something like that? It saves
> 1 more wakeup per synchronize_rcu() I think.
> 
We can release the last wait-head if we know that the worker is not
pending/running. Then we guarantee that Frederic's case is not possible.
From the other hand it will introduce again more mess because the idea
was, in the begging, to start with something really simple :)

--
Uladzislau Rezki

Uladzislau Rezki March 7, 2024, 12:28 p.m. UTC | #8

On Thu, Mar 07, 2024 at 01:21:50AM -0500, Joel Fernandes wrote:
> 
> 
> On 3/6/2024 5:31 PM, Joel Fernandes wrote:
> > 
> > 
> > On 3/5/2024 2:57 PM, Uladzislau Rezki (Sony) wrote:
> >> Fix a below race by not releasing a wait-head from the
> >> GP-kthread as it can lead for reusing it whereas a worker
> >> can still access it thus execute newly added callbacks too
> >> early.
> >>
> >> CPU 0                              CPU 1
> >> -----                              -----
> >>
> >> // wait_tail == HEAD1
> >> rcu_sr_normal_gp_cleanup() {
> >>     // has passed SR_MAX_USERS_WAKE_FROM_GP
> >>     wait_tail->next = next;
> >>     // done_tail = HEAD1
> >>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >>     queue_work() {
> >>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> >>         __queue_work()
> >>     }
> >> }
> >>
> >>                                set_work_pool_and_clear_pending()
> >>                                rcu_sr_normal_gp_cleanup_work() {
> >> // new GP, wait_tail == HEAD2
> >> rcu_sr_normal_gp_cleanup() {
> >>     // executes all completion, but stop at HEAD1
> >>     wait_tail->next = HEAD1;
> >>     // done_tail = HEAD2
> >>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >>     queue_work() {
> >>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> >>         __queue_work()
> >>     }
> >> }
> >>                                  // done = HEAD2
> >>                                  done = smp_load_acquire(&rcu_state.srs_done_tail);
> >>                                  // head = HEAD1
> >>                                  head = done->next;
> >>                                  done->next = NULL;
> >>                                  llist_for_each_safe() {
> >>                                  // completes all callbacks, release HEAD1
> >>                                  }
> >>                                }
> >>                                // Process second queue
> >>                                set_work_pool_and_clear_pending()
> >>                                rcu_sr_normal_gp_cleanup_work() {
> >>                                // done = HEAD2
> >>                                done = smp_load_acquire(&rcu_state.srs_done_tail);
> >>
> >> // new GP, wait_tail == HEAD3
> >> rcu_sr_normal_gp_cleanup() {
> >>     // Finds HEAD2 with ->next == NULL at the end
> >>     rcu_sr_put_wait_head(HEAD2)
> >>     ...
> >>
> >> // A few more GPs later
> >> rcu_sr_normal_gp_init() {
> >>      HEAD2 = rcu_sr_get_wait_head();
> >>      llist_add(HEAD2, &rcu_state.srs_next);
> >>                                // head == rcu_state.srs_next
> >>                                head = done->next;
> >>                                done->next = NULL;
> >>                                llist_for_each_safe() {
> >>                                 // EXECUTE CALLBACKS TOO EARLY!!!
> >>                                 }
> >>                                }
> >>
> >> Reported-by: Frederic Weisbecker <frederic@kernel.org>
> >> Fixes: 05a10b921000 ("rcu: Support direct wake-up of synchronize_rcu() users")
> >> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> >> ---
> >>  kernel/rcu/tree.c | 22 ++++++++--------------
> >>  1 file changed, 8 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >> index 31f3a61f9c38..475647620b12 100644
> >> --- a/kernel/rcu/tree.c
> >> +++ b/kernel/rcu/tree.c
> >> @@ -1656,21 +1656,11 @@ static void rcu_sr_normal_gp_cleanup(void)
> >>  	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
> >>  
> >>  	/*
> >> -	 * Process (a) and (d) cases. See an illustration. Apart of
> >> -	 * that it handles the scenario when all clients are done,
> >> -	 * wait-head is released if last. The worker is not kicked.
> >> +	 * Process (a) and (d) cases. See an illustration.
> >>  	 */
> >>  	llist_for_each_safe(rcu, next, wait_tail->next) {
> >> -		if (rcu_sr_is_wait_head(rcu)) {
> >> -			if (!rcu->next) {
> >> -				rcu_sr_put_wait_head(rcu);
> >> -				wait_tail->next = NULL;
> >> -			} else {
> >> -				wait_tail->next = rcu;
> >> -			}
> >> -
> >> +		if (rcu_sr_is_wait_head(rcu))
> >>  			break;
> >> -		}
> >>  
> >>  		rcu_sr_normal_complete(rcu);
> >>  		// It can be last, update a next on this step.
> >> @@ -1684,8 +1674,12 @@ static void rcu_sr_normal_gp_cleanup(void)
> >>  	smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >>  	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
> >>  
> >> -	if (wait_tail->next)
> >> -		queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
> >> +	/*
> >> +	 * We schedule a work in order to perform a final processing
> >> +	 * of outstanding users(if still left) and releasing wait-heads
> >> +	 * added by rcu_sr_normal_gp_init() call.
> >> +	 */
> >> +	queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
> >>  }
> 
> One question, why do you need to queue_work() if wait_tail->next == NULL?
> 
> AFAICS, at this stage if wait_tail->next == NULL, you are in CASE f. so the last
> remaining HEAD stays?  (And llist_for_each_safe() in
> rcu_sr_normal_gp_cleanup_work becomes a NOOP).
> 
> Could be something like this, but maybe I missed something:
> 
> @@ -1672,7 +1674,7 @@ static void rcu_sr_normal_gp_cleanup_work(struct
> work_struct *work)
>   */
>  static void rcu_sr_normal_gp_cleanup(void)
>  {
> -       struct llist_node *wait_tail, *next, *rcu;
> +       struct llist_node *wait_tail, *next = NULL, *rcu = NULL;
>         int done = 0;
> 
>         wait_tail = rcu_state.srs_wait_tail;
> @@ -1707,7 +1709,8 @@ static void rcu_sr_normal_gp_cleanup(void)
>          * of outstanding users(if still left) and releasing wait-heads
>          * added by rcu_sr_normal_gp_init() call.
>          */
> -       queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
> +       if (rcu)
> +               queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
>  }
> 
>  /*
>
Unneeded queueing happens only first time after boot. After one cycle
we always need to queue the work to do a previous cleanups.

if (wait_head->next)
    queue_the_work();


wait_head->next is always not NULL except first cycle after boot.

--
Uladzislau Rezki

Uladzislau Rezki March 7, 2024, 12:31 p.m. UTC | #9

On Thu, Mar 07, 2024 at 02:09:38AM -0500, Joel Fernandes wrote:
> 
> 
> On 3/7/2024 1:21 AM, Joel Fernandes wrote:
> > 
> > 
> > On 3/6/2024 5:31 PM, Joel Fernandes wrote:
> >>
> >>
> >> On 3/5/2024 2:57 PM, Uladzislau Rezki (Sony) wrote:
> >>> Fix a below race by not releasing a wait-head from the
> >>> GP-kthread as it can lead for reusing it whereas a worker
> >>> can still access it thus execute newly added callbacks too
> >>> early.
> >>>
> >>> CPU 0                              CPU 1
> >>> -----                              -----
> >>>
> >>> // wait_tail == HEAD1
> >>> rcu_sr_normal_gp_cleanup() {
> >>>     // has passed SR_MAX_USERS_WAKE_FROM_GP
> >>>     wait_tail->next = next;
> >>>     // done_tail = HEAD1
> >>>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >>>     queue_work() {
> >>>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> >>>         __queue_work()
> >>>     }
> >>> }
> >>>
> >>>                                set_work_pool_and_clear_pending()
> >>>                                rcu_sr_normal_gp_cleanup_work() {
> >>> // new GP, wait_tail == HEAD2
> >>> rcu_sr_normal_gp_cleanup() {
> >>>     // executes all completion, but stop at HEAD1
> >>>     wait_tail->next = HEAD1;
> >>>     // done_tail = HEAD2
> >>>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >>>     queue_work() {
> >>>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> >>>         __queue_work()
> >>>     }
> >>> }
> >>>                                  // done = HEAD2
> >>>                                  done = smp_load_acquire(&rcu_state.srs_done_tail);
> >>>                                  // head = HEAD1
> >>>                                  head = done->next;
> >>>                                  done->next = NULL;
> >>>                                  llist_for_each_safe() {
> >>>                                  // completes all callbacks, release HEAD1
> >>>                                  }
> >>>                                }
> >>>                                // Process second queue
> >>>                                set_work_pool_and_clear_pending()
> >>>                                rcu_sr_normal_gp_cleanup_work() {
> >>>                                // done = HEAD2
> >>>                                done = smp_load_acquire(&rcu_state.srs_done_tail);
> >>>
> >>> // new GP, wait_tail == HEAD3
> >>> rcu_sr_normal_gp_cleanup() {
> >>>     // Finds HEAD2 with ->next == NULL at the end
> >>>     rcu_sr_put_wait_head(HEAD2)
> >>>     ...
> >>>
> >>> // A few more GPs later
> >>> rcu_sr_normal_gp_init() {
> >>>      HEAD2 = rcu_sr_get_wait_head();
> >>>      llist_add(HEAD2, &rcu_state.srs_next);
> >>>                                // head == rcu_state.srs_next
> >>>                                head = done->next;
> >>>                                done->next = NULL;
> >>>                                llist_for_each_safe() {
> >>>                                 // EXECUTE CALLBACKS TOO EARLY!!!
> >>>                                 }
> >>>                                }
> >>>
> >>> Reported-by: Frederic Weisbecker <frederic@kernel.org>
> >>> Fixes: 05a10b921000 ("rcu: Support direct wake-up of synchronize_rcu() users")
> >>> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> >>> ---
> >>>  kernel/rcu/tree.c | 22 ++++++++--------------
> >>>  1 file changed, 8 insertions(+), 14 deletions(-)
> >>>
> >>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >>> index 31f3a61f9c38..475647620b12 100644
> >>> --- a/kernel/rcu/tree.c
> >>> +++ b/kernel/rcu/tree.c
> >>> @@ -1656,21 +1656,11 @@ static void rcu_sr_normal_gp_cleanup(void)
> >>>  	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
> >>>  
> >>>  	/*
> >>> -	 * Process (a) and (d) cases. See an illustration. Apart of
> >>> -	 * that it handles the scenario when all clients are done,
> >>> -	 * wait-head is released if last. The worker is not kicked.
> >>> +	 * Process (a) and (d) cases. See an illustration.
> >>>  	 */
> >>>  	llist_for_each_safe(rcu, next, wait_tail->next) {
> >>> -		if (rcu_sr_is_wait_head(rcu)) {
> >>> -			if (!rcu->next) {
> >>> -				rcu_sr_put_wait_head(rcu);
> >>> -				wait_tail->next = NULL;
> >>> -			} else {
> >>> -				wait_tail->next = rcu;
> >>> -			}
> >>> -
> >>> +		if (rcu_sr_is_wait_head(rcu))
> >>>  			break;
> >>> -		}
> >>>  
> >>>  		rcu_sr_normal_complete(rcu);
> >>>  		// It can be last, update a next on this step.
> >>> @@ -1684,8 +1674,12 @@ static void rcu_sr_normal_gp_cleanup(void)
> >>>  	smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >>>  	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
> >>>  
> >>> -	if (wait_tail->next)
> >>> -		queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
> >>> +	/*
> >>> +	 * We schedule a work in order to perform a final processing
> >>> +	 * of outstanding users(if still left) and releasing wait-heads
> >>> +	 * added by rcu_sr_normal_gp_init() call.
> >>> +	 */
> >>> +	queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
> >>>  }
> > 
> > One question, why do you need to queue_work() if wait_tail->next == NULL?
> > 
> > AFAICS, at this stage if wait_tail->next == NULL, you are in CASE f. so the last
> > remaining HEAD stays?  (And llist_for_each_safe() in
> > rcu_sr_normal_gp_cleanup_work becomes a NOOP).
> > 
> 
> Never mind, sorry for spewing nonsense. You can never end up with CASE f here so
> you still need the queue_work(). I think it is worth looking into how to free
> the last HEAD, without queuing a work though (while not causing wreckage).
> 
No problem at all :)

--
Uladzislau Rezki

Uladzislau Rezki March 7, 2024, 12:57 p.m. UTC | #10

On Wed, Mar 06, 2024 at 05:31:31PM -0500, Joel Fernandes wrote:
> 
> 
> On 3/5/2024 2:57 PM, Uladzislau Rezki (Sony) wrote:
> > Fix a below race by not releasing a wait-head from the
> > GP-kthread as it can lead for reusing it whereas a worker
> > can still access it thus execute newly added callbacks too
> > early.
> > 
> > CPU 0                              CPU 1
> > -----                              -----
> > 
> > // wait_tail == HEAD1
> > rcu_sr_normal_gp_cleanup() {
> >     // has passed SR_MAX_USERS_WAKE_FROM_GP
> >     wait_tail->next = next;
> >     // done_tail = HEAD1
> >     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >     queue_work() {
> >         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> >         __queue_work()
> >     }
> > }
> > 
> >                                set_work_pool_and_clear_pending()
> >                                rcu_sr_normal_gp_cleanup_work() {
> > // new GP, wait_tail == HEAD2
> > rcu_sr_normal_gp_cleanup() {
> >     // executes all completion, but stop at HEAD1
> >     wait_tail->next = HEAD1;
> >     // done_tail = HEAD2
> >     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >     queue_work() {
> >         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
> >         __queue_work()
> >     }
> > }
> >                                  // done = HEAD2
> >                                  done = smp_load_acquire(&rcu_state.srs_done_tail);
> >                                  // head = HEAD1
> >                                  head = done->next;
> >                                  done->next = NULL;
> >                                  llist_for_each_safe() {
> >                                  // completes all callbacks, release HEAD1
> >                                  }
> >                                }
> >                                // Process second queue
> >                                set_work_pool_and_clear_pending()
> >                                rcu_sr_normal_gp_cleanup_work() {
> >                                // done = HEAD2
> >                                done = smp_load_acquire(&rcu_state.srs_done_tail);
> > 
> > // new GP, wait_tail == HEAD3
> > rcu_sr_normal_gp_cleanup() {
> >     // Finds HEAD2 with ->next == NULL at the end
> >     rcu_sr_put_wait_head(HEAD2)
> >     ...
> > 
> > // A few more GPs later
> > rcu_sr_normal_gp_init() {
> >      HEAD2 = rcu_sr_get_wait_head();
> >      llist_add(HEAD2, &rcu_state.srs_next);
> >                                // head == rcu_state.srs_next
> >                                head = done->next;
> >                                done->next = NULL;
> >                                llist_for_each_safe() {
> >                                 // EXECUTE CALLBACKS TOO EARLY!!!
> >                                 }
> >                                }
> > 
> > Reported-by: Frederic Weisbecker <frederic@kernel.org>
> > Fixes: 05a10b921000 ("rcu: Support direct wake-up of synchronize_rcu() users")
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> >  kernel/rcu/tree.c | 22 ++++++++--------------
> >  1 file changed, 8 insertions(+), 14 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 31f3a61f9c38..475647620b12 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1656,21 +1656,11 @@ static void rcu_sr_normal_gp_cleanup(void)
> >  	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
> >  
> >  	/*
> > -	 * Process (a) and (d) cases. See an illustration. Apart of
> > -	 * that it handles the scenario when all clients are done,
> > -	 * wait-head is released if last. The worker is not kicked.
> > +	 * Process (a) and (d) cases. See an illustration.
> >  	 */
> >  	llist_for_each_safe(rcu, next, wait_tail->next) {
> > -		if (rcu_sr_is_wait_head(rcu)) {
> > -			if (!rcu->next) {
> > -				rcu_sr_put_wait_head(rcu);
> > -				wait_tail->next = NULL;
> > -			} else {
> > -				wait_tail->next = rcu;
> > -			}
> > -
> > +		if (rcu_sr_is_wait_head(rcu))
> >  			break;
> > -		}
> >  
> >  		rcu_sr_normal_complete(rcu);
> >  		// It can be last, update a next on this step.
> > @@ -1684,8 +1674,12 @@ static void rcu_sr_normal_gp_cleanup(void)
> >  	smp_store_release(&rcu_state.srs_done_tail, wait_tail);
> >  	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
> >  
> > -	if (wait_tail->next)
> > -		queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
> > +	/*
> > +	 * We schedule a work in order to perform a final processing
> > +	 * of outstanding users(if still left) and releasing wait-heads
> > +	 * added by rcu_sr_normal_gp_init() call.
> > +	 */
> > +	queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
> >  }
> 
> Ah, nice. So instead of allocating/freeing in GP thread and freeing in worker,
> you allocate heads only in GP thread and free them only in worker, thus
> essentially fixing the UAF that Frederick found.
> 
> AFAICS, this fixes the issue.
> 
> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
Thank you for the review-by!

> There might a way to prevent queuing new work as fast-path optimization, incase
> the CBs per GP will always be < SR_MAX_USERS_WAKE_FROM_GP but I could not find a
> workqueue API that helps there, and work_busy() has comments saying not to use that.
> 
This is not really critical but yes, we can think of it.

--
Uladzislau Rezki

Joel Fernandes March 7, 2024, 1:13 p.m. UTC | #11

On 3/7/2024 7:57 AM, Uladzislau Rezki wrote:
> On Wed, Mar 06, 2024 at 05:31:31PM -0500, Joel Fernandes wrote:
>>
>>
>> On 3/5/2024 2:57 PM, Uladzislau Rezki (Sony) wrote:
>>> Fix a below race by not releasing a wait-head from the
>>> GP-kthread as it can lead for reusing it whereas a worker
>>> can still access it thus execute newly added callbacks too
>>> early.
>>>
>>> CPU 0                              CPU 1
>>> -----                              -----
>>>
>>> // wait_tail == HEAD1
>>> rcu_sr_normal_gp_cleanup() {
>>>     // has passed SR_MAX_USERS_WAKE_FROM_GP
>>>     wait_tail->next = next;
>>>     // done_tail = HEAD1
>>>     smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>>>     queue_work() {
>>>         test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)
>>>         __queue_work()
>>>     }
>>> }
>>>
>>>                                set_work_pool_and_clear_pending()
>>>                                rcu_sr_normal_gp_cleanup_work() {
[..]
>>>
>>> Reported-by: Frederic Weisbecker <frederic@kernel.org>
>>> Fixes: 05a10b921000 ("rcu: Support direct wake-up of synchronize_rcu() users")
>>> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
>>> ---
>>>  kernel/rcu/tree.c | 22 ++++++++--------------
>>>  1 file changed, 8 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>> index 31f3a61f9c38..475647620b12 100644
>>> --- a/kernel/rcu/tree.c
>>> +++ b/kernel/rcu/tree.c
>>> @@ -1656,21 +1656,11 @@ static void rcu_sr_normal_gp_cleanup(void)
>>>  	WARN_ON_ONCE(!rcu_sr_is_wait_head(wait_tail));
>>>  
>>>  	/*
>>> -	 * Process (a) and (d) cases. See an illustration. Apart of
>>> -	 * that it handles the scenario when all clients are done,
>>> -	 * wait-head is released if last. The worker is not kicked.
>>> +	 * Process (a) and (d) cases. See an illustration.
>>>  	 */
>>>  	llist_for_each_safe(rcu, next, wait_tail->next) {
>>> -		if (rcu_sr_is_wait_head(rcu)) {
>>> -			if (!rcu->next) {
>>> -				rcu_sr_put_wait_head(rcu);
>>> -				wait_tail->next = NULL;
>>> -			} else {
>>> -				wait_tail->next = rcu;
>>> -			}
>>> -
>>> +		if (rcu_sr_is_wait_head(rcu))
>>>  			break;
>>> -		}
>>>  
>>>  		rcu_sr_normal_complete(rcu);
>>>  		// It can be last, update a next on this step.
>>> @@ -1684,8 +1674,12 @@ static void rcu_sr_normal_gp_cleanup(void)
>>>  	smp_store_release(&rcu_state.srs_done_tail, wait_tail);
>>>  	ASSERT_EXCLUSIVE_WRITER(rcu_state.srs_done_tail);
>>>  
>>> -	if (wait_tail->next)
>>> -		queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
>>> +	/*
>>> +	 * We schedule a work in order to perform a final processing
>>> +	 * of outstanding users(if still left) and releasing wait-heads
>>> +	 * added by rcu_sr_normal_gp_init() call.
>>> +	 */
>>> +	queue_work(system_highpri_wq, &rcu_state.srs_cleanup_work);
>>>  }
>>
>> Ah, nice. So instead of allocating/freeing in GP thread and freeing in worker,
>> you allocate heads only in GP thread and free them only in worker, thus
>> essentially fixing the UAF that Frederick found.
>>
>> AFAICS, this fixes the issue.
>>
>> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>>
> Thank you for the review-by!
> 
>> There might a way to prevent queuing new work as fast-path optimization, incase
>> the CBs per GP will always be < SR_MAX_USERS_WAKE_FROM_GP but I could not find a
>> workqueue API that helps there, and work_busy() has comments saying not to use that.
>>
> This is not really critical but yes, we can think of it.
> 

Thanks, I have a patch that does that. I could not help but write it as soon as
I woke up in the morning, ;-). It passes torture and I will push it for further
review after some more testing.

thanks,

 - Joel

[1/2] rcu: Do not release a wait-head from a GP kthread

Commit Message

Comments

Patch