diff mbox series

[RFC] tick/sched: Ensure quiet_vmstat() is called when the idle tick was stopped too

Message ID 20220203214339.1889971-1-atomlin@redhat.com (mailing list archive)
State Superseded
Headers show
Series [RFC] tick/sched: Ensure quiet_vmstat() is called when the idle tick was stopped too | expand

Commit Message

Aaron Tomlin Feb. 3, 2022, 9:43 p.m. UTC
Hi Frederic,

If I understand correctly, in the context of the idle task and a nohz_full
CPU, quiet_vmstat() can be called: before stopping the idle tick, entering
an idle state and on exit. In particular, for the latter case, when the
idle task is required to reschedule, the idle tick can remain stopped and
the timer expiration time endless i.e., KTIME_MAX. Now, indeed before a
nohz_full CPU enters an idle state, CPU-specific vmstat counters should
be processed to ensure the respective values have been reset and folded
into the zone specific vm_stat[]. That being said, it can only occur when:
the idle tick was previously stopped, and reprogramming of the timer is not
required.

A customer provided some evidence which indicates that the idle tick was
stopped; albeit, CPU-specific vmstat counters still remained populated.
Thus one can only assume quiet_vmstat() was not invoked on return to the
idle loop.

Unfortunately, I suspect this divergence might erroneously prevent a
reclaim attempt by kswapd. If the number of zone specific free pages are
below their per-cpu drift value then zone_page_state_snapshot() is used to
compute a more accurate view of the aforementioned statistic.
Thus any task blocked on the NUMA node specific pfmemalloc_wait queue will
be unable to make significant progress via direct reclaim unless it is
killed after being woken up by kswapd (see throttle_direct_reclaim()).
That being said, eventually reclaim should give up if the conditions are
correct, no?

Consider the following theoretical scenario:

        1.      CPU Y migrated running task A to CPU X that was
                in an idle state i.e. waiting for an IRQ - not
                polling; marked the current task on CPU X to
                need/or require a reschedule i.e., set
                TIF_NEED_RESCHED and invoked a reschedule IPI to
                CPU X (see sched_move_task())

        2.      CPU X acknowledged the reschedule IPI from CPU Y;
                generic idle loop code noticed the
                TIF_NEED_RESCHED flag against the idle task and
                attempts to exit of the loop and calls the main
                scheduler function i.e. __schedule().

                Since the idle tick was previously stopped no
                scheduling-clock tick would occur.
                So, no deferred timers would be handled

        3.      Post transition to kernel execution Task A
                running on CPU Y, indirectly released a few pages
                (e.g. see __free_one_page()); CPU Y's
                vm_stat_diff[NR_FREE_PAGES] was updated and zone
                specific vm_stat[] update was deferred as per the
                CPU-specific stat threshold

        4.      Task A does invoke exit(2) and the kernel does
                remove the task from the run-queue; the idle task
                was selected to execute next since there are no
                other runnable tasks assigned to the given CPU
                (see pick_next_task() and pick_next_task_idle())

        5.      On return to the idle loop since the idle tick
                was already stopped and can remain so (see [1]
                below) e.g. no pending soft IRQs, no attempt is
                made to zero and fold CPU Y's vmstat counters
                since reprogramming of the scheduling-clock tick
                is not required/or needed (see [2])

              ...
                do_idle
                {

                  __current_set_polling()
                  tick_nohz_idle_enter()

                  while (!need_resched()) {

                    local_irq_disable()

                    ...

                    /* No polling or broadcast event */
                    cpuidle_idle_call()
                    {

                      if (cpuidle_not_available(drv, dev)) {
                        tick_nohz_idle_stop_tick()
                          __tick_nohz_idle_stop_tick(this_cpu_ptr(&tick_cpu_sched))
                          {
                            int cpu = smp_processor_id()

                            if (ts->timer_expires_base)
                              expires = ts->timer_expires
                            else if (can_stop_idle_tick(cpu, ts))
          (1) ------->        expires = tick_nohz_next_event(ts, cpu)
                            else
                              return

                            ts->idle_calls++

                            if (expires > 0LL) {

                              tick_nohz_stop_tick(ts, cpu)
                              {

                                if (ts->tick_stopped && (expires == ts->next_tick)) {
          (2) ------->            if (tick == KTIME_MAX || ts->next_tick ==
                                    hrtimer_get_expires(&ts->sched_timer))
                                    return
                                }
                                ...
                              }


The idea with this patch is to ensure refresh_cpu_vm_stats(false) is called
on return to the idle loop when the idle tick was previously stopped.

Any feedback/or testing would be appreciated.


Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
---
 kernel/time/tick-sched.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Phil Auld Feb. 3, 2022, 10:22 p.m. UTC | #1
Hi Aaron,

On Thu, Feb 03, 2022 at 09:43:39PM +0000 Aaron Tomlin wrote:
> Hi Frederic,
> 
> If I understand correctly, in the context of the idle task and a nohz_full
> CPU, quiet_vmstat() can be called: before stopping the idle tick, entering
> an idle state and on exit. In particular, for the latter case, when the
> idle task is required to reschedule, the idle tick can remain stopped and
> the timer expiration time endless i.e., KTIME_MAX. Now, indeed before a
> nohz_full CPU enters an idle state, CPU-specific vmstat counters should
> be processed to ensure the respective values have been reset and folded
> into the zone specific vm_stat[]. That being said, it can only occur when:
> the idle tick was previously stopped, and reprogramming of the timer is not
> required.
> 
> A customer provided some evidence which indicates that the idle tick was
> stopped; albeit, CPU-specific vmstat counters still remained populated.
> Thus one can only assume quiet_vmstat() was not invoked on return to the
> idle loop.
> 
> Unfortunately, I suspect this divergence might erroneously prevent a
> reclaim attempt by kswapd. If the number of zone specific free pages are
> below their per-cpu drift value then zone_page_state_snapshot() is used to
> compute a more accurate view of the aforementioned statistic.
> Thus any task blocked on the NUMA node specific pfmemalloc_wait queue will
> be unable to make significant progress via direct reclaim unless it is
> killed after being woken up by kswapd (see throttle_direct_reclaim()).
> That being said, eventually reclaim should give up if the conditions are
> correct, no?
> 
> Consider the following theoretical scenario:
> 
>         1.      CPU Y migrated running task A to CPU X that was
>                 in an idle state i.e. waiting for an IRQ - not
>                 polling; marked the current task on CPU X to
>                 need/or require a reschedule i.e., set
>                 TIF_NEED_RESCHED and invoked a reschedule IPI to
>                 CPU X (see sched_move_task())
> 
>         2.      CPU X acknowledged the reschedule IPI from CPU Y;
>                 generic idle loop code noticed the
>                 TIF_NEED_RESCHED flag against the idle task and
>                 attempts to exit of the loop and calls the main
>                 scheduler function i.e. __schedule().
> 
>                 Since the idle tick was previously stopped no
>                 scheduling-clock tick would occur.
>                 So, no deferred timers would be handled
> 
>         3.      Post transition to kernel execution Task A
>                 running on CPU Y, indirectly released a few pages
>                 (e.g. see __free_one_page()); CPU Y's
>                 vm_stat_diff[NR_FREE_PAGES] was updated and zone
>                 specific vm_stat[] update was deferred as per the
>                 CPU-specific stat threshold
> 
>         4.      Task A does invoke exit(2) and the kernel does
>                 remove the task from the run-queue; the idle task
>                 was selected to execute next since there are no
>                 other runnable tasks assigned to the given CPU
>                 (see pick_next_task() and pick_next_task_idle())
> 
>         5.      On return to the idle loop since the idle tick
>                 was already stopped and can remain so (see [1]
>                 below) e.g. no pending soft IRQs, no attempt is
>                 made to zero and fold CPU Y's vmstat counters
>                 since reprogramming of the scheduling-clock tick
>                 is not required/or needed (see [2])
> 
>               ...
>                 do_idle
>                 {
> 
>                   __current_set_polling()
>                   tick_nohz_idle_enter()
> 
>                   while (!need_resched()) {
> 
>                     local_irq_disable()
> 
>                     ...
> 
>                     /* No polling or broadcast event */
>                     cpuidle_idle_call()
>                     {
> 
>                       if (cpuidle_not_available(drv, dev)) {
>                         tick_nohz_idle_stop_tick()
>                           __tick_nohz_idle_stop_tick(this_cpu_ptr(&tick_cpu_sched))
>                           {
>                             int cpu = smp_processor_id()
> 
>                             if (ts->timer_expires_base)
>                               expires = ts->timer_expires
>                             else if (can_stop_idle_tick(cpu, ts))
>           (1) ------->        expires = tick_nohz_next_event(ts, cpu)
>                             else
>                               return
> 
>                             ts->idle_calls++
> 
>                             if (expires > 0LL) {
> 
>                               tick_nohz_stop_tick(ts, cpu)
>                               {
> 
>                                 if (ts->tick_stopped && (expires == ts->next_tick)) {
>           (2) ------->            if (tick == KTIME_MAX || ts->next_tick ==
>                                     hrtimer_get_expires(&ts->sched_timer))
>                                     return
>                                 }
>                                 ...
>                               }
> 
> 
> The idea with this patch is to ensure refresh_cpu_vm_stats(false) is called
> on return to the idle loop when the idle tick was previously stopped.
> 
> Any feedback/or testing would be appreciated.
> 
> 
> Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
> ---
>  kernel/time/tick-sched.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 17a283ce2b20..61874df064b6 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -876,6 +876,9 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
>  		ts->do_timer_last = 0;
>  	}
>  
> +	/* Attempt to fold when the idle tick is stopped or not */
> +	quiet_vmstat();
> +
>  	/* Skip reprogram of event if its not changed */
>  	if (ts->tick_stopped && (expires == ts->next_tick)) {
>  		/* Sanity check: make sure clockevent is actually programmed */
> @@ -897,7 +900,6 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
>  	 */
>  	if (!ts->tick_stopped) {
>  		calc_load_nohz_start();
> -		quiet_vmstat();
>  
>  		ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
>  		ts->tick_stopped = 1;
> -- 
> 2.34.1
> 
> 

As I said earlier, I don't think you want to call quiet_vmstat() unconditionally. And
I don't think this will catch the cases you are trying to fix. Once the tick is stopped
tick_nohz_stop_tick should not be getting called again until it's been restarted.

Something like this I think should catch cases where the task goes idle after
changing the counters but without restarting the tick.

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index ed1fd55fc55b..8fbb5167ceb4 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1121,6 +1121,9 @@ void tick_nohz_idle_enter(void)
 
        WARN_ON_ONCE(ts->timer_expires_base);
 
+       if (ts->tick_stopped)
+               quiet_vmstat();
+
        ts->inidle = 1;
        tick_nohz_start_idle(ts);
 

But I could be wrong...


Cheers,
Phil

--
Aaron Tomlin Feb. 16, 2022, 2:34 p.m. UTC | #2
On Thu 2022-02-03 17:22 -0500, Phil Auld wrote:
> As I said earlier, I don't think you want to call quiet_vmstat()
> unconditionally. And I don't think this will catch the cases you are
> trying to fix. Once the tick is stopped tick_nohz_stop_tick should not be
> getting called again until it's been restarted.

Phil,

Sorry about the delay. If I understand correctly, I see a scenario by which
tick_nohz_stop_tick() can be called on transition/or exit from idle (e.g.
default_idle_call()):

	1.	The idle/or scheduling-clock was previously
		stopped

	2.	It is considered safe for the scheduling-clock
	        tick to remain "stopped"/or omitted; no need to
		reprogram and enable a periodic tick
		(e.g. no queued/or expired pending timer)

	  ...
            do_idle
	      cpuidle_idle_call
	      {

		...

	.--     default_idle_call
	|         arch_cpu_idle
	|         goto exit_idle
        |
        |       exit_idle:
	|         __current_set_polling()
        |
	|     }
        |     tick_nohz_idle_exit()
        |     {
        |
        |       tick_stopped = ts->tick_stopped
        |
        |       if (tick_stopped)
        |         tick_nohz_idle_update_tick(ts, now)
        |           if (tick_nohz_full_cpu(smp_processor_id()))
        |             __tick_nohz_full_update_tick(ts, now)
        |             {
        |               int cpu = smp_processor_id()
        |
        |               if (can_stop_full_tick(cpu, ts))
        |                 tick_nohz_stop_sched_tick(ts, cpu)
        |                   if (tick_nohz_next_event(ts, cpu))
        '--                   tick_nohz_stop_tick(ts, cpu)
                      }
              }

If I understand correctly, __tick_nohz_full_update_tick() can return with
no changes to the current tick (e.g. expire time == KTIME_MAX), no?


Kind regards,
Phil Auld Feb. 16, 2022, 9:20 p.m. UTC | #3
On Wed, Feb 16, 2022 at 02:34:12PM +0000 Aaron Tomlin wrote:
> On Thu 2022-02-03 17:22 -0500, Phil Auld wrote:
> > As I said earlier, I don't think you want to call quiet_vmstat()
> > unconditionally. And I don't think this will catch the cases you are
> > trying to fix. Once the tick is stopped tick_nohz_stop_tick should not be
> > getting called again until it's been restarted.
> 
> Phil,
> 
> Sorry about the delay. If I understand correctly, I see a scenario by which
> tick_nohz_stop_tick() can be called on transition/or exit from idle (e.g.
> default_idle_call()):

No worries. It's possible that I am misunderstanding the issue still... :)

> 
> 	1.	The idle/or scheduling-clock was previously
> 		stopped
> 
> 	2.	It is considered safe for the scheduling-clock
> 	        tick to remain "stopped"/or omitted; no need to
> 		reprogram and enable a periodic tick
> 		(e.g. no queued/or expired pending timer)
> 
> 	  ...
>             do_idle
> 	      cpuidle_idle_call
> 	      {
> 
> 		...
> 
> 	.--     default_idle_call
> 	|         arch_cpu_idle
> 	|         goto exit_idle
>         |
>         |       exit_idle:
> 	|         __current_set_polling()
>         |
> 	|     }
>         |     tick_nohz_idle_exit()
>         |     {
>         |
>         |       tick_stopped = ts->tick_stopped
>         |
>         |       if (tick_stopped)
>         |         tick_nohz_idle_update_tick(ts, now)
>         |           if (tick_nohz_full_cpu(smp_processor_id()))
>         |             __tick_nohz_full_update_tick(ts, now)
>         |             {
>         |               int cpu = smp_processor_id()
>         |
>         |               if (can_stop_full_tick(cpu, ts))
>         |                 tick_nohz_stop_sched_tick(ts, cpu)
>         |                   if (tick_nohz_next_event(ts, cpu))
>         '--                   tick_nohz_stop_tick(ts, cpu)
>                       }
>               }
> 
> If I understand correctly, __tick_nohz_full_update_tick() can return with
> no changes to the current tick (e.g. expire time == KTIME_MAX), no?
>

Yeah, I missed that path, so tick_nohz_stop_tick() can get called with the tick
already stopped.  My concern was about calling quiet_vmstat() if the tick was not
stopped as per the comment on that function. But looking more closely at
tick_nohz_stop_tick() it won't be doing that with your patch.

If this is fixing the issue you are seeing (I don't remember if you had a
reproducible case or not) then I think this could be a good way to do it.

It does seem to rely on a few things lining up right to get to the call to
tick_nohz_stop_tick(). 


Cheers,
Phil


> 
> Kind regards,
> 
> -- 
> Aaron Tomlin
> 

--
Frederic Weisbecker Feb. 17, 2022, 12:47 p.m. UTC | #4
Hi Aaron,

I fear my blood-brain barrier doesn't let much of mm/ code in, so I'm adding a
few interested people in Cc. Meanwhile a few comments below:


On Thu, Feb 03, 2022 at 09:43:39PM +0000, Aaron Tomlin wrote:
> Hi Frederic,
> 
> If I understand correctly, in the context of the idle task and a nohz_full
> CPU, quiet_vmstat() can be called: before stopping the idle tick, entering
> an idle state and on exit. In particular, for the latter case, when the
> idle task is required to reschedule, the idle tick can remain stopped and
> the timer expiration time endless i.e., KTIME_MAX. Now, indeed before a
> nohz_full CPU enters an idle state, CPU-specific vmstat counters should
> be processed to ensure the respective values have been reset and folded
> into the zone specific vm_stat[]. That being said, it can only occur when:
> the idle tick was previously stopped, and reprogramming of the timer is not
> required.

So, to make sure I understand, the issue is that with nohz_full, we may
well enter into the idle loop with the tick already stopped. We may also
exit from idle without restarting the tick (again only with nohz_full). And
so this can cause the vmstat to not be flushed upon idle entry. Right?

> 
> A customer provided some evidence which indicates that the idle tick was
> stopped; albeit, CPU-specific vmstat counters still remained populated.
> Thus one can only assume quiet_vmstat() was not invoked on return to the
> idle loop.
> 
> Unfortunately, I suspect this divergence might erroneously prevent a
> reclaim attempt by kswapd. If the number of zone specific free pages are
> below their per-cpu drift value then zone_page_state_snapshot() is used to
> compute a more accurate view of the aforementioned statistic.
> Thus any task blocked on the NUMA node specific pfmemalloc_wait queue will
> be unable to make significant progress via direct reclaim unless it is
> killed after being woken up by kswapd (see throttle_direct_reclaim()).
> That being said, eventually reclaim should give up if the conditions are
> correct, no?

Now if quiet_vmstat() isn't called, the vmstat_work should fix this later,
right? Or does that happen too late perhaps?

Thanks!
Frederic Weisbecker Feb. 17, 2022, 12:57 p.m. UTC | #5
On Wed, Feb 16, 2022 at 02:34:12PM +0000, Aaron Tomlin wrote:
> On Thu 2022-02-03 17:22 -0500, Phil Auld wrote:
> > As I said earlier, I don't think you want to call quiet_vmstat()
> > unconditionally. And I don't think this will catch the cases you are
> > trying to fix. Once the tick is stopped tick_nohz_stop_tick should not be
> > getting called again until it's been restarted.
> 
> Phil,
> 
> Sorry about the delay. If I understand correctly, I see a scenario by which
> tick_nohz_stop_tick() can be called on transition/or exit from idle (e.g.
> default_idle_call()):
> 
> 	1.	The idle/or scheduling-clock was previously
> 		stopped
> 
> 	2.	It is considered safe for the scheduling-clock
> 	        tick to remain "stopped"/or omitted; no need to
> 		reprogram and enable a periodic tick
> 		(e.g. no queued/or expired pending timer)
> 
> 	  ...
>             do_idle
> 	      cpuidle_idle_call
> 	      {
> 
> 		...
> 
> 	.--     default_idle_call
> 	|         arch_cpu_idle
> 	|         goto exit_idle
>         |
>         |       exit_idle:
> 	|         __current_set_polling()
>         |
> 	|     }
>         |     tick_nohz_idle_exit()
>         |     {
>         |
>         |       tick_stopped = ts->tick_stopped
>         |
>         |       if (tick_stopped)
>         |         tick_nohz_idle_update_tick(ts, now)
>         |           if (tick_nohz_full_cpu(smp_processor_id()))
>         |             __tick_nohz_full_update_tick(ts, now)
>         |             {
>         |               int cpu = smp_processor_id()
>         |
>         |               if (can_stop_full_tick(cpu, ts))
>         |                 tick_nohz_stop_sched_tick(ts, cpu)
>         |                   if (tick_nohz_next_event(ts, cpu))
>         '--                   tick_nohz_stop_tick(ts, cpu)
>                       }
>               }
> 
> If I understand correctly, __tick_nohz_full_update_tick() can return with
> no changes to the current tick (e.g. expire time == KTIME_MAX), no?

Hmm, but does it matter? The issue seem to be that we can enter in idle loop without
flushing vmstat. Or am I missing something else?

Thanks.

> 
> 
> Kind regards,
> 
> -- 
> Aaron Tomlin
>
Aaron Tomlin Feb. 17, 2022, 2:26 p.m. UTC | #6
On Thu 2022-02-17 13:47 +0100, Frederic Weisbecker wrote:
> So, to make sure I understand, the issue is that with nohz_full, we may
> well enter into the idle loop with the tick already stopped. We may also
> exit from idle without restarting the tick (again only with nohz_full). And
> so this can cause the vmstat to not be flushed upon idle entry. Right?

Hi Frederic,

Yes - this is exactly it.

> > A customer provided some evidence which indicates that the idle tick was
> > stopped; albeit, CPU-specific vmstat counters still remained populated.
> > Thus one can only assume quiet_vmstat() was not invoked on return to the
> > idle loop.
> > 
> > Unfortunately, I suspect this divergence might erroneously prevent a
> > reclaim attempt by kswapd. If the number of zone specific free pages are
> > below their per-cpu drift value then zone_page_state_snapshot() is used to
> > compute a more accurate view of the aforementioned statistic.
> > Thus any task blocked on the NUMA node specific pfmemalloc_wait queue will
> > be unable to make significant progress via direct reclaim unless it is
> > killed after being woken up by kswapd (see throttle_direct_reclaim()).
> > That being said, eventually reclaim should give up if the conditions are
> > correct, no?

> Now if quiet_vmstat() isn't called, the vmstat_work should fix this later,
> right? Or does that happen too late perhaps?

If I understand correctly, in the context of nohz_full, since such work is
deferred, it will only be handled in a scenario when the periodic/or
scheduling-clock tick is enabled i.e. the timer was reprogrammed on exit
from idle.


Kind regards,
Aaron Tomlin Feb. 17, 2022, 2:45 p.m. UTC | #7
On Thu 2022-02-17 13:57 +0100, Frederic Weisbecker wrote:
> Hmm, but does it matter? The issue seem to be that we can enter in idle loop without
> flushing vmstat. Or am I missing something else?
Frederic,

Yes, this is indeed the concern. So, the idea I had was to invoke
quiet_vmstat() regardless if the tick was stopped or not. If I understand
correctly, this should resolve the issue. Furthermore, folding of all
outstanding differentials will only occur when required. Thus performance
should be negligible.


Kind regards,
Frederic Weisbecker Feb. 17, 2022, 4:32 p.m. UTC | #8
On Thu, Feb 17, 2022 at 02:26:15PM +0000, Aaron Tomlin wrote:
> On Thu 2022-02-17 13:47 +0100, Frederic Weisbecker wrote:
> > So, to make sure I understand, the issue is that with nohz_full, we may
> > well enter into the idle loop with the tick already stopped. We may also
> > exit from idle without restarting the tick (again only with nohz_full). And
> > so this can cause the vmstat to not be flushed upon idle entry. Right?
> 
> Hi Frederic,
> 
> Yes - this is exactly it.
> 
> > > A customer provided some evidence which indicates that the idle tick was
> > > stopped; albeit, CPU-specific vmstat counters still remained populated.
> > > Thus one can only assume quiet_vmstat() was not invoked on return to the
> > > idle loop.
> > > 
> > > Unfortunately, I suspect this divergence might erroneously prevent a
> > > reclaim attempt by kswapd. If the number of zone specific free pages are
> > > below their per-cpu drift value then zone_page_state_snapshot() is used to
> > > compute a more accurate view of the aforementioned statistic.
> > > Thus any task blocked on the NUMA node specific pfmemalloc_wait queue will
> > > be unable to make significant progress via direct reclaim unless it is
> > > killed after being woken up by kswapd (see throttle_direct_reclaim()).
> > > That being said, eventually reclaim should give up if the conditions are
> > > correct, no?
> 
> > Now if quiet_vmstat() isn't called, the vmstat_work should fix this later,
> > right? Or does that happen too late perhaps?
> 
> If I understand correctly, in the context of nohz_full, since such work is
> deferred, it will only be handled in a scenario when the periodic/or
> scheduling-clock tick is enabled i.e. the timer was reprogrammed on exit
> from idle.

Oh I see, it's a deferrable delayed work...
Then I can see two other issues:

1) Can an interrupt in idle modify the vmstat and thus trigger the need to
   flush it? I believe it's the case and then the problem goes beyond nohz_full
   because if the idle interrupt fired while the tick is stopped and didn't set
   TIF_RESCHED, we go back to sleep without calling quiet_vmstat().

2) What if we are running task A in kernel mode while the tick is stopped
   (nohz_full). Task A modifies the vmstat and goes to userspace for a long
   while.

Your patch fixes case 1) but not case 2). The problem is that TIMER_DEFERRABLE
should really be about dynticks-idle only and not dynticks-full. I've always
been afraid about enforcing that rule though because that would break old
noise-free setups. But perhaps I should...
Aaron Tomlin Feb. 18, 2022, 12:54 p.m. UTC | #9
On Thu 2022-02-17 17:32 +0100, Frederic Weisbecker wrote:
> > If I understand correctly, in the context of nohz_full, since such work is
> > deferred, it will only be handled in a scenario when the periodic/or
> > scheduling-clock tick is enabled i.e. the timer was reprogrammed on exit
> > from idle.
> 
> Oh I see, it's a deferrable delayed work...
> Then I can see two other issues:
> 
> 1) Can an interrupt in idle modify the vmstat and thus trigger the need to
>    flush it? I believe it's the case and then the problem goes beyond nohz_full
>    because if the idle interrupt fired while the tick is stopped and didn't set
>    TIF_RESCHED, we go back to sleep without calling quiet_vmstat().

Yes: e.g. a nohz_full CPU, in idle code, could indeed receive a reschedule
IPI; re-enable local IRQs and generic idle code sees the TIF_NEED_RESCHED
flag against the idle task. Additionally, the selected task could
indirectly released a few pages [to satisfy a low-memory condition] and
modify CPU-specific vmstat data i.e. vm_stat_diff[NR_FREE_PAGES].


> 2) What if we are running task A in kernel mode while the tick is stopped
>    (nohz_full). Task A modifies the vmstat and goes to userspace for a long
>    while.
> Your patch fixes case 1) but not case 2). The problem is that TIMER_DEFERRABLE
> should really be about dynticks-idle only and not dynticks-full. I've always
> been afraid about enforcing that rule though because that would break old
> noise-free setups. But perhaps I should...

If I understand correctly, I agree. For the latter case, nothing can be
done unfortunately since the scheduling-clock tick is stopped.


Kind regards,
Aaron Tomlin Feb. 19, 2022, 3:46 p.m. UTC | #10
On Fri 2022-02-18 12:54 +0000, Aaron Tomlin wrote:
> On Thu 2022-02-17 17:32 +0100, Frederic Weisbecker wrote:
> > > If I understand correctly, in the context of nohz_full, since such work is
> > > deferred, it will only be handled in a scenario when the periodic/or
> > > scheduling-clock tick is enabled i.e. the timer was reprogrammed on exit
> > > from idle.
> > 
> > Oh I see, it's a deferrable delayed work...
> > Then I can see two other issues:
> > 
> > 1) Can an interrupt in idle modify the vmstat and thus trigger the need to
> >    flush it? I believe it's the case and then the problem goes beyond nohz_full
> >    because if the idle interrupt fired while the tick is stopped and didn't set
> >    TIF_RESCHED, we go back to sleep without calling quiet_vmstat().
> 
> Yes: e.g. a nohz_full CPU, in idle code, could indeed receive a reschedule
> IPI; re-enable local IRQs and generic idle code sees the TIF_NEED_RESCHED
> flag against the idle task. Additionally, the selected task could
> indirectly released a few pages [to satisfy a low-memory condition] and
> modify CPU-specific vmstat data i.e. vm_stat_diff[NR_FREE_PAGES].
> 
> 
> > 2) What if we are running task A in kernel mode while the tick is stopped
> >    (nohz_full). Task A modifies the vmstat and goes to userspace for a long
> >    while.
> > Your patch fixes case 1) but not case 2). The problem is that TIMER_DEFERRABLE
> > should really be about dynticks-idle only and not dynticks-full. I've always
> > been afraid about enforcing that rule though because that would break old
> > noise-free setups. But perhaps I should...
> 
> If I understand correctly, I agree. For the latter case, nothing can be
> done unfortunately since the scheduling-clock tick is stopped.

Hi Frederic,

As far as I understand, in the context of nohz_full, options are indeed
limited; albeit, if we can ensure CPU-specific vmstat data is folded on
return to idle [when required] then this should be good enough.


Kind regards,
Marcelo Tosatti Feb. 24, 2022, 12:27 p.m. UTC | #11
On Sat, Feb 19, 2022 at 03:46:16PM +0000, Aaron Tomlin wrote:
> On Fri 2022-02-18 12:54 +0000, Aaron Tomlin wrote:
> > On Thu 2022-02-17 17:32 +0100, Frederic Weisbecker wrote:
> > > > If I understand correctly, in the context of nohz_full, since such work is
> > > > deferred, it will only be handled in a scenario when the periodic/or
> > > > scheduling-clock tick is enabled i.e. the timer was reprogrammed on exit
> > > > from idle.
> > > 
> > > Oh I see, it's a deferrable delayed work...
> > > Then I can see two other issues:
> > > 
> > > 1) Can an interrupt in idle modify the vmstat and thus trigger the need to
> > >    flush it? 

Yes. Page allocation and page freeing for example.

   6   3730  ../mm/page_alloc.c <<rmqueue>>
             __mod_zone_freepage_state(zone, -(1 << order),
   4   1096  ../mm/page_alloc.c <<__free_one_page>>
             __mod_zone_freepage_state(zone, -(1 << order),

> > >    I believe it's the case and then the problem goes beyond nohz_full
> > >    because if the idle interrupt fired while the tick is stopped and didn't set
> > >    TIF_RESCHED, we go back to sleep without calling quiet_vmstat().
> > 
> > Yes: e.g. a nohz_full CPU, in idle code, could indeed receive a reschedule
> > IPI; re-enable local IRQs and generic idle code sees the TIF_NEED_RESCHED
> > flag against the idle task. Additionally, the selected task could
> > indirectly released a few pages [to satisfy a low-memory condition] and
> > modify CPU-specific vmstat data i.e. vm_stat_diff[NR_FREE_PAGES].
> > 
> > 
> > > 2) What if we are running task A in kernel mode while the tick is stopped
> > >    (nohz_full). Task A modifies the vmstat and goes to userspace for a long
> > >    while.
> > > Your patch fixes case 1) but not case 2). The problem is that TIMER_DEFERRABLE
> > > should really be about dynticks-idle only and not dynticks-full. I've always
> > > been afraid about enforcing that rule though because that would break old
> > > noise-free setups. But perhaps I should...
> > 
> > If I understand correctly, I agree. For the latter case, nothing can be
> > done unfortunately since the scheduling-clock tick is stopped.
> 
> Hi Frederic,
> 
> As far vmstat_updateas I understand, in the context of nohz_full, options are indeed
> limited; albeit, if we can ensure CPU-specific vmstat data is folded on
> return to idle [when required] then this should be good enough.

I suppose the desired behaviour, with the deferred timer for vmstat_sync, is:

"Allow the per-CPU vmstats to be out of sync, but for a maximum of 
sysctl_stat_interval".

But Aaron, vmstat_shepherd should be ensuring that per-CPU vmstat_update
work are queued, if the per-CPU vmstat are out of sync.

And:

static void
trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
{
        if (!is_timers_nohz_active())
                return;

        /*
         * TODO: This wants some optimizing similar to the code below, but we
         * will do that when we switch from push to pull for deferrable timers.
         */
        if (timer->flags & TIMER_DEFERRABLE) {
                if (tick_nohz_full_cpu(base->cpu))
                        wake_up_nohz_cpu(base->cpu);
                return;
        }

 * @TIMER_DEFERRABLE: A deferrable timer will work normally when the
 * system is busy, but will not cause a CPU to come out of idle just
 * to service it; instead, the timer will be serviced when the CPU
 * eventually wakes up with a subsequent non-deferrable timer.

You'd want that vmstat_update to execute regardless of whether there are 
armed non-deferrable timers.

Should fix both 1 and 2 AFAICS.
Marcelo Tosatti Feb. 24, 2022, 12:30 p.m. UTC | #12
On Thu, Feb 24, 2022 at 09:27:14AM -0300, Marcelo Tosatti wrote:
> On Sat, Feb 19, 2022 at 03:46:16PM +0000, Aaron Tomlin wrote:
> > On Fri 2022-02-18 12:54 +0000, Aaron Tomlin wrote:
> > > On Thu 2022-02-17 17:32 +0100, Frederic Weisbecker wrote:
> > > > > If I understand correctly, in the context of nohz_full, since such work is
> > > > > deferred, it will only be handled in a scenario when the periodic/or
> > > > > scheduling-clock tick is enabled i.e. the timer was reprogrammed on exit
> > > > > from idle.
> > > > 
> > > > Oh I see, it's a deferrable delayed work...
> > > > Then I can see two other issues:
> > > > 
> > > > 1) Can an interrupt in idle modify the vmstat and thus trigger the need to
> > > >    flush it? 
> 
> Yes. Page allocation and page freeing for example.
> 
>    6   3730  ../mm/page_alloc.c <<rmqueue>>
>              __mod_zone_freepage_state(zone, -(1 << order),
>    4   1096  ../mm/page_alloc.c <<__free_one_page>>
>              __mod_zone_freepage_state(zone, -(1 << order),
> 
> > > >    I believe it's the case and then the problem goes beyond nohz_full
> > > >    because if the idle interrupt fired while the tick is stopped and didn't set
> > > >    TIF_RESCHED, we go back to sleep without calling quiet_vmstat().
> > > 
> > > Yes: e.g. a nohz_full CPU, in idle code, could indeed receive a reschedule
> > > IPI; re-enable local IRQs and generic idle code sees the TIF_NEED_RESCHED
> > > flag against the idle task. Additionally, the selected task could
> > > indirectly released a few pages [to satisfy a low-memory condition] and
> > > modify CPU-specific vmstat data i.e. vm_stat_diff[NR_FREE_PAGES].
> > > 
> > > 
> > > > 2) What if we are running task A in kernel mode while the tick is stopped
> > > >    (nohz_full). Task A modifies the vmstat and goes to userspace for a long
> > > >    while.
> > > > Your patch fixes case 1) but not case 2). The problem is that TIMER_DEFERRABLE
> > > > should really be about dynticks-idle only and not dynticks-full. I've always
> > > > been afraid about enforcing that rule though because that would break old
> > > > noise-free setups. But perhaps I should...

Can't grasp the sentence above "The problem is that ...".
What rule?
Marcelo Tosatti Feb. 24, 2022, 12:37 p.m. UTC | #13
On Thu, Feb 24, 2022 at 09:27:14AM -0300, Marcelo Tosatti wrote:
> On Sat, Feb 19, 2022 at 03:46:16PM +0000, Aaron Tomlin wrote:
> > On Fri 2022-02-18 12:54 +0000, Aaron Tomlin wrote:
> > > On Thu 2022-02-17 17:32 +0100, Frederic Weisbecker wrote:
> > > > > If I understand correctly, in the context of nohz_full, since such work is
> > > > > deferred, it will only be handled in a scenario when the periodic/or
> > > > > scheduling-clock tick is enabled i.e. the timer was reprogrammed on exit
> > > > > from idle.
> > > > 
> > > > Oh I see, it's a deferrable delayed work...
> > > > Then I can see two other issues:
> > > > 
> > > > 1) Can an interrupt in idle modify the vmstat and thus trigger the need to
> > > >    flush it? 
> 
> Yes. Page allocation and page freeing for example.
> 
>    6   3730  ../mm/page_alloc.c <<rmqueue>>
>              __mod_zone_freepage_state(zone, -(1 << order),
>    4   1096  ../mm/page_alloc.c <<__free_one_page>
>              __mod_zone_freepage_state(zone, -(1 << order),
> 
> > > >    I believe it's the case and then the problem goes beyond nohz_full
> > > >    because if the idle interrupt fired while the tick is stopped and didn't set
> > > >    TIF_RESCHED, we go back to sleep without calling quiet_vmstat().
> > > 
> > > Yes: e.g. a nohz_full CPU, in idle code, could indeed receive a reschedule
> > > IPI; re-enable local IRQs and generic idle code sees the TIF_NEED_RESCHED
> > > flag against the idle task. Additionally, the selected task could
> > > indirectly released a few pages [to satisfy a low-memory condition] and
> > > modify CPU-specific vmstat data i.e. vm_stat_diff[NR_FREE_PAGES].
> > > 
> > > 
> > > > 2) What if we are running task A in kernel mode while the tick is stopped
> > > >    (nohz_full). Task A modifies the vmstat and goes to userspace for a long
> > > >    while.
> > > > Your patch fixes case 1) but not case 2). The problem is that TIMER_DEFERRABLE
> > > > should really be about dynticks-idle only and not dynticks-full. I've always
> > > > been afraid about enforcing that rule though because that would break old
> > > > noise-free setups. But perhaps I should...
> > > 
> > > If I understand correctly, I agree. For the latter case, nothing can be
> > > done unfortunately since the scheduling-clock tick is stopped.
> > 
> > Hi Frederic,
> > 
> > As far vmstat_updateas I understand, in the context of nohz_full, options are indeed
> > limited; albeit, if we can ensure CPU-specific vmstat data is folded on
> > return to idle [when required] then this should be good enough.
> 
> I suppose the desired behaviour, with the deferred timer for vmstat_sync, is:
> 
> "Allow the per-CPU vmstats to be out of sync, but for a maximum of 
> sysctl_stat_interval".
> 
> But Aaron, vmstat_shepherd should be ensuring that per-CPU vmstat_update
> work are queued, if the per-CPU vmstat are out of sync.
> 
> And:
> 
> static void
> trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
> {
>         if (!is_timers_nohz_active())
>                 return;
> 
>         /*
>          * TODO: This wants some optimizing similar to the code below, but we
>          * will do that when we switch from push to pull for deferrable timers.
>          */
>         if (timer->flags & TIMER_DEFERRABLE) {
>                 if (tick_nohz_full_cpu(base->cpu))
>                         wake_up_nohz_cpu(base->cpu);
>                 return;
>         }
> 
>  * @TIMER_DEFERRABLE: A deferrable timer will work normally when the
>  * system is busy, but will not cause a CPU to come out of idle just
>  * to service it; instead, the timer will be serviced when the CPU
>  * eventually wakes up with a subsequent non-deferrable timer.
> 
> You'd want that vmstat_update to execute regardless of whether there are 
> armed non-deferrable timers.
> 
> Should fix both 1 and 2 AFAICS.

Maybe just switching to a non-deferrable timer does not increase the
frequency of vmstat_update calls so much ? It should happen once per
second anyway.

Then the "vmstats out of sync but for a maximum of sysctl_stat_interval"
would be respected, rather than existance of non-deferrable timers.
Aaron Tomlin Feb. 24, 2022, 1 p.m. UTC | #14
On Thu 2022-02-24 09:27 -0300, Marcelo Tosatti wrote:
> But Aaron, vmstat_shepherd should be ensuring that per-CPU vmstat_update
> work are queued, if the per-CPU vmstat are out of sync.

Hi Marcelo,

Yes, I agree; albeit, as far as I understand, in the context of a nohz_full
CPU that has its scheduling-clock tick stopped, we cannot rely on any
deferred work.

The purpose of my patch was to prevent a nohz_full CPU from entering idle
state when CPU-specific vmstat data is non-zero.

> And:
> 
> static void
> trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
> {
>         if (!is_timers_nohz_active())
>                 return;
> 
>         /*
>          * TODO: This wants some optimizing similar to the code below, but we
>          * will do that when we switch from push to pull for deferrable timers.
>          */
>         if (timer->flags & TIMER_DEFERRABLE) {
>                 if (tick_nohz_full_cpu(base->cpu))
>                         wake_up_nohz_cpu(base->cpu);
>                 return;
>         }
> 
>  * @TIMER_DEFERRABLE: A deferrable timer will work normally when the
>  * system is busy, but will not cause a CPU to come out of idle just
>  * to service it; instead, the timer will be serviced when the CPU
>  * eventually wakes up with a subsequent non-deferrable timer.
> 
> You'd want that vmstat_update to execute regardless of whether there are 
> armed non-deferrable timers.
> 
> Should fix both 1 and 2 AFAICS.
> 

If I understand correctly, you are suggesting to switch to a non-deferred
timer for such work when the scheduling-clock tick is stopped? Indeed, it
would address both scenarios yet I'm not sure we'd want that due to the
performance impact which might be more than negligible.


Kind regards,
Aaron Tomlin Feb. 24, 2022, 1:01 p.m. UTC | #15
On Thu 2022-02-24 09:30 -0300, Marcelo Tosatti wrote:
> > > > > 2) What if we are running task A in kernel mode while the tick is stopped
> > > > >    (nohz_full). Task A modifies the vmstat and goes to userspace for a long
> > > > >    while.
> > > > > Your patch fixes case 1) but not case 2). The problem is that TIMER_DEFERRABLE
> > > > > should really be about dynticks-idle only and not dynticks-full. I've always
> > > > > been afraid about enforcing that rule though because that would break old
> > > > > noise-free setups. But perhaps I should...
> 
> Can't grasp the sentence above "The problem is that ...".
> What rule?

Hi Marcelo,

That statement was from Frederic.


Kind regards,
Marcelo Tosatti Feb. 24, 2022, 1:14 p.m. UTC | #16
On Thu, Feb 24, 2022 at 01:00:14PM +0000, Aaron Tomlin wrote:
> On Thu 2022-02-24 09:27 -0300, Marcelo Tosatti wrote:
> > But Aaron, vmstat_shepherd should be ensuring that per-CPU vmstat_update
> > work are queued, if the per-CPU vmstat are out of sync.
> 
> Hi Marcelo,
> 
> Yes, I agree; albeit, as far as I understand, in the context of a nohz_full
> CPU that has its scheduling-clock tick stopped, we cannot rely on any
> deferred work.
> 
> The purpose of my patch was to prevent a nohz_full CPU from entering idle
> state when CPU-specific vmstat data is non-zero.
> 
> > And:
> > 
> > static void
> > trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
> > {
> >         if (!is_timers_nohz_active())
> >                 return;
> > 
> >         /*
> >          * TODO: This wants some optimizing similar to the code below, but we
> >          * will do that when we switch from push to pull for deferrable timers.
> >          */
> >         if (timer->flags & TIMER_DEFERRABLE) {
> >                 if (tick_nohz_full_cpu(base->cpu))
> >                         wake_up_nohz_cpu(base->cpu);
> >                 return;
> >         }
> > 
> >  * @TIMER_DEFERRABLE: A deferrable timer will work normally when the
> >  * system is busy, but will not cause a CPU to come out of idle just
> >  * to service it; instead, the timer will be serviced when the CPU
> >  * eventually wakes up with a subsequent non-deferrable timer.
> > 
> > You'd want that vmstat_update to execute regardless of whether there are 
> > armed non-deferrable timers.
> > 
> > Should fix both 1 and 2 AFAICS.
> > 
> 
> If I understand correctly, you are suggesting to switch to a non-deferred
> timer for such work when the scheduling-clock tick is stopped? Indeed, it
> would address both scenarios yet I'm not sure we'd want that due to the
> performance impact which might be more than negligible.

Aaron,

If the per-CPU vmstat_update is limited to happen once per second, that
shouldnt be a significant performance impact?
Aaron Tomlin Feb. 24, 2022, 1:28 p.m. UTC | #17
On Thu 2022-02-24 10:14 -0300, Marcelo Tosatti wrote:
> If the per-CPU vmstat_update is limited to happen once per second, that
> shouldnt be a significant performance impact?

Perhaps not. Albeit, is the interrupt worth it? Then again it could indeed
be a long time before the idle task is selected and a return to idle code
were we'd check for any remaining differentials with the aforementioned
patch.

Kind regards,
Marcelo Tosatti Feb. 24, 2022, 1:40 p.m. UTC | #18
On Thu, Feb 24, 2022 at 01:28:16PM +0000, Aaron Tomlin wrote:
> On Thu 2022-02-24 10:14 -0300, Marcelo Tosatti wrote:
> > If the per-CPU vmstat_update is limited to happen once per second, that
> > shouldnt be a significant performance impact?
> 
> Perhaps not. Albeit, is the interrupt worth it? 

Its a requirement for correctness right?

> Then again it could indeed
> be a long time before the idle task is selected and a return to idle code
> were we'd check for any remaining differentials with the aforementioned
> patch.

Or a long time before userspace returns to kernel.
Aaron Tomlin Feb. 24, 2022, 1:44 p.m. UTC | #19
On Thu 2022-02-24 10:40 -0300, Marcelo Tosatti wrote:
> On Thu, Feb 24, 2022 at 01:28:16PM +0000, Aaron Tomlin wrote:
> > On Thu 2022-02-24 10:14 -0300, Marcelo Tosatti wrote:
> > > If the per-CPU vmstat_update is limited to happen once per second, that
> > > shouldnt be a significant performance impact?
> > 
> > Perhaps not. Albeit, is the interrupt worth it? 
> 
> Its a requirement for correctness right?

Yes, this is true.

> > Then again it could indeed
> > be a long time before the idle task is selected and a return to idle code
> > were we'd check for any remaining differentials with the aforementioned
> > patch.
> 
> Or a long time before userspace returns to kernel.

Indeed. I'll put together a patch for comment.


Kind regards,
Aaron Tomlin March 31, 2022, 2:33 p.m. UTC | #20
On Thu 2022-02-17 17:32 +0100, Frederic Weisbecker wrote:
> Then I can see two other issues:
> 
> 1) Can an interrupt in idle modify the vmstat and thus trigger the need to
>    flush it? I believe it's the case and then the problem goes beyond nohz_full
>    because if the idle interrupt fired while the tick is stopped and didn't set
>    TIF_RESCHED, we go back to sleep without calling quiet_vmstat().
> 
> 2) What if we are running task A in kernel mode while the tick is stopped
>    (nohz_full). Task A modifies the vmstat and goes to userspace for a long
>    while.
> 
> Your patch fixes case 1) but not case 2). The problem is that TIMER_DEFERRABLE
> should really be about dynticks-idle only and not dynticks-full. I've always
> been afraid about enforcing that rule though because that would break old
> noise-free setups. But perhaps I should...

Hi Frederic,


Firstly, apologies for the delay.

In reference to case 2:

If I understand correctly, even if TIMER_DEFERRABLE is removed
refresh_cpu_vm_stats() cannot be invoked since the scheduling-clock tick is
disabled i.e. non-deferrable timers are serviced by the tick, no?
So, the only option would be to interrupt the workload - not desirable - or
detect any remaining differentials prior to entering userspace?


Kind regards,
diff mbox series

Patch

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 17a283ce2b20..61874df064b6 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -876,6 +876,9 @@  static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
 		ts->do_timer_last = 0;
 	}
 
+	/* Attempt to fold when the idle tick is stopped or not */
+	quiet_vmstat();
+
 	/* Skip reprogram of event if its not changed */
 	if (ts->tick_stopped && (expires == ts->next_tick)) {
 		/* Sanity check: make sure clockevent is actually programmed */
@@ -897,7 +900,6 @@  static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
 	 */
 	if (!ts->tick_stopped) {
 		calc_load_nohz_start();
-		quiet_vmstat();
 
 		ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
 		ts->tick_stopped = 1;