[1/1] sched/rt: avoid contend with CFS task

Message ID	1567048502-6064-1-git-send-email-jing-ting.wu@mediatek.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=IOd7=WZ=lists.infradead.org=linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4705222CF5 From: Jing-Ting Wu <jing-ting.wu@mediatek.com> To: Peter Zijlstra <peterz@infradead.org>, Matthias Brugger <matthias.bgg@gmail.com> Subject: [PATCH 1/1] sched/rt: avoid contend with CFS task Date: Thu, 29 Aug 2019 11:15:02 +0800 Message-ID: <1567048502-6064-1-git-send-email-jing-ting.wu@mediatek.com> MIME-Version: 1.0 summary: Content analysis details: (0.0 points) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at https://www.dnswl.org/, no trust [216.200.240.184 listed in list.dnswl.org] -0.0 SPF_PASS SPF: sender matches SPF record 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record 0.0 UNPARSEABLE_RELAY Informational: message has unparseable relay lines Precedence: list Cc: linux-arm-kernel@lists.infradead.org, Jing-Ting Wu <jing-ting.wu@mediatek.com>, linux-mediatek@lists.infradead.org, linux-kernel@vger.kernel.org, wsd_upstream@mediatek.com Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org
Series	[1/1] sched/rt: avoid contend with CFS task \| expand [1/1] sched/rt: avoid contend with CFS task

Jing-Ting Wu Aug. 29, 2019, 3:15 a.m. UTC

At original linux design, RT & CFS scheduler are independent.
Current RT task placement policy will select the first cpu in
lowest_mask, even if the first CPU is running a CFS task.
This may put RT task to a running cpu and let CFS task runnable.

So we select idle cpu in lowest_mask first to avoid preempting
CFS task.

Signed-off-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
---
 kernel/sched/rt.c |   42 +++++++++++++++++-------------------------
 1 file changed, 17 insertions(+), 25 deletions(-)

Valentin Schneider Aug. 29, 2019, 10:38 a.m. UTC | #1

On 29/08/2019 04:15, Jing-Ting Wu wrote:
> At original linux design, RT & CFS scheduler are independent.
> Current RT task placement policy will select the first cpu in
> lowest_mask, even if the first CPU is running a CFS task.
> This may put RT task to a running cpu and let CFS task runnable.
> 
> So we select idle cpu in lowest_mask first to avoid preempting
> CFS task.
> 

Regarding the RT & CFS thing, that's working as intended. RT is a whole
class above CFS, it shouldn't have to worry about CFS.

On the other side of things, CFS does worry about RT. We have the concept
of RT-pressure in the CFS scheduler, where RT tasks will reduce a CPU's
capacity (see fair.c::scale_rt_capacity()).

CPU capacity is looked at on CFS wakeup (see wake_cap() and
find_idlest_cpu()), and the periodic load balancer tries to spread load
over capacity, so it'll tend to put less things on CPUs that are also
running RT tasks.

If RT were to start avoiding rqs with CFS tasks, we'd end up with a nasty
situation were both are avoiding each other. It's even more striking when
you see that RT pressure is done with a rq-wide RT util_avg, which
*doesn't* get migrated when a RT task migrates. So if you decide to move
a RT task to an idle CPU "B" because CPU "A" had runnable CFS tasks, the
CFS scheduler will keep seeing CPU "B" as not significantly RT-pressured
while that util_avg signal ramps up, whereas it would correctly see CPU
"A" as RT-pressured if the RT task previously ran there.

So overall I think this is the wrong approach.

> Signed-off-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
> ---

Qais Yousef Aug. 30, 2019, 2:55 p.m. UTC | #2

On 08/29/19 11:38, Valentin Schneider wrote:
> On 29/08/2019 04:15, Jing-Ting Wu wrote:
> > At original linux design, RT & CFS scheduler are independent.
> > Current RT task placement policy will select the first cpu in
> > lowest_mask, even if the first CPU is running a CFS task.
> > This may put RT task to a running cpu and let CFS task runnable.
> > 
> > So we select idle cpu in lowest_mask first to avoid preempting
> > CFS task.
> > 
> 
> Regarding the RT & CFS thing, that's working as intended. RT is a whole
> class above CFS, it shouldn't have to worry about CFS.
> 
> On the other side of things, CFS does worry about RT. We have the concept
> of RT-pressure in the CFS scheduler, where RT tasks will reduce a CPU's
> capacity (see fair.c::scale_rt_capacity()).
> 
> CPU capacity is looked at on CFS wakeup (see wake_cap() and
> find_idlest_cpu()), and the periodic load balancer tries to spread load
> over capacity, so it'll tend to put less things on CPUs that are also
> running RT tasks.
> 
> If RT were to start avoiding rqs with CFS tasks, we'd end up with a nasty
> situation were both are avoiding each other. It's even more striking when
> you see that RT pressure is done with a rq-wide RT util_avg, which
> *doesn't* get migrated when a RT task migrates. So if you decide to move
> a RT task to an idle CPU "B" because CPU "A" had runnable CFS tasks, the
> CFS scheduler will keep seeing CPU "B" as not significantly RT-pressured
> while that util_avg signal ramps up, whereas it would correctly see CPU
> "A" as RT-pressured if the RT task previously ran there.
> 
> So overall I think this is the wrong approach.

I like the idea, but yeah tend to agree the current approach might not be
enough.

I think the major problem here is that on generic systems where CFS is a first
class citizen, RT tasks can be hostile to them - not always necessarily for a
good reason.

To further complicate the matter, even among CFS tasks we can't tell which are
more important than the others - though hopefully latency-nice proposal will
make the situation better.

So I agree we have a problem here, but I think this patch is just a temporary
band aid and we need to do better. Though I have no concrete suggestion yet on
how to do that.

Another thing I couldn't quantify yet how common and how severe this problem is
yet. Jing-Ting, if you can share the details of your use case that'd be great.

Cheers

--
Qais Yousef

Jing-Ting Wu Sept. 5, 2019, 1:26 p.m. UTC | #3

On Fri, 2019-08-30 at 15:55 +0100, Qais Yousef wrote:
> On 08/29/19 11:38, Valentin Schneider wrote:
> > On 29/08/2019 04:15, Jing-Ting Wu wrote:
> > > At original linux design, RT & CFS scheduler are independent.
> > > Current RT task placement policy will select the first cpu in
> > > lowest_mask, even if the first CPU is running a CFS task.
> > > This may put RT task to a running cpu and let CFS task runnable.
> > > 
> > > So we select idle cpu in lowest_mask first to avoid preempting
> > > CFS task.
> > > 
> > 
> > Regarding the RT & CFS thing, that's working as intended. RT is a whole
> > class above CFS, it shouldn't have to worry about CFS.
> > 
> > On the other side of things, CFS does worry about RT. We have the concept
> > of RT-pressure in the CFS scheduler, where RT tasks will reduce a CPU's
> > capacity (see fair.c::scale_rt_capacity()).
> > 
> > CPU capacity is looked at on CFS wakeup (see wake_cap() and
> > find_idlest_cpu()), and the periodic load balancer tries to spread load
> > over capacity, so it'll tend to put less things on CPUs that are also
> > running RT tasks.
> > 
> > If RT were to start avoiding rqs with CFS tasks, we'd end up with a nasty
> > situation were both are avoiding each other. It's even more striking when
> > you see that RT pressure is done with a rq-wide RT util_avg, which
> > *doesn't* get migrated when a RT task migrates. So if you decide to move
> > a RT task to an idle CPU "B" because CPU "A" had runnable CFS tasks, the
> > CFS scheduler will keep seeing CPU "B" as not significantly RT-pressured
> > while that util_avg signal ramps up, whereas it would correctly see CPU
> > "A" as RT-pressured if the RT task previously ran there.
> > 
> > So overall I think this is the wrong approach.
> 
> I like the idea, but yeah tend to agree the current approach might not be
> enough.
> 
> I think the major problem here is that on generic systems where CFS is a first
> class citizen, RT tasks can be hostile to them - not always necessarily for a
> good reason.
> 
> To further complicate the matter, even among CFS tasks we can't tell which are
> more important than the others - though hopefully latency-nice proposal will
> make the situation better.
> 
> So I agree we have a problem here, but I think this patch is just a temporary
> band aid and we need to do better. Though I have no concrete suggestion yet on
> how to do that.
> 
> Another thing I couldn't quantify yet how common and how severe this problem is
> yet. Jing-Ting, if you can share the details of your use case that'd be great.
> 
> Cheers
> 
> --
> Qais Yousef


I agree that the nasty situation will happen.The current approach and this patch might not be enough.
But for requirement of performance, I think it is better to differentiate between idle CPU and CPU has CFS task.

For example, we use rt-app to evaluate runnable time on non-patched environment.
There are (NR_CPUS-1) heavy CFS tasks and 1 RT Task. When a CFS task is running, the RT task wakes up and choose the same CPU.
The CFS task will be preempted and keep runnable until it is migrated to another cpu by load balance.
But load balance is not triggered immediately, it will be triggered until timer tick hits with some condition satisfied(ex. rq->next_balance).
CFS tasks may be runnable for a long time. In this test case, it increase 332.091 ms runnable time for CFS task.

The detailed log is shown as following, CFS task(thread1-6580) is preempted by RT task(thread0-6674) about 332ms:
thread1-6580  [003] dnh2    94.452898: sched_wakeup: comm=thread0 pid=6674 prio=89 target_cpu=003 
thread1-6580  [003] d..2    94.452916: sched_switch: prev_comm=thread1 prev_pid=6580 prev_prio=120 prev_state=R ==> next_comm=thread0 next_pid=6674 next_prio=89
.... 332.091ms
krtatm-1930  [001] d..2    94.785007: sched_migrate_task: comm=thread1 pid=6580 prio=120 orig_cpu=3 dest_cpu=1
krtatm-1930  [001] d..2    94.785020: sched_switch: prev_comm=krtatm prev_pid=1930 prev_prio=100 prev_state=S ==> next_comm=thread1 next_pid=6580 next_prio=120

So I think choose idle CPU at RT wake up flow could reduce the CFS runnable time. 


Best regards,
Jing-Ting Wu

Vincent Guittot Sept. 5, 2019, 2:01 p.m. UTC | #4

Hi Jing-Ting,

On Thu, 5 Sep 2019 at 15:26, Jing-Ting Wu <jing-ting.wu@mediatek.com> wrote:
>
> On Fri, 2019-08-30 at 15:55 +0100, Qais Yousef wrote:
> > On 08/29/19 11:38, Valentin Schneider wrote:
> > > On 29/08/2019 04:15, Jing-Ting Wu wrote:
> > > > At original linux design, RT & CFS scheduler are independent.
> > > > Current RT task placement policy will select the first cpu in
> > > > lowest_mask, even if the first CPU is running a CFS task.
> > > > This may put RT task to a running cpu and let CFS task runnable.
> > > >
> > > > So we select idle cpu in lowest_mask first to avoid preempting
> > > > CFS task.
> > > >
> > >
> > > Regarding the RT & CFS thing, that's working as intended. RT is a whole
> > > class above CFS, it shouldn't have to worry about CFS.
> > >
> > > On the other side of things, CFS does worry about RT. We have the concept
> > > of RT-pressure in the CFS scheduler, where RT tasks will reduce a CPU's
> > > capacity (see fair.c::scale_rt_capacity()).
> > >
> > > CPU capacity is looked at on CFS wakeup (see wake_cap() and
> > > find_idlest_cpu()), and the periodic load balancer tries to spread load
> > > over capacity, so it'll tend to put less things on CPUs that are also
> > > running RT tasks.
> > >
> > > If RT were to start avoiding rqs with CFS tasks, we'd end up with a nasty
> > > situation were both are avoiding each other. It's even more striking when
> > > you see that RT pressure is done with a rq-wide RT util_avg, which
> > > *doesn't* get migrated when a RT task migrates. So if you decide to move
> > > a RT task to an idle CPU "B" because CPU "A" had runnable CFS tasks, the
> > > CFS scheduler will keep seeing CPU "B" as not significantly RT-pressured
> > > while that util_avg signal ramps up, whereas it would correctly see CPU
> > > "A" as RT-pressured if the RT task previously ran there.
> > >
> > > So overall I think this is the wrong approach.
> >
> > I like the idea, but yeah tend to agree the current approach might not be
> > enough.
> >
> > I think the major problem here is that on generic systems where CFS is a first
> > class citizen, RT tasks can be hostile to them - not always necessarily for a
> > good reason.
> >
> > To further complicate the matter, even among CFS tasks we can't tell which are
> > more important than the others - though hopefully latency-nice proposal will
> > make the situation better.
> >
> > So I agree we have a problem here, but I think this patch is just a temporary
> > band aid and we need to do better. Though I have no concrete suggestion yet on
> > how to do that.
> >
> > Another thing I couldn't quantify yet how common and how severe this problem is
> > yet. Jing-Ting, if you can share the details of your use case that'd be great.
> >
> > Cheers
> >
> > --
> > Qais Yousef
>
>
> I agree that the nasty situation will happen.The current approach and this patch might not be enough.

RT task should not harm its cache hotness and responsiveness for the
benefit of a CFS task

> But for requirement of performance, I think it is better to differentiate between idle CPU and CPU has CFS task.
>
> For example, we use rt-app to evaluate runnable time on non-patched environment.
> There are (NR_CPUS-1) heavy CFS tasks and 1 RT Task. When a CFS task is running, the RT task wakes up and choose the same CPU.
> The CFS task will be preempted and keep runnable until it is migrated to another cpu by load balance.
> But load balance is not triggered immediately, it will be triggered until timer tick hits with some condition satisfied(ex. rq->next_balance).

Yes you will have to wait for the next tick that will trigger an idle
load balance because you have an idle cpu and 2 runnable tack (1 RT +
1CFS) on the same CPU. But you should not wait for more than  1 tick

The current load_balance doesn't handle correctly the situation of 1
CFS and 1 RT task on same CPU while 1 CPU is idle. There is a rework
of the load_balance that is under review on the mailing list that
fixes this problem and your CFS task should migrate to the idle CPU
faster than now

> CFS tasks may be runnable for a long time. In this test case, it increase 332.091 ms runnable time for CFS task.
>
> The detailed log is shown as following, CFS task(thread1-6580) is preempted by RT task(thread0-6674) about 332ms:

332ms is quite long and is probably not an idle load blanace but a
busy load balance

> thread1-6580  [003] dnh2    94.452898: sched_wakeup: comm=thread0 pid=6674 prio=89 target_cpu=003
> thread1-6580  [003] d..2    94.452916: sched_switch: prev_comm=thread1 prev_pid=6580 prev_prio=120 prev_state=R ==> next_comm=thread0 next_pid=6674 next_prio=89
> .... 332.091ms
> krtatm-1930  [001] d..2    94.785007: sched_migrate_task: comm=thread1 pid=6580 prio=120 orig_cpu=3 dest_cpu=1
> krtatm-1930  [001] d..2    94.785020: sched_switch: prev_comm=krtatm prev_pid=1930 prev_prio=100 prev_state=S ==> next_comm=thread1 next_pid=6580 next_prio=120

your CFS task has not moved on the idle CPU but has replaced another task

Regards,
Vincent
>
> So I think choose idle CPU at RT wake up flow could reduce the CFS runnable time.
>
>
> Best regards,
> Jing-Ting Wu
>
>

Jing-Ting Wu Sept. 19, 2019, 11:22 a.m. UTC | #5

On Thu, 2019-09-05 at 16:01 +0200, Vincent Guittot wrote:
> Hi Jing-Ting,
> 
> On Thu, 5 Sep 2019 at 15:26, Jing-Ting Wu <jing-ting.wu@mediatek.com> wrote:
> >
> > On Fri, 2019-08-30 at 15:55 +0100, Qais Yousef wrote:
> > > On 08/29/19 11:38, Valentin Schneider wrote:
> > > > On 29/08/2019 04:15, Jing-Ting Wu wrote:
> > > > > At original linux design, RT & CFS scheduler are independent.
> > > > > Current RT task placement policy will select the first cpu in
> > > > > lowest_mask, even if the first CPU is running a CFS task.
> > > > > This may put RT task to a running cpu and let CFS task runnable.
> > > > >
> > > > > So we select idle cpu in lowest_mask first to avoid preempting
> > > > > CFS task.
> > > > >
> > > >
> > > > Regarding the RT & CFS thing, that's working as intended. RT is a whole
> > > > class above CFS, it shouldn't have to worry about CFS.
> > > >
> > > > On the other side of things, CFS does worry about RT. We have the concept
> > > > of RT-pressure in the CFS scheduler, where RT tasks will reduce a CPU's
> > > > capacity (see fair.c::scale_rt_capacity()).
> > > >
> > > > CPU capacity is looked at on CFS wakeup (see wake_cap() and
> > > > find_idlest_cpu()), and the periodic load balancer tries to spread load
> > > > over capacity, so it'll tend to put less things on CPUs that are also
> > > > running RT tasks.
> > > >
> > > > If RT were to start avoiding rqs with CFS tasks, we'd end up with a nasty
> > > > situation were both are avoiding each other. It's even more striking when
> > > > you see that RT pressure is done with a rq-wide RT util_avg, which
> > > > *doesn't* get migrated when a RT task migrates. So if you decide to move
> > > > a RT task to an idle CPU "B" because CPU "A" had runnable CFS tasks, the
> > > > CFS scheduler will keep seeing CPU "B" as not significantly RT-pressured
> > > > while that util_avg signal ramps up, whereas it would correctly see CPU
> > > > "A" as RT-pressured if the RT task previously ran there.
> > > >
> > > > So overall I think this is the wrong approach.
> > >
> > > I like the idea, but yeah tend to agree the current approach might not be
> > > enough.
> > >
> > > I think the major problem here is that on generic systems where CFS is a first
> > > class citizen, RT tasks can be hostile to them - not always necessarily for a
> > > good reason.
> > >
> > > To further complicate the matter, even among CFS tasks we can't tell which are
> > > more important than the others - though hopefully latency-nice proposal will
> > > make the situation better.
> > >
> > > So I agree we have a problem here, but I think this patch is just a temporary
> > > band aid and we need to do better. Though I have no concrete suggestion yet on
> > > how to do that.
> > >
> > > Another thing I couldn't quantify yet how common and how severe this problem is
> > > yet. Jing-Ting, if you can share the details of your use case that'd be great.
> > >
> > > Cheers
> > >
> > > --
> > > Qais Yousef
> >
> >
> > I agree that the nasty situation will happen.The current approach and this patch might not be enough.
> 
> RT task should not harm its cache hotness and responsiveness for the
> benefit of a CFS task
> 

Yes, it’s a good point to both consider cache hotness. We have revised
the implementation to select a better idle CPU in the same sched_domain
of prev_cpu (with the same cache hotness) when the RT task wakeup.

I modify the code of find_lowest_rq as following:
@@ -1648,6 +1629,9 @@ static int find_lowest_rq(struct task_struct
*task)
        struct cpumask *lowest_mask =
this_cpu_cpumask_var_ptr(local_cpu_mask);
        int this_cpu = smp_processor_id();
        int cpu      = task_cpu(task);
+       int i;
+       struct rq *prev_rq = cpu_rq(cpu);
+       struct sched_domain *prev_sd;

        /* Make sure the mask is initialized first */
        if (unlikely(!lowest_mask))
@@ -1659,6 +1643,24 @@ static int find_lowest_rq(struct task_struct
*task)
        if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask))
                return -1; /* No targets found */

+       /* Choose previous cpu if it is idle and it fits lowest_mask */
+       if (cpumask_test_cpu(cpu, lowest_mask) && idle_cpu(cpu))
+               return cpu;
+
+       rcu_read_lock();
+       prev_sd = rcu_dereference(prev_rq->sd);
+
+       if (prev_sd) {
+               /* Choose idle_cpu among lowest_mask and it is closest
to our hot cache data */
+               for_each_cpu(i, lowest_mask) {
+                       if (idle_cpu(i) && cpumask_test_cpu(i,
sched_domain_span(prev_sd))) {
+                               rcu_read_unlock();
+                               return i;
+                       }
+               }
+       }
+       rcu_read_unlock();
+
        /*
         * At this point we have built a mask of CPUs representing the
         * lowest priority tasks in the system.  Now we want to elect



> > But for requirement of performance, I think it is better to differentiate between idle CPU and CPU has CFS task.
> >
> > For example, we use rt-app to evaluate runnable time on non-patched environment.
> > There are (NR_CPUS-1) heavy CFS tasks and 1 RT Task. When a CFS task is running, the RT task wakes up and choose the same CPU.
> > The CFS task will be preempted and keep runnable until it is migrated to another cpu by load balance.
> > But load balance is not triggered immediately, it will be triggered until timer tick hits with some condition satisfied(ex. rq->next_balance).
> 
> Yes you will have to wait for the next tick that will trigger an idle
> load balance because you have an idle cpu and 2 runnable tack (1 RT +
> 1CFS) on the same CPU. But you should not wait for more than  1 tick
> 
> The current load_balance doesn't handle correctly the situation of 1
> CFS and 1 RT task on same CPU while 1 CPU is idle. There is a rework
> of the load_balance that is under review on the mailing list that
> fixes this problem and your CFS task should migrate to the idle CPU
> faster than now
> 

Period load balance should be triggered when current jiffies is behind
rq->next_balance, but rq->next_balance is not often exactly the same
with next tick. 
If cpu_busy, interval = sd->balance_interval * sd->busy_factor, and
interval is clamped by 1 to max_load_balance_interval.
By experiment, in a system with HZ=250, available_cpus = 8, the
max_load_balance_interval = HZ * available_cpus / 10 = 250 * 8 / 10 =
200 jiffies,
It would let rq->next_balance = sd->last_balance + interval, the maximum
interval = 200 jiffies, result in more than 1 sched-tick to migrate a
CFS task.



> > CFS tasks may be runnable for a long time. In this test case, it increase 332.091 ms runnable time for CFS task.
> >
> > The detailed log is shown as following, CFS task(thread1-6580) is preempted by RT task(thread0-6674) about 332ms:
> 
> 332ms is quite long and is probably not an idle load blanace but a
> busy load balance
> 
> > thread1-6580  [003] dnh2    94.452898: sched_wakeup: comm=thread0 pid=6674 prio=89 target_cpu=003
> > thread1-6580  [003] d..2    94.452916: sched_switch: prev_comm=thread1 prev_pid=6580 prev_prio=120 prev_state=R ==> next_comm=thread0 next_pid=6674 next_prio=89
> > .... 332.091ms
> > krtatm-1930  [001] d..2    94.785007: sched_migrate_task: comm=thread1 pid=6580 prio=120 orig_cpu=3 dest_cpu=1
> > krtatm-1930  [001] d..2    94.785020: sched_switch: prev_comm=krtatm prev_pid=1930 prev_prio=100 prev_state=S ==> next_comm=thread1 next_pid=6580 next_prio=120
> 
> your CFS task has not moved on the idle CPU but has replaced another task
> 

I think it is minor and reasonable, because CPU1 has triggered idle
balance (when krtatm task is the last task leaving CPU1) to pull the
thread1-6580.



Best regards,
Jing-Ting Wu

> Regards,
> Vincent
> >
> > So I think choose idle CPU at RT wake up flow could reduce the CFS runnable time.
> >
> >
> > Best regards,
> > Jing-Ting Wu
> >
> >

Vincent Guittot Sept. 19, 2019, 12:27 p.m. UTC | #6

On Thu, 19 Sep 2019 at 13:22, Jing-Ting Wu <jing-ting.wu@mediatek.com> wrote:
>
> On Thu, 2019-09-05 at 16:01 +0200, Vincent Guittot wrote:
> > Hi Jing-Ting,
> >
> > On Thu, 5 Sep 2019 at 15:26, Jing-Ting Wu <jing-ting.wu@mediatek.com> wrote:
> > >
> > > On Fri, 2019-08-30 at 15:55 +0100, Qais Yousef wrote:
> > > > On 08/29/19 11:38, Valentin Schneider wrote:
> > > > > On 29/08/2019 04:15, Jing-Ting Wu wrote:
> > > > > > At original linux design, RT & CFS scheduler are independent.
> > > > > > Current RT task placement policy will select the first cpu in
> > > > > > lowest_mask, even if the first CPU is running a CFS task.
> > > > > > This may put RT task to a running cpu and let CFS task runnable.
> > > > > >
> > > > > > So we select idle cpu in lowest_mask first to avoid preempting
> > > > > > CFS task.
> > > > > >
> > > > >
> > > > > Regarding the RT & CFS thing, that's working as intended. RT is a whole
> > > > > class above CFS, it shouldn't have to worry about CFS.
> > > > >
> > > > > On the other side of things, CFS does worry about RT. We have the concept
> > > > > of RT-pressure in the CFS scheduler, where RT tasks will reduce a CPU's
> > > > > capacity (see fair.c::scale_rt_capacity()).
> > > > >
> > > > > CPU capacity is looked at on CFS wakeup (see wake_cap() and
> > > > > find_idlest_cpu()), and the periodic load balancer tries to spread load
> > > > > over capacity, so it'll tend to put less things on CPUs that are also
> > > > > running RT tasks.
> > > > >
> > > > > If RT were to start avoiding rqs with CFS tasks, we'd end up with a nasty
> > > > > situation were both are avoiding each other. It's even more striking when
> > > > > you see that RT pressure is done with a rq-wide RT util_avg, which
> > > > > *doesn't* get migrated when a RT task migrates. So if you decide to move
> > > > > a RT task to an idle CPU "B" because CPU "A" had runnable CFS tasks, the
> > > > > CFS scheduler will keep seeing CPU "B" as not significantly RT-pressured
> > > > > while that util_avg signal ramps up, whereas it would correctly see CPU
> > > > > "A" as RT-pressured if the RT task previously ran there.
> > > > >
> > > > > So overall I think this is the wrong approach.
> > > >
> > > > I like the idea, but yeah tend to agree the current approach might not be
> > > > enough.
> > > >
> > > > I think the major problem here is that on generic systems where CFS is a first
> > > > class citizen, RT tasks can be hostile to them - not always necessarily for a
> > > > good reason.
> > > >
> > > > To further complicate the matter, even among CFS tasks we can't tell which are
> > > > more important than the others - though hopefully latency-nice proposal will
> > > > make the situation better.
> > > >
> > > > So I agree we have a problem here, but I think this patch is just a temporary
> > > > band aid and we need to do better. Though I have no concrete suggestion yet on
> > > > how to do that.
> > > >
> > > > Another thing I couldn't quantify yet how common and how severe this problem is
> > > > yet. Jing-Ting, if you can share the details of your use case that'd be great.
> > > >
> > > > Cheers
> > > >
> > > > --
> > > > Qais Yousef
> > >
> > >
> > > I agree that the nasty situation will happen.The current approach and this patch might not be enough.
> >
> > RT task should not harm its cache hotness and responsiveness for the
> > benefit of a CFS task
> >
>
> Yes, it’s a good point to both consider cache hotness. We have revised
> the implementation to select a better idle CPU in the same sched_domain
> of prev_cpu (with the same cache hotness) when the RT task wakeup.
>
> I modify the code of find_lowest_rq as following:
> @@ -1648,6 +1629,9 @@ static int find_lowest_rq(struct task_struct
> *task)
>         struct cpumask *lowest_mask =
> this_cpu_cpumask_var_ptr(local_cpu_mask);
>         int this_cpu = smp_processor_id();
>         int cpu      = task_cpu(task);
> +       int i;
> +       struct rq *prev_rq = cpu_rq(cpu);
> +       struct sched_domain *prev_sd;
>
>         /* Make sure the mask is initialized first */
>         if (unlikely(!lowest_mask))
> @@ -1659,6 +1643,24 @@ static int find_lowest_rq(struct task_struct
> *task)
>         if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask))
>                 return -1; /* No targets found */
>
> +       /* Choose previous cpu if it is idle and it fits lowest_mask */
> +       if (cpumask_test_cpu(cpu, lowest_mask) && idle_cpu(cpu))
> +               return cpu;
> +
> +       rcu_read_lock();
> +       prev_sd = rcu_dereference(prev_rq->sd);
> +
> +       if (prev_sd) {
> +               /* Choose idle_cpu among lowest_mask and it is closest
> to our hot cache data */
> +               for_each_cpu(i, lowest_mask) {
> +                       if (idle_cpu(i) && cpumask_test_cpu(i,
> sched_domain_span(prev_sd))) {
> +                               rcu_read_unlock();
> +                               return i;
> +                       }
> +               }
> +       }
> +       rcu_read_unlock();
> +
>         /*
>          * At this point we have built a mask of CPUs representing the
>          * lowest priority tasks in the system.  Now we want to elect
>
>
>
> > > But for requirement of performance, I think it is better to differentiate between idle CPU and CPU has CFS task.
> > >
> > > For example, we use rt-app to evaluate runnable time on non-patched environment.
> > > There are (NR_CPUS-1) heavy CFS tasks and 1 RT Task. When a CFS task is running, the RT task wakes up and choose the same CPU.
> > > The CFS task will be preempted and keep runnable until it is migrated to another cpu by load balance.
> > > But load balance is not triggered immediately, it will be triggered until timer tick hits with some condition satisfied(ex. rq->next_balance).
> >
> > Yes you will have to wait for the next tick that will trigger an idle
> > load balance because you have an idle cpu and 2 runnable tack (1 RT +
> > 1CFS) on the same CPU. But you should not wait for more than  1 tick
> >
> > The current load_balance doesn't handle correctly the situation of 1
> > CFS and 1 RT task on same CPU while 1 CPU is idle. There is a rework
> > of the load_balance that is under review on the mailing list that
> > fixes this problem and your CFS task should migrate to the idle CPU
> > faster than now
> >
>
> Period load balance should be triggered when current jiffies is behind
> rq->next_balance, but rq->next_balance is not often exactly the same
> with next tick.
> If cpu_busy, interval = sd->balance_interval * sd->busy_factor, and

But if there is an idle CPU on the system, the next idle load balance
should apply shortly because the busy_factor is not used for this CPU
which is  not busy.
In this case, the next_balance interval is sd_weight which is probably
4ms at cluster level and 8ms at system level in your case. This means
between 1 and 2 ticks

longer interval means that :
-all cpu are busy,
-the cpu has trigger an all_pinned case
-the idle load balance fails to migrate the task. And this is probably
the your case. https://lore.kernel.org/patchwork/patch/1129117/ should
reduce time before migrating the CFS task to 1-2 ticks

> interval is clamped by 1 to max_load_balance_interval.
> By experiment, in a system with HZ=250, available_cpus = 8, the
> max_load_balance_interval = HZ * available_cpus / 10 = 250 * 8 / 10 =
> 200 jiffies,
> It would let rq->next_balance = sd->last_balance + interval, the maximum
> interval = 200 jiffies, result in more than 1 sched-tick to migrate a
> CFS task.
>
>
>
> > > CFS tasks may be runnable for a long time. In this test case, it increase 332.091 ms runnable time for CFS task.
> > >
> > > The detailed log is shown as following, CFS task(thread1-6580) is preempted by RT task(thread0-6674) about 332ms:
> >
> > 332ms is quite long and is probably not an idle load blanace but a
> > busy load balance
> >
> > > thread1-6580  [003] dnh2    94.452898: sched_wakeup: comm=thread0 pid=6674 prio=89 target_cpu=003
> > > thread1-6580  [003] d..2    94.452916: sched_switch: prev_comm=thread1 prev_pid=6580 prev_prio=120 prev_state=R ==> next_comm=thread0 next_pid=6674 next_prio=89
> > > .... 332.091ms
> > > krtatm-1930  [001] d..2    94.785007: sched_migrate_task: comm=thread1 pid=6580 prio=120 orig_cpu=3 dest_cpu=1
> > > krtatm-1930  [001] d..2    94.785020: sched_switch: prev_comm=krtatm prev_pid=1930 prev_prio=100 prev_state=S ==> next_comm=thread1 next_pid=6580 next_prio=120
> >
> > your CFS task has not moved on the idle CPU but has replaced another task
> >
>
> I think it is minor and reasonable, because CPU1 has triggered idle
> balance (when krtatm task is the last task leaving CPU1) to pull the
> thread1-6580.

so it was a newly-idle load_balance not an idle load balance

>
>
>
> Best regards,
> Jing-Ting Wu
>
> > Regards,
> > Vincent
> > >
> > > So I think choose idle CPU at RT wake up flow could reduce the CFS runnable time.
> > >
> > >
> > > Best regards,
> > > Jing-Ting Wu
> > >
> > >
>
>

Qais Yousef Sept. 19, 2019, 2:23 p.m. UTC | #7

On 09/19/19 14:27, Vincent Guittot wrote:
> > > > But for requirement of performance, I think it is better to differentiate between idle CPU and CPU has CFS task.
> > > >
> > > > For example, we use rt-app to evaluate runnable time on non-patched environment.
> > > > There are (NR_CPUS-1) heavy CFS tasks and 1 RT Task. When a CFS task is running, the RT task wakes up and choose the same CPU.
> > > > The CFS task will be preempted and keep runnable until it is migrated to another cpu by load balance.
> > > > But load balance is not triggered immediately, it will be triggered until timer tick hits with some condition satisfied(ex. rq->next_balance).
> > >
> > > Yes you will have to wait for the next tick that will trigger an idle
> > > load balance because you have an idle cpu and 2 runnable tack (1 RT +
> > > 1CFS) on the same CPU. But you should not wait for more than  1 tick
> > >
> > > The current load_balance doesn't handle correctly the situation of 1
> > > CFS and 1 RT task on same CPU while 1 CPU is idle. There is a rework
> > > of the load_balance that is under review on the mailing list that
> > > fixes this problem and your CFS task should migrate to the idle CPU
> > > faster than now
> > >
> >
> > Period load balance should be triggered when current jiffies is behind
> > rq->next_balance, but rq->next_balance is not often exactly the same
> > with next tick.
> > If cpu_busy, interval = sd->balance_interval * sd->busy_factor, and
> 
> But if there is an idle CPU on the system, the next idle load balance
> should apply shortly because the busy_factor is not used for this CPU
> which is  not busy.
> In this case, the next_balance interval is sd_weight which is probably
> 4ms at cluster level and 8ms at system level in your case. This means
> between 1 and 2 ticks

But if the CFS task we're preempting was latency sensitive - this 1 or 2 tick
is too late of a recovery.

So while it's good we recover, but a preventative approach would be useful too.
Just saying :-) I'm still not sure if this is the best longer term approach.

--
Qais Yousef

Vincent Guittot Sept. 19, 2019, 2:32 p.m. UTC | #8

On Thu, 19 Sep 2019 at 16:23, Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 09/19/19 14:27, Vincent Guittot wrote:
> > > > > But for requirement of performance, I think it is better to differentiate between idle CPU and CPU has CFS task.
> > > > >
> > > > > For example, we use rt-app to evaluate runnable time on non-patched environment.
> > > > > There are (NR_CPUS-1) heavy CFS tasks and 1 RT Task. When a CFS task is running, the RT task wakes up and choose the same CPU.
> > > > > The CFS task will be preempted and keep runnable until it is migrated to another cpu by load balance.
> > > > > But load balance is not triggered immediately, it will be triggered until timer tick hits with some condition satisfied(ex. rq->next_balance).
> > > >
> > > > Yes you will have to wait for the next tick that will trigger an idle
> > > > load balance because you have an idle cpu and 2 runnable tack (1 RT +
> > > > 1CFS) on the same CPU. But you should not wait for more than  1 tick
> > > >
> > > > The current load_balance doesn't handle correctly the situation of 1
> > > > CFS and 1 RT task on same CPU while 1 CPU is idle. There is a rework
> > > > of the load_balance that is under review on the mailing list that
> > > > fixes this problem and your CFS task should migrate to the idle CPU
> > > > faster than now
> > > >
> > >
> > > Period load balance should be triggered when current jiffies is behind
> > > rq->next_balance, but rq->next_balance is not often exactly the same
> > > with next tick.
> > > If cpu_busy, interval = sd->balance_interval * sd->busy_factor, and
> >
> > But if there is an idle CPU on the system, the next idle load balance
> > should apply shortly because the busy_factor is not used for this CPU
> > which is  not busy.
> > In this case, the next_balance interval is sd_weight which is probably
> > 4ms at cluster level and 8ms at system level in your case. This means
> > between 1 and 2 ticks
>
> But if the CFS task we're preempting was latency sensitive - this 1 or 2 tick
> is too late of a recovery.
>
> So while it's good we recover, but a preventative approach would be useful too.
> Just saying :-) I'm still not sure if this is the best longer term approach.

like using a rt task ?

>
> --
> Qais Yousef

Vincent Guittot Sept. 19, 2019, 2:37 p.m. UTC | #9

On Thu, 19 Sep 2019 at 16:32, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Thu, 19 Sep 2019 at 16:23, Qais Yousef <qais.yousef@arm.com> wrote:
> >
> > On 09/19/19 14:27, Vincent Guittot wrote:
> > > > > > But for requirement of performance, I think it is better to differentiate between idle CPU and CPU has CFS task.
> > > > > >
> > > > > > For example, we use rt-app to evaluate runnable time on non-patched environment.
> > > > > > There are (NR_CPUS-1) heavy CFS tasks and 1 RT Task. When a CFS task is running, the RT task wakes up and choose the same CPU.
> > > > > > The CFS task will be preempted and keep runnable until it is migrated to another cpu by load balance.
> > > > > > But load balance is not triggered immediately, it will be triggered until timer tick hits with some condition satisfied(ex. rq->next_balance).
> > > > >
> > > > > Yes you will have to wait for the next tick that will trigger an idle
> > > > > load balance because you have an idle cpu and 2 runnable tack (1 RT +
> > > > > 1CFS) on the same CPU. But you should not wait for more than  1 tick
> > > > >
> > > > > The current load_balance doesn't handle correctly the situation of 1
> > > > > CFS and 1 RT task on same CPU while 1 CPU is idle. There is a rework
> > > > > of the load_balance that is under review on the mailing list that
> > > > > fixes this problem and your CFS task should migrate to the idle CPU
> > > > > faster than now
> > > > >
> > > >
> > > > Period load balance should be triggered when current jiffies is behind
> > > > rq->next_balance, but rq->next_balance is not often exactly the same
> > > > with next tick.
> > > > If cpu_busy, interval = sd->balance_interval * sd->busy_factor, and
> > >
> > > But if there is an idle CPU on the system, the next idle load balance
> > > should apply shortly because the busy_factor is not used for this CPU
> > > which is  not busy.
> > > In this case, the next_balance interval is sd_weight which is probably
> > > 4ms at cluster level and 8ms at system level in your case. This means
> > > between 1 and 2 ticks
> >
> > But if the CFS task we're preempting was latency sensitive - this 1 or 2 tick
> > is too late of a recovery.
> >
> > So while it's good we recover, but a preventative approach would be useful too.
> > Just saying :-) I'm still not sure if this is the best longer term approach.
>
> like using a rt task ?

I mean, RT task should select a sub optimal CPU because of CFS
If you want to favor CFS compared to RT it's probably because your
task should be RT too

>
> >
> > --
> > Qais Yousef

Qais Yousef Sept. 19, 2019, 3:11 p.m. UTC | #10

On 09/19/19 16:37, Vincent Guittot wrote:
> On Thu, 19 Sep 2019 at 16:32, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Thu, 19 Sep 2019 at 16:23, Qais Yousef <qais.yousef@arm.com> wrote:
> > >
> > > On 09/19/19 14:27, Vincent Guittot wrote:
> > > > > > > But for requirement of performance, I think it is better to differentiate between idle CPU and CPU has CFS task.
> > > > > > >
> > > > > > > For example, we use rt-app to evaluate runnable time on non-patched environment.
> > > > > > > There are (NR_CPUS-1) heavy CFS tasks and 1 RT Task. When a CFS task is running, the RT task wakes up and choose the same CPU.
> > > > > > > The CFS task will be preempted and keep runnable until it is migrated to another cpu by load balance.
> > > > > > > But load balance is not triggered immediately, it will be triggered until timer tick hits with some condition satisfied(ex. rq->next_balance).
> > > > > >
> > > > > > Yes you will have to wait for the next tick that will trigger an idle
> > > > > > load balance because you have an idle cpu and 2 runnable tack (1 RT +
> > > > > > 1CFS) on the same CPU. But you should not wait for more than  1 tick
> > > > > >
> > > > > > The current load_balance doesn't handle correctly the situation of 1
> > > > > > CFS and 1 RT task on same CPU while 1 CPU is idle. There is a rework
> > > > > > of the load_balance that is under review on the mailing list that
> > > > > > fixes this problem and your CFS task should migrate to the idle CPU
> > > > > > faster than now
> > > > > >
> > > > >
> > > > > Period load balance should be triggered when current jiffies is behind
> > > > > rq->next_balance, but rq->next_balance is not often exactly the same
> > > > > with next tick.
> > > > > If cpu_busy, interval = sd->balance_interval * sd->busy_factor, and
> > > >
> > > > But if there is an idle CPU on the system, the next idle load balance
> > > > should apply shortly because the busy_factor is not used for this CPU
> > > > which is  not busy.
> > > > In this case, the next_balance interval is sd_weight which is probably
> > > > 4ms at cluster level and 8ms at system level in your case. This means
> > > > between 1 and 2 ticks
> > >
> > > But if the CFS task we're preempting was latency sensitive - this 1 or 2 tick
> > > is too late of a recovery.
> > >
> > > So while it's good we recover, but a preventative approach would be useful too.
> > > Just saying :-) I'm still not sure if this is the best longer term approach.
> >
> > like using a rt task ?
> 
> I mean, RT task should select a sub optimal CPU because of CFS
> If you want to favor CFS compared to RT it's probably because your
> task should be RT too

Yes possibly. But I don't think this is always doable. Especially when you're
running on generic system not a special purposed one.

And we don't need to favor CFS over RT. But I think they can play nicely
together.

For example on Android there are few RT tasks and rarely more than 1 runnable
RT task at a time. But if it happened to wakeup on the same CPU that is
running the UI thread you could lose a frame. And from what I've seen as well
we have 1-3 CFS tasks runnable, weighted more towards 1 task. So we do have
plenty of idle CPUs on average.

But as I mentioned earlier I couldn't prove yet this being a serious problem.
I was hoping the use case presented here is based on a real workload, but it's
synthetic. So I agree we need stronger reasons, but I think conceptually we do
have a conflict of interest where RT task could unnecessarily hurt the
performance of CFS task.

Another way to look at the problem is that the system is not partitioned
correctly and the admin could do a better job to prevent this.

--
Qais Yousef

Jing-Ting Wu Oct. 2, 2019, 1:20 a.m. UTC | #11

On Thu, 2019-09-19 at 16:11 +0100, Qais Yousef wrote:
> On 09/19/19 16:37, Vincent Guittot wrote:
> > On Thu, 19 Sep 2019 at 16:32, Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > >
> > > On Thu, 19 Sep 2019 at 16:23, Qais Yousef <qais.yousef@arm.com> wrote:
> > > >
> > > > On 09/19/19 14:27, Vincent Guittot wrote:
> > > > > > > > But for requirement of performance, I think it is better to differentiate between idle CPU and CPU has CFS task.
> > > > > > > >
> > > > > > > > For example, we use rt-app to evaluate runnable time on non-patched environment.
> > > > > > > > There are (NR_CPUS-1) heavy CFS tasks and 1 RT Task. When a CFS task is running, the RT task wakes up and choose the same CPU.
> > > > > > > > The CFS task will be preempted and keep runnable until it is migrated to another cpu by load balance.
> > > > > > > > But load balance is not triggered immediately, it will be triggered until timer tick hits with some condition satisfied(ex. rq->next_balance).
> > > > > > >
> > > > > > > Yes you will have to wait for the next tick that will trigger an idle
> > > > > > > load balance because you have an idle cpu and 2 runnable tack (1 RT +
> > > > > > > 1CFS) on the same CPU. But you should not wait for more than  1 tick
> > > > > > >
> > > > > > > The current load_balance doesn't handle correctly the situation of 1
> > > > > > > CFS and 1 RT task on same CPU while 1 CPU is idle. There is a rework
> > > > > > > of the load_balance that is under review on the mailing list that
> > > > > > > fixes this problem and your CFS task should migrate to the idle CPU
> > > > > > > faster than now
> > > > > > >
> > > > > >
> > > > > > Period load balance should be triggered when current jiffies is behind
> > > > > > rq->next_balance, but rq->next_balance is not often exactly the same
> > > > > > with next tick.
> > > > > > If cpu_busy, interval = sd->balance_interval * sd->busy_factor, and
> > > > >
> > > > > But if there is an idle CPU on the system, the next idle load balance
> > > > > should apply shortly because the busy_factor is not used for this CPU
> > > > > which is  not busy.
> > > > > In this case, the next_balance interval is sd_weight which is probably
> > > > > 4ms at cluster level and 8ms at system level in your case. This means
> > > > > between 1 and 2 ticks
> > > >
> > > > But if the CFS task we're preempting was latency sensitive - this 1 or 2 tick
> > > > is too late of a recovery.
> > > >
> > > > So while it's good we recover, but a preventative approach would be useful too.
> > > > Just saying :-) I'm still not sure if this is the best longer term approach.
> > >
> > > like using a rt task ?
> > 
> > I mean, RT task should select a sub optimal CPU because of CFS
> > If you want to favor CFS compared to RT it's probably because your
> > task should be RT too
> 
> Yes possibly. But I don't think this is always doable. Especially when you're
> running on generic system not a special purposed one.
> 
> And we don't need to favor CFS over RT. But I think they can play nicely
> together.
> 
> For example on Android there are few RT tasks and rarely more than 1 runnable
> RT task at a time. But if it happened to wakeup on the same CPU that is
> running the UI thread you could lose a frame. And from what I've seen as well
> we have 1-3 CFS tasks runnable, weighted more towards 1 task. So we do have
> plenty of idle CPUs on average.
> 
> But as I mentioned earlier I couldn't prove yet this being a serious problem.
> I was hoping the use case presented here is based on a real workload, but it's
> synthetic. So I agree we need stronger reasons, but I think conceptually we do
> have a conflict of interest where RT task could unnecessarily hurt the
> performance of CFS task.
> 
> Another way to look at the problem is that the system is not partitioned
> correctly and the admin could do a better job to prevent this.
> 
> --
> Qais Yousef


I use some third-party application, such as weibo and others, to test
the application launch time. I apply this RT patch, and compare it with
original design. Both RT patch test case and original design test case
are already apply the
patch:https://lore.kernel.org/patchwork/patch/1129117/

After apply the RT patch, launch time of weibo from 1325.72ms to 1214.88
ms, its launch time decreases 110.84ms(about 8.36%). Other applications
also decrease 7~13%.

At original design test case, RT tasks(surfaceflinger) could preempt
some CFS tasks, if we add all these CFS tasks runnable time, it may have
some impact on app launch time. So even if we already use the load
balance patch and reduce a lot of CFS runnable time, I think choose idle
CPU at RT scheduler could also reduce the some CFS runnable time.



Best regards,
Jing-Ting Wu

Dietmar Eggemann Oct. 3, 2019, 4:25 p.m. UTC | #12

[+ Steven Rostedt <rostedt@goodmis.org>]

On 29/08/2019 05:15, Jing-Ting Wu wrote:
> At original linux design, RT & CFS scheduler are independent.
> Current RT task placement policy will select the first cpu in
> lowest_mask, even if the first CPU is running a CFS task.
> This may put RT task to a running cpu and let CFS task runnable.
> 
> So we select idle cpu in lowest_mask first to avoid preempting
> CFS task.
> 
> Signed-off-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
> ---
>  kernel/sched/rt.c |   42 +++++++++++++++++-------------------------
>  1 file changed, 17 insertions(+), 25 deletions(-)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index a532558..626ca27 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1388,7 +1388,6 @@ static void yield_task_rt(struct rq *rq)
>  static int
>  select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
>  {
> -	struct task_struct *curr;
>  	struct rq *rq;
>  
>  	/* For anything but wake ups, just return the task_cpu */
> @@ -1398,33 +1397,15 @@ static void yield_task_rt(struct rq *rq)
>  	rq = cpu_rq(cpu);
>  
>  	rcu_read_lock();
> -	curr = READ_ONCE(rq->curr); /* unlocked access */
>  
>  	/*
> -	 * If the current task on @p's runqueue is an RT task, then
> -	 * try to see if we can wake this RT task up on another
> -	 * runqueue. Otherwise simply start this RT task
> -	 * on its current runqueue.
> -	 *
> -	 * We want to avoid overloading runqueues. If the woken
> -	 * task is a higher priority, then it will stay on this CPU
> -	 * and the lower prio task should be moved to another CPU.
> -	 * Even though this will probably make the lower prio task
> -	 * lose its cache, we do not want to bounce a higher task
> -	 * around just because it gave up its CPU, perhaps for a
> -	 * lock?
> -	 *
> -	 * For equal prio tasks, we just let the scheduler sort it out.
> -	 *
> -	 * Otherwise, just let it ride on the affined RQ and the
> -	 * post-schedule router will push the preempted task away
> -	 *
> -	 * This test is optimistic, if we get it wrong the load-balancer
> -	 * will have to sort it out.
> +	 * If the task p is allowed to put more than one CPU or
> +	 * it is not allowed to put on this CPU.
> +	 * Let p use find_lowest_rq to choose other idle CPU first,
> +	 * instead of choose this cpu and preempt curr cfs task.
>  	 */
> -	if (curr && unlikely(rt_task(curr)) &&
> -	    (curr->nr_cpus_allowed < 2 ||
> -	     curr->prio <= p->prio)) {
> +	if ((p->nr_cpus_allowed > 1) ||
> +	    (!cpumask_test_cpu(cpu, p->cpus_ptr))) {
>  		int target = find_lowest_rq(p);

I'm sure RT folks don't like the idea to change this condition.

I remember a similar approach and Steven Rostedt NAKed the idea back:

https://lore.kernel.org/r/1415099585-31174-2-git-send-email-pang.xunlei@linaro.org

Back then, Xunlei Pang even tried to create a lower mask of idle CPUs,
for find_lower_mask() to return:

https://lore.kernel.org/r/1415099585-31174-1-git-send-email-pang.xunlei@linaro.org

[...]

Qais Yousef Oct. 9, 2019, 10:56 a.m. UTC | #13

On 10/02/19 09:20, Jing-Ting Wu wrote:
> On Thu, 2019-09-19 at 16:11 +0100, Qais Yousef wrote:
> > On 09/19/19 16:37, Vincent Guittot wrote:
> > > On Thu, 19 Sep 2019 at 16:32, Vincent Guittot
> > > <vincent.guittot@linaro.org> wrote:
> > > >
> > > > On Thu, 19 Sep 2019 at 16:23, Qais Yousef <qais.yousef@arm.com> wrote:
> > > > >
> > > > > On 09/19/19 14:27, Vincent Guittot wrote:
> > > > > > > > > But for requirement of performance, I think it is better to differentiate between idle CPU and CPU has CFS task.
> > > > > > > > >
> > > > > > > > > For example, we use rt-app to evaluate runnable time on non-patched environment.
> > > > > > > > > There are (NR_CPUS-1) heavy CFS tasks and 1 RT Task. When a CFS task is running, the RT task wakes up and choose the same CPU.
> > > > > > > > > The CFS task will be preempted and keep runnable until it is migrated to another cpu by load balance.
> > > > > > > > > But load balance is not triggered immediately, it will be triggered until timer tick hits with some condition satisfied(ex. rq->next_balance).
> > > > > > > >
> > > > > > > > Yes you will have to wait for the next tick that will trigger an idle
> > > > > > > > load balance because you have an idle cpu and 2 runnable tack (1 RT +
> > > > > > > > 1CFS) on the same CPU. But you should not wait for more than  1 tick
> > > > > > > >
> > > > > > > > The current load_balance doesn't handle correctly the situation of 1
> > > > > > > > CFS and 1 RT task on same CPU while 1 CPU is idle. There is a rework
> > > > > > > > of the load_balance that is under review on the mailing list that
> > > > > > > > fixes this problem and your CFS task should migrate to the idle CPU
> > > > > > > > faster than now
> > > > > > > >
> > > > > > >
> > > > > > > Period load balance should be triggered when current jiffies is behind
> > > > > > > rq->next_balance, but rq->next_balance is not often exactly the same
> > > > > > > with next tick.
> > > > > > > If cpu_busy, interval = sd->balance_interval * sd->busy_factor, and
> > > > > >
> > > > > > But if there is an idle CPU on the system, the next idle load balance
> > > > > > should apply shortly because the busy_factor is not used for this CPU
> > > > > > which is  not busy.
> > > > > > In this case, the next_balance interval is sd_weight which is probably
> > > > > > 4ms at cluster level and 8ms at system level in your case. This means
> > > > > > between 1 and 2 ticks
> > > > >
> > > > > But if the CFS task we're preempting was latency sensitive - this 1 or 2 tick
> > > > > is too late of a recovery.
> > > > >
> > > > > So while it's good we recover, but a preventative approach would be useful too.
> > > > > Just saying :-) I'm still not sure if this is the best longer term approach.
> > > >
> > > > like using a rt task ?
> > > 
> > > I mean, RT task should select a sub optimal CPU because of CFS
> > > If you want to favor CFS compared to RT it's probably because your
> > > task should be RT too
> > 
> > Yes possibly. But I don't think this is always doable. Especially when you're
> > running on generic system not a special purposed one.
> > 
> > And we don't need to favor CFS over RT. But I think they can play nicely
> > together.
> > 
> > For example on Android there are few RT tasks and rarely more than 1 runnable
> > RT task at a time. But if it happened to wakeup on the same CPU that is
> > running the UI thread you could lose a frame. And from what I've seen as well
> > we have 1-3 CFS tasks runnable, weighted more towards 1 task. So we do have
> > plenty of idle CPUs on average.
> > 
> > But as I mentioned earlier I couldn't prove yet this being a serious problem.
> > I was hoping the use case presented here is based on a real workload, but it's
> > synthetic. So I agree we need stronger reasons, but I think conceptually we do
> > have a conflict of interest where RT task could unnecessarily hurt the
> > performance of CFS task.
> > 
> > Another way to look at the problem is that the system is not partitioned
> > correctly and the admin could do a better job to prevent this.
> > 
> > --
> > Qais Yousef
> 
> 
> I use some third-party application, such as weibo and others, to test
> the application launch time. I apply this RT patch, and compare it with
> original design. Both RT patch test case and original design test case
> are already apply the
> patch:https://lore.kernel.org/patchwork/patch/1129117/
> 
> After apply the RT patch, launch time of weibo from 1325.72ms to 1214.88
> ms, its launch time decreases 110.84ms(about 8.36%). Other applications
> also decrease 7~13%.
> 
> At original design test case, RT tasks(surfaceflinger) could preempt
> some CFS tasks, if we add all these CFS tasks runnable time, it may have
> some impact on app launch time. So even if we already use the load
> balance patch and reduce a lot of CFS runnable time, I think choose idle
> CPU at RT scheduler could also reduce the some CFS runnable time.

It'd be good if you spin v2 of your patch with these values added to the commit
message. If you include numbers for 2 or 3 apps that'd be even better.

Make sure to add Steven Rostedt on CC and other maintainers/reviewers in
get_maintainer.pl

Cheers

--
Qais Yousef

[1/1] sched/rt: avoid contend with CFS task

Commit Message

Comments

Patch