From patchwork Tue Apr 10 15:22:48 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Ilsche X-Patchwork-Id: 10333339 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 5299C601A0 for ; Tue, 10 Apr 2018 15:23:25 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 43A78251F9 for ; Tue, 10 Apr 2018 15:23:25 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 384F6252D5; Tue, 10 Apr 2018 15:23:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00, MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 10518251F9 for ; Tue, 10 Apr 2018 15:23:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754252AbeDJPXX (ORCPT ); Tue, 10 Apr 2018 11:23:23 -0400 Received: from mailout6.zih.tu-dresden.de ([141.30.67.75]:40820 "EHLO mailout6.zih.tu-dresden.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754224AbeDJPXW (ORCPT ); Tue, 10 Apr 2018 11:23:22 -0400 Received: from [172.26.34.104] (helo=msx.tu-dresden.de) by mailout6.zih.tu-dresden.de with esmtps (TLSv1.2:AES256-SHA:256) (Exim 4.84_2) (envelope-from ) id 1f5v6t-0001gp-Ap; Tue, 10 Apr 2018 17:23:06 +0200 Received: from [141.30.69.3] (141.30.69.3) by MSX-L104.msx.ad.zih.tu-dresden.de (172.26.34.104) with Microsoft SMTP Server (TLS) id 15.0.1365.1; Tue, 10 Apr 2018 17:22:49 +0200 From: Thomas Ilsche Subject: Re: [RFT][PATCH v7.3 5/8] cpuidle: Return nohz hint from cpuidle_select() To: "Rafael J. Wysocki" CC: Peter Zijlstra , Linux PM , Frederic Weisbecker , "Thomas Gleixner" , Paul McKenney , Doug Smythies , Rik van Riel , "Aubrey Li" , Mike Galbraith , LKML References: <2390019.oHdSGtR3EE@aspire.rjw.lan> <7336733.EjMJpVF2xN@aspire.rjw.lan> <8197de8d-db1f-107e-e224-2b9600bf8a87@tu-dresden.de> <24561274.DadG5k7Qxl@aspire.rjw.lan> Message-ID: <079e16b6-2e04-2518-0553-66becab226a6@tu-dresden.de> Date: Tue, 10 Apr 2018 17:22:48 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <24561274.DadG5k7Qxl@aspire.rjw.lan> Content-Language: en-US X-ClientProxiedBy: MSX-L106.msx.ad.zih.tu-dresden.de (172.26.34.106) To MSX-L104.msx.ad.zih.tu-dresden.de (172.26.34.104) X-PMWin-Version: 4.0.3, Antivirus-Engine: 3.70.2, Antivirus-Data: 5.49 X-TUD-Virus-Scanned: mailout6.zih.tu-dresden.de Sender: linux-pm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP >> However my fundamental concerns about the policy whether to disable the sched >> tick remain: >> >> Mixing the precise timer and vague heuristic for the decision is >> dangerous. The timer should not be wrong, heuristic may be. > > Well, I wouldn't say "dangerous". It may be suboptimal, but even that is not > a given. Besides -> > >> Decisions should use actual time points rather than the generic tick >> duration and residency time. e.g. >> expected_interval < delta_next_us >> vs >> expected_interval < TICK_USEC > > -> the role of this check is to justify taking the overhead of stopping the > tick and it certainly is justifiable if the governor doesn't anticipate any > wakeups (timer or not) in the TICK_USEC range. It may be justifiable in > other cases too, but that's a matter of some more complex checks and may not > be worth the extra complexity at all. I tried that change. It's just just a bit bulky because I cache the result of ktime_to_us(delta_next) early. This works as a I expect in the sense of stopping the tick more often and allowing deeper sleep states in some cases. However, it drastically *increases* the power consumption for some affected workloads test system (SKL-SP). So while I believe this generally improves the behavior - I can't recommend it based on the practical implications. Below are some details for the curious: power consumption for various workload configurations with 250 Hz kernels 4.16.0, v9, v9+delta_next patch: https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v9_250_Hz_power.png Practically v9 has the best power consumption in most cases. The following detailed looks are with 1000 Hz kernels. v9 with a synchronized 950 us sleep workload: https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v9_poll_sync.png v9 with a staggered 950 us sleep workload: https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v9_poll_stagger.png Both show that the sched tick is kept on and this causes more requests to C1E than C6 v9+delta_next with a synchronized 950 us sleep workload: https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v9_delta_poll_sync.png v9+delta_next with a staggered 950 us sleep workload: https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v9_delta_poll_stagger.png No more sched tick, no more C1E requests, but increased power. Besides: - the hardware reports 0 residency in C6 (both core and PKG) for both v9 and v9+delta_next_us. - the increased power consumption comes after a ramp-up of ~200 ms for the staggered and ~2 s for the synchronized workload. For reference traces from an unmodified 4.16.0: https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v4.16.0_poll_sync.png https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v4.16.0_poll_stagger.png It behaves similar to the delta_next patch but does not show the increased power consumption in this exact workload configuration. I couldn't help to dig into the effect a bit more and am able to reproduce it even under unmodified kernels with staggered sleep cycles between ~1.2 ms and ~2.5 ms where power is increased by > 40 W. Anyway, this effect seems to be beyond what the governor should consider. It is an example where it doesn't seem possible to decide for the optimal C state without considering the state of other cores and such unexpected hardware behavior. And these are only the results from one system and a limited set of workload configurations. >> For some cases the unmodified sched tick is not efficient as fallback. >> Is it feasible to >> 1) enable the sched tick when it's currently disabled instead of >> blindly choosing a different C state? > > It is not "blindly" if you will. It is very much "consciously". :-) > > Restarting the tick from within menu_select() wouldn't work IMO and > restarting it from cpuidle_idle_call() every time it was stopped might > be wasteful. > > It could be done, but AFAICS if the CPU in deep idle is woken up by an > occasional interrupt that doesn't set need_resched, it is more likely > to go into deep idle again than to go into shallow idle at that point. > >> 2) modify the next upcoming sched tick to be better suitable as >> fallback timer? > > Im not sure what you mean. > >> I think with the infrastructure changes it should be possible to >> implement the policy I envisioned previously >> (https://marc.info/?l=linux-pm&m=151384941425947&w=2), which is based >> on the ordering of timers and the heuristically predicted idle time. >> If the sleep_length issue is fixed and I have some mechanism for a >> modifiable fallback timer, I'll try to demonstrate that on top of your >> changes. > > Sure. I'm not against adding more complexity to this in principle, but there > needs to be a good enough justification for it. > > As I said in one of the previous messages, if simple code gets the job done, > the extra complexity may just not be worth it. That's why I went for very > simple code here. Still, if there is a clear case for making it more complex, > that can be done. > > Thanks! > diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index a6eca02..cad87bf 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -296,6 +296,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, unsigned long nr_iowaiters, cpu_load; int resume_latency = dev_pm_qos_raw_read_value(device); ktime_t delta_next; + unsigned long delta_next_us; if (data->needs_update) { menu_update(drv, dev); @@ -314,6 +315,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, /* determine the expected residency time, round up */ data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length(&delta_next)); + delta_next_us = ktime_to_us(delta_next); get_iowait_load(&nr_iowaiters, &cpu_load); data->bucket = which_bucket(data->next_timer_us, nr_iowaiters); @@ -364,7 +366,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, */ if (data->predicted_us < TICK_USEC) data->predicted_us = min_t(unsigned int, TICK_USEC, - ktime_to_us(delta_next)); + delta_next_us); } else { /* * Use the performance multiplier and the user-configurable @@ -412,9 +414,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, * expected idle duration is shorter than the tick period length. */ if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) || - expected_interval < TICK_USEC) { - unsigned int delta_next_us = ktime_to_us(delta_next); - + expected_interval < delta_next_us) { *stop_tick = false; if (!tick_nohz_tick_stopped() && idx > 0 &&